DBSCAN for clustering data by location and density

Paul picture Paul · Jan 25, 2016 · Viewed 7.3k times · Source

I'm using the method dbscan::dbscan in order to cluster my data by location and density.

My data looks like this:

str(data)
'data.frame': 4872 obs. of 3 variables:
 $ price    : num ...
 $ lat      : num ...
 $ lng      : num ...

Now I'm using following code:

EPS = 7
cluster.dbscan <- dbscan(data, eps = EPS, minPts = 30, borderPoints = T, 
search = "kdtree")
plot(lat ~ lng, data = data, col = cluster.dbscan$cluster + 1L, pch = 20)

but the result isn't satisfying at all, the point's aren't really clustered.

enter image description here

I would like to have the clusters nicely defined, something like this: enter image description here

I also tried to use use a decision tree classifier tree:tree which works better, but I can't tell if it is really a good classification.

File:

http://www.file-upload.net/download-11246655/file.csv.html

Question:

  • is it possible to achieve what I want?
  • am I using the right method?
  • should I play more with the parameters? if yes, with which?

Answer

Has QUIT--Anony-Mousse picture Has QUIT--Anony-Mousse · Jan 26, 2016

This is the output of a careful density-based clustering using the quite new HDBSCAN* algorithm.

Using Haversine distance, instead of Euclidean!

It identified some 50-something regions that are substantially more dense than their surroundings. In this figure, some clusters look as if they had only 3 elements, but they do have many more.

enter image description here

The outermost area, these are the noise points that do not belong to any cluster at all!

(Parameters used: -verbose -dbc.in file.csv -parser.labelIndices 0,1 -algorithm clustering.hierarchical.extraction.HDBSCANHierarchyExtraction -algorithm SLINKHDBSCANLinearMemory -algorithm.distancefunction geo.LatLngDistanceFunction -hdbscan.minPts 20 -hdbscan.minclsize 20)

OPTICS is another density-based algorithm, here is a result:

enter image description here

Again, we have a "noise" area with red dots are not dense at all.

Parameters used: -verbose -dbc.in file.csv -parser.labelIndices 0,1 -algorithm clustering.optics.OPTICSXi -opticsxi.xi 0.1 -algorithm.distancefunction geo.LatLngDistanceFunction -optics.minpts 25

The OPTICS plot for this data set looks like this:

enter image description here

You can see there are many small valleys that correspond to clusters. But there is no "large" structure here.

You probably were looking for a result like this:

enter image description here

But in fact, this is a meaningless and rather random way of breaking the data into large chunks. Sure, it minimizes variance; but it does not at all care about the structure of the data. Points within one cluster will frequently have less in common than points in different clusters. Just look at the points at the border between the red, orange, and violet clusters.

Last but not least, the oldtimers: hierarchical clustering with complete linkage:

enter image description here

and the dendrogram:

enter image description here

(Parameters used: -verbose -dbc.in file.csv -parser.labelIndices 0,1 -algorithm clustering.hierarchical.extraction.SimplifiedHierarchyExtraction -algorithm AnderbergHierarchicalClustering -algorithm.distancefunction geo.LatLngDistanceFunction -hierarchical.linkage CompleteLinkageMethod -hdbscan.minclsize 50)

Not too bad. Complete linkage works on such data rather well, too. But you could merge or split any of these clusters.