I'm using the method dbscan::dbscan in order to cluster my data by location and density.
My data looks like this:
str(data)
'data.frame': 4872 obs. of 3 variables:
$ price : num ...
$ lat : num ...
$ lng : num ...
Now I'm using following code:
EPS = 7
cluster.dbscan <- dbscan(data, eps = EPS, minPts = 30, borderPoints = T,
search = "kdtree")
plot(lat ~ lng, data = data, col = cluster.dbscan$cluster + 1L, pch = 20)
but the result isn't satisfying at all, the point's aren't really clustered.
I would like to have the clusters nicely defined, something like this:
I also tried to use use a decision tree classifier tree:tree which works better, but I can't tell if it is really a good classification.
File:
http://www.file-upload.net/download-11246655/file.csv.html
Question:
This is the output of a careful density-based clustering using the quite new HDBSCAN* algorithm.
Using Haversine distance, instead of Euclidean!
It identified some 50-something regions that are substantially more dense than their surroundings. In this figure, some clusters look as if they had only 3 elements, but they do have many more.
The outermost area, these are the noise points that do not belong to any cluster at all!
(Parameters used: -verbose -dbc.in file.csv -parser.labelIndices 0,1 -algorithm clustering.hierarchical.extraction.HDBSCANHierarchyExtraction -algorithm SLINKHDBSCANLinearMemory -algorithm.distancefunction geo.LatLngDistanceFunction -hdbscan.minPts 20 -hdbscan.minclsize 20
)
OPTICS is another density-based algorithm, here is a result:
Again, we have a "noise" area with red dots are not dense at all.
Parameters used: -verbose -dbc.in file.csv -parser.labelIndices 0,1 -algorithm clustering.optics.OPTICSXi -opticsxi.xi 0.1 -algorithm.distancefunction geo.LatLngDistanceFunction -optics.minpts 25
The OPTICS plot for this data set looks like this:
You can see there are many small valleys that correspond to clusters. But there is no "large" structure here.
You probably were looking for a result like this:
But in fact, this is a meaningless and rather random way of breaking the data into large chunks. Sure, it minimizes variance; but it does not at all care about the structure of the data. Points within one cluster will frequently have less in common than points in different clusters. Just look at the points at the border between the red, orange, and violet clusters.
Last but not least, the oldtimers: hierarchical clustering with complete linkage:
and the dendrogram:
(Parameters used: -verbose -dbc.in file.csv -parser.labelIndices 0,1 -algorithm clustering.hierarchical.extraction.SimplifiedHierarchyExtraction -algorithm AnderbergHierarchicalClustering -algorithm.distancefunction geo.LatLngDistanceFunction -hierarchical.linkage CompleteLinkageMethod -hdbscan.minclsize 50
)
Not too bad. Complete linkage works on such data rather well, too. But you could merge or split any of these clusters.