Cluster high dimensional data with python and DBSCAN

Ekgren picture Ekgren · Apr 22, 2013 · Viewed 7.1k times · Source

I have a dataset with 1000 dimensions and I am trying to cluster the data with DBSCAN in Python. I have a hard time understanding what metric to choose and why.

Can someone explain this? And how should I decide what values to set eps to?

I am interested in the finer structure of the data so the min_value is set to 2. Now I use the regular metric that is preset for dbscan in sklearn, but for small eps values, such as eps < 0.07, I get a few clusters but miss many points and for larger values i get several smaller clusters and one huge. I do understand that everything depends on the data at hand but I am interested in tips on how to choose eps values in a coherent and structured way and what metrics to choose!

I have read this question and the answers there are with regards to 10 dimensions I have 1000 :) and I also do not know how to evaluate my metric so it would be interesting with a more elaborate explanation then: evaluate your metric!

Edit: Or tips on other clustering algorithms that work on high dimensional data with an existing python implementation.

Answer

Has QUIT--Anony-Mousse picture Has QUIT--Anony-Mousse · Apr 22, 2013

First of all, with minPts=2 you aren't actually doing DBSCAN clustering, but the result will degenerate into single-linkage clustering.

You really should use minPts=10 or higher.

Unfortunately, you didn't bother to tell us what distance metric you actually use!

Epsilon really depends heavily on your data set and metric. We cannot help you there without knowing the parameters and your data set. Have you tried plotting a distance histogram to see which values are typical? That probably is the best heuristic to choose this threshold: look at quantiles of the distance histogram (or a sample thereof).

However, note that OPTICS does get rid of this parameter (at least when you have a proper implementation). When extracting clusters with the Xi method, you only need epsilon large enough to not cut structure you are interested in (and small enough to get the runtime you want - larger is slower, although not linearly). Xi then gives a relative increase in distance that is considered to be significant.