Python Clustering Algorithms

astromax picture astromax · Nov 13, 2013 · Viewed 14k times · Source

I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori linking lengths are known (similar to this question).

I've tried kmeans, which works well if you know how many clusters you want. I've tried dbscan, which does poorly unless you tell it a characteristic length scale on which to stop looking (or start looking) for clusters. The problem is, I have potentially thousands of these clusters of particles, and I cannot spend the time to tell kmeans/dbscan algorithms what they should go off of.

Here is an example of what dbscan find: dbscanfail

You can see that there really are two separate populations here, though adjusting the epsilon factor (the max. distance between neighboring clusters parameter), I simply cannot get it to see those two populations of particles.

Is there any other algorithms which would work here? I'm looking for minimal information upfront - in other words, I'd like the algorithm to be able to make "smart" decisions about what could constitute a separate cluster.

Answer

astromax picture astromax · Nov 14, 2013

I've found one that requires NO a priori information/guesses and does very well for what I'm asking it to do. It's called Mean Shift and is located in SciKit-Learn. It's also relatively quick (compared to other algorithms like Affinity Propagation).

Here's an example of what it gives:

MeanShiftResults

I also want to point out that in the documentation is states that it may not scale well.