I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a specified radius). That means that there probably has to be many clusters of size 1.
But! I need the running time to be well below n^2. n log n or so should be fine. The reason I'm doing this clustering is to avoid computing a distance matrix of all n points (which takes n^2 time or many hours), instead I want to just compute distances between clusters.
I tried the pycluster k-means algorithm but quickly realized it's way too slow. I've also tried the following greedy approach:
Slice space into 20 pieces in each dimension. (so there are 20^5 total pieces). I will store clusters in these gridboxes, according to their centroids.
For each point, retrieve the gridboxes that are within r (maximum bounding sphere radius). If there is a near enough cluster, add it to that cluster, otherwise make a new cluster.
However, this seems to give me more clusters than I want. I also implemented approaches similar to this twice, and they give very different answers.
Are there any standard approaches to clustering in faster than n^2 time? Probabilistic algorithms are ok.
Consider an approximate nearest neighbor (ANN) algorithm or locality sensitive hashing (LSH). They don't directly solve the clustering problem, but they will be able to tell you which points are "close" to one another. By altering the parameters, you can define close to be as close as you want. And it's fast.
More precisely, LSH can provide a hash function, h
, such that, for two points x
and y
, and distance metric d
,
d(x,y) <= R1 => P(h(x) = h(y)) >= P1
d(x,y) >= R2 => P(h(x) = h(y)) <= P2
where R1 < R2
and P1 > P2
. So yes, it is probabilistic. You can postprocess the retrieved data to arrive at true clusters.
Here is information on LSH including the E2LSH manual. ANN is similar in spirit; David Mount has information here, or try FLANN (has Matlab and Python bindings).