Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster
and also Orange.
hcluster
had trouble with 18k objects. Orange was able to cluster 18k objects in seconds, but failed with 100k objects (saturated memory and eventually crashed).
I am running on a 64bit Xeon CPU (2.53GHz) and 8GB of RAM + 3GB swap on Ubuntu 11.10.
The problem probably is that they will try to compute the full 2D distance matrix (about 8 GB naively with double precision) and then their algorithm will run in O(n^3)
time anyway.
You should seriously consider using a different clustering algorithm. Hierarchical clustering is slow and the results are not at all convincing usually. In particular for millions of objects, where you can't just look at the dendrogram to choose the appropriate cut.
If you really want to continue hierarchical clustering, I belive that ELKI (Java though) has a O(n^2)
implementation of SLINK
. Which at 1 million objects should be approximately 1 million times as fast. I don't know if they already have CLINK
, too. And I'm not sure if there actually is any sub-O(n^3)
algorithm for other variants than single-link and complete-link.
Consider using other algorithms. k-means for example scales very well with the number of objects (it's just not very good usually either, unless your data is very clean and regular). DBSCAN
and OPTICS
are quite good in my opinion, once you have a feel for the parameters. If your data set is low dimensional, they can be accelerated quite well with an appropriate index structure. They should then run in O(n log n)
, if you have an index with O(log n)
query time. Which can make a huge difference for large data sets. I've personally used OPTICS
on a 110k images data set without problems, so I can imagine it scales up well to 1 million on your system.