Text Clustering and topic extraction

Misconstruction picture Misconstruction · May 30, 2013 · Viewed 8.9k times · Source

I'm doing some text mining using the excellent scikit-learn module. I'm trying to cluster and classify scientific abstracts.

I'm looking for a way to cluster my set of tf-id representations, without having to specify the number of clusters in advance. I haven't been able to find a good algorithm that can do that, and still handle large sparse matrixes decently. I have been looking into simply using scikit-learn's kmeans, but it doesn't have a way to determine the optimal number of clusters (for example using BIC). I have also tried using the gaussian mixture models (using the best BIC score to select the model), but they are awfully slow.

After I have clustered the documents, I would like to be able to look into the topics of each cluster, meaning the words they tend to use. Is there a way to extract this information, given the data matrix and cluster-labels? Maybe taking the mean of the cluster and inverse transforming it using the tf-id-vectorizer? I've previously tried to use chi-square and randomforest to rank feature importance, but that doesn't say which label-class uses what.

I've tried using the NMF decomposition method (using simply the example code from scikit-learns website) to do topic detection. It worked great, and produced very meaningful topics very quickly. However, i did not find a way of using it to assign each datapoint to a cluster, nor automatically determine the 'optimal' number of clusters. But it's the sort of thing i'm looking for.

I also read somewhere that it's possible to extract topic information directly from a fitted LDA model, but i don't understand how it's done. Since I already have implemented an LDA as a baseline classifier and visualisation tool, this might be an easy solution.

If I manage to produce meaningful cluster/topics, I am going to compare them to some human made labels (not topic based), to see how they correspond. But that's a topic for another thread :-)

Answer

ogrisel picture ogrisel · May 30, 2013

You can try TF-IDF with a low max_df, e.g. max_df=0.5 and then k-means (or MiniBatchKMeans). To find a good value for K you can try one of those heuristics:

  • the gap statistic
  • the prediction strength

Executive descriptions are provided in this blog post: http://blog.echen.me/2011/03/19/counting-clusters/

None of those method are implemented in sklearn. I would be very interested if you find any of them useful for your problem. If so it would probably be interesting to discuss how to best contribute a default implementation in scikit-learn.