Better text documents clustering than tf/idf and cosine similarity?

machine-learning data-mining cluster-analysis text-mining

Jack Twain · Jul 9, 2013 · Viewed 10.9k times · Source

I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad.

The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences:

1- The website Stackoverflow is a nice place. 2- Stackoverflow is a website.

The prevoiuse two sentences will likely by clustered together with a reasonable threshold value since they share a lot of keywords. But now consider the following two sentences:

1- The website Stackoverflow is a nice place. 2- I visit Stackoverflow regularly.

Now by using tf/idf the clustering algorithm will fail miserably because they only share one keyword even tho they both talk about the same topic.

My question: is there better techniques to cluster documents?

Answer

In my experience, cosine similarity on latent semantic analysis (LSA/LSI) vectors works a lot better than raw tf-idf for text clustering, though I admit I haven't tried it on Twitter data. In particular, it tends to take care of the sparsity problem that you're encountering, where the documents just don't contain enough common terms.

Topic models such as LDA might work even better.

Better text documents clustering than tf/idf and cosine similarity?

Answer

Related questions