How can i cluster document using k-means (Flann with python)?

Phyo Arkar Lwin picture Phyo Arkar Lwin · Sep 19, 2012 · Viewed 13.3k times · Source

I want to cluster documents based on similarity.

I haved tried ssdeep (similarity hashing), very fast but i was told that k-means is faster and flann is fastest of all implementations, and more accurate so i am trying flann with python bindings but i can't find any example how to do it on text (it only support array of numbers).

I am very very new to this field (k-means, natural language processing). What i need is speed and accuracy.

My questions are:

  1. Can we do document similarity grouping / Clustering using KMeans (Flann do not allow any text input it seems )
  2. Is Flann the right choice? If not please suggest me High performance library that support text/docs clustering, that have python wrapper/API.
  3. Is k-means the right algorithm?

Answer

dhg picture dhg · Sep 19, 2012

You need to represent your document as an array of numbers (aka, a vector). There are many ways to do this, depending on how sophisticated you want to be, but the simplest way is just to represent is as a vector of word counts.

So here's what you do:

  1. Count up the number of times each word appears in the document.

  2. Choose a set of "feature" words that will be included in your vector. This should exclude extremely common words (aka "stopwords") like "the", "a", etc.

  3. Make a vector for each document based on the counts of the feature words.

Here's an example.

If your "documents" are single sentences, and they look like (one doc per line):

there is a dog who chased a cat
someone ate pizza for lunch
the dog and a cat walk down the street toward another dog

If my set of feature words are [dog, cat, street, pizza, lunch], then I can convert each document into a vector:

[1, 1, 0, 0, 0]  // dog 1 time, cat 1 time
[0, 0, 0, 1, 1]  // pizza 1 time, lunch 1 time
[2, 1, 1, 0, 0]  // dog 2 times, cat 1 time, street 1 time

You can use these vectors in your k-means algorithm and it will hopefully group the first and third sentence together because they are similar, and make the second sentence a separate cluster since it is very different.