I want to cluster documents based on similarity.
I haved tried ssdeep (similarity hashing), very fast but i was told that k-means is faster and flann is fastest of all implementations, and more accurate so i am trying flann with python bindings but i can't find any example how to do it on text (it only support array of numbers).
I am very very new to this field (k-means, natural language processing). What i need is speed and accuracy.
My questions are:
You need to represent your document as an array of numbers (aka, a vector). There are many ways to do this, depending on how sophisticated you want to be, but the simplest way is just to represent is as a vector of word counts.
So here's what you do:
Count up the number of times each word appears in the document.
Choose a set of "feature" words that will be included in your vector. This should exclude extremely common words (aka "stopwords") like "the", "a", etc.
Make a vector for each document based on the counts of the feature words.
Here's an example.
If your "documents" are single sentences, and they look like (one doc per line):
there is a dog who chased a cat
someone ate pizza for lunch
the dog and a cat walk down the street toward another dog
If my set of feature words are [dog, cat, street, pizza, lunch]
, then I can convert each document into a vector:
[1, 1, 0, 0, 0] // dog 1 time, cat 1 time
[0, 0, 0, 1, 1] // pizza 1 time, lunch 1 time
[2, 1, 1, 0, 0] // dog 2 times, cat 1 time, street 1 time
You can use these vectors in your k-means algorithm and it will hopefully group the first and third sentence together because they are similar, and make the second sentence a separate cluster since it is very different.