To compute the similarity between two documents, I create a feature vector containing the term frequencies. But then, for the next step, I can't decide between "Cosine similarity" and "Hamming distance".
My question: Do you have experience with these algorithms? Which one gives you better results?
In addition to that: Could you tell me how to code the Cosine similarity in PHP? For Hamming distance, I've already got the code:
function check ($terms1, $terms2) {
$counts1 = array_count_values($terms1);
$totalScore = 0;
foreach ($terms2 as $term) {
if (isset($counts1[$term])) $totalScore += $counts1[$term];
}
return $totalScore * 500 / (count($terms1) * count($terms2));
}
I don't want to use any other algorithm. I would only like to have help to decide between both.
And maybe someone can say something to how to improve the algorithms. Will you get better results if you filter out the stop words or common words?
I hope you can help me. Thanks in advance!
A Hamming distance should be done between two strings of equal length and with the order taken into account.
As your documents are certainly of different length and if the words places do not count, cosine similarity is better (please note that depending your needs, better solutions exist). :)
Here is a cosine similarity function of 2 arrays of words:
function cosineSimilarity($tokensA, $tokensB)
{
$a = $b = $c = 0;
$uniqueTokensA = $uniqueTokensB = array();
$uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));
foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;
foreach ($uniqueMergedTokens as $token) {
$x = isset($uniqueTokensA[$token]) ? 1 : 0;
$y = isset($uniqueTokensB[$token]) ? 1 : 0;
$a += $x * $y;
$b += $x;
$c += $y;
}
return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}
It is fast (isset()
instead of in_array()
is a killer on large arrays).
As you can see, the results does not take into account the "magnitude" of each the word.
I use it to detect multi-posted messages of "almost" copy-pasted texts. It works well. :)
The best link about string similarity metrics: http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
For further interesting readings:
http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html http://bioinformatics.oxfordjournals.org/cgi/content/full/22/18/2298