String similarity in PHP: levenshtein like function for long strings

anon picture anon · Feb 23, 2011 · Viewed 9.6k times · Source

The function levenshtein in PHP works on strings with maximum length 255. What are good alternatives to compute a similarity score of sentences in PHP.

Basically I have a database of sentences, and I want to find approximate duplicates. similar_text function is not giving me expected results. What is the easiest way for me to detect similar sentences like below:

$ss="Jack is a very nice boy, isn't he?";
$pp="jack is a very nice boy is he";

$ss=strtolower($ss);  // convert to lower case as we dont care about case
$pp=strtolower($pp);

$score=similar_text($ss, $pp);
echo "$score %\n";  // Outputs just 29 %

$score=levenshtein ( $ss, $pp );
echo "$score\n";  // Outputs '5', which indicates they are very similar. But, it does not work for more than 255 chars :(

Answer

Ferdinand Beyer picture Ferdinand Beyer · Feb 23, 2011

The levenshtein algorithm has a time complexity of O(n*m), where n and m are the lengths of the two input strings. This is pretty expensive and computing such a distance for long strings will take a long time.

For whole sentences, you might want to use a diff algorithm instead, see for example: Highlight the difference between two strings in PHP

Having said this, PHP also provides the similar_text function which has an even worse complexity (O(max(n,m)**3)) but seems to work on longer strings.