String similarity -> Levenshtein distance

Fede Lerner picture Fede Lerner · Jul 26, 2012 · Viewed 10k times · Source

I'm using the Levenshtein algorithm to find the similarity between two strings. This is a very important part of the program I'm making, so it needs to be effective. The problem is that the algorithm doesn't find the following examples as similar:

CONAIR
AIRCON

The algorithm will give a distance of 6. So for this word of 6 letters (You look at the word with the highest amount of letters), the difference is of 100% => the similarity is 0%.

I need to find a way to find the similarities between two string, but also taking into consideration cases like the one I presented before.

Is there a better algorithm I can use? Or what do you guys recommend me?

EDIT: I've also looked into the "Damerau–Levenshtein" algorithm, which adds transpositions. The problem is that this transpositions are only for adjacent characters (and not for a number of characters).

Answer

maniek picture maniek · Jul 26, 2012

I would divide the term into unigrams, bigrams and trigrams, then calculate cosine similarity.