Percentage rank of matches using Levenshtein Distance matching

user1368587 picture user1368587 · May 2, 2012 · Viewed 18k times · Source

I am trying to match a single search term against a dictionary of possible matches using a Levenshtein distance algorithm. The algorithm returns a distance expressed as number of operations required to convert the search string into the matched string. I want to present the results in ranked percentage list of top "N" (say 10) matches.

Since the search string can be longer or shorter than the individual dictionary strings, what would be an appropriate logic for expressing the distance as a percentage, which would qualitatively refelct how close "as a percentage" is each result to the query string, with 100% indicating an exact match.

I considered the following options:

Q = query string
M = matched string
PM = Percentage Match
Option 1. PMi = (1 - Lev_distance(Q, Mi)/Strlen(Q)) * 100
Option 2. PMi = (1 - Lev_distance(Q, Mi)/max(Strlen(Q), strlen(Mi))) * 100

Option 1 has the possibility of negative percentages in case the distance is greater than the search string length, where the match string is long. For example query "ABC" matched with "ABC Corp." would result in a negative match percent.

Option 2 does not appear to give a consistent percentage across a set of Mi, as each calculation would possibly use a different denominator and hence the resulting percentage values would not be normalized.

Only other way I can think of is to ditch the comparison of the lev_distance to either string lenghts, but instead present the comparitive distances of the top "N" matches as an inverse percentile rank (100-percentile-rank).

Any thoughts? Are there better approaches? I must be missing something as Levenshtein distance is probably the most common algorithm for fuzzy matches and this must be a very common problem.

Answer

Celio Camargo Junior picture Celio Camargo Junior · Feb 11, 2014

I had a similar problem and this thread helped me to figure out a solution. Hope it can help others too.

int levDis = Lev_distance(Q, Mi)
int bigger = max(strlen(Q), strlen(Mi))
double pct = (bigger - levDis) / bigger

It should return 100% if both strings are exactly the same and 0% if they are totaly different.

(sorry if my english isn't that good)