Say I have the following two strings in my database:
(1) 'Levi Watkins Learning Center - Alabama State University'
(2) 'ETH Library'
My software receives free text inputs from a data source, and it should match those free texts to the pre-defined strings in the database (the ones above).
For example, if the software gets the string 'Alabama University'
, it should recognize that this is more similar to (1)
than it is to (2)
.
At first, I thought of using a well-known string metric like Levenshtein-Damerau or Trigrams, but this leads to unwanted results as you can see here:
http://fuzzy-string.com/Compare/Transform.aspx?r=ETH+Library&q=Alabama+University
Difference to (1): 37
Difference to (2): 14
(2)
wins because it is much shorter than (1)
, even though (1)
contains both words (Alabama
and University
) of the search string.
I also tried it with Trigrams (using the Javascript library fuzzySet), but I got similar results there.
Is there a string metric that would recognize the similarity of the search string to (1)
?
You could try the Word Mover's Distance https://github.com/mkusner/wmd instead. One brilliant advantage of this algorithm is that it incorporates the implied meanings while computing the differences between words in documents. The paper can be found here