String similarity metrics in Python

agiliq picture agiliq · Sep 24, 2009 · Viewed 46.2k times · Source

I want to find string similarity between two strings. This page has examples of some of them. Python has an implemnetation of Levenshtein algorithm. Is there a better algorithm, (and hopefully a python library), under these contraints.

  1. I want to do fuzzy matches between strings. eg matches('Hello, All you people', 'hello, all You peopl') should return True
  2. False negatives are acceptable, False positives, except in extremely rare cases are not.
  3. This is done in a non realtime setting, so speed is not (much) of concern.
  4. [Edit] I am comparing multi word strings.

Would something other than Levenshtein distance(or Levenshtein ratio) be a better algorithm for my case?

Answer

Nadia Alramli picture Nadia Alramli · Sep 24, 2009

I realize it's not the same thing, but this is close enough:

>>> import difflib
>>> a = 'Hello, All you people'
>>> b = 'hello, all You peopl'
>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())
>>> seq.ratio()
0.97560975609756095

You can make this as a function

def similar(seq1, seq2):
    return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9

>>> similar(a, b)
True
>>> similar('Hello, world', 'Hi, world')
False