Fuzzy string matching in Python

BernardL picture BernardL · Aug 16, 2016 · Viewed 11k times · Source

I have 2 lists of over a million names with slightly different naming conventions. The goal here it to match those records that are similar, with the logic of 95% confidence.

I am made aware there are libraries which I can leverage on, such as the FuzzyWuzzy module in Python.

However in terms of processing it seems it will take up too much resources having every string in 1 list to be compared to the other, which in this case seems to require 1 million multiplied by another million number of iterations.

Are there any other more efficient methods for this problem?

UPDATE:

So I created a bucketing function and applied a simple normalization of removing whitespace, symbols and converting the values to lowercase etc...

for n in list(dftest['YM'].unique()):
    n = str(n)
    frame = dftest['Name'][dftest['YM'] == n]
    print len(frame)
    print n
    for names in tqdm(frame):
            closest = process.extractOne(names,frame)

By using pythons pandas, the data is loaded to smaller buckets grouped by years and then using the FuzzyWuzzy module, process.extractOne is used to get the best match.

Results are still somewhat disappointing. During test the code above is used on a test data frame containing only 5 thousand names and takes up almost a whole hour.

The test data is split up by.

  • Name
  • Year Month of Date of Birth

And I am comparing them by buckets where their YMs are in the same bucket.

Could the problem be because of the FuzzyWuzzy module I am using? Appreciate any help.

Answer

DhruvPathak picture DhruvPathak · Aug 16, 2016

There are several level of optimizations possible here to turn this problem from O(n^2) to a lesser time complexity.

  • Preprocessing : Sort your list in the first pass, creating an output map for each string , they key for the map can be normalized string. Normalizations may include:

    • lowercase conversion,
    • no whitespaces, special characters removal,
    • transform unicode to ascii equivalents if possible,use unicodedata.normalize or unidecode module )

    This would result in "Andrew H Smith", "andrew h. smith", "ándréw h. smith" generating same key "andrewhsmith", and would reduce your set of million names to a smaller set of unique/similar grouped names.

You can use this utlity method to normalize your string (does not include the unicode part though) :

def process_str_for_similarity_cmp(input_str, normalized=False, ignore_list=[]):
    """ Processes string for similarity comparisons , cleans special characters and extra whitespaces
        if normalized is True and removes the substrings which are in ignore_list)
    Args:
        input_str (str) : input string to be processed
        normalized (bool) : if True , method removes special characters and extra whitespace from string,
                            and converts to lowercase
        ignore_list (list) : the substrings which need to be removed from the input string
    Returns:
       str : returns processed string
    """
    for ignore_str in ignore_list:
        input_str = re.sub(r'{0}'.format(ignore_str), "", input_str, flags=re.IGNORECASE)

    if normalized is True:
        input_str = input_str.strip().lower()
        #clean special chars and extra whitespace
        input_str = re.sub("\W", "", input_str).strip()

    return input_str
  • Now similar strings will already lie in the same bucket if their normalized key is same.

  • For further comparison, you will need to compare the keys only, not the names. e.g andrewhsmith and andrewhsmeeth, since this similarity of names will need fuzzy string matching apart from the normalized comparison done above.

  • Bucketing : Do you really need to compare a 5 character key with 9 character key to see if that is 95% match ? No you do not. So you can create buckets of matching your strings. e.g. 5 character names will be matched with 4-6 character names, 6 character names with 5-7 characters etc. A n+1,n-1 character limit for a n character key is a reasonably good bucket for most practical matching.

  • Beginning match : Most variations of names will have same first character in the normalized format ( e.g Andrew H Smith, ándréw h. smith, and Andrew H. Smeeth generate keys andrewhsmith,andrewhsmith, and andrewhsmeeth respectively. They will usually not differ in the first character, so you can run matching for keys starting with a to other keys which start with a, and fall within the length buckets. This would highly reduce your matching time. No need to match a key andrewhsmith to bndrewhsmith as such a name variation with first letter will rarely exist.

Then you can use something on the lines of this method ( or FuzzyWuzzy module ) to find string similarity percentage, you may exclude one of jaro_winkler or difflib to optimize your speed and result quality:

def find_string_similarity(first_str, second_str, normalized=False, ignore_list=[]):
    """ Calculates matching ratio between two strings
    Args:
        first_str (str) : First String
        second_str (str) : Second String
        normalized (bool) : if True ,method removes special characters and extra whitespace
                            from strings then calculates matching ratio
        ignore_list (list) : list has some characters which has to be substituted with "" in string
    Returns:
       Float Value : Returns a matching ratio between 1.0 ( most matching ) and 0.0 ( not matching )
                    using difflib's SequenceMatcher and and jellyfish's jaro_winkler algorithms with
                    equal weightage to each
    Examples:
        >>> find_string_similarity("hello world","Hello,World!",normalized=True)
        1.0
        >>> find_string_similarity("entrepreneurship","entreprenaurship")
        0.95625
        >>> find_string_similarity("Taj-Mahal","The Taj Mahal",normalized= True,ignore_list=["the","of"])
        1.0
    """
    first_str = process_str_for_similarity_cmp(first_str, normalized=normalized, ignore_list=ignore_list)
    second_str = process_str_for_similarity_cmp(second_str, normalized=normalized, ignore_list=ignore_list)
    match_ratio = (difflib.SequenceMatcher(None, first_str, second_str).ratio() + jellyfish.jaro_winkler(unicode(first_str), unicode(second_str)))/2.0
    return match_ratio