Fuzzy Wuzzy String Matching on 2 Large Data Sets Based on a Condition - python

Nirav picture Nirav · Mar 17, 2017 · Viewed 11.7k times · Source

I have 2 large data sets that I have read into Pandas DataFrames (~ 20K rows and ~40K rows respectively). When I try merging these two DFs outright using pandas.merge on the address field, I get a paltry number of match compared to the number of rows. So I thought I would try to fuzzy string match to see if it improves the number of output matches.

I approached this by trying to create a new column in DF1 (20K rows) that was the result of applying the fuzzywuzzy extractone function on DF1[addressline] to DF2[addressline]. I shortly realized that this would take forever since it will be doing close to 1 billion comparisons.

Both of these datasets have "County" fields and my ask is this: is there a way to conditionally do a fuzzy string match on the "addressline" fields in both DFs based on the "county" fields being the same? Researching questions similar to mine I stumbled upon this discussion: Fuzzy logic on big datasets using Python

However I am still fuzzy (no pun intended) on how to go about grouping/blocking fields based on county. Any advice would be greatly appreciated!

import pandas as pd
from fuzzywuzzy import process

def fuzzy_match(x, choices, scorer, cutoff):
  return process.extractOne(x, choices = choices, scorer = scorer, score_cutoff= cutoff)[0]

test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'ID':['X','U','X','Y']}) 
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'ID':['X','U','X','Y']}) 
test['Address1'] = test['Address1'].apply(lambda x: x.lower()) 
test2['Address1'] = test2['Address1'].apply(lambda x: x.lower()) 
test['FuzzyAddress1'] = test['Address1'].apply(fuzzy_match, args = (test2['Address1'], fuzz.ratio, 80))

I've added 2 images that are sample sets of the 2 different DFs imported into Excel. Not all the fields have been included since they aren't important to my question. To reiterate my end goal, I want a new column in one of the DFs that has the top result from fuzzy matching an address line with the other address lines in the 2nd DF but only for those lines where the counties match between both DFs. From there I plan to merge the two dfs, one on the fuzzy matched address and the address line column in the 2nd DF. Hopefully this doesn't sound confusing.

Answer

maxymoo picture maxymoo · Mar 17, 2017

You could adapt your fuzzy_match function to take the id as a variable and use this to subset your choices before doing the fuzzy search (note that this requires applying the function over the whole dataframe rather than just the address column)

def fuzzy_match(x, choices, scorer, cutoff):
    match = process.extractOne(x['Address1'], 
                               choices=choices.loc[choices['ID'] == x['ID'], 
                                                   'Address1'], 
                               scorer=scorer, 
                               score_cutoff=cutoff)
    if match:
        return match[0]

test['FuzzyAddress1'] = test.apply(fuzzy_match, 
                                   args=(test2, fuzz.ratio, 80), 
                                   axis=1)