I have the following problem
I have a dataframe master that contains sentences, such as
master
Out[8]:
original
0 this is a nice sentence
1 this is another one
2 stackoverflow is nice
For every row in Master, I lookup into another Dataframe slave for the best match using fuzzywuzzy
. I use fuzzywuzzy because the matched sentences between the two dataframes could differ a bit (extra characters, etc).
For instance, slave could be
slave
Out[10]:
my_value name
0 2 hello world
1 1 congratulations
2 2 this is a nice sentence
3 3 this is another one
4 1 stackoverflow is nice
Here is a fully-functional, wonderful, compact working example :)
from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np
import difflib
master= pd.DataFrame({'original':['this is a nice sentence',
'this is another one',
'stackoverflow is nice']})
slave= pd.DataFrame({'name':['hello world',
'congratulations',
'this is a nice sentence ',
'this is another one',
'stackoverflow is nice'],'my_value': [2,1,2,3,1]})
def fuzzy_score(str1, str2):
return fuzz.token_set_ratio(str1, str2)
def helper(orig_string, slave_df):
#use fuzzywuzzy to see how close original and name are
slave_df['score'] = slave_df.name.apply(lambda x: fuzzy_score(x,orig_string))
#return my_value corresponding to the highest score
return slave_df.ix[slave_df.score.idxmax(),'my_value']
master['my_value'] = master.original.apply(lambda x: helper(x,slave))
The 1 million dollars question is: can I parallelize my apply code above?
After all, every row in master
is compared to all the rows in slave
(slave is a small dataset and I can hold many copies of the data into the RAM).
I dont see why I could not run multiple comparisons (i.e. process multiple rows at the same time).
Problem: I dont know how to do that or if thats even possible.
Any help greatly appreciated!
You can parallelize this with Dask.dataframe.
>>> dmaster = dd.from_pandas(master, npartitions=4)
>>> dmaster['my_value'] = dmaster.original.apply(lambda x: helper(x, slave), name='my_value'))
>>> dmaster.compute()
original my_value
0 this is a nice sentence 2
1 this is another one 3
2 stackoverflow is nice 1
Additionally, you should think about the tradeoffs between using threads vs processes here. Your fuzzy string matching almost certainly doesn't release the GIL, so you won't get any benefit from using multiple threads. However, using processes will cause data to serialize and move around your machine, which might slow things down a bit.
You can experiment between using threads and processes or a distributed system by managing the get=
keyword argument to the compute()
method.
import dask.multiprocessing
import dask.threaded
>>> dmaster.compute(get=dask.threaded.get) # this is default for dask.dataframe
>>> dmaster.compute(get=dask.multiprocessing.get) # try processes instead