New to pandas, I already want to parallelize a row-wise apply operation. So far I found Parallelize apply after pandas groupby However, that only seems to work for grouped data frames.
My use case is different: I have a list of holidays and for my current row/date want to find the no-of-days before and after this day to the next holiday.
This is the function I call via apply:
def get_nearest_holiday(x, pivot):
nearestHoliday = min(x, key=lambda x: abs(x- pivot))
difference = abs(nearesHoliday - pivot)
return difference / np.timedelta64(1, 'D')
How can I speed it up?
I experimented a bit with pythons pools - but it was neither nice code, nor did I get my computed results.
For the parallel approach this is the answer based on Parallelize apply after pandas groupby:
from joblib import Parallel, delayed
import multiprocessing
def get_nearest_dateParallel(df):
df['daysBeforeHoliday'] = df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day < x], x))
df['daysAfterHoliday'] = df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day > x], x))
return df
def applyParallel(dfGrouped, func):
retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
return pd.concat(retLst)
print ('parallel version: ')
# 4 min 30 seconds
%time result = applyParallel(datesFrame.groupby(datesFrame.index), get_nearest_dateParallel)
but I prefer @NinjaPuppy's approach because it does not require O(n * number_of_holidays)