I'm looking for solutions to speed up a function I have written to loop through a pandas dataframe and compare column values between the current row and the previous row.
As an example, this is a simplified version of my problem:
User Time Col1 newcol1 newcol2 newcol3 newcol4
0 1 6 [cat, dog, goat] 0 0 0 0
1 1 6 [cat, sheep] 0 0 0 0
2 1 12 [sheep, goat] 0 0 0 0
3 2 3 [cat, lion] 0 0 0 0
4 2 5 [fish, goat, lemur] 0 0 0 0
5 3 9 [cat, dog] 0 0 0 0
6 4 4 [dog, goat] 0 0 0 0
7 4 11 [cat] 0 0 0 0
At the moment I have a function which loops through and calculates values for 'newcol1
' and 'newcol2
' based on whether the 'User
' has changed since the previous row and also whether the difference in the 'Time
' values is greater than 1. It also looks at the first value in the arrays stored in 'Col1
' and 'Col2
' and updates 'newcol3
' and 'newcol4
' if these values have changed since the previous row.
Here's the pseudo-code for what I'm doing currently (since I've simplified the problem I haven't tested this but it's pretty similar to what I'm actually doing in ipython notebook):
def myJFunc(df):
... #initialize jnum counter
... jnum = 0;
... #loop through each row of dataframe (not including the first/zeroeth)
... for i in range(1,len(df)):
... #has user changed?
... if df.User.loc[i] == df.User.loc[i-1]:
... #has time increased by more than 1 (hour)?
... if abs(df.Time.loc[i]-df.Time.loc[i-1])>1:
... #update new columns
... df['newcol2'].loc[i-1] = 1;
... df['newcol1'].loc[i] = 1;
... #increase jnum
... jnum += 1;
... #has content changed?
... if df.Col1.loc[i][0] != df.Col1.loc[i-1][0]:
... #record this change
... df['newcol4'].loc[i-1] = [df.Col1.loc[i-1][0], df.Col2.loc[i][0]];
... #different user?
... elif df.User.loc[i] != df.User.loc[i-1]:
... #update new columns
... df['newcol1'].loc[i] = 1;
... df['newcol2'].loc[i-1] = 1;
... #store jnum elsewhere (code not included here) and reset jnum
... jnum = 1;
I now need to apply this function to several million rows and it's impossibly slow so I'm trying to figure out the best way to speed it up. I've heard that Cython can increase the speed of functions but I have no experience with it (and I'm new to both pandas and python). Is it possible to pass two rows of a dataframe as arguments to the function and then use Cython to speed it up or would it be necessary to create new columns with "diff
" values in them so that the function only reads from and writes to one row of the dataframe at a time, in order to benefit from using Cython? Any other speed tricks would be greatly appreciated!
(As regards using .loc, I compared .loc, .iloc and .ix and this one was marginally faster so that's the only reason I'm using that currently)
(Also, my User
column in reality is unicode not int, which could be problematic for speedy comparisons)
I was thinking along the same lines as Andy, just with groupby
added, and I think this is complementary to Andy's answer. Adding groupby is just going to have the effect of putting a NaN in the first row whenever you do a diff
or shift
. (Note that this is not an attempt at an exact answer, just to sketch out some basic techniques.)
df['time_diff'] = df.groupby('User')['Time'].diff()
df['Col1_0'] = df['Col1'].apply( lambda x: x[0] )
df['Col1_0_prev'] = df.groupby('User')['Col1_0'].shift()
User Time Col1 time_diff Col1_0 Col1_0_prev
0 1 6 [cat, dog, goat] NaN cat NaN
1 1 6 [cat, sheep] 0 cat cat
2 1 12 [sheep, goat] 6 sheep cat
3 2 3 [cat, lion] NaN cat NaN
4 2 5 [fish, goat, lemur] 2 fish cat
5 3 9 [cat, dog] NaN cat NaN
6 4 4 [dog, goat] NaN dog NaN
7 4 11 [cat] 7 cat dog
As a followup to Andy's point about storing objects, note that what I did here was to extract the first element of the list column (and add a shifted version also). Doing it like this you only have to do an expensive extraction once and after that can stick to standard pandas methods.