I want to improve the time of a groupby
in python pandas.
I have this code:
df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
The objective is to count how many contracts a client has in a month and add this information in a new column (Nbcontrats
).
Client
: client codeMonth
: month of data extractionContrat
: contract numberI want to improve the time. Below I am only working with a subset of my real data:
%timeit df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
1 loops, best of 3: 391 ms per loop
df.shape
Out[309]: (7464, 61)
How can I improve the execution time?
Here's one way to proceed :
Slice out the relevant columns (['Client', 'Month']
) from the input dataframe into a NumPy array. This is mostly a performance-focused idea as we would be using NumPy functions later on, which are optimized to work with NumPy arrays.
Convert the two columns data from ['Client', 'Month']
into a single 1D
array, which would be a linear index equivalent of it considering elements from the two columns as pairs. Thus, we can assume that the elements from 'Client'
represent the row indices, whereas 'Month'
elements are the column indices. This is like going from 2D
to 1D
. But, the issue would be deciding the shape of the 2D grid to perform such a mapping. To cover all pairs, one safe assumption would be assuming a 2D grid whose dimensions are one more than the max along each column because of 0-based indexing in Python. Thus, we would get linear indices.
Next up, we tag each linear index based on their uniqueness among others. I think this would correspond to the keys obtained with grouby
instead. We also need to get counts of each group/unique key along the entire length of that 1D array. Finally, indexing into the counts with those tags should map for each element the respective counts.
That's the whole idea about it! Here's the implementation -
# Save relevant columns as a NumPy array for performing NumPy operations afterwards
arr_slice = df[['Client', 'Month']].values
# Get linear indices equivalent of those columns
lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)
# Get unique IDs corresponding to each linear index (i.e. group) and grouped counts
unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)
# Index counts with the unique tags to map across all elements with the counts
df["Nbcontrats"] = counts[unqtags]
Runtime test
1) Define functions :
def original_app(df):
df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
def vectorized_app(df):
arr_slice = df[['Client', 'Month']].values
lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)
unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)
df["Nbcontrats"] = counts[unqtags]
2) Verify results :
In [143]: # Let's create a dataframe with 100 unique IDs and of length 10000
...: arr = np.random.randint(0,100,(10000,3))
...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
...: df1 = df.copy()
...:
...: # Run the function on the inputs
...: original_app(df)
...: vectorized_app(df1)
...:
In [144]: np.allclose(df["Nbcontrats"],df1["Nbcontrats"])
Out[144]: True
3) Finally time them :
In [145]: # Let's create a dataframe with 100 unique IDs and of length 10000
...: arr = np.random.randint(0,100,(10000,3))
...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
...: df1 = df.copy()
...:
In [146]: %timeit original_app(df)
1 loops, best of 3: 645 ms per loop
In [147]: %timeit vectorized_app(df1)
100 loops, best of 3: 2.62 ms per loop