Groupby in python pandas: Fast Way

Náthali picture Náthali · Jul 1, 2016 · Viewed 11.4k times · Source

I want to improve the time of a groupby in python pandas. I have this code:

df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)

The objective is to count how many contracts a client has in a month and add this information in a new column (Nbcontrats).

  • Client: client code
  • Month: month of data extraction
  • Contrat: contract number

I want to improve the time. Below I am only working with a subset of my real data:

%timeit df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
1 loops, best of 3: 391 ms per loop

df.shape
Out[309]: (7464, 61)

How can I improve the execution time?

Answer

Divakar picture Divakar · Jul 1, 2016

Here's one way to proceed :

  • Slice out the relevant columns (['Client', 'Month']) from the input dataframe into a NumPy array. This is mostly a performance-focused idea as we would be using NumPy functions later on, which are optimized to work with NumPy arrays.

  • Convert the two columns data from ['Client', 'Month'] into a single 1D array, which would be a linear index equivalent of it considering elements from the two columns as pairs. Thus, we can assume that the elements from 'Client' represent the row indices, whereas 'Month' elements are the column indices. This is like going from 2D to 1D. But, the issue would be deciding the shape of the 2D grid to perform such a mapping. To cover all pairs, one safe assumption would be assuming a 2D grid whose dimensions are one more than the max along each column because of 0-based indexing in Python. Thus, we would get linear indices.

  • Next up, we tag each linear index based on their uniqueness among others. I think this would correspond to the keys obtained with grouby instead. We also need to get counts of each group/unique key along the entire length of that 1D array. Finally, indexing into the counts with those tags should map for each element the respective counts.

That's the whole idea about it! Here's the implementation -

# Save relevant columns as a NumPy array for performing NumPy operations afterwards
arr_slice = df[['Client', 'Month']].values

# Get linear indices equivalent of those columns
lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)

# Get unique IDs corresponding to each linear index (i.e. group) and grouped counts
unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)

# Index counts with the unique tags to map across all elements with the counts
df["Nbcontrats"] = counts[unqtags]

Runtime test

1) Define functions :

def original_app(df):
    df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)

def vectorized_app(df):
    arr_slice = df[['Client', 'Month']].values
    lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)
    unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)
    df["Nbcontrats"] = counts[unqtags]

2) Verify results :

In [143]: # Let's create a dataframe with 100 unique IDs and of length 10000
     ...: arr = np.random.randint(0,100,(10000,3))
     ...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
     ...: df1 = df.copy()
     ...: 
     ...: # Run the function on the inputs
     ...: original_app(df)
     ...: vectorized_app(df1)
     ...: 

In [144]: np.allclose(df["Nbcontrats"],df1["Nbcontrats"])
Out[144]: True

3) Finally time them :

In [145]: # Let's create a dataframe with 100 unique IDs and of length 10000
     ...: arr = np.random.randint(0,100,(10000,3))
     ...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
     ...: df1 = df.copy()
     ...: 

In [146]: %timeit original_app(df)
1 loops, best of 3: 645 ms per loop

In [147]: %timeit vectorized_app(df1)
100 loops, best of 3: 2.62 ms per loop