Pandas groupby apply vs transform with specific functions

jpp picture jpp · Jun 28, 2018 · Viewed 7.6k times · Source

I don't understand which functions are acceptable for groupby + transform operations. Often, I end up just guessing, testing, reverting until something works, but I feel there should be a systematic way of determining whether a solution will work.

Here's a minimal example. First let's use groupby + apply with set:

df = pd.DataFrame({'a': [1,2,3,1,2,3,3], 'b':[1,2,3,1,2,3,3], 'type':[1,0,1,0,1,0,1]})

g = df.groupby(['a', 'b'])['type'].apply(set)

print(g)

a  b
1  1    {0, 1}
2  2    {0, 1}
3  3    {0, 1}

This works fine, but I want the resulting set calculated groupwise in a new column of the original dataframe. So I try and use transform:

df['g'] = df.groupby(['a', 'b'])['type'].transform(set)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
---> 23 df['g'] = df.groupby(['a', 'b'])['type'].transform(set)

TypeError: int() argument must be a string, a bytes-like object or a number, not 'set'

This is the error I see in Pandas v0.19.0. In v0.23.0, I see TypeError: 'set' type is unordered. Of course, I can map a specifically defined index to achieve my result:

g = df.groupby(['a', 'b'])['type'].apply(set)
df['g'] = df.set_index(['a', 'b']).index.map(g.get)

print(df)

   a  b  type       g
0  1  1     1  {0, 1}
1  2  2     0  {0, 1}
2  3  3     1  {0, 1}
3  1  1     0  {0, 1}
4  2  2     1  {0, 1}
5  3  3     0  {0, 1}
6  3  3     1  {0, 1}

But I thought the benefit of transform was to avoid such an explicit mapping. Where did I go wrong?

Answer

rafaelc picture rafaelc · Jul 1, 2018

I believe, in the first place, that there is some room for intuition in using these functions as they can be very meaningful.

In your first result, you are not actually trying to transform your values, but rather to aggregate them (which would work in the way you intended).

But getting into code, the transform docs are quite suggestive in saying that

Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk.

When you do

df.groupby(['a', 'b'])['type'].transform(some_func)

You are actually transforming each pd.Series object from each group into a new object using your some_func function. But the thing is, this new object should have the same size as the group OR be broadcastable to the size of the chunk.

Therefore, if you transform your series using tuple or list, you will be basically transforming the object

0    1
1    2
2    3
dtype: int64

into

[1,2,3]

But notice that these values are now assigned back to their respective indexes and that is why you see no difference in the transform operation. The row that had the .iloc[0] value from the pd.Series will now have the [1,2,3][0] value from the transform list (the same would apply to tuple) etc. Notice that ordering and size here matters, because otherwise you could mess up your groups and the transform wouldn't work (and this is exactly why set is not a proper function to be used is this case).


The second part of the quoted text says "broadcastable to the size of the group chunk".

This means that you can also transform your pd.Series to an object that can be used in all rows. For example

df.groupby(['a', 'b'])['type'].transform(lambda k: 50)

would work. Why? even though 50 is not iterable, it is broadcastable by using this value repeatedly in all positions of your initial pd.Series.


Why can you apply using set?

Because the apply method doesn't have this constraint of size in the result. It actually has three different result types, and it infers whether you want to expand, reduce or broadcast your results. Notice that you can't reduce in transforming*

By default (result_type=None), the final return type is inferred from the return type of the applied function. result_type : {‘expand’, ‘reduce’, ‘broadcast’, None}, default None These only act when axis=1 (columns):

  1. ‘expand’ : list-like results will be turned into columns.

  2. ‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.

  3. ‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.