I don't understand which functions are acceptable for groupby
+ transform
operations. Often, I end up just guessing, testing, reverting until something works, but I feel there should be a systematic way of determining whether a solution will work.
Here's a minimal example. First let's use groupby
+ apply
with set
:
df = pd.DataFrame({'a': [1,2,3,1,2,3,3], 'b':[1,2,3,1,2,3,3], 'type':[1,0,1,0,1,0,1]})
g = df.groupby(['a', 'b'])['type'].apply(set)
print(g)
a b
1 1 {0, 1}
2 2 {0, 1}
3 3 {0, 1}
This works fine, but I want the resulting set
calculated groupwise in a new column of the original dataframe. So I try and use transform
:
df['g'] = df.groupby(['a', 'b'])['type'].transform(set)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
---> 23 df['g'] = df.groupby(['a', 'b'])['type'].transform(set)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'set'
This is the error I see in Pandas v0.19.0. In v0.23.0, I see TypeError: 'set' type is unordered
. Of course, I can map a specifically defined index to achieve my result:
g = df.groupby(['a', 'b'])['type'].apply(set)
df['g'] = df.set_index(['a', 'b']).index.map(g.get)
print(df)
a b type g
0 1 1 1 {0, 1}
1 2 2 0 {0, 1}
2 3 3 1 {0, 1}
3 1 1 0 {0, 1}
4 2 2 1 {0, 1}
5 3 3 0 {0, 1}
6 3 3 1 {0, 1}
But I thought the benefit of transform
was to avoid such an explicit mapping. Where did I go wrong?
I believe, in the first place, that there is some room for intuition in using these functions as they can be very meaningful.
In your first result, you are not actually trying to transform your values, but rather to aggregate them (which would work in the way you intended).
But getting into code, the transform
docs are quite suggestive in saying that
Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk.
When you do
df.groupby(['a', 'b'])['type'].transform(some_func)
You are actually transforming each pd.Series
object from each group into a new object using your some_func
function. But the thing is, this new object should have the same size as the group OR be broadcastable to the size of the chunk.
Therefore, if you transform your series using tuple
or list
, you will be basically transforming the object
0 1
1 2
2 3
dtype: int64
into
[1,2,3]
But notice that these values are now assigned back to their respective indexes and that is why you see no difference in the transform
operation. The row that had the .iloc[0]
value from the pd.Series
will now have the [1,2,3][0]
value from the transform list (the same would apply to tuple) etc. Notice that ordering and size here matters, because otherwise you could mess up your groups and the transform wouldn't work (and this is exactly why set
is not a proper function to be used is this case).
The second part of the quoted text says "broadcastable to the size of the group chunk".
This means that you can also transform your pd.Series
to an object that can be used in all rows. For example
df.groupby(['a', 'b'])['type'].transform(lambda k: 50)
would work. Why? even though 50
is not iterable, it is broadcastable by using this value repeatedly in all positions of your initial pd.Series
.
Why can you apply
using set?
Because the apply
method doesn't have this constraint of size in the result. It actually has three different result types, and it infers whether you want to expand, reduce or broadcast your results. Notice that you can't reduce in transforming*
By default (
result_type=None
), the final return type is inferred from the return type of the applied function. result_type : {‘expand’, ‘reduce’, ‘broadcast’, None}, default None These only act whenaxis=1
(columns):
‘expand’ : list-like results will be turned into columns.
‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.