I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. But it insists that, when grouping by multiple categories, every combination of categories must be accounted for.
I sometimes use categories even when there's a low density of common strings, simply because those strings are long and it saves memory / improves performance. Sometimes there are thousands of categories in each column. When grouping by 3 columns, pandas
forces us to hold results for 1000^3 groups.
My question: is there a convenient way to use groupby
with categories while avoiding this untoward behaviour? I'm not looking for any of these solutions:
numpy
.groupby
, reverting to categories later.I'm hoping there's a way to modify just this particular pandas
idiosyncrasy. A simple example is below. Instead of 4 categories I want in the output, I end up with 12.
import pandas as pd
group_cols = ['Group1', 'Group2', 'Group3']
df = pd.DataFrame([['A', 'B', 'C', 54.34],
['A', 'B', 'D', 61.34],
['B', 'A', 'C', 514.5],
['B', 'A', 'A', 765.4],
['A', 'B', 'D', 765.4]],
columns=(group_cols+['Value']))
for col in group_cols:
df[col] = df[col].astype('category')
df.groupby(group_cols, as_index=False).sum()
Group1 Group2 Group3 Value
# A A A NaN
# A A C NaN
# A A D NaN
# A B A NaN
# A B C 54.34
# A B D 826.74
# B A A 765.40
# B A C 514.50
# B A D NaN
# B B A NaN
# B B C NaN
# B B D NaN
Bounty update
The issue is poorly addressed by pandas development team (cf github.com/pandas-dev/pandas/issues/17594). Therefore, I am looking for responses that address any of the following:
Bounty update #2
To be clear, I'm not expecting answers to all of the above 4 questions. The main question I am asking is whether it's possible, or advisable, to overwrite pandas
library methods so that categories are treated in a way that facilitates groupby
/ set_index
operations.
Since Pandas 0.23.0, the groupby
method can now take a parameter observed
which fixes this issue if it is set to True
(False by default).
Below is the exact same code as in the question with just observed=True
added :
import pandas as pd
group_cols = ['Group1', 'Group2', 'Group3']
df = pd.DataFrame([['A', 'B', 'C', 54.34],
['A', 'B', 'D', 61.34],
['B', 'A', 'C', 514.5],
['B', 'A', 'A', 765.4],
['A', 'B', 'D', 765.4]],
columns=(group_cols+['Value']))
for col in group_cols:
df[col] = df[col].astype('category')
df.groupby(group_cols, as_index=False, observed=True).sum()