I would like to break down a pandas column consisting of a list of elements into as many columns as there are unique elements i.e. one-hot-encode
them (with value 1
representing a given element existing in a row and 0
in the case of absence).
For example, taking dataframe df
Col1 Col2 Col3
C 33 [Apple, Orange, Banana]
A 2.5 [Apple, Grape]
B 42 [Banana]
I would like to convert this to:
df
Col1 Col2 Apple Orange Banana Grape
C 33 1 1 1 0
A 2.5 1 0 0 1
B 42 0 0 1 0
How can I use pandas/sklearn to achieve this?
We can also use sklearn.preprocessing.MultiLabelBinarizer:
Often we want to use sparse DataFrame for the real world data in order to save a lot of RAM.
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)
df = df.join(
pd.DataFrame.sparse.from_spmatrix(
mlb.fit_transform(df.pop('Col3')),
index=df.index,
columns=mlb.classes_))
result:
In [38]: df
Out[38]:
Col1 Col2 Apple Banana Grape Orange
0 C 33.0 1 1 0 1
1 A 2.5 1 0 1 0
2 B 42.0 0 1 0 0
In [39]: df.dtypes
Out[39]:
Col1 object
Col2 float64
Apple Sparse[int32, 0]
Banana Sparse[int32, 0]
Grape Sparse[int32, 0]
Orange Sparse[int32, 0]
dtype: object
In [40]: df.memory_usage()
Out[40]:
Index 128
Col1 24
Col2 24
Apple 16 # <--- NOTE!
Banana 16 # <--- NOTE!
Grape 8 # <--- NOTE!
Orange 8 # <--- NOTE!
dtype: int64
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Col3')),
columns=mlb.classes_,
index=df.index))
Result:
In [77]: df
Out[77]:
Col1 Col2 Apple Banana Grape Orange
0 C 33.0 1 1 0 1
1 A 2.5 1 0 1 0
2 B 42.0 0 1 0 0