Dear power Pandas experts:
I'm trying to implement a function to flatten a column of a dataframe which has element of type list, I want for each row of the dataframe where the column has element of type list, all columns but the designated column to be flattened will be duplicated, while the designated column will have one of the value in the list.
The following illustrate my requirements:
input = DataFrame({'A': [1, 2], 'B': [['a', 'b'], 'c']})
A B
0 1 [a, b]
1 2 c
expected = DataFrame({'A': [1, 1, 2], 'B': ['a', 'b', 'c']}, index=[0, 0, 1])
A B
0 1 a
0 1 b
1 2 c
I feel that there might be an elegant solution/concept for it, but I'm struggling.
Here is my attempt, which does not work yet.
def flattenColumn(df, column):
'''column is a string of the column's name.
for each value of the column's element (which might be a list), duplicate the rest of columns at the correspdonding row with the (each) value.
'''
def duplicate_if_needed(row):
return concat([concat([row.drop(column, axis = 1), DataFrame({column: each})], axis = 1) for each in row[column][0]])
return df.groupby(df.index).transform(duplicate_if_needed)
In recognition of alko's help, here is my trivial generalization of the solution to deal with more than 2 columns in a dataframe:
def flattenColumn(input, column):
'''
column is a string of the column's name.
for each value of the column's element (which might be a list),
duplicate the rest of columns at the corresponding row with the (each) value.
'''
column_flat = pandas.DataFrame(
[
[i, c_flattened]
for i, y in input[column].apply(list).iteritems()
for c_flattened in y
],
columns=['I', column]
)
column_flat = column_flat.set_index('I')
return (
input.drop(column, 1)
.merge(column_flat, left_index=True, right_index=True)
)
The only limitation at the moment is that the order of columns changed, the column flatten would be at the right most, not in its original position. It should be feasible to fix.
I guess easies way to flatten list of lists would be a pure python code, as this object type is not well suited for pandas or numpy. So you can do it with for example
>>> b_flat = pd.DataFrame([[i, x]
... for i, y in input['B'].apply(list).iteritems()
... for x in y], columns=list('IB'))
>>> b_flat = b_flat.set_index('I')
Having B column flattened, you can merge it back:
>>> input[['A']].merge(b_flat, left_index=True, right_index=True)
A B
0 1 a
0 1 b
1 2 c
[3 rows x 2 columns]
If you want the index to be recreated, as in your expected result, you can add .reset_index(drop=True)
to last command.