Sklearn preprocessing - PolynomialFeatures - How to keep column names/headers of the output array / dataframe

Afflatus picture Afflatus · Apr 19, 2016 · Viewed 12.5k times · Source

TLDR: How to get headers for the output numpy array from the sklearn.preprocessing.PolynomialFeatures() function?


Let's say I have the following code...

import pandas as pd
import numpy as np
from sklearn import preprocessing as pp

a = np.ones(3)
b = np.ones(3) * 2
c = np.ones(3) * 3

input_df = pd.DataFrame([a,b,c])
input_df = input_df.T
input_df.columns=['a', 'b', 'c']

input_df

    a   b   c
0   1   2   3
1   1   2   3
2   1   2   3

poly = pp.PolynomialFeatures(2)
output_nparray = poly.fit_transform(input_df)
print output_nparray

[[ 1.  1.  2.  3.  1.  2.  3.  4.  6.  9.]
 [ 1.  1.  2.  3.  1.  2.  3.  4.  6.  9.]
 [ 1.  1.  2.  3.  1.  2.  3.  4.  6.  9.]]

How can I get that 3x10 matrix/ output_nparray to carry over the a,b,c labels how they relate to the data above?

Answer

Guiem Bosch picture Guiem Bosch · Apr 20, 2016

Working example, all in one line (I assume "readability" is not the goal here):

target_feature_names = ['x'.join(['{}^{}'.format(pair[0],pair[1]) for pair in tuple if pair[1]!=0]) for tuple in [zip(input_df.columns,p) for p in poly.powers_]]
output_df = pd.DataFrame(output_nparray, columns = target_feature_names)

Update: as @OmerB pointed out, now you can use the get_feature_names method:

>> poly.get_feature_names(input_df.columns)
['1', 'a', 'b', 'c', 'a^2', 'a b', 'a c', 'b^2', 'b c', 'c^2']