Here's what I got from a tutorial
# Data Preprocessing
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
# Encoding categorical data
# Encoding the Independent Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
This is the X matrix with encoded dummy variables
1.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 4.400000000000000000e+01 7.200000000000000000e+04
0.000000000000000000e+00 0.000000000000000000e+00 1.000000000000000000e+00 2.700000000000000000e+01 4.800000000000000000e+04
0.000000000000000000e+00 1.000000000000000000e+00 0.000000000000000000e+00 3.000000000000000000e+01 5.400000000000000000e+04
0.000000000000000000e+00 0.000000000000000000e+00 1.000000000000000000e+00 3.800000000000000000e+01 6.100000000000000000e+04
0.000000000000000000e+00 1.000000000000000000e+00 0.000000000000000000e+00 4.000000000000000000e+01 6.377777777777778101e+04
1.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 3.500000000000000000e+01 5.800000000000000000e+04
0.000000000000000000e+00 0.000000000000000000e+00 1.000000000000000000e+00 3.877777777777777857e+01 5.200000000000000000e+04
1.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 4.800000000000000000e+01 7.900000000000000000e+04
0.000000000000000000e+00 1.000000000000000000e+00 0.000000000000000000e+00 5.000000000000000000e+01 8.300000000000000000e+04
1.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 3.700000000000000000e+01 6.700000000000000000e+04
The problem is there are no column labels. I tried
something = pd.get_dummies(X)
But I get the following Exception
Exception: Data must be 1-dimensional
Most sklearn
methods don't care about column names, as they're mainly concerned with the math behind the ML algorithms they implement. You can add column names back onto the OneHotEncoder
output after fit_transform()
, if you can figure out the label encoding ahead of time.
First, grab the column names of your predictors from the original dataset
, excluding the first one (which we reserve for LabelEncoder
):
X_cols = dataset.columns[1:-1]
X_cols
# Index(['Age', 'Salary'], dtype='object')
Now get the order of the encoded labels. In this particular case, it looks like LabelEncoder()
organizes its integer mapping alphabetically:
labels = labelencoder_X.fit(X[:, 0]).classes_
labels
# ['France' 'Germany' 'Spain']
Combine these column names, and then add them to X
when you convert to DataFrame
:
# X gets re-used, so make sure to define encoded_cols after this line
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
encoded_cols = np.append(labels, X_cols)
# ...
X = onehotencoder.fit_transform(X).toarray()
encoded_df = pd.DataFrame(X, columns=encoded_cols)
encoded_df
France Germany Spain Age Salary
0 1.0 0.0 0.0 44.000000 72000.000000
1 0.0 0.0 1.0 27.000000 48000.000000
2 0.0 1.0 0.0 30.000000 54000.000000
3 0.0 0.0 1.0 38.000000 61000.000000
4 0.0 1.0 0.0 40.000000 63777.777778
5 1.0 0.0 0.0 35.000000 58000.000000
6 0.0 0.0 1.0 38.777778 52000.000000
7 1.0 0.0 0.0 48.000000 79000.000000
8 0.0 1.0 0.0 50.000000 83000.000000
9 1.0 0.0 0.0 37.000000 67000.000000
NB: For example data I'm using this dataset, which seems either very similar or identical to the one used by OP. Note how the output is identical to OP's X
matrix.