I am trying to convert categorical values into binary values using pandas. The idea is to consider every unique categorical value as a feature (i.e. a column) and put 1 or 0 depending on whether a particular object (i.e. row) was assigned to this category. The following is the code:
data = pd.read_csv('somedata.csv')
converted_val = data.T.to_dict().values()
vectorizer = DV( sparse = False )
vec_x = vectorizer.fit_transform( converted_val )
numpy.savetxt('out.csv',vec_x,fmt='%10.0f',delimiter=',')
My question is, how to save this converted data with the column names?. In the above code, I am able to save the data using numpy.savetxt
function, but this simply saves the array and the column names are lost. Alternatively, is there a much efficient way to perform the above operation?.
You mean "one-hot" encoding?
Say you have the following dataset:
import pandas as pd
df = pd.DataFrame([
['green', 1, 10.1, 0],
['red', 2, 13.5, 1],
['blue', 3, 15.3, 0]])
df.columns = ['color', 'size', 'prize', 'class label']
df
Now, you have multiple options ...
color_mapping = {
'green': (0,0,1),
'red': (0,1,0),
'blue': (1,0,0)}
df['color'] = df['color'].map(color_mapping)
df
import numpy as np
y = df['class label'].values
X = df.iloc[:, :-1].values
X = np.apply_along_axis(func1d= lambda x: np.array(list(x[0]) + list(x[1:])), axis=1, arr=X)
print('Class labels:', y)
print('\nFeatures:\n', X)
Yielding:
Class labels: [0 1 0]
Features:
[[ 0. 0. 1. 1. 10.1]
[ 0. 1. 0. 2. 13.5]
[ 1. 0. 0. 3. 15.3]]
DictVectorizer
from sklearn.feature_extraction import DictVectorizer
dvec = DictVectorizer(sparse=False)
X = dvec.fit_transform(df.transpose().to_dict().values())
X
Yielding:
array([[ 0. , 0. , 1. , 0. , 10.1, 1. ],
[ 1. , 0. , 0. , 1. , 13.5, 2. ],
[ 0. , 1. , 0. , 0. , 15.3, 3. ]])
get_dummies
pd.get_dummies(df)