XGBoost Categorical Variables: Dummification vs encoding

ishido picture ishido · Dec 14, 2015 · Viewed 45.7k times · Source

When using XGBoost we need to convert categorical variables into numeric.

Would there be any difference in performance/evaluation metrics between the methods of:

  1. dummifying your categorical variables
  2. encoding your categorical variables from e.g. (a,b,c) to (1,2,3)

ALSO:

Would there be any reasons not to go with method 2 by using for example labelencoder?

Answer

T. Scharf picture T. Scharf · Dec 18, 2015

xgboost only deals with numeric columns.

if you have a feature [a,b,b,c] which describes a categorical variable (i.e. no numeric relationship)

Using LabelEncoder you will simply have this:

array([0, 1, 1, 2])

Xgboost will wrongly interpret this feature as having a numeric relationship! This just maps each string ('a','b','c') to an integer, nothing more.

Proper way

Using OneHotEncoder you will eventually get to this:

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

This is the proper representation of a categorical variable for xgboost or any other machine learning tool.

Pandas get_dummies is a nice tool for creating dummy variables (which is easier to use, in my opinion).

Method #2 in above question will not represent the data properly