Working of labelEncoder in sklearn

Neo picture Neo · Jan 21, 2017 · Viewed 25.2k times · Source

Say I have the following input feature:

hotel_id = [1, 2, 3, 2, 3]

This is a categorical feature with numeric values. If I give it to the model as it is, the model will treat it as continuous variable, ie., 2 > 1.

If I apply sklearn.labelEncoder() then I will get:

hotel_id = [0, 1, 2, 1, 2] 

So this encoded feature is considered as continuous or categorical? If it is treated as continuous then whats the use of labelEncoder().

P.S. I know about one hot encoding. But there are around 100 hotel_ids so dont want to use it. Thanks

Answer

Tgsmith61591 picture Tgsmith61591 · Jan 21, 2017

The LabelEncoder is a way to encode class levels. In addition to the integer example you've included, consider the following example:

>>> from sklearn.preprocessing import LabelEncoder
>>> le = LabelEncoder()
>>>
>>> train = ["paris", "paris", "tokyo", "amsterdam"]
>>> test = ["tokyo", "tokyo", "paris"]
>>> le.fit(train).transform(test)
array([2, 2, 1]...)

What the LabelEncoder allows us to do, then, is to assign ordinal levels to categorical data. However, what you've noted is correct: namely, the [2, 2, 1] is treated as numeric data. This is a good candidate for using the OneHotEncoder for dummy variables (which I know you said you were hoping not to use).

Note that the LabelEncoder must be used prior to one-hot encoding, as the OneHotEncoder cannot handle categorical data. Therefore, it is frequently used as pre-cursor to one-hot encoding.

Alternatively, it can encode your target into a usable array. If, for instance, train were your target for classification, you would need a LabelEncoder to use it as your y variable.