Pandas dataframe encode Categorical variable with thousands of unique values

roqds picture roqds · Feb 3, 2018 · Viewed 8.2k times · Source

I have a dataframe about data on schools for a few thousands cities. The school is the row identifier and the city is encoded as follow:

school city          category   capacity
1      azez6576sebd  45         23
2      dsqozbc765aj  12         236
3      sqdqsd12887s  8          63 
4      azez6576sebd  7          234 
...

How can I convert the city variable to numeric knowing that I have a few thousand cities ? I guess one-hot encoding is not appropriate as I will have too many columns. What is the general approach to convert categorical variable with thousand of levels to numeric ?

Thank you.

Answer

BENY picture BENY · Feb 3, 2018

You can using category dtype in sklearn , it should be labelencoder

df.city=df.city.astype('category').cat.codes
df
Out[385]: 
   school  city  category  capacity
0       1     0        45        23
1       2     1        12       236
2       3     2         8        63
3       4     0         7       234