I have a dataframe about data on schools for a few thousands cities. The school is the row identifier and the city is encoded as follow:
school city category capacity
1 azez6576sebd 45 23
2 dsqozbc765aj 12 236
3 sqdqsd12887s 8 63
4 azez6576sebd 7 234
...
How can I convert the city variable to numeric knowing that I have a few thousand cities ? I guess one-hot encoding is not appropriate as I will have too many columns. What is the general approach to convert categorical variable with thousand of levels to numeric ?
Thank you.
You can using category dtype in sklearn , it should be labelencoder
df.city=df.city.astype('category').cat.codes
df
Out[385]:
school city category capacity
0 1 0 45 23
1 2 1 12 236
2 3 2 8 63
3 4 0 7 234