I'm using LabelEncoder
and OneHotEncoder
from sklearn
in a Machine Learning project to encode the labels (country names) in the dataset. Everything works good and my model runs perfectly. The project is to classify whether a bank customer will continue with or leave the bank based on a number of features(data), including the customer's country.
My issue arises when I want to predict (classify) a new customer (one only). The data for the new customer is still not pre-processed (i.e., country names are not encoded). Something like the following:
new_customer = np.array([['France', 600, 'Male', 40, 3, 60000, 2, 1,1, 50000]])
In the online course, where I learn machine learning, the instructor opened the pre-processed dataset that included the encoded data and manually checked the code for France and updated it in the new_customer
, as the following:
new_customer = np.array([[0, 0, 600, 'Male', 40, 3, 60000, 2, 1,1, 50000]])
I believe that this is not practical, there must be a way to automatically encode France to the same code used in the original dataset, or at least a way to return a list of the countries and their encoded values. Manually encoding a label seems tedious and error-prone. So how can I automate this process, or generate the codes for the labels? Thanks in advance.
It seems like you may be looking for the .transform()
method of your estimator.
>>> from sklearn.preprocessing import LabelEncoder
>>> c = ['France', 'UK', 'US', 'US', 'UK', 'China', 'France']
>>> enc = LabelEncoder().fit(c)
>>> encoded = enc.transform(c)
>>> encoded
array([1, 2, 3, 3, 2, 0, 1])
>>> encoded.transform(['France'])
array([1])
This takes the "mapping" that was learned when you called fit(c)
and applies it to new data (in this case, a new label). You can see this mapping in reverse:
>>> enc.inverse_transform(encoded)
array(['France', 'UK', 'US', 'US', 'UK', 'China', 'France'], dtype='<U6')
As mentioned by the answer here, if you want to do this between Python sessions, you could serialize the estimator to disk like this:
import pickle
with open('enc.pickle', 'wb') as file:
pickle.dump(enc, file, pickle.HIGHEST_PROTOCOL)
Then load this in a new session and transform incoming data with it.