I have a series like:
df['ID'] = ['ABC123', 'IDF345', ...]
I'm using scikit's LabelEncoder
to convert it to numerical values to be fed into the RandomForestClassifier
.
During the training, I'm doing as follows:
le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID)
But, now for testing/prediction, when I pass in new data, I want to transform the 'ID' from this data based on le_id
i.e., if same values are present then transform it according to the above label encoder, otherwise assign a new numerical value.
In the test file, I was doing as follows:
new_df['ID'] = le_dpid.transform(new_df.ID)
But, I'm getting the following error: ValueError: y contains new labels
How do I fix this?? Thanks!
UPDATE:
So the task I have is to use the below (for example) as training data and predict the 'High', 'Mod', 'Low'
values for new BankNum, ID combinations. The model should learn the characteristics where a 'High' is given, where a 'Low' is given from the training dataset. For example, below a 'High' is given when there are multiple entries with same BankNum and different IDs.
df =
BankNum | ID | Labels
0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low
And then predict it on something like:
BankNum | ID |
00982222 | AB999 |
00982222 | AB999 |
00981111 | AB890 |
I'm doing something like this:
df['BankNum'] = df.BankNum.astype(np.float128)
le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID)
X_train, X_test, y_train, y_test = train_test_split(df[['BankNum', 'ID'], df.Labels, test_size=0.25, random_state=42)
clf = RandomForestClassifier(random_state=42, n_estimators=140)
clf.fit(X_train, y_train)
I think the error message is very clear: Your test dataset contains ID
labels which have not been included in your training data set. For this items, the LabelEncoder
can not find a suitable numeric value to represent. There are a few ways to solve this problem. You can either try to balance your data set, so that you are sure that each label is not only present in your test but also in your training data. Otherwise, you can try to follow one of the ideas presented here.
One of the possibles solutions is, that you search through your data set at the beginning, get a list of all unique ID
values, train the LabelEncoder
on this list, and keep the rest of your code just as it is at the moment.
An other possible solution is, to check that the test data have only labels which have been seen in the training process. If there is a new label, you have to set it to some fallback value like unknown_id
(or something like this). Doin this, you put all new, unknown ID
s in one class; for this items the prediction will then fail, but you can use the rest of your code as it is now.