Error: Classification metrics can't handle a mix of multiclass-multioutput and multilabel-indicator targets

Lossan picture Lossan · Jun 24, 2018 · Viewed 8.4k times · Source

I am newbie to machine learning in general.

I am trying to do multilabel text classification. I have the original labels for these documents as well as the result of the classification (used mlknn classifier) represented as one hot encoding (19000 document x 200 label). Now I am trying to evaluate the classification with f1_score micro and macro but I am getting this error (on line 3) ValueError: Classification metrics can't handle a mix of multiclass-multioutput and multilabel-indicator targets and I dont know how I can solve it. This is my code:

1. y_true = np.loadtxt("target_matrix.txt")
2. y_pred = np.loadtxt("classification_results.txt")

3. print (f1_score(y_true, y_pred, average='macro'))
4. print (f1_score(y_true, y_pred, average='micro'))

I also tried to use cross_val_score for the classification to get the evaluation right away but ran into another error (from cross_val_score line):

File "_csparsetools.pyx", line 20, in scipy.sparse._csparsetools.lil_get1
File "_csparsetools.pyx", line 48, in scipy.sparse._csparsetools.lil_get1
IndexError: column index (11) out of bounds

this is my code:

X = np.loadtxt("docvecs.txt", delimiter=",")
y = np.loadtxt("target_matrix.txt", dtype='int')

cv_scores = []
mlknn = MLkNN(k=10)  
scores = cross_val_score(mlknn, X, y, cv=5, scoring='f1_micro')
cv_scores.append(scores)

any help with either one of the errors is much appreciated, thanks.

Answer

Lossan picture Lossan · Jun 25, 2018

I was creating the y array manually and it seems that was my mistake. I used now MultiLabelBinarizer to create it, as the following example and now it works:

train_foo = [['sci-fi', 'thriller'],['comedy'],['sci-fi', 'thriller'],['comedy']]
mlb = MultiLabelBinarizer()
mlb_label_train = mlb.fit_transform(train_foo)

X = np.loadtxt("docvecs.txt", delimiter=",")
cv_scores = []
mlknn = MLkNN(k=3) 
scores = cross_val_score(mlknn, X, mlb_label_train, cv=5, scoring='f1_macro')
cv_scores.append(scores)

you can find the documentation for MultiLabelBinarizer here.