Python: Logistic regression max_iter parameter is reducing the accuracy

nurlubanu picture nurlubanu · Jul 18, 2019 · Viewed 7.4k times · Source

I am doing multiclass/multilabel text classification. I trying to get rid of the "ConvergenceWarning".

When I tuned the max_iter from default to 4000, the warning is disappeared. However, my model accuracy is reduced from 78 to 75.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


logreg = Pipeline([('vect', CountVectorizer()),
            ('tfidf', TfidfTransformer()),
            ('clf', LogisticRegression(n_jobs=1, C=1e5, solver='lbfgs',multi_class='ovr' ,random_state=0, class_weight='balanced' )),
           ])
logreg.fit(X_train, y_train)


y_pred = logreg.predict(X_test)

print('Logistic Regression Accuracy %s' % accuracy_score(y_pred, y_test))

cv_score = cross_val_score(logreg, train_tfidf, y_train, cv=10, scoring='accuracy')
print("CV Score : Mean : %.7g | Std : %.7g | Min : %.7g | Max : %.7g" % (np.mean(cv_score),np.std(cv_score),np.min(cv_score),np.max(cv_score)))

Why my accuracy is reducing when max_iter =4000? Is there any other way to fix * "ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations. "of iterations.", ConvergenceWarning)" *

Answer

Maurício Collaça picture Maurício Collaça · Jan 5, 2020

It's missing the data used in the question so it's not possible to reproduce the problem but just guess.

Some things to check:

1) Many estimators such as LogisticRegression likes (not to say requires) scaled data. Depending on your data, you may want to scale with MaxAbsScaler, MinMaxScaler, StandardScaler or RobustAScaler. The optimal choice depends on the kind of problem you are trying to solve, data properties like sparsity, whether negative values are welcomed by the downstream estimator, etc. Scaling data usually speeds up convergence, that may even not require to increase max_iter.

2) In my experience, solver not "liblinear" requires more max_iter iterations to converge given the same input data.

3) I didn't see any 'max_iterset in your code snippet. It currently defaults to100` (sklearn 0.22).

4) I saw you set the the regularization parameter C=100000. It's drastically reduce the regularization, as C is the inverse of regularization strength. It's expected to consume more iterations and may lead to overfit the model.

5) I didn't expect that a higher max_iter would get you lower accuracy. The solver is diverging rather than converging. The data may not be scaled or the random state is not fixed or the tolerance tol (defaults 1e-4) became to high.

6) Check you cross_val_score cross-validation parameter cv. If I'm not wrong, the default behavior doesn't set the random state which result in variable mean accuracy.