sklearn logistic regression with unbalanced classes

agentscully picture agentscully · Feb 13, 2013 · Viewed 14.7k times · Source

I'm solving a classification problem with sklearn's logistic regression in python.

My problem is a general/generic one. I have a dataset with two classes/result (positive/negative or 1/0), but the set is highly unbalanced. There are ~5% positives and ~95% negatives.

I know there are a number of ways to deal with an unbalanced problem like this, but have not found a good explanation of how to implement properly using the sklearn package.

What I've done thus far is to build a balanced training set by selecting entries with a positive outcome and an equal number of randomly selected negative entries. I can then train the model to this set, but I'm stuck with how to modify the model to then work on the original unbalanced population/set.

What are the specific steps to do this? I've poured over the sklearn documentation and examples and haven't found a good explanation.

Answer

ogrisel picture ogrisel · Feb 13, 2013

Have you tried to pass to your class_weight="auto" classifier? Not all classifiers in sklearn support this, but some do. Check the docstrings.

Also you can rebalance your dataset by randomly dropping negative examples and / or over-sampling positive examples (+ potentially adding some slight gaussian feature noise).