I'm solving a classification problem with sklearn's logistic regression in python.
My problem is a general/generic one. I have a dataset with two classes/result (positive/negative or 1/0), but the set is highly unbalanced. There are ~5% positives and ~95% negatives.
I know there are a number of ways to deal with an unbalanced problem like this, but have not found a good explanation of how to implement properly using the sklearn package.
What I've done thus far is to build a balanced training set by selecting entries with a positive outcome and an equal number of randomly selected negative entries. I can then train the model to this set, but I'm stuck with how to modify the model to then work on the original unbalanced population/set.
What are the specific steps to do this? I've poured over the sklearn documentation and examples and haven't found a good explanation.
Have you tried to pass to your class_weight="auto"
classifier? Not all classifiers in sklearn support this, but some do. Check the docstrings.
Also you can rebalance your dataset by randomly dropping negative examples and / or over-sampling positive examples (+ potentially adding some slight gaussian feature noise).