How to perform under sampling in scikit learn?

Gaurav Patil picture Gaurav Patil · Mar 23, 2015 · Viewed 18.7k times · Source

We have a retinal dataset wherein the diseased eye information constitutes 70 percent of the information whereas the non diseased eye constitutes the remaining 30 percent.We want a dataset wherein the diseased as well as the non diseased samples should be equal in number. Is there any function available with the help of which we can do the same?

Answer

RickardSjogren picture RickardSjogren · Mar 23, 2015

I would choose to do this with Pandas DataFrame and numpy.random.choice. In that way it is easy to do random sampling to produce equally sized data-sets. An example:

import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randn(7, 4))
data['Healthy'] = [1, 1, 0, 0, 1, 1, 1]

This data has two non-healthy and five healthy samples. To randomly pick two samples from the healthy population you do:

healthy_indices = data[data.Healthy == 1].index
random_indices = np.random.choice(healthy_indices, 2, replace=False)
healthy_sample = data.loc[random_indices]

To automatically pick a subsample of the same size as the non-healthy group you can do:

sample_size = sum(data.Healthy == 0)  # Equivalent to len(data[data.Healthy == 0])
random_indices = np.random.choice(healthy_indices, sample_size, replace=False)