How to perform under sampling in scikit learn?

Question 1

How to perform under sampling in scikit learn?

python python-2.7 dataset scikit-learn sampling

Gaurav Patil · Mar 23, 2015 · Viewed 18.7k times · Source

Answer

Answer

I would choose to do this with Pandas DataFrame and numpy.random.choice. In that way it is easy to do random sampling to produce equally sized data-sets. An example:

import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randn(7, 4))
data['Healthy'] = [1, 1, 0, 0, 1, 1, 1]

This data has two non-healthy and five healthy samples. To randomly pick two samples from the healthy population you do:

healthy_indices = data[data.Healthy == 1].index
random_indices = np.random.choice(healthy_indices, 2, replace=False)
healthy_sample = data.loc[random_indices]

To automatically pick a subsample of the same size as the non-healthy group you can do:

sample_size = sum(data.Healthy == 0)  # Equivalent to len(data[data.Healthy == 0])
random_indices = np.random.choice(healthy_indices, sample_size, replace=False)

Question 2

We have a retinal dataset wherein the diseased eye information constitutes 70 percent of the information whereas the non diseased eye constitutes the remaining 30 percent.We want a dataset wherein the diseased as well as the non diseased samples should be equal in number. Is there any function available with the help of which we can do the same?

How to perform under sampling in scikit learn?

Answer

Related questions