We have a retinal dataset wherein the diseased eye information constitutes 70 percent of the information whereas the non diseased eye constitutes the remaining 30 percent.We want a dataset wherein the diseased as well as the non diseased samples should be equal in number. Is there any function available with the help of which we can do the same?
I would choose to do this with Pandas DataFrame
and numpy.random.choice
. In that way it is easy to do random sampling to produce equally sized data-sets. An example:
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(7, 4))
data['Healthy'] = [1, 1, 0, 0, 1, 1, 1]
This data has two non-healthy and five healthy samples. To randomly pick two samples from the healthy population you do:
healthy_indices = data[data.Healthy == 1].index
random_indices = np.random.choice(healthy_indices, 2, replace=False)
healthy_sample = data.loc[random_indices]
To automatically pick a subsample of the same size as the non-healthy group you can do:
sample_size = sum(data.Healthy == 0) # Equivalent to len(data[data.Healthy == 0])
random_indices = np.random.choice(healthy_indices, sample_size, replace=False)