How to split data on balanced training set and test set on sklearn

Jeanne picture Jeanne · Feb 18, 2016 · Viewed 42.5k times · Source

I am using sklearn for multi-classification task. I need to split alldata into train_set and test_set. I want to take randomly the same sample number from each class. Actually, I amusing this function

X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0)

but it gives unbalanced dataset! Any suggestion.

Answer

Guiem Bosch picture Guiem Bosch · Feb 18, 2016

Although Christian's suggestion is correct, technically train_test_split should give you stratified results by using the stratify param.

So you could do:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target)

The trick here is that it starts from version 0.17 in sklearn.

From the documentation about the parameter stratify:

stratify : array-like or None (default is None) If not None, data is split in a stratified fashion, using this as the labels array. New in version 0.17: stratify splitting