I am using sklearn for multi-classification task. I need to split alldata into train_set and test_set. I want to take randomly the same sample number from each class. Actually, I amusing this function
X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0)
but it gives unbalanced dataset! Any suggestion.
Although Christian's suggestion is correct, technically train_test_split
should give you stratified results by using the stratify
param.
So you could do:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target)
The trick here is that it starts from version 0.17
in sklearn
.
From the documentation about the parameter stratify
:
stratify : array-like or None (default is None) If not None, data is split in a stratified fashion, using this as the labels array. New in version 0.17: stratify splitting