Stratified Train/Test-split in scikit-learn

pir picture pir · Apr 3, 2015 · Viewed 140.1k times · Source

I need to split my data into a training set (75%) and test set (25%). I currently do that with the code below:

X, Xt, userInfo, userInfo_train = sklearn.cross_validation.train_test_split(X, userInfo)   

However, I'd like to stratify my training dataset. How do I do that? I've been looking into the StratifiedKFold method, but doesn't let me specifiy the 75%/25% split and only stratify the training dataset.

Answer

Andreas Mueller picture Andreas Mueller · Apr 6, 2015

[update for 0.17]

See the docs of sklearn.model_selection.train_test_split:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, 
                                                    test_size=0.25)

[/update for 0.17]

There is a pull request here. But you can simply do train, test = next(iter(StratifiedKFold(...))) and use the train and test indices if you want.