How to do a random stratified sampling with Python (Not a train/test split)?

asl picture asl · May 6, 2018 · Viewed 10.3k times · Source

I am looking for the best way to do a random stratified sampling like survey and polls. I don't want to do a sklearn.model_selection.StratifiedShuffleSplit since I am not doing a supervised learning and I have no target. I just want to create random stratified samples from pandas DataFrame (https://www.investopedia.com/terms/stratified_random_sampling.asp).

Python is my main language.

Thank you for any help

Answer

Furkan Gursoy picture Furkan Gursoy · Sep 5, 2019

Given that the variables are binned, the following one liner should give you the desired output. I see that scikit-learn is mainly employed for purposes other than yours but using a function from it should not do any harm.

Note that if you have a scikit-learn version earlier than the 0.19.0, the sampling result might contain duplicate rows.

If you test the following method, please share whether it behaves as expected or not.

from sklearn.model_selection import train_test_split

stratified_sample, _ = train_test_split(population, test_size=0.999, stratify=population[['income', 'sex', 'age']])