Sampling a dataframe based on a given distribution

stackit picture stackit · Oct 13, 2015 · Viewed 9.5k times · Source

How can I sample a pandas dataframe or graphlab sframe based on a given class\label distribution values eg: I want to sample an data frame having a label\class column to select rows such that each class label is equally fetched thereby having a similar frequency for each class label corresponding to a uniform distribution of class labels . Or best would be to get samples according to the class distribution we want.

+------+-------+-------+
| col1 | clol2 | class |
+------+-------+-------+
| 4    | 45    | A     |
+------+-------+-------+
| 5    | 66    | B     |
+------+-------+-------+
| 5    | 6     | C     |
+------+-------+-------+
| 4    | 6     | C     |
+------+-------+-------+
| 321  | 1     | A     |
+------+-------+-------+
| 32   | 432   | B     |
+------+-------+-------+
| 5    | 3     | B     |
+------+-------+-------+

given a huge dataframe like above and the required frequency distribution like below:
+-------+--------------+
| class | nostoextract |
+-------+--------------+
| A     | 2            |
+-------+--------------+
| B     | 2            |
+-------+--------------+
| C     | 2            |
+-------+--------------+


The above should extract rows from the first dataframe based on the given frequency distribution in the second frame where the frequency count values are given in nostoextract column to give a sampled frame where each class appears at max 2 times. should ignore and continue if cant find sufficient classes to meet the required count. The resulting dataframe is to be used for a decision tree based classifier.

As a commentator puts it the sampled dataframe has to contain nostoextract different instances of the corresponding class? Unless there are not enough examples for a given class in which case you just take all the available ones.

Answer

Thomas Kimber picture Thomas Kimber · Dec 22, 2015

Can you split your first dataframe into class-specific sub-dataframes, and then sample at will from those?

i.e.

dfa = df[df['class']=='A']
dfb = df[df['class']=='B']
dfc = df[df['class']=='C']
....

Then once you've split/created/filtered on dfa, dfb, dfc, pick a number from the top as desired (if dataframes don't have any particular sort-pattern)

 dfasamplefive = dfa[:5]

Or use the sample method as described by a previous commenter to directly take a random sample:

dfasamplefive = dfa.sample(n=5)

If that suits your needs, all that's left to do is automate the process, feeding in the number to be sampled from the control dataframe you have as your second dataframe containing the desired number of samples.