Random Sample of a subset of a dataframe in Pandas

WGP picture WGP · Jun 28, 2016 · Viewed 65.6k times · Source

Say i have a dataframe with 100,000 entries and want to split it into 100 sections of 1000 entries.

How do i take a random sample of say size 50 of just one of the 100 sections. the data set is already ordered such that the first 1000 results are the first section the next section the next and so on.

many thanks

Answer

Andy Hayden picture Andy Hayden · Jun 28, 2016

You can use the sample method*:

In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])

In [12]: df.sample(2)
Out[12]:
   A  B
0  1  2
2  5  6

In [13]: df.sample(2)
Out[13]:
   A  B
3  7  8
0  1  2

*On one of the section DataFrames.

Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.

In [14]: df.sample(5)
ValueError: Cannot take a larger sample than population when 'replace=False'

In [15]: df.sample(5, replace=True)
Out[15]:
   A  B
0  1  2
1  3  4
2  5  6
3  7  8
1  3  4