I have files (A,B,C etc) each having 12,000 data points. I have divided the files into batches of 1000 points and computed the value for each batch. So now for each file we have 12 values, which is loaded in a pandas Data Frame (shown below).
file value_1 value_2
0 A 1 43
1 A 1 89
2 A 1 22
3 A 1 87
4 A 1 43
5 A 1 89
6 A 1 22
7 A 1 87
8 A 1 43
9 A 1 89
10 A 1 22
11 A 1 87
12 A 1 83
13 B 0 99
14 B 0 23
15 B 0 29
16 B 0 34
17 B 0 99
18 B 0 23
19 B 0 29
20 B 0 34
21 B 0 99
22 B 0 23
23 B 0 29
24 B 0 34
25 C 1 62
- - - -
- - - -
Now as the next step I need to randomly select a file, and for that file randomly select a sequence of 4 batches for value_1. The later, I believe can be done with df.sample(), but I'm not sure how to randomly select the files. I tried to make it work with np.random.choice(data['file'].unique()), but doesn't seems correct.
Thanks for the help in advance. I'm pretty new to pandas and python in general.
If I understand what you are trying to get at, the following should be of help:
# Test dataframe
import numpy as np
import pandas as pd
data = pd.DataFrame({'file': np.repeat(['A', 'B', 'C'], 12),
'value_1': np.repeat([1,0,1],12),
'value_2': np.random.randint(20, 100, 36)})
# Select a file
data1 = data[data.file == np.random.choice(data['file'].unique())].reset_index(drop=True)
# Get a random index from data1
start_ix = np.random.choice(data1.index[:-3])
# Get a sequence starting at the random index from the previous step
print(data.loc[start_ix:start_ix+3])