I have a DataFrame
loaded from a .tsv
file. I wanted to generate some exploratory plots. The problem is that the data set is large (~1 million rows), so there are too many points on the plot to see a trend. Plus, it is taking a while to plot.
I wanted to sub-sample 10000 randomly distributed rows. This should be reproducible so the same sequence of random numbers is generated in each run.
This: Sample two pandas dataframes the same way seems to be on the right track, but I cannot guarantee the subsample size.
You can select random elements from the index with np.random.choice
. Eg to select 5 random rows:
df = pd.DataFrame(np.random.rand(10))
df.loc[np.random.choice(df.index, 5, replace=False)]
This function is new in 1.7. If you want a solution with an older numpy, you can shuffle the data and taken the first elements of that:
df.loc[np.random.permutation(df.index)[:5]]
In this way your DataFrame is not sorted anymore, but if this is needed for plotting (for example, a line plot), you can simply do .sort()
afterwards.