Load a small random sample from a large csv file into R data frame

P.Escondido picture P.Escondido · Mar 7, 2014 · Viewed 14.9k times · Source

The csv file to be processed does not fit into the memory. How can one read ~20K random lines of it to do basic statistics on the selected data frame?

Answer

Jed picture Jed · Mar 7, 2014

You can also just do it in the terminal with perl.

perl -ne 'print if (rand() < .01)' biglist.txt > subset.txt

This won't necessarily get you exactly 20,000 lines. (Here it'll grab about .01 or 1% of the total lines.) It will, however, be really really fast, and you'll have a nice copy of both files in your directory. You can then load the smaller file into R however you want.