Strategies for reading in CSV files in pieces?

Ari B. Friedman picture Ari B. Friedman · Feb 19, 2012 · Viewed 21.9k times · Source

I have a moderate-sized file (4GB CSV) on a computer that doesn't have sufficient RAM to read it in (8GB on 64-bit Windows). In the past I would just have loaded it up on a cluster node and read it in, but my new cluster seems to arbitrarily limit processes to 4GB of RAM (despite the hardware having 16GB per machine), so I need a short-term fix.

Is there a way to read in part of a CSV file into R to fit available memory limitations? That way I could read in a third of the file at a time, subset it down to the rows and columns I need, and then read in the next third?

Thanks to commenters for pointing out that I can potentially read in the whole file using some big memory tricks: Quickly reading very large tables as dataframes in R

I can think of some other workarounds (e.g. open in a good text editor, lop off 2/3 of the observations, then load in R), but I'd rather avoid them if possible.

So reading it in pieces still seems like the best way to go for now.

Answer

Jacob H picture Jacob H · May 22, 2015

After reviewing this thread I noticed a conspicuous solution to this problem was not mentioned. Use connections!

1) Open a connection to your file

con = file("file.csv", "r")

2) Read in chunks of code with read.csv

read.csv(con, nrows="CHUNK SIZE",...)

Side note: defining colClasses will greatly speed things up. Make sure to define unwanted columns as NULL.

3) Do what ever you need to do

4) Repeat.

5) Close the connection

close(con)

The advantage of this approach is connections. If you omit this step, it will likely slow things down a bit. By opening a connection manually, you essentially open the data set and do not close it until you call the close function. This means that as you loop through the data set you will never lose your place. Imagine that you have a data set with 1e7 rows. Also imagine that you want to load a chunk of 1e5 rows at a time. Since we open the connection we get the first 1e5 rows by running read.csv(con, nrow=1e5,...), then to get the second chunk we run read.csv(con, nrow=1e5,...) as well, and so on....

If we did not use the connections we would get the first chunk the same way, read.csv("file.csv", nrow=1e5,...), however for the next chunk we would need to read.csv("file.csv", skip = 1e5, nrow=2e5,...). Clearly this is inefficient. We are have to find the 1e5+1 row all over again, despite the fact that we just read in the 1e5 row.

Finally, data.table::fread is great. But you can not pass it connections. So this approach does not work.

I hope this helps someone.

UPDATE

People keep upvoting this post so I thought I would add one more brief thought. The new readr::read_csv, like read.csv, can be passed connections. However, it is advertised as being roughly 10x faster.