I have some very big delimited data files and I want to process only certain columns in R without taking the time and memory to create a data.frame
for the whole file.
The only options I know of are read.table
which is very wasteful when I only want a couple of columns or scan
which seems too low level for what I want.
Is there a better option, either with pure R or perhaps calling out to some other shell script to do the column extraction and then using scan or read.table on it's output? (Which leads to the question how to call a shell script and capture its output in R?).
Sometimes I do something like this when I have the data in a tab-delimited file:
df <- read.table(pipe("cut -f1,5,28 myFile.txt"))
That lets cut
do the data selection, which it can do without using much memory at all.
See Only read limited number of columns for pure R version, using "NULL"
in the colClasses
argument to read.table
.