I wanted to know if there is a limit to the number of rows that can be read using the data.table fread function. I am working with a table with 4 billion rows, 4 columns, about 40 GB. It appears that fread will read only the first ~ 840 million rows. It does not give any errors but returns to the R prompt as if it had read all the data !
I understand that fread is not for "prod use" at the moment, and wanted to find out if there was any timeframe for implementation of a prod-release.
The reason I am using data.table is that, for files of such sizes, it is extremely efficient at processing the data compared to loading the file in a data.frame, etc.
At the moment, I am trying 2 other alternatives -
1) Using scan and passing on to a data.table
data.table(matrix(scan("file.csv",what="integer",sep=","),ncol=4))
Resulted in --
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
too many items
2) Breaking the file up into multiple individual segments with a limit of approx. 500 million rows using Unix split and reading them sequentially ... then looping over the files sequentially into fread - a bit cumbersome, but appears to be the only workable solution.
I think there may be an Rcpp way to do this even faster, but am not sure how it is generally implemented.
Thanks in advance.
I was able to accomplish this using feedback from another posting on Stackoverflow. The process was very fast and 40 GB of data was read in about 10 minutes using fread iteratively. Foreach-dopar failed to work when run by itself to read files into new data.tables sequentially due to some limitations which are also mentioned on the page below.
Note: The file list (file_map) was prepared by simply running --
file_map <- list.files(pattern="test.$") # Replace pattern to suit your requirement
mclapply with big objects - "serialization is too large to store in a raw vector"
Quoting --
collector = vector("list", length(file_map)) # more complex than normal for speed
for(index in 1:length(file_map)) {
reduced_set <- mclapply(file_map[[index]], function(x) {
on.exit(message(sprintf("Completed: %s", x)))
message(sprintf("Started: '%s'", x))
fread(x) # <----- CHANGED THIS LINE to fread
}, mc.cores=10)
collector[[index]]= reduced_set
}
# Additional line (in place of rbind as in the URL above)
for (i in 1:length(collector)) { rbindlist(list(finalList,yourFunction(collector[[i]][[1]]))) }
# Replace yourFunction as needed, in my case it was an operation I performed on each segment and joined them with rbindlist at the end.
My function included a loop using Foreach dopar that executed across several cores per file as specified in file_map. This allowed me to use dopar without encountering the "serialization too large error" when running on the combined file.
Another helpful post is at -- loading files in parallel not working with foreach + data.table