I have some R scripts, where I have to load several dataframe in R as quickly as possible. This is quite important as reading the data is the slowest part of the procedure. E.g.: plotting from different dataframes. I get the data in sav (SPSS) format, but I could transform it to any format as suggested. Merging the dataframes is not an option unfortunately.
What could be the fastest way to load the data? I was thinking of the following:
read.spss
.attr
values of the variables (e.g. variable.labels from Spss imported files)? Or this should be done in a separate table?Any other thoughts are welcome. Thank you for every suggestion in advance!
I made a little experiment below based on the answers you have given, and also added (24/01/2011) a quite "hackish" but really speedy solution loading only a few variables/columns from a special binary file. The latter seems to be the fastest method I can imagine now, that is why I made up (05/03/2011: ver. 0.3) a small package named saves to deal with this feature. The package is under "heavy" development, any recommendation is welcome!
I will soon post a vignette with accurate benchmark results with the help of microbenchmark package.
Thank you all for the tips and answers, I did some summary and experiment based on that.
See a little test with a public database (ESS 2008 in Hungary) below. The database have 1508 cases and 508 variables, so it could be a mid-sized data. That might be a good example to do the test on (for me), but of course special needs would require an experiment with adequate data.
Reading the data from SPSS sav file without any modification:
> system.time(data <- read.spss('ESS_HUN_4.sav'))
user system elapsed
2.214 0.030 2.376
Loading with a converted binary object:
> save('data',file='ESS_HUN_4.Rdata')
> system.time(data.Rdata <- load('ESS_HUN_4.Rdata'))
user system elapsed
0.28 0.00 0.28
Trying with csv:
> write.table(data, file="ESS_HUN_4.csv")
> system.time(data.csv <- read.csv('ESS_HUN_4.csv'))
user system elapsed
1.730 0.010 1.824
Trying with "fine-tuned" csv loading:
> system.time(data.csv <- read.table('ESS_HUN_4.csv', comment.char="", stringsAsFactors=FALSE, sep=","))
user system elapsed
1.296 0.014 1.362
Also with package sqldf, which seems to load csv files a lot faster:
> library(sqldf)
> f <- file("ESS_HUN_4.csv")
> system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F, sep="\t")))
user system elapsed
0.939 0.106 1.071
And also loading the data from a MySQL database running on localhost:
> library(RMySQL)
> con <- dbConnect(MySQL(), user='root', dbname='test', host='localhost', password='')
> dbWriteTable(con, "data", as.data.frame(data), overwrite = TRUE)
> system.time(data <- dbReadTable(con, 'data'))
user system elapsed
0.583 0.026 1.055
> query <-('SELECT * FROM data')
> system.time(data.sql <- dbGetQuery(con, query))
user system elapsed
0.270 0.020 0.473
Here, I think we should add the two system.time
reported, as connecting to the data also counts in our case. Please comment, if I misunderstood something.
But let us see if querying only some variables, as eg. while plotting we do not need all the dataframe in most cases, and querying only two variables is enough to create a nice plot of them:
> query <-('SELECT c1, c19 FROM data')
> system.time(data.sql <- dbGetQuery(con, query))
user system elapsed
0.030 0.000 0.112
Which seems really great! Of course just after loading the table with dbReadTable
Summary: nothing to beat reading the whole data from binary file, but reading only a few columns (or other filtered data) from the same database table might be also weighted in some special cases.
Test environment: HP 6715b laptop (AMD X2 2Ghz, 4 Gb DDR2) with a low-end SSD.
UPDATE (24/01/2011): I added a rather hackish, but quite "creative" way of loading only a few columns of a binary object - which looks a lot faster then any method examined above.
Be aware: the code will look really bad, but still very effective :)
First, I save all columns of a data.frame into different binary objects via the following loop:
attach(data)
for (i in 1:length(data)) {
save(list=names(data)[i],file=paste('ESS_HUN_4-', names(data)[i], '.Rdata', sep=''))
}
detach(data)
And then I load two columns of the data:
> system.time(load('ESS_HUN_4-c19.Rdata')) +
> system.time(load('ESS_HUN_4-c1.Rdata')) +
> system.time(data.c1_c19 <- cbind(c1, c19))
user system elapsed
0.003 0.000 0.002
Which looks like a "superfast" method! :) Note: it was loaded 100 times faster than the fastest (loading the whole binary object) method above.
I have made up a very tiny package (named: saves), look in github for more details if interested.
UPDATE (06/03/2011): a new version of my little package (saves) was uploaded to CRAN, in which it is possible to save and load variables even faster - if only the user needs only a subset of the available variables in a data frame or list. See the vignette in the package sources for details or the one on my homepage, and let me introduce also a nice boxplot of some benchmark done:
This boxplot shows the benefit of using saves package to load only a subset of variables against load
and read.table
or read.csv
from base, read.spss
from foreign or sqldf
or RMySQL
packages.