What is a good way to read line-by-line in R?

David B picture David B · Nov 5, 2010 · Viewed 54.9k times · Source

I have a file where each line is a set of results collected in specific replicate of an experiment. The number of results in each experiment (i.e. number of columns in each row) may differ. There's also no importance to the order of the results in each row (the first result in row 1 and the first result 2 are not more related than any other pair; these are sets of results).

The file looks something like this:

2141 0 5328 5180 357 5335 1 5453 5325 5226 7 4880 5486 0 
2650 0 5280 4980 5243 5301 4244 5106 5228 5068 5448 3915 4971 5585 4818 4388 5497 4914 5364 4849 4820 4370
2069 2595 2478 4941 
2627 3319 5192 5106 32 4666 3999 5503 5085 4855 4135 4383 4770 
2005 2117 2803 2722 2281 2248 2580 2697 2897 4417 4094 4722 5138 5004 4551 5758 5468 17361 
1914 1977 2414 100 2711 2171 3041 5561 4870 4281 4691 4461 5298 3849 5166 5578 5520 4634 4836 4905 5105 5089
2539 2326 0 4617 3735 0 5122 5439 5238 1
25 5316 21173 4492 5038 5944 5576 5424 5139 5184 5 5096 4963 2771 2808 2592 2
4963 9428 17152 5467 5202 6038 5094 5221 5469 5079 3753 5080 5141 4097 5173 11338 4693 5273 5283 5110 4503 51
2024 2 2822 5097 5239 5296 4561 

except each line is much longer (up to a few thousand values). As can be seen, all values are non-negative integers.

To put it short - this is not a normal table, where the columns have meanings. Its just a bunch of results - each set in a line.

I would like to read all the results, then do some operations on each experiment (row), such as calculating the ecdf. I would also like to calculate the average ecdf over all the replicates.

My problem - how should I read this strange looking file? I'm so use to read.table that I'm not sure I ever tried anything else... Do I have to use some low-level like readlines? I guess the preferred output would be a list (or vector?) of vectors. I looked at scan but it seems all vectors must be of the same length there.

Any suggestions will be appreciated.

UPDATE Following the suggestions below, I now do something like this:

con <- file('myfile') 
open(con);
results.list <- list();
current.line <- 1
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
 results.list[[current.line]] <- as.integer(unlist(strsplit(line, split=" ")))
 current.line <- current.line + 1
} 
close(con)

Seems to work. Does it looks OK?

When I summary(results.list) I get:Length Class Mode

      Length Class  Mode  
 [1,] 1091   -none- numeric
 [2,] 1070   -none- numeric
   ....

Shouldn't the class be integer? And what is the mode?

Answer

JD Long picture JD Long · Nov 5, 2010

The example Josh linked to is one that I use all the time.

inputFile <- "/home/jal/myFile.txt"
con  <- file(inputFile, open = "r")

dataList <- list()
ecdfList <- list()

while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
    myVector <- (strsplit(oneLine, " "))
    myVector <- list(as.numeric(myVector[[1]]))
    dataList <- c(dataList,myVector)

    myEcdf <- ecdf(myVector[[1]])
    ecdfList <- c(ecdfList,myEcdf)

  } 

close(con)

I edited the example to create two lists from your example data. dataList is a list where each item in the list is a vector of numeric values from each line in your text file. ecdfList is a list where each element is an ecdf for each line in your text file.

You should probably add some try() or trycatch() logic in there to properly handle situations where the ecdf can't be created because of nulls or some such. But the above example should get you pretty close. Good luck!