R+Hadoop: How to read CSV file from HDFS and execute mapreduce?

Hao Huang picture Hao Huang · Aug 7, 2013 · Viewed 8.4k times · Source

In the following example:

  small.ints = to.dfs(1:1000)
  mapreduce(
    input = small.ints, 
    map = function(k, v) cbind(v, v^2))

The data input for mapreduce function is an object named small.ints which refered to blocks in HDFS.

Now I have a CSV file already stored in HDFS as

"hdfs://172.16.1.58:8020/tmp/test_short.csv"

How to get an object for it?

And as far as I know(which may be wrong), if I want data from CSV file as input for mapreduce, I have to first generate a table in R which contains all values in the CSV file. I do have method like:

data=from.dfs("hdfs://172.16.1.58:8020/tmp/test_short.csv",make.input.format(format="csv",sep=","))
mydata=data$val

It seems OK to use this method to get mydata, and then do object=to.dfs(mydata), but the problem is the test_short.csv file is huge, which is around TB size, and memory can't hold output of from.dfs!!

Actually, I'm wondering if I use "hdfs://172.16.1.58:8020/tmp/test_short.csv" as mapreduce input directly, and inside map function do the from.dfs() thing, am I able to get data blocks?

Please give me some advice, whatever!

Answer

piccolbo picture piccolbo · Aug 7, 2013

mapreduce(input = path, input.format = make.input.format(...), map ...)

from.dfs is for small data. In most cases you won't use from.dfs in the map function. The arguments hold a portion of the input data already