In the following example:
small.ints = to.dfs(1:1000)
mapreduce(
input = small.ints,
map = function(k, v) cbind(v, v^2))
The data input for mapreduce function is an object named small.ints which refered to blocks in HDFS.
Now I have a CSV file already stored in HDFS as
"hdfs://172.16.1.58:8020/tmp/test_short.csv"
How to get an object for it?
And as far as I know(which may be wrong), if I want data from CSV file as input for mapreduce, I have to first generate a table in R which contains all values in the CSV file. I do have method like:
data=from.dfs("hdfs://172.16.1.58:8020/tmp/test_short.csv",make.input.format(format="csv",sep=","))
mydata=data$val
It seems OK to use this method to get mydata, and then do object=to.dfs(mydata), but the problem is the test_short.csv file is huge, which is around TB size, and memory can't hold output of from.dfs!!
Actually, I'm wondering if I use "hdfs://172.16.1.58:8020/tmp/test_short.csv" as mapreduce input directly, and inside map function do the from.dfs() thing, am I able to get data blocks?
Please give me some advice, whatever!
mapreduce(input = path, input.format = make.input.format(...), map ...)
from.dfs is for small data. In most cases you won't use from.dfs in the map function. The arguments hold a portion of the input data already