Hadoop MapReduce provide nested directories as job input

Question 1

Hadoop MapReduce provide nested directories as job input

hadoop nested mapreduce directory-walk

sa125 · Apr 18, 2012 · Viewed 12.9k times · Source

Answer

Answer

I didn't found any document on this but */* works. So it's -input 'path/*/*'.

Question 2

I'm working on a job that processes a nested directory structure, containing files on multiple levels:

one/
├── three/
│   └── four/
│       ├── baz.txt
│       ├── bleh.txt
│       └── foo.txt
└── two/
    ├── bar.txt
    └── gaa.txt

When I add one/ as an input path, no files are processed, since none are immediately available at the root level.

I read about job.addInputPathRecursively(..), but this seems to have been deprecated in the more recent releases (I'm using hadoop 1.0.2). I've written some code to walk the folders and add each dir with job.addInputPath(dir), which worked until the job crashed when trying to process a directory as an input file for some reason, e.g. - trying to fs.open(split.getPath()), when split.getPath() is a directory (This happens inside LineRecordReader.java).

I'm trying to convince myself there has to be a simpler way to provide a job with a nested directory structure. Any ideas?

EDIT - apparently there's an open bug on this.

Hadoop MapReduce provide nested directories as job input

Answer

Related questions