I'm working on a job that processes a nested directory structure, containing files on multiple levels:
one/
├── three/
│ └── four/
│ ├── baz.txt
│ ├── bleh.txt
│ └── foo.txt
└── two/
├── bar.txt
└── gaa.txt
When I add one/
as an input path, no files are processed, since none are immediately available at the root level.
I read about job.addInputPathRecursively(..)
, but this seems to have been deprecated in the more recent releases (I'm using hadoop 1.0.2). I've written some code to walk the folders and add each dir with job.addInputPath(dir)
, which worked until the job crashed when trying to process a directory as an input file for some reason, e.g. - trying to fs.open(split.getPath())
, when split.getPath()
is a directory (This happens inside LineRecordReader.java
).
I'm trying to convince myself there has to be a simpler way to provide a job with a nested directory structure. Any ideas?
EDIT - apparently there's an open bug on this.
I didn't found any document on this but */*
works. So it's -input 'path/*/*'
.