How to list all files in a directory and its subdirectories in hadoop hdfs

nik686 picture nik686 · Jul 5, 2012 · Viewed 102.3k times · Source

I have a folder in hdfs which has two subfolders each one has about 30 subfolders which,finally,each one contains xml files. I want to list all xml files giving only the main folder's path. Locally I can do this with apache commons-io's FileUtils.listFiles(). I have tried this

FileStatus[] status = fs.listStatus( new Path( args[ 0 ] ) );

but it only lists the two first subfolders and it doesn't go further. Is there any way to do this in hadoop?

Answer

Prasoon Joshi picture Prasoon Joshi · Sep 3, 2014

If you are using hadoop 2.* API there are more elegant solutions:

    Configuration conf = getConf();
    Job job = Job.getInstance(conf);
    FileSystem fs = FileSystem.get(conf);

    //the second boolean parameter here sets the recursion to true
    RemoteIterator<LocatedFileStatus> fileStatusListIterator = fs.listFiles(
            new Path("path/to/lib"), true);
    while(fileStatusListIterator.hasNext()){
        LocatedFileStatus fileStatus = fileStatusListIterator.next();
        //do stuff with the file like ...
        job.addFileToClassPath(fileStatus.getPath());
    }