Batch rename in hadoop

Question 1

Batch rename in hadoop

bash hadoop file-rename

beefyhalo · Feb 6, 2013 · Viewed 11.3k times · Source

Answer

Answer

If you don't want to write Java Code for this - I think using the command line HDFS API is your best bet:

mv in Hadoop

hadoop fs -mv URI [URI …] <dest>

You can get the paths using a small one liner:

% hadoop fs -ls /user/foo/bar | awk  '!/^d/ {print $8}'

/user/foo/bar/blacklist
/user/foo/bar/books-eng
...

the awk will remove directories from output..now you can put these files into a variable:

% files=$(hadoop fs -ls /user/foo/bar | awk  '!/^d/ {print $8}')

and rename each file..

% for f in $files; do hadoop fs -mv $f $f.lzo; done

you can also use awk to filter the files for other criteria. This should remove files that match the regex nolzo. However it's untested. But this way you can write flexible filters.

% files=$(hadoop fs -ls /user/foo/bar | awk  '!/^d|nolzo/ {print $8}' )

test if it works with replacing the hadoop command with echo:

$ for f in $files; do echo $f $f.lzo; done

Edit: Updated examples to use awk instead of sed for more reliable output.

The "right" way to do it is probably using the HDFS Java API .. However using the shell is probably faster and more flexible for most jobs.

Question 2

How can I rename all files in a hdfs directory to have a .lzo extension? .lzo.index files should not be renamed.

For example, this directory listing:

file0.lzo file0.lzo.index file0.lzo_copy_1

could be renamed to:

file0.lzo file0.lzo.index file0.lzo_copy_1.lzo

These files are lzo compressed, and I need them to have the .lzo extension to be recognized by hadoop.

Batch rename in hadoop

Answer

Related questions