s3distcp copy files and directory from HDFS to S3 in a single command

sashmi picture sashmi · May 8, 2017 · Viewed 11.7k times · Source

I have below 2 files and 1 directory in HDFS.

-rw-r--r-- 1 hadoop hadoop 11194859 2017-05-05 19:53 hdfs:///outputfiles/abc_output.txt
drwxr-xr-x - hadoop hadoop 0 2017-05-05 19:28 hdfs:///outputfiles/sample_directory
-rw-r--r-- 1 hadoop hadoop 68507436 2017-05-05 19:55 hdfs:///outputfiles/sample_output.txt

I want to copy abc_output.txt and sample_directory in gzip format onto S3 from HDFS in a single command. I don't want the files to be combined on S3.

My S3 bucket should contain the following: abc_output.txt.gzip sample_directory.gzip

I tried the following:

s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=hdfs:///outputfiles/ --dest=s3://bucket-name/outputfiles/ --outputCodec=gzip

But this copies all files and folders from source to destination.

By referring Deduce the HDFS path at runtime on EMR , I also tried the below command:

s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=hdfs:///outputfiles/ --dest=s3://bucket-name/outputfiles/ --srcPattern=.*abc_output.txt.sample_directory. --outputCodec=gzip but this failed.

Answer

jc mannem picture jc mannem · May 8, 2017

S3DistCp supports two options on how you want to compress copy data from source to destination. --srcPattern --groupBy http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html

The documentation is self evident on what can be done. The rest.. cannot be done.

Using srcPattern , you can write a RegEx that matches your source files. s3distcp would simply copy those matched files into destination individually.

For example : --srcPattern='.*(txt|sample_folder).*' will copy all files having txt extension & It will create the matching directories in the destination to copy files inside source folders having name sample_folder to the destination

http://regexr.com/3ftn0 (You can design ReGex's based on your requirement.)

If you apply --outputCodec=gzip option in addition to --srcPattern, the individual matched files will be compressed accordingly. They will not be zipped as a whole. If you need to all matched files gzipped into one single file (without its contents concatenated) , then you would to run s3-dist-cp and a compression command on output separately.

If you want to concatenate file abc_output.txt and all files inside sample_directory into a single file and compress that in gzip format , you need to use --groupBy. For using groupby , the matching regex pattern should be similar , you need to have a Parentheses in your regex that indicate how files should be grouped, with all of the items that match the parenthetical statement being combined into a single output file.

For example :

--groupBy='.*(file|noname).*[0-9].*' --outputCodec=gz 

on http://regexr.com/3ftn9 will concatenate all matched file contents and creates one .gz file