I have below 2 files and 1 directory in HDFS.
-rw-r--r-- 1 hadoop hadoop 11194859 2017-05-05 19:53 hdfs:///outputfiles/abc_output.txt
drwxr-xr-x - hadoop hadoop 0 2017-05-05 19:28 hdfs:///outputfiles/sample_directory
-rw-r--r-- 1 hadoop hadoop 68507436 2017-05-05 19:55 hdfs:///outputfiles/sample_output.txt
I want to copy abc_output.txt and sample_directory in gzip format onto S3 from HDFS in a single command. I don't want the files to be combined on S3.
My S3 bucket should contain the following: abc_output.txt.gzip sample_directory.gzip
I tried the following:
s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=hdfs:///outputfiles/ --dest=s3://bucket-name/outputfiles/ --outputCodec=gzip
But this copies all files and folders from source to destination.
By referring Deduce the HDFS path at runtime on EMR , I also tried the below command:
s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=hdfs:///outputfiles/ --dest=s3://bucket-name/outputfiles/ --srcPattern=.*abc_output.txt.sample_directory. --outputCodec=gzip but this failed.
S3DistCp supports two options on how you want to compress copy data from source to destination. --srcPattern --groupBy http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
The documentation is self evident on what can be done. The rest.. cannot be done.
Using srcPattern , you can write a RegEx that matches your source files. s3distcp would simply copy those matched files into destination individually.
For example : --srcPattern='.*(txt|sample_folder).*'
will copy all files having txt
extension & It will create the matching directories in the destination to copy files inside source folders having name sample_folder
to the destination
http://regexr.com/3ftn0 (You can design ReGex's based on your requirement.)
If you apply --outputCodec=gzip
option in addition to --srcPattern
, the individual matched files will be compressed accordingly. They will not be zipped as a whole. If you need to all matched files gzipped into one single file (without its contents concatenated) , then you would to run s3-dist-cp and a compression command on output separately.
If you want to concatenate file abc_output.txt and all files inside sample_directory into a single file and compress that in gzip format , you need to use --groupBy. For using groupby , the matching regex pattern should be similar , you need to have a Parentheses in your regex that indicate how files should be grouped, with all of the items that match the parenthetical statement being combined into a single output file.
For example :
--groupBy='.*(file|noname).*[0-9].*' --outputCodec=gz
on http://regexr.com/3ftn9 will concatenate all matched file contents and creates one .gz file