I'm loading 28 GB file in hadoop hdfs using webhdfs and it takes ~25 mins to load.
I tried loading same file using hdfs put and It took ~6 mins. Why there is so much difference in performance?
What is recommended to use? Can somebody explain or direct me to some good link it will be really helpful.
Below us the command I'm using
curl -i --negotiate -u: -X PUT "http://$hostname:$port/webhdfs/v1/$destination_file_location/$source_filename.temp?op=CREATE&overwrite=true"
this will redirect to a datanode address which I use in next step to write the data.
Hadoop provides several ways of accessing HDFS
All of the following support almost all features of the filesystem -
1. FileSystem (FS) shell commands: Provides easy access of Hadoop file system operations as well as other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS.
This needs hadoop client to be installed and involves the client to write blocks directly to one Data Node. All versions of Hadoop do not support all options for copying between filesystems.2. WebHDFS: It defines a public HTTP REST API, which permits clients to access Hadoop from multiple languages without installing Hadoop, Advantage being language agnostic way(curl, php etc....).
WebHDFS needs access to all nodes of the cluster and when some data is read, it is transmitted from the source node directly but **there is a overhead of http ** (1)FS Shell but works agnostically and no problems with different hadoop cluster and versions.3. HttpFS. Read and write data to HDFS in a cluster behind a firewall. Single node will act as GateWay node through which all the data will be transfered and performance wise I believe this can be even slower but preferred when needs to pull the data from public source into a secured cluster.
So choose rightly!.. Going down the list will always be an alternative when the choice above it is not available to you.