I have implemented a task in Hive. Currently it is working fine on my single node cluster. Now I am planning to deploy it on AWS.
I don't know anything about the AWS. If I plan to deploy it then what should I choose Amazon EC2 or Amazon EMR?
I want to improve the performance of my task. Which one is better and reliable for me? How to approach towards them? I heard that we can also register our VM setting as it is on AWS. Is it possible?
Please suggest me as soon as possible.
Many Thanks.
EMR is a collection of EC2 instances with Hadoop (and optionally Hive and/or Pig) installed and configured on them. If you are using your cluster for running Hadoop/Hive/Pig jobs, EMR is the way to go. An EMR instance costs a little bit extra as compared to an EC2 instance. A quick check on Amazon prices today reveals that a small EC2 instances costs $0.08/hour while a small EMR instance costs $0.015/hour extra.
In my opinion, it's totally worth paying that extra money to save yourself the hassle of installing and setting up Hadoop (along with Hive and Pig), creating and maintaining and AMI and using it. Moreover, EMR's version of Hadoop and Hive has some patches that are not available (atleast, not yet) on Apache Hive. If you use EC2, you will probably be using Apache Hadoop and Hive (or may be, the cloudera distributions) and wouldn't have access to those patches (like native support for S3 or commands like ALTER TABLE my_table RECOVER PARTITIONS
References: