Spark - Which instance type is preferred for AWS EMR cluster?

shihpeng picture shihpeng · May 25, 2015 · Viewed 14.3k times · Source

I am running some machine learning algorithms on EMR Spark cluster. I am curious about which kind of instance to use so I can get the optimal cost/performance gain?

For the same level of prices, I can choose among:

          vCPU  ECU  Memory(GiB)
m3.xlarge  4     13     15     
c4.xlarge  4     16      7.5
r3.xlarge  4     13     30.5

Which kind of instance should be used in EMR Spark cluster?

Answer

eliasah picture eliasah · May 25, 2015

Generally speaking, it depends on your use case, needs, etc... But I can suggest a minimum configuration considering the information that you have shared.

You seem to be trying to train an ALS factorization or SVD on matrices between 2 ~ 4 GBs of data. So actually that's not too much of data.

You'll be needing at least 1 master and 2 nodes to setup and configure a small distributed cluster. The master won't be doing any computing whatsoever so it won't need much resources but of course I would be dealing task scheduling, etc.

You can add slaves (instances) according to your needs.

  • 1 x master : m3.xlarge m5.xlarge - vCPU : 4 , RAM : 16 GB with EBS storage.
  • 2 x slaves : c3.4xlarge c5.xlarge - vCPU : 16, RAM : 32 GB with EBS storage.

EDIT : As mentioned in the comments, 5th generation instances are now available for each of the instance types mentioned in this thread: R5, M5, and C5. In general, latest-generation instance types are cheaper and more performant than their older counterparts.

C3, C4, and C5 are compute optimized instances featuring high performance processors and with a lowest price/compute performance in EC2 compared to R3, R4 or R5 although it's recommended use cases are distributed memory caches and in-memory analytics. But C5 will do the job for you for a lower price.

Performance Optimizations :

  • Amazon EMR charges on hourly increments. This means once you run a cluster, you are paying for the entire hour. That's important to remember because if you are paying for a full hour of Amazon EMR cluster, improving your data processing time by matter of minutes may not be worth your time and effort.

  • Don't forget that adding more nodes to increase performance is cheaper than spending time optimizing your cluster.

Reference : Amazon EMR Best Practices - Parviz Deyhim.

EDIT : You might also consider enabling Ganglia to monitor your cluster resources: CPU, RAM, Network I/O. This would help you also tuning your EMR cluster. Practically, you don't have any configuration to do. Just follow the documentation to add it to your EMR cluster on creation.