Best practice cassandra setup on ec2 with large amount of data

John Z picture John Z · Jan 27, 2014 · Viewed 13.6k times · Source

I am doing a large migration from physical machines to ec2 instances.

As of right now I have 3 x.large nodes each with 4 instance store drives (raid-0 1.6TB). After I set this this up I remembered that "The data on an instance store volume persists only during the life of the associated Amazon EC2 instance; if you stop or terminate an instance, any data on instance store volumes is lost."

What do people usually do in this situation? I am worried that if one of the boxes crash then all of the data will be lost on that box if it is not 100% replicated on another.

http://www.hulen.com/?p=326 I read in the above link that these guys use ephermal drives and periodically backup the content using the EBS drives and snapshots."

In this question here: How do I take a backup of aws ec2 instance/ephemeral storage? People claim that you cannot backup ephermal data onto EBS snapshots.

Is my best choice to use a few EBS drives and raid0 them together and be able to take snapshots directly from them? I know this is probably the most expensive solution, however, it seems to make the most sense.

Any info would be great.

Thank you for your time.

Answer

Arya picture Arya · Jan 27, 2014

I have been running Cassandra on EC2 for over 2 years. To address your concerns, you need to form a proper availability architecture on EC2 for your Cassandra cluster. Here is a bullet list for you to consider:

  1. Consider at least 3 zones for setting up your cluster;
  2. Use NetworkTopologyStrategy with EC2Snitch/EC2MultiRegionSnitch to propagate a replica of your data to each zone; this means that the machines in each zone will have your full data set combined; for example the strategy_options would be like {us-east:3}.

The above two tips should satisfy basic availability in AWS and in case your queries are sent using LOCAL_QUORUM, your application will be fine even if one zone goes down.

If you are concerned about 2 zones going down (don't recall it happened in AWS for the past 2 years of my use), then you can also add another region to your cluster.

With the above, if any node dies for any reason, you can restore it from nodes in other zones. After all, CAssandra was designed to provide you with this kind of availability.

About EBS vs Ephemeral:

I have always been against using EBS volumes in anything production because it is one of the worst AWS service in terms of availability. They go down several times a year, and their downside usually cascades to other AWS services like ELBs and RDS. They are also like attached Network storage, so any read/write will have to go over the Network. Don't use them. Even DataStax doesn't recommend them:

http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/architecture/../../cassandra/architecture/architecturePlanningEC2_c.html

About Backups:

I use a solution called Priam (https://github.com/Netflix/Priam) which was written by Netflix. It can take a nightly snapshot of your cluster and copy everything to S3. If you enable incremental_backups, it also uploads incremental backups to S3. In case a node goes down, you can trigger a restore on the specific node using a simple API call. It restores a lot faster and does not put a lot of streaming load on your other nodes. I also added a patch to it which let's you do fancy things like bringing up multiple DCs inside one AWS region.

You can read about my setup here: http://aryanet.com/blog/shrinking-the-cassandra-cluster-to-fewer-nodes

Hope above helps.