What are the differences between a node, a cluster and a datacenter in a cassandra nosql database?

enjazweb picture enjazweb · Jan 28, 2015 · Viewed 65.4k times · Source

I am trying to duplicate data in a cassandra nosql database for a school project using datastax ops center. From what I have read, there is three keywords: cluster, node, and datacenter, and from what I have understand, the data in a node can be duplicated in another node, that exists in another cluster. And all the nodes that contains the same (duplicated) data compose a datacenter. Is that right?

If it is not, what is the difference?

Answer

Akbar Ahmed picture Akbar Ahmed · Feb 12, 2015

The hierarchy of elements in Cassandra is:

  • Cluster
    • Data center(s)
      • Rack(s)
        • Server(s)
          • Node (more accurately, a vnode)

A Cluster is a collection of Data Centers.

A Data Center is a collection of Racks.

A Rack is a collection of Servers.

A Server contains 256 virtual nodes (or vnodes) by default.

A vnode is the data storage layer within a server.

Note: A server is the Cassandra software. A server is installed on a machine, where a machine is either a physical server, an EC2 instance, or similar.

Now to specifically address your questions.

An individual unit of data is called a partition. And yes, partitions are replicated across multiple nodes. Each copy of the partition is called a replica.

In a multi-data center cluster, the replication is per data center. For example, if you have a data center in San Francisco named dc-sf and another in New York named dc-ny then you can control the number of replicas per data center.

As an example, you could set dc-sf to have 3 replicas and dc-ny to have 2 replicas.

Those numbers are called the replication factor. You would specifically say dc-sf has a replication factor of 3, and dc-ny has a replication factor of 2. In simple terms, dc-sf would have 3 copies of the data spread across three vnodes, while dc-sf would have 2 copies of the data spread across two vnodes.

While each server has 256 vnodes by default, Cassandra is smart enough to pick vnodes that exist on different physical servers.

To summarize:

  • Data is replicated across multiple virtual nodes (each server contains 256 vnodes by default)
  • Each copy of the data is called a replica
  • The unit of data is called a partition
  • Replication is controlled per data center