What is the byte size of common Cassandra data types - To be used when calculating partition disk usage?

nicgul picture nicgul · Oct 17, 2016 · Viewed 8k times · Source

I am trying to calculate the the partition size for each row in a table with arbitrary amount of columns and types using a formula from the Datastax Academy Data Modeling Course.

In order to do that I need to know the "size in bytes" for some common Cassandra data types. I tried to google this but I get a lot of suggestions so I am puzzled.

The data types I would like to know the byte size of are:

  • A single Cassandra TEXT character (I googled answers from 2 - 4 bytes)
  • A Cassandra DECIMAL
  • A Cassandra INT (I suppose it is 4 bytes)
  • A Cassandra BIGINT (I suppose it is 8 bytes)
  • A Cassandra BOOELAN (I suppose it is 1 byte, .. or is it a single bit)

Any other considerations would of course also be appreciated regarding data types sizes in Cassandra.

Adding more info since it seems confusing to understand that I am only trying to estimate the "worst scenario disk usage" the data would occupy with out any compressions and other optimizations done by Cassandra behinds the scenes.

I am following the Datastax Academy Course DS220 (see link at end) and implement the formula and will use the info from answers here as variables in that formula.

https://academy.datastax.com/courses/ds220-data-modeling/physical-partition-size

Answer

James Fremen picture James Fremen · Jan 18, 2017

I think, from a pragmatic point of view, that it is wise to get a back-of-the-envelope estimate of worst case using the formulae in the ds220 course up-front at design time. The effect of compression often varies depending on algorithms and patterns in the data. From ds220 and http://cassandra.apache.org/doc/latest/cql/types.html:

uuid: 16 bytes
timeuuid: 16 bytes
timestamp: 8 bytes
bigint: 8 bytes
counter: 8 bytes
double: 8 bytes
time: 8 bytes
inet: 4 bytes (IPv4) or 16 bytes (IPV6)
date: 4 bytes
float: 4 bytes
int 4 bytes
smallint: 2 bytes
tinyint: 1 byte
boolean: 1 byte (hopefully.. no source for this)
ascii: equires an estimate of average # chars * 1 byte/char
text/varchar: requires an estimate of average # chars * (avg. # bytes/char for language)
map/list/set/blob: an estimate

hope it helps