I'm investigating the different types of NoSQL database types and I'm trying to wrap my head around the data model of column-family stores, such as Bigtable, HBase and Cassandra.
Some people describe a column family as a collection of rows, where each row contains columns [1], [2]. An example of this model (column families are uppercased):
{
"USER":
{
"codinghorror": { "name": "Jeff", "blog": "http://codinghorror.com/" },
"jonskeet": { "name": "Jon Skeet", "email": "[email protected]" }
},
"BOOKMARK":
{
"codinghorror":
{
"http://codinghorror.com/": "My awesome blog",
"http://unicorns.com/": "Weaponized ponies"
},
"jonskeet":
{
"http://msmvps.com/blogs/jon_skeet/": "Coding Blog",
"http://manning.com/skeet2/": "C# in Depth, Second Edition"
}
}
}
Other sites describe a column family as a group of related columns within a row [3], [4]. Data from the previous example, modeled in this fashion:
{
"codinghorror":
{
"USER": { "name": "Jeff", "blog": "http://codinghorror.com/" },
"BOOKMARK":
{
"http://codinghorror.com/": "My awesome blog",
"http://unicorns.com/": "Weaponized ponies"
}
},
"jonskeet":
{
"USER": { "name": "Jon Skeet", "email": "[email protected]" },
"BOOKMARK":
{
"http://msmvps.com/blogs/jon_skeet/": "Coding Blog",
"http://manning.com/skeet2/": "C# in Depth, Second Edition"
}
}
}
A possible rationale behind the first model is that not all column families have a relation like USER
and BOOKMARK
do. This implies that not all column families contain identical keys. Placing the column families at the outer level feels more natural from this point of view.
The name 'column family' implies a group of columns. This is exactly how column families are presented in the second model.
Both models are valid representations of the data. I realize that these representations are solely for communicating the data towards humans; applications don't 'think' of data in such a way.
What is the 'standard' definition of a column family? Is it a collection of rows, or a group of related columns within a row?
I have to write a paper on the subject, so I'm also interested in how people usually explain the 'column family' concept to other people. Both of these models seem to contradict each other. I'd like to use the 'correct' or generally accepted model to describe column-family stores.
I have settled with the second model for explaining the data model in my paper. I'm still interested in how you explain the data model of column-family stores to other people.
The Cassandra database follows your first model, I think. A ColumnFamily is a collection of rows, which can contain any columns, in a sparse fashion (so each row can have different collection of column names, if desired). The number of columns allowed in a row is almost unlimited (2 billion in Cassandra v0.7).
A key point is that row keys must be unique within a column family, by definition - but can be re-used in other column families. So you can store unrelated data about the same key in different ColumnFamilies.
In Cassandra this matters because the data in a particular column family is stored in the same files on disk - so it is more efficient to place data items that are likely to be retrieved together, in the same ColumnFamily. This is partly a practical speed concern, but also a matter of organising your data into a clear schema. This touches upon your second definition - one might consider all the data about a particular key to be a "row", but partitioned by Column Family. However, in Cassandra it is not really a single row, because the data in one ColumnFamily can be changed independently of the data in other ColumnFamilies for the same row key.