What NoSQL DB to use for sparse Time Series like data?

angerman picture angerman · Apr 15, 2012 · Viewed 11.1k times · Source

I'm planning a side project where I will be dealing with Time Series like data and would like to give one of those shiny new NoSQL DBs a try and am looking for a recommendation.

For a (growing) set of symbols I will have a list of (time,value) tuples (increasing over time). Not all symbols will be updated; some symbols may be updated while others may not, and completely new symbols may be added.

The database should therefore allow:

  • Add Symbols with initial one-element (tuple) list. E.g. A: [(2012-04-14 10:23, 50)]
  • Update Symbols with a new tuple. (Append that tuple to the list of that symbol).
  • Read the data for a given symbol. (Ideally even let me specify the time frame for which the data should be returned)

The create and update operations should possibly be atomic. If reading multiple symbols at once is possible, that would be interesting.

Performance is not critical. Updates/Creates will happen roughly once every few hours.

Answer

yamen picture yamen · Apr 15, 2012

I believe literally all the major NoSQL databases will support that requirement, especially if you don't actually have a large volume of data (which begs the question, why NoSQL?).

That said, I've had to recently design and work with a NoSQL database for time series data so can give some input on that design, which can then be extrapolated for all others.

Our chosen database was Cassandra, and our design was as follows:

  • A single keyspace for all 'symbols'
  • Each symbol was a new row
  • Each time entry was a new column for that relevant row
  • Each value (can be more than a single value) was the value part of the time entry

This lets you achieve everything you asked for, most notably to read the data for a single symbol, and using a range if necessary (column range calls). Although you said performance wasn't critical, it was for us and this was quite performant also - all data for any single symbol is by definition sorted (column name sort) and always stored on the same node (no cross node communication for simple queries). Finally, this design translates well to other NoSQL databases that have have dynamic columns.

Further to this, here's some information on using MongoDB (and capped collections if necessary) for a time series store: MongoDB as a Time Series Database

Finally, here's a discussion of SQL vs NoSQL for time series: https://dba.stackexchange.com/questions/7634/timeseries-sql-or-nosql

I can add to that discussion the following:

  • Learning curve for NoSQL will be higher, you don't get the added flexibility and functionality for free in terms of 'soft costs'. Who will be supporting this database operationally?
  • If you expect this functionality to grow in future (either as more fields to be added to each time entry, or much larger capacity in terms of number of symbols or size of symbol's time series), then definitely go with NoSQL. The flexibility benefit is huge, and the scalability you get (with the above design) on both the 'per symbol' and 'number of symbols' basis is almost unbounded (I say almost unbounded - maximum columns per row is in the billions, maximum rows per key space is unbounded I believe).