Using HBase to store time series data

gurrie picture gurrie · Nov 8, 2010 · Viewed 17.5k times · Source

We are trying to use HBase to store time-series data. The model we have currently stores the time-series as versions within a cell. This implies that the cell could end up storing millions of versions, and the queries on this time-series would retrieve a range of versions using the setTimeRange method available on the Get class in HBase.

e.g.

{
    "row1" : {
        "columnFamily1" : {
            "column1" : {
                1 : "1",
                2 : "2"
            },
            "column2" : {
                1 : "1"
            }
        }
    }
}

Is this a reasonable model to store time-series data in HBase?

Is the alternate model of storing data in multiple columns (is it possible to query across columns) or rows more suitable?

Answer

Donald Miner picture Donald Miner · Apr 26, 2012

I don't think you should use versioning to store the time series here. Not because it won't work, but because it's not designed for that particular use case and there are other ways.


I suggest you store the time series as the time step as the column qualifier and the value will be the data itself. Something like:

{
    "row1" : {
        "columnFamily1" : {
            "col1-000001" : "1"
            "col1-000002" : "2"
            "col1-000003" : "91"
            "col2-000001" : "31"
            }
        }
    }
}

One nice thing here is that HBase stores the column qualifiers in sorted order, so when reading the time series back you should see the items in order.


Another realistic option would be to have the identifier for the record as the first part of the rowkey, but then have the time step in the rowkey as well. Something like:

{
    "fooseries-00001" : {
        "columnFamily1" : {
            "val" : "1"
            }
        }
    }
    "fooseries-00002" : {
        "columnFamily1" : {
            "val" : "2"
            }
        }
    }

}

This has the nice feature that it'll be pretty easy to do range scans in a particular series. For example, pulling out fooseries's steps 104 to 199 is going to be pretty trivial to implement and be efficient.

The downside to this one is deleting an entire series is going to require a bit more management and synchronization. Another downside is that MapReduce analytics are going to have a hard time doing any sort of analysis on this data. With the above approach, the entire time series will be passed to one map() call, while here, map() will be called for each frame.