Using Cassandra for time series data

Woojun Kim picture Woojun Kim · May 7, 2015 · Viewed 14.3k times · Source

I'm on my research for storing logs to Cassandra.
The schema for logs would be something like this.

EDIT: I've changed the schema in order to make some clarification.

CREATE TABLE log_date (
  userid bigint,
  time timeuuid,
  reason text,
  item text,
  price int,
  count int,
  PRIMARY KEY ((userid), time) - #1
  PRIMARY KEY ((userid), time, reason, item, price, count) - #2
);

A new table will be created for the day everyday. So a table contains logs for only one day.

My querying condition is as follows.
Query all logs from a specific user on a specific day(date not time).
So the reason, item, price, count will not be used as hints or conditions for queries at all.

My Question is which PRIMARY KEY design suits better.
EDIT: And the key here is that I want to store the logs in a schematic way.

If I choose #1 so many columns would be created per log. And the possibility of having more values per log is very high. The schema above is just an example. The log can contain values like subreason, friendid and so on.

If I choose #2 one (very) composite column will be created per log, and so far I couldn't find any valuable information about the overhead of the composite columns.

Which one should I choose? Please help.

Answer

jbarrueta picture jbarrueta · May 7, 2015

My advise is that none of your two options seems to be ideal for your time-series, the fact the you're creating a table per-day, doesn't seem optimal either.

Instead I'd recommend to create a single Table and partition by userid and day and use a time uuids as the clustered column for the event, an example of this would look like:

CREATE TABLE log_per_day (
   userid bigint,
   date text, 
   time timeuuid, 
   value text,
      PRIMARY KEY ((userid, date), time)
)

This will allow you to have all events in a day in a single row and allow you to do your query per day per user.

By declaring the time clustered column allows to have a wide row where you can insert as a many events as you need in a day.

So the row key is a composite key of the userid and plus date in text e.g.

insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID1,'my value')

insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID2,'my value2')

The two inserts above will be in the same row and therefore you will be able to read in a single query.

Also if you want more information about time series I highly recommend you to check Getting Started with Time Series Data Modeling

Hope it helps,

José Luis