cassandra get all records in time range

Faber picture Faber · Sep 9, 2013 · Viewed 70.5k times · Source

I have to work with a column family that has (user_id, timestamp) as key. In my query I would like to fetch all records in a given time range independent of the user_id. This is the exact table schema:

CREATE TABLE userlog (
  user_id text,
  ts timestamp,
  action text,
  app_type text,
  channel_name text,
  channel_session_id text,
  pid text,
  region_id text,
  PRIMARY KEY (user_id, ts)
)

I tried to run

SELECT * FROM userlog  WHERE ts >= '2013-01-01 00:00:00+0200' AND  ts <= '2013-08-13 23:59:00+0200' ALLOW FILTERING;

which works fine on my local cassandra installation containing a small data set but fails with

Request did not complete within rpc_timeout.

on the productive system containing all the data.

Is there a, preferably cql, query that runs smoothly with the given column family or de we have to change the design?

Answer

Richard picture Richard · Sep 9, 2013

The timeout is because Cassandra is taking longer than the timeout (default is 10 seconds) to return the data. For your query, Cassandra will attempt to fetch the entire dataset before returning. For more than a few records this can easily take longer than the timeout.

For queries that are producing lots of data you need to page e.g.

SELECT * FROM userlog WHERE ts >= '2013-01-01 00:00:00+0200' AND  ts <= '2013-08-13 23:59:00+0200' AND token(user_id) > previous_token LIMIT 100 ALLOW FILTERING;

where user_id is the previous user_id returned. You will also need to page on ts to guarantee you get all the records for the last user_id returned.

Alternatively, in Cassandra 2.0.0 (just released), paging is done transparently so your original query should work with no timeout or manual paging.

The ALLOW FILTERING means Cassandra is reading through all your data, but only returning data within the range specified. This is only efficient if the range is most of the data. If you wanted to find records within e.g. a 5 minute time window, this would be very inefficient.