SELECT DISTINCT cql ignores WHERE clause

Diplow picture Diplow · Oct 24, 2014 · Viewed 43.1k times · Source

Executing two identical requests but the DISTINCT keyword gives unexpected results. Without the keyword, the result is ok but with DISTINCT, it looks like the where clause is ignored. Why ?

Cqlsh version:

Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.6 | CQL spec 3.1.1 | Thrift protocol 19.39.0]

Table considered:

DESCRIBE TABLE events;

CREATE TABLE events (
  userid uuid,
  "timestamp" timestamp,
  event_type text,
  data text,
  PRIMARY KEY (userid, "timestamp", event_type)
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};

Table content:

SELECT * FROM events;

 userid                               | timestamp                | event_type | data
--------------------------------------+--------------------------+------------+------
 aaaaaaaa-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:06:17+0100 |       toto | null
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:06:17+0100 |       toto | null
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:07:17+0100 |       toto | null
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:08:17+0100 |       toto | null
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:09:17+0100 |       toto | null
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:10:17+0100 |       toto | null

(6 rows)

Request1: Request without DISTINCT

SELECT userid FROM events WHERE timestamp > '1970-01-17 09:07:17+0100' ALLOW FILTERING;

 userid
--------------------------------------
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e

(3 rows)

Request2: Same request with DISTINCT

SELECT DISTINCT userid FROM events WHERE timestamp > '1970-01-17 09:07:17+0100' ALLOW FILTERING;

 userid
--------------------------------------
 aaaaaaaa-be1c-44ab-a0e8-f25cf6064b0e
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e

(2 rows)

EDIT 1
here is some context.
This table "events" is subject to a lot of writes, it receives around ~1k insertions per second and I have a batch script that checks those events every 5 minutes.
This batch script has 2 needs:
1- get all userids that have been active in the last 5 minutes (i.e every userid present in the events from the last 5 minutes)
2- get all events related to those userids (not only for the last 5 minutes)

I used to have two different tables to handle this. One table "activeusers" for the first request and the "events" table like I have described here for the second request. My problem with that is just that it requires from my server to write in two different tables when it receives an event. So I tried this using only the events table.

Answer

Aaron picture Aaron · Oct 24, 2014

It happens that way because in Cassandra CQL DISTINCT is designed to return only the partition (row) keys of your table (column family)...which must be unique. Therefore, the WHERE clause can only operate on partition keys when used with DISTINCT (which in your case, isn't terribly useful). If you take the DISTINCT out, WHERE can then be used to evaluate the clustering (column) keys within each partition key (albeit, with ALLOW FILTERING).

I feel compelled to mention that ALLOW FILTERING is not something you should be doing a whole lot of...and definitely not in production. If that query is one you need to run often (querying events for userids after a certain timestamp) then I would suggest partitioning your data by event_type instead:

PRIMARY KEY (event_type, "timestamp", userid)

Then you'll be able to run this query without ALLOW FILTERING.

SELECT userid FROM events WHERE event_type='toto' AND timestamp > '1970-01-17 09:07:17+0100'

Without knowing anything about your application or use case, that may or may not be useful to you. But consider it as an example, and as an indication that there may be a better way build your model to satisfy your query pattern(s). Check out Patrick McFadin's article on timeseries data modeling for more ideas on how to model for this problem.