Cassandra Error - Clustering column cannot be restricted (preceding column is restricted by a non-EQ relation)

Sabarish Sathasivan picture Sabarish Sathasivan · Dec 22, 2015 · Viewed 12.9k times · Source

We are using Cassandra as the data historian for our fleet management solution. We have a table in Cassandra , which stores the details of journey made by the vehicle. The table structure is as given below

CREATE TABLE  journeydetails(
bucketid text,
vehicleid text,
starttime timestamp,
stoptime timestamp,
travelduration bigint,
PRIMARY KEY (bucketid,vehicleid,starttime,travelduration)
);

Where:

  1. bucketid :- partition key which is a combination of month and year
  2. vehicleid : -unique id of the vehicle
  3. starttime :- start time of the journey
  4. endtime :- endtime of the journey
  5. travelduration:- duration of travel in milliseconds

We would like to run the following query - get all the travels of a vehicle - 1234567 between 2015-12-1 and 2015-12-3 whose travel duration is greater than 30 minutes

When I run this query:

select * from  journeydetails where bucketid in('2015-12') and vehicleid in('1234567') 
  and starttime > '2015-12-1 00:00:00' and starttime < '2015-12-3 23:59:59' 
  and travelduration > 1800000;

I get this result:

InvalidRequest: code=2200 [Invalid query] message="Clustering column "travelduration"
cannot be restricted (preceding column "starttime" is restricted by a non-EQ relation)

Does anyone have a recommendation on how to fix this issue?

Answer

Aaron picture Aaron · Dec 22, 2015
select * from journeydetails where bucketid in('2015-12') and vehicleid in('1234567')
  and starttime > '2015-12-1 00:00:00' and starttime < '2015-12-3 23:59:59' 
  and travelduration > 1800000;

That's not going to work. The reason goes back to how Cassandra stores data on-disk. The idea with Cassandra is that it is very efficient at returning a single row with a precise key, or at returning a continuous range of rows from the disk.

Your rows are partitioned by bucketid, and then sorted on disk by vehicleid, starttime, and travelduration. Because you are already executing a range query (non-EQ relation) on starttime, you cannot restrict the key that follows. This is because the travelduration restriction may disqualify some of the rows in your range condition. This would result in an inefficient, non-continuous read. Cassandra is designed to protect you from writing queries (such as this), which may have unpredictable performance.

Here are two alternatives:

1- If you could restrict all of your key columns prior to travelduration (with an equals relation), then you could apply a your greater-than condition:

select * from journeydetails where bucketid='2015-12' and vehicleid='1234567'
  and starttime='2015-12-1 00:00:00' and travelduration > 1800000;

Of course, restricting on an exact starttime may not be terribly useful.

2- Another approach would be to omit travelduration altogether, and then your original query would work.

select * from journeydetails where bucketid='2015-12' and vehicleid='1234567'
  and starttime > '2015-12-1 00:00:00' and starttime < '2015-12-3 23:59:59';

Unfortunately, Cassandra does not offer a large degree of query flexibility. Many people have found success using a solution like Spark (alongside Cassandra) to achieve this level of reporting.

And just a side note, but don't use IN unless you have to. Querying with IN is similar to using a secondary index, in that Cassandra has to talk to several nodes to satisfy your query. Calling it with a single item probably isn't too big of a deal. But IN is one of those old RDBMS habits that you should really break before getting too deep into Cassandra.