Cassandra IN clause on index

CatInABox picture CatInABox · May 26, 2014 · Viewed 11.5k times · Source

Assume a simple table with one insert (or without this insert, doesn't really matter).

CREATE TABLE test (
 x int,
 y int,
 z int,
 PRIMARY KEY (x, y, z)
);


create index z_index on test (z);

insert into test(x, y, z) values (1,2,3);

I struggle to understand why I cannot query with an in clause on the index z:

cqlsh:test> select * from test where z in (3);
Bad Request: PRIMARY KEY part z cannot be restricted (preceding part y is either not restricted or by a non-EQ relation)

It is possible with a simple equals predicate:

cqlsh:test> select * from test where z = 3;

 x | y | z
---+---+---
 1 | 2 | 3

(0 rows)

I thought having an index on z would keep a mapping from specific values of z to rows but this assumption seems wrong.

Why this doesn't work the way I expected? I guess the index works differently.

EDIT: I am using [cqlsh 4.1.1 | Cassandra 2.0.6 | CQL spec 3.1.1 | Thrift protocol 19.39.0]

Answer

Aaron picture Aaron · May 29, 2014

While the documentation on DataStax is normally really good, I couldn't find anything discussing the details behind this. I did however, run across this article called A deep look at the CQL WHERE clause. The second section titled "WHERE clause restrictions for SELECT statements" has several paragraphs discussing the IN restriction.

To paraphrase, it basically says this:

For single column keys, the IN operator is allowed without restriction. For composite primary keys, I have to use the = operator on the first N-1 columns of the partition key in order to use the IN operator on the last column.

In your case, x is your partition key, which means that x is the only column that will support the IN operator of CQL. If you really do need to be able to support IN queries on column z, then you would have to de-normalize your data, and build a (redundant) table designed to support that query. For instance:

CREATE TABLE test (
 x int,
 y int,
 z int,
 PRIMARY KEY (z)
);

...would support the query, but the values of z may not be unique. In that case, you could define x and/or y as a LIST<int> and that would do it.

Also, DataStax does have documentation available on when not to use an index, and they state that the same conditions apply to the use of the IN operator.

Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster with 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range.