Finding distinct values of non Primary Key column in CQL Cassandra

Avi picture Avi · Mar 7, 2016 · Viewed 15.4k times · Source

I use the following code for creating table:

CREATE KEYSPACE mykeyspace
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
USE mykeyspace;
CREATE TABLE users (
  user_id int PRIMARY KEY,
  fname text,
  lname text
);
INSERT INTO users (user_id,  fname, lname)
  VALUES (1745, 'john', 'smith');
INSERT INTO users (user_id,  fname, lname)
  VALUES (1744, 'john', 'doe');
INSERT INTO users (user_id,  fname, lname)
  VALUES (1746, 'john', 'smith');

I would like to find the distinct value of lname column (that is not a PRIMARY KEY). I would like to get the following result:

 lname
-------
 smith

By using SELECT DISTINCT lname FROM users; However since lname is not a PRIMARY KEY I get the following error:

InvalidRequest: code=2200 [Invalid query] message="SELECT DISTINCT queries must
only request partition key columns and/or static columns (not lname)"
cqlsh:mykeyspace> SELECT DISTINCT lname FROM users;

How can I get the distinct values from lname?

Answer

Aaron picture Aaron · Mar 7, 2016

User - Undefined_variable - makes two good points:

  • In Cassandra, you need to build your data model to match your query patterns. This sometimes means duplicating your data into additional tables, to attain the desired level of query flexibility.
  • DISTINCT only works on partition keys.

So, one way to get this to work, would be to build a specific table to support that query:

CREATE TABLE users_by_lname (
    lname text,
    fname text,
    user_id int,
    PRIMARY KEY (lname, fname, user_id)
);

Now after I run your INSERTs to this new query table, this works:

aploetz@cqlsh:stackoverflow> SELECT DISTINCT lname FROm users_by_lname ;

 lname
-------
 smith
   doe

(2 rows)

Notes: In this table, all rows with the same partition key (lname) will be sorted by fname, as fname is a clustering key. I added user_id as an additional clustering key, just to ensure uniqueness.