Cassandra CQL searching for element in list

kazy picture kazy · Nov 5, 2014 · Viewed 12k times · Source

I have a table that has a column of list type (tags):

CREATE TABLE "Videos" (
    video_id UUID,
    title VARCHAR,
    tags LIST<VARCHAR>,
    PRIMARY KEY (video_id, upload_timestamp)
) WITH CLUSTERING ORDER BY (upload_timestamp DESC);

I have plenty of rows containing various values in the tags column, ie. ["outdoor","funny cats","funny mice"].

I want to perform a SELECT query that will return all rows that contain "funny cats" in the tags column. How can I do that?

Answer

Aaron picture Aaron · Nov 5, 2014

To directly answer your question, yes there is a way to accomplish this. As of Cassandra 2.1 you can create a secondary index on a collection. First, I'll re-create your column family definition (while adding a definition for upload_timestamp timeuuid) and put some values in it.

aploetz@cqlsh:stackoverflow> SELECT * FROM videos ;

 video_id                             | upload_timestamp                     | tags                                          | title
--------------------------------------+--------------------------------------+-----------------------------------------------+---------------------------
 2977b806-df76-4dd7-a57e-11d361e72ce1 | fc011080-64f9-11e4-a819-21b264d4c94d |             ['sci-fi', 'action', 'adventure'] |                 Star Wars
 ab696e1f-78c0-45e6-893f-430e88db7f46 | 8db7c4b0-64fa-11e4-a819-21b264d4c94d |                               ['documentary'] | The Witches of Whitewater
 15e6bc0d-6195-4d8b-ad25-771966c780c8 | 1680d120-64fa-11e4-a819-21b264d4c94d | ['dark comedy', 'action', 'language warning'] |              Pulp Fiction

(3 rows)

Next, I'll create a secondary index on the tags column:

aploetz@cqlsh:stackoverflow> CREATE INDEX ON videos (tags);

Now, if I want to query the videos that contain the tag "action," I can accomplish this with the CONTAINS keyword:

aploetz@cqlsh:stackoverflow> SELECT * FROM videos WHERE tags CONTAINS 'action';

 video_id                             | upload_timestamp                     | tags                                          | title
--------------------------------------+--------------------------------------+-----------------------------------------------+--------------
 2977b806-df76-4dd7-a57e-11d361e72ce1 | fc011080-64f9-11e4-a819-21b264d4c94d |             ['sci-fi', 'action', 'adventure'] |    Star Wars
 15e6bc0d-6195-4d8b-ad25-771966c780c8 | 1680d120-64fa-11e4-a819-21b264d4c94d | ['dark comedy', 'action', 'language warning'] | Pulp Fiction

(2 rows)

With this all being said, I should pass along a couple of warnings:

  • Secondary indexes do not perform well at scale. They exist to provide convenience, not performance. If you are expecting to have to query by tag often, then the right way to solve this would be to create a videosbytag query table, with the same data but keyed like this: PRIMARY KEY (tag,video_id)
  • You don't need the double-quotes in your table name. In fact, having it in quotes may cause you problems (ok, maybe minor irritations) down the road.