Cassandra NOT EQUAL Operator

Babu James picture Babu James · Feb 21, 2014 · Viewed 14.9k times · Source

Question to all Cassandra experts out there.

I have a column family with about a million records.

I would like to query these records in such a way that I should be able to perform a Not-Equal-To kind of operation.

I Googled on this and it seems I have to use some sort of Map-Reduce.

Can somebody tell me what are the options available in this regard.

Answer

Daniel S. picture Daniel S. · Feb 21, 2014

I can suggest a few approaches.

1) If you have a limited number of values that you would like to test for not-equality, consider modeling those as a boolean columns (i.e.: column isEqualToUnitedStates with true or false).

2) Otherwise, consider emulating the unsupported query != X by combining results of two separate queries, < X and > X on the client-side.

3) If your schema cannot support either type of query above, you may have to resort to writing custom routines that will do client-side filtering and construct the not-equal set dynamically. This will work if you can first narrow down your search space to manageable proportions, such that it's relatively cheap to run the query without the not-equal.

So let's say you're interested in all purchases of a particular customer of every product type except Widget. An ideal query could look something like SELECT * FROM purchases WHERE customer = 'Bob' AND item != 'Widget'; Now of course, you cannot run this, but in this case you should be able to run SELECT * FROM purchases WHERE customer = 'Bob' without wasting too many resources and filter item != 'Widget' in the client application.

4) Finally, if there is no way to restrict the data in a meaningful way before doing the scan (querying without the equality check would returning too many rows to handle comfortably), you may have to resort to MapReduce. This means running a distributed job that would scan all rows in the table across the cluster. Such jobs will obviously run a lot slower than native queries, and are quite complex to set up. If you want to go this way, please look into Cassandra Hadoop integration.