My understanding that rows are overwritten when another row with identical primary keys is inserted.
For example:
I have columns (user_id int, item_id int, site_id int)
, and my PRIMARY KEY(user_id, item_id)
If I had the following table:
user_id, item_id, site_id
2 3 4
and I insert user_id : 2, item_id : 3, site_id : 10
, my new table would be:
user_id, item_id, site_id
2 3 10
not
user_id, item_id, site_id
2 3 4
2 3 10
Is this simple case hold in all cases? Are any subtleties that I likely not aware off? Also, I could not find this in the docs and came to this conclusion by playing around with cassandra, can anyone provide a doc source?
Yes, this is how Cassandra is designed to operate. In all cases where an UPDATE
or INSERT
is executed, data will be updated (based on the keys) if it exists, and inserted it it does not. An important point to remember, is that under the hood, UPDATE
and INSERT
are synonymous. If you think about those two as being the same, then you can start to understand why it works the way that it does.
That being said, you are correct, in that you do have to look closely to find an explicit reference to this behavior in the documentation. I found the closest references in the docs and listed them below:
From the UPDATE documentation:
The row is created if none existed before, and updated otherwise. Specify the row to update in the WHERE clause by including all columns composing the partition key. ... The UPDATE SET operation is not valid on a primary key field.
From the INSERT documentation:
You do not have to define all columns, except those that make up the key. ... If the column exists, it is updated. The row is created if none exists.
Now while these excerpts may not come right out and say "be careful not to overwrite", I did manage to find an article on Planet Cassandra that was more explicit: How to Do an Upsert in Cassandra
Cassandra is a distributed database that avoids reading before a write, so an INSERT or UPDATE sets the column values you specify regardless of whether the row already exists. This means inserts can update existing rows, and updates can create new rows. It also means it’s easy to accidentally overwrite existing data, so keep that in mind.