Please help me understand the use-case behind SELECT ... FOR UPDATE
.
Question 1: Is the following a good example of when SELECT ... FOR UPDATE
should be used?
Given:
The application wants to list all rooms and their tags, but needs to differentiate between rooms with no tags versus rooms that have been removed. If SELECT ... FOR UPDATE is not used, what could happen is:
[id = 1]
[id = 1, name = 'cats']
[room_id = 1, tag_id = 1]
SELECT id FROM rooms;
returns [id = 1]
DELETE FROM room_tags WHERE room_id = 1;
DELETE FROM rooms WHERE id = 1;
SELECT tags.name FROM room_tags, tags WHERE room_tags.tag_id = 1 AND tags.id = room_tags.tag_id;
Now Thread 1 thinks that room 1 has no tags, but in reality the room has been removed. To solve this problem, Thread 1 should SELECT id FROM rooms FOR UPDATE
, thereby preventing Thread 2 from deleting from rooms
until Thread 1 is done. Is that correct?
Question 2: When should one use SERIALIZABLE
transaction isolation versus READ_COMMITTED
with SELECT ... FOR UPDATE
?
Answers are expected to be portable (not database-specific). If that's not possible, please explain why.
The only portable way to achieve consistency between rooms and tags and making sure rooms are never returned after they had been deleted is locking them with SELECT FOR UPDATE
.
However in some systems locking is a side effect of concurrency control, and you achieve the same results without specifying FOR UPDATE
explicitly.
To solve this problem, Thread 1 should
SELECT id FROM rooms FOR UPDATE
, thereby preventing Thread 2 from deleting fromrooms
until Thread 1 is done. Is that correct?
This depends on the concurrency control your database system is using.
MyISAM
in MySQL
(and several other old systems) does lock the whole table for the duration of a query.
In SQL Server
, SELECT
queries place shared locks on the records / pages / tables they have examined, while DML
queries place update locks (which later get promoted to exclusive or demoted to shared locks). Exclusive locks are incompatible with shared locks, so either SELECT
or DELETE
query will lock until another session commits.
In databases which use MVCC
(like Oracle
, PostgreSQL
, MySQL
with InnoDB
), a DML
query creates a copy of the record (in one or another way) and generally readers do not block writers and vice versa. For these databases, a SELECT FOR UPDATE
would come handy: it would lock either SELECT
or the DELETE
query until another session commits, just as SQL Server
does.
When should one use
REPEATABLE_READ
transaction isolation versusREAD_COMMITTED
withSELECT ... FOR UPDATE
?
Generally, REPEATABLE READ
does not forbid phantom rows (rows that appeared or disappeared in another transaction, rather than being modified)
In Oracle
and earlier PostgreSQL
versions, REPEATABLE READ
is actually a synonym for SERIALIZABLE
. Basically, this means that the transaction does not see changes made after it has started. So in this setup, the last Thread 1
query will return the room as if it has never been deleted (which may or may not be what you wanted). If you don't want to show the rooms after they have been deleted, you should lock the rows with SELECT FOR UPDATE
In InnoDB
, REPEATABLE READ
and SERIALIZABLE
are different things: readers in SERIALIZABLE
mode set next-key locks on the records they evaluate, effectively preventing the concurrent DML
on them. So you don't need a SELECT FOR UPDATE
in serializable mode, but do need them in REPEATABLE READ
or READ COMMITED
.
Note that the standard on isolation modes does prescribe that you don't see certain quirks in your queries but does not define how (with locking or with MVCC
or otherwise).
When I say "you don't need SELECT FOR UPDATE
" I really should have added "because of side effects of certain database engine implementation".