I'm a little bit confused reading about PostgreSQL deadlocks.
A typical deadlock example is:
-- Transaction 1
UPDATE customer SET ... WHERE id = 1
UPDATE customer SET ... WHERE id = 2
-- Transaction 2
UPDATE customer SET ... WHERE id = 2
UPDATE customer SET ... WHERE id = 1
But what if I change the code as follows:
-- Transaction 1
UPDATE customer SET ... WHERE id IN (1, 2)
-- Transaction 2
UPDATE customer SET ... WHERE id IN (1, 2)
Will be a possibility of deadlock here?
Essentially my question is: in the 2nd case does PostgreSQL lock rows one-by-one, or lock the entire scope covered by the WHERE
condition?
Thanks in advance!
In PostgreSQL the rows will be locked as they are updated -- in fact, the way this actually works is that each tuple (version of a row) has a system field called xmin
to indicate which transaction made that tuple current (by insert or update) and a system field called xmax
to indicate which transaction expired that tuple (by update or delete). When you access data, it checks each tuple to determine whether it is visible to your transaction, by checking your active "snapshot" against these values.
If you are executing an UPDATE and a tuple which matches your search conditions has an xmin which would make it visible to your snapshot and an xmax of an active transaction, it blocks, waiting for that transaction to complete. If the transaction which first updated the tuple rolls back, your transaction wakes up and processes the row; if the first transaction commits, your transaction wakes up and takes action depending on the current transaction isolation level.
Obviously, a deadlock is the result of this happening to rows in different order. There is no row-level lock in RAM which can be obtained for all rows at the same time, but if rows are updated in the same order you can't have the circular locking. Unfortunately, the suggested IN(1, 2)
syntax doesn't guarantee that. Different sessions may have different costing factors active, a background "analyze" task may change statistics for the table between the generation of one plan and the other, or it may be using a seqscan and be affected by the PostgreSQL optimization which causes a new seqscan to join one already in progress and "loop around" to reduce disk I/O.
If you do the updates one at a time in the same order, in application code or using a cursor, then you will have only simple blocking, not deadlocks. In general, though, relational databases are prone to serialization failures, and it is best to access them through a framework which will recognize them based on SQLSTATE and automatically retry the entire transaction from the start. In PostgreSQL a serialization failure will always have a SQLSTATE of 40001 or 40P01.
http://www.postgresql.org/docs/current/interactive/mvcc-intro.html