I have the following UPSERT in PostgreSQL 9.5:
INSERT INTO chats ("user", "contact", "name")
VALUES ($1, $2, $3),
($2, $1, NULL)
ON CONFLICT("user", "contact") DO NOTHING
RETURNING id;
If there are no conflicts it returns something like this:
----------
| id |
----------
1 | 50 |
----------
2 | 51 |
----------
But if there are conflicts it doesn't return any rows:
----------
| id |
----------
I want to return the new id
columns if there are no conflicts or return the existing id
columns of the conflicting columns.
Can this be done? If so, how?
The currently accepted answer seems ok for a single conflict target, few conflicts, small tuples and no triggers. It avoids concurrency issue 1 (see below) with brute force. The simple solution has its appeal, the side effects may be less important.
For all other cases, though, do not update identical rows without need. Even if you see no difference on the surface, there are various side effects:
It might fire triggers that should not be fired.
It write-locks "innocent" rows, possibly incurring costs for concurrent transactions.
It might make the row seem new, though it's old (transaction timestamp).
Most importantly, with PostgreSQL's MVCC model a new row version is written for every UPDATE
, no matter whether the row data changed. This incurs a performance penalty for the UPSERT itself, table bloat, index bloat, performance penalty for subsequent operations on the table, VACUUM
cost. A minor effect for few duplicates, but massive for mostly dupes.
Plus, sometimes it is not practical or even possible to use ON CONFLICT DO UPDATE
. The manual:
For
ON CONFLICT DO UPDATE
, aconflict_target
must be provided.
A single "conflict target" is not possible if multiple indexes / constraints are involved.
You can achieve (almost) the same without empty updates and side effects. Some of the following solutions also work with ON CONFLICT DO NOTHING
(no "conflict target"), to catch all possible conflicts that might arise - which may or may not be desirable.
WITH input_rows(usr, contact, name) AS (
VALUES
(text 'foo1', text 'bar1', text 'bob1') -- type casts in first row
, ('foo2', 'bar2', 'bob2')
-- more?
)
, ins AS (
INSERT INTO chats (usr, contact, name)
SELECT * FROM input_rows
ON CONFLICT (usr, contact) DO NOTHING
RETURNING id --, usr, contact -- return more columns?
)
SELECT 'i' AS source -- 'i' for 'inserted'
, id --, usr, contact -- return more columns?
FROM ins
UNION ALL
SELECT 's' AS source -- 's' for 'selected'
, c.id --, usr, contact -- return more columns?
FROM input_rows
JOIN chats c USING (usr, contact); -- columns of unique index
The source
column is an optional addition to demonstrate how this works. You may actually need it to tell the difference between both cases (another advantage over empty writes).
The final JOIN chats
works because newly inserted rows from an attached data-modifying CTE are not yet visible in the underlying table. (All parts of the same SQL statement see the same snapshots of underlying tables.)
Since the VALUES
expression is free-standing (not directly attached to an INSERT
) Postgres cannot derive data types from the target columns and you may have to add explicit type casts. The manual:
When
VALUES
is used inINSERT
, the values are all automatically coerced to the data type of the corresponding destination column. When it's used in other contexts, it might be necessary to specify the correct data type. If the entries are all quoted literal constants, coercing the first is sufficient to determine the assumed type for all.
The query itself (not counting the side effects) may be a bit more expensive for few dupes, due to the overhead of the CTE and the additional SELECT
(which should be cheap since the perfect index is there by definition - a unique constraint is implemented with an index).
May be (much) faster for many duplicates. The effective cost of additional writes depends on many factors.
But there are fewer side effects and hidden costs in any case. It's most probably cheaper overall.
Attached sequences are still advanced, since default values are filled in before testing for conflicts.
About CTEs:
Assuming default READ COMMITTED
transaction isolation. Related:
The best strategy to defend against race conditions depends on exact requirements, the number and size of rows in the table and in the UPSERTs, the number of concurrent transactions, the likelihood of conflicts, available resources and other factors ...
If a concurrent transaction has written to a row which your transaction now tries to UPSERT, your transaction has to wait for the other one to finish.
If the other transaction ends with ROLLBACK
(or any error, i.e. automatic ROLLBACK
), your transaction can proceed normally. Minor possible side effect: gaps in sequential numbers. But no missing rows.
If the other transaction ends normally (implicit or explicit COMMIT
), your INSERT
will detect a conflict (the UNIQUE
index / constraint is absolute) and DO NOTHING
, hence also not return the row. (Also cannot lock the row as demonstrated in concurrency issue 2 below, since it's not visible.) The SELECT
sees the same snapshot from the start of the query and also cannot return the yet invisible row.
Any such rows are missing from the result set (even though they exist in the underlying table)!
This may be ok as is. Especially if you are not returning rows like in the example and are satisfied knowing the row is there. If that's not good enough, there are various ways around it.
You can check the row count of the output and repeat the statement if it does not match the row count of the input. May be good enough for the rare case. The point is to start a new query (can be in the same transaction), which will then see the newly committed rows.
Or check for missing result rows within the same query and overwrite those with the brute force trick demonstrated in Alextoni's answer.
WITH input_rows(usr, contact, name) AS ( ... ) -- see above
, ins AS (
INSERT INTO chats AS c (usr, contact, name)
SELECT * FROM input_rows
ON CONFLICT (usr, contact) DO NOTHING
RETURNING id, usr, contact -- we need unique columns for later join
)
, sel AS (
SELECT 'i'::"char" AS source -- 'i' for 'inserted'
, id, usr, contact
FROM ins
UNION ALL
SELECT 's'::"char" AS source -- 's' for 'selected'
, c.id, usr, contact
FROM input_rows
JOIN chats c USING (usr, contact)
)
, ups AS ( -- RARE corner case
INSERT INTO chats AS c (usr, contact, name) -- another UPSERT, not just UPDATE
SELECT i.*
FROM input_rows i
LEFT JOIN sel s USING (usr, contact) -- columns of unique index
WHERE s.usr IS NULL -- missing!
ON CONFLICT (usr, contact) DO UPDATE -- we've asked nicely the 1st time ...
SET name = c.name -- ... this time we overwrite with old value
-- SET name = EXCLUDED.name -- alternatively overwrite with *new* value
RETURNING 'u'::"char" AS source -- 'u' for updated
, id --, usr, contact -- return more columns?
)
SELECT source, id FROM sel
UNION ALL
TABLE ups;
It's like the query above, but we add one more step with the CTE ups
, before we return the complete result set. That last CTE will do nothing most of the time. Only if rows go missing from the returned result, we use brute force.
More overhead, yet. The more conflicts with pre-existing rows, the more likely this will outperform the simple approach.
One side effect: the 2nd UPSERT writes rows out of order, so it re-introduces the possibility of deadlocks (see below) if three or more transactions writing to the same rows overlap. If that's a problem, you need a different solution - like repeating the whole statement as mentioned above.
If concurrent transactions can write to involved columns of affected rows, and you have to make sure the rows you found are still there at a later stage in the same transaction, you can lock existing rows cheaply in the CTE ins
(which would otherwise go unlocked) with:
...
ON CONFLICT (usr, contact) DO UPDATE
SET name = name WHERE FALSE -- never executed, but still locks the row
...
And add a locking clause to the SELECT
as well, like FOR UPDATE
.
This makes competing write operations wait till the end of the transaction, when all locks are released. So be brief.
More details and explanation:
Defend against deadlocks by inserting rows in consistent order. See:
Explicit type casts for the first row of data in the free-standing VALUES
expression may be inconvenient. There are ways around it. You can use any existing relation (table, view, ...) as row template. The target table is the obvious choice for the use case. Input data is coerced to appropriate types automatically, like in the VALUES
clause of an INSERT
:
WITH input_rows AS (
(SELECT usr, contact, name FROM chats LIMIT 0) -- only copies column names and types
UNION ALL
VALUES
('foo1', 'bar1', 'bob1') -- no type casts here
, ('foo2', 'bar2', 'bob2')
)
...
This does not work for some data types. See:
This also works for all data types.
While inserting into all (leading) columns of the table, you can omit column names. Assuming table chats
in the example only consists of the 3 columns used in the UPSERT:
WITH input_rows AS (
SELECT * FROM (
VALUES
((NULL::chats).*) -- copies whole row definition
('foo1', 'bar1', 'bob1') -- no type casts needed
, ('foo2', 'bar2', 'bob2')
) sub
OFFSET 1
)
...
Aside: don't use reserved words like "user"
as identifier. That's a loaded footgun. Use legal, lower-case, unquoted identifiers. I replaced it with usr
.