There are two tables linked by an id:
item_tbl (id)
link_tbl (item_id)
There are some records in item_tbl
that don't have matching rows in link_tbl
. A select which would count their amount would be:
SELECT COUNT(*)
FROM link_tbl lnk LEFT JOIN item_tbl itm ON lnk.item_id=itm.id
WHERE itm.id IS NULL
I would like to delete those orphan records (those which don't have match in the other table) from link_tbl
but the only way I could think of was:
DELETE FROM link_tbl lnk
WHERE lnk.item_id NOT IN (SELECT itm.id FROM item_tbl itm)
There are
262,086,253 records in link_tbl
3,033,811 in item_tbl
16,844,347 orphan records in link_tbl
.
The server has 4GB RAM and 8 core CPU.
EXPLAIN DELETE FROM link_tbl lnk
WHERE lnk.item_id NOT IN (SELECT itm.id FROM item_tbl itm)
Returns:
Delete on link lnk (cost=0.00..11395249378057.98 rows=131045918 width=6)
-> Seq Scan on link lnk (cost=0.00..11395249378057.98 rows=131045918 width=6)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..79298.10 rows=3063207 width=4)
-> Seq Scan on item itm (cost=0.00..52016.07 rows=3063207 width=4)
The questions are:
link_tbl
?How accurate is the explain above, or how long it could take to delete those records?
Resolution:
Thank you all for your advices, it was very helpful. I finally used the delete advised by Erwin Brandstetter https://stackoverflow.com/a/15959896/1331340 but I tweaked it a little:
DELETE FROM link_tbl lnk
WHERE lnk.item_id BETWEEN 0 AND 10000
AND lnk.item_id NOT IN (SELECT itm.id FROM item itm
WHERE itm.id BETWEEN 0 AND 10000)
I compared results for NOT IN and NOT EXISTS and the output is below, although I used COUNT instead of DELETE which I think should be the same (I mean in sake of relative comparison):
EXPLAIN ANALYZE SELECT COUNT(*)
FROM link_tbl lnk
WHERE lnk.item_id BETWEEN 0 AND 20000
AND lnk.item_id NOT IN (SELECT itm.id
FROM item_tbl itm
WHERE itm.id BETWEEN 0 AND 20000);
QUERY PLAN
Aggregate (cost=6002667.56..6002667.57 rows=1 width=0) (actual time=226817.086..226817.088 rows=1 loops=1)
-> Seq Scan on link_tbl lnk (cost=1592.50..5747898.65 rows=101907564 width=0) (actual time=206.029..225289.570 rows=566625 loops=1)
Filter: ((item_id >= 0) AND (item_id <= 20000) AND (NOT (hashed SubPlan 1)))
SubPlan 1
-> Index Scan using item_tbl_pkey on item_tbl itm (cost=0.00..1501.95 rows=36221 width=4) (actual time=0.056..99.266 rows=17560 loops=1)
Index Cond: ((id >= 0) AND (id <= 20000))
Total runtime: 226817.211 ms
EXPLAIN ANALYZE SELECT COUNT(*)
FROM link_tbl lnk WHERE lnk.item_id>0 AND lnk.item_id<20000
AND NOT EXISTS (SELECT 1 FROM item_tbl itm WHERE itm.id=lnk.item_id);
QUERY PLAN
Aggregate (cost=8835772.00..8835772.01 rows=1 width=0)
(actual time=1209235.133..1209235.135 rows=1 loops=1)
-> Hash Anti Join (cost=102272.16..8835771.99 rows=1 width=0)
(actual time=19315.170..1207900.612 rows=566534 loops=1)
Hash Cond: (lnk.item_id = itm.id)
-> Seq Scan on link_tbl lnk (cost=0.00..5091076.55 rows=203815128 width=4) (actual time=0.016..599147.604 rows=200301872 loops=1)
Filter: ((item_id > 0) AND (item_id < 20000))
-> Hash (cost=52016.07..52016.07 rows=3063207 width=4) (actual time=19313.976..19313.976 rows=3033811 loops=1)
Buckets: 131072 Batches: 4 Memory Usage: 26672kB
-> Seq Scan on item_tbl itm (cost=0.00..52016.07 rows=3063207 width=4) (actual time=0.013..9274.158 rows=3033811 loops=1)
Total runtime: 1209260.228 ms
NOT EXISTS was 5 times slower.
The actual delete of the data didn't take so long as I was worried, I was able to delete it in 5 batches (10000-20000,20000-100000,100000-200000,200000-1000000 and 1000000-1755441). At first I found out max item_id and I only had to went through half of the table.
When I tried NOT IN or EXISTS without the range (with select count) it didn't even finish, I let it run during the night and it was still running in the morning.
I think I was looking for DELETE with USING from wildplasser's answer https://stackoverflow.com/a/15988033/1331340 but it came too late.
DELETE FROM one o
USING (
SELECT o2.id
FROM one o2
LEFT JOIN two t ON t.one_id = o2.id
WHERE t.one_id IS NULL
) sq
WHERE sq.id = o.id
;
I benchmarked four typical queries, with different settings for {work_mem, effective_cache_size, random_page_cost}, these settings have the largest influence on the selected plan. I first did a "run in" with my default settings to warm the cache. Note: the test-set is small enough to allow all needed pages to be present in cache.
The test-set
SET search_path=tmp;
/************************/
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE one
( id SERIAL NOT NULL PRIMARY KEY
, payload varchar
);
CREATE TABLE two
( id SERIAL NOT NULL PRIMARY KEY
, one_id INTEGER REFERENCES one
, payload varchar
);
INSERT INTO one (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;
INSERT INTO two (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;
UPDATE two t
SET one_id = o.id
FROM one o
WHERE o.id = t.id
AND random() < 0.1;
INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
VACUUM ANALYZE one;
VACUUM ANALYZE two;
/***************/
The queries:
\echo NOT EXISTS()
EXPLAIN ANALYZE
DELETE FROM one o
WHERE NOT EXISTS ( SELECT * FROM two t
WHERE t.one_id = o.id
);
\echo NOT IN()
EXPLAIN ANALYZE
DELETE FROM one o
WHERE o.id NOT IN ( SELECT one_id FROM two t)
;
\echo USING (subquery self LEFT JOIN two where NULL)
EXPLAIN ANALYZE
DELETE FROM one o
USING (
SELECT o2.id
FROM one o2
LEFT JOIN two t ON t.one_id = o2.id
WHERE t.one_id IS NULL
) sq
WHERE sq.id = o.id
;
\echo USING (subquery self WHERE NOT EXISTS(two)))
EXPLAIN ANALYZE
DELETE FROM one o
USING (
SELECT o2.id
FROM one o2
WHERE NOT EXISTS ( SELECT *
FROM two t WHERE t.one_id = o2.id
)
) sq
WHERE sq.id = o.id
;
The result (summarised)
NOT EXISTS() NOT IN() USING(LEFT JOIN NULL) USING(NOT EXISTS)
1) rpc=4.0.csz=1M wmm=64 80.358 14389.026 77.620 72.917
2) rpc=4.0.csz=1M wmm=64000 60.527 69.104 51.851 51.004
3) rpc=1.5.csz=1M wmm=64 69.804 10758.480 80.402 77.356
4) rpc=1.5.csz=1M wmm=64000 50.872 69.366 50.763 53.339
5) rpc=4.0.csz=1G wmm=64 84.117 7625.792 69.790 69.627
6) rpc=4.0.csz=1G wmm=64000 49.964 67.018 49.968 49.380
7) rpc=1.5.csz=1G wmm=64 68.567 3650.008 70.283 69.933
8) rpc=1.5.csz=1G wmm=64000 49.800 67.298 50.116 50.345
legend:
rpc := "random_page_cost"
csz := "effective_cache_size"
wmm := "work_mem"
As you can see, the NOT IN()
variant is very sensitive to shortage of work_mem
. Agreed, the setting 64(KB) is very low, but this `more or less* corresponds to large data sets, which won't fit in hashtables, either.
EXTRA: during the warm-in phase, the NOT EXISTS()
query suffered from extreme FK-trigger contention. This apears to be a result of a conflict with the vacuum deamon, which is still active after the table set-up.:
PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1, 64-bit
NOT EXISTS()
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Delete on one o (cost=6736.00..7623.94 rows=27962 width=12) (actual time=80.596..80.596 rows=0 loops=1)
-> Hash Anti Join (cost=6736.00..7623.94 rows=27962 width=12) (actual time=49.174..61.327 rows=27050 loops=1)
Hash Cond: (o.id = t.one_id)
-> Seq Scan on one o (cost=0.00..463.00 rows=30000 width=10) (actual time=0.003..5.156 rows=30000 loops=1)
-> Hash (cost=3736.00..3736.00 rows=240000 width=10) (actual time=49.121..49.121 rows=23600 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 1015kB
-> Seq Scan on two t (cost=0.00..3736.00 rows=240000 width=10) (actual time=0.006..33.790 rows=240000 loops=1)
Trigger for constraint two_one_id_fkey: time=467720.117 calls=27050
Total runtime: 467824.652 ms
(9 rows)