Find difference between two big tables in PostgreSQL

odiszapc picture odiszapc · Mar 11, 2013 · Viewed 27.4k times · Source

I have two similar tables in Postgres with just one 32-byte latin field (simple md5 hash). Both tables have ~30,000,000 rows. Tables have little difference (10-1000 rows are different)

Is it possible with Postgres to find a difference between these tables, the result should be 10-1000 rows I described above.

This is not a real task, I just want to know about how PostgreSQL deals with JOIN-like logic.

Answer

Erwin Brandstetter picture Erwin Brandstetter · Mar 11, 2013

EXISTS seems like the best option.

tbl1 is the table with surplus rows in this example:

SELECT *
FROM   tbl1
WHERE  NOT EXISTS (SELECT FROM tbl2 WHERE tbl2.col = tbl1.col);

If you don't know which table has surplus rows or both have, you can either repeat the above query after switching table names, or:

SELECT *
FROM   tbl1
FULL   OUTER JOIN tbl2 USING (col)
WHERE  tbl2 col IS NULL OR
       tbl1.col IS NULL;

Overview over basic techniques in a later post:

Aside: The data type uuid is efficient for md5 hashes: