efficient way to compare two tables in bigquery

Nico Albers picture Nico Albers · Jul 12, 2018 · Viewed 13.5k times · Source

I am interested in comparing, whether two tables contain the same data.

I could do it like this:

#standardSQL
SELECT
    key1, key2
FROM
(
    SELECT 
    table1.key1,
    table1.key2,
    table1.column1 - table2.column1 as col1,
    table1.col2 - table2.col2 as col2
    FROM
        `table1` AS table1
    LEFT JOIN
        `table2` AS table2
    ON
        table1.key1 = table2.key1
    AND
        table1.key2 = table2.key2
)
WHERE 
    col1 != 0
OR
    col2 != 0

But when I want to compare all numerical columns, this is kind of hard, especially if I want to do it for multiple table combinations.

Therefore my question: Is someone aware of a possibility to iterate over all numerical columns and restrict the result set to those keys where any of these differences where not zero?

Answer

Jordan Arseno picture Jordan Arseno · Sep 12, 2019

In Standard SQL, we found using a UNION ALL of two EXCEPT DISTINCT's works for our use cases:

(
  SELECT * FROM table1
  EXCEPT DISTINCT
  SELECT * from table2
)

UNION ALL

(
  SELECT * FROM table2
  EXCEPT DISTINCT
  SELECT * from table1
)

This will produce differences in both directions:

  • rows in table1 that are not in table2
  • rows in table2 that are not in table1

Notes and caveats:

  • table1 and table2 must be of the same width and have columns in the same order and type.
  • this does not work directly with STRUCT or ARRAY data types. You should either UNNEST, or use TO_JSON_STRING to convert the these data types first.
  • this does not directly work with GEOGRAPHY either, you must cast to text first using ST_AsText