I have two tables in redshift:
In addition to it, I have a view called qry_both_days defined as following:
CREATE OR REPLACE qry_both_days AS
SELECT * FROM tbl_current_day
UNION SELECT * FROM tbl_previous_day;
When I run a query on one of the separate tables, I get very good performance as expected. For example, the following query runs 5 seconds:
select count(distinct person_id) from tbl_current_day;
-- (person_id is of type int)
Explain plan:
XN Aggregate (cost=1224379.82..1224379.82 rows=1 width=4)
-> XN Subquery Scan volt_dt_0 (cost=1224373.80..1224378.61 rows=481 width=4)
-> XN HashAggregate (cost=1224373.80..1224373.80 rows=481 width=4)
-> XN Seq Scan on tbl_current_day (cost=0.00..979499.04 rows=97949904 width=4)
Note that width is 4 bytes, as it's supposed to be, as my column is of type int.
HOWEVER, when I run the same query on qry_both_days the query runs 20 times slower, while I would expect it to run only 2 times slower, as it should go over twice more rows:
select count(distinct person_id) from qry_both_days;
Explain plan:
XN Aggregate (cost=55648338.34..55648338.34 rows=1 width=4)
-> XN Subquery Scan volt_dt_0 (cost=55648335.84..55648337.84 rows=200 width=4)
-> XN HashAggregate (cost=55648335.84..55648335.84 rows=200 width=4)
-> XN Subquery Scan qry_both_days (cost=0.00..54354188.49 rows=517658938 width=4)
-> XN Unique (cost=0.00..49177599.11 rows=517658938 width=190)
-> XN Append (cost=0.00..10353178.76 rows=517658938 width=190)
-> XN Subquery Scan "*SELECT* 1" (cost=0.00..89649.20 rows=4482460 width=190)
-> XN Seq Scan on tbl_current_day (cost=0.00..44824.60 rows=4482460 width=190)
-> XN Subquery Scan "*SELECT* 2" (cost=0.00..90675.00 rows=4533750 width=187)
-> XN Seq Scan on tbl_previous_day (cost=0.00..45337.50 rows=4533750 width=187)
The problem: width is now 190, not 4 bytes as it's supposed to be!!! Anybody knows how to make RedShift pick only the relevant columns on UNION SELECT?
Thanks!
UNION
used by itself removes duplicate rows, e.g., uses an implied DISTINCT
, as per the SQL spec.
That means that a lot more processing is required to prepare the output.
If you do not want DISTINCT
results then you should always use UNION ALL
to make sure the database is not checking for potential dupes.