Very bad performance of UNION select query in RedShift / ParAccel

Question 1

Very bad performance of UNION select query in RedShift / ParAccel

amazon-redshift paraccel

diemacht · Nov 19, 2013 · Viewed 7k times · Source

Answer

Answer

UNION used by itself removes duplicate rows, e.g., uses an implied DISTINCT, as per the SQL spec.

That means that a lot more processing is required to prepare the output.

If you do not want DISTINCT results then you should always use UNION ALL to make sure the database is not checking for potential dupes.

Question 2

I have two tables in redshift:

tbl_current_day - about 4.5M rows
tbl_previous_day - about 4.5M rows, with the same data exactly as tbl_current_day

In addition to it, I have a view called qry_both_days defined as following:

CREATE OR REPLACE qry_both_days AS 
SELECT * FROM tbl_current_day
UNION SELECT * FROM tbl_previous_day;

When I run a query on one of the separate tables, I get very good performance as expected. For example, the following query runs 5 seconds:

select count(distinct person_id) from tbl_current_day;
-- (person_id is of type int)

Explain plan:

 XN Aggregate  (cost=1224379.82..1224379.82 rows=1 width=4)
   ->  XN Subquery Scan volt_dt_0  (cost=1224373.80..1224378.61 rows=481 width=4)
         ->  XN HashAggregate  (cost=1224373.80..1224373.80 rows=481 width=4)
               ->  XN Seq Scan on tbl_current_day  (cost=0.00..979499.04 rows=97949904 width=4)

Note that width is 4 bytes, as it's supposed to be, as my column is of type int.

HOWEVER, when I run the same query on qry_both_days the query runs 20 times slower, while I would expect it to run only 2 times slower, as it should go over twice more rows:

select count(distinct person_id) from qry_both_days;

Explain plan:

 XN Aggregate  (cost=55648338.34..55648338.34 rows=1 width=4)
   ->  XN Subquery Scan volt_dt_0  (cost=55648335.84..55648337.84 rows=200 width=4)
         ->  XN HashAggregate  (cost=55648335.84..55648335.84 rows=200 width=4)
               ->  XN Subquery Scan qry_both_days  (cost=0.00..54354188.49 rows=517658938 width=4)
                     ->  XN Unique  (cost=0.00..49177599.11 rows=517658938 width=190)
                           ->  XN Append  (cost=0.00..10353178.76 rows=517658938 width=190)
                                 ->  XN Subquery Scan "*SELECT* 1"  (cost=0.00..89649.20 rows=4482460 width=190)
                                       ->  XN Seq Scan on tbl_current_day  (cost=0.00..44824.60 rows=4482460 width=190)
                                 ->  XN Subquery Scan "*SELECT* 2"  (cost=0.00..90675.00 rows=4533750 width=187)
                                       ->  XN Seq Scan on tbl_previous_day  (cost=0.00..45337.50 rows=4533750 width=187)

The problem: width is now 190, not 4 bytes as it's supposed to be!!! Anybody knows how to make RedShift pick only the relevant columns on UNION SELECT?

Thanks!

Very bad performance of UNION select query in RedShift / ParAccel

Answer

Related questions