JOIN versus EXISTS performance

Question 1

JOIN versus EXISTS performance

sql teradata

BellevueBob · Dec 3, 2012 · Viewed 16.8k times · Source

Answer

Answer

NOT EXISTS is more efficient than using a LEFT OUTER JOIN to exclude records that are missing from the participating table using an IS NULL condition because the optimizer will elect to use an EXCLUSION MERGE JOIN with the NOT EXISTS predicate.

While your second test did not yield impressive results for the data sets you were using the performance increase from NOT EXISTS over a LEFT JOIN is very noticeable as your data volumes increase. Keep in mind that the tables will need to be hash distributed by the columns that participate in the NOT EXISTS join just like they would in the LEFT JOIN. Therefore, data skew can impact the performance of the EXCLUSION MERGE JOIN.

EDIT:

Typically, I would defer to EXISTS as a replacement for IN instead of using it for re-writing a join solution. This is especially true when the column(s) participating in the logical comparison can be NULL. That's not to say you couldn't use EXISTS in place of an INNER JOIN. Instead of an EXCLUSION JOIN you will end up with an INCLUSION JOIN. The INNER JOIN is in essence an inclusion join to begin with. I'm sure there are some nuances that I am overlooking but you can find those in the manuals if you wish to take the time to read them.

Question 2

Generally speaking, is there a performance difference between using a JOIN to select rows versus an EXISTS where clause? Searching various Q&A web sites suggests that a join is more efficient, but I recall learning a long time ago that EXISTS was better in Teradata.

I do see other SO answers, like this and this, but my question is specific to Teradata.

For example, consider these two queries, which return identical results:

select   svc.ltv_scr, count(*) as freq
from     MY_BASE_TABLE svc
join     MY_TARGET_TABLE x
on       x.srv_accs_id=svc.srv_accs_id
group by 1
order by 1

-and-

select   svc.ltv_scr, count(*) as freq
from     MY_BASE_TABLE svc
where exists(
    select 1
    from   MY_TARGET_TABLE x
    where  x.srv_accs_id=svc.srv_accs_id)
group by 1
order by 1

The primary index (unique) on both tables is 'srv_accs_id'. MY_BASE_TABLE is rather large (200 million rows) and MY_TARGET_TABLE relatively small (200,000 rows).

There is one significant difference in the EXPLAIN plans: The first says the two tables are joined "by way of a RowHash match scan" and the second says "by way of an all-rows scan". Both say it is "an all-AMPs JOIN step" and the total estimated time is identical (0.32 seconds).

Both queries perform the same (I'm using Teradata 13.10).

A similar experiment to find non-matches comparing a LEFT OUTER JOIN with a corresponding IS NULL where clause to a NOT EXISTS sub-query does show a performance difference:

select   svc.ltv_scr, count(*) as freq
from     MY_BASE_TABLE svc
left outer join MY_TARGET_TABLE x
on       x.srv_accs_id=svc.srv_accs_id
where    x.srv_accs_id is null
group by 1
order by 1

-and-

select   svc.ltv_scr, count(*) as freq
from     MY_BASE_TABLE svc
where not exists(
    select 1
    from   MY_TARGET_TABLE x
    where  x.srv_accs_id=svc.srv_accs_id)
group by 1
order by 1

The second query plan is faster (2.21 versus 2.14 seconds as described by EXPLAIN).

My example may be too trivial to see a difference; I'm just looking for coding guidance.

JOIN versus EXISTS performance

Answer

Related questions