General rules for simplifying SQL statements

MicSim picture MicSim · Jul 1, 2009 · Viewed 16.9k times · Source

I'm looking for some "inference rules" (similar to set operation rules or logic rules) which I can use to reduce a SQL query in complexity or size. Does there exist something like that? Any papers, any tools? Any equivalencies that you found on your own? It's somehow similar to query optimization, but not in terms of performance.

To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?

So, I'm looking for equivalent transformations of SQL statements like the fact that most SUBSELECTs can be rewritten as a JOIN.

Answer

Quassnoi picture Quassnoi · Jul 1, 2009

To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?

That's exactly what optimizers do for a living (not that I'm saying they always do this well).

Since SQL is a set based language, there are usually more than one way to transform one query to other.

Like this query:

SELECT  *
FROM    mytable
WHERE   col1 > @value1 OR col2 < @value2

can be transformed into this:

SELECT  *
FROM    mytable
WHERE   col1 > @value1
UNION
SELECT  *
FROM    mytable
WHERE   col2 < @value2

or this:

SELECT  mo.*
FROM    (
        SELECT  id
        FROM    mytable
        WHERE   col1 > @value1
        UNION
        SELECT  id
        FROM    mytable
        WHERE   col2 < @value2
        ) mi
JOIN    mytable mo
ON      mo.id = mi.id

, which look uglier but can yield better execution plans.

One of the most common things to do is replacing this query:

SELECT  *
FROM    mytable
WHERE   col IN
        (
        SELECT  othercol
        FROM    othertable
        )

with this one:

SELECT  *
FROM    mytable mo
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    othertable o
        WHERE   o.othercol = mo.col
        )

In some RDBMS's (like PostgreSQL), DISTINCT and GROUP BY use the different execution plans, so sometimes it's better to replace one with the other:

SELECT  mo.grouper,
        (
        SELECT  SUM(col)
        FROM    mytable mi
        WHERE   mi.grouper = mo.grouper
        )
FROM    (
        SELECT  DISTINCT grouper
        FROM    mytable
        ) mo

vs.

SELECT  mo.grouper, SUM(col)
FROM    mytable
GROUP BY
        mo.grouper

In PostgreSQL, DISTINCT sorts and GROUP BY hashes.

MySQL lacks FULL OUTER JOIN, so it can be rewritten as folloing:

SELECT  t1.col1, t2.col2
FROM    table1 t1
LEFT OUTER JOIN
        table2 t2
ON      t1.id = t2.id

vs.

SELECT  t1.col1, t2.col2
FROM    table1 t1
LEFT JOIN
        table2 t2
ON      t1.id = t2.id
UNION ALL
SELECT  NULL, t2.col2
FROM    table1 t1
RIGHT JOIN
        table2 t2
ON      t1.id = t2.id
WHERE   t1.id IS NULL

, but see this article in my blog on how to do this more efficiently in MySQL:

This hierarchical query in Oracle:

SELECT  DISTINCT(animal_id) AS animal_id
FROM    animal
START WITH
        animal_id = :id
CONNECT BY
        PRIOR animal_id IN (father, mother)
ORDER BY
        animal_id

can be transformed to this:

SELECT  DISTINCT(animal_id) AS animal_id
FROM    (
        SELECT  0 AS gender, animal_id, father AS parent
        FROM    animal
        UNION ALL
        SELECT  1, animal_id, mother
        FROM    animal
        )
START WITH
        animal_id = :id
CONNECT BY
        parent = PRIOR animal_id
ORDER BY
        animal_id

, the latter one being more performant.

See this article in my blog for the execution plan details:

To find all ranges that overlap the given range, you can use the following query:

SELECT  *
FROM    ranges
WHERE   end_date >= @start
        AND start_date <= @end

, but in SQL Server this more complex query yields same results faster:

SELECT  *
FROM    ranges
WHERE   (start_date > @start AND start_date <= @end)
        OR (@start BETWEEN start_date AND end_date)

, and believe it or not, I have an article in my blog on this too:

SQL Server also lacks an efficient way to do cumulative aggregates, so this query:

SELECT  mi.id, SUM(mo.value) AS running_sum
FROM    mytable mi
JOIN    mytable mo
ON      mo.id <= mi.id
GROUP BY
        mi.id

can be more efficiently rewritten using, Lord help me, cursors (you heard me right: cursors, more efficiently and SQL Server in one sentence).

See this article in my blog on how to do it:

There is a certain kind of query commonly met in financial applications that searches for the effective rate for a currency, like this one in Oracle:

SELECT  TO_CHAR(SUM(xac_amount * rte_rate), 'FM999G999G999G999G999G999D999999')
FROM    t_transaction x
JOIN    t_rate r
ON      (rte_currency, rte_date) IN
        (
        SELECT  xac_currency, MAX(rte_date)
        FROM    t_rate
        WHERE   rte_currency = xac_currency
                AND rte_date <= xac_date
        )

This query can be heavily rewritten to use an equality condition which allows a HASH JOIN instead of NESTED LOOPS:

WITH v_rate AS
        (
        SELECT  cur_id AS eff_currency, dte_date AS eff_date, rte_rate AS eff_rate
        FROM    (
                SELECT  cur_id, dte_date,
                        (
                        SELECT  MAX(rte_date)
                        FROM    t_rate ri
                        WHERE   rte_currency = cur_id
                                AND rte_date <= dte_date
                        ) AS rte_effdate
                FROM    (
                        SELECT  (
                                SELECT  MAX(rte_date)
                                FROM    t_rate
                                ) - level + 1 AS dte_date
                        FROM    dual
                        CONNECT BY
                                level <=
                                (
                                SELECT  MAX(rte_date) - MIN(rte_date)
                                FROM    t_rate
                                )
                        ) v_date,
                        (
                        SELECT  1 AS cur_id
                        FROM    dual
                        UNION ALL
                        SELECT  2 AS cur_id
                        FROM    dual
                        ) v_currency
                ) v_eff
        LEFT JOIN
                t_rate
        ON      rte_currency = cur_id
                AND rte_date = rte_effdate
        )
SELECT  TO_CHAR(SUM(xac_amount * eff_rate), 'FM999G999G999G999G999G999D999999')
FROM    (
        SELECT  xac_currency, TRUNC(xac_date) AS xac_date, SUM(xac_amount) AS xac_amount, COUNT(*) AS cnt
        FROM    t_transaction x
        GROUP BY
                xac_currency, TRUNC(xac_date)
        )
JOIN    v_rate
ON      eff_currency = xac_currency
        AND eff_date = xac_date

Despite being bulky as a hell, the latter query is 6 times faster.

The main idea here is replacing <= with =, which requires building an in-memory calendar table. to JOIN with.