I'm looking for some "inference rules" (similar to set operation rules or logic rules) which I can use to reduce a SQL query in complexity or size. Does there exist something like that? Any papers, any tools? Any equivalencies that you found on your own? It's somehow similar to query optimization, but not in terms of performance.
To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?
So, I'm looking for equivalent transformations of SQL statements like the fact that most SUBSELECTs can be rewritten as a JOIN.
To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?
That's exactly what optimizers do for a living (not that I'm saying they always do this well).
Since SQL
is a set based language, there are usually more than one way to transform one query to other.
Like this query:
SELECT *
FROM mytable
WHERE col1 > @value1 OR col2 < @value2
can be transformed into this:
SELECT *
FROM mytable
WHERE col1 > @value1
UNION
SELECT *
FROM mytable
WHERE col2 < @value2
or this:
SELECT mo.*
FROM (
SELECT id
FROM mytable
WHERE col1 > @value1
UNION
SELECT id
FROM mytable
WHERE col2 < @value2
) mi
JOIN mytable mo
ON mo.id = mi.id
, which look uglier but can yield better execution plans.
One of the most common things to do is replacing this query:
SELECT *
FROM mytable
WHERE col IN
(
SELECT othercol
FROM othertable
)
with this one:
SELECT *
FROM mytable mo
WHERE EXISTS
(
SELECT NULL
FROM othertable o
WHERE o.othercol = mo.col
)
In some RDBMS
's (like PostgreSQL
), DISTINCT
and GROUP BY
use the different execution plans, so sometimes it's better to replace one with the other:
SELECT mo.grouper,
(
SELECT SUM(col)
FROM mytable mi
WHERE mi.grouper = mo.grouper
)
FROM (
SELECT DISTINCT grouper
FROM mytable
) mo
vs.
SELECT mo.grouper, SUM(col)
FROM mytable
GROUP BY
mo.grouper
In PostgreSQL
, DISTINCT
sorts and GROUP BY
hashes.
MySQL
lacks FULL OUTER JOIN
, so it can be rewritten as folloing:
SELECT t1.col1, t2.col2
FROM table1 t1
LEFT OUTER JOIN
table2 t2
ON t1.id = t2.id
vs.
SELECT t1.col1, t2.col2
FROM table1 t1
LEFT JOIN
table2 t2
ON t1.id = t2.id
UNION ALL
SELECT NULL, t2.col2
FROM table1 t1
RIGHT JOIN
table2 t2
ON t1.id = t2.id
WHERE t1.id IS NULL
, but see this article in my blog on how to do this more efficiently in MySQL
:
This hierarchical query in Oracle
:
SELECT DISTINCT(animal_id) AS animal_id
FROM animal
START WITH
animal_id = :id
CONNECT BY
PRIOR animal_id IN (father, mother)
ORDER BY
animal_id
can be transformed to this:
SELECT DISTINCT(animal_id) AS animal_id
FROM (
SELECT 0 AS gender, animal_id, father AS parent
FROM animal
UNION ALL
SELECT 1, animal_id, mother
FROM animal
)
START WITH
animal_id = :id
CONNECT BY
parent = PRIOR animal_id
ORDER BY
animal_id
, the latter one being more performant.
See this article in my blog for the execution plan details:
To find all ranges that overlap the given range, you can use the following query:
SELECT *
FROM ranges
WHERE end_date >= @start
AND start_date <= @end
, but in SQL Server
this more complex query yields same results faster:
SELECT *
FROM ranges
WHERE (start_date > @start AND start_date <= @end)
OR (@start BETWEEN start_date AND end_date)
, and believe it or not, I have an article in my blog on this too:
SQL Server
also lacks an efficient way to do cumulative aggregates, so this query:
SELECT mi.id, SUM(mo.value) AS running_sum
FROM mytable mi
JOIN mytable mo
ON mo.id <= mi.id
GROUP BY
mi.id
can be more efficiently rewritten using, Lord help me, cursors (you heard me right: cursors
, more efficiently
and SQL Server
in one sentence).
See this article in my blog on how to do it:
There is a certain kind of query commonly met in financial applications that searches for the effective rate for a currency, like this one in Oracle
:
SELECT TO_CHAR(SUM(xac_amount * rte_rate), 'FM999G999G999G999G999G999D999999')
FROM t_transaction x
JOIN t_rate r
ON (rte_currency, rte_date) IN
(
SELECT xac_currency, MAX(rte_date)
FROM t_rate
WHERE rte_currency = xac_currency
AND rte_date <= xac_date
)
This query can be heavily rewritten to use an equality condition which allows a HASH JOIN
instead of NESTED LOOPS
:
WITH v_rate AS
(
SELECT cur_id AS eff_currency, dte_date AS eff_date, rte_rate AS eff_rate
FROM (
SELECT cur_id, dte_date,
(
SELECT MAX(rte_date)
FROM t_rate ri
WHERE rte_currency = cur_id
AND rte_date <= dte_date
) AS rte_effdate
FROM (
SELECT (
SELECT MAX(rte_date)
FROM t_rate
) - level + 1 AS dte_date
FROM dual
CONNECT BY
level <=
(
SELECT MAX(rte_date) - MIN(rte_date)
FROM t_rate
)
) v_date,
(
SELECT 1 AS cur_id
FROM dual
UNION ALL
SELECT 2 AS cur_id
FROM dual
) v_currency
) v_eff
LEFT JOIN
t_rate
ON rte_currency = cur_id
AND rte_date = rte_effdate
)
SELECT TO_CHAR(SUM(xac_amount * eff_rate), 'FM999G999G999G999G999G999D999999')
FROM (
SELECT xac_currency, TRUNC(xac_date) AS xac_date, SUM(xac_amount) AS xac_amount, COUNT(*) AS cnt
FROM t_transaction x
GROUP BY
xac_currency, TRUNC(xac_date)
)
JOIN v_rate
ON eff_currency = xac_currency
AND eff_date = xac_date
Despite being bulky as a hell, the latter query is 6
times faster.
The main idea here is replacing <=
with =
, which requires building an in-memory calendar table. to JOIN
with.