I have been trying to improve query times for an existing Oracle database-driven application that has been running a little sluggish. The application executes several large queries, such as the one below, which can take over an hour to run. Replacing the DISTINCT
with a GROUP BY
clause in the query below shrank execution time from 100 minutes to 10 seconds. My understanding was that SELECT DISTINCT
and GROUP BY
operated in pretty much the same way. Why such a huge disparity between execution times? What is the difference in how the query is executed on the back-end? Is there ever a situation where SELECT DISTINCT
runs faster?
Note: In the following query, WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'
represents just one of a number of ways that results can be filtered. This example was provided to show the reasoning for joining all of the tables that do not have columns included in the SELECT
and would result in about a tenth of all available data
SQL using DISTINCT
:
SELECT DISTINCT
ITEMS.ITEM_ID,
ITEMS.ITEM_CODE,
ITEMS.ITEMTYPE,
ITEM_TRANSACTIONS.STATUS,
(SELECT COUNT(PKID)
FROM ITEM_PARENTS
WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID
) AS CHILD_COUNT
FROM
ITEMS
INNER JOIN ITEM_TRANSACTIONS
ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID
AND ITEM_TRANSACTIONS.FLAG = 1
LEFT OUTER JOIN ITEM_METADATA
ON ITEMS.ITEM_ID = ITEM_METADATA.ITEM_ID
LEFT OUTER JOIN JOB_INVENTORY
ON ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID
LEFT OUTER JOIN JOB_TASK_INVENTORY
ON JOB_INVENTORY.JOB_ITEM_ID = JOB_TASK_INVENTORY.JOB_ITEM_ID
LEFT OUTER JOIN JOB_TASKS
ON JOB_TASK_INVENTORY.TASKID = JOB_TASKS.TASKID
LEFT OUTER JOIN JOBS
ON JOB_TASKS.JOB_ID = JOBS.JOB_ID
LEFT OUTER JOIN TASK_INVENTORY_STEP
ON JOB_INVENTORY.JOB_ITEM_ID = TASK_INVENTORY_STEP.JOB_ITEM_ID
LEFT OUTER JOIN TASK_STEP_INFORMATION
ON TASK_INVENTORY_STEP.JOB_ITEM_ID = TASK_STEP_INFORMATION.JOB_ITEM_ID
WHERE
TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'
ORDER BY
ITEMS.ITEM_CODE
SQL using GROUP BY
:
SELECT
ITEMS.ITEM_ID,
ITEMS.ITEM_CODE,
ITEMS.ITEMTYPE,
ITEM_TRANSACTIONS.STATUS,
(SELECT COUNT(PKID)
FROM ITEM_PARENTS
WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID
) AS CHILD_COUNT
FROM
ITEMS
INNER JOIN ITEM_TRANSACTIONS
ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID
AND ITEM_TRANSACTIONS.FLAG = 1
LEFT OUTER JOIN ITEM_METADATA
ON ITEMS.ITEM_ID = ITEM_METADATA.ITEM_ID
LEFT OUTER JOIN JOB_INVENTORY
ON ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID
LEFT OUTER JOIN JOB_TASK_INVENTORY
ON JOB_INVENTORY.JOB_ITEM_ID = JOB_TASK_INVENTORY.JOB_ITEM_ID
LEFT OUTER JOIN JOB_TASKS
ON JOB_TASK_INVENTORY.TASKID = JOB_TASKS.TASKID
LEFT OUTER JOIN JOBS
ON JOB_TASKS.JOB_ID = JOBS.JOB_ID
LEFT OUTER JOIN TASK_INVENTORY_STEP
ON JOB_INVENTORY.JOB_ITEM_ID = TASK_INVENTORY_STEP.JOB_ITEM_ID
LEFT OUTER JOIN TASK_STEP_INFORMATION
ON TASK_INVENTORY_STEP.JOB_ITEM_ID = TASK_STEP_INFORMATION.JOB_ITEM_ID
WHERE
TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'
GROUP BY
ITEMS.ITEM_ID,
ITEMS.ITEM_CODE,
ITEMS.ITEMTYPE,
ITEM_TRANSACTIONS.STATUS
ORDER BY
ITEMS.ITEM_CODE
Here is the Oracle query plan for the query using DISTINCT
:
Here is the Oracle query plan for the query using GROUP BY
:
The performance difference is probably due to the execution of the subquery in the SELECT
clause. I am guessing that it is re-executing this query for every row before the distinct. For the group by
, it would execute once after the group by.
Try replacing it with a join, instead:
select . . .,
parentcnt
from . . . left outer join
(SELECT PARENT_ITEM_ID, COUNT(PKID) as parentcnt
FROM ITEM_PARENTS
) p
on items.item_id = p.parent_item_id