I am performing some tests on a HSQLDB
server with a table containing 500 000 entries. The table has no indices. There are 5000 distinct business keys. I need a list of them. Naturally I started with a DISTINCT
query:
SELECT DISTINCT business_key FROM memory WHERE
concept <> 'case' or
attrib <> 'status' or
value <> 'closed'
It takes around 90 seconds!!!
Then I tried using GROUP BY
:
SELECT business_key FROM memory WHERE
concept <> 'case' or
attrib <> 'status' or
value <> 'closed'
GROUP BY business_key
And it takes 1 second!!!
Trying to figure out the difference I ran EXLAIN PLAN FOR
but it seems to give the same information for both queries.
EXLAIN PLAN FOR DISTINCT ...
isAggregated=[false]
columns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
join type=INNER
table=MEMORY
alias=M
access=FULL SCAN
condition = [ index=SYS_IDX_SYS_PK_10057_10058
other condition=[
OR arg_left=[
OR arg_left=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
VALUE = case, TYPE = CHARACTER]] arg_right=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
VALUE = status, TYPE = CHARACTER]]] arg_right=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
VALUE = closed, TYPE = CHARACTER]]]
]
]]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks
EXLAIN PLAN FOR SELECT ... GROUP BY ...
isDistinctSelect=[false]
isGrouped=[true]
isAggregated=[false]
columns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
join type=INNER
table=MEMORY
alias=M
access=FULL SCAN
condition = [ index=SYS_IDX_SYS_PK_10057_10058
other condition=[
OR arg_left=[
OR arg_left=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
VALUE = case, TYPE = CHARACTER]] arg_right=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
VALUE = status, TYPE = CHARACTER]]] arg_right=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
VALUE = closed, TYPE = CHARACTER]]]
]
]]
groupColumns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks
EDIT:
I did additional tests. With 500 000 records in HSQLDB
with all distinct business keys, the performance of DISTINCT
is now better - 3 seconds, vs GROUP BY
which took around 9 seconds.
In MySQL
both queries preform the same:
MySQL: 500 000 rows - 5 000 distinct business keys:
Both queries: 0.5 second
MySQL: 500 000 rows - all distinct business keys:
SELECT DISTINCT ...
- 11 seconds
SELECT ... GROUP BY business_key
- 13 seconds
So the problem is only related to HSQLDB
.
I will be very grateful if someone can explain why there is such a drastic difference.
The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct
approach is executed like:
business_key
values to a temporary tableThe group by
could be executed like:
business key
in a hashtableThe first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.
Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.