Can I optimize a SELECT DISTINCT x FROM hugeTable query by creating an index on column x?

polygenelubricants picture polygenelubricants · May 12, 2011 · Viewed 28k times · Source

I have a huge table, having a much smaller number (by orders of magnitude) of distinct values on some column x.

I need to do a query like SELECT DISTINCT x FROM hugeTable, and I want to do this relatively fast.

I did something like CREATE INDEX hugeTable_by_x ON hugeTable(x), but for some reason, even though the output is small, the query execution is not as fast. The query plan shows that 97% of the time is spent on Index Scan of hugeTable_by_x, with an estimated number of rows equal to the size of the entire table. This is followed by, among other things, a Hash Match operation.

Since I created an index on column x, can I not expect this query to run very quickly?

Note that I'm using Microsoft SQL Server 2005.

Answer

Remus Rusanu picture Remus Rusanu · May 12, 2011

This is likely not a problem of indexing, but one of data design. Normalization, to be precise. The fact that you need to query distinct values of a field, and even willing to add an index, is a strong indicator that the field should be normalized into a separate table with a (small) join key. Then the distinct values will be available immediately by scanning the much smaller lookup foreign table.

Update
As a workaround, you can create an indexed view on an aggregate by the 'distinct' field. COUNT_BIG is an aggregate that is allowed in indexed views:

create view vwDistinct
with schemabinding
as select x, count_big(*)
from schema.hugetable
group by x;

create clustered index cdxDistinct on vwDistinct(x);

select x from vwDistinct with (noexpand);