I have a huge table, having a much smaller number (by orders of magnitude) of distinct values on some column x
.
I need to do a query like SELECT DISTINCT x FROM hugeTable
, and I want to do this relatively fast.
I did something like CREATE INDEX hugeTable_by_x ON hugeTable(x)
, but for some reason, even though the output is small, the query execution is not as fast. The query plan shows that 97% of the time is spent on Index Scan of hugeTable_by_x
, with an estimated number of rows equal to the size of the entire table. This is followed by, among other things, a Hash Match operation.
Since I created an index on column x
, can I not expect this query to run very quickly?
Note that I'm using Microsoft SQL Server 2005.
This is likely not a problem of indexing, but one of data design. Normalization, to be precise. The fact that you need to query distinct values of a field, and even willing to add an index, is a strong indicator that the field should be normalized into a separate table with a (small) join key. Then the distinct values will be available immediately by scanning the much smaller lookup foreign table.
Update
As a workaround, you can create an indexed view on an aggregate by the 'distinct' field. COUNT_BIG
is an aggregate that is allowed in indexed views:
create view vwDistinct
with schemabinding
as select x, count_big(*)
from schema.hugetable
group by x;
create clustered index cdxDistinct on vwDistinct(x);
select x from vwDistinct with (noexpand);