CHECKSUM() collisions in SQL Server 2005

Cade Roux picture Cade Roux · Jun 22, 2009 · Viewed 7.1k times · Source

I've got a table of 5,651,744 rows, with a primary key made of 6 columns (int x 3, smallint, varchar(39), varchar(2)). I am looking to improve the performance with this table and another table which shares this primary key plus an additional column added but has 37m rows.

In anticipation of adding a column to create the hash key, I did an analysis and found 18,733 collisions.

SELECT  SUM(CT)
FROM    (
         SELECT HASH_KEY
               ,COUNT(*) AS CT
         FROM   (
                 SELECT CHECKSUM(DATA_DT_ID, BANK_NUM, COST_CTR_NUM,
                                 GL_ACCT_NUM, ACCT_NUM, APPN_CD) AS HASH_KEY
                 FROM   CUST_ACCT_PRFTBLT
                ) AS X
         GROUP BY HASH_KEY
         HAVING COUNT(*) > 1
        ) AS Y

SELECT  COUNT(*)
FROM    CUST_ACCT_PRFTBLT

It's about twice as bad with BINARY_CHECKSUM()

Does this seem too high (.33%) given the smaller relative amount of the destination space I'm covering? And if the collisions are this high, is there a benefit in joining on this manufactured key first in joins for the cost of the extra 4 bytes per row, given that you still have to join on the regular columns to handle the occasional collision?

Answer

HLGEM picture HLGEM · Jun 22, 2009

I don't see where adding a checksum will get you anything with that level of collisons. Even 1 collision is too many as it would cause you to join to the wrong data. If you can't guarantee to be joining to the correct record, it is pointless if it improves performance but messes with data integrity. This appears to be financial data, so you had better be really sure that your queries won't return bad results. You could actually end up debiting or crediting the wrong accounts if there are any collisions.

If you do go this route, Marc is right that you should if at all possible pre-compute (Adding a computation that has to happen to every record in multimillion record tables is not likely to improve performance in my experience). Possibly if you can do the precomputed column (and you'll need triggers to keep it up-date) then you may not need to join to all six of the other columns to ensure no collisions. Then possibly you might have imporved performance. All you can do is test your theory. But be very sure you don't have any collisions.

Have you considered using a surrogate key and then a unique index on the six natural key fields instead? Then you could join on the surrogate key and likely that would improve performance a good bit. It can't be efficient to join on six columns (one a varchar) instead of one surrogate key. I realize from the size of the data, this might be harder to refactor than in a non-production system, but really it might be worth the down time to permananently fix persistent performance problems. Only you can say how complex a change this would be and how hard it would be to change all the sps or queries to a better join. However, it might be feasible to try.