I have a complex query which uses a lot of binary checksum function, when I was testing it with some test data for two distinct records it actually returned me same checksum value. Please find the test data I used below
SELECT BINARY_CHECKSUM(16 ,'EP30461105',1) AS BinaryCheckSumEx UNION ALL
SELECT BINARY_CHECKSUM(21 ,'EP30461155',1) AS BinaryCheckSumEx
Now I am trying to use HASHBYTES function with 'MD5' algorithm for which I can be certain to get unique records, but what concerns me now is that in the current query I use the 'Checksum' value to join in my 'Merge' statements to look for new records. Since 'HashBytes' returns me Varbinary data type how much of a performance overhead I can expect when I replace the join conditions with the 'HashByte' field.
SELECT HASHBYTES('MD5', CONCAT(Col1,Col2,Col3,Col4,..))
And moreover I need to create hashing for multiple columns in which case I need to have an additional Concat function will this have additional overhead to my performance.
Here are the options:
Using index on hash as VARBINARY
Using BINARY_CHECKSUM and CHECKSUM
However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.
Source: https://msdn.microsoft.com/en-us/library/ms189788(v=SQL.100).aspx
I would also be careful about converting the hashed value to BIGINT given that BIGINT is only 8 bytes yet all hash algorithms -- even MD5 -- are greater than 8 bytes (MD5 = 16 bytes, SHA1 = 20, SHA2_256 = 32, and SHA2_512 = 64). And converting binary values larger than 8 bytes to BIGINTsilently truncates the values. Hence you lose accuracy and increasing occurrences of false positives. The following query shows this behavior:
SELECT CONVERT(BIGINT, 0xFFFFFFFFFFFFFF), -- 7 bytes = 72057594037927935
CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFF), -- 8 bytes = -1
CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFFFF), -- 9 bytes = -1
CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFFFFFF) -- 10 bytes = -1
Source: https://dba.stackexchange.com/questions/154945/index-maintenance-for-varbinary
a) If you're using SQL 2008 or above
SELECT CONVERT(NVARCHAR(32),HashBytes('MD5', CONTENT),2)
b) If you're using SQL 2005
SELECT SUBSTRING(master.dbo.fn_varbintohexstr(HashBytes('MD5', CONTENT)), 3, 32)
PS: If you wonder which Hash Algorithm you should use:
MD5 = 16 bytes
SHA1 = 20 bytes
SHA2_256 = 32 bytes
SHA2_512 = 64 bytes
Source: https://blogs.msdn.microsoft.com/sqlsecurity/2011/08/26/data-hashing-in-sql-server/
For your second question, you should make Hash columns PERSISTED, to avoid the impact on running each query.