VBA hash string

nixda picture nixda · Feb 5, 2013 · Viewed 47.3k times · Source

How do I get a short hash of a long string using Excel VBA

Whats given

  • Input string is not longer than 80 characters
  • Valid input characters are: [0..9] [A_Z] . _ /
  • Valid output characters are [0..9] [A_Z] [a_z] (lower and upper case can be used)
  • The output hash shouldn't be longer than ~12 characters (shorter is even better)
  • No need to be unique at all since this will result in a too long hash

What I have done so far

I thought this SO answer is a good start since it generates a 4-digit Hex-Code (CRC16).

But 4 digits were to little. In my test with 400 strings 20% got a duplicate somewhere else.
The chance to generate a collision is too high.

Sub tester()
    For i = 2 To 433
        Cells(i, 2) = CRC16(Cells(i, 1))
    Next i
End Sub


Function CRC16(txt As String)
Dim x As Long
Dim mask, i, j, nC, Crc As Integer
Dim c As String

Crc = &HFFFF

For nC = 1 To Len(txt)
    j = Val("&H" + Mid(txt, nC, 2))
    Crc = Crc Xor j
    For j = 1 To 8
        mask = 0
        If Crc / 2 <> Int(Crc / 2) Then mask = &HA001
        Crc = Int(Crc / 2) And &H7FFF: Crc = Crc Xor mask
    Next j
Next nC

CRC16 = Hex$(Crc)
End Function

How to reproduce

You can copy these 400 test strings from pastebin.
Paste them to column A in a new Excel workbook and execute the code above.

Q: How do I get a string hash which is short enough (12 chars) and long enough to get a small percentage of duplicates.

Answer

nixda picture nixda · Feb 7, 2013

Maybe others will find this useful.

I have collected some different functions to generate a short hash of a string in VBA.
I don't take credit for the code and all sources are referenced.

enter image description here

  1. CRC16
    • Function: =CRC16HASH(A1) with this Code
    • hash is a 4 characters long HEX string
    • 19 code lines
    • 4 digits long hash = 624 collisions in 6895 lines = 9 % collision rate
  2. CRC16 numeric
    • Function: =CRC16NUMERIC(A1) with this Code
    • hash is a 5 digits long number
    • 92 code lines
    • 5 digits long hash = 616 collisions in 6895 lines = 8.9 % collision rate
  3. CRC16 twice
    • Function: =CRC16TWICE(A1) with this Code
    • hash is a 8 characters long HEX string
    • hash can be expanded to 12/16/20 etc. characters to reduce collision rate even more
    • 39 code lines
    • 8 digits long hash = 18 collisions in 6895 lines = 0.23 % collision rate
  4. SHA1
    • Function: =SHA1TRUNC(A1) with this Code
    • hash is a 40 characters long HEX string
    • 142 code lines
    • can be truncated
    • 4 digits hash = 726 collisions in 6895 lines = 10.5 % collision rate
    • 5 digits hash = 51 collisions in 6895 lines = 0.73 % collision rate
    • 6 digits hash = 0 collisions in 6895 lines = 0 % collision rate
  5. SHA1 + Base64
    • Function: =BASE64SHA1(A1) with this Code
    • hash is a 28 characters long unicode string (case sensitive + special chars)
    • 41 code lines
    • requires .NET since it uses library "Microsoft MSXML"
    • can be truncated
    • 4 digits hash = 36 collisions in 6895 lines = 0.5 % collision rate
    • 5 digits hash = 0 collisions in 6895 lines = 0 % collision rate

Here is my test workbook with all example functions and a big number of test strings.

Feel free to add own functions.