Can one construct a "good" hash function using CRC32C as a base?

DavidD picture DavidD · Apr 22, 2010 · Viewed 8.7k times · Source

Given that SSE 4.2 (Intel Core i7 & i5 parts) includes a CRC32 instruction, it seems reasonable to investigate whether one could build a faster general-purpose hash function. According to this only 16 bits of a CRC32 are evenly distributed. So what other transformation would one apply to overcome that?

Update How about this? Only 16 bits are suitable for a hash value. Fine. If your table is 65535 or less then great. If not, run the CRC value through the Nehalem POPCNT (population count) instruction to get the number of bits set. Then, use that as an index into an array of tables. This works if your table is south of 1mm entries. I'd bet that's cheaper/faster that the best-performing hash functions. Now that GCC 4.5 has a CRC32 intrinsic it should be easy to test...if only I had the copious spare time to work on it.

David

Answer

mjv picture mjv · Apr 22, 2010

Revisited, August 2014
Prompted by Arnaud Bouchez in a recent comment, and in view of other answers and comments, I acknowledge that the original answer needs to be altered or for the least qualified. I left the original as-is, at the end, for reference.

First, and maybe most important, a fair answer to the question depends on the intended use of the hash code: What does one mean by "good" [hash function...]? Where/how will the hash be used? (e.g. is it for hashing a relatively short input key? Is it for indexing / lookup purposes, to produce message digests or yet other uses? How long is the desired hash code itself, all 32 bits [of CRC32 or derivatives thereof], more bits, fewer... etc?
The OP questions calls for "a faster general-purpose hash function", so the focus is on SPEED (something less CPU intensive and/or something which can make use of parallel processing of various nature). We may note here that the computation time for the hash code itself is often only part of the problem in an application of hash (for example if the size of the hash code or its intrinsic characteristics result in many collisions which require extra cycles to be dealt with). Also the requirement for "general purpose" leaves many questions as to the possible uses.

With this in mind, a short and better answer is, maybe:

Yes, the hardware implementations of CRC32C on newer Intel processors can be used to build faster hash codes; beware however that depending on the specific implementation of the hash and on its application the overall results may be sub-optimal because of the frequency of collisions, of the need to use longer codes. Also, for sure, cryptographic uses of the hash should be carefully vetted because the CRC32 algorithm itself is very weak in this regard.

The original answer cited a article on Evaluating Hash functions by Bret Mulvey and as pointed in Mdlg's answer, the conclusion of this article are erroneous in regards to CRC32 as the implementation of CRC32 it was based on was buggy/flawed. Despite this major error in regards to CRC32, the article provides useful guidance as to the properties of hash algorithms in general. The URL to this article is now defunct; I found it on archive.today but I don't know if the author has it at another location and also whether he updated it.

Other answers here cite CityHash 1.0 as an example of a hash library that uses CRC32C. Apparently, this is used in the context of some longer (than 32 bits) hash codes but not for the CityHash32() function itself. Also, the use of CRC32 by City Hash functions is relatively small, compared with all the shifting and shuffling and other operations that are performed to produce the hash code. (This is not a critique of CityHash for which I have no hands-on experience. I'll go on a limb, from a cursory review of the source code that CityHash functions produce good, e.g. ell distributed codes, but are not significantly faster than various other hash functions.)

Finally, you may also find insight on this issue in a quasi duplicate question on SO .


Original answer and edit (April 2010)

A priori, this sounds like a bad idea!.

CRC32 was not designed for hashing purposes, and its distribution is likely to not be uniform, hence making it a relatively poor hash-code. Furthermore, its "scrambling" power is relatively weak, making for a very poor one-way hash, as would be used in cryptographic applications.

[BRB: I'm looking for online references to that effect...]

Google's first [keywords = CRC32 distribution] hit seems to confirm this :
Evaluating CRC32 for hash tables

Edit: The page cited above, and indeed the complete article provides a good basis of what to look for in Hash functions.
Reading [quickly] this article, confirmed the blanket statement that in general CRC32 should not be used as a hash, however, and depending on the specific purpose of the hash, it may be possible to use, at least in part, a CRC32 as a hash code.

For example the lower (or higher, depending on implementation) 16 bits of the CRC32 code have a relatively even distribution, and, provided that one isn't concerned about the cryptographic properties of the hash code (i.e. for example the fact that similar keys produce very similar codes), it may be possible to build a hash code which uses, say, a concatenation of the lower [or higher] 16 bits for two CRC32 codes produced with the two halves (or whatever division) of the original key.
One would need to run tests to see if the efficiency of the built-in CRC32 instruction, relative to an alternative hash functions, would be such that the overhead of calling the instruction twice and splicing the code together etc. wouldn't result in an overall slower function.