i need to transfer large files across network and need to create checksum for them on hourly basis. so the speed for generating checksum is critical for me.
somehow i can't make zlib.crc32 and zlib.adler32 working with files larger than 4GB on Windows XP Pro 64bit machine. i suspect i've hit the 32bit limitation here? using hashlib.md5 i could get a result but the problem is the speed. it takes roughly about 5 minutes to generate an md5 for 4.8GB file. task manager shows that the process is using one core only.
my questions are:
PS: i'm working on somethimg similar like an "Asset Management" system, kind of like svn but the asset consist of large compressed image files. the files have tiny bit incremental changes. the hashing/checksum is needed for detecting changes and error detection.
It's an algorithm selection problem, rather than a library/language selection problem!
There appears to be two points to consider primarily:
Apparently, the answer to the second question is something like 'some false negative allowed' since the reliability of any 32 bits hash, relative to a 4Gb message, even in a moderately noisy channel, is not going to be virtually absolute.
Assuming that I/O can be improved through multithreading, we may choose a hash that doesn't require a sequential scan of the complete message. Instead we can maybe work the file in parallel, hashing individual sections and either combining the hash values or appending them, to form a longer, more reliable error detection device.
The next step could be to formalize this handling of files as ordered sections, and to transmit them as such (to be re-glued together at the recipient's end). This approach, along additional information about the way the files are produced (for ex. they may be exclusively modified by append, like log files), may even allow to limit the amount of hash calculation required. The added complexity of this approach needs to weighted against the desire to have zippy fast CRC calculation.
Side note: Alder32 is not limited to message sizes below a particular threshold. It may just be a limit of the zlib API. (BTW, the reference I found about zlib.adler32 used a buffer, and well... this approach is to be avoided in the context of our huge messages, in favor of streamed processes: read a little from file, calculate, repeat..)