I am trying to find a collision between two messages that will lead to the same CRC hash. Considering I am using CRC32, is there any way I can shorten the list of possible messages I have to try when doing a brute force attack?
Any links to websites with hints on this will be helpful. I already have a brute force algorithm that will do this but it simply increment integers and sees if it will match other hashes.
It depends entirely on what you mean by "message". If you can append four bytes of gibberish to one of the messages. (I.E. four bytes that have no meaning within the context of the message.) Then it becomes trivial in the truest sense of the word.
Thinking in terms of bits moving through the CRC32 state machine.
CRC32 is based on a galois feedback shift register, each bit in its state will be replaced with the induction of 32 bits from the payload data. At the induction of each bit, the positions indicated by the polynomial will be exclusive ored with the sequence observed from the end of the Shift register. This sequence is not influenced by the input data until the shift register has been filled.
As an example, imagine we have a shift register filled with initial state 10101110, polynomial 10000011, and filling with unknown bits, X.
Polynomial * ** |feedback (End of SR.)
State 10101110 0
State X1010111 1
State XX101000 0
State XXX10100 0
State XXXX1010 0
State XXXXX101 1
State XXXXXX01 1
State XXXXXXX1 1
State XXXXXXXX 0
The feedback isn't in terms of X until the SR has been filled! So in order to generate a message with a predetermined checksum, you take your new message, generate it's CRC and work out it's next 32 bits of feedback. This you can do in 32 steps of the CRC function. You then need to calculate the effect this feedback has on the contents of the shift register.
A shortcut for doing this is to pad your message with four zero bytes and then look at the checksum. (Checksum is the state of the SR at the end, which if padded with four zero bytes is the influence of the feedback and the empty bytes.)
Exclusive OR that influence with the checksum value you want, replace the four byte trailer with that computed value and regenerate the checksum. You could do this with any program that generates CRC32, a hex editor, and a calculator that can handle hex.
If you want to generate two messages that both make complete sense and don't contain trailing garbage, things get a little harder. Identify a number of sections that you can write plausible alternatives, with exactly the same length.
Using english prose as an example. "I think that this can work" and "I believe in this approach" Have broadly similar meanings, and exactly the same length.
Identifying enough examples in your message is the tricky bit (Unless you want to cheat with whitespace!) CRC 32 is linear, provided the data has the correct offset within the message. So CRC([messagea][padding])^CRC([padding][messageb])=CRC([messagea][messageb]) There are some caveats with word alignment that you'll need to cope with, as a general hint, you want to extend the passages out into the "fixed" parts of the message. As a general rule you want to have alternatives for n*1.5 passages, where n is the size of the CRC.
You can now calculate the CRC that the skeletal message has, the impression that each alternative passage would have on it, and then draw up a table comparing the influence that each alternative for each passage would have. You then need to select alternatives that will modify the skeletal CRC to match the CRC you want. That problem is actually quite fun to solve, First off find any alternatives that uniquely modify a bit, if that bit needs to change for your CRC, select that alternative and fold it's influence into the CRC, then go round again. That should reduce the solution space that you then need to search.
That's quite a tough thing to code up, but it would generate your collisions in a very short time span.