Algorithm for efficient diffing of huge files

Question 1

Algorithm for efficient diffing of huge files

algorithm diff rcs

usr · Jan 8, 2010 · Viewed 7.3k times · Source

Answer

Answer

You can use rdiff, which works very well with large files. Here I create a diff of two big files A and B:

Create a signature of one file, with e.g.
```
rdiff signature A sig.txt
```
using the generated signature file sig.txt and the other big file, create the delta:
```
rdiff delta sig.txt B delta
```
now delta contains all the information you need to recreate file B when you have both A and delta. To recreate B, run
```
rdiff patch A delta B
```

In Ubuntu, just run sudo apt-get install rdiff to install it. It is quite fast, I get about 40 MB per second on my PC. I have just tried it on a 8GB file, and the memory used by rsync was about 1MB.

Question 2

I have to store two files A and B which are both very large (like 100GB). However B is likely to be similar in big parts to A so i could store A and diff(A, B). There are two interesting aspects to this problem:

The files are too big to be analyzed by any diff library I know of because they are in-memory
I don't actually need a diff - a diff typically has inserts, edits and deletes because it is meant to be read by humans. I can get away with less information: I only need "new range of bytes" and "copy bytes from old file from arbitrary offset".

I am currently at a loss at how to compute the delta from A to B under these conditions. Does anyone know of an algorithm for this?

Again, the problem is simple: Write an algorithm that can store the files A and B with as few bytes as possible given the fact that both are quite similar.

Additional info: Although big parts might be identical they are likely to have different offsets and be out of order. The last fact is why a conventional diff might not save much.

Algorithm for efficient diffing of huge files

Answer

Related questions