I'm looking for a Linux command line tool to compare two PDF files and save the diffs to a PDF outfile. The tool should create diff-pdf's in a batch-process. The PDF files are construction plans, so pure text-compare doesn't work.
Something like:
<tool> file1.pdf file2.pdf -o diff-out.pdf
Most of the tools I found convert the PDFs to images and compare them, but only with a GUI.
Any other solution is also welcome.
I've written my own script that does something similar to what you're asking for. The script uses 4 tools to achieve its goal:
compare
commandpdftk
utility (if you have multipage PDFs)md5sum
(optional)It should be quite easy to port this to a .bat
batch file for DOS/Windows.
But first, please note: this only works well for PDFs which have the same page/media size. The comparison is done pixel by pixel between the two input PDFs. The resulting file is an image showing the "diff" like this:
That diff image is saved as a new PDF to make it better accessible on different OS platforms.
I'm using this for example to discover minimal page display differences when font substitution in PDF processing comes into play.
It could happen, that there is no visible difference between your PDFs, though they are different in MD5 hashes and/or file size. In this case the "diff" output PDF page would become all-white. You could automatically discover this condition, so you only have to visually investigate the non-white PDFs by deleting the all-white ones automatically.
Here are the building blocks:
Use this command line utility to split multipage PDF files into multiple singlepage PDFs:
pdftk file_1.pdf burst output somewhere/file_1---page_%03d.pdf
pdftk file_2.pdf burst output somewhere/file_2---page_%03d.pdf
If you are comparing 1-page PDFs only, this building block is optional. Since you talk about "construction plans", this is likely the case.
Use this command line utility from ImageMagick to create a "diff" PDF page for each of the pages:
compare \
-verbose \
-debug coder \
-log "%u %m:%l %e" \
somewhere/file_1---page_001.pdf \
somewhere/file_2---page_001.pdf \
-compose src \
somewhereelse/file_1--file_2---diff_page_001.pdf
Because of automatically inserted meta data (such as the current date+time), PDF output is not working well for MD5hash-based file comparisons.
If you want to automatically discover all cases where the diff PDF consist of a purely white page, you should convert the PDF page to a meta-data free bitmap format using the bmp256
output device. You can do that like this:
First, find out what the page size format of your PDF is. Again, this little utility identify
comes as part of any ImageMagick installation:
identify \
-format "%[fx:(w)]x%[fx:(h)]" \
somewhereelse/file_1--file_2---diff_page_001.pdf
You can store this value in an environment variable like this:
export my_size=$(identify \
-format "%[fx:(w)]x%[fx:(h)]" \
somewhereelse/file_1--file_2---diff_page_001.pdf)
Now Ghostscript comes into play, using a commandline which includes the above discovered page size as it is stored in the variable:
gs \
-o somewhereelse/file_1--file_2---diff_page_001.ppm \
-sDEVICE=ppmraw \
-r72 \
-g${my_size} \
somewhereelse/file_1--file_2---diff_page_001.pdf
This gives you a PPM (Portable PixMap) with a resolution of 72 dpi from the original PDF page. 72 dpi usually is good enough for what we want... Next, create a purely white PPM page with the same page size:
gs \
-o somewhereelse/file_1--file_2---whitepage_001.ppm \
-sDEVICE=ppmraw \
-r72 \
-g${my_size} \
-c "showpage"
The -c "showpage"
part is a PostScript command that tells Ghostscript to emit an empty page only.
Use the MD5 hash to automatically compare the original PPM with the whitepage PPM. In case they are the same, you can savely assume that there are no differences between the PDFs and therefore rename or delete the diff-PDF:
MD5_1=$(md5sum somewhereelse/file_1--file_2---diff_page_001.ppm | awk '{print $1}')
MD5_2=$(md5sum somewhereelse/file_1--file_2---whitepage_001.ppm | awk '{print $1}')
if [ "x${MD5_1}" == "x${MD5_2}" ]; then
mv \
somewhereelse/file_1--file_2---diff_page_001.pdf \
somewhereelse/file_1--file_2---NODIFFERENCE_page_001.pdf # rename all-white PDF
rm \
somewhereelse/file_1--file_2---*_page_001.ppm # delete both PPMs
fi
This spares you from having to visually inspect "diff PDFs" that do not have any differences.