PDF compare on linux command line

Christof Aenderl picture Christof Aenderl · Jun 24, 2011 · Viewed 25.5k times · Source

I'm looking for a Linux command line tool to compare two PDF files and save the diffs to a PDF outfile. The tool should create diff-pdf's in a batch-process. The PDF files are construction plans, so pure text-compare doesn't work.

Something like:

<tool> file1.pdf file2.pdf -o diff-out.pdf

Most of the tools I found convert the PDFs to images and compare them, but only with a GUI.

Any other solution is also welcome.

Answer

Kurt Pfeifle picture Kurt Pfeifle · Jun 27, 2011

I've written my own script that does something similar to what you're asking for. The script uses 4 tools to achieve its goal:

  1. ImageMagick's compare command
  2. the pdftk utility (if you have multipage PDFs)
  3. Ghostscript (optional)
  4. md5sum (optional)

It should be quite easy to port this to a .bat batch file for DOS/Windows.

But first, please note: this only works well for PDFs which have the same page/media size. The comparison is done pixel by pixel between the two input PDFs. The resulting file is an image showing the "diff" like this:

  • Each pixel that remains unchanged becomes white.
  • Each pixel that got changed is painted in red.

That diff image is saved as a new PDF to make it better accessible on different OS platforms.

I'm using this for example to discover minimal page display differences when font substitution in PDF processing comes into play.

It could happen, that there is no visible difference between your PDFs, though they are different in MD5 hashes and/or file size. In this case the "diff" output PDF page would become all-white. You could automatically discover this condition, so you only have to visually investigate the non-white PDFs by deleting the all-white ones automatically.

Here are the building blocks:

pdftk

Use this command line utility to split multipage PDF files into multiple singlepage PDFs:

pdftk  file_1.pdf  burst  output  somewhere/file_1---page_%03d.pdf
pdftk  file_2.pdf  burst  output  somewhere/file_2---page_%03d.pdf

If you are comparing 1-page PDFs only, this building block is optional. Since you talk about "construction plans", this is likely the case.

compare

Use this command line utility from ImageMagick to create a "diff" PDF page for each of the pages:

compare \
       -verbose \
       -debug coder \
       -log "%u %m:%l %e" \
        somewhere/file_1---page_001.pdf \
        somewhere/file_2---page_001.pdf \
       -compose src \
        somewhereelse/file_1--file_2---diff_page_001.pdf

Ghostscript

Because of automatically inserted meta data (such as the current date+time), PDF output is not working well for MD5hash-based file comparisons.

If you want to automatically discover all cases where the diff PDF consist of a purely white page, you should convert the PDF page to a meta-data free bitmap format using the bmp256 output device. You can do that like this:

First, find out what the page size format of your PDF is. Again, this little utility identify comes as part of any ImageMagick installation:

 identify \
   -format "%[fx:(w)]x%[fx:(h)]" \
    somewhereelse/file_1--file_2---diff_page_001.pdf

You can store this value in an environment variable like this:

 export my_size=$(identify \
   -format "%[fx:(w)]x%[fx:(h)]" \
    somewhereelse/file_1--file_2---diff_page_001.pdf)

Now Ghostscript comes into play, using a commandline which includes the above discovered page size as it is stored in the variable:

 gs \
   -o somewhereelse/file_1--file_2---diff_page_001.ppm \
   -sDEVICE=ppmraw \
   -r72 \
   -g${my_size} \
    somewhereelse/file_1--file_2---diff_page_001.pdf

This gives you a PPM (Portable PixMap) with a resolution of 72 dpi from the original PDF page. 72 dpi usually is good enough for what we want... Next, create a purely white PPM page with the same page size:

 gs \
   -o somewhereelse/file_1--file_2---whitepage_001.ppm \
   -sDEVICE=ppmraw \
   -r72 \
   -g${my_size} \
   -c "showpage"

The -c "showpage" part is a PostScript command that tells Ghostscript to emit an empty page only.

MD5 sum

Use the MD5 hash to automatically compare the original PPM with the whitepage PPM. In case they are the same, you can savely assume that there are no differences between the PDFs and therefore rename or delete the diff-PDF:

 MD5_1=$(md5sum somewhereelse/file_1--file_2---diff_page_001.ppm | awk '{print $1}')
 MD5_2=$(md5sum somewhereelse/file_1--file_2---whitepage_001.ppm | awk '{print $1}')

 if [ "x${MD5_1}" == "x${MD5_2}" ]; then 
     mv  \
       somewhereelse/file_1--file_2---diff_page_001.pdf \
       somewhereelse/file_1--file_2---NODIFFERENCE_page_001.pdf # rename all-white PDF
     rm  \
       somewhereelse/file_1--file_2---*_page_001.ppm            # delete both PPMs
 fi

This spares you from having to visually inspect "diff PDFs" that do not have any differences.