How to convert PDF binary parts into ASCII/ANSI so I can look at it in a text editor?

simplybest55 picture simplybest55 · Aug 10, 2010 · Viewed 9.4k times · Source

Most PDFs contain lots of binary looking parts in between some ASCII. But I remember also having seen PDFs where such binary parts by and large were absent, and one could open them in a text editor to study their structure.

Is there a trick, tool, or command that will convert binary PDF parts to ASCII/ANSI? (Preferably "free as in beer" or even "free as in liberty")

Answer

Kurt Pfeifle picture Kurt Pfeifle · Aug 14, 2010

[Updated 2014-10-15]

Using Ghostscript

Ghostscript has a small utility program written in PostScript in its source code repository. It's called pdfinflt.ps. If you are lucky, it may already slumber in a 'toolbin' subdirectory of your Ghostscript installation location. Otherwise, get it here:

Now run it together with your targeted input PDF through the Ghostscript interpreter:

gswin32c.exe -- c:/path/to/pdfinflt.ps your-input.pdf deflated-output.pdf

pdfinflt.ps will (try to) expand all 'streams' contained in the PDF which use the following compression filters/methods: /FlateDecode, /LZWDecode, /ASCII85Decode, /ASCIIHexDecode.

It will not attempt to remove /RunLengthDecode, /CCITTFaxDecode, /DCTDecode, /JBIG2Decode and /JPXDecode. (Compressed/binary fonts will also pass unchanged into the output PDF.)

If you are in an adventurous mood, you may dare to uncomment those lines in the utility which disable /RunLengthDecode, /DCTDecode and CCITTFaxDecode and see if it still works...


Using qpdf

Another useful tool to transform a PDF into an internal format that enables text editor access is qpdf. It is a "command-line program that does structural, content-preserving transformations on PDF files".

Example usage:

 qpdf                                  \
   --qdf                               \
   --object-streams=disable            \
     input-with-compressed-objects.pdf \
     output-with-expanded-objects.pdf
  1. The output of the QDF-mode enforced by the --qdf switch organizes and re-orders the objects neatly. It adds comments to track the original object IDs and page content streams. All object dictionaries are written into a "normalized" standard format for easier parsing.

  2. The --object-streams=disable causes the extraction of (otherwise not recognizable) individual objects that are compressed into another object's stream data.


Using mutool

Artifex, the creators of Ghostscript, offer another under a Free and Open Source Software license available tool: MuPDF.

MuPDF comes with a command line tool, mutool, which also can expand compressed PDF object streams:

 mutool        \
    clean      \
   -d          \
   -a          \
    input.pdf  \
    output.pdf \
    4,7,8,9
  1. clean: re-writes the PDF;
  2. -d: de-compresses all streams;
  3. -a: ASCIIhex encodes all binary streams;
  4. 4,7,8,9: selects pages 4, 7, 8 and 9 for inclusion in output.pdf.

Using pdftk

Last, here is how to use the pdtk tool to uncompress PDF object's streams:

pdftk your-input.pdf cat output uncompressed-output.pdf uncompress

Note the final uncompress word in the command line.


Pick your favorite

All above tools are available for Linux, Mac OSX, Unix and Windows.

My own favorite is QPDF for most practical cases.

However, you should make your own experiments and compare the (different) output of each of the suggested tools. Then make your own pick.