Most PDFs contain lots of binary looking parts in between some ASCII. But I remember also having seen PDFs where such binary parts by and large were absent, and one could open them in a text editor to study their structure.
Is there a trick, tool, or command that will convert binary PDF parts to ASCII/ANSI? (Preferably "free as in beer" or even "free as in liberty")
[Updated 2014-10-15]
Ghostscript has a small utility program written in PostScript in its source code repository. It's called pdfinflt.ps
. If you are lucky, it may already slumber in a 'toolbin' subdirectory of your Ghostscript installation location. Otherwise, get it here:
Now run it together with your targeted input PDF through the Ghostscript interpreter:
gswin32c.exe -- c:/path/to/pdfinflt.ps your-input.pdf deflated-output.pdf
pdfinflt.ps
will (try to) expand all 'streams' contained in the PDF which use the following compression filters/methods: /FlateDecode
, /LZWDecode
, /ASCII85Decode
, /ASCIIHexDecode
.
It will not attempt to remove /RunLengthDecode
, /CCITTFaxDecode
, /DCTDecode
, /JBIG2Decode
and /JPXDecode
. (Compressed/binary fonts will also pass unchanged into the output PDF.)
If you are in an adventurous mood, you may dare to uncomment those lines in the utility which disable /RunLengthDecode
, /DCTDecode
and CCITTFaxDecode
and see if it still works...
qpdf
Another useful tool to transform a PDF into an internal format that enables text editor access is qpdf
. It is a "command-line program that does structural, content-preserving transformations on PDF files".
Example usage:
qpdf \
--qdf \
--object-streams=disable \
input-with-compressed-objects.pdf \
output-with-expanded-objects.pdf
The output of the QDF
-mode enforced by the --qdf
switch organizes and re-orders the objects neatly. It adds comments to track the original object IDs and page content streams. All object dictionaries are written into a "normalized" standard format for easier parsing.
The --object-streams=disable
causes the extraction of (otherwise not recognizable) individual objects that are compressed into another object's stream data.
mutool
Artifex, the creators of Ghostscript, offer another under a Free and Open Source Software license available tool: MuPDF
.
MuPDF comes with a command line tool, mutool
, which also can expand compressed PDF object streams:
mutool \
clean \
-d \
-a \
input.pdf \
output.pdf \
4,7,8,9
clean
: re-writes the PDF;-d
: de-compresses all streams;-a
: ASCIIhex encodes all binary streams;4,7,8,9
: selects pages 4, 7, 8 and 9 for inclusion in output.pdf
.pdftk
Last, here is how to use the pdtk
tool to uncompress PDF object's streams:
pdftk your-input.pdf cat output uncompressed-output.pdf uncompress
Note the final uncompress
word in the command line.
All above tools are available for Linux, Mac OSX, Unix and Windows.
My own favorite is QPDF
for most practical cases.
However, you should make your own experiments and compare the (different) output of each of the suggested tools. Then make your own pick.