pdftk will not decompress data streams

Question 1

pdftk will not decompress data streams

pdf pdftk

James Duvall · Feb 25, 2013 · Viewed 9.4k times · Source

Answer

Answer

If you are not attached to pdftk, you can use qpdf. For instance, you could use:

$ qpdf --stream-data=uncompress input.pdf output.pdf

For what it is worth, if there are blobs, they still might appear as binary. Although, the rest of the stream will be uncompressed (either with pdftk or qpdf). qpdf allows you to uncompress all or only the streams.

From qpdf manual:

When --stream-data=uncompress is specified, qpdf will attempt to remove any non-lossy filters that it supports. This includes /FlateDecode, /LZWDecode, /ASCII85Decode, and /ASCIIHexDecode. This can be very useful for inspecting the contents of various streams.

The same could happen with pdftk.

Question 2

I have been trying to work with pdftk to inspect information from compressed pdf streams created by Nitro Reader, but pdftk will not deflate the streams. It produces no errors, but it does not seem to do anything beyond reordering the pdf objects. Here is a minimal example of one of these pdfs.

    pdftk test.pdf output test-d.pdf uncompress

When I try pdftk on other pdfs, it seems to work fine. If I manually extract the data streams and decompress them using zlib in Python, they decompress properly. Also, if I open the pdf in Adobe Reader and re-save, pdftk works fine on the resulting pdf.

I have manually inspected the Nitro pdf to the best of my ability, and it seems to be a valid pdf. I am very confused as to what is going on here.

As background to the problem, I have hundreds of these pdfs, and I am trying search for certain keywords, which I should be able to do if I can automate the decompression.

pdftk version 1.45
Windows 7 Home Premium SP1
Nitro Reader 2 version 2.5.0.36

Thanks, James

pdftk will not decompress data streams

Answer

Related questions