I have been trying to work with pdftk to inspect information from compressed pdf streams created by Nitro Reader, but pdftk will not deflate the streams. It produces no errors, but it does not seem to do anything beyond reordering the pdf objects. Here is a minimal example of one of these pdfs.
pdftk test.pdf output test-d.pdf uncompress
When I try pdftk on other pdfs, it seems to work fine. If I manually extract the data streams and decompress them using zlib in Python, they decompress properly. Also, if I open the pdf in Adobe Reader and re-save, pdftk works fine on the resulting pdf.
I have manually inspected the Nitro pdf to the best of my ability, and it seems to be a valid pdf. I am very confused as to what is going on here.
As background to the problem, I have hundreds of these pdfs, and I am trying search for certain keywords, which I should be able to do if I can automate the decompression.
pdftk version 1.45
Windows 7 Home Premium SP1
Nitro Reader 2 version 2.5.0.36
Thanks, James
If you are not attached to pdftk
, you can use qpdf. For instance, you could use:
$ qpdf --stream-data=uncompress input.pdf output.pdf
For what it is worth, if there are blobs, they still might appear as binary. Although, the rest of the stream will be uncompressed (either with pdftk
or qpdf
). qpdf
allows you to uncompress all or only the streams.
From qpdf
manual:
When --stream-data=uncompress is specified, qpdf will attempt to remove any non-lossy filters that it supports. This includes /FlateDecode, /LZWDecode, /ASCII85Decode, and /ASCIIHexDecode. This can be very useful for inspecting the contents of various streams.
The same could happen with pdftk
.