pdftk Error: Failed to open PDF file:

Manoj Dhiman picture Manoj Dhiman · May 15, 2015 · Viewed 21.8k times · Source

I am using pdftk library to extract the form fields from the pdf .Everything is just running fine except the one issue that i got a pdf file pdf file link. which causes the error is given bellow

Error: Failed to open PDF file: 
   http://www.uscis.gov/sites/default/files/files/form/i-9.pdf
Done.  Input errors, so no output created.

command for this is

root@ri8-MS-7788:/home/ri-8# pdftk http://192.168.1.43/form/i-9.pdf dump_data_fields

the same command is working for all other forms .

Attempt1

I have tried to encrypt the pdf to unsafe version but it produce the same error . here is the command

pdftk http://192.168.1.43/forms/i-9.pdf input_pw foopass output /var/www/forms/un-i-9.pdf

Update

this is my full function to handle this

public function Formanalysis($pdfname) 
    {
         $pdffile=Yii::app()->getBaseUrl(true).'/uploads/forms/'.$pdfname;
        exec("pdftk ".$pdffile." dump_data_fields 2>&1", $output,$retval);
        //got an error for some pdf if these are secure 
        if(strpos($output[0],'Error') !== false)
        {
            $unsafepdf=Yii::getPathOfAlias('webroot').'/uploads/forms/un-'.$pdfname;
            //echo "pdftk ".$pdffile." input_pw foopass output ".$unsafepdf;
            exec("pdftk ".$pdffile." input_pw foopass output ".$unsafepdf);
            exec("pdftk ".$unsafepdf." dump_data_fields 2>&1", $outputunsafe,$retval);
            return $outputunsafe ;
            //$response=array('0'=>'error','error'=>$output[0]);
            //return $response;

        }

        //if (strpos($output[0],'Error') !== false){ echo  "error to run" ; }   // this is the option to handle error 
        return $output;

    }

Answer

Bruno Lowagie picture Bruno Lowagie · May 15, 2015

PdfTk is a tool that was created by compiling an obsolete version of iText to an executable using the GNU Compiler for Java (GCJ) (PdfTk is not endorsed by iText Group NV).

I have examined your PDF and it uses two technologies that weren't supported by iText at the time PdfTk was created: XFA and compressed cross-reference tables.

The latter is what causes your problem. PdfTk expects your file to end like this:

xref
0 7
0000000000 65535 f 
0000000258 00000 n 
0000000015 00000 n 
0000000346 00000 n 
0000000146 00000 n 
0000000397 00000 n 
0000000442 00000 n 
trailer
<</ID [<c8bf0ac531b0fc7b5b9ec5daf0296834><ec4dde54d00305ebbec62f3f6bbca974>]/Root 5 0 R/Size 7/Info 6 0 R>>
%iText-5.4.3
startxref
595
%%EOF

In this snippet startxref marks the byte offset of xref which is where the cross-reference table starts. This table contains the byte-offsets of all the objects in the PDF.

When you look at the PDF you refer to, you see that it ends like this:

64 0 obj
<</DecodeParms<</Columns 5/Predictor 12>>/Encrypt 972 0 R/Filter/FlateDecode/ID[<85C47EA3EFE49E4CB0F087350055FDDC><C3F1748360D0464FBA02D711DE864630>]/Info 970 0 R/Length 283/Root 973 0 R/Size 971/Type/XRef/W[1 3 1]>>stream
hÞìÒ±JQЙ·»7J¢©ÕØ(Xþ„ù »h%¤É¤¶”€mZ+;ÁN,,ÁÆ6 XÁ&‚("î½YŒI‘Bî‡áμ]ö1Áð÷³cfþ‹ûÐÚLî`z„Ýôœùw÷N×X?ÙkNv`hÁÒj¦G[œiÀå»›œ?b½Än…ÉëàÍþ gY—i7WW‡òj®îÍ°u¸Ò‡Ñ:óÆÛ™ñÎë&'×݈§ü†ù!ÿñ€ù%,\ácçÙ9˜ì±Þ€S¼Ãd—‰Áy~×.ø¶Åìþßn_˜$9Ôüw£X9#åxzçgRüüóÙwÝ¡œÄNJ©½’Ú+©½’R{%µWR{%ÿ·á”;`_ z6Ø
endstream
endobj
startxref
116
%%EOF

In this case, startxref still refers to where the first cross-reference table starts (it's a linearized PDF), but the cross reference table is stored inside an object, and that object is compressed (see the gibberish between the stream and endstream keywords).

Compressed cross-reference tables and compressed objects were introduced in PDF 1.5 (2003), but they aren't supported by PdfTk. You'll have to find a tool that can deal with such streams (e.g. a recent version of iText, which is the real stuff when compared to PdfTk), or you have to save your PDF as a PDF 1.4 before you treat it with PdfTk (but you'll lose the XFA, because XFA was also introduced in PDF 1.5).

Update:

Since you are asking about form fields, I'm adding the following attachment:

enter image description here

This screenshot was taken using iText RUPS (which proves that iText can open the document). To the right, you see that the same form is defined twice:

enter image description here

If you would walk down the tree under Fields, you'd find all the fields that are stored in the PDF using AcroForm technology. To the left, you can see the description of such a field:

enter image description here

If you look under XFA, you notice that the same form is also defined using the XML Forms Architecture. If you click on datasets, you see the XML description of the dataset in the lower panel:

enter image description here

All of this information can be accessed programmatically using iText (Java) or iTextSharp (C#). PdfTk is merely a tool based on a very old version of this technology.