How to deal with ContentNotFoundError when using wkhtmltopdf?

Murali Mopuru picture Murali Mopuru · Sep 17, 2014 · Viewed 19.5k times · Source

Can someone tell me how to resolve following issues?

  1. wkhtmltopdf don't have option to pass proxy info (-p or --proxy) unlike in previous versions and its not using system $http_proxy and $https_proxy env variable too.

  2. wkhtmltopdf not working with HTTPS/SSL even though i set LD_LIBRARY_PATH for libssl.so and libcrypto.so

    [deploy@localhost ~]$ wkhtmltopdf https://www.google.co.in google.pdf
    loaded the Generic plugin 
    Loading page (1/2)
    Error: Failed loading page https://www.google.co.in (sometimes it will work just to ignore this error with --load-error-handling ignore)
    Exit with code 1 due to network error: UnknownNetworkError
    

    and

    [deploy@localhost ~]$ wkhtmltoimage https://www.google.co.in sample.jpg
    loaded the Generic plugin 
    Loading page (1/2)
    Error: Failed loading page https://www.google.co.in (sometimes it will work just to ignore this error with --load-error-handling ignore)
    Exit with code 1 due to network error: UnknownNetworkError
    
  3. wkhtmltopdf working partially with HTTP. The output pdf files missing some content/background/positions.

    [deploy@localhost ~]$ wkhtmltopdf http://localhost:8880/ sample.pdf
    loaded the Generic plugin 
    Loading page (1/2)
    Printing pages (2/2)                                               
    Done                                                           
    Exit with code 1 due to network error: ContentNotFoundError
    
    [deploy@localhost ~]$ wkhtmltoimage http://localhost:8880/ sample.jpg
    loaded the Generic plugin 
    Loading page (1/2)
    Rendering (2/2)                                                    
    Done                                                               
    Exit with code 1 due to network error: ContentNotFoundError
    

Note: Im using wkhtmltopdf-0.12.1-1.fc20.x86_64 and qt-4.8.6-10.fc20.x86_64

Answer

kenorb picture kenorb · May 20, 2015

Unfortunately wkhtmltopdf doesn't handle downloading of complex websites, because it's uses Qt/QtWebKit library which seems to have some issues.

One problem is that wkhtmltopdf doesn't support relative addresses (GitHub: #1634, #1886, #2359, QTBUG-46240) such as:

<img src="/images/filetypes/txt.png">
<script src="//cdn.optimizely.com/js/653710485.js">

and it loads them as local. One solution which I've found to this is the correcting html file in-place by ex in-place editor:

ex -V1 page.html <<-EOF
  %s,'//,'http://,ge 
  %s,"//,"http://,ge 
  %s,'/,'http://www.example.com/,ge
  %s,"/,"http://www.example.com/,ge
  wq " Update changes and quit.
EOF

However it won't work for files which have these type of URLs on the remote.

Another problem is that it doesn't handle missing resources. You can try to specify --load-error-handling ignore, but in most cases it doesn't work (see #2051), so this is still outstanding. Workaround is to simply remove these invalid resources, before conversion.

Alternatively to wkhtmltopdf, you can use either htmldoc, PhantomJS with some additional script, for example using rasterize.js:

phantomjs rasterize.js http://example.com/

or dompdf (HTML to PDF converter for PHP, you can install by composer) with sample code below:

<?php
// somewhere early in your project's loading, require the Composer autoloader
// see: http://getcomposer.org/doc/00-intro.md
$HOMEDIR = "/Users/foo";
require $HOMEDIR . '/.composer/vendor/autoload.php';

// disable DOMPDF's internal autoloader if you are using Composer
define('DOMPDF_ENABLE_AUTOLOAD', FALSE);
define('DOMPDF_ENABLE_REMOTE', TRUE);

// include DOMPDF's default configuration
require_once $HOMEDIR . '/.composer/vendor/dompdf/dompdf/dompdf_config.inc.php';

$htmlString = file_get_contents("https://example.com/foo.pdf");

$dompdf = new DOMPDF();
$dompdf->load_html($htmlString);
$dompdf->render();
$dompdf->stream("sample.pdf");