Can Apache FOP be used to convert an arbitary HTML to PDF?

B. Abraham picture B. Abraham · Jan 18, 2017 · Viewed 8.1k times · Source

I have tried to use Apache FOP to convert HTML to PDF. ( HTML -->XHTML--> XSL-FO --> PDF). I used the xhtml2fo.xsl from Antenna House for the xhtml --> XSL-FO conversion.

It works for simple html files.

It does not work for html files with styling ( via embedded css or by style attribute). A PDF is created but completely unformatted. I am trying to convert HTML file where I do not have much control over the styling/content.

Creating an xslt for each html is not practical in my use-case.

Currently, I do have a working implementation with flyingsaucer. However, the requirement calls for an implementation without AGPL license.

My Question is: Can this be achieved with FOP ?

Appreciate any help

Answer

lfurini picture lfurini · Jan 22, 2017

tl;dr version:

In the most general situation, no, you cannot use FOP to convert any html preserving the original styles (and changing formatter would not solve the problem).

However, you can use FOP (or any another formatter) to try and handle reasonably well a large subset of html documents; this could require some XSLT adjustment.


Why it cannot work in general

HTML --> XHTML --> XSL-FO --> PDF

Your description of the necessary transformation chain is spot on.

However, FOP is only involved in the last step: with the exception of the features that are not implemented yet, the final PDF file should respect the typographical characteristics expressed in the FO file.

I used the xhtml2fo.xsl from Antenna House for the xhtml --> XSL-FO conversion [...]

A PDF is created but completely unformatted

Is the stylesheet you are using this one from the AntennaHouse site?

From a quick look, it seems like it should convert the style="..." attribute producing separate attributes in the FO output, but it does not process external CSS files.

As a result, the HTML files styled with external CSS will be transformed into FO files without any formatting attribute (font-family, font-size, text-align, ...).

Can this be achieved with FOP ?

If that's indeed the case, the formatter cannot do anything but use the default values, a few of which (font-family comes to mind) are application-dependant.

So, according to the formatter you use you will have a slightly different result, but still an "unformatted" one.

What you need is either a tool to "merge" the html and css files, inlining the styles so that the XSLT can process them, or a different stylesheet capable of taking into account the external css files (but I suspect it would not be easy to write one working in a general case).

What can be fixed with little effort

While processing html tables the linked XSLT uses the fo:table-and-caption element, which is not supported by FOP so the tables "disappear" from the output.

This can be fixed with a small change in the XSLT, or (probably a cleaner solution) using a custom stylesheet importing the other one:

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:fo="http://www.w3.org/1999/XSL/Format"
  xmlns:html="http://www.w3.org/1999/xhtml">

  <xsl:include href="xhtml2fo.xsl"/>

  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="no"/>

  <xsl:template match="html:table" priority="2">
    <fo:table xsl:use-attribute-sets="table">
      <!-- warning: table caption is not processed! -->
      <xsl:call-template name="process-table"/>
    </fo:table>
  </xsl:template>

</xsl:stylesheet>

It is possible that the stylesheet you are actually using needs a few similar adjustments to better work in conjunction with FOP.

Disclosure: I'm a FOP developer, though not very active nowadays.