Convert doc to pdf using Apache POI

user1710922 picture user1710922 · Jul 24, 2013 · Viewed 28.7k times · Source

I am trying to convert doc to pdf using Apache POI, but the resulting pdf document contains only text, it is not having any formating like images, tables alignment etc.

How can I convert doc to pdf with having all formattings like tables, images, alignments?

Here is my code:

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.OutputStream;

import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;


import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;


public class demo {
    public static void main(String[] args) {

        POIFSFileSystem fs = null;  
        Document document = new Document();

         try {  
             System.out.println("Starting the test");  
             fs = new POIFSFileSystem(new FileInputStream("Resume.doc"));  

             HWPFDocument doc = new HWPFDocument(fs);  
             WordExtractor we = new WordExtractor(doc);  

             OutputStream file = new FileOutputStream(new File("test.pdf")); 

             PdfWriter writer = PdfWriter.getInstance(document, file);  

             Range range = doc.getRange();
             document.open();  
             writer.setPageEmpty(true);  
             document.newPage();  
             writer.setPageEmpty(true);  

             String[] paragraphs = we.getParagraphText();  
             for (int i = 0; i < paragraphs.length; i++) {  

                 org.apache.poi.hwpf.usermodel.Paragraph pr = range.getParagraph(i);
                 paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");  
                 System.out.println("Length:" + paragraphs[i].length());  
                 System.out.println("Paragraph" + i + ": " + paragraphs[i].toString());  
                 // add the paragraph to the document  
                 document.add(new Paragraph(paragraphs[i]));  
             }  

             System.out.println("Document testing completed");  
         } catch (Exception e) {  
             System.out.println("Exception during test");  
             e.printStackTrace();  
         } finally {  
             // close the document  
             document.close();  
         }  
     }  
 }

Answer

mkl picture mkl · Jul 25, 2013

The task at hand is converting doc to pdf with having all formattings like tables, images, alignments.

Creating an own converter class

There already are WordToXxxConverter classes in Apache POI, namely WordToFoConverter, WordToHtmlConverter, and WordToTextConverter. The latter one most likely is too lossy to serve as an example for your requirements but the former two are adequate.

All these converter classes are derived from the common base class AbstractWordConverter which provides a basic framework for word conversion classes. Furthermore all these classes make use of a matching *DocumentFacade class which wraps the concrete target (or some intermediate) format creation: FoDocumentFacade, HtmlDocumentFacade, or TextDocumentFacade.

To implement your task converting doc to pdf with having all formattings like tables, images, alignments, therefore, you should also derive a converter class from that AbstractWordConverter and for implementing the abstract methods let yourself be inspired by the three concrete implementation classes. Just like in the other converter classes, concentrating the very PDF library specific code into a PdfDocumentFacade class seems like a good idea.

If you want to start simple and add the more complex details later, you might start by using much WordToTextConverter implementation code first and as soon as that works at least on a proof-of-concept level, extend the functionality to also cover more and more of the formatting information.

Unfortunately this converter framework is somewhat DOM element centric: AbstractWordConverter callbacks expect and forward DOM elements as indicators of the current target document context; at first glance it does not seem to make use of that context being a DOM element, so you might get away with copying that base class and exchanging those DOM element parameters with a more apropos type or even better a generic class parameter.

Using existing Word-to-XXX converters in combination with existing XXX-to-Pdf converters

If this seems too complex or time consuming for your resources, you might try a different approach: You can try to use the output of one of the existing converters mentioned above as input for another conversion to Pdf.

Using existing conversion classes will turn out results earlier, but multi-step conversions tend to be more lossy than single-step ones. The decision is up to you.

In the code you posted in your question you used iText classes. iText does support conversion from HTML to PDF with certain limitations using the XMLWorker provided in the iText XML Worker sub-project. In ancient iText versions there also used to be the now deprecated HTMLWorker class. Thus using the WordToHtmlConverter in combination with the iText XMLWorker may be an option for you.

Alternatively Apache also provides XSL FO processing to PDF. This applied to the output of WordToFoConverter may also be an option