Apache POI or docx4j for dealing with docx documents

becks picture becks · Feb 21, 2013 · Viewed 28.4k times · Source

What do you think Which is better to use to read docx document as java objects and why ?

in other words. which library supports most of the word tags ?

Answer

JasonPlutext picture JasonPlutext · Feb 24, 2013

Disclosure: I lead the docx4j project

Although docx4j can also handle pptx and xlsx, it is mostly used for docx manipulation. By way of illustration, as at the time of writing, there are nearly 1000 topics in the docx4j forum. The pptx forum has only 10% of the volume.

Whatever you want to do with the docx document, docx4j ought to be able to help you. There's a single page overview of a generic workflow.

For many common requirements, docx4j provides higher level API. These include:

  • Create/open/save docx (of course)

  • Report/document generation, using a variety of approaches: (i) Variable substitution, (ii) XML data binding (particularly strong), and (iii) Mailmerge

  • Export as HTML, XHTML

  • Export as PDF (with font support)

For anything else, you can manipulate the JAXB representation of the docx to your heart's content. JAXB is a Java community standard, included in Java 6, and with a strong alternative implementation in EclipseLink's MOXy. (POI uses XML Beans instead of JAXB)

There's a web app to help you explore a docx, and generate Java code to create corresponding Java objects.

Of course, if there is some specific task you have in mind, it may be that docx4j or POI has a particular strength there.

Both docx4j and POI are ASL v2 licensed.

docx4j is actively maintained; its source code is on GitHub.

In addition, commercial support is available for docx4j if you want it, as are several commercial extensions eg MergeDocx.

docx4j does rely on POI as a library for its implementation of the OLE 2 Compound Document format, which we're grateful for.