How to extract plain text from a DOCX file using the new OOXML support in Apache POI 3.5?

Robert Campbell picture Robert Campbell · Sep 29, 2009 · Viewed 15k times · Source

On September 28, 2009 the Apache POI project released version 3.5 which officially supports the OOXML formats introduced in Office 2007, like DOCX and XLSX.

Please provide a code sample for extracting a DOCX file's content in plain text, ignoring any styles or formatting.

I am asking this because I have been unable to find any Apache POI examples covering the new OOXML support.

Answer

Tanuj Chatterjee picture Tanuj Chatterjee · Oct 22, 2009

This worked for me. Make sure you add the required jars (upgrade xmlbeans, etc.)

public String extractText(InputStream in) throws Exception {
    XWPFDocument doc = new XWPFDocument(in);
    XWPFWordExtractor ex = new XWPFWordExtractor(doc);
    String text = ex.getText();
    return text;
}