I am using docx4j for reading .docx files and I need to get the paragraph of a document and replace strings

yams picture yams · Nov 2, 2012 · Viewed 10.7k times · Source

I am using docx4j for reading and parsing .docx files but when I iterate through paragraphs it is reading in one pass not all of the paragraph. Below is a sample of the code I am using.

private void replaceAcrAndDef(String acrName, String acrParensName, String oldDef, String newDef){
    String XPATH_TO_SELECT_TEXT_NODES = "//w:t";
    List<Object> paragraphs = template.getMainDocumentPart().getJAXBNodesViaXPath(XPATH_TO_SELECT_TEXT_NODES, true);
    for (Object obj : paragraphs){
        Text text = (Text) ((JAXBElement)obj).getValue();
        String textValue = text.getValue();
        System.out.println(textValue);
 }

During one pass of the for loop above this will read as the first paragraph -

"Team has a deep understanding of the requirements by having direct MDA experience for the Mission, Test and Administrative and General Services networks and systems. The benefits to re a low risk, responsive Team with an established understanding of Mission, Processes and Priorities. Our use of an integrated based"

But it is missing the last parts of the paragraph. Which will come out in the consecutive passes. What am I doing wrong here.

The entire contents of the paragraph are :

Team has a deep understanding of the requirements by having direct MDA experience for the Mission, Test and Administrative and General Services networks and systems. The benefits to are a low risk, responsive Team with an established understanding of Mission, Processes and Priorities. Our use of an integrated Information Technology based Role-Based Administration (RBA) approach works in synergy with associate contractors, existing processes and the addition of our complementary processes.

I do not know if there is a way to get the entire paragraph or not but if there is that would be great as I need to do String replacement on a paragraph by paragraph basis.

Answer

Eddie G. picture Eddie G. · Nov 5, 2012

I expand my comments to an answer:

I guess, the paragraph contains more than one text element (w:t). Could you provide a sample document with this issue? What about extracting text with TextUtils.extractText on the paragraph element?

Try P.toString(). There TextUtils is referenced, which you can try with a StringWriter, too.


Using P.toString():

// Request paragraphs
final String XPATH_TO_SELECT_TEXT_NODES = "//w:p";
final List<Object> jaxbNodes = wordMLPackage.getMainDocumentPart().getJAXBNodesViaXPath(XPATH_TO_SELECT_TEXT_NODES, true);

for (Object jaxbNode : jaxbNodes){
    final String paragraphString = jaxbNode.toString();
    System.out.println(paragraphString);
}

Using TextUtils.extractText(...) and StringWriter:

for (Object jaxbNode : jaxbNodes){
    final StringWriter stringWriter = new StringWriter();
    TextUtils.extractText(jaxbNode, stringWriter);
    final String paragraphString = stringWriter.toString();
    System.out.println(paragraphString);
}