How to extract text from an existing docx file using python-docx

Question 1

How to extract text from an existing docx file using python-docx

python python-2.7 python-3.x python-docx

Nancy · Aug 10, 2014 · Viewed 119.3k times · Source

Answer

Answer

you can try this

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

Question 2

I'm trying to use python-docx module (pip install python-docx) but it seems to be very confusing as in github repo test sample they are using opendocx function but in readthedocs they are using Document class. Even they are only showing how to add text to a docx file not reading existing one?

1st one (opendocx) is not working, may be deprecated. For second case I was trying to use:

from docx import Document

document = Document('test_doc.docx')

print document.paragraphs

It returned a list of <docx.text.Paragraph object at 0x... >

Then I did:

for p in document.paragraphs:
    print p.text

It returned all text but there were few thing missing. All URLs (CTRL+CLICK to go to URL) were not present in text on console.

What is the issue? Why URLs are missing?

How could I get complete text without iterating over loop (something like open().read())

How to extract text from an existing docx file using python-docx

Answer

Related questions