I'm trying to use python-docx
module (pip install python-docx
)
but it seems to be very confusing as in github repo test sample they are using opendocx
function but in readthedocs they are using Document
class. Even they are only showing how to add text to a docx file not reading existing one?
1st one (opendocx
) is not working, may be deprecated. For second case I was trying to use:
from docx import Document
document = Document('test_doc.docx')
print document.paragraphs
It returned a list of <docx.text.Paragraph object at 0x... >
Then I did:
for p in document.paragraphs:
print p.text
It returned all text but there were few thing missing. All URLs (CTRL+CLICK to go to URL) were not present in text on console.
What is the issue? Why URLs are missing?
How could I get complete text without iterating over loop (something like open().read()
)
you can try this
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)