Best way to extract text from a Word doc without using COM/automation?

Kevin picture Kevin · Sep 3, 2008 · Viewed 60.2k times · Source

Is there a reasonable way to extract plain text from a Word file that doesn't depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platform - that's non-negotiable in this case.)

Antiword seems like it might be a reasonable option, but it seems like it might be abandoned.

A Python solution would be ideal, but doesn't appear to be available.

Answer

mikemaccana picture mikemaccana · Dec 30, 2009

(Same answer as extracting text from MS word files in python)

Use the native Python docx module which I made this week. Here's how to extract all the text from a doc:

document = opendocx('Hello world.docx')

# This location is where most document content lives 
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]

# Extract all text
print getdocumenttext(document)

See Python DocX site

100% Python, no COM, no .net, no Java, no parsing serialized XML with regexs, no crap.