Is there a reasonable way to extract plain text from a Word file that doesn't depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platform - that's non-negotiable in this case.)
Antiword seems like it might be a reasonable option, but it seems like it might be abandoned.
A Python solution would be ideal, but doesn't appear to be available.
(Same answer as extracting text from MS word files in python)
Use the native Python docx module which I made this week. Here's how to extract all the text from a doc:
document = opendocx('Hello world.docx')
# This location is where most document content lives
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]
# Extract all text
print getdocumenttext(document)
See Python DocX site
100% Python, no COM, no .net, no Java, no parsing serialized XML with regexs, no crap.