How to extract just plain text from .doc & .docx files?

docextract picture docextract · Apr 15, 2011 · Viewed 69.3k times · Source

Anyone know of anything they can recommend in order to extract just the plain text from a .doc or .docx?

I've found this - wondered if there were any other suggestions?

Answer

rob picture rob · Sep 2, 2014

If you want the pure plain text(my requirement) then all you need is

unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

Which I found at command line fu

It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost.