Anyone know of anything they can recommend in order to extract just the plain text from a .doc
or .docx
?
I've found this - wondered if there were any other suggestions?
If you want the pure plain text(my requirement) then all you need is
unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'
Which I found at command line fu
It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost.