I want to add an word import function to our CMS, the only problem I cannot seems to find a good library for reading docx files (Word 2007).
Do anyone has some recommendations, the library should be able to extract content of the document and basic styling like italic, bold, superscript?
Thanks for your help
docx
files are actually just containers for the document's XML. You should be able to unzip the docx file and then go to the word folder inside, then to the document.xml. This has the actual text. But things like the fonts and styles are in other xml files in the docx container, so you'll probably want to mess around a bit and figure out what is what and how to match it up (start by using namespaces, I bet).
But yea, unzip the file, then use simplexml to convert it into something you can actually mess around with.