People that send content to my website use Word, so I get a lot of Word documents to convert to HTML. I want to conserve only the basic formatting - headings, lists and emphasis - no images.
When I convert them with Libre Office "Save as HTML", the resulting files are huge, for example, a doc file of 112K becomes 450K HTML, most of it useless FONT and SPAN tags (for some reason, every single punctuation mark is enclosed in its own span!).
I tried this script: http://www.techrepublic.com/blog/opensource/how-to-convert-doc-and-odf-files-to-clean-and-lean-html/3708 based on tidy and sed, and it reduced the size to about 150K, but there are still many useless SPANs.
I tried to copy and past into Kompozer - an HTML editor, and then save as HTML; but it converted all my non-Latin (Hebrew) letters to entities such as "ְ", which increased the size to 750K!
I tried docvert: https://github.com/holloway/docvert/issues/6 but found out that it requires a python library that requires another libraries, etc., which seems like an endless route of dependencies...
Is there a simple way to create clean HTML from Office documents?
I was using http://word2cleanhtml.com/ till i realised that MS Word itself gives the option to save document as HTML.
On selecting this, the .docx file becomes .html and is the best html version of a word doc that i've seen. Its certainly better than all these online tools.