From MS Word or Libre Office to clean HTML

Erel Segal-Halevi picture Erel Segal-Halevi · Jan 24, 2013 · Viewed 10k times · Source

People that send content to my website use Word, so I get a lot of Word documents to convert to HTML. I want to conserve only the basic formatting - headings, lists and emphasis - no images.

When I convert them with Libre Office "Save as HTML", the resulting files are huge, for example, a doc file of 112K becomes 450K HTML, most of it useless FONT and SPAN tags (for some reason, every single punctuation mark is enclosed in its own span!).

I tried this script: http://www.techrepublic.com/blog/opensource/how-to-convert-doc-and-odf-files-to-clean-and-lean-html/3708 based on tidy and sed, and it reduced the size to about 150K, but there are still many useless SPANs.

I tried to copy and past into Kompozer - an HTML editor, and then save as HTML; but it converted all my non-Latin (Hebrew) letters to entities such as "ְ", which increased the size to 750K!

I tried docvert: https://github.com/holloway/docvert/issues/6 but found out that it requires a python library that requires another libraries, etc., which seems like an endless route of dependencies...

Is there a simple way to create clean HTML from Office documents?

Answer

Tarun picture Tarun · Sep 28, 2013

I was using http://word2cleanhtml.com/ till i realised that MS Word itself gives the option to save document as HTML.

On selecting this, the .docx file becomes .html and is the best html version of a word doc that i've seen. Its certainly better than all these online tools.