Styling Microsoft-Word-Generated HTML

Walker picture Walker · Jan 30, 2011 · Viewed 14.1k times · Source

Ugh. Word is notorious for its bloated, convoluted, non-standards-compliant, non-semantic HTML. Unfortunately, I have a professor who is requiring us to generate an outline to very exacting standards. I'd rather not hand-write it, so I decided to make something that would be useful for my classmates as well. I created the outline using a simple numbered list in NeoOffice on my Mac, exported it as HTML, and wrote quite a bit of CSS to style it. Then, I got someone to create an ordered list in Word for Windows, export it as html, and send it to me to check compatibility. After scrolling miles down the page, trying to repress a shudder, I saw a problem. Word did not use <ol> and <li>. It used mountains of nested <span>s with classes out the wazoo. I hate to see all my work go to waste, but this content is impossible to work with—I'd have to style on a document-to-document basis, rather than with a universal stylesheet.

Ideally, Word would generate HTML using standard tags so that I could style it just like any other list, but this doesn't seem to be the case. How can I make it generate lists that actually use <ul> and <li> rather than <span>, or at least modify something in my code to somehow work with the way weird way it does create lists?

Answer

Herc picture Herc · Jun 18, 2014

The guys who wrote Winword and its HTML generation are smart guys. If it was easy to use HTML features in a purist way they would have done so.

Word is about creating paper-optimised layouts. it supports concepts such as tab-stops and multi-level numbering that HTML doesn't support, or is only just starting to. As a result, the HTML version of a Word document is not 'nice' HTML, but an attempt to retain the features of the Word document accurately.

When Word re-opens an HTML file it has saved, it does some clever reverse-engineering on the document, so that renders in Word looking pretty much like it started. Equally, if you insert the HTML as a snippet into a web-page, retaining Word CSS, the results are pretty faithful. In this case there is a culture clash between the underlying CSS of the webpage and Word's CSS, and some effort is required to make the best of a bad job. The Word HTML doesn't use UTF-8 either, which needs some handling.

HTMLTidy can be used to rip out Word mark-up, but some more massaging is required after this for good rendering within a webpage. I have worked on a product for 15 years which does this mixing of Word and web pages, and the results can be quite good if you fine tune the CSS.

We used Word because we are creating paper-versions, and importing text from reports written in Word, not because we couldn't find a dedicated HTML editor.

I would not recommend using Word to create tidy purist HTML. You wouldn't use a can-opener to open a bottle of wine, would you?

Life would be much simpler if: a) Microsoft re-engineered the myriad options on its highly confusing 'bullets and number' feature, b) HTML provided native, and properly featured, multi-level numbering support, instead of the after-thought approaches currently available. The weakness of HTML in this area can be seen in the flimsy numbering options available in Google Docs.

So much has improved with HTML 5, maybe we can hope that HTML 6 will help bridge the word processor / HTML editor divide.