HTML 4, HTML 5, XHTML, MIME types - the definitive resource

deceze picture deceze · Apr 18, 2010 · Viewed 8.6k times · Source

The topics of HTML vs. XHTML and XHTML as text/html vs. XHTML as XHTML are quite complex. Unfortunately it's hard to get a complete picture, since information is spread mostly in bits and pieces around the web or is buried deep in W3C tech jargon. In addition there's some misinformation being circulated. I propose to make this the definitive SO resource about the topic, describing the most important aspects of:

  • HTML 4
  • HTML 5
  • XHTML 1.0 as text/html, application/xml+xhtml
  • XHTML 1.1 as application/xml+xhtml

What are the practical implications of each?
What are common pitfalls?
What is the importance of proper MIME types for each?
How do different browsers handle them?

I'd like to see one answer per technology. I'm making this a community wiki, so rather than contributing redundant answers, please edit answers to complete the picture. Feel free to start with stubs. Also feel free to edit this question.

Answer

Alohci picture Alohci · Apr 18, 2010

Contents.

  • Terminology
  • Languages and Serializations
  • Specifications
  • Browser Parsers and Content (MIME) Types
  • Browser Support
  • Validators and Document Type Definitions
  • Quirks, Limited Quirks, and Standards modes.

Terminology

One of the difficulties of describing this is clearly that the terminology within the official specifications has changed over the years, since HTML was first introduced. What follows below is based on HTML5 terminology. Also, "file" is used as a generic term to mean a file, document, input stream, octet stream, etc to avoid having to make fine distinctions.

Languages and Serializations

HTML and XHTML are defined in terms of a language and a serialization.

The language defines the vocabulary of the elements and attributes, and their content model, i.e. which elements are permitted inside which other elements, which attributes are allowed on which element, along with the purpose and meaning of each element and attribute.

The serialization defines how mark-up is used to describe these elements and attributes within a text document. This includes which tags are required and which can be inferred, and the rules for those inferences. It describes such things as how void elements should be marked up (e.g. “>” vs “/>”) and when attribute values need to be quoted.

Specifications

The HTML 4.01 specification is the current specification that defines both the HTML language and the HTML serialization.

The XML 1.0 specification defines a serialization but leaves the language to be defined by other specifications, which are termed “XML applications”

The XHTML 1.0 and 1.1 specifications are both in use. Essentially, they use the same language as HTML 4.01 but use a different serialization, one that is compatible with the XML 1.0 specification. i.e. XHTML is an XML application.

The HTML5 (as of 2010-04-18, draft) specification describes a new language for both HTML and XHTML. This language is mostly a superset of the HTML 4.01 language, but is intended to only be backward compatible with existing web tools, (e.g. browsers, search engines and authoring tools) and not with previous specifications, where differences arise. So the meaning of some elements are occasionally changed from the earlier specifications. Similarly, each of the serializations are backward compatible with the current tools.

Browser Parsers and Content (MIME) Types

When a text file is sent to a browser, it is parsed into its internal memory structure (object model). To do so it uses a parser which follows either the HTML serialization rules or XML serialization rules. Which parser it uses depends on what it deduces the content type to be, based for non-local files on the “content-type” HTTP header. Internally, once the file has been parsed, the browser treats the object model in almost the same way, regardless of whether it was originally supplied using an HTML or XHTML serialization.

For a browser to use its XHTML parser, the content type HTTP header must be one of the XML content types. Most commonly, this is either application/xml or application/xhtml+xml. Any non XML content type will mean that the file, regardless of whether it meets all the XHTML language and serialization rules or not, will not be processed by the browser as XHTML.

Using a HTTP content type of text/html (or in most fallback scenarios, where the content type is missing or any other non-XML type) will cause the browser to use its HTML serialization parser.

One key difference between the two parsers is that the HTML serialization parser performs error recovery. If the input file to the parser does not meet the HTML serialization rules, the parser will recover in ways reverse engineered from previous browsers and carry on building its object model until it reaches the end of the file. HTML5 contains the first normative definition of the recovery but no mainstream browser has shipped an implementation of the algorithm enabled in a release version as of 2010-04-26.

In contrast, the XML serialization parser, will stop when it encounters anything that it cannot interpret as XML (i.e. when it discovers that the file is not XML well-formed). This is required of parsers by the XML 1.0 specification.

Browser Support

Most modern browsers contain support for both an HTML parser and an XML parser. However, in Microsoft Internet Explorer versions 8.0 and earlier, the XML parser cannot directly create an object model for rendering as an HTML page. The XML structure can, however be processed with an XSLT file to create a stream which in turn be parsed using the HTML parser to create a object model that can be rendered.

Starting with Internet Explorer 9 Platform Preview, XHTML supplied using an XML content type can be parsed directly in the same way as the other modern browsers.

When their XML parsers detect that their input files are not XML well-formed, some browsers display an error message, and others show the page as constructed up to the point where the error was detected and some offer the user the opportunity to have the file re-parsed using their HTML parser.

Validators and Document Type Definitions

HTML and XHTML files can begin with a Document Type Definition (DTD) declaration which indicates the language and serialization that is being used in the document. Validators, such as the one at http://validator.w3.org/ use this information to match the language and serialization used within the file against the rules defined in the DTD. It then reports errors based on where the rules in the DTD are violated by mark up in the file.

Not all HTML serialization and language rules can be described in a DTD, so validators only test for a subset of all the rules described by the specifications.

HTML 4.01 and XHTML 1.0 define Strict, Transitional, and Frameset DTDs which differ in the language elements and attributes that are permitted in compliant files.

Validators based on HTML5 such as validator.nu behave more like browsers, processing the page according to the HTTP content type and using a non DTD-based rule set so that they catch errors that cannot be described by DTDs.

Quirks, Limited Quirks, and Standards modes.

Browsers do not validate the files sent to them. Nor do they use any DTD declaration to determine the language or serialization of the file. However, they do use it to guess the era in which the page was created, and therefore the likely parsing and rendering behaviour the author would have expected of a browser at that time. Accordingly, they define three parsing and rendering modes, known as Quirks mode, Limited Quirks (or Almost Standards) mode and Standards mode.

Any file served using an XML content type is always processed in standards mode. For files parsed using the HTML parser, if there is no DTD provided or the DTD is determined to be very old, browsers use their quirks mode. Broadly speaking, HTML 4.01 and XHTML files processed as text/html will be processed with limited quirks mode if they contain a transitional DTD and with standards mode if using a strict DTD.

Where the DTD is not recognised, the mode is determined by a complex set of rules. One special case is where the public and system identifiers are omitted and the declaration is simply <!DOCTYPE html>. This is known to be the shortest doctype declaration where current browsers will treat the file as standards mode. For that reason, it is the declaration specified to be used for HTML5 compliant files.