PHP "pretty print" HTML (not Tidy)

Jack Sleight picture Jack Sleight · Apr 20, 2009 · Viewed 25.3k times · Source

I'm using the DOM extension in PHP to build some HTML documents, and I want the output to be formatted nicely (with new lines and indentation) so that it's readable, however, from the many tests I've done:

  1. "formatOutput = true" doesn't work at all with saveHTML(), only saveXML()
  2. Even if I used saveXML(), it still only works on elements created via the DOM, not elements that are included with loadHTML(), even with "preserveWhiteSpace = false"

If anyone knows differently I'd really like to know how they got it to work.

So, I have a DOM document, and I'm using saveHTML() to output the HTML. As it's coming from the DOM I know it is valid, there's no need to "Tidy" or validate it in any way.

I'm simply looking for a way to get nicely formatted output from the output I receive from the DOM extension.

NB. As you may have guessed, I don't want to use the Tidy extension as a) it does a lot more that I need it too (the markup is already valid) and b) it actually makes changes to the HTML content (such as the HTML 5 doctype and some elements).

Follow Up:

OK, with the help of the answer below I've worked out why the DOM extension wasn't working. Although the given example works, it still wasn't working with my code. With the help of this comment I found that if you have any text nodes where isWhitespaceInElementContent() is true no formatting will be applied beyond that point. This happens regardless of whether or not preserveWhiteSpace is false. The solution is to remove all of these nodes (although I'm not sure if this may have adverse effects on the actual content).

Answer

stefs picture stefs · Apr 20, 2009

you're right, there seems to be no indentation for HTML (others are also confused). XML works, even with loaded code.

<?php
function tidyHTML($buffer) {
    // load our document into a DOM object
    $dom = new DOMDocument();
    // we want nice output
    $dom->preserveWhiteSpace = false;
    $dom->loadHTML($buffer);
    $dom->formatOutput = true;
    return($dom->saveHTML());
}

// start output buffering, using our nice
// callback function to format the output.
ob_start("tidyHTML");

?>
<html>
    <head>
    <title>foo bar</title><meta name="bar" value="foo"><body><h1>bar foo</h1><p>It's like comparing apples to oranges.</p></body></html>
<?php
// this will be called implicitly, but we'll
// call it manually to illustrate the point.
ob_end_flush();
?>

result:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>foo bar</title>
<meta name="bar" value="foo">
</head>
<body>
<h1>bar foo</h1>
<p>It's like comparing apples to oranges.</p>
</body>
</html>

the same with saveXML() ...

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <head>
    <title>foo bar</title>
    <meta name="bar" value="foo"/>
  </head>
  <body>
    <h1>bar foo</h1>
    <p>It's like comparing apples to oranges.</p>
  </body>
</html>

probably forgot to set preserveWhiteSpace=false before loadHTML?

disclaimer: i stole most of the demo code from tyson clugg/php manual comments. lazy me.


UPDATE: i now remember some years ago i tried the same thing and ran into the same problem. i fixed this by applying a dirty workaround (wasn't performance critical): i just somehow converted around between SimpleXML and DOM until the problem vanished. i suppose the conversion got rid of those nodes. maybe load with dom, import with simplexml_import_dom, then output the string, parse this with DOM again and then printed it pretty. as far as i remember this worked (but it was really slow).