How do I pretty-print HTML with Nokogiri?

Jarsen picture Jarsen · Dec 14, 2009 · Viewed 27.8k times · Source

I wrote a web crawler in Ruby and I'm using Nokogiri::HTML to parse the page. I need to print the page out and while messing around in IRB I noticed a pretty_print method. However it takes a parameter and I can't figure out what it wants.

My crawler is caching the HTML of the webpages and writing it to files on my local machine. I would like to "pretty print" the HTML so that it looks nice and properly formatted when I do so.

Answer

Phrogz picture Phrogz · Oct 20, 2011

The answer by @mislav is somewhat wrong. Nokogiri does support pretty-printing if you:

  • Parse the document as XML
  • Instruct Nokogiri to ignore whitespace-only nodes ("blanks") during parsing
  • Use to_xhtml or to_xml to specify pretty-printing parameters

In action:

html = '<section>
<h1>Main Section 1</h1><p>Intro</p>
<section>
<h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p>
</section><section>
<h2>Subhead 1.2</h2><p>Meat</p>
</section></section>'

require 'nokogiri'
doc = Nokogiri::XML(html,&:noblanks)
puts doc
#=> <section>
#=>   <h1>Main Section 1</h1>
#=>   <p>Intro</p>
#=>   <section>
#=>     <h2>Subhead 1.1</h2>
#=>     <p>Meat</p>
#=>     <p>MOAR MEAT</p>
#=>   </section>
#=>   <section>
#=>     <h2>Subhead 1.2</h2>
#=>     <p>Meat</p>
#=>   </section>
#=> </section>

puts doc.to_xhtml( indent:3, indent_text:"." )
#=> <section>
#=> ...<h1>Main Section 1</h1>
#=> ...<p>Intro</p>
#=> ...<section>
#=> ......<h2>Subhead 1.1</h2>
#=> ......<p>Meat</p>
#=> ......<p>MOAR MEAT</p>
#=> ...</section>
#=> ...<section>
#=> ......<h2>Subhead 1.2</h2>
#=> ......<p>Meat</p>
#=> ...</section>
#=> </section>