Strip text from HTML document using Ruby

davidsmalley picture davidsmalley · Sep 30, 2009 · Viewed 10.2k times · Source

There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly.

What I am trying to do is the opposite, remove all the text from an HTML document, leaving just the tags and their attributes.

I considered looping through the document setting inner_html to nil but then really you'd have to do this in reverse as the first element (root) has an inner_html of the entire rest of the document, so ideally I'd have to start at the inner most element and set inner_html to nil whilst moving up through the ancestors.

Does anyone know a neat little trick for doing this efficiently? I was thinking perhaps regex's might do it but probably not as efficiently as an HTML tokenizer/parser might.

Answer

andre-r picture andre-r · Sep 30, 2009

This works too:

doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").remove