Remove all JavaScript from an HTML page

user1049097 picture user1049097 · Nov 28, 2011 · Viewed 7.1k times · Source

I've tried using the Sanitize gem to clean a string which contains the HTML of a website.

It only removed the <script> tags, not the JavaScript inside the script tags.

What can I use to remove the JavaScript from a page?

Answer

Phrogz picture Phrogz · Nov 28, 2011
require 'open-uri'      # included with Ruby; only needed to load HTML from a URL
require 'nokogiri'      # gem install nokogiri   read more at http://nokogiri.org

html = open('http://stackoverflow.com')              # Get the HTML source string
doc = Nokogiri.HTML(html)                            # Parse the document

doc.css('script').remove                             # Remove <script>…</script>
puts doc                                             # Source w/o script blocks

doc.xpath("//@*[starts-with(name(),'on')]").remove   # Remove on____ attributes
puts doc                                             # Source w/o any JavaScript