Remove all javascript tags and style tags from html with python and the lxml module

john-charles picture john-charles · Dec 18, 2011 · Viewed 18.9k times · Source

I am parsing an html document using the http://lxml.de/ library. So far I have figured out how to strip tags from an html document In lxml, how do I remove a tag but retain all contents? but the method described in that post leaves all the text, stripping the tags with out removing the actual script. I have also found a class reference to lxml.html.clean.Cleaner http://lxml.de/api/lxml.html.clean.Cleaner-class.html but this is clear as mud as to how to actually use the class to clean the document. Any help, perhaps a short example would be helpful to me!

Answer

aculich picture aculich · Dec 18, 2011

Below is an example to do what you want. For an HTML document, Cleaner is a better general solution to the problem than using strip_elements, because in cases like this you want to strip out more than just the <script> tag; you also want to get rid of things like onclick=function() attributes on other tags.

#!/usr/bin/env python

import lxml
from lxml.html.clean import Cleaner

cleaner = Cleaner()
cleaner.javascript = True # This is True because we want to activate the javascript filter
cleaner.style = True      # This is True because we want to activate the styles & stylesheet filter

print("WITH JAVASCRIPT & STYLES")
print(lxml.html.tostring(lxml.html.parse('http://www.google.com')))
print("WITHOUT JAVASCRIPT & STYLES")
print(lxml.html.tostring(cleaner.clean_html(lxml.html.parse('http://www.google.com'))))

You can get a list of the options you can set in the lxml.html.clean.Cleaner documentation; some options you can just set to True or False (the default) and others take a list like:

cleaner.kill_tags = ['a', 'h1']
cleaner.remove_tags = ['p']

Note that the difference between kill vs remove:

remove_tags:
  A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.
kill_tags:
  A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself.
allow_tags:
  A list of tags to include (default include all).