from lxml.html.clean import clean_html, Cleaner
def clean(text):
try:
cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
remove_tags = ['a', 'li', 'td'])
print (len(cleaner.clean_html(text))- len(text))
return cleaner.clean_html(text)
except:
print 'Error in clean_html'
print sys.exc_info()
return text
I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags
and links=True
any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?
solution from David concatenates the text with no separator:
import lxml.html
document = lxml.html.document_fromstring(html_string)
# internally does: etree.XPath("string()")(document)
print document.text_content()
but this one helped me - concatenation the way I needed:
from lxml import etree
print "\n".join(etree.XPath("//text()")(document))