I need to browse the DOM tree of a parsed HTML document.
I'm using uTidyLib before parsing the string with lxml
a = tidy.parseString(html_code, options) dom = etree.fromstring(str(a))
sometimes I get an error, it seems that tidylib is not able to repair malformed html.
how can I parse every HTML file without getting an error (parsing only some parts of files that can not be repaired)?
Beautiful Soup does a good job with invalid/broken HTML
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<htm@)($*><body><table <tr><td>hi</tr></td></body><html")
>>> print soup.prettify()
<htm>
<body>
<table>
<tr>
<td>
hi
</td>
</tr>
</table>
</body>
</htm>