How to parse malformed HTML in python

lorenzov picture lorenzov · May 24, 2009 · Viewed 7.4k times · Source

I need to browse the DOM tree of a parsed HTML document.

I'm using uTidyLib before parsing the string with lxml

a = tidy.parseString(html_code, options) dom = etree.fromstring(str(a))

sometimes I get an error, it seems that tidylib is not able to repair malformed html.

how can I parse every HTML file without getting an error (parsing only some parts of files that can not be repaired)?

Answer

dbr picture dbr · May 24, 2009

Beautiful Soup does a good job with invalid/broken HTML

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<htm@)($*><body><table <tr><td>hi</tr></td></body><html")
>>> print soup.prettify()
<htm>
 <body>
  <table>
   <tr>
    <td>
     hi
    </td>
   </tr>
  </table>
 </body>
</htm>