Parsing HTML in Python

Andy Baker picture Andy Baker · Apr 4, 2009 · Viewed 48.7k times · Source

What's my best bet for parsing HTML if I can't use BeautifulSoup or lxml? I've got some code that uses SGMLlib but it's a bit low-level and it's now deprecated.

I would prefer if it could stomache a bit of malformed HTML although I'm pretty sure most of the input will be pretty clean.

Answer

Andrei Taranchenko picture Andrei Taranchenko · Apr 4, 2009

Python has a native HTML parser, however the Tidy wrapper Nick suggested would probably be a solid choice as well. Tidy is a very common library, (written in C is it?)