I am working on a project that will involve parsing HTML.
After searching around, I found two probable options: BeautifulSoup and lxml.html
Is there any reason to prefer one over the other? I have used lxml for XML some time back and I feel I will be more comfortable with it, however BeautifulSoup seems to be much common.
I know I should use the one that works for me, but I was looking for personal experiences with both.
The simple answer, imo, is that if you trust your source to be well-formed, go with the lxml solution. Otherwise, BeautifulSoup all the way.
Edit:
This answer is three years old now; it's worth noting, as Jonathan Vanasco does in the comments, that BeautifulSoup4
now supports using lxml
as the internal parser, so you can use the advanced features and interface of BeautifulSoup without most of the performance hit, if you wish (although I still reach straight for lxml
myself -- perhaps it's just force of habit :)).