How to parse HTML from eMail body - Python

skme picture skme · Jul 14, 2013 · Viewed 7.4k times · Source

I'm trying to parse incoming emails in python. I get emails which are part text part HTML. I want to get the HTML part and find a table in the HTML.

I tried using beatifulsoup. But when trying the next code, the bs only get the first "" part and not all the HTML part :

# connecting to the gmail imap server
m = imaplib.IMAP4_SSL("imap.gmail.com")
m.login(user,pwd)
# use m.list() to get all the mailboxes, "INBOX" to get only inbox
m.select("INBOX")
resp, items = m.search(None, '(UNSEEN)') # you could filter using the IMAP rules here (check http://www.example-code.com/csharp/imap-search-critera.asp)
items = items[0].split() # getting the mails id

for emailid in items:
    # getting the mail content
    resp, data = m.fetch(emailid, '(UID BODY[TEXT])')
    text = str(data[0][1])
    soup = bs(text)

How can I use 'bs' for the entire HTML part? Or, is there any other way to parse out an html table from the email body?

'bs' seems to be the best for me, cause I want to find a specific HTML Body which contains specific keyword, and 'bs' search can retrieve the entire table and let me iterate in it.

Answer

skme picture skme · Aug 11, 2013

Apparently, I used a wrong parser.

Once I changed into 'lxml' parser, it worked just fine.

need to change the next line:

soup = bs(text,"lxml");