How to parse HTML from eMail body - Python

Question 1

How to parse HTML from eMail body - Python

python html email beautifulsoup email-parsing

skme · Jul 14, 2013 · Viewed 7.4k times · Source

Answer

Answer

Apparently, I used a wrong parser.

Once I changed into 'lxml' parser, it worked just fine.

need to change the next line:

soup = bs(text,"lxml");

Question 2

I'm trying to parse incoming emails in python. I get emails which are part text part HTML. I want to get the HTML part and find a table in the HTML.

I tried using beatifulsoup. But when trying the next code, the bs only get the first "" part and not all the HTML part :

# connecting to the gmail imap server
m = imaplib.IMAP4_SSL("imap.gmail.com")
m.login(user,pwd)
# use m.list() to get all the mailboxes, "INBOX" to get only inbox
m.select("INBOX")
resp, items = m.search(None, '(UNSEEN)') # you could filter using the IMAP rules here (check http://www.example-code.com/csharp/imap-search-critera.asp)
items = items[0].split() # getting the mails id

for emailid in items:
    # getting the mail content
    resp, data = m.fetch(emailid, '(UID BODY[TEXT])')
    text = str(data[0][1])
    soup = bs(text)

How can I use 'bs' for the entire HTML part? Or, is there any other way to parse out an html table from the email body?

'bs' seems to be the best for me, cause I want to find a specific HTML Body which contains specific keyword, and 'bs' search can retrieve the entire table and let me iterate in it.

How to parse HTML from eMail body - Python

Answer

Related questions