Python : How to parse the Body from a raw email , given that raw email does not have a "Body" tag or anything

user2621078 picture user2621078 · Jul 26, 2013 · Viewed 135.1k times · Source

It seems easy to get the

From
To
Subject

etc via

import email
b = email.message_from_string(a)
bbb = b['from']
ccc = b['to']

assuming that "a" is the raw-email string which looks something like this.

a = """From [email protected] Thu Jul 25 19:28:59 2013
Received: from a1.local.tld (localhost [127.0.0.1])
    by a1.local.tld (8.14.4/8.14.4) with ESMTP id r6Q2SxeQ003866
    for <[email protected]>; Thu, 25 Jul 2013 19:28:59 -0700
Received: (from root@localhost)
    by a1.local.tld (8.14.4/8.14.4/Submit) id r6Q2Sxbh003865;
    Thu, 25 Jul 2013 19:28:59 -0700
From: [email protected]
Subject: oooooooooooooooo
To: [email protected]
Cc: 
X-Originating-IP: 192.168.15.127
X-Mailer: Webmin 1.420
Message-Id: <1374805739.3861@a1>
Date: Thu, 25 Jul 2013 19:28:59 -0700 (PDT)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="bound1374805739"

This is a multi-part message in MIME format.

--bound1374805739
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

ooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo

--bound1374805739--"""

THE QUESTION

how do you get the Body of this email via python ?

So far this is the only code i am aware of but i have yet to test it.

if email.is_multipart():
    for part in email.get_payload():
        print part.get_payload()
else:
    print email.get_payload()

is this the correct way ?

or maybe there is something simpler such as...

import email
b = email.message_from_string(a)
bbb = b['body']

?

Answer

Todor Minakov picture Todor Minakov · Sep 29, 2015

To be highly positive you work with the actual email body (yet, still with the possibility you're not parsing the right part), you have to skip attachments, and focus on the plain or html part (depending on your needs) for further processing.

As the before-mentioned attachments can and very often are of text/plain or text/html part, this non-bullet-proof sample skips those by checking the content-disposition header:

b = email.message_from_string(a)
body = ""

if b.is_multipart():
    for part in b.walk():
        ctype = part.get_content_type()
        cdispo = str(part.get('Content-Disposition'))

        # skip any text/plain (txt) attachments
        if ctype == 'text/plain' and 'attachment' not in cdispo:
            body = part.get_payload(decode=True)  # decode
            break
# not multipart - i.e. plain text, no attachments, keeping fingers crossed
else:
    body = b.get_payload(decode=True)

BTW, walk() iterates marvelously on mime parts, and get_payload(decode=True) does the dirty work on decoding base64 etc. for you.

Some background - as I implied, the wonderful world of MIME emails presents a lot of pitfalls of "wrongly" finding the message body. In the simplest case it's in the sole "text/plain" part and get_payload() is very tempting, but we don't live in a simple world - it's often surrounded in multipart/alternative, related, mixed etc. content. Wikipedia describes it tightly - MIME, but considering all these cases below are valid - and common - one has to consider safety nets all around:

Very common - pretty much what you get in normal editor (Gmail,Outlook) sending formatted text with an attachment:

multipart/mixed
 |
 +- multipart/related
 |   |
 |   +- multipart/alternative
 |   |   |
 |   |   +- text/plain
 |   |   +- text/html
 |   |      
 |   +- image/png
 |
 +-- application/msexcel

Relatively simple - just alternative representation:

multipart/alternative
 |
 +- text/plain
 +- text/html

For good or bad, this structure is also valid:

multipart/alternative
 |
 +- text/plain
 +- multipart/related
      |
      +- text/html
      +- image/jpeg

Hope this helps a bit.

P.S. My point is don't approach email lightly - it bites when you least expect it :)