I am pretty new to python and I am trying to parse email from gmail via python's imaplib and email. It is working pretty well but I am having issues with email attachments.
I would like to parse out all of the plaintext from the email while ignoring any HTML that may be inserted as a secondary content type while also removing and saving all other attachments. I have been trying the following:
...imaplib connection and mailbox selection...
typ, msg_data = c.fetch(num, '(RFC822)')
email_body = msg_data[0][1]
mail = email.message_from_string(email_body)
for part in mail.walk():
if part.get_content_type() == 'text/plain':
body = body + '\n' + part.get_payload()
else:
continue
This was my original attempt to just take the plaintext portions of an email, but when someone sends an email with a text attachment, the contents of the text file shows up for the 'body' variable above.
Can someone tell me how I can extract the plaintext portions of an email while ignoring the secondary HTML that is sometimes present, while also saving all other types of file attachments as files? I appologize if this doesn't make a lot of sense. I will update the question with more clarification if needed.
If you just need to keep text attachments out of the body
variable with what you have there, it should be as simple as this:
mail = email.message_from_string(email_body)
for part in mail.walk():
c_type = part.get_content_type()
c_disp = part.get('Content-Disposition')
if c_type == 'text/plain' and c_disp == None:
body = body + '\n' + part.get_payload()
else:
continue
Then if the Content-Disposition indicates that it's an attachment, you should be able to use part.get_filename()
and part.get_payload()
to handle the file. I don't know if any of this can vary, but it's basically what I've used in the past to interface with my mail server.