Python: 如何从原始电子邮件中解析 Body,因为原始电子邮件没有“ Body”标签或其他东西

看起来很容易得到

From
To
Subject

等等

import email
b = email.message_from_string(a)
bbb = b['from']
ccc = b['to']

假设 "a"是原始电子邮件字符串,它看起来像这样。

a = """From root@a1.local.tld Thu Jul 25 19:28:59 2013
Received: from a1.local.tld (localhost [127.0.0.1])
by a1.local.tld (8.14.4/8.14.4) with ESMTP id r6Q2SxeQ003866
for <ooo@a1.local.tld>; Thu, 25 Jul 2013 19:28:59 -0700
Received: (from root@localhost)
by a1.local.tld (8.14.4/8.14.4/Submit) id r6Q2Sxbh003865;
Thu, 25 Jul 2013 19:28:59 -0700
From: root@a1.local.tld
Subject: oooooooooooooooo
To: ooo@a1.local.tld
Cc:
X-Originating-IP: 192.168.15.127
X-Mailer: Webmin 1.420
Message-Id: <1374805739.3861@a1>
Date: Thu, 25 Jul 2013 19:28:59 -0700 (PDT)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="bound1374805739"


This is a multi-part message in MIME format.


--bound1374805739
Content-Type: text/plain
Content-Transfer-Encoding: 7bit


ooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo


--bound1374805739--"""

问题

如何通过 python 获得此电子邮件的 Body

到目前为止,这是我唯一知道的代码,但我还没有测试它。

if email.is_multipart():
for part in email.get_payload():
print part.get_payload()
else:
print email.get_payload()

这是正确的方法吗?

或者有更简单的方法,比如..。

import email
b = email.message_from_string(a)
bbb = b['body']

178829 次浏览

Use Message.get_payload

b = email.message_from_string(a)
if b.is_multipart():
for payload in b.get_payload():
# if payload.is_multipart(): ...
print payload.get_payload()
else:
print b.get_payload()

There is no b['body'] in python. You have to use get_payload.

if isinstance(mailEntity.get_payload(), list):
for eachPayload in mailEntity.get_payload():
...do things you want...
...real mail body is in eachPayload.get_payload()...
else:
...means there is only text/plain part....
...use mailEntity.get_payload() to get the body...

Good Luck.

To be highly positive you work with the actual email body (yet, still with the possibility you're not parsing the right part), you have to skip attachments, and focus on the plain or html part (depending on your needs) for further processing.

As the before-mentioned attachments can and very often are of text/plain or text/html part, this non-bullet-proof sample skips those by checking the content-disposition header:

b = email.message_from_string(a)
body = ""


if b.is_multipart():
for part in b.walk():
ctype = part.get_content_type()
cdispo = str(part.get('Content-Disposition'))


# skip any text/plain (txt) attachments
if ctype == 'text/plain' and 'attachment' not in cdispo:
body = part.get_payload(decode=True)  # decode
break
# not multipart - i.e. plain text, no attachments, keeping fingers crossed
else:
body = b.get_payload(decode=True)

BTW, walk() iterates marvelously on mime parts, and get_payload(decode=True) does the dirty work on decoding base64 etc. for you.

Some background - as I implied, the wonderful world of MIME emails presents a lot of pitfalls of "wrongly" finding the message body. In the simplest case it's in the sole "text/plain" part and get_payload() is very tempting, but we don't live in a simple world - it's often surrounded in multipart/alternative, related, mixed etc. content. Wikipedia describes it tightly - MIME, but considering all these cases below are valid - and common - one has to consider safety nets all around:

Very common - pretty much what you get in normal editor (Gmail,Outlook) sending formatted text with an attachment:

multipart/mixed
|
+- multipart/related
|   |
|   +- multipart/alternative
|   |   |
|   |   +- text/plain
|   |   +- text/html
|   |
|   +- image/png
|
+-- application/msexcel

Relatively simple - just alternative representation:

multipart/alternative
|
+- text/plain
+- text/html

For good or bad, this structure is also valid:

multipart/alternative
|
+- text/plain
+- multipart/related
|
+- text/html
+- image/jpeg

Hope this helps a bit.

P.S. My point is don't approach email lightly - it bites when you least expect it :)

There is very good package available to parse the email contents with proper documentation.

import mailparser


mail = mailparser.parse_from_file(f)
mail = mailparser.parse_from_file_obj(fp)
mail = mailparser.parse_from_string(raw_mail)
mail = mailparser.parse_from_bytes(byte_mail)

How to Use:

mail.attachments: list of all attachments
mail.body
mail.to

If emails is the pandas dataframe and emails.message the column for email text

## Helper functions
def get_text_from_email(msg):
'''To get the content from email objects'''
parts = []
for part in msg.walk():
if part.get_content_type() == 'text/plain':
parts.append( part.get_payload() )
return ''.join(parts)


def split_email_addresses(line):
'''To separate multiple email addresses'''
if line:
addrs = line.split(',')
addrs = frozenset(map(lambda x: x.strip(), addrs))
else:
addrs = None
return addrs


import email
# Parse the emails into a list email objects
messages = list(map(email.message_from_string, emails['message']))
emails.drop('message', axis=1, inplace=True)
# Get fields from parsed email objects
keys = messages[0].keys()
for key in keys:
emails[key] = [doc[key] for doc in messages]
# Parse content from emails
emails['content'] = list(map(get_text_from_email, messages))
# Split multiple email addresses
emails['From'] = emails['From'].map(split_email_addresses)
emails['To'] = emails['To'].map(split_email_addresses)


# Extract the root of 'file' as 'user'
emails['user'] = emails['file'].map(lambda x:x.split('/')[0])
del messages


emails.head()

Here's the code that works for me everytime (for Outlook emails):

#to read Subjects and Body of email in a folder (or subfolder)


import win32com.client
#import package


outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
#create object


#get to the desired folder (MyEmail@xyz.com is my root folder)


root_folder =
outlook.Folders['MyEmail@xyz.com'].Folders['Inbox'].Folders['SubFolderName']


#('Inbox' and 'SubFolderName' are the subfolders)


messages = root_folder.Items


for message in messages:
if message.Unread == True:    # gets only 'Unread' emails
subject_content = message.subject
# to store subject lines of mails


body_content = message.body
# to store Body of mails


print(subject_content)
print(body_content)


message.Unread = True         # mark the mail as 'Read'
message = messages.GetNext()  #iterate over mails

Python 3.6+ provides built-in convenience methods to find and decode the plain text body as in @Todor Minakov's answer. You can use the EMailMessage.get_body() and get_content() methods:

msg = email.message_from_string(s, policy=email.policy.default)
body = msg.get_body(('plain',))
if body:
body = body.get_content()
print(body)

Note this will give None if there is no (obvious) plain text body part.

If you are reading from e.g. an mbox file, you can give the mailbox constructor an EmailMessage factory:

mbox = mailbox.mbox(mboxfile, factory=lambda f: email.message_from_binary_file(f, policy=email.policy.default), create=False)
for msg in mbox:
...

Note you must pass email.policy.default as the policy, since it's not the default...

Small update based on Doctor J's answer. Parses the plaintext portion (if any) of the email message. May try getting the html as well since the (bad) habit of sending html only mails are increasingly popular.

from email import message_from_string
from email import policy


raw_string = raw_string.strip() # where raw_string is the email message (DATA)
msg = message_from_string(raw_string, policy=policy.default)
body = msg.get_body(('plain',))
if body:
body = body.get_content()
print(body)

When working with email DATA as strings, it's necessary to strip leading/trailing whitespace, wasted a lot of time without it!