从引用的回复中解析电子邮件内容

小开

There is no universal indicator of a reply in an e-mail. The best you can do is try to catch the most common and parse new patterns as you come across them.

Keep in mind that some people insert replies inside the quoted text (My boss for example answers questions on the same line as I asked them) so whatever you do, you might lose some information you would have liked to keep.

小开

I did a lot more searching on this and here's what I've found. There are basically two situations under which you are doing this: when you have the entire thread and when you don't. I'll break it up into those two categories:

When you have the thread:

If you have the entire series of emails, you can achieve a very high level of assurance that what you are removing is actually quoted text. There are two ways to do this. One, you could use the message's Message-ID, In-Reply-To ID, and Thread-Index to determine the individual message, it's parent, and the thread it belongs to. For more information on this, see RFC822, RFC2822, this interesting article on threading, or this article on threading. Once you have re-assembled the thread, you can then remove the external text (such as To, From, CC, etc... lines) and you're done.

If the messages you are working with do not have the headers, you can also use similarity matching to determine what parts of an email are the reply text. In this case you're stuck with doing similarity matching to determine the text that is repeated. In this case you might want to look into a Levenshtein Distance algorithm such as this one on Code Project or this one.

No matter what, if you're interested in the threading process, check out this great PDF on reassembling email threads.

When you don't have the thread:

If you are stuck with only one message from the thread, you're doing to have to try to guess what the quote is. In that case, here are the different quotation methods I have seen:

a line (as seen in outlook).
Angle Brackets
"---Original Message---"
"On such-and-such day, so-and-so wrote:"

Remove the text from there down and you're done. The downside to any of these is that they all assume that the sender put their reply on top of the quoted text and did not interleave it (as was the old style on the internet). If that happens, good luck. I hope this helps some of you out there!

小开

First of all, this is a tricky task.

You should collect typical responses from different e-mail clients and prepare correct regular expressions (or whatever) to parse them. I've collected responses from outlook, thunderbird, Gmail, Apple mail, and mail.ru.

I am using regular expressions to parse responses in the following manner: if an expression did not match, I try to use the next one.

new Regex("From:\\s*" + Regex.Escape(_mail), RegexOptions.IgnoreCase);
new Regex("<" + Regex.Escape(_mail) + ">", RegexOptions.IgnoreCase);
new Regex(Regex.Escape(_mail) + "\\s+wrote:", RegexOptions.IgnoreCase);
new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline);
new Regex("-+original\\s+message-+\\s*$", RegexOptions.IgnoreCase);
new Regex("from:\\s*$", RegexOptions.IgnoreCase);

To remove quotation in the end:

new Regex("^>.*$", RegexOptions.IgnoreCase | RegexOptions.Multiline);

Here is my small collection of test responses (samples divided by --- ):

From: test@test.com [mailto:test@test.com]
Sent: Tuesday, January 13, 2009 1:27 PM
----
2008/12/26 <test@test.com>


>  text
----
test@test.com wrote:
> text
----
test@test.com wrote:         text
text
----
2009/1/13 <test@test.com>


>  text
----
test@test.com wrote:         text
text
----
2009/1/13 <test@test.com>


> text
> text
----
2009/1/13 <test@test.com>


> text
> text
----
test@test.com wrote:
> text
> text
<response here>
----
--- On Fri, 23/1/09, test@test.com <test@test.com> wrote:


> text
> text

小开

If you control the original message (e.g. notifications from a web application), you can put a distinct, identifiable header in place, and use that as the delimiter for the original post.

小开

Thank you, Goleg, for the regexes! Really helped. This isn't C#, but for the googlers out there, here's my Ruby parsing script:

def extract_reply(text, address)
regex_arr = [
Regexp.new("From:\s*" + Regexp.escape(address), Regexp::IGNORECASE),
Regexp.new("<" + Regexp.escape(address) + ">", Regexp::IGNORECASE),
Regexp.new(Regexp.escape(address) + "\s+wrote:", Regexp::IGNORECASE),
Regexp.new("^.*On.*(\n)?wrote:$", Regexp::IGNORECASE),
Regexp.new("-+original\s+message-+\s*$", Regexp::IGNORECASE),
Regexp.new("from:\s*$", Regexp::IGNORECASE)
]


text_length = text.length
#calculates the matching regex closest to top of page
index = regex_arr.inject(text_length) do |min, regex|
[(text.index(regex) || text_length), min].min
end


text[0, index].strip
end

It's worked pretty well so far.

小开

By far the easiest way to do this is by placing a marker in your content, such as:

--- Please reply above this line ---

As you have no doubt noticed, parsing out quoted text is not a trivial task as different email clients quote text in different ways. To solve this problem properly you need to account for and test in every email client.

Facebook can do this, but unless your project has a big budget, you probably can't.

Oleg has solved the problem using regexes to find the "On 13 Jul 2012, at 13:09, xxx wrote:" text. However, if the user deletes this text, or replies at the bottom of the email, as many people do, this solution will not work.

Likewise if the email client uses a different date string, or doesn't include a date string the regex will fail.

小开

Here is my C# version of @hurshagrawal's Ruby code. I don't know Ruby really well so it could be off, but I think I got it right.

public string ExtractReply(string text, string address)
{
var regexes = new List<Regex>() { new Regex("From:\\s*" + Regex.Escape(address), RegexOptions.IgnoreCase),
new Regex("<" + Regex.Escape(address) + ">", RegexOptions.IgnoreCase),
new Regex(Regex.Escape(address) + "\\s+wrote:", RegexOptions.IgnoreCase),
new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline),
new Regex("-+original\\s+message-+\\s*$", RegexOptions.IgnoreCase),
new Regex("from:\\s*$", RegexOptions.IgnoreCase),
new Regex("^>.*$", RegexOptions.IgnoreCase | RegexOptions.Multiline)
};


var index = text.Length;


foreach(var regex in regexes){
var match = regex.Match(text);


if(match.Success && match.Index < index)
index = match.Index;
}


return text.Substring(0, index).Trim();
}

小开

This is a good solution. Found it after searching for so long.

One addition, as mentioned above, this is case wise, so the above expressions did not correctly parse my gmail and outlook (2010) responses, for which I added the following two Regex(s). Let me know for any issues.

//Works for Gmail
new Regex("\\n.*On.*<(\\r\\n)?" + Regex.Escape(address) + "(\\r\\n)?>", RegexOptions.IgnoreCase),
//Works for Outlook 2010
new Regex("From:.*" + Regex.Escape(address), RegexOptions.IgnoreCase),

Cheers

小开

It is old post, however, not sure if you are aware github has a Ruby lib extracting the reply. If you use .NET, I have a .NET one at https://github.com/EricJWHuang/EmailReplyParser

小开

If you use SigParser.com's API, it will give you an array of all the broken out emails in a reply chain from a single email text string. So if there are 10 emails, you'll get the text for all 10 of the emails.

You can view the detailed API spec here.

https://api.sigparser.com/

小开

It should be fairly easy these days, given text/html content type works for you (with Outlook being an exception; see details below). Here is a table with with the real testing results of parsing options in various desktop email clients:

Mail client	Reply message format	HTML can be parsed easily and reliably	HTML tags to be deleted	Plain text quote marker
web.de	always html	yes	`<div name="quote">`	- (always html)
Thunderbird	same as in the original message	yes	`<div class="moz-cite-prefix">`, `<blockquote type="cite">`	"On 26.10.2022 12:37, John Doe wrote:"
Gmail	both	yes	`<div class="gmail_quote">`	"On Thu, Oct 27, 2022 at 1:39 PM John Doe john@inbox.test wrote:"
Outlook 2016, 2019	same as in the original message	Probably impossible due to use of some weird Word processor	unknown	Plain text-only message: "-----Original Message-----", multipart: 3 blank lines with some space followed by "From: John Doe john@inbox.test"
Apple	unknown	yes	`<blockquote type="cite">`	"> On 22. Dec 2021, at 12:50, John Doe john@inbox.test wrote:"