如何在 C # 中将 HTML 转换为文本?

我正在寻找 C # 代码转换为纯文本的 HTML 文档。

我不是在寻找简单的标记剥离,而是一些将输出纯文本与 合情合理原始布局的保存。

输出应该是这样的:

W3C 上的 Html2Txt

我已经看过 HTML 敏捷包,但我不认为这是我需要的。还有人有其他建议吗?

编辑: 我刚刚从 CodePlex下载了 HTML 敏捷包,并运行了 Html2Txt 项目。多么令人失望(至少将 html 转换为文本的模块) !它所做的只是撕掉标签,压平桌子等等。输出与 Html2Txt@W3C 生成的输出完全不同。可惜这个消息来源似乎不可用。 我想看看是否有更“罐装”的解决方案。

编辑2: 谢谢大家的建议。苍蝇特警队告诉了我我想去的方向。我可以使用 System.Diagnostics.Process类运行带有“-dump”开关的 lynx.exe,将文本发送到标准输出,并使用 ProcessStartInfo.UseShellExecute = falseProcessStartInfo.RedirectStandardOutput = true捕获标准输出。我将用 C # 类包装这一切。这段代码只是偶尔被调用,所以我并不太关心生成一个新的进程与在代码中完成它的区别。而且,山猫跑得很快! !

130321 次浏览

What you are looking for is a text-mode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do than you would expect.

The easiest would probably be tag stripping combined with replacement of some tags with text layout elements like dashes for list elements (li) and line breaks for br's and p's. It shouldn't be too hard to extend this to tables.

Have you tried http://www.aaronsw.com/2002/html2text/ it's Python, but open source.

I don't know C#, but there is a fairly small & easy to read python html2txt script here: http://www.aaronsw.com/2002/html2text/

I've heard from a reliable source that, if you're doing HTML parsing in .Net, you should look at the HTML agility pack again..

http://www.codeplex.com/htmlagilitypack

Some sample on SO..

HTML Agility pack - parsing tables

Another post suggests the HTML agility pack:

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

You could use this:

 public static string StripHTML(string HTMLText, bool decode = true)
{
Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
var stripped = reg.Replace(HTMLText, "");
return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
}

Updated

Thanks for the comments I have updated to improve this function

I have used Detagger in the past. It does a pretty good job of formatting the HTML as text and is more than just a tag remover.

This is another solution to convert HTML to Text or RTF in C#:

    SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
string text = h.ConvertString(htmlString);

This library is not free, this is commercial product and it is my own product.

I have recently blogged on a solution that worked for me by using a Markdown XSLT file to transform the HTML Source. The HTML source will of course need to be valid XML first

You can use WebBrowser control to render in memory your html content. After LoadCompleted event fired...

IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document;
string innerHTML = htmlDoc.body.innerHTML;
string innerText = htmlDoc.body.innerText;

In Genexus You can made with Regex

&pattern = '<[^>]+>'

&TSTRPNOT=&TSTRPNOT.ReplaceRegEx(&pattern,"")

In Genexus possiamo gestirlo con Regex,

Assuming you have well formed html, you could also maybe try an XSL transform.

Here's an example:

using System;
using System.IO;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml.Xsl;


class Html2TextExample
{
public static string Html2Text(XDocument source)
{
var writer = new StringWriter();
Html2Text(source, writer);
return writer.ToString();
}


public static void Html2Text(XDocument source, TextWriter output)
{
Transformer.Transform(source.CreateReader(), null, output);
}


public static XslCompiledTransform _transformer;
public static XslCompiledTransform Transformer
{
get
{
if (_transformer == null)
{
_transformer = new XslCompiledTransform();
var xsl = XDocument.Parse(@"<?xml version='1.0'?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>");
_transformer.Load(xsl.CreateNavigator());
}
return _transformer;
}
}


static void Main(string[] args)
{
var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>");
var text = Html2Text(html);
Console.WriteLine(text);
}
}

Because I wanted conversion to plain text with LF and bullets, I found this pretty solution on codeproject, which covers many conversion usecases:

Convert HTML to Plain Text

Yep, looks so big, but works fine.

Just a note about the HtmlAgilityPack for posterity. The project contains an example of parsing text to html, which, as noted by the OP, does not handle whitespace at all like anyone writing HTML would envisage. There are full-text rendering solutions out there, noted by others to this question, which this is not (it cannot even handle tables in its current form), but it is lightweight and fast, which is all I wanted for creating a simple text version of HTML emails.

using System.IO;
using System.Text.RegularExpressions;
using HtmlAgilityPack;


//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText
{


public static string Convert(string path)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
return ConvertDoc(doc);
}


public static string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
return ConvertDoc(doc);
}


public static string ConvertDoc (HtmlDocument doc)
{
using (StringWriter sw = new StringWriter())
{
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
}


internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText, textInfo);
}
}
public static void ConvertTo(HtmlNode node, TextWriter outText)
{
ConvertTo(node, outText, new PreceedingDomTextInfo(false));
}
internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
string html;
switch (node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText, textInfo);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
{
break;
}
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
{
break;
}
// check the text is meaningful and not a bunch of whitespaces
if (html.Length == 0)
{
break;
}
if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace)
{
html= html.TrimStart();
if (html.Length == 0) { break; }
textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
}
outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " ")));
if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))
{
outText.Write(' ');
}
break;
case HtmlNodeType.Element:
string endElementString = null;
bool isInline;
bool skip = false;
int listIndex = 0;
switch (node.Name)
{
case "nav":
skip = true;
isInline = false;
break;
case "body":
case "section":
case "article":
case "aside":
case "h1":
case "h2":
case "header":
case "footer":
case "address":
case "main":
case "div":
case "p": // stylistic - adjust as you tend to use
if (textInfo.IsFirstTextOfDocWritten)
{
outText.Write("\r\n");
}
endElementString = "\r\n";
isInline = false;
break;
case "br":
outText.Write("\r\n");
skip = true;
textInfo.WritePrecedingWhiteSpace = false;
isInline = true;
break;
case "a":
if (node.Attributes.Contains("href"))
{
string href = node.Attributes["href"].Value.Trim();
if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1)
{
endElementString =  "<" + href + ">";
}
}
isInline = true;
break;
case "li":
if(textInfo.ListIndex>0)
{
outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
}
else
{
outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
}
isInline = false;
break;
case "ol":
listIndex = 1;
goto case "ul";
case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
endElementString = "\r\n";
isInline = false;
break;
case "img": //inline-block in reality
if (node.Attributes.Contains("alt"))
{
outText.Write('[' + node.Attributes["alt"].Value);
endElementString = "]";
}
if (node.Attributes.Contains("src"))
{
outText.Write('<' + node.Attributes["src"].Value + '>');
}
isInline = true;
break;
default:
isInline = true;
break;
}
if (!skip && node.HasChildNodes)
{
ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex });
}
if (endElementString != null)
{
outText.Write(endElementString);
}
break;
}
}
}
internal class PreceedingDomTextInfo
{
public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten)
{
IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
}
public bool WritePrecedingWhiteSpace {get;set;}
public bool LastCharWasSpace { get; set; }
public readonly BoolWrapper IsFirstTextOfDocWritten;
public int ListIndex { get; set; }
}
internal class BoolWrapper
{
public BoolWrapper() { }
public bool Value { get; set; }
public static implicit operator bool(BoolWrapper boolWrapper)
{
return boolWrapper.Value;
}
public static implicit operator BoolWrapper(bool boolWrapper)
{
return new BoolWrapper{ Value = boolWrapper };
}
}

As an example, the following HTML code...

<!DOCTYPE HTML>
<html>
<head>
</head>
<body>
<header>
Whatever Inc.
</header>
<main>
<p>
Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things:
</p>
<ol>
<li>
Please confirm this is your email by replying.
</li>
<li>
Then perform this step.
</li>
</ol>
<p>
Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please:
</p>
<ul>
<li>
a point.
</li>
<li>
another point, with a <a href="http://en.wikipedia.org/wiki/Hyperlink">hyperlink</a>.
</li>
</ul>
<p>
Sincerely,
</p>
<p>
The whatever.com team
</p>
</main>
<footer>
Ph: 000 000 000<br/>
mail: whatever st
</footer>
</body>
</html>

...will be transformed into:

Whatever Inc.




Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:


1.  Please confirm this is your email by replying.
2.  Then perform this step.


Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please:


*   a point.
*   another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>.


Sincerely,


The whatever.com team




Ph: 000 000 000
mail: whatever st

...as opposed to:

        Whatever Inc.




Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:


Please confirm this is your email by replying.


Then perform this step.




Please solve this . Then, in any order, could you please:


a point.


another point, with a hyperlink.




Sincerely,




The whatever.com team


Ph: 000 000 000
mail: whatever st

If you are using .NET framework 4.5 you can use System.Net.WebUtility.HtmlDecode() which takes a HTML encoded string and returns a decoded string.

Documented on MSDN at: http://msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode(v=vs.110).aspx

You can use this in a Windows Store app as well.

Try the easy and usable way: just call StripHTML(WebBrowserControl_name);

 public string StripHTML(WebBrowser webp)
{
try
{
doc.execCommand("SelectAll", true, null);
IHTMLSelectionObject currentSelection = doc.selection;


if (currentSelection != null)
{
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange;
if (range != null)
{
currentSelection.empty();
return range.text;
}
}
}
catch (Exception ep)
{
//MessageBox.Show(ep.Message);
}
return "";


}

I had some decoding issues with HtmlAgility and I didn't want to invest time investigating it.

Instead I used that utility from the Microsoft Team Foundation API:

var text = HtmlFilter.ConvertToPlainText(htmlContent);

This function convert "What You See in the browser" to plain text with line breaks. (If you want to see result in the browser just use commented return value)

public string HtmlFileToText(string filePath)
{
using (var browser = new WebBrowser())
{
string text = File.ReadAllText(filePath);
browser.ScriptErrorsSuppressed = true;
browser.Navigate("about:blank");
browser?.Document?.OpenNew(false);
browser?.Document?.Write(text);
return browser.Document?.Body?.InnerText;
//return browser.Document?.Body?.InnerText.Replace(Environment.NewLine, "<br />");
}
}

Here is the short sweet answer using HtmlAgilityPack. You can run this in LinqPad.

var html = "<div>..whatever html</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var plainText = doc.DocumentNode.InnerText;

I simply use HtmlAgilityPack in any .NET project that needs HTML parsing. It's simple, reliable, and fast.

Update - you are correct that the above removes tags but does not decode the escaped characters. This will do it:

var a = "This &amp; that";
var result = System.Web.HttpUtility.HtmlDecode(a);
result.Dump();

Using the two together you can get the plain text from the HTML.