How do you convert Html to plain text?

小开

Maybe you do want more than one billion records, but my point is that there is some limit on the number you want, and it is less than 18 quintillion. For the sake of stability, optimization, and possibly usability, I would suggest putting some meaningful limit on the query. This would also reduce confusion for anyone who has never seen that magical looking number, and have the added benefit of communicating at least how many records you are willing to handle at once.

If you really must get all 18 quintillion records from your database, maybe what you really want is to grab them in increments of 100 million and loop 184 billion times.

e:

I ran into a very similar issue when practicing LC#1321, in which I have to select all the dates but the first 6 dates are skipped.

MySQL 8.0 Reference Manual - ROW_NUMBER()

小开

I achieved this in MySQL with the help of ROW_NUMBER() window function and subquery. For example, the following query returns all the results with the first five rows skipped:

SELECT
fieldname1,
fieldname2
FROM(
SELECT
*,
ROW_NUMBER() OVER() row_num
FROM
mytable
) tmp
WHERE
row_num > 5;

If you have data that has HTML tags and you want to display it so that a person can SEE the tags, use HttpServerUtility::HtmlEncode.

You may need to add some more logics in the subquery, especially in OVER() to fit your need. In addition, RANK()/DENSE_RANK() window functions may be used instead of ROW_NUMBER() depending on your real offset logic.

If you have data that has HTML tags in it and you want the user to see the tags rendered, then display the text as is. If the text represents an entire web page, use an IFRAME for it.

Reference:

If you have data that has HTML tags and you want to strip out the tags and just display the unformatted text, use a regular expression.

小开

最佳答案

I am creating a GUI frontend for the Eve Online API in Python.

If you can use regular expressions there are many web pages out there with good info:

I have successfully pulled the XML data from their server.

http://weblogs.asp.net/rosherove/archive/2003/05/13/6963.aspx

I am trying to grab the value from a node called "name":

from xml.dom.minidom import parse
dom = parse("C:\\eve.xml")
name = dom.getElementsByTagName('name')
print name

http://www.google.com/search?hl=en&q=html+tag+stripping+&btnG=Search

If you need the more complex behaviour of a CFG I would suggest using a third party tool, unfortunately I don't know of a good one to recommend.

小开

Depends on what you mean by "html." The most complex case would be complete web pages. That's also the easiest to handle, since you can use a text-mode web browser. See the Wikipedia article listing web browsers, including text mode browsers. Lynx is probably the best known, but one of the others may be better for your needs.

小开

HTTPUtility.HTMLEncode() is meant to handle encoding HTML tags as strings. It takes care of all the heavy lifting for you. From the MSDN Documentation:

This seems to find the node, but the output is below:
[<DOM Element: name at 0x11e6d28>]
If characters such as blanks and punctuation are passed in an HTTP stream, they might be misinterpreted at the receiving end. HTML encoding converts characters that are not allowed in HTML into character-entity equivalents; HTML decoding reverses the encoding. For example, when embedded in a block of text, the characters < and >, are encoded as < and > for HTTP transmission.

How could I get it to print the value of the node?

quivalents; HTML decoding reverses the encoding. For example, when embedded in a block of text, the characters < and >, are encoded as < and > for HTTP transmission.

HTTPUtility.HTMLEncode() method, detailed here:

public static void HtmlEncode(
string s,
TextWriter output
)

Usage:

String TestString = "This is a <Test String>.";
StringWriter writer = new StringWriter();
Server.HtmlEncode(TestString, writer);
String EncodedString = writer.ToString();

小开

I know that's not super specific, but I just discovered it, and so far it's a lot easier to get my head around than the minidom (since so many nodes are essentially white space).

I had a similar case, what worked for me was:

For instance, you have the tag name and the actual text together, just as you'd probably expect:

>>> element[0]
<Element currentTime at 40984d0>
>>> element[0].tag
'currentTime'
>>> element[0].text
'2010-04-12 02:45:45'e

小开

name.firstChild.childNodes[0].data

To add to vfilby's answer, you can just perform a RegEx replace within your code; no new classes are necessary. In case other newbies like myself stumple upon this question.

using System.Text.RegularExpressions;

XML is supposed to be simple and it really is and I don't know why python's minidom did it so complicated... but it's how it's made

Then...

private string StripHtml(string source)
{
string output;


//get rid of HTML tags
output = Regex.Replace(source, "<[^>]*>", string.Empty);


//get rid of multiple blank lines
output = Regex.Replace(output, @"^\s*$\n", string.Empty, RegexOptions.Multiline);


return output;
}

小开

de>

I think the easiest way is to make a 'string' extension method (based on what user Richard have suggested):

using System;
using System.Text.RegularExpressions;


public static class StringHelpers
{
public static string StripHTML(this string HTMLText)
{
var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
return reg.Replace(HTMLText, "");
}
}

Then...

private string StripHtml(string source)
{
string output;


//get rid of HTML tags
output = Regex.Replace(source, "<[^>]*>", string.Empty);


//get rid of multiple blank lines
output = Regex.Replace(output, @"^\s*$\n", string.Empty, RegexOptions.Multiline);


return output;
}

Then just use this extension method on any 'string' variable in your program:

var yourHtmlString = "<div class=\"someclass\"><h2>yourHtmlText</h2></span>";
var yourTextString = yourHtmlString.StripHTML();

lt;/span>";

I use this extension method to convert html formated comments to plain text so it will be displayed correctly on a crystal report, and it works perfect!

小开

Thats works for me. BUT I DONT FIND A METHOD WITH NAME 'ConvertToPlainText' IN 'HtmlAgilityPack'.

小开

Here is a slightly modified answer of Henrik's for multiple nodes (ie. when getElementsByTagName returns more than one instance)

images = xml.getElementsByTagName("imageUrl")
for i in images:
print " ".join(t.nodeValue for t in i.childNodes if t.nodeType == t.TEXT_NODE)

I could not use HtmlAgilityPack, so I wrote a second best solution for myself

private static string HtmlToPlainText(string html)
{
const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);


var text = html;
//Decode html specific characters
text = System.Net.WebUtility.HtmlDecode(text);
//Remove tag whitespace/line breaks
text = tagWhiteSpaceRegex.Replace(text, "><");
//Replace <br /> with line breaks
text = lineBreakRegex.Replace(text, Environment.NewLine);
//Strip formatting
text = stripFormattingRegex.Replace(text, string.Empty);


return text;
}

小开

pFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing

The simplest way I found:

HtmlFilter.ConvertToPlainText(html);

const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: , , , , ,

The HtmlFilter class is located in Microsoft.TeamFoundation.WorkItemTracking.Controls.dll

var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);

The dll can be found in folder like this: %ProgramFiles%\Common Files\microsoft shared\Team Foundation Server\14.0\

var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);

In VS 2015, the dll also requires reference to Microsoft.TeamFoundation.WorkItemTracking.Common.dll, located in the same folder.

小开

var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);

Three Step Process for converting HTML into Plain Text

var text = html; //Decode html specific characters text = System.Net.WebUtility.HtmlDecode(text); //Remove tag whitespace/line breaks text = tagWhiteSpaceRegex.Replace(text, "><"); //Replace with line breaks

First You need to Install Nuget Package For HtmlAgilityPack text = lineBreakRegex.Replace(text, Environment.NewLine); Second Create This class

public class HtmlToText
{
public HtmlToText()
{
}


public string Convert(string path)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);


StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}


public string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);


StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}


private void ConvertContentTo(HtmlNode node, TextWriter outText)
{
foreach(HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText);
}
}


public void ConvertTo(HtmlNode node, TextWriter outText)
{
string html;
switch(node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;


case HtmlNodeType.Document:
ConvertContentTo(node, outText);
break;


case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
break;


// get text
html = ((HtmlTextNode)node).Text;


// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
break;


// check the text is meaningful and not a bunch of whitespaces
if (html.Trim().Length > 0)
{
outText.Write(HtmlEntity.DeEntitize(html));
}
break;


case HtmlNodeType.Element:
switch(node.Name)
{
case "p":
// treat paragraphs as crlf
outText.Write("\r\n");
break;
}


if (node.HasChildNodes)
{
ConvertContentTo(node, outText);
}
break;
}
}
}

//Strip formatting

By using above class with reference to Judah Himango's answer

text = stripFormattingRegex.Replace(text, string.Empty); return text;

Third you need to create the Object of above class and Use ConvertHtml(HTMLContent) Method for converting HTML into Plain Text rather than ConvertToPlainText(string html);

HtmlToText htt=new HtmlToText();
var plainText = htt.ConvertHtml(HTMLContent);

小开

The question has been answered, my contribution consists in clarifying one thing that may confuse beginners:

I used xml.dom.minidom, the various fields are explained on this page, MiniDom Python.

小开

Some of the suggested and correct answers used firstChild.data and others used firstChild.nodeValue instead. In case you are wondering what is the different between them, you should remember they do the same thing because nodeValue is just an alias for data.

I had the same question, just my html had a simple pre-known layout, like:

<DIV><P>abc</P><P>def</P></DIV>

So I ended up using such simple code:

string.Join (Environment.NewLine, XDocument.Parse (html).Root.Elements ().Select (el => el.Value))

The reference to my statement can be found as a comment on the source code of minidom:

Which outputs:
abc
def

小开

#nodeValue is an alias for data

= new StringWriter()) {

Did not write but an using:

using HtmlAgilityPack;
using System;
using System.IO;
using System.Text.RegularExpressions;


namespace foo {
//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText {


public static string Convert(string path) {
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
return ConvertDoc(doc);
}


public static string ConvertHtml(string html) {
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
return ConvertDoc(doc);
}


public static string ConvertDoc(HtmlDocument doc) {
using (StringWriter sw = new StringWriter()) {
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
}


internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) {
foreach (HtmlNode subnode in node.ChildNodes) {
ConvertTo(subnode, outText, textInfo);
}
}
public static void ConvertTo(HtmlNode node, TextWriter outText) {
ConvertTo(node, outText, new PreceedingDomTextInfo(false));
}
internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) {
string html;
switch (node.NodeType) {
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText, textInfo);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style")) {
break;
}
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html)) {
break;
}
// check the text is meaningful and not a bunch of whitespaces
if (html.Length == 0) {
break;
}
if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace) {
html = html.TrimStart();
if (html.Length == 0) { break; }
textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
}
outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " ")));
if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1])) {
outText.Write(' ');
}
break;
case HtmlNodeType.Element:
string endElementString = null;
bool isInline;
bool skip = false;
int listIndex = 0;
switch (node.Name) {
case "nav":
skip = true;
isInline = false;
break;
case "body":
case "section":
case "article":
case "aside":
case "h1":
case "h2":
case "header":
case "footer":
case "address":
case "main":
case "div":
case "p": // stylistic - adjust as you tend to use
if (textInfo.IsFirstTextOfDocWritten) {
outText.Write("\r\n");
}
endElementString = "\r\n";
isInline = false;
break;
case "br":
outText.Write("\r\n");
skip = true;
textInfo.WritePrecedingWhiteSpace = false;
isInline = true;
break;
case "a":
if (node.Attributes.Contains("href")) {
string href = node.Attributes["href"].Value.Trim();
if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase) == -1) {
endElementString = "<" + href + ">";
}
}
isInline = true;
break;
case "li":
if (textInfo.ListIndex > 0) {
outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
} else {
outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
}
isInline = false;
break;
case "ol":
listIndex = 1;
goto case "ul";
case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
endElementString = "\r\n";
isInline = false;
break;
case "img": //inline-block in reality
if (node.Attributes.Contains("alt")) {
outText.Write('[' + node.Attributes["alt"].Value);
endElementString = "]";
}
if (node.Attributes.Contains("src")) {
outText.Write('<' + node.Attributes["src"].Value + '>');
}
isInline = true;
break;
default:
isInline = true;
break;
}
if (!skip && node.HasChildNodes) {
ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten) { ListIndex = listIndex });
}
if (endElementString != null) {
outText.Write(endElementString);
}
break;
}
}
}
internal class PreceedingDomTextInfo {
public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten) {
IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
}
public bool WritePrecedingWhiteSpace { get; set; }
public bool LastCharWasSpace { get; set; }
public readonly BoolWrapper IsFirstTextOfDocWritten;
public int ListIndex { get; set; }
}
internal class BoolWrapper {
public BoolWrapper() { }
public bool Value { get; set; }
public static implicit operator bool(BoolWrapper boolWrapper) {
return boolWrapper.Value;
}
public static implicit operator BoolWrapper(bool boolWrapper) {
return new BoolWrapper { Value = boolWrapper };
}
}
}

小开

ConvertTo(doc.DocumentNode, sw);

It has limitation that not collapsing long inline whitespace, but it is definitely portable and respects layout like webbrowser.

static string HtmlToPlainText(string html) {
string buf;
string block = "address|article|aside|blockquote|canvas|dd|div|dl|dt|" +
"fieldset|figcaption|figure|footer|form|h\\d|header|hr|li|main|nav|" +
"noscript|ol|output|p|pre|section|table|tfoot|ul|video";


string patNestedBlock = $"(\\s*?</?({block})[^>]*?>)+\\s*";
buf = Regex.Replace(html, patNestedBlock, "\n", RegexOptions.IgnoreCase);


// Replace br tag to newline.
buf = Regex.Replace(buf, @"<(br)[^>]*>", "\n", RegexOptions.IgnoreCase);


// (Optional) remove styles and scripts.
buf = Regex.Replace(buf, @"<(script|style)[^>]*?>.*?</\1>", "", RegexOptions.Singleline);


// Remove all tags.
buf = Regex.Replace(buf, @"<[^>]*(>|$)", "", RegexOptions.Multiline);


// Replace HTML entities.
buf = WebUtility.HtmlDecode(buf);
return buf;
}

小开

sw.Flush();

I have faced similar problem and found best solution . Below code works perfect for me.

  private string ConvertHtml_Totext(string source)
{
try
{
string result;


// Remove HTML Development formatting
// Replace line breaks with space
// because browsers inserts space
result = source.Replace("\r", " ");
// Replace line breaks with space
// because browsers inserts space
result = result.Replace("\n", " ");
// Remove step-formatting
result = result.Replace("\t", string.Empty);
// Remove repeating spaces because browsers ignore them
result = System.Text.RegularExpressions.Regex.Replace(result,
@"( )+", " ");


// Remove the header (prepare first by clearing attributes)
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*head([^>])*>","<head>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<( )*(/)( )*head( )*>)","</head>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(<head>).*(</head>)",string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);


// remove all scripts (prepare first by clearing attributes)
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*script([^>])*>","<script>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<( )*(/)( )*script( )*>)","</script>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
//result = System.Text.RegularExpressions.Regex.Replace(result,
//         @"(<script>)([^(<script>\.</script>)])*(</script>)",
//         string.Empty,
//         System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<script>).*(</script>)",string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);


// remove all styles (prepare first by clearing attributes)
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*style([^>])*>","<style>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<( )*(/)( )*style( )*>)","</style>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(<style>).*(</style>)",string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);


// insert tabs in spaces of <td> tags
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*td([^>])*>","\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);


// insert line breaks in places of <BR> and <LI> tags
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*br( )*>","\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*li( )*>","\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);


// insert line paragraphs (double line breaks) in place
// if <P>, <DIV> and <TR> tags
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*div([^>])*>","\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*tr([^>])*>","\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*p([^>])*>","\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);


// Remove remaining tags like <a>, links, images,
// comments etc - anything that's enclosed inside < >
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<[^>]*>",string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);


// replace special characters:
result = System.Text.RegularExpressions.Regex.Replace(result,
@" "," ",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);


result = System.Text.RegularExpressions.Regex.Replace(result,
@"&bull;"," * ",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"&lsaquo;","<",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"&rsaquo;",">",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"&trade;","(tm)",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"&frasl;","/",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"&lt;","<",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"&gt;",">",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"&copy;","(c)",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"&reg;","(r)",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// Remove all others. More can be added, see
// http://hotwired.lycos.com/webmonkey/reference/special_characters/
result = System.Text.RegularExpressions.Regex.Replace(result,
@"&(.{2,6});", string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);


// for testing
//System.Text.RegularExpressions.Regex.Replace(result,
//       this.txtRegex.Text,string.Empty,
//       System.Text.RegularExpressions.RegexOptions.IgnoreCase);


// make line breaking consistent
result = result.Replace("\n", "\r");


// Remove extra line breaks and tabs:
// replace over 2 breaks with 2 and over 4 tabs with 4.
// Prepare first to remove any whitespaces in between
// the escaped characters and remove redundant tabs in between line breaks
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\r)( )+(\r)","\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\t)( )+(\t)","\t\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\t)( )+(\r)","\t\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\r)( )+(\t)","\r\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// Remove redundant tabs
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\r)(\t)+(\r)","\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// Remove multiple tabs following a line break with just one tab
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\r)(\t)+","\r\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// Initial replacement target string for line breaks
string breaks = "\r\r\r";
// Initial replacement target string for tabs
string tabs = "\t\t\t\t\t";
for (int index=0; index<result.Length; index++)
{
result = result.Replace(breaks, "\r\r");
result = result.Replace(tabs, "\t\t\t\t");
breaks = breaks + "\r";
tabs = tabs + "\t";
}


// That's it.
return result;
}
catch
{
MessageBox.Show("Error");
return source;
}

return sw.ToString(); }

}

Escape characters such as \n and \r had to be removed first because they cause regexes to cease working as expected.

internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) {

Moreover, to make the result string display correctly in the textbox, one might need to split it up and set textbox's Lines property instead of assigning to Text property.

foreach (HtmlNode subnode in node.ChildNodes) {

this.txtResult.Lines = ConvertTo(subnode, outText, textInfo); StripHTML(this.txtSource.Text).Split("\r".ToCharArray());

}

Source : https://www.codeproject.com/Articles/11902/Convert-HTML-to-Plain-Text-2

小开

}

I think it has a simple answer:

public string RemoveHTMLTags(string HTMLCode)
{
string str=System.Text.RegularExpressions.Regex.Replace(HTMLCode, "<[^>]*>", "");
return str;
}

小开

public static void ConvertTo(HtmlNode node, TextWriter outText) {

For anyone looking for an exact solution to the OP question for a textual abbreviation of a given html document, without newlines and HTML tags, please find the solution below.

ConvertTo(node, outText, new PreceedingDomTextInfo(false)); }

Like with every proposed solution, there are some assumptions with the code below:

script or style tags should not contain script and style tags as a string html; part of script
only major inline elements will be inlined without case HtmlNodeType.Comment: space, i.e. hello should output hello. List of inline // don't output comments tags: https://www.w3schools.com/htmL/html_blocks.asp

break;

Considering the above, the following string extension with compiled regular expressions will output expected plain text with regard to html escaped characters and null on null input.

public static class StringExtensions
{
public static string ConvertToPlain(this string html)
{
if (html == null)
{
return html;
}


html = scriptRegex.Replace(html, string.Empty);
html = inlineTagRegex.Replace(html, string.Empty);
html = tagRegex.Replace(html, " ");
html = HttpUtility.HtmlDecode(html);
html = multiWhitespaceRegex.Replace(html, " ");


return html.Trim();
}


private static readonly Regex inlineTagRegex = new Regex("<\\/?(a|span|sub|sup|b|i|strong|small|big|em|label|q)[^>]*>", RegexOptions.Compiled | RegexOptions.Singleline);
private static readonly Regex scriptRegex = new Regex("<(script|style)[^>]*?>.*?</\\1>", RegexOptions.Compiled | RegexOptions.Singleline);
private static readonly Regex tagRegex = new Regex("<[^>]+>", RegexOptions.Compiled | RegexOptions.Singleline);
private static readonly Regex multiWhitespaceRegex = new Regex("\\s+", RegexOptions.Compiled | RegexOptions.Singleline);
}