<p>If you do have to worry about <code><script></code> tags and the like then you'll need something a bit more powerful then regular expressions because you need to track state, omething more like a Context Free Grammar (CFG). Althought you might be able to accomplish it with 'Left To Right' or non-greedy matching.</p> Get Element value with minidom with Python

HTTPUtility.HTMLEncode() method, detailed here:

public static void HtmlEncode(
string s,
TextWriter output
)

Usage:

String TestString = "This is a <Test String>.";
StringWriter writer = new StringWriter();
Server.HtmlEncode(TestString, writer);
String EncodedString = writer.ToString();
166352 次浏览

It should just be

name[0].firstChild.nodeValue

Probably something like this if it's the text part you want...

from xml.dom.minidom import parse
dom = parse("C:\\eve.xml")
name = dom.getElementsByTagName('name')


print " ".join(t.nodeValue for t in name[0].childNodes if t.nodeType == t.TEXT_NODE)

The text part of a node is considered a node in itself placed as a child-node of the one you asked for. Thus you will want to go through all its children and find all child nodes that are text nodes. A node can have several text nodes; eg.

<name>
blabla
<somestuff>asdf</somestuff>
znylpx
</name>

You want both 'blabla' and 'znylpx'; hence the " ".join(). You might want to replace the space with a newline or so, or perhaps by nothing.

I know this question is pretty old now, but I thought you might have an easier time with ElementTree

from xml.etree import ElementTree as ET
import datetime


f = ET.XML(data)


for element in f:
if element.tag == "currentTime":
# Handle time data was pulled
currentTime = datetime.datetime.strptime(element.text, "%Y-%m-%d %H:%M:%S")
if element.tag == "cachedUntil":
# Handle time until next allowed update
cachedUntil = datetime.datetime.strptime(element.text, "%Y-%m-%d %H:%M:%S")
if element.tag == "result":
# Process list of skills
pass

The MIT licensed HtmlAgilityPack has in one of its samples a method that converts from HTML to plain text.

var plainText = HtmlUtilities.ConvertToPlainText(string html);

I know that's not super specific, but I just discovered it, and so far it's a lot easier to get my head around than the minidom (since so many nodes are essentially white space).

Feed it an HTML string like

<b>hello, <i>world!</i></b>

And you'll get a plain text result like:

hello world!
datetime.strptime(element.text, "%Y-%m-%d %H:%M:%S") if element.tag == "result": # Process list of skills

For instance, you have the tag name and the actual text together, just as you'd probably expect:

>>> element[0]
<Element currentTime at 40984d0>
>>> element[0].tag
'currentTime'
>>> element[0].text
'2010-04-12 02:45:45'e
pass

you can use something like this.It worked out for me

doc = parse('C:\\eve.xml')
my_node_list = doc.getElementsByTagName("name")
my_n_node = my_node_list[0]
my_child = my_n_node.firstChild
my_text = my_child.data
print my_text
var yourTextString = yourHtmlString.StripHTML();

There not a method with the name 'ConvertToPlainText' in the HtmlAgilityPack but you can convert a html string to CLEAR string with :

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
var textString = doc.DocumentNode.InnerText;
Regex.Replace(textString , @"<(.|n)*?>", string.Empty).Replace("&nbsp", "");

I use this extension method to convert html formated comments to plain text so it will be displayed correctly on a crystal report, and it works perfect!

Here is my solution:

public string StripHTML(string html)
{
if (string.IsNullOrWhiteSpace(html)) return "";


// could be stored in static variable
var regex = new Regex("<[^>]+>|\\s{2}", RegexOptions.IgnoreCase);
return System.Web.HttpUtility.HtmlDecode(regex.Replace(html, ""));
}

The above answer is correct, namely:

name[0].firstChild.nodeValue

Example:

StripHTML("<p class='test' style='color:red;'>Here is my solution:</p>");
// output -> Here is my solution:

However for me, like others, my value was further down the tree:

name[0].firstChild.firstChild.nodeValue
Nodes) )

To find this I used the following:

def scandown( elements, indent ):
for el in elements:
print("   " * indent + "nodeName: " + str(el.nodeName) )
print("   " * indent + "nodeValue: " + str(el.nodeValue) )
print("   " * indent + "childNodes: " + str(el.childNodes) )
scandown(el.childNodes, indent + 1)


scandown( doc.getElementsByTagName('text'), 0 )
scandown(el.childNodes, indent + 1)

Running this for my simple SVG file created with Inkscape this gave me:

nodeName: text
nodeValue: None
childNodes: [<DOM Element: tspan at 0x10392c6d0>]
nodeName: tspan
nodeValue: None
childNodes: [<DOM Text node "'MY STRING'">]
nodeName: #text
nodeValue: MY STRING
childNodes: ()
nodeName: text
nodeValue: None
childNodes: [<DOM Element: tspan at 0x10392c800>]
nodeName: tspan
nodeValue: None
childNodes: [<DOM Text node "'MY WORDS'">]
nodeName: #text
nodeValue: MY WORDS
childNodes: ()
scandown( doc.getElementsByTagName('text'), 0 )

Running this for my simple SVG file created with Inkscape this gave me:

nodeName: text
nodeValue: None
childNodes: [<DOM Element: tspan at 0x10392c6d0>]
nodeName: tspan
nodeValue: None
childNodes: [<DOM Text node "'MY STRING'">]
nodeName: #text
nodeValue: MY STRING
childNodes: ()
nodeName: text
nodeValue: None
childNodes: [<DOM Element: tspan at 0x10392c800>]
nodeName: tspan
nodeValue: None
childNodes: [<DOM Text node "'MY WORDS'">]
nodeName: #text
nodeValue: MY WORDS
childNodes: ()

I used xml.dom.minidom, the various fields are explained on this page, MiniDom Python.

}

What does generator comprehension do? How does it work? I couldn't find a tutorial about it.

break; } } } internal class PreceedingDomTextInfo { public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten) { IsFirstTextOfDocWritten = isFirstTextOfDocWritten; } public bool WritePrecedingWhiteSpace { get; set; } public bool LastCharWasSpace { get; set; } public readonly BoolWrapper IsFirstTextOfDocWritten; public int ListIndex { get; set; } } internal class BoolWrapper { public BoolWrapper() { } public bool Value { get; set; } public static implicit operator bool(BoolWrapper boolWrapper) { return boolWrapper.Value; } public static implicit operator BoolWrapper(bool boolWrapper) { return new BoolWrapper { Value = boolWrapper }; } } }

It's a tree, and there may be nested elements. Try:

def innerText(self, sep=''):
t = ""
for curNode in self.childNodes:
if (curNode.nodeType == Node.TEXT_NODE):
t += sep + curNode.nodeValue
elif (curNode.nodeType == Node.ELEMENT_NODE):
t += sep + curNode.innerText(sep=sep)
return t