在 lxml 中获取标记中的所有文本

我想编写一个代码片段,以 lxml 格式在下面的三个实例中获取 <content>标记中的所有文本,包括代码标记。我已经试过 tostring(getchildren()),但是它会错过标签之间的文本。我在 API 中搜索相关函数的运气不太好。你能帮帮我吗?

<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>


<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"




<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"
113424 次浏览

Try:

def stringify_children(node):
from lxml.etree import tostring
from itertools import chain
parts = ([node.text] +
list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
[node.tail])
# filter removes possible Nones in texts and tails
return ''.join(filter(None, parts))

Example:

from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)

Produces: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'

Does text_content() do what you need?

import urllib2
from lxml import etree
url = 'some_url'

getting url

test = urllib2.urlopen(url)
page = test.read()

getting all html code within including table tag

tree = etree.HTML(page)

xpath selector

table = tree.xpath("xpath_here")
res = etree.tostring(table)

res is the html code of table this was doing job for me.

so you can extract the tags content with xpath_text() and tags including their content using tostring()

div = tree.xpath("//div")
div_res = etree.tostring(div)
text = tree.xpath_text("//content")

or text = tree.xpath("//content/text()")

div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')

this last line with strip method using is not nice, but it just works

If this is an a tag, you can try:

node.values()

Just use the node.itertext() method, as in:

 ''.join(node.itertext())

In response to @Richard's comment above, if you patch stringify_children to read:

 parts = ([node.text] +
--            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
++            list(chain(*([tostring(c)] for c in node.getchildren()))) +
[node.tail])

it seems to avoid the duplication he refers to.

Defining stringify_children this way may be less complicated:

from lxml import etree


def stringify_children(node):
s = node.text
if s is None:
s = ''
for child in node:
s += etree.tostring(child, encoding='unicode')
return s

or in one line

return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

Rationale is the same as in this answer: leave the serialization of child nodes to lxml. The tail part of node in this case isn't interesting since it is "behind" the end tag. Note that the encoding argument may be changed according to one's needs.

Another possible solution is to serialize the node itself and afterwards, strip the start and end tag away:

def stringify_children(node):
s = etree.tostring(node, encoding='unicode', with_tail=False)
return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]

which is somewhat horrible. This code is correct only if node has no attributes, and I don't think anyone would want to use it even then.

import re
from lxml import etree


node = etree.fromstring("""
<content>Text before inner tag
<div>Text
<em>inside</em>
tag
</div>
Text after inner tag
</content>""")


print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1)

A version of albertov 's stringify-content that solves the bugs reported by hoju:

def stringify_children(node):
from lxml.etree import tostring
from itertools import chain
return ''.join(
chunk for chunk in chain(
(node.text,),
chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
(node.tail,)) if chunk)

I know that this is an old question, but this is a common problem and I have a solution that seems simpler than the ones suggested so far:

def stringify_children(node):
"""Given a LXML tag, return contents as a string


>>> html = "<p><strong>Sample sentence</strong> with tags.</p>"
>>> node = lxml.html.fragment_fromstring(html)
>>> extract_html_content(node)
"<strong>Sample sentence</strong> with tags."
"""
if node is None or (len(node) == 0 and not getattr(node, 'text', None)):
return ""
node.attrib.clear()
opening_tag = len(node.tag) + 2
closing_tag = -(len(node.tag) + 3)
return lxml.html.tostring(node)[opening_tag:closing_tag]

Unlike some of the other answers to this question this solution preserves all of tags contained within it and attacks the problem from a different angle than the other working solutions.

The following snippet which uses python generators works perfectly and is very efficient.

''.join(node.itertext()).strip()

One of the simplest code snippets, that actually worked for me and as per documentation at http://lxml.de/tutorial.html#using-xpath-to-find-text is

etree.tostring(html, method="text")

where etree is a node/tag whose complete text, you are trying to read. Behold that it doesn't get rid of script and style tags though.

Here is a working solution. We can get content with a parent tag and then cut the parent tag from output.

import re
from lxml import etree


def _tostr_with_tags(parent_element, html_entities=False):
RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$'
content_with_parent = etree.tostring(parent_element)


def _replace_html_entities(s):
RE_ENTITY = r'&#(\d+);'


def repl(m):
return unichr(int(m.group(1)))


replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE)


return replaced


if not html_entities:
content_with_parent = _replace_html_entities(content_with_parent)


content_with_parent = content_with_parent.strip() # remove 'white' characters on margins


start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0]


if start_tag != end_tag:
raise Exception('Start tag does not match to end tag while getting content with tags.')


return content_without_parent

parent_element must have Element type.

Please note, that if you want text content (not html entities in text) please leave html_entities parameter as False.

lxml have a method for that:

node.text_content()

Just a quick enhancement as the answer has been given. If you want to clean the inside text:

clean_string = ' '.join([n.strip() for n in node.itertext()]).strip()