如何删除 lxml 中的元素

我需要使用 python 的 lxml 根据属性的内容完全删除元素:

import lxml.etree as et


xml="""
<groceries>
<fruit state="rotten">apple</fruit>
<fruit state="fresh">pear</fruit>
<fruit state="fresh">starfruit</fruit>
<fruit state="rotten">mango</fruit>
<fruit state="fresh">peach</fruit>
</groceries>
"""


tree=et.fromstring(xml)


for bad in tree.xpath("//fruit[@state=\'rotten\']"):
#remove this element from the tree


print et.tostring(tree, pretty_print=True)

我想刊登以下内容:

<groceries>
<fruit state="fresh">pear</fruit>
<fruit state="fresh">starfruit</fruit>
<fruit state="fresh">peach</fruit>
</groceries>

有没有一种方法可以不存储临时变量并手动打印到它,比如:

newxml="<groceries>\n"
for elt in tree.xpath('//fruit[@state=\'fresh\']'):
newxml+=et.tostring(elt)


newxml+="</groceries>"
76138 次浏览

Use the remove method of an xmlElement :

tree=et.fromstring(xml)


for bad in tree.xpath("//fruit[@state=\'rotten\']"):
bad.getparent().remove(bad)     # here I grab the parent of the element to call the remove directly on it


print et.tostring(tree, pretty_print=True, xml_declaration=True)

If I had to compare with the @Acorn version, mine will work even if the elements to remove are not directly under the root node of your xml.

You're looking for the remove function. Call the tree's remove method and pass it a subelement to remove.

import lxml.etree as et


xml="""
<groceries>
<fruit state="rotten">apple</fruit>
<fruit state="fresh">pear</fruit>
<punnet>
<fruit state="rotten">strawberry</fruit>
<fruit state="fresh">blueberry</fruit>
</punnet>
<fruit state="fresh">starfruit</fruit>
<fruit state="rotten">mango</fruit>
<fruit state="fresh">peach</fruit>
</groceries>
"""


tree=et.fromstring(xml)


for bad in tree.xpath("//fruit[@state='rotten']"):
bad.getparent().remove(bad)


print et.tostring(tree, pretty_print=True)

Result:

<groceries>
<fruit state="fresh">pear</fruit>
<fruit state="fresh">starfruit</fruit>
<fruit state="fresh">peach</fruit>
</groceries>

I met one situation:

<div>
<script>
some code
</script>
text here
</div>

div.remove(script) will remove the text here part which I didn't mean to.

following the answer here, I found that etree.strip_elements is a better solution for me, which you can control whether or not you will remove the text behind with with_tail=(bool) param.

But still I don't know if this can use xpath filter for tag. Just put this for informing.

Here is the doc:

strip_elements(tree_or_element, *tag_names, with_tail=True)

Delete all elements with the provided tag names from a tree or subtree. This will remove the elements and their entire subtree, including all their attributes, text content and descendants. It will also remove the tail text of the element unless you explicitly set the with_tail keyword argument option to False.

Tag names can contain wildcards as in _Element.iter.

Note that this will not delete the element (or ElementTree root element) that you passed even if it matches. It will only treat its descendants. If you want to include the root element, check its tag name directly before even calling this function.

Example usage::

   strip_elements(some_element,
'simpletagname',             # non-namespaced tag
'{http://some/ns}tagname',   # namespaced tag
'{http://some/other/ns}*'    # any tag from a namespace
lxml.etree.Comment           # comments
)

As already mentioned, you can use the remove() method to delete (sub)elements from the tree:

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
bad.getparent().remove(bad)

But it removes the element including its tail, which is a problem if you are processing mixed-content documents like HTML:

<div><fruit state="rotten">avocado</fruit> Hello!</div>

Becomes

<div></div>

Which is I suppose what you not always want :) I have created helper function to remove just the element and keep its tail:

def remove_element(el):
parent = el.getparent()
if el.tail.strip():
prev = el.getprevious()
if prev:
prev.tail = (prev.tail or '') + el.tail
else:
parent.text = (parent.text or '') + el.tail
parent.remove(el)


for bad in tree.xpath("//fruit[@state=\'rotten\']"):
remove_element(bad)

This way it will keep the tail text:

<div> Hello!</div>

You could also use html from lxml to solve that:

from lxml import html


xml="""
<groceries>
<fruit state="rotten">apple</fruit>
<fruit state="fresh">pear</fruit>
<fruit state="fresh">starfruit</fruit>
<fruit state="rotten">mango</fruit>
<fruit state="fresh">peach</fruit>
</groceries>
"""


tree = html.fromstring(xml)


print("//BEFORE")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))


for i in tree.xpath("//fruit[@state='rotten']"):
i.drop_tree()


print("//AFTER")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

It should output this:

//BEFORE
<groceries>
<fruit state="rotten">apple</fruit>
<fruit state="fresh">pear</fruit>
<fruit state="fresh">starfruit</fruit>
<fruit state="rotten">mango</fruit>
<fruit state="fresh">peach</fruit>
</groceries>




//AFTER
<groceries>


<fruit state="fresh">pear</fruit>
<fruit state="fresh">starfruit</fruit>


<fruit state="fresh">peach</fruit>
</groceries>

The remove function detaches an element from the tree and therefore removes the XML node (Element, PI or Comment), its content (the descendant items) and the tail text. Here, preserving the tail text is superfluous because it only contains whitespaces and a newline, which can be considered ignorable whitespaces.

To remove a element (and its content), preserving its tail, you can use the following function:

def remove_node(child, keep_content=False):
"""
Remove an XML element, preserving its tail text.


:param child: XML element to remove
:param keep_content: ``True`` to keep child text and sub-elements.
"""
parent = child.getparent()
parent_text = parent.text or u""
prev_node = child.getprevious()
if keep_content:
# insert: child text
child_text = child.text or u""
if prev_node is None:
parent.text = u"{0}{1}".format(parent_text, child_text) or None
else:
prev_tail = prev_node.tail or u""
prev_node.tail = u"{0}{1}".format(prev_tail, child_text) or None
# insert: child elements
index = parent.index(child)
parent[index:index] = child[:]
# insert: child tail
parent_text = parent.text or u""
prev_node = child.getprevious()
child_tail = child.tail or u""
if prev_node is None:
parent.text = u"{0}{1}".format(parent_text, child_tail) or None
else:
prev_tail = prev_node.tail or u""
prev_node.tail = u"{0}{1}".format(prev_tail, child_tail) or None
# remove: child
parent.remove(child)

Here is a demo:

from lxml import etree


tree = etree.XML(u"<root>text <bad>before <bad>inner</bad> after</bad> tail</root>")
bad1 = tree.xpath("//bad[1]")[0]
remove_node(bad1)


etree.dump(tree)
# <root>text  tail</root>

If you want to preserve the content, you can do:

tree = etree.XML(u"<root>text <bad>before <bad>inner</bad> after</bad> tail</root>")
bad1 = tree.xpath("//bad[1]")[0]
remove_node(bad1, keep_content=True)


etree.dump(tree)
# <root>text before <bad>inner</bad> after tail</root>