使用 BeautifulSoup 查找包含特定文本的 HTML 标记

我试图获取 HTML 文档中包含以下文本模式的元素: # S {11}

<h2> this is cool #12345678901 </h2>

因此,前者可以使用以下方法进行匹配:

soup('h2',text=re.compile(r' #\S{11}'))

结果会是这样的:

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

我能够得到所有匹配的文本(见上一行)。但是我希望文本的父元素能够匹配,所以我可以使用它作为遍历文档树的起点。在这种情况下,我希望返回所有 h2元素,而不是匹配的文本。

有什么想法吗?

101753 次浏览
from BeautifulSoup import BeautifulSoup
import re


html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""


soup = BeautifulSoup(html_text)




for elem in soup(text=re.compile(r' #\S{11}')):
print elem.parent

印刷品:

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

BeautifulSoup 搜索操作在使用 text=作为标准时提供[一个] BeautifulSoup.NavigableString对象列表,而在其他情况下使用 BeautifulSoup.Tag作为标准时则提供[一个] BeautifulSoup.NavigableString对象列表。检查对象的 __dict__以查看可用的属性。在这些属性中,由于 BS4的改变parent优于 previous

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re


html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""


soup = BeautifulSoup(html_text)


# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')


pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}


print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

使用 bs4(Beautiful Soup 4) ,OP 的尝试与预期完全一样:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

返回 [<h2> this is cool #12345678901 </h2>]