BeautifulSoup-通过标记中的文本进行搜索

注意以下问题:

import re
from bs4 import BeautifulSoup as BS


soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")


# This returns the <a> element
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)


soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")


# This returns None
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)

由于某种原因,当 <i>标记也在那里时,BeautifulSoup 将不匹配文本。查找标记并显示其文本将生成

>>> a2 = soup.find(
'a',
href="/customer-menu/1/accounts/1/update"
)
>>> print(repr(a2.text))
'\n Edit\n'

对。根据 医生,靓汤使用的是正则表达式的 match 函数,而不是搜索函数。所以我需要提供 DOTALL 标志:

pattern = re.compile('.*Edit.*')
pattern.match('\n Edit\n')  # Returns None


pattern = re.compile('.*Edit.*', flags=re.DOTALL)
pattern.match('\n Edit\n')  # Returns MatchObject

好的,看起来不错,我们试试汤

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")


soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*", flags=re.DOTALL)
)  # Still return None... Why?!

剪辑

我的解决方案基于 geckons 的答案: 我实现了这些助手:

import re


MATCH_ALL = r'.*'




def like(string):
"""
Return a compiled regular expression that matches the given
string with any prefix and postfix, e.g. if string = "hello",
the returned regex matches r".*hello.*"
"""
string_ = string
if not isinstance(string_, str):
string_ = str(string_)
regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
return re.compile(regex, flags=re.DOTALL)




def find_by_text(soup, text, tag, **kwargs):
"""
Find the tag in soup that matches all provided kwargs, and contains the
text.


If no match is found, return None.
If more than one match is found, raise ValueError.
"""
elements = soup.find_all(tag, **kwargs)
matches = []
for element in elements:
if element.find(text=like(text)):
matches.append(element)
if len(matches) > 1:
raise ValueError("Too many matches:\n" + "\n".join(matches))
elif len(matches) == 0:
return None
else:
return matches[0]

现在,当我想找到上面的元素时,我只需要运行 find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')

152836 次浏览

The problem is that your <a> tag with the <i> tag inside, doesn't have the string attribute you expect it to have. First let's take a look at what text="" argument for find() does.

NOTE: The text argument is an old name, since BeautifulSoup 4.4.0 it's called string.

From the docs:

Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string. This code finds the tags whose .string is “Elsie”:

soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

Now let's take a look what Tag's string attribute is (from the docs again):

If a tag has only one child, and that child is a NavigableString, the child is made available as .string:

title_tag.string
# u'The Dormouse's story'

(...)

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:

print(soup.html.string)
# None

This is exactly your case. Your <a> tag contains a text and <i> tag. Therefore, the find gets None when trying to search for a string and thus it can't match.

How to solve this?

Maybe there is a better solution but I would probably go with something like this:

import re
from bs4 import BeautifulSoup as BS


soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")


links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")


for link in links:
if link.find(text=re.compile("Edit")):
thelink = link
break


print(thelink)

I think there are not too many links pointing to /customer-menu/1/accounts/1/update so it should be fast enough.

You can pass a function that return True if a text contains "Edit" to .find

In [51]: def Edit_in_text(tag):
....:     return tag.name == 'a' and 'Edit' in tag.text
....:


In [52]: soup.find(Edit_in_text, href="/customer-menu/1/accounts/1/update")
Out[52]:
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>

EDIT:

You can use the .get_text() method instead of the text in your function which gives the same result:

def Edit_in_text(tag):
return tag.name == 'a' and 'Edit' in tag.get_text()

in one line using lambda

soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)

With soupsieve 2.1.0 you can use :-soup-contains css pseudo class selector to target a node's text. This replaces the deprecated form of :contains().

from bs4 import BeautifulSoup as BS


soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")
single = soup.select_one('a:-soup-contains("Edit")').text.strip()
multiple = [i.text.strip() for i in soup.select('a:-soup-contains("Edit")')]
print(single, '\n', multiple)
  1. Method - 1: Checking text property
    pattern = 'Edit'
a2 = soup.find_all('a', string = pattern)[0]
  1. Method - 2: Using lambda iterate through all elements
    a2 = soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)

Good Luck