如何找到只有某些属性的标记-BeautifulSoup

如何使用 BeautifulSoup 搜索只包含我所搜索的属性的标记?

例如,我想找到所有的 <td valign="top">标签。

The following code: raw_card_data = soup.fetch('td', {'valign':re.compile('top')})

gets all of the data I want, but also grabs any <td> tag that has the attribute valign:top

我也试过: raw_card_data = soup.findAll(re.compile('<td valign="top">')) 这样就不会返回任何值(可能是因为正则表达式不正确)

我想知道在 BeautifulSoup 中是否有一种方法可以说“查找仅有属性为 valign:top<td>标记”

更新 例如,如果一个 HTML 文档包含以下 <td>标记:

<td valign="top">.....</td><br />
<td width="580" valign="top">.......</td><br />
<td>.....</td><br />

我只希望返回第一个 <td>标记(<td width="580" valign="top">)

175946 次浏览

只要把它当作 findAll的一个参数:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("""
... <html>
... <head><title>My Title!</title></head>
... <body><table>
... <tr><td>First!</td>
... <td valign="top">Second!</td></tr>
... </table></body><html>
... """)
>>>
>>> soup.findAll('td')
[<td>First!</td>, <td valign="top">Second!</td>]
>>>
>>> soup.findAll('td', valign='top')
[<td valign="top">Second!</td>]

就像 BeautifulSoup 文档上说的那样

你可以这样做:

soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})

EDIT :

要返回只有 valign = “ top”属性的标记,可以检查标记 attrs属性的长度:

from BeautifulSoup import BeautifulSoup


html = '<td valign="top">.....</td>\
<td width="580" valign="top">.......</td>\
<td>.....</td>'


soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})


for result in results :
if len(result.attrs) == 1 :
print result

结果是:

<td valign="top">.....</td>

您可以在 findAll中使用 lambda函数,正如 文件中所解释的那样。因此,在您的情况下,只使用 valign = "top"搜索 td标记时,请使用以下内容:

td_tag_list = soup.findAll(
lambda tag:tag.name == "td" and
len(tag.attrs) == 1 and
tag["valign"] == "top")

The easiest way to do this is with the new CSS style select method:

soup = BeautifulSoup(html)
results = soup.select('td[valign="top"]')

if you want to only search with attribute name with any value

from bs4 import BeautifulSoup
import re


soup= BeautifulSoup(html.text,'lxml')
results = soup.findAll("td", {"valign" : re.compile(r".*")})

根据史蒂夫洛里默更好地通过真而不是正则表达式

results = soup.findAll("td", {"valign" : True})

添加 Chris Redford 和 Amr 的答案组合,您还可以使用 select 命令搜索具有任意值的属性名称:

from bs4 import BeautifulSoup as Soup
html = '<td valign="top">.....</td>\
<td width="580" valign="top">.......</td>\
<td>.....</td>'
soup = Soup(html, 'lxml')
results = soup.select('td[valign]')

find using an attribute in any tag

<th class="team" data-sort="team">Team</th>
soup.find_all(attrs={"class": "team"})


<th data-sort="team">Team</th>
soup.find_all(attrs={"data-sort": "team"})
 

如果您希望获取所有标记,其中某个特定属性出现在 完全没有中,那么您可以使用与接受的答案相同的代码,但是不需要为标记指定值,只需将其设置为 True。

soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : True})

This will return all td tags that have valign attributes. This is useful if your project involves pulling info from a tag like div that is used all over, but can handle very specific attributes that you might be looking for.