测试 BeautifulSoup 中的标记中是否存在属性

我希望在一个文档中获取所有的 <script>标记,然后根据某些属性的存在(或不存在)来处理每个标记。

例如,对于每个 <script>标记,如果存在属性 for,则执行某些操作; 否则,如果存在属性 bar,则执行其他操作。

以下是我目前正在做的事情:

outputDoc = BeautifulSoup(''.join(output))
scriptTags = outputDoc.findAll('script', attrs = {'for' : True})

但是这样我过滤了所有带有 for属性的 <script>标签... 但是我丢失了其他的标签(那些没有 for属性的标签)。

100285 次浏览

If i understand well, you just want all the script tags, and then check for some attributes in them?

scriptTags = outputDoc.findAll('script')
for script in scriptTags:
if script.has_attr('some_attribute'):
do_something()

For future reference, has_key has been deprecated is beautifulsoup 4. Now you need to use has_attr

scriptTags = outputDoc.find_all('script')
for script in scriptTags:
if script.has_attr('some_attribute'):
do_something()

If you only need to get tag(s) with attribute(s), you can use lambda:

soup = bs4.BeautifulSoup(YOUR_CONTENT)
  • Tags with attribute
tags = soup.find_all(lambda tag: 'src' in tag.attrs)

OR

tags = soup.find_all(lambda tag: tag.has_attr('src'))
  • Specific tag with attribute
tag = soup.find(lambda tag: tag.name == 'script' and 'src' in tag.attrs)
  • Etc ...

Thought it might be useful.

You don't need any lambdas to filter by attribute, you can simply use some_attribute=True in find or find_all.

script_tags = soup.find_all('script', some_attribute=True)


# or


script_tags = soup.find_all('script', {"some-data-attribute": True})

Here are more examples with other approaches as well:

soup = bs4.BeautifulSoup(html)


# Find all with a specific attribute


tags = soup.find_all(src=True)
tags = soup.select("[src]")


# Find all meta with either name or http-equiv attribute.


soup.select("meta[name],meta[http-equiv]")


# find any tags with any name or source attribute.


soup.select("[name], [src]")


# find first/any script with a src attribute.


tag = soup.find('script', src=True)
tag = soup.select_one("script[src]")


# find all tags with a name attribute beginning with foo
# or any src beginning with /path
soup.select("[name^=foo], [src^=/path]")


# find all tags with a name attribute that contains foo
# or any src containing with whatever
soup.select("[name*=foo], [src*=whatever]")


# find all tags with a name attribute that endwith foo
# or any src that ends with  whatever
soup.select("[name$=foo], [src$=whatever]")

You can also use regular expressions with find or find_all:

import re
# starting with
soup.find_all("script", src=re.compile("^whatever"))
# contains
soup.find_all("script", src=re.compile("whatever"))
# ends with
soup.find_all("script", src=re.compile("whatever$"))

By using the pprint module you can examine the contents of an element.

from pprint import pprint


pprint(vars(element))

Using this on a bs4 element will print something similar to this:

{'attrs': {u'class': [u'pie-productname', u'size-3', u'name', u'global-name']},
'can_be_empty_element': False,
'contents': [u'\n\t\t\t\tNESNA\n\t'],
'hidden': False,
'name': u'span',
'namespace': None,
'next_element': u'\n\t\t\t\tNESNA\n\t',
'next_sibling': u'\n',
'parent': <h1 class="pie-compoundheader" itemprop="name">\n<span class="pie-description">Bedside table</span>\n<span class="pie-productname size-3 name global-name">\n\t\t\t\tNESNA\n\t</span>\n</h1>,
'parser_class': <class 'bs4.BeautifulSoup'>,
'prefix': None,
'previous_element': u'\n',
'previous_sibling': u'\n'}

To access an attribute - lets say the class list - use the following:

class_list = element.attrs.get('class', [])

You can filter elements using this approach:

for script in soup.find_all('script'):
if script.attrs.get('for'):
# ... Has 'for' attr
elif "myClass" in script.attrs.get('class', []):
# ... Has class "myClass"
else:
# ... Do something else

you can check if some attribute are present

scriptTags = outputDoc.findAll('script', some_attribute=True)
for script in scriptTags:
do_something()

A simple way to select just what you need.

outputDoc.select("script[for]")