Can I remove script tags with BeautifulSoup?

Can <script> tags and all of their contents be removed from HTML with BeautifulSoup, or do I have to use Regular Expressions or something else?

124463 次浏览

小开

最佳答案

from bs4 import BeautifulSoup
soup = BeautifulSoup('<script>a</script>baba<script>b</script>', 'html.parser')
for s in soup.select('script'):
s.extract()
print(soup)
baba

小开

As stated in the (正式文件) you can use the extract method to remove all the subtree that matches the search.

import BeautifulSoup
a = BeautifulSoup.BeautifulSoup("<html><body><script>aaa</script></body></html>")
[x.extract() for x in a.findAll('script')]

小开

更新答案，以供日后参考: 正确答案是。 decompose(). 您可以使用不同的方法，但是 decompose可以在适当的位置工作。

示例用法:

soup = BeautifulSoup('<p>This is a slimy text and <i> I am slimer</i></p>')
soup.i.decompose()
print str(soup)
#prints '<p>This is a slimy text and</p>'

非常有用的去除如 <script>，<img>等碎片。