Python/BeautifulSoup-如何从元素中删除所有标记?

如何简单地从 BeautifulSoup 中找到的元素中去除所有标记?

146811 次浏览

it looks like this is the way to do! as simple as that

with this line you are joining together the all text parts within the current element

''.join(htmlelement.find(text=True))

You can use the decompose method in bs4:

soup = bs4.BeautifulSoup('<body><a href="http://example.com/">I linked to <i>example.com</i></a></body>')


for a in soup.find('a').children:
if isinstance(a,bs4.element.Tag):
a.decompose()


print soup


Out: <html><body><a href="http://example.com/">I linked to </a></body></html>

why has no answer I've seen mentioned anything about the unwrap method? Or, even easier, the get_text method

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unwrap http://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text

With BeautifulStoneSoup gone in bs4, it's even simpler in Python3

from bs4 import BeautifulSoup


soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

Use get_text(), it returns all the text in a document or beneath a tag, as a single Unicode string.

For instance, remove all different script tags from the following text:

<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>

The expected result is:

Signal et Communication
Ingénierie Réseaux et Télécommunications

Here is the source code:

#!/usr/bin/env python3
from bs4 import BeautifulSoup


text = '''
<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>
'''
soup = BeautifulSoup(text)


print(soup.get_text())

Here is the source code: you can get the text which is exactly in the URL

URL = ''
page = requests.get(URL)
soup = bs4.BeautifulSoup(page.content,'html.parser').get_text()
print(soup)

Code to simply get the contents as text instead of html:

'html_text' parameter is the string which you will pass in this function to get the text

from bs4 import BeautifulSoup


soup = BeautifulSoup(html_text, 'lxml')
text = soup.get_text()
print(text)