在 Python 中从字符串中剥离 HTML

from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
print line

在打印 HTML 文件中的一行时,我试图找到一种方法,只显示每个 HTML 元素的内容,而不显示格式本身。如果它发现 '<a href="whatever.example">some text</a>',它将只打印“一些文本”,'<b>hello</b>'打印“你好”,等等。一个人该怎么做呢?

325741 次浏览

你可以使用不同的HTML解析器(像lxml,或美丽的汤)——它提供只提取文本的函数。或者,您可以在行字符串上运行一个regex来删除标记。更多信息请参见Python文档

我总是使用这个函数来剥离HTML标签,因为它只需要Python标准库:

对于Python 3:

from io import StringIO
from html.parser import HTMLParser


class MLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs= True
self.text = StringIO()
def handle_data(self, d):
self.text.write(d)
def get_data(self):
return self.text.getvalue()


def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()

对于Python 2:

from HTMLParser import HTMLParser
from StringIO import StringIO


class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.text = StringIO()
def handle_data(self, d):
self.text.write(d)
def get_data(self):
return self.text.getvalue()


def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()

你可以编写自己的函数:

def StripTags(text):
finished = 0
while not finished:
finished = 1
start = text.find("<")
if start >= 0:
stop = text[start:].find(">")
if stop >= 0:
text = text[:start] + text[start+stop+1:]
finished = 0
return text

如果您需要剥离HTML标记来进行文本处理,那么一个简单的正则表达式就可以了。下面的正则表达式将相当可靠地剥离大多数HTML标签:

import re


re.sub('<[^<]+?>', '', text)

对于那些不理解regex的人来说,这将搜索字符串<...>,其中内部内容由一个或多个(+)字符组成,而这些字符不是<?表示它将匹配它能找到的最小字符串。例如,给定<p>Hello</p>,它将分别匹配<'p></p>?。没有它,它将匹配整个字符串<..Hello..>

如果非标签<出现在html(例如。2 < 3),它应该被写成转义序列&...,所以^<可能是不必要的。

这个方法对我来说完美无缺,不需要额外的安装:

import re
import htmlentitydefs


def convertentity(m):
if m.group(1)=='#':
try:
return unichr(int(m.group(2)))
except ValueError:
return '&#%s;' % m.group(2)
try:
return htmlentitydefs.entitydefs[m.group(2)]
except KeyError:
return '&%s;' % m.group(2)


def converthtml(s):
return re.sub(r'&(#?)(.+?);',convertentity,s)


html =  converthtml(html)
html.replace("&nbsp;", " ") ## Get rid of the remnants of certain formatting(subscript,superscript,etc).

我需要一种方法剥离标签而且解码HTML实体为纯文本。下面的解决方案是基于Eloff的答案(我不能使用,因为它剥离实体)。

import html.parser


class HTMLTextExtractor(html.parser.HTMLParser):
def __init__(self):
super(HTMLTextExtractor, self).__init__()
self.result = [ ]


def handle_data(self, d):
self.result.append(d)


def get_text(self):
return ''.join(self.result)


def html_to_text(html):
"""Converts HTML to plain text (stripping tags and converting entities).
>>> html_to_text('<a href="#">Demo<!--...--> <em>(&not; \u0394&#x03b7;&#956;&#x03CE;)</em></a>')
'Demo (\xac \u0394\u03b7\u03bc\u03ce)'


"Plain text" doesn't mean result can safely be used as-is in HTML.
>>> html_to_text('&lt;script&gt;alert("Hello");&lt;/script&gt;')
'<script>alert("Hello");</script>'


Always use html.escape to sanitize text before using in an HTML context!


HTMLParser will do its best to make sense of invalid HTML.
>>> html_to_text('x < y &lt z <!--b')
'x < y < z '


Named entities are handled as per HTML 5.
>>> html_to_text('&nosuchentity; &apos; ')
"&nosuchentity; ' "
"""
s = HTMLTextExtractor()
s.feed(html)
return s.get_text()

快速测试:

html = '<a href="#">Demo <em>(&not; \u0394&#x03b7;&#956;&#x03CE;)</em></a>'
print(repr(html_to_text(html)))

结果:

'Demo (¬ Δημώ)'

安全注意事项:不要混淆HTML 剥离(将HTML转换为纯文本)和HTML 消毒(将纯文本转换为HTML)。这个答案将删除HTML并将实体解码为纯文本——这并不能使结果在HTML上下文中安全使用。

示例:&lt;script&gt;alert("Hello");&lt;/script&gt;将被转换为<script>alert("Hello");</script>,这是100%正确的行为,但如果结果纯文本按原样插入到HTML页面中,显然是不够的。

规则并不难:任何时候你插入一个纯文本字符串到HTML输出,总是 HTML转义它(使用html.escape(s)),即使你"知道"它不包含HTML(例如,因为你剥离了HTML内容)。

然而,OP询问是否将结果打印到控制台,在这种情况下不需要HTML转义。相反,你可能想要剥离ASCII控制字符,因为它们会触发不需要的行为(特别是在Unix系统上):

import re
text = html_to_text(untrusted_html_input)
clean_text = re.sub(r'[\0-\x1f\x7f]+', '', text)
# Alternatively, if you want to allow newlines:
# clean_text = re.sub(r'[\0-\x09\x0b-\x1f\x7f]+', '', text)
print(clean_text)

我已经成功地在Python 3.1中使用了Eloff的答案[非常感谢!]。

我升级到Python 3.2.3,并遇到了错误。

由响应者Thomas K提供的解决方案是将super().__init__()插入以下代码:

def __init__(self):
self.reset()
self.fed = []

... 为了让它看起来像这样:

def __init__(self):
super().__init__()
self.reset()
self.fed = []

... 它适用于Python 3.2.3。

再次感谢Thomas K的修复和Eloff提供的原始代码!

如果你需要保留HTML实体(即&amp;),我在Eloff的回答中添加了“handle_entityref”方法。

from HTMLParser import HTMLParser


class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def handle_entityref(self, name):
self.fed.append('&%s;' % name)
def get_data(self):
return ''.join(self.fed)


def html_to_text(html):
s = MLStripper()
s.feed(html)
return s.get_data()

有一个简单的方法:

def remove_html_markup(s):
tag = False
quote = False
out = ""


for c in s:
if c == '<' and not quote:
tag = True
elif c == '>' and not quote:
tag = False
elif (c == '"' or c == "'") and tag:
quote = not quote
elif not tag:
out = out + c


return out

这里解释了这个概念:http://youtu.be/2tu9LTDujbw

你可以看到它在这里工作:http://youtu.be/HPkNPcYed9M?t=35s

PS -如果你对这个类(关于使用python进行智能调试)感兴趣,我给你一个链接:http://www.udacity.com/overview/Course/cs259/CourseRev/1。它是免费的!

欢迎你!:)

如果你想去掉所有HTML标签,我发现最简单的方法是使用BeautifulSoup:

from bs4 import BeautifulSoup  # Or from BeautifulSoup import BeautifulSoup


def stripHtmlTags(htmlTxt):
if htmlTxt is None:
return None
else:
return ''.join(BeautifulSoup(htmlTxt).findAll(text=True))

我尝试了接受的答案的代码,但我得到了“RuntimeError:最大递归深度超出”,这没有发生在上面的代码块。

短版!

import re, html
tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')


# Remove well-formed tags, fixing mistakes by legitimate users
no_tags = tag_re.sub('', user_input)


# Clean up anything else by escaping
ready_for_web = html.escape(no_tags)

Regex来源:MarkupSafe。他们的版本也处理HTML实体,而这个快速的版本不能。

为什么我不能把标签撕掉,然后留下?

让人们远离<i>italicizing</i>的东西,而不让__abc1到处漂浮是一回事。但任意输入并使其完全无害是另一回事。本页上的大多数技术都将保留未关闭的注释(<!--)和不属于标记(blah <<<><blah)的尖括号等内容。HTMLParser版本甚至可以保留完整的标记,如果它们在未关闭的注释中。

如果你的模板是\{\{ firstname }} \{\{ lastname }}呢?firstname = '<a'lastname = 'href="http://evil.example/">'将被本页上的每个标签剥离器通过(除了@Medeiros!),因为它们本身不是完整的标签。除去普通的HTML标记是不够的。

Django的strip_tags是这个问题顶部答案的改进版本(见下一个标题),给出了以下警告:

绝对不能保证得到的字符串是HTML安全的。因此,永远不要在没有转义的情况下将strip_tags调用的结果标记为安全,例如escape()

听从他们的建议!

要用HTMLParser去除标签,你必须运行它多次。

绕过这个问题最上面的答案很容易。

看看这个字符串(来源与讨论):

<img<!-- --> src=x onerror=alert(1);//><!-- -->

HTMLParser第一次看到它时,它不能告诉<img...>是一个标记。它看起来坏了,所以HTMLParser不会去掉它。它只取出<!-- comments -->,留给你

<img src=x onerror=alert(1);//>

这个问题是在2014年3月向Django项目披露的。他们的旧strip_tags本质上与这个问题的顶部答案相同。他们的新版本基本上是在循环中运行它,直到再次运行它不会改变字符串:

# _strip_once runs HTMLParser once, pulling out just the text of all the nodes.


def strip_tags(value):
"""Returns the given HTML with all tags stripped."""
# Note: in typical case this loop executes _strip_once once. Loop condition
# is redundant, but helps to reduce number of executions of _strip_once.
while '<' in value and '>' in value:
new_value = _strip_once(value)
if len(new_value) >= len(value):
# _strip_once was not able to detect more tags
break
value = new_value
return value

当然,如果你总是转义strip_tags()的结果,这些都不是问题。

2015年3月19日更新:在1.4.20,1.6.11,1.7.7和1.8c1之前的Django版本中有一个错误。这些版本可以在strip_tags()函数中进入一个无限循环。固定版本见上文。详情请点击这里

好的东西可以复制或使用

我的示例代码不处理HTML实体——Django和MarkupSafe的打包版本可以。

我的示例代码来自出色的MarkupSafe库,用于防止跨站点脚本编写。它既方便又快速(C会加速到原生Python版本)。它包含在谷歌应用引擎中,并被Jinja2(2.7及以上)、Mako、Pylons等使用。它可以很容易地与Django 1.7的模板一起工作。

Django的strip_tags和其他HTML实用程序从最近的版本都很好,但我发现它们不如MarkupSafe方便。它们是相当独立的,你可以从这个文件复制你需要的东西。

如果你需要剥离几乎所有的标签,漂白剂库是好的。你可以让它执行这样的规则:“我的用户可以使用斜体,但他们不能创建iframe。”

了解标签剥离器的属性!对它进行绒毛测试!我曾经做过这个答案的研究。

羞怯的注意 -这个问题本身是关于打印到控制台的,但这是“python从字符串中剥离HTML”的谷歌结果的顶部,所以这就是为什么这个答案99%是关于web的。

使用HTML-Parser的解决方案都是可破坏的,如果它们只运行一次:

html_to_text('<<b>script>alert("hacked")<</b>/script>

结果:

<script>alert("hacked")</script>

你想要阻止什么。如果你使用HTML-Parser,计数标签直到0被替换:

from HTMLParser import HTMLParser


class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
self.containstags = False


def handle_starttag(self, tag, attrs):
self.containstags = True


def handle_data(self, d):
self.fed.append(d)


def has_tags(self):
return self.containstags


def get_data(self):
return ''.join(self.fed)


def strip_tags(html):
must_filtered = True
while ( must_filtered ):
s = MLStripper()
s.feed(html)
html = s.get_data()
must_filtered = s.has_tags()
return html

这是一个快速修复,甚至可以更优化,但它将工作良好。这段代码将用""替换所有非空标记,并从给定的输入文本中剥离所有html标记。你可以使用./file.py输入输出运行它

    #!/usr/bin/python
import sys


def replace(strng,replaceText):
rpl = 0
while rpl > -1:
rpl = strng.find(replaceText)
if rpl != -1:
strng = strng[0:rpl] + strng[rpl + len(replaceText):]
return strng




lessThanPos = -1
count = 0
listOf = []


try:
#write File
writeto = open(sys.argv[2],'w')


#read file and store it in list
f = open(sys.argv[1],'r')
for readLine in f.readlines():
listOf.append(readLine)
f.close()


#remove all tags
for line in listOf:
count = 0;
lessThanPos = -1
lineTemp =  line


for char in lineTemp:


if char == "<":
lessThanPos = count
if char == ">":
if lessThanPos > -1:
if line[lessThanPos:count + 1] != '<>':
lineTemp = replace(lineTemp,line[lessThanPos:count + 1])
lessThanPos = -1
count = count + 1
lineTemp = lineTemp.replace("&lt","<")
lineTemp = lineTemp.replace("&gt",">")
writeto.write(lineTemp)
writeto.close()
print "Write To --- >" , sys.argv[2]
except:
print "Help: invalid arguments or exception"
print "Usage : ",sys.argv[0]," inputfile outputfile"

你可以使用BeautifulSoup get_text()特性。

from bs4 import BeautifulSoup


html_str = '''
<td><a href="http://www.fakewebsite.example">Please can you strip me?</a>
<br/><a href="http://www.fakewebsite.example">I am waiting....</a>
</td>
'''
soup = BeautifulSoup(html_str)


print(soup.get_text())
#or via attribute of Soup Object: print(soup.text)

建议显式地指定解析器,例如BeautifulSoup(html_str, features="html.parser"),以使输出可再现。

我正在解析Github自述,我发现下面的工作真的很好:

import re
import lxml.html


def strip_markdown(x):
links_sub = re.sub(r'\[(.+)\]\([^\)]+\)', r'\1', x)
bold_sub = re.sub(r'\*\*([^*]+)\*\*', r'\1', links_sub)
emph_sub = re.sub(r'\*([^*]+)\*', r'\1', bold_sub)
return emph_sub


def strip_html(x):
return lxml.html.fromstring(x).text_content() if x else ''

然后

readme = """<img src="https://raw.githubusercontent.com/kootenpv/sky/master/resources/skylogo.png" />


sky is a web scraping framework, implemented with the latest python versions in mind (3.4+).
It uses the asynchronous `asyncio` framework, as well as many popular modules
and extensions.


Most importantly, it aims for **next generation** web crawling where machine intelligence
is used to speed up the development/maintainance/reliability of crawling.


It mainly does this by considering the user to be interested in content
from *domains*, not just a collection of *single pages*
([templating approach](#templating-approach))."""


strip_markdown(strip_html(readme))

正确移除所有markdown和html。

基于__abc0的解决方案(lxml是一个本机库,比纯python解决方案性能更好)。

安装lxml模块使用pip install lxml

移除所有标签

from lxml import html




## from file-like object or URL
tree = html.parse(file_like_object_or_url)


## from string
tree = html.fromstring('safe <script>unsafe</script> safe')


print(tree.text_content().strip())


### OUTPUT: 'safe unsafe safe'


删除预消毒HTML的所有标签(删除一些标签)

from lxml import html
from lxml.html.clean import clean_html


tree = html.fromstring("""<script>dangerous</script><span class="item-summary">
Detailed answers to any questions you might have
</span>""")


## text only
print(clean_html(tree).text_content().strip())


### OUTPUT: 'Detailed answers to any questions you might have'

也可以参见http://lxml.de/lxmlhtml.html#cleaning-up-html了解lxml. xml的具体内容。清洁。

如果你需要更多的控制在转换为文本之前应该删除哪些特定的标记,那么使用所需的选项创建一个自定义的lxml清洁,例如:

cleaner = Cleaner(page_structure=True,
meta=True,
embedded=True,
links=True,
style=True,
processing_instructions=True,
inline_style=True,
scripts=True,
javascript=True,
comments=True,
frames=True,
forms=True,
annoying_tags=True,
remove_unknown_tags=True,
safe_attrs_only=True,
safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
remove_tags=('span', 'font', 'div')
)
sanitized_html = cleaner.clean_html(unsafe_html)

要自定义生成纯文本的方式,可以使用lxml.etree.tostring而不是text_content():

from lxml.etree import tostring


print(tostring(tree, method='text', encoding=str))


python 3改编自søren-løvborg的回答

from html.parser import HTMLParser
from html.entities import html5


class HTMLTextExtractor(HTMLParser):
""" Adaption of http://stackoverflow.com/a/7778368/196732 """
def __init__(self):
super().__init__()
self.result = []


def handle_data(self, d):
self.result.append(d)


def handle_charref(self, number):
codepoint = int(number[1:], 16) if number[0] in (u'x', u'X') else int(number)
self.result.append(unichr(codepoint))


def handle_entityref(self, name):
if name in html5:
self.result.append(unichr(html5[name]))


def get_text(self):
return u''.join(self.result)


def html_to_text(html):
s = HTMLTextExtractor()
s.feed(html)
return s.get_text()

美丽的汤包立即为您做到这一点。

from bs4 import BeautifulSoup


soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

使用BeautifulSoup, html2text或来自@Eloff的代码,大多数时候,它仍然是一些html元素,javascript代码…

所以你可以使用这些库的组合并删除markdown格式(Python 3):

import re
import html2text
from bs4 import BeautifulSoup
def html2Text(html):
def removeMarkdown(text):
for current in ["^[ #*]{2,30}", "^[ ]{0,30}\d\\\.", "^[ ]{0,30}\d\."]:
markdown = re.compile(current, flags=re.MULTILINE)
text = markdown.sub(" ", text)
return text
def removeAngular(text):
angular = re.compile("[{][|].{2,40}[|][}]|[{][*].{2,40}[*][}]|[{][{].{2,40}[}][}]|\[\[.{2,40}\]\]")
text = angular.sub(" ", text)
return text
h = html2text.HTML2Text()
h.images_to_alt = True
h.ignore_links = True
h.ignore_emphasis = False
h.skip_internal_links = True
text = h.handle(html)
soup = BeautifulSoup(text, "html.parser")
text = soup.text
text = removeAngular(text)
text = removeMarkdown(text)
return text

这对我来说很有效,但当然还可以增强……

对于一个项目,我需要这样剥离HTML,但也css和js。因此,我对eloff的回答做了一个变化:

class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.strict = False
self.convert_charrefs= True
self.fed = []
self.css = False
def handle_starttag(self, tag, attrs):
if tag == "style" or tag=="script":
self.css = True
def handle_endtag(self, tag):
if tag=="style" or tag=="script":
self.css=False
def handle_data(self, d):
if not self.css:
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)


def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()

下面是一个与当前接受的答案(https://stackoverflow.com/a/925630/95989)类似的解决方案,除了它直接使用内部HTMLParser类(即没有子类化),从而使其更加简洁:

def strip_html(text):
parts = []
parser = HTMLParser()
parser.handle_data = parts.append
parser.feed(text)
return ''.join(parts)

这是我对python 3的解决方案。

import html
import re


def html_to_txt(html_text):
## unescape html
txt = html.unescape(html_text)
tags = re.findall("<[^>]+>",txt)
print("found tags: ")
print(tags)
for tag in tags:
txt=txt.replace(tag,'')
return txt

不确定它是否完美,但解决了我的用例,看起来很简单。

简单的代码!这将删除其中的所有类型的标签和内容。

def rm(s):
start=False
end=False
s=' '+s
for i in range(len(s)-1):
if i<len(s):
if start!=False:
if s[i]=='>':
end=i
s=s[:start]+s[end+1:]
start=end=False
else:
if s[i]=='<':
start=i
if s.count('<')>0:
self.rm(s)
else:
s=s.replace('&nbsp;', ' ')
return s

但如果文本中包含& lt; >符号,则不会给出完整结果。

下面是一个简单的解决方案,它基于惊人的快速lxml库剥离HTML标记并解码HTML实体:

from lxml import html


def strip_html(s):
return str(html.fromstring(s).text_content())


strip_html('Ein <a href="">sch&ouml;ner</a> Text.')  # Output: Ein schöner Text.
# This is a regex solution.
import re
def removeHtml(html):
if not html: return html
# Remove comments first
innerText = re.compile('<!--[\s\S]*?-->').sub('',html)
while innerText.find('>')>=0: # Loop through nested Tags
text = re.compile('<[^<>]+?>').sub('',innerText)
if text == innerText:
break
innerText = text


return innerText.strip()

2020年更新

使用Mozilla Bleach库,它真的可以让你自定义哪些标签要保留,哪些属性要保留,还可以根据值过滤出属性

这里有两个例子来说明

1)不允许任何HTML标签或属性

获取原始文本样本

raw_text = """
<p><img width="696" height="392" src="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg" class="attachment-medium_large size-medium_large wp-post-image" alt="Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC" style="float:left; margin:0 15px 15px 0;" srcset="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg 768w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-300x169.jpg 300w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1024x576.jpg 1024w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-696x392.jpg 696w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1068x601.jpg 1068w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-747x420.jpg 747w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-190x107.jpg 190w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-380x214.jpg 380w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-760x428.jpg 760w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc.jpg 1280w" sizes="(max-width: 696px) 100vw, 696px" />Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://news.bitcoin.com/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc/">Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC</a> appeared first on <a rel="nofollow" href="https://news.bitcoin.com">Bitcoin News</a>.</p>
"""

2)从原始文本中删除所有HTML标签和属性

# DO NOT ALLOW any tags or any attributes
from bleach.sanitizer import Cleaner
cleaner = Cleaner(tags=[], attributes={}, styles=[], protocols=[], strip=True, strip_comments=True, filters=None)
print(cleaner.clean(raw_text))

输出

Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]
The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News.

3只允许带有srcset属性的img标签

from bleach.sanitizer import Cleaner
# ALLOW ONLY img tags with src attribute
cleaner = Cleaner(tags=['img'], attributes={'img': ['srcset']}, styles=[], protocols=[], strip=True, strip_comments=True, filters=None)
print(cleaner.clean(raw_text))

输出

<img srcset="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg 768w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-300x169.jpg 300w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1024x576.jpg 1024w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-696x392.jpg 696w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1068x601.jpg 1068w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-747x420.jpg 747w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-190x107.jpg 190w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-380x214.jpg 380w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-760x428.jpg 760w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc.jpg 1280w">Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]
The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News.

我就是这么做的,但我不知道我在做什么。我通过剥离HTML标记从HTML表中获取数据。

这将字符串"name"并返回字符串"name1"没有HTML标签。

x = 0
anglebrackets = 0
name1 = ""
while x < len(name):
    

if name[x] == "<":
anglebrackets = anglebrackets + 1
if name[x] == ">":
anglebrackets = anglebrackets - 1
if anglebrackets == 0:
if name[x] != ">":
name1 = name1 + name[x]
x = x + 1
import re


def remove(text):
clean = re.compile('<.*?>')
return re.sub(clean, '', text)