使用 Python 删除子字符串

我已经从一个论坛上提取了一些信息,这是我现在所掌握的原始资料:

string = 'i think mabe 124 + <font color="black"><font face="Times New Roman">but I don\'t have a big experience it just how I see it in my eyes <font color="green"><font face="Arial">fun stuff'

我不喜欢的是子字符串 "<font color="black"><font face="Times New Roman">""<font color="green"><font face="Arial">"。我想保留字符串的其他部分,除了这个。所以结果应该是这样的

resultString = "i think mabe 124 + but I don't have a big experience it just how I see it in my eyes fun stuff"

我怎么能这么做?实际上我是用漂亮的汤从一个论坛里提取了上面的字符串。现在我可能更喜欢正则表达式来删除部分。

168755 次浏览
import re
re.sub('<.*?>', '', string)
"i think mabe 124 + but I don't have a big experience it just how I see it in my eyes fun stuff"

The re.sub function takes a regular expresion and replace all the matches in the string with the second parameter. In this case, we are searching for all tags ('<.*?>') and replacing them with nothing ('').

The ? is used in re for non-greedy searches.

More about the re module.

>>> import re
>>> st = " i think mabe 124 + <font color=\"black\"><font face=\"Times New Roman\">but I don't have a big experience it just how I see it in my eyes <font color=\"green\"><font face=\"Arial\">fun stuff"
>>> re.sub("<.*?>","",st)
" i think mabe 124 + but I don't have a big experience it just how I see it in my eyes fun stuff"
>>>
BeautifulSoup(text, features="html.parser").text

For the people who were seeking deep info in my answer, sorry.

I'll explain it.

Beautifulsoup is a widely use python package that helps the user (developer) to interact with HTML within python.

The above like just take all the HTML text (text) and cast it to Beautifulsoup object - that means behind the sense its parses everything up (Every HTML tag within the given text)

Once done so, we just request all the text from within the HTML object.