从字符串中去掉标点符号的最佳方法

小开

不一定更简单，但不同的方式，如果你更熟悉re家族。

import re, strings = "string. With. Punctuation?" # Sample stringout = re.sub('[%s]' % re.escape(string.punctuation), '', s)

小开

我通常使用这样的东西：

>>> s = "string. With. Punctuation?" # Sample string>>> import string>>> for c in string.punctuation:...     s= s.replace(c,"")...>>> s'string With Punctuation'

小开

最佳答案

从效率的角度来看，你不会打败

s.translate(None, string.punctuation)

对于更高版本的Python，请使用以下代码：

s.translate(str.maketrans('', '', string.punctuation))

它使用查找表在C中执行原始字符串操作——除了编写自己的C代码之外，没有什么比这更好的了。

如果速度不是问题，另一个选择是：

exclude = set(string.punctuation)s = ''.join(ch for ch in s if ch not in exclude)

这比s.replace每个char要快，但性能不如非纯python方法，如正则表达式或string.translate，正如您从下面的计时中看到的那样。对于这种类型的问题，在尽可能低的级别进行处理会有回报。

定时代码：

import re, string, timeit
s = "string. With. Punctuation"exclude = set(string.punctuation)table = string.maketrans("","")regex = re.compile('[%s]' % re.escape(string.punctuation))
def test_set(s):return ''.join(ch for ch in s if ch not in exclude)
def test_re(s):  # From Vinko's solution, with fix.return regex.sub('', s)
def test_trans(s):return s.translate(table, string.punctuation)
def test_repl(s):  # From S.Lott's solutionfor c in string.punctuation:s=s.replace(c,"")return s
print "sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)print "regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)print "replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

这给出了以下结果：

sets      : 19.8566138744regex     : 6.86155414581translate : 2.12455511093replace   : 28.4436721802

小开

myString.translate(None, string.punctuation)

小开

这可能不是最好的解决方案，但我就是这样做的。

import stringf = lambda x: ''.join([i for i in x if i not in string.punctuation])

小开

string.punctuation是ASCII只有！更正确（但也慢得多）的方法是使用Unicodedata模块：

# -*- coding: utf-8 -*-from unicodedata import categorys = u'String — with -  «punctation »...'s = ''.join(ch for ch in s if category(ch)[0] != 'P')print 'stripped', s

您也可以泛化和剥离其他类型的字符：

''.join(ch for ch in s if category(ch)[0] not in 'SP')

它还将删除像~*+§$这样的字符，这些字符可能是也可能不是“标点符号”，这取决于一个人的观点。

小开

我喜欢使用这样的函数：

def scrub(abc):while abc[-1] is in list(string.punctuation):abc=abc[:-1]while abc[0] is in list(string.punctuation):abc=abc[1:]return abc

小开

正则表达式很简单，如果你知道的话。

import res = "string. With. Punctuation?"s = re.sub(r'[^\w\s]','',s)

小开

对于Python 3str或Python 2unicode值，str.translate()只需要一个字典；在该映射中查找代码点（整数），并删除映射到None的任何内容。

要删除（一些？）标点符号，请使用：

import string
remove_punct_map = dict.fromkeys(map(ord, string.punctuation))s.translate(remove_punct_map)

#0类方法使得创建映射变得简单，根据键的顺序将所有值设置为None。

要删除所有标点符号，而不仅仅是ASCII标点符号，您的表需要大一点；请参阅J. F. Sebastian的回答（Python 3版本）：

import unicodedataimport sys
remove_punct_map = dict.fromkeys(i for i in range(sys.maxunicode)if unicodedata.category(chr(i)).startswith('P'))

小开

这是我写的一个函数。它不是很有效，但很简单，你可以添加或删除任何你想要的标点符号：

def stripPunc(wordList):"""Strips punctuation from list of words"""puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]for punc in puncList:for word in wordList:wordList=[word.replace(punc,'') for word in wordList]return wordList

小开

一行代码在不太严格的情况下可能会有所帮助：

''.join([c for c in s if c.isalnum() or c.isspace()])

小开

这是Python 3.5的一行代码：

import string"l*ots! o(f. p@u)n[c}t]u[a'ti\"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation}))

小开

为了使用方便，我总结了Python 2和Python 3中从字符串中剥离标点符号的注意事项。详细描述请参阅其他答案。

python2

import string
s = "string. With. Punctuation?"table = string.maketrans("","")new_s = s.translate(table, string.punctuation)      # Output: string without punctuation

python3

import string
s = "string. With. Punctuation?"table = str.maketrans(dict.fromkeys(string.punctuation))  # OR {key: None for key in string.punctuation}new_s = s.translate(table)                          # Output: string without punctuation

小开

我还没有看到这个答案。只需使用正则表达式；它删除除了单词字符（\w）和数字字符（\d）之外的所有字符，然后是空格字符（\s）：

import res = "string. With. Punctuation?" # Sample stringout = re.sub(ur'[^\w\d\s]+', '', s)

小开

>>> s = "string. With. Punctuation?">>> s = re.sub(r'[^\w\s]','',s)>>> re.split(r'\s*', s)

['string', 'With', 'Punctuation']

小开

string.punctuation忽略了现实世界中常用的标点符号。对于非ASCII标点符号的解决方案如何？

import regexs = u"string. With. Some・Really Weird、Non？ASCII。 「（Punctuation）」?"remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE)remove.sub(u" ", s).strip()

就个人而言，我认为这是在Python中从字符串中删除标点符号的最佳方法，因为：

它删除了所有Unicode标点符号
它很容易修改，例如，如果你想删除标点符号，你可以删除\{S}，但保留$这样的符号。
您可以非常具体地了解您想要保留的内容和要删除的内容，例如\{Pd}只会删除破折号。
这个正则表达式还规范化了空格。它将制表符、回车和其他奇怪的东西映射到漂亮的单个空格。

这使用Unicode字符属性，你可以阅读更多关于维基百科。

小开

这是一个没有regex的解决方案。

import string
input_text = "!where??and!!or$$then:)"punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation))print ' '.join(input_text.translate(punctuation_replacer).split()).strip()
Output>> where and or then

用空格替换标点符号
用单个空格替换单词之间的多个空格
删除尾随空格，如果有条带（）

小开

# FIRST METHOD# Storing all punctuations in a variablepunctuation='!?,.:;"\')(_-'newstring ='' # Creating empty stringword = raw_input("Enter string: ")for i in word:if(i not in punctuation):newstring += iprint ("The string without punctuation is", newstring)
# SECOND METHODword = raw_input("Enter string: ")punctuation = '!?,.:;"\')(_-'newstring = word.translate(None, punctuation)print ("The string without punctuation is",newstring)

# Output for both methodsEnter string: hello! welcome -to_python(programming.language)??,The string without punctuation is: hello welcome topythonprogramminglanguage

小开

with open('one.txt','r')as myFile:
str1=myFile.read()
print(str1)

punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
for i in punctuation:
str1 = str1.replace(i," ")myList=[]myList.extend(str1.split(" "))print (str1)for i in myList:
print(i,end='\n')print ("____________")

小开

使用Python从文本文件中删除停用词

print('====THIS IS HOW TO REMOVE STOP WORS====')
with open('one.txt','r')as myFile:
str1=myFile.read()
stop_words ="not", "is", "it", "By","between","This","By","A","when","And","up","Then","was","by","It","If","can","an","he","This","or","And","a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though","be","But","these"
myList=[]
myList.extend(str1.split(" "))
for i in myList:
if i not in stop_words:
print ("____________")
print(i,end='\n')

小开

import res = "string. With. Punctuation?" # Sample stringout = re.sub(r'[^a-zA-Z0-9\s]', '', s)

小开

作为更新，我在Python 3中重写了@Brian示例，并对其进行了更改，将regex编译步骤移动到函数内部。我的想法是为使函数工作所需的每一步计时。也许你正在使用分布式计算并且不能在你的工作人员之间共享regex对象，并且需要在每个工作人员处有re.compile步。此外，我很好奇为Python 3的两个不同的maketrans实现计时

table = str.maketrans({key: None for key in string.punctuation})

vs

table = str.maketrans('', '', string.punctuation)

另外，我添加了另一种使用set的方法，我利用交集函数来减少迭代次数。

这是完整的代码：

import re, string, timeit
s = "string. With. Punctuation"

def test_set(s):exclude = set(string.punctuation)return ''.join(ch for ch in s if ch not in exclude)

def test_set2(s):_punctuation = set(string.punctuation)for punct in set(s).intersection(_punctuation):s = s.replace(punct, ' ')return ' '.join(s.split())

def test_re(s):  # From Vinko's solution, with fix.regex = re.compile('[%s]' % re.escape(string.punctuation))return regex.sub('', s)

def test_trans(s):table = str.maketrans({key: None for key in string.punctuation})return s.translate(table)

def test_trans2(s):table = str.maketrans('', '', string.punctuation)return(s.translate(table))

def test_repl(s):  # From S.Lott's solutionfor c in string.punctuation:s=s.replace(c,"")return s

print("sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))print("sets2      :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))print("regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))print("replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))

这是我的结果：

sets      : 3.1830138750374317sets2      : 2.189873124472797regex     : 7.142953420989215translate : 4.243278483860195translate2 : 2.427158243022859replace   : 4.579746678471565

小开

为什么你们都不用这个？

 ''.join(filter(str.isalnum, s))

太慢了？

小开

考虑Unicode。Python3中检查的代码。

from unicodedata import categorytext = 'hi, how are you?'text_without_punc = ''.join(ch for ch in text if not category(ch).startswith('P'))

小开

这是另一种使用RegEx的简单方法

import re
punct = re.compile(r'(\w+)')
sentence = 'This ! is : a # sample $ sentence.' # Text with punctuationtokenized = [m.group() for m in punct.finditer(sentence)]sentence = ' '.join(tokenized)print(sentence)'This is a sample sentence'

小开

试试这个：）

regex.sub(r'\p{P}','', s)

小开

我正在寻找一个非常简单的解决方案。这是我得到的：

import re
s = "string. With. Punctuation?"s = re.sub(r'[\W\s]', ' ', s)
print(s)'string  With  Punctuation '

小开

你也可以这样做：

import string' '.join(word.strip(string.punctuation) for word in 'text'.split())

小开

这个问题没有很多细节，所以我采取的方法是想出一个对问题最简单解释的解决方案：删除标点符号。

请注意，提供的解决方案没有考虑收缩词（例如，you're）或连字符词（例如，anal-retentive）……这是关于它们是否应该被视为标点符号……也没有考虑非英语字符集或类似的东西……因为这些细节在问题中没有提到。有人认为空格是标点符号，这是技术上正确……但对我来说，在手头的问题的上下文中，这是没有意义的。

# using lambda''.join(filter(lambda c: c not in string.punctuation, s))
# using list comprehension''.join('' if c in string.punctuation else c for c in s)

小开

显然我无法对所选答案进行编辑，所以这里有一个适用于Python 3的更新。在进行重要转换时，translate方法仍然是最有效的选择。

这要归功于上面@Brian的繁重工作。感谢@ddejohn对原始测试的改进提出了很好的建议。

#!/usr/bin/env python3
"""Determination of most efficient way to remove punctuation in Python 3.
Results in Python 3.8.10 on my system using the default arguments:
set       : 51.897regex     : 17.901translate :  2.059replace   : 13.209"""
import argparseimport reimport stringimport timeit
parser = argparse.ArgumentParser()parser.add_argument("--filename", "-f", default=argparse.__file__)parser.add_argument("--iterations", "-i", type=int, default=10000)opts = parser.parse_args()with open(opts.filename) as fp:s = fp.read()exclude = set(string.punctuation)table = str.maketrans("", "", string.punctuation)regex = re.compile(f"[{re.escape(string.punctuation)}]")
def test_set(s):return "".join(ch for ch in s if ch not in exclude)
def test_regex(s):  # From Vinko's solution, with fix.return regex.sub("", s)
def test_translate(s):return s.translate(table)
def test_replace(s):  # From S.Lott's solutionfor c in string.punctuation:s = s.replace(c, "")return s
opts = dict(globals=globals(), number=opts.iterations)solutions = "set", "regex", "translate", "replace"for solution in solutions:elapsed = timeit.timeit(f"test_{solution}(s)", **opts)print(f"{solution:<10}: {elapsed:6.3f}")

小开

当您处理Unicode字符串时，我建议使用PyPi#0模块，因为它同时支持Unicode属性类（如\p{X}/\P{X}）和POSIX字符类（如[:name:]）。

只需在终端中键入pip install regex（或pip3 install regex）并按ENTER即可安装软件包。

如果您需要删除任何类型的标点符号和符号（即除字母、数字和空格之外的任何内容），您可以使用

regex.sub(r'[\p{P}\p{S}]', '', text)  # to remove one by oneregex.sub(r'[\p{P}\p{S}]+', '', text) # to remove all consecutive punctuation/symbols with one goregex.sub(r'[[:punct:]]+', '', text)  # Same with a POSIX character class

见Python在线演示：

import regex
text = 'भारत India <><>^$.,,! 002'new_text = regex.sub(r'[\p{P}\p{S}\s]+', ' ', text).lower().strip()# OR# new_text = regex.sub(r'[[:punct:]\s]+', ' ', text).lower().strip()
print(new_text)# => भारत india 002

在这里，我向字符类添加了一个空白\s模式

小开

对于严肃的自然语言处理（NLP），您应该让像SpaCy这样的库处理标记化的标点符号，然后您可以根据需要手动调整。

例如，你想如何处理单词中的连字符？像缩写这样的特殊情况？开始和结束引号？URL？在NLP中，将像“let's”这样的缩写分离成“let”和“'s”以进行进一步处理通常很有用。