从字符串中去掉标点符号的最佳方法

似乎应该有一个更简单的方法:

import strings = "string. With. Punctuation?" # Sample stringout = s.translate(string.maketrans("",""), string.punctuation)

有吗?

1078749 次浏览

不一定更简单,但不同的方式,如果你更熟悉re家族。

import re, strings = "string. With. Punctuation?" # Sample stringout = re.sub('[%s]' % re.escape(string.punctuation), '', s)

我通常使用这样的东西:

>>> s = "string. With. Punctuation?" # Sample string>>> import string>>> for c in string.punctuation:...     s= s.replace(c,"")...>>> s'string With Punctuation'

从效率的角度来看,你不会打败

s.translate(None, string.punctuation)

对于更高版本的Python,请使用以下代码:

s.translate(str.maketrans('', '', string.punctuation))

它使用查找表在C中执行原始字符串操作——除了编写自己的C代码之外,没有什么比这更好的了。

如果速度不是问题,另一个选择是:

exclude = set(string.punctuation)s = ''.join(ch for ch in s if ch not in exclude)

这比s.replace每个char要快,但性能不如非纯python方法,如正则表达式或string.translate,正如您从下面的计时中看到的那样。对于这种类型的问题,在尽可能低的级别进行处理会有回报。

定时代码:

import re, string, timeit
s = "string. With. Punctuation"exclude = set(string.punctuation)table = string.maketrans("","")regex = re.compile('[%s]' % re.escape(string.punctuation))
def test_set(s):return ''.join(ch for ch in s if ch not in exclude)
def test_re(s):  # From Vinko's solution, with fix.return regex.sub('', s)
def test_trans(s):return s.translate(table, string.punctuation)
def test_repl(s):  # From S.Lott's solutionfor c in string.punctuation:s=s.replace(c,"")return s
print "sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)print "regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)print "replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

这给出了以下结果:

sets      : 19.8566138744regex     : 6.86155414581translate : 2.12455511093replace   : 28.4436721802
myString.translate(None, string.punctuation)

这可能不是最好的解决方案,但我就是这样做的。

import stringf = lambda x: ''.join([i for i in x if i not in string.punctuation])

string.punctuation是ASCII只有!更正确(但也慢得多)的方法是使用Unicodedata模块:

# -*- coding: utf-8 -*-from unicodedata import categorys = u'String — with -  «punctation »...'s = ''.join(ch for ch in s if category(ch)[0] != 'P')print 'stripped', s

您也可以泛化和剥离其他类型的字符:

''.join(ch for ch in s if category(ch)[0] not in 'SP')

它还将删除像~*+§$这样的字符,这些字符可能是也可能不是“标点符号”,这取决于一个人的观点。

我喜欢使用这样的函数:

def scrub(abc):while abc[-1] is in list(string.punctuation):abc=abc[:-1]while abc[0] is in list(string.punctuation):abc=abc[1:]return abc

正则表达式很简单,如果你知道的话。

import res = "string. With. Punctuation?"s = re.sub(r'[^\w\s]','',s)

对于Python 3str或Python 2unicode值,str.translate()只需要一个字典;在该映射中查找代码点(整数),并删除映射到None的任何内容。

要删除(一些?)标点符号,请使用:

import string
remove_punct_map = dict.fromkeys(map(ord, string.punctuation))s.translate(remove_punct_map)

#0类方法使得创建映射变得简单,根据键的顺序将所有值设置为None

要删除所有标点符号,而不仅仅是ASCII标点符号,您的表需要大一点;请参阅J. F. Sebastian的回答(Python 3版本):

import unicodedataimport sys
remove_punct_map = dict.fromkeys(i for i in range(sys.maxunicode)if unicodedata.category(chr(i)).startswith('P'))

这是我写的一个函数。它不是很有效,但很简单,你可以添加或删除任何你想要的标点符号:

def stripPunc(wordList):"""Strips punctuation from list of words"""puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]for punc in puncList:for word in wordList:wordList=[word.replace(punc,'') for word in wordList]return wordList

一行代码在不太严格的情况下可能会有所帮助:

''.join([c for c in s if c.isalnum() or c.isspace()])

这是Python 3.5的一行代码:

import string"l*ots! o(f. p@u)n[c}t]u[a'ti\"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation}))

为了使用方便,我总结了Python 2和Python 3中从字符串中剥离标点符号的注意事项。详细描述请参阅其他答案。


python2

import string
s = "string. With. Punctuation?"table = string.maketrans("","")new_s = s.translate(table, string.punctuation)      # Output: string without punctuation

python3

import string
s = "string. With. Punctuation?"table = str.maketrans(dict.fromkeys(string.punctuation))  # OR {key: None for key in string.punctuation}new_s = s.translate(table)                          # Output: string without punctuation

我还没有看到这个答案。只需使用正则表达式;它删除除了单词字符(\w)和数字字符(\d)之外的所有字符,然后是空格字符(\s):

import res = "string. With. Punctuation?" # Sample stringout = re.sub(ur'[^\w\d\s]+', '', s)
>>> s = "string. With. Punctuation?">>> s = re.sub(r'[^\w\s]','',s)>>> re.split(r'\s*', s)

['string', 'With', 'Punctuation']

string.punctuation忽略了现实世界中常用的标点符号。对于非ASCII标点符号的解决方案如何?

import regexs = u"string. With. Some・Really Weird、Non?ASCII。 「(Punctuation)」?"remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE)remove.sub(u" ", s).strip()

就个人而言,我认为这是在Python中从字符串中删除标点符号的最佳方法,因为:

  • 它删除了所有Unicode标点符号
  • 它很容易修改,例如,如果你想删除标点符号,你可以删除\{S},但保留$这样的符号。
  • 您可以非常具体地了解您想要保留的内容和要删除的内容,例如\{Pd}只会删除破折号。
  • 这个正则表达式还规范化了空格。它将制表符、回车和其他奇怪的东西映射到漂亮的单个空格。

这使用Unicode字符属性,你可以阅读更多关于维基百科

这是一个没有regex的解决方案。

import string
input_text = "!where??and!!or$$then:)"punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation))print ' '.join(input_text.translate(punctuation_replacer).split()).strip()
Output>> where and or then
  • 用空格替换标点符号
  • 用单个空格替换单词之间的多个空格
  • 删除尾随空格,如果有条带()
# FIRST METHOD# Storing all punctuations in a variablepunctuation='!?,.:;"\')(_-'newstring ='' # Creating empty stringword = raw_input("Enter string: ")for i in word:if(i not in punctuation):newstring += iprint ("The string without punctuation is", newstring)
# SECOND METHODword = raw_input("Enter string: ")punctuation = '!?,.:;"\')(_-'newstring = word.translate(None, punctuation)print ("The string without punctuation is",newstring)

# Output for both methodsEnter string: hello! welcome -to_python(programming.language)??,The string without punctuation is: hello welcome topythonprogramminglanguage
with open('one.txt','r')as myFile:
str1=myFile.read()
print(str1)

punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
for i in punctuation:
str1 = str1.replace(i," ")myList=[]myList.extend(str1.split(" "))print (str1)for i in myList:
print(i,end='\n')print ("____________")

使用Python从文本文件中删除停用词

print('====THIS IS HOW TO REMOVE STOP WORS====')
with open('one.txt','r')as myFile:
str1=myFile.read()
stop_words ="not", "is", "it", "By","between","This","By","A","when","And","up","Then","was","by","It","If","can","an","he","This","or","And","a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though","be","But","these"
myList=[]
myList.extend(str1.split(" "))
for i in myList:
if i not in stop_words:
print ("____________")
print(i,end='\n')
import res = "string. With. Punctuation?" # Sample stringout = re.sub(r'[^a-zA-Z0-9\s]', '', s)

作为更新,我在Python 3中重写了@Brian示例,并对其进行了更改,将regex编译步骤移动到函数内部。我的想法是为使函数工作所需的每一步计时。也许你正在使用分布式计算并且不能在你的工作人员之间共享regex对象,并且需要在每个工作人员处有re.compile步。此外,我很好奇为Python 3的两个不同的maketrans实现计时

table = str.maketrans({key: None for key in string.punctuation})

vs

table = str.maketrans('', '', string.punctuation)

另外,我添加了另一种使用set的方法,我利用交集函数来减少迭代次数。

这是完整的代码:

import re, string, timeit
s = "string. With. Punctuation"

def test_set(s):exclude = set(string.punctuation)return ''.join(ch for ch in s if ch not in exclude)

def test_set2(s):_punctuation = set(string.punctuation)for punct in set(s).intersection(_punctuation):s = s.replace(punct, ' ')return ' '.join(s.split())

def test_re(s):  # From Vinko's solution, with fix.regex = re.compile('[%s]' % re.escape(string.punctuation))return regex.sub('', s)

def test_trans(s):table = str.maketrans({key: None for key in string.punctuation})return s.translate(table)

def test_trans2(s):table = str.maketrans('', '', string.punctuation)return(s.translate(table))

def test_repl(s):  # From S.Lott's solutionfor c in string.punctuation:s=s.replace(c,"")return s

print("sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))print("sets2      :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))print("regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))print("replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))

这是我的结果:

sets      : 3.1830138750374317sets2      : 2.189873124472797regex     : 7.142953420989215translate : 4.243278483860195translate2 : 2.427158243022859replace   : 4.579746678471565

为什么你们都不用这个?

 ''.join(filter(str.isalnum, s))

太慢了?

考虑Unicode。Python3中检查的代码。

from unicodedata import categorytext = 'hi, how are you?'text_without_punc = ''.join(ch for ch in text if not category(ch).startswith('P'))

这是另一种使用RegEx的简单方法

import re
punct = re.compile(r'(\w+)')
sentence = 'This ! is : a # sample $ sentence.' # Text with punctuationtokenized = [m.group() for m in punct.finditer(sentence)]sentence = ' '.join(tokenized)print(sentence)'This is a sample sentence'

试试这个:)

regex.sub(r'\p{P}','', s)

我正在寻找一个非常简单的解决方案。这是我得到的:

import re
s = "string. With. Punctuation?"s = re.sub(r'[\W\s]', ' ', s)
print(s)'string  With  Punctuation '

你也可以这样做:

import string' '.join(word.strip(string.punctuation) for word in 'text'.split())

这个问题没有很多细节,所以我采取的方法是想出一个对问题最简单解释的解决方案:删除标点符号。

请注意,提供的解决方案没有考虑收缩词(例如,you're)或连字符词(例如,anal-retentive)……这是关于它们是否应该被视为标点符号……也没有考虑非英语字符集或类似的东西……因为这些细节在问题中没有提到。有人认为空格是标点符号,这是技术上正确……但对我来说,在手头的问题的上下文中,这是没有意义的。

# using lambda''.join(filter(lambda c: c not in string.punctuation, s))
# using list comprehension''.join('' if c in string.punctuation else c for c in s)

显然我无法对所选答案进行编辑,所以这里有一个适用于Python 3的更新。在进行重要转换时,translate方法仍然是最有效的选择。

这要归功于上面@Brian的繁重工作。感谢@ddejohn对原始测试的改进提出了很好的建议。

#!/usr/bin/env python3
"""Determination of most efficient way to remove punctuation in Python 3.
Results in Python 3.8.10 on my system using the default arguments:
set       : 51.897regex     : 17.901translate :  2.059replace   : 13.209"""
import argparseimport reimport stringimport timeit
parser = argparse.ArgumentParser()parser.add_argument("--filename", "-f", default=argparse.__file__)parser.add_argument("--iterations", "-i", type=int, default=10000)opts = parser.parse_args()with open(opts.filename) as fp:s = fp.read()exclude = set(string.punctuation)table = str.maketrans("", "", string.punctuation)regex = re.compile(f"[{re.escape(string.punctuation)}]")
def test_set(s):return "".join(ch for ch in s if ch not in exclude)
def test_regex(s):  # From Vinko's solution, with fix.return regex.sub("", s)
def test_translate(s):return s.translate(table)
def test_replace(s):  # From S.Lott's solutionfor c in string.punctuation:s = s.replace(c, "")return s
opts = dict(globals=globals(), number=opts.iterations)solutions = "set", "regex", "translate", "replace"for solution in solutions:elapsed = timeit.timeit(f"test_{solution}(s)", **opts)print(f"{solution:<10}: {elapsed:6.3f}")

当您处理Unicode字符串时,我建议使用PyPi#0模块,因为它同时支持Unicode属性类(如\p{X}/\P{X})和POSIX字符类(如[:name:])。

只需在终端中键入pip install regex(或pip3 install regex)并按ENTER即可安装软件包。

如果您需要删除任何类型的标点符号和符号(即除字母、数字和空格之外的任何内容),您可以使用

regex.sub(r'[\p{P}\p{S}]', '', text)  # to remove one by oneregex.sub(r'[\p{P}\p{S}]+', '', text) # to remove all consecutive punctuation/symbols with one goregex.sub(r'[[:punct:]]+', '', text)  # Same with a POSIX character class

Python在线演示

import regex
text = 'भारत India <><>^$.,,! 002'new_text = regex.sub(r'[\p{P}\p{S}\s]+', ' ', text).lower().strip()# OR# new_text = regex.sub(r'[[:punct:]\s]+', ' ', text).lower().strip()
print(new_text)# => भारत india 002

在这里,我向字符类添加了一个空白\s模式

对于严肃的自然语言处理(NLP),您应该让像SpaCy这样的库处理标记化的标点符号,然后您可以根据需要手动调整。

例如,你想如何处理单词中的连字符?像缩写这样的特殊情况?开始和结束引号?URL?在NLP中,将像“let's”这样的缩写分离成“let”和“'s”以进行进一步处理通常很有用。

SpaCy实例标记化