如何去除标点符号使用 NLTK 标记？

小开

注释中注意到，由于 word _ tokenize ()只对一个句子起作用，因此以 send _ tokenize ()开头。可以使用 filter ()过滤掉标点符号。如果有 unicode 字符串，请确保它是 unicode 对象(而不是使用“ utf-8”等编码方式编码的“ str”)。

from nltk.tokenize import word_tokenize, sent_tokenize


text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)

小开

看看 nltk 提供的给你的其他标记选项。例如，您可以定义一个标记器，它挑选字母数字字符序列作为标记，并删除其他所有内容:

from nltk.tokenize import RegexpTokenizer


tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

产出:

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

小开

我只是用了下面的代码，去掉了所有的标点符号:

tokens = nltk.wordpunct_tokenize(raw)


type(tokens)


text = nltk.Text(tokens)


type(text)


words = [w.lower() for w in text if w.isalpha()]

小开

我用这个代码来删除标点符号:

import nltk
def getTerms(sentences):
tokens = nltk.word_tokenize(sentences)
words = [w.lower() for w in tokens if w.isalnum()]
print tokens
print words


getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")

如果你想检查一个标记是否是一个有效的英语单词，你可能需要 PyEnchant

教程:

 import enchant
d = enchant.Dict("en_US")
d.check("Hello")
d.check("Helo")
d.suggest("Helo")

小开

您并不真正需要 NLTK 来删除标点符号。您可以使用简单的 python 来删除它。对于字符串:

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

或者 Unicode:

import string
translate_table = dict((ord(char), None) for char in string.punctuation)
s.translate(translate_table)

然后在标记器中使用这个字符串。

P.S. string 模块还有其他一些可以删除的元素(比如数字)。

小开

我认为您需要某种正则表达式匹配(下面是 Python 3中的代码) :

import string
import re
import nltk


s = "I can't do this now, because I'm so tired.  Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)

产出:

['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']

在大多数情况下应该可以很好地工作，因为它删除了标点符号，同时保留了“不”这样的标记，这些标记不能从像 wordpunct_tokenize这样的正则表达式标记获得。

小开

下面的代码将删除所有的句读和非字母字符。

Http://www.nltk.org/book/ch01.html

import nltk


s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"


words = nltk.word_tokenize(s)


words=[word.lower() for word in words if word.isalpha()]


print(words)

输出

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

小开

删除标点符号(它将删除。以及部分标点符号处理使用下面的代码)

        tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
text_string = text_string.translate(tbl) #text_string don't have punctuation
w = word_tokenize(text_string)  #now tokenize the string

输入/输出示例:

direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni

“直接”，“单位”，“奥贝罗伊”，“绅士”，“3”，“ bhk”，“2195”，“实用”，“1330”，“地毯”，“利率”，“14500”，“最终”，“加号”，“1”，“地板”，“上升”，“税”，“大约”，“9”，“单位”，“成本”，“停车”，“389”,‘ cr’，‘ plus’，‘ tax’，‘ plus’，‘ holdings’，‘ charger’，‘ Middle’，‘ floor’，‘ north’，‘ door’，‘ arey’，‘ oberoi’，‘ woods’，‘ facing’，‘53’，‘ paymemt’，‘ due’，‘1’，‘ Transfer’,’充电’,’买方’,’总’,’成本’,’左右’,’420’,’cr’,’大约’,’加号’,’拥有’,’充电’,’拉胡尔’,’索尼’]

小开

诚恳地问，什么是一个词？如果您的假设是一个单词只包含字母字符，那么您就错了，因为诸如 can't之类的单词将被分解成片段(如 can和 t) 如果你在标记之前删除标点符号，这很可能对您的程序产生负面影响。

因此，解决方案是 标记，然后删除标点符号。

import string


from nltk.tokenize import word_tokenize


tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']


tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']

... 然后如果你愿意，你可以用 am代替某些令牌，比如 'm。

小开

只要通过@rmalouf 添加到解中，就不会包含任何数字，因为 w + 等价于[ a-zA-Z0-9 _ ]

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

小开

您可以在不使用 nltk (python 3.x)的情况下在一行中完成。

import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))