我怎样才能把一篇文章分成几个句子?

我有一个文本文件。我需要得到一个句子的列表。

如何实现这一点? 有很多微妙的地方,例如在缩写中使用点。

我的旧正则表达式工作得很糟糕:

re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)
236467 次浏览

自然语言工具包(Nltk.org)有你需要的东西。 这个群发表明这样做:

import nltk.data


tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(我还没试过!)

对于简单的情况(句子通常终止) ,这应该起作用:

import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

正则表达式是 *\. +,它匹配一个由左侧0个或多个空格和右侧1个或多个空格围绕的句点(以防止 re.split 中的句点被计算为句子的变化)。

显然,这不是最健壮的解决方案,但是在大多数情况下都可以做得很好。唯一不包括的情况是缩写(也许通过句子列表运行,并检查 sentences中的每个字符串以大写字母开头?)

您可以使用这个函数为俄语(和其他一些语言)创建一个新的标记器:

def russianTokenizer(text):
result = text
result = result.replace('.', ' . ')
result = result.replace(' .  .  . ', ' ... ')
result = result.replace(',', ' , ')
result = result.replace(':', ' : ')
result = result.replace(';', ' ; ')
result = result.replace('!', ' ! ')
result = result.replace('?', ' ? ')
result = result.replace('\"', ' \" ')
result = result.replace('\'', ' \' ')
result = result.replace('(', ' ( ')
result = result.replace(')', ' ) ')
result = result.replace('  ', ' ')
result = result.replace('  ', ' ')
result = result.replace('  ', ' ')
result = result.replace('  ', ' ')
result = result.strip()
result = result.split(' ')
return result

然后这样称呼它:

text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)

毫无疑问,NLTK 最适合这个目的。但是开始使用 NLTK 是相当痛苦的(但是一旦安装了它——你就会收获回报)

下面是 http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html中提供的简单的基于 re 的代码

# split up a paragraph into sentences
# using regular expressions




def splitParagraphIntoSentences(paragraph):
''' break a paragraph into sentences
and return a list '''
import re
# to split by multile characters


#   regular expressions are easiest (and fastest)
sentenceEnders = re.compile('[.!?]')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList




if __name__ == '__main__':
p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""


sentences = splitParagraphIntoSentences(p)
for s in sentences:
print s.strip()


#output:
#   This is a sentence
#   This is an excited sentence


#   And do you think this is a question

下面是一个不依赖任何外部库的中间方法。我使用列表内涵来排除缩写和终止符之间的重叠,也排除终止符变体之间的重叠,例如:Vs

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']




def find_sentences(paragraph):
end = True
sentences = []
while end > -1:
end = find_sentence_end(paragraph)
if end > -1:
sentences.append(paragraph[end:].strip())
paragraph = paragraph[:end]
sentences.append(paragraph)
sentences.reverse()
return sentences




def find_sentence_end(paragraph):
[possible_endings, contraction_locations] = [[], []]
contractions = abbreviations.keys()
sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
for sentence_terminator in sentence_terminators:
t_indices = list(find_all(paragraph, sentence_terminator))
possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
for contraction in contractions:
c_indices = list(find_all(paragraph, contraction))
contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
max_end_start = max([pe[0] for pe in possible_endings])
possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
end = (-1 if not len(possible_endings) else max(possible_endings))
return end




def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1:
return
yield start
start += len(sub)

我从这个条目中使用了 Karl 的 find _ all 函数: 在 Python 中查找子字符串的所有匹配项

这个功能可以在0.1秒内把整个哈克贝利·费恩历险记分解成句子,并且可以处理很多让句子解析变得非常困难的边缘情况,例如“ 小约翰 · 约翰逊先生出生于 U.S.A,但在加入耐克公司成为一名工程师之前,他在以色列获得了博士学位。他还在 craigslist.org 担任商业分析师。

# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"
digits = "([0-9])"


def split_into_sentences(text):
text = " " + text + "  "
text = text.replace("\n"," ")
text = re.sub(prefixes,"\\1<prd>",text)
text = re.sub(websites,"<prd>\\1",text)
text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
if "..." in text: text = text.replace("...","<prd><prd><prd>")
if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
if "”" in text: text = text.replace(".”","”.")
if "\"" in text: text = text.replace(".\"","\".")
if "!" in text: text = text.replace("!\"","\"!")
if "?" in text: text = text.replace("?\"","\"?")
text = text.replace(".",".<stop>")
text = text.replace("?","?<stop>")
text = text.replace("!","!<stop>")
text = text.replace("<prd>",".")
sentences = text.split("<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
return sentences

您也可以使用 nltk 库,而不是使用正则表达式将文本分割成句子。

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."


>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

档号: https://stackoverflow.com/a/9474645/2877052

您可以尝试使用 空间代替正则表达式。

import spacy
nlp = spacy.load('en')


text = '''Your text here'''
tokens = nlp(text)


for sent in tokens.sents:
print(sent.string.strip())

我必须阅读字幕文件,然后把它们分成句子。预处理后(如删除时间信息等)。Srt 文件) ,变量 fullFile 包含字幕文件的全文。下面这些粗糙的方法巧妙地把它们分成了几个句子。也许我很幸运,句子总是(正确地)以空格结尾。首先尝试这个,如果它有任何异常,添加更多的检查和平衡。

# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(\!|\?|\.) ","\\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
sentFile.write (line);
sentFile.write ("\n");
sentFile.close;

噢!好吧。我现在意识到,由于我的内容是西班牙语,我没有处理“史密斯先生”等问题。不过,如果有人想要一个快速而肮脏的解析器..。

您还可以在 NLTK 中使用句子标记功能:

from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes.  Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."


sent_tokenize(sentence)

我希望这将有助于你在拉丁文,中文,阿拉伯文文本

import re


punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|!|?|;|…| |!|؟|؛)+")
lines = []


with open('myData.txt','r',encoding="utf-8") as myFile:
lines = punctuation.sub(r"\1\2<pad>", myFile.read())
lines = [line.strip() for line in lines.split("<pad>") if line.strip()]

正在做类似的工作,遇到了这个查询,通过下面的几个链接和工作几个练习 nltk 下面的代码为我工作喜欢魔术。

from nltk.tokenize import sent_tokenize
  

text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
sent_tokenize(text)

产出:

['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']

资料来源: https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/

另外,要小心那些没有包含在上述答案中的其他顶级域名。

例如. info、 . biz、 . ru、 . online 将抛出一些句子解析器,但是上面没有包含它们。

这里有一些关于顶级域名频率的信息: https://www.westhost.com/blog/the-most-popular-top-level-domains-in-2017/

这个问题可以通过编辑上面的代码来解决:

alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|ai|edu|co.uk|ru|info|biz|online)"

使用 空间:

import spacy


nlp = spacy.load('en_core_web_sm')
text = "How are you today? I hope you have a great day"
tokens = nlp(text)
for sent in tokens.sents:
print(sent.string.strip())

不妨加上这个,因为这是第一个出现在句子被 n 个句子分开的帖子。

这种方法适用于可变的拆分长度,它指示最后连接在一起的句子。

import nltk
//nltk.download('punkt')
from more_itertools import windowed


split_length = 3 // 3 sentences for example


elements = nltk.tokenize.sent_tokenize(text)
segments = windowed(elements, n=split_length, step=split_length)
text_splits = []
for seg in segments:
txt = " ".join([t for t in seg if t])
if len(txt) > 0:
text_splits.append(txt)

如果 NLTK 的 send _ tokenize 不是一个东西(例如,需要在长文本上使用大量的 GPU RAM) ,并且 regex 不能在各种语言间正常工作,那么 句子分割器句子分割器可能值得一试。

使用适用于多种人类语言的自然语言处理库 Stanza

import stanza


stanza.download('en')
nlp = stanza.Pipeline(lang='en', processors='tokenize')


doc = nlp(t_en)
for sentence in doc.sentences:
print(sentence.text)

使用空间

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'This is first.This is second.This is Thired ')
for sentence in doc.sents:
print(sentence)

但是如果你想通过索引得到一个句子示例:

#don't work
doc.sents[0]

使用

list( doc.sents)[0]

我非常喜欢空格,但是我最近发现了两种新的句子标记化方法。一个是来自微软的 闪耀之火(非常快) ,另一个是 PySBD(非常精确)。

text = ...


from blingfire import text_to_sentences
sents = text_to_sentences(text).split('\n')


from pysbd import Segmenter
segmenter = Segmenter(language='en', clean=False)
sents = segmenter.segment(text)

我用五种不同的方法分开了20000个句子,以下是经过的时间:

  • 空间哨兵: 1.16934 s
  • Space Parse: 25.97063 s
  • PySBD: 9.03505 s
  • NLTK: 0.30512 s
  • 闪火: 0.07933 s