蟒蛇的 n 克456克?

我在想办法把短信分成 n 个字母。 通常我会这样做:

import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams

我知道 nltk 只提供双字母和三字母,但是有没有办法把我的文本分成4克、5克甚至100克呢?

谢谢!

241414 次浏览

我从来没有处理过 nltk,但作为一些小类项目的一部分 N 克。如果你想找出字符串中所有 N 克出现的频率,这里有一个方法。D会给你 N 个单词的直方图。

D = dict()
string = 'whatever string...'
strparts = string.split()
for i in range(len(strparts)-N): # N-grams
try:
D[tuple(strparts[i:i+N])] += 1
except:
D[tuple(strparts[i:i+N])] = 1

您可以很容易地使用 itertools创建自己的函数来完成这项工作:

from itertools import izip, islice, tee
s = 'spam and eggs'
N = 3
trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N))))
list(trigrams)
# [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '),
# ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'),
# ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'),
# ('e', 'g', 'g'), ('g', 'g', 's')]

我很惊讶这种情况还没有出现:

In [34]: sentence = "I really like python, it's pretty awesome.".split()


In [35]: N = 4


In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]


In [37]: for gram in grams: print gram
['I', 'really', 'like', 'python,']
['really', 'like', 'python,', "it's"]
['like', 'python,', "it's", 'pretty']
['python,', "it's", 'pretty', 'awesome.']

由其他用户提供的基于 Python 的优秀本地答案。但是这里有一种 nltk方法(以防万一,OP 因为重新发明了 nltk库中已经存在的内容而受到惩罚)。

有一个 Ngram 模块,人们很少使用在 nltk。这并不是因为难以读取 ngram,而是因为基于 ngram 的模型训练(n > 3)会导致大量数据稀疏。

from nltk import ngrams


sentence = 'this is a foo bar sentences and i want to ngramize it'


n = 6
sixgrams = ngrams(sentence.split(), n)


for grams in sixgrams:
print grams

这里有另一个简单的方法来做 n-gram

>>> from nltk.util import ngrams
>>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
>>> tokenize = nltk.word_tokenize(text)
>>> tokenize
['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
>>> bigrams = ngrams(tokenize,2)
>>> bigrams
[('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')]
>>> trigrams=ngrams(tokenize,3)
>>> trigrams
[('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')]
>>> fourgrams=ngrams(tokenize,4)
>>> fourgrams
[('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]

对于4 _ g,它已经在 NLTK中了,这里有一段代码可以帮助你做到这一点:

 from nltk.collocations import *
import nltk
#You should tokenize your text
text = "I do not like green eggs and ham, I do not like them Sam I am!"
tokens = nltk.wordpunct_tokenize(text)
fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
for fourgram, freq in fourgrams.ngram_fd.items():
print fourgram, freq

希望能有所帮助。

你可以使用 特征 _ 提取. 文本. 计数向量器:

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

产出:

4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']

可以将 ngram_size设置为任何正整数。也就是说,你可以把一篇文章分成4克、5克甚至100克。

Nltk 很棒,但有时候对于某些项目来说是一个开销:

import re
def tokenize(text, ngrams=1):
text = re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', text)
text = re.sub(r'\s+', ' ', text)
tokens = text.split()
return [tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)]

示例使用:

>> text = "This is an example text"
>> tokenize(text, 2)
[('This', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'text')]
>> tokenize(text, 3)
[('This', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'text')]

只使用 nltk 工具

from nltk.tokenize import word_tokenize
from nltk.util import ngrams


def get_ngrams(text, n ):
n_grams = ngrams(word_tokenize(text), n)
return [ ' '.join(grams) for grams in n_grams]

输出示例

get_ngrams('This is the simplest text i could think of', 3 )


['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']

为了保持数组格式的 ngram,只需删除 ' '.join

使用 python 内置的 zip()构建双字符更优雅的方法。 只需通过 split()将原始字符串转换为一个列表,然后正常地传递列表一次,并将列表偏移一次。

string = "I really like python, it's pretty awesome."


def find_bigrams(s):
input_list = s.split(" ")
return zip(input_list, input_list[1:])


def find_ngrams(s, n):
input_list = s.split(" ")
return zip(*[input_list[i:] for i in range(n)])


find_bigrams(string)


[('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')]

如果效率是一个问题,你必须构建多个不同的 n-gram (最多可达100个) ,但你想使用纯 Python,我会这样做:

from itertools import chain


def n_grams(seq, n=1):
"""Returns an itirator over the n-grams given a listTokens"""
shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i)
shiftedTokens = (shiftToken(i) for i in range(n))
tupleNGrams = zip(*shiftedTokens)
return tupleNGrams # if join in generator : (" ".join(i) for i in tupleNGrams)


def range_ngrams(listTokens, ngramRange=(1,2)):
"""Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens."""
return chain(*(n_grams(listTokens, i) for i in range(*ngramRange)))

用法:

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngramRange=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

~ 与 NLTK 速度相同:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngramRange=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

从我的 上一个答案转发。

您可以得到所有4-6克使用的代码没有其他包下面:

from itertools import chain


def get_m_2_ngrams(input_list, min, max):
for s in chain(*[get_ngrams(input_list, k) for k in range(min, max+1)]):
yield ' '.join(s)


def get_ngrams(input_list, n):
return zip(*[input_list[i:] for i in range(n)])


if __name__ == '__main__':
input_list = ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
for s in get_m_2_ngrams(input_list, 4, 6):
print(s)

产出如下:

I am aware that
am aware that nltk
aware that nltk only
that nltk only offers
nltk only offers bigrams
only offers bigrams and
offers bigrams and trigrams
bigrams and trigrams ,
and trigrams , but
trigrams , but is
, but is there
but is there a
is there a way
there a way to
a way to split
way to split my
to split my text
split my text in
my text in four-grams
text in four-grams ,
in four-grams , five-grams
four-grams , five-grams or
, five-grams or even
five-grams or even hundred-grams
I am aware that nltk
am aware that nltk only
aware that nltk only offers
that nltk only offers bigrams
nltk only offers bigrams and
only offers bigrams and trigrams
offers bigrams and trigrams ,
bigrams and trigrams , but
and trigrams , but is
trigrams , but is there
, but is there a
but is there a way
is there a way to
there a way to split
a way to split my
way to split my text
to split my text in
split my text in four-grams
my text in four-grams ,
text in four-grams , five-grams
in four-grams , five-grams or
four-grams , five-grams or even
, five-grams or even hundred-grams
I am aware that nltk only
am aware that nltk only offers
aware that nltk only offers bigrams
that nltk only offers bigrams and
nltk only offers bigrams and trigrams
only offers bigrams and trigrams ,
offers bigrams and trigrams , but
bigrams and trigrams , but is
and trigrams , but is there
trigrams , but is there a
, but is there a way
but is there a way to
is there a way to split
there a way to split my
a way to split my text
way to split my text in
to split my text in four-grams
split my text in four-grams ,
my text in four-grams , five-grams
text in four-grams , five-grams or
in four-grams , five-grams or even
four-grams , five-grams or even hundred-grams

你可以在这个 博客上找到更多的细节

人们已经很好地回答了你需要二元或三元的场景,但是如果你需要 每张图的句子在这种情况下,你可以使用 nltk.util.everygrams

>>> from nltk.util import everygrams


>>> message = "who let the dogs out"


>>> msg_split = message.split()


>>> list(everygrams(msg_split))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out'), ('who', 'let', 'the'), ('let', 'the', 'dogs'), ('the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs'), ('let', 'the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs', 'out')]

如果你有一个限制,比如在三角形的情况下,最大长度应该是3,那么你可以使用 max _ len param 来指定它。

>>> list(everygrams(msg_split, max_len=2))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out')]

您只需修改 max _ len 参数以获得任何克,即4克、5克、6克甚至100克。

可以对前面提到的解决方案进行修改以实现上述解决方案,但是这个解决方案比上面提到的解决方案简单得多。

如需进一步阅读,请按 给你

当你只是需要一个特定的克,如二元或三元等,你可以使用 Nltk.util.ngram,在 M.A.Hassan 的答案中提到。

大约7年后,这里有一个使用 collections.deque的更优雅的答案:

def ngrams(words, n):
d = collections.deque(maxlen=n)
d.extend(words[:n])
words = words[n:]
for window, word in zip(itertools.cycle((d,)), words):
print(' '.join(window))
d.append(word)
print(' '.join(window))


words = ['I', 'am', 'become', 'death,', 'the', 'destroyer', 'of', 'worlds']

产出:

In [236]: ngrams(words, 2)
I am
am become
become death,
death, the
the destroyer
destroyer of
of worlds


In [237]: ngrams(words, 3)
I am become
am become death,
become death, the
death, the destroyer
the destroyer of
destroyer of worlds


In [238]: ngrams(words, 4)
I am become death,
am become death, the
become death, the destroyer
death, the destroyer of
the destroyer of worlds


In [239]: ngrams(words, 1)
I
am
become
death,
the
destroyer
of
worlds


如果您想要一个纯迭代器解决方案,用于具有常量内存使用的大字符串:

from typing import Iterable
import itertools


def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
input_iters = [
map(lambda m: m.group(0), re.finditer(token_regex, input))
for n in range(ngram_size)
]
# Skip first words
for n in range(1, ngram_size): list(map(next, input_iters[n:]))


output_iter = itertools.starmap(
lambda *args: " ".join(args),
zip(*input_iters)
)
return output_iter

测试:

input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))

产出:

['If you want a pure',
'you want a pure iterator',
'want a pure iterator solution',
'a pure iterator solution for',
'pure iterator solution for large',
'iterator solution for large strings',
'solution for large strings with',
'for large strings with constant',
'large strings with constant memory',
'strings with constant memory usage']

在 python 中使用 n gram 很容易,例如:

def n_gram(list,n):
return [ list[i:i+n] for i in range(len(list)-n+1) ]

如果你这样做了:

str = "I really like python, it's pretty awesome."
n_gram(str.split(" "),4)

你会得到

[['I', 'really', 'like', 'python,'],
['really', 'like', 'python,', "it's"],
['like', 'python,', "it's", 'pretty'],
['python,', "it's", 'pretty', 'awesome.']]

这是一个老问题了,但是如果你想得到的 n-gram 实际上是一个子字符串列表(而不是列表或元组列表) ,并且不想导入任何东西,下面的代码工作得很好,很容易阅读:

def get_substrings(phrase, n):
phrase = phrase.split()
substrings = []
for i in range(len(phrase)):
if len(phrase[i:i+n]) == n:
substrings.append(' '.join(phrase[i:i+n]))
return substrings

你可以使用它,例如,用这种方法得到一个术语列表的所有 n-gram 长度为 的单词:

a = 5
terms = [
"An n-gram is a contiguous sequence of n items",
"An n-gram of size 1 is referred to as a unigram",
]


for term in terms:
for i in range(1, a+1):
print(f"{i}-grams: {get_substrings(term, i)}")

印刷品:

1-grams: ['An', 'n-gram', 'is', 'a', 'contiguous', 'sequence', 'of', 'n', 'items']
2-grams: ['An n-gram', 'n-gram is', 'is a', 'a contiguous', 'contiguous sequence', 'sequence of', 'of n', 'n items']
3-grams: ['An n-gram is', 'n-gram is a', 'is a contiguous', 'a contiguous sequence', 'contiguous sequence of', 'sequence of n', 'of n items']
4-grams: ['An n-gram is a', 'n-gram is a contiguous', 'is a contiguous sequence', 'a contiguous sequence of', 'contiguous sequence of n', 'sequence of n items']
5-grams: ['An n-gram is a contiguous', 'n-gram is a contiguous sequence', 'is a contiguous sequence of', 'a contiguous sequence of n', 'contiguous sequence of n items']
1-grams: ['An', 'n-gram', 'of', 'size', '1', 'is', 'referred', 'to', 'as', 'a', 'unigram']
2-grams: ['An n-gram', 'n-gram of', 'of size', 'size 1', '1 is', 'is referred', 'referred to', 'to as', 'as a', 'a unigram']
3-grams: ['An n-gram of', 'n-gram of size', 'of size 1', 'size 1 is', '1 is referred', 'is referred to', 'referred to as', 'to as a', 'as a unigram']
4-grams: ['An n-gram of size', 'n-gram of size 1', 'of size 1 is', 'size 1 is referred', '1 is referred to', 'is referred to as', 'referred to as a', 'to as a unigram']
5-grams: ['An n-gram of size 1', 'n-gram of size 1 is', 'of size 1 is referred', 'size 1 is referred to', '1 is referred to as', 'is referred to as a', 'referred to as a unigram']