Python 中的 wordnet 词元化和 post 标记

我想在 python 中使用 wordnet lemmatizer,我了解到默认的 pos 标签是 NOUN,它不会输出动词的正确引理,除非 pos 标签被明确指定为 VERB。

我的问题是什么是最好的拍摄,以执行上述外延准确?

我做了使用 nltk.pos_tag的 POS 标签,我迷失在整合树银行 POS 标签,以文字网兼容 POS 标签。请帮帮我

from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
tagged = nltk.pos_tag(tokens)

我得到的输出标签在 NN,JJ,VB,RB。我如何改变这些字网兼容标签?

我还需要用带标签的语料库来训练 nltk.pos_tag()吗? 或者我可以直接用它来评估我的数据吗?

91632 次浏览

First of all, you can use nltk.pos_tag() directly without training it. The function will load a pretrained tagger from a file. You can see the file name with nltk.tag._POS_TAGGER:

nltk.tag._POS_TAGGER
>>> 'taggers/maxent_treebank_pos_tagger/english.pickle'

As it was trained with the Treebank corpus, it also uses the Treebank tag set.

The following function would map the treebank tags to WordNet part of speech names:

from nltk.corpus import wordnet


def get_wordnet_pos(treebank_tag):


if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return ''

You can then use the return value with the lemmatizer:

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('going', wordnet.VERB)
>>> 'go'

Check the return value before passing it to the Lemmatizer because an empty string would give a KeyError.

As in the source code of nltk.corpus.reader.wordnet (http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html)

#{ Part-of-speech constants
ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
#}
POS_LIST = [NOUN, VERB, ADJ, ADV]

@Suzana_K was working. But I there are some case result in KeyError as @ Clock Slave mention.

Convert treebank tags to Wordnet tag

from nltk.corpus import wordnet


def get_wordnet_pos(treebank_tag):


if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return None # for easy if-statement

Now, we only input pos into lemmatize function only if we have wordnet tag

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
tagged = nltk.pos_tag(tokens)
for word, tag in tagged:
wntag = get_wordnet_pos(tag)
if wntag is None:# not supply tag in case of None
lemma = lemmatizer.lemmatize(word)
else:
lemma = lemmatizer.lemmatize(word, pos=wntag)

Steps to convert : Document->Sentences->Tokens->POS->Lemmas

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet


#example text text = 'What can I say about this place. The staff of these restaurants is nice and the eggplant is not bad'


class Splitter(object):
"""
split the document into sentences and tokenize each sentence
"""
def __init__(self):
self.splitter = nltk.data.load('tokenizers/punkt/english.pickle')
self.tokenizer = nltk.tokenize.TreebankWordTokenizer()


def split(self,text):
"""
out : ['What', 'can', 'I', 'say', 'about', 'this', 'place', '.']
"""
# split into single sentence
sentences = self.splitter.tokenize(text)
# tokenization in each sentences
tokens = [self.tokenizer.tokenize(sent) for sent in sentences]
return tokens




class LemmatizationWithPOSTagger(object):
def __init__(self):
pass
def get_wordnet_pos(self,treebank_tag):
"""
return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v)
"""
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
# As default pos in lemmatization is Noun
return wordnet.NOUN


def pos_tag(self,tokens):
# find the pos tagginf for each tokens [('What', 'WP'), ('can', 'MD'), ('I', 'PRP') ....
pos_tokens = [nltk.pos_tag(token) for token in tokens]


# lemmatization using pos tagg
# convert into feature set of [('What', 'What', ['WP']), ('can', 'can', ['MD']), ... ie [original WORD, Lemmatized word, POS tag]
pos_tokens = [ [(word, lemmatizer.lemmatize(word,self.get_wordnet_pos(pos_tag)), [pos_tag]) for (word,pos_tag) in pos] for pos in pos_tokens]
return pos_tokens


lemmatizer = WordNetLemmatizer()
splitter = Splitter()
lemmatization_using_pos_tagger = LemmatizationWithPOSTagger()


#step 1 split document into sentence followed by tokenization
tokens = splitter.split(text)


#step 2 lemmatization using pos tagger
lemma_pos_token = lemmatization_using_pos_tagger.pos_tag(tokens)
print(lemma_pos_token)

You can do this in one line:

wnpos = lambda e: ('a' if e[0].lower() == 'j' else e[0].lower()) if e[0].lower() in ['n', 'r', 'v'] else 'n'

Then use wnpos(nltk_pos) to get the POS to give to .lemmatize(). In your case, lmtzr.lemmatize(word=tagged[0][0], pos=wnpos(tagged[0][1])).

You can create a map using the python default dict and take advantage of the fact that for the lemmatizer the default tag is Noun.

from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict


tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV


text = "Another way of achieving this task"
tokens = word_tokenize(text)
lmtzr = WordNetLemmatizer()


for token, tag in pos_tag(tokens):
lemma = lmtzr.lemmatize(token, tag_map[tag[0]])
print(token, "=>", lemma)

After searching from internet, I've found this solution: from sentence to "bag of words" derived after splitting, pos_tagging, lemmatizing and cleaning (from punctuation and "stopping words") operations. Here's my code:

from nltk.corpus import wordnet as wn
from nltk.wsd import lesk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


punctuation = u",.?!()-_\"\'\\\n\r\t;:+*<>@#§^$%&|/"
stop_words_eng = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
tag_dict = {"J": wn.ADJ,
"N": wn.NOUN,
"V": wn.VERB,
"R": wn.ADV}


def extract_wnpostag_from_postag(tag):
#take the first letter of the tag
#the second parameter is an "optional" in case of missing key in the dictionary
return tag_dict.get(tag[0].upper(), None)


def lemmatize_tupla_word_postag(tupla):
"""
giving a tupla of the form (wordString, posTagString) like ('guitar', 'NN'), return the lemmatized word
"""
tag = extract_wnpostag_from_postag(tupla[1])
return lemmatizer.lemmatize(tupla[0], tag) if tag is not None else tupla[0]


def bag_of_words(sentence, stop_words=None):
if stop_words is None:
stop_words = stop_words_eng
original_words = word_tokenize(sentence)
tagged_words = nltk.pos_tag(original_words) #returns a list of tuples: (word, tagString) like ('And', 'CC')
original_words = None
lemmatized_words = [ lemmatize_tupla_word_postag(ow) for ow in tagged_words ]
tagged_words = None
cleaned_words = [ w for w in lemmatized_words if (w not in punctuation) and (w not in stop_words) ]
lemmatized_words = None
return cleaned_words


sentence = "Two electric guitar rocks players, and also a better bass player, are standing off to two sides reading corpora while walking"
print(sentence, "\n\n bag of words:\n", bag_of_words(sentence) )

You can do as following:

import nltk
from nltk.corpus import wordnet


wordnet_map = {
"N": wordnet.NOUN,
"V": wordnet.VERB,
"J": wordnet.ADJ,
"R": wordnet.ADV
}




def pos_tag_wordnet(text):
"""
Create pos_tag with wordnet format
"""
pos_tagged_text = nltk.pos_tag(text)


# map the pos tagging output with wordnet output
pos_tagged_text = [
(word, wordnet_map.get(pos_tag[0])) if pos_tag[0] in wordnet_map.keys()
else (word, wordnet.NOUN)
for (word, pos_tag) in pos_tagged_text
]


return pos_tagged_text