如何从句子中标记的单词2vec 中获得句子的向量

我已经使用 word2vec 从一个大型文档中生成了令牌列表的向量。给定一个句子,是否可以从句子中标记的向量中得到该句子的向量。

78541 次浏览

It is possible, but not from word2vec. The composition of word vectors in order to obtain higher-level representations for sentences (and further for paragraphs and documents) is a really active research topic. There is not one best solution to do this, it really depends on to what task you want to apply these vectors. You can try concatenation, simple summation, pointwise multiplication, convolution etc. There are several publications on this that you can learn from, but ultimately you just need to experiment and see what fits you best.

You can get vector representations of sentences during training phase (join the test and train sentences in a single file and run word2vec code obtained from following link).

Code for sentence2vec has been shared by Tomas Mikolov here. It assumes first word of a line to be sentence-id. Compile the code using

gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -funroll-loops

and run it using

./word2vec -train alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1

EDIT

Gensim (development version) seems to have a method to infer vectors of new sentences. Check out model.infer_vector(NewDocument) method in https://github.com/gojomo/gensim/blob/develop/gensim/models/doc2vec.py

There are differet methods to get the sentence vectors :

  1. Doc2Vec : you can train your dataset using Doc2Vec and then use the sentence vectors.
  2. Average of Word2Vec vectors : You can just take the average of all the word vectors in a sentence. This average vector will represent your sentence vector.
  3. Average of Word2Vec vectors with TF-IDF : this is one of the best approach which I will recommend. Just take the word vectors and multiply it with their TF-IDF scores. Just take the average and it will represent your sentence vector.

It depends on the usage:

1) If you only want to get sentence vector for some known data. Check out paragraph vector in these papers:

Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. Eprint Arxiv,4:1188–1196.

A. M. Dai, C. Olah, and Q. V. Le. 2015. DocumentEmbedding with Paragraph Vectors. ArXiv e-prints,July.

2) If you want a model to estimate sentence vector for unknown(test) sentences with unsupervised approach:

You could check out this paper:

Steven Du and Xi Zhang. 2016. Aicyber at SemEval-2016 Task 4: i-vector based sentence representation. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016), San Diego, US

3)Researcher are also looking for the output of certain layer in RNN or LSTM network, recent example is:

http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12195

4)For the gensim doc2vec, many researchers could not get good results, to overcome this problem, following paper using doc2vec based on pre-trained word vectors.

Jey Han Lau and Timothy Baldwin (2016). An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In Proceedings of the 1st Workshop on Representation Learning for NLP, 2016.

5) tweet2vec or sent2vec .

Facebook has SentEval project for evaluating the quality of sentence vectors.

https://github.com/facebookresearch/SentEval

6) There are more information in the following paper:

Neural Network Models for Paraphrase Identification, Semantic Textual Similarity, Natural Language Inference, and Question Answering


And for now you can use 'BERT':

Google release the source code as well as pretrained models.

https://github.com/google-research/bert

And here is an example to run bert as a service:

https://github.com/hanxiao/bert-as-service

I've had good results from:

  1. Summing the word vectors (with tf-idf weighting). This ignores word order, but for many applications is sufficient (especially for short documents)
  2. Fastsent

There are several ways to get a vector for a sentence. Each approach has advantages and shortcomings. Choosing one depends on the task you want to perform with your vectors.

First, you can simply average the vectors from word2vec. According to Le and Mikolov, this approach performs poorly for sentiment analysis tasks, because it "loses the word order in the same way as the standard bag-of-words models do" and "fail[s] to recognize many sophisticated linguistic phenomena, for instance sarcasm". On the other hand, according to Kenter et al. 2016, "simply averaging word embeddings of all words in a text has proven to be a strong baseline or feature across a multitude of tasks", such as short text similarity tasks. A variant would be to weight word vectors with their TF-IDF to decrease the influence of the most common words.

A more sophisticated approach developed by Socher et al. is to combine word vectors in an order given by a parse tree of a sentence, using matrix-vector operations. This method works for sentences sentiment analysis, because it depends on parsing.

Deep averaging network (DAN) can provide sentence embeddings in which word bi-grams are averaged and passed through feedforward deep neural network(DNN).

It is found that transfer learning using sentence embeddings tends to outperform word level transfer as it preserves the semantic relationship.

You don't need to start the training from scratch, the pretrained DAN models are available for perusal ( Check Universal Sentence Encoder module in google hub).

Google's Universal Sentence Encoder embeddings are an updated solution to this problem. It doesn't use Word2vec but results in a competing solution.

Here is a walk-through with TFHub and Keras.

let suppose this is current sentence

import gensim
from gensim.models import Word2Vec
from gensim import models
model = gensim.models.KeyedVectors.load_word2vec_format('path of your trainig
dataset', binary=True)


strr = 'i am'
strr2 = strr.split()
print(strr2)
model[strr2] //this the the sentance embeddings.