Example: dental hygienist

Distributed Representations of Sentences and Documents

Distributed Representations of Sentences and DocumentsQuoc Inc, 1600 Amphitheatre Parkway, Mountain View, CA 94043 AbstractMany machine learning algorithms require theinput to be represented as a fixed-length featurevector. When it comes to texts, one of the mostcommon fixed-length features is their popularity, bag-of-words featureshave two major weaknesses: they lose the order-ing of the words and they also ignore semanticsof the words. For example, powerful, strong and Paris are equally distant. In this paper, weproposeParagraph Vector, an unsupervised algo-rithm that learns fixed-length feature representa-tions from variable-length pieces of texts, such assentences, paragraphs, and Documents .

matrix-vector operations (Socher et al., 2011b). Both ap-proaches have weaknesses. The first approach, weighted averaging of word vectors, loses the word order in the same way as the standard bag-of-words models do. The second approach, using a parse tree to combine word vectors, has been shown to work for only sentences because it relies on ...

Tags:

  Matrix, Representation, Distributed, Distributed representations

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Distributed Representations of Sentences and Documents

1 Distributed Representations of Sentences and DocumentsQuoc Inc, 1600 Amphitheatre Parkway, Mountain View, CA 94043 AbstractMany machine learning algorithms require theinput to be represented as a fixed-length featurevector. When it comes to texts, one of the mostcommon fixed-length features is their popularity, bag-of-words featureshave two major weaknesses: they lose the order-ing of the words and they also ignore semanticsof the words. For example, powerful, strong and Paris are equally distant. In this paper, weproposeParagraph Vector, an unsupervised algo-rithm that learns fixed-length feature representa-tions from variable-length pieces of texts, such assentences, paragraphs, and Documents .

2 Our algo-rithm represents each document by a dense vec-tor which is trained to predict words in the doc-ument. Its construction gives our algorithm thepotential to overcome the weaknesses of bag-of-words models. Empirical results show that Para-graph Vectors outperform bag-of-words modelsas well as other techniques for text representa-tions. Finally, we achieve new state-of-the-art re-sults on several text classification and sentimentanalysis IntroductionText classification and clustering play an important rolein many applications, , document retrieval, web search,spam filtering. At the heart of these applications is ma-chine learning algorithms such as logistic regression or K-means.

3 These algorithms typically require the text input tobe represented as a fixed-length vector. Perhaps the mostcommon fixed-length vector representation for texts is thebag-of-words or bag-of-n-grams (Harris, 1954) due to itssimplicity, efficiency and often surprising , the bag-of-words (BOW) has many disadvan-Proceedings of the31stInternational Conference on MachineLearning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-right 2014 by the author(s).tages. The word order is lost, and thus different sentencescan have exactly the same representation , as long as thesame words are used. Even though bag-of-n-grams con-siders the word order in short context, it suffers from datasparsity and high dimensionality.

4 Bag-of-words and bag-of-n-grams have very little sense about the semantics of thewords or more formally the distances between the means that words powerful, strong and Paris areequally distant despite the fact that semantically, power-ful should be closer to strong than Paris. In this paper, we proposeParagraph Vector, an unsuper-vised framework that learns continuous Distributed vectorrepresentations for pieces of texts. The texts can be ofvariable-length, ranging from Sentences to Documents . Thename Paragraph Vector is to emphasize the fact that themethod can be applied to variable-length pieces of texts,anything from a phrase or sentence to a large our model, the vector representation is trained to be use-ful for predicting words in a paragraph.

5 More precisely, weconcatenate the paragraph vector with several word vec-tors from a paragraph and predict the following word in thegiven context. Both word vectors and paragraph vectors aretrained by the stochastic gradient descent and backpropaga-tion (Rumelhart et al., 1986). While paragraph vectors areunique among paragraphs, the word vectors are shared. Atprediction time, the paragraph vectors are inferred by fix-ing the word vectors and training the new paragraph vectoruntil technique is inspired by the recent work in learn-ing vector Representations of words using neural net-works (Bengio et al., 2006; Collobert & Weston, 2008;Mnih & Hinton, 2008; Turian et al., 2010; Mikolov et al.)

6 ,2013a;c). In their formulation, each word is represented bya vector which is concatenated or averaged with other wordvectors in a context, and the resulting vector is used to pre-dict other words in the context. For example, the neuralnetwork language model proposed in (Bengio et al., 2006)uses the concatenation of several previous word vectors toform the input of a neural network, and tries to predict thenext word. The outcome is that after the model is trained,the word vectors are mapped into a vector space such thatDistributed Representations of Sentences and Documentssemantically similar words have similar vector representa-tions ( , strong is close to powerful ).Following these successful techniques, researchers havetried to extend the models to go beyond word levelto achieve phrase-level or sentence-level representa-tions (Mitchell & Lapata, 2010; Zanzotto et al.

7 , 2010;Yessenalina & Cardie, 2011; Grefenstette et al., 2013;Mikolov et al., 2013c). For instance, a simple approach isusing a weighted average of all the words in the more sophisticated approach is combining the word vec-tors in an order given by a parse tree of a sentence, usingmatrix-vector operations (Socher et al., 2011b). Both ap-proaches have weaknesses. The first approach, weightedaveraging of word vectors, loses the word order in the sameway as the standard bag-of-words models do. The secondapproach, using a parse tree to combine word vectors, hasbeen shown to work for only Sentences because it relies Vector is capable of constructing representationsof input sequences of variable length.

8 Unlike some of theprevious approaches, it is general and applicable to texts ofany length: Sentences , paragraphs, and Documents . It doesnot require task-specific tuning of the word weighting func-tion nor does it rely on the parse trees. Further in the paper,we will present experiments on several benchmark datasetsthat demonstrate the advantages of Paragraph Vector. Forexample, on sentiment analysis task, we achieve new state-of-the-art results, better than complex methods, yielding arelative improvement of more than 16% in terms of errorrate. On a text classification task, our method convincinglybeats bag-of-words models, giving a relative improvementof about 30%.2. AlgorithmsWe start by discussing previous methods for learning wordvectors.

9 These methods are the inspiration for our Para-graph Vector Learning Vector representation of WordsThis section introduces the concept of Distributed vectorrepresentation of words. A well known framework forlearning the word vectors is shown in Figure 1. The taskis to predict a word given the other words in a this framework, every word is mapped to a unique vec-tor, represented by a column in a matrixW. The columnis indexed by position of the word in the vocabulary. Theconcatenation or sum of the vectors is then used as featuresfor prediction of the next word in a formally, given a sequence of training wordsw1, w2, w3, .., wT, the objective of the word vector modelFigure framework for learning word vectors.

10 Context ofthree words ( the, cat, and sat ) is used to predict the fourthword ( on ). The input words are mapped to columns of the ma-trixWto predict the output to maximize the average log probability1TT k t=klogp(wt|wt k, .., wt+k)The prediction task is typically done via a multiclass clas-sifier, such as softmax. There, we havep(wt|wt k, .., wt+k) =eywt ieyiEach ofyiis un-normalized log-probability for each outputwordi, computed asy=b+U h(wt k, .., wt+k;W)(1)whereU, bare the softmax constructed bya concatenation or average of word vectors extracted practice, hierarchical softmax (Morin & Bengio, 2005;Mnih & Hinton, 2008; Mikolov et al., 2013c) is preferredto softmax for fast training.


Related search queries