Learning Word Vectors for Sentiment Analysis

Learning Word Vectors for Sentiment AnalysisAndrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang,Andrew Y. Ng,andChristopher PottsStanford UniversityStanford, CA 94305[amaas, rdaly, ptpham, yuze, ang, vector - based approaches to se-mantics can model rich lexical meanings, butthey largely fail to capture Sentiment informa-tion that is central to many word meanings andimportant for a wide range of NLP tasks. Wepresent a model that uses a mix of unsuper-vised and supervised techniques to learn wordvectors capturing semantic term document in-formation as well as rich Sentiment proposed model can leverage both con-tinuous and multi-dimensional Sentiment in-formation as well as non- Sentiment annota-tions. We instantiate the model to utilize thedocument-level Sentiment polarity annotationspresent in many online documents ( starratings). We evaluate the model using small,widely used Sentiment and subjectivity cor-pora and find it out-performs several previ-ously introduced methods for Sentiment clas-sification.]

We also introduce a large datasetof movie reviews to serve as a more robustbenchmark for work in this IntroductionWord representations are a critical component ofmany natural language processing systems. It iscommon to represent words as indices in a vocab-ulary, but this fails to capture the rich relationalstructure of the lexicon. vector - based models domuch better in this regard. They encode continu-ous similarities between words as distance or anglebetween word Vectors in a high-dimensional general approach has proven useful in taskssuch as word sense disambiguation, named entityrecognition, part of speech tagging, and documentretrieval (Turney and Pantel, 2010; Collobert andWeston, 2008; Turian et al., 2010).In this paper, we present a model to capture bothsemantic and Sentiment similarities among semantic component of our model learns wordvectors via an unsupervised probabilistic model ofdocuments.

However, in keeping with linguistic andcognitive research arguing that expressive contentand descriptive semantic content are distinct (Ka-plan, 1999; Jay, 2000; Potts, 2007), we find thatthis basic model misses crucial Sentiment informa-tion. For example, while it learns thatwonderfulandamazingare semantically close, it doesn t cap-ture the fact that these are both very strong positivesentiment words , at the opposite end of the , we extend the model with a supervisedsentiment component that is capable of embracingmany social and attitudinal aspects of meaning (Wil-son et al., 2004; Alm et al., 2005; Andreevskaiaand Bergler, 2006; Pang and Lee, 2005; Goldbergand Zhu, 2006; Snyder and Barzilay, 2007). Thiscomponent of the model uses the vector represen-tation of words to predict the Sentiment annotationson contexts in which the words appear. This causeswords expressing similar Sentiment to have similarvector representations.

The full objective functionof the model thus learns semantic Vectors that areimbued with nuanced Sentiment information. In ourexperiments, we show how the model can leveragedocument-level Sentiment annotations of a sort thatare abundant online in the form of consumer reviewsfor movies, products, etc. The technique is suffi-ciently general to work also with continuous andmulti-dimensional notions of Sentiment as well asnon- Sentiment annotations ( , political affiliation,speaker commitment).After presenting the model in detail, we pro-vide illustrative examples of the Vectors it learns,and then we systematically evaluate the approachon document-level and sentence-level classificationtasks. Our experiments involve the small, widelyused Sentiment and subjectivity corpora of Pang andLee (2004), which permits us to make comparisonswith a number of related approaches and publishedresults.

We also show that this dataset contains manycorrelations between examples in the training andtesting sets. This leads us to evaluate on, and makepublicly available, a large dataset of informal moviereviews from the Internet Movie Database (IMDB).2 Related workThe model we present in the next section draws in-spiration from prior work on both probabilistic topicmodeling and vector -spaced models for word Dirichlet Allocation (LDA; (Blei et al.,2003)) is a probabilistic document model that as-sumes each document is a mixture of latent top-ics. For each latent topicT, the model learns aconditional distributionp(w|T)for the probabilitythat wordwoccurs inT. One can obtain ak-dimensional vector representation of words by firsttraining ak-topic model and then filling the matrixwith thep(w|T)values (normalized to unit length).The result is a word topic matrix in which the rowsare taken to represent word meanings.

However,because the emphasis in LDA is on modeling top-ics, not word meanings, there is no guarantee thatthe row (word) Vectors are sensible as points in ak-dimensional space. Indeed, we show in section4 that using LDA in this way does not deliver ro-bust word Vectors . The semantic component of ourmodel shares its probabilistic foundation with LDA,but is factored in a manner designed to discoverword Vectors rather than latent topics. Some recentwork introduces extensions of LDA to capture sen-timent in addition to topical information (Li et al.,2010; Lin and He, 2009; Boyd-Graber and Resnik,2010). Like LDA, these methods focus on model -ing Sentiment -imbued topics rather than embeddingwords in a vector space models (VSMs) seek to model wordsdirectly (Turney and Pantel, 2010). Latent Seman-tic Analysis (LSA), perhaps the best known VSM,explicitly learns semantic word Vectors by apply-ing singular value decomposition (SVD) to factor aterm document co-occurrence matrix.

It is typicalto weight and normalize the matrix values prior toSVD. To obtain ak-dimensional representation for agiven word, only the entries corresponding to theklargest singular values are taken from the word s ba-sis in the factored matrix. Such matrix factorization- based approaches are extremely successful in prac-tice, but they force the researcher to make a numberof design choices (weighting, normalization, dimen-sionality reduction algorithm) with little theoreticalguidance to suggest which to term frequency (tf) and inverse documentfrequency (idf) weighting to transform the valuesin a VSM often increases the performance of re-trieval and categorization systems. Delta idf weight-ing (Martineau and Finin, 2009) is a supervised vari-ant of idf weighting in which the idf calculation isdone for each document class and then one valueis subtracted from the other. Martineau and Fininpresent evidence that this weighting helps with sen-timent classification, and Paltoglou and Thelwall(2010) systematically explore a number of weight-ing schemes in the context of Sentiment success of delta idf weighting in previous worksuggests that incorporating Sentiment informationinto VSM values via supervised methods is help-ful for Sentiment Analysis .

We adopt this insight,but we are able to incorporate it directly into ourmodel s objective function. (Section 4 comparesour approach with a representative sample of suchweighting schemes.)3 Our ModelTo capture semantic similarities among words , wederive a probabilistic model of documents whichlearns word representations. This component doesnot require labeled data, and shares its foundationwith probabilistic topic models such as LDA. Thesentiment component of our model uses sentimentannotations to constrain words expressing similarsentiment to have similar representations. We canefficiently learn parameters for the joint objectivefunction using alternating Capturing Semantic SimilaritiesWe build a probabilistic model of a document us-ing a continuous mixture distribution over words in-dexed by a multi-dimensional random variable .We assume words in a document are conditionallyindependent given the mixture variable.

We assigna probability to a documentdusing a joint distribu-tion over the document and . The model assumeseach wordwi dis conditionally independent ofthe other words given . The probability of a docu-ment is thusp(d) = p(d, )d = p( )N i=1p(wi| )d .(1)WhereNis the number of words indandwiistheithword ind. We use a Gaussian prior on .We define the conditional distributionp(wi| )us-ing a log-linear model with energy function uses a word representation ma-trixR R( x|V|)where each wordw(representedas a one-on vector ) in the vocabularyVhas a -dimensional vector representation w=Rwcorre-sponding to that word s column inR. The randomvariable is also a -dimensional vector , R which weights each of the dimensions of words representation Vectors . We additionally introduce abiasbwfor each word to capture differences in over-all word frequencies. The energy assigned to a wordwgiven these model parameters isE(w; , w,bw) = T w bw.

(2)To obtain the distributionp(w| )we use a softmax,p(w| ;R,b) =exp( E(w; , w,bw)) w Vexp( E(w ; , w ,bw ))(3)=exp( T w+bw) w Vexp( T w +bw ).(4)The number of terms in the denominator s sum-mation grows linearly in|V|, making exact com-putation of the distribution possible. For a given , a wordw s occurrence probability is related tohow closely its representation vector wmatches thescaling direction of . This idea is similar to theword vector inner product used in the log-bilinearlanguage model of Mnih and Hinton (2007).Equation 1 resembles the probabilistic model ofLDA (Blei et al., 2003), which models documentsas mixtures of latent topics. One could view the en-tries of a word vector as that word s associationstrength with respect to each latent topic random variable then defines a weighting overtopics. However, our model does not attempt tomodel individual topics, but instead directly modelsword probabilities conditioned on the topic mixturevariable.

Learning Word Vectors for Sentiment Analysis

Tags:

Information

Transcription of Learning Word Vectors for Sentiment Analysis

Related search queries

Learning Word Vectors for Sentiment Analysis

Tags:

Information

Documents from same domain

Related documents

Related search queries