GloVe: Global Vectors for Word Representation

GloVe: Global Vectors for Word RepresentationJeffrey Pennington, Richard Socher, Christopher D. ManningComputer Science Department, Stanford University, Stanford, CA methods for learning vector spacerepresentations of words have succeededin capturing fine-grained semantic andsyntactic regularities using vector arith-metic, but the origin of these regularitieshas remained opaque. We analyze andmake explicit the model properties neededfor such regularities to emerge in wordvectors. The result is a new Global log-bilinear regression model that combinesthe advantages of the two major modelfamilies in the literature: Global matrixfactorization and local context windowmethods. Our model efficiently leveragesstatistical information by training only onthe nonzero elements in a word-word co- occurrence matrix, rather than on the en-tire sparse matrix or on individual contextwindows in a large corpus.

The model pro-duces a vector space with meaningful sub-structure, as evidenced by its performanceof 75% on a recent word analogy task. Italso outperforms related models on simi-larity tasks and named entity IntroductionSemantic vector space models of language repre-sent each word with a real-valued vector. Thesevectors can be used as features in a variety of ap-plications, such as information retrieval (Manninget al., 2008), document classification (Sebastiani,2002), question answering (Tellex et al., 2003),named entity recognition (Turian et al., 2010), andparsing (Socher et al., 2013).Most word vector methods rely on the distanceor angle between pairs of word Vectors as the pri-mary method for evaluating the intrinsic qualityof such a set of word representations. Recently,Mikolov et al. (2013c) introduced a new evalua-tion scheme based on word analogies that probesthe finer structure of the word vector space by ex-amining not the scalar distance between word vec-tors, but rather their various dimensions of dif-ference.

For example, the analogy king is toqueen as man is to woman should be encodedin the vector space by the vector equationking queen=man woman. This evaluation schemefavors models that produce dimensions of mean-ing, thereby capturing the multi-clustering idea ofdistributed representations (Bengio, 2009).The two main model families for learning wordvectors are: 1) Global matrix factorization meth-ods, such as latent semantic analysis (LSA) (Deer-wester et al., 1990) and 2) local context windowmethods, such as the skip-gram model of Mikolovet al. (2013c). Currently, both families suffer sig-nificant drawbacks. While methods like LSA ef-ficiently leverage statistical information, they dorelatively poorly on the word analogy task, indi-cating a sub-optimal vector space structure. Meth-ods like skip-gram may do better on the analogytask, but they poorly utilize the statistics of the cor-pus since they train on separate local context win-dows instead of on Global co- occurrence this work, we analyze the model propertiesnecessary to produce linear directions of meaningand argue that Global log-bilinear regression mod-els are appropriate for doing so.

We propose a spe-cific weighted least squares model that trains onglobal word-word co- occurrence counts and thusmakes efficient use of statistics. The model pro-duces a word vector space with meaningful sub-structure, as evidenced by its state-of-the-art per-formance of 75% accuracy on the word analogydataset. We also demonstrate that our methodsoutperform other current methods on several wordsimilarity tasks, and also on a common named en-tity recognition (NER) provide the source code for the model aswell as trained word Vectors Related WorkMatrix Factorization factor-ization methods for generating low-dimensionalword representations have roots stretching as farback as LSA. These methods utilize low-rank ap-proximations to decompose large matrices thatcapture statistical information about a corpus. Theparticular type of information captured by suchmatrices varies by application.

In LSA, the ma-trices are of term-document type, , the rowscorrespond to words or terms, and the columnscorrespond to different documents in the contrast, the Hyperspace Analogue to Language(HAL) (Lund and Burgess, 1996), for example,utilizes matrices of term-term type, , the rowsand columns correspond to words and the entriescorrespond to the number of times a given wordoccurs in the context of another given main problem with HAL and related meth-ods is that the most frequent words contribute adisproportionate amount to the similarity measure:the number of times two words co-occur withtheorand, for example, will have a large effect ontheir similarity despite conveying relatively littleabout their semantic relatedness. A number oftechniques exist that addresses this shortcoming ofHAL, such as the COALS method (Rohde et al.)

,2006), in which the co- occurrence matrix is firsttransformed by an entropy- or correlation-basednormalization. An advantage of this type of trans-formation is that the raw co- occurrence counts,which for a reasonably sized corpus might span8 or 9 orders of magnitude, are compressed so asto be distributed more evenly in a smaller inter-val. A variety of newer models also pursue thisapproach, including a study (Bullinaria and Levy,2007) that indicates that positive pointwise mu-tual information (PPMI) is a good recently, a square root type transformationin the form of Hellinger PCA (HPCA) (Lebret andCollobert, 2014) has been suggested as an effec-tive way of learning word Window-Based is to learn word representations that aidin making predictions within local context win-dows. For example, Bengio et al. (2003) intro-duced a model that learns word vector representa-tions as part of a simple neural network architec-ture for language modeling.

Collobert and Weston(2008) decoupled the word vector training fromthe downstream training objectives, which pavedthe way for Collobert et al. (2011) to use the fullcontext of a word for learning the word represen-tations, rather than just the preceding context as isthe case with language , the importance of the full neural net-work structure for learning useful word repre-sentations has been called into and continuous bag-of-words (CBOW)models of Mikolov et al. (2013a) propose a sim-ple single-layer architecture based on the innerproduct between two word Vectors . Mnih andKavukcuoglu (2013) also proposed closely-relatedvector log-bilinear models, vLBL and ivLBL, andLevy et al. (2014) proposed explicit word embed-dings based on a PPMI the skip-gram and ivLBL models, the objec-tive is to predict a word s context given the worditself, whereas the objective in the CBOW andvLBL models is to predict a word given its con-text.

Through evaluation on a word analogy task,these models demonstrated the capacity to learnlinguistic patterns as linear relationships betweenthe word the matrix factorization methods, theshallow window-based methods suffer from thedisadvantage that they do not operate directly onthe co- occurrence statistics of the corpus. Instead,these models scan context windows across the en-tire corpus, which fails to take advantage of thevast amount of repetition in the The GloVe ModelThe statistics of word occurrences in a corpus isthe primary source of information available to allunsupervised methods for learning word represen-tations, and although many such methods now ex-ist, the question still remains as to how meaningis generated from these statistics, and how the re-sulting word Vectors might represent that this section, we shed some light on this ques-tion.

We use our insights to construct a new modelfor word Representation which we call GloVe, forGlobal Vectors , because the Global corpus statis-tics are captured directly by the we establish some notation. Let the matrixof word-word co- occurrence counts be denoted byX, whose entriesXi jtabulate the number of timeswordjoccurs in the context of wordi. LetXi= kXikbe the number of times any word appearsin the context of wordi. Finally, letPi j=P(j|i)=Xi j/Xibe the probability that wordjappear in theTable 1: Co- occurrence probabilities for target wordsiceandsteamwith selected context words from a 6billion token corpus. Only in the ratio does noise from non-discriminative words likewaterandfashioncancel out, so that large values (much greater than 1) correlate well with properties specific to ice, andsmall values (much less than 1) correlate well with properties specific of and Ratiok=solidk=gask=waterk=fashionP(k|ice ) 10 10 10 10 5P(k|steam) 10 10 10 10 5P(k|ice)/P(k|steam) 10 of begin with a simple example that showcaseshow certain aspects of meaning can be extracteddirectly from co- occurrence probabilities.

Con-sider two wordsiandjthat exhibit a particular as-pect of interest; for concreteness, suppose we areinterested in the concept of thermodynamic phase,for which we might takei=iceandj= relationship of these words can be examinedby studying the ratio of their co- occurrence prob-abilities with various probe words,k. For wordskrelated to ice but not steam, sayk=solid, weexpect the ratioPik/Pj kwill be large. Similarly,for wordskrelated to steam but not ice, sayk=gas, the ratio should be small. For wordsklikewaterorfashion, that are either related to both iceand steam, or to neither, the ratio should be closeto one. Table 1 shows these probabilities and theirratios for a large corpus, and the numbers confirmthese expectations. Compared to the raw probabil-ities, the ratio is better able to distinguish relevantwords (solidandgas) from irrelevant words (waterandfashion) and it is also better able to discrimi-nate between the two relevant above argument suggests that the appropri-ate starting point for word vector learning shouldbe with ratios of co- occurrence probabilities ratherthan the probabilities themselves.

GloVe: Global Vectors for Word Representation

Tags:

Information

Transcription of GloVe: Global Vectors for Word Representation

Related search queries

GloVe: Global Vectors for Word Representation

Tags:

Information

Documents from same domain

Related documents

Related search queries