Text mining and topic models - University of California ...

Text mining and topic modelsCharles 12, 2014 Text mining means the application of learning algorithms to documents con-sisting of words and sentences. Text mining tasks include classifier learning clustering, and theme for documents are useful for many applications. Major uses for binaryclassifiers include spam detection and personalization of streams of news classifiers are useful for routing messages to classifiers for documents are designed to categorize according to subjectmatter. However, it is also possible to learn to categorize according to qualitativecriteria such as helpfulness for product reviews submitted by many applications of multiclass classification, a single document can be-long to more than one category, so it is correct to predict more than one task is specifically called multilabel classification. In standard multiclassclassification, the classes are mutually exclusive, a special type of negativecorrelation is fixed in advance.

In multilabel classification, it is important to learnthe positive and negative correlations between The bag-of-words representationIn order to do text mining , the first question we must answer is how to representdocuments. For genuine understanding of natural language one must obviously1preserve the order of the words in documents. However, for many large-scale datamining tasks, including classifying and clustering documents, it is sufficient to usea simple representation that loses all information about word a collection of documents, the first task to perform is to identify theset of all words used a least once in at least one document. This set is called thevocabulary. Often, it is reduced in size by keeping only words that are used in atleast two documents. (Words that are found only once are often mis-spellings orother mistakes.) Although the vocabulary is a set, we fix an arbitrary ordering forit so we can refer to word 1 through wordmwheremis the size of the the vocabulary has been fixed, each document is represented as a vectorwith integer entries of lengthm.

If this vector isxthen itsjth componentxjis thenumber of appearances of wordjin the document. The length of the document isn= mj=1xj. For typical documents,nis much smaller thanmandxj= 0formost applications of text mining also eliminate from the vocabulary so-called stop words. These are words that are common in most documents and do notcorrespond to any particular subject matter. In linguistics these words are some-times called function words or closed-class words. They include pronouns (you,he, it) connectives (and, because, however), prepositions (to, of, before), and aux-iliaries (have, been, can, should). Stop words may also include generic nouns(amount, part, nothing) and verbs (do, have). It is important to appreciate, how-ever, that generic words carry a lot of information for many tasks, including iden-tifying the author of a document or detecting its collection of documents is represented as a two-dimensional matrix whereeach row describes a document and each column corresponds to a word.

Eachentry in this matrix is an integer count; most entries are zero. It makes sense toview each column as a feature. It also makes sense to learn a low-rank approx-imation of the whole matrix; doing this is called latent semantic analysis (LSA)and is discussed The multinomial distributionOnce we have a representation for individual documents, the natural next step is toselect a model for a set of documents. It is important to understand the differencebetween a representation and a model . A representation is a way of encodingan entity as a data structure. A model is an abstraction of a set of entities, forexample a probability distribution. Given a training set of documents, we will2choose values for the parameters of a probabilistic model that make the trainingdocuments have high probability. Then, given a test document, we can evaluateits probability according to the model . The higher this probability is, the moresimilar the test document is to the training probability distribution that we use is the multinomial.

Mathematically,this distribution isp(x; ) =(n! mj=1xj!)(m j=1 xjj).(1)where the dataxare a vector of non-negative integers and the parameters are areal-valued vector. Both vectors have the same , jis the probability of wordjwhilexjis the count of time wordjappears in the document it contributes an amount jto the totalprobability, hence the term xjj. The components of are non-negative and haveunit sum: mj=1 j= 1. A vector with these properties is called a unit vectorxof word counts can be generated by any member of an equiva-lence class of sequences of words. In Equation (1) the first factor in parenthesesis called a multinomial coefficient. It is the size of the equivalence class ofx, thatis the number of different word sequences that yield the same counts. The secondfactor in parentheses in Equation (1) is the probability of any individual memberof the equivalence class any discrete distribution, a multinomial has to sum to one, where the sumis over all possible data points.

Here, a data point is a document containingnwords. The number of such documents is exponential in their lengthn: it probability of any individual document will therefore be very small. Whatis important is the relative probability of different documents. A document thatmostly uses words with high probability will have higher relative first sight, computing the probability of a document requiresO(m)timebecause of the product overj. However, ifxj= 0then xjj= 1so thejth factorcan be omitted from the product. Similarly,0! = 1so thejth factor can be omittedfrom mj=1xj!. Hence, computing the probability of a document needs onlyO(n) the probabilities of individual documents decline exponentially withlengthn, it is necessary to do numerical computations with log probabilities:logp(x; ) = logn! [m j=1logxj!] + [m j=1xj log j]3 Given a set of training documents, the maximum-likelihood estimate of thejthparameter is j=1T xxjwhere the sum is over all documentsxbelonging to the training set.

The nor-malizing constant isT= x jxjwhich is the sum of the sizes of all a multinomial has j= 0for somej, then every document withxj>0for thisjhas zero probability, regardless of any other words in the that are perfectly zero are undesirable, so we want j>0for with a constantcis the way to achieve this. We set j=1T (c+ xxj).The constantcis called a pseudocount. Intuitively, it is a notional number ofappearances of wordjthat are assumed to exist, regardless of the true number ofappearances. Typicallycis chosen in the range0< c 1. Because the equality j j= 1must be preserved, the normalizing constant must beT =mc+ order to avoid big changes in the estimated probabilities j, one should havec < , one multinomial is a distribution over all documents of a fixedsizen. Therefore, what is learned by the maximum-likelihood process just de-scribed is in fact a different distribution for each sizen. These distributions, al-though separate, have the same parameter values.

Parameters are said to be tied if they are required to be the same for different Generative processesSuppose that we have a collection of documents, and we want to find an organi-zation for these, we want to do unsupervised common way to do unsupervised learning is to assume that the data weregenerated by some probabilistic process, and then to infer the parameters of thisprocess. The generative process is a specification of a parametrized family ofdistributions. Learning is based on the principle of maximum likelihood, or somerefinement of this principle such as maximuma posteriori(MAP) generative process for a single document is as follows:4A: fix a multinomial distribution with parameter vector of lengthVB: for each word in the documentdraw a wordwaccording to Above, step A sets up the probability distributions that are then used in step B toproduce the observed training single multinomial distribution can only represent a category of closely re-lated documents.

For a collection of documents of multiple categories, a simplegenerative process isA: fix a multinomial over categories 1 toKfor category number 1 to category numberKfix a multinomial with parameter vector kB: for document number 1 to document numberMdraw a categoryzaccording to for each word in the documentdraw a wordwaccording to zNote thatzis an integer between1andK. For each document, the value ofzis hidden, meaning that it exists conceptually, but it is never known, not even fortraining data. The generative process above gives the following global probabilitydistribution:f(x) =K k=1 kf(x; k).wherexis a document and kis the parameter vector of thekth general,xcould be a data point of any type and kcould be the parame-ters of any appropriate called the number of components in themixture model . For eachk,f(x; k)is the distribution of component scalar kis the proportion of component numberk. A distribution like thisis called a mixture distribution.

In general, the components can be probabilitydensity functions (pdfs), for example Gaussians, or probability mass functions(pmfs), for example multinomials. We will see later how to do maximum like-lihood estimation for mixture distributions, using a technique called Latent Dirichlet allocationThe specific topic model we consider is called latent Dirichlet allocation (LDA).(The same abbreviation is also used for linear discriminant analysis , which is un-5related.) LDA is based on the intuition that each document contains words frommultiple topics ; the proportion of each topic in each document is different, but thetopics themselves are the same for all generative process assumed by the LDA model is as follows:Given: Dirichlet distribution with parameter vector of lengthKGiven: Dirichlet distribution with parameter vector of lengthVfor topic number 1 to topic numberKdraw a multinomial with parameter vector kaccording to for document number 1 to document numberMdraw a topic distribution, a multinomial according to for each word in the documentdraw a topiczaccording to draw a wordwaccording to zNote thatzis an integer between1andKfor each we go into the LDA model in mathematical detail, we need to reviewthe Dirichlet distribution.

Text mining and topic models - University of California ...

Tags:

Information

Transcription of Text mining and topic models - University of California ...

Related search queries

Text mining and topic models - University of California ...

Tags:

Information

Documents from same domain

Related documents

Related search queries