Transcription of Machine Learning: Generative and Discriminative Models
1 Machine Learning: Generative and Discriminative ModelsSargur N. Learning Course: ~srihari/CSE574 LearningSrihari2 Outline of Presentation1. What is Machine Learning?ML applications, ML as Search2. Generative and Discriminative Taxonomy3. Generative - Discriminative PairsClassifiers: Na ve Bayes and Logistic RegressionSequential Data: HMMs and CRFs4. Performance Comparison in Sequential ApplicationsNLP: Table extraction, POS tagging, Shallow parsing, Handwritten word recognition, Document analysis5. Advantages, disadvantages6. Summary7. ReferencesMachine LearningSrihari31. Machine Learning Programming computers to use example data or past experience Well-Posed Learning Problems A computer program is said to learn from experience E with respect to class of tasks T and performance measure P, if its performance at tasks T, as measured by P, improves with experience LearningSrihari4 Problems Too Difficult To Program by Hand Learning to drive an autonomous vehicle Train computer-controlled vehicles to steer correctly Drive at 70 mph for 90 miles on public highways Associate steering commands with image sequencesTask T: driving on public, 4-lane highway using vision sensorsPerform measure P: average distance traveled before error (as judged by human overseer)Training E: sequence of images and steering commands recorded while observing a human driverMachine LearningSrihari5 Example Problem.
2 Handwritten Digit Recognition Handcrafted rules will result in large no of rules and exceptions Better to have a Machine that learns from a large training setWide variability of same numeralMachine LearningSrihari6 Other Applications of Machine Learning Recognizing spoken words Speaker-specific strategies for recognizing phonemes and words from speech Neural networks and methods for learning HMMs for customizing to individual speakers, vocabularies and microphone characteristics Search engines Information extraction from text Data mining Very large databases to learn general regularities implicit in data Classify celestial objects from image data Decision tree for objects in sky survey: 3 terabytesMachine LearningSrihari7ML as Searching Hypotheses Space Very large space of possible hypotheses to fit: observed data and any prior knowledge held by the observerMethodHypothesis SpaceConcept LearningBoolean ExpressionsDecision TreesAll Possible TreesNeural NetworksWeight SpaceMachine LearningSrihari8ML Methodologies are increasingly statistical Rule-based expert systems being replaced by probabilistic Generative Models Example: Autonomous agents in AI ELIZA : natural language rules to emulate therapy session Manual specification of Models , theories are increasingly difficult Greater availability of data and computational power to migrate away from rule-based and manually specified Models to probabilistic data-driven modesMachine LearningSrihari9 The Statistical ML Approach1.
3 Data CollectionLarge sample of data of how humans perform the task2. Model SelectionSettle on a parametric statistical model of the process3. Parameter EstimationCalculate parameter values by inspecting the dataUsing learned model perform:4. SearchFind optimal solution to given problemMachine LearningSrihari102. Generative and Discriminative Models : An analogy The task is to determine the language that someone is speaking Generative approach: is to learn each language and determine as to which language the speech belongs to Discriminative approach: is determine the linguistic differences without learning any language a much easier task! Machine LearningSrihari11 Taxonomy of ML Models Generative Methods Model class-conditional pdfs and prior probabilities Generative since sampling can generate synthetic data points Popular Models Gaussians, Na ve Bayes, Mixtures of multinomials Mixtures of Gaussians, Mixtures of experts, Hidden Markov Models (HMM) Sigmoidal belief networks, Bayesian networks, Markov random fields Discriminative Methods Directly estimate posterior probabilities No attempt to model underlying probability distributions Focus computational resources on given task better performance Popular Models Logistic regression, SVMs Traditional neural networks, Nearest neighbor Conditional Random Fields (CRF) Generative Models (graphical)
4 Parent nodeselects betweencomponentsMarkov RandomFieldQuick Medical Reference -DTDiagnosingDiseases from SymptomsMachine LearningSrihari13 Successes of Generative Methods NLP Traditional rule-based or Boolean logic systems (eg Dialog and Lexis-Nexis) are giving way to statistical approaches (Markov Models and stochastic context free grammars) Medical Diagnosis QMR knowledge base, initially a heuristic expert systems for reasoning about diseases and symptoms has been augmented with decision theoretic formulation Genomics and Bioinformatics Sequences represented as Generative HMMsMachine LearningSrihari14 Discriminative Classifier: SVMN onlinear decision boundary(x1, x2) (x1, x2, x1x2) Linear boundaryin high-dimensionalspaceMachine LearningSrihari15 Three support vectors are shown as solid dotsSupport Vector Machines Support vectors are those nearest patterns at distance b from hyperplane SVM finds hyperplane with maximum distance from nearest training patterns For full description of SVMs ~srihari/CSE555 LearningSrihari163.
5 Generative - Discriminative Pairs Na ve Bayes and Logistic Regression form a Generative - Discriminative pair for classification Their relationship mirrors that between HMMs and linear-chain CRFs for sequential dataMachine LearningSrihari17 Graphical Model RelationshipNa ve Bayes ClassifierLogistic RegressionHidden Markov ModelSEQUENCEC onditional Random FieldCONDITIONGENERATIVEDISCRIMINATIVEyx x1xMXx1xNYy1yNp(y,x)p(y/x)p(Y,X)p(Y/X)CO NDITIONSEQUENCEM achine LearningSrihari18 Generative Classifier: Bayes Given variables x =(x1 ,..,xM ) and class variable y Joint pdf is p(x,y) Called Generative model since we can generate more samples artificially Given a full joint pdf we can Marginalize Condition By conditioning the joint pdf we form a classifier Computational problem: If x is binary then we need 2M values If 100 samples are needed to estimate a given probability, M=10, and there are two classes then we need 2048 samplesx()(x, )pyp y= (x, )(|x)(x)pypyp= Machine LearningSrihari19Na ve Bayes Classifier Goal is to predict single class variable y given a vector of features x=(x1.)
6 ,xM ) Assume that once class labels are known the features are independent Joint probability model has the form Need to estimate only M probabilities Factor graph obtained by defining factors (y)=p(y), m (y,xm )=p(xm ,y)1(,x)()( | )Mmmpypypx y== Machine LearningSrihari20 Discriminative Classifier: Logistic Regression Feature vector x Two-class classification: class variable y has values C1 and C2 A posteriori probability p(C1 |x) written asp(C1 |x) =f(x) = (wTx) where It is known as logistic regression in statistics Although it is a model for classification rather than for regression1()1exp( )aa =+ a (a)Properties:A. Symmetry (-a)=1- (a)B. Inversea=ln( /1- )known as known as log odds since it is the ratioln[p(C1 |x)/p(C2 |x)]C. Derivatived /da= (1- )Logistic SigmoidMachine LearningSrihari21 Logistic Regression versus Generative Bayes Classifier Posterior probability of class variable y is In a Generative model we estimate the class- conditionals (which are used to determine a) In the Discriminative approach we directly estimate a as a linear function of x , a = wTx111112 21122(x |) ( )(|x)(x |) ( )(x |) ()(x |) ( )1 =( ) where ln1 exp()(x |) ()pCpCpCpC pCpC pCpCpCaaapCpC =+==+ Machine LearningSrihari22 Logistic Regression Parameters For M-dimensional feature space logistic regression has M parameters w=(w1.
7 ,wM ) By contrast, Generative approach by fitting Gaussian class-conditional densities will result in 2M parameters for means, M(M+1)/2 parameters for shared covariance matrix, and one for class prior p(C1 ) Which can be reduced to O(M) parameters by assuming independence via Na ve BayesMachine LearningSrihari23 Multi-class Logistic Regression Case of K>2 classes Known as normalized exponentialwhere ak =ln p(x|Ck )p(Ck ) Normalized exponential also known as softmax since if ak >>aj then p(Ck |x)=1 and p(Cj |x)=0 In logistic regression we assume activations given by ak =wkTx(|)( )(|x)(|)( )exp( ) =exp( )kkkjjjkjjpx C pCpCpxC pCaa= Machine LearningSrihari24 Graphical Model for Logistic Regression Multiclass logistic regression can be written as Rather than using one weight per class we can define feature functions that are nonzero only for a single class This notation mirrors the usual notation for CRFs111( | x)exp where(x)(x) = expKyyjjjKyyjjyjpyxZZx == =+ + 11(|x)exp(,x)(x)Kkkkpyf yZ = = Machine LearningSrihari254.
8 Sequence Models Classifiers predict only a single class variable Graphical Models are best to model many variables that are interdependent Given sequence of observations X={xn }n=1N Underlying sequence of states Y={yn }n=1 NMachine LearningSrihari26 Generative Model: HMM X is observed data sequence to be labeled, Y is the random variable over the label sequences HMM is a distribution that Models p(Y, X) Joint distribution is Highly structured network indicates conditional independences, past states independent of future states Conditional independence of observed given its ,(|)(|)Nnnnnnp() pyypy == YXxMachine LearningSrihari27 Discriminative Model for Sequential Data CRF Models the conditional distribution p(Y/X) CRF is a random field globally conditioned on the observation X The conditional distribution p(Y|X) that follows from the joint distribution p(Y,X) can be rewritten as a Markov Random Fieldy1y2ynyNXMachine LearningSrihari28 Markov Random Field (MRF) Also called undirected graphical model Joint distribution of set of variables x is defined by an undirected graph aswhere C is a maximal clique (each node connected to every other node)
9 ,xC is the set of variables in that clique, C is a potential function (or local or compatibility function)such that C (xC ) > 0, typically C (xC ) = exp{-E(xC )}, andis the partition function for normalization Model refers to a family of distributions and Field refers to a specific one1(x)(x )CCCpZ = x(x )CCCZ = Machine LearningSrihari29 MRF with Input-Output Variables X is a set of input variables that are observed Element of X is denoted x Y is a set of output variables that we predict Element of Y is denoted y A are subsets of X U Y Elements of A that are in A ^ X are denoted xA Element of A that are in A ^ Y are denoted yA Then undirected graphical model has the formx,y1(x,y)(x , y ) where Z=(x , y )AAAAAAAApZ= Machine LearningSrihari30 MRF Local Function Assume each local function has the formwhere A is a parameter vector, fA are feature functions and m=1,..M are feature subscripts(x , y ) exp(x , y ) AAAAmAmAAmf = Machine LearningSrihari31 From HMM to CRF In an HMM Can be rewritten as Further rewritten as Which gives us Note that Z cancels out11,(|)(|)Nnnnnnp() pyypy == YXx1{}{ }{}{ },1(, )exp1 11 1nnnnij yi y joi yi xonijSn iSoOpZ ==== =+ YX111() exp(,,)Mmm n nnmpfyyZ = = Y, Xx11'1'1exp( ,,)(,)()(',)exp( ,,)Mmm n nnmMymm n nnymfyypyxppy xfyy = = == xY|XxIndicator function:1{x = x } takes value 1when x=x and 0 otherwiseParameters of the distribution: ={ ij , oi }Feature Functions have the form fm (yn ,yn-1 ,xn ).
10 Need one feature for eachstate transition (i,j)fij (y,y ,x)=1{y=i} 1{y =j} and one for each state- observation pairfio (y,y ,x)=1{y=i} 1{x=o} Machine LearningSrihari32 CRF definition A linear chain CRF is a distribution p(Y|X) that takes the form Where Z(X) is an instance specific normalization function111(|)exp(, , )(X)Mmm n nnmpfyyZ = = YXx11(X)exp( ,,)Mmm n nnymZfyy = = xMachine LearningSrihari33 Functional ModelsNa ve Bayes ClassifierLogistic RegressionHidden Markov ModelConditional Random FieldGENERATIVEDISCRIMINATIVEyxx1xMXx1xN Yy1yN1(|)Mmmp(y, )p(y)p xy== x1'1exp( , )(|)exp( ', )MmmmMmmymfypyfy == = xxx11,(|)(|)Nnnnnnp() pyypy == YXx111'1exp( ,,)(|)exp(',',)Mmm n nnmMmm nnnmfyypfy y = = = yxYXxynxnMachine LearningSrihari34 NLP: Part Of Speech Taggingw = The quick brown fox jumped over the lazy dogs = DET VERB ADJ NOUN-S VERB-P PREP DET ADJ NOUN-SFor a sequence of words w = {w1 ,w2.}