Transcription of INTRODUCTION TO Machine Learning - Computer Science
1 INTRODUCTION TOMachine LearningETHEM ALPAYDIN The MIT Press, Slides forCHAPTER 10:Linear DiscriminationLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )3 Likelihood- vs. Discriminant-based Classification Likelihood-based:Assume a model for p(x|Ci), use Bayes rule to calculate P(Ci|x) gi(x) = log P(Ci|x) Discriminant-based:Assume a model for gi(x| i); no density estimation Estimating the boundaries is enough; no need to accurately estimate the densities inside the boundariesLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )4 Linear Discriminant Linear discriminant: Advantages: Simple: O(d) space/computation Knowledge extraction: Weighted sum of attributes; positive/negative weights, magnitudes (credit scoring) Optimal when p(x|Ci) are Gaussian with shared cov matrix.
2 Useful when classes are (almost) linearly separableLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )5 Generalized Linear Model Quadratic discriminant: Higher-order (product) terms:Map from xto zusing nonlinear basis functions and use a linear discriminant in z-spaceLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )6 Two ClassesLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )7 GeometryLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )8 Multiple ClassesClasses arelinearly separableLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )
3 9 Pairwise SeparationLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )10 From Discriminants to Posteriors When p (x| Ci ) ~ N( i, )Lecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )11 Lecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )12 Sigmoid (Logistic) FunctionLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )13 Gradient-Descent E(w|X) is error with parameters won sample X Gradient Gradient-descent: Starts from random wand updates witeratively in the negative direction of gradientLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )14 Gradient-Descentwtwt+1 E (wt)E (wt+1)Lecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )15 Logistic Discrimination Two classes.
4 Assume log likelihood ratio is linearLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )16 Training: Two ClassesLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )17 Training: Gradient-DescentLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )18 Lecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )19101001000 Lecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )20K>2 ClassessoftmaxLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )21 Lecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )
5 22 ExampleLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )23 Generalizing the Linear Model Quadratic: Sum of basis functions:where (x)are basis functions Kernels in SVM Hidden units in neural networksLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )24 Discrimination by Regression Classes are NOT mutually exclusive and exhaustiveLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )25 Optimal Separating Hyperplane(Cortes and Vapnik, 1995.)
6 Vapnik, 1995)Lecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )26 Margin Distance from the discriminant to the closest instances on either side Distance of x to the hyperplane is We require For a unique sol n, fix and to max marginLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )27 Lecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )28 Lecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )29 Most tare 0 and only a small number have t>0.
7 They are the support vectorsLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )30 Soft Margin Hyperplane Not linearly separable Soft error New primal isLecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )31 Kernel Machines Preprocess input xby basis functions The SVM solution Lecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )32 Kernel Functions Polynomials of degree q: Radial-basis functions: Sigmoidal functions:(Cherkassky and Mulier, 1998)Lecture Notes for E Alpayd n 2004 INTRODUCTION to Machine Learning The MIT Press ( )33 SVM for Regression Use a linear model (possibly kernelized) Use the -sensitive error function