Example: stock market

Lecture 2: The SVM classifier - University of Oxford

Lecture 2: The SVM classifierC19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function Slack variables Loss functions revisited OptimizationBinary ClassificationGiven training data (xi,yi)fori= ,withxi Rdandyi { 1,1},learnaclassifierf(x)such thatf(xi)( 0yi=+1<0yi= (xi)>0 for a correct separabilitylinearly separablenot linearly separableLinear classifiersX2X1A linear classifier has the form in 2D the discriminant is a line is the normal to the line, and b the bias is known as the weight vectorf(x)=0f(x)=w>x+bf(x)>0f(x)<0 Linear classifiersA linear classifier has the form in 3D the discriminant is a plane, and in nD it is a hyperplaneFor a K-NN classifier it was necessary to `carry the training dataFor a linear classifier , the training data is used to learn wand then discardedOnly wis needed for classifying new dataf(x)=0f(x))

Linear separability again: What is the best w? • the points can be linearly separated but there is a very narrow margin • but possibly the large margin solution is better, even though one constraint is violated In general there is a trade off between the margin and the number of …

Tags:

  Separability

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Lecture 2: The SVM classifier - University of Oxford

1 Lecture 2: The SVM classifierC19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function Slack variables Loss functions revisited OptimizationBinary ClassificationGiven training data (xi,yi)fori= ,withxi Rdandyi { 1,1},learnaclassifierf(x)such thatf(xi)( 0yi=+1<0yi= (xi)>0 for a correct separabilitylinearly separablenot linearly separableLinear classifiersX2X1A linear classifier has the form in 2D the discriminant is a line is the normal to the line, and b the bias is known as the weight vectorf(x)=0f(x)=w>x+bf(x)>0f(x)<0 Linear classifiersA linear classifier has the form in 3D the discriminant is a plane, and in nD it is a hyperplaneFor a K-NN classifier it was necessary to `carry the training dataFor a linear classifier , the training data is used to learn wand then discardedOnly wis needed for classifying new dataf(x)=0f(x))

2 =w>x+bGiven linearly separable data xilabelled into two categories yi= {-1,1} , find a weight vector wsuch that the discriminant functionseparates the categories for i = 1, .., N how can we find this separating hyperplane ?The Perceptron Classifierf(xi)=w>xi+bThe Perceptron AlgorithmWrite classifier as Initialize w= 0 Cycle though the data points { xi, yi} if xiis misclassified then Until all the data is correctly classifiedw w+ sign(f(xi))xif(xi)= w> xi+w0=w>xiwherew=( w,w0),xi=( xi,1)For example in 2DX2X1X2X1wbefore updateafter updatewNB after convergencew=PNi ixi Initialize w= 0 Cycle though the data points { xi, yi} if xiis misclassified then Until all the data is correctly classifiedw w+ sign(f(xi))xixi if the data is linearly separable, then the algorithm will converge convergence can be slow .. separating line close to training data we would prefer a larger marginfor generalization-15-10-50510-10-8-6-4-2024 68 Perceptron exampleWhat is the best w?

3 Maximum marginsolution: most stable under perturbations of the inputsSupport Vector MachinewSupport VectorSupport Vectorb||w||f(x)=Xi iyi(xi>x)+bsupport vectorswTx+ b = 0linearly separable dataSVM sketch derivation Sincew>x+b=0andc(w>x+b)=0define the sameplane, we have the freedom to choose the normalizationofw Choose normalization such thatw>x++b=+1andw>x +b= 1 for the positive and negative support vectors re-spectively Then themarginis given byw||w||. x+ x =w> x+ x ||w||=2||w||Support Vector MachinewSupport VectorSupport VectorwTx+ b = 0wTx+ b = 1wTx+ b = -1 Margin =2||w||linearly separable dataSVM Optimization Learning the SVM can be formulated as an optimization:maxw2||w||subject tow>xi+b 1ifyi=+1 1ifyi= 1fori= Or equivalentlyminw||w||2subject toyi w>xi+b 1fori= This is a quadratic optimization problem subject to linearconstraints and there is a unique minimumLinear separability again: What is the best w?

4 The points can be linearly separated but there is a very narrow margin but possibly the large margin solution is better, even though one constraint is violatedIn general there is a trade off between the margin and the number of mistakes on the training dataIntroduce slack variableswSupport VectorSupport VectorwTx+ b = 0wTx+ b = 1wTx+ b = -1 Margin =2||w||Misclassified point i||w||>2||w|| = 0 i||w||<1||w|| i 0 for 0< 1pointisbetweenmargin and correct side of hyper-plane. This is amargin violation for >1 point ismisclassified Soft margin solutionThe optimization problem becomesminw Rd, i R+||w||2+CNXi isubject toyi w>xi+b 1 ifori= Every constraint can be satisfied if iis sufficiently large Cis aregularizationparameter: smallCallows constraints to be easily ignored large margin largeCmakes constraints hard to ignore narrow margin C= enforces all constraints: hard margin This is still a quadratic optimization problem and there is aunique minimum.

5 Note, there is only one parameter, xfeature y data is linearly separable but only with a narrow marginC = Infinity hard marginC = 10 soft marginApplication: Pedestrian detection in Computer VisionObjective: detect (localize) standing humans in an image cf face detection with a sliding window classifier reduces object detection to binary classification does an image window contain a person or not?Method: the HOG detector Positive data 1208 positive window examples Negative data 1218 negative window examples (initially)Training data and featuresFeature: histogram of oriented gradients (HOG)Feature vector dimension = 16 x 8 (for tiling) x 8 (orientations) = 1024imagedominant directionHOGfrequencyorientation tile window into 8 x 8 pixel cells each cell represented by HOGA veraged positive examplesTraining (Learning) Represent each example window by a HOG feature vector Train a SVM classifierTesting (Detection) Sliding window classifierAlgorithmf(x)=w>x+bxi Rd,withd= 1024 Dalal and Triggs, CVPR 2005 Learned modelSlide from Deva Ramananf(x)=w>x+bSlide from Deva RamananOptimizationLearning an SVM has been formulated as aconstrainedoptimization prob-lem overwand minw Rd, i R+||w||2+CNXi isubject toyi w>xi+b 1 ifori= constraintyi w>xi+b 1 i, can be written more concisely asyif(xi) 1 iwhich, together with i 0, is equivalent to i=max(0,1 yif(xi))

6 Hence the learning problem is equivalent to theunconstrainedoptimiza-tion problem overwminw Rd||w||2+CNXimax(0,1 yif(xi))loss functionregularizationLoss functionwSupport VectorSupport VectorwTx+ b = 0minw Rd||w||2+CNXimax(0,1 yif(xi))Points are in three (xi)>1 Point is outside contribution to (xi)=1 Point is on contribution to in hard margin (xi)<1 Point violates margin to lossloss functionLoss functions SVM uses hinge loss an approximation to the 0-1 lossmax(0,1 yif(xi))yif(xi)minw RdCNXimax(0,1 yif(xi))+||w||2 Does this cost function have a unique solution? Does the solution depend on the starting point of an iterative optimization algorithm (such as gradient descent)?local minimumglobal minimumIf the cost function is convex, then a locally optimal point is globally optimal (provided the optimization is over a convex set, which it is in our case)Optimization continuedConvex functionsConvex function examplesconvexNot convexA non-negative sum of convex functions is convexSVMminw RdCNXimax(0,1 yif(xi))+||w||2+convexGradient (or steepest) descent algorithm for SVMF irst, rewrite the optimization problem as anaverageminwC(w)= 2||w||2+1 NNXimax(0,1 yif(xi))=1 NNXi 2||w||2+max(0,1 yif(xi)) (with =2/(NC) up to an overall scale of the problem) andf(x)=w>x+bBecause the hinge loss is not differentiable, asub-gradientiscomputedTo minimize a cost functionC(w) use the iterative updatewt+1 wt t wC(wt)where is the learning for hinge lossL(xi,yi;w)=max(0,1 yif(xi))f(xi)=w>xi+byif(xi) L w= yixi L w=0 Sub-gradient descent algorithm for SVMC(w)=1 NNXi 2||w||2+L(xi,yi.)

7 W) The iterative update iswt+1 wt wtC(wt) wt 1 NNXi( wt+ wL(xi,yi;wt))where is the learning each iterationtinvolves cycling through the training data with theupdates:wt+1 wt ( wt yixi)ifyif(xi)<1 wt wtotherwiseIn the Pegasos algorithm the learning rate is set at t=1 tPegasos Stochastic Gradient Descent AlgorithmRandomly sample from the training data05010015020025030010-210-1100101102energy-6-4-20246-6-4-20246 Background reading and more .. Next Lecture see that the SVM can be expressed as a sum over the support vectors: On web page: ~az/lectures/ml links to SVM tutorials and video lectures MATLAB SVM demo f(x)=Xi iyi(xi>x)+bsupport vectors


Related search queries