Example: dental hygienist

Deep Learning using Linear Support Vector Machines

Deep Learning using Linear Support Vector MachinesYichuan of Computer Science, University of Toronto. Toronto, Ontario, , fully-connected and convolutionalneural networks have been trained to achievestate-of-the-art performance on a wide vari-ety of tasks such as speech recognition, im-age classification, natural language process-ing, and bioinformatics. For classificationtasks, most of these deep Learning modelsemploy the softmax activation function forprediction and minimize cross-entropy this paper, we demonstrate a small butconsistent advantage of replacing the soft-max layer with a Linear Support Vector ma-chine.

Deep Learning using Linear Support Vector Machines 2. The model 2.1. Softmax For classi cation problems using deep learning tech-niques, it …

Tags:

  Using, Linear, Machine, Learning, Support, Vector, Learning using linear support vector machines

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Deep Learning using Linear Support Vector Machines

1 Deep Learning using Linear Support Vector MachinesYichuan of Computer Science, University of Toronto. Toronto, Ontario, , fully-connected and convolutionalneural networks have been trained to achievestate-of-the-art performance on a wide vari-ety of tasks such as speech recognition, im-age classification, natural language process-ing, and bioinformatics. For classificationtasks, most of these deep Learning modelsemploy the softmax activation function forprediction and minimize cross-entropy this paper, we demonstrate a small butconsistent advantage of replacing the soft-max layer with a Linear Support Vector ma-chine.

2 Learning minimizes a margin-basedloss instead of the cross-entropy loss. Whilethere have been various combinations of neu-ral nets and SVMs in prior art, our resultsusing L2-SVMs show that by simply replac-ing softmax with Linear SVMs gives signifi-cant gains on popular deep Learning datasetsMNIST, CIFAR-10, and the ICML 2013 Rep-resentation Learning Workshop s face expres-sion recognition IntroductionDeep Learning using neural networks have claimedstate-of-the-art performances in a wide range of include (but not limited to) speech (Mohamedet al., 2009; Dahl et al., 2010) and vision (Jarrettet al.)

3 , 2009; Ciresan et al., 2011; Rifai et al., 2011a;Krizhevsky et al., 2012). All of the above mentionedpapers use the softmax activation function (also knownas multinomial logistic regression) for Vector machine is an widely used alternativeto softmax for classification (Boser et al., 1992). UsingSVMs (especially Linear ) in combination with convolu-tional nets have been proposed in the past as part of aInternational Conference on machine Learning 2013: Chal-lenges in Representation Learning ,Georgia, process. In particular, a deep convolutionalnet is first trained using supervised/unsupervised ob-jectives to learn good invariant hidden latent represen-tations.

4 The corresponding hidden variables of datasamples are then treated as input and fed into Linear (or kernel) SVMs (Huang & LeCun, 2006; Lee et al.,2009; Quoc et al., 2010; Coates et al., 2011). Thistechnique usually improves performance but the draw-back is that lower level features are not been the SVM s papers have also proposed similar models butwith joint training of weights at lower layers usingboth standard neural nets as well as convolutional neu-ral nets (Zhong & Ghosh, 2000; Collobert & Bengio,2004; Nagi et al., 2012). In other related works, We-ston et al. (2008) proposed a semi-supervised embed-ding algorithm for deep Learning where the hinge lossis combined with the contrastive loss from siamesenetworks (Hadsell et al.)

5 , 2006). Lower layer weightsare learned using stochastic gradient descent. Vinyalset al. (2012) learns a recursive representation using lin-ear SVMs at every layer, but without joint fine-tuningof the hidden this paper, we show that for some deep architec-tures, a Linear SVM top layer instead of a softmaxis beneficial. We optimize the primal problem of theSVM and the gradients can be backpropagated to learnlower level features. Our models are essentially sameas the ones proposed in (Zhong & Ghosh, 2000; Nagiet al., 2012), with the minor novelty of using the lossfrom the L2-SVM instead of the standard hinge to nets using a top layer softmax,we demonstrate superior performance on MNIST,CIFAR-10, and on a recent Kaggle competition onrecognizing face expressions.

6 Optimization is done us-ing stochastic gradient descent on small the two models in Sec. , we believe theperformance gain is largely due to the superior regu-larization effects of the SVM loss function, rather thanan advantage from better parameter Learning using Linear Support Vector Machines2. The SoftmaxFor classification problems using deep Learning tech-niques, it is standard to use the softmax or 1-of-Kencoding at the top. For example, given 10 possibleclasses, the softmax layer has 10 nodes denoted bypi,wherei= 1,.., a discrete probabilitydistribution, therefore, 10ipi= the activation of the penultimate layer nodes,Wis the weight connecting the penultimate layer tothe softmax layer, the total input into a softmax layer,given bya, isai= khkWki,(1)then we havepi=exp(ai) 10jexp(aj)(2)The predicted class iwould be i= arg maxipi= arg maxiai(3) Support Vector MachinesLinear Support Vector Machines (SVM) is originallyformulated for binary train-ing data and its corresponding labels (xn,yn),n=1.

7 ,N,xn RD,tn { 1,+1}, SVMs learningconsists of the following constrained optimization:minw, n12wTw+CN n=1 n(4) 1 n n n 0 n nare slack variables which penalizes data pointswhich violate the margin requirements. Note that wecan include the bias by augment all data vectorsxnwith a scalar value of 1. The corresponding uncon-strained optimization problem is the following:minw12wTw+CN n=1max(1 wTxntn,0)(5)The objective of Eq. 5 is known as the primal formproblem of L1-SVM, with the standard hinge L1-SVM is not differentiable, a popular variationis known as the L2-SVM which minimizes the squaredhinge loss:minw12wTw+CN n=1max(1 wTxntn,0)2(6)L2-SVM is differentiable and imposes a bigger(quadratic vs.)

8 Linear ) loss for points which violate themargin. To predict the class label of a test datax:arg maxt(wTx)t(7)For Kernal SVMs, optimization must be performed inthe dual. However, scalability is a problem with Ker-nal SVMs, and in this paper we will be only usinglinear SVMs with standard deep Learning Multiclass SVMsThe simplest way to extend SVMs for multiclass prob-lems is using the so-calledone-vs-restapproach (Vap-nik, 1995). ForKclass problems,Klinear SVMswill be trained independently, where the data fromthe other classes form the negative cases. Hsu & Lin(2002) discusses other alternative multiclass SVM ap-proaches, but we leave those to future the output of thek-th SVM asak(x) =wTx(8)The predicted class isarg maxkak(x)(9)Note that prediction using SVMs is exactly the sameas using a softmax Eq.

9 3. The only difference betweensoftmax and multiclass SVMs is in their objectivesparametrized by all of the weight matricesW. Soft-max layer minimizes cross-entropy or maximizes thelog-likelihood, while SVMs simply try to find the max-imum margin between data points of different Deep Learning with Support VectorMachinesMost deep Learning methods for classification usingfully connected layers and convolutional layers haveused softmax layer objective to learn the lower levelparameters. There are exceptions, notably in papersby (Zhong & Ghosh, 2000; Collobert & Bengio, 2004;Nagi et al.)

10 , 2012), supervised embedding with nonlin-ear NCA (Salakhutdinov & Hinton, 2007), and semi-supervised deep embedding (Weston et al., 2008). Inthis paper, we use L2-SVM s objective to train deepDeep Learning using Linear Support Vector Machinesneural nets for classification. Lower layer weights arelearned by backpropagating the gradients from the toplayer Linear SVM. To do this, we need to differentiatethe SVM objective with respect to the activation ofthe penultimate layer. Let the objective in Eq. 5 bel(w), and the inputxis replaced with the penultimateactivationh, l(w) hn= Ctnw(I{1>wThntn})(10)WhereI{ }is the indicator function.


Related search queries