Example: biology

Deep Learning using Linear Support Vector Machines

Deep Learning using Linear Support Vector MachinesYichuan of Computer Science, University of Toronto. Toronto, Ontario, , fully-connected and convolutionalneural networks have been trained to achievestate-of-the-art performance on a wide vari-ety of tasks such as speech recognition, im-age classification, natural language process-ing, and bioinformatics. For classificationtasks, most of these deep Learning modelsemploy the softmax activation function forprediction and minimize cross-entropy this paper, we demonstrate a small butconsistent advantage of replacing the soft-max layer with a Linear Support Vector ma-chine. Learning minimizes a margin-basedloss instead of the cross-entropy loss. Whilethere have been various combinations of neu-ral nets and SVMs in prior art, our resultsusing L2-SVMs show that by simply replac-ing softmax with Linear SVMs gives signifi-cant gains on popular deep Learning datasetsMNIST, CIFAR-10, and the ICML 2013 Rep-resentation Learning Workshop s face expres-sion recognition IntroductionDeep Learning using neural networks have claimedstate-of

Deep Learning using Linear Support Vector Machines Yichuan Tang tang@cs.toronto.edu Department of Computer Science, University of Toronto. Toronto, Ontario, Canada.

Tags:

  Using, Linear, Machine, Learning, Support, Vector, Tang, Learning using linear support vector machines, Tang tang

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Deep Learning using Linear Support Vector Machines

1 Deep Learning using Linear Support Vector MachinesYichuan of Computer Science, University of Toronto. Toronto, Ontario, , fully-connected and convolutionalneural networks have been trained to achievestate-of-the-art performance on a wide vari-ety of tasks such as speech recognition, im-age classification, natural language process-ing, and bioinformatics. For classificationtasks, most of these deep Learning modelsemploy the softmax activation function forprediction and minimize cross-entropy this paper, we demonstrate a small butconsistent advantage of replacing the soft-max layer with a Linear Support Vector ma-chine. Learning minimizes a margin-basedloss instead of the cross-entropy loss. Whilethere have been various combinations of neu-ral nets and SVMs in prior art, our resultsusing L2-SVMs show that by simply replac-ing softmax with Linear SVMs gives signifi-cant gains on popular deep Learning datasetsMNIST, CIFAR-10, and the ICML 2013 Rep-resentation Learning Workshop s face expres-sion recognition IntroductionDeep Learning using neural networks have claimedstate-of-the-art performances in a wide range of include (but not limited to) speech (Mohamedet al.)

2 , 2009; Dahl et al., 2010) and vision (Jarrettet al., 2009; Ciresan et al., 2011; Rifai et al., 2011a;Krizhevsky et al., 2012). All of the above mentionedpapers use the softmax activation function (also knownas multinomial logistic regression) for Vector machine is an widely used alternativeto softmax for classification (Boser et al., 1992). UsingSVMs (especially Linear ) in combination with convolu-tional nets have been proposed in the past as part of aInternational Conference on machine Learning 2013: Chal-lenges in Representation Learning ,Georgia, process. In particular, a deep convolutionalnet is first trained using supervised/unsupervised ob-jectives to learn good invariant hidden latent represen-tations. The corresponding hidden variables of datasamples are then treated as input and fed into Linear (or kernel) SVMs (Huang & LeCun, 2006; Lee et al.

3 ,2009; Quoc et al., 2010; Coates et al., 2011). Thistechnique usually improves performance but the draw-back is that lower level features are not been the SVM s papers have also proposed similar models butwith joint training of weights at lower layers usingboth standard neural nets as well as convolutional neu-ral nets (Zhong & Ghosh, 2000; Collobert & Bengio,2004; Nagi et al., 2012). In other related works, We-ston et al. (2008) proposed a semi-supervised embed-ding algorithm for deep Learning where the hinge lossis combined with the contrastive loss from siamesenetworks (Hadsell et al., 2006). Lower layer weightsare learned using stochastic gradient descent. Vinyalset al. (2012) learns a recursive representation using lin-ear SVMs at every layer, but without joint fine-tuningof the hidden this paper, we show that for some deep architec-tures, a Linear SVM top layer instead of a softmaxis beneficial.

4 We optimize the primal problem of theSVM and the gradients can be backpropagated to learnlower level features. Our models are essentially sameas the ones proposed in (Zhong & Ghosh, 2000; Nagiet al., 2012), with the minor novelty of using the lossfrom the L2-SVM instead of the standard hinge to nets using a top layer softmax,we demonstrate superior performance on MNIST,CIFAR-10, and on a recent Kaggle competition onrecognizing face expressions. Optimization is done us-ing stochastic gradient descent on small the two models in Sec. , we believe theperformance gain is largely due to the superior regu-larization effects of the SVM loss function, rather thanan advantage from better parameter Learning using Linear Support Vector Machines2.

5 The SoftmaxFor classification problems using deep Learning tech-niques, it is standard to use the softmax or 1-of-Kencoding at the top. For example, given 10 possibleclasses, the softmax layer has 10 nodes denoted bypi,wherei= 1,.., a discrete probabilitydistribution, therefore, 10ipi= the activation of the penultimate layer nodes,Wis the weight connecting the penultimate layer tothe softmax layer, the total input into a softmax layer,given bya, isai= khkWki,(1)then we havepi=exp(ai) 10jexp(aj)(2)The predicted class iwould be i= arg maxipi= arg maxiai(3) Support Vector MachinesLinear Support Vector Machines (SVM) is originallyformulated for binary train-ing data and its corresponding labels (xn,yn),n=1,..,N,xn RD,tn { 1,+1}, SVMs learningconsists of the following constrained optimization:minw, n12wTw+CN n=1 n(4) 1 n n n 0 n nare slack variables which penalizes data pointswhich violate the margin requirements.

6 Note that wecan include the bias by augment all data vectorsxnwith a scalar value of 1. The corresponding uncon-strained optimization problem is the following:minw12wTw+CN n=1max(1 wTxntn,0)(5)The objective of Eq. 5 is known as the primal formproblem of L1-SVM, with the standard hinge L1-SVM is not differentiable, a popular variationis known as the L2-SVM which minimizes the squaredhinge loss:minw12wTw+CN n=1max(1 wTxntn,0)2(6)L2-SVM is differentiable and imposes a bigger(quadratic vs. Linear ) loss for points which violate themargin. To predict the class label of a test datax:arg maxt(wTx)t(7)For Kernal SVMs, optimization must be performed inthe dual. However, scalability is a problem with Ker-nal SVMs, and in this paper we will be only usinglinear SVMs with standard deep Learning Multiclass SVMsThe simplest way to extend SVMs for multiclass prob-lems is using the so-calledone-vs-restapproach (Vap-nik, 1995).

7 ForKclass problems,Klinear SVMswill be trained independently, where the data fromthe other classes form the negative cases. Hsu & Lin(2002) discusses other alternative multiclass SVM ap-proaches, but we leave those to future the output of thek-th SVM asak(x) =wTx(8)The predicted class isarg maxkak(x)(9)Note that prediction using SVMs is exactly the sameas using a softmax Eq. 3. The only difference betweensoftmax and multiclass SVMs is in their objectivesparametrized by all of the weight matricesW. Soft-max layer minimizes cross-entropy or maximizes thelog-likelihood, while SVMs simply try to find the max-imum margin between data points of different Deep Learning with Support VectorMachinesMost deep Learning methods for classification usingfully connected layers and convolutional layers haveused softmax layer objective to learn the lower levelparameters.

8 There are exceptions, notably in papersby (Zhong & Ghosh, 2000; Collobert & Bengio, 2004;Nagi et al., 2012), supervised embedding with nonlin-ear NCA (Salakhutdinov & Hinton, 2007), and semi-supervised deep embedding (Weston et al., 2008). Inthis paper, we use L2-SVM s objective to train deepDeep Learning using Linear Support Vector Machinesneural nets for classification. Lower layer weights arelearned by backpropagating the gradients from the toplayer Linear SVM. To do this, we need to differentiatethe SVM objective with respect to the activation ofthe penultimate layer. Let the objective in Eq. 5 bel(w), and the inputxis replaced with the penultimateactivationh, l(w) hn= Ctnw(I{1>wThntn})(10)WhereI{ }is the indicator function. Likewise, for theL2-SVM, we have l(w) hn= 2 Ctnw(max(1 wThntn,0))(11)From this point on, backpropagation algorithm is ex-actly the same as the standard softmax-based deeplearning networks.

9 We found L2-SVM to be slightlybetter than L1-SVM most of the time and will use theL2-SVM in the experiments Facial Expression RecognitionThis competition/challenge was hosted by the ICML2013 workshop on representation Learning , organizedby the LISA at University of Montreal. The contestitself was hosted on Kaggle with over 120 competingteams during the initial developmental data consist of 28,709 48x48 images of faces under7 different types of expression. See Fig 1 for examplesand their corresponding expression category. The val-idation and test sets consist of 3,589 images and thisis a classification SolutionWe submitted the winning solution with a public val-idation score of and corresponding private testscore of Our private test score is almost 2%higher than the 2nd place team.

10 Due to label noiseand other factors such as corrupted data, human per-formance is roughly estimated to be between 65% and68% submission consists of using a simple Convolu-tional Neural Network with Linear one-vs-all SVM atthe top. Stochastic gradient descent with momentumis used for training and several models are averaged toslightly improve the generalization capabilities. Datapreprocessing consisted of first subtracting the mean1 Personal communication from the competition orga-nizers: data. Each column consists of faces ofthe same expression: starting from the leftmost column:Angry, Disgust, Fear, Happy, Sad, Surprise, of each image and then setting the image normto be 100. Each pixels is then standardized by remov-ing its mean and dividing its value by the standarddeviation of that pixel, across all training implementation is in C++ and CUDA, with portsto Matlab using MEX files.


Related search queries