arXiv:1505.00387v2 [cs.LG] 3 Nov 2015

highway NetworksRupesh Kumar urgen Swiss AI Lab IDSIAI stituto Dalle Molle di Studi sull Intelligenza ArtificialeUniversit`a della Svizzera italiana (USI)Scuola universitaria professionale della Svizzera italiana (SUPSI)Galleria 2, 6928 Manno-Lugano, SwitzerlandAbstractThere is plenty of theoretical and empirical evi-dence that depth of neural networks is a crucialingredient for their success. However, networktraining becomes more difficult with increasingdepth and training of very deep networks remainsan open problem. In this extended abstract, weintroduce a new architecture designed to easegradient-based training of very deep refer to networks with this architecture ashighway networks, since they allow unimpededinformation flow across several layers oninfor-mation highways.

The architecture is character-ized by the use of gating units which learn to reg-ulate the flow of information through a networks with hundreds of layers canbe trained directly using stochastic gradient de-scent and with a variety of activation functions,opening up the possibility of studying extremelydeep and efficient : A full paper extending this study is available , with addi-tional references, experiments and IntroductionMany recent empirical breakthroughs in supervised ma-chine learning have been achieved through the applica-tion of deep neural networks.

network depth (referring tothe number of successive computation layers) has playedperhaps the most important role in these successes. ForPresented at the Deep Learning Workshop, International Confer-ence on Machine Learning, Lille, France, 2015. Copyright 2015by the author(s).instance, the top-5 image classification accuracy on the1000-class ImageNet dataset has increased from 84%(Krizhevsky et al., 2012) to 95% (Szegedy et al., 2014;Simonyan & Zisserman, 2014) through the use of ensem-bles of deeper architectures and smaller receptive fields(Ciresan et al.)

, 2011a;b; 2012) in just a few the theoretical side, it is well known that deep net-works can represent certain function classes exponentiallymore efficiently than shallow ones ( the work of H astad(1987); H astad & Goldmann (1991) and recently of Mont-ufar et al. (2014)). As argued by Bengio et al. (2013), theuse of deep networks can offer both computational and sta-tistical efficiency for complex , training deeper networks is not as straightfor-ward as simply adding layers. Optimization of deep net-works has proven to be considerably more difficult, lead-ing to research on initialization schemes (Glorot & Ben-gio, 2010; Saxe et al.

, 2013; He et al., 2015), techniquesof training networks in multiple stages (Simonyan & Zis-serman, 2014; Romero et al., 2014) or with temporarycompanion loss functions attached to some of the layers(Szegedy et al., 2014; Lee et al., 2015).In this extended abstract, we present a novel architecturethat enables the optimization of networks with virtually ar-bitrary depth. This is accomplished through the use of alearned gating mechanism for regulating information flowwhich is inspired by Long Short Term Memory recurrentneural networks (Hochreiter & Schmidhuber, 1995).

Dueto this gating mechanism, a neural network can have pathsalong which information can flow across several layerswithout attenuation. We call such pathsinformation high-ways, and such networkshighway preliminary experiments, we found that highway net-works as deep as 900 layers can be optimized using simpleStochastic Gradient Descent (SGD) with momentum. [ ] 3 Nov 2015 highway Networksup to 100 layers we compare their training behavior to thatof traditional networks with normalized initialization (Glo-rot & Bengio, 2010; He et al., 2015). We show that opti-mization of highway networks is virtually independent ofdepth, while for traditional networks it suffers significantlyas the number of layers increases.

We also show that archi-tectures comparable to those recently presented by Romeroet al. (2014) can be directly trained to obtain similar testset accuracy on the CIFAR-10 dataset without the need fora pre-trained teacher NotationWe use boldface letters for vectors and matrices, and ital-icized capital letters to denote transformation vectors of zeros and ones respectively, andIdenotes an identity matrix. The function (x)is defined as (x) =11+e x,x highway NetworksAplainfeedforward neural network typically consists ofLlayers where thelthlayer (l {1,2,..,L}) applies a non-linear transformH(parameterized byWH,l) on its inputxlto produce its outputyl.

Thus,x1is the input to thenetwork andyLis the network s output. Omitting the layerindex and biases for clarity,y=H(x,WH).(1)His usually an affine transform followed by a non-linearactivation function, but in general it may take other a highway network , we additionally define two non-linear transformsT(x,WT)andC(x,WC)such thaty=H(x,WH) T(x,WT) +x C(x,WC).(2)We refer toTas thetransformgate andCas thecarrygate,since they express how much of the output is produced bytransforming the input and carrying it, respectively. Forsimplicity, in this paper we setC= 1 T, givingy=H(x,WH) T(x,WT) +x (1 T(x,WT)).

(3)The dimensionality ofx,y,H(x,WH)andT(x,WT)must be the same for Equation (3) to be valid. Note thatthis re-parametrization of the layer transformation is muchmore flexible than Equation (1). In particular, observe thaty={x,ifT(x,WT) =0,H(x,WH),ifT(x,WT) =1.(4)Similarly, for the Jacobian of the layer transform,dydx={I,ifT(x,WT) =0,H (x,WH),ifT(x,WT) =1.(5)Thus, depending on the output of the transform gates, ahighway layer can smoothly vary its behavior between thatof a plain layer and that of a layer which simply passesits inputs through. Just as a plain layer consists of multi-ple computing units such that theithunit computesyi=Hi(x), a highway network consists of multiple blocks suchthat theithblock computes ablock stateHi(x)andtrans-form gate outputTi(x).}}

Finally, it produces theblock out-putyi=Hi(x) Ti(x) +xi (1 Ti(x)), which is con-nected to the next Constructing highway NetworksAs mentioned earlier, Equation (3) requires that the dimen-sionality ofx,y,H(x,WH)andT(x,WT)be the cases when it is desirable to change the size of the rep-resentation, one can replacexwith xobtained by suitablysub-sampling or zero-paddingx. Another alternative is touse a plain layer (without highways) to change dimension-ality and then continue with stacking highway layers. Thisis the alternative we use in this highway layers are constructed similar tofully connected layers.

arXiv:1505.00387v2 [cs.LG] 3 Nov 2015

Tags:

Information

Transcription of arXiv:1505.00387v2 [cs.LG] 3 Nov 2015

Related search queries

arXiv:1505.00387v2 [cs.LG] 3 Nov 2015

Tags:

Information

Documents from same domain

Related documents

Related search queries