Transcription of arXiv:1505.00387v2 [cs.LG] 3 Nov 2015
1 highway NetworksRupesh Kumar urgen Swiss AI Lab IDSIAI stituto Dalle Molle di Studi sull Intelligenza ArtificialeUniversit`a della Svizzera italiana (USI)Scuola universitaria professionale della Svizzera italiana (SUPSI)Galleria 2, 6928 Manno-Lugano, SwitzerlandAbstractThere is plenty of theoretical and empirical evi-dence that depth of neural networks is a crucialingredient for their success. However, networktraining becomes more difficult with increasingdepth and training of very deep networks remainsan open problem. In this extended abstract, weintroduce a new architecture designed to easegradient-based training of very deep refer to networks with this architecture ashighway networks, since they allow unimpededinformation flow across several layers oninfor-mation highways.
2 The architecture is character-ized by the use of gating units which learn to reg-ulate the flow of information through a networks with hundreds of layers canbe trained directly using stochastic gradient de-scent and with a variety of activation functions,opening up the possibility of studying extremelydeep and efficient : A full paper extending this study is available , with addi-tional references, experiments and IntroductionMany recent empirical breakthroughs in supervised ma-chine learning have been achieved through the applica-tion of deep neural networks.
3 network depth (referring tothe number of successive computation layers) has playedperhaps the most important role in these successes. ForPresented at the Deep Learning Workshop, International Confer-ence on Machine Learning, Lille, France, 2015. Copyright 2015by the author(s).instance, the top-5 image classification accuracy on the1000-class ImageNet dataset has increased from 84%(Krizhevsky et al., 2012) to 95% (Szegedy et al., 2014;Simonyan & Zisserman, 2014) through the use of ensem-bles of deeper architectures and smaller receptive fields(Ciresan et al.)
4 , 2011a;b; 2012) in just a few the theoretical side, it is well known that deep net-works can represent certain function classes exponentiallymore efficiently than shallow ones ( the work of H astad(1987); H astad & Goldmann (1991) and recently of Mont-ufar et al. (2014)). As argued by Bengio et al. (2013), theuse of deep networks can offer both computational and sta-tistical efficiency for complex , training deeper networks is not as straightfor-ward as simply adding layers. Optimization of deep net-works has proven to be considerably more difficult, lead-ing to research on initialization schemes (Glorot & Ben-gio, 2010; Saxe et al.
5 , 2013; He et al., 2015), techniquesof training networks in multiple stages (Simonyan & Zis-serman, 2014; Romero et al., 2014) or with temporarycompanion loss functions attached to some of the layers(Szegedy et al., 2014; Lee et al., 2015).In this extended abstract, we present a novel architecturethat enables the optimization of networks with virtually ar-bitrary depth. This is accomplished through the use of alearned gating mechanism for regulating information flowwhich is inspired by Long Short Term Memory recurrentneural networks (Hochreiter & Schmidhuber, 1995).
6 Dueto this gating mechanism, a neural network can have pathsalong which information can flow across several layerswithout attenuation. We call such pathsinformation high-ways, and such networkshighway preliminary experiments, we found that highway net-works as deep as 900 layers can be optimized using simpleStochastic Gradient Descent (SGD) with momentum. [ ] 3 Nov 2015 highway Networksup to 100 layers we compare their training behavior to thatof traditional networks with normalized initialization (Glo-rot & Bengio, 2010; He et al., 2015). We show that opti-mization of highway networks is virtually independent ofdepth, while for traditional networks it suffers significantlyas the number of layers increases.
7 We also show that archi-tectures comparable to those recently presented by Romeroet al. (2014) can be directly trained to obtain similar testset accuracy on the CIFAR-10 dataset without the need fora pre-trained teacher NotationWe use boldface letters for vectors and matrices, and ital-icized capital letters to denote transformation vectors of zeros and ones respectively, andIdenotes an identity matrix. The function (x)is defined as (x) =11+e x,x highway NetworksAplainfeedforward neural network typically consists ofLlayers where thelthlayer (l {1,2,..,L}) applies a non-linear transformH(parameterized byWH,l) on its inputxlto produce its outputyl.
8 Thus,x1is the input to thenetwork andyLis the network s output. Omitting the layerindex and biases for clarity,y=H(x,WH).(1)His usually an affine transform followed by a non-linearactivation function, but in general it may take other a highway network , we additionally define two non-linear transformsT(x,WT)andC(x,WC)such thaty=H(x,WH) T(x,WT) +x C(x,WC).(2)We refer toTas thetransformgate andCas thecarrygate,since they express how much of the output is produced bytransforming the input and carrying it, respectively. Forsimplicity, in this paper we setC= 1 T, givingy=H(x,WH) T(x,WT) +x (1 T(x,WT)).
9 (3)The dimensionality ofx,y,H(x,WH)andT(x,WT)must be the same for Equation (3) to be valid. Note thatthis re-parametrization of the layer transformation is muchmore flexible than Equation (1). In particular, observe thaty={x,ifT(x,WT) =0,H(x,WH),ifT(x,WT) =1.(4)Similarly, for the Jacobian of the layer transform,dydx={I,ifT(x,WT) =0,H (x,WH),ifT(x,WT) =1.(5)Thus, depending on the output of the transform gates, ahighway layer can smoothly vary its behavior between thatof a plain layer and that of a layer which simply passesits inputs through. Just as a plain layer consists of multi-ple computing units such that theithunit computesyi=Hi(x), a highway network consists of multiple blocks suchthat theithblock computes ablock stateHi(x)andtrans-form gate outputTi(x).}}
10 Finally, it produces theblock out-putyi=Hi(x) Ti(x) +xi (1 Ti(x)), which is con-nected to the next Constructing highway NetworksAs mentioned earlier, Equation (3) requires that the dimen-sionality ofx,y,H(x,WH)andT(x,WT)be the cases when it is desirable to change the size of the rep-resentation, one can replacexwith xobtained by suitablysub-sampling or zero-paddingx. Another alternative is touse a plain layer (without highways) to change dimension-ality and then continue with stacking highway layers. Thisis the alternative we use in this highway layers are constructed similar tofully connected layers.