Batch Normalization: Accelerating Deep Network Training …

[ ] 2 Mar 2015 Batch normalization : Accelerating deep Network Training byReducing Internal Covariate ShiftSergey IoffeGoogle SzegedyGoogle deep Neural Networks is complicated by the factthat the distribution of each layer s inputs changes duringtraining, as the parameters of the previous layers slows down the Training by requiring lower learningrates and careful parameter initialization, and makes it no-toriously hard to train models with saturating nonlineari-ties. We refer to this phenomenon asinternal covariateshift, and address the problem by normalizing layer in-puts.

Our method draws its strength from making normal-ization a part of the model architecture and performing thenormalizationfor each Training mini- Batch . Batch Nor-malization allows us to use much higher learning rates andbe less careful about initialization. It also acts as a regu-larizer, in some cases eliminating the need for to a state-of-the-art image classification model, Batch normalization achieves the same accuracy with 14times fewer Training steps, and beats the original modelby a significant margin. Using an ensemble of Batch -normalized networks, we improve upon the best publishedresult on ImageNet classification: reaching top-5validation error (and test error), exceeding the ac-curacy of human IntroductionDeep learning has dramatically advanced the state of theart in vision, speech, and many other areas.

Stochas-tic gradient descent (SGD) has proved to be an effec-tive way of Training deep networks, and SGD variantssuch as momentum (Sutskever et al., 2013) and Adagrad(Duchi et al., 2011) have been used to achieve state of theart performance. SGD optimizes the parameters of thenetwork, so as to minimize the loss = arg min 1NN i=1 (xi, ) the Training data set. With SGD, the train-ing proceeds in steps, and at each step we consider sizem. The mini- Batch is used to approx-imate the gradient of the loss function with respect to theparameters, by computing1m (xi, ).

Using mini-batches of examples, as opposed to one exam-ple at a time, is helpful in several ways. First, the gradientof the loss over a mini- Batch is an estimate of the gradientover the Training set, whose quality improves as the batchsize increases. Second, computation over a Batch can bemuch more efficient thanmcomputations for individualexamples, due to the parallelism afforded by the moderncomputing stochastic gradient is simple and effective, itrequires careful tuning of the model hyper-parameters,specifically the learning rate used in optimization, as wellas the initial values for the model parameters.

The train-ing is complicated by the fact that the inputs to each layerare affected by the parameters of all preceding layers sothat small changes to the Network parameters amplify asthe Network becomes change in the distributions of layers inputspresents a problem because the layers need to continu-ously adapt to the new distribution. When the input dis-tribution to a learning system changes, it is said to experi-encecovariate shift(Shimodaira, 2000). This is typicallyhandled via domain adaptation (Jiang, 2008).

However,the notion of covariate shift can be extended beyond thelearning system as a whole, to apply to its parts, such as asub- Network or a layer. Consider a Network computing =F2(F1(u, 1), 2)whereF1andF2are arbitrary transformations, and theparameters 1, 2are to be learned so as to minimizethe loss . Learning 2can be viewed as if the inputsx =F1(u, 1)are fed into the sub- Network =F2(x, 2).For example, a gradient descent step 2 2 mm i=1 F2(xi, 2) 2(for Batch sizemand learning rate ) is exactly equivalentto that for a stand-alone networkF2with inputx.

There-fore, the input distribution properties that make trainingmore efficient such as having the same distribution be-tween the Training and test data apply to Training thesub- Network as well. As such it is advantageous for thedistribution ofxto remain fixed over time. Then, 2does1not have to readjust to compensate for the change in thedistribution distribution of inputs to a sub- Network wouldhave positive consequences for the layersoutsidethe sub- Network , as well. Consider a layer with a sigmoid activa-tion functionz =g(Wu + b)whereuis the layer input,the weight matrixWand bias vectorbare the layer pa-rameters to be learned, andg(x) =11+exp( x).

As|x|increases,g (x)tends to zero. This means that for all di-mensions ofx =Wu+bexcept those with small absolutevalues, the gradient flowing down touwill vanish and themodel will train slowly. However, sincexis affected byW,band the parameters of all the layers below, changesto those parameters during Training will likely move manydimensions ofxinto the saturated regime of the nonlin-earity and slow down the convergence. This effect isamplified as the Network depth increases. In practice,the saturation problem and the resulting vanishing gradi-ents are usually addressed by using Rectified Linear Units(Nair & Hinton, 2010)ReLU(x) = max(x,0), carefulinitialization (Bengio & Glorot, 2010; Saxe et al.)

, 2013),and small learning rates. If, however, we could ensurethat the distribution of nonlinearity inputs remains morestable as the Network trains, then the optimizer would beless likely to get stuck in the saturated regime, and thetraining would refer to the change in the distributions of internalnodes of a deep Network , in the course of Training , asIn-ternal Covariate Shift. Eliminating it offers a promise offaster Training . We propose a new mechanism, which wecallBatch normalization , that takes a step towards re-ducing internal covariate shift, and in doing so dramati-cally accelerates the Training of deep neural nets.

It ac-complishes this via a normalization step that fixes themeans and variances of layer inputs. Batch Normalizationalso has a beneficial effect on the gradient flow throughthe Network , by reducing the dependence of gradientson the scale of the parameters or of their initial allows us to use much higher learning rates with-out the risk of divergence. Furthermore, Batch normal-ization regularizes the model and reduces the need forDropout (Srivastava et al., 2014). Finally, Batch Normal-ization makes it possible to use saturating nonlinearitiesby preventing the Network from getting stuck in the satu-rated Sec.

Batch Normalization: Accelerating Deep Network Training …

Tags:

Information

Transcription of Batch Normalization: Accelerating Deep Network Training …

Related search queries

Batch Normalization: Accelerating Deep Network Training …

Tags:

Information

Documents from same domain

Related documents

Related search queries