Transcription of Binarized Neural Networks: Training Neural …
1 Binarized Neural Networks: Training Neural Networks with Weights andActivations Constrained to+1or 1 Matthieu e de Montr eal2 Technion - Israel Institute of Technology3 Columbia University4 CIFAR Senior Fellow*Indicates equal contribution. Ordering determined by coin introduce a method to train Binarized Neu-ral Networks (BNNs) - Neural networks with bi-nary weights and activations at the binary weights and activationsare used for computing the parameters gradi-ents. During the forward pass, BNNs drasticallyreduce memory size and accesses, and replacemost arithmetic operations with bit-wise opera-tions, which is expected to substantially improvepower-efficiency.
2 To validate the effectiveness ofBNNs we conduct two sets of experiments on theTorch7 and Theano frameworks. On both, BNNsachieved nearly state-of-the-art results over theMNIST, CIFAR-10 and SVHN datasets. Last butnot least, we wrote a binary matrix multiplicationGPU kernel with which it is possible to run ourMNIST BNN 7 times faster than with an unopti-mized GPU kernel, without suffering any loss inclassification accuracy. The code for Training andrunning our BNNs is available Neural Networks (DNNs) have substantially pushedArtificial Intelligence (AI) limits in a wide range of tasks,including but not limited to object recognition from im-ages (Krizhevsky et al., 2012; Szegedy et al.)
3 , 2014), speechrecognition (Hinton et al., 2012; Sainath et al., 2013), sta-tistical machine translation (Devlin et al., 2014; Sutskeveret al., 2014; Bahdanau et al., 2015), Atari and Go games(Mnih et al., 2015; Silver et al., 2016), and even abstractart (Mordvintsev et al., 2015).Today, DNNs are almost exclusively trained on one ormany very fast and power-hungry Graphic ProcessingUnits (GPUs) (Coates et al., 2013). As a result, it is of-ten a challenge to run DNNs on target low-power devices,and substantial research efforts are invested in speedingup DNNs at run-time on both general-purpose (Vanhouckeet al., 2011; Gong et al., 2014; Romero et al., 2014; Hanet al.
4 , 2015) and specialized computer hardware (Farabetet al., 2011a;b; Pham et al., 2012; Chen et al., 2014a;b;Esser et al., 2015).This paper makes the following contributions: We introduce a method to train Binarized - Neural -Networks (BNNs), Neural networks with binaryweights and activations, at run-time, and when com-puting the parameters gradients at train-time (see Sec-tion 1). We conduct two sets of experiments, each imple-mented on a different framework, namely Torch7(Collobert et al., 2011) and Theano (Bergstra et al.,2010; Bastien et al., 2012), which show that it is pos-sible to train BNNs on MNIST, CIFAR-10 and SVHNand achieve nearly state-of-the-art results (see Section2).
5 We show that during the forward pass (both at run-time and train-time), BNNs drastically reduce mem-ory consumption (size and number of accesses), [ ] 17 Mar 2016 Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to+1or 1replace most arithmetic operations with bit-wise oper-ations, which potentially lead to a substantial increasein power-efficiency (see Section 3). Moreover, a bi-narized CNN can lead to binary convolution kernelrepetitions; We argue that dedicated hardware couldreduce the time complexity by60%. Last but not least, we programed a binary matrix mul-tiplication GPU kernel with which it is possible to runour MNIST BNN 7 times faster than with an unopti-mized GPU kernel, without suffering any loss in clas-sification accuracy (see Section 4).
6 The code for Training and running our BNNs is avail-able on-line (In both Theano framework1and Torchframework2).1. Binarized Neural NetworksIn this section, we detail our binarization function, showhow we use it to compute the parameters gradients, andhow we backpropagate through Deterministic vs Stochastic BinarizationWhen Training a BNN, we constrain both the weights andthe activations to either+1or 1. Those two values arevery advantageous from a hardware perspective, as we ex-plain in Section 4. In order to transform the real-valuedvariables into those two values, we use two different bi-narization functions, as in (Courbariaux et al., 2015). Ourfirst binarization function is deterministic:xb= Sign(x) ={+1ifx 0, 1otherwise,(1)wherexbis the Binarized variable (weight or activation)andxthe real-valued variable.}
7 It is very straightforward toimplement and works quite well in practice. Our secondbinarization function is stochastic:xb={+1with probabilityp= (x), 1with probability1 p,(2)where is the hard sigmoid function: (x) = clip(x+ 12,0,1) = max(0,min(1,x+ 12)).(3)The stochastic binarization is more appealing than the signfunction, but harder to implement as it requires the hard-ware to generate random bits when quantizing. As a re-sult, we mostly use the deterministic binarization function( , the sign function), with the exception ofactivations attrain-timein some of our Gradient Computation and AccumulationAlthough our BNN Training method uses binary weightsand activation to compute the parameters gradients, thereal-valued gradients of the weights are accumulated inreal-valued variables, as per Algorithm 1.}
8 Real-valuedweights are likely required for Stochasic Gradient Descent(SGD) to work at all. SGD explores the space of param-eters in small and noisy steps, and that noise isaveragedoutby the stochastic gradient contributions accumulated ineach weight. Therefore, it is important to keep sufficientresolution for these accumulators, which at first glance sug-gests that high precision is absolutely , adding noise to weights and activations whencomputingthe parameters gradients provide a form of reg-ularization that can help to generalize better, as previ-ously shown with variational weight noise (Graves, 2011),Dropout (Srivastava, 2013; Srivastava et al., 2014) andDropConnect (Wan et al.)
9 , 2013). Our method of trainingBNNs can be seen as a variant of Dropout, in which insteadof randomly setting half of the activations to zero whencomputing the parameters gradients, we binarize both theactivations and the Propagating Gradients Through DiscretizationThe derivative of the sign function is zero almost every-where, making it apparently incompatible with backpropa-gation, since the exact gradient of the cost with respect tothe quantities before the discretization (pre-activations orweights) would be zero. Note that this remains true evenif stochastic quantization is used. Bengio (2013) studiedthe question of estimating or propagating gradients throughstochastic discrete neurons.
10 They found in their experi-ments that the fastest Training was obtained when using the straight-through estimator, previously introduced in Hin-ton (2012) s follow a similar approach but use the version ofthe straight-through estimator that takes into account thesaturation effect, and does use deterministic rather thanstochastic sampling of the bit. Consider the sign functionquantizationq= Sign(r),and assume that an estimatorgqof the gradient C qhasbeen obtained (with the straight-through estimator whenneeded). Then, our straight-through estimator of C ris sim-plygr=gq1|r| 1.(4)Note that this preserves the gradient s information and can-cels the gradient whenris too large.