Example: bachelor of science

Lipschitz regularity of deep neural networks: analysis and ...

Lipschitz regularity of deep neural networks: analysis and efficient estimationKevin ScamanHuawei Noah s Ark VirmauxHuawei Noah s Ark neural networks are notorious for being sensitive to small well-chosen per-turbations, and estimating the regularity of such architectures is of utmost impor-tance for safe and robust practical applications. In this paper, we investigate oneof the key characteristics to assess the regularity of such methods: the Lipschitzconstant of deep learning architectures. First, we show that, even for two layerneural networks, the exact computation of this quantity is NP-hard and state-of-art methods may significantly overestimate it. Then, we both extend and improveprevious estimation methods by providingAutoLip, the first generic algorithm forupper bounding the Lipschitz constant of any automatically differentiable func-tion.

Aladin Virmaux Huawei Noah’s Ark Lab aladin[email protected] Abstract Deep neural networks are notorious for being sensitive to small well-chosen per-turbations, and estimating the regularity of such architectures is of utmost impor-tance for safe and robust practical applications. In this paper, we investigate one

Tags:

  Analysis, Network, Deep, Neural, Danila, Analysis and, Lipschitz regularity of deep neural networks, Lipschitz, Regularity

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Lipschitz regularity of deep neural networks: analysis and ...

1 Lipschitz regularity of deep neural networks: analysis and efficient estimationKevin ScamanHuawei Noah s Ark VirmauxHuawei Noah s Ark neural networks are notorious for being sensitive to small well-chosen per-turbations, and estimating the regularity of such architectures is of utmost impor-tance for safe and robust practical applications. In this paper, we investigate oneof the key characteristics to assess the regularity of such methods: the Lipschitzconstant of deep learning architectures. First, we show that, even for two layerneural networks, the exact computation of this quantity is NP-hard and state-of-art methods may significantly overestimate it. Then, we both extend and improveprevious estimation methods by providingAutoLip, the first generic algorithm forupper bounding the Lipschitz constant of any automatically differentiable func-tion.

2 We provide a power method algorithm working with automatic differen-tiation, allowing efficient computations even on large convolutions. Second, forsequential neural networks, we propose an improved algorithm namedSeqLipthattakes advantage of the linear computation graph to split the computation per pairof consecutive layers. Third we propose heuristics onSeqLipin order to tacklevery large networks. Our experiments show thatSeqLipcan significantly improveon the existing upper bounds. Finally, we provide an implementation ofAutoLipin thePyTorchenvironment that may be used to better estimate the robustness ofa given neural network to small perturbations or regularize it using more preciseLipschitz IntroductionDeep neural networks made a striking entree in machine learning and quickly became state-of-the-art algorithms in many tasks such as computer vision [1,2,3,4], speech recognition and generation[5,6] or natural language processing [7,8].

3 However, deep neural networks are known for being very sensitive to their input, andadversarialexamplesprovide a good illustration of their lack of robustness [9,10]. Indeed, a well-chosen smallperturbation of the input image can mislead a neural network and significantly decrease its classi-fication accuracy. One metric to assess the robustness of neural networks to small perturbations istheLipschitz constant(see Definition1), which upper bounds the relationship between input per-turbation and output variation for a given distance. For generative models, the recentWassersteinGAN[11] improved the training stability of GANs by reformulating the optimization problem as aminimization of the Wasserstein distance between the real and generated distributions [12]. How-ever, this method relies on an efficient way of constraining the Lipschitz constant of the critic, whichwas only partially addressed in the original paper, and the object of several follow-up works [13,14].

4 Recently, Lipschitz continuity was used in order to improve the state-of-the-art in several deep learn-ing topics: (1) for robust learning, avoiding adversarial attacks was achieved in [15] by constraininglocal Lipschitz constants in neural networks. (2) For generative models, using spectral normaliza-tion on each layer allowed [13] to successfully train a GAN on ILRSVRC2012 dataset. (3) In deep32nd Conference on neural Information Processing Systems (NeurIPS 2018), Montr al, theory, novel generalization bounds critically rely on the Lipschitz constant of the neuralnetwork [16,17,18].To the best of our knowledge, the first upper-bound on the Lipschitz constant of a neural network wasdescribed in [9, Section ], as the product of the spectral norms of linear layers (a special case ofour generic algorithm, see Proposition1).

5 More recently, the Lipschitz constant ofscatter networkswas analyzed in [19]. Unfortunately, this analysis does not extend to more general aim in this paper is to provide a rigorous and practice-oriented study on how Lipschitz constantsof neural networks and automatically differentiable functions may be estimated. We first preciselydefine the notion of Lipschitz constant of vector valued functions in Section2, and then show in Sec-tion3that its estimation is, even for 2-layerMulti-Layer-Perceptrons(MLP),NP-h ard. In Section4,we both extend and improve previous estimation methods by providingAutoLip, the first genericalgorithm for upper bounding the Lipschitz constant of any automatically differentiable , we show how the Lipschitz constant of most neural network layers may be computed effi-ciently using automatic differentiation algorithms [20] and libraries such as PyTorch [21].

6 Notably,we extend the power method to convolution layers using automatic differentiation to speed-up thecomputations. In Section6, we provide a theoretical analysis of AutoLip in the case of sequentialneural networks, and show that the upper bound may lose a multiplicative factorper activation layer,which may significantly downgrade the estimation quality of AutoLip and lead to a very large andunrealistic upper bound. In order to prevent this, we propose an improved algorithm calledSeqLipin the case of sequential neural networks, and show in Section7that SeqLip may significantly im-prove on AutoLip. Finally we discuss the different algorithms on the AlexNet [1] neural networkfor computer vision using the proposed Background and notationsIn the following, we denote as x; y and x 2the scalar product andL2-norm of the Hilbert spaceRn,x ythe coordinate-wise product ofxandy, andf gthe composition between the functionsf:Rk!

7 Rmandg:Rn!Rk. For any differentiable functionf:Rn!Rmand any pointx2Rn, we will denote asDxf2Rm nthe differential operator offatx, also called theJacobianmatrix. Note that, in the case of real valued functions ( 1), the gradient offis the transposeof the differential operator: f(x) = (Dxf) . Finally,diagn;m(x)2Rn mis the rectangularmatrix withx2 Rminfn;mgalong the diagonal and0outside of it. When unambiguous, we will usethe notationdiag(x)instead ofdiagn;m(x). All proofs are available as supplemental functionf:Rn!Rmis calledLipschitz continuousif there exists a constantLsuch that8x; y2Rn; f(x) f(y) 2 L x y 2:The smallestLfor which the previous inequality is true is called theLipschitz constantoffand willbe denotedL(f).For locally Lipschitz functions ( functions whose restriction to some neighborhood around anypoint is Lipschitz ), the Lipschitz constant may be computed using its differential 1(Rademacher [22, Theorem ]).

8 Iff:Rn!Rmis a locally Lipschitz continuousfunction, thenfis differentiable almost everywhere. Moreover, iffis Lipschitz continuous, thenL(f) = supx2Rn Dxf 2(1)where M 2= supfx: x =1g M x 2is the operator norm of the matrixM2Rm particular, iffis real valued ( 1), its Lipschitz constant is the maximum norm of itsgradientL(f) = supx f(x) 2on its domain set. Note that the supremum in Theorem1is aslight abuse of notations, since the differentialDxfis definedalmost everywhereinRn, except fora set of Lebesgue measure code used in this paper is available Exact Lipschitz computation is NP-hardIn this section, we show that the exact computation of the Lipschitz constant of neural networksisNP-hard, hence motivating the need for good approximation algorithms. More precisely, upperbounds are in this case more valuable as they ensure that the variation of the function, when subjectto an input perturbation, remains small.

9 Aneural networkis, in essence, a succession of linearoperators and non-linear activation functions. The most simplistic model of neural network is theMulti-Layer-Perceptron(MLP) as defined 2(MLP).AK-layerMulti-Layer-PerceptronfML P:Rn!Rmis the functionfMLP(x) =TK K 1 1 T1(x);whereTk:x7!Mkx+bkis an affine function and k:x7!(gk(xi))i2J1;nkKis a non-linearactivation standard deep network architectures ( ) follow to some extent the MLP turns out that even for2-layer MLPs, the computation of the Lipschitz constant 1(LIP-CST).LIP-CSTis the decision problem associated to the exact computationof the Lipschitz constant of a2-layer MLP with ReLU activation :Two matricesM12Rl nandM22Rm l, and a constant :Letf=M2 M1where (x) = maxf0; xgis the ReLU activation the Lipschitz constantL(f) ?

10 Theorem2shows that, even for extremely simple neural networks, exact Lipschitz computation isnot achievable in polynomial time (assuming thatP =NP). The proof of Theorem2is availablein the supplemental on a reduction to theNP-hard problem of quadratic concave minimization on ahypercube by considering well-chosen AutoLip: a Lipschitz upper bound through automatic differentiationEfficient implementations of backpropagation in modern deep learning libraries such asPy-Torch[21] orTensorFlow[23] rely on on the concept ofautomatic differentiation[24,20]. Simplyput, automatic differentiation is a principled approach to the computation of gradients and differen-tial operators of functions resulting fromKsuccessive functionf:Rn!Rmiscomputable inKoperationsif it is the result ofKsimplefunctions in the following way:9( 1; :::; K)functions of the inputxand(g1; : : : ; gK)whereglisa function of( i)i l 1such that: 0(x) =x; K(x) =f(x);8k2J1; KK; k(x) =gk(x; 1(x); : : : ; k 1(x)):(2) 0=x 1= 0=2 2=!


Related search queries