Example: biology

Theory of Deep Learning - Princeton University

C O N T R I B U T O R S : R A M A N A R O R A , S A N J E E V A R O R A , J O A N B R U N A , N A D A V C O H E N , R O N GG E , S U R I Y A G U N A S E K A R , C H I J I N , J A S O N L E E , T E N G Y U M A , B E H N A M N E Y S H A B U R , Z H A OS O N G , ..T H E O R Y O F D E E PL E A R N I N GContents1 Basic Setup and some math of useful math Value Decomposition132 Basics of the Taylor lemma for gradient gradient Gradient Runtime Analysis of and its Chain feedforward algorithm (not efficient!) (Linear Time) product in linear time: Pearlmutter s trick2744 Basics of generalization s razor formalized for for generalization simple bounds on generalization dependent complexity Interpretation: Ability to correlate with random bounds335 Advanced Optimization notions376 Algorithmic models in regression: squared induced by updates of local search induced by parameterization of model Models in Models with Exponential Tailed bias in function space537 Tractable Landscapes for Nonconvex and challenges in nonconvex with a unique global linear objective for generalized linear , saddle points and locally optimizable study: top eigenvector of a all critical dir

puted using weighted sum of derivatives with respect to all nodes that z feeds into. 23 3.3 Vector version of above 26 6.1 Steepest descent w.r.t k. 4/3: the global minimum to which steepest descent converges to depends on h. Here w0 = [0,0,0], w k.k = argminy2G kwk 4/3 denotes the minimum norm global minimum, and w¥ h!0 denotes the solu-

Tags:

  Learning, Theory, Deep, Weighted, Theory of deep learning

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Theory of Deep Learning - Princeton University

1 C O N T R I B U T O R S : R A M A N A R O R A , S A N J E E V A R O R A , J O A N B R U N A , N A D A V C O H E N , R O N GG E , S U R I Y A G U N A S E K A R , C H I J I N , J A S O N L E E , T E N G Y U M A , B E H N A M N E Y S H A B U R , Z H A OS O N G , ..T H E O R Y O F D E E PL E A R N I N GContents1 Basic Setup and some math of useful math Value Decomposition132 Basics of the Taylor lemma for gradient gradient Gradient Runtime Analysis of and its Chain feedforward algorithm (not efficient!) (Linear Time) product in linear time: Pearlmutter s trick2744 Basics of generalization s razor formalized for for generalization simple bounds on generalization dependent complexity Interpretation: Ability to correlate with random bounds335 Advanced Optimization notions376 Algorithmic models in regression: squared induced by updates of local search induced by parameterization of model Models in Models with Exponential Tailed bias in function space537 Tractable Landscapes for Nonconvex and challenges in nonconvex with a unique global linear objective for generalized linear , saddle points and locally optimizable study.

2 Top eigenvector of a all critical directions of improvements6458 Ultra-wide Neural Networks and Neural Tangent Equation on Ultra-wide Neural Networks and Optimization and Generalization of Ultra-wide Neural Networks via formula for Multilayer Fully-connected Neural in Biases due to Algorithmic Sensing neural of the Optimization bias in local of Parametrization10010 Unsupervised Learning : goals of unsupervised Objective for Density estimation: Log Autoencoder (VAE) open question10811 Generative Adversarial definitions109612 Representation Learning11113 Examples of Theorems, Proofs, Algorithms, Tables, of Theorems and of Long Equation of of of of it suffices to compute derivatives with respect to chain rule: derivative with respect to nodezcan be com-puted using weighted sum of derivatives with respect to all nodesthatzfeeds version of descent.

3 4/3:the global minimum to which steepest descentconverges to depends on . Herew0= [0, 0, 0],w . =arg min G w 4/3denotes the minimum norm global minimum, andw 0denotes the solu-tion of infinitesimal SD with 0. Note that even as 0, the expectedcharacterization does not hold, ,w 06=w .. for nonconvex optimization. From left to right: local min-imum, saddle point and flat rate vs. projections onto eigenvectors of the kernel error vs. complexity landscape (top) and contour plot (bottom) for a singlehidden-layer linear autoencoder network with one dimensional in-put and output and a hidden layer of widthr=2with dropout,for different values of the regularization parameter . Left: for =0the problem reduces to squared loss minimization, which is rota-tion invariant as suggested by the level sets. Middle: for >0theglobal optima shrink toward the origin.

4 All local minima are global,and are equalized, the weights are parallel to the vector( 1, 1).Right: as increases, global optima shrink of Pearson s Crab Data as mixture of two Gaussians.(Credit: MIX homepage at McMaster University .) defined using a density distributionp(h,x), wherehis the latent feature vector corresponding to visible vectorx. The pro-cess of computinghgivenxis called encoding and the reverse iscalled decoding. In general applying the encoder onxfollowedby the decoder would not givexagain, since the composed transfor-mation is a sample from a chasing sequence115 List of ignore theOfor simplicity. The` /`2is the strongest possibleguarantee, with`2/`2coming second,`2/`1third and exactlyk-sparsebeing the weaker. We also note that all [? ? ? ?] obtain improved anal-yses of the Restricted Isometry property; the algorithm is suggestedand analyzed (modulo the RIP property) in [?]

5 ]. The work in [?] doesnot explicitly state the extension to thed-dimensional case, but caneasily be inferred from the arguments. [? ? ? ?] work when the uni-verse size in each dimension are powers of monograph discusses the emerging Theory of deep Learning . It isbased upon a graduate seminar taught at Princeton University in Fall2019in conjunction with a Special Year on Optimization, Statistics,and Machine Learning at the Institute for Advanced Setup and some math notionsThis Chapter introduces the basic nomenclature. Training/test error,generalization error etc. Tengyu notes: Todos: Illustrate with plots: a typical trainingcurve and test curveMention some popular architectures (feed forward, convolutional, pooling, resnet, densenet) ina brief para each. We review the basic notions in statistical Learning Theory . A space of possible data pointsX.

6 A space of possible labelsY. A joint probability distributionDonX Y. We assume that ourtraining data consist ofndata points(x(1),y(1)), .. ,(x(n),y(n)) D,each drawn independently fromD. Hypothesis space:His a family of hypotheses, or a family ofpredictors. ,Hcould be the set of all neural networks witha fixed architecture:H={h }whereh is neural net that isparameterized by parameters . Loss function:`:(X Y) H R. , in binary classification whereY={ 1,+1}, and supposewe have a hypothesish (x), then the logistic loss function forthe hypothesish on data point(x,y)is`((x,y), ) =11+exp( yh (x)). Expected loss:L(h) =E(x,y) D[`((x,y),h)].RecallDis the data distribution overX Theory of deep Learning Training loss (also known as empirical risk): L(h) =1nn i=1`((x(i),y(i)),h),where(x(1),y(1)),(x( 2),y(2)), .. ,(x(n),y(n))arentrainingexamples drawn fromD.

7 Empirical risk minimizer (ERM): h arg minh H L(h). Regularization: Suppose we have a regularizerR(h), then theregularized loss is L (h) = L(h) + R(h). Suriya notes: Misc notations: gradient, hessian, norms of useful math factsNow we list some useful math toolsIn this section we introduce the probability tools we use in the , about tail bounds for random scalarvariables. about cdf of Gaussian distributions. Finally, a concentration result on random (Markov s inequality).Ifxis a nonnegative random variableandt>0, then the probability thatxis at leasttis at most the expectationof x divided by t:Pr[x t] E[x] (Chebyshev s inequality).Letxdenote a nonnegativerandom variable and t>0, thenPr[|x E[x]| t] Var[x] (Chernoff bound [?]).LetX= ni=1Xi, whereXi=1withprobabilitypiandXi=0with probability1 pi, and allXiare =E[X] = ni=1pi. [X (1+ ) ] exp( 2 /3), >0; [X (1 ) ] exp( 2 /2), 0< < (Hoeffding bound [?)]

8 ]).LetX1, ,Xndenotenindepen-dent bounded variables in[ai,bi]. Let X= ni=1Xi, then we havePr[|X E[X]| t] 2 exp( 2t2 ni=1(bi ai)2).basic setup and some math notions (Bernstein inequality [?]).LetX1, ,Xnbe independentzero-mean random variables. Suppose that|Xi| Malmost surely, for , for all positive t,Pr[n i=1Xi>t] exp( t2/2 nj=1E[X2j] +Mt/3). (Anti-concentration of Gaussian distribution).LetX N(0, 2), that is, the probability density function ofXis given by (x) =1 2 2e x22 2. ThenPr[|X| t] (23t ,45t ). (Matrix Bernstein, [?]).Consider a finitesequence{X1, ,Xm} Rn1 n2of independent, random matrices withcommon dimension n1 n2. Assume thatE[Xi] =0, i [m]and Xi M, i [m].Let Z= mi=1Xi. LetVar[Z]be the matrix variance statistic of sum:Var[Z] =max{ m i=1E[XiX>i] , m i=1E[X>iXi] }.ThenE[ Z ] (2 Var[Z] log(n1+n2))1/2+M log(n1+n2) , for all t 0,Pr[ Z t] (n1+n2) exp( t2/2 Var[Z] +Mt/3).

9 Explain these in a paraA useful shorthand will be the following: Ify1,y2, .. ,ymare in-dependent random variables each having mean0and taking valuesin[ 1, 1], then their average1m iyibehaves like a Gaussian vari-able with mean zero and variance at most1/m. In other words, theprobability that this average is at leastein absolute value is at mostexp( e2m). Value of OptimizationThis chapter sets up the basic analysis framework for gradient-basedoptimization algorithms and discuss how it applies to deep Learning . Tengyu notes: Sanjeev notes:Suggestion: when introducing usual abstractions like Lipschitz constt, Hessian norm etc. let srelate them concretely to what they mean in context of deep Learning (noting that Lipschitz consttis wrt the vector of parameters). Be frank about what these numbers might be for deep Learning oreven how feasible it is to estimate them.

10 (Maybe that discussion can go in the side bar.)BTW it may be useful to give some numbers for the empirical liptschitz constt encountered suspects that the optimization speed analysis is rather pessimistic. Suriya notes: To ground optimization to our case, we can also mention that f is often of theeither the ERM or stochastic optimization formL(w) = l(w;x,y)- it might also be useful tomention that outside of this chapter, we typically usefas an alternative forhto denote a functioncomputed Tengyu notes: should we usewor in this section? Suriya notes: I rememberedthat we agreed onwfor parameters long time back - did we we go back to theta? descentSuppose we would like to optimize a continuous functionf(w) Rdf(w).The gradient descent (GD) algorithm isw0=initializaitonwt+1=wt f(wt)where is the step size or Learning motivation or justification of the GD is that the update direc-tion f(wt)is the steepest descent direction locally.


Related search queries