Example: stock market

CSC321 Lecture 10: Automatic Differentiation

CSC321 Lecture 10: Automatic DifferentiationRoger GrosseRoger GrosseCSC321 Lecture 10: Automatic Differentiation1 / 23 OverviewImplementing backprop by hand is like programming in ll probably never do it, but it s important for having a mentalmodel of how everything 6 covered the math of backprop, which you are using to codeit up for a particular network for Assignment 1 This Lecture : how to build an Automatic Differentiation (autodiff)library, so that you never have to write derivatives by handWe ll cover a simplified version of Autograd, a lightweight autodiff s autodiff feature is based on very similar GrosseCSC321 Lecture 10: Automatic Differentiation2 / 23 Confusing TerminologyAutomatic Differentiation (autodiff) refers to a general way of takinga program which computes a value, and automatically constructing aprocedure for computing derivatives of that this Lecture , we focus on reverse mode autodiff.

PyTorch’s autodi feature is based on very similar principles. Roger Grosse CSC321 Lecture 10: Automatic Di erentiation 2 / 23. Confusing Terminology Automatic di erentiation (autodi )refers to a general way of taking a program which computes a value, and automatically constructing a

Tags:

  Pytorch

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of CSC321 Lecture 10: Automatic Differentiation

1 CSC321 Lecture 10: Automatic DifferentiationRoger GrosseRoger GrosseCSC321 Lecture 10: Automatic Differentiation1 / 23 OverviewImplementing backprop by hand is like programming in ll probably never do it, but it s important for having a mentalmodel of how everything 6 covered the math of backprop, which you are using to codeit up for a particular network for Assignment 1 This Lecture : how to build an Automatic Differentiation (autodiff)library, so that you never have to write derivatives by handWe ll cover a simplified version of Autograd, a lightweight autodiff s autodiff feature is based on very similar GrosseCSC321 Lecture 10: Automatic Differentiation2 / 23 Confusing TerminologyAutomatic Differentiation (autodiff) refers to a general way of takinga program which computes a value, and automatically constructing aprocedure for computing derivatives of that this Lecture , we focus on reverse mode autodiff.

2 There is also aforward mode, which is for computing directional is the special case of autodiff applied to neural netsBut in machine learning, we often use backprop synonymously withautodiffAutograd is the name of a particular autodiff lots of people, including the pytorch developers, got confused andstarted using autograd to mean autodiff Roger GrosseCSC321 Lecture 10: Automatic Differentiation3 / 23 What Autodiff Is NotAutodiff is not finite differences are expensive, since you need to do a forward pass also induces huge numerical , we only use it for is both efficient (linear in the cost of computing the value)and numerically GrosseCSC321 Lecture 10: Automatic Differentiation4 / 23 What Autodiff Is NotAutodiff is not symbolic Differentiation ( Mathematica).Symbolic Differentiation can result in complex and s derivatives for one layer of soft ReLU (univariate case):Derivatives for two layers of soft ReLU:There might not be a convenient formula for the goal of autodiff is not a formula, but a procedure for GrosseCSC321 Lecture 10: Automatic Differentiation5 / 23 What Autodiff IsRecall how we computed the derivatives of logistic least squares autodiff system should transform the left-hand side into the the loss:z=wx+by= (z)L=12(y t)2 Computing the derivatives:L= 1y=y tz=y (z)w=z xb=zRoger GrosseCSC321 Lecture 10: Automatic Differentiation6 / 23 What Autodiff IsAn autodiff system will convert the program into a sequence of primitiveoperations which have specified routines for computing this representation, backprop can be done in a completely mechanical program.

3 Z=wx+by=11 + exp( z)L=12(y t)2 Sequence of primitive operations:t1=wxz=t1+bt3= zt4= exp(t3)t5= 1 +t4y= 1/t5t6=y tt7=t26L=t7/2 Roger GrosseCSC321 Lecture 10: Automatic Differentiation7 / 23 What Autodiff IsRoger GrosseCSC321 Lecture 10: Automatic Differentiation8 / 23 AutogradThe rest of this Lecture covers how Autograd is code for the original Autograd package: , a pedagogical implementation of Autograd you areencouraged to read the to Matt Johnson for providing this!Roger GrosseCSC321 Lecture 10: Automatic Differentiation9 / 23 Building the Computation GraphMost autodiff systems, including Autograd, explicitly construct thecomputation frameworks like TensorFlow provide mini-languages for buildingcomputation graphs directly. Disadvantage: need to learn a totally new instead builds them by tracing the forward pass computation,allowing for an interface nearly indistinguishable from (defined ) represents a node of thecomputation graph.

4 It has attributes:value, the actual value computed on a particular set of inputsfun, the primitive operation defining the nodeargsandkwargs, the arguments the op was called withparents, the parentNodesRoger GrosseCSC321 Lecture 10: Automatic Differentiation10 / 23 Building the Computation GraphAutograd s fake NumPy module provides primitive ops which look andfeel like NumPy functions, but secretly build the computation wrap around NumPy functions:Roger GrosseCSC321 Lecture 10: Automatic Differentiation11 / 23 Building the Computation GraphExample:Roger GrosseCSC321 Lecture 10: Automatic Differentiation12 / 23 Vector-Jacobian ProductsPreviously, I suggested deriving backprop equations in terms of sumsand indices, and then vectorizing them. But we d like to implementour primitive operations in vectorized Jacobian is the matrix of partial derivatives:J= y x= y1 x1 y1 ym x1 ym xn The backprop equation (single child node) can be written as avector-Jacobian product (VJP):xj= iyi yi xjx=y>JThat gives a row vector.

5 We can treat it as a column vector by takingx=J>yRoger GrosseCSC321 Lecture 10: Automatic Differentiation13 / 23 Vector-Jacobian ProductsExamplesMatrix-vector productz=WxJ=Wx=W>zElementwise operationsy= exp(z)J= exp(z1) (zD) z= exp(z) yNote: we never explicitly construct the Jacobian. It s usually simplerand more efficient to compute the VJP GrosseCSC321 Lecture 10: Automatic Differentiation14 / 23 Vector-Jacobian ProductsFor each primitive operation, we must specify VJPs foreachof itsarguments. Considery= exp(x).This is a function which takes in the output gradient ( ), theanswer (y), and the arguments (x), and returns the input gradient (x)defvjp(defined ) is a convenience routine for registeringVJPs. It just adds them to a GrosseCSC321 Lecture 10: Automatic Differentiation15 / 23 Backward PassRecall that the backprop computations are more modular if we viewthem as message procedure can be implemented directly using the data structureswe ve GrosseCSC321 Lecture 10: Automatic Differentiation16 / 23 Backward PassThe backwards pass is defined argumentgis the error signal for the end node; for us this is alwaysL= GrosseCSC321 Lecture 10: Automatic Differentiation17 / 23 Backward Passgrad( ) is just a wrapper aroundmakevjp( ) which builds the computation graph and feeds it is viewed as a VJP, if we treatLas the 1 1 matrix with entry 1.

6 L w= L wLRoger GrosseCSC321 Lecture 10: Automatic Differentiation18 / 23 RecapWe saw three main parts to the code:tracing the forward pass to build the computation graphvector-Jacobian products for primitive opsthe backwards passBuilding the computation graph requires fancy NumPy gymnastics,but other two items are basically what I showed re encouraged to read the full code (<200 lines!) at: GrosseCSC321 Lecture 10: Automatic Differentiation19 / 23 Differentiating through a Fluid SimulationRoger GrosseCSC321 Lecture 10: Automatic Differentiation20 / 23 Differentiating through a Fluid #end-to-end-examplesRoger GrosseCSC321 Lecture 10: Automatic Differentiation21 / 23 Gradient-Based Hyperparameter OptimizationRoger GrosseCSC321 Lecture 10: Automatic Differentiation22 / 23 Gradient-Based Hyperparameter OptimizationRoger GrosseCSC321 Lecture 10: Automatic Differentiation23 / 23


Related search queries