Example: stock market

Vector, Matrix, and Tensor Derivatives

vector , Matrix, and Tensor DerivativesErik Learned-MillerThe purpose of this document is to help you learn to take Derivatives of vectors, matrices,and higher order tensors (arrays with three dimensions or more), and to help you takederivativeswith respect tovectors, matrices, and higher order Simplify, simplify, simplifyMuch of the confusion in taking Derivatives involving arrays stems from trying to do toomany things at once. These things include taking Derivatives of multiple componentssimultaneously, taking Derivatives in the presence of summation notation, and applying thechain rule. By doing all of these things at the same time, we are more likely to make errors,at least until we have a lot of Expanding notation into explicit sums and equations for eachcomponentIn order to simplify a given calculation, it is often useful to write out the explicit formula fora single scalar elementof the output in terms of nothing butscalar variables.

vector associated with the corresponding row of the input X. Sticking to our technique of writing down an expression for a given component of the output, we have Y i;j = XD k=1 X i;kW k;j: We can see immediately from this equation that among the derivatives @Y a;b @X c;d; they are all zero unless a = c. That is, since each component of Y is ...

Tags:

  Expression, Vector

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Vector, Matrix, and Tensor Derivatives

1 vector , Matrix, and Tensor DerivativesErik Learned-MillerThe purpose of this document is to help you learn to take Derivatives of vectors, matrices,and higher order tensors (arrays with three dimensions or more), and to help you takederivativeswith respect tovectors, matrices, and higher order Simplify, simplify, simplifyMuch of the confusion in taking Derivatives involving arrays stems from trying to do toomany things at once. These things include taking Derivatives of multiple componentssimultaneously, taking Derivatives in the presence of summation notation, and applying thechain rule. By doing all of these things at the same time, we are more likely to make errors,at least until we have a lot of Expanding notation into explicit sums and equations for eachcomponentIn order to simplify a given calculation, it is often useful to write out the explicit formula fora single scalar elementof the output in terms of nothing butscalar variables.

2 Once one hasan explicit formula for a single scalar element of the output in terms of other scalar values,then one can use the calculus that you used as a beginner, which is much easier than tryingto do matrix math, summations, and Derivatives all at the same we have a column vector ~yof lengthCthat is calculated by formingthe product of a matrixWthat isCrows byDcolumns with a column vector ~xof lengthD:~y=W ~x.(1)Suppose we are interested in the derivative of~ywith respect to~x. A full characterizationof this derivative requires the (partial) Derivatives of each component of~ywith respect to eachcomponent of~x, which in this case will containC Dvalues since there areCcomponentsin~yandDcomponents of~ s start by computing one of these, say, the 3rd component of~ywith respect to the7th component of~x. That is, we want to compute ~y3 ~x7,1which is just the derivative of one scalar with respect to first thing to do is to write down the formula for computing~y3so we can take itsderivative.

3 From the definition of matrix- vector multiplication, the value~y3is computed bytaking the dot product between the 3rd row ofWand the vector ~x:~y3=D j=1W3,j~xj.(2)At this point, we have reduced the original matrix equation (Equation 1) to a scalar makes it much easier to compute the desired Removing summation notationWhile it is certainly possible to compute Derivatives directly from Equation 2, people fre-quently make errors when differentiating expressions that contain summation notation ( )or product notation ( ). When you re beginning, it is sometimes useful to write out acomputation without any summation notation to make sure you re doing everything 1 as the first index, we have:~y3=W3,1~x1+W3,2~x2+..+W3,7~x7+..+W 3,D~ course, I have explicitly included the term that involves~x7, since that is what we aredifferenting with respect to. At this point, we can see that the expression fory3only dependsupon~x7through a single term,W3,7~x7.

4 Since none of the other terms in the summationinclude~x7, their Derivatives with respect to~x7are all 0. Thus, we have ~y3 ~x7= ~x7[W3,1~x1+W3,2~x2+..+W3,7~x7+..+W3,D~x D](3)= 0 + 0 +..+ ~x7[W3,7~x7] +..+ 0(4)= ~x7[W3,7~x7](5)=W3,7.(6)By focusing on one component of~yand one component of~x, we have made the calculationabout as simple as it can be. In the future, when you are confused, it can help to try toreduce a problem to this most basic setting to see where you are going Completing the derivative: the Jacobian matrixRecall that our original goal was to compute the Derivatives of each component of~ywithrespect to each component of~x, and we noted that there would beC Dof these. They2can be written out as a matrix in the following form: ~y1 ~x1 ~y1 ~x2 ~y1 ~x3.. ~y1 ~xD ~y2 ~x1 ~y2 ~x2 ~y2 ~x3.. ~y2 ~ ~yC ~x1 ~yC ~x2 ~yC ~x3.. ~yC ~xD In this particular case, this is called theJacobian matrix, but this terminology is not tooimportant for our that for the equation~y=W ~x,the partial of~y3with respect to~x7was simply given byW3,7.

5 If you go through the sameprocess for other components, you will find that, for alliandj, ~yi ~xj=Wi, means that the matrix of partial Derivatives is ~y1 ~x1 ~y1 ~x2 ~y1 ~x3.. ~y1 ~xD ~y2 ~x1 ~y2 ~x2 ~y2 ~x3.. ~y2 ~ ~yC ~x1 ~yC ~x2 ~yC ~x3.. ~yC ~xD = W1,1W1,2W1,3.. W1,DW2,1W2,2W2,3.. W2, ,1WC,2WC,3.. WC,D. This, of course, is , after all this work, we have concluded that for~y=W ~x,we haved~yd~x= Row vectors instead of column vectorsIt is important in working with different neural networks packages to pay close attention tothe arrangement of weight matrices, data matrices, and so on. For example, if a data matrixXcontains many different vectors, each of which represents an input, is each data vector arow or column of the data matrixX?In the example from the first section, we worked with a vector ~xthat was a columnvector. However, you should also be able to use the same basic ideas when~xis a row Example 2 Let~ybe arow vectorwithCcomponents computed by taking the product of another rowvector~xwithDcomponents and a matrixWthat isDrows byCcolumns.

6 ~y=~ , despite the fact that~yand~xhave the same number of components as before,the shape ofWis thetransposeof the shape that we used before forW. In particular, sincewe are now left-multiplying by~x, whereas before~xwas on the right,Wmust be transposedfor the matrix algebra to this case, you will see, by writing~y3=D j=1~xjWj,3that ~y3 ~x7=W7, that the indexing intoWis the opposite from what it was in the first , when we assemble the full Jacobian matrix, we can still see that in this case aswell,d~yd~x=W.(7)3 Dealing with more than two dimensionsLet s consider another closely related problem, that of computingd~ this case,~yvaries along one coordinate whileWvaries along two coordinates. Thus, theentire derivative is most naturally contained in athree-dimensional array. We avoid the term three-dimensional matrix since it is not clear how matrix multiplication and other matrixoperations are defined on a three-dimensional with three-dimensional arrays, it becomes perhaps more trouble than it s worthto try to find a way to display them.

7 Instead, we should simply define our results as formulaswhich can be used to compute the result on any element of the desired three s again compute a scalar derivative between one component of~y, say~y3and onecomponent ofW, sayW7,8. Let s start with the same basic setup in which we write downan equation for~y3in terms of other scalar components. Now we would like an equation thatexpresses~y3in terms of scalar values, and shows the role thatW7,8plays in its , what we see is thatW7,8playsno rolein the computation of~y3, since~y3=~x1W1,3+~x2W2,3+..+~xDWD,3.(8)I n other words, ~y3 W7,8= , the partials of~y3with respect to elements of the 3rd column ofWwill certainlybe non-zero. For example, the derivative of~y3with respect toW2,3is given by ~y3 W2,3=~x2,(9)as can be easily seen by examining Equation general, when the index of the~ycomponent is equal to the second index ofW, thederivative will be non-zero, but will be zero otherwise.

8 We can write: ~yj Wi,j=~xi,but the other elements of the 3-d array will be 0. If we letFrepresent the 3d arrayrepresenting the derivative of~ywith respect toW, whereFi,j,k= ~yi Wj,k,thenFi,j,i=~xj,but all other entries ofFare , if we define a newtwo-dimensionalarrayGasGi,j=Fi,j,iwe can see that all of the information we need aboutFcan be stored inG, and that thenon-trivial portion ofFis really two-dimensional, not the important part of derivative arrays in a compact way is critical toefficient implementations of neural Multiple data pointsIt is a good exercise to repeat some of the previous examples, but using multiple examples of~x, stacked together to form a matrixX. Let s assume that each individual~xis a row vectorof lengthD, and thatXis a two-dimensional array withNrows , as inour last example, will be a matrix withDrows , given byY=XW,5will also be a matrix, withNrows andCcolumns.

9 Thus, each row ofYwill give a rowvector associated with the corresponding row of the to our technique of writing down an expression for a given component of theoutput, we haveYi,j=D k=1Xi,kWk, can see immediately from this equation that among the Derivatives Ya,b Xc,d,they are all zero unlessa=c. That is, since each component ofYis computed using onlythe corresponding row ofX, Derivatives of components between different rows ofYandXare all , we can see that Yi,j Xi,k=Wk,j(10)doesn t depend at all upon which row ofYandXwe are fact, the matrixWholds all of these partials as it is we just have to remember toindex into it according to Equation 10 to obtain the specific partial derivative that we we letYi,:be the ith row ofYand letXi,:be the ith row ofX, then we see that Yi,: Xi,:=W,which is a simple generalization of our previous result from Equation The chain rule in combination with vectors and ma-tricesNow that we have worked through a couple of basic examples, let s combine these ideas withan example of the chain rule.

10 Again, assuming~yand~xare column vectors, let s start withthe equation~y=V W ~x,and try to compute the derivative of~ywith respect to~x. We could simply observe that theproduct of two matricesVandWis simply another matrix, call itU, and therefored~yd~x=V W= , we want to go through the process of using the chain rule to define intermediateresults, so that we can see how the chain rule applies in the context of non-scalar us define the intermediate result~m=W ~ we have that~y=V ~ can then write, using the chain rule, thatd~yd~x=d~yd ~md ~md~ make sure that we know exactly what this means, let s take the old approach ofanalyzing one component at a time, starting with a single component of~yand a singlecomponent of~x:d~yid~xj=d~yid ~md ~md~ how exactly should we interpret the product on the right? The idea with the chainrule is tomultiplythe change in~yiwith respect toeach scalarintermediate variable by thechange in the scalar intermediate variable with respect to~xj.


Related search queries