Vector, Matrix, and Tensor Derivatives: 1 Simplify, Simplify, Simplify
Vector, Matrix, and Tensor Derivatives: 1 Simplify, Simplify, Simplify
Vector, Matrix, and Tensor Derivatives: 1 Simplify, Simplify, Simplify
Erik Learned-Miller
The purpose of this document is to help you learn to take derivatives of vectors, matrices,
and higher order tensors (arrays with three dimensions or more), and to help you take
derivatives with respect to vectors, matrices, and higher order tensors.
1.1 Expanding notation into explicit sums and equations for each
component
In order to simplify a given calculation, it is often useful to write out the explicit formula for
a single scalar element of the output in terms of nothing but scalar variables. Once one has
an explicit formula for a single scalar element of the output in terms of other scalar values,
then one can use the calculus that you used as a beginner, which is much easier than trying
to do matrix math, summations, and derivatives all at the same time.
1
which is just the derivative of one scalar with respect to another.
The first thing to do is to write down the formula for computing ~y3 so we can take its
derivative. From the definition of matrix-vector multiplication, the value ~y3 is computed by
taking the dot product between the 3rd row of W and the vector ~x:
D
X
~y3 = W3,j ~xj . (2)
j=1
At this point, we have reduced the original matrix equation (Equation 1) to a scalar equation.
This makes it much easier to compute the desired derivatives.
Of course, I have explicitly included the term that involves ~x7 , since that is what we are
differenting with respect to. At this point, we can see that the expression for y3 only depends
upon ~x7 through a single term, W3,7~x7 . Since none of the other terms in the summation
include ~x7 , their derivatives with respect to ~x7 are all 0. Thus, we have
∂~y3 ∂
= [W3,1~x1 + W3,2~x2 + ... + W3,7~x7 + ... + W3,D ~xD ] (3)
∂~x7 ∂~x7
∂
= 0 + 0 + ... + [W3,7~x7 ] + ... + 0 (4)
∂~x7
∂
= [W3,7~x7 ] (5)
∂~x7
= W3,7 . (6)
By focusing on one component of ~y and one component of ~x, we have made the calculation
about as simple as it can be. In the future, when you are confused, it can help to try to
reduce a problem to this most basic setting to see where you are going wrong.
2
can be written out as a matrix in the following form:
∂~y1 ∂~y1 ∂~y1 ∂~
y1
∂~x1 ∂~
x2 ∂~
x3
... ∂~
xD
∂~y
2 ∂~y2 ∂~y2 . . . ∂~
y2
∂~x1 ∂~x2 ∂~x3 ∂~
xD
. .. .. .. ..
.. . . . .
∂~
yC ∂~
yC ∂~
yC ∂~
yC
∂~x1 ∂~
x2 ∂~
x3
... ∂~
xD
In this particular case, this is called the Jacobian matrix, but this terminology is not too
important for our purposes.
Notice that for the equation
~y = W ~x,
the partial of ~y3 with respect to ~x7 was simply given by W3,7 . If you go through the same
process for other components, you will find that, for all i and j,
∂~yi
= Wi,j .
∂~xj
~y = W ~x,
we have
d~y
= W.
d~x
3
2.1 Example 2
Let ~y be a row vector with C components computed by taking the product of another row
vector ~x with D components and a matrix W that is D rows by C columns.
~y = ~xW.
Importantly, despite the fact that ~y and ~x have the same number of components as before,
the shape of W is the transpose of the shape that we used before for W . In particular, since
we are now left-multiplying by ~x, whereas before ~x was on the right, W must be transposed
for the matrix algebra to work.
In this case, you will see, by writing
D
X
~y3 = ~xj Wj,3
j=1
that
∂~y3
= W7,3 .
∂~x7
Notice that the indexing into W is the opposite from what it was in the first example.
However, when we assemble the full Jacobian matrix, we can still see that in this case as
well,
d~y
= W. (7)
d~x
4
However, what we see is that W7,8 plays no role in the computation of ~y3 , since
In other words,
∂~y3
= 0.
∂W7,8
However, the partials of ~y3 with respect to elements of the 3rd column of W will certainly
be non-zero. For example, the derivative of ~y3 with respect to W2,3 is given by
∂~y3
= ~x2 , (9)
∂W2,3
as can be easily seen by examining Equation 8.
In general, when the index of the ~y component is equal to the second index of W , the
derivative will be non-zero, but will be zero otherwise. We can write:
∂~yj
= ~xi ,
∂Wi,j
but the other elements of the 3-d array will be 0. If we let F represent the 3d array
representing the derivative of ~y with respect to W , where
∂~yi
Fi,j,k = ,
∂Wj,k
then
Fi,j,i = ~xj ,
but all other entries of F are zero.
Finally, if we define a new two-dimensional array G as
Gi,j = Fi,j,i
we can see that all of the information we need about F can be stored in G, and that the
non-trivial portion of F is really two-dimensional, not three-dimensional.
Representing the important part of derivative arrays in a compact way is critical to
efficient implementations of neural networks.
Y = XW,
5
will also be a matrix, with N rows and C columns. Thus, each row of Y will give a row
vector associated with the corresponding row of the input X.
Sticking to our technique of writing down an expression for a given component of the
output, we have
D
X
Yi,j = Xi,k Wk,j .
k=1
We can see immediately from this equation that among the derivatives
∂Ya,b
,
∂Xc,d
they are all zero unless a = c. That is, since each component of Y is computed using only
the corresponding row of X, derivatives of components between different rows of Y and X
are all zero.
Furthermore, we can see that
∂Yi,j
= Wk,j (10)
∂Xi,k
doesn’t depend at all upon which row of Y and X we are comparing.
In fact, the matrix W holds all of these partials as it is–we just have to remember to
index into it according to Equation 10 to obtain the specific partial derivative that we want.
If we let Yi,: be the ith row of Y and let Xi,: be the ith row of X, then we see that
∂Yi,:
= W,
∂Xi,:
6
Let us define the intermediate result
m
~ = W ~x.
which is just the component expression for V W , our original answer to the problem.
To summarize, we can use the chain rule in the setting of vector and matrix derivatives
by
• Clearly stating intermediate results and the variables used to represent them,
• Expressing the chain rule for individual components of the final derivatives,
• Summing appropriately over the intermediate results within the chain rule expression.