Vector and Matrix Calculus: Herman Kamper 30 January 2013

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Vector and Matrix Calculus

Herman Kamper
[email protected]

30 January 2013

1 Introduction

As explained in detail in [1], there unfortunately exists multiple competing notations concerning
the layout of matrix derivatives. This can cause a lot of difficulty when consulting several
sources, since different sources might use different conventions. Some sources, for example [2]
(from which I use a lot of identities), even use a mixed layout (according to [1, Notes]). Identities
for both the numerator layout (sometimes called the Jacobian formulation) and the denominator
layout (sometimes called the Hessian formulation) is given in [1], so this makes it easy to check
what layout a particular source uses. I will aim to stick to the denominator layout, which seems
to be the most widely used in the field of statistics and pattern recognition (e.g. [3] and [4,
pp. 327–332]). Other useful references concerning matrix calculus include [5] and [6]. In this
document column vectors are assumed in all cases expect where specifically stated otherwise.
Table 1: Derivatives of scalars, vector functions and matrices [1, 6].

column vector
scalar y matrix Y ∈ Rm×n
y ∈ Rm
∂y row vector matrix ∂Y
∂x (only
scalar x scalar ∂y m
∂x ∈ R
∂x numerator layout)
column vector matrix
column vector x ∈ Rn ∂y n ∂y
∂x ∈ R ∂x ∈ Rn×m
matrix X ∈ Rp×q matrix ∂X ∈ Rp×q

2 Definitions

Table 1 indicates the six possible kinds of derivatives when using the denominator layout. Using
this layout notation consistently, we have the following definitions.
The derivative of a scalar function f : Rn → R with respect to vector x ∈ Rn is
 
∂f (x)
 ∂x1 
 ∂f (x) 
∂f (x) def  ∂x2 
=  .. 
 (1)
∂x  . 
 
∂f (x)

This is the transpose of the gradient (some authors simply call this the gradient, irrespective of
whether numerator or denominator layout is used).

The derivative of a vector function f : Rn → Rm , where f (x) = f1 (x) f2 (x) . . . fm (x)

and x ∈ Rn , with respect to scalar xi is

∂f (x) def h ∂f1 (x) ∂f2 (x) ∂fm (x)

= ∂xi ∂xi ... ∂xi

The derivative of a vector function f : Rn → Rm , where f (x) = f1 (x) f2 (x) . . . fm (x) ,

with respect to vector x ∈ Rn is

   
∂f (x) ∂f1 (x) ∂f2 (x) ∂fm (x)
. . .
 ∂x1   ∂x1 ∂x1 ∂x1 
 ∂f (x)   ∂f1 (x) ∂f2 (x) ∂fm (x) 
∂f (x) def  ∂x2  ...
 =  ∂x2 ∂x2 ∂x2 

=  ..   .. .. .. .. 
 (3)
∂x  .   . . . . 
   
∂f (x) ∂f1 (x) ∂f2 (x) ∂fm (x)
∂xn ∂xn ∂xn ... ∂xn

This is just the transpose of the Jacobian matrix.

The derivative of a scalar function f : Rm×n → R with respect to matrix X ∈ Rm×n is
 
∂f (X) ∂f (X) ∂f (X)
· · ·
 ∂X11 ∂X12 ∂X1n 
 ∂f (X) ∂f (X) ∂f (X) 
∂f (X) def  ∂X21 ∂X22 · · · ∂X2n 
=  .. .. .. .. 
 (4)
∂X  . . . . 
 
∂f (X) ∂f (X) ∂f (X)
∂Xm1 ∂Xm2 ··· ∂Xmn

Observe that the (1) is just a special case of (4) for column vectors. Often (as in [3]) the gradient
notation is used as an alternative to the notation used above, for example:

∂f (x)
∇x f (x) = (5)
∂f (X)
∇X f (X) = (6)

3 Identities

3.1 Scalar-by-vector product rule

If a ∈ Rm , b ∈ Rn and C ∈ Rm×n then

 
X m
X n
X m X
X n
aT Cb = ai (Cb)i = ai  Cij bj  = Cij ai bj (7)
i=1 i=1 j=1 i=1 j=1

Now assume we have vector functions u : Rm → Rm , v = Rn → Rn and A ∈ Rm×n . The vector

functions u and v are functions of x ∈ Rq , but A is not. We want to find an identity for

∂uT Av

From (7), we have:
 T m n
∂uT Av

∂u Av ∂ XX
= = Aij ui vj
∂x l ∂xl ∂xl
i=1 j=1
m X
X ∂
= Aijui vj
i=1 j=1
m X n  
X ∂ui ∂vj
= Aij vj + ui
∂xl ∂xl
i=1 j=1
m X
n m n
X ∂ui X X ∂vj
= Aij vj + Aij ui (9)
∂xl ∂xl
i=1 j=1 i=1 j=1

Now we can show (by writing out the elements [Notebook, 2012-05-22]) that:
  m Xn m n
∂u ∂v T X ∂ui X X T ∂vj
Av + A u = Aij vj + (A )ji ui
∂x ∂x l ∂xl ∂xl
i=1 j=1 i=1 j=1
m X n m X n
X ∂ui X ∂vj
= Aij vj + Aij ui (10)
∂xl ∂xl
i=1 j=1 i=1 j=1

A comparison of (9) and (10) completes the proof that

∂uT Av ∂u ∂v T
= Av + A u (11)
∂x ∂x ∂x

3.2 Useful identities from scalar-by-vector product rule

From (11) it follows, with vectors and matrices b ∈ Rm , d ∈ Rq , x ∈ Rn , B ∈ Rm×n , C ∈ Rm×q ,

D ∈ Rq×n , that
∂(Bx + b)T C(Dx + d) ∂(Ax + b) ∂(Dx + d)T T
= C(Dx + d) + C (Bx + b) (12)
∂x ∂x ∂x
resulting in the identity:

∂(Bx + b)T C(Dx + d)

= BT C(Dx + d) + DT CT (Bx + b) (13)
by using the easily verifiable identities:

∂(u(x) + v(x)) ∂u(x) ∂v(x)

= + (14)
∂x ∂x ∂x

= AT (15)
=0 (16)

Some other useful special cases of (11):

∂xT Ab
= Ab (17)

∂xT Ax
= (A + AT )x (18)

∂xT Ax
= 2Ax if A is symmetric (19)

3.3 Derivatives of determinant

See [7, p. 374] for definition of cofactors. Also see [Notebook, 2012-05-22].
We can write the determinant of matrix X ∈ Rn×n as
|X| = Xi1 Ci1 + Xi2 Ci2 + . . . + Xin Cin = Xij Cij+ (20)

Thus the derivative will be

∂|X| ∂
= {Xi1 Ci1 + Xi2 Ci2 + . . . + Xin Cin }
∂X kl ∂Xkl

= {Xk1 Ck1 + Xk2 Ck2 + . . . + Xkn Ckn }
(can choose i any number, so choose i = k)
= Ckl (21)

Thus (see [7, p. 386])

= cofactor X = (adj X)T (22)
But we know that the inverse of X is given by [7, p. 387]
X−1 = adj X (23)

adj X = |X|X−1 (24)
which, when substituted into (22), results in the identity

= |X|(X−1 )T (25)

From (25) we can also write

∂ ln |X| ∂ ln |X| 1 ∂|X| 1
= = = |X|(X−1 )T (26)
∂X kl ∂Xkl |X| ∂X |X|

giving the identity

∂ ln |X|
= (X−1 )T (27)


[1] Matrix calculus. [Online]. Available: calculus

[2] K. B. Petersen and M. S. Pedersen, “The matrix cookbook,” 2008.
[3] A. Ng, Machine Learning. Class notes for CS229, Stanford Engineering Everywhere,
Stanford University, 2008. [Online]. Available:
[4] S. R. Searle, Matrix Algebra Useful for Statistics. New York, NY: John Wiley & Sons, 1982.
[5] J. R. Schott, Matrix Analysis for Statistics. New York, NY: John Wiley & Sons, 1996.
[6] T. P. Minka, “Old and new matrix algebra useful for statistics,” 2000. [Online]. Available:
[7] D. G. Zill and M. R. Cullen, Advanced Engineering Mathematics, 3rd ed. Jones and Bartlett,

You might also like