Mat Deriv

Some Important Properties for Matrix Calculus
Dawen Liang
Carnegie Mellon University
[email protected]
1 Introduction
Matrix calculation plays an essential role in many machine learning algorithms, among which ma-
trix calculus is the most commonly used tool. In this note, based on the properties from the dif-
ferential calculus, we show that they are all adaptable to the matrix calculus1 . And in the end, an
example on least-square linear regression is presented.
2 Notation
A matrix is represented as a bold upper letter, e.g. X, where Xm,n indicates the numbers of rows
and columns are m and n, respectively. A vector is represented as a bold lower letter, e.g. x, where
it is a n × 1 column vector in this note. An important concept for a n × n matrix An,n is the trace
Tr(A), which is defined as the sum of the diagonal:
n
X
Tr(A) = Aii (1)
i=1
where Aii index the element at the ith row and ith column.
3 Properties
The derivative of a matrix is usually referred as the gradient, denoted as ∇. Consider a function
f : Rm×n → Rp×q , the gradient for f (A) w.r.t. Am,n is:
 ∂f ∂f ∂f

∂A11 ∂A12 · · · ∂A 1n
 ∂f ∂f ∂f 
∂f (A)  ∂A 21 ∂A 22
· · · ∂A 
2n 
∇A f (A) = = . . . .
∂A  .. .. .. .. 


∂f ∂f ∂f
∂Am1 ∂Am2 ··· ∂Amn
This definition is very similar to the differential derivative, thus a few simple properties hold
(the matrix A below is square matrix and has the same dimension with the vectors):
∇x bT Ax = bT A (2)
1
Some of the detailed derivations which are omitted in this note can be found at http://www.cs.berkeley.
edu/˜jduchi/projects/matrix_prop.pdf
1
∇A XAY = YT XT (3)
∇x xT Ax = Ax + AT x (4)
∇AT f (A) = (∇A f (A))T (5)
where superscript T denotes the transpose of a matrix or a vector.
Now let us turn to the properties for the derivative of the trace. First of all, a few useful properties
for trace:
Tr(A) = Tr(AT ) (6)
Tr(ABC) = Tr(BCA) = Tr(CAB) (7)
Tr(A + B) = Tr(A) + Tr(B) (8)
which are all easily derived. Note that the second one be extended to more general case with
arbitrary number of matrices.
Thus, for the derivatives,
∇A Tr(AB) = BT (9)
Proof :
Just extend Tr(AB) according to the trace definition (Eq. 1).
∇A Tr(ABAT C) = CAB + CT ABT (10)
Proof :
∇A Tr(ABAT C)
=∇A Tr((AB) (AT C))
| {z } | {z }
u(A) v(AT )
=∇A:u(A) Tr(u(A)v(AT )) + ∇A:v(AT ) Tr(u(A)v(AT ))

=(v(AT ))T ∇A u(A) + (∇AT :v(AT ) Tr(u(A)v(AT ))T
=CT ABT + ((u(A))T ∇AT v(AT ))T
=CT ABT + (BT AT CT )T
=CAB + CT ABT
Here we make use of the property of the derivative of product: (u(x)v(x))0 = u0 (x)v(x) +
u(x)v 0 (x). The notation ∇A:u(A) means to calculate the derivative w.r.t. A only on u(A). Same ap-
plies to ∇AT :v(AT ) . Here chain rule is used. Note that the conversion from ∇A:v(AT ) to ∇AT :v(AT )
is based on Eq. 5.
4 An Example on Least-square Linear Regression

Now we will derive the solution for least-square linear regression in matrix form, using the proper-
ties shown above. We know that the least-square linear regression has a closed-form solution (often
referred as normal equation).
2
Assume we have N data points {x(i) , y (i) }1:N , and the linear regression function hθ (x) is
parametrized by θ. We can rearrange the data to matrix form:
 (1) T   (1) 
(x ) y
 (x(2) )T   y (2) 
X= y= . 
   
..  .
 .   . 
(x(N ) )T y (N )
Thus the error can be represented as:
hθ (x(1) ) − y(1)
 
 hθ (x(2) ) − y(2)

Xθ − y = 
 
 ..
  .
(N )
hθ (x ) − y (N )
The squared error E(θ), according to the numerical definition:

N
1X
E(θ) = (hθ (x(i) ) − y(i) )2
2
i=1
which is equivalent to the matrix form:

1
E(θ) = (Xθ − y)T (Xθ − y)
2
Take the derivative:
1
∇θ E(θ) = ∇ (Xθ − y)T (Xθ − y)
2| {z }
1×1 matrix, thus Tr(·)=(·)
1
= ∇Tr(θT XT Xθ − yT Xθ − θT XT y + yT y)
2
1
= ∇Tr(θT XT Xθ) − ∇Tr(yT Xθ) − ∇Tr(θT XT y)
2
1
= ∇Tr(θ I θT XT X) − (yT X)T − XT y
2
The first term can be computed using Eq. 10, where A = θ, B = I, and C = XT X (Note that
in this case, C = CT ). Plug back to the derivation:
1
∇θ E(θ) = (XT Xθ + XT Xθ − 2XT y)
2
1
= (2XT Xθ − 2XT y)
2
Set to 0
====⇒ XT Xθ = XT y
θLS = (XT X)−1 XT y
The normal equation is obtained in matrix form.

Mat Deriv

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Mat Deriv

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mat Deriv

Uploaded by

Copyright:

Available Formats

Some Important Properties for Matrix Calculus

∇A Tr(ABAT C) = CAB + CT ABT (10)

=∇A:u(A) Tr(u(A)v(AT )) + ∇A:v(AT ) Tr(u(A)v(AT ))

4 An Example on Least-square Linear Regression

Thus the error can be represented as:

The squared error E(θ), according to the numerical definition:

which is equivalent to the matrix form:

The normal equation is obtained in matrix form.

You might also like