Mat Deriv
Mat Deriv
Mat Deriv
Dawen Liang
Carnegie Mellon University
[email protected]
1 Introduction
Matrix calculation plays an essential role in many machine learning algorithms, among which ma-
trix calculus is the most commonly used tool. In this note, based on the properties from the dif-
ferential calculus, we show that they are all adaptable to the matrix calculus1 . And in the end, an
example on least-square linear regression is presented.
2 Notation
A matrix is represented as a bold upper letter, e.g. X, where Xm,n indicates the numbers of rows
and columns are m and n, respectively. A vector is represented as a bold lower letter, e.g. x, where
it is a n × 1 column vector in this note. An important concept for a n × n matrix An,n is the trace
Tr(A), which is defined as the sum of the diagonal:
n
X
Tr(A) = Aii (1)
i=1
where Aii index the element at the ith row and ith column.
3 Properties
The derivative of a matrix is usually referred as the gradient, denoted as ∇. Consider a function
f : Rm×n → Rp×q , the gradient for f (A) w.r.t. Am,n is:
∂f ∂f ∂f
∂A11 ∂A12 · · · ∂A 1n
∂f ∂f ∂f
∂f (A) ∂A 21 ∂A 22
· · · ∂A
2n
∇A f (A) = = . . . .
∂A .. .. .. ..
∂f ∂f ∂f
∂Am1 ∂Am2 ··· ∂Amn
This definition is very similar to the differential derivative, thus a few simple properties hold
(the matrix A below is square matrix and has the same dimension with the vectors):
∇x bT Ax = bT A (2)
1
Some of the detailed derivations which are omitted in this note can be found at http://www.cs.berkeley.
edu/˜jduchi/projects/matrix_prop.pdf
1
∇A XAY = YT XT (3)
∇x xT Ax = Ax + AT x (4)
∇AT f (A) = (∇A f (A))T (5)
where superscript T denotes the transpose of a matrix or a vector.
Now let us turn to the properties for the derivative of the trace. First of all, a few useful properties
for trace:
Tr(A) = Tr(AT ) (6)
Tr(ABC) = Tr(BCA) = Tr(CAB) (7)
Tr(A + B) = Tr(A) + Tr(B) (8)
which are all easily derived. Note that the second one be extended to more general case with
arbitrary number of matrices.
Thus, for the derivatives,
∇A Tr(AB) = BT (9)
Proof :
Just extend Tr(AB) according to the trace definition (Eq. 1).
Proof :
∇A Tr(ABAT C)
=∇A Tr((AB) (AT C))
| {z } | {z }
u(A) v(AT )
Here we make use of the property of the derivative of product: (u(x)v(x))0 = u0 (x)v(x) +
u(x)v 0 (x). The notation ∇A:u(A) means to calculate the derivative w.r.t. A only on u(A). Same ap-
plies to ∇AT :v(AT ) . Here chain rule is used. Note that the conversion from ∇A:v(AT ) to ∇AT :v(AT )
is based on Eq. 5.
2
Assume we have N data points {x(i) , y (i) }1:N , and the linear regression function hθ (x) is
parametrized by θ. We can rearrange the data to matrix form:
(1) T (1)
(x ) y
(x(2) )T y (2)
X= y= .
.. .
. .
(x(N ) )T y (N )
hθ (x(1) ) − y(1)
hθ (x(2) ) − y(2)
Xθ − y =
..
.
(N )
hθ (x ) − y (N )