Matrix Calculus4ml

Matrix Calculus Foundation for Machine Learning
LI Xiucheng
SCSE Nanyang Technological University

Outline
Background
Matrix Derivative
Matrix Differentiation Rules
Machine Learning Examples
1
Notation
We denote
• scalars with lower-case, x;

• vectors with bold-case, x;
• matrices with upper-case, X ;
• the elements of vectors or matrices with xi or Xij ;
• trace as tr(X ) = i=1 Xii for X ∈ Rn×n ;
P
• determinant as |X | for X ∈ Rn×n ;

• matrices Hadamard product as X Y ;
• vector or matrix inner product with h·, ·i.
2
Background
Vector and Matrix Product
For any x ∈ Rn , y ∈ Rn and X ∈ Rm×n , Y ∈ Rm×n , we define their inner product as
• hx, yi = x> y = ni=1 xi yi .

P
• hX , Y i = tr(X > Y ) = m
P Pn
i=1 j=1 Xij Yij .
Remark:
• The second one is also known as matrix Frobenius inner product.

• Frobenius inner product is compatible with vector inner product in the sense that
when two matrices degrade to vectors Frobenius inner product equals to vector
inner product.
3
Properties of Frobenius inner product
For any X ∈ Rm×n , Y ∈ Rm×n , Z ∈ Rm×n , a ∈ R,
• hX , Y i = hY , X i.
• haX , Y i = hX , aY i = ahX , Y i.
• hX + Z , Y i = hX , Y i + hZ , Y i.
• hX , Y Z i = hX Y , Z i.
4
Suppose that A ∈ Rm×`1 , C ∈ R`1 ×n , B ∈ Rm×`2 , D ∈ R`2 ×n , then we have
• hAC , BDi = hB > AC , Di = hC , A> BDi,

• hAC , BDi = hACD > , Bi = hA, BDC > i.
Remark
• The first two equations can be summarized as moving left to left by transposing.
• The last two equations can be summarized as moving right to right by transposing.
5
Proof.
The first two equations are pretty obvious by using the definition of inner product; the
last two equations use the fact that tr(XY ) = tr(YX ) holds for any two matrices X , Y
such that X > has the same size with Y .
6
Matrix Derivative
Matrix Derivative
Let us denote f = f (X ) ∈ R.
First, consider a scalar x, we have
df = f 0 (x)dx (1)
Similarly, for a vector x, we have that

n
X ∂f
df = dxi = h∇x f , dxi . (2)
∂xi
i=1
The above form is easy to extend to matrix as

m X
n
X ∂f
df = dXij = h∇X f , dX i . (3)
∂Xij
i=1 j=1
7
1. d(X ± Y ) = dX ± dY , d(XY ) = (dX )Y + XdY , d(X > ) = (dX )>

2. d tr(X ) = tr(dX )
3. dX −1 = −X −1 (dX )X −1
4. d|X | = hadj(X )> , dX i, where adj(X ) is the adjoint matrix of X
5. d|X | = |X |h(X −1 )> , dX i when X is invertible
6. d(X Y ) = (dX ) Y + X dY
7. dσ(X ) = σ 0 (X ) dX , where σ(·) is an element-wise function such as sigmoid.
Remark: (1) implies that dhX , Y i = hdX , Y i + hX , dY i. (4) is known as Jacobi’s

formula.
8
Method
The key idea is to use the properties of inner product and the matrix differentiation
rules to obtain the inner product form
df = h∇X f , dX i .
9
Machine Learning Examples
Quadratic Function Optimization
f (x) = hx, Axi.
df = hdx, Axi + hx, dAxi

= hAx, dxi + hx, Adxi
= hAx, dxi + hA> x, dxi
= hAx + A> x, dxi
Hence,
∇x f = Ax + A> x.
10
Linear Regression
f (w) = hX w − y, X w − yi.
df = hd(X w − y), X w − yi + hX w − y, d(X w − y)i

= 2hX w − y, d(X w − y)i
= 2hX w − y, Xdwi = 2hX > (X w − y), dwi.
Hence,
∇w f = 2X > (X w − y).
And
∇w f = 0 =⇒ w∗ = (X > X )−1 X > y.
11
Softmax Regression
In Softmax regression, y is a one-hot vector defining the class target distribution,

ŷ = softmax(X w) is the model predicted distribution. The loss function is defined as
the cross-entropy between y and ŷ, i.e.,
f (w) = −hy, log softmax(X w)

exp(X w)
= − y, log
h1, exp(X w)i
= − hy, X w − 1 logh1, exp(X w)ii
= − hy, X wi + logh1, exp(X w)i hy, 1i
= − hy, X wi + logh1, exp(X w)i,
note that hy, 1i = 1.
12
Softmax Regression
f (w) = − hy, X wi + logh1, exp(X w)i.

dh1, exp(X w)i
df = − hy, Xdwi +
h1, exp(X w)i

1
= − hy, Xdwi + , exp(X w) d(X w)
h1, exp(X w)i

1 exp(X w)
= − hy, Xdwi + , d(X w)
h1, exp(X w)i

exp(X w)
= − hy, Xdwi + , Xdw
h1, exp(X w)i

> exp(X w)
=− X y− , dw
h1, exp(X w)i
Hence,
exp(X w)
∇w f = X > − −X > y.
h1, exp(X w)i 13
Estimating the Covariance of Gaussian Distribution
1 PN
−1

f (Σ) = log |Σ| + N i=1 xi − µ, Σ (xi − µ) . The first term
d log |Σ| = |Σ|−1 d|Σ| = hΣ−1 , dΣi.
The second term

N N
1 X
1 X
xi − µ, Σ−1 (xi − µ) = xi − µ, dΣ−1 (xi − µ)

d
N N
i=1 i=1
N
1 X
xi − µ, Σ−1 (dΣ)Σ−1 (xi − µ)

=
N
i=1
N D
1 X
−1 > −1
E
= Σ (xi − µ)(xi − µ) Σ , dΣ .
N
i=1
14
Estimating the Covariance of Gaussian Distribution
PN
Let S = 1
N i=1 (xi − µ)(xi − µ)> , then
df = Σ−1 − Σ−1 SΣ−1 , dΣ .

Hence,
∇Σ f = (Σ−1 − Σ−1 SΣ−1 )> .
15

Matrix Calculus4ml

Uploaded by

Copyright:

Available Formats

Matrix Calculus4ml

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Matrix Calculus4ml

Uploaded by

Copyright:

Available Formats

Matrix Calculus Foundation for Machine Learning

SCSE Nanyang Technological University

Matrix Differentiation Rules

Machine Learning Examples

• scalars with lower-case, x;

• determinant as |X | for X ∈ Rn×n ;

For any x ∈ Rn , y ∈ Rn and X ∈ Rm×n , Y ∈ Rm×n , we define their inner product as

• hx, yi = x> y = ni=1 xi yi .

• The second one is also known as matrix Frobenius inner product.

For any X ∈ Rm×n , Y ∈ Rm×n , Z ∈ Rm×n , a ∈ R,

Suppose that A ∈ Rm×`1 , C ∈ R`1 ×n , B ∈ Rm×`2 , D ∈ R`2 ×n , then we have

• hAC , BDi = hB > AC , Di = hC , A> BDi,

Similarly, for a vector x, we have that

The above form is easy to extend to matrix as

1. d(X ± Y ) = dX ± dY , d(XY ) = (dX )Y + XdY , d(X > ) = (dX )>

Remark: (1) implies that dhX , Y i = hdX , Y i + hX , dY i. (4) is known as Jacobi’s

f (x) = hx, Axi.

df = hdx, Axi + hx, dAxi

df = hd(X w − y), X w − yi + hX w − y, d(X w − y)i

In Softmax regression, y is a one-hot vector defining the class target distribution,

f (w) = −hy, log softmax(X w)

note that hy, 1i = 1.

f (w) = − hy, X wi + logh1, exp(X w)i.

d log |Σ| = |Σ|−1 d|Σ| = hΣ−1 , dΣi.

The second term

xi − µ, Σ−1 (xi − µ) = xi − µ, dΣ−1 (xi − µ)

xi − µ, Σ−1 (dΣ)Σ−1 (xi − µ)

df = Σ−1 − Σ−1 SΣ−1 , dΣ .

You might also like