Matrix Calculus4ml

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Matrix Calculus Foundation for Machine Learning

LI Xiucheng

SCSE Nanyang Technological University


Outline

Background

Matrix Derivative

Matrix Differentiation Rules

Machine Learning Examples

1
Notation

We denote

• scalars with lower-case, x;


• vectors with bold-case, x;
• matrices with upper-case, X ;
• the elements of vectors or matrices with xi or Xij ;
• trace as tr(X ) = i=1 Xii for X ∈ Rn×n ;
P

• determinant as |X | for X ∈ Rn×n ;


• matrices Hadamard product as X Y ;
• vector or matrix inner product with h·, ·i.

2
Background
Vector and Matrix Product

For any x ∈ Rn , y ∈ Rn and X ∈ Rm×n , Y ∈ Rm×n , we define their inner product as

• hx, yi = x> y = ni=1 xi yi .


P

• hX , Y i = tr(X > Y ) = m
P Pn
i=1 j=1 Xij Yij .

Remark:

• The second one is also known as matrix Frobenius inner product.


• Frobenius inner product is compatible with vector inner product in the sense that
when two matrices degrade to vectors Frobenius inner product equals to vector
inner product.

3
Properties of Frobenius inner product

For any X ∈ Rm×n , Y ∈ Rm×n , Z ∈ Rm×n , a ∈ R,

• hX , Y i = hY , X i.
• haX , Y i = hX , aY i = ahX , Y i.
• hX + Z , Y i = hX , Y i + hZ , Y i.
• hX , Y Z i = hX Y , Z i.

4
Properties of Frobenius inner product

Suppose that A ∈ Rm×`1 , C ∈ R`1 ×n , B ∈ Rm×`2 , D ∈ R`2 ×n , then we have

• hAC , BDi = hB > AC , Di = hC , A> BDi,


• hAC , BDi = hACD > , Bi = hA, BDC > i.

Remark

• The first two equations can be summarized as moving left to left by transposing.
• The last two equations can be summarized as moving right to right by transposing.

5
Properties of Frobenius inner product

Proof.
The first two equations are pretty obvious by using the definition of inner product; the
last two equations use the fact that tr(XY ) = tr(YX ) holds for any two matrices X , Y
such that X > has the same size with Y .

6
Matrix Derivative
Matrix Derivative

Let us denote f = f (X ) ∈ R.
First, consider a scalar x, we have

df = f 0 (x)dx (1)

Similarly, for a vector x, we have that


n
X ∂f
df = dxi = h∇x f , dxi . (2)
∂xi
i=1

The above form is easy to extend to matrix as


m X
n
X ∂f
df = dXij = h∇X f , dX i . (3)
∂Xij
i=1 j=1

7
Matrix Differentiation Rules
Matrix Differentiation Rules

1. d(X ± Y ) = dX ± dY , d(XY ) = (dX )Y + XdY , d(X > ) = (dX )>


2. d tr(X ) = tr(dX )
3. dX −1 = −X −1 (dX )X −1
4. d|X | = hadj(X )> , dX i, where adj(X ) is the adjoint matrix of X
5. d|X | = |X |h(X −1 )> , dX i when X is invertible
6. d(X Y ) = (dX ) Y + X dY
7. dσ(X ) = σ 0 (X ) dX , where σ(·) is an element-wise function such as sigmoid.

Remark: (1) implies that dhX , Y i = hdX , Y i + hX , dY i. (4) is known as Jacobi’s


formula.

8
Method

The key idea is to use the properties of inner product and the matrix differentiation
rules to obtain the inner product form

df = h∇X f , dX i .

9
Machine Learning Examples
Quadratic Function Optimization

f (x) = hx, Axi.

df = hdx, Axi + hx, dAxi


= hAx, dxi + hx, Adxi
= hAx, dxi + hA> x, dxi
= hAx + A> x, dxi

Hence,
∇x f = Ax + A> x.

10
Linear Regression

f (w) = hX w − y, X w − yi.

df = hd(X w − y), X w − yi + hX w − y, d(X w − y)i


= 2hX w − y, d(X w − y)i
= 2hX w − y, Xdwi = 2hX > (X w − y), dwi.

Hence,
∇w f = 2X > (X w − y).

And
∇w f = 0 =⇒ w∗ = (X > X )−1 X > y.

11
Softmax Regression

In Softmax regression, y is a one-hot vector defining the class target distribution,


ŷ = softmax(X w) is the model predicted distribution. The loss function is defined as
the cross-entropy between y and ŷ, i.e.,

f (w) = −hy, log softmax(X w)


 
exp(X w)
= − y, log
h1, exp(X w)i
= − hy, X w − 1 logh1, exp(X w)ii
= − hy, X wi + logh1, exp(X w)i hy, 1i
= − hy, X wi + logh1, exp(X w)i,

note that hy, 1i = 1.

12
Softmax Regression

f (w) = − hy, X wi + logh1, exp(X w)i.


dh1, exp(X w)i
df = − hy, Xdwi +
h1, exp(X w)i
 
1
= − hy, Xdwi + , exp(X w) d(X w)
h1, exp(X w)i
 
1 exp(X w)
= − hy, Xdwi + , d(X w)
h1, exp(X w)i
 
exp(X w)
= − hy, Xdwi + , Xdw
h1, exp(X w)i
   
> exp(X w)
=− X y− , dw
h1, exp(X w)i
Hence,
exp(X w)
∇w f = X > − −X > y.
h1, exp(X w)i 13
Estimating the Covariance of Gaussian Distribution

1 PN
−1

f (Σ) = log |Σ| + N i=1 xi − µ, Σ (xi − µ) . The first term

d log |Σ| = |Σ|−1 d|Σ| = hΣ−1 , dΣi.

The second term


N N
1 X
1 X

xi − µ, Σ−1 (xi − µ) = xi − µ, dΣ−1 (xi − µ)



d
N N
i=1 i=1
N
1 X

xi − µ, Σ−1 (dΣ)Σ−1 (xi − µ)



=
N
i=1
N D
1 X
−1 > −1
E
= Σ (xi − µ)(xi − µ) Σ , dΣ .
N
i=1

14
Estimating the Covariance of Gaussian Distribution

PN
Let S = 1
N i=1 (xi − µ)(xi − µ)> , then

df = Σ−1 − Σ−1 SΣ−1 , dΣ .



Hence,
∇Σ f = (Σ−1 − Σ−1 SΣ−1 )> .

15

You might also like