Linear Algebra Review and Reference: Zico Kolter (Updated by Chuong Do and Tengyu Ma) April 3, 2019
Linear Algebra Review and Reference: Zico Kolter (Updated by Chuong Do and Tengyu Ma) April 3, 2019
Linear Algebra Review and Reference: Zico Kolter (Updated by Chuong Do and Tengyu Ma) April 3, 2019
Contents
1 Basic Concepts and Notation 2
1.1 Basic Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Matrix Multiplication 3
2.1 Vector-Vector Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Matrix-Vector Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Matrix-Matrix Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Matrix Calculus 22
4.1 The Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 The Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Gradients and Hessians of Quadratic and Linear Functions . . . . . . . . . . 26
4.4 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5 Gradients of the Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Eigenvalues as Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1
1 Basic Concepts and Notation
Linear algebra provides a way of compactly representing and operating on sets of linear
equations. For example, consider the following system of equations:
4x1 − 5x2 = −13
−2x1 + 3x2 = 9.
This is two equations and two variables, so as you know from high school algebra, you
can find a unique solution for x1 and x2 (unless the equations are somehow degenerate, for
example if the second equation is simply a multiple of the first, but in the case above there
is in fact a unique solution). In matrix notation, we can write the system more compactly
as
Ax = b
with
4 −5 −13
A= , b= .
−2 3 9
As we will see shortly, there are many advantages (including the obvious space savings)
to analyzing linear equations in this form.
• We use the notation aij (or Aij , Ai,j , etc) to denote the entry of A in the ith row and
jth column:
a11 a12 ··· a1n
a21 a22 ··· a2n
A = .. .. .. .
. ...
. .
am1 am2 ··· amn
2
• We denote the jth column of A by aj or A:,j :
| | |
A = a1 a2 · · · an .
| | |
2 Matrix Multiplication
The product of two matrices A ∈ Rm×n and B ∈ Rn×p is the matrix
C = AB ∈ Rm×p ,
where n
X
Cij = Aik Bkj .
k=1
Note that in order for the matrix product to exist, the number of columns in A must equal
the number of rows in B. There are many other ways of looking at matrix multiplication
that may be more convenient and insightful than the standard definition above, and we’ll
start by examining a few special cases.
3
Observe that inner products are really just special case of matrix multiplication. Note that
it is always the case that xT y = y T x.
Given vectors x ∈ Rm , y ∈ Rn (not necessarily of the same size), xy T ∈ Rm×n is called
the outer product of the vectors. It is a matrix whose entries are given by (xy T )ij = xi yj ,
i.e.,
x1 x1 y1 x1 y2 · · · x1 yn
x2 y1 x2 y2 · · · x2 yn
x2
T m×n
xy ∈ R = .. y1 y2 · · · yn = .. .. .. .
. . ...
. .
xm xm y1 xm y2 · · · xm yn
As an example of how the outer product can be useful, let 1 ∈ Rn denote an n-dimensional
vector whose entries are all equal to 1. Furthermore, consider the matrix A ∈ Rm×n whose
columns are all equal to some vector x ∈ Rm . Using outer products, we can represent A
compactly as,
x1 x1 · · · x1 x1
| | | x2 x2 · · · x2 x2
T
A= x x · · · x = .. .. . . .. = .. 1 1 · · · 1 = x1 .
| | |
. . . . .
xm xm · · · xm xm
4
So far we have been multiplying on the right by a column vector, but it is also possible
to multiply on the left by a row vector. This is written, y T = xT A for A ∈ Rm×n , x ∈ Rm ,
and y ∈ Rn . As before, we can express y T in two obvious ways, depending on whether we
express A in terms on its rows or columns. In the first case we express A in terms of its
columns, which gives
| | |
y T = x T A = x T a1 a 2 · · · an = x T a 1 x T a2 · · · x T a n
| | |
which demonstrates that the ith entry of y T is equal to the inner product of x and the ith
column of A.
Finally, expressing A in terms of rows we get the final representation of the vector-matrix
product,
y T = xT A
— aT1 —
— aT2
—
= x1 x2 · · · xn ..
.
— aTm —
so we see that y T is a linear combination of the rows of A, where the coefficients for the
linear combination are given by the entries of x.
5
by rows and B by columns. Alternatively, we can represent A by columns, and B by rows.
This representation leads to a much trickier interpretation of AB as a sum of outer products.
Symbolically,
— bT1 —
| | | — bT — X n
2
C = AB = a1 a2 · · · an .. = ai bTi .
| | |
. i=1
— bTn —
Put another way, AB is equal to the sum, over all i, of the outer product of the ith column
of A and the ith row of B. Since, in this case, ai ∈ Rm and bi ∈ Rp , the dimension of the
outer product ai bTi is m × p, which coincides with the dimension of C. Chances are, the last
equality above may appear confusing to you. If so, take the time to check it for yourself!
Second, we can also view matrix-matrix multiplication as a set of matrix-vector products.
Specifically, if we represent B by columns, we can view the columns of C as matrix-vector
products between A and the columns of B. Symbolically,
| | | | | |
C = AB = A b1 b2 · · · bp = Ab1 Ab2 · · · Abp . (2)
| | | | | |
Here the ith column of C is given by the matrix-vector product with the vector on the right,
ci = Abi . These matrix-vector products can in turn be interpreted using both viewpoints
given in the previous subsection. Finally, we have the analogous viewpoint, where we repre-
sent A by rows, and view the rows of C as the matrix-vector product between the rows of A
and C. Symbolically,
— aT1 — — aT1 B —
— aT — — aT B —
2 2
C = AB = .. B = .. .
. .
T T
— am — — am B —
Here the ith row of C is given by the matrix-vector product with the vector on the left,
cTi = aTi B.
It may seem like overkill to dissect matrix multiplication to such a large degree, especially
when all these viewpoints follow immediately from the initial definition we gave (in about a
line of math) at the beginning of this section. The direct advantage of these various
viewpoints is that they allow you to operate on the level/unit of vectors instead
of scalars. To fully understand linear algebra without getting lost in the complicated
manipulation of indices, the key is to operate with as large concepts as possible. 1
1
E.g., if you could write all your math derivations with matrices or vectors, it would be better than doing
them with scalar elements.
6
Virtually all of linear algebra deals with matrix multiplications of some kind, and it is
worthwhile to spend some time trying to develop an intuitive understanding of the viewpoints
presented here.
In addition to this, it is useful to know a few basic properties of matrix multiplication at
a higher level:
• Matrix multiplication is, in general, not commutative; that is, it can be the case that
AB 6= BA. (For example, if A ∈ Rm×n and B ∈ Rn×q , the matrix product BA does
not even exist if m and q are not equal!)
If you are not familiar with these properties, take the time to verify them for yourself.
For example, to check the associativity of matrix multiplication, suppose that A ∈ Rm×n ,
B ∈ Rn×p , and C ∈ Rp×q . Note that AB ∈ Rm×p , so (AB)C ∈ Rm×q . Similarly, BC ∈ Rn×q ,
so A(BC) ∈ Rm×q . Thus, the dimensions of the resulting matrices agree. To show that
matrix multiplication is associative, it suffices to check that the (i, j)th entry of (AB)C is
equal to the (i, j)th entry of A(BC). We can verify this directly using the definition of
matrix multiplication:
p p n
!
X X X
((AB)C)ij = (AB)ik Ckj = Ail Blk Ckj
k=1 k=1 l=1
p n p
! n
!
X X X X
= Ail Blk Ckj = Ail Blk Ckj
k=1 l=1 l=1 k=1
n p
! n
X X X
= Ail Blk Ckj = Ail (BC)lj = (A(BC))ij .
l=1 k=1 l=1
Here, the first and last two equalities simply use the definition of matrix multiplication, the
third and fifth equalities use the distributive property for scalar multiplication over addition,
and the fourth equality uses the commutative and associativity of scalar addition. This
technique for proving matrix properties by reduction to simple scalar properties will come
up often, so make sure you’re familiar with it.
7
3.1 The Identity Matrix and Diagonal Matrices
The identity matrix , denoted I ∈ Rn×n , is a square matrix with ones on the diagonal and
zeros everywhere else. That is,
1 i=j
Iij =
0 i 6= j
It has the property that for all A ∈ Rm×n ,
AI = A = IA.
Note that in some sense, the notation for the identity matrix is ambiguous, since it does not
specify the dimension of I. Generally, the dimensions of I are inferred from context so as to
make matrix multiplication possible. For example, in the equation above, the I in AI = A
is an n × n matrix, whereas the I in A = IA is an m × m matrix.
A diagonal matrix is a matrix where all non-diagonal elements are 0. This is typically
denoted D = diag(d1 , d2 , . . . , dn ), with
di i = j
Dij =
0 i 6= j
• (AT )T = A
• (AB)T = B T AT
• (A + B)T = AT + B T
8
matrix A − AT is anti-symmetric. From this it follows that any square matrix A ∈ Rn×n can
be represented as a sum of a symmetric matrix and an anti-symmetric matrix, since
1 1
A = (A + AT ) + (A − AT )
2 2
and the first matrix on the right is symmetric, while the second is anti-symmetric. It turns out
that symmetric matrices occur a great deal in practice, and they have many nice properties
which we will look at shortly. It is common to denote the set of all symmetric matrices of
size n as Sn , so that A ∈ Sn means that A is a symmetric n × n matrix;
As described in the CS229 lecture notes, the trace has the following properties (included
here for the sake of completeness):
• For A ∈ Rn×n , trA = trAT .
• For A, B, C such that ABC is square, trABC = trBCA = trCAB, and so on for the
product of more matrices.
As an example of how these properties can be proven, we’ll consider the fourth property
given above. Suppose that A ∈ Rm×n and B ∈ Rn×m (so that AB ∈ Rm×m is a square
matrix). Observe that BA ∈ Rn×n is also a square matrix, so it makes sense to apply the
trace operator to it. To verify that trAB = trBA, note that
m m n
!
X X X
trAB = (AB)ii = Aij Bji
i=1 i=1 j=1
Xm Xn n
XX m
= Aij Bji = Bji Aij
i=1 j=1 j=1 i=1
n m
! n
X X X
= Bji Aij = (BA)jj = trBA.
j=1 i=1 j=1
9
Here, the first and last two equalities use the definition of the trace operator and matrix
multiplication. The fourth equality, where the main work occurs, uses the commutativity
of scalar multiplication in order to reverse the order of the terms in each product, and the
commutativity and associativity of scalar addition in order to rearrange the order of the
summation.
3.5 Norms
A norm of a vector kxk is informally a measure of the “length” of the vector. For example,
we have the commonly-used Euclidean or ℓ2 norm,
v
u n
uX
kxk2 = t x2i .
i=1
Norms can also be defined for matrices, such as the Frobenius norm,
v
u m X n
uX
p
kAkF = t A2ij = tr(AT A).
i=1 j=1
Many other norms exist, but they are beyond the scope of this review.
10
3.6 Linear Independence and Rank
A set of vectors {x1 , x2 , . . . xn } ⊂ Rm is said to be (linearly) independent if no vector can
be represented as a linear combination of the remaining vectors. Conversely, if one vector
belonging to the set can be represented as a linear combination of the remaining vectors,
then the vectors are said to be (linearly) dependent. That is, if
n−1
X
xn = α i xi
i=1
for some scalar values α1 , . . . , αn−1 ∈ R, then we say that the vectors x1 , . . . , xn are linearly
dependent; otherwise, the vectors are linearly independent. For example, the vectors
1 4 2
x1 = 2 x2 = 1 x3 = −3
3 5 −1
• For A ∈ Rm×n , rank(A) ≤ min(m, n). If rank(A) = min(m, n), then A is said to be
full rank .
11
A−1 may not exist. In particular, we say that A is invertible or non-singular if A−1
exists and non-invertible or singular otherwise.2
In order for a square matrix A to have an inverse A−1 , then A must be full rank. We will
soon see that there are many alternative sufficient and necessary conditions, in addition to
full rank, for invertibility.
The following are properties of the inverse; all assume that A, B ∈ Rn×n are non-singular:
• (A−1 )−1 = A
• (AB)−1 = B −1 A−1
• (A−1 )T = (AT )−1 . For this reason this matrix is often denoted A−T .
As an example of how the inverse is used, consider the linear system of equations, Ax = b
where A ∈ Rn×n , and x, b ∈ Rn . If A is nonsingular (i.e., invertible), then x = A−1 b.
(What if A ∈ Rm×n is not a square matrix? Does this work?)
UT U = I = UUT .
In other words, the inverse of an orthogonal matrix is its transpose. Note that if U is not
square — i.e., U ∈ Rm×n , n < m — but its columns are still orthonormal, then U T U = I,
but U U T 6= I. We generally only use the term orthogonal to describe the previous case,
where U is square.
Another nice property of orthogonal matrices is that operating on a vector with an
orthogonal matrix will not change its Euclidean norm, i.e.,
12
3.9 Range and Nullspace of a Matrix
The span of a set of vectors {x1 , x2 , . . . xn } is the set of all vectors that can be expressed as
a linear combination of {x1 , . . . , xn }. That is,
( n
)
X
span({x1 , . . . xn }) = v : v = α i xi , α i ∈ R .
i=1
It can be shown that if {x1 , . . . , xn } is a set of n linearly independent vectors, where each
xi ∈ Rn , then span({x1 , . . . xn }) = Rn . In other words, any vector v ∈ Rn can be written as
a linear combination of x1 through xn . The projection of a vector y ∈ Rm onto the span
of {x1 , . . . , xn } (here we assume xi ∈ Rm ) is the vector v ∈ span({x1 , . . . xn }), such that v
is as close as possible to y, as measured by the Euclidean norm kv − yk2 . We denote the
projection as Proj(y; {x1 , . . . , xn }) and can define it formally as,
Proj(y; {x1 , . . . xn }) = argminv∈span({x1 ,...,xn }) ky − vk2 .
The range (sometimes also called the columnspace) of a matrix A ∈ Rm×n , denoted
R(A), is the the span of the columns of A. In other words,
R(A) = {v ∈ Rm : v = Ax, x ∈ Rn }.
Making a few technical assumptions (namely that A is full rank and that n < m), the
projection of a vector y ∈ Rm onto the range of A is given by,
Proj(y; A) = argminv∈R(A) kv − yk2 = A(AT A)−1 AT y .
This last equation should look extremely familiar, since it is almost the same formula we
derived in class (and which we will soon derive again) for the least squares estimation of
parameters. Looking at the definition for the projection, it should not be too hard to
convince yourself that this is in fact the same objective that we minimized in our least
squares problem (except for a squaring of the norm, which doesn’t affect the optimal point)
and so these problems are naturally very connected. When A contains only a single column,
a ∈ Rm , this gives the special case for a projection of a vector on to a line:
aaT
Proj(y; a) = T y .
a a
m×n
The nullspace of a matrix A ∈ R , denoted N (A) is the set of all vectors that equal
0 when multiplied by A, i.e.,
N (A) = {x ∈ Rn : Ax = 0}.
Note that vectors in R(A) are of size m, while vectors in the N (A) are of size n, so vectors
in R(AT ) and N (A) are both in Rn . In fact, we can say much more. It turns out that
w : w = u + v, u ∈ R(AT ), v ∈ N (A) = Rn and R(AT ) ∩ N (A) = {0} .
In other words, R(AT ) and N (A) are disjoint subsets that together span the entire space of
Rn . Sets of this type are called orthogonal complements, and we denote this R(AT ) =
N (A)⊥ .
13
3.10 The Determinant
The determinant of a square matrix A ∈ Rn×n , is a function det : Rn×n → R, and is
denoted |A| or det A (like the trace operator, we usually omit parentheses). Algebraically,
one could write down an explicit formula for the determinant of A, but this unfortunately
gives little intuition about its meaning. Instead, we’ll start out by providing a geometric
interpretation of the determinant and then visit some of its specific algebraic properties
afterwards.
Given a matrix
— aT1 —
— aT —
2
.. ,
.
T
— an —
consider the set of points S ⊂ Rn formed by taking all possible linear combinations of the
row vectors a1 , . . . , an ∈ Rn of A, where the coefficients of the linear combination are all
between 0 and 1; that is, the set S is the restriction of span({a1 , . . . , an }) to only those
linear combinations whose coefficients α1 , . . . , αn satisfy 0 ≤ αi ≤ 1, i = 1, . . . , n. Formally,
n
X
n
S = {v ∈ R : v = αi ai where 0 ≤ αi ≤ 1, i = 1, . . . , n}.
i=1
The absolute value of the determinant of A, it turns out, is a measure of the “volume” of
the set S.3
For example, consider the 2 × 2 matrix,
1 3
A= . (4)
3 2
Here, the rows of the matrix are
1 3
a1 = a2 = .
3 2
The set S corresponding to these rows is shown in Figure 1. For two-dimensional matrices,
S generally has the shape of a parallelogram. In our example, the value of the determinant
is |A| = −7 (as can be computed using the formulas shown later in this section), so the area
of the parallelogram is 7. (Verify this for yourself!)
In three dimensions, the set S corresponds to an object known as a parallelepiped (a three-
dimensional box with skewed sides, such that every face has the shape of a parallelogram).
The absolute value of the determinant of the 3 × 3 matrix whose rows define S give the
three-dimensional volume of the parallelepiped. In even higher dimensions, the set S is an
object known as an n-dimensional parallelotope.
3
Admittedly, we have not actually defined what we mean by “volume” here, but hopefully the intuition
should be clear enough. When n = 2, our notion of “volume” corresponds to the area of S in the Cartesian
plane. When n = 3, “volume” corresponds with our usual notion of volume for a three-dimensional object.
14
(4, 5)
(1, 3)
a1 (3, 2)
a2
(0, 0)
Figure 1: Illustration of the determinant for the 2 × 2 matrix A given in (4). Here, a1 and a2
are vectors corresponding to the rows of A, and the set S corresponds to the shaded region
(i.e., the parallelogram). The absolute value of the determinant, |detA| = 7, is the area of
the parallelogram.
Algebraically, the determinant satisfies the following three properties (from which all
other properties follow, including the general formula):
1. The determinant of the identity is 1, |I| = 1. (Geometrically, the volume of a unit
hypercube is 1).
2. Given a matrix A ∈ Rn×n , if we multiply a single row in A by a scalar t ∈ R, then the
determinant of the new matrix is t|A|,
— t aT —
1
— aT —
2
.. = t|A|.
.
— aTm
—
(Geometrically, multiplying one of the sides of the set S by a factor t causes the volume
to increase by a factor t.)
3. If we exchange any two rows aTi and aTj of A, then the determinant of the new matrix
is −|A|, for example
—
aT2 —
— aT1 —
.. = −|A|.
.
aTm
— —
In case you are wondering, it is not immediately obvious that a function satisfying the above
three properties exists. In fact, though, such a function does exist, and is unique (which we
will not prove here).
Several properties that follow from the three properties above include:
15
• For A ∈ Rn×n , |A| = |AT |.
• For A, B ∈ Rn×n , |AB| = |A||B|.
• For A ∈ Rn×n , |A| = 0 if and only if A is singular (i.e., non-invertible). (If A is singular
then it does not have full rank, and hence its columns are linearly dependent. In this
case, the set S corresponds to a “flat sheet” within the n-dimensional space and hence
has zero volume.)
• For A ∈ Rn×n and A non-singular, |A−1 | = 1/|A|.
Before giving the general definition for the determinant, we define, for A ∈ Rn×n , A\i,\j ∈
(n−1)×(n−1)
R to be the matrix that results from deleting the ith row and jth column from A.
The general (recursive) formula for the determinant is
n
X
|A| = (−1)i+j aij |A\i,\j | (for any j ∈ 1, . . . , n)
i=1
n
X
= (−1)i+j aij |A\i,\j | (for any i ∈ 1, . . . , n)
j=1
with the initial case that |A| = a11 for A ∈ R1×1 . If we were to expand this formula
completely for A ∈ Rn×n , there would be a total of n! (n factorial) different terms. For this
reason, we hardly ever explicitly write the complete equation of the determinant for matrices
bigger than 3 × 3. However, the equations for determinants of matrices up to size 3 × 3 are
fairly common, and it is good to know them:
|[a11 ]| = a11
a11 a12
a21 a22 = a11 a22 − a12 a21
a11 a12 a13
a21 a22 a23 = a11 a22 a33 + a12 a23 a31 + a13 a21 a32
a31 a32 a33
−a11 a23 a32 − a12 a21 a33 − a13 a22 a31
The classical adjoint (often just called the adjoint) of a matrix A ∈ Rn×n , is denoted
adj(A), and defined as
(note the switch in the indices A\j,\i ). It can be shown that for any nonsingular A ∈ Rn×n ,
1
A−1 = adj(A) .
|A|
While this is a nice “explicit” formula for the inverse of matrix, we should note that, numer-
ically, there are in fact much more efficient ways of computing the inverse.
16
3.11 Quadratic Forms and Positive Semidefinite Matrices
Given a square matrix A ∈ Rn×n and a vector x ∈ Rn , the scalar value xT Ax is called a
quadratic form. Written explicitly, we see that
n n n
! n Xn
X X X X
T
x Ax = xi (Ax)i = xi Aij xj = Aij xi xj .
i=1 i=1 j=1 i=1 j=1
Note that,
T T T T T 1 T 1 T
x Ax = (x Ax) = x A x = x A + A x,
2 2
where the first equality follows from the fact that the transpose of a scalar is equal to
itself, and the second equality follows from the fact that we are averaging two quantities
which are themselves equal. From this, we can conclude that only the symmetric part of
A contributes to the quadratic form. For this reason, we often implicitly assume that the
matrices appearing in a quadratic form are symmetric.
We give the following definitions:
• A symmetric matrix A ∈ Sn is positive definite (PD) if for all non-zero vectors
x ∈ Rn , xT Ax > 0. This is usually denoted A ≻ 0 (or just A > 0), and often times the
set of all positive definite matrices is denoted Sn++ .
• A symmetric matrix A ∈ Sn is positive semidefinite (PSD) if for all vectors xT Ax ≥
0. This is written A 0 (or just A ≥ 0), and the set of all positive semidefinite matrices
is often denoted Sn+ .
• Likewise, a symmetric matrix A ∈ Sn is negative definite (ND), denoted A ≺ 0 (or
just A < 0) if for all non-zero x ∈ Rn , xT Ax < 0.
• Similarly, a symmetric matrix A ∈ Sn is negative semidefinite (NSD), denoted
A 0 (or just A ≤ 0) if for all x ∈ Rn , xT Ax ≤ 0.
• Finally, a symmetric matrix A ∈ Sn is indefinite, if it is neither positive semidefinite
nor negative semidefinite — i.e., if there exists x1 , x2 ∈ Rn such that xT1 Ax1 > 0 and
xT2 Ax2 < 0.
It should be obvious that if A is positive definite, then −A is negative definite and vice
versa. Likewise, if A is positive semidefinite then −A is negative semidefinite and vice versa.
If A is indefinite, then so is −A.
One important property of positive definite and negative definite matrices is that they
are always full rank, and hence, invertible. To see why this is the case, suppose that some
matrix A ∈ Rn×n is not full rank. Then, suppose that the jth column of A is expressible as
a linear combination of other n − 1 columns:
X
aj = x i ai ,
i6=j
17
for some x1 , . . . , xj−1 , xj+1 , . . . , xn ∈ R. Setting xj = −1, we have
n
X
Ax = xi ai = 0.
i=1
But this implies xT Ax = 0 for some non-zero vector x, so A must be neither positive definite
nor negative definite. Therefore, if A is either positive definite or negative definite, it must
be full rank.
Finally, there is one type of positive definite matrix that comes up frequently, and so
deserves some special mention. Given any matrix A ∈ Rm×n (not necessarily symmetric or
even square), the matrix G = AT A (sometimes called a Gram matrix ) is always positive
semidefinite. Further, if m ≥ n (and we assume for convenience that A is full rank), then
G = AT A is positive definite.
Ax = λx, x 6= 0.
Intuitively, this definition means that multiplying A by the vector x results in a new vector
that points in the same direction as x, but scaled by a factor λ. Also note that for any
eigenvector x ∈ Cn , and scalar t ∈ C, A(cx) = cAx = cλx = λ(cx), so cx is also an
eigenvector. For this reason when we talk about “the” eigenvector associated with λ, we
usually assume that the eigenvector is normalized to have length 1 (this still creates some
ambiguity, since x and −x will both be eigenvectors, but we will have to live with this).
We can rewrite the equation above to state that (λ, x) is an eigenvalue-eigenvector pair
of A if,
(λI − A)x = 0, x 6= 0.
But (λI − A)x = 0 has a non-zero solution to x if and only if (λI − A) has a non-empty
nullspace, which is only the case if (λI − A) is singular, i.e.,
|(λI − A)| = 0.
We can now use the previous definition of the determinant to expand this expression
|(λI − A)| into a (very large) polynomial in λ, where λ will have degree n. It’s often called
the characteristic polynomial of the matrix A.
We then find the n (possibly complex) roots of this characteristic polynomial and denote
them by λ1 , . . . , λn . These are all the eigenvalues of the matrix A, but we note that they
4
Note that λ and the entries of x are actually in C, the set of complex numbers, not just the reals; we
will see shortly why this is necessary. Don’t worry about this technicality for now, you can think of complex
vectors in the same way as real vectors.
18
may not be distinct. To find the eigenvector corresponding to the eigenvalue λi , we simply
solve the linear equation (λi I − A)x = 0, which is guaranteed to have a non-zero solution
because λi I − A is singular (but there could also be multiple or infinite solutions.)
It should be noted that this is not the method which is actually used in practice to nu-
merically compute the eigenvalues and eigenvectors (remember that the complete expansion
of the determinant has n! terms); it is rather a mathematical argument.
The following are properties of eigenvalues and eigenvectors (in all cases assume A ∈ Rn×n
has eigenvalues λi , . . . , λn ):
• The trace of a A is equal to the sum of its eigenvalues,
n
X
trA = λi .
i=1
• The eigenvalues of a diagonal matrix D = diag(d1 , . . . dn ) are just the diagonal entries
d1 , . . . dn .
19
Let U be the orthonormal matrix that contains ui ’s as columns:6
| | |
U= u 1 u 2 ··· un (5)
| | |
Recalling that orthonormal matrix U satisfies that U U T = I and using the equation
above, we have
A = AU U T = U ΛU T (6)
Background:
representing vector w.r.t. another basis. Any orthonormal matrix
| | |
U = u1 u2 · · · un defines a new basis (coordinate system) of Rn in the following
| | |
sense. For any vector x ∈ Rn can be represented as a linear combination of u1 , . . . , un with
coefficient x̂1 , . . . , x̂n :
x = x̂1 u1 + · · · + x̂n un = U x̂
where in the second equality we use the view of equation (1). Indeed, such x̂ uniquely exists
x = U x̂ ⇔ U T x = x̂
In other words, the vector x̂ = U T x can serve as another representation of the vector x w.r.t
the basis defined by U .
6
Here for notational simplicity, we deviate from the notational convention for columns of matrices in the
previous sections.
20
“Diagonalizing” matrix-vector multiplication. With the setup above, we will see that
left-multiplying matrix A can be viewed as left-multiplying a diagonal matrix w.r.t the basic
of the eigenvectors. Suppose x is a vector and x̂ is its representation w.r.t to the basis of U .
Let z = Ax be the matrix-vector product. Now let’s compute the representation z w.r.t the
basis of U :
Then, again using the fact that U U T = U T U = I and equation (6), we have that
λ1 x̂1
λ2 x̂2
ẑ = U T z = U T Ax = U T U ΛU T x = Λx̂ = ..
.
λn x̂n
21
4. Finally, if A has both positive and negative eigenvalues, say λi > 0 and λj < 0, then
it is indefinite.
Pn This is because if we let x̂ satisfy x̂i = 1 and x̂k = 0, ∀k 6= i, then
x Ax = Pi=1 λi x̂2i > 0. Similarly we can let x̂ satisfy x̂j = 1 and x̂k = 0, ∀k 6= j, then
T
i.e., we want to find the vector (of norm 1) which maximizes the quadratic form. Assuming
the eigenvalues are ordered as λ1 ≥ λ2 ≥ . . . ≥ λn , the optimal value of this optimization
problem is λ1 and any eigenvector u1 corresponding to λ1 is one of the maximizers. (If
λ1 > λ2 , then there is a unique eigenvector corresponding to eigenvalue λ1 , which is the
unique maximizer of the optimization problem (9).)
We can show this by using the diagonalization technique: Note that kxk2 = kx̂k2 by
equation (3), and using equation (8), we can rewrite the optimization (9) as
n
X
T
maxx̂∈Rn x̂ Λx̂ = λi x̂2i subject to kx̂k22 = 1 (10)
i=1
4 Matrix Calculus
While the topics in the previous sections are typically covered in a standard course on linear
algebra, one topic that does not seem to be covered very often (and which we will use
extensively) is the extension of calculus to the vector setting. Despite the fact that all the
actual calculus we use is relatively trivial, the notation can often make things look much
more difficult than they are. In this section we present some basic definitions of matrix
calculus and provide a few examples.
8
Note that x = U x̂ and therefore constructing x̂ gives an implicit construction of x.
22
4.1 The Gradient
Suppose that f : Rm×n → R is a function that takes as input a matrix A of size m × n and
returns a real value. Then the gradient of f (with respect to A ∈ Rm×n ) is the matrix of
partial derivatives, defined as:
∂f (A) ∂f (A) ∂f (A)
∂A11 ∂A12
· · · ∂A1n
∂f (A) ∂f (A) ∂f (A)
∂A21 ∂A22
· · · ∂A2n
∇A f (A) ∈ Rm×n =
. . .
.. ... .
..
.
∂f (A) ∂f (A) ∂f (A)
∂Am1 ∂Am2
· · · ∂Amn
It is very important to remember that the gradient of a function is only defined if the function
is real-valued, that is, if it returns a scalar value. We can not, for example, take the gradient
of Ax, A ∈ Rn×n with respect to x, since this quantity is vector-valued.
It follows directly from the equivalent properties of partial derivatives that:
• ∇x (f (x) + g(x)) = ∇x f (x) + ∇x g(x).
∇f (Ax).
How should this expression be interpreted? There are at least two possibilities:
1. In the first interpretation, recall that ∇z f (z) = 2z. Here, we interpret ∇f (Ax) as
evaluating the gradient at the point Ax, hence,
23
2. In the second interpretation, we consider the quantity f (Ax) as a function of the input
variables x. More formally, let g(x) = f (Ax). Then in this interpretation,
∇f (Ax) = ∇x g(x) ∈ Rn .
Here, we can see that these two interpretations are indeed different. One interpretation yields
an m-dimensional vector as a result, while the other interpretation yields an n-dimensional
vector as a result! How can we resolve this?
Here, the key is to make explicit the variables which we are differentiating with respect
to. In the first case, we are differentiating the function f with respect to its arguments z and
then substituting the argument Ax. In the second case, we are differentiating the composite
function g(x) = f (Ax) with respect to x directly. We denote the first case as ∇z f (Ax) and
the second case as ∇x f (Ax).9 Keeping the notation clear is extremely important (as you’ll
find out in your homework, in fact!).
∂ 2 f (x)
(∇2x f (x))ij = .
∂xi ∂xj
∂ 2 f (x) ∂ 2 f (x)
= .
∂xi ∂xj ∂xj ∂xi
Similar to the gradient, the Hessian is defined only when f (x) is real-valued.
9
A drawback to this notation that we will have to live with is the fact that in the first case, ∇z f (Ax) it
appears that we are differentiating with respect to a variable that does not even appear in the expression
being differentiated! For this reason, the first case is often written as ∇f (Ax), and the fact that we are
differentiating with respect to the arguments of f is understood. However, the second case is always written
as ∇x f (Ax).
24
It is natural to think of the gradient as the analogue of the first derivative for functions
of vectors, and the Hessian as the analogue of the second derivative (and the symbols we
use also suggest this relation). This intuition is generally correct, but there a few caveats to
keep in mind.
First, for real-valued functions of one variable f : R → R, it is a basic definition that the
second derivative is the derivative of the first derivative, i.e.,
∂ 2 f (x) ∂ ∂
2
= f (x).
∂x ∂x ∂x
However, for functions of a vector, the gradient of the function is a vector, and we cannot
take the gradient of a vector — i.e.,
∂f (x)
∂x1
∂f (x)
∂x2
∇x ∇x f (x) = ∇x ..
.
∂f (x)
∂xn
and this expression is not defined. Therefore, it is not the case that the Hessian is the
gradient of the gradient. However, this is almost true, in the following sense: If we look at
the ith entry of the gradient (∇x f (x))i = ∂f (x)/∂xi , and take the gradient with respect to
x we get
∂ 2 f (x)
∂xi ∂x1
∂ 2 f (x)
∂f (x) ∂xi ∂x2
∇x = ..
∂xi
.
∂f (x)
∂xi ∂xn
If we don’t mind being a little bit sloppy we can say that (essentially) ∇2x f (x) = ∇x (∇x f (x))T ,
so long as we understand that this really means taking the gradient of each entry of (∇x f (x))T ,
not the gradient of the whole vector.
Finally, note that while we can take the gradient with respect to a matrix A ∈ Rn , for
the purposes of this class we will only consider taking the Hessian with respect to a vector
x ∈ Rn . This is simply a matter of convenience (and the fact that none of the calculations
we do require us to find the Hessian with respect to a matrix), since the Hessian with respect
to a matrix would have to represent all the partial derivatives ∂ 2 f (A)/(∂Aij ∂Akℓ ), and it is
rather cumbersome to represent this as a matrix.
25
4.3 Gradients and Hessians of Quadratic and Linear Functions
Now let’s try to determine the gradient and Hessian matrices for a few simple functions. It
should be noted that all the gradients given here are special cases of the gradients given in
the CS229 lecture notes.
For x ∈ Rn , let f (x) = bT x for some known vector b ∈ Rn . Then
n
X
f (x) = bi xi
i=1
so n
∂f (x) ∂ X
= bi xi = bk .
∂xk ∂xk i=1
From this we can easily see that ∇x bT x = b. This should be compared to the analogous
situation in single variable calculus, where ∂/(∂x) ax = a.
Now consider the quadratic function f (x) = xT Ax for A ∈ Sn . Remember that
n X
X n
f (x) = Aij xi xj .
i=1 j=1
To take the partial derivative, we’ll consider the terms including xk and x2k factors separately:
n n
∂f (x) ∂ XX
= Aij xi xj
∂xk ∂xk i=1 j=1
" #
∂ XX X X
= Aij xi xj + Aik xi xk + Akj xk xj + Akk x2k
∂xk i6=k j6=k i6=k j6=k
X X
= Aik xi + Akj xj + 2Akk xk
i6=k j6=k
X n X n n
X
= Aik xi + Akj xj = 2 Aki xi ,
i=1 j=1 i=1
where the last equality follows since A is symmetric (which we can safely assume, since it is
appearing in a quadratic form). Note that the kth entry of ∇x f (x) is just the inner product
of the kth row of A and x. Therefore, ∇x xT Ax = 2Ax. Again, this should remind you of
the analogous fact in single-variable calculus, that ∂/(∂x) ax2 = 2ax.
Finally, let’s look at the Hessian of the quadratic function f (x) = xT Ax (it should be
obvious that the Hessian of a linear function bT x is zero). In this case,
" n #
∂ 2 f (x)
∂ ∂f (x) ∂ X
= = 2 Aℓi xi = 2Aℓk = 2Akℓ .
∂xk ∂xℓ ∂xk ∂xℓ ∂xk i=1
26
Therefore, it should be clear that ∇2x xT Ax = 2A, which should be entirely expected (and
again analogous to the single-variable fact that ∂ 2 /(∂x2 ) ax2 = 2a).
To recap,
• ∇x bT x = b
• ∇x xT Ax = 2Ax (if A symmetric)
• ∇2x xT Ax = 2A (if A symmetric)
Taking the gradient with respect to x we have, and using the properties we derived in the
previous section
Setting this last expression equal to zero and solving for x gives the normal equations
x = (AT A)−1 AT b
so n
∂ ∂ X
|A| = (−1)i+j Aij |A\i,\j | = (−1)k+ℓ |A\k,\ℓ | = (adj(A))ℓk .
∂Akℓ ∂Akℓ i=1
27
From this it immediately follows from the properties of the adjoint that
Now let’s consider the function f : Sn++ → R, f (A) = log |A|. Note that we have to
restrict the domain of f to be the positive definite matrices, since this ensures that |A| > 0,
so that the log of |A| is a real number. In this case we can use the chain rule (nothing fancy,
just the ordinary chain rule from single-variable calculus) to see that
where we can drop the transpose in the last expression because A is symmetric. Note the
similarity to the single-valued case, where ∂/(∂x) log x = 1/x.
L(x, λ) = xT Ax − λxT x
where λ is called the Lagrange multiplier associated with the equality constraint. It can be
established that for x∗ to be a optimal point to the problem, the gradient of the Lagrangian
has to be zero at x∗ (this is not the only condition, but it is required). That is,
Notice that this is just the linear equation Ax = λx. This shows that the only points which
can possibly maximize (or minimize) xT Ax assuming xT x = 1 are the eigenvectors of A.
10
Don’t worry if you haven’t seen Lagrangians before, as we will cover them in greater detail later in
CS229.
28