SVD Chapter

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

4

Singular Value Decomposition (SVD)

The singular value decomposition of a matrix A is the factorization of A into the product of three matrices A = U DV T where the columns of U and V are orthonormal and the matrix D is diagonal with positive real entries. The SVD is useful in many tasks. Here we mention two examples. First, the rank of a matrix A can be read o from its SVD. This is useful when the elements of the matrix are real numbers that have been rounded to some nite precision. Before the entries were rounded the matrix may have been of low rank but the rounding converted the matrix to full rank. The original rank can be determined by the number of diagonal elements of D not exceedingly close to zero. Second, for a square and invertible matrix A, the inverse of A is V D1 U T . To gain insight into the SVD, treat the rows of an n d matrix A as n points in a d-dimensional space and consider the problem of nding the best k -dimensional subspace with respect to the set of points. Here best means minimize the sum of the squares of the perpendicular distances of the points to the subspace. We begin with a special case of the problem where the subspace is 1-dimensional, a line through the origin. We will see later that the best-tting k -dimensional subspace can be found by k applications of the best tting line algorithm. Finding the best tting line through the origin with respect to a set of points {xi |1 i n} in the plane means minimizing the sum of the squared distances of the points to the line. Here distance is measured perpendicular to the line. The problem is called the best least squares t. In the best least squares t, one is minimizing the distance to a subspace. An alternative problem is to nd the function that best ts some data. Here one variable y is a function of the variables x1 , x2 , . . . , xd and one wishes to minimize the vertical distance, i.e., distance in the y direction, to the subspace of the xi rather than minimize the perpendicular distance to the subspace being t to the data. xi distance v

projection Figure 4.1: The projection of the point xi onto the line through the origin in the direction of v

Returning to the best least squares t problem, consider projecting a point xi onto a 110

line through the origin. Then


2 2 2 2 x2 i1 + xi2 + +id = (length of projection) + (distance of point to line) .

See Figure 4.1. Thus


2 2 2 (distance of point to line)2 = x2 i1 + xi2 + +id (length of projection) .

To minimize the sum of the squares of the distances to the line, one could minimize n 2 2 ( x2 i1 + xi2 + +id ) minus the sum of the squares of the lengths of the projections of
i=1

the points to the line. However,

i=1

line), so minimizing the sum of the squares of the distances is equivalent to maximizing the sum of the squares of the lengths of the projections onto the line. Similarly for best-t subspaces, we could maximize the sum of the squared lengths of the projections onto the subspace instead of minimizing the sum of squared distances to the subspace.

2 2 ( x2 i1 + xi2 + +id ) is a constant (independent of the

4.1

Singular Vectors

We now dene the singular vectors of an n d matrix A. Consider the rows of A as n points in a d-dimensional space. Consider the best t line through the origin. Let v be a unit vector along this line. The length of the projection of ai , the ith row of A, onto v is |ai v|. From this we see that the sum of length squared of the projections is |Av|2 . The best t line is the one maximizing |Av|2 and hence minimizing the sum of the squared distances of the points to the line. With this in mind, dene the rst singular vector, v1 , of A, which is a column vector, as the best t line through the origin for the n points in d-space that are the rows of A. Thus v1 = arg max |Av|.
|v|=1 2 The value 1 (A) = |Av1 | is called the rst singular value of A. Note that 1 is the sum of the squares of the projections of the points to the line determined by v1 .

The greedy approach to nd the best t 2-dimensional subspace for a matrix A, takes v1 as the rst basis vector for the 2-dimenional subspace and nds the best 2-dimensional subspace containing v1 . The fact that we are using the sum of squared distances helps. For every 2-dimensional subspace containing v1 , the sum of squared lengths of the projections onto the subspace equals the sum of squared projections onto v1 plus the sum of squared projections along a vector perpendicular to v1 in the subspace. Thus, instead of looking for the best 2-dimensional subspace containing v1 , look for a unit vector, call it v2 , perpendicular to v1 that maximizes |Av|2 among all such unit vectors. Using the same greedy strategy to nd the best three and higher dimensional subspaces, denes v3 , v4 , . . . in a similar manner. This is captured in the following denitions. There is no 111

apriori guarantee that the greedy algorithm gives the best t. But, in fact, the greedy algorithm does work and yields the best-t subspaces of every dimension. The second singular vector, v2 , is dened by the best t line perpendicular to v1 v2 = arg max |Av| .
vv1 ,|v|=1

The value 2 (A) = |Av2 | is called the second singular value of A. The third singular vector v3 is dened similarly by v3 = arg max |Av|
vv1 ,v2 ,|v|=1

and so on. The process stops when we have found v1 , v2 , . . . , vr as singular vectors and
vv1 ,v2 ,...,vr |v|=1

arg max |Av| = 0.

If instead of nding v1 that maximized |Av| and then the best t 2-dimensional subspace containing v1 , we had found the best t 2-dimensional subspace, we might have done better. This is not the case. We now give a simple proof that the greedy algorithm indeed nds the best subspaces of every dimension. Theorem 4.1 Let A be an n d matrix where v1 , v2 , . . . , vr are the singular vectors dened above. For 1 k r, let Vk be the subspace spanned by v1 , v2 , . . . , vk . Then for each k , Vk is the best-t k -dimensional subspace for A. Proof: The statement is obviously true for k = 1. For k = 2, let W be a best-t 2dimensional subspace for A. For any basis w1 , w2 of W , |Aw1 |2 + |Aw2 |2 is the sum of squared lengths of the projections of the rows of A onto W . Now, choose a basis w1 , w2 of W so that w2 is perpendicular to v1 . If v1 is perpendicular to W , any unit vector in W will do as w2 . If not, choose w2 to be the unit vector in W perpendicular to the projection of v1 onto W. Since v1 was chosen to maximize |Av1 |2 , it follows that |Aw1 |2 |Av1 |2 . Since v2 was chosen to maximize |Av2 |2 over all v perpendicular to v1 , |Aw2 |2 |Av2 |2 . Thus |Aw1 |2 + |Aw2 |2 |Av1 |2 + |Av2 |2 . Hence, V2 is at least as good as W and so is a best-t 2-dimensional subspace. For general k , proceed by induction. By the induction hypothesis, Vk1 is a best-t k -1 dimensional subspace. Suppose W is a best-t k -dimensional subspace. Choose a basis w1 , w2 , . . . , wk of W so that wk is perpendicular to v1 , v2 , . . . , vk1 . Then |Aw1 |2 + |Aw2 |2 + + |Awk |2 |Av1 |2 + |Av2 |2 + + |Avk1 |2 + |Awk |2 112

since Vk1 is an optimal k -1 dimensional subspace. Since wk is perpendicular to v1 , v2 , . . . , vk1 , by the denition of vk , |Awk |2 |Avk |2 . Thus |Aw1 |2 + |Aw2 |2 + + |Awk1 |2 + |Awk |2 |Av1 |2 + |Av2 |2 + + |Avk1 |2 + |Avk |2 , proving that Vk is at least as good as W and hence is optimal. Note that the n-vector Avi is really a list of lengths (with signs) of the projections of the rows of A onto vi . Think of |Avi | = i (A) as the component of the matrix A along vi . For this interpretation to make sense, it should be true that adding up the squares of the components of A along each of the vi gives the square of the whole content of the matrix A. This is indeed the case and is the matrix analogy of decomposing a vector into its components along orthogonal directions. Consider one row, say aj , of A. Since v1 , v2 , . . . , vr span the space of all rows of A, r aj v = 0 for all v perpendicular to v1 , v2 , . . . , vr . Thus, for each row aj , (aj vi )2 = |aj |2 . Summing over all rows j ,
n j =1 i=1

|aj |2 =

n r j =1 i=1

(aj vi )2 =

r n i=1 j =1

(aj vi )2 =

r i=1

|Avi |2 =

r i=1

2 i ( A) .

But

j =1

squares of the singular values of A is indeed the square of the whole content of A, i.e., the sum of squares of all the entries. There is an important norm associated with this quantity, the Frobenius norm of A, denoted ||A||F dened as ||A||F = a2 jk .
j,k

|aj |2 =

j =1 k=1

n d

a2 jk , the sum of squares of all the entries of A. Thus, the sum of

Lemma 4.2 For any matrix of squares of the singular values equals the 2A, the sum Frobenius norm. That is, i (A) = ||A||2 . F Proof: By the preceding discussion. A matrix A can be described fully by how it transforms the vectors vi . Every vector v can be written as a linear combination of v1 , v2 , . . . , vr and a vector perpendicular to all the vi . Now, Av is the same linear combination of Av1 , Av2 , . . . , Avr as v is of v1 , v2 , . . . , vr . So the Av1 , Av2 , . . . , Avr form a fundamental set of vectors associated with A. We normalize them to length one by ui = 1 Av i . i (A) 113

The vectors u1 , u2 , . . . , ur are called the left singular vectors of A. The vi are called the right singular vectors. The SVD theorem (Theorem 4.5) will fully explain the reason for these terms. Clearly, the right singular vectors are orthogonal by denition. We now show that the r T left singular vectors are also orthogonal and that A = i ui vi .
i=1

Theorem 4.3 Let A be a rank r matrix. The left singular vectors of A, u1 , u2 , . . . , ur , are orthogonal.

Proof: The proof is by induction on r. For r = 1, there is only one ui so the theorem is trivially true. For the inductive part consider the matrix
T B = A 1 u1 v 1 .

The implied algorithm in the denition of singular value decomposition applied to B is identical to a run of the algorithm on A for its second and later singular vectors and sinT gular values. To see this, rst observe that B v1 = Av1 1 u1 v1 v1 = 0. It then follows that the rst right singular vector, call it z, of B will be perpendicular to v1 since if it z1 B z| had a component z1 along v1 , then, B |z = |z| > |B z|, contradicting the arg max zz1 | z1 | denition of z . But for any v perpendicular to v1 , B v = Av. Thus, the top singular vector of B is indeed a second singular vector of A. Repeating this argument shows that a run of the algorithm on B is the same as a run on A for its second and later singular vectors. This is left as an exercise. Thus, there is a run of the algorithm that nds that B has right singular vectors v2 , v3 , . . . , vr and corresponding left singular vectors u2 , u3 , . . . , ur . By the induction hypothesis, u2 , u3 , . . . , ur are orthogonal. It remains to prove that u1 is orthogonal to the other ui . Suppose not and for some T i 2, uT 1 ui = 0. Without loss of generality assume that u1 ui > 0. The proof is symmetric T for the case where u1 ui < 0. Now, for innitesimally small > 0, the vector v 1 + v i 1 u1 + i ui A = | v 1 + v i | 1 + 2 has length at least as large as its component along u1 which is 1 u1 + i ui T 2 4 uT ( ) = + u u 1 + O ( ) = 1 + i uT ui O 2 > 1 1 i 1 i 1 1 2 1 + 2 a contradiction. Thus, u1 , u2 , . . . , ur are orthogonal.

114

4.2

Singular Value Decomposition (SVD)

Let A be an nd matrix with singular vectors v1 , v2 , . . . , vr and corresponding singular 1 values 1 , 2 , . . . , r . Then ui = Avi , for i = 1, 2, . . . , r, are the left singular vectors and i by Theorem 4.5, A can be decomposed into a sum of rank one matrices as A=
r i=1 T i ui v i .

We rst prove a simple lemma stating that two matrices A and B are identical if Av = B v for all v. The lemma states that in the abstract, a matrix A can be viewed as a transformation that maps vector v onto Av. Lemma 4.4 Matrices A and B are identical if and only if for all vectors v, Av = B v. Proof: Clearly, if A = B then Av = B v for all v. For the converse, suppose that Av = B v for all v. Let ei be the vector that is all zeros except for the ith component which has value 1. Now Aei is the ith column of A and thus A = B if for each i, Aei = B ei .

Theorem 4.5 Let A be an n d matrix with right singular vectors v1 , v2 , . . . , vr , left singular vectors u1 , u2 , . . . , ur , and corresponding singular values 1 , 2 , . . . , r . Then A=
r i=1 T i ui v i .

Proof: For each singular vector vj , Avj =

pressed as a linear combination of the singular vectors plus a vector perpendicular to the r r T T vi , Av = i ui vi v and by Lemma 4.4, A = i ui vi .
i=1 i=1

i=1

T i ui vi vj . Since any vector v can be ex-

The decomposition is called the singular value decomposition, SVD, of A. In matrix notation A = U DV T where the columns of U and V consist of the left and right singular vectors, respectively, and D is a diagonal matrix whose diagonal entries are the singular values of A. For any matrix A, the sequence of singular values is unique and if the singular values are all distinct, then the sequence of singular vectors is unique also. However, when some set of singular values are equal, the corresponding singular vectors span some subspace. Any set of orthonormal vectors spanning this subspace can be used as the singular vectors.

115

A nd

U nr

D rr

VT rd

Figure 4.2: The SVD decomposition of an n d matrix.

4.3

Best Rank k Approximations

There are two important matrix norms, the Frobenius norm denoted ||A||F and the 2-norm denoted ||A||2 . The 2-norm of the matrix A is given by max |Av|
|v|=1

and thus equals the largest singular value of the matrix. Let A be an n d matrix and think of the rows of A as n points in d-dimensional space. The Frobenius norm of A is the square root of the sum of the squared distance of the points to the origin. The 2-norm is the square root of the sum of squared distances to the origin along the direction that maximizes this quantity. Let A=
r i=1

T i ui vi

be the SVD of A. For k {1, 2, . . . , r}, let Ak =

k i=1

T i ui v i

be the sum truncated after k terms. It is clear that Ak has rank k . Furthermore, Ak is the best rank k approximation to A when the error is measured in either the 2-norm or the Frobenius norm. Lemma 4.6 The rows of Ak are the projections of the rows of A onto the subspace Vk spanned by the rst k singular vectors of A. 116

Proof: Let a be an arbitrary row vector. Since the vi are orthonormal, the projection k of the vector a onto Vk is given by (a vi )vi T . Thus, the matrix whose rows are the
i=1

projections of the rows of A onto Vk is given by to


k i=1

i=1 T k i=1

T Avi vi . This last expression simplies

Avi vi =

i ui vi T = Ak .

The matrix Ak is the best rank k approximation to A in both the Frobenius and the 2-norm. First we show that the matrix Ak is the best rank k approximation to A in the Frobenius norm. Theorem 4.7 For any matrix B of rank at most k A Ak F A B F Proof: Let B minimize A B 2 F among all rank k or less matrices. Let V be the space spanned by the rows of B . The dimension of V is at most k . Since B minimizes A B 2 F, it must be that each row of B is the projection of the corresponding row of A onto V , otherwise replacing the row of B with the projection of the corresponding row of A onto V does not change V and hence the rank of B but would reduce A B 2 F . Since each row of B is the projection of the corresponding row of A, it follows that A B 2 F is the sum of squared distances of rows of A to V . Since Ak minimizes the sum of squared distance of rows of A to any k -dimensional subspace, it follows that A Ak F A B F . Next we tackle the 2-norm. We rst show that the square of the 2-norm of A Ak is the square of the (k + 1)st singular value of A,
2 Lemma 4.8 A Ak 2 2 = k+1 .

Proof: Let A =
k

i=1

i=1

i ui vi T and A Ak =

i ui vi T be the singular value decomposition of A. Then Ak =


r

i=k+1

i ui vi T . Let v be the top singular vector of A Ak .


i=1 r

Express v as a linear combination of v1 , v2 , . . . , vr . That is, write v =

i vi . Then

r r r | ( A Ak ) v | = i ui v i T j vj = i i ui v i T v i j =1 i=k+1 i=k+1 r r 2 2 i i . = i i ui =
i=k+1 i=k+1

117

The v maximizing this last quantity, subject to the constraint that |v|2 =

2 occurs when k+1 = 1 and the rest of the i are 0. Thus, A Ak 2 2 = k+1 proving the lemma.

i=1

2 i = 1,

Finally, we prove that Ak is the best rank k 2-norm approximation to A. Theorem 4.9 Let A be an n d matrix. For any matrix B of rank at most k A Ak 2 A B 2 Proof: If A is of rank k or less, the theorem is obviously true since A Ak 2 = 0. Thus 2 assume that A is of rank greater than k . By Lemma 4.8, A Ak 2 2 = k+1 . Now suppose there is some matrix B of rank at most k such that B is a better 2-norm approximation to A than Ak . That is, A B 2 < k+1 . The null space of B , Null (B ), (the set of vectors v such that B v = 0) has dimension at least d k . Let v1 , v2 , . . . , vk+1 be the rst k + 1 singular vectors of A. By a dimension argument, it follows that there exists a z = 0 in Null (B ) Span {v1 , v2 , . . . , vk+1 } . Scale z so that |z| = 1. We now show that for this vector z, which lies in the space of the rst k + 1 singular vectors of A, that (A B ) z k+1 . Hence the 2-norm of A B is at least k+1 contradicting the assumption that A B 2 < k+1 . First
2 A B 2 2 |(A B ) z| .

Since B z = 0,

Since z is in the Span {v1 , v2 , . . . , vk+1 } 2 n n k+1 k+1 T 2 T 2 T 2 2 2 2 2 |Az|2 = i ui vi T z = i vi z = i vi z k v i z = k +1 +1 .


i=1 i=1 i=1 i=1

2 A B 2 2 |Az| .

It follows that

contradicting the assumption that ||A B ||2 < k+1 . This proves the theorem.

2 A B 2 2 k+1

4.4

Power Method for Computing the Singular Value Decomposition

Computing the singular value decomposition is an important branch of numerical analysis in which there have been many sophisticated developments over a long period of time. Here we present an in-principle method to establish that the approximate SVD of a matrix A can be computed in polynomial time. The reader is referred to numerical 118

analysis texts for more details. The method we present, called the Power Method, is conceptually simple. The word power refers to taking high powers of the matrix B = AAT . T If the SVD of A is i ui vi , then by direct multiplication
i

B = AAT = = =
i i,j

T i j ui vi vj uT j 2 i ui uT i ,

T i ui v i


i,j

j vj uT j

T i j ui (vi vj )uT j

T since vi vj is the dot product of the two vectors and is zero unless i = j . [Caution: ui uj T is a matrix and is not zero even for i = j .] Using the same kind of calculation, 2k Bk = i ui uT i . i

As k increases, for i > 1,

2k 2k i /1

goes to zero and B k is approximately equal to


2k 1 u1 uT 1

provided that for each i > 1, i (A) < 1 (A). This suggests a way of nding 1 and u1 , by successively powering B . But there are two issues. First, if there is a signicant gap between the rst and second singular values of a matrix, then the above argument applies and the power method will quickly converge to the rst left singular vector. Suppose there is no signicant gap. In the extreme case, there may be ties for the top singular value. Then the above argument does not work. We overcome this problem in Theorem 4.11 below which states that even with ties, the power method converges to some vector in the span of those singular vectors corresponding to the nearly highest singular values. A second issue is that computing B k costs k matrix multiplications when done in a straight-forward manner or O (log k ) when done by successive squaring. Instead we compute Bkx where x is a random unit length vector. Each increase in k requires a matrix-vector product which takes time proportional to the number of nonzero entries in B . Further saving may be achieved by writing B k x = AAT B k1 x .

Now the cost is proportional to the number of nonzero entries in A. Since B k x 2k k 1 u1 (uT 1 x) is a scalar multiple of u1 , u1 can be recovered from B x by normalization.

119

1 20 d

Figure 4.3: The volume of the cylinder of height of the hemisphere below x1 = 201 d

1 20 d

is an upper bound on the volume

We start with a technical Lemma needed in the proof of the theorem. Lemma 4.10 Let (x1 , x2 , . . . , xd ) be a unit d-dimensional vector picked at random. The is at least 9/10. probability that |x1 | 201 d Proof: We rst show that for a vector v picked at random with |v| 1, the probability is at least 9/10. Then we let x = v/|v|. This can only increase the value that v1 201 d of v1 , so the result follows.
. The probability that |v1 | equals one minus the probability that Let = 201 d |v1 | . The probability that |v1 | is equal to the fraction of the volume of the unit sphere with |v1 | . To get an upper bound on the volume of the sphere with |v1 | , consider twice the volume of the unit radius cylinder of height . The volume of the portion of the sphere with |v1 | is less than or equal to 2A(d 1) and

Prob(|v1 | )

2A(d 1) V (d)

Now the volume of the unit radius sphere is at least twice the volume of the cylinder of 1 1 height d and radius 1 d or 1 1 V (d) 2 1 d 2 V (d 1)(1 ) 2 d1 d1 120

Using (1 x)a 1 ax V (d) and 2 d2 1 V (d 1) A(d 1)(1 ) 2 d1 d1 d1

2V (d 1) d1 1 . Prob(|v1 | ) 1 10 V (d 1) 10 d d 1
1 20 d

Thus the probability that v1

is at least 9/10.

Theorem 4.11 Let A be an n d matrix and x a random unit length vector. Let V be the space spanned by the leftsingular vectors of A corresponding to singular values greater ln(n/) than (1 ) 1 . Let k be . Let w be unit vector after k iterations of the power method, namely, T k AA x . w= T )k x ( AA The probability that w has a component of at least perpendicular to V is at most 1/10. Proof: Let A=
r i=1 T i ui vi

be the SVD of A. If the rank of A is less than n, then complete {u1 , u2 , . . . ur } into a basis {u1 , u2 , . . . un } of n-space. Write x in the basis of the ui s as x=
n n i=1

c i ui .
n 2k i ci ui . For a random unit

Since (AAT )k =

length vector x picked independent of A, the ui are xed vectors and picking x at random with probability at is equivalent to picking random ci . From Lemma 4.10, |c1 | 201 n least 9/10. Suppose that 1 , 2 , . . . , m are the singular values of A that are greater than or equal to (1 ) 1 and that m+1 , . . . , n are the singular values that are less than (1 ) 1 . Now 2 n n 1 4k T k 2 2k 4k 2 4k 2 |(AA ) x| = i c i ui = i ci 1 c1 1 , 400 n i=1 i=1

i=1

2k T k i ui uT i , it follows that (AA ) x =

i=1

with probability at least 9/10. Here we used the fact that a sum of positive quantities is at least as large as its rst element and the rst element is greater than or equal to 121

You might also like