Chapter 11
Chapter 11
Chapter 11
Dimensionality Reduction
There are many sources of data that can be viewed as a large matrix. We
saw in Chapter 5 how the Web can be represented as a transition matrix. In
Chapter 9, the utility matrix was a point of focus. And in Chapter 10 we
examined matrices that represent social networks. In many of these matrix
applications, the matrix can be summarized by finding narrower matrices
that in some sense are close to the original. These narrow matrices have only a
small number of rows or a small number of columns, and therefore can be used
much more efficiently than can the original large matrix. The process of finding
these narrow matrices is called dimensionality reduction.
We saw a preliminary example of dimensionality reduction in Section 9.4.
There, we discussed UV-decomposition of a matrix and gave a simple algorithm
for finding this decomposition. Recall that a large matrix M was decomposed
into two matrices U and V whose product U V was approximately M . The
matrix U had a small number of columns whereas V had a small number of rows,
so each was significantly smaller than M , and yet together they represented
most of the information in M that was useful in predicting ratings of items by
individuals.
In this chapter we shall explore the idea of dimensionality reduction in
more detail. We begin with a discussion of eigenvalues and their use in principal component analysis (PCA). We cover singular-value decomposition, a
more powerful version of UV-decomposition. Finally, because we are always
interested in the largest data sizes we can handle, we look at another form
of decomposition, called CUR-decomposition, which is a variant of singularvalue decomposition that keeps the matrices of the decomposition sparse if the
original matrix is sparse.
11.1
We shall assume that you are familiar with the basics of matrix algebra: multiplication, transpose, determinants, and solving linear equations for example. In
405
406
11.1.1
Definitions
1/5
2/ 5
Also observe that the eigenvector is a unit vector, because (1/ 5)2 + (2/ 5)2 =
1/5 + 4/5 = 1. 2
11.1.2
We have already seen one approach to finding an eigenpair (an eigenvalue and
its corresponding eigenvector) for a suitable matrix M in Section 5.1: start with
any unit vector v of the appropriate length and compute M i v iteratively until it
407
1 Recall
3
2
2
6
408
3 2
2 6
x
y
=7
x
y
=
=
7x
7y
Notice that both of these equations really say the same thing: y = 2x. Thus, a
possible eigenvector is
1
2
But that vector is not a unit vector, since the sum of the squares of its components is 5, not 1. Thus
to get the unit vector in the same direction, we divide
each component by 5. That is, the principal eigenvector is
1/5
2/ 5
and its eigenvalue is 7. Note that this was the eigenpair we explored in Example 11.1.
For the second eigenpair, we repeat the above with eigenvalue 2 in place of
7. The equation involving the components of e is x = 2y, and the second
eigenvector is
2/5
1/ 5
Its corresponding eigenvalue is 2, of course.
11.1.3
409
eigenvalue) of the original matrix. The process proceeds in that manner, removing each eigenvector as we find it, and then using power iteration to find
the principal eigenvector of the matrix that remains.
Let M be the matrix whose eigenpairs we would like to find. Start with any
nonzero vector x0 and then iterate:
xk+1 :=
M xk
kM xk k
where kN k for a matrix or vector N denotes the Frobenius norm; that is, the
square root of the sum of the squares of the elements of N . We multiply the
current vector xk by the matrix M until convergence (i.e., kxk xk+1 k is less
than some small, chosen constant). Let x be xk for that value of k at which
convergence is obtained. Then x is (approximately) the principal eigenvector
of M . To obtain the corresponding eigenvalue we simply compute 1 = xT M x,
which is the equation M x = x solved for , since x is a unit vector.
Example 11.3 : Take the matrix from Example 11.2:
3 2
M=
2 6
and let us start with x0 a vector with 1 for both components. To compute x1 ,
we multiply M x0 to get
3 2
1
5
=
2 6
1
8
410
Recall from Example 11.2 that the true principal eigenvector is 7. Power iteration will introduce small errors due either to limited precision, as was the case
here, or due to the fact that we stop the iteration before reaching the exact
value of the eigenvector. When we computed PageRank, the small inaccuracies did not matter, but when we try to compute all eigenpairs, inaccuracies
accumulate if we are not careful. 2
To find the second eigenpair we create a new matrix M = M 1 xxT .
Then, use power iteration on M to compute its largest eigenvalue. The obtained x and correspond to the second largest eigenvalue and the corresponding eigenvector of matrix M .
Intuitively, what we have done is eliminate the influence of a given eigenvector by setting its associated eigenvalue to zero. The formal justification is
the following two observations. If M = M xxT , where x and are the
eigenpair with the largest eigenvalue, then:
1. x is also an eigenvector of M , and its corresponding eigenvalue is 0. In
proof, observe that
M x = (M xxT )x = M x xxT x = M x x = 0
(a) M and M T have the same set of eigenvalues and eigenvectors. Note
that if M is symmetric, then M = M T ; but even if M is not symmetric, the statement is true, although we do not prove it here.
(b) The eigenvectors of a matrix are orthogonal. That is, the dot product
of any two distinct eigenvectors of a matrix is 0. We do not prove
this statement here.
Example 11.4 : Continuing Example 11.3, we compute
0.447
3 2
M =
6.993 0.447 0.894
=
2 6
0.894
3 2
1.399 2.787
2.601 0.797
=
2 6
2.797 5.587
0.797
0.413
We may find the second eigenpair by processing the matrix above as we did the
original matrix M . 2
411
11.1.4
2/5
1/ 5
2/5 1/5
1/ 5
2/ 5
E T is therefore
0
1
11.1.5
Exercise 11.1.1 : Find the unit vector in the same direction as the vector
[1, 2, 3].
Exercise 11.1.2 : Complete Example 11.4 by computing the principal eigenvector of the matrix that was constructed in this example. How close to the
correct solution (from Example 11.2) are you?
Exercise 11.1.3 : For any symmetric 3 3 matrix
a
b
c
b
d
e
c
e
f
there is a cubic equation in that says the determinant of this matrix is 0. In
terms of a through f , find this equation.
412
1 1 1
1 2 3
1 3 5
using the method of Section 11.1.2.
! Exercise 11.1.5 : Find the eigenpairs for the following matrix:
1 1 1
1 2 3
1 3 6
using the method of Section 11.1.2.
Exercise 11.1.6 : For the matrix of Exercise 11.1.4:
(a) Starting with a vector of three 1s, use power iteration to find an approximate value of the principal eigenvector.
(b) Compute an estimate the principal eigenvalue for the matrix.
(c) Construct a new matrix by subtracting out the effect of the principal
eigenpair, as in Section 11.1.3.
(d) From your matrix of (c), find the second eigenpair for the original matrix
of Exercise 11.1.4.
(e) Repeat (c) and (d) to find the third eigenpair for the original matrix.
Exercise 11.1.7 : Repeat Exercise 11.1.6 for the matrix of Exercise 11.1.5.
11.2
Principal-Component Analysis
Principal-component analysis, or PCA, is a technique for taking a dataset consisting of a set of tuples representing points in a high-dimensional space and
finding the directions along which the tuples line up best. The idea is to treat
the set of tuples as a matrix M and find the eigenvectors for M M T or M T M .
The matrix of these eigenvectors can be thought of as a rigid rotation in a highdimensional space. When you apply this transformation to the original data,
the axis corresponding to the principal eigenvector is the one along which the
points are most spread out, More precisely, this axis is the one along which
the variance of the data is maximized. Put another way, the points can best be
viewed as lying along this axis, with small deviations from this axis. Likewise,
the axis corresponding to the second eigenvector (the eigenvector corresponding to the second-largest eigenvalue) is the axis along which the variance of
distances from the first axis is greatest, and so on.
413
11.2.1
An Illustrative Example
We shall start the exposition with a contrived and simple example. In this
example, the data is two-dimensional, a number of dimensions that is too small
to make PCA really useful. Moreover, the data, shown in Fig. 11.1 has only
four points, and they are arranged in a simple pattern along the 45-degree line
to make our calculations easy to follow. That is, to anticipate the result, the
points can best be viewed as lying along the axis that is at a 45-degree angle,
with small deviations in the perpendicular direction.
(3,4)
(4,3)
(1,2)
(2,1)
1 2
2 1
M =
3 4
4 3
Compute M T M , which is
M TM =
1
2
2 3
1 4
4
3
1
2
3
4
2
1
= 30 28
4
28 30
3
We may find the eigenvalues of the matrix above by solving the equation
(30 )(30 ) 28 28 = 0
414
=
=
58x
58y
Both equations tell us the same thing: x = y. Thus, the unit eigenvector
corresponding to the principal eigenvalue 58 is
1/2
1/ 2
For the second eigenvalue, 2, we perform the same process. Multiply out
30 28
x
x
=2
28 30
y
y
to get the two equations
30x+28y
28x+30y
=
=
2x
2y
Both equations tell us the same thing: x = y. Thus, the unit eigenvector
corresponding to the principal eigenvalue 2 is
1/2
1/ 2
While we promised to write eigenvectors with their first component positive,
we choose the opposite here because it makes the transformation of coordinates
easier to follow in this case.
Now, let us construct E, the matrix of eigenvectors for the matrix M T M .
Placing the principal eigenvector first, the matrix of eigenvectors is
1/2 1/2
E=
1/ 2
1/ 2
Any matrix of orthonormal vectors (unit vectors that are orthogonal to one
another) represents a rotation of the axes of a Euclidean space. The matrix
above can be viewed as a rotation 45 degrees counterclockwise. For example,
let us multiply the matrix M that represents each of the points of Fig. 11.1 by
E. The product is
3/2
1/2
1 2
3/ 2 1/ 2
2 1 1/ 2 1/ 2
=
ME =
7/ 2
3 4 1/ 2
1/ 2
1/
2
4 3
7/ 2 1/ 2
415
(3,4)
(3.5,3.5)
(4,3)
(1,2)
(1.5,1.5)
(2,1)
Figure 11.2: Figure 11.1 with the axes rotated 45 degrees counterclockwise
We see the first point, [1, 2], has been transformed into the point
[3/ 2, 1/ 2]
If we examine Fig. 11.2, with the dashed line representing the new x-axis, we
seethat the projection of the first point onto that axis places it at distance
3/ 2 from the origin. To check this fact, notice that the point of projection for
both the first and second points is [1.5, 1.5] in the original coordinate system,
and the distance from the origin to this point is
p
p
(3/
2 , 1/ 2 )
(7/
2 , 1/ 2 )
(3/
2 , 1/ 2 )
(7/
2 , 1/ 2 )
Figure 11.3: The points of Fig. 11.1 in the new coordinate system
The second point, [2, 1] happens
by coincidence to project onto the same
point of the new x-axis. It is 1/ 2 below that axis along the new y-axis, as is
416
confirmed
the fact that the second row in the matrix of transformed
by
points
is [3/ 2, 1/ 2]. The third point, [3, 4] is transformed
into
[7/
2,
1/
2] and
the fourth point, [4, 3], is transformed to [7/ 2, 1/ 2]. That is, they both
project
onto the same point of the newx-axis, and that point is at distance
7/ 2 from the origin, while they are 1/ 2 above and below the new x-axis in
the direction of the new y-axis.
11.2.2
From the example we have just worked out, we can see a general principle. If
M is a matrix whose rows each represent a point in a Euclidean space with
any number of dimensions, we can compute M T M and compute its eigenpairs.
Let E be the matrix whose columns are the eigenvectors, ordered as largest
eigenvalue first. Define the matrix L to have the eigenvalues of M T M along
the diagonal, largest first, and 0s in all other entries. Then, since M T M e =
e = e for each eigenvector e and its corresponding eigenvalue , it follows
that M T M E = EL.
We observed that M E is the points of M transformed into a new coordinate space. In this space, the first axis (the one corresponding to the largest
eigenvalue) is the most significant; formally, the variance of points along that
axis is the greatest. The second axis, corresponding to the second eigenpair,
is next most significant in the same sense, and the pattern continues for each
of the eigenpairs. If we want to transform M to a space with fewer dimensions, then the choice that preserves the most significance is the one that uses
the eigenvectors associated with the largest eigenvalues and ignores the other
eigenvalues.
That is, let Ek be the first k columns of E. Then M Ek is a k-dimensional
representation of M .
Example 11.6 : Let M be the matrix from Section 11.2.1. This data has only
two dimensions, so the only dimensionality reduction we can do is to use k = 1;
i.e., project the data onto a one dimensional space. That is, we compute M E1
by
3/
1 2
2
3/ 2
2 1 1/ 2
3 4 1/ 2 = 7/ 2
4 3
7/ 2
The effect of this transformation is to replace the points of M by their projections onto the x-axis of Fig. 11.3. While the first two points project to the
same point, as do the third and fourth points, this representation makes the
best possible one-dimensional distinctions among the points. 2
11.2.3
417
Let us return to the example of Section 11.2.1, but instead of starting with
M T M , let us examine the eigenvalues of M M T . Since our example M has
more rows than columns, the latter is a bigger matrix than the former, but if
M had more columns than rows, we would actually get a smaller matrix. In
the running example, we have
5 4 11 10
1 2
4 5 10 11
2 1 1 2 3 4
MMT =
3 4 2 1 4 3 = 11 10 25 24
10 11 24 25
4 3
Like M T M , we see that M M T is symmetric. The entry in the ith row and
jth column has a simple interpretation; it is the dot product of the vectors
represented by the ith and jth points (rows of M ).
There is a strong relationship between the eigenvalues of M T M and M M T .
Suppose e is an eigenvector of M T M ; that is,
M T M e = e
Multiply both sides of this equation by M on the left. Then
M M T (M e) = M e = (M e)
Thus, as long as M e is not the zero vector 0, it will be an eigenvector of M M T
and will be an eigenvalue of M M T as well as of M T M .
The converse holds as well. That is, if e is an eigenvector of M M T with
corresponding eigenvalue , then start with M M T e = e and multiply on the
left by M T to conclude that M T M (M T e) = (M T e). Thus, if M T e is not 0,
then is also an eigenvalue of M T M .
We might wonder what happens when M T e = 0. In that case, M M T e
is also 0, but e is not 0 because 0 cannot be an eigenvector. However, since
0 = e, we conclude that = 0.
We conclude that the eigenvalues of M M T are the eigenvalues of M T M
plus additional 0s. If the dimension of M M T were less than the dimension
of M T M , then the opposite would be true; the eigenvalues of M T M would be
those of M M T plus additional 0s.
3/116
1/2
7/116
1/2
3/ 116 1/2
7/116 1/2
7/ 116
1/2
3/
116
1/2
418
Example 11.7 : The eigenvalues of M M T for our running example must include 58 and 2, because those are the eigenvalues of M T M as we observed in
Section 11.2.1. Since M M T is a 4 4 matrix, it has two other eigenvalues,
which must both be 0. The matrix of eigenvectors corresponding to 58, 2, 0,
and 0 is shown in Fig. 11.4. 2
11.2.4
1 1
2 4
3 9
4 16
(a) What are M T M and M M T ?
(b) Compute the eigenpairs for M T M .
! (c) What do you expect to be the eigenvalues of M M T ?
! (d) Find the eigenvectors of M M T , using your eigenvalues from part (c).
! Exercise 11.2.2 : Prove that if M is any matrix, then M T M and M M T are
symmetric.
11.3
Singular-Value Decomposition
11.3.1
Definition of SVD
Let M be an m n matrix, and let the rank of M be r. Recall that the rank of
a matrix is the largest number of rows (or equivalently columns) we can choose
419
for which no nonzero linear combination of the rows is the all-zero vector 0 (we
say a set of such rows or columns is independent). Then we can find matrices
U , , and V as shown in Fig. 11.5 with the following properties:
1. U is an m r column-orthonormal matrix ; that is, each of its columns is
a unit vector and the dot product of any two columns is 0.
2. V is an n r column-orthonormal matrix. Note that we always use V in
its transposed form, so it is the rows of V T that are orthonormal.
3. is a diagonal matrix; that is, all elements not on the main diagonal are
0. The elements of are called the singular values of M .
r
VT
420
Titanic
Casablanca
Joe
Jim
John
Jack
Jill
Jenny
Jane
1
3
4
5
0
0
0
0
0
0
0
4
5
2
1
3
4
5
0
0
0
1
3
4
5
0
0
0
0
0
0
0
4
5
2
1
3
4
5
0
0
0
1
3
4
5
0
0
0
1
3
4
5
0
0
0
0
0
0
0
4
5
2
0
0
0
0
4
5
2
.14 0
.42 0
.56 0
12.4 0
.58 .58
.70 0
0
9.5
0
0
0 .60
0 .75
0 .30
U
.58 0
0
0 .71 .71
VT
11.3.2
Interpretation of SVD
421
Titanic
Casablanca
Joe
Jim
John
Jack
Jill
Jenny
Jane
1
3
4
5
0
0
0
0
0
0
0
4
5
2
1
3
4
5
2
0
1
1
3
4
5
0
0
0
0
0
0
0
4
5
2
Figure 11.8: The new matrix M , with ratings for Alien by two additional raters
Example 11.9 : Figure 11.8 is almost the same as Fig. 11.6, but Jill and Jane
rated Alien, although neither liked it very much. The rank of the matrix in
Fig. 11.8 is 3; for example the first, sixth, and seventh rows are independent,
but you can check that no four rows are independent. Figure 11.9 shows the
decomposition of the matrix from Fig. 11.8.
We have used three columns for U , , and V because they decompose a
matrix of rank three. The columns of U and V still correspond to concepts.
The first is still science fiction and the second is romance. It is harder to
422
1
3
4
5
0
0
0
1
3
4
5
2
0
1
1
3
4
5
0
0
0
0
0
0
0
4
5
2
0
0
0
0
4
5
2
.13
.02 .01
.41
.07 .03
.55
.09 .04
.68
.11 .05
.15 .59
.65
.07 .73 .67
.07 .29
.32
U
12.4 0
0
.56
.59 .56
.09
.09
0
9.5 0 .12 .02 .12 .69 .69
0
0 1.3
.40 .80 .40
.09
.09
VT
11.3.3
423
multiplied only by 0s when we perform the multiplication, so this row and this
column may as well not be there. That is, the approximation to M obtained
by using only the two largest singular values is that shown in Fig. 11.10.
.13
.02
.41
.07
.55
.09
12.4 0
.56
.68
.11
0
9.5
.12
.15 .59
.07 .73
.07 .29
=
4.84 4.96 4.84
0.37 1.21 0.37
.59 .56
.09
.09
.02 .12 .69 .69
.014
.000
.026
.040
4.04
4.87
1.98
.014
.000
.026
.040
4.04
4.87
1.98
Figure 11.10: Dropping the lowest singular value from the decomposition of
Fig. 11.7
The resulting matrix is quite close to the matrix M of Fig. 11.8. Ideally, the
entire difference is the result of making the last singular value be 0. However,
in this simple example, much of the difference is due to rounding error caused
by the fact that the decomposition of M was only correct to two significant
digits. 2
11.3.4
The choice of the lowest singular values to drop when we reduce the number of
dimensions can be shown to minimize the root-mean-square error between the
original matrix M and its approximation. Since the number of entries is fixed,
and the square root is a monotone operation, we can simplify and compare
the Frobenius norms of the matrices involved. Recall that the Frobenius norm
of a matrix M , denoted kM k, is the square root of the sum of the squares of
the elements of M . Note that if M is the difference between one matrix and
its approximation, then kM k is proportional to the RMSE (root-mean-square
error) between the matrices.
To explain why choosing the smallest singular values to set to 0 minimizes
the RMSE or Frobenius norm of the difference between M and its approximation, let us begin with a little matrix algebra. Suppose M is the product of
three matrices M = P QR. Let mij , pij , qij , and rij be the elements in row i
and column j of M , P , Q, and R, respectively. Then the definition of matrix
424
multiplication tells us
mij =
XX
k
pik qk rj
Then
kM k2 =
XX
i
(mij )2 =
X XX X
i
pik qk rj
2
(11.1)
(11.2)
Now, let us examine the case where P , Q, and R are really the SVD of M .
That is, P is a column-orthonormal matrix, Q is a diagonal matrix, and R is
the transpose of a column-orthonormal matrix. That is, R is row-orthonormal;
its rows are unit vectors and the dot product of any two different rows is 0. To
begin, since Q is a diagonal matrix, qk and qnm will be zero unless k = and
n = m. We can thus drop the summations for and m in Equation 11.2 and
set k = and n = m. That is, Equation 11.2 becomes
XXXX
kM k2 =
pik qkk rkj pin qnn rnj
(11.3)
i
Next, reorder the summation, so i is the innermost sum. Equation 11.3 has
only two factors pik and pin that involve i; all other factors are constants as far
as summation over i is concerned. Since P is column-orthonormal, We know
425
P
that i pik pin is 1 if k = n and 0 otherwise. That is, in Equation 11.3 we can
set k = n, drop the factors pik and pin , and eliminate the sums over i and n,
yielding
XX
kM k2 =
qkk rkj qkk rkj
(11.4)
j
11.3.5
In this section we shall look at how SVD can help us answer certain queries
efficiently, with good accuracy. Let us assume for example that we have decomposed our original movie-rating data (the rank-2 data of Fig. 11.6) into the
SVD form of Fig. 11.7. Quincy is not one of the people represented by the
original matrix, but he wants to use the system to know what movies he would
like. He has only seen one movie, The Matrix, and rated it 4. Thus, we can
represent Quincy by the vector q = [4, 0, 0, 0, 0], as if this were one of the rows
of the original matrix.
If we used a collaborative-filtering approach, we would try to compare
Quincy with the other users represented in the original matrix M . Instead,
we can map Quincy into concept space by multiplying him by the matrix V
of the decomposition. We find qV = [2.32, 0].3 That is to say, Quincy is high
in science-fiction interest, and not at all interested in romance.
We now have a representation of Quincy in concept space, derived from, but
different from his representation in the original movie space. One useful thing
we can do is to map his representation back into movie space by multiplying
3 Note
426
[2.32, 0] by V T . This product is [1.35, 1.35, 1.35, 0, 0]. It suggests that Quincy
would like Alien and Star Wars, but not Casablanca or Titanic.
Another sort of query we can perform in concept space is to find users similar
to Quincy. We can use V to map all users into concept space. For example,
Joe maps to [1.74, 0], and Jill maps to [0, 5.68]. Notice that in this simple
example, all users are either 100% science-fiction fans or 100% romance fans, so
each vector has a zero in one component. In reality, people are more complex,
and they will have different, but nonzero, levels of interest in various concepts.
In general, we can measure the similarity of users by their cosine distance in
concept space.
Example 11.11 : For the case introduced above, note that the concept vectors
for Quincy and Joe, which are [2.32, 0] and [1.74, 0], respectively, are not the
same, but they have exactly the same direction. That is, their cosine distance
is 0. On the other hand, the vectors for Quincy and Jill, which are [2.32, 0] and
[0, 5.68], respectively, have a dot product of 0, and therefore their angle is 90
degrees. That is, their cosine distance is 1, the maximum possible. 2
11.3.6
The SVD of a matrix M is strongly connected to the eigenvalues of the symmetric matrices M T M and M M T . This relationship allows us to obtain the SVD
of M from the eigenpairs of the latter two matrices. To begin the explanation,
start with M = U V T , the expression for the SVD of M . Then
M T = (U V T )T = (V T )T T U T = V T U T
Since is a diagonal matrix, transposing it has no effect. Thus, M T = V U T .
Now, M T M = V U T U V T . Remember that U is an orthonormal matrix,
so U T U is the identity matrix of the appropriate size. That is,
M T M = V 2 V T
Multiply both sides of this equation on the left by V to get
M T M V = V 2 V T V
Since V is also an orthonormal matrix, we know that V T V is the identity. Thus
M T M V = V 2
(11.6)
427
Thus, the same algorithm that computes the eigenpairs for M T M gives us
the matrix V for the SVD of M itself. It also gives us the singular values for
this SVD; just take the square roots of the eigenvalues for M T M .
Only U remains to be computed, but it can be found in the same way we
found V . Start with
M M T = U V T (U V T )T = U V T V U T = U 2 U T
Then by a series of manipulations analogous to the above, we learn that
M M T U = U 2
That is, U is the matrix of eigenvectors of M M T .
A small detail needs to be explained concerning U and V . Each of these
matrices have r columns, while M T M is an nn matrix and M M T is an mm
matrix. Both n and m are at least as large as r. Thus, M T M and M M T should
have an additional n r and m r eigenpairs, respectively, and these pairs do
not show up in U , V , and . Since the rank of M is r, all other eigenvalues
will be 0, and these are not useful.
11.3.7
Exercise 11.3.1 : In Fig. 11.11 is a matrix M . It has rank 2, as you can see by
observing that the first column plus the third column minus twice the second
column equals 0.
1
3
5
0
1
2
4
4
2
3
3
5
3
4
5
428
(f) How much of the energy of the original singular values is retained by the
one-dimensional approximation?
Exercise 11.3.2 : Use the SVD from Fig. 11.7. Suppose Leslie assigns rating 3
to Alien and rating 4 to Titanic, giving us a representation of Leslie in movie
space of [0, 3, 0, 0, 4]. Find the representation of Leslie in concept space. What
does that representation predict about how well Leslie would like the other
movies appearing in our example data?
! Exercise 11.3.3 : Demonstrate that the rank of the matrix in Fig. 11.8 is 3.
! Exercise 11.3.4 : Section 11.3.5 showed how to guess the movies a person
would most like. How would you use a similar technique to guess the people
that would most like a given movie, if all you had were the ratings of that movie
by a few people?
11.4
CUR Decomposition
There is a problem with SVD that does not show up in the running example
of Section 11.3. In large-data applications, it is normal for the matrix M being
decomposed to be very sparse; that is, most entries are 0. For example, a
matrix representing many documents (as rows) and the words they contain (as
columns) will be sparse, because most words are not present in most documents.
Similarly, a matrix of customers and products will be sparse because most
people do not buy most products.
We cannot deal with dense matrices that have millions or billions of rows
and/or columns. However, with SVD, even if M is sparse, U and V will be
dense.4 Since is diagonal, it will be sparse, but is usually much smaller
than U and V , so its sparseness does not help.
In this section, we shall consider another approach to decomposition, called
CUR-decomposition. The merit of this approach lies in the fact that if M is
sparse, then the two large matrices (called C and R for columns and rows)
analogous to U and V in SVD are also sparse. Only the matrix in the middle
(analogous to in SVD) is dense, but this matrix is small so the density does
not hurt too much.
Unlike SVD, which gives an exact decomposition as long as the parameter r
is taken to be at least as great as the rank of the matrix M , CUR-decomposition
is an approximation no matter how large we make r. There is a theory that
guarantees convergence to M as r gets larger, but typically you have to make r
so large to get, say within 1% that the method becomes impractical. Nevertheless, a decomposition with a relatively small value of r has a good probability
of being a useful and accurate decomposition.
4 In
Fig. 11.7, it happens that U and V have a significant number of 0s. However, that is
an artifact of the very regular nature of our example matrix M and is not the case in general.
429
11.4.1
Definition of CUR
430
11.4.2
Recall that the choice of rows and columns is random. However, this choice
must be biased so that the more important rows and columns have a better
chance of being picked. The measure of importance we must use is the square
of the Frobenius norm, that
sum of the squares of the elements of the
P is, the
2
m
,
row or column. Let f =
ij the square of the Frobenius norm of M .
i,j
Then
each
time
we
select
a
row,
the
probability pi with which we select row i
P
is j m2ij /f . Each time we select a column, the probability qj with which we
P
select column j is i m2ij /f .
Star Wars
Alien
Matrix
Titanic
Casablanca
Joe
Jim
John
Jack
Jill
Jenny
Jane
1
3
4
5
0
0
0
0
0
0
0
4
5
2
1
3
4
5
0
0
0
1
3
4
5
0
0
0
0
0
0
0
4
5
2
Example 11.12 : Let us reconsider the matrix M from Fig. 11.6, which we
repeat here as Fig. 11.12. The sum of the squares of the elements of M is 243.
The three columns for the science-fiction movies The Matrix, Alien, and Star
Wars each have a squared Frobenius norm of 12 + 32 + 42 + 52 = 51, so their
probabilities are each 51/243 = .210. The remaining two columns each have a
squared Frobenius norm of 42 + 52 + 22 = 45, and therefore their probabilities
are each 45/243 = .185.
The seven rows of M have squared Frobenius norms of 3, 27, 48, 75, 32,
50, and 8, respectively. Thus, their respective probabilities are .012, .111, .198,
.309, .132, .206, and .033. 2
Now, let us select r columns for the matrix C. For each column, we choose
randomly from the columns of M . However, the selection is not with uniform
probability; rather, the jth column is selected with probability qj . Recall that
probability is the sum of the squares of the elements in that column divided by
the sum of the squares of all the elements of the matrix. Each column of C is
chosen independently from the columns of M , so there is some chance that a
column will be selected more than once. We shall discuss how to deal with this
situation after explaining the basics of CUR-decomposition.
431
is [1, 3, 4, 5, 0, 0, 0]T, and we must scale this column by dividing by rq2 . Recall
from Example 11.12 that the
probability associated with the Alien column is
.210, so the division is by 2 0.210 = 0.648. To two decimal places, the
scaled column for Alien is [1.54, 4.63, 6.17, 7.72, 0, 0, 0]T. This column becomes
the first column of C.
The second column of C is constructed by taking the column
of M for
11.4.3
Finally, we must construct the matrix U that connects C and R in the decomposition. Recall that U is an r r matrix. We start the construction of U
with another matrix of the same size, which we call W . The entry in row i and
column j of W is the entry of M whose row is the one from which we selected
the ith row of R and whose column is the one from which we selected the jth
column of C.
432
Example 11.14 : Let us follow the selections of rows and columns made in
Example 11.13. We claim
0 5
W =
5 0
The first row of W corresponds to the first row of R, which is the row for Jenny
in the matrix M of Fig. 11.12. The 0 in the first column is there because that
is the entry in the row of M for Jenny and the column for Alien; recall that the
first column of C was constructed from the column of M for Alien. The 5 in the
second column reflects the 5 in M s row for Jenny and column for Casablanca;
the latter is the column of M from which the second column of C was derived.
Similarly, the second row of W is the entries in the row for Jack and columns
for Alien and Casablanca, respectively. 2
The matrix U is constructed from W by the Moore-Penrose pseudoinverse described in Section 11.4.1. It consists of taking the SVD of W , say
W = XY T , and replacing all nonzero elements in the matrix of singular values by their numerical inverses, to obtain the pseudoinverse + . Then
U = Y (+ )2 X T .
Example 11.15 : Let us construct U from the matrix W that we constructed
in Example 11.14. First, here is the SVD for W :
0 5
0 1
5 0
1 0
W =
=
5 0
1 0
0 5
0 1
That is, the three matrices on the right are X, , and Y T , respectively. The
matrix has no zeros along the diagonal, so each element is replaced by its
numerical inverse to get its Moore-Penrose pseudoinverse:
1/5 0
+ =
0 1/5
Now X and Y are symmetric, so they are their own transposes. Thus,
2
1 0
1/5 0
0 1
0
1/25
+ 2 T
U = Y ( ) X =
=
0 1
0 1/5
1 0
1/25
0
2
11.4.4
433
CU R =
1.54
0
4.63
0
6.17
0
7.72
0
0 9.30
0 11.63
0 4.65
0
1/25
0
0
0 11.01 11.01
1/25
0
8.99 8.99 8.99
0
0
0.55
1.67
2.22
2.78
0
0
0
0.55
1.67
2.22
2.78
0
0
0
0.55
0
0
1.67
0
0
2.22
0
0
2.78
0
0
0 4.10 4.10
0 5.12 5.12
0 2.05 2.05
11.4.5
It is quite possible that a single row or column is selected more than once.
There is no great harm in using the same row twice, although the rank of the
matrices of the decomposition will be less than the number of row and column
choices made. However, it is also possible to combine k rows of R that are each
the same row of the matrix M into a single row of R, thus leaving R with fewer
rows. Likewise, k columns of C that each come from the same column of M
can be combined into one column of C. However, for either rows orcolumns,
the remaining vector should have each of its elements multiplied by k.
When we merge some rows and/or columns, it is possible that R has fewer
rows than C has columns, or vice versa. As a consequence, W will not be a
square matrix. However, we can still take its pseudoinverse by decomposing it
into W = XY T , where is now a diagonal matrix with some all-0 rows or
434
columns, whichever it has more of. To take the pseudoinverse of such a diagonal
matrix, we treat each element on the diagonal as usual (invert nonzero elements
and leave 0 as it is), but then we must transpose the result.
Example 11.17 : Suppose
2 0
= 0 0
0 0
Then
1/2
0
+
=
0
0
11.4.6
0 0
0 0
3 0
0
0
1/3
0
0
0
0
0
48
14
14 48
3/5
4/5
4/5 3/5
50 0
0 25
4/5 3/5
3/5
4/5
11.5
Summary of Chapter 11
Dimensionality Reduction: The goal of dimensionality reduction is to replace a large matrix by two or more other matrices whose sizes are much
smaller than the original, but from which the original can be approximately reconstructed, usually by taking their product.
435
436
Using SVD for Dimensionality Reduction: In a complete SVD for a matrix, U and V are typically as large as the original. To use fewer columns
for U and V , delete the columns corresponding to the smallest singular
values from U , V , and . This choice minimizes the error in reconstructing the original matrix from the modified U , , and V .
Decomposing Sparse Matrices: Even in the common case where the given
matrix is sparse, the matrices constructed by SVD are dense. The CUR
decomposition seeks to decompose a sparse matrix into sparse, smaller
matrices whose product approximates the original matrix.
CUR Decomposition: This method chooses from a given sparse matrix
a set of columns C and a set of rows R, which play the role of U and
V T in SVD; the user can pick any number of rows and columns. The
choice of rows and columns is made randomly with a distribution that
depends on the Frobenius norm, or the square root of the sum of the
squares of the elements. Between C and R is a square matrix called U
that is constructed by a pseudo-inverse of the intersection of the chosen
rows and columns.
11.6
437