Manifold Learning Algorithms

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Algorithms for manifold learning

Lawrence Cayton
[email protected]
June 15, 2005
Abstract
Manifold learning is a popular recent approach to
nonlinear dimensionality reduction. Algorithms for
this task are based on the idea that the dimensional-
ity of many data sets is only articially high; though
each data point consists of perhaps thousands of fea-
tures, it may be described as a function of only a
few underlying parameters. That is, the data points
are actually samples from a low-dimensional manifold
that is embedded in a high-dimensional space. Man-
ifold learning algorithms attempt to uncover these
parameters in order to nd a low-dimensional repre-
sentation of the data. In this paper, we discuss the
motivation, background, and algorithms proposed for
manifold learning. Isomap, Locally Linear Embed-
ding, Laplacian Eigenmaps, Semidenite Embedding,
and a host of variants of these algorithms are exam-
ined.
1 Introduction
Many recent applications of machine learning in
data mining, computer vision, and elsewhere re-
quire deriving a classier or function estimate from
an extremely large data set. Modern data sets of-
ten consist of a large number of examples, each of
which is made up of many features. Though access
to an abundance of examples is purely benecial to
an algorithm attempting to generalize from the data,
managing a large number of features some of which
may be irrelevant or even misleading is typically a
burden to the algorithm. Overwhelmingly complex
feature sets will slow the algorithm down and make
nding global optima dicult. To lessen this burden
on standard machine learning algorithms (e.g. clas-
siers, function estimators), a number of techniques
have been developed to vastly reduce the quantity of
features in a data set i.e. to reduce the dimension-
ality of data.
Dimensionality reduction has other, related uses in
addition to simplifying data so that it can be e-
ciently processed. Perhaps the most obvious is vi-
sualization; if data lies in a 100-dimensional space,
one cannot get an intuitive feel for what the data
looks like. However, if a meaningful two- or three-
dimensional representation of the data can be found,
then it is possible to eyeball it. Though this may
seem like a trivial point, many statistical and machine
learning algorithms have very poor optimality guar-
antees, so the ability to actually see the data and the
output of an algorithm is of great practical interest.
Beyond visualization, a dimensionality reduction
procedure may help reveal what the underlying forces
governing a data set are. For example, suppose we are
to classify an email as spam or not spam. A typical
approach to this problem would be to represent an
email as a vector of counts of the words appearing in
the email. The dimensionality of this data can easily
be in the hundreds, yet an eective dimensionality
reduction technique may reveal that there are only a
few exceptionally telling features, such as the word
Viagra.
There are many approaches to dimensionality re-
duction based on a variety of assumptions and used
in a variety of contexts. We will focus on an approach
initiated recently based on the observation that high-
dimensional data is often much simpler than the di-
mensionality would indicate. In particular, a given
high-dimensional data set may contain many fea-
tures that are all measurements of the same under-
lying cause, so are closely related. This type of phe-
nomenon is common, for example, when taking video
footage of a single object from multiple angles simul-
taneously. The features of such a data set contain
much overlapping information; it would be helpful to
somehow get a simplied, non-overlapping represen-
tation of the data whose features are identiable with
the underlying parameters that govern the data. This
intuition is formalized using the notion of a manifold:
1
the data set lies along a low-dimensional manifold em-
bedded in a high-dimensional space, where the low-
dimensional space reects the underlying parameters
and high-dimensional space is the feature space. At-
tempting to uncover this manifold structure in a data
set is referred to as manifold learning.
1.1 Organization
We begin by looking at the intuitions, basic mathe-
matics, and assumptions of manifold learning in sec-
tion 2. In section 3, we detail several algorithms pro-
posed for learning manifolds: Isomap, Locally Linear
Embedding, Laplacian Eigenmaps, Semidenite Em-
bedding, and some variants. These algorithms will be
compared and contrasted with one another in section
4. We conclude with future directions for research in
section 5.
1.2 Notation and Conventions
Throughout, we will be interested in the problem of
reducing the dimensionality of a given data set con-
sisting of high-dimensional points in Euclidean space.
The high-dimensional input points will be re-
ferred to as x
1
, x
2
, . . . x
n
. At times it will be
convenient to work with these points as a single
matrix X, where the ith row of X is x
i
.
The low-dimensional representations that the di-
mensionality reduction algorithms nd will be
referred to as y
1
, y
2
, . . . , y
n
. Y is the matrix of
these points.
n is the number of points in the input.
D is the dimensionality of the input (i.e. x
i

R
D
).
d is the dimensionality of the manifold that the
input is assumed to lie on and, accordingly, the
dimensionality of the output (i.e. y
i
R
d
).
k is the number of nearest neighbors used by a
particular algorithm.
N(i) denotes the set of the k-nearest neighbors
of x
i
.
Throughout, we assume that the (eigenvector,
eigenvalue) pairs are ordered by the eigenvalues.
That is, if the (eigenvector, eigenvalue) pairs are
(v
i
,
i
), for i = 1, . . . , n, then
1

2

n
. We refer to v
1
, . . . v
d
as the top d eigenvec-
tors, and v
nd+1
, . . . v
n
as the bottom d eigenvec-
tors.
2 Preliminaries
In this section, we consider the motivations for man-
ifold learning and introduce the basic concepts at
work. We have already discussed the importance
of dimensionality reduction for machine learning; we
now look at a classical procedure for this task and
then turn to more recent eorts.
2.1 Linear Dimensionality Reduction
Perhaps the most popular algorithm for dimension-
ality reduction is Principal Components Analysis
(PCA). Given a data set, PCA nds the directions
(vectors) along which the data has maximum vari-
ance in addition to the relative importance of these
directions. For example, suppose that we feed a set
of three-dimensional points that all lie on a two-
dimensional plane to PCA. PCA will return two vec-
tors that span the plane along with a third vector
that is orthogonal to the plane (so that all of R
3
is
spanned by these vectors). The two vectors that span
the plane will be given a positive weight, but the third
vector will have a weight of zero, since the data does
not vary along that direction. PCA is most useful in
the case when data lies on or close to a linear sub-
space of the data set. Given this type of data, PCA
will nd a basis for the linear subspace and allow one
to disregard the irrelevant features.
The manifold learning algorithms may be viewed
as non-linear analogs to PCA, so it is worth detail-
ing PCA somewhat. Suppose that X R
nD
is a
matrix whose rows are D-dimensional data points.
We are looking for the d-dimensional linear subspace
of R
D
along which the data has maximum variance.
If in fact the data lies perfectly along a subspace of
R
D
, PCA will reveal that subspace; otherwise, PCA
will introduce some error. The objective function it
optimizes is
max
V
var(XV ),
where V is an orthogonal Dd matrix. If d = 1, then
V is simply a unit-length vector which gives the direc-
tion of maximum variance. In general, the columns of
V give the d dimensions that we project the original
data onto. This cost function admits an eigenvalue-
based solution, which we now detail for d = 1. The
2
direction of maximum variation is given by a unit vec-
tor v to project the data points onto, which is found
as follows:
max
v=1
var(Xv) = max
v=1
E(Xv)
2
(EXv)
2
(1)
= max
v=1
E(Xv)
2
(2)
= max
v=1

i
(x
i

v)
2
(3)
= max
v=1

i
(v

x
i
)(x
i

v) (4)
= max
v=1
v

i
(x
i
x
i

)
_
v (5)
= max
v=1
v

Xv. (6)
Equality 2 follows from assuming the data is mean-
centered, without loss of generality. The last line is
a form of Rayleighs Quotient and so is maximized
by setting v to the eigenvector with the largest corre-
sponding eigenvector. In general, the d-dimensional
PCA solution is XV , where V is the D d matrix
whose columns are the top d eigenvectors (i.e. (V
i
,
i
)
are (eigenvalue, eigenvector) pairs with
1

2


d
).
Despite PCAs popularity it has a number of lim-
itations. Perhaps the most blatant drawback is the
requirement that the data lie on linear subspace. Re-
turning the previous example, what if the plane was
curled as it is in gure 1? Though the data is still
intuitively two-dimensional, PCA will not correctly
extract this two-dimensional structure. The prob-
lem is that this data set, commonly referred as the
swiss roll, is a two-dimensional manifold, not a two-
dimensional subspace. Manifold learning algorithms
essentially attempt to duplicate the behavior of PCA,
but on manifolds instead of linear subspaces.
We now briey review the concept of a manifold
and formalize the manifold learning problem.
2.2 Manifolds
Consider the curve shown in gure 2. Note that the
curve is in R
3
, yet it has zero volume, and in fact zero
area. The extrinsic dimensionality three is some-
what misleading since the curve can be parameterized
by a single variable. One way of formalizing this in-
tuition is via the idea of a manifold: the curve is
a one-dimensional manifold because it locally looks
like a copy of R
1
. Let us quickly review some basic
10 5 0 5 10 15
0
10
20
30
15
10
5
0
5
10
15
Figure 1: A curled plane: the swiss roll.
0
2
4
6
8
10
12
14
1
0.5
0
0.5
1
1
0.5
0
0.5
1
Figure 2: A one-dimensional manifold embedded in
three dimensions.
3
terminology from geometry and topology in order to
crystallize this notion of dimensionality.
Denition 1. A homeomorphism is a continuous
function whose inverse is also a continuous function.
Denition 2. A d-dimensional manifold M is set
that is locally homeomorphic with R
d
. That is, for
each x M, there is an open neighborhood around
x, N
x
, and a homeomorphism f : N
x
R
d
. These
neighborhoods are referred to as coordinate patches,
and the map is referred to a a coordinate chart. The
image of the coordinate charts is referred to as the
parameter space.
Manifolds are terrically well-studied in mathe-
matics and the above denition is extremely general.
We will be interested only in the case where M is a
subset of R
D
, where D is typically much larger than
d. In other words, the manifold will lie in a high-
dimensional space (R
D
), but will be homeomorphic
with a low-dimensional space (R
d
, with d < D).
Additionally, all of algorithms we will look at
require some smoothness requirements that further
constrain the class of manifolds considered.
Denition 3. A smooth (or dierentiable) manifold
is a manifold such that each coordinate chart is dif-
ferentiable with a dierentiable inverse (i.e., each co-
ordinate chart is a dieomorphism).
We will be using the term embedding in a number
of places, though our usage will be somewhat loose
and inconsistent. An embedding of a manifold M
into R
D
is a smooth homeomorphism from M to a
subset of R
D
. The algorithms discussed on this paper
nd embeddings of discrete point-sets, by which we
simply mean a mapping of the point set into another
space (typically lower-dimensional Euclidean space).
An embedding of a dissimilarity matrix into R
d
, as
discussed in section 3.1.2, is a conguration of points
whose interpoint distances match those given in the
matrix.
With this background in mind, we proceed to the
main topic of this paper.
2.3 Manifold Learning
We have discussed the importance of dimensionality
reduction and provided some intuition concerning the
inadequacy of PCA for data sets lying along some
non-at surface; we now turn to the manifold learning
approach to dimensionality reduction.
We are given data x
1
, x
2
, . . . , x
n
R
D
and we wish
to reduce the dimensionality of this data. PCA is ap-
propriate if the data lies on a low-dimensional sub-
space of R
D
. Instead, we assume only that the data
lies on a d-dimensional manifold embedded into R
D
,
where d < D. Moreover, we assume that the mani-
fold is given by a single coordinate chart.
1
We can
now describe the problem formally.
Problem: Given points x
1
, . . . , x
n
R
D
that lie on a d-dimensional manifold M
that can be described by a single coordi-
nate chart f : M R
d
, nd y
1
, . . . , y
n

R
d
, where y
i
def
= f(x
i
).
Solving this problem is referred to as manifold
learning, since we are trying to learn a manifold
from a set of points. A simple and oft-used example
in the manifold learning literature is the swiss roll, a
two-dimensional manifold embedded in R
3
. Figure 3
shows the swiss roll and a learned two-dimensional
embedding of the manifold found using Isomap, an
algorithm discussed later. Notice that points nearby
in the original data set neighbor one another in the
two-dimensional data set; this eect occurs because
the chart f between these two graphs is a homeomor-
phism (as are all charts).
Though the extent to which real-world data sets
exhibit low-dimensional manifold structure is still be-
ing explored, a number of positive examples have
been found. Perhaps the most notable occurrence is
in video and image data sets. For example, suppose
we have a collection of frames taken from a video of
a person rotating his or her head. The dimensional-
ity of the data set is equal to the number of pixels
in a frame, which is generally very large. However,
the images are actually governed by only a couple of
degrees of freedom (such as the angle of rotation).
Manifold learning techniques have been successfully
applied to a number of similar video and image data
sets [Ple03].
3 Algorithms
In this section, we survey a number of algorithms
proposed for manifold learning. Our treatment goes
roughly in chronological order: Isomap and LLE were
the rst algorithms developed, followed by Laplacian
1
All smooth, compact manifolds can be described by a sin-
gle coordinate chart, so this assumption is not overly restric-
tive.
4
10 5 0 5 10 15
0
20
40
20
10
0
10
20
Original Data
40 20 0 20
10
0
10
Isomap
Figure 3: The swiss roll (top) and the learned two-
dimensional representation (bottom).
Eigenmaps. Semidenite Embedding and the vari-
ants of LLE were developed most recently.
Before beginning, we note that every algorithm
that we will discuss requires a neighborhood-size pa-
rameter k. Every algorithm assumes that within
each neighborhood the manifold is approximately
at. Typically, one simply runs an algorithm over
a variety of choices of neighborhood size and com-
pares the outputs to pick the appropriate k. Other
neighborhood schemes, such as picking a neighbor-
hood radius , are sometimes used in place of the
k-nearest neighbor scheme.
3.1 Isomap
Isomap [TdL00] short for isometric feature map-
ping was one of the rst algorithms introduced for
manifold learning. The algorithm is perhaps the best
known and most applied among the multitude of pro-
cedures now available for the problem at hand. It
may be viewed as an extension to Multidimensional
Scaling (MDS), a classical method for embedding dis-
similarity information into Euclidean space. Isomap
consists of two main steps:
1. Estimate the geodesic distances (distances along
a manifold) between points in the input us-
ing shortest-path distances on the data sets k-
nearest neighbor graph.
2. Use MDS to nd points in low-dimensional Eu-
clidean space whose interpoint distances match
the distances found in step 1.
We consider each of these points in turn.
3.1.1 Estimating geodesic distances
Again, we are assuming that the data is drawn from
a d-dimensional manifold embedded in R
D
and we
hope to nd a chart back into R
d
. Isomap assumes
further that there is an isometric chart i.e. a chart
that preserves the distances between points. That is,
if x
i
, x
j
are points on the manifold and G(x
i
, x
j
) is
the geodesic distance between them, then there is a
chart f : M R
d
such that
|f(x
i
) f(x
j
)| = G(x
i
, x
j
).
Furthermore, it is assumed that the manifold is
smooth enough that the geodesic distance between
nearby points is approximately linear. Thus, the Eu-
clidean distance between nearby points in the high-
dimensional data space is assumed to be a good ap-
proximation to the geodesic distances between these
points.
For points that are distant in the high-dimensional
space, the Euclidean distance between them is not a
good estimate of the geodesic distance. The problem
is that though the manifold is locally linear (at least
approximately), this approximation breaks down as
the distance between points increases. Estimating
distances between distant points, thus, requires a dif-
ferent technique. To perform this estimation, the
Isomap algorithm rst constructs a k-nearest neigh-
bor graph that is weighted by the Euclidean dis-
tances. Then, the algorithm runs a shortest-path al-
gorithm (Dijkstras or Floyds) and uses its output
as the estimates for the remainder of the geodesic
distances.
3.1.2 Multidimensional Scaling
Once these geodesic distances are calculated, the
Isomap algorithm nds points whose Euclidean dis-
tances equal these geodesic distances. Since the man-
ifold is isometrically embedded, such points exist, and
in fact, they are unique up to translation and ro-
tation. Multidimensional Scaling [CC01] is a classi-
cal technique that may be used to nd such points.
5
The basic problem it solves is this: given a matrix
D R
nn
of dissimilarities, construct a set of points
whose interpoint Euclidean distances match those in
D closely. There are numerous cost functions for this
task and a variety of algorithms for minimizing these
cost functions. We focus on classical MDS (cMDS),
since that is what is used by Isomap.
The cMDS algorithm, shown in gure 4, is based on
the following theorem which establishes a connection
between the space of Euclidean distance matrices and
the space of Gram matrices (inner-product matrices).
Theorem 1. A nonnegative symmetric matrix D
R
nn
, with zeros on the diagonal, is a Euclidean dis-
tance matrix if and only if B
def
=
1
2
HDH, where
H
def
= I
1
n
11

, is positive semidenite. Further-


more, this B will be the Gram matrix for a mean-
centered conguration with interpoint distances given
by D.
A proof of this theorem is given in the appendix.
This theorem gives a simple technique to convert
a Euclidean distance matrix D into an appropriate
Gram matrix B. Let X be the conguration that we
are after. Since B is its Gram matrix, B = XX

.
To get X from B, we spectrally decompose B into
UU

. Then, X = U
1/2
.
What if D is not Euclidean? In the application of
Isomap, for example, D will generally not be perfectly
Euclidean since we are only able to approximate the
geodesic distances. In this case, B, as described in the
above theorem, will not be positive semidenite and
thus will not be a Gram matrix. To handle this case,
cMDS projects B onto the cone of positive semide-
nite matrices by setting its negative eigenvalues to 0
(step 3).
Finally, users of cMDS often want a low-
dimensional embedding. In general, X := U
1/2
+
(step 4) will be an n-dimensional conguration. This
dimensionality is reduced to d in step 5 of the algo-
rithm by simply removing the n d trailing columns
of X. Doing so is equivalent to projecting X onto its
top d principal components, as we now demonstrate.
Let X
def
= U
1/2
+
be the n-dimensional conguration
found in step 4. Then, U
+
U

is the spectral de-


composition of XX

. Suppose that we use PCA to


project X onto d dimensions. Then, PCA returns
XV , where V R
nd
and the columns of V are
given by the eigenvectors of X

X:v
1
, . . . , v
d
. Let us
compute these eigenvectors.
X

Xv
i
=
i
v
i
classical Multidimensional Scaling
input: D R
nn
(D
ii
= 0, D
ij
0), d 1, . . . , n
1. Set B :=
1
2
HDH, where H = I
1
n
11

is the
centering matrix.
2. Compute the spectral decomposition of B: B =
UU

.
3. Form
+
by setting [
+
]
ij
:= max
ij
, 0.
4. Set X := U
1/2
+
.
5. Return [X]
nd
.
Figure 4: Classical MDS.
(U

1/2
+
)

(U
1/2
+
)v
i
=
i
v
i
(
1/2
+
U

U
1/2
+
)v
i
=
i
v
i

+
v
i
=
i
v
i
.
Thus, v
i
= e
i
, where e
i
is the ith standard basis vec-
tor for R
n
, and
i
= 1. Then, XV = X[e
1
e
d
] =
[X]
nd
, so the truncation performed in step 5 is in-
deed equivalent to projecting X onto its rst d prin-
cipal components.
This equivalence is useful because if a set of dis-
similarities can be realized in d-dimensions, then all
but the rst d columns of X will be zeros; moreover,

+
reveals the relative importance of each dimension,
since these eigenvalues are the same as those found
by PCA. In other words, classical MDS automatically
nds the dimensionality of the conguration associ-
ated with a set of dissimilarities, where dimensional-
ity refers to the dimensionality of the subspace along
which X lies. Within the context of Isomap, classical
MDS automatically nds the dimensionality of the
parameter space of the manifold, which is the dimen-
sionality of the manifold.
Classical MDS nds the optimal d-dimensional
conguration of points for the cost function |HDH
HD

H|
2
F
, where D

runs over all Euclidean distance


matrices for d-dimensional congurations. Unfortu-
nately, this cost function makes cMDS tremendously
sensitive to noise and outliers; moreover, the prob-
lem is NP-hard for a wide variety of more robust cost
functions [CD05].
3.1.3 Putting it together
The Isomap algorithm consists of estimating geodesic
distances using shortest-path distances and then nd-
ing an embedding of these distances in Euclidean
6
space using cMDS. Figure 5 summarizes the algo-
rithm.
One particularly helpful feature of Isomap not
found in some of the other algorithms is that it
automatically provides an estimate of the dimension-
ality of the underlying manifold. In particular, the
number of non-zero eigenvalues found by cMDS gives
the underlying dimensionality. Generally, however,
there will be some noise and uncertainty in the esti-
mation process, so one looks for a tell-tale gap in the
spectrum to determine the dimensionality.
Unlike many of the algorithms to be discussed,
Isomap has an optimality guarantee [BdSLT00];
roughly, Isomap is guaranteed to recover the parame-
terization of a manifold under the following assump-
tions.
1. The manifold is isometrically embedded into R
D
.
2. The underlying parameter space is convex i.e.
the distance between any two points in the pa-
rameter space is given by a geodesic that lies
entirely within the parameter space. Intuitively,
the parameter space of the manifold from which
we are sampling cannot contain any holes.
3. The manifold is well sampled everywhere.
4. The manifold is compact.
The proof of this optimality begins by showing that,
with enough samples, the graph distances approach
the underlying geodesic distances. In the limit, these
distances will be perfectly Euclidean and cMDS will
return the correct low-dimensional points. Moreover,
cMDS is continuous: small changes in the input will
result in small changes in the output. Because of this
robustness to perturbations, the solution of cMDS
will converge to the correct parameterization in the
limit.
Isomap has been extended in two major directions:
to handle conformal mappings and to handle very
large data sets [dST03]. The former extension is re-
ferred to as C-Isomap and addresses the issue that
many embeddings preserve angles, but not distances.
C-Isomap has a theoretical optimality guarantee sim-
ilar to that of Isomap, but requires that the manifold
be uniformly sampled.
The second extension, Landmark Isomap, is based
on using landmark points to speed up the calcula-
tion of interpoint distances and cMDS. Both of these
steps require O(n
3
) timesteps
2
, which is prohibitive
2
The distance calculation can be brought down to
O(kn
2
log n) using Fibonacci heaps.
Isomap
input: x
1
, . . . , x
n
R
D
, k
1. Form the k-nearest neighbor graph with edge
weights W
ij
:= |x
i
x
j
| for neighboring points
x
i
, x
j
.
2. Compute the shortest path distances between all
pairs of points using Dijkstras or Floyds algo-
rithm. Store the squares of these distances in
D.
3. Return Y :=cMDS(D).
Figure 5: Isomap.
for large data sets. The basic idea is to select a small
set of l landmark points, where l > d, but l n.
Next, the approximate geodesic distances from each
point to the landmark points are computed (in con-
trast to computing all nn interpoint distances, only
n l are computed). Then, cMDS is run on the l l
distance matrix for the landmark points. Finally, the
remainder of the points are located using a simple for-
mula based on their distances to the landmarks. This
technique brings the time complexity of the two com-
putational bottlenecks of Isomap down considerably:
the time complexity of L-Isomap is O(dlnlog n + l
3
)
instead of O(n
3
).
3.2 Locally Linear Embedding
Locally Linear Embedding (LLE) [SR03] was intro-
duced at about the same time as Isomap, but is
based on a substantially dierent intuition. The idea
comes from visualizing a manifold as a collection
of overlapping coordinate patches. If the neighbor-
hood sizes are small and the manifold is suciently
smooth, then these patches will be approximately lin-
ear. Moreover, the chart from the manifold to R
d
will
be roughly linear on these small patches. The idea,
then, is to identify these linear patches, characterize
the geometry of them, and nd a mapping to R
d
that
preserves this local geometry and is nearly linear. It is
assumed that these local patches will overlap with one
another so that the local reconstructions will combine
into a global one.
As in all of these methods, the number of neighbors
chosen is a parameter of the algorithm. Let N(i) be
the set of the k-nearest neighbors of point x
i
. The
rst step of the algorithm models the manifold as a
collection of linear patches and attempts to charac-
7
terize the geometry of these linear patches. To do
so, it attempts to represent x
i
as a weighted, convex
combination of its nearest-neighbors. The weights
are chosen to minimize the following squared error
for each i:
_
_
_
_
x
i

jN(i)
W
ij
x
j
_
_
_
_
2
.
The weight matrix W is used as a surrogate for
the local geometry of the patches. Intuitively, W
i
reveals the layout of the points around x
i
. There are
a couple of constraints on the weights: each row must
sum to one (equivalently, each point is represented as
a convex combination of its neighbors) and W
ij
= 0 if
j , N(i). The second constraint reects that LLE is
a local method; the rst makes the weights invariant
to global translations: if each x
i
is shifted by then
_
_
_
_
x
i
+

jN(i)
W
ij
(x
j
+)
_
_
_
_
2
=
_
_
_
_
x
i

jN(i)
W
ij
x
j
_
_
_
_
2
,
so W is unaected. Moreover, W is invariant to
global rotations and scalings.
Fortunately, there is a closed form solution for W,
which may be derived using Lagrange multipliers. In
particular, the reconstruction weights for each point
x
i
are given by

W
i
=

k
C
1
jk

lm
C
1
lm
,
where C is the local covariance matrix with entries
C
jk
def
= (x
i

j
)

(x
i

k
) and
j
,
k
are neighbors
of x
i
.

W R
nk
is then transformed into the sparse
n n matrix W; W
ij
=

W
il
if x
j
is the lth neighbor
of x
i
, and is 0 if j , N(i).
W
i
is a characterization of the local geometry
around x
i
of the manifold. Since it is invariant to ro-
tations, scalings, and translations, W
i
will also char-
acterize the local geometry around x
i
for any linear
mapping of the neighborhood patch surround x
i
. We
expect that the chart from this neighborhood patch
to the parameter space should be approximately lin-
ear since the manifold is smooth. Thus, W
i
should
capture the local geometry of the parameter space
as well. The second step of the algorithm nds a
conguration in d-dimensions (the dimensionality of
the parameter space) whose local geometry is char-
acterized well by W. Unlike with Isomap, d must
be known a priori or estimated. The d-dimensional
Locally Linear Embedding
input: x
1
, . . . , x
n
R
D
, d, k
1. Compute reconstruction weights. For each point
x
i
, set
W
i
:=

k
C
1
jk

lm
C
1
lm
.
2. Compute the low-dimensional embedding.
Let U be the matrix whose columns are
the eigenvectors of (I W)

(I W) with
nonzero accompanying eigenvalues.
Return Y := [U]
nd
.
Figure 6: Locally Linear Embedding.
conguration is found by minimizing

i
_
_
_
_
y
i

j
W
ij
y
j
_
_
_
_
2
(7)
with respect to y
1
, . . . , y
n
R
d
. Note that this ex-
pression has the same form as the error minimized in
the rst step of the algorithm, but that we are now
minimizing with respect to y
1
, . . . , y
n
, rather than W.
The cost function (7) may be rewritten as
Y

MY,
where
M
ij
def
=
ij
W
ij
W
ji
+

k
W
ki
W
kj
,
and
ij
def
= 1 if i = j and 0 otherwise. M may be
written more concisely as (I W)

(I W). There
are a couple of constraints on Y . First, Y

Y = I,
which forces the solution to be of rank d. Second,

i
Y
i
= 0; this constraint centers the embedding on
the origin. This cost function is a form of Rayleighs
quotient, so is minimized by setting the columns of
Y to the bottom d eigenvectors of M. However, the
bottom (eigenvector, eigenvalue) pair is (1, 0), so the
nal coordinate of each point is identical. To avoid
this degenerate dimension, the d-dimensional embed-
ding is given by the bottom non-constant d eigenvec-
tors.
Conformal Eigenmaps [SS05] is a recent technique
that may be used in conjunction with LLE to provide
an automatic estimate of d, the dimensionality of the
manifold. LLE is summarized in gure 6.
8
3.3 Laplacian Eigenmaps
The Laplacian Eigenmaps [BN01] algorithm is based
on ideas from spectral graph theory. Given a graph
and a matrix of edge weights, W, the graph Laplacian
is dened as L
def
= D W, where D is the diagonal
matrix with elements D
ii
=

j
W
ij
.
3
The eigenval-
ues and eigenvectors of the Laplacian reveal a wealth
of information about the graph such as whether it
is complete or connected. Here, the Laplacian will
be exploited to capture local information about the
manifold.
We dene a local similarity matrix W which re-
ects the degree to which points are near to one an-
other. There are two choices for W:
1. W
ij
= 1 if x
j
is one of the k-nearest neighbors
of x
i
and equals 0 otherwise.
2. W
ij
= e
xixj
2
/2
2
for neighboring nodes; 0
otherwise. This is the Gaussian heat kernel,
which has an interesting theoretical justication
given in [BN03]. On the other hand, using the
Heat kernel requires manually setting , so is
considerably less convenient than the simple-
minded weight scheme.
We use this similarity matrix W to nd points
y
1
, . . . , y
n
R
d
that are the low-dimensional analogs
of x
1
, . . . , x
n
. Like with LLE, d is a parameter of
the algorithm, but it can be automatically estimated
by using Conformal Eigenmaps as a post-processor.
For simplicity, we rst derive the algorithm for d = 1
i.e. each y
i
is a scalar.
Intuitively, if x
i
and x
k
have a high degree of sim-
ilarity, as measured by W, then they lie very near to
one another on the manifold. Thus, y
i
and y
j
should
be near to one another. This intuition leads to the
following cost function which we wish to minimize:

ij
W
ij
(y
i
y
j
)
2
. (8)
This function is minimized by the setting y
1
= =
y
n
= 0; we avoid this trivial solution by enforcing the
constraint
y

Dy = 1. (9)
This constraint eectively xes a scale at which the
y
i
s are placed. Without this constraint, setting
y

i
= y
i
, with any constant < 1, would give a
conguration with lower cost than the conguration
3
L is referred to as the unnormalized Laplacian in many
places. The normalized version is D
1/2
LD
1/2
.
Laplacian Eigenmaps
input: x
1
. . . x
n
R
D
, d, k, .
1. Set W
ij
=
_
e
xixj
2
/2
2
if x
j
N(i)
0 otherwise.
2. Let U be the matrix whose columns are the
eigenvectors of Ly = Dy with nonzero accom-
panying eigenvalues.
3. Return Y := [U]
nd
.
Figure 7: Laplacian Eigenmaps.
y
1
, . . . , y
n
. Note that we could achieve the same re-
sult with the constraint y

y = 1, however, using (9)


leads to a simple eigenvalue solution.
Using Lagrange multipliers, it can be shown that
the bottom eigenvector of the generalized eigenvalue
problem Ly = Dy minimizes 8. However, the bot-
tom (eigenvalue,eigenvector) pair is (0, 1), which is
yet another trivial solution (it maps each point to 1).
Avoiding this solution, the embedding is given by the
eigenvector with smallest non-zero eigenvalue.
We now turn to the more general case, where we are
looking for d-dimensional representatives y
1
, . . . , y
n
of the data points. We will nd a n d matrix Y ,
whose rows are the embedded points. As before, we
wish to minimize the distance between similar points;
our cost function is thus

ij
W
ij
|y
i
y
j
|
2
= tr(Y

LY ). (10)
In the one-dimensional case, we added a constraint
to avoid the trivial solution of all zeros. Analogously,
we must constrain Y so that we do not end up with
a solution with dimensionality less than d. The con-
straint Y

DY = I forces Y to be of full dimensional-


ity; we could use Y

Y = I instead, but Y

DY = I
is preferable because it leads to a closed-form eigen-
value solution. Again, we may solve this problem
via the generalized eigenvalue problem Ly = Dy.
This time, however, we need the bottom d non-zero
eigenvalues and eigenvectors. Y is then given by the
matrix whose columns are these d eigenvectors.
The algorithm is summarized in gure 7.
3.3.1 Connection to LLE
Though LLE is based on a dierent intuition than
Laplacian Eigenmaps, Belkin and Niyogi [BN03] were
able to show that that the algorithms were equivalent
9
in some situations. The equivalence rests on another
motivation for Laplacian Eigenmaps, not discussed
here, that frames the algorithm as a discretized ver-
sion of nding the eigenfunctions
4
of the Laplace-
Beltrami operator, L, on a manifold. The authors
demonstrate that LLE, under certain assumptions, is
nding the eigenfunctions of
1
2
L
2
, which are the same
as the eigenfunctions of L. In the proof, they assume
that the neighborhood patches selected by LLE are
perfectly locally linear (i.e., the neighbors of a point
x actually lie on the tangent plane at x). Further-
more, the equivalence is only shown in expectation
over locally uniform measures on the manifold.
The connection is an intriguing one, but has not
been exploited or even demonstrated on any real data
sets. It is not clear how reasonable the assumptions
are. These two algorithms are still considered dis-
tinct, with dierent characteristics.
3.4 Semidenite Embedding
Semidenite embedding (SDE) [WSS04], yet another
algorithm for manifold learning, is based on a physi-
cal intuition. Imagine that each point is connected to
its nearest neighbors with a rigid rod. Now, we take
this structure and pull it as far apart as possible i.e.
we attempt to maximize the distance between points
that are not neighbors. If the manifold looks like a
curved version of the parameter space, then hopefully
this procedure will unravel it properly.
To make this intuition a little more concrete, let
us look at a simple example. Suppose we sample
some points from a sine curve in R
2
. The sine curve
is a one-dimensional manifold embedded into R
2
, so
let us apply the SDE idea to it to try to nd a
one-dimensional representation. We rst attach each
point to its two nearest neighbors. We then pull apart
the resulting structure as far as possible. Figure 8
shows the steps of the procedure. As the gure shows,
we have successfully found a one-dimensional repre-
sentation of the data set. On the other hand, if the
neighbors were chosen a bit dierently, the procedure
would have failed.
We this simple example in mind, we proceed to
describe the algorithm more precisely. The SDE al-
gorithm uses a semidenite program [VB96], which
is essentially a linear program with additional con-
straints that force the variables to form a positive
semidenite matrix. Recall that a matrix is a Gram
4
In the algorithm presented, we are looking for eigenvectors.
The eigenvectors are the eigenfunctions evaluated at the data
points.
0 1 2 3 4 5 6
1
0.5
0
0.5
1
0 1 2 3 4 5 6
1
0.5
0
0.5
1
0 1 2 3 4 5 6 7 8
1
0.5
0
0.5
1
Figure 8: An illustration of the idea behind SDE. The
top gure is a few points sampled from a sine curve.
The middle gure shows the result of connecting each
point to its two neighbors. The bottom shows the
result of the pulling step.
matrix (an inner product matrix) if and only if it
is positive semidenite. The SDE algorithm uses a
semidenite program to nd an appropriate Gram
matrix, then extracts a conguration from the Gram
matrix using the same trick employed in classical
MDS.
The primary constraint of the program is that dis-
tances between neighboring points remain the same
i.e if x
i
and x
j
are neighbors then |y
i
y
j
| =
|x
i
x
j
|, where y
i
and y
j
are the low-dimensional
representatives of x
i
and x
j
. However, we must ex-
press this constraint in terms of the Gram matrix,
B, since this is the variable the semidenite program
works with (and not Y ). Translating the constraint
is simple enough, though, since
|y
i
y
j
|
2
= |y
i
|
2
+|y
j
|
2
2y
i
, y
j
)
= y
i
, y
i
) +y
j
, y
j
) 2y
i
, y
j
)
= B
ii
+B
jj
2B
ij
.
The constraint, then is
B
ii
+B
jj
2B
ij
= |x
i
x
j
|
2
for all neighboring points x
i
, x
j
.
We wish to pull the remainder of the points as
far apart as possible, so we maximize the interpoint
distances of our conguration subject to the above
constraint. The objective function is:
max

i,j
|y
i
y
j
|
2
= max

i,j
B
ii
+B
jj
2B
ij
10
Semidenite Embedding
input: x
1
, . . . , x
n
R
D
, k
1. Solve the following semidenite program:
maximize tr(B)
subject to:

ij
B
ij
= 0;
B
ii
+B
jj
2B
ij
= |x
i
x
j
|
2
(for all neighboring x
i
, x
j
);
B _ 0.
2. Compute the spectral decomposition of B: B =
UU

.
3. Form
+
by setting [
+
]
ij
:= max
ij
, 0.
4. Return Y := U
1/2
+
.
Figure 9: Semidenite Embedding.
= max

i,j
B
ii
+B
jj
= max tr(B)
The second equality follows from a regularization
constraint: since the interpoint distances are invari-
ant to translation, there is an extra degree of freedom
in the problem which we remove by enforcing

ij
B
ij
= 0.
The semidenite program may be solved by using
any of a wide variety of software packages. The sec-
ond part of the algorithm is identical to the second
part of classical MDS: the eigenvalue decomposition
UU

is computed for B and the new coordinates


are given by U
1/2
. Additionally, the eigenvalues re-
veal the underlying dimensionality of the manifold.
The algorithm is summarized in gure 9.
One major gripe with SDE is that solving a
semidenite program requires tremendous computa-
tional resources. State-of-the-art semidenite pro-
gramming packages cannot handle an instance with
n 2000 or so on a powerful desktop machine. This
is a a rather severe restriction since real data sets
often contain many more than 2000 points. One pos-
sible approach to solving larger instances would be
to develop a projected subgradient algorithm [Ber99].
This technique resembles gradient descent, but can
be used to handle non-dierentiable convex functions.
This approach has been exploited for a similar prob-
lem: nding robust embeddings of dissimilarity in-
formation [CD05]. This avenue has not been pursued
for SDE.
Instead, there is an approximation known as SDE
[WPS05], which uses matrix factorization to keep the
size of the semidenite program down. The idea is to
factor the Gram matrix B into QLQ

, where L is a
l l Gram matrix of landmark points and l n. The
semidenite program only nds L and then the rest of
the points are located with respect to the landmarks.
This method was a full two orders of magnitude faster
than SDE in some experiments [WPS05]. This ap-
proximation to SDE appears to produce high-quality
results, though more investigation is needed.
3.5 Other algorithms
We have looked at four algorithms for manifold learn-
ing, but there are certainly more. In this section, we
briey discuss two other algorithms.
As discussed above, both the LLE and Lapla-
cian Eigenmaps algorithms are Laplacian-based al-
gorithms. Hessian LLE (hLLE) [DG03] is closely
related to these algorithms, but replaces the graph
Laplacian with a Hessian
5
estimator. Like the LLE
algorithm, hLLE looks at locally linear patches of
the manifold and attempts to map them to a low-
dimensional space. LLE models these patches by
representing each x
i
as a weighted combination of
its neighbors. In place of this technique, hLLE runs
principal components analysis on the neighborhood
set to get a estimate of the tangent space at x
i
(i.e.
the linear patch surrounding x
i
). The second step
of LLE maps these linear patches to linear patches
in R
d
using the weights found in the rst step to
guide the process. Similarly, hLLE maps the tangent
space estimates to linear patches in R
d
, warping the
patches as little as possible. The warping factor
is the Frobenius norm of the Hessian matrix. The
mapping with minimum norm is found by solving an
eigenvalue problem.
The hLLE algorithm has a fairly strong optimal-
ity guarantee stronger, in some sense, then that
of Isomap. Whereas Isomap requires that the pa-
rameter space be convex and the embedded man-
ifold be globally isometric to the parameter space,
hLLE requires only that the parameter space be con-
nected and locally isometric to the manifold. In these
cases hLLE is guaranteed to asymptotically recover
the parameterization. Additionally, some empirical
evidence supporting this theoretical claim has been
given: Donoho and Grimes [DG03] demonstrate that
5
Recall that the Hessian of a function is the matrix of its
partial second derivatives.
11
hLLE outperforms Isomap on a variant of the swiss
roll data set where a hole has been punched out of
the surface. Though its theoretical guarantee and
some limited empirical evidence make the algorithm
a strong contender, hLLE has yet to be widely used,
perhaps partly because it is a bit more sophisticated
and dicult to implement than Isomap or LLE. It
is also worth pointing out the theoretical guarantees
are of a slightly dierent nature than those of Isomap:
whereas Isomap has some nite sample error bounds,
the theorems pertaining to hLLE hold only in the
continuum limit. Even so, the hLLE algorithm is
a viable alternative to many of the algorithms dis-
cussed.
Another recent algorithm proposed for manifold
learning is Local Tangent Space Alignment (LTSA)
[ZZ04]. Similar to hLLE, the LTSA algorithm esti-
mates the tangent space at each point by performing
PCA on its neighborhood set. Since the data is as-
sumed to lie on a d-dimensional manifold, and be
locally linear, these tangent spaces will admit a d-
dimensional basis. These tangent spaces are aligned
to nd global coordinates for the points. The align-
ment is computed by minimizing a cost function that
allows any linear transformation of each local coor-
dinate space. LTSA is a relatively recent algorithm
that has not nd widespread popularity yet; as such,
its relative strengths and weaknesses are still under
investigation.
There are a number of other algorithms in addi-
tion the ones described here. We have touched on
those algorithms that have had the greatest impact
on developments in the eld thus far.
4 Comparisons
We have discussed a myriad of dierent algorithms,
all of which are for roughly the same task. Choosing
between these algorithms is a dicult task; there is
no clear best one. In this section, we discuss the
trade-os associated with each algorithm.
4.1 Embedding type
One major distinction between these algorithms is
the type of embedding that they compute. Isomap,
Hessian LLE, and Semidenite Embedding all look
for isometric embeddings i.e. they all assume that
there is a coordinate chart from the parameter space
to the high-dimensional space that preserves inter-
point distances and attempt to uncover this chart. In
contrast, LLE (and the C-Isomap variant of Isomap)
look for conformal mappings mappings which pre-
serve local angles, but not necessarily distances. It re-
mains open whether the LLE algorithm actually can
recover such mappings correctly. It is unclear exactly
what type of embedding the Laplacian Eigenmaps al-
gorithm computes [BN03]; it has been dubbed prox-
imity preserving in places [WS04]. This breakdown
is summarized in the following table.
Algorithm Mapping
C-Isomap conformal
Isomap isometric
Hessian LLE isometric
Laplacian Eigenmaps unknown
Locally Linear Embedding conformal
Semidenite Embedding isometric
This comparison is not particularly practical, since
the form of embedding for a particular data set is usu-
ally unknown; however, if, say, an isometric method
fails, then perhaps a conformal method would be a
reasonable alternative.
4.2 Local versus global
Manifold learning algorithms are commonly split
into two camps: local methods and global methods.
Isomap is considered a global method because it con-
structs an embedding derived from the geodesic dis-
tance between all pairs of points. LLE is considered a
local method because the cost function that it uses to
construct an embedding only considers the placement
of each point with respect to its neighbors. Similarly,
Laplacian Eigenmaps and the derivatives of LLE are
local methods. Semidenite embedding, on the other
hand, straddles the local/global division: it is a lo-
cal method in that intra-neighborhood distances are
equated with geodesic distances, but a global method
in that its objective function considers distances be-
tween all pairs of points. Figure 10 summarizes the
breakdown of these algorithms.
The local/global split reveals some distinguishing
characteristics between algorithms in the two cate-
gories. Local methods tend to characterize the local
geometry of manifolds accurately, but break down at
the global level. For instance, the points in the em-
bedding found by local methods tend be well placed
with respect to their neighbors, but non-neighbors
may be much nearer to one another than they are
on the manifold. The reason for these aberrations
is simple: the constraints on the local methods do
12
SDE
LLE
Lap Eig
Isomap
Local Global
Figure 10: The local/global split.
not pertain to non-local points. The behavior for
Isomap is just the opposite: the intra-neighborhood
distances for points in the embedding tend to be in-
accurate whereas the large interpoint distances tend
to be correct. Classical MDS creates this behavior;
its cost function contains fourth-order terms of the
distances, which exacerbates the dierences in scale
between small and large distances. As a result, the
emphasis of the cost function is on the large inter-
point distances and the local topology of points is
often corrupted in the embedding process. SDE be-
haves more like the local methods in that the dis-
tances between neighboring points are guaranteed to
be the same before and after the embedding.
The local methods also seem to handle sharp curva-
ture in manifolds better than Isomap. Additionally,
shortcut edges edges between points that should
not actually be classied as neighbors can have a
potentially disastrous eect in Isomap. A few short-
cut edges can badly damage all of geodesic distances
approximations. In contrast, this type of error does
not seem to propagate through the entire embedding
with the local methods.
4.3 Time complexity
The primary computational bottleneck for all of these
algorithms except for semidenite embedding is
a spectral decomposition. Computing this decompo-
sition for a dense n n matrix requires O(n
3
) time
steps. For sparse matrices, however, a spectral de-
composition can be performed much more quickly.
The global/local split discussed above turns out to
be a dense/sparse split as well.
First, we look at the local methods. LLE builds
a matrix of weights W R
nn
such that W
ij
,= 0
only if j is one of is nearest neighbors. Thus, W
is very sparse: each row contains only k nonzero ele-
ments and k is usually small compared to the number
of points. LLE performs a sparse spectral decom-
position of (I W)

(I W). Moreover, this ma-


trix contains some additional structure that may be
used to speed up the spectral decomposition [SR03].
Similarly, the Laplacian Eigenmaps algorithm nds
the eigenvectors of the generalized eigenvalue prob-
lem Ly = Dy. Recall that L
def
= D W, where
W is a weight matrix and D is the diagonal matrix
of row-sums of W. All three of these matrices are
sparse: W only contains k nonzero entries per row,
D is diagonal, and L is their dierence.
The two LLE variants described, Hessian LLE and
Local Tangent Space Alignment, both require one
spectral decomposition per point; luckily, the matri-
ces to be decomposed only contain nk entries, so the
decompositions requires only O(k
3
) time steps. Both
algorithms also require a spectral decomposition of
an n n matrix, but these matrices are sparse.
For all of these local methods, the spectral decom-
positions scale, pessimistically, with O((d + k)n
2
),
which is a substantial improvement over O(n
3
).
The spectral decomposition employed by Isomap,
on the other hand, is not sparse. The actual de-
composition, which occurs during the cMDS phase
of the algorithm, is of B
def
=
1
2
HDH, where D is
the matrix of approximate interpoint square geodesic
distances, and H is a constant matrix. The ma-
trix B R
nn
only has rank approximately equal
to the dimensionality of the manifold, but is never-
theless dense since D has zeros only on the diago-
nal. Thus the decomposition requires a full O(n
3
)
timesteps. This complexity is prohibitive for large
data sets. To manage large data sets, Landmark
Isomap was created. The overall time complexity
of Landmark Isomap is the much more manageable
O(dlnlog n +l
3
), but the solution is only an approx-
imate one.
Semidenite embedding is by far the most compu-
tationally demanding of these methods. It also re-
quires a spectral decomposition of a dense n n ma-
trix, but the real bottleneck is the semidenite pro-
gram. Semidenite programming methods are still
relatively new and, perhaps as a result, quite slow.
The theoretical time complexity of these procedures
varies, but standard contemporary packages cannot
handle an instance of the SDE program if the num-
ber of points is greater than 2000 or so. Smaller in-
13
stances can be handled, but often require an enor-
mous amount of time. The SDE approximation is a
fast alternative to SDE that has been shown to yield
a speedup of two orders of magnitude in some experi-
ments [WPS05]. There has been no work relating the
time requirements of SDE back to other manifold
learning algorithms yet.
To summarize: the local methods LLE, Laplacian
Eigenmaps, and variants scale well up to very large
data sets; Isomap can handle medium size data sets
and the Landmark approximation may be used for
large data sets; SDE can handle only fairly small in-
stances, though its approximation can tackle medium
to large instances.
4.4 Other issues
One shortcoming of Isomap, rst pointed out as a
theoretical issue in [BdSLT00] and later examined
empirically, is that it requires that the underlying
parameter space be convex. Without convexity, the
geodesic estimates will be awed. Many manifolds
do not have convex parameter spaces; one simple ex-
ample is given by taking the swiss roll and punching
a hole out of it. More importantly, [DG02] found
that many image manifolds do not have this prop-
erty. None of the other algorithms seem to have this
shortcoming.
The only algorithms with theoretical optimality
guarantees are Hessian LLE and Isomap. Both of
these guarantees are asymptotic ones, though, so it is
questionable how useful they are. On the other hand,
these guarantees at least demonstrate that hLLE and
Isomap are conceptually correct.
Isomap and SDE are the only two algorithms that
automatically provide an estimate of the dimensional-
ity of the manifold. The other algorithms requires the
target dimensionality to be know a priori and often
produce substantially dierent results for dierent
choices of dimensionality. However, the recent Con-
formal Eigenmaps algorithm provides a patch with
which to automatically estimate the dimensionality
when using Laplacian Eigenmaps, LLE, or its vari-
ants.
5 Whats left to do?
Though manifold learning has become an increasingly
popular area of study within machine learning and re-
lated elds, much is still unknown. One issue is that
all of these algorithms require a neighborhood size
parameter k. The solutions found are often very dif-
ferent depending on the choice of k, yet there is little
work on actually choosing k. One notable exception
is [WZZ04], but these techniques are discussed only
for LTSA.
Perhaps the most curious aspect of the eld is that
there are already so many algorithms for the same
task, yet new ones continue to be developed. In this
paper, we have discussed a total of nine algorithms,
but our coverage was certainly not exhaustive. Why
are there so many?
The primary reason for the abundance of algo-
rithms is that evaluating these algorithms is dicult.
Though there are a couple of theoretical performance
results for these algorithms, the primary evaluation
methodology has been to run the algorithm on some
(often articial) data sets and see whether the re-
sults are intuitively pleasing. More rigorous evalu-
ation methodologies remain elusive. Performing an
experimental evaluation requires knowing the under-
lying manifolds behind the data and then determining
whether the manifold learning algorithm is preserv-
ing useful information. But, determining whether a
real-world data set actually lies on a manifold is prob-
ably as hard as learning the manifold; and, using an
articial data set may not give realistic results. Addi-
tionally, there is little agreement on what constitutes
useful information it is probably heavily problem-
dependent. This uncertainty stems from a certain
slack in the problem: though we want to nd a chart
into low-dimensional space, there may be many such
charts, some of which are more useful than others.
Another problem with experimentation is that one
must somehow select a realistic cross-section of man-
ifolds.
Instead of an experimental evaluation, one might
establish the strength of an algorithm theoretically.
In many cases, however, algorithms used in machine
learning do not have strong theoretical guarantees
associated with them. For example, Expectation-
Maximization can perform arbitrarily poorly on a
data set, yet it is a tremendously useful and popu-
lar algorithm. A good manifold learning algorithm
may well have similar behavior. Additionally, the
type of theoretical results that are of greatest utility
are nite-sample guarantees. In contrast, the algo-
rithmic guarantees for hLLE and Isomap
6
, as well as
a number of other algorithms from machine learning,
are asymptotic results.
6
The theoretical guarantees for Isomap are a bit more useful
as they are able to bound the error in the nite sample case.
14
Devising strong evaluation techniques looks to be a
dicult task, yet is absolutely necessary for progress
in this area. Perhaps even more fundamental is deter-
mining whether real-world data sets actually exhibit
a manifold structure. Some video and image data sets
have been found to feature this structure, but much
more investigation needs to be done. It may well
turn out that most data sets do not contain embed-
ded manifolds, or that the points lie along a manifold
only very roughly. If the fundamental assumption of
manifold learning turns out to be unrealistic for data
mining, then it needs to be determined if these tech-
niques can be adapted so that they are still useful for
dimensionality reduction.
The manifold leaning approach to dimensionality
reduction shows alot of promise and has had some
successes. But, the most fundamental questions are
still basically open:
Is the assumption of a manifold structure
reasonable?
If it is, then:
How successful are these algorithms at
uncovering this structure?
Acknowledgments
Thanks to Sanjoy Dasgupta for valuable sugges-
tions. Thanks also to Henrik Wann Jensen and
Virginia de Sa for being on my committee. Fig-
ures 1 and 3 were generated using modied ver-
sions of code downloaded from the Isomap website
(http://isomap.stanford.edu) and the Hessian
LLE website (http://basis.stanford.edu/HLLE).
References
[BdSLT00] Mira Bernstein, Vin de Silva, John Lang-
ford, and Joshua Tenenbaum. Graph ap-
proximations to geodesics on embedded
manifolds. Manuscript, Dec 2000.
[Ber99] Dimitri P. Bertsekas. Nonlinear Progam-
ming. Athena Scientic, 2nd edition,
1999.
[BN01] M. Belkin and P. Niyogi. Laplacian eigen-
maps and spectral techniques for em-
bedding and clustering. In Advances in
Neural Information Processing Systems,
2001.
[BN03] Mikhail Belkin and Partha Niyogi. Lapla-
cian eigenmaps for dimensionality re-
duction and data representation. Neu-
ral Computation, 15(6):13731396, June
2003.
[Bur05] Christopher J.C. Burges. Geometric
methods for feature extraction and di-
mensional reduction. In L. Rokach and
O. Maimon, editors, Data Mining and
Knowledge Discovery Handbook: A Com-
plete Guide for Practitioners and Re-
searchers. Kluwer Academic Publishers,
2005.
[CC01] T.F. Cox and M.A.A. Cox. Multidimen-
sional Scaling. Chapman and Hall/CRC,
2nd edition, 2001.
[CD05] Lawrence Cayton and Sanjoy Dasgupta.
Cost functions for euclidean embedding.
Unpublished manuscript, 2005.
[DG02] David L. Donoho and Carrie Grimes.
When does isomap recover the natural
parametrization of families of articulated
images? Technical report, Department of
Statistics, Stanford University, 2002.
[DG03] David L. Donoho and Carrie Grimes.
Hessian eigenmaps: new locally lin-
ear embedding techniques for high-
dimensional data. Proceedings of the Na-
tional Academy of Sciences, 100:5591
5596, 2003.
[dST03] Vin de Silva and Joshua Tenenbaum.
Global versus local methods in nonlinear
dimensionality reduction. In S. Becker,
S. Thrun, and K. Obermayer, editors,
Advances in Neural Information Process-
ing Systems, 2003.
[dST04] Vin de Silva and Joshua B. Tenenbaum.
Sparse multidimensional scaling using
landmark points. Manuscript, June 2004.
[Mil65] John W Milnor. Topology from the dier-
entiable viewpoint. The University Press
of Virginia, 1965.
[Ple03] Robert Pless. Using isomap to explore
video sequences. In Proc. International
Conference on Computer Vision, 2003.
15
[SR03] L. Saul and S. Roweis. Think globally, t
locally: unsupervised learning of low di-
mensional manifolds. Journal of Machine
Learning Research, 4:119155, 2003.
[SS05] F. Sha and L.K. Saul. Analysis and ex-
tension of spectral methods for nonlinear
dimensionality reduction. Submitted to
ICML, 2005.
[TdL00] J. B. Tenenbaum, V. deSilva, and J. C.
Langford. A global geometric framework
for nonlinear dimensionality reduction.
Science, 290:23192323, 2000.
[VB96] Lieven Vandenberghe and Stephen Boyd.
Semidenite programming. SIAM Re-
view, 1996.
[WPS05] K. Weinberger, B. Packer, and L. Saul.
Nonlinear dimensionality reduction by
semidenite programming and kernel ma-
trix factorization. In Proceedings of the
Tenth International Workshop on Arti-
cial Intelligence and Statistics, 2005.
[WS04] K. Q. Weinberger and L. K. Saul. Unsu-
pervised learning of image manifolds by
semidenite programming. In Proceed-
ings of the IEEE Conference on Com-
puter Vision and Pattern Recognition,
2004.
[WSS04] K.Q. Weinberger, F. Sha, and L.K. Saul.
Learning a kernel matrix for nonlinear di-
mensionality reduction. In Proceedings of
the International Conference on Machine
Learning, 2004.
[WZZ04] Jing Wang, Zhenyue Zhang, and
Hongyuan Zha. Adaptive manifold
learning. In Advances in Neural Infor-
mation Processing Systems, 2004.
[ZZ03] Hongyuan Zha and Zhenyue Zhang.
Isometric embedding and continuum
isomap. In Proceedings of the Twenti-
eth International Conference on Machine
Learning, 2003.
[ZZ04] Zhenyue Zhang and Hongyuan Zha. Prin-
cipal manifolds and nonlinear dimension
reduction via local tangent space align-
ment. SIAM Journal of Scientic Com-
puting, 26(1):313338, 2004.
A Proof of theorem 1
First, we restate the theorem.
Theorem. A nonnegative symmetric matrix D
R
nn
, with zeros on the diagonal, is a Euclidean dis-
tance matrix if and only if B
def
=
1
2
HDH, where
H
def
= I
1
n
11

, is positive semidenite. Further-


more, this B will be the Gram matrix for a mean-
centered conguration with interpoint distances given
by D.
Proof: Suppose that D is a Euclidean distance ma-
trix. Then, there exist x
1
, . . . , x
n
such that |x
i

x
j
|
2
= D
ij
for all i, j. We assume without loss that
these points are mean-centered i.e.

i
x
ij
= 0 for
all j. We expand the norm:
D
ij
= |x
i
x
j
|
2
= x
i
, x
i
)+x
j
, x
j
)2x
i
, x
j
), (11)
so
x
i
, x
j
) =
1
2
(D
ij
x
i
, x
i
) x
j
, x
j
)). (12)
Let B
ij
def
= x
i
, x
j
). We proceed to rewrite B
ij
in
terms of D. From (11),

i
D
ij
=
_

i
B
ii
+

i
B
jj
2

i
B
ij
_
=

i
(B
ii
+B
jj
).
The second equality follows because the conguration
is mean-centered. It follows that
1
n

i
D
ij
=
1
n

i
B
ii
+B
jj
. (13)
Similarly,
1
n

j
D
ij
=
1
n

j
B
jj
+B
ii
, and (14)
1
n
2

ij
D
ij
=
2
n

i
B
ii
. (15)
Subtracting (15) from the sum of (14) and (13)
leaves B
jj
+ B
ii
on the right-hand side, so we may
rewrite (12) as
B
ij
=
1
2
_
_
D
ij

1
n

i
D
ij

1
n

j
D
ij
+
1
n
2

ij
D
ij
_
_
=
1
2
[HDH]
ij
.
16
Thus,
1
2
HDH is a Gram matrix for x
1
, . . . , x
n
.
Conversely, suppose that B
def
=
1
2
HDH is positive
semidenite. Then, there is a matrix X such that
XX

= B. We show that the points given by the


rows of X have square interpoint distances given by
D.
First, we expand B using its denition:
B
ij
=
1
2
[HDH]
ij
=
1
2
_
_
D
ij

1
n

i
D
ij

1
n

j
D
ij
+
1
n
2

ij
D
ij
_
_
=
1
2
_
d
2
ij
d
2
i
d
2
j
+d
2
_
.
Let x
i
be the ith row of X. Then,
|x
i
x
j
|
2
= |x
i
|
2
+|x
j
|
2
2x
i
, x
j
)
= B
ii
+B
jj
2B
ij
= D
ij
.
The nal equality follows from expanding B
ij
, B
ii
,
and B
jj
as written above. Thus, we have shown that
the conguration x
1
, . . . , x
n
have square interpoint
distances given by D, so D is a Euclidean distance
matrix.
17

You might also like