Manifold Learning Algorithms
Manifold Learning Algorithms
Manifold Learning Algorithms
Lawrence Cayton
[email protected]
June 15, 2005
Abstract
Manifold learning is a popular recent approach to
nonlinear dimensionality reduction. Algorithms for
this task are based on the idea that the dimensional-
ity of many data sets is only articially high; though
each data point consists of perhaps thousands of fea-
tures, it may be described as a function of only a
few underlying parameters. That is, the data points
are actually samples from a low-dimensional manifold
that is embedded in a high-dimensional space. Man-
ifold learning algorithms attempt to uncover these
parameters in order to nd a low-dimensional repre-
sentation of the data. In this paper, we discuss the
motivation, background, and algorithms proposed for
manifold learning. Isomap, Locally Linear Embed-
ding, Laplacian Eigenmaps, Semidenite Embedding,
and a host of variants of these algorithms are exam-
ined.
1 Introduction
Many recent applications of machine learning in
data mining, computer vision, and elsewhere re-
quire deriving a classier or function estimate from
an extremely large data set. Modern data sets of-
ten consist of a large number of examples, each of
which is made up of many features. Though access
to an abundance of examples is purely benecial to
an algorithm attempting to generalize from the data,
managing a large number of features some of which
may be irrelevant or even misleading is typically a
burden to the algorithm. Overwhelmingly complex
feature sets will slow the algorithm down and make
nding global optima dicult. To lessen this burden
on standard machine learning algorithms (e.g. clas-
siers, function estimators), a number of techniques
have been developed to vastly reduce the quantity of
features in a data set i.e. to reduce the dimension-
ality of data.
Dimensionality reduction has other, related uses in
addition to simplifying data so that it can be e-
ciently processed. Perhaps the most obvious is vi-
sualization; if data lies in a 100-dimensional space,
one cannot get an intuitive feel for what the data
looks like. However, if a meaningful two- or three-
dimensional representation of the data can be found,
then it is possible to eyeball it. Though this may
seem like a trivial point, many statistical and machine
learning algorithms have very poor optimality guar-
antees, so the ability to actually see the data and the
output of an algorithm is of great practical interest.
Beyond visualization, a dimensionality reduction
procedure may help reveal what the underlying forces
governing a data set are. For example, suppose we are
to classify an email as spam or not spam. A typical
approach to this problem would be to represent an
email as a vector of counts of the words appearing in
the email. The dimensionality of this data can easily
be in the hundreds, yet an eective dimensionality
reduction technique may reveal that there are only a
few exceptionally telling features, such as the word
Viagra.
There are many approaches to dimensionality re-
duction based on a variety of assumptions and used
in a variety of contexts. We will focus on an approach
initiated recently based on the observation that high-
dimensional data is often much simpler than the di-
mensionality would indicate. In particular, a given
high-dimensional data set may contain many fea-
tures that are all measurements of the same under-
lying cause, so are closely related. This type of phe-
nomenon is common, for example, when taking video
footage of a single object from multiple angles simul-
taneously. The features of such a data set contain
much overlapping information; it would be helpful to
somehow get a simplied, non-overlapping represen-
tation of the data whose features are identiable with
the underlying parameters that govern the data. This
intuition is formalized using the notion of a manifold:
1
the data set lies along a low-dimensional manifold em-
bedded in a high-dimensional space, where the low-
dimensional space reects the underlying parameters
and high-dimensional space is the feature space. At-
tempting to uncover this manifold structure in a data
set is referred to as manifold learning.
1.1 Organization
We begin by looking at the intuitions, basic mathe-
matics, and assumptions of manifold learning in sec-
tion 2. In section 3, we detail several algorithms pro-
posed for learning manifolds: Isomap, Locally Linear
Embedding, Laplacian Eigenmaps, Semidenite Em-
bedding, and some variants. These algorithms will be
compared and contrasted with one another in section
4. We conclude with future directions for research in
section 5.
1.2 Notation and Conventions
Throughout, we will be interested in the problem of
reducing the dimensionality of a given data set con-
sisting of high-dimensional points in Euclidean space.
The high-dimensional input points will be re-
ferred to as x
1
, x
2
, . . . x
n
. At times it will be
convenient to work with these points as a single
matrix X, where the ith row of X is x
i
.
The low-dimensional representations that the di-
mensionality reduction algorithms nd will be
referred to as y
1
, y
2
, . . . , y
n
. Y is the matrix of
these points.
n is the number of points in the input.
D is the dimensionality of the input (i.e. x
i
R
D
).
d is the dimensionality of the manifold that the
input is assumed to lie on and, accordingly, the
dimensionality of the output (i.e. y
i
R
d
).
k is the number of nearest neighbors used by a
particular algorithm.
N(i) denotes the set of the k-nearest neighbors
of x
i
.
Throughout, we assume that the (eigenvector,
eigenvalue) pairs are ordered by the eigenvalues.
That is, if the (eigenvector, eigenvalue) pairs are
(v
i
,
i
), for i = 1, . . . , n, then
1
2
n
. We refer to v
1
, . . . v
d
as the top d eigenvec-
tors, and v
nd+1
, . . . v
n
as the bottom d eigenvec-
tors.
2 Preliminaries
In this section, we consider the motivations for man-
ifold learning and introduce the basic concepts at
work. We have already discussed the importance
of dimensionality reduction for machine learning; we
now look at a classical procedure for this task and
then turn to more recent eorts.
2.1 Linear Dimensionality Reduction
Perhaps the most popular algorithm for dimension-
ality reduction is Principal Components Analysis
(PCA). Given a data set, PCA nds the directions
(vectors) along which the data has maximum vari-
ance in addition to the relative importance of these
directions. For example, suppose that we feed a set
of three-dimensional points that all lie on a two-
dimensional plane to PCA. PCA will return two vec-
tors that span the plane along with a third vector
that is orthogonal to the plane (so that all of R
3
is
spanned by these vectors). The two vectors that span
the plane will be given a positive weight, but the third
vector will have a weight of zero, since the data does
not vary along that direction. PCA is most useful in
the case when data lies on or close to a linear sub-
space of the data set. Given this type of data, PCA
will nd a basis for the linear subspace and allow one
to disregard the irrelevant features.
The manifold learning algorithms may be viewed
as non-linear analogs to PCA, so it is worth detail-
ing PCA somewhat. Suppose that X R
nD
is a
matrix whose rows are D-dimensional data points.
We are looking for the d-dimensional linear subspace
of R
D
along which the data has maximum variance.
If in fact the data lies perfectly along a subspace of
R
D
, PCA will reveal that subspace; otherwise, PCA
will introduce some error. The objective function it
optimizes is
max
V
var(XV ),
where V is an orthogonal Dd matrix. If d = 1, then
V is simply a unit-length vector which gives the direc-
tion of maximum variance. In general, the columns of
V give the d dimensions that we project the original
data onto. This cost function admits an eigenvalue-
based solution, which we now detail for d = 1. The
2
direction of maximum variation is given by a unit vec-
tor v to project the data points onto, which is found
as follows:
max
v=1
var(Xv) = max
v=1
E(Xv)
2
(EXv)
2
(1)
= max
v=1
E(Xv)
2
(2)
= max
v=1
i
(x
i
v)
2
(3)
= max
v=1
i
(v
x
i
)(x
i
v) (4)
= max
v=1
v
i
(x
i
x
i
)
_
v (5)
= max
v=1
v
Xv. (6)
Equality 2 follows from assuming the data is mean-
centered, without loss of generality. The last line is
a form of Rayleighs Quotient and so is maximized
by setting v to the eigenvector with the largest corre-
sponding eigenvector. In general, the d-dimensional
PCA solution is XV , where V is the D d matrix
whose columns are the top d eigenvectors (i.e. (V
i
,
i
)
are (eigenvalue, eigenvector) pairs with
1
2
d
).
Despite PCAs popularity it has a number of lim-
itations. Perhaps the most blatant drawback is the
requirement that the data lie on linear subspace. Re-
turning the previous example, what if the plane was
curled as it is in gure 1? Though the data is still
intuitively two-dimensional, PCA will not correctly
extract this two-dimensional structure. The prob-
lem is that this data set, commonly referred as the
swiss roll, is a two-dimensional manifold, not a two-
dimensional subspace. Manifold learning algorithms
essentially attempt to duplicate the behavior of PCA,
but on manifolds instead of linear subspaces.
We now briey review the concept of a manifold
and formalize the manifold learning problem.
2.2 Manifolds
Consider the curve shown in gure 2. Note that the
curve is in R
3
, yet it has zero volume, and in fact zero
area. The extrinsic dimensionality three is some-
what misleading since the curve can be parameterized
by a single variable. One way of formalizing this in-
tuition is via the idea of a manifold: the curve is
a one-dimensional manifold because it locally looks
like a copy of R
1
. Let us quickly review some basic
10 5 0 5 10 15
0
10
20
30
15
10
5
0
5
10
15
Figure 1: A curled plane: the swiss roll.
0
2
4
6
8
10
12
14
1
0.5
0
0.5
1
1
0.5
0
0.5
1
Figure 2: A one-dimensional manifold embedded in
three dimensions.
3
terminology from geometry and topology in order to
crystallize this notion of dimensionality.
Denition 1. A homeomorphism is a continuous
function whose inverse is also a continuous function.
Denition 2. A d-dimensional manifold M is set
that is locally homeomorphic with R
d
. That is, for
each x M, there is an open neighborhood around
x, N
x
, and a homeomorphism f : N
x
R
d
. These
neighborhoods are referred to as coordinate patches,
and the map is referred to a a coordinate chart. The
image of the coordinate charts is referred to as the
parameter space.
Manifolds are terrically well-studied in mathe-
matics and the above denition is extremely general.
We will be interested only in the case where M is a
subset of R
D
, where D is typically much larger than
d. In other words, the manifold will lie in a high-
dimensional space (R
D
), but will be homeomorphic
with a low-dimensional space (R
d
, with d < D).
Additionally, all of algorithms we will look at
require some smoothness requirements that further
constrain the class of manifolds considered.
Denition 3. A smooth (or dierentiable) manifold
is a manifold such that each coordinate chart is dif-
ferentiable with a dierentiable inverse (i.e., each co-
ordinate chart is a dieomorphism).
We will be using the term embedding in a number
of places, though our usage will be somewhat loose
and inconsistent. An embedding of a manifold M
into R
D
is a smooth homeomorphism from M to a
subset of R
D
. The algorithms discussed on this paper
nd embeddings of discrete point-sets, by which we
simply mean a mapping of the point set into another
space (typically lower-dimensional Euclidean space).
An embedding of a dissimilarity matrix into R
d
, as
discussed in section 3.1.2, is a conguration of points
whose interpoint distances match those given in the
matrix.
With this background in mind, we proceed to the
main topic of this paper.
2.3 Manifold Learning
We have discussed the importance of dimensionality
reduction and provided some intuition concerning the
inadequacy of PCA for data sets lying along some
non-at surface; we now turn to the manifold learning
approach to dimensionality reduction.
We are given data x
1
, x
2
, . . . , x
n
R
D
and we wish
to reduce the dimensionality of this data. PCA is ap-
propriate if the data lies on a low-dimensional sub-
space of R
D
. Instead, we assume only that the data
lies on a d-dimensional manifold embedded into R
D
,
where d < D. Moreover, we assume that the mani-
fold is given by a single coordinate chart.
1
We can
now describe the problem formally.
Problem: Given points x
1
, . . . , x
n
R
D
that lie on a d-dimensional manifold M
that can be described by a single coordi-
nate chart f : M R
d
, nd y
1
, . . . , y
n
R
d
, where y
i
def
= f(x
i
).
Solving this problem is referred to as manifold
learning, since we are trying to learn a manifold
from a set of points. A simple and oft-used example
in the manifold learning literature is the swiss roll, a
two-dimensional manifold embedded in R
3
. Figure 3
shows the swiss roll and a learned two-dimensional
embedding of the manifold found using Isomap, an
algorithm discussed later. Notice that points nearby
in the original data set neighbor one another in the
two-dimensional data set; this eect occurs because
the chart f between these two graphs is a homeomor-
phism (as are all charts).
Though the extent to which real-world data sets
exhibit low-dimensional manifold structure is still be-
ing explored, a number of positive examples have
been found. Perhaps the most notable occurrence is
in video and image data sets. For example, suppose
we have a collection of frames taken from a video of
a person rotating his or her head. The dimensional-
ity of the data set is equal to the number of pixels
in a frame, which is generally very large. However,
the images are actually governed by only a couple of
degrees of freedom (such as the angle of rotation).
Manifold learning techniques have been successfully
applied to a number of similar video and image data
sets [Ple03].
3 Algorithms
In this section, we survey a number of algorithms
proposed for manifold learning. Our treatment goes
roughly in chronological order: Isomap and LLE were
the rst algorithms developed, followed by Laplacian
1
All smooth, compact manifolds can be described by a sin-
gle coordinate chart, so this assumption is not overly restric-
tive.
4
10 5 0 5 10 15
0
20
40
20
10
0
10
20
Original Data
40 20 0 20
10
0
10
Isomap
Figure 3: The swiss roll (top) and the learned two-
dimensional representation (bottom).
Eigenmaps. Semidenite Embedding and the vari-
ants of LLE were developed most recently.
Before beginning, we note that every algorithm
that we will discuss requires a neighborhood-size pa-
rameter k. Every algorithm assumes that within
each neighborhood the manifold is approximately
at. Typically, one simply runs an algorithm over
a variety of choices of neighborhood size and com-
pares the outputs to pick the appropriate k. Other
neighborhood schemes, such as picking a neighbor-
hood radius , are sometimes used in place of the
k-nearest neighbor scheme.
3.1 Isomap
Isomap [TdL00] short for isometric feature map-
ping was one of the rst algorithms introduced for
manifold learning. The algorithm is perhaps the best
known and most applied among the multitude of pro-
cedures now available for the problem at hand. It
may be viewed as an extension to Multidimensional
Scaling (MDS), a classical method for embedding dis-
similarity information into Euclidean space. Isomap
consists of two main steps:
1. Estimate the geodesic distances (distances along
a manifold) between points in the input us-
ing shortest-path distances on the data sets k-
nearest neighbor graph.
2. Use MDS to nd points in low-dimensional Eu-
clidean space whose interpoint distances match
the distances found in step 1.
We consider each of these points in turn.
3.1.1 Estimating geodesic distances
Again, we are assuming that the data is drawn from
a d-dimensional manifold embedded in R
D
and we
hope to nd a chart back into R
d
. Isomap assumes
further that there is an isometric chart i.e. a chart
that preserves the distances between points. That is,
if x
i
, x
j
are points on the manifold and G(x
i
, x
j
) is
the geodesic distance between them, then there is a
chart f : M R
d
such that
|f(x
i
) f(x
j
)| = G(x
i
, x
j
).
Furthermore, it is assumed that the manifold is
smooth enough that the geodesic distance between
nearby points is approximately linear. Thus, the Eu-
clidean distance between nearby points in the high-
dimensional data space is assumed to be a good ap-
proximation to the geodesic distances between these
points.
For points that are distant in the high-dimensional
space, the Euclidean distance between them is not a
good estimate of the geodesic distance. The problem
is that though the manifold is locally linear (at least
approximately), this approximation breaks down as
the distance between points increases. Estimating
distances between distant points, thus, requires a dif-
ferent technique. To perform this estimation, the
Isomap algorithm rst constructs a k-nearest neigh-
bor graph that is weighted by the Euclidean dis-
tances. Then, the algorithm runs a shortest-path al-
gorithm (Dijkstras or Floyds) and uses its output
as the estimates for the remainder of the geodesic
distances.
3.1.2 Multidimensional Scaling
Once these geodesic distances are calculated, the
Isomap algorithm nds points whose Euclidean dis-
tances equal these geodesic distances. Since the man-
ifold is isometrically embedded, such points exist, and
in fact, they are unique up to translation and ro-
tation. Multidimensional Scaling [CC01] is a classi-
cal technique that may be used to nd such points.
5
The basic problem it solves is this: given a matrix
D R
nn
of dissimilarities, construct a set of points
whose interpoint Euclidean distances match those in
D closely. There are numerous cost functions for this
task and a variety of algorithms for minimizing these
cost functions. We focus on classical MDS (cMDS),
since that is what is used by Isomap.
The cMDS algorithm, shown in gure 4, is based on
the following theorem which establishes a connection
between the space of Euclidean distance matrices and
the space of Gram matrices (inner-product matrices).
Theorem 1. A nonnegative symmetric matrix D
R
nn
, with zeros on the diagonal, is a Euclidean dis-
tance matrix if and only if B
def
=
1
2
HDH, where
H
def
= I
1
n
11
.
To get X from B, we spectrally decompose B into
UU
. Then, X = U
1/2
.
What if D is not Euclidean? In the application of
Isomap, for example, D will generally not be perfectly
Euclidean since we are only able to approximate the
geodesic distances. In this case, B, as described in the
above theorem, will not be positive semidenite and
thus will not be a Gram matrix. To handle this case,
cMDS projects B onto the cone of positive semide-
nite matrices by setting its negative eigenvalues to 0
(step 3).
Finally, users of cMDS often want a low-
dimensional embedding. In general, X := U
1/2
+
(step 4) will be an n-dimensional conguration. This
dimensionality is reduced to d in step 5 of the algo-
rithm by simply removing the n d trailing columns
of X. Doing so is equivalent to projecting X onto its
top d principal components, as we now demonstrate.
Let X
def
= U
1/2
+
be the n-dimensional conguration
found in step 4. Then, U
+
U
X:v
1
, . . . , v
d
. Let us
compute these eigenvectors.
X
Xv
i
=
i
v
i
classical Multidimensional Scaling
input: D R
nn
(D
ii
= 0, D
ij
0), d 1, . . . , n
1. Set B :=
1
2
HDH, where H = I
1
n
11
is the
centering matrix.
2. Compute the spectral decomposition of B: B =
UU
.
3. Form
+
by setting [
+
]
ij
:= max
ij
, 0.
4. Set X := U
1/2
+
.
5. Return [X]
nd
.
Figure 4: Classical MDS.
(U
1/2
+
)
(U
1/2
+
)v
i
=
i
v
i
(
1/2
+
U
U
1/2
+
)v
i
=
i
v
i
+
v
i
=
i
v
i
.
Thus, v
i
= e
i
, where e
i
is the ith standard basis vec-
tor for R
n
, and
i
= 1. Then, XV = X[e
1
e
d
] =
[X]
nd
, so the truncation performed in step 5 is in-
deed equivalent to projecting X onto its rst d prin-
cipal components.
This equivalence is useful because if a set of dis-
similarities can be realized in d-dimensions, then all
but the rst d columns of X will be zeros; moreover,
+
reveals the relative importance of each dimension,
since these eigenvalues are the same as those found
by PCA. In other words, classical MDS automatically
nds the dimensionality of the conguration associ-
ated with a set of dissimilarities, where dimensional-
ity refers to the dimensionality of the subspace along
which X lies. Within the context of Isomap, classical
MDS automatically nds the dimensionality of the
parameter space of the manifold, which is the dimen-
sionality of the manifold.
Classical MDS nds the optimal d-dimensional
conguration of points for the cost function |HDH
HD
H|
2
F
, where D
jN(i)
W
ij
x
j
_
_
_
_
2
.
The weight matrix W is used as a surrogate for
the local geometry of the patches. Intuitively, W
i
reveals the layout of the points around x
i
. There are
a couple of constraints on the weights: each row must
sum to one (equivalently, each point is represented as
a convex combination of its neighbors) and W
ij
= 0 if
j , N(i). The second constraint reects that LLE is
a local method; the rst makes the weights invariant
to global translations: if each x
i
is shifted by then
_
_
_
_
x
i
+
jN(i)
W
ij
(x
j
+)
_
_
_
_
2
=
_
_
_
_
x
i
jN(i)
W
ij
x
j
_
_
_
_
2
,
so W is unaected. Moreover, W is invariant to
global rotations and scalings.
Fortunately, there is a closed form solution for W,
which may be derived using Lagrange multipliers. In
particular, the reconstruction weights for each point
x
i
are given by
W
i
=
k
C
1
jk
lm
C
1
lm
,
where C is the local covariance matrix with entries
C
jk
def
= (x
i
j
)
(x
i
k
) and
j
,
k
are neighbors
of x
i
.
W R
nk
is then transformed into the sparse
n n matrix W; W
ij
=
W
il
if x
j
is the lth neighbor
of x
i
, and is 0 if j , N(i).
W
i
is a characterization of the local geometry
around x
i
of the manifold. Since it is invariant to ro-
tations, scalings, and translations, W
i
will also char-
acterize the local geometry around x
i
for any linear
mapping of the neighborhood patch surround x
i
. We
expect that the chart from this neighborhood patch
to the parameter space should be approximately lin-
ear since the manifold is smooth. Thus, W
i
should
capture the local geometry of the parameter space
as well. The second step of the algorithm nds a
conguration in d-dimensions (the dimensionality of
the parameter space) whose local geometry is char-
acterized well by W. Unlike with Isomap, d must
be known a priori or estimated. The d-dimensional
Locally Linear Embedding
input: x
1
, . . . , x
n
R
D
, d, k
1. Compute reconstruction weights. For each point
x
i
, set
W
i
:=
k
C
1
jk
lm
C
1
lm
.
2. Compute the low-dimensional embedding.
Let U be the matrix whose columns are
the eigenvectors of (I W)
(I W) with
nonzero accompanying eigenvalues.
Return Y := [U]
nd
.
Figure 6: Locally Linear Embedding.
conguration is found by minimizing
i
_
_
_
_
y
i
j
W
ij
y
j
_
_
_
_
2
(7)
with respect to y
1
, . . . , y
n
R
d
. Note that this ex-
pression has the same form as the error minimized in
the rst step of the algorithm, but that we are now
minimizing with respect to y
1
, . . . , y
n
, rather than W.
The cost function (7) may be rewritten as
Y
MY,
where
M
ij
def
=
ij
W
ij
W
ji
+
k
W
ki
W
kj
,
and
ij
def
= 1 if i = j and 0 otherwise. M may be
written more concisely as (I W)
(I W). There
are a couple of constraints on Y . First, Y
Y = I,
which forces the solution to be of rank d. Second,
i
Y
i
= 0; this constraint centers the embedding on
the origin. This cost function is a form of Rayleighs
quotient, so is minimized by setting the columns of
Y to the bottom d eigenvectors of M. However, the
bottom (eigenvector, eigenvalue) pair is (1, 0), so the
nal coordinate of each point is identical. To avoid
this degenerate dimension, the d-dimensional embed-
ding is given by the bottom non-constant d eigenvec-
tors.
Conformal Eigenmaps [SS05] is a recent technique
that may be used in conjunction with LLE to provide
an automatic estimate of d, the dimensionality of the
manifold. LLE is summarized in gure 6.
8
3.3 Laplacian Eigenmaps
The Laplacian Eigenmaps [BN01] algorithm is based
on ideas from spectral graph theory. Given a graph
and a matrix of edge weights, W, the graph Laplacian
is dened as L
def
= D W, where D is the diagonal
matrix with elements D
ii
=
j
W
ij
.
3
The eigenval-
ues and eigenvectors of the Laplacian reveal a wealth
of information about the graph such as whether it
is complete or connected. Here, the Laplacian will
be exploited to capture local information about the
manifold.
We dene a local similarity matrix W which re-
ects the degree to which points are near to one an-
other. There are two choices for W:
1. W
ij
= 1 if x
j
is one of the k-nearest neighbors
of x
i
and equals 0 otherwise.
2. W
ij
= e
xixj
2
/2
2
for neighboring nodes; 0
otherwise. This is the Gaussian heat kernel,
which has an interesting theoretical justication
given in [BN03]. On the other hand, using the
Heat kernel requires manually setting , so is
considerably less convenient than the simple-
minded weight scheme.
We use this similarity matrix W to nd points
y
1
, . . . , y
n
R
d
that are the low-dimensional analogs
of x
1
, . . . , x
n
. Like with LLE, d is a parameter of
the algorithm, but it can be automatically estimated
by using Conformal Eigenmaps as a post-processor.
For simplicity, we rst derive the algorithm for d = 1
i.e. each y
i
is a scalar.
Intuitively, if x
i
and x
k
have a high degree of sim-
ilarity, as measured by W, then they lie very near to
one another on the manifold. Thus, y
i
and y
j
should
be near to one another. This intuition leads to the
following cost function which we wish to minimize:
ij
W
ij
(y
i
y
j
)
2
. (8)
This function is minimized by the setting y
1
= =
y
n
= 0; we avoid this trivial solution by enforcing the
constraint
y
Dy = 1. (9)
This constraint eectively xes a scale at which the
y
i
s are placed. Without this constraint, setting
y
i
= y
i
, with any constant < 1, would give a
conguration with lower cost than the conguration
3
L is referred to as the unnormalized Laplacian in many
places. The normalized version is D
1/2
LD
1/2
.
Laplacian Eigenmaps
input: x
1
. . . x
n
R
D
, d, k, .
1. Set W
ij
=
_
e
xixj
2
/2
2
if x
j
N(i)
0 otherwise.
2. Let U be the matrix whose columns are the
eigenvectors of Ly = Dy with nonzero accom-
panying eigenvalues.
3. Return Y := [U]
nd
.
Figure 7: Laplacian Eigenmaps.
y
1
, . . . , y
n
. Note that we could achieve the same re-
sult with the constraint y
ij
W
ij
|y
i
y
j
|
2
= tr(Y
LY ). (10)
In the one-dimensional case, we added a constraint
to avoid the trivial solution of all zeros. Analogously,
we must constrain Y so that we do not end up with
a solution with dimensionality less than d. The con-
straint Y
Y = I instead, but Y
DY = I
is preferable because it leads to a closed-form eigen-
value solution. Again, we may solve this problem
via the generalized eigenvalue problem Ly = Dy.
This time, however, we need the bottom d non-zero
eigenvalues and eigenvectors. Y is then given by the
matrix whose columns are these d eigenvectors.
The algorithm is summarized in gure 7.
3.3.1 Connection to LLE
Though LLE is based on a dierent intuition than
Laplacian Eigenmaps, Belkin and Niyogi [BN03] were
able to show that that the algorithms were equivalent
9
in some situations. The equivalence rests on another
motivation for Laplacian Eigenmaps, not discussed
here, that frames the algorithm as a discretized ver-
sion of nding the eigenfunctions
4
of the Laplace-
Beltrami operator, L, on a manifold. The authors
demonstrate that LLE, under certain assumptions, is
nding the eigenfunctions of
1
2
L
2
, which are the same
as the eigenfunctions of L. In the proof, they assume
that the neighborhood patches selected by LLE are
perfectly locally linear (i.e., the neighbors of a point
x actually lie on the tangent plane at x). Further-
more, the equivalence is only shown in expectation
over locally uniform measures on the manifold.
The connection is an intriguing one, but has not
been exploited or even demonstrated on any real data
sets. It is not clear how reasonable the assumptions
are. These two algorithms are still considered dis-
tinct, with dierent characteristics.
3.4 Semidenite Embedding
Semidenite embedding (SDE) [WSS04], yet another
algorithm for manifold learning, is based on a physi-
cal intuition. Imagine that each point is connected to
its nearest neighbors with a rigid rod. Now, we take
this structure and pull it as far apart as possible i.e.
we attempt to maximize the distance between points
that are not neighbors. If the manifold looks like a
curved version of the parameter space, then hopefully
this procedure will unravel it properly.
To make this intuition a little more concrete, let
us look at a simple example. Suppose we sample
some points from a sine curve in R
2
. The sine curve
is a one-dimensional manifold embedded into R
2
, so
let us apply the SDE idea to it to try to nd a
one-dimensional representation. We rst attach each
point to its two nearest neighbors. We then pull apart
the resulting structure as far as possible. Figure 8
shows the steps of the procedure. As the gure shows,
we have successfully found a one-dimensional repre-
sentation of the data set. On the other hand, if the
neighbors were chosen a bit dierently, the procedure
would have failed.
We this simple example in mind, we proceed to
describe the algorithm more precisely. The SDE al-
gorithm uses a semidenite program [VB96], which
is essentially a linear program with additional con-
straints that force the variables to form a positive
semidenite matrix. Recall that a matrix is a Gram
4
In the algorithm presented, we are looking for eigenvectors.
The eigenvectors are the eigenfunctions evaluated at the data
points.
0 1 2 3 4 5 6
1
0.5
0
0.5
1
0 1 2 3 4 5 6
1
0.5
0
0.5
1
0 1 2 3 4 5 6 7 8
1
0.5
0
0.5
1
Figure 8: An illustration of the idea behind SDE. The
top gure is a few points sampled from a sine curve.
The middle gure shows the result of connecting each
point to its two neighbors. The bottom shows the
result of the pulling step.
matrix (an inner product matrix) if and only if it
is positive semidenite. The SDE algorithm uses a
semidenite program to nd an appropriate Gram
matrix, then extracts a conguration from the Gram
matrix using the same trick employed in classical
MDS.
The primary constraint of the program is that dis-
tances between neighboring points remain the same
i.e if x
i
and x
j
are neighbors then |y
i
y
j
| =
|x
i
x
j
|, where y
i
and y
j
are the low-dimensional
representatives of x
i
and x
j
. However, we must ex-
press this constraint in terms of the Gram matrix,
B, since this is the variable the semidenite program
works with (and not Y ). Translating the constraint
is simple enough, though, since
|y
i
y
j
|
2
= |y
i
|
2
+|y
j
|
2
2y
i
, y
j
)
= y
i
, y
i
) +y
j
, y
j
) 2y
i
, y
j
)
= B
ii
+B
jj
2B
ij
.
The constraint, then is
B
ii
+B
jj
2B
ij
= |x
i
x
j
|
2
for all neighboring points x
i
, x
j
.
We wish to pull the remainder of the points as
far apart as possible, so we maximize the interpoint
distances of our conguration subject to the above
constraint. The objective function is:
max
i,j
|y
i
y
j
|
2
= max
i,j
B
ii
+B
jj
2B
ij
10
Semidenite Embedding
input: x
1
, . . . , x
n
R
D
, k
1. Solve the following semidenite program:
maximize tr(B)
subject to:
ij
B
ij
= 0;
B
ii
+B
jj
2B
ij
= |x
i
x
j
|
2
(for all neighboring x
i
, x
j
);
B _ 0.
2. Compute the spectral decomposition of B: B =
UU
.
3. Form
+
by setting [
+
]
ij
:= max
ij
, 0.
4. Return Y := U
1/2
+
.
Figure 9: Semidenite Embedding.
= max
i,j
B
ii
+B
jj
= max tr(B)
The second equality follows from a regularization
constraint: since the interpoint distances are invari-
ant to translation, there is an extra degree of freedom
in the problem which we remove by enforcing
ij
B
ij
= 0.
The semidenite program may be solved by using
any of a wide variety of software packages. The sec-
ond part of the algorithm is identical to the second
part of classical MDS: the eigenvalue decomposition
UU
, where L is a
l l Gram matrix of landmark points and l n. The
semidenite program only nds L and then the rest of
the points are located with respect to the landmarks.
This method was a full two orders of magnitude faster
than SDE in some experiments [WPS05]. This ap-
proximation to SDE appears to produce high-quality
results, though more investigation is needed.
3.5 Other algorithms
We have looked at four algorithms for manifold learn-
ing, but there are certainly more. In this section, we
briey discuss two other algorithms.
As discussed above, both the LLE and Lapla-
cian Eigenmaps algorithms are Laplacian-based al-
gorithms. Hessian LLE (hLLE) [DG03] is closely
related to these algorithms, but replaces the graph
Laplacian with a Hessian
5
estimator. Like the LLE
algorithm, hLLE looks at locally linear patches of
the manifold and attempts to map them to a low-
dimensional space. LLE models these patches by
representing each x
i
as a weighted combination of
its neighbors. In place of this technique, hLLE runs
principal components analysis on the neighborhood
set to get a estimate of the tangent space at x
i
(i.e.
the linear patch surrounding x
i
). The second step
of LLE maps these linear patches to linear patches
in R
d
using the weights found in the rst step to
guide the process. Similarly, hLLE maps the tangent
space estimates to linear patches in R
d
, warping the
patches as little as possible. The warping factor
is the Frobenius norm of the Hessian matrix. The
mapping with minimum norm is found by solving an
eigenvalue problem.
The hLLE algorithm has a fairly strong optimal-
ity guarantee stronger, in some sense, then that
of Isomap. Whereas Isomap requires that the pa-
rameter space be convex and the embedded man-
ifold be globally isometric to the parameter space,
hLLE requires only that the parameter space be con-
nected and locally isometric to the manifold. In these
cases hLLE is guaranteed to asymptotically recover
the parameterization. Additionally, some empirical
evidence supporting this theoretical claim has been
given: Donoho and Grimes [DG03] demonstrate that
5
Recall that the Hessian of a function is the matrix of its
partial second derivatives.
11
hLLE outperforms Isomap on a variant of the swiss
roll data set where a hole has been punched out of
the surface. Though its theoretical guarantee and
some limited empirical evidence make the algorithm
a strong contender, hLLE has yet to be widely used,
perhaps partly because it is a bit more sophisticated
and dicult to implement than Isomap or LLE. It
is also worth pointing out the theoretical guarantees
are of a slightly dierent nature than those of Isomap:
whereas Isomap has some nite sample error bounds,
the theorems pertaining to hLLE hold only in the
continuum limit. Even so, the hLLE algorithm is
a viable alternative to many of the algorithms dis-
cussed.
Another recent algorithm proposed for manifold
learning is Local Tangent Space Alignment (LTSA)
[ZZ04]. Similar to hLLE, the LTSA algorithm esti-
mates the tangent space at each point by performing
PCA on its neighborhood set. Since the data is as-
sumed to lie on a d-dimensional manifold, and be
locally linear, these tangent spaces will admit a d-
dimensional basis. These tangent spaces are aligned
to nd global coordinates for the points. The align-
ment is computed by minimizing a cost function that
allows any linear transformation of each local coor-
dinate space. LTSA is a relatively recent algorithm
that has not nd widespread popularity yet; as such,
its relative strengths and weaknesses are still under
investigation.
There are a number of other algorithms in addi-
tion the ones described here. We have touched on
those algorithms that have had the greatest impact
on developments in the eld thus far.
4 Comparisons
We have discussed a myriad of dierent algorithms,
all of which are for roughly the same task. Choosing
between these algorithms is a dicult task; there is
no clear best one. In this section, we discuss the
trade-os associated with each algorithm.
4.1 Embedding type
One major distinction between these algorithms is
the type of embedding that they compute. Isomap,
Hessian LLE, and Semidenite Embedding all look
for isometric embeddings i.e. they all assume that
there is a coordinate chart from the parameter space
to the high-dimensional space that preserves inter-
point distances and attempt to uncover this chart. In
contrast, LLE (and the C-Isomap variant of Isomap)
look for conformal mappings mappings which pre-
serve local angles, but not necessarily distances. It re-
mains open whether the LLE algorithm actually can
recover such mappings correctly. It is unclear exactly
what type of embedding the Laplacian Eigenmaps al-
gorithm computes [BN03]; it has been dubbed prox-
imity preserving in places [WS04]. This breakdown
is summarized in the following table.
Algorithm Mapping
C-Isomap conformal
Isomap isometric
Hessian LLE isometric
Laplacian Eigenmaps unknown
Locally Linear Embedding conformal
Semidenite Embedding isometric
This comparison is not particularly practical, since
the form of embedding for a particular data set is usu-
ally unknown; however, if, say, an isometric method
fails, then perhaps a conformal method would be a
reasonable alternative.
4.2 Local versus global
Manifold learning algorithms are commonly split
into two camps: local methods and global methods.
Isomap is considered a global method because it con-
structs an embedding derived from the geodesic dis-
tance between all pairs of points. LLE is considered a
local method because the cost function that it uses to
construct an embedding only considers the placement
of each point with respect to its neighbors. Similarly,
Laplacian Eigenmaps and the derivatives of LLE are
local methods. Semidenite embedding, on the other
hand, straddles the local/global division: it is a lo-
cal method in that intra-neighborhood distances are
equated with geodesic distances, but a global method
in that its objective function considers distances be-
tween all pairs of points. Figure 10 summarizes the
breakdown of these algorithms.
The local/global split reveals some distinguishing
characteristics between algorithms in the two cate-
gories. Local methods tend to characterize the local
geometry of manifolds accurately, but break down at
the global level. For instance, the points in the em-
bedding found by local methods tend be well placed
with respect to their neighbors, but non-neighbors
may be much nearer to one another than they are
on the manifold. The reason for these aberrations
is simple: the constraints on the local methods do
12
SDE
LLE
Lap Eig
Isomap
Local Global
Figure 10: The local/global split.
not pertain to non-local points. The behavior for
Isomap is just the opposite: the intra-neighborhood
distances for points in the embedding tend to be in-
accurate whereas the large interpoint distances tend
to be correct. Classical MDS creates this behavior;
its cost function contains fourth-order terms of the
distances, which exacerbates the dierences in scale
between small and large distances. As a result, the
emphasis of the cost function is on the large inter-
point distances and the local topology of points is
often corrupted in the embedding process. SDE be-
haves more like the local methods in that the dis-
tances between neighboring points are guaranteed to
be the same before and after the embedding.
The local methods also seem to handle sharp curva-
ture in manifolds better than Isomap. Additionally,
shortcut edges edges between points that should
not actually be classied as neighbors can have a
potentially disastrous eect in Isomap. A few short-
cut edges can badly damage all of geodesic distances
approximations. In contrast, this type of error does
not seem to propagate through the entire embedding
with the local methods.
4.3 Time complexity
The primary computational bottleneck for all of these
algorithms except for semidenite embedding is
a spectral decomposition. Computing this decompo-
sition for a dense n n matrix requires O(n
3
) time
steps. For sparse matrices, however, a spectral de-
composition can be performed much more quickly.
The global/local split discussed above turns out to
be a dense/sparse split as well.
First, we look at the local methods. LLE builds
a matrix of weights W R
nn
such that W
ij
,= 0
only if j is one of is nearest neighbors. Thus, W
is very sparse: each row contains only k nonzero ele-
ments and k is usually small compared to the number
of points. LLE performs a sparse spectral decom-
position of (I W)
i
D
ij
=
_
i
B
ii
+
i
B
jj
2
i
B
ij
_
=
i
(B
ii
+B
jj
).
The second equality follows because the conguration
is mean-centered. It follows that
1
n
i
D
ij
=
1
n
i
B
ii
+B
jj
. (13)
Similarly,
1
n
j
D
ij
=
1
n
j
B
jj
+B
ii
, and (14)
1
n
2
ij
D
ij
=
2
n
i
B
ii
. (15)
Subtracting (15) from the sum of (14) and (13)
leaves B
jj
+ B
ii
on the right-hand side, so we may
rewrite (12) as
B
ij
=
1
2
_
_
D
ij
1
n
i
D
ij
1
n
j
D
ij
+
1
n
2
ij
D
ij
_
_
=
1
2
[HDH]
ij
.
16
Thus,
1
2
HDH is a Gram matrix for x
1
, . . . , x
n
.
Conversely, suppose that B
def
=
1
2
HDH is positive
semidenite. Then, there is a matrix X such that
XX
i
D
ij
1
n
j
D
ij
+
1
n
2
ij
D
ij
_
_
=
1
2
_
d
2
ij
d
2
i
d
2
j
+d
2
_
.
Let x
i
be the ith row of X. Then,
|x
i
x
j
|
2
= |x
i
|
2
+|x
j
|
2
2x
i
, x
j
)
= B
ii
+B
jj
2B
ij
= D
ij
.
The nal equality follows from expanding B
ij
, B
ii
,
and B
jj
as written above. Thus, we have shown that
the conguration x
1
, . . . , x
n
have square interpoint
distances given by D, so D is a Euclidean distance
matrix.
17