Clustering With Gradient Descent: 1 Performance
Clustering With Gradient Descent: 1 Performance
Clustering With Gradient Descent: 1 Performance
1 Performance
We explored SVD. Note that in practice, you should not compute XX T , as this squares
all singular values, leading to higher instability. The built-in scipy.linalg.svd has the
same runtime as an explicit computation O(dn2 + d3 ) but modifies the algorithm so that the
results are more stable.
P
We also explored k-means clustering. Again, our objective is Minimize1 ,...T i=1 Min1jk kXi
ji k2 . Note that Lloyds algorithm might find local minima, and the complexity per iteration
is O(ndk + nd).
Finally, we explored spectral clustering, where our goal was to Minimizecut(V1 , V2 ) s.t.
|V1 | = |V2 |. This is the equivalent of Minimize 14 v T Lv s.t. vi {1, 1}, 1T v = 0. The
solutionto our latter form can be approximated using the solution to Minimize 41 v T Lv s.t.
kvk = n, 1T v = 0. This always gives a response but it only gives a good answer if there
actually exists clusters in your data. The solution to this is the second eigenvalue, with an
L n matrix, meaning the runtime is O(n3 ).
In this note, we will see how gradient descent can be applied to each of these problems, so
that we get iterative algorithms for each of these.
, where the above objective has the solutions A = 1/2 U T , B = 1/2 V T . Let us rewrite the
objective function using the column vectors of A and column vectors of B.
1
A = a1 , . . . , ad , B = b1 , . . . bn
d X
X n
Minimize (Xij aTi bj )2
i=1 j=1
Take the gradient of the objective to obtain the following, w.r.t. a and w.r.t. b.
We can fit this to gradient descent. Fix step size t 6= 0 and initialize A, B.
1. Sample Yij
2. Set e = Xij aTi bj .
3. ai ai + tebj , bj bj + teai .
4. Repeat.
Computing e, ai , and bj all take O(r), so the complexity per iteration is O(r). There are nd
total entries, but interestingly, O((n + d)r) iterations often will suffice. Turns out the proof
is quite complicated, so we will omit it.
What if most entries are missing? We can run SGD on the fully-observed entries to achieve
matrix completion.
3 K-means Clustering
1. Pick i.
2. Find j closest to xi .
3. j = (1 t)j + txi .
As it turns out, Mini fi (x) = fi (x) In other words, the gradient of the minimum fi (x) is
the equivalent of the gradient at a randomly-selected fi .
2
4 Spectral Clustering
Our goal is to rid of our constraints, else the efficiency of SGD decreases. Run gradient
descent. Recall that
(P
j aij i=j
Lij =
aij o.w.
This is called the projected gradient algorithm1 . In short, we project onto the unit ball.
Take a gradient descent, and then repeat. The complexity is the number of nonzero entries
in L. If L is extremely sparse, the weight update for projected gradient is extremely fast.
On the other hand, an extremely dense graph will see a runtime approaching O(n2 ), and a
graph where the number of neighbors for any node is bounded by a constant k, then the
runtime can be O(n).
How do you choose t? Decrease t until you no longer have nans. How do you choose stopping
criteria? Run for 1-200 epochs. After a number of epochs, set t t where < 1. Common
values are = 0.9, 0.8, 0.1.
We have familiar algorithms for this problem. For each weight update, simply take the
maximum of 0 with the new value. For example, we could modify the iterative algorithm
for SVD to be the following:
1. Sample Yij
1
To see a derivation of the projected gradient algorithm, see http://www.stats.ox.ac.uk/ lien-
art/blog opti pgd.html
3
2. Set e = Xij aTi bj .
4. Repeat.
MinkX AT Bk s.t. A 0, B 0
(
1 i links to j
Hij =
0 o.w.
We can take the out-degree ni and create Hij = n1i Hij . Take p~ to be the probability of
landing on a page. The eigenvector p = Hp gives us the most popular webpages on the
internet.
MaximizepT Hp s.t. p