Unsupervised Learning (A.k.a Clustering) : Marcello Pelillo
Unsupervised Learning (A.k.a Clustering) : Marcello Pelillo
Unsupervised Learning (A.k.a Clustering) : Marcello Pelillo
Learning
(a.k.a Clustering)
Marcello Pelillo
University of Venice, Italy
Artificial Intelligence
a.y. 2018/19
The “classical” clustering problem
Given:
ü a set of n “objects” = an edge-weighted graph G
ü an n × n matrix A of pairwise similarities
Goal: Partition the vertices of the G into maximally homogeneous groups (i.e.,
clusters).
For a review see, e.g., A. K. Jain, "Data clustering: 50 years beyond K-means,”
Pattern Recognition Letters 31(8):651-666, 2010.
Basic ideas of grouping in humans:
The Gestalt school
Koffka
Source: K. Grauman
Segmentation as clustering
– Initialize:
Pick K random points as cluster centers
– Alternate:
1. Assign data points to closest cluster center
2. Change the cluster center to the average of its assigned points
Note: Ensure that every cluster has at least one data point. Possible techniques for doing this include
supplying empty clusters with a point chosen at random from points far from their cluster centers.
K-means clustering: Example
Initialization:
Pick K random points as
cluster centers
Iterative Step 1:
Assign data points to
closest cluster center
Iterative Step 2:
Change the cluster center to
the average of the assigned
points
Final output
⎧ 2⎫
∑ ⎨ ∑ x j − µi ⎬
i∈clusters ⎩ j∈elements of i'th cluster ⎭
• Pros
– Very simple method
– Efficient
• Cons
– Converges to a local minimum
of the error function
– Need to pick K
– Sensitive to initialization
– Sensitive to outliers
– Only finds “spherical” clusters
Images as graphs
j
wij
i
Source: D. Sontag
Measuring affinity
⎛ 1 2⎞
exp⎜ − 2 dist (xi , x j ) ⎟
⎝ 2σ ⎠
Scale affects affinity
• Small σ: group only nearby points
• Large σ: group far-away points
Eigenvector-based clustering
Let us represent a cluster using a vector x whose k-th entry captures the
participation of node k in that cluster. If a node does not participate in a
cluster, the corresponding entry is zero.
We want to maximize:
points
eigenvector
matrix
More than two segments
• Two options
– Recursively split each side to get a tree, continuing till the
eigenvalues are too small
– Use the other eigenvectors
Clustering by eigenvectors: Algorithm
cut(A, B) = ∑ ∑ w(i, j)
i∈A j∈B
A
B
Bad news
Favors highly unbalanced clusters (often with isolated vertices)
Graph terminology
Degree of nodes
Volume of a set
⎛ 1 1 ⎞
Ncut(A, B) = cut(A, B) ⎜ + ⎟
⎝ vol(A) vol(B) ⎠
L=D–W
Example:
Indeed:
Properties
• L is symmetric (by assumption) and positive semi-definite:
f’L f ≥ 0
for all vectors f (by “key fact”)
• Smallest eigenvalue of L is 0; corresponding eigenvector is 1
• Thus eigenvalues are: 0 = λ1 ≤ λ2 ≤ ... ≤ λn
Lrw = D−1 L
= I – D−1 W
• Symmetric normalization:
⎧ +1 if i ∈ A
xi = ⎨
⎩ −1 if i ∈ B
Rayleigh quotient
It can be shown that:
y'(D − W )y
min Ncut(x) = min
x y y' Dy
This is NP-hard!
Ncut as an eigensystem
If we relax the constraint that y be a discrete-valued vector and allow it to
take on real values, the problem
y'(D − W )y
min
y y' Dy
is equivalent to:
Laplacian (D − W )y = λ Dy
P = D−1 W
1. Given a weighted graph G = (V, E, w), summarize the information into
matrices W and D
2. Solve (D − W)y = λDy for eigenvectors with the smallest eigenvalues
3. Use the eigenvector with the second smallest eigenvalue to bipartition the
graph by finding the splitting point such that Ncut is minimized
4. Decide if the current partition should be subdivided by checking the stability of
the cut, and make sure Ncut is below the prespecified value
Note. The approach is computationally wasteful; only the second eigenvector is used, whereas
the next few small eigenvectors also contain useful partitioning information.
Ncut: More than 2 clusters
Approach #2: Using first k eigenvectors
Spectral clustering
Four 1D Gaussian clusters with increasing variance and corresponding eigevalues of Lrw (von Luxburg, 2007).
References
• J. Shi and J. Malik, Normalized cuts and image segmentation. IEEE
Transactions on Pattern Analysis and Machine Intelligence 22(8): 888-905
(2000).
Examples:
ü clustering micro-array gene expression data
ü clustering documents into topic categories
ü perceptual grouping
ü segmentation of images with transparent surfaces
References:
ü N. Jardine and R. Sibson. The construction of hierarchic and non-hierarchic
classifications. Computer Journal, 11:177–184, 1968
ü A. Banerjee, C. Krumpelman, S. Basu, R. J. Mooney, and J. Ghosh. Model-
based overlapping clustering. KDD 2005.
ü K. A. Heller and Z. Ghahramani. A nonparametric Bayesian approach to
modeling overlapping clusters. AISTATS 2007.
The Symmetry Assumption
«Similarity has been viewed by both philosophers and psychologists
as a prime example of a symmetric relation. Indeed, the assumption
of symmetry underlies essentially all theoretical treatments of
similarity.
Amos Tversky
“Features of similarities,” Psychol. Rev. (1977)
Internal criterion
all “objects” inside a cluster should be highly similar to each other
External criterion
all “objects” outside a cluster should be highly dissimilar to the ones inside
The Notion of “Gestalt”
«In most visual fields the contents of particular areas “belong together” as
circumscribed units from which their surrounding are excluded.»
«In gestalt theory the word “Gestalt” means any segregated whole.»
W. Köhler (1929)
Data Clustering:
Old vs. New
By answering the question “what is a cluster?” we get a novel way of
looking at the clustering problem.
Clustering_old(V,A,k)
V1,V2,...,Vk <- My_favorite_partitioning_algorithm(V,A,k)
return V1,V2,...,Vk
−−−−−−
Clustering_new(V,A)
V1,V2,...,Vk <- Enumerate_all_clusters(V,A)
return V1,V2,...,Vk
Enumerate_all_clusters(V,A)
repeat
Extract_a_cluster(V,A)
until all clusters have been found
return the clusters found
A Special Case:
Binary Symmetric Affinities
NCut
New approach
Advantages of the New Approach
Need a partition?
Partition_into_clusters(V,A)
repeat
Extract_a_cluster
remove it from V
until all vertices have been clustered
What is Game Theory?
“The central problem of game theory was posed by von
Neumann as early as 1926 in Göttingen. It is the following:
If n players, P1,…, Pn, play a given game Γ, how must the ith
player, Pi, play to achieve the most favorable result for himself?”
Harold W. Kuhn
Lectures on the Theory of Games (1953)
1944, 1947: John von Neumann and Oskar Morgenstern publish Theory of Games and
Economic Behavior.
1950−1953: In four papers John Nash made seminal contributions to both non-cooperative
game theory and to bargaining theory.
1972−1982: John Maynard Smith applies game theory to biological problems thereby
founding “evolutionary game theory.”
Player 2
Player 1
Low 2,2 5,4 12 , 3
⎧ n ⎫
Δ = ⎨ x ∈ R : ∀i = 1…n : x i ≥ 0, and ∑ x i = 1⎬
n
⎩ i=1 ⎭
Nash Equilibria
y ʹAx = ∑ ∑ aij y i x j
i j
€ x'Ax ≥ y ʹAx
for all strategies y. (Best reply to itself.)
Theorem (Nash,
€ 1951). Every finite normal-form game admits a mixed-
strategy Nash equilibrium.
Evolution and the Theory of Games
Assumptions:
ü A large population of individuals belonging to the same species which
compete for a particular limited resource
ü This kind of conflict is modeled as a symmetric two-player game, the
players being pairs of randomly selected population members
ü Players do not behave “rationally” but act according to a pre-
programmed behavioral pattern (pure strategy)
ü Reproduction is assumed to be asexual
ü Utility is measured in terms of Darwinian fitness, or reproductive
success
1
awdeg S (i) = ∑ aij
| S | j ∈S
Moreover,
€ if j ∉ S, we define:
i j
φ S (i, j) = aij − awdeg S (i)
S
Intuitively,
€ φS(i,j) measures the similarity between vertices j and i, with
respect to the (average) similarity between vertex i and its neighbors in S.
Assigning Weights to Vertices
⎧1 if S = 1
⎪
w S (i) = ⎨ ∑ φ ( j,i)w S−{ i} ( j) otherwise
S−{ i}
⎪⎩ j ∈S−{ i} S-{i}
i
W (S) = ∑ w S (i)
i∈S
S
Interpretation
Theorem (Torsello, Rota Bulò and Pelillo, 2006). Evolutionary stable strategies
of the clustering game with affinity matrix A are in a one-to-one
correspondence with dominant sets.
Dominant-set clustering
ü To get a single dominant-set cluster use, e.g., replicator dynamics (but see
Rota Bulò, Pelillo and Bomze, CVIU 2011, for faster dynamics)
which yields:
€
x˙ i = x i [(Ax) i − x T Ax ]
Theorem (Nachbar, 1990; Taylor and Jonker, 1978). A point x∈∆ is a Nash
equilibrium if and only if x is the limit point of a replicator dynamics
€ from the interior of ∆.
trajectory starting
Furthermore, if x∈∆ is an ESS, then it is an asymptotically stable equilibrium
point for the replicator dynamics.
Doubly Symmetric Games
In a doubly symmetric (or partnership) game, the payoff matrix A is
symmetric (A = AT).
A( x(t)) i
x i (t + 1) = x i (t)
x(t)T Ax(t)
Measuring the Degree of Cluster
Membership
The components of the converged vector give us a measure of the participation of
the corresponding vertices in the cluster, while the value of the objective function
provides of the cohesiveness of the cluster.
Application to Image Segmentation
For the sake of comparison, in the experiments we used the same similarities
used in Shi and Malik’s normalized-cut paper (PAMI 2000).
Partition_into_dominant_sets(G)
Repeat
find a dominant set
remove it from graph
until all vertices have been clustered
To find a single dominant set we used replicator dynamics (but see Rota
Bulò, Pelillo and Bomze, CVIU 2011, for faster game dynamics).
Intensity Segmentation Results
Dominant sets
Texture Segmentation Results
NCut
In a nutshell…
The game-theoretic/dominant-set approach:
ü does not require a priori knowledge on the number of clusters (since it extracts
them sequentially)