2013 IEEE Conference on Computer Vision and Pattern Recognition
Cartesian k-means
Mohammad Norouzi
David J. Fleet
Department of Computer Science
University of Toronto
{norouzi,fleet}@cs.toronto.edu
Abstract
results in prohibitive storage and run-time costs. Even if
one resorts to ANN search with approximate k-means [14]
or hierarchical k-means [12], it is hard to scale to large k
(e.g., k = 264 ), as storing the centers is untenable.
This paper concerns methods for scalable quantization
with tractable storage and run-time costs. Inspired by Product Quantization (PQ), a state-of-the-art algorithm for ANN
search with high-dimensional data (e.g., [7]), compositionality is one key. By expressing data in terms of recurring,
reusable parts, the representational capacity of compositional models grows exponentially in the number of parameters. Compression techniques like JPEG accomplish this by
encoding images as disjoint rectangular patches. PQ divides
the feature space into disjoint subspaces that are quantized
independently. Other examples include part-based recognition models (e.g., [18]), and tensor-based models for stylecontent separation (e.g., [17]). Here, with a compositional
parameterization of region centers, we find a family of models that reduce the encoding
cost of k centers down from k
√
to between log2 k and k. A model parameter controls the
trade-off between model fidelity and compactness.
We formulate two related algorithms, Orthogonal kmeans (ok-means) and Cartesian k-means (ck-means).
They are natural extensions of k-means, and are closely related to other hashing and quantization methods. The okmeans algorithm is a generalization of the Iterative Quantization (ITQ) algorithm for finding locality-sensitive binary codes [5]. The ck-means model is an extension of okmeans, and can be viewed as a generalization of PQ.1
We then report empirical results on large-scale ANN
search tasks, and on codebook learning for recognition. For
ANN search we use datasets of 1M GIST and 1B SIFT features, with both symmetric and asymmetric distance measures on the latent codes. We consider codebook learning
for a generic form of recognition on CIFAR-10.
A fundamental limitation of quantization techniques like
the k-means clustering algorithm is the storage and runtime cost associated with the large numbers of clusters required to keep quantization errors small and model fidelity
high. We develop new models with a compositional parameterization of cluster centers, so representational capacity
increases super-linearly in the number of parameters. This
allows one to effectively quantize data using billions or trillions of centers. We formulate two such models, Orthogonal
k-means and Cartesian k-means. They are closely related to
one another, to k-means, to methods for binary hash function optimization like ITQ [5], and to Product Quantization
for vector quantization [7]. The models are tested on largescale ANN retrieval tasks (1M GIST, 1B SIFT features), and
on codebook learning for object recognition (CIFAR-10).
1. Introduction and background
Techniques for vector quantization, like the well-known
k-means algorithm, are used widely in vision and learning.
Common applications include codebook learning for object
recognition [16], and approximate nearest neighbor (ANN)
search for information retrieval [12, 14]. In general terms,
such techniques involve partitioning an input vector space
into multiple regions {Si }ki=1 , mapping points x in each
region into region-specific representatives {ci }ki=1 , known
as centers. As such, a quantizer, q(x), can be expressed as
q(x) =
k
i=1
1(x ∈ Si ) ci ,
(1)
where 1(·) is the usual indicator function.
The quality of a quantizer is expressed in terms of expected distortion, a common measure of which is squared
error x − q(x)22 . In this case, given centers {ci }, the region to which a point is assigned with minimal distortion is
obtained by Euclidean nearest neighbor (NN) search. The
k-means algorithm can be used to learn centers from data.
To reduce expected distortion, crucial for many applications, one can shrink region volumes by increasing k,
the number of regions. In practice, however, increasing k
1063-6919/13 $26.00 © 2013 IEEE
DOI 10.1109/CVPR.2013.388
2. k-means
Given a dataset of n p-dim points, D ≡ {xj }nj=1 , the kmeans algorithm partitions the n points into k clusters, and
1 A very similar generalization of PQ, was developed concurrently by
Ge, He, Ke and Sun, and also appears in CVPR 2013 [4].
3015
3017
represents each cluster by a center point. Let C ∈ Rp×k
be a matrix whose columns comprise the k cluster centers,
i.e., C = [c1 , c2 , · · · , ck ]. The k-means objective is to minimize the within-cluster squared distances:
ℓk-means (C) =
minx − ci 22
=
x∈D
−1
+1
−1
+1
+1
+1
+1
+1
−1
−1
i
x∈D
min x −
b∈H1/k
Cb22
(2)
where H1/k ≡ {b | b ∈ {0, 1}k and b = 1}, i.e., b is a binary vector comprising a 1-of-k encoding. Lloyd’s k-means
algorithm [11] finds a local minimum of (2) by iterative, alternating optimization with respect to C and the b’s.
The k-means model is simple and intuitive, using NN
search to assign points to centers. The assignment of points
to centers can be represented with a log k-bit index per data
point. The cost of storing the centers grows linearly with k.
3. Orthogonal k-means with 2m centers
With a compositional model one can represent cluster centers more efficiently. One such approach is to reconstruct each input with an additive combination of the
columns of C. To this end, instead of the 1-of-k encoding in
(2), we let b be a general m-bit vector, b ∈ Hm ≡ {0, 1}m ,
and C ∈ Rp×m . As such, each cluster center is the sum of a
subset of the columns of C. There are 2m possible subsets,
and therefore k = 2m centers. While the number of parameters in the quantizer is linear in m, the number of centers
increases exponentially.
While efficient in representing cluster centers, the approach is problematic, because solving
b̂ = argmin x − Cb22 ,
(3)
b∈Hm
is intractable; i.e., the discrete optimization is not submodular. Obviously, for small 2m one could generate all possible
centers and then perform NN search to find the optimal solution, but this would not scale well to large values of m.
One key observation is that if the columns of C are orthogonal, then optimization (3) becomes tractable. To explain this, without loss of generality, assume the bits belong
′
to {−1, 1} instead of {0, 1}, i.e., b′ ∈ Hm
≡ {−1, 1}m .
Then,
T
x − Cb′ 22 = xT x + b′ C T Cb′ − 2xT Cb′ .
(4)
For diagonal C T C, the middle quadratic term on the RHS
becomes trace(C T C), independent of b′ . As a consequence, when C has orthogonal columns,
argmin x − Cb′ 22 = sgn(C T x) ,
−1
−1
(5)
+1
−1
+1
−1
Figure 1. Two quantizations of 2D data (red ×’s) by ok-means.
Cluster centers are depicted by dots, and cluster boundaries by
dashed lines. (Left) Clusters formed by a 2-cube with no rotation,
scaling or translation; centers = {b′ |b′ ∈ H2′ }. (Right) Rotation,
scaling and translation are used to reduce distances between data
points and cluster centers; centers = {µ + RDb′ | b′ ∈ H2′ }.
To reduce quantization error further we can also introduce an offset, denoted µ, to translate x. Taken together
with (5), this leads to the loss function for the model we
call orthogonal k-means (ok-means):2
ℓok-means (C, µ) =
min
x − µ − Cb′ 22 . (6)
′
′
x∈D
b ∈Hm
Clearly, with a change of variables, b′ = 2b−1, we can define new versions of µ and C, with identical loss, for which
the unknowns are binary, but in {0, 1}m .
The ok-means quantizer encodes each data point as a vertex of a transformed m-dimensional unit hypercube. The
transform, via C and µ, maps the hypercube vertices onto
the input feature space, ideally as close as possible to the
data points. The matrix C has orthogonal columns and
can therefore be expressed in terms of rotation and scaling; i.e., C ≡ RD, where R ∈ Rp×m has orthonormal
columns (RT R = Im ), and D is diagonal and positive definite. The goal of learning is to find the parameters, R, D,
and µ, which minimize quantization error.
Fig. 1 depicts a small set of 2D data points (red x’s) and
two possible quantizations. Fig. 1 (left) depicts the vertices
of a 2-cube with C = I2 and zero translation. The cluster
regions are simply the four quadrants of the 2D space. The
distances between data points and cluster centers, i.e., the
quantization errors, are relatively large. By comparison,
Fig. 1 (right) shows how a transformed 2-cube, the full
model, can greatly reduce quantization errors.
3.1. Learning ok-means
To derive the learning algorithm for ok-means we first rewrite the objective in matrix terms. Given n data points, let
X = [x1 , x2 , · · · , xn ] ∈ Rp×n . Let B ′ ∈ {−1, 1}m×n denote the corresponding cluster assignment coefficients. Our
b′ ∈H′m
2 While closely related to ITQ, we use the term ok-means to emphasize
the relationship to k-means.
where sgn(·) is the element-wise sign function.
3018
3016
goal is to find the assignment coefficients B ′ and the transformation parameters, namely, the rotation R, scaling D,
and translation µ, that minimize
ℓok-means (B ′ , R, D, µ)
we describe some experimental results with ok-means on
Euclidean ANN search. The benefits of ok-means for ANN
search are two-fold. Storage costs for the centers is reduced
to O(log k), from O(k) with k-means. Second, substantial
speedups are possible by exploiting fast methods for NN
search on binary codes in Hamming space (e.g., [13]).
Generally, in terms of a generic quantizer q(·), there are
two natural ways to estimate the distance between two vectors, v and u [7]. Using the Symmetric quantizer distance
(SQD) v−u is approximated by q(v)−q(u). Using the
Asymmetric quantizer distance (AQD), only one of the two
vectors is quantized, and v−u is estimated as v−q(u).
While SQD might be slightly faster to compute, AQD incurs lower quantization errors, and hence is more accurate.
For ANN search, in a pre-processing stage, each
database entry, u, is encoded into a binary vector corresponding to the cluster center index to which u is assigned.
At test time, the queries may or may not be encoded into
indices, depending on whether one uses SQD or AQD.
In the ok-means model, the quantization of an input x is
straightforwardly shown to be
X − µ1T − RDB ′ 2f (7)
X ′ − RDB ′ 2f
(8)
=
=
where ·f denotes the Frobenius norm, X ′ ≡ X − µ1T , R
is constrained to have orthonormal columns (RT R = Im ),
and D is a positive diagonal matrix.
Like k-means, coordinate descent is effective for optimizing (8). We first initialize µ and R, and then iteratively
optimize ℓok-means with respect to B ′ , D, R, and µ:
• Optimize B ′ and D: With straightforward algebraic ma-
nipulation, one can show that
T
ℓok-means = RT X ′ −DB ′ 2f + R⊥ X ′ 2f ,
(9)
where columns of R⊥ span the orthogonal complement
of the column-space of R (i.e., the block matrix [R R⊥ ]
is orthogonal).
qok (x) = µ + RD sgn(RT (x − µ)) .
It follows that, given X ′ and R, we can optimize the first
term in (9) to solve for B ′ and D. Here, DB ′ is the leastsquares approximation to RT X ′ , where RT X ′ and DB ′
are m×n. Further, the ith row of DB ′ can only contain
elements from {−di , +di } where di = Dii . Wherever
the corresponding element of RT X ′ is positive (negative)
we should put a positive (negative) value in DB ′ . The
optimal di is determined by the mean absolute value of
the elements in the ith row of RT X ′ :
B ′ ← sgn RT X ′
(10)
D
← Diag(mean(abs(RT X ′ )))
row
The corresponding m-bit cluster index is sgn(R (x − µ)).
Given two indices, namely b′1 , b′2 ∈ {−1, +1}m , the symmetric ok-means quantizer distance is
2
SQDok (b′1 , b′2 ) = µ + RDb′1 − µ − RDb′2 2
= D(b′1 − b′2 )22 .
(14)
In effect, SQDok is a weighted Hamming distance. It is
the sum of the squared diagonal entries of D corresponding to bits where b′1 and b′2 differ. Interestingly, in our
experiments with ok-means, Hamming and weighted Hamming distances yield similar results. Thus, in ok-means experiments we simply report results for Hamming distance,
to facilitate comparisons with other techniques. When the
scaling in ok-means is constrained to be isotropic (i.e., D =
αIm for α ∈ R+ ), then SQDok becomes a constant multiple of the usual Hamming distance. As discussed in Sec. 5,
this isotropic ok-means is closely related to ITQ [5].
The ok-means AQD between a feature vector x and a
cluster index b′ , is defined as
(11)
• Optimize R: Using the original objective (8), find R that
minimizes X ′ − R A2f , subject to RT R = Im , and
A ≡ DB ′ . This is equivalent to an Orthogonal Procrustes
problem [15], and can be solved exactly using SVD. In
particular, by adding p − m rows of zeros to the bottom
of D, DB becomes p × n. Then R is square and orthogonal and can be estimated with SVD. But since DB is
degenerate we are only interested in the first m columns
of R; the remaining columns are unique only up to rotation of the null-space.)
2
AQDok (x, b′ ) = x − µ − RDb′ 2
where x′ = x−µ. For ANN search, in comparing distances
from query x to a dataset of binary indices, the second term
on the RHS of (15) is irrelevant, since it does not depend on
b′ . Without this term, AQDok becomes a form of asymmetric Hamming (AH) distance between RT x′ and b′ . While
previous work [6] discussed various ad hoc AH distance
measures for binary hashing, interestingly, the optimal AH
distance for ok-means is derived directly in our model.
by the column average of X −RDB ′ :
colum
T
= RT x′ − Db′ 22 + R⊥ x′ 22 , (15)
• Optimize µ: Given R, B ′ and D, the optimal µ is given
µ ← mean(X −RDB ′ )
(13)
T
(12)
3.2. Approximate nearest neighbor search
One application of scalable quantization is ANN search.
Before introducing more advanced quantization techniques,
3019
3017
1M SIFT, 64−bit encoding (k = 264)
1
d1
Recall@R
0.8
d′1
d4
0.4
PQ (AD)
ok−means (AH)
ITQ (AH)
ok−means (H)
ITQ (H)
0.2
10
100
R
1K
10K
Figure 2. Euclidean ANN retrieval results for different methods
and distance functions on the 1M SIFT dataset.
3.3. Experiments with ok-means
Following [7], we report ANN search results on 1M
SIFT, a corpus of 128D SIFT descriptors with disjoint sets
of 100K training, 1M base, and 10K test vectors. The training set is used to train models. The base set is the database,
and the test points are queries. The number of bits m is
typically less than p, but no pre-processing is needed for dimensionality reduction. Rather, to initialize learning, R is
a random rotation of the first m principal directions of the
training data, and µ is the mean of the data.
For each query we find R nearest neighbors, and compute Recall@R, the fraction of queries for which the
ground-truth Euclidean NN is found in the R retrieved
items. Fig. 2 shows the Recall@R plots for ok-means with
Hamming (H) ≈ SQDok and asymmetric Hamming (AH)
≡ AQDok distance, vs. ITQ [5] and PQ [7]. The PQ
method exploits a more complex asymmetric distance function denoted AD ≡ AQDck (defined in Sec. 6.1). Note first
that ok-means improves upon ITQ, with both Hamming and
asymmetric Hamming distances. This is due to the scaling parameters (i.e., D) in ok-means. If one is interested in
Hamming distance efficient retrieval, ok-means is prefered
over ITQ. But better results are obtained with the asymmetric distance function.
Fig. 2 also shows that PQ achieves superior recall rates.
This stems from its joint encoding of multiple feature dimensions. In ok-means, each bit represents a partition of the
feature space into two clusters, separated by a hyperplane.
The intersection of m orthogonal hyperplanes yields 2m regions. Hence we obtain just two clusters per dimension,
and each dimension is encoded independently. In PQ, by
comparison, multiple dimensions are encoded jointly, with
arbitrary numbers of clusters. PQ thereby captures statistical dependencies among different dimensions. We next
extend ok-means to jointly encode multiple dimensions.
4. Cartesian k-means
In the Cartesian k-means (ck-means) model, each region
center is expressed parametrically as an additive combination of multiple subcenters. Let there be m sets of subcen-
qck ( ) =
d4
d′3
qck ( ) =
d1
d′5
qck ( ) =
d3
d′1
d5
d2
0.6
0
1
d′3
d′4
d′2
d3
d′5
Figure 3. Depiction of Cartesian quantization on 4D data, with
the first (last) two dimensions sub-quantized on the left (right).
Cartesian k-means quantizer denoted qck , combines the subquantizations in subspaces, and produces a 4D reconstruction.
ters, each with h elements.3 Let C (i) be a matrix whose
columns comprise the elements of the ith subcenter set;
C (i) ∈ Rp×h . Finally, assume that each cluster center, c,
is the sum of exactly one element from each subcenter set:
c =
m
C (i) b(i) ,
(16)
i=1
where b(i) ∈ H1/h is a 1-of-h encoding.
As a concrete example (see Fig. 3), suppose we are given
4D inputs, x ∈ R4 , and we split each datum into m = 2 parts:
(17)
z(1) = I2 0 x , and z(2) = 0 I2 x .
Then, suppose we quantize each part, z(1) and z(2) , separately. As depicted in Fig. 3 (left and middle), we could use
h = 5 subcenters for each one. Placing the corresponding
subcenters in the columns of 4×5 matrices C (1) and C (2) ,
d1 d 2 d 3 d 4 d 5
02×5
(1)
(2)
, C = ′ ′ ′ ′ ′ ,
C =
02×5
d1 d2 d3 d4 d5
we obtain a model (16) that provides 52 possible centers
with which to quantize the data.
More generally, the total number of model centers is
k = hm . Each center is a member of the Cartesian product
of the subcenter sets, hence the name Cartesian k-means.
Importantly, while the number of centers is hm , the number
of subcenters is only mh. The model provides a super-linear
number of centers with a linear number of parameters.
The learning objective for Cartesian k-means is
ℓck-means (C) =
x∈D
min
{b(i) }m
i=1
x−
m
i=1
C (i) b(i)
2
2
(18)
where b(i) ∈ H1/h , and C ≡ [C (1) , · · · , C (m) ] ∈ Rp×mh .
T
T
If we let bT ≡ [b(1) , · · · , b(m) ] then the second sum in
(18) can be expressed succinctly as Cb.
3 While here we assume a fixed cardinality for all subcenter sets, the
model is easily extended to allow sets with different cardinalities.
3020
3018
The key problem with this formulation is that the minimization of (18) with respect to the b(i) ’s is intractable.
Nevertheless, motivated by orthogonal k-means (Sec. 3),
encoding can be shown to be both efficient and exact if we
impose orthogonality constraints on the sets of subcenters.
To that end, assume that all subcenters in different sets are
pairwise orthogonal; i.e.,
∀i, j | i = j
T
C (i) C (j) = 0h×h .
method #centers
ok-means
2m
(19)
(1)
=
z
−D
(1) (1)
b
(2)
+ z
−D
(2) (2)
b
2
D(1) 0
0 D(2)
R = [R(1) , · · · , R(m) ] , and D = .
..
0
0
...
..
0
0
..
.
.
. . . D(m)
2
2
=
m
i=1
z(i) −D(i) b(i)
2
2
m log h
k-means
k
log k
O(p2 + hp) O(ps+hs) or
or O(mhp) O(ps+mhs)
O(kp)
O(ps + ks)
We can re-write the ck-means objective (18) in matrix
form with the Frobenius norm; i.e.,
ℓck-means (R, D, B) = X − RDB 2f
.
(20)
(22)
where the columns of X and B comprise the data points
and the subcenter assignment coefficients. The input to the
learning algorithm is the training data X, the number of
subcenter sets m, the cardinality of the subcenter sets h,
and an upper bound on the rank of C, i.e., s. In practice,
we also let each rotation matrix R(i) have the same number
of columns, i.e., si = s/m. The outputs are the matrices
{R(i) } and {D(i) } that provide a local minima of (22).
Learning begins with the initialization of R and D, followed by iterative coordinate descent in B, D, and R:
• Optimize B and D: With R fixed, the objective is given
T
by (21) where R⊥ X2f is constant. Given data proT
jections Z (i) ≡ R(i) X, to optimize for B and D we
perform one step of k-means for each subcenter set:
and hence C (i) = R(i) D(i) . With si ≡ rank(C (i) ), it follows
si ≤ p,
that D(i) ∈ Rsi ×h and R(i) ∈ Rp×si . Clearly,
because of the orthogonality constraints.
Replacing C (i) with R(i) D(i) in the RHS of (18), we find
x−Cb
hm
4.1. Learning ck-means
,
ck-means
cost(s)
O(mp)
obtain each z(i) , the total encoding cost is O(ps + hs),
i.e., O(p2+hp). Alternatively, one could perform NN search
in p-dim C (i) ’s, to find the b(i) ’s, which costs O(mhp). Table 1 summerizes the models in terms of their number of
centers, index size, and cost of storage and encoding.
In words, the squared quantization error is the sum of the
squared errors on the subspaces. One can therefore solve for
the binary coefficients of the subcenter sets independently.
In the general case, assuming (19) is satisfied, C can
be expressed as a product R D, where R has orthonormal
columns, and D is block diagonal; i.e., C = R D where
cost
O(mp)
Table 1. A summery of ok-means, ck-means, and k-means in terms
of number of centers, number of bits needed for indices (i.e., log
#centers), and the storage cost of representation, which is the same
as the encoding cost to convert inputs to indices. The last column
shows the costs given an assumption that C has a rank of s ≥ m.
Each subcenter matrix C (i) spans a linear subspace of Rp ,
and the linear subspaces for different subcenter sets do not
intersect. Hence, (19) implies that only the subcenters in
C (i) can explain the projection of x onto the C (i) subspace.
In the example depicted in Fig. 3, the input features are
simply partitioned (17), and the subspaces clearly satisfy
the orthogonality constraints. It is also clear that C ≡
[ C (1) C (2) ] is block diagonal, with 2 × 5 blocks, denoted
D(1) and D(2) . The quantization error therefore becomes
(1) (1)
(1)
2
z
b
D
0
2
−
x − Cb 2 =
(2)
(2)
(2)
b
z
0
D
2
#bits
m
– Assignment: Perform NN searches for each subcenter
set to find the assignment coefficients, B (i) ,
B (i) ← argmin Z (i) − D(i) B (i) 2f
B (i)
T
+ R⊥ x22 , (21)
– Update:
D
(i)
← argmin Z (i) − D(i) B (i) 2f
D (i)
(i) T
• Optimize R: Placing the D (i) ’s along the diagonal of
x, and R⊥ is the orthogonal complement
where z(i) ≡ R
of R. This shows that, with orthogonality constraints (19),
the ck-means encoding problem can be split into m independent sub-encoding problems, one for each subcenter set.
2
To find the b(i) that minimizes z(i) − D(i) b(i) 2 , we
perform NN search with z(i) against h si -dim vectors in
D(i) . This entails a cost of O(hsi ). Fortunately, all the elements of b can be found very efficiently, in O(hs), where
s ≡
si . If we also include the cost of rotating x to
D, as in (20), and concatenating B (i) ’s as rows of B,
T
T
i.e., B T = [B (1) , . . . , B (m) ], the optimization of R reduces to the orthogonal Procrustes problem:
R ← argmin X − RDB2f .
R
In experiments below, R ∈ Rp×p , and rank(C) ≤ p is unconstrained. For high-dimensional data where rank(X) ≪
p, for efficiency it may be useful to constrain rank(C).
3021
3019
h = 2. The benefits of ok-means are its small number of
parameters, and its intuitive formulation. The benefit of the
ck-means generalization is its joint encoding of multiple dimensions with an arbitrary number of centers, allowing it to
capture intrinsic dependence among data dimensions.
5. Relations and related work
As mentioned above, there are close mathematical relationships between ok-means, ck-means, ITQ for binary
hashing [5], and PQ for vector quantization [7]. It is instructive to specify these relationships in some detail.
Product Quantization vs. Cartesian k-means.
PQ first splits the input vectors into m disjoint sub-vectors,
and then quantizes each sub-vector separately using a subquantizer. Thus PQ is a special case of ck-means where
the rotation R is not optimized; rather, R is fixed in
both learning and retrieval. This is important because the
sub-quantizers act independently, thereby ignoring intrasubspace statistical dependence. Thus the selection of subspaces is critical for PQ [7, 8]. Jégou et al. [8] suggest using
PCA, followed by random rotation, before applying PQ to
high-dimensional vectors. Finding the rotation to minimize
quantization error is mentioned in [8], but it was considered
too difficult to estimate.
Here we show that one can find a rotation to minimize
the quantization error. The ck-means learning algorithm optimizes sub-quantizers in the inner loop, but they interact in
an outer loop that optimizes the rotation (Sec. 4.1).
Iterative Quantization vs. Orthogonal k-means.
ITQ [5] is a variant of locality-sensitive hashing, mapping
data to binary codes for fast retrieval. To extract m-bit
codes, ITQ first zero-centers the data matrix to obtain X ′ .
PCA is then used for dimensionality reduction, from p down
to m dimensions, after which the subspace representation is
randomly rotated. The composition of PCA and the random
rotation can be expressed as W T X ′ where W ∈ Rp×m . ITQ
then solves for the m×m rotation matrix, R, that minimizes
ℓITQ (B ′ , R) = RTW TX ′− B ′ 2f , s.t. RT R = Im×m ,
(23)
where B ′ ∈ {−1, +1}n×p .
ITQ rotates the subspace representation of the data to
match the binary codes, and then minimizes quantization
error within the subspace. By contrast, ok-means maps the
binary codes into the original input space, and then considers both the quantization error within the subspace and the
out-of-subspace projection error. A key difference is the inclusion of R⊥ X ′ 2f in the ok-means objective (9). This is
important since one can often reduce quantization errors by
projecting out significant portions of the feature space.
Another key difference between ITQ and ok-means is
the inclusion of non-uniform scaling in ok-means. This is
important when the data are not isotropic, and contributes
to the marked improvement in recall rates in Fig. 2.
6. Experiments
6.1. Euclidean ANN search
Euclidean ANN is a useful task for comparing quantization techniques. Given a database of ck-means indices, and a query, we use Asymmetric and Symmetric
ck-means quantizer distance, denoted AQDck and SQDck .
The AQDck between a query x and a binary index b ≡
(1) T
T T
b
, . . . , b(m)
, derived in (21), is
Orthogonal k-means vs. Cartesian k-means.
We next show that ok-means is a special case of ck-means
with h = 2, where each subcenter set has two elements. To
(i) (i)
(i) (i)
this end, let C (i) = [c1 c2 ], and let b(i) = [b1 b2 ]T be the
(i)
ith subcenter
selection vector. Since b is a 1-of-2
matrix
encoding ( 01 or 10 ), it follows that:
(i) (i)
(i) (i)
b 1 c1 + b 2 c 2
(i)
(i)
=
(i)
(i)
AQDck (x, b) =
i=1
(24)
(i)
where b′i ≡ b1 − b2 ∈ {−1, +1}. With the following
setting of the ok-means parameters,
µ=
m
(i)
(i)
c +c
1
i=1
2
2
(1)
, and C =
(1)
(m)
(m)
c1 −c2
c −c2
,..., 1
2
2
z(i)−D(i) b(i)
2
2
T
+ R⊥ x22 .
(25)
2
Here, z(i) − D(i) b(i) 2 is the distance between the ith projection of x, i.e., z(i) , and a subcenter projection from D(i)
selected by b(i) . Given a query, these distances for each i,
and all h possible values of b(i) can be pre-computed and
stored in a query-specific h×m lookup table. Once created,
the lookup table is used to compare all database points to the
query. So computing AQDck entails m lookups and additions, somewhat more costly than Hamming distance. The
last term on the RHS of (25) is irrelevant for NN search.
Since PQ is a special case of ck-means with pre-defined
subspaces, the same distances are used for PQ (cf. [7]).
The SQDck between binary codes b1 and b2 is given by
(i)
c − c2
c 1 + c2
+ b′i 1
,
2
2
m
,
it should be clear that i C (i) b(i) = µ + Cb′ , where
b′ ∈ {−1, +1}m , and b′i is the ith bit of b′ , used in (24).
Similarly, one can also map ok-means parameters onto corresponding subcenters for ck-means.
Thus, there is a 1-to-1 mapping between the parameterizations of cluster centers in ok-means and ck-means for
SQDck (b1 , b2 ) =
m
i=1
(i)
(i)
(i)
(i) 2
2
D(i) b1 − D(i) b2
.
(26)
Since b1 and b2 are 1-of-h encodings, an m×h×h lookup
table can be created to store all pairwise sub-distances.
3022
3020
Recall@R
1M SIFT, 64−bit encoding (k = 264)
1M GIST, 64−bit encoding (k = 264)
1B SIFT, 64−bit encoding (k = 264)
1
1
1
0.8
0.8
0.8
0.6
0.6
ck−means (AD)
PQ (AD)
ck−means (SD)
PQ (SD)
ok−means (AH)
ITQ (AH)
0.4
0.2
0
1
10
R
100
1K
0.6
ck−means (AD)
PQ (AD)
ck−means (SD)
PQ (SD)
ok−means (AH)
ITQ (AH)
0.4
0.2
0
1
10
100
1K
R
10K
ck−means (AD)
PQ (AD)
ck−means (SD)
PQ (SD)
ok−means (AH)
ITQ (AH)
0.4
0.2
0
1
10
100
1K
R
10K
Figure 4. Euclidean NN recall@R (number of items retrieved) based on different quantizers and corresponding distance functions on the
1M SIFT, 1M GIST, and 1B SIFT datasets. The dashed curves use symmetric distance. (AH ≡ AQDok , SD ≡ SQDck , AD ≡ AQDck )
1M GIST, 64−bit encoding (k = 264)
1
0.8
Recall@R
While the cost of computing SQDck is the same as AQDck ,
SQDck could also be used to estimate the distance between
the indexed database entries, for diversifying the retrieval
results, or to detect near duplicates, for example.
Datasets. We use the 1M SIFT dataset, as in Sec. 3.3, along
with two others, namely, 1M GIST (960D features) and 1B
SIFT, both comprising disjoint sets of training, base and
test vectors. 1M GIST has 500K training, 1M base, and 1K
query vectors. 1B SIFT has 100M training, 1B base, and
10K query points. In each case, recall rates are averaged
over queries in test set for a database populated from the
base set. For expedience, we use only the first 1M training
points for the 1B SIFT experiments.
Parameters. In ANN experiments below, for both ckmeans and PQ, we use m = 8 and h = 256. Hence the
number of clusters is k = 2568 = 264 , so 64-bits are used
as database indices. Using h = 256 is particularly attractive
because the resulting lookup tables are small, encoding is
fast, and each subcenter index fits into one byte. As h increases we expect retrieval results to improve, but encoding
and indexing of a query to become slower.
Initialization. To initialize the Di ’s for learning, as in kmeans, we simply begin with random samples from the
set of Z i ’s (see Sec. 4.1). To initialize R we consider the
different methods that Jégou et al. [7] proposed for splitting the feature dimensions into m sub-vectors for PQ: (1)
natural: sub-vectors comprise consecutive dimensions, (2)
structured: dimensions with the same index modulo 8 are
grouped, and (3) random: random permutations are used.
For PQ in the experiments below, we use the orderings
that produced the best results in [7], namely, the structured
ordering for 960D GIST, and the natural ordering for 128D
SIFT. For learning ck-means, R is initialized to the identity
with SIFT corpora. For 1M GIST, where the PQ ordering is
significant, we consider all three orderings to initialize R.
Results. Fig. 4 shows Recall@R plots for ck-means and
PQ [7] with symmetric and asymmetric distances (SD ≡
SQDck and AD ≡ AQDck ) on the 3 datasets. The horizontal axis represents the number of retrieved items, R, on
a log-scale. The results consistently favor ck-means. On
0.6
ck−means (1) (AD)
ck−means (2) (AD)
ck−means (3) (AD)
PQ (2) (AD)
PQ (1) (AD)
PQ (3) (AD)
0.4
0.2
0
1
10
100
1K
10K
R
Figure 5. PQ and ck-means results using natural (1), structured (2),
and random (3) ordering to define the (initial) subspaces.
the high-dimensional GIST data, ck-means with AD significantly outperforms other methods; even ck-means with
SD performs on par with PQ with AD. On 1M SIFT, the
Recall@10 numbers for PQ and ck-means, both using AD,
are 59.9% and 63.7%. On 1B SIFT, Recall@100 numbers
are 56.5% and 64.9%. As expected, with increasing dataset
size, the difference between methods is more significant.
In 1B SIFT, each feature vector is 128 bytes, hence a
total of 119 GB. Using any method in Fig. 4 (including ckmeans) to index the database into 64 bits, this storage cost
reduces to only 7.5 GB. This allows one to work with much
larger datasets. In the experiments we use linear scan to find
the nearest items according to quantizer distances. For NN
search using 10K SIFT queries on 1B SIFT this takes about
8 hours for AD and AH and 4 hours for Hamming distance
on a 2 × 4-core computer. Search can be sped up significantly; using a coarse initial quantization and an inverted
file structure for AD and AH, as suggested by [7, 1], and
using the multi-index hashing method of [13] for Hamming
distance. In this paper we did not implement these efficiencies as we focus primarily on the quality of quantization.
Fig. 5 compares ck-means to PQ when R in ck-means
is initialized using the 3 orderings of [7]. It shows that ckmeans is superior in all cases. Simiarly interesting, it also
shows that despite the non-convexity of the optimization objective, ck-means learning tends to find similarly good encodings under different initial conditions. Finally, Fig. 6
compares the methods under different code dimensionality.
3023
3021
7. Conclusions
1M GIST, encoding with 64, 96, and 128 bits
1
We present the Cartesian k-means algorithm, a generalization of k-means with a parameterization of the cluster centers such that number of centers is super-linear in
the number of parameters. The method is also shown
to be a generalization of the ITQ algorithm and Product
Quantization. In experiments on large-scale retrieval and
codebook learning for recognition the results are impressive, outperforming product quantization by a significant
margin. An implementation of the method is available at
https://github.com/norouzi/ckmeans.
Recall@R
0.8
0.6
ck−means 128−bit
ck−means 96−bit
ck−means 64−bit
PQ 128−bit
PQ 96−bit
PQ 64−bit
0.4
0.2
0
1
10
100
R
1K
10K
Figure 6. PQ and ck-means results using different number of bits
for encoding. In all cases asymmetric distance is used.
Codebook
PQ (k = 402 )
ck-means (k = 402 )
k-means (k = 1600) [2]
PQ (k = 642 )
ck-means (k = 642 )
k-means (k = 4000) [2]
Accuracy
75.9%
78.2%
77.9%
78.2%
79.7%
79.6%
References
[1] A. Babenko and V. Lempitsky. The inverted multi-index.
CVPR, 2012.
[2] A. Coates, H. Lee, and A. Ng. An analysis of single-layer
networks in unsupervised feature learning. AISTATS, 2011.
[3] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray.
Visual categorization with bags of keypoints. ECCV Workshop Statistical Learning in Computer Vision, 2004.
[4] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization for approximate nearest neighbor search. CVPR, 2013.
[5] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. CVPR, 2011.
[6] A. Gordo and F. Perronnin. Asymmetric distances for binary
embeddings. CVPR, 2011.
[7] H. Jégou, M. Douze, and C. Schmid. Product quantization
for nearest neighbor search. IEEE Trans. PAMI, 2011.
[8] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation.
CVPR, 2010.
[9] A. Krizhevsky. Learning multiple layers of features from
tiny images. MSc Thesis, Univ. Toronto, 2009.
[10] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of
features: Spatial pyramid matching for recognizing natural
scene categories. CVPR, 2006.
[11] S. P. Lloyd. Least squares quantization in pcm. IEEE Trans.
IT, 28(2):129–137, 1982.
[12] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. CVPR, 2006.
[13] M. Norouzi, A. Punjani, and D. J. Fleet. Fast search in hamming space with multi-index hashing. CVPR, 2012.
[14] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial
matching. CVPR, 2007.
[15] P. Schönemann. A generalized solution of the orthogonal
procrustes problem. Psychometrika, 31, 1966.
[16] J. Sivic and A. Zisserman. Video Google: A text retrieval
approach to object matching in videos. ICCV, 2003.
[17] J. Tenenbaum and W. Freeman. Separating style and content
with bilinear models. Neural Comp., 12:1247–1283, 2000.
[18] A. Torralba, K. Murphy, and W. Freeman. Sharing visual
features for multiclass and multiview object detection. IEEE
T. PAMI, 29(5):854–869, 2007.
[19] S. Wei, X. Wu, and D. Xu. Partitioned k-means clustering
for fast construction of unbiased visual vocabulary. The Era
of Interactive Media, 2012.
Table 2. Recognition accuracy on the CIFAR-10 test set using different codebook learning algorithms.
6.2. Learning visual codebooks
While the ANN seach tasks above require too many
clusters for k-means, it is interesing to compare k-means
and ck-means on a task with a moderate number of clusters. To this end we consider codebook learning for bag-ofwords
√ models [3, 10]. We use ck-means with m = 2 and
h = k, and hence k centers. The main advantage of ckmeans√here is that finding the closest cluster center is done
in O( k) time, much faster than standard NN search with
k-means in O(k). Alternatives for k-means, to improve efficiency, include approximate k-means [14], and hierarchical
k-means [12]. Here we only compare to exact k-means.
CIFAR-10 [9] comprises 50K training and 10K test images (32×32 pixels). Each image is one of 10 classes (airplane, bird, car, cat, deer, dog, frog, horse, ship, and truck).
We use publicly available code from Coates et al [2], with
changes to the codebook learning and cluster assignment
modules. Codebooks are built on 6×6 whitened color image
patches. One histogram is created per image quadrant, and
a linear SVM is applied to 4k-dim feature vectors.
Recognition accuracy rates on the test set for different
models and k are given in Table 2. Despite having fewer
parameters, ck-means performs on par or better than kmeans. This is consistent for different initializations of the
algorithms. Although k-means has higher fedility than ckmeans, with fewer parameters, ck-means may be less susceptible to overfitting. Table 2, also compares with the
approach of [19], where PQ without learning a rotation is
used for clustering features. As expected, learning the rotation has a significant impact on recognition rates, outperforming all three initializations of PQ.
3024
3022