Relative Attribute Guided Dictionary Learning: Mohammadreza Babaee, Thomas Wolf, Gerhard Rigoll
Relative Attribute Guided Dictionary Learning: Mohammadreza Babaee, Thomas Wolf, Gerhard Rigoll
Relative Attribute Guided Dictionary Learning: Mohammadreza Babaee, Thomas Wolf, Gerhard Rigoll
with the regularization parameter λ1 . In order to take the rel- with the elements qim = rm (yi ) that contains the strength of
ative attributes into account the objective function can be ex- the (relative) attributes of all signals in Y . In order to find the
tended with an additional term L(X). transformation of Y into Q we apply the RankSVM function
< D, X >= arg min Y − DX22 + λ1 X1 + λ2 L(X) known from [11] onto the original input signal and obtain the
D,X weighting matrix W = [w1 ; w2 ; ...; wM ]. The objective
(2) is finding a matrix A, which transform the sparse represen-
As additional information, the strength of M predefined at- tation of the signals into their corresponding relative attribute
tributes, the so called relative attributes [11], for the input representations Q with a minimum distance between wm yi
signals are available. Those attributes, in contrast to binary and am xi .
labels, represent the strength of a property instead of its pres-
ence. The idea in learning relative attributes, assuming arg min Q − AX22 = arg min W Y − AX22 . (6)
there are M attributes A = {am }, is to learn M ranking func- A A
The work of Parikh et al. [11] provides us with the convenient + λ2 W DX − AX22 . (8)
RankSVM function that returns the ranking vector wm for a
set of input images and their relative ordering. This informa- The third term of Eq. (8) is minimized if A = W D. This
tion can further be used in the objective function in Eq. (2). information can be used to eliminate A from Eq. (7) to arrive
at the final objective function Algorithm 1 Relative Attribute Guided Dictionary Learning
Require: Original signal Y , sets of ordered (Om ) and un-
< D, X >= arg min Y − DX22 + λ1 X1
D,X orderd images (Sm )
(9)
Ensure: Dictionary D
+ λ2 W (Y − DX)22 .
1: W ← RankSVM(Y, Om , Sm )
In order to find a closed form solution for (9) and to reduce 2: D init ← rndperm(Y )
computational complexity the term X1 is replaced with 3: D, X ← KSVD(D init , Y )
X22 . This can be justified (as in [10]) because the goal is to 4: for i = 0 to convergence do
learn a discriminative dictionary and not to obtain sparse sig- 5: D ← Y (X X)−1 X
nals. However, once the dictionary is learned, a sparse repre- 6: D ← normcol(D)
−1
sentation can be obtained by the orthogonal matching pursuit 7: X ← (D D + λ1 I + λ2 D W W D) (D Y −
[17]. The final objective function is shown in Eq. (10). λ2 D W Y )
8: end for
This equation is not a jointly convex optimization problem so 4.1. Test Datasets
X and D are optimized alternately. The update rules for D In order to assess the quality of the learned dictionary, we
and X are found by deriving the objective function in (11) purpose a clustering task for three public available datasets,
and setting the derivatives to zero. namely Public Figure Face (PubFig) [1], Outdoor Scene
Recognition (OSR) [2] and Shoes [3]. These sets have been
O = Y − DX22 + λ1 X22 + λ2 W (Y − DX)22 chosen, since they are the only ones known to us with anno-
tated relative attributes. Some sample images of each dataset
(11)
are presented in Figure 2. Clustering is chosen over classifi-
cation since the proposed algorithm has no information about
the ground truth labels of the data and therefore no classifier
∂O
=2(Y − DX)X can be trained.
∂D
+ 2λ2 W (W Y − W DX)X = 0
⇒ D =Y (X X)−1 X (12)
∂O
= − 2D (Y − DX) + 2λ1 X
∂X
− 2λ2 D W (W Y − W DX) = 0
−1
⇒ X =(D D + λ1 I + λ2 D W W D) Fig. 2. Example images from the PubFig, OSR and Shoes
∗ (D Y + λ2 D W Y ). (13) datasets.
Additionally, tests were conducted to find the optimal values 0.8 0.6
for λ1 and λ2 . Therefore, different fixed values were cho- 0.7 0.5
sen for λ1 while iterating over candidates for λ2 . The chosen 0.6
0.4
0.5
values are λ1 = 0.01 and λ2 = 1 for the Pubfig dataset,
nMI
AC
0.3
0.4
λ1 = 0.1 and λ2 = 0.01 for the OSR dataset and λ2 = 0.001 0.3
0.2
0.1 0
16 40 80 120 160 240 16 40 80 120 160 240
Dictionary Size Dictionary Size
4.2. Evaluation Metrics (a) (b)
representation, the k-means algorithm [18] is applied and the 0.7 0.5
accuracy (AC) and the normalized Mutual Information (nMI) 0.6 0.4
nMI
AC
metrics [19] are computed. Furthermore, the sparse represen- 0.5 0.3
by Eq. (14), with the help of the OMP-Box Matlab toolbox 0.2
16 40 80 120 160 240
0
16 40 80 120 160 240
Dictionary Size Dictionary Size
[17], where the reconstruction error from the training phase is (c) (d)
chosen as ε. Those signals can then be used for the clustering.
0.5 0.5
nMI
AC
4.3. Results 0.3 0.2
0.25
0.1
As a benchmark for the results, different supervised and un- 0.2
0
supervised (discriminative) dictionary learning techniques 0.15
20 50 100 140 20 50 100 140
Dictionary Size Dictionary Size
are used, namely (1) KSVD [8], (2) SRC [6] as unsuper- (e) (f)
vised techniques and (3) LC-KSVD [9], (4) FDDL [16], (5)
SRC KSVD LC-KSVD FDDL SVGDL proposed
SVGDL [10] as supervised techniques. The results were
compared by their performance for full label information for
varying dictionary sizes. Figure 3 shows the behavior of Fig. 3. Clustering results for all three datasets for increasing
the algorithms for an increasing the dictionary size with the dictionary sizes. The first and second column represent the
complete training data available. The dictionary sizes used Accuracy (AC) and normalized MI (nMI), respectively. The
were [16, 40, 80, 120, 160, 240] for the PubFig and OSR first, second, and third rows are the results of PubFig, OSR,
dataset and [20, 50, 100, 140] for the Shoes dataset, which and Shoes datasets, respectively.
corresponds to [2, 5, 10, 15, 20, 30] and [2, 5, 10, 14] atoms
per class. The number of atoms per class are constrained
by the partition of the data into training and testing (for the tion instead of binary labels. The relative attributes provide
Shoes dataset one class only includes 14 training samples). a much richer semantic information to improve the discrim-
One should notice that the FDDL algorithm cannot use all inative property of the dictionary and eventually the sparse
training data, since the dictionary size restricts the size of representation of the input signal. Instead of using relative
the training samples. Therefore, only in the last test case the attributes of the images directly, we use the learned rank-
algorithm uses the complete training information. The results ing functions in the learning process. The ranking functions
show that for the proposed algorithm the accuracy increases transform the original features into a relative attribute space
with the dictionary size, up to values exceeding the compared and therefore, we aim to transform the sparse signal linearly
algorithms. However, for the OSR and Shoes dataset and an into this attribute space. This can be achieved by adding an
increasing dictionary size the SVGDL and FDDL produce additional loss term to the objective function of a standard
comparable results. dictionary learning problem. The new method was applied
Additionally, the runtime of the training phase was analyzed, to three different datasets for face-, scene- and object recog-
where the proposed algorithm outperformed all contestants nition. The results not only show promising results that are
with an average training time of 1.75 s. comparable to modern DDL approaches, but also has low
computational time for learning the dictionary outperform-
5. CONCLUSION ing all of the compared approaches. For future work, one
could design a classifier based on relative attributes to investi-
We have presented a novel discriminative dictionary learn- gate the benefit of this semantic information for classification
ing algorithm using relative attributes as semantic informa- tasks.
6. REFERENCES [13] I. Ramirez, P. Sprechmann, and G. Sapiro, “Classifi-
cation and clustering via dictionary learning with struc-
[1] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Na- tured incoherence and shared features,” in Computer Vi-
yar, “Attribute and Simile Classifiers for Face Verifica- sion and Pattern Recognition (CVPR), IEEE Conference
tion,” in IEEE International Conference on Computer on. IEEE, 2010, pp. 3501–3508.
Vision (ICCV), Oct 2009.
[14] S. Gao, I.-H. Tsang, and Y. Ma, “Learning category-
[2] A. Oliva and A. Torralba, “Modeling the shape of the specific dictionary and shared dictionary for fine-
scene: A holistic representation of the spatial envelope,” grained image categorization,” Image Processing, IEEE
International journal of computer vision, vol. 42, no. 3, Transactions on, vol. 23, no. 2, pp. 623–634, 2014.
pp. 145–175, 2001.
[15] Zhuolin Jiang, Zhe Lin, and Larry S Davis, “Learn-
[3] A. Kovashka, D. Parikh, and K. Grauman, “Whit- ing a discriminative dictionary for sparse coding via la-
tlesearch: Image Search with Relative Attribute Feed- bel consistent k-svd,” in Computer Vision and Pattern
back,” in Proceedings of the IEEE Conference on Recognition (CVPR), IEEE Conference on. IEEE, 2011,
Computer Vision and Pattern Recognition (CVPR), June pp. 1697–1704.
2012.
[16] M. Yang, D. Zhang, and X. Feng, “Fisher discrimination
[4] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning for sparse representation,” in Com-
dictionary learning for sparse coding,” in Proceedings of puter Vision (ICCV), 2011 IEEE International Confer-
the 26th Annual International Conference on Machine ence on. IEEE, 2011, pp. 543–550.
Learning (ICML). ACM, 2009, pp. 689–696.
[17] R. Rubinstein, Mi. Zibulevsky, and M. Elad, “Efficient
[5] M. Elad, Sparse and redundant representations: from implementation of the k-svd algorithm using batch or-
theory to applications in signal and image processing, thogonal matching pursuit,” CS Technion, vol. 40, no. 8,
Springer Science & Business Media, 2010. pp. 1–15, 2008.
[6] J. Wright, A. Y Yang, A. Ganesh, S. Sastry, and Y. Ma,
[18] James MacQueen et al., “Some methods for classifica-
“Robust face recognition via sparse representation,”
tion and analysis of multivariate observations,” in Pro-
Pattern Analysis and Machine Intelligence, IEEE Trans-
ceedings of the fifth Berkeley symposium on mathemati-
actions on, vol. 31, no. 2, pp. 210–227, 2009.
cal statistics and probability. Oakland, CA, USA., 1967,
[7] S. Mallat, A wavelet tour of signal processing, Aca- vol. 1, pp. 281–297.
demic press, 1999.
[19] M. Babaee, R. Bahmanyar, G. Rigoll, and M. Datcu,
[8] M. Aharon, M. Elad, and A. Bruckstein, “K-svd: An “Farness preserving non-negative matrix factorization,”
algorithm for designing overcomplete dictionaries for in Image Processing (ICIP), IEEE International Confer-
sparse representation,” Signal Processing, IEEE Trans- ence on. IEEE, 2014, pp. 3023–3027.
actions on, vol. 54, no. 11, pp. 4311–4322, 2006.
[9] Q. Zhang and B. Li, “Discriminative k-svd for dictio-
nary learning in face recognition,” in Computer Vision
and Pattern Recognition (CVPR), IEEE Conference on.
IEEE, 2010, pp. 2691–2698.
[10] Sijia Cai, Wangmeng Zuo, Lei Zhang, Xiangchu Feng,
and Ping Wang, “Support vector guided dictionary
learning,” in Computer Vision–ECCV 2014, pp. 624–
639. Springer, 2014.
[11] D. Parikh and K. Grauman, “Relative attributes,” in
Computer Vision (ICCV), IEEE International Confer-
ence on. IEEE, 2011, pp. 503–510.
[12] K. Engan, S. O. Aase, and J. Husoy, “Frame based
signal compression using method of optimal directions
(mod),” in Circuits and Systems, 1999. ISCAS’99. Pro-
ceedings of the 1999 IEEE International Symposium on.
IEEE, 1999, vol. 4, pp. 1–4.