Relative Attribute Guided Dictionary Learning: Mohammadreza Babaee, Thomas Wolf, Gerhard Rigoll

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

RELATIVE ATTRIBUTE GUIDED DICTIONARY LEARNING

Mohammadreza Babaee, Thomas Wolf, Gerhard Rigoll

Institute for Human-Machine Communication, TU München


Theresienstrae 90, 80333 München, Germany

ABSTRACT tributes instead of binary labels to enhance the discriminative


property of the dictionary. Relative attributes, as described
A discriminative dictionary learning algorithm is proposed to
in [11] represent the strength of a set of predefined attributes
find sparse signal representations using relative attributes as
rather than only their appearance. This way of describing
the available semantic information. In contrast, existing (dis-
often seems more natural to humans. For example, is a
criminative) dictionary learning (DDL) approaches mostly
gazelle a big animal? That is hard to say. In the context of
utilize binary label information to enhance the discriminative
relative attributes one can say a gazelle is bigger than a cat
property of the signal reconstruction residual, the sparse cod-
but smaller than an elephant. Just as previous discriminative
ing vectors or both. Compared to binary attributes or labels,
dictionary learning approaches use binary label information
relative attributes contain richer semantic information where
to enhance their discriminative capabilities, we incorporate
the data is annotated with the attributes’ strength. In this
relative attributes into the dictionary learning process to give
paper we use the relative attributes of training data indirectly
the learning process some semantic information. In Section 2
to learn a discriminative dictionary. Precisely, we incorporate
related work in the field of dictionary learning and relative
a rank function for the attributes in the dictionary learning
attributes is presented. Section 3 illustrates the proposed
process. In order to assess the quality of the obtained signals,
algorithm and in Section 4 the experiments and results are
we apply k-means clustering and measure the clustering per-
presented. The Report concludes with a discussion and sum-
formance. Experimental results conducted on three datasets,
mary in Section 5.
namely the PubFig [1], OSR [2] and Shoes [3] confirm that
the proposed approach outperforms the state-of-the-art label
based dictionary learning algorithms.
2. RELATED WORK
Index Terms— Relative Attributes, Dictionary Learning,
Clustering The first approaches in the field of (reconstructive) dictionary
learning are the K-SVD algorithm [8] and the Method of Op-
1. INTRODUCTION timal Direction (MOD) [12], where no semantic information
is used in the learning process. An additional example for
The concept of sparse coding has become very popular in the usage of sparse representation is the sparse representa-
many fields of engineering such as signal analysis and pro- tion based classification (SRC) [6] in which the dictionary is
cessing [4], clustering and classification [5], and face recog- built directly from the training data. A large field in dictionary
nition [6]. The idea behind a sparse representation is to learning is called Discriminative Dictionary Learning (DDL),
approximate a signal by a linear combination of a small set where either the discriminative property of the signal recon-
of elements from a so called over-complete dictionary. The struction residual, or the discriminative property of the sparse
coding vector specifying the linear combination is the sparse representation itself is enhanced. Approaches with a focus on
representation of the original input signal. the reconstruction residual are the work of Ramirez et al. [13]
Considering a set of n input signals Y ∈ Rp×n , the goal is to which includes a structured incoherence term to find inde-
find a dictionary D = [d1 , d2 , ..., dk ] ∈ Rp×k and the sparse pendent sub-directories for each class and the work of Gao
representation X ∈ Rk×n such that Y ∼ = DX, where the et al. [14] where sub-dictionaries for the different classes are
term over-complete indicates k > n. Dictionaries can either learned as well as a shared dictionary over all classes.
be predefined as in the form of wavelets [7], or be learned Methods aiming at finding discriminative coding vectors learn
from observations [8, 9, 10]. Also, many approaches have the dictionary and a classifier simultaneously. In the work
been developed to impose discriminative capabilities onto of Zhang et al. [9], the K-SVD algorithm is extended by a
the dictionary learning process. However, those methods use linear classifier. Jiang et al. [15] included an additional dis-
binary label information to acquire discriminative behavior. criminative regularizer to come up with the so called label
In this work we present an approach that utilizes relative at- consistent KSVD (LC-KSVD) algorithm. Both of these al-

‹,(((  ,&,3


gorithms show good results for classification and face recog- 3. RELATIVE ATTRIBUTE GUIDED DICTIONARY
nition tasks. The approach of Yang et al. [16] combines the LEARNING
two types of DDL by taking the discriminative capabilities of
the reconstruction residual and the sparse representation into The RankSVM function maps the original input signal (yi ) to
account. Therefore, class specific sub-dictionaries are learned a point (qi ) in a so-called relative attribute space. Addition-
while maintaining discriminative coding vectors by applying ally, we assume that there exists a linear transformation (i.e.,
the Fisher discrimination criterion. In the recent work of Cai A) that maps the sparse signal (xi ) to the point qi (see Fig-
et al. [10] a new so called Support Vector Guided Dictionary ure 1 and Eq. (6)). First, we define the matrix Q ∈ Rn×M
Learning (SVGDL) algorithm is presented where the discrim-
ination term consists of a weighted summation over squared original feature ranked sparse signal
distances between the pairs of coding vectors. The algorithm space attribute space space
automatically assigns non-zero weights to critical vector pairs Y Q X
(the support vectors) leading to a generalized good perfor-
D
mance in pattern recognition tasks.
W
yi A
2.1. Background qi
yj qj xi
For the general problem formulation we assume Y = xj
W A
[y1 , y2 , ...yn ] to be the set of p-dimensional input sig-
nals, each belonging to one of C (hidden) classes, X = D
[x1 , x2 , ..., xn ] to be their corresponding k-dimensional
sparse representation and D ∈ Rp×k to be the dictionary.
Therefore, we formulate the dictionary learning problem as
Fig. 1. Illustration of signal transformations. The goal is to
< D, X >= arg min Y − DX22 + λ1 X1 , (1) transform xi and xj as close as possible to qi and qj .
D,X

with the regularization parameter λ1 . In order to take the rel- with the elements qim = rm (yi ) that contains the strength of
ative attributes into account the objective function can be ex- the (relative) attributes of all signals in Y . In order to find the
tended with an additional term L(X). transformation of Y into Q we apply the RankSVM function
< D, X >= arg min Y − DX22 + λ1 X1 + λ2 L(X) known from [11] onto the original input signal and obtain the
D,X weighting matrix W = [w1  ; w2  ; ...; wM  ]. The objective
(2) is finding a matrix A, which transform the sparse represen-
As additional information, the strength of M predefined at- tation of the signals into their corresponding relative attribute
tributes, the so called relative attributes [11], for the input representations Q with a minimum distance between wm  yi
signals are available. Those attributes, in contrast to binary and am  xi .
labels, represent the strength of a property instead of its pres-
ence. The idea in learning relative attributes, assuming arg min Q − AX22 = arg min W Y − AX22 . (6)
there are M attributes A = {am }, is to learn M ranking func- A A

tions wm for m = 1..M . Therefore, the predicted relative


By using Eq. (6) in Eq. (2) as a loss term we get the formula-
attributes are computed by
tion
rm (xi ) = wm  xi , (3)
< D, X >= arg min Y − DX22 + λ1 X1
such that the maximum number of the following constraints D,X,A
is satisfied: + λ2 W Y − AX22 . (7)
 
∀(i, j) ∈ Om : wm xi > wm xj , (4)
From the first part of the equation we can see that Y = ∼ DX.
∀(i, j) ∈ Sm : wm  xi ≈ wm  xj (5)
If Y in the loss term for the relative attributes is approximated
whereby Om = {(i, j)} is a set of ordered signal pairs with by DX then the equation becomes
signal i having a stronger presence of attribute am than sig-
nal j and Sm = {(i, j)} being a set of unordered pairs where < D, X >= arg min Y − DX22 + λ1 X1
signal i and j have about the same presence of attribute am . D,X,A

The work of Parikh et al. [11] provides us with the convenient + λ2 W DX − AX22 . (8)
RankSVM function that returns the ranking vector wm for a
set of input images and their relative ordering. This informa- The third term of Eq. (8) is minimized if A = W D. This
tion can further be used in the objective function in Eq. (2). information can be used to eliminate A from Eq. (7) to arrive


at the final objective function Algorithm 1 Relative Attribute Guided Dictionary Learning
Require: Original signal Y , sets of ordered (Om ) and un-
< D, X >= arg min Y − DX22 + λ1 X1
D,X orderd images (Sm )
(9)
Ensure: Dictionary D
+ λ2 W (Y − DX)22 .
1: W ← RankSVM(Y, Om , Sm )
In order to find a closed form solution for (9) and to reduce 2: D init ← rndperm(Y )
computational complexity the term X1 is replaced with 3: D, X ← KSVD(D init , Y )
X22 . This can be justified (as in [10]) because the goal is to 4: for i = 0 to convergence do
learn a discriminative dictionary and not to obtain sparse sig- 5: D ← Y (X  X)−1 X 
nals. However, once the dictionary is learned, a sparse repre- 6: D ← normcol(D)
−1
sentation can be obtained by the orthogonal matching pursuit 7: X ← (D  D + λ1 I + λ2 D  W  W D) (D  Y −
 
[17]. The final objective function is shown in Eq. (10). λ2 D W Y )
8: end for

< D, X >= arg min Y − DX22 + λ1 X22


D,X
4. EXPERIMENTS
+ λ2 W (Y − DX)22 . (10)

This equation is not a jointly convex optimization problem so 4.1. Test Datasets
X and D are optimized alternately. The update rules for D In order to assess the quality of the learned dictionary, we
and X are found by deriving the objective function in (11) purpose a clustering task for three public available datasets,
and setting the derivatives to zero. namely Public Figure Face (PubFig) [1], Outdoor Scene
Recognition (OSR) [2] and Shoes [3]. These sets have been
O = Y − DX22 + λ1 X22 + λ2 W (Y − DX)22 chosen, since they are the only ones known to us with anno-
tated relative attributes. Some sample images of each dataset
(11)
are presented in Figure 2. Clustering is chosen over classifi-
cation since the proposed algorithm has no information about
the ground truth labels of the data and therefore no classifier
∂O
=2(Y − DX)X  can be trained.
∂D
+ 2λ2 W  (W Y − W DX)X  = 0
⇒ D =Y (X  X)−1 X  (12)

∂O
= − 2D  (Y − DX) + 2λ1 X
∂X
− 2λ2 D  W  (W Y − W DX) = 0
−1
⇒ X =(D  D + λ1 I + λ2 D  W  W D) Fig. 2. Example images from the PubFig, OSR and Shoes
∗ (D  Y + λ2 D  W  Y ). (13) datasets.

The complete algorithm works as follows. Initially the


RankSVM [11] function is used to learn the ranking ma- a) The PubFig dataset contains 772 images from 8 differ-
trix W from the original input data Y and their relative ent identities defined by the 512 dimensional GIST [2]
ordering (i.e., sets Om , Sm ). The initial dictionary D and the features and is split into 241 training images and 531
sparse representation of the data is obtained by first building test images.
a dictionary from randomly chosen input signals and then
applying the KSVD-algorithm [8]. Afterwards, the dictio- b) The OSR set consists of 2688 images from 8 categories
nary and the sparse representation are optimized alternately described again by the 512 dimensional GIST [2] fea-
until convergence. We first optimize D with respect to the tures split into 240 training and 2488 testing images.
initial representation X. Then X is updated depending on
the new Dictionary D, and so forth. In order to avoid scaling c) In the Shoes dataset there are 14658 images from 10
issues,that may affect the convergence, the dictionary is L2 different types. Out of this set 240 images were used for
normalized column-wise. The structure of the algorithm can training and 1579 for testing. The images are described
be seen in Algorithm 1. by 960 dimensional GIST [2] features.


Additionally, tests were conducted to find the optimal values 0.8 0.6

for λ1 and λ2 . Therefore, different fixed values were cho- 0.7 0.5

sen for λ1 while iterating over candidates for λ2 . The chosen 0.6
0.4
0.5
values are λ1 = 0.01 and λ2 = 1 for the Pubfig dataset,

nMI
AC
0.3
0.4
λ1 = 0.1 and λ2 = 0.01 for the OSR dataset and λ2 = 0.001 0.3
0.2

and λ2 = 0.1 for the Shoes dataset. 0.2 0.1

0.1 0
16 40 80 120 160 240 16 40 80 120 160 240
Dictionary Size Dictionary Size
4.2. Evaluation Metrics (a) (b)

In order to quantify the clustering capabilities of the sparse 0.8 0.6

representation, the k-means algorithm [18] is applied and the 0.7 0.5

accuracy (AC) and the normalized Mutual Information (nMI) 0.6 0.4

nMI
AC
metrics [19] are computed. Furthermore, the sparse represen- 0.5 0.3

tation is obtained from the learned dictionary by approximat- 0.4 0.2

ing X in the error-constrained sparse coding problem, given 0.3 0.1

by Eq. (14), with the help of the OMP-Box Matlab toolbox 0.2
16 40 80 120 160 240
0
16 40 80 120 160 240
Dictionary Size Dictionary Size
[17], where the reconstruction error from the training phase is (c) (d)
chosen as ε. Those signals can then be used for the clustering.
0.5 0.5

X̂ = arg min X0 s.t. X = Y − DX22 ≤ ε (14) 0.45


0.4
X 0.4
0.3
0.35

nMI
AC
4.3. Results 0.3 0.2
0.25
0.1
As a benchmark for the results, different supervised and un- 0.2

0
supervised (discriminative) dictionary learning techniques 0.15
20 50 100 140 20 50 100 140
Dictionary Size Dictionary Size
are used, namely (1) KSVD [8], (2) SRC [6] as unsuper- (e) (f)
vised techniques and (3) LC-KSVD [9], (4) FDDL [16], (5)
SRC KSVD LC-KSVD FDDL SVGDL proposed
SVGDL [10] as supervised techniques. The results were
compared by their performance for full label information for
varying dictionary sizes. Figure 3 shows the behavior of Fig. 3. Clustering results for all three datasets for increasing
the algorithms for an increasing the dictionary size with the dictionary sizes. The first and second column represent the
complete training data available. The dictionary sizes used Accuracy (AC) and normalized MI (nMI), respectively. The
were [16, 40, 80, 120, 160, 240] for the PubFig and OSR first, second, and third rows are the results of PubFig, OSR,
dataset and [20, 50, 100, 140] for the Shoes dataset, which and Shoes datasets, respectively.
corresponds to [2, 5, 10, 15, 20, 30] and [2, 5, 10, 14] atoms
per class. The number of atoms per class are constrained
by the partition of the data into training and testing (for the tion instead of binary labels. The relative attributes provide
Shoes dataset one class only includes 14 training samples). a much richer semantic information to improve the discrim-
One should notice that the FDDL algorithm cannot use all inative property of the dictionary and eventually the sparse
training data, since the dictionary size restricts the size of representation of the input signal. Instead of using relative
the training samples. Therefore, only in the last test case the attributes of the images directly, we use the learned rank-
algorithm uses the complete training information. The results ing functions in the learning process. The ranking functions
show that for the proposed algorithm the accuracy increases transform the original features into a relative attribute space
with the dictionary size, up to values exceeding the compared and therefore, we aim to transform the sparse signal linearly
algorithms. However, for the OSR and Shoes dataset and an into this attribute space. This can be achieved by adding an
increasing dictionary size the SVGDL and FDDL produce additional loss term to the objective function of a standard
comparable results. dictionary learning problem. The new method was applied
Additionally, the runtime of the training phase was analyzed, to three different datasets for face-, scene- and object recog-
where the proposed algorithm outperformed all contestants nition. The results not only show promising results that are
with an average training time of 1.75 s. comparable to modern DDL approaches, but also has low
computational time for learning the dictionary outperform-
5. CONCLUSION ing all of the compared approaches. For future work, one
could design a classifier based on relative attributes to investi-
We have presented a novel discriminative dictionary learn- gate the benefit of this semantic information for classification
ing algorithm using relative attributes as semantic informa- tasks.


6. REFERENCES [13] I. Ramirez, P. Sprechmann, and G. Sapiro, “Classifi-
cation and clustering via dictionary learning with struc-
[1] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Na- tured incoherence and shared features,” in Computer Vi-
yar, “Attribute and Simile Classifiers for Face Verifica- sion and Pattern Recognition (CVPR), IEEE Conference
tion,” in IEEE International Conference on Computer on. IEEE, 2010, pp. 3501–3508.
Vision (ICCV), Oct 2009.
[14] S. Gao, I.-H. Tsang, and Y. Ma, “Learning category-
[2] A. Oliva and A. Torralba, “Modeling the shape of the specific dictionary and shared dictionary for fine-
scene: A holistic representation of the spatial envelope,” grained image categorization,” Image Processing, IEEE
International journal of computer vision, vol. 42, no. 3, Transactions on, vol. 23, no. 2, pp. 623–634, 2014.
pp. 145–175, 2001.
[15] Zhuolin Jiang, Zhe Lin, and Larry S Davis, “Learn-
[3] A. Kovashka, D. Parikh, and K. Grauman, “Whit- ing a discriminative dictionary for sparse coding via la-
tlesearch: Image Search with Relative Attribute Feed- bel consistent k-svd,” in Computer Vision and Pattern
back,” in Proceedings of the IEEE Conference on Recognition (CVPR), IEEE Conference on. IEEE, 2011,
Computer Vision and Pattern Recognition (CVPR), June pp. 1697–1704.
2012.
[16] M. Yang, D. Zhang, and X. Feng, “Fisher discrimination
[4] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning for sparse representation,” in Com-
dictionary learning for sparse coding,” in Proceedings of puter Vision (ICCV), 2011 IEEE International Confer-
the 26th Annual International Conference on Machine ence on. IEEE, 2011, pp. 543–550.
Learning (ICML). ACM, 2009, pp. 689–696.
[17] R. Rubinstein, Mi. Zibulevsky, and M. Elad, “Efficient
[5] M. Elad, Sparse and redundant representations: from implementation of the k-svd algorithm using batch or-
theory to applications in signal and image processing, thogonal matching pursuit,” CS Technion, vol. 40, no. 8,
Springer Science & Business Media, 2010. pp. 1–15, 2008.
[6] J. Wright, A. Y Yang, A. Ganesh, S. Sastry, and Y. Ma,
[18] James MacQueen et al., “Some methods for classifica-
“Robust face recognition via sparse representation,”
tion and analysis of multivariate observations,” in Pro-
Pattern Analysis and Machine Intelligence, IEEE Trans-
ceedings of the fifth Berkeley symposium on mathemati-
actions on, vol. 31, no. 2, pp. 210–227, 2009.
cal statistics and probability. Oakland, CA, USA., 1967,
[7] S. Mallat, A wavelet tour of signal processing, Aca- vol. 1, pp. 281–297.
demic press, 1999.
[19] M. Babaee, R. Bahmanyar, G. Rigoll, and M. Datcu,
[8] M. Aharon, M. Elad, and A. Bruckstein, “K-svd: An “Farness preserving non-negative matrix factorization,”
algorithm for designing overcomplete dictionaries for in Image Processing (ICIP), IEEE International Confer-
sparse representation,” Signal Processing, IEEE Trans- ence on. IEEE, 2014, pp. 3023–3027.
actions on, vol. 54, no. 11, pp. 4311–4322, 2006.
[9] Q. Zhang and B. Li, “Discriminative k-svd for dictio-
nary learning in face recognition,” in Computer Vision
and Pattern Recognition (CVPR), IEEE Conference on.
IEEE, 2010, pp. 2691–2698.
[10] Sijia Cai, Wangmeng Zuo, Lei Zhang, Xiangchu Feng,
and Ping Wang, “Support vector guided dictionary
learning,” in Computer Vision–ECCV 2014, pp. 624–
639. Springer, 2014.
[11] D. Parikh and K. Grauman, “Relative attributes,” in
Computer Vision (ICCV), IEEE International Confer-
ence on. IEEE, 2011, pp. 503–510.
[12] K. Engan, S. O. Aase, and J. Husoy, “Frame based
signal compression using method of optimal directions
(mod),” in Circuits and Systems, 1999. ISCAS’99. Pro-
ceedings of the 1999 IEEE International Symposium on.
IEEE, 1999, vol. 4, pp. 1–4.



You might also like