out (8) (1)
out (8) (1)
out (8) (1)
Shizhi Chen
at
August 2013
Committee Members:
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
UMI 3671880
Published by ProQuest LLC (2015). Copyright in the Dissertation held by the Author.
Microform Edition © ProQuest LLC.
All rights reserved. This work is protected against
unauthorized copying under Title 17, United States Code
ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346
VISUAL CLASSIFICATION BY MULTIPLE FEATURE
FUSION AND LARGE-SCALE LEARNING
by
Shizhi Chen
Abstract
Robust visual classification can be applied to numerous practical
applications, such as augmented reality, personal robotics and medical image
analysis. The main challenges for visual classification are accuracy and
efficiency in terms of both computation and memory. This dissertation will
present efforts to address these challenges.
To improve accuracy, we propose a new feature representation, i.e.
EigenMap, and a novel multiple-kernel learning framework, i.e. margin-
constrained multiple-kernel learning. The EigenMap utilizes kernel density
estimation to approximate the location probability of features in an image.
Hence, the proposed feature representation incorporates both appearance and
spatial information of local regions.
A popular methodology is to utilize the complementarity of multiple
feature types, by either a simple concatenation or multiple-kernel learning
(MKL). The method of the simple concatenation of feature vectors from
different feature types requires manual feature selection and careful
parameter tuning. On the other hand, MKL can automatically combine
different feature types to achieve better performance. However, the MKL
tends to only select the most discriminative feature type and ignore other
less discriminative feature types, which may provide complementary
information for visual classification.
In order to better utilize complementary features with less
discriminative power, we propose a margin-constrained multiple-kernel
learning (MCMKL) method by extending MKL with margin constraints and
dimensionally normalized kernel. The proposed MCMKL method learns
weights of different feature types, i.e., base features, according to their
discriminative power. Unlike the conventional MKL, MCMKL incorporates
less discriminative base features by assigning smaller weights when
constructing the optimal combined kernel, so that we can fully take the
advantages of complementary multiple features to improve the accuracy of
visual classification.
To improve efficiency of visual classification, especially on a large-
scale classification with a large number of categories, a typical approach is
to use one-vs-all linear Support Vector Machine (SVM). However, it has
been criticized that the complexity increases linearly with the number of
categories. On the other hand, nearest-neighbor (NN) based classifiers can
naturally handle large numbers of categories and do not have learning step.
But the inferior performance, as compared with learning based classifiers, as
well as expensive computation and memory costs have hindered these
classifiers on large numbers of classification categories.
We propose a novel classifier scheme, i.e., the Discriminative
Hierarchical K-Means Tree (D-HKTree), which combines the advantages of
both NN-based and learning-based classifiers for large-scale classification. It
incorporates learning-based classifier into NN-based classifier, and extends
the NN-based classifier to large-scale dataset with significantly lower
computational cost and memory usage. At the same time, the D-HKTree can
still achieve the state-of-the-art classification accuracies on several
challenging datasets.
The main contributions of this dissertation include a new spatially
encoded object representation and novel classifier frameworks to improve
both accuracy and efficiency of visual classification. The proposed
EigenMap object representation incorporates spatial context into the popular
bag of words model without manually partitioning an image into a set of
sub-regions. We also extend the multiple-kernel learning framework, with
margin constraints and dimensionally normalized kernel, in order to
maximize joint discriminative power of multiple complementary features.
Finally, we propose a novel classifier scheme, i.e., Discriminative
Hierarchical K-Means Tree (D-HKTree), to take advantages of both learning
based and nearest neighbor based classifiers for large-scale visual
classification.
Acknowledgment
I would like to express my deepest appreciations to my advisor,
Professor YingLi Tian, for her guidance, patience and understanding. She
not only teaches me to become a good researcher in computer vision field,
but also encourages me to apply the research to benefit our society and help
people with special needs. This dissertation would not have been possible
without her consistent support and encouragements.
I would like to thank my dissertation committee members, Professor
Zhigang Zhu, Professor Fred Moshary and Dr. Rogerio Schmidt Feris for
their valuable comments and suggestions on revising the final dissertation.
I would also like to thank my colleagues and friends in the CCNY
Media Lab, including Xiaodong Yang, Chucai Yi, Chenyang Zhang, Yang
Xian, Carol Mazuera, Ze Ye, Hangrong Pan, Long Tian, Baiyu Xi, Shuai
Yuan, Shuihua Wang, Michael Quintian, Faiz Hasanuzzaman, and many
others for their friendship and supports. Special thanks go to Xiaodong and
Chucai for the discussions of many research topics and their constructive
comments on revising this dissertation. I also enjoy the collaboration with
Xiaodong for the large scale learning part of this dissertation.
Furthermore, I would like to acknowledge NOAA CREST with much
appreciation, which has been providing financial support for my Phd study.
Finally, I cannot make it so far without the unbound love and
continuous support from my family. I would like to express my deepest
gratitude to my parents for their dedication and love. I also want to say thank
you to my wife, Man Ting Wong, for her patience and understanding over
the Phd study. I appreciate my wonderful children, Shuyi Chen and Shuhui
Chen. They have given me so much fun and the experience as a parent.
List of Figures
1 Introduction 14
1.1 Applications and Challenges of Visual Classification . . . 14
1.2 Improving Classification Accuracy . . . . . . . . . . . 16
1.2.1 Spatially Encoded Object Representation . . . . . 16
1.2.2 Multiple Features Fusion . . . . . . . . . . . . . 17
1.3 Improving Classification Efficiency
for Large-scale Learning . . . . . . . . . . . . . . . . 19
1.4 Dissertation Organization . . . . . . . . . . . . . . . . 20
2 Related Work 22
2.1 Local Features for Object Representation . . . . . . . . 22
2.1.1 Feature Extraction . . . . . . . . . . . . . . . . . 23
(A) Interest Point Detector . . . . . . . . . . . 23
(B) Interest Point Descriptor . . . . . . . . . . 28
2.1.2 Object Representation . . . . . . . . . . . . . . 30
(A) Bag of Words Representation . . . . . . . 32
(B) Bag of Words in Spatial Context . . . . . . . 36
2.2 Multiple-Feature Fusion . . . . . . . . . . . . . . . . 38
2.2.1 Direct Concatenation . . . . . . . . . . . . . . . 39
2.2.2 Single-Kernel Learning . . . . . . . . . . . . . . 42
2.2.3 Multiple-Kernel Learning (MKL) . . . . . . . . . 48
2.3 Large-scale Learning . . . . . . . . . . . . . . . . . . 51
2.3.1 Discriminative Learning . . . . . . . . . . . . . 52
2.3.2 Nearest Neighbor based Classifier . . . . . . . . 61
Bibliography 130
Introduction
Robust visual classification has been an active research area [22, 28,
40, 44] in computer vision field. It remains a driving force for many
practical applications, e.g., human computer interaction (HIC), surveillance,
augmented reality, personal robotics and medical applications. Nevertheless,
two main challenges, i.e., accuracy and efficiency, have to be overcome, in
order for visual classification to have a wide range of applications.
Figure 1: Some sample images for objects and scenes show large intra-class
variations, which includes occlusion, viewing direction changes, and scale changes.
Related Work
One of the most popular interest point detector is Harris corner based
detector [42, 79], which is based on second moment matrix
I x2 IxI y
A( x , y ) = , (2-1)
IxI y I y2
∇ norm
2
L(x, y, σ ) = σ 2 (Lxx + Lyy ) , (2-4)
where Lxx and Lyy are second derivatives of L at pixel (x, y) in x and y
directions. The interest point at the detected scale can be defined from scale-
space maximum or minimum response of the scale normalized Laplacian, as
shown in Equation (2-5).
function ; (d) Approximated 2-D Gaussian function G(x,y) with standard deviation
of 1.2; dark grey color indicate 0, and lighter color indicate larger positive value; (e)
following the study [54]. Since Gaussian kernel satisfies diffusion equation
over scale, the DOG approximation can be derived from Equations (2-6) and
(2-7) below.
∇ norm
2
L(x, y, σ ) = σ 2 ∇ 2 L(x, y, σ ) ∝ L(x, y, kσ ) − L(x, y, σ ) , (2-7)
where k is a constant. Hence, the interest point can also be defined from
scale-space maximum or minimum response of DOG in the right hand side
of Equation (2-7). SIFT (Scale Invariant Feature Transform) detector has
shown very robust performance, as it is invariant to scale, rotation and
change in illumination etc.
To speed up the detection, Hessian matrix based Speeded Up Robust
Feature (SURF) [4] detector simplify Gaussian filter using box filters as
shown in Figure 2(e) and 2(f). Figure 2 shows box filter approximation of 2-
dimensional (2-D) Gaussian function and its derivatives. Figure 2(a), (b) and
(c) show 1-D Gaussian function, its first derivative and second derivative
respectively.
Figure 2(d) shows the approximated 2-D Gaussian function G(x,y)
with standard deviation of 1.2. Figure 2(e) and 2(f) show the box filter
approximation of and respectively. The Hessian matrix is defined as
Lxx Lxy
H (x, y, σ ) = . (2-8)
Lxy Lyy
where Dyy is the convolution of box filter in Figure 2(f), which is normalized
by the filter size, with the original image I ( x, y ) . Similarly, Dxx and Dxy are the
convolution of corresponding box filters with the original image. The
convolution of box filter can be efficiently computed by integral image
techniques [90]. Finally, interest point and its scale can be found by
Figure 5: Haar wavelet filters used to compute horizontal and vertical response
and , where light region is +1, and dark region is -1.
Figure 8: Pipeline to construct Bag of Words (BOW) model. (a) Visual dictionary
construction; (b) image or object representation using BOW model [112].
the bicycle. The BOW model ignored spatial relationship between different
components of the bicycle, which, however, is very important for human to
recognize an object.
Despite the loss of spatial information, the BOW object representation
has been successfully applied in computer vision applications, such as image
retrieval and visual classification [26, 80, 96, 97, 106].
Different from words in a text, there are much more variations in local
features. Hence nearby features in feature space are grouped together to
form a cluster, which is represented by a prototype feature, i.e., a visual
word. The collection of visual words from each cluster in training data forms
a visual dictionary or codebook. Figure 8 shows a pipeline of BOW model
including visual dictionary construction in Figure 8(a), and object
representation in Figure 8(b).
We first extract local features, e.g., SURF features, from a set of
training images or objects. Figure 8(a) shows local features, which are
projected onto a two-dimensional space and marked by green color. Then
similar local features are clustered together, and are represented by a
prototype feature or visual word, marked with red color in Figure 8(a).
Visual dictionary is formed by collecting all visual words from feature
clusters, i.e., 3 visual words in Figure 8(a). A popular clustering algorithm is
unsupervised K-Means clustering method [57].
To construct object representation, local features are also extracted
from an input image. Then coding and pooling operators are used to generate
the final BOW representation, as illustrated in Figure 8(b). In the coding
step, local features are encoded using visual words in a dictionary. The
simplest coding method is vector quantization, i.e., hard assignment. The
hard assignment of a local feature is to assign the weight of 1 to the nearest
neighbor visual word, while all other visual words are assigned with 0
weight. Equation (2-13) shows hard assignment code for ith local
feature in an image.
2
dist(xi − dk ) = xi − dk 2 . (2-12)
0 otherwise
, (2-13)
where is kth visual word in the codebook, and K is the codebook size, i.e.,
total number of visual words in the dictionary.
However, hard assignment of a local feature usually has large
quantization error since it assigns weights to only one visual word. By
considering soft probabilistic version of coding as shown in Equation (2-14),
soft assignment has normally achieved better performance in visual
classification [36, 37].
−β * dist(xi , d j )
e
∂i, j = . (2-14)
K −β * dist(xi , dk )
e
k=1
2
xi − d k if dk ∈ NN n (xi )
dist(xi , dk ) = 2 , (2-15)
∞ otherwise
K 2
∂i = argmin xi − ∂m,k * dk + λ ∂m 1 , (2-16)
k=1
∂m 2
where ∂m 1
∂m λ ∂m
1 I
pk = ∂
I i=1 i,k
, (2-17)
pk = max ∂i,k
i=1,... I
, (2-18)
2
− fci − fcj
2
K( f , fc ) = e
i j 2σ 2 , (2-20)
c
where fci and fcj are concatenated feature vector for ith sample and jth sample.
2
− [ f1i , f2i ,..., f Ni ]−[ f1 j , f2j ,..., f Nj ]
2
K( f , fc ) = e
i j 2σ 2 . (2-21)
c
2
N
− k=1 fki − fk j
2
K( f , fc ) = e
i j 2σ 2
c . (2-23)
2
− fki − fk j
2
K( f , fc ) = ∏ e
i j N
2σ 2 . (2-24)
c k=1
We can recognize that the product term in Equation (2-24) is the RBF
kernel for kth feature type K( fki , fkj ) . Then the kernel matrix of the
k=1 (2-25)
From Equations (2-24) and (2-25), we can see that all feature types
are contributing to the final kernel matrix, i.e., the concatenated feature
vector’s kernel matrix. If some feature types are less discriminative, it will
degrade the final kernel matrix K( fci , fc j ) , which in turn lower the final
classification accuracy.
The relative weights of single feature types, contributing to the final
kernel matrix, can be approximated based on Equation (2-23). If we assume
( )
2
i
fk,m − fk,m
j
is statistically same for each dimension of every single feature
w⋅ x +b = 0 . (2-26)
w ⋅ x + b = +1 . (2-27)
w ⋅ x + b = −1 . (2-28)
Figure 10: Hyper-plane
plane of linear SVM. Features in orange color are support vectors
with zero training error. Features in red color are support vectors with nonzero error.
The distance between these two hyper-planes are called margin, which
2
is as shown in Figure 10. The larger margin represents the better
w
generalization of the learned model. The features with orange color are
support vectors, which lie on the two support vector hyper-planes..
Any training features of a class, which lies beyond the class’ support
s
vector hyper-plane,, toward the other class, introduce training error. The error
is proportional to the distance between the feature and its support vector
hyper-plane. As an example, tthe features with red color in Figure 10 are
features, which have training error. To quantize training error, we define
w ⋅ xi + b ≤ −1 + ξi for yi = −1 , (2-30)
ξi ≥ 0 ∀i , (2-31)
1 2 l
f= w +C ξ
i=1 i
, (2-32)
2
yi (w ⋅ xi + b) −1+ ξi ≥ 0 . (2-33)
1 2 l l l
LP = w +C ξ − ∂i [ yi ( xi ⋅ w + b) −1+ ξ i ] − µiξ i , (2-
2 i=1 i i=1 i=1
34)
where
∂i ≥ 0 , (2-35)
µi ≥ 0 , (2-36)
∂i [ yi ( xi ⋅ w + b) −1+ ξi ] = 0
, (2-37)
µ iξ i = 0 . (2-38)
∂L p l
= w− ∂yx =0
i=1 i i i
, (2-39)
∂w
∂Lp l
=−
i=1 i i
∂y =0 , (2-40)
∂b
∂L p
= C − ∂i − µi = 0 . (2-41)
∂ξ i
l 1 2 l l l
LP = ∂i + w − ∂i yi xi ⋅ w + ξ (C − ∂i − µi ) − b
i=1 i i=1 i i
∂y . (2-42)
i=1 2 i=1
By substituting Equations (2-40) and (2-41), we can remove the last two
terms in Equation (2-42).
l 1 2 l
LP = ∂i + w − ∂ y x ⋅w
i=1 i i i
. (2-43)
i=1 2
l 1 l l
LD = ∂i − ∂ ∂ j yi y j xi ⋅ x j
j=1 i
. (2-44)
i=1 2 i=1
0 ≤ ∂i ≤ C . (2-45)
l
∂y =0 . (2-46)
i=1 i i
Note that the constraint of Equation (2-45) can be obtained from Equations
(2-35), (2-36) and (2-41). The Equation (2-46) is same as Equation (2-40).
We discuss ∂i value in three different scenarios. The first one is when
training feature xi does not lie on or beyond its support vector hyper-plane,
i.e. yi ( xi ⋅ w + b) −1 > 0 This scenario corresponds to the green features in
Figure 10. There is no training error for the feature, i.e., ξ i = 0 . From
Equation (2-37), we know ∂i = 0
The second scenario is when the training feature xi lies on its support
vector hyper-plane, i.e. yi ( xi ⋅ w + b) −1 = 0 This corresponds to the orange
feature in Figure 10. There is no training error in this scenario, i.e., ξ i = 0 .
Therefore, yi ( xi ⋅ w + b)−1+ ξi = 0 and ∂i is undefined, which means some
training features on the support vector hyper-plane are non-zeros and some
can be 0.
The last scenario is when the training feature xi lies beyond its support
vector hyper-plane, i.e. yi ( xi ⋅ w + b) −1 < 0 $ %
Ns
w= ∂yx . (2-47)
i=1 i i i
1
b= yi − xi ⋅ w (2-48)
{i | 0 < ∂i < C} {i|0<∂ <C}
i
1 Ns
b= yi − ∂ j y j xi ⋅ x j . (2-49)
{i | 0 < ∂i < C} {i|0<∂i <C} j=1
During testing phase, the predicted class label score yt of the testing feature
xt is
Ns
yt = w ⋅ xt + b = ∂ y x ⋅xt + b . (2-50)
i=1 i i i
l 1 l l
LD = ∂i − ∂ ∂ j yi y j K( xi , x j )
j=1 i
. (2-51)
i=1 2 i=1
subject to
0 ≤ ∂i ≤ C , (2-52)
l
∂y =0 . (2-53)
i=1 i i
The testing phase of Equation (2-50) becomes
Ns
yt = ∂ y K( xi , xt ) + b , (2-54)
i=1 i i
where
1 Ns
b= yi − ∂ j y j K( xi , x j ) . (2-55)
{i | 0 < ∂i < C} {i|0<∂i <C} j=1
weight for the kth base feature. The kernel combination K opt approximates the
best trade-off between the discriminative power and the invariance for a
specific application.
Equations (2-56) to (2-58) show the objective cost function f and the
constraints for the multiple-kernel learning (MKL) proposed in [88].
1 2
Min f= w +C ξi + σ k dk , (2-56)
w, ξ i ,dk 2 i k
subject to
yi (w ⋅ Φ( xi ) + b) −1+ ξi ≥ 0 , (2-57)
ξi ≥ 0 ∀ i ; d k ≥ 0 ∀ k ; Ad ≥ p . (2-58)
where Φ( xi ) is the combined features corresponding to K opt in a high
dimensional space for sample xi , which is shown in Equation (2-59).
Equivalently, K opt can be expressed in Equation (2-60), where Φk ( xi )⋅ Φk ( x j )
forms kth base kernel K k .
K opt ( xi , x j ) = Φ( xi )⋅ Φ( x j ) (2-59)
k
σ k dk is simply a constant.
1
Max LD = ∂i − ∂i∂ j yi y j K opt ( xi , x j ) + σ k dk , (2-61)
∂i i 2 i, j k
subject to
0 ≤ ∂i ≤ C ; ∂i yi = 0 . (2-62)
i
∂LD
dknew = dkold − (2-64)
∂dk
These two iteration steps are repeated until converge or the maximum
number of iterations is reached. The final weights of base features can be
determined.
Then we train SVM classifiers using the optimal combined kernel
according the final weights of base features. A new sample xt is assigned the
class labels with the sign of Equation (2-65).
Ns
yt = ∂y
i=1 i i
dk K k (xi , xt ) + b . (2-65)
k
that the samples from ignored subset Sy0 are not participating in the binary
classification problem. An additional coloring variable for each class label is
also introduced, i.e., µn ∈ {−1, 0,+1} , where n is from 1 to N, indicating which
subset the nth class label should belong to. Then, the optimization problem at
each node becomes to minimize
1 2 l l
f= w +C µ ξ −A
i=1 yi i
µ yi . (2-66)
2 2 i=1
subject to
µ yi (w ⋅ xi + b) ≥1− ξ i . (2-68)
ξi ≥ 0 . (2-69)
N
−B ≤
n=1
µn ≤ B . (2-70)
1 2 l l
f= w +C [δ (µ yi −1)ξ i+ + δ (µ yi +1)ξ i− ] − A µ yi , (2-72)
2 2 i=1 i=1
N N
f =C [δ (µn −1)ξ i+ + δ (µn +1)ξ i− ] − A µn (2-73)
n=1 i={i | yi=n} n=1 i={i | yi=n}
N A
f =C [δ (µn −1)ξ i+ + δ (µn +1)ξ i− − µn ] . (2-74)
n=1 i={i | yi=n} C
Since C is a constant, it can be dropped out from the objective function, and
let fn represent cost function of a class label n, i.e.,
A
fn = [δ (µn −1)ξi+ + δ (µ n +1)ξ i− − µn ] '( &
i={i | yi=n} C
N
f= f
n=1 n
. (2-76)
N
−B ≤
n=1
µn ≤ B . (2-78)
Note that
Since µ n can only take three discrete values as shown in Equation (2-77), a
single class label’s cost function fn , which depends on µ n , can only take
three values too. To minimize Equation (2-76) is equivalent to select
minimum fn over three possible µ n values for each class label. Let’s say µ̂ n
N
gives minimum fn , then n=1
µ̂ n need to satisfy the constraint of Equation (2-
78).
N
If n=1
µ̂ n satisfy Equation (2-78), then µ̂ n is the solution. If
N
n=1
µ̂ n > B , then we know there are more class labels in the positive subset,
i.e., Sy+ . Therefore, some class labels’ µ̂ n should be decreased. This can be
done by calculating the delta increase of fn , i.e., ∆fn , per unit decrease in µ n
for each class label. Then sort the ∆fn , and select minimum ∆fn , and decrease
N
its corresponding µ n . The process is repeated until n=1
µ̂ n satisfy Equation
N
(2-78). If n=1
µ̂ n < −B , same approach is adapted except to increase µ n .
By Bayes’s theorem,
p(Q | C)p(C)
p(C | Q) = . (2-83)
p(Q)
Since p(Q) is constant over all class label C, and we assume uniform
prior, i.e., p(C) is a constant, Equation (2-84) becomes
Ĉ = argmax p(Q | C) . (2-85)
C
p(Q | C) = ∏ p(di | C)
N
i=1
, (2-86)
Ĉ = argmax log
C
(∏ N
i=1 )
p(di | C) . (2-87)
Therefore,
( )
N
Ĉ = argmax
i=1
log p(di | C) . (2-88)
C
Next, approximating p(di | C) with Parzen window estimator [72, 74], gives
1 L
p( di | C) = K( di − d Cj ) , (2-89)
L j=1
Equivalently,
N 2
Ĉ = argmin di − NN C (di ) , (2-94)
C i=1 2
3.1 Summary
3.2 Method
3.2.1 Overview
Figure 16: Flowchart of the proposed approach using EigenMap representation in
visual scene classification.
s = ΦT ∗(m − m) , (3-1)
η = [ s, µ ] , (3-2)
3.2.5 Classifier
3.3 Experiments
Figure 21:: sample images of the Natural Scene datab
database,
ase, which has 13 categories.
categories
3.3.1 Databases
(b)
Figure 22: Comparing to the Bag of Words (BOW) model and the Latent Dirichlet
Allocation (LDA) model on (a) the UIUC Sport Scene database; and (b) the Natural
Scene database.
We divide the dataset of each category into five subsets. Then the
images of one subset are used as testing set, while the images from the
remaining four subsets are used as training set. The process is repeated five
times with each of the five subsets used as the testing data once. All
experimental results reported in the dissertation are the average accuracy of
the five repeated testing.
(A) Compare to the Bag of Words Model and the LDA model
For each feature type, i.e., texture, shape, color, the Harris corner with
the SIFT descriptor and the uniform grid interest point with the SIFT
descriptor, we compare the proposed EigenMap approach with the Bag of
Words model (BOW) [26] and the Topic Discovery model [52]. More
specifically, we employ the Latent Dirichlet Allocation (LDA) [9] similar to
the approach proposed by Li et al. [52]. They are all running under the same
experimental setup.
The detailed comparisons over different feature types are shown in
Figure 22(a) and Figure 22(b) for the UIUC Sport Scene database and the
Natural Scene database respectively. On the UIUC Sport Scene database, the
EigenMap model outperforms the standard Bag of Words model by the
average of 4.8% over all different feature types. It also outperforms the LDA
model by the average of 15.3%.
We observe larger performance improvement over the other two
models on the Natural Scene database. The proposed approach improves
(a)
(b)
Figure 23: Comparing the BOW and the LDA models using the five-fold cross
validation results of the uniform grid interest point features on (a) the UIUC Sport
Scene database; and (b) the Natural Scene database.
Figure 25: Compare with the state of the art reported by Fei-Fei and Perona [30],
Lazebnik et al. [49], Li et al. [52], and Wang et al. [93] on both (a) UIUC Sport Scene
database and (b) Natural Scene database.
and the Natural Scene database. The true positive rate for each category is
shown in the last column next to the corresponding confusion matrix.
All the four confusion matrices are generated using the uniform grid
interest point with the SIFT descriptor, where the rows are the ground truth
while the columns are the classified categories. As we can see from the
confusion matrices of the EigenMap and the BOW, the EigenMap approach
achieves higher performance on most of the scene categories.
In the UIUC Sport Scene dataset, the most confusion occurs between
the “Rowing” and the “Sailing” categories since both sport scenes are very
similar in the background, which contains water in the scene images. In the
Natural Scene dataset, the most confusion occurs between the bedroom and
the living room scene images.
Figure 25(a) and 25(b) show the detailed comparison with the state-
of-the-art performance on the UIUC Sport scene dataset [52, 93] and the
(a) (b)
Figure 26: The effect of number of eigenvectors used in the PCA projection on the
classification performance over (a) the UIUC Sport Scene database; (b) the Natural
Scene database; Note that we used the uniform grid interest point with the SIFT
descriptor for both databases.
Natural Scene dataset [30, 49] respectively. The results are directly cited
from their papers. From Figure 25, the proposed EigenMap model achieves a
state-of-the-art performance on both the UIUC Sport Scene database and the
Natural Scene database.
3.4 Discussion
We have proposed a novel EigenMap representation of a scene image,
which can not only incorporates the spatial information with the appearance
features, but also integrates both local features and their global
correspondences effectively. The EigenMap model has been evaluated on
two public databases for scene image classification and outperforms both the
standard Bag of Words model and the LDA model. The proposed model also
achieves a state-of-the-art performance on both datasets with small feature
dimension.
Chapter 4
4.1 Summary
4.2 Method
kth base kernel. Its objective function is shown in Equations (2-56) to (2-58)
in section 2.2.3. We copy the equations here for the convenience.
1 2
Min f= w +C ξi + σ k dk , (4-1)
w, ξ i ,dk 2 i k
subject to
yi (w ⋅ Φ( xi ) + b) −1+ ξi ≥ 0 , (4-2)
ξi ≥ 0 ∀ i ; d k ≥ 0 ∀ k ; Ad ≥ p , (4-3)
where Φ( xi ) satisfies
K opt ( xi , x j ) = Φ( xi )⋅ Φ( x j ) . (4-4)
subject to
0 ≤ ∂i ≤ C ; ∂i yi = 0 . (4-6)
i
The optimization is carried out by two iteration steps: (1) fixing feature
weight dk , then solving Equation (4-5) with standard SVM solver; (2) fixing
∂i , then updating feature weights dk with projected gradient descent as
∂LD 1
=σk − ∂i∂ j yi y j K k ( xi , x j ) (4-7)
∂dk 2 i, j
∂LD
dknew = dkold − (4-8)
∂dk
Figure 27: (a) The hyper-plane of base feature “a” has a large separation margin to
separate solid dot class and triangle class; (b) The hyper-plane of base feature “b” has
a small margin to separate solid dot class and triangle class.
Therefore, the separation margin for each base feature in its high
dimensional space provides a rough measurement on the base feature’s
discriminative power. Nevertheless, these rough measurements can
effectively guide MKL when searching for the optimal feature combination.
The separation margin for each base feature can be calculated using
Equation (9) as the inversed square root of its own objective cost function.
2 2 2
mk = ≈ = . (4-9)
wk fk 1 2
wk + C ξ i + σ k dk
2 i
After obtaining the separation margin mk for each base feature, we select one
of base features as the reference base feature, which has the feature weight
of ds and the margin ms . The weight dk of kth base feature is constrained in
the range, which has the lower bound of LBk and the upper bound of UBk
according to the margin ratio between ms and mk during training. LBk and UBk
can be calculated as in Equation (4-11). The additional weight constraints in
Equation (4-10) are enforced during the multiple-kernel learning.
n n
m m
LBk = k * ds ; UBk = k * ds * (1+ δ ) . (4-11)
ms ms
D
K( xi , x j ) = exp(−γ (xi,q − x j,q )2 ) . (4-12)
q=1
where xi and x j are the ith sample and the jth sample along with xi,q and x j,q as
the qth element in a feature vector. D is the sample’s feature dimension.
is the RBF kernel parameter, which determines the mapping from a low
dimensional feature space L to a high dimensional space H.
Assuming that (xi,q − x j,q )2 is statistically same, the kernel value decreases
when the feature dimension increases at a fixed as shown in Equation (4-
12). Hence, Equation (4-12) suggests the inverse relationship between the
optimal and the feature dimension. This intuition is confirmed in our
experiments.
In MKL fusion, base features from different modalities may have
significantly different feature dimensions, which will result very different
optimal values for each base feature. Therefore, MKL cannot utilize the
maximum discriminative power of all base features from different modalities
at the same time.
We can treat as a feature selection parameter in MKL, which select
only few base features at a time. This intuition also explains the observations
reported in [88] that MKL tends to select only very few most discriminative
base features. Therefore, MKL cannot take the full advantages of all types of
features from multiple modalities.
Based on these observations, we propose a dimensionally normalized RBF
kernel (DNRBF), which is defined in Equation (4-13).
γ D
K(xi , x j ) = exp(− (xi,q − x j,q )2 ) (4-13)
D q=1
Active Shape Model [24, 94] is first applied to track 53 facial landmark
points including brows, eyes, nose, mouth, and face contour, as shown in
Figure 29(a). Then we locate the corresponding positions of the facial points
in the Motion History Image (MHI), as shown in Figure 29(b).
The next step is to extract Image-HOG and MHI-HOG features on original
video frames and the corresponding MHI images respectively.
We use 48 by 48 pixels patches with the number of orientation bin
equals to 6 and 8 for the Image-HOG and the MHI-HOG features
(a) (b)
Figure 29: (a) Facial landmark points tracking; (b) Motion History Image.
To extract body gesture features, we first track both hands and head in
an expression video. The head position is simply the center point of the
facial points from the ASM model (see Figure 30(b)). To track hands, we
apply a skin color detection [47] followed by the removal of the face
regions, which has already been tracked by the ASM model as shown in
Figure 30(a) and 30(b).
In addition to the positions of head and hands, we also calculate the
motion areas (e.g. the numbers of motion pixels in MHI image) within the
detected regions of head and hands. Figure 30(c) shows the head and hand
regions in a MHI image.
We further extract Image-HOG and MHI-HOG features in hand
regions by uniformly sampling interest points. Then a bag of words
representation with the codebook size of 80 is used to describe the
distribution of Image-HOG and MHI-HOG features of hand regions. Finally,
we perform PCA to reduce their feature dimensions.
4.4 Experiments
Figure 33: The average performance of the top 12 ranks by sweeping kernel
parameter log2( ) from -15 to 8 for each of the three methods, i.e., concatenation
(cvpr4HB’11), MKL, and MCMKL.
We randomly divide the videos into three subsets. Two subsets are
used in training and the remaining subset is used in testing. No same video
appears in both training and testing. But same subject may appear in both
training and testing due to the random selection process.
MCMKL outperforms the other two methods over all the top 12 ranks,
as shown in Figure 33. If we look at the rank 1 comparison in Figure 34, we
can see that the proposed MCMKL achieves better performance than the
concatenation method. Note that the five base features have been carefully
selected, and the parameters, e.g., PCA projection dimensions etc. are also
carefully chosen for the concatenation fusion method in [23]. On the other
hand, MCMKL effectively select those feature vectors, and it can still
outperform the concatenation fusion method by the average of 1.3%.
Our proposed MCMKL outperforms traditional MKL method by an
average recognition rate of 5.7% on these base features, as shown in Figure
34(a). Figure 34(b) shows the performance comparison over three different
testing subsets.
Figure 35: Comparing the average feature weight distribution of the 5 base features,
i.e., face, location (loc), motion area (MA), and both hands’ Image-HOG (imgHOG)
and MHI-HOG (mhiHOG) features, for MKL and MCMKL methods.
HOG (mhiHOG) features. Then we take the mean feature weight of the 28
models over each base feature, followed by the proper normalization.
Figure 35 shows the distribution of average feature weights over the 5
base feature types for MKL and MCMKL method. As expected, MKL
selects only the most discriminative base feature, i.e., face feature. More
specifically, it assigns more than 98% of the total feature weights to the face
feature. The MKL method ignores all the gesture features, i.e., location, and
motion area etc., even though these gesture features have been proven to
provide complementary information to the face feature [23].
As shown in Figure 35, the proposed MCMKL obtains a more
reasonable feature weight distribution. Similar to MKL methods, it
recognizes the face feature as the most discriminative base feature by
assigning the largest feature weight of 48%. At the same time, it also
incorporates other less discriminative gesture features according to their
discriminative power.
Figure 35 has verified the effectiveness of the proposed margin
constraints and the dimensionally normalized RBF kernel (DNRBF). It is
obvious that the additional constraints on the feature weights according to
the separation margin of each base feature can enforce the model to assign
small weights to the less discriminative base features. However, it may not
be intuitive how the DNRBF contribute to a more reasonable feature weight
assignment.
Before we provide such intuition, we examine the relationship
between the optimal value and feature dimension experimentally. We
select three base features, i.e., facial point’s MHI-HOG, the facial point’s
Image-HOG and the location feature. Then we manually vary the PCA
dimension of the Image-HOG and the MHI-HOG, or the number of
normalization frames of the location feature, so that their feature dimensions
can be gradually increased. Then we use SVM’s 5-fold cross validation to
find out the optimal kernel parameter for each of the three base features at
the selected feature dimension. Figure 36 has suggested the inverse
relationship between the optimal value and the feature dimension, which
has verified our analysis in section 4.2.3.
In the experiments of the last section, the most discriminative feature,
i.e., the face feature, has the optimal of 2-15. Since other less discriminative
gesture feature has much smaller feature dimension as we can see from
Table 2. Their optimal value is much larger. Therefore, at the of 2-15, the
other gesture features has almost no discriminative power since their optimal
values are very far away from 2-15. Therefore, MKL method assigns almost
zero feature weights to other gesture features.
After we perform the dimensionally normalization as in Equation (4-
13). The optimal values become very close for different base features
Figure 37: Explore contamination from noisy feature. The average performance of
the top 10 ranks by sweeping kernel parameter for each of the three methods, i.e.,
concatenation (cvpr4HB’11), MKL, and MCMKL.
4.5 Discussion
5.1 Summary
1 Lc
log p(di | C) = K(di − d Cj ) . (5-2)
LC j=1
di − d Cj
K(di − d ) = B f U(1−
C
j ) , (5-3)
ri
Bf Lc di − d Cj
log p(di | C) = U(1− ) . (5-4)
LC j=1 ri
in Figure 40(b), we further approximate the unit step function with the
feature space boundary defined by the leaf node , in which the query
feature falls within.
di − d Cj
U(1− ) ≅ δ ( fi , LEAF(d Cj )) (5-5)
ri
feature %;
&
' equals to 1 when feature and &
% falls within the same feature
space partition defined by the leaf node . If we substitute Equation (5-5) to
Equation (5-4), it becomes
Bf LC
log p(di | C) = δ ( fi , LEAF(d Cj )) . (5-6)
LC j=1
Note the right hand side of the Equation (5-6) does not depend on the
query feature except that is the leaf node of . Therefore, we can pre-
compute the right hand side of Equation (5-6) for each class label (, and
store the results as the label histogram associated with the leaf node ,
denoted as !) ( .
The summation term in Equation (5-6) is the number of training
features from category (, which falls within the feature space partition of the
leaf node . !& corresponds to the total number of training features in
category ( . *+ is a L1-Norm constant. Substituting Equation (5-6) to
Equation (5-1), we have the L-HKTree classification rule.
N Bf LC
Ĉ = argmax δ ( fi , LEAF(d Cj ))
C i=1 LC j=1
. (5-7)
N
= argmax
i=1
[ LH (LEAF(di ), C)]
C
return ./0
————————————————————————
M
Y=
k=1
[Wk1H k ,Wk2 H k ,....,WkC H K ,...,WkT H K ] . (5-9)
M
Y=
k=1
H kWk , (5-10)
5.3 Experiments
0 # 0 '*! 12
)'
('
.('
/ // / // .0
Figure 41: Comparing memory usage over different NN-based classifiers as the scale
of the dataset increases.
0 # 0 '*!
3 4 15 ! !"# ,$ % '& "# '$ % #&
10 15 %' & '"# * $ "# *$
1+0 #,$)-&'$+- *+$#(&'$+-
.0 %)& ! " *$ ' '"# $
- ./ ##$%&'$%% *+$+&'$+,
(a)
12
3 4 15 %''& ' *$
1 - 3 6 %''& ' '$
1 -36% & ' $
- ./ )+$#&'$(,
(b)
Figure 41 compares the memory requirements of different NN-based
classifiers as the scale of dataset increases. Since the memory usage of
conventional NN-based classifiers is directly related to the size of training
data size, we can estimate their memory usages according to the training
data size. As for D-HKTree, the memory usage is estimated from the size of
L-HKTree. As shown in this figure, the memory usage of D-HKTree only
increases slightly from Caltech 101 to Caltech 256, then SUN. However, the
memory consumptions of NBNN and local NBNN grow significantly as
dataset scale increases. For example, the memory requirement is around
100GB for both NBNN and local NBNN under our experimental settings in
SUN dataset, while the memory usage of D-HKTree is only 6GB.
)+
7 8 9
)' .0
1+
1'
(+
' '$( '$1 '$) '$* '$+ '$%
4 5 ! ! 6
(a)
+'
*+
7 8 9
*' .0
)+
)'
1+
1'
' '$1 '$* '$% '$, (
4 5 ! ! 6
(b)
Figure 42: Comparison of the tradeoff between accuracy and relative computational
complexity to other hierarchical SVM based methods for large-scale data, i.e. Gao
[34], Griffin [38], Marszalek [59], on (a) SUN dataset; (b) Caltech 256 dataset; Note
that the results from the other three methods are directly estimated from the plots in
the paper [34].
There are very few work [85, 107] on combining learning-based and
NN-based classifiers to take advantages of both classifier types. Table 5
compares D-HKTree with two other methods that hybrid both classifier
types, i.e. SVM-KNN [107] and NBNN Kernel [85]. As shown in this table,
D-HKTree significantly outperforms SVM-KNN and NBNN Kernel by 11%
and 8% in accuracy respectively. Note that NBNN Multi-Kernel is actually
combining NBNN Kernel with other kernels of different feature types
instead of a single feature type. Nevertheless, D-HKTree still obtains the
highest accuracy as shown in Table 5.
Table 5: Comparison to the hybrid classifiers, i.e. SVM-KNN [24], and NBNN
Kernel [20], on the Caltech 101 dataset.
17 . ( . ( 8 -./
9 !! '"# *$ !) !"# )$ * '" '$ ##$%&'$%%
SVM-KNN [107] selects k-nn neighbor images for a query image and
training a local multi-class SVM classifier to predict query image. As the
training data closely coupled with the nearest neighbor results, the nn-based
and svm based classifier cannot complement each other very well. That may
cause the significant performance decrease as compared with the D-HKTree.
5.4 Discussion
# 1(2 %( % 0>
112. Van. Digital Image. 2013 Toyota Sienna LE for Sale. Web. 1 May
2013. <http://www.usedcarsgroup.com/wappingersfalls-ny/2013-
toyota-sienna-5tdkk3dc1ds360034.html>.
113. Dolphin. Digital Image. Animal Rights Group Rescues Dolphin Off
Cape Cod. Web. 1 May 2013.
<http://www.opposingviews.com/i/animal-rights-group-rescues-
dolphins-off-cape-cod>
114. Dolphin. Digital Image. Mysterious Dolphin Deaths Continue in
Gulf of Mexico. Web. 1 May 2013.
<http://www.banderasnews.com/1301/nb-
mysteriousdolphindeaths.htm>
115. Tiger. Digital Image. Wallpapers for resolution of 1600x1200. Web.
1 May 2013.
<http://im05.coolwallpapers.org/resolution/1600x1200/162>
116. Watermelon. Digital Image. Geometry Shapes with an example.
Web. 1 May 2013. <http://quizlet.com/18009319/geometry-shapes-
with-an-example-flash-cards/>
117. Orange. Digital Image. Strategy. Web. 1 May 2013.
<http://stratkomuncut.com/category/strategy/>.
118. Strawberry. Digital Image. Marietta College. Web. 1 May 2013.
<https://secure.www.alumniconnections.com/olc/pub/MRO/event/sho
wEventForm.jsp?form_id=144710>.
119. Pink lily flower. Digital Image. Full of Life. Web. 1 May 2013.
<http://banishloneliness.org/2012/09/27/non-attachment-v-
loneliness/>.
120. Rose. Digital Image. Flickr. Web. 1 May 2013.
<http://www.flickr.com/photos/niceflowers48/5021857417/in/photostr
eam/>.
121. Lion. Digital Image. African Lion King. Web. 1 May 2013.
<http://www.wallpapersshop.net/wallpaper/african-lion-king/>.
122. Blue whale. Digital Image. Reflection at Point Lookout. Web. 1
May 2013.
<http://www.reflectionsatpointlookout.com/?_escaped_fragment_=gal
lery>.
123. Office partition. Digital Image. Singapore Office Partitions. Web. 1
May 2013. <http://detail.en.china.cn/provide/1001809044.html>.
124. Office. Digital Image. Portillo Cleaning Services. Web. 1 May
2013. <http://portillocleaninglck.com/services.html>.
125. Office. Digital Image. Prestigious Cambridge Location. Web. 1 May
2013. <http://geekoffices.com/properties/>.
126. Office furniture. Digital Image. World Class Wooden Furniture.
Web. 1 May 2013. <http://wooden-furnitures.co.in/>.
127. G. Kim, C. Faloutsos, M. Hebert, “Unsupervised Modeling of
Object Categories Using Link Analysis Techniques”, CVPR, 2008.
128. T. Dean, M. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, J.
Yagnik, “Fast, Accurate Detection of 100,000 Object Classes on a
Single Machine”, CVPR, 2013.
My Publication List
Conference