Towards Large-Scale Face Recognition Based on Videos
Meltem Yalcin, Hakan Cevikalp, Hasan Serhan Yavuz
Eskisehir Osmangazi University
Machine Learning and Computer Vision Lab, Eskisehir, Turkey
{yalcinmeltem26, hakan.cevikalp, hsyavuz}@gmail.com
Abstract
This paper introduces a new method to find the most important samples for classification in image sets to speed-up
the classification phase and reduce the storage space for
large-scale face recognition tasks that use image sets obtained from face videos. We approximate the image sets
with the kernelized convex hulls and show that it is sufficient to use only the samples that participate to shape
the image set boundaries in this setting. To find those important samples that form the image set boundaries in the
feature space, we employed the kernelized Support Vector
Data Description (SVDD) method which finds a compact
hypersphere that fits the image set samples best. Then, we
show that these kernelized hypersphere models can also be
used to model image sets for classification purposes. Lastly,
we introduce ESOGU-285 (ESkisehir OsmanGazi University) Face Videos database that includes 285 people since
the most popular video datasets used for set based recognition methods include either a few amount of people or
large amount of people with just a few (or single) video collections. The experimental results on small sized standard
datasets and our new larger sized dataset show that the proposed method greatly improves the testing times of the classification system (we obtained speed-ups up to a factor of 10
in ESOGU Face Videos dataset) without a significant drop
in accuracies.
1. Introduction
Face/object recognition based on image sets has been attracting more attention in recent years owing to the fact that
collecting a set of images for recognizing people/objects
becomes increasingly convenient and easy with the popularization of video cameras and cell phone cameras. In this
setup, the user supplies a set of images of the same unknown
individual rather than supplying a single query image. In
general, the gallery also contains a set of images for each
known individual, so the system must recover the individual
whose gallery set is the best match for given query set. As
a result, the image set recognition task naturally arises in a
wide range of contexts including video-based recognition,
surveillance, and personal albums. The query and gallery
sets may contain large variations in pose, illumination, and
scale. For example, even if the images were taken on the
same occasion, they may come from different viewpoints or
from face tracking in surveillance video over several minutes.
Recognition methods using image sets generally outperform the ones for single instance based recognition, both
because they incorporate information about the variability
of the individual’s appearance and because they allow the
decision process to be based on comparisons of the most
similar pairs of query and gallery images - or on local models based on these. Moreover, in many applications, image
sets are also the most natural form of the input to the system since obtaining image sets does not generally require
cooperation from the individuals. Recognition based on
image sets offers these great advantages, but at the same
time it poses new challenges since the traditional classification methods such as Support Vector Machines (SVMs) or
k-nearest neighbor (k-NN) cannot be used directly in this
setup.
Existing classification methods using image sets differ
in the ways in which they represent the image sets and
compute the distances (or similarity) between them. Some
methods [1,15] used parametric probability distributions to
model image sets, and Kullback-Leibler divergence is used
to measure the similarity between these distributions. However, as noted in [18,2], these methods are not very robust
when the test sets have only weak statistical relationships to
the training ones. Nonparametric methods [20, 6, 8, 2, 18]
use different models to approximate image sets. Yamaguchi et al. [20] used linear subspaces to model image
sets and they used canonical angles between subspaces to
measure the similarity between them. Cevikalp and Triggs
[2] used linear/kernelized affine/convex hull models to approximate image sets and geometric distances (distances of
closest approach) between these models are used to measure the similarity. This method can be seen as enhance1 26
ment of nearest neighbor classification that attempts to reduce its sensitivity to random variations in sample placement by “filling in the gaps” around the examples. Although still based on the closest-point idea, classification
method using affine/convex hull models replaces point-topoint or point-to-model comparisons with training-model
to test-model ones. This methodology offers a number of
attractive properties: the model for each individual can be
fitted independently; computing distances between models
is straightforward due to the convexity; resistance to outliers can be incorporated by using robust fitting to estimate
convex models.
After introduction of affine/convex hull models, different variants of these models have been proposed [9,21]. For
example, SANP (Sparse Approximated Nearest Points) [9]
methodology extended the affine hull method by enforcing
the sparsity of samples used for affine hull combination, and
reported good accuracies. However, this method is very
complex in the sense that it requires setting 3 design parameters beside the affine hull model parameters. It is also
slow since one has to solve a complex optimization problem that includes minimization of L1 norm of some vectors,
which makes it unsuitable for real-time applications as verified in our experiments. In a similar manner, [21] used
regularized affine hull (RAH) models to represent image
sets where L2-norms of affine hull combination coefficients
are minimized during computing the smallest distances between affine hulls. Although this requires solving a much
easier optimization problem compared to [9], it is still not
suitable for real-time applications. More recently, new extensions of these methods used so-called collaborative representations for affine/convex hull models [19, 22]. The basic difference is that they model all gallery sets as a single
affine/convex hull and then query set is classified by using
the reconstruction residuals computed from only individual
gallery sets. However, as we show below, these methods
are bound to fail for large-scale applications. Other methods using sparse models for image set based recognition can
be found in [5,4,3]. Most of the mentioned methods above
have kernelized versions that can be used to estimate nonlinear face models.
There are also many methods that seek to build nonlinear approximations of the manifold of face appearances,
typically embedding local linearity within a globally nonlinear model. For instance, Fan and Yeung [6] use hierarchical clustering to discover local structures and approximate each local structure with a linear subspace. Wang
et al. [18] follow a similar approach and they use nearest
neighbor clustering to find the local structures forming the
nonlinear manifold. Wang and Chen [17] extends MMD
method as the manifold discriminant analysis (MDA) to improve the between-manifold distances. Cevikalp and Triggs
[2] use spectral clustering to find the local structures and
model the local structures with affine subspaces. Hadid and
Pietikainen [8] apply k-means clustering to find local structures and model each local structure with the cluster center.
All these methods were inspired by the nonlinear manifold
modeling approach of Roweis and Saul [13], but they replace the locally affine models with different models as described above.
Our Contributions:We consider large-scale face recognition applications using image sets collected from videos in
this paper. We first discuss main challenges that will be
encountered for such applications. Then, we question suitability of the existing methods in the literature for largescale applications and then propose an efficient method that
will make large-scale image set based recognition feasible.
To this end, we propose a method to find the most essential
samples in image sets for classification to reduce the image
set samples. SVDD method, which finds a compact kernelized hypersphere that best fits the image set samples, is used
to determine the most essential samples. In addition, we
show that the kernelized hypersphere models can be used
for set based face recognition. It should be noted that most
popular video datasets used for set based recognition methods are not large-scale and they include only few person
classes. Therefore, to test the proposed method, we developed a new video dataset, called ESOGU Face Videos, that
includes 2280 videos belonging to 285 individuals. The total number of frames is about 764 K. Although this dataset
cannot be considered as large-scale data, it was still sufficient to show that the most recent face recognition methods
using image sets have serious drawbacks related to computational complexity or representation of image sets.
2. Challenges for Large-Scale Face Recognition Based on Image Sets
One of the biggest challenge of the large-scale set based
face recognition systems will be related to saving all data
on a computer. Since even short videos may include hundreds of frames, the face detectors will return many face images for a single person. Thus, one needs to find a sophisticated technique to reduce the amount of original data without causing significant drop in recognition performance.
Reducing techniques using random selections can significantly decrease the performance as reported in the results
of experiments, that is carried out with different number
of face images [4, 3, 9]. The second challenge will be to
choose good models to represent image sets. The most popular video datasets such as Honda/UCSD [10] or YouTube
Celebrities [11] used for set based recognition have videos
of people with different poses including frontal, left or right
profiles and poses between those, thus the face images in
a set construct a nonlinear manifold which is locally linear. Although methods that approximate these nonlinear
27
image sets with a single linear/affine subspace or linear
convex hulls produce very high accuracies on current data
sets, the performances of these methods will drop as the
number of people in the gallery set increases since these
models will seriously overestimate the true extent of the
classes and introduce large overlapping regions between image sets, cf., illustration in Fig. 1. Thus, only kernelized
versions of these methods or methods that build nonlinear
face manifolds using linear models will give satisfactory
performance. The last challenge will be the real-time performance of the recognition system. An efficient system
must return the individual (among the thousands of people)
whose face image set is the best match for the given query
set in a reasonable time.
to this combined convex hull will become zero and the coefficients which will be used for computing residuals will
be almost random. For affine hull models, the situation gets
even worse since three independent face images (not sets)
are enough to span all two-dimensional plane. Another
problem would be related to the computational difficulty.
Some collaborative representation based methods as in [19]
require taking inverse of matrices with size (n × n), where
n is the total number of images in the gallery. Large-scale
applications result in very large matrices that would be very
impractical to fit them into the memory, not to mention the
difficulty of taking their inverses. Our experiments confirm
this fact since we failed to implement some collaborative
representation based methods because of memory issues on
our moderate sized video dataset.
(a)
Figure 1. In large-scale applications, using either linear affine or
convex hull models for representation of image sets causes large
overlapping regions between these linear models. In this example,
for affine hull model, all image sets span entire 2D plane thus it is
impossible to separate all these sets; for convex hull model, most
neighboring image sets have overlaps and it is only possible to
separate furthest image sets.
(b)
Figure 2. Comparisons of small and large-scale scenarios for set
based recognition. In (a), the number of the image sets belonging
to different people in gallery is small so one can model all gallery
sets as a single convex hull and find the distance from the convex
hull of the query set to this hull. But, when the number of people
is increased, the query sets will be typically inside the convex hull
formed by combining all image sets in the gallery as illustrated in
(b). In such cases, the distances will become zero and the coefficients which will be used for computing residuals will be almost
random. This will cause the collaborative model classifier to fail.
3. Proposed Method
In set based face recognition, the methods using the
so-called joint or collaborative representations report very
good accuracies on small sized datasets but they will likely
to fail in large scale applications. In these methods, all
gallery sets belonging to different individuals are approximated with a single combined affine/convex hull, and query
set is classified by using the reconstruction residuals that
only come from individual gallery sets. We adopt the illustration given in [22] to show how these methods get weak
in large-scale applications. In case of a few image sets, one
can model all image sets with a single convex hull and find
the distances from the query sets to this convex hull as illustrated in Fig. 2 (a). But, when the number of person classes
is large as in Fig. 2 (b), query sets will be typically inside
the combined large convex hull built from all gallery sets.
As a result, all distances from the convex hulls of query sets
In the proposed method, we approximate image sets with
the kernelized convex hulls as in [2] since convex hulls are
tighter models compared to affine hulls, and they provide
better localization in large-scale applications. Let the face
image samples be xck ∈ IRd , where c = 1, . . . , C indexes
C image sets (individuals) and k = 1, . . . , nc indexes nc
samples of image set c. Let φ(.) be the implicit feature
space embedding and k(xi , xj ) = hφ(xi ), φ(xj )i be the
corresponding kernel function, where h.i denotes the feature space inner product. A kernelized convex hull of samples xck is defined as
Hcconvex = {φ(x) =
P nc
k=1
αck φ(xck )|
P nc
k=1
αck = 1, 0 ≤ αck ≤ 1} .
(1)
If we set the upper bound on the convex combination coefficients to values, U , smaller than 1, several samples need to
28
P nc
be activated to ensure k=1
αck = 1, giving a more compact convex approximation that lies strictly inside the convex hull of the samples. This trick provides more robustness
against to outliers during computation of the distances between convex hulls.
Given two compact kernelized convex hulls, the geometric distance between them can be found by solving the
following constrained convex quadratic programming (QP)
problem,
arg min kΦ(Xi )αi − Φ(Xj )αj k
2
αi ,αj
s.t.
ni
X
k=1
αik =
nj
X
SVDD method aims to find a closed boundary around the
data. To this end, it finds a compact bounding hypersphere
where the most of data samples lie in that hypersphere.
of a point set
n The bounding hypersphere
o
d
xk ∈ IR |k = 1, . . . , n is characterized by its center s and radius r. These can be found by solving the
quadratic programming problem
!
X
2
r +γ
ξk
arg min
(3)
s, r≥0, ξ≥0
k
s.t. kxk − sk2 ≤ r2 + ξk , k = 1, . . . , n,
αj k̃ = 1, 0 ≤ αik , αj k̃ ≤ U,
or its dual
k̃=1
(2)
where Φ(Xi ) = [φ(xi1 ), . . . , φ(xini )] represents the matrix whose columns are the mapped samples of set i, and αi
is a vector containing the corresponding αik coefficients.
It should be noted that the objective function of (2) can be
written as α⊤ Kα by setting Φ(X) = [Φ(Xi ) − Φ(Xj )]
αi
⊤
and α ≡ ( α
j ), where K = Φ(X) Φ(X). This problem is
closely related to the classical SVM classifier formulation,
which finds a separating hyperplane between two convex
hulls based on exactly the same pair of closest points. Thus,
the same problem can also be solved by training an SVM
that separates the query set from the given gallery one as
explained in [2].
It is well-known that the solution returned by an SVM
classifier is sparse and completely determined by the samples that are near the decision boundaries where the rival
class samples approach to each other (these samples are
called the support vectors), and all other samples far from
these regions do not contribute to the solution. Therefore,
the support vectors are the “essential” training points for
classification and the goal of the SVM training is to discover them. If we generalize this rule for arbitrary convex sets, the geometric distances between them will always
be determined by the samples in the vicinity of image sets’
outer boundaries. Thus, if we can find the samples forming
the image set boundaries in the feature space, we can ignore
the remaining samples. This will greatly reduce the required
disk storage space because only the relevant data needed for
image set classification will be saved, and will significantly
improve the testing speed since one has to solve smaller
sized QP problems.
Kernelized one-class classifiers [16,14] can be used to
determine the samples that shape the image set boundaries.
Both methods yield to the similar results for certain kernel
function types such as the Gaussian kernels, but their solutions are different for the linear case if the data is not preprocessed to have unit norm. Therefore, we prefer to use
the SVDD method of Tax and Duin [16] since the geometrical intuition behind the method is very similar to our goal.
X
X
αk kxk k2
αk αl hxk , xl i −
arg min
α
k
k,l
s.t.
X
(4)
αk = 1, ∀k 0 ≤ αk ≤ γ.
k
The αk are Lagrange multipliers and γ ∈ [1/n, 1] is the
ceiling parameter that can be set to a value less than one
to reduce the influence of outliers. The objective function
is convex so a global minimum exists. In the kernelized
case, we have to replace all inner products hxk , xl i with
kernel evaluations k(xk , xl ). The dual formulation typically yields a sparse solution in terms of the support vectors
(samples that correspond to the nonzero Lagrange multipliers), and they come from object class boundaries when
nonlinear kernel functions are used as illustrated in Fig. 3.
In our setup, we solve the QP problem (4) for each image
set in the gallery offline and keep only these support vectors
for each set. During testing, we run the same algorithm for
the query set and compute the kernelized convex hull distances by using these reduced image sets that contain only
the essential support vectors. In this way, the amount of
data is greatly reduced without any significant decrease in
the accuracy, and the testing time is greatly shortened since
smaller sized QP problems are solved. We call the proposed
method as Reduced Convex Hull based Image Set Distance
(RCHISD) method.
It should be noted that we can find the center of the kernelized hypersphere model of the c-th class by using the
nonzero α∗ coefficients returned by the QP solver as follows
X
sc =
αk∗ xck .
(5)
k
The corresponding radius is rc = ||xck −sc || for any xck for
which 0 < αk∗ < γ. Therefore, we can also find the most
similar gallery set to the given query by using the distances
between the kernelized hypersphere models. The geometric
distance between two kernelized hyperpsheres, hsc and hsq
29
(characterized by their center and radius), is given as
d(hsc , hsq ) = ||sc − sq || − (rc + rq ),
(6)
where
||sc − sq || =
qP
i,j
αci αcj hxci , xcj i − 2
P
i,k
αci αqk hxci , xqk i +
P
k,l
αqk αql hxqk , xql i.
One needs to use a few support vectors that correspond to
the nonzero Lagrange multipliers to compute the above distance, thus this computation is too fast. In our experiments,
we also used hypersphere models for image set classification and compared these results to the ones obtained by the
kernelized convex hulls. To the best our knowledge, this is
the first time of the use of the hypersphere models for image
set classification.
Figure 4. Support vectors (red circles) returned by SVDD algorithm for different parameters of Gaussian kernel width: (a) kernel
width is set to 1.5, (b) kernel width is set to 2.0, (c) kernel width
is set to 2.5, (d) kernel width is set to 3.5, (e) kernel width is set
to 4.5, (f) kernel width is set to 5.0. Note that less support vectors
are returned as the kernel width is increased.
4. Experiments
Figure 3. Nonlinearly distributed data and the support vectors
(shown with red circles around the data samples) returned by
SVDD using a Gaussian kernel. Support vectors come from the
object boundaries when the Gaussian kernel width is set properly.
In the kernelized case of the proposed method, there are
two parameters: the ceiling parameter (γ) and the Gaussian
kernel width which define the reduced sets. As we mention
in the experimental study section, the results are not very
sensitive to γ, but the Gaussian kernel width highly affects
the number of reduced elements in the set. So, the Gaussian kernel width parameter is important for determining
the size of the reduced image sets. The smaller values of
this parameter return more support vectors whereas larger
values return less support vectors as demonstrated in Fig.
4. In this example, we gradually increase the value of the
Gaussian kernel width from 1.5 to 5.0 and plot the returned
support vectors with the red circles around the data samples.
The more data yields better accuracies but the less data is
faster during testing. When the parameter is adjusted properly, lower amount of data are used to represent the image
sets collected from videos, which results in less amount of
storage and better classification speeds.
In this study, we tested recognition accuracies and testing speeds of the proposed methods on two popular small
sized databases CMU MoBo [7] and Honda/UCSD [10] and
our new larger sized ESOGU Face Videos data set. To allow comparison with the literature on various datasets, we
followed the simple protocol of [2, 21, 18, 19]: the face images detected from video frames were histogram equalized
but no further pre-processing such as alignment or background removal was performed on them. In our database,
we tested both gray-level values and local binary pattern
(LBP) features for classification. For affine hull methods,
subspace dimensions are set by retaining enough leading
eigenvectors to account for 98% of the overall energy in
the eigen-decomposition. For all kernelized methods we
used the Gaussian kernels and the Gaussian kernel width is
determined based on experiments using randomly selected
subsets of image sets. We compared the proposed method
RCHISD (Reduced Convex Hull based Image Set Distance)
to the linear/kernelized affine hull method (AHISD) [2],
linear/kernelized convex hull method (CHISD) [2], Mutual Subspace Method (MSM) [20], SANP [9], Regularized
Nearest Points (RNP) [21], Collaboratively Regularized
Nearest Points(CRNP) [19], Manifold-Manifold Distance
(MMD) [18], and Self-Regularized Nonnegative Adaptive
Distance Metric Learning (SRN-ADML) [12]. In addition
to these methods, we also tested linear/kernelized bounding
hypersphere (HS) models for image set classification.
4.1. Experiments on MoBo Data Set
The MoBo (Motion of Body) data set contains 96 image
sequences of 24 individuals walking on a treadmill. The
images were collected from multiple cameras under four
different walking situations: slow walking, fast walking, in30
cline walking, and carrying a ball. Each image set includes
both frontal and profile views of the subjects faces. We used
LBP feature sets from [2].
As in [2], we randomly select one image set from each
class for the gallery and used the remaining 3 for testing.
This was repeated 10 times and we report averages of the
classification rates over the 10 runs. Table 1 shows the accuracies and testing times for all tested methods. Testing
time shows the amount of time spent to classify a single
test set on the average. We tested the kernelized reduced
convex hull classifiers for different ceiling parameter γ values changing between 0.10 and 1, and all of them returned
same accuracies. So, we conclude that the results are not
very sensitive to this parameter and we fix it to γ = 0.20 for
all of the experiments. The linear/kernelized convex hull
models and SANP achieve the best results and reducing the
image sets samples by using SVDD does not impact the accuracy significantly, but improves the testing time. When
reduced image sets are tested by computing pair-wise distances independently, testing time of a query is approximately 4 times faster compared to using full image sets. The
linear hypersphere method is the worst performing method,
but it is one of the fastest methods. Similarly, the accuracy
of the kernelized hypersphere method is also low compared
to other kernelized methods, but it is the fastest one among
all tested kernelized methods. Regarding the reduced set
size, the total number of face images in all sets is 48789
and it is reduced to 7098 by using SVDD method without
a significant drop in accuracy. Kernel AHISD and SANP
methods are the worst methods in terms of testing time. It
should be noted there is not a significant accuracy difference
between the linear and kernelized versions of all methods
except for HS classifier since the number of the people in
the dataset is small.
4.2. Experiments on Honda/UCSD Data Set
The Honda/UCSD data set [10] was collected for videobased face recognition. It consists of 59 video sequences
involving 20 individuals. Each sequence contains approximately 300-500 frames. Twenty sequences were set aside
for training, leaving the remaining 39 for testing. The
detected faces were resized to gray-scale images and histogram equalized, and the resulting pixel values were used
as features. Table 2 shows the accuracies and testing times
for all tested methods. We tested the kernelized convex hull
classifiers for different ceiling parameter values changing
between 0.10 and 1, and all of them returned the same accuracy of 100%. The kernelized convex hull models achieve
the best results and reducing the image sets samples by
using SVDD does not impact the accuracy but improves
the testing time. When reduced image sets are tested by
computing pair-wise distances independently, testing time
of a query is 2 times faster compared to using full image
sets. The linear hypersphere method is the worst performing
method, but it is one of the fastest methods. Similarly, the
accuracy of the kernelized hypersphere method is also low
compared to other kernelized methods, but it is the fastest
one among all tested kernelized methods.
Table 2. Classification Rates (%) and Testing Times on the
Honda/UCSD Dataset.
Method
Linear AHISD
Linear CHISD
Linear HS
MSM
SANP
RNP
CRNP
SRN-ADML
MMD
Kernel AHISD
Kernel CHISD
Kernel HS
Kernel RCHISD
Table 1. Classification Rates (%) and Testing Times on the MoBo
Dataset.
Method
Linear AHISD
Linear CHISD
Linear HS
MSM
SANP
RNP
CRNP
SRN-ADML
MMD
Kernel AHISD
Kernel CHISD
Kernel HS
Kernel RCHISD
Accuracy
95.3 ± 2.6
98.1 ± 0.9
71.9 ± 4.7
92.4 ± 1.9
98.1 ± 0.9
93.8 ± 2.7
97.4 ± 0.8
95.3 ± 1.6
94.7 ± 2.3
96.4 ± 2.5
98.1 ± 0.9
87.8 ± 2.8
97.3 ± 1.3
Testing Time (sec)
32.0 sec
25.6 sec
0.6 sec
9.2 sec
40.2 sec
11.3 sec
15.8 sec
30.0 sec
10.6 sec
87.3 sec
32.8 sec
5.8 sec
8.3 sec
Accuracy
97.4
97.4
59.0
97.4
97.4
100
100
97.4
100
97.4
100
94.9
100
Testing Time (sec)
1.6 sec
5.1 sec
0.6 sec
2.14 sec
16.7 sec
5.4 sec
2.6 sec
6.18 sec
7.11 sec
14.2 sec
7.6 sec
2.8 sec
3.7 sec
4.3. Experiments on ESOGU-285 Face Videos
Dataset
ESOGU Face Videos dataset includes videos of 285 people captured in two sessions separated by at least three
weeks. In each session, we captured four short videos
with four different scenarios for each person. In the first
scenario, the subjects are asked to make free head movements under normal illumination conditions similar to video
recordings in Honda/UCSD. In the second one, the subjects
pretend to talk on a cell phone with free head movements.
31
The third and the last videos include recordings of free head
movements when the subjects are illuminated from the right
and left, respectively. Some frames from videos are shown
in Fig. 5. We manually cropped the faces using a semiautomatic annotation tool. We used both (40 × 30) graylevel and LBP values as visual features.
We used the image sets captured in the first session as
gallery set and the sets captured in the second session for
testing. The recognition accuracies and testing times are
given in Table 3. We could not implement CRNP because
of memory issues since it requires to operate on matrices
with size n × n, and n = 410251 is the number of frames in
the gallery (In table 2, OOM indicates the “out of memory”
problem). We would like to point out that SANP method
is very slow for gray-level values. However, the best accuracy for LBP values is obtained by SANP followed by
Kernel CHISD and MMD whereas the Kernel CHISD is
the best performing method alone for gray-level features.
The total number of frames in both gallery and query sets
is 764006 and it is reduced to 104716 for LBP and 149520
for gray level features by using kernelized SVDD. Our proposed Kernel RCHISD method, which uses the reduced
sets, achieves similar results to Kernel CHISD, but it is approximately 10 times faster for LBP and 6 times faster for
gray-level features. The linear hypersphere is again one of
the fastest methods but it has the worst recognition accuracy.
As opposed to the results on small sized datasets, there
is a big difference between classification accuracies of linear methods and their kernelized counterparts especially for
gray-level values. More precisely, both kernelized convex
hull and affine hull models achieve much higher accuracies
than linear methods for gray-level values as expected. In
a similar manner, the kernelized affine hull model significantly outperforms the linear affine hull model for LBP
features, but there is not a significant performance difference between the linear and kernelized convex hull models since convex hull is a much tighter model compared to
affine hulls (however we should expect a difference if we
further increase the number of people). These results also
indicate that LBP features are more discriminative features
compared to gray-level values and they yield to more compact face manifolds.
5. Conclusion
In this work, we developed image set based classification
methods which use the reduced image samples in each set
to lessen the required storage space and speed-up the testing process for large-scale face recognition applications. To
this end, we showed that we need to keep only the image
sets samples that form the image set boundaries when kernelized convex hulls were used to approximate image sets.
SVDD method, which finds a compact hypersphere that fits
Figure 5. Some frames selected from videos captured in each session. The first row shows the recording of free head movements
without illumination, the second row shows the recording of phone
call, and the third and the last rows show the recordings of free
head movements when the subjects are illuminated from the right
and left directions.
the image set samples best, is used to determine the samples
forming image set boundaries. Experimental results verify
that reducing image set samples via SVDD greatly improves
the testing time without a significant drop in accuracy.
Another contribution of the study is the investigation of
the suitability of the hypersphere models for approximating image sets. Experiments show that hypersphere models yield to lower accuracies compared to affine or convex
hull models, but they are extremely faster. Therefore, these
models can be used to return the nearest approximate candidates of the gallery sets to the given query set quickly, and
then a more accurate similarity search can be done among
the returned candidate sets by using affine/convex hull approximations.
Lastly, it was shown that accuracies of methods using
linear models to approximate image sets drop as the number of classes is increased. Especially it has been observed
a significant accuracy drop when looser linear models such
as affine hulls and linear subspaces are used to approximate
image sets. We also verified that some recently proposed
collaborative methods and methods using sparse models
cannot be applied to large-scale data (even to moderate size
data) due to the memory or speed problems. It should also
be noted that the proposed methods are not limited with face
recognition. They can be used in other visual recognition
problems where each example is represented by a set of images.
Acknowledgments: This work was funded by the Scientific and Technological Research Council of Turkey
(TUBİTAK) under Grant numbers EEEAG-114E014 and
EEEAG-113E118.
32
Table 3. Classification Rates (%) and Testing Times on the ESOGU-285 Face Videos Dataset.
Grayscale Values
LBP Features
Methods
Accuracy Testing Time Accuracy Testing Time
Linear AHISD
44.30
22.0 sec
66.75
180.0 sec
Linear CHISD
55.09
179.6 sec
76.58
390.1 sec
Linear HS
29.04
3.9 sec
39.47
0.8 sec
MSM
50.08
2.3 sec
69.56
5.1 sec
SANP
51.92
29771 sec
79.12
564.6 sec
RNP
46.66
1731.7 sec
51.92
2205.3 sec
CRNP
OOM
OOM
SRN-ADML
45.35
364.6 sec
68.42
380.2 sec
MMD
52.02
7.2 sec
77.63
30.4 sec
Kernel AHISD
62.11
2015.0 sec
76.05
4369.0 sec
Kernel CHISD
62.19
233.3 sec
77.63
480.4 sec
Kernel HS
43.68
61.9 sec
49.39
12.9 sec
Kernel RCHISD
61.23
39.7 sec
75.36
46.1 sec
References
[1] O. Arandjelovic, G. Shakhnarovich, J. Fisher, R. Cipolla, and
T. Darrell. Face recognition with image sets using manifold
density divergence. In CVPR, 2005. 1
[2] H. Cevikalp and B. Triggs. Face recognition based on image
sets. In CVPR, 2010. 1, 2, 3, 4, 5, 6
[3] S. Chen, C. Sanderson, M. T. Harandi, and B. C. Lovell. Improved image set classification via joint sparse approximated
nearest subspaces. In CVPR, 2013. 2
[4] Y.-C. Chen, V. M. Patel, S. Shekhar, R. Chellappa, and P. J.
Phillips. Video-based face recognition via joint sparse representation. In Automatic Face and Gesture Recognition Conference, 2013. 2
[5] Z. Cui, H. Chang, S. Shan, B. Ma, and X. Chen. Joint sparse
representation for video-based face recognition. Neurocomputing, 135:306–312, 2014. 2
[6] W. Fan and D.-Y. Yeung. Locally linear models on face appearance manifolds with application to dual-subspace based
classification. In CVPR, 2006. 1, 2
[7] R. Gross and J. Shi. The cmu motion of body (mobo)
database. Technical report, Robotics Institute, Carnegie Mellon University, 2001. 5
[8] A. Hadid and M. Pietikainen. From still image to videobased face recognition: an experimental analysis. In International Conference on Automatic Face and Gesture Recognition, 2004. 1, 2
[9] Y. Hu, A. S. Mian, , and R. Owens. Face recognition using sparse approximated nearest points between image sets.
IEEE Transactions on PAMI, 34(3):1992–2004, 2012. 2, 5
[10] K. C. Lee, J. Mo, M. H. Yang, and D. Kriegman. Videobased face recognition using probabilistic appearance manifolds. In CVPR, 2003. 2, 5, 6
[11] V. P. M. Kim, S. Kumar and H. Rowley. Face tracking and
recognition with visual constraints in real-world videos. In
IEEE Conf. Comput. Vis. Pattern Recognit., 2008. 2
[12] A. Mian, Y. Hu, R. Hartley, and R. Owens. Image set based
face recognition using self-regularized non-negative coding
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
33
and adaptive distance metric learning. IEEE Transactions on
Image Processing, 22:5252–5262, 2013. 5
S. Roweis and L. Saul. Nonlinear dimensionality reduction
by locally linear embedding. Science, 290:2319–2323, 2000.
2
B. Schölkopf, J. Platt, and R. Williamson. Estimating the
support of a high-dimensional distribution. Neural Computation, 13:1443–1471, 2001. 4
G. Shakhnarovich, J. W. Fisher, and T. Darrell. Face recognition from long-term observations. In ECCV, pages 851–868,
2002. 1
D. M. J. Tax and R. P. W. Duin. Support vector data description. Machine Learning, 54:45–66, 2004. 4
R. Wang and X. Chen. Manifold discriminant analysis. In
IEEE Conf. Comput. Vis. Pattern Recognit., 2009. 2
R. Wang, S. Shan, X. Chen, and W. Gao. Manifold-manifold
distance with application to face recognition based on image
sets. In CVPR, 2008. 1, 2, 5
Y. Wu, M. Minoh, and M. Mukunoki. Collaboratively regularized nearest points for set based recognition. In BMVC,
2013. 2, 3, 5
O. Yamaguchi, K. Fukui, and K.-I. Maeda. Face recognition
using temporal image sequence. In International Symposium
of Robotics Research, pages 318–323, 1998. 1, 5
M. Yang, P. Zhu, L. V. Gool, and L. Zhang. Face recognition
based on regularized nearest points between image sets. In
Automatic Face and Gesture Recognition Conference, 2013.
2, 5
P. Zhu, W. Zuo, L. Zhang, S. C.-K. Shiu, and D. Zhang. Image set-based collaborative representation for face recognition. IEEE Transactions on Information Forensics and Security, 9:1120–1132, 2014. 2, 3