Multimodal Age and Gender Classification Using Ear and Profile Face Images
Multimodal Age and Gender Classification Using Ear and Profile Face Images
Multimodal Age and Gender Classification Using Ear and Profile Face Images
Abstract
arXiv:1907.10081v1 [cs.CV] 23 Jul 2019
transfer learning is to perform fine-tuning. In this approach, Since it has been shown that [18] transferring a pre-
the parameters of the pretrained model are used to initialize trained model from a closer domain leads to better perfor-
CNN models’ weights instead of using random initial val- mance, we have benefited from a large-scale ear dataset,
ues. Then these model weights are further updated using the which is named as Multi-PIE ear dataset [6]. As the name
target dataset. During fine-tuning, depending the similarity implies, this dataset was constructed by running an auto-
between target dataset domain and pretrained dataset do- matic ear detector on the profile and close-to-profile face
main, and the size of the target dataset, some layers’ weights images of the Multi-PIE face dataset [8]. This way, 17183
can be frozen or all layers’ weights can be updated. ear images of 205 subjects were detected [6]. In this work,
Figure 3. Visualization of employed data fusion approaches. (a)
contains original ear and profile face images. (b) represents side-
by-side concatenation, which is spatial fusion and (c) is the aver-
age of the profile face and ear images, that is intensity fusion.
Method Formula
Figure 4. Sample images from FERET dataset [20]. The first col-
Basic c = s[0] umn contains the original images, the second column contains
d2s c = s[0] − s[1] sample detected ear images, and the last column contains detected
d2sr c = 1 − (s[1]/s[0]) profile face images.
PM
avg-diff c = M1−1 i=1 (s[0] − s[i])
PM −1
diff1 c = i=1 ( s[i−1]−s[i]
i )
4. Experimental Results
Table 1. Confidence score calculation methods for score-based fu-
sion. In the formulas, c represents confidence score of the corre-
In this section, we provide information about used
sponding CNN model and s is an array that contains probabilities datasets, experimental setups, implementation details, and
for each class from high to low value. our results, respectively.
4.1. Datasets
we have extended Multi-PIE ear dataset [6] and named it Multi-PIE Extended-ear Dataset contains 39185 ear
as Multi-PIE extended-ear dataset. The coordinates of the images. This dataset was constructed by running an auto-
detected ear images of the Multi-PIE ear dataset [6] have matic ear detector on the profile and close-to-profile face
been shared A . We have used these ear coordinates and ad- images of the Multi-PIE face dataset [8]. As explained in
ditionally, we have executed our improved ear detection al- the Section 3.4, this dataset was used to adapt the pretrained
gorithm on the other images of Multi-PIE dataset that were CNN models to the ear domain.
not listed in the Multi-PIE ear dataset. This way, we have FERET [20] is one of the most well-known face
acquired new ear images in addition to the ones in the Multi- datasets. Sample images from FERET are shown in Fig-
PIE ear dataset [6] and reached 39185 ear images. The co- ure 4. For our experiments, we used both ear and profile
ordinates of this extended-ear dataset will be shared B as face images. We utilized dlib library [15] to detect profile
well. Compared to the Multi-PIE ear dataset, we provided faces. In order to detect the ear regions, we run an off-
a significant increase in the dataset size. For training, we the-shelf ear detector [2]. This way we obtained 1397 ear
have first initialized the CNN models with the parameters images.
of the pretrained CNN models that were trained on the Im-
UND-F [31] contains both 2D and 3D ear images. In this
ageNet dataset. Then, by utilizing the Multi-PIE extended-
work, we only used 2D ear images. There are 942 profile
ear dataset, we have updated the CNN models and adapt
face images that belong to 302 different subjects. We run
them to the ear domain. Afterwards, we have benefited
the aforementioned off-the-shelf ear detector [2] and crop
from adapted pretrained models and further updated them
ear regions from profile face images. We utilized dlib li-
by training on the target ear datasets. Experimental results
brary [15] for detecting profile face images. This dataset
have validated that this learning strategy, that is by perform-
was employed to benchmark gender classification accuracy,
ing domain adaptation via an intermediate fine-tuning step,
since it was also used in the previous work on gender clas-
and by this way using the CNN models that were adapted
sification from ear images [16, 33].
to the ear domain instead of directly using the CNN models
UND-J2 [31] is another dataset that was utilized to
that were trained on a generic image classification dataset,
benchmark gender classification accuracy in previous work
helped to improve the performance of both age and gender
[14, 16]. It includes 2430 profile face images. As in the
classification tasks.
UND-F dataset, we detected the ear regions by applying the
A https://github.com/iremeyiokur/multipie ear dataset off-the-shelf ear detector [2] and we employed dlib [15] to
B https://github.com/iremeyiokur/multipie extended ear dataset detect profile faces.
Model Data Age Acc. Gender Acc. Model Fusion Age Acc. Gender Acc.
VGG-16 Ear 60.97% 97.56% VGG-16 A-1 61.83% 92.33%
VGG-16 Profile 65.73% 95.81% ResNet-50 A-1 57.49% 91.63%
ResNet-50 Ear 60.97% 98.00% VGG-16 A-2 67.59% 99.11%
ResNet-50 Profile 62.37% 94.05% ResNet-50 A-2 62.71% 99.11%
VGG-16 A-3 62.05% 93.03%
Table 2. Unimodal age and gender classification results on FERET ResNet-50 A-3 58.53% 92.33%
dataset [20]. VGG-16 B 67.28% 98.16%
ResNet-50 B 66.44% 97.56%
4.2. Implementation Details VGG-16 C-1 63.76% 97.90%
ResNet-50 C-1 63.06% 98.00%
In order to have domain adapted pretrained deep models, VGG-16 C-2 63.06% 97.90%
we have performed fine-tuning on Multi-PIE extended ear ResNet-50 C-2 62.02% 98.00%
dataset with VGG-16 [26] and ResNet-50 [9] CNN models. VGG-16 C-3 63.06% 97.90%
For this, we have splitted Multi-PIE extended ear dataset ResNet-50 C-3 62.02% 98.00%
into training, validation, and test sets. 80% of the images VGG-16 C-4 63.76% 97.90%
have been assigned to the training set, and the remaining ResNet-50 C-4 63.06% 98.00%
20% has been assigned evenly to the validation and test sets. VGG-16 C-5 63.76% 97.90%
The same percentage of training, validation, and test splits ResNet-50 C-5 63.06% 98.00%
are also applied in the age and gender classification experi-
ments on FERET, UND-F, and UND-J2 datasets. Table 3. Age and gender classification results of the three differ-
For age classification task, we have five different age ent fusion methods that are explained in Section 3.3. In the fusion
groups based on the following age ranges: 18-28, 29-38, column, A, B, and C correspond to data, feature, and score fu-
39-48, 49-58, 59-68+. These classes have been selected ac- sion methods, respectively. In method A, A-1, A-2, and A-3 are
channel fusion, spatial fusion, and intensity fusion methods, re-
cording to the previous work about age classification from
spectively. In C, C1, C2, C3, C4, and C5 represent different con-
ear images [30]. According to the [30], the changes in the
fidence score calculation methods that are presented in Table 1.
ear can be observable between these age groups. These age C-1 means the first formula of the Table 1 is used, C-2 means the
classes include 419, 435, 316, 169, and 58 ear and profile second formula in the table is employed and etc.
face images. While the number of images for train, valida-
tion, and test is 1110, 144, and 143 respectively. Note that
we have split data into train-validation-test sets with respect According to Table 2, we have achieved the best age clas-
to the data distribution per class. sification result with VGG-16 model using profile face im-
In all experiments, we have set learning rate to 0.0001. ages with 65.73% classification accuracy. While ResNet-50
We have dropped learning rate by 0.1 in every 25 epochs. performance on profile face images is slightly lower than
We have also used L2 regularization with 0.001 coefficient VGG-16 performance, both models have achieved similar
and center loss [29] with 0.1 coefficient as mentioned in accuracy with ear images. The best gender classification
Section 3.1. Moreover, to prevent overfitting, we have result which is 98.00%, is obtained with ear images using
dropped neurons with 75% probability value during train- ResNet-50 CNN model and VGG-16 result is very close to
ing. On the other hand, we have not dropped them in test this result. Although age classification performance is bet-
phase. While we have selected batch size as 32 for the uni- ter with profile face images, ear images are found to be more
modal experiments, we have chosen batch size as 16 during useful than profile face images in gender classification. All
multimodal experiments due to memory constraints. these results indicate that ear and profile face images con-
tain useful cues for age and gender classification. However,
4.3. Unimodal Results age classification needs further investigation because of the
In this section, we present unimodal experiments, which relatively low performance compared to that of gender clas-
are based on profile face and ear images separately. We sification.
have fine-tuned VGG-16 [26] and ResNet-50 [9] CNN mod-
4.4. Multimodal Results
els for age and gender classification tasks with FERET
dataset [20]. The obtained results are presented in Table 2. In this section, we have presented multimodal and multi-
In this table, first column contains used CNN model, second task age and gender classification experiments using VGG-
column contains information about data, which can be ear 16 and ResNet-50 CNN architectures. Table 3 shows age
image or profile face image and, finally, last two columns and gender classification results based on different fusion
show test accuracies of age and gender classification. methods. First column contains name of the employed
Method/Model Dataset Data Accuracy All these results indicate that the spatial fusion and
Gender Classification feature-based fusion methods are effective multimodal fu-
Distance+KNN [7] Internal Ear 90.42% sion strategies for extracting soft biometric traits from pro-
GoogLeNet [30] Internal Ear 94.00% file face and ear images.
BoF+SVM [33] UND-F Ear 91.78%
HIS+SVM [16] UND-F Ear 92.94%
HIS+SVM [16] UND-J2 Ear 91.92% 4.5. Comparison with State-of-the-art
Gabor+Voting [14] UND-J2 Ear 89.49%
In Table 4, we have compared age and gender classifi-
BoF-SVM [33] UND-F Profile 95.43%
cation performance of our proposed method with previous
BoF-SVM [33] UND-F Multi 97.65%
works. We have achieved the state-of-the-art performance
Ours FERET Multi 99.11%
on both tasks, however, used dataset in this work is not the
Ours UND-F Multi 100%
same as the ones used in previous work [30] for age clas-
Ours UND-J2 Multi 99.79%
sification task. Because of that, we have implemented and
Age Classification used the previous work on age classification [30]. We have
Yaman et al. [30] Internal Ear 52.00% followed the same strategy with them and presented the ob-
Yaman et al.* [30] FERET Ear 58.53% tained result with * symbol in the table. Besides, we have
Ours FERET Multi 67.59% conducted gender classification experiments on the UND-F
and UND-J2 datasets [31], which are benchmark datasets
Table 4. Comparison of the proposed methods with previous
for gender classification from ear images. As a result, we
works. While first part contains gender classification results, sec-
ond part presents age classification results. According to the re- have obtained the state-of-the-art results on both datasets
sults, we have achieved the state-of-the-art results for age and with our spatial fusion method and feature-based fusion
gender classification. We have implemented the proposed method method. We have achieved 100% accuracy on the UND-
in [30] and to have a fair comparison, we have tested it also on F dataset [31] and 99.79% accuracy on the UND-J2 dataset
FERET dataset [20]. The result of this method on FERET dataset [31]. For age classification, we have otained 67.59% accu-
is presented with * symbol. racy, with spatial fusion, that is around 9% better than the
accuracy achieved by the method presented in [30].