Sensors 22 00706
Sensors 22 00706
Sensors 22 00706
net/publication/357914251
CITATIONS READS
23 389
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Paweł Pławiak on 18 January 2022.
Abstract: Hand gesture recognition is one of the most effective modes of interaction between humans
and computers due to being highly flexible and user-friendly. A real-time hand gesture recognition
system should aim to develop a user-independent interface with high recognition performance.
Nowadays, convolutional neural networks (CNNs) show high recognition rates in image classification
problems. Due to the unavailability of large labeled image samples in static hand gesture images, it is
a challenging task to train deep CNN networks such as AlexNet, VGG-16 and ResNet from scratch.
Therefore, inspired by CNN performance, an end-to-end fine-tuning method of a pre-trained CNN
model with score-level fusion technique is proposed here to recognize hand gestures in a dataset
with a low number of gesture images. The effectiveness of the proposed technique is evaluated using
leave-one-subject-out cross-validation (LOO CV) and regular CV tests on two benchmark datasets.
A real-time American sign language (ASL) recognition system is developed and tested using the
Citation: Sahoo, J.P.; Prakash, A.J.;
proposed technique.
Pławiak, P.; Samantray, S Real-Time
Hand Gesture Recognition Using Keywords: ASL; fine-tunning; hand gesture recognition; pre-trained CNN; real-time gesture recogni-
Fine-Tuned Convolutional Neural tion; score fusion
Network. Sensors 2022, 22, 706.
https://doi.org/10.3390/s22030706
color images, assuming the hand region contains the majority area of the image frame.
In such cases, the segmentation of the hand is difficult if the hand is surrounded by the
human face or body and the background color is similar to human skin color [5]. The depth
threshold technique is applied on the depth image of a Kinect sensor to segment the hand
region from the background [7,14]. In these cases, the hand is assumed as the closest object
in front of the Kinect sensor [9]. Again, the depth image is free from background variations
and human noise [15]. After that, several feature extraction techniques are applied on the
segmented hand to extract the semantic information for the input gesture image. Then, the
gestures are recognized using different classifiers.
In the literature, many researchers have applied hand-crafted feature extraction tech-
niques such as shape descriptors, spatiotemporal features [16], and the recognition of
hand gestures. However, these features have performed well in a specific environment
while performance has degraded in varied conditions of the dataset [17]. Nowadays, deep
learning techniques are used to overcome the above limitations. In these cases, convo-
lutional neural network [17] and stacked denoising autoencoder [18] architectures are
used. However, it is a challenging task to train the CNN from scratch due to following
reasons [19]: (1) A huge number of level image datasets are required to train the CNN
effectively. (2) High memory resources are required to train the CNN, otherwise the training
remains slow. (3) Sometimes the training of the CNN also suffers from convergence issues,
which requires repetitive adjustment in CNN layers and learning of hyperparameters.
Therefore, the development of a CNN-based model is very tedious and time-consuming.
To overcome the above issue, the dataset having less image samples adapted a transfer
learning technique. In this technique, the pre-trained CNN models such as AlexNet [20],
VGG [21], GoogLeNet [22] and ResNet [23] that have been trained on large label datasets
are fine-tuned on the target datasets.
Therefore, an efficient and accurate hand gesture recognition model is highly essential
for the recognition of hand gestures in real-time applications. To develop such a recognition
model, a score-level fusion technique between two fine-tuned CNNs such as AlexNet [20]
and VGG-16 [21] is proposed in this work. The contributions in this work are as follows:
• An end-to-end fine-tuning of the deep CNNs such as AlexNet and VGG-16 is per-
formed on the training gesture samples of the target dataset. Then, the score-level
fusion technique is applied between the output scores of the fine-tuned deep CNNs.
• The performance of recognition accuracy is evaluated on two publicly available bench-
mark American Sign Language (ASL) large-gesture class datasets.
• A real-time gesture recognition system is developed using the proposed technique
and tested in subject-independent mode.
The rest of the paper is organized as follows. In Section 2, recent works on hand
gesture recognition techniques are reviewed. The methodology of the proposed work on
pre-trained CNNs is discussed in Section 3. Section 4 demonstrates the standard dataset
and validation techniques used to evaluate the performance of the proposed technique.
The detailed experimental results and analysis are presented in Section 5, whereas real-time
implementation of the proposed technique is presented in Section 6. Finally, the paper is
concluded in Section 7.
2. Related Works
In this section, a detailed literature survey of recent techniques for vision-based hand
gesture recognition is presented. The study includes the recognition of hand gestures based
on RGB cameras and depth sensors using machine learning and deep learning techniques.
3. Proposed Methodology
An overview of the proposed hand gesture recognition system is shown in Figure 1.
As shown in the figure, the recognition of static hand gesture images is achieved by the
following steps: data acquisition, pre-processing and recognition of hand gestures using
proposed technique.
Sensors 2022, 22, 706 4 of 14
Predicted class
Segmentation
Score 1
Fine-tuned
and Filtering
Decision
AlexNet Norm Score Final
Depth Score
RGB VGG 16 Norm Fusion
colorization
Score 2
Depth Depth
Figure 1. The proposed framework for the recognition of static hand gesture images.
The step-wise operation details of static hand gesture recognition are as follows:
Table 1. Analysis of different sensors used for the recognition of hand gestures.
3.2. Preprocessing
The objective of this step is to segment the hand region from the hand gesture image
frame and to resize it into the pre-trained CNN’s input image size. The color and depth
map images are obtained from the Kinect depth camera as shown in Figure 2. Between
both inputs, only the depth map image is considered for recognition of static hand gesture.
Depth thresholding is used for segmentation of the hand region from the depth map. An
empirically determined value of 10 cm [14] is chosen as a depth threshold value to segment
the hand from the background as shown in Figure 2c. The maximum-area-based filtering
technique is used to find the hand region and remove the noise section of the segmented
image as shown in the bounding box form in Figure 2c. Following this, the bounding
box region is cropped form the segmented image. Both pre-trained CNNs operate with
three-channel input images. Therefore, the cropped hand gesture images are normalized to
generate a single-channel image in a range from [0, 255] using (1).
(
max( D )− D ( x,y)
max( D )− min( D )
× 255 i f D ( x, y) 6= 0
D ( x, y) = (1)
0 i f D ( x, y) = 0
where D denotes the depth values in the depth map image and ( x, y) are the pixel indices
in the depth map. max( D ) and min( D ) are the maximum and minimum depth vales in the
depth map. Conversion of a single channel to three channels is performed by applying a jet
color map [31] on the single-channel hand cropped image. The hand segmented image is
resized according to input image size of pre-trained CNN AlexNet and VGG-16. Therefore,
Sensors 2022, 22, 706 5 of 14
all the images in the dataset are resized to a resolution 227 × 227 × 3 for fine-tuning of
pre-trained AlexNet, and for fine-tuning of pre-trained VGG-16, the input image is resized
to 224 × 224 × 3 image resolution.
Figure 2. Simulation result of preprocessing step on HUST-ASL dataset: (a) Color image of the
RGB-D images. (b) Depth map of the corresponding color image. (c) Localization of hand from the
depth map using depth thresholding and removal of noise. (d) Resize of the hand segmented image
according to pre-trained CNN input size.
as shown in Figure 3. As shown in the figure, the fine-tunning process of the pre-trained
AlexNet is carried out on the HUST-ASL dataset. In this process, the last fully connected
layer of the pre-trained AlexNet is changed into 34 nodes, which is the number of classes
in the dataset. Then, the model is fine-tuned according to the hyperparameter setting.
1000
4096 4096
Transfer parameters
of AlexNet FC6 FC7
FC8
34
4096 4096
Figure 3. Fine-tuning process using pre-trained AlexNet on target hand gesture dataset.
3.4. Normalization
In general, normalization of the output score is performed to decrease the score
variabilities among the different models and to put both models’ score values on the same
scale. Therefore, the output scores from the two fine-tuned CNNs are put into the interval
[0, 1] using the min-max normalization technique [35]. The normalized score of s (s ∈ S) is
denoted as s0 . The normalized score is calculated using (2).
s − min(S)
s0 = (2)
max(S) − min(S)
where S is set of raw output score vectors of s obtained from fine-tuned CNN model, and
min(S) and max(S) are the minimum and maximum values in S, respectively.
The notations S1 and S2 are the score vectors of the fine-tuned AlexNet and VGG-16
models, respectively. The optimal weight value (w) is assigned to the score vector of the
model. This is obtained between [0, 1] using a grid-search algorithm [36]. The optimal
weight value is found to be 0.5 for both datasets using the above search algorithm.
4. Experimental Evaluation
4.1. Benchmark Datasets
The effectiveness of the proposed technique is evaluated using two publicly available
benchmark static hand gesture datasets. Detailed information on the datasets is provided
in the following subsections.
Sensors 2022, 22, 706 7 of 14
training and testing processes. Hence, this CV test is user-biased. Therefore, the model
performance using the regular CV test is higher than the LOO CV test. The confusion
matrices of the test gesture samples in the MU dataset and HUST dataset using the LOO
CV test are shown in Figures 5 and 6, respectively. The most confusing gesture poses are ‘6’
and ‘w’ in the MU dataset. A total of 52.9% of gesture pose ‘6’ is misclassified to gesture
pose ‘w’, and 48.6% of gesture pose ‘w’ is misclassified to gesture pose ’6’ as shown in
Figure 5. The visualization of gesture pose similarity between poses ‘6’ and ‘w’ is shown in
Figure 7a. The location of the thumb finger in the figure is confusing to distinguish with
the human eye.
Table 2. Comparison results of mean accuracy with fine-tuned CNNs and score fusion for both CV
tests on two standard datasets. Both the weight and bias of the pre-trained CNNs are fine-tuned for
end-to-end layers.
(a) (b)
Figure 4. The subject-wise comparison of recognition accuracy in the LOO CV test on both datasets
used: (a) MU dataset; (b) HUST ASL dataset.
g 97.1 2.9
h 100.0
50
i 98.6 1.4
j 1.4 98.6
k 1.4 92.9 1.4 4.3
l 100.0 40
m 2.9 64.3 32.9
n 7.1 2.9 88.6 1.4
o 15.7 84.3
30
p 97.1 2.9
q 100.0
r 94.3 5.7
s 1.4 98.6 20
t 100.0
u 1.4 98.6
v 35.7 1.4 62.9
w 48.6 4.3 47.1 10
x 1.4 1.4 97.1
y 100.0
z 4.3 2.9 1.4 2.9 88.6
0
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z
Predicted class
Figure 5. Confusion matrix of MU dataset with test gesture samples in LOO CV test.
Sensors 2022, 22, 706 9 of 14
e 12.5 0.6 0.6 7.5 1.3 1.3 29.4 0.6 1.3 0.6 10.0 1.3 21.9 0.6 0.6 1.3 3.1 1.9 0.6 2.5 0.6
Target class
f 1.3 1.3 0.6 1.3 2.5 1.9 21.9 0.6 13.1 3.1 0.6 37.5 1.9 1.3 1.3 1.3 2.5 1.3 1.3 0.6 0.6 0.6 0.6 1.3
g 0.6 6.9 0.6 0.6 0.6 0.6 65.0 15.6 0.6 0.6 1.3 2.5 1.9 0.6 0.6 1.3
h 0.6 1.3 0.6 1.9 0.6 0.6 15.0 68.1 0.6 1.9 1.9 1.9 0.6 1.3 0.6 1.3 1.3 40
i 1.3 1.9 0.6 0.6 0.6 1.9
0.6 1.9 1.3 2.5 0.6 70.0 0.6 0.6 0.6 1.3 3.8 0.6 0.6 1.3 1.9 0.6 4.4
k 5.0 6.9 3.1 1.3 0.6 0.6 1.3 0.6 0.6 0.6 1.9 1.9 0.6 65.6 1.9 3.1 3.1 0.6 0.6
l 1.9 0.6 1.9 1.3 0.6 1.9 0.6 0.6 81.9 4.4 1.9 0.6 0.6 1.3
m 4.4 0.6 10.0 1.9 6.9 1.3 0.6 0.6 43.1 7.5 4.4 0.6 8.1 8.1 1.9 30
n 5.0 0.6 1.3 0.6 0.6 3.8 1.3 1.9 0.6 3.1 1.3 12.5 38.8 1.3 2.5 7.5 16.9 0.6
o 13.8 0.6 7.5 1.9 1.9 23.1 0.6 0.6 0.6 1.3 1.3 35.0 0.6 1.3 5.6 0.6 3.8
p 1.3 1.9 0.6 0.6 0.6 0.6 0.6 1.3 0.6 1.3 3.1 0.6 1.9 5.6 65.6 6.3 2.5 0.6 1.3 0.6 2.5
q 1.3 1.3 0.6 1.3 1.9 2.5 0.6 1.3 3.8 1.3 1.9 1.9 0.6 1.3 1.3 5.6 55.6 1.9 4.4 0.6 6.9 2.5 20
r 0.6 3.8 1.3 1.3 5.6 2.5 1.3 1.9 0.6 1.3 0.6 1.3 0.6 1.9 3.8 50.0 0.6 16.9 1.9 0.6 1.9
s 1.9 0.6 0.6 25.0 1.3 0.6 7.5 1.3 6.9 2.5 8.1 0.6 1.3 28.7 10.0 1.9 1.3
t 1.3 6.3 1.3 0.6 1.9 0.6 1.3 3.8 0.6 11.9 10.6 1.3 0.6 3.8 0.6 11.9 40.0 1.9
u 1.3 0.6 3.8 0.6 1.9 0.6 2.5 1.9 1.3 1.3 3.1 1.3 0.6 0.6 0.6 3.1 15.6 1.3 53.8 3.8 0.6 10
v 1.3 3.8 1.3 2.5 3.8 1.3 0.6 0.6 2.5 0.6 1.9 2.5 0.6 1.3 0.6 1.3 70.6 3.1
w 0.6 4.4 0.6 3.1 39.4 2.5 1.3 1.3 0.6 1.3 3.1 0.6 0.6 0.6 1.3 1.3 3.1 4.4 28.7 1.3
x 0.6 3.1 0.6 0.6 1.3 1.3 9.4 1.9 0.6 2.5 0.6 1.9 4.4 0.6 1.9 0.6 10.6 1.9 3.1 1.3 0.6 50.6
y 1.3 0.6 0.6 2.5 0.6 0.6 5.6 1.9 1.9 1.3 0.6 0.6 1.9 1.3 1.3 1.3 76.3
0
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i k l m n o p q r s t u v w x y
Predicted class
Figure 6. Confusion matrix of HUST dataset with test gesture samples in LOO CV test.
(a)
Gesture pose:0 Gesture pose:a Gesture pose:e Gesture pose:m
(b)
Figure 7. Similar gesture poses of MU dataset. (a) Most confused gesture poses of MU dataset ‘6’
and ‘w’. (b) Static ASL gesture poses with without any fingers held out are ‘0’, ‘a’, ‘e’, ‘m’, ‘n’, ‘o’, ‘s’
and ‘t’.
Table 3. Comparison of proposed technique with earlier techniques using LOO CV test on MU dataset.
Table 4. Comparison of proposed technique with earlier techniques using holdout CV test on
MU dataset.
image is 0.0969 s. The fine-tuned CNNs utilized 0.4236 s to generate individual scores, and
finally, the recognition of gestures using the score fusion technique takes 0.0014 s. Thus, the
total time to recognize a gesture pose using the proposed technique is 0.5219 s.
Table 6. Error analysis on ASL hand gesture datasets without any fingers held out.
Fine-tuned Fine-tuned
Alexnet VGG16
Some examples of real-time recognition of hand gesture poses are illustrated in the
Figure 9. The figure shows the correctly recognized ASL gesture poses are ‘4’, ‘7’, ‘d’ and ‘i’.
Input depth map Segmented hand region Input depth map Segmented hand region Input depth map Segmented hand region Input depth map
4 7 d
Segmented hand region
i
Figure 9. The hand region part is segmented from the input depth map, and the recognized gesture
pose is displayed in the figure. (a–d) are different detected real-time gesture poses of ‘4’, ‘7’, ‘d’, and
‘i’ respectively using proposed method.
7. Conclusions
This paper has introduced a score-level fusion technique between two fine-tunned
CNNs for the recognition of vision-based static hand gestures. The proposed network
eliminates the requirement of illumination variation, rotation and hand region segmentation
as pre-processing steps for the color image MU dataset. Due to the depth thresholding
technique, the segmentation process of the hand regions became easier even in the presence
of human noise and complex backgrounds. The experimental results prove that the HGR
performance using the proposed technique is superior than the earlier works on two
benchmarked datasets. Moreover, the proposed technique is able to distinguish the majority
of closely related gesture poses accurately, due to which the recognition performance is
improved. For the HUST-ASL dataset, the LOO CV test performance is limited as the few
gesture poses are collected in out-of-plane rotation. The proposed technique is also used to
recognized the ASL gesture poses in real time. In future work, some specific shape-based
feature extraction techniques from different views of the gesture pose may be introduced
in the current HGR system to handle the out-of-plane gesture poses.
Author Contributions: Conceptualization, J.P.S. and A.J.P.; methodology, S.S. and J.P.S.; Software,
A.J.P.; validation, J.P.S., A.J.P. and S.S.; formal analysis, A.J.P.; investigation, J.P.S.; resources, P.P.; data
acquisition, J.P.S., and S.S.; writing—original draft preparation, J.P.S. and A.J.P.; writing—review and
editing, J.P.S. and A.J.P.; biographies, A.J.P.; visualization, P.P. and A.J.P.; supervision, P.P.; project
administration, P.P.; funding acquisition, P.P., Proof reading, A.J.P., and J.P.S. All authors have read
and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available datasets are utilized in this work. The datasets are
available at mu dataset: https://www.massey.ac.nz/~albarcza/gesture_dataset2012.html; HUST-
ASL dataset: http://mc.eistar.net/UpLoadFiles/File/hust_asl_dataset.zip.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Mitra, S.; Acharya, T. Gesture Recognition: A Survey. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2007, 37, 311–324.
[CrossRef]
2. Wachs, J.P.; Kölsch, M.; Stern, H.; Edan, Y. Vision-based hand-gesture applications. Commun. ACM 2011, 54, 60–71. [CrossRef]
3. McNeill, D. Hand and Mind; De Gruyter Mouton: Berlin, Germany, 2011.
4. Pugeault, N.; Bowden, R. Spelling it out: Real-time ASL fingerspelling recognition. In Proceedings of the 2011 IEEE International
Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; pp. 1114–1119.
Sensors 2022, 22, 706 13 of 14
5. Sharma, A.; Mittal, A.; Singh, S.; Awatramani, V. Hand Gesture Recognition using Image Processing and Feature Extraction
Techniques. Procedia Comput. Sci. 2020, 173, 181–190. [CrossRef]
6. Lian, S.; Hu, W.; Wang, K. Automatic user state recognition for hand gesture based low-cost television control system. IEEE Trans.
Consum. Electron. 2014, 60, 107–115. [CrossRef]
7. Ren, Z.; Yuan, J.; Meng, J.; Zhang, Z. Robust Part-Based Hand Gesture Recognition Using Kinect Sensor. IEEE Trans. Multimed.
2013, 15, 1110–1120. [CrossRef]
8. Wang, C.; Liu, Z.; Chan, S.C. Superpixel-Based Hand Gesture Recognition With Kinect Depth Camera. IEEE Trans. Multimed.
2015, 17, 29–39. [CrossRef]
9. Feng, B.; He, F.; Wang, X.; Wu, Y.; Wang, H.; Yi, S.; Liu, W. Depth-Projection-Map-Based Bag of Contour Fragments for Robust
Hand Gesture Recognition. IEEE Trans. -Hum.-Mach. Syst. 2017, 47, 511–523. [CrossRef]
10. Pisharady, P.K.; Saerbeck, M. Recent methods and databases in vision-based hand gesture recognition: A review. Comput. Vis.
Image Underst. 2015, 141, 152–165. [CrossRef]
11. Suarez, J.; Murphy, R.R. Hand gesture recognition with depth images: A review. In Proceedings of the 2012 IEEE RO-MAN:
The 21st IEEE International Symposium on Robot and Human Interactive Communication, Paris, France, 9–13 September 2012.
[CrossRef]
12. Han, J.; Shao, L.; Xu, D.; Shotton, J. Enhanced Computer Vision with Microsoft Kinect Sensor: A Review. IEEE Trans. Cybern.
2013, 43, 1318–1334. [CrossRef] [PubMed]
13. Modanwal, G.; Sarawadekar, K. Towards hand gesture based writing support system for blinds. Pattern Recognit. 2016, 57, 50–60.
[CrossRef]
14. Plouffe, G.; Cretu, A.M. Static and Dynamic Hand Gesture Recognition in Depth Data Using Dynamic Time Warping. IEEE Trans.
Instrum. Meas. 2016, 65, 305–316. [CrossRef]
15. Sharma, P.; Anand, R.S. Depth data and fusion of feature descriptors for static gesture recognition. IET Image Process. 2020,
14, 909–920. [CrossRef]
16. Patil, A.R.; Subbaraman, S. A spatiotemporal approach for vision-based hand gesture recognition using Hough transform and
neural network. Signal, Image Video Process. 2019, 13, 413–421. [CrossRef]
17. Tao, W.; Leu, M.C.; Yin, Z. American Sign Language alphabet recognition using Convolutional Neural Networks with multiview
augmentation and inference fusion. Eng. Appl. Artif. Intell. 2018, 76, 202–213. [CrossRef]
18. Oyedotun, O.K.; Khashman, A. Deep learning in vision-based static hand gesture recognition. Neural Comput. Appl. 2017,
28, 3941–3951. [CrossRef]
19. Tajbakhsh, N.; Shin, J.Y.; Gurudu, S.R.; Hurst, R.T.; Kendall, C.B.; Gotway, M.B.; Liang, J. Convolutional Neural Networks for
Medical Image Analysis: Full Training or Fine Tuning? IEEE Trans. Med. Imaging 2016, 35, 1299–1312. [CrossRef]
20. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf.
Process. Syst. 2012, 25, 1097–1105. [CrossRef]
21. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.
22. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with
convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA,
USA, 7–12 June 2015; pp. 1–9. [CrossRef]
23. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
24. Chevtchenko, S.F.; Vale, R.F.; Macario, V.; Cordeiro, F.R. A convolutional neural network with feature fusion for real-time hand
posture recognition. Appl. Soft Comput. 2018, 73, 748–766. [CrossRef]
25. Lee, D.L.; You, W.S. Recognition of complex static hand gestures by using the wristband-based contour features. IET Image
Process. 2018, 12, 80–87. [CrossRef]
26. Chevtchenko, S.F.; Vale, R.F.; Macario, V. Multi-objective optimization for hand posture recognition. Expert Syst. Appl. 2018,
92, 170–181. [CrossRef]
27. Fang, L.; Liang, N.; Kang, W.; Wang, Z.; Feng, D.D. Real-time hand posture recognition using hand geometric features and fisher
vector. Signal Process. Image Commun. 2020, 82, 115729. [CrossRef]
28. Barbhuiya, A.A.; Karsh, R.K.; Jain, R. CNN based feature extraction and classification for sign language. Multimed. Tools Appl.
2021, 80, 3051–3069. [CrossRef]
29. Dadashzadeh, A.; Targhi, A.T.; Tahmasbi, M.; Mirmehdi, M. HGR-Net: A fusion network for hand gesture segmentation and
recognition. IET Comput. Vis. 2019, 13, 700–707. [CrossRef]
30. Guo, L.; Lu, Z.; Yao, L. Human-machine interaction sensing technology based on hand gesture recognition: A review. IEEE Trans.
Hum.-Mach. Syst. 2021, 51, 300–309. [CrossRef]
31. Eitel, A.; Springenberg, J.T.; Spinello, L.; Riedmiller, M.; Burgard, W. Multimodal deep learning for robust RGB-D object
recognition. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg,
Germany, 28 September–2 October 2015; pp. 681–687.
32. Ding, J.; Chen, B.; Liu, H.; Huang, M. Convolutional Neural Network with Data Augmentation for SAR Target Recognition. IEEE
Geosci. Remote Sens. Lett. 2016, 13, 364–368. [CrossRef]
Sensors 2022, 22, 706 14 of 14
33. Han, D.; Liu, Q.; Fan, W. A new image classification method using CNN transfer learning and web data augmentation. Expert
Syst. Appl. 2018, 95, 43–56. [CrossRef]
34. Akcay, S.; Kundegorski, M.E.; Willcocks, C.G.; Breckon, T.P. Using Deep Convolutional Neural Network Architectures for Object
Classification and Detection Within X-Ray Baggage Security Imagery. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2203–2215.
[CrossRef]
35. He, M.; Horng, S.J.; Fan, P.; Run, R.S.; Chen, R.J.; Lai, J.L.; Khan, M.K.; Sentosa, K.O. Performance evaluation of score level fusion
in multimodal biometric systems. Pattern Recognit. 2010, 43, 1789–1800. [CrossRef]
36. Taheri, S.; Toygar, Ö. Animal classification using facial images with score-level fusion. IET Comput. Vis. 2018, 12, 679–685.
[CrossRef]
37. Barczak, A.; Reyes, N.; Abastillas, M.; Piccio, A.; Susnjak, T. A new 2D static hand gesture colour image dataset for ASL gestures.
Res. Lett. Inf. Math. Sci. 2011, 15, 12–20.