Indian Sign Language Converter Using Convolutional Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2019 5th International Conference for Convergence in Technology (I2CT)

Pune, India. Mar 29-31, 2019

Indian Sign Language converter using Convolutional


Neural Networks
Nishi Intwala, Arkav Banerjee, Meenakshi and Nikhil Gala
Mukesh Patel School of Technology Management and Engineering (Mumbai Campus)
SVKM’s NMIMS University
Mumbai- 400056, Maharashtra, India
[email protected], [email protected], [email protected], [email protected]

Abstract – People with hearing and speech impairments have


to face a lot of difficulties while communicating with the general
public. Being a minority, the sign language used by them is not
known to a majority of people. In this paper, an Indian sign
language converter was developed using a Convolutional Neural
Network algorithm with the aim to classify the 26 letters of the
Indian Sign Language into their equivalent alphabet letters by
capturing a real time image of that sign and converting it to its Fig. 1: The Indian sign language letters
text equivalent. First a database was created in various
backgrounds and various image pre-processing techniques were II. RELATED WORKS
used to make the database ready for feature extraction. After We have referred to papers which have converted sign
feature extraction, the images were fed into the CNN using the languages using different algorithms. One of these papers is
python software. Several real time images were tested to find
by Muttaki Hasan, Tanvir Hossain Sajib, Mrinmoy Dey who
the accuracy and efficiency. The results showed a 96% accuracy
for the testing images and an accuracy of 87.69% for real time have used SVM as their classifier and have used HOG
images. (Histogram of Oriented Gradients) for feature extraction of
certain words in the Bengali sign language. In the paper they
Index Terms – Indian sign language, Convolutional Neural have converted the expressions into their respective audio
Networks, Transfer learning, GrabCut algorithm. outputs [5]. They have first preprocessed the images using
binary thresholding, and then segmented the images by
I. INTRODUCTION cropping the hand gesture with the help of the OpenCV
library. The next step taken by them was to train and test the
People with speech and hearing impairments are at a
classifier with the testing set containing 64 of the 320 images.
disadvantage when it comes to having conversations with
Finally, the output from the classifier is converted into audio
people who communicate normally on a daily basis. The main
with the help of the TTS engine in the python library. They
method of communication for people with a speech disability
achieved an accuracy of 86.53%.
is through gestures. These gestures are compiled together to
In a paper by Pranali Loke, Juilee Paranjpe, Sayali Bhabal
create a sign language which differs from region to region.
and Ketan Kanere, they have proposed a sign language
The population of India is 132.42 crores out of which 19 lakh
converter using an artificial neural network on MATLAB. In
people have a speech disability. They communicate via the
this paper they haven’t implemented but only proposed a
Indian sign language which was developed in 2001 with 1600
method to create an android application for sign language
words. It is essential that we develop ways to make
conversion [4]. They have chosen the hand gesture
communication easy for them and convert their gesture based
recognition feature to recognize the gestures and have
language to our oral language used on a day to day basis. In
converted these to natural language. They have proposed
this paper, we are presenting an Indian Sign Language
doing so by utilizing the HSV (hue, saturation and intensity)
alphabet converter which converts the gesture into its
model for hand tracking and segmentation. For classification
equivalent alphabet in the English language. For this project,
they have made use of a neural network. Using an android
we have used machine learning which includes an image
application, they have captured the images and used them as
classifier called MobileNet which uses a convolutional neural
an input in their neural network. The hand gesture is matched
network. We have trained this classifier to give us the
with its respective hand gesture on MATLAB and the
optimum accuracy.
resultant converted text is sent back to the user’s device,
making it a system which will aid the people unknown to sign
language.

978-1-5386-8075-9/19/$31.00 ©2019 IEEE 1


Another paper that we have referred to is written Every image is utilized numerous times during the training
by Malladi Sai Phani Kumar, Veerapalli Lathasree and S.N. step and the calculations for each image from the previous
Karishma. In this paper they have used images as their input layers are cached and stored in the bottlenecks and can be
and have segmented the hand gestures from the background reused. Finally, during the training step, the bottlenecks of
with the help of the GrabCut algorithm, however they have the images are found and fed to the final layer. The final layer
proposed using novel contour as a segmentation process makes its predictions which are then compared with the
instead of the GrabCut algorithm [9]. They concluded that the actual label of the image and the weights of the final layer are
novel contour method gave better results in terms of optimized using the backpropagation process.
segmenting the image compared to the GrabCut algorithm,
IV. FRAMEWORK
which required multiple iterations. They discovered that the
average amount of iterations needed to completely segment
the image were three. However, the novel contour method did
not work well in abstract backgrounds. This paper was mainly
focused on comparing two methods of segmentation. In our
case, we have used only one iteration for the GrabCut
algorithm as we want the process to be automated and also, a
single iteration is enough to provide us with good results.
Since our backgrounds are abstract, GrabCut algorithm
provides us with better segmentation.

III. NEURAL NETWORKS

Convolutional Neural Network is one of the algorithms


used in machine learning. They are similar to artificial neural
networks, meaning they have nodes or neurons which are Fig. 2: Implementation Process
connected via weighted links which produce an output in
A. Dataset creation
accordance with the given input. The main difference is that
Data collection is the main aspect of machine learning
convolutional networks are better suited to visual
that makes training of the algorithm possible. Usually, data
classification such as images. Regular neural networks consist
gathering and processing consumes most of the time involved
of a hidden layer which is connected to the previous input
in the whole machine learning process. The accuracy
layer and an output layer where the classification output is
improves with more the data we gather. For our project we
given. However regular neural networks cannot handle huge
have used a 720p HD webcam for taking pictures of our
amounts of data. Hence for a large number of images,
dataset. Our dataset consists of 26 classes, each representing
convolutional neural networks are more efficient [10].
an English alphabet. We have an Indian sign equivalent to
Convolutional neural networks have a 3D shape of
every English alphabet. We have clicked pictures of 2000
neurons meaning that they have height, width and depth.
such signs for each alphabet. Therefore, we have a total of
Here, all the neurons of one layer are not only connected to
52,000 images since there are 26 letters. The images have
all the other neurons of the adjacent layer, but a small region
been taken in different backgrounds. However, since we
of neurons are connected to each other. The output layer of
accumulated abundant data, the computational time also
the network will convert the image into a single vector along
increased. The dataset has been created with the help of
the depth dimension.
MATLAB 2017b software.
In our project we have used transfer learning in which we
use a pre-trained convolutional neural network model instead
B. Image Cropping
of training a convolutional neural network from scratch. Here
After the dataset has been created, we have used image
the pre-trained model is selected such that its problem
cropping. Image cropping is a process where an image is
statement is similar to the problem statement of the user. The
cropped according to the specific area of interest. In our case,
size and similarities between the datasets play an important
the hand gestures were the crux of the image and hence we
role which means that in order to classify images, the pre-
have used image cropping to pinpoint only those portions of
trained neural network model must be trained on images and
the image. This is necessary to eliminate any effect of the
not on any other data set.
background on the hand gestures while training these images
The pre-trained model called MobileNet is a convolutional
for classification purposes. Cropping an image also reduces
neural network. Here we are only retraining the final layer of
the computational time while training the classifier. We have
the model. The previous layer before the final layer is called a
applied image cropping with the help of MATLAB 2017b
bottleneck. First all the bottlenecks for each image are
software.
calculated. The bottleneck does the actual classification.

2
C. Image Resizing these regions again. Although the segmentation accuracy is
In digital imaging and graphics, image resizing is done to not 100% in our case, and some regions are marked wrongly,
resize a digital image. We can change the total number of the result provided by the algorithm is sufficient enough to
pixels using image resizing. We have resized each image to a give us a good result while classifying it. Hard labelling has
resolution of 224x224px. We have used python software in been avoided in order to give us a quicker result and for the
order to achieve this. This has been done in order to reduce whole process to be conducted automatically as well.
the computational time as well as to have a uniform dataset to
be used for training.

D. Image Flipping
A flipped image or reversed image is basically the image
that is produced due to the mirror-reversal of the original
image across the vertical or horizontal axis. Image flipping is
done because the webcam automatically flips the images
while capturing it. Hence, flipping across the vertical axis is
required once again to regain the original photo.

E. Training the classifier


Once the dataset has been pre-processed, it is ready to be
put into an image classifier. We have used MobileNet as our
classifier which adopts a convolutional neural network in
order to classify the images.

F. Segmentation Fig. 3: GrabCut Algorithm


Selecting good features is critical in any object recognition
system. Segmentation is one of the feature extraction V. RESULTS
techniques that is used for images [2]. We have made use of Figure 4 depicts the pre-processing applied on the dataset,
the GrabCut algorithm for segmentation of our real-time namely- Resizing, cropping and flipping. As mentioned
images. The working of the algorithm is as follows: - before, these pre-processing techniques have been applied to
Initially, a rectangle has to be specified marking the only the dataset and not the real time images.
region of interest in the image making the region outside this
rectangle the background. Once a rectangle has been
specified, the algorithm will try to label the foreground and
the background automatically in terms of colour statistics by
iterating the image multiple times. The computer labels the
foreground and background pixels. We apply a Gaussian
Mixture Model(GMM) which tries to cluster unknown pixels Fig. 4: Pre-processing
in terms of colour statistics which then, either become a part For the real time images, we have segmented the images
of the foreground or the background. A graph is generated with the help of GrabCut algorithm as shown below. As we
using the pixel distribution with the pixels as nodes and two can see, the main area of interest, i.e. hand, has not been
additional nodes, the Source node and the Sink node segmented totally by the algorithm, however it is sufficient
respectively are added to this graph [9]. The source node is enough to provide us with accurate results.
connected to the foreground pixels while the sink node to the
background pixels. Each pixel is connected to the two nodes
with the help of edges which have a certain weight associated
to them. The weight among two pixels is determined by the
edge information or pixel similarity. A large difference in
pixel colour will result in a low weight in the edge between
two pixels. The graph is cut into two parts separating the
source and the sink node with minimum cost function [11].
All the pixels connected to Source node and Sink node
become foreground and background after the cut is made.
However, there may be some areas which have been classified
wrongly in which case, the user has to once again hard label Fig. 5: Segmentation

3
After obtaining a good accuracy of 96% in the testing
N 85% Moderate
phase, we further applied the algorithm on 20 real time
images for each sign. Since there were a total of 26 signs, the
total number of real time images taken into consideration
were 20×26=520. O 90% Good
TABLE 1.

SIGN CHARACTER ACCURACY RESULT P 85% Moderate

A 95% Good

Q 80% Moderate

B 90% Good

R 75% Poor

C 95% Good

S 85% Moderate

D 85% Moderate

T 75% Poor

E 85% Moderate

U 100% Excellent

F 80% Moderate

V 100% Excellent

G 95% Good

W 95% Good

H 90% Good

X 90% Good

I 100% Excellent

Y 85% Moderate

J 80% Moderate

Z 90% Good

K 75% Poor

In table 1, we considered an accuracy of 100% to be


excellent, an accuracy from 90% to 99% to be good, an
L 95% Good accuracy from 80% to 89% to be moderate while an accuracy
below 80% to be poor. We observed that signs I, U and V had
been classified perfectly while signs K and R had been
M 80% Moderate classified poorly. We also observed that out of these 520
images, 456 of them had been classified correctly. Hence we
obtained an accuracy of 87.69% for the real time images.

4
TABLE II. CONFUSION MATRIX FOR THE 26 INDIAN SIGN LETTERS segmentation approaches can also be used in order to
completely segment the hand from the rest of the image.
REFERENCES
[1] R.M. Gurav, P.K. Kadbe, “Real time finger tracking and contour detection
for gesture recognition using opencv,” International Conference on
Industrial Instrumentation and Control 2015 (ICIC 2015), pp. 974-977,
2015.
[2] Pinaki Pratim Acharjya, Ritaban Das & Dibyendu Ghoshal, “Study
and Comparison of Different Edge Detectors for Image Segmentation,”
2012, Global Journal of Computer Science and Technology Graphics
& Vision, Volume 12, Issue 13, Version 1.0, Year 2012.
[3] Y.Ramadevi, T.Sridevi, B.Poornima, B.Kalyani, “Segmentation and object
recognition using edge detection techniques,” 2010, International Journal
of Computer Science & Information Technology (IJCSIT), Vol 2,No
6, December 2010.
[4] Pranali Loke, Juilee Paranjpe, Sayli Bhabal, Ketan Kanere, “Indian Sign
Language Converter System Using An Android App,” 2017, International
Conference on Electronics, Communication and Aerospace Technology-
ICECA, 978-1-5090-5686-6/17.
Table 2 depicts the confusion matrix for the 26 Indian sign [5] Muttaki Hasan, Tanvir Hossain Sajib and Mrinmoy Dey, “A Machine
letters. The 520 real time images have been considered in the Learning Based Approach for the Detection and Recognition of Bangla
confusion matrix. Sign Language,” IEEE - 978-1-5090-5421-3/16.
[6] Farhad Yasir, P.W.C. Prasad and Abir Alsadoon, “SIFT Based Approach
on Bangla Sign Language Recognition,” IEEE 8th International Workshop
V. CONCLUSION AND FUTURE WORK on Computational Intelligence and Applications, November 6-7, 2015.
[7] S. Karishma, V. Lathasree, “Fusion of skin color detection and background
In this paper, we have proposed a method for the subtraction for hand gesture segmentation,” International Journal of
recognition and classification of the 26 Indian sign language Engineering Research and Technology, vol. 3, no. 2, 2014.
alphabets by using CNNs. We observed that MobileNet is an [8] M. K. Ahuja and A. Singh, “Static vision-based hand gesture recognition
using principal component analysis,” in IEEE 3rd International Conference
effective approach for the classification of large amounts of on MOOCs, Innovation and Technology in Education 2015(MITE 2015),
data. We considered previous work for the recognition of sign IEEE, 2015, pp. 402–406.
language and came to a conclusion that MobileNet would be [9] Malladi Sai Phani Kumar, Veerapalli Lathasree and S.N. Karishma,
efficient to classify the hand signs with a high accuracy. “Novel Contour Based Detection and GrabCut Segmentation for Sign
Language Recognition,” International Conference on Wireless
However, letters H and J are dynamic gestures in the Indian Communications, Signal Processing and Networking (WiSPNET)-IEEE,
Sign Language whereas our method is applicable for static 978-1-5090-4442-9/17.
gesture recognition. Hence, these letters have been denoted by [10] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,
Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam,
a single frame. In the future, we would like to click the real “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision
time images with a cellular camera instead of the webcam Applications,” arXiv:1704.04861v1 [cs.CV] 17 Apr 2017.
and get the output on the cell phone itself, since MobileNet is [11] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut: Interactive foreground
extraction using iterated graph cuts,” in ACM Transactions on Graphics
especially suited for cellular operations [10]. Better (TOG), vol. 23, ACM, 2004, pp. 309–314.

You might also like