Group No19A Sign Language Recognition
Group No19A Sign Language Recognition
Group No19A Sign Language Recognition
B. E. Computer Engineering
by
Supervisor:
iv
TABLE OF CONTENTS
Page
Chapter Contents
No.
1 INTRODUCTION 1
1.1 Description 1
1.2 Problem Formulation 1
1.3 Proposed Solution 2
1.4 Scope of the Project 2
2 REVIEW OF LITERATURE 3
2.1 Overview 3
3 SYSTEM ANALYSIS 5
3.1 Functional Requirements 5
3.2 Non - Functional Requirements 5
3.3 Specific Requirements 5
3.4 Use Case Diagram and Description 7
4 ANALYSIS MODELING 9
4.1 Activity Diagram 10
4.2 Functional Modeling 11
4.3 Timeline Chart 13
5 DESIGN 14
5.1 Architectural Design 14
6 IMPLEMENTATION 15
6.1 Algorithms/Methods used 15
7 TESTING 26
7.1 Test cases 26
8 RESULTS AND DISCUSSION 29
9 CONCLUSION AND FUTURE WORK 30
9.1 Conclusion 30
9.2 Future Scope 30
Appendix 31
Literature Cited 32
Acknowledgements 33
v
LIST OF FIGURES
Fig. Page
Figure Caption
No. No.
3.4.1 Use case Diagram 7
vi
LIST OF TABLES
vii
Chapter 1
Introduction
1.1 Description
Sign language is a language for the deaf and dumb which uses simultaneous orientation and
movement of hand shapes instead of acoustically conveyed sound patterns.
Deaf and Dumb people rely on sign language interpreters for communications. However,
finding experienced and qualified interpreters for their day to day affairs throughout life
period is a very difficult task and also unaffordable.
Sign language is the basic means of communication for those with hearing and vocal
disabilities. Those disadvantaged have difficulty in their day to day lives. We aim to develop
a system that would ease this difficulty in communication. Sign language consists of making
shapes or movements with your hands with respect to the head or other body parts along with
certain facial cues. A recognition system would thus have to identify specifically the head
and hand orientation or movements, facial expression and even body pose. We propose the
design for a basic yet extensible system that is able to recognize static and dynamic gestures
of American Sign Language, specifically the letters a-z (where j and z are dynamic with hand
movement while the rest are static). American Sign Language was chosen since it is utilized
by a majority of those disabled.
Over 100 million people - more than 1% of the world’s population - are unable to hear. Being
deaf from birth or childhood, many of these people use sign language as their primary form
of communication.
There are several hundred sign languages around the world and these also have their own
dialects. One of the most common of these is American Sign Language (ASL). More than
500,000 people use ASL in the US alone, and millions more use it worldwide.
1
Most hearing people don’t know that written English is only the second language of people
who are born deaf. Although they can settle mostly everything in writing, there might be such
official situations in which the cooperation of a sign language interpreter is necessitated as
they prefer communicating on their first language – sign language.
There is an undeniable communication problem between the Deaf community and the hearing
majority. Innovations in automatic sign language recognition try to tear down this
communication barrier. Our contribution considers a recognition system using the Microsoft
Kinect, convolutional neural networks (CNNs) and GPU acceleration. Instead of constructing
complex handcrafted features, CNNs are able to automate the process of feature construction.
The main purpose would be to accommodate a dialogue between signers and non-signers.
This would be beneficial in emergency situations when there needs to be quick exchange of
information like a conversation between a physician and his patient. The vocabulary can be
extended with time as new words are added to expand the existing dataset.
2
Chapter 2
Review of Literature
2.1 Overview of the literature:
The review of literature focuses on techniques used for gesture recognition and the color
spaces used while detecting different colors in the surrounding environment
The survey done in provides a good knowledge about various color models used for
detection. It is a review paper for different color models used as well as the mathematical
representation of each with their corresponding advantages and disadvantages along with
their comparison and their suitable application area. We will use RGB to HSV color
conversion algorithm as HSV is better suited for such operations. The conversion from RGB
to HSV takes time for higher resolution images or higher Frames per second applications. [1]
An approach in narrates skin detection in HSV color space. In order to detect skin from an
RGB image it is first converted to HSV as it can be perceived closely as human colors. RGB
to HSV conversion is done using values ranging from 6 to 38 for Hue and mixture of
different filters to detect skin color. Next step is thresholding where non skin pixels were
assigned value 0 and skin pixels 1. Dilate and erode of kernel size 5x5 is used to soften the
skin image to a certain extent. A median filter with kernel size 3x3 to soften the image. Now,
only the skin region will appear as white pixels and all the other pixels are represented as
black pixels. The openCV library was used for image processing and the system was
developed in C language. [2]
Image filtering algorithms are needed to filter out noise which is the main focus of the paper.
Filtering is required to reduce the noise and improve the visual quality of the image. It gives a
detailed explanation about the various filtering techniques. The proposed system will use
mean filtering as it removes noise while preserving the edge which is crucial as we need to
detect contours of the hand. [3]
3
The adaptive boosting for hand detection and Haar cascade classifier algorithm to train the
classifier was implemented in the system. It uses HSV color model for background
subtraction and noise removal, convex hull algorithm for drawing contours around the palm
and fingertip detection. A laptop webcam of resolution 480p was used to capture the stream.
OpenCV and C++ were used to implement this system. [4]
Boundary detection algorithm of an object was proposed in this paper. The algorithm finds a
detailed boundary that includes object’s outer border also known as 1-component. It also
consists of a hole-border between the hole and the 1-component surrounding it directly. This
can be modified for detecting convex and concave parts of the hand to detect the contours. [5]
Convolutional Neural Networks are very similar to ordinary Neural Networks from the
previous chapter: they are made up of neurons that have learnable weights and biases. Each
neuron receives some inputs, performs a dot product and optionally follows it with a non-
linearity. The whole network still expresses a single differentiable score function: from the
raw image pixels on one end to class scores at the other. And they still have a loss function
(e.g. SVM/Softmax) on the last (fully-connected) layer and all the tips/tricks we developed
for learning regular Neural Networks still apply. [6]
4
Chapter 3
System Analysis
3.1 Functional Requirements
➢ Software Requirements
○ Programming: Python, OpenCV, TensorFlow, NumPy
○ Operating System: Ubuntu 16.04
5
3.3.2 Economic Feasibility
➢ Hardware Requirements
➢ Software Requirements
6
3.4 Use- Case Diagram and description
Use case diagrams are usually referred to as behavior diagrams used to describe a set of
actions that some systems (subjects) should or can perform in collaboration with one or more
external users (actors) of the system. Each user should provide some observable and valuable
result to the actors or other stakeholders of the system.
Use case diagrams are twofold - they are both behavior diagrams, as the describe the behavior
of the system, and also structure diagrams - as a special case of class diagrams where
classifiers are restricted to be either actors or use cases related to each other with association.
7
Below is a detailed study of the use case diagram:
8
Use Case description for “Process Video”
Use Case Process on Received Video
Goal in Context Display the meaning according to tracking and gesture recognition
9
Chapter 4
Analysis Modeling
In the activity diagram, the system is expressed as various activities in step format.
Firstly video is captured and split into frames. Then the image is passed onto CNN for
processing. If these are training images, then CNN filters are trained else CNN outputs
feature vector. The feature vector is then used for generating RNN data if it belongs to train
class else for prediction. These ensembles of classifiers give out the converted text.
10
4.2 Functional Modeling
Context Level DFD/ Level 0 DFD
Level 1 DFD
In level 1 data flow diagram, the gesture recognition module is explained in further detail.
The camera provides live feed of the user actions. Operations are performed to enhance the
hand movements. The hand movement is recognized and sent to the gesture tracking module.
The gesture tracking module checks for the gesture in the pre trained network of gestures and
11
their respective meanings. The gesture control module maps the gesture to its meaning. Thus
in these processes the gesture is recognized and meaning is displayed.
Level 2 DFD
In the level 2 data flow diagram, the three important processes are explained:
Process 1.1 - here the video is captured via the camera and sent for processing. Convolution
operations are performed on the feed to enhance CNN filters. The primary focus is the hand
of the user. The hand is recognized by training CNN over a number of samples.
12
Process 1.2 - here the filtered images are stitched back into a video to generate RNN data
using pool features of CNN, the gestures are recognized and tracking takes place.
Process 1.3 - the tracked gesture is obtained and mapped to its meaning via the pre trained
LSTM network. The meaning is then displayed to the user
13
Chapter 5
Design
5.1 Architectural Design
14
Chapter 7
Testing
7.1 Test Cases
15
16
17
Chapter 8
We then used the same Convolutional Neural Network for image sequences, which we would
later process as time sequences, but it wasn’t learning the gestures, instead was learning the
people and faces. We then tried to train it by changing a few hyperparameters to try to impart
the gestures on to the network (54%).
This wasn’t good either as the data wasn’t clean and had other gestures in between the current
gesture data. We then sifted through the images and cleaned it to contain only the gestured
part of the video. This was a tremendous improvement (61%) as it was starting to learn and
identify the gestures.
For processing the image sequences as a time series, we tried using 3D Convolutional
Networks which wasn’t working due to memory constraints. To tackle these issues we tried
to batch the sequences, which still did not work and the memory constraints still remained.
We then tried to implement LSTM networks and found out that a single layer LSTM network
is more than enough and does the job well.
We later tried out retraining the Inception V3 model as it already had pre-trained weights and
updating those weights would be much more efficient than training an entire model from
scratch. We found out that the model was learning the gestured parts and giving out a decent
accuracy (63%). After passing it through the LSTM network, the accuracy came out to be
about 81% and we could identify the gestures live pretty well.
The current solution is the best because it correctly identifies the gestures by extracting the
features and correctly recognizing the sequences in which the gesture takes place.
18
Chapter 9
This project holds immense potential in terms of real world applications and can be used as a
platform for development of solutions to a number of problems. The future scope of this
project includes
● Increasing accuracy of current system: This is done in two ways. Firstly, by
increasing the size of training dataset to include more variations of characters.
Accuracy can also be improved by training the data to a deeper level. However, this
will require a very high configuration machine.
● Script recognition engine: The base framework and training approach used in this
project can be used to create an application that learns to recognize handwritten
scripts given a sizeable dataset to learn from.
19
Appendix
Pooling: Convolutional networks may include local or global pooling layers, which combine
the outputs of neuron clusters at one layer into a single neuron in the next layer. For example,
max pooling uses the maximum value from each of a cluster of neurons at the prior layer.
Another example is average pooling, which uses the average value from each of a cluster of
neurons at the prior layer.
Fully connected: Fully connected layers connect each neuron in a layer to every neuron in
another layer. It is in principle the same as the traditional multi-layer perceptron neural
network (MLP).
Weights: CNNs share weights in convolutional layers, which means that the same filter
(weights bank) is used for each receptive field in the layer; this reduces memory footprint and
improves performance.
20
Literature Cited
[1] Ibraheem, N.A., Hasan, M.M., Khan, R.Z. and Mishra, P.K., 2012. Understanding color
models: a review. ARPN Journal of Science and Technology, 2(3), pp.265-275.
[2] Oliveira, V.A. and Conci, A., 2009. Skin Detection using HSV color space. In H. Pedrini,
& J. Marques de Carvalho, Workshops of Sibgrapi (pp. 1-2).
[3] Chandel, R. and Gupta, G., 2013. Image filtering algorithms and techniques: A review.
International Journal of Advanced Research in Computer Science and Software Engineering,
3(10).
[4] Gurav, R.M. and Kadbe, P.K., 2015, May. Real time finger tracking and contour detection
for gesture recognition using OpenCV. In Industrial Instrumentation and Control (ICIC),
2015 International Conference on (pp. 974-977). IEEE.
[5] Suzuki, S., 1985. Topological structural analysis of digitized binary images by border
following. Computer vision, graphics, and image processing, 30(1), pp.32-46.
[6] Pigou L., Dieleman S., Kindermans PJ., Schrauwen B. (2015) Sign Language Recognition
Using Convolutional Neural Networks. In: Agapito L., Bronstein M., Rother C. (eds)
Computer Vision - ECCV 2014 Workshops. ECCV 2014. Lecture Notes inComputer
Science, vol 8925. Springer, Cham.
[7] Andrej Karpathy. Stanford university cs231n: Convolutional neural networks for visual
recognition. http://cs231n.stanford.edu/, March 24 2017. [Online].
[8] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper
with convolutions. Cvpr, 2015.
21
Acknowledgements
First and foremost, we would like to thank to our supervisors Mrs. Sridari Iyer and Mrs. Priya
Chaudhary for their valuable guidance and advice rendered to us throughout the work process
of this term paper.
Besides, an honorable mention goes to our fellow classmates for their understanding and
support in completing this report. We would also like to show gratitude to our project
coordinator Mrs. Vincy Joseph for sharing her pearls of wisdom with us during the course of
this research.
We are immensely grateful to them, although any errors are our own and should not tarnish
the reputations of these esteemed persons.
We would also like to thank the college, principal and brother for giving you an opportunity
to work on this project.
22