Group No19A Sign Language Recognition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Sign Language Recognition

Submitted in partial fulfillment of the requirements


of the degree of

B. E. Computer Engineering

by

Gaurav Chemburkar 11 (142017)


Priyaank Chhadwa 13 (142033)
Ashish Mishra 64 (142027)

Supervisor:

Ms. Sridari Iyer


(Assistant Professor)

ST. FRANCIS INSTITUTE OF TECHNOLOGY


Mount Poinsur, S.V.P Road, Borivali (W), Mumbai-400103
UNIVERSITY OF MUMBAI
2017-2018
i
ii
iii
Abstract
Sign language is a method of communication which uses various hand gestures and
movements. Understanding these gestures can be postulated as a pattern recognition problem.
Humans use different kinds of gestures and motions to convey different messages to other
humans. This project represents a framework for a human computer interface capable of
recognizing said gestures from sign language and providing a text output representing the
meaning of the gesture. The proposed system will use convolutional neural networks and
long short term memory networks to identify and learn the gestures which will help to
minimize the communication barrier between signers and non-signers.

iv
TABLE OF CONTENTS

Page
Chapter Contents
No.
1 INTRODUCTION 1
1.1 Description 1
1.2 Problem Formulation 1
1.3 Proposed Solution 2
1.4 Scope of the Project 2
2 REVIEW OF LITERATURE 3
2.1 Overview 3
3 SYSTEM ANALYSIS 5
3.1 Functional Requirements 5
3.2 Non - Functional Requirements 5
3.3 Specific Requirements 5
3.4 Use Case Diagram and Description 7
4 ANALYSIS MODELING 9
4.1 Activity Diagram 10
4.2 Functional Modeling 11
4.3 Timeline Chart 13
5 DESIGN 14
5.1 Architectural Design 14
6 IMPLEMENTATION 15
6.1 Algorithms/Methods used 15
7 TESTING 26
7.1 Test cases 26
8 RESULTS AND DISCUSSION 29
9 CONCLUSION AND FUTURE WORK 30
9.1 Conclusion 30
9.2 Future Scope 30
Appendix 31
Literature Cited 32
Acknowledgements 33

v
LIST OF FIGURES

Fig. Page
Figure Caption
No. No.
3.4.1 Use case Diagram 7

4.1.1 Activity Diagram 10

4.2.1 Context Level DFD 11

4.2.2 Level 1 DFD 11

4.2.3 Level 2 DFD 12

4.3.1 Timeline for Semester VII and VIII 13

5.1.1 Project Flow Diagram 14

vi
LIST OF TABLES

Table No. Table Title Page No.

3.4.1 Use Case Description 1 8

3.4.2 Use Case Description 2 8

3.4.3 Use Case Description 3 9

3.4.4 Use Case Description 4 9

vii
Chapter 1

Introduction

1.1 Description

Sign language is a language for the deaf and dumb which uses simultaneous orientation and
movement of hand shapes instead of acoustically conveyed sound patterns.

Deaf and Dumb people rely on sign language interpreters for communications. However,
finding experienced and qualified interpreters for their day to day affairs throughout life
period is a very difficult task and also unaffordable.

Sign language is the basic means of communication for those with hearing and vocal
disabilities. Those disadvantaged have difficulty in their day to day lives. We aim to develop
a system that would ease this difficulty in communication. Sign language consists of making
shapes or movements with your hands with respect to the head or other body parts along with
certain facial cues. A recognition system would thus have to identify specifically the head
and hand orientation or movements, facial expression and even body pose. We propose the
design for a basic yet extensible system that is able to recognize static and dynamic gestures
of American Sign Language, specifically the letters a-z (where j and z are dynamic with hand
movement while the rest are static). American Sign Language was chosen since it is utilized
by a majority of those disabled.

1.2 Problem Formulation

Over 100 million people - more than 1% of the world’s population - are unable to hear. Being
deaf from birth or childhood, many of these people use sign language as their primary form
of communication.

There are several hundred sign languages around the world and these also have their own
dialects. One of the most common of these is American Sign Language (ASL). More than
500,000 people use ASL in the US alone, and millions more use it worldwide.

1
Most hearing people don’t know that written English is only the second language of people
who are born deaf. Although they can settle mostly everything in writing, there might be such
official situations in which the cooperation of a sign language interpreter is necessitated as
they prefer communicating on their first language – sign language.

There is an undeniable communication problem between the Deaf community and the hearing
majority. Innovations in automatic sign language recognition try to tear down this
communication barrier. Our contribution considers a recognition system using the Microsoft
Kinect, convolutional neural networks (CNNs) and GPU acceleration. Instead of constructing
complex handcrafted features, CNNs are able to automate the process of feature construction.

1.3 Proposed Solution


We aim to reduce the gap that exists between signers and non-signers by using Convolutional
Neural Nets (CNNs). We first extract features from the frame sequences resulting in a
representation consisting of one or more feature vectors. The process will be carried by a
convolutional neural network trained on the reduced dataset. This aids the computer to
differentiate between possible classes of actions. The second step is video classification of the
gestures. This is done by Long Short Term Memory (LSTM), a type of Recurrent Neural
Network (RNN).\

1.4 Scope of Project

The main purpose would be to accommodate a dialogue between signers and non-signers.
This would be beneficial in emergency situations when there needs to be quick exchange of
information like a conversation between a physician and his patient. The vocabulary can be
extended with time as new words are added to expand the existing dataset.

2
Chapter 2

Review of Literature
2.1 Overview of the literature:

The review of literature focuses on techniques used for gesture recognition and the color
spaces used while detecting different colors in the surrounding environment

The survey done in provides a good knowledge about various color models used for
detection. It is a review paper for different color models used as well as the mathematical
representation of each with their corresponding advantages and disadvantages along with
their comparison and their suitable application area. We will use RGB to HSV color
conversion algorithm as HSV is better suited for such operations. The conversion from RGB
to HSV takes time for higher resolution images or higher Frames per second applications. [1]

An approach in narrates skin detection in HSV color space. In order to detect skin from an
RGB image it is first converted to HSV as it can be perceived closely as human colors. RGB
to HSV conversion is done using values ranging from 6 to 38 for Hue and mixture of
different filters to detect skin color. Next step is thresholding where non skin pixels were
assigned value 0 and skin pixels 1. Dilate and erode of kernel size 5x5 is used to soften the
skin image to a certain extent. A median filter with kernel size 3x3 to soften the image. Now,
only the skin region will appear as white pixels and all the other pixels are represented as
black pixels. The openCV library was used for image processing and the system was
developed in C language. [2]

Image filtering algorithms are needed to filter out noise which is the main focus of the paper.
Filtering is required to reduce the noise and improve the visual quality of the image. It gives a
detailed explanation about the various filtering techniques. The proposed system will use
mean filtering as it removes noise while preserving the edge which is crucial as we need to
detect contours of the hand. [3]

3
The adaptive boosting for hand detection and Haar cascade classifier algorithm to train the
classifier was implemented in the system. It uses HSV color model for background
subtraction and noise removal, convex hull algorithm for drawing contours around the palm
and fingertip detection. A laptop webcam of resolution 480p was used to capture the stream.
OpenCV and C++ were used to implement this system. [4]

Boundary detection algorithm of an object was proposed in this paper. The algorithm finds a
detailed boundary that includes object’s outer border also known as 1-component. It also
consists of a hole-border between the hole and the 1-component surrounding it directly. This
can be modified for detecting convex and concave parts of the hand to detect the contours. [5]

Convolutional Neural Networks are very similar to ordinary Neural Networks from the
previous chapter: they are made up of neurons that have learnable weights and biases. Each
neuron receives some inputs, performs a dot product and optionally follows it with a non-
linearity. The whole network still expresses a single differentiable score function: from the
raw image pixels on one end to class scores at the other. And they still have a loss function
(e.g. SVM/Softmax) on the last (fully-connected) layer and all the tips/tricks we developed
for learning regular Neural Networks still apply. [6]

4
Chapter 3

System Analysis
3.1 Functional Requirements

1. Gesture Recognition: Software should automatically recognize the gesture through


the video input.
2. Authentic representation: Software should give out the correct meaning of the
gesture.
3. Cross platform support: Software should run on as many platforms as possible.

3.2 Non- Functional Requirements


1. Availability: The software should be available at all times.
2. Reliability: The software should provide accurate meaning of gestures.
3. Scalability: The software should be able to handle all the basic gestures.
4. Maintainability: The software should be coded in a away which is easily readable
and maintainable.

3.3 Specific Requirements (Hardware and software requirements)


3.3.1 Technical Feasibility
➢ Hardware Requirements
o Processor: 2.6 GHz or faster
o RAM: 4 GB or higher
o HDD: 2 GB available disk space
o GPU: NVIDIA GTX 1060 or higher or any AMD equivalent card
o Camera: At least 5MP

➢ Software Requirements
○ Programming: Python, OpenCV, TensorFlow, NumPy
○ Operating System: Ubuntu 16.04

5
3.3.2 Economic Feasibility

➢ Hardware Requirements

Requirement Details Cost

Processor Intel i7 6700K 23000/-

RAM 16GB 8000/-

HDD 500GB 3000/-

GPU NVIDIA GTX 1070 30000/-

Web Cam Inbuilt -

Total Cost: 63000/-

➢ Software Requirements

Requirement Version Cost

OpenCV 2.4 or higher Open source

TensorFlow 1.0 Open source

Python 3.6 Open source

6
3.4 Use- Case Diagram and description

Fig 3.4.1 Use Case Diagram

Use case diagrams are usually referred to as behavior diagrams used to describe a set of
actions that some systems (subjects) should or can perform in collaboration with one or more
external users (actors) of the system. Each user should provide some observable and valuable
result to the actors or other stakeholders of the system.
Use case diagrams are twofold - they are both behavior diagrams, as the describe the behavior
of the system, and also structure diagrams - as a special case of class diagrams where
classifiers are restricted to be either actors or use cases related to each other with association.

Use case diagrams are used to specify:


● External requirements, required usage of a system under design or analysis (subject) -
what the system is supposed to do.
● The functionality offered by the subject - what the system can do
● Requirements the specified subject poses on its environment - by defining how
environment should interact with the subject so that it will be able perform its
services.

7
Below is a detailed study of the use case diagram:

Use Case description for “Perform Gesture”


Use Case Track hands

Primary Actor User

Goal in Context Allows the user to track finger movements

Preconditions User performs hand gestures and video processing is active

Trigger On tracking user could perform gestures

Scenario User performs gestures

Priority Essential for displaying the meaning of gestures

Secondary Actor Computer

Exception Video feed cannot be tracked


Table 3.4.1 Use Case Description 1

Use Case description for “Capture Video”


Use Case Video processing

Primary Actor Webcam

Goal in Context Webcam captures live feed

Preconditions User must have a webcam

Trigger On capturing live feed, track movements

Scenario Live feed is captured and operations are performed

Priority Every frame captured should be sent for processing

Secondary Actor User

Exception The software breaks


Table 3.4.2 Use Case Description

8
Use Case description for “Process Video”
Use Case Process on Received Video

Primary Actor Webcam

Goal in Context Operations are performed on live feed

Preconditions User must have a webcam

Trigger On gesturing, tracking begins

Scenario Webcam captures live video which is converted into images.


Operations are performed to enhance related features and gesture
movements

Priority Every frame must be captured

Exception The software breaks


Table 3.4.3 Use Case Description 3

Use Case description for “Display Gesture Meaning”


Use Case Display gesture meaning

Primary Actor Computer

Goal in Context Display the meaning according to tracking and gesture recognition

Preconditions User must have a webcam

Trigger The sign is recognized

Scenario Enhanced images are passed to the network for recognition

Priority Every gesture must be recognized

Secondary Actor User, webcam

Exception Gesture cannot be recognized by the software


Table 3.4.4 Use Case Description 4

9
Chapter 4

Analysis Modeling

4.1 Activity Diagram

Fig 4.1 Activity Diagram

In the activity diagram, the system is expressed as various activities in step format.

Firstly video is captured and split into frames. Then the image is passed onto CNN for
processing. If these are training images, then CNN filters are trained else CNN outputs
feature vector. The feature vector is then used for generating RNN data if it belongs to train
class else for prediction. These ensembles of classifiers give out the converted text.

10
4.2 Functional Modeling
Context Level DFD/ Level 0 DFD

Fig 4.2.1 Level 0 DFD


In this level 0 data flow diagram, the whole system is represented with the help of input,
processing and output. The input to the gesture recognition system is the live feed from the
camera which contains the gestures performed by the user. The camera provides the frames
which can be mapped to their corresponding meanings.

Level 1 DFD

Fig 4.2.2 Level 1 DFD

In level 1 data flow diagram, the gesture recognition module is explained in further detail.
The camera provides live feed of the user actions. Operations are performed to enhance the
hand movements. The hand movement is recognized and sent to the gesture tracking module.
The gesture tracking module checks for the gesture in the pre trained network of gestures and

11
their respective meanings. The gesture control module maps the gesture to its meaning. Thus
in these processes the gesture is recognized and meaning is displayed.
Level 2 DFD

Fig 4.2.3 Level 2 DFD

In the level 2 data flow diagram, the three important processes are explained:
Process 1.1 - here the video is captured via the camera and sent for processing. Convolution
operations are performed on the feed to enhance CNN filters. The primary focus is the hand
of the user. The hand is recognized by training CNN over a number of samples.

12
Process 1.2 - here the filtered images are stitched back into a video to generate RNN data
using pool features of CNN, the gestures are recognized and tracking takes place.

Process 1.3 - the tracked gesture is obtained and mapped to its meaning via the pre trained
LSTM network. The meaning is then displayed to the user

4.3 Timeline Chart and Gantt Chart

Fig 4.3.1 Timeline Chart

13
Chapter 5

Design
5.1 Architectural Design

Fig 5.1.1 Project Flow Diagram

Image in RGB: the image captured is in RGB format.


CNN: A set of learnable filters (or kernels), which have a small receptive field, but extend
through the full depth of the input volume. During the forward pass, each filter is convolved
across the width and height of the input volume, computing the dot product between the
entries of the filter and the input and producing a 2-dimensional activation map of that filter.
Activation: Activation function helps the linear operation of inputs to a non-linear
operation.
Pooling: Max pooling takes the maximum value of a defined grid.
LSTM: It is responsible for remembering values over arbitrary time intervals.

14
Chapter 7

Testing
7.1 Test Cases

15
16
17
Chapter 8

Results and Discussions


We started out by building our own Convolutional Neural Network and testing it on a static
gesture database. The results were astonishing and we got an accuracy of about 91%.

We then used the same Convolutional Neural Network for image sequences, which we would
later process as time sequences, but it wasn’t learning the gestures, instead was learning the
people and faces. We then tried to train it by changing a few hyperparameters to try to impart
the gestures on to the network (54%).

This wasn’t good either as the data wasn’t clean and had other gestures in between the current
gesture data. We then sifted through the images and cleaned it to contain only the gestured
part of the video. This was a tremendous improvement (61%) as it was starting to learn and
identify the gestures.

For processing the image sequences as a time series, we tried using 3D Convolutional
Networks which wasn’t working due to memory constraints. To tackle these issues we tried
to batch the sequences, which still did not work and the memory constraints still remained.
We then tried to implement LSTM networks and found out that a single layer LSTM network
is more than enough and does the job well.

We later tried out retraining the Inception V3 model as it already had pre-trained weights and
updating those weights would be much more efficient than training an entire model from
scratch. We found out that the model was learning the gestured parts and giving out a decent
accuracy (63%). After passing it through the LSTM network, the accuracy came out to be
about 81% and we could identify the gestures live pretty well.

The current solution is the best because it correctly identifies the gestures by extracting the
features and correctly recognizing the sequences in which the gesture takes place.

18
Chapter 9

Conclusion and Future work


9.1 Conclusion
The report describes the implementation of a system for recognizing sign language. The
input consists of video of the gin to be recognized. It is recorded and preprocessed which is
then sent to the server. The server returns the meaning of the sign contained in the video as
output. Such a system will be an invaluable asset to many institutions who depend solely on
manual translators.

9.2 Future Scope

This project holds immense potential in terms of real world applications and can be used as a
platform for development of solutions to a number of problems. The future scope of this
project includes
● Increasing accuracy of current system: This is done in two ways. Firstly, by
increasing the size of training dataset to include more variations of characters.
Accuracy can also be improved by training the data to a deeper level. However, this
will require a very high configuration machine.
● Script recognition engine: The base framework and training approach used in this
project can be used to create an application that learns to recognize handwritten
scripts given a sizeable dataset to learn from.

19
Appendix

A Convolutional Neural Network (CNN) is comprised of one or more convolutional layers


(often with a subsampling step) and then followed by one or more fully connected layers as in
a standard multilayer neural network. The architecture of a CNN is designed to take
advantage of the 2D structure of an input image (or other 2D input such as a speech signal).
This is achieved with local connections and tied weights followed by some form of pooling
which results in translation invariant features. Another benefit of CNNs is that they are easier
to train and have many fewer parameters than fully connected networks with the same
number of hidden units.

Back-propagation: Back-propagation is a method used in artificial neural networks to


calculate the error contribution of each neuron after a batch of data (in image recognition,
multiple images) is processed. This is used by an enveloping optimization algorithm to adjust
the weight of each neuron, completing the learning process for that case.

Pooling: Convolutional networks may include local or global pooling layers, which combine
the outputs of neuron clusters at one layer into a single neuron in the next layer. For example,
max pooling uses the maximum value from each of a cluster of neurons at the prior layer.
Another example is average pooling, which uses the average value from each of a cluster of
neurons at the prior layer.

Fully connected: Fully connected layers connect each neuron in a layer to every neuron in
another layer. It is in principle the same as the traditional multi-layer perceptron neural
network (MLP).

Weights: CNNs share weights in convolutional layers, which means that the same filter
(weights bank) is used for each receptive field in the layer; this reduces memory footprint and
improves performance.

20
Literature Cited

[1] Ibraheem, N.A., Hasan, M.M., Khan, R.Z. and Mishra, P.K., 2012. Understanding color
models: a review. ARPN Journal of Science and Technology, 2(3), pp.265-275.

[2] Oliveira, V.A. and Conci, A., 2009. Skin Detection using HSV color space. In H. Pedrini,
& J. Marques de Carvalho, Workshops of Sibgrapi (pp. 1-2).

[3] Chandel, R. and Gupta, G., 2013. Image filtering algorithms and techniques: A review.
International Journal of Advanced Research in Computer Science and Software Engineering,
3(10).

[4] Gurav, R.M. and Kadbe, P.K., 2015, May. Real time finger tracking and contour detection
for gesture recognition using OpenCV. In Industrial Instrumentation and Control (ICIC),
2015 International Conference on (pp. 974-977). IEEE.

[5] Suzuki, S., 1985. Topological structural analysis of digitized binary images by border
following. Computer vision, graphics, and image processing, 30(1), pp.32-46.

[6] Pigou L., Dieleman S., Kindermans PJ., Schrauwen B. (2015) Sign Language Recognition
Using Convolutional Neural Networks. In: Agapito L., Bronstein M., Rother C. (eds)
Computer Vision - ECCV 2014 Workshops. ECCV 2014. Lecture Notes inComputer
Science, vol 8925. Springer, Cham.

[7] Andrej Karpathy. Stanford university cs231n: Convolutional neural networks for visual
recognition. http://cs231n.stanford.edu/, March 24 2017. [Online].

[8] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper
with convolutions. Cvpr, 2015.

21
Acknowledgements

First and foremost, we would like to thank to our supervisors Mrs. Sridari Iyer and Mrs. Priya
Chaudhary for their valuable guidance and advice rendered to us throughout the work process
of this term paper.

Besides, an honorable mention goes to our fellow classmates for their understanding and
support in completing this report. We would also like to show gratitude to our project
coordinator Mrs. Vincy Joseph for sharing her pearls of wisdom with us during the course of
this research.

We are immensely grateful to them, although any errors are our own and should not tarnish
the reputations of these esteemed persons.

We would also like to thank the college, principal and brother for giving you an opportunity
to work on this project.

22

You might also like