17BIS0101 - Capstone Project Report

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 31

Communicator Assistance System for Blind-Deaf-Dumb

People using CAN

Submitted in partial fulfilment of the requirements for the degree of

Bachelor of Technology
in

Electronics and Communication Engineering with


Specialization in IOT and Sensors

By

A.T. Ruthvik Srinivasa Deekshitulu

17BIS0101

Under the guidance of

Prof. Surya Prakash

School of Electronics Engineering,


VIT, Vellore.

May 2021
DECLARATION

I hereby declare that the thesis entitled “Communicator Assistance System for Blind-Deaf-
Dumb People using CAN” submitted by me, for the award of the degree of Bachelors of
Technology in Electronics and Communication with specialization in IOT and Sesors to
VIT, Vellore is a record of bonafide work carried out by me under the supervision of Prof.
Surya Prakash.
I further declare that the work reported in this thesis has not been submitted and will not
be submitted, either in part or in full, for the award of any other degree or diploma in this
institute or any other institute or university.

Place: Vellore Signature of the Candidate

Date: 15 May 2021 (A.T.Ruthvik Srinivaa Deekshitulu)

1
CERTIFICATE

This is to certify that the thesis entitled “Communicator Assistance System for Blind-Deaf-
Dumb People using CAN” submitted by A.T.Ruthvik Srinivasa Deekshitulu(17BIS0101)
School Of Electronics Engineering, VIT, Vellore for the award of the degree of Bachelors of
Technology in Electronics and Communication Engineering, is a record of bonafide work
carried out by him under my supervision during the period, 21-12-2020 to 31-05-2021, as
per the VIT, Vellore code of academic and research ethics.

The contents of this report have not been submitted and will not be submitted either in
part or in full, for the award of any other degree or diploma in this institute or any other
institute or university. The thesis fulfils the requirements and regulations of the University
and in my opinion, meets the necessary standards for submission.

Place: VIT Vellore Signature of the Guide

Date: 15 May 2021 (Prof. Surya Prakash)

Internal Examiner External Examiner

Prof. Dr. SASIKUMAR P

Head of Department

Electronics and Communication Engineering

2
3
ACKNOWLEDGEMENTS

With immense pleasure and deep sense of gratitude, I wish to express my sincere thanks to
my supervisor Prof. Surya Prakash School of Electronics Engineering, VIT, Vellore, without
his motivation and continuous encouragement, this research would not have been
successfully completed.

I am grateful to the Chancellor of VIT, Vellore, Dr. G.Viswanathan, the Vice Presidents, the
Vice Chancellor for motivating me to carry out research in the VIT, Vellore and also for
providing me with infrastructural facilities and many other resources needed for my
research.

I express my sincere thanks to Dr. Kittur Harish Mallikarjun, Dean, School of Electronics
Engineering, VIT, Vellore for his kind words of support and encouragement.

We would like to take this opportunity to express our humble gratitude to the HoD, Dr.
SASIKUMAR P. His constant guidance and willingness to share practical knowledge made us
understand the practical and theoretical working of this project and its manifestations in
great depths.

Lastly, we would like to thank our parents for always motivating us and showing us the
right path from the very beginning.

A.T.Ruthvik Srinivasa Deekshitulu


17BIS0101

4
EXECUTIVE SUMMARY

In this thesis we discuss about developing a communication method between Blind and
Deaf&Dumb using Deep Learning techniques mainly Convolutional Neural Networks and
Computer Vision. People like deaf and/or dumb to use sign language for essential
communication while blind people use braille script to read and their voice for
communication. This research paper aims to develop a communicator assistance model for
the interaction among differently-abled people like Blind and deaf and/or dumb. A
Communicator Assistance Network (CAN) has been modeled to detect the sign-language
from deaf and/or dumb person through webcam with the help of OpenCV and construct the
words which can be converted to speech for blind people; the whole process would be
reverse for responding to the conversation. The proposed CNN works along with other
techniques like Speech_to_Text and Text_to_Speech works with a training accuracy of
99%, with batches of size 20 running through 10 epochs, taking 1344 steps to finish an
epoch. Our model can be used where the interaction with/between challenged people is
more like to happen, like in schools for specially challenged people or workplace, as this
application is an asset to them in communicating with others and expressing their voice.

5
TABLE OF CONTENTS

DECLARATION i

CERTIFICATE ii

ACKNOWLEDGEMENTS iii

Executive summary iv

TABLE OF CONTENTS 1

LIST OF FIGURES 3

LIST OF ABBREVIATIONS 4

1 INTRODUCTION 5

1.1 Objective 5

1.2 Motivation 5

1.3 Background 6

2 PROJECT DESCRIPTION AND GOAL 7

2.1 Description 7

2.2 Goals 7

3 TECHNICAL SPECIFICATIONS 8

3.1 Python 3 8

3.2 Arduino IDE 8

3.3 Libraries used 8

4 DESIGN APPROACHES AND TECHNIQUES 10

4.1 STATE OF ART 10

4.1.1 CNN 10

4.1.2 HAND SIGN RECOGNITION 10

4.1.3 SPEECH RECOGNITION 12

4.2 METHODOLOGY 12

4.2.1 DATASET FEATURES 12

1
4.2.2 PREPROCESSING 12

4.2.3 MODEL 13

4.2.3.1 ARCHITECTURE 13

4.2.3.2 COMPILATION AND FITTING 14

4.2.3.3 PREDICTIONS 14

4.2.4 TEXT_TO_SPEECH 15

4.2.5 SPEECH_TO_TEXT 15

4.2.6 TEXT_TO_SIGNSLIDES 15

4.3 FLOWCHART 16

4.4 CODES AND STANDARDS 16

4.4.1 PEP 8 16

4.5 CONSTRAINTS, ALTERNATIVES AND TRADEOFF 16

4.5.1 CONSTRAINTS 16

4.5.2 ALTERNATIVES 17

4.5.2.1 GANs 17

5 SCHEDULE, TASKS AND MILESTONES 18

5.1 Project schedule/timeline: 18

5.2 Tasks: 18

5.3 Milestones 18

6 PROJECT DEMONSTRATION 20

7 RESULTS, DISCUSSIONS 22

8 SUMMARY 23

9 REFERENCES 24

2
LIST OF FIGURES

Figure 4-1(a).Images Before Preprocessing. 13


Figure 4-1(b).Images After Preprocessing. 13
Figure 4-2.Network Architecture 14
Figure 4-3(a).Training phase graphs, accuracy Vs. epoch 14
Figure 4-3(b). raining phase graphs, loss Vs. epoch 14
Figure 4-4.Flowchart of project 16
Figure 6-1.(a)Proposed model sign prediction for various letter,predicting ‘U’ 20
Figure 6-1.(b)Proposed model sign prediction for various letter,predicting ‘L’ 20
Figure 6-2.(a) Text to Speech API with different functions, taking input 20
Figure 6-2.(b)Text to Speech API with different functions,saying and saving the text 21
Figure 6-2.(c) Text to Speech API with different functions,stopping the function 21
Figure 6-3.(a)Speech to text API with different functions,listening the audio 1
Figure 6-3.(b) Speech to text API with different functions,recognizing the audio 1
Figure 6-3.(c) Speech to text API with different functions,stopping the loop

3
LIST OF ABBREVIATIONS

CNN Convlutional Neural Network


HMM Hidden Markov Model

GMM Gaussian Mixture Model


GAN Generative adversarial networks

STT Speech to Text

TTS Text to Speech

HSR Hand Sign Recognition

CAN Communicator Assistance Network

RGB Red-Green-Blue

GUI Graphic User Interface

ISL Indian Sign Language

ASL American Sign Language

4
1 INTRODUCTION

1.1 OBJECTIVE
Our objective here is to develop a communication method between Blind and deaf&dumb
using Deep Learning techniques mainly Convolutional Neural Networks and Computer
Vision. Communication is an integral part of our daily life for interacting with others, but it is
challenging for differently-abled people like deaf and/or dumb to use sign language for
essential communication. This sign language is still not adequate for blind people, they use
braille script to read and their voice for communication. This research paper aims to develop
a communicator assistance model for the interaction among differently-abled people like
Blind and deaf and/or dumb. A neural network model SignNet has been designed to detect
the sign-language from deaf and/or dumb per-son through webcam with the help of OpenCV
and construct the words which can be converted to speech for blind people; the whole process
would be reverse for responding to the conversation. Our neural network works with an
accuracy of 99%, with batches of size 20 running through 10 epochs, taking 1344 steps to
finish an epoch. Our model can be used where the interaction with/between challenged
people is more like to happen, like in schools for specially challenged people or workplace,
as this application is an asset to them in communicating with others and expressing their
voice.

1.2 MOTIVATION
According to the World Health Organization (WHO) study/survey, 15% of people in the
world live with disabilities .It is now possible to explore and focus on the problems that were
neglected in old days due to the lack of supportive technologies with the assistance of Deep
Learning and regular production technology. Deaf and/or dumb people communicate with the
use of sign language whereas blind people use braille script for reading and/or writing and
their voice for communication. So we want to develop a communication assistance system
between these two disabled people.

5
Not only for communication among the disabled person, but gesture recognition can also be
useful in medicine, chemistry, robotics, and many more. It also helps them to build new and
more natural human-machine interface methods. Real-time identification of hand movements
consists of recognizing a specific gesture performed by the hand at any instant, with no
perceivable pause. The gesture is a central feature of human-to-human contact. The role of
visual cues in spoken language, however, is not well known. On the other hand, sign
language offers a simple structure with a given inventory and grammatical rules regulating
joint hand-to-hand speech (movement, form, direction, position of articulation) and face
expression (eye gaze, eyebrows, mouth, head orientation)

1.3 BACKGROUND
In many implementations such as voice-enabled computers, navigation systems, and
usability for the visually disabled, synthesizing artificial human speech from text, widely
known as text-to-speech (TTS), is an important part. Fundamentally, without having visual
interfaces, it facilitates human-technology interaction. Modern TTS systems are based on
complicated, multi-stage pipelines of processing, each of which may depend on heuristics
and hand-engineered features. Because of this difficulty, it can be very labor-intensive and
challenging to build modern TTS systems [8]. Artificial neural networks (ANNs) have been
used to model Hidden Markov Model (HMM) speech recognizers' state emission
probabilities since the early ‘90s. While conventional Gaussian mixture model (GMM)-
HMMs model context-dependence through related context-dependent states (e.g. CART-
clustered crossword triphones), ANN-HMMs have never been specifically used to do so [10].
Recent developments in deep learning indicate that a compelling approach to directing the
area of speech translation is the end-to-end speech-to-text translation paradigm. [11]

Now, an assistance model is possible for the conversation among the deaf and/or dumb
and/or blind people. This model has two main sub-models, one model will recognize the
hand gesture and convert it into text while the other sub-model will convert this text to
speech. Plenty of research work has been done in these areas, still, a precise combined
model is not available which can assist deaf, dumb, and blind people simultaneously

6
2 PROJECT DESCRIPTION AND GOAL

2.1 DESCRIPTION
This project aim is to develop a communication method between Blind and deaf&dumb
using Deep Learning techniques mainly Convolutional Neural Networks and Computer
Vision. A Communicator Assistance Network (CAN) has been modeled to detect the sign-
language from deaf and/or dumb person through webcam with the help of OpenCV and
construct the words which can be converted to speech for blind people; the whole process
would be reverse for responding to the conversation. The proposed neural network works
with an accuracy of 98% with batches of size 20 running through 15 epochs taking 1459
steps to finish an epoch.

First the Hand Gestures of Dumb/Deaf people are detected using a webcam. Now the sign
language is predicted and words are constructed from the gestures. These words are now
converted to audio file and stored as mp3. The saved audio file is played to the blind person.

Now after hearing what the dumb/deaf person said, the blind person responds to the audio
with his voice in a speech. This speech is converted to text using speech recognition
technique. Later this converted text is displayed to the dumb/deaf person using a monitor in
the form of slide show of hand gestures.

2.2 GOALS
● Develop a communication assistance system between Blind and deaf & dumb.
● Enhance our model and made capable of working with massive numbers of signs
● Improve the accuracy of the current model .

7
3 TECHNICAL SPECIFICATIONS

3.1 SOFTWARE REQUIREMENTS (PYTHON )


In this project python is used to build, process, explore and plot useful data and information.
Python has an extensive number of standard libraries with high-level data structures and
new function. Python’s simplicity of the syntax and the ability to process and analyse a huge
volume of data makes it easier to build prototypes for machine learning. The version of
python used is 3.8.1

3.2 HARDWARE REQUIREMENTS


As for the hardware will be using a system with 16GB virtual memory and GDDR5 video Card
of 2GB memory for hardware acceleration to enhance training time and model
performance.

3.3 LIBRARIES USED


Pandas: this library is used for data manipulation and analysis. This library provides a data
frame to store data which can be easily retrieved and worked upon with a minimum number
of commands. Version: 1.0.3

Numpy: this library is used to process the multidimensional array of data sets and perform
various mathematical operations on them. Concatenation, merging, converting to the array
are some of the operations this can perform with ease. Version: 1.18.3

Matplotlib: this library is used for plotting, creating a figure. This makes it easy to plot any
kind of graphs like a bar, line, histogram etc. Version: 3.2.1

Keras: this is neural network library and is built in python. It is used for developing deep
neural network model with few lines of code. Version: 2.3.0

Tensorflow : TensorFlow is a free and open-source software library for machine learning. It
can be used across a range of tasks but has a particular focus on training and inference of
deep neural networks. Tensorflow is a symbolic math library based on dataflow and
differentiable programming

OpenCV : OpenCV (Open Source Computer Vision Library) is a library of programming


functions mainly aimed at real-time computer vision. The library is cross-platform and free

8
for use under the open-source Apache 2 License. Starting with 2011, OpenCV features GPU
acceleration for real-time operations.

Tkinter: Tkinter is a Python binding to the Tk GUI toolkit. It is the standard Python interface
to the Tk GUI toolkit, and is Python's de facto standard GUI. Tkinter is included with
standard Linux, Microsoft Windows and Mac OS X installs of Python. The name Tkinter
comes from Tk interface.

Random : Random is library where you can generate random numbers in Python. Python
offers random module that can generate random numbers. These are pseudo-random
number as the sequence of number generated depends on the seed. If the seeding value is
same, the sequence will be the same.

OS The OS module in Python provides functions for interacting with the operating system.
OS comes under Python's standard utility modules. This module provides a portable way of
using operating system-dependent functionality. The *os* and *os.path* modules include
many functions to interact with the file system.:

Time: The Python time module provides many ways of representing time in code, such as
objects, numbers, and strings. It also provides functionality other than representing time,
like waiting during code execution and measuring the efficiency of your code.

pyttsx3: pyttsx3 is a text-to-speech conversion library in Python. Unlike alternative libraries,


it works offline, and is compatible with both Python 2 and 3.

gTTS: Google Text to Speech API commonly known as the gTTS API. gTTS is a very easy to
use tool which converts the text entered, into audio which can be saved as a mp3 file.

speech_recognition: Speech Recognition is library for performing speech recognition, with


support for several engines and APIs, online and offline. Speech recognition engine/API
support many engines line CMU Sphinx (works offline), Google Speech Recognition, Google
Cloud Speech API, Wit.ai, Microsoft Bing Voice Recognition, Houndify API, IBM Speech to
Text

9
4 METHODOLOGY AND DETAILS

4.1 STATE OF THE ART

4.1.1 CNN
Convolutional Neural Network (CNN) a special form of Artificial Neural Network (ANN)
where it contains repeated sets of neurons which are applied across the space of an image.
These sets of neuron are referred to as 2D Convolutional Kernels, repeatedly applied over all
the patches of an image. This helps to learn various meaningful features from every
subspace of the whole space (image). The reason why CNN is chosen for image
classification, is to exploit spatial or temporal invariance in recognition. For the effectiveness
of CNN, it out performed previously used techniques, such as Support Vector
Machine(SVM), K-Nearest Neighbors (K-NN) in classifying and learning from images. Two
major components of CNN that differ from regular neural networks are:

• Convolution Layer

• Pooling Layer

A simple CNN is a sequence of layers, and every layer of a CNN transforms one volume of
activation values to another through a differentiable function. The first layer is the
Convolution layer. It could consist of one or more than one kernels. It gives us a output
passed through a non-linearities i.e. Rectified Linear Unit(ReLu).

Pooling layer is the way to reduce the spatial dimension. Pooling layer adds the spatial
invariance to CNN model thus it avoids overfitting. There are several kind of pooling such as
Max pooling, Min pooling, Sum pooling etc. Fully Connected layer is the last steps of CNN
and it connects the CNN to the output layer. The expected number of outputs constructed
in this layer.

4.1.2 HAND SIGN RECOGNITION


Hand sign recognition (HSR) is a daunting activity because of the difficulties with normal
cameras in detecting and monitoring hands and the limitations of conventional manually
selected features.[5]

10
A comprehensive hand posture recognition method based on a Kinect in real-time has been
established by Tnag et al. (2013) [5]. Tests validate that the device proposed operates rapidly
and reliably and reaches an accuracy of identification as high as 98.12% which can be used
for sign language recognition also.
A hybrid CNN-HMM approach for sign recognition has been proposed by Koller et al.
(2016) [7] that is consistent with Bayesian concepts, combining the powerful discriminatory
abilities of CNNs with the sequence modeling capabilities of HMMs which improve the
results.
A universal real-time hand gesture identification model is recommended by Edison and
Marco (2019) [4]. The electromyographic (EMG) signal is measured on the forearm and uses
an autoencoder and classifier to extract the features through an artificial feed-forward neural
network. Got an 85.08 ± 15.21% overall accuracy of identification, with an average reaction
time of 3 ± 1 ms. The accuracy can be increased with the use of the recurrent neural
networks, and predict the present gesture based on the previous history.
The people who do not speak, Ahmed et al. (2019) [2] build a deep learning model for
Bangla speech, focused on convolutional neural networks is developed to help them. It
classifies digits based on hand signals with 92% precision over validation results, ensuring a
highly trustworthy framework. There is a space for tool up-gradation such as taking
advantage of more detailed hand structure recording devices like Leap Motion or Xbox
Kinect that can significantly increase the tool's efficiency.
The hand sign is the language to make communication channels among deaf and/or dumb
people which requires a large dataset to recognize the sign using artificial intelligence
precisely. There are few datasets available that are not sufficient to cover full vocabulary so
Dongxu Li et al. (2020) [1] implements a modern Word-Level American Sign Language
(WLASL) large-scale dataset of video featuring over 2000 terms by more than 100 signers
which is in terms of the size of the vocabulary and the number of examples for each class, is
the largest publicly accessible ASL dataset. To Improve word-level sign language recognition
algorithms on such a large dataset, more sophisticated learning algorithms, such as few-shot
learning, are needed. Later it can be used to enable grammatical and narrative levels
computer interpretation of signs. WLASL incorporate with different model shows that pose-
based and appearance-based models perform equally in terms of up to 62.63 percent of the
time. on 2,000 words/glosses at top-10 precision which indicating the validity and challenges
of the dataset. 

11
Coster et al. (2020) [3] combined end-to-end feature learning with Convolutional Neural
Networks and feature extraction with OpenPose for human keypoint estimation to achieve a
precision of 74.7 percent on a 100-class vocabulary.

4.1.3 SPEECH RECOGNITION


A novel way of phoneme boundary detection with the deep neural network using
connectionist temporal classification failure for the segmentation model is recommended by
Sercan et al. (2017) [8]. Introduced a version of WaveNet for the audio synthesis model that
needs fewer parameters and trains faster that accelerate up to 400x over existing
implementations. 
The text-to-speech (TTS) strategy based on deep convolutional neural networks (CNN),
without the use of any repeating units, has been introduced by Tachibana et al. (2020) [9]
which is much faster than RNN-based methods due to high parallelizability, appropriately
educated overnight (15 hours).
Seide et al. (2011) [10] apply the recently proposed Context-Dependent Deep Neural-
Network Hidden Markov model to speech-to-text transcription., scaled up to broad speech-
to-text transcription data sets. It merges conventional artificial neural networks and HMMs
with standard tied-state triphones and deep-belief network pre-training.

4.2 METHODOLOGY

4.2.1 DATASET FEATURES


The dataset is an un-preprocessed 20,000 RGB Images of hand sign samples which are
distinguished into separate folders with folder name being the sign class. Our dataset
consists of both numeric and alphabetic classes. Dataset is gathered from the Kaggle, a
popular site of Data Science community. All the images are 400x400 pixels which will be
resized to 128x128 for fast processing.

4.2.2 PREPROCESSING
First, the resized (128x128) dataset is distributed among three parts; train (70%), validation
(10%) and test (20%) dataset. We will preprocess the images in the train dataset, as normal
color images have a lot of features which we won’t need. So, we will be turning them in

12
grayscale images, add some Gaussian Blur with a kernel of 5x5. These smoothened images
will be used to detect edges through Canny Edge Detection. Then we dilate the images to
increases the brightness of images by taking the neighborhood maximum when passing the
structuring element over the image. With binary images, dilation connects areas that are
separated by spaces smaller than the structuring element and adds pixels to the perimeter
of each image and highlighting the features desired, then we erode the images to remove
unwanted noise and finally rescale the image array values from (0, 255) to (0, 1) range for
better performance while training.

Now we will use a Keras API ImageGenerator which helps in rescaling, augmenting,
zooming, rotating images randomly. Now we will use this processed data for the model.
With and without preprocessing hand gesture images represented in figure 1.

(a)

(b)

Figure 4-1. Dataset features  (a) without preprocessing [12], and (b) with preprocessing

4.2.3 MODEL
The model is turned into a class that takes in training data as an initializer and has a
predict_words function that returns predicted words. Our model is created with Keras APIs
which use TensorFlow-GPU as the backend.

4.2.3.1 ARCHITECTURE
The data initially enters a fully connected layer that has two convolutional layers with 64
units with activation function ‘RELU’ and a filter size of 5x5. Then it is introduced to a Max
pooling layer with a pooling size of 2x2. Then we will have another set of 32-unit
convolutional layers with a 3x3 filter kernel and max pooling filters with similar
architectures. We will also use a Dropout layer of the probability of 50%. And the data finally

13
enters the flatten layer that roll out all the data into a single vector. The data enters the Dense
hidden layers with the number of units as 512-to-256-to-256 all with activation function -
‘RELU.’ At the output layer, we have a Dense layer with a number of units equal to the
number of labels, as well as the ‘Softmax' activation function. The model architecture
diagram is represented in figure 2.

Figure 4-2. Network Architecture

4.2.3.2 COMPILATION AND FITTING


For compilation, used ‘Adam’ Optimizer with a learning rate of 0.001,
SparseCategoricalCrossEntropy for loss, and accuracy as our metric. For the fitting, which
will have the data flow into a model from ImageGenerator in batches of 20, with 15 epochs.
Our fitted model is saved in a file called “SignNet.h5,” and our logs are save in “his-
tory.csv.” This HDF5 file is loaded into our main program for using our model.

14
(a) (b)

Figure 4-3. Training phase graphs, (a) accuracy Vs. epoch, and (b) loss Vs. epoch

4.2.3.3 PREDICTIONS
For predicting the word we will be using a function named getClassName taking image data
from a real-time webcam video stream where each frame of 2-second intervals is used to
predict from the character from the sign shown by the person. This frame is then processed
and sent to our model to predict its class using np.argmax to find the index of the character
shown. This character is concatenated to our string ‘text’ which will be sent to the next step.

4.2.4 TEXT TO SPEECH

The predicted string ‘text’ is sent to a function that uses the inherent pyttsx3 module and its
API to convert it to speech and be heard by the blind person. Moreover, we also use google’s
text to Speech API gTTS for converting the audio into an mp3 file and saving it for future
references. 

4.2.5 SPEECH TO TEXT

Now after hearing what the other person said is heard by the blind person through
conversion, he responds with his answer in a speech. So, this speech is recognized using the
Speech Recognition API in the python Speech Recognition module. Here the response speech
from the blind person is turned into text and is returned for the next step.

4.2.6 TEXT TO SIGNSLIDES

After the response is converted to text, the text is convert-ed to characters. These characters
with the help of ‘labels.csv’ which has the path to the class folder for each character is used
to create a slideshow-esque reply by printing the first image in each class of the respective

15
character for 2 seconds. This concludes the two-communication between the two people.
Now the program restarts again after the deaf & dumb person responds by his sign.

The process flow chart is explained in figure 4 which describe the working model of
Communicator Assistance System for Blind-Deaf/Dumb People.

4.3 FLOW CHART

Figure 4-4. Process flow chart for the communicator assistance system

4.4 CODES AND STANDARDS

4.4.1 PEP 8
Since the entire project is based on Python language, the PEP 8 coding style is adopted.
Commenting regularly and updating whenever required, giving spaces before and after
operators and variables, uniform naming of classes and function, are some of the prominent
features of this particular coding style.

16
4.5 CONSTRAINTS, ALTERNATIVES AND TRADEOFF

4.5.1 CONSTRAINTS
● Making use of pandas Dataframe involved a lot of conversions from arrays and lists to
pandas Dataframe. Operations that were not supported by list and series are supported
by the pandas Dataframe.
● The neural network when trained for twenty or more epochs becomes overfit for the
training data, this further hampers the accuracy in prediction. Therefore, the neural
network was run for different epochs then a particular number was fixed.
● Our model recognizes way faster than we can change the signs, so we need to
improve our model to give us some time to change our hand signs.
● Our Current model is only usable with finger spelling signs, but in general ASL has
about 33,000 signs for basic daily usage.
● Also our model is very resource-heavy and takes high spec systems to train.
● Our model works with static images frames with high resolution. So it is hard to use
with low resolution and also recognize signs that are video type like J, Z which
requires movement.
● Our is only usable for conversation in English, while as we know the world has many
different languages.

4.5.2 ALTERNATIVES

4.5.2.1 GANS
Generative adversarial networks (GANs) are attracting growing interest in the deep learning
community. GANs have been applied to various domains such as computer vision, natural
language processing, time series synthesis, semantic segmentation etc. GANs belong to the
family of generative models in machine learning. Compared to other generative models e.g.,
variational autoencoders, GANs offer advantages such as an ability to handle sharp
estimated density functions, efficiently generating desired samples, eliminating
deterministic bias and with good compatibility with the internal neural architecture. These
properties have allowed GANs to enjoy great success especially in the field of computer

17
vision e.g., plausible image generation, image-to-image translation, image super-resolution
and image completion.

18
5 SCHEDULE, TASKS AND MILESTONES

5.1 PROJECT TIMELINE :


a. Upgradations of Laptop as per the requirements of the project as mentioned in
Technical specifications.
b. An extensive literature survey was done to identify a problem statement and to
get insights into the existing techniques and methodologies and the
shortcomings of the same in the first one month.
c. For the next two weeks image pre-processing of hand signs are done.
d. In the second month, model design and a Convolutional Neural Networks has
been modeled to detect the sign-language.
e. In the third month code implementation has carried out.
f. In the following three weeks testing and improvements of code has been done as
per the guidance giving by our guide.
g. In the final month draft of the paper has been done and worked for the
betterment of the neural network model.

5.2 TASKS:
a. To collect relevant and usable dataset that contains sufficient data for training.
b. To build a neural network for the Communicator assistance system..
c. To clean data by extracting useful information, redundancy checking, and filling
missing values and to make it fit for processing.
d. Research on the deep learning and statistical models and to study the application
in this project.
e. Implementing and fine-tuning of deep learning models for pre-processing and
image processing.
f. Implementing and fine-tuning a statistical model to get the desired accuracy for
better assistance.
g. Validation of all the models using the same validation data and to compare and
choose the best model.

19
5.3 MILESTONES
a. Successfully identified a dataset and hand sign images that are relevant to the
project and contains a sufficient number of entries for training.
b. Processed the data to suit the needs of this project.
c. Researched on deep learning and statistical models that can be made use of for
forecasting future data.
d. Successfully modeled and implemented the neural network for communicator
assistance with desired accuracy.

20
6 PROJECT DEMONSTRATION

(a) (b)
Figure 6-1. Proposed model sign prediction for various letter, (a) predicting ‘U’, and (b)
predicting ‘L’

Here we could see that the model predicts the sign ‘U’ and ‘L’ in a real-time live feed with an
accuracy of almost 100% as shown in figure 4. We continue to show signs like this, and the
characters get added to our ‘text’ string until we press the ‘q’ or ‘esc’ button, then the ‘text’
is sent to the text_to_speech function.

Our Text to Speech API takes a string input and converts it to audio and also saves it as a
.mp3 file for logs. It stops when the input string is ‘end’ as demonstrate in figure 6.

Our speech-to-text API takes the blind person’s response, recognizes their response to the
other communicator, and converts it to text. It stops its loop when it hears ‘stop’ as
demonstrated in figure 7.

(a)

21
(b)

(c)
Figure 6-2. Text to Speech API with different functions, (a) taking input, (b) saying and saving
the text, and (c) stopping the function

(a)

(b)

(c)

22
Figure 6-3. Speech to text API with different functions, (a) listening the audio, (b) recognizing
the audio, and (c) stopping the loop

This sentence will be converted to SignSlides and be shown to the Deaf/Dumb challenged
person Who responds again.

7 RESULTS AND DISCUSSION


As it can be observed from our demonstration earlier, our project could work in a good
working system with demandable specifications and but after testing in few different
conditions like environment setup and background colour it works more desirably in places
with even coloured backgrounds and environment where human more recognizable (non-
red colours).

Sometimes forming sentences is tougher as ISL (Indian Sign Language) uses both hands
instead of one like other sign languages (for example ASL (American Sign Language)). Also,
our Speech recognition API is not capable of holding long sentences, so conversation should
be broken down to smaller sentences.

23
8 SUMMARY/CONCLUSION
Our objective here is to develop a communication method between Blind and deaf&dumb
using Deep Learning techniques mainly Convolutional Neural Networks and Computer
Vision. We created a model with 99% accuracy to recognize signs and convert them to
speech and the speech to text to signs.

Our model works well with given 38 classes, in general established sign languages like
ASL(American Sign Language) there are over 33,000 signs. So our work is only for limited
classes.

If our model is enhanced and made capable of working with massive numbers of signs, then
this can be used in physically handicapped schools where communication is necessary and
can also be used in public places to interact with people who don’t understand sign language.

24
9 REFERENCES
[1] Li, Dongxu, et al. "Word-level deep sign language recognition from video: A new large-
scale dataset and methods comparison." The IEEE Winter Conference on Applications of
Computer Vision. 2020.
[2] Ahmed, Shahjalal, et al. "Hand sign to Bangla speech: a deep learning in vision based
system for recognizing hand sign digits and generating Bangla speech." arXiv preprint
arXiv:1901.05613 (2019).
[3] De Coster, Mathieu, Mieke Van Herreweghe, and Joni Dambre. "Sign language
recognition with transformer networks." 12th International Conference on Language
Resources and Evaluation. 2020.
[4] Chung, Edison A., and Marco E. Benalcázar. "Real-Time Hand Gesture Recognition
Model Using Deep Learning Techniques and EMG Signals." 2019 27th European Signal
Processing Conference (EUSIPCO). IEEE, 2019.
[5] Tang, Ao, et al. "A real-time hand posture recognition system using deep neural
networks." ACM Transactions on Intelligent Systems and Technology (TIST) 6.2 (2015):
1-23.
[6] Jung, Seokwoo, et al. "Real-time Traffic Sign Recognition system with deep
convolutional neural network." 2016 13th International Conference on Ubiquitous
Robots and Ambient Intelligence (URAI). IEEE, 2016.
[7] Koller, Oscar, et al. "Deep sign: hybrid CNN-HMM for continuous sign language
recognition." Proceedings of the British Machine Vision Conference 2016. 2016.
[8]  Arik, Sercan O., et al. "Deep voice: Real-time neural text-to-speech." arXiv preprint
arXiv:1702.07825 (2017).
[9]  Tachibana, Hideyuki, Katsuya Uenoyama, and Shunsuke Aihara. "Efficiently trainable
text-to-speech system based on deep convolutional networks with guided attention." 2018
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2018.
[10]  Seide, Frank, Gang Li, and Dong Yu. "Conversational speech transcription using
context-dependent deep neural networks." Twelfth annual conference of the international
speech communication association. 2011.
[11] Bahar, Parnia, Tobias Bieschke, and Hermann Ney. "A comparative study on end-to-
end speech to text translation." 2019 IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU). IEEE, 2019.
[12] Vaishnavi Sonawane,  Indian Sign Language Dataset, [Accessed: 12:56 HRS,
13/05/2021]
“https://www.kaggle.com/vaishnaviasonawane/indian-sign-language-dataset” 

25

You might also like