17BIS0101 - Capstone Project Report
17BIS0101 - Capstone Project Report
17BIS0101 - Capstone Project Report
Bachelor of Technology
in
By
17BIS0101
May 2021
DECLARATION
I hereby declare that the thesis entitled “Communicator Assistance System for Blind-Deaf-
Dumb People using CAN” submitted by me, for the award of the degree of Bachelors of
Technology in Electronics and Communication with specialization in IOT and Sesors to
VIT, Vellore is a record of bonafide work carried out by me under the supervision of Prof.
Surya Prakash.
I further declare that the work reported in this thesis has not been submitted and will not
be submitted, either in part or in full, for the award of any other degree or diploma in this
institute or any other institute or university.
1
CERTIFICATE
This is to certify that the thesis entitled “Communicator Assistance System for Blind-Deaf-
Dumb People using CAN” submitted by A.T.Ruthvik Srinivasa Deekshitulu(17BIS0101)
School Of Electronics Engineering, VIT, Vellore for the award of the degree of Bachelors of
Technology in Electronics and Communication Engineering, is a record of bonafide work
carried out by him under my supervision during the period, 21-12-2020 to 31-05-2021, as
per the VIT, Vellore code of academic and research ethics.
The contents of this report have not been submitted and will not be submitted either in
part or in full, for the award of any other degree or diploma in this institute or any other
institute or university. The thesis fulfils the requirements and regulations of the University
and in my opinion, meets the necessary standards for submission.
Head of Department
2
3
ACKNOWLEDGEMENTS
With immense pleasure and deep sense of gratitude, I wish to express my sincere thanks to
my supervisor Prof. Surya Prakash School of Electronics Engineering, VIT, Vellore, without
his motivation and continuous encouragement, this research would not have been
successfully completed.
I am grateful to the Chancellor of VIT, Vellore, Dr. G.Viswanathan, the Vice Presidents, the
Vice Chancellor for motivating me to carry out research in the VIT, Vellore and also for
providing me with infrastructural facilities and many other resources needed for my
research.
I express my sincere thanks to Dr. Kittur Harish Mallikarjun, Dean, School of Electronics
Engineering, VIT, Vellore for his kind words of support and encouragement.
We would like to take this opportunity to express our humble gratitude to the HoD, Dr.
SASIKUMAR P. His constant guidance and willingness to share practical knowledge made us
understand the practical and theoretical working of this project and its manifestations in
great depths.
Lastly, we would like to thank our parents for always motivating us and showing us the
right path from the very beginning.
4
EXECUTIVE SUMMARY
In this thesis we discuss about developing a communication method between Blind and
Deaf&Dumb using Deep Learning techniques mainly Convolutional Neural Networks and
Computer Vision. People like deaf and/or dumb to use sign language for essential
communication while blind people use braille script to read and their voice for
communication. This research paper aims to develop a communicator assistance model for
the interaction among differently-abled people like Blind and deaf and/or dumb. A
Communicator Assistance Network (CAN) has been modeled to detect the sign-language
from deaf and/or dumb person through webcam with the help of OpenCV and construct the
words which can be converted to speech for blind people; the whole process would be
reverse for responding to the conversation. The proposed CNN works along with other
techniques like Speech_to_Text and Text_to_Speech works with a training accuracy of
99%, with batches of size 20 running through 10 epochs, taking 1344 steps to finish an
epoch. Our model can be used where the interaction with/between challenged people is
more like to happen, like in schools for specially challenged people or workplace, as this
application is an asset to them in communicating with others and expressing their voice.
5
TABLE OF CONTENTS
DECLARATION i
CERTIFICATE ii
ACKNOWLEDGEMENTS iii
Executive summary iv
TABLE OF CONTENTS 1
LIST OF FIGURES 3
LIST OF ABBREVIATIONS 4
1 INTRODUCTION 5
1.1 Objective 5
1.2 Motivation 5
1.3 Background 6
2.1 Description 7
2.2 Goals 7
3 TECHNICAL SPECIFICATIONS 8
3.1 Python 3 8
4.1.1 CNN 10
4.2 METHODOLOGY 12
1
4.2.2 PREPROCESSING 12
4.2.3 MODEL 13
4.2.3.1 ARCHITECTURE 13
4.2.3.3 PREDICTIONS 14
4.2.4 TEXT_TO_SPEECH 15
4.2.5 SPEECH_TO_TEXT 15
4.2.6 TEXT_TO_SIGNSLIDES 15
4.3 FLOWCHART 16
4.4.1 PEP 8 16
4.5.1 CONSTRAINTS 16
4.5.2 ALTERNATIVES 17
4.5.2.1 GANs 17
5.2 Tasks: 18
5.3 Milestones 18
6 PROJECT DEMONSTRATION 20
7 RESULTS, DISCUSSIONS 22
8 SUMMARY 23
9 REFERENCES 24
2
LIST OF FIGURES
3
LIST OF ABBREVIATIONS
RGB Red-Green-Blue
4
1 INTRODUCTION
1.1 OBJECTIVE
Our objective here is to develop a communication method between Blind and deaf&dumb
using Deep Learning techniques mainly Convolutional Neural Networks and Computer
Vision. Communication is an integral part of our daily life for interacting with others, but it is
challenging for differently-abled people like deaf and/or dumb to use sign language for
essential communication. This sign language is still not adequate for blind people, they use
braille script to read and their voice for communication. This research paper aims to develop
a communicator assistance model for the interaction among differently-abled people like
Blind and deaf and/or dumb. A neural network model SignNet has been designed to detect
the sign-language from deaf and/or dumb per-son through webcam with the help of OpenCV
and construct the words which can be converted to speech for blind people; the whole process
would be reverse for responding to the conversation. Our neural network works with an
accuracy of 99%, with batches of size 20 running through 10 epochs, taking 1344 steps to
finish an epoch. Our model can be used where the interaction with/between challenged
people is more like to happen, like in schools for specially challenged people or workplace,
as this application is an asset to them in communicating with others and expressing their
voice.
1.2 MOTIVATION
According to the World Health Organization (WHO) study/survey, 15% of people in the
world live with disabilities .It is now possible to explore and focus on the problems that were
neglected in old days due to the lack of supportive technologies with the assistance of Deep
Learning and regular production technology. Deaf and/or dumb people communicate with the
use of sign language whereas blind people use braille script for reading and/or writing and
their voice for communication. So we want to develop a communication assistance system
between these two disabled people.
5
Not only for communication among the disabled person, but gesture recognition can also be
useful in medicine, chemistry, robotics, and many more. It also helps them to build new and
more natural human-machine interface methods. Real-time identification of hand movements
consists of recognizing a specific gesture performed by the hand at any instant, with no
perceivable pause. The gesture is a central feature of human-to-human contact. The role of
visual cues in spoken language, however, is not well known. On the other hand, sign
language offers a simple structure with a given inventory and grammatical rules regulating
joint hand-to-hand speech (movement, form, direction, position of articulation) and face
expression (eye gaze, eyebrows, mouth, head orientation)
1.3 BACKGROUND
In many implementations such as voice-enabled computers, navigation systems, and
usability for the visually disabled, synthesizing artificial human speech from text, widely
known as text-to-speech (TTS), is an important part. Fundamentally, without having visual
interfaces, it facilitates human-technology interaction. Modern TTS systems are based on
complicated, multi-stage pipelines of processing, each of which may depend on heuristics
and hand-engineered features. Because of this difficulty, it can be very labor-intensive and
challenging to build modern TTS systems [8]. Artificial neural networks (ANNs) have been
used to model Hidden Markov Model (HMM) speech recognizers' state emission
probabilities since the early ‘90s. While conventional Gaussian mixture model (GMM)-
HMMs model context-dependence through related context-dependent states (e.g. CART-
clustered crossword triphones), ANN-HMMs have never been specifically used to do so [10].
Recent developments in deep learning indicate that a compelling approach to directing the
area of speech translation is the end-to-end speech-to-text translation paradigm. [11]
Now, an assistance model is possible for the conversation among the deaf and/or dumb
and/or blind people. This model has two main sub-models, one model will recognize the
hand gesture and convert it into text while the other sub-model will convert this text to
speech. Plenty of research work has been done in these areas, still, a precise combined
model is not available which can assist deaf, dumb, and blind people simultaneously
6
2 PROJECT DESCRIPTION AND GOAL
2.1 DESCRIPTION
This project aim is to develop a communication method between Blind and deaf&dumb
using Deep Learning techniques mainly Convolutional Neural Networks and Computer
Vision. A Communicator Assistance Network (CAN) has been modeled to detect the sign-
language from deaf and/or dumb person through webcam with the help of OpenCV and
construct the words which can be converted to speech for blind people; the whole process
would be reverse for responding to the conversation. The proposed neural network works
with an accuracy of 98% with batches of size 20 running through 15 epochs taking 1459
steps to finish an epoch.
First the Hand Gestures of Dumb/Deaf people are detected using a webcam. Now the sign
language is predicted and words are constructed from the gestures. These words are now
converted to audio file and stored as mp3. The saved audio file is played to the blind person.
Now after hearing what the dumb/deaf person said, the blind person responds to the audio
with his voice in a speech. This speech is converted to text using speech recognition
technique. Later this converted text is displayed to the dumb/deaf person using a monitor in
the form of slide show of hand gestures.
2.2 GOALS
● Develop a communication assistance system between Blind and deaf & dumb.
● Enhance our model and made capable of working with massive numbers of signs
● Improve the accuracy of the current model .
7
3 TECHNICAL SPECIFICATIONS
Numpy: this library is used to process the multidimensional array of data sets and perform
various mathematical operations on them. Concatenation, merging, converting to the array
are some of the operations this can perform with ease. Version: 1.18.3
Matplotlib: this library is used for plotting, creating a figure. This makes it easy to plot any
kind of graphs like a bar, line, histogram etc. Version: 3.2.1
Keras: this is neural network library and is built in python. It is used for developing deep
neural network model with few lines of code. Version: 2.3.0
Tensorflow : TensorFlow is a free and open-source software library for machine learning. It
can be used across a range of tasks but has a particular focus on training and inference of
deep neural networks. Tensorflow is a symbolic math library based on dataflow and
differentiable programming
8
for use under the open-source Apache 2 License. Starting with 2011, OpenCV features GPU
acceleration for real-time operations.
Tkinter: Tkinter is a Python binding to the Tk GUI toolkit. It is the standard Python interface
to the Tk GUI toolkit, and is Python's de facto standard GUI. Tkinter is included with
standard Linux, Microsoft Windows and Mac OS X installs of Python. The name Tkinter
comes from Tk interface.
Random : Random is library where you can generate random numbers in Python. Python
offers random module that can generate random numbers. These are pseudo-random
number as the sequence of number generated depends on the seed. If the seeding value is
same, the sequence will be the same.
OS The OS module in Python provides functions for interacting with the operating system.
OS comes under Python's standard utility modules. This module provides a portable way of
using operating system-dependent functionality. The *os* and *os.path* modules include
many functions to interact with the file system.:
Time: The Python time module provides many ways of representing time in code, such as
objects, numbers, and strings. It also provides functionality other than representing time,
like waiting during code execution and measuring the efficiency of your code.
gTTS: Google Text to Speech API commonly known as the gTTS API. gTTS is a very easy to
use tool which converts the text entered, into audio which can be saved as a mp3 file.
9
4 METHODOLOGY AND DETAILS
4.1.1 CNN
Convolutional Neural Network (CNN) a special form of Artificial Neural Network (ANN)
where it contains repeated sets of neurons which are applied across the space of an image.
These sets of neuron are referred to as 2D Convolutional Kernels, repeatedly applied over all
the patches of an image. This helps to learn various meaningful features from every
subspace of the whole space (image). The reason why CNN is chosen for image
classification, is to exploit spatial or temporal invariance in recognition. For the effectiveness
of CNN, it out performed previously used techniques, such as Support Vector
Machine(SVM), K-Nearest Neighbors (K-NN) in classifying and learning from images. Two
major components of CNN that differ from regular neural networks are:
• Convolution Layer
• Pooling Layer
A simple CNN is a sequence of layers, and every layer of a CNN transforms one volume of
activation values to another through a differentiable function. The first layer is the
Convolution layer. It could consist of one or more than one kernels. It gives us a output
passed through a non-linearities i.e. Rectified Linear Unit(ReLu).
Pooling layer is the way to reduce the spatial dimension. Pooling layer adds the spatial
invariance to CNN model thus it avoids overfitting. There are several kind of pooling such as
Max pooling, Min pooling, Sum pooling etc. Fully Connected layer is the last steps of CNN
and it connects the CNN to the output layer. The expected number of outputs constructed
in this layer.
10
A comprehensive hand posture recognition method based on a Kinect in real-time has been
established by Tnag et al. (2013) [5]. Tests validate that the device proposed operates rapidly
and reliably and reaches an accuracy of identification as high as 98.12% which can be used
for sign language recognition also.
A hybrid CNN-HMM approach for sign recognition has been proposed by Koller et al.
(2016) [7] that is consistent with Bayesian concepts, combining the powerful discriminatory
abilities of CNNs with the sequence modeling capabilities of HMMs which improve the
results.
A universal real-time hand gesture identification model is recommended by Edison and
Marco (2019) [4]. The electromyographic (EMG) signal is measured on the forearm and uses
an autoencoder and classifier to extract the features through an artificial feed-forward neural
network. Got an 85.08 ± 15.21% overall accuracy of identification, with an average reaction
time of 3 ± 1 ms. The accuracy can be increased with the use of the recurrent neural
networks, and predict the present gesture based on the previous history.
The people who do not speak, Ahmed et al. (2019) [2] build a deep learning model for
Bangla speech, focused on convolutional neural networks is developed to help them. It
classifies digits based on hand signals with 92% precision over validation results, ensuring a
highly trustworthy framework. There is a space for tool up-gradation such as taking
advantage of more detailed hand structure recording devices like Leap Motion or Xbox
Kinect that can significantly increase the tool's efficiency.
The hand sign is the language to make communication channels among deaf and/or dumb
people which requires a large dataset to recognize the sign using artificial intelligence
precisely. There are few datasets available that are not sufficient to cover full vocabulary so
Dongxu Li et al. (2020) [1] implements a modern Word-Level American Sign Language
(WLASL) large-scale dataset of video featuring over 2000 terms by more than 100 signers
which is in terms of the size of the vocabulary and the number of examples for each class, is
the largest publicly accessible ASL dataset. To Improve word-level sign language recognition
algorithms on such a large dataset, more sophisticated learning algorithms, such as few-shot
learning, are needed. Later it can be used to enable grammatical and narrative levels
computer interpretation of signs. WLASL incorporate with different model shows that pose-
based and appearance-based models perform equally in terms of up to 62.63 percent of the
time. on 2,000 words/glosses at top-10 precision which indicating the validity and challenges
of the dataset.
11
Coster et al. (2020) [3] combined end-to-end feature learning with Convolutional Neural
Networks and feature extraction with OpenPose for human keypoint estimation to achieve a
precision of 74.7 percent on a 100-class vocabulary.
4.2 METHODOLOGY
4.2.2 PREPROCESSING
First, the resized (128x128) dataset is distributed among three parts; train (70%), validation
(10%) and test (20%) dataset. We will preprocess the images in the train dataset, as normal
color images have a lot of features which we won’t need. So, we will be turning them in
12
grayscale images, add some Gaussian Blur with a kernel of 5x5. These smoothened images
will be used to detect edges through Canny Edge Detection. Then we dilate the images to
increases the brightness of images by taking the neighborhood maximum when passing the
structuring element over the image. With binary images, dilation connects areas that are
separated by spaces smaller than the structuring element and adds pixels to the perimeter
of each image and highlighting the features desired, then we erode the images to remove
unwanted noise and finally rescale the image array values from (0, 255) to (0, 1) range for
better performance while training.
Now we will use a Keras API ImageGenerator which helps in rescaling, augmenting,
zooming, rotating images randomly. Now we will use this processed data for the model.
With and without preprocessing hand gesture images represented in figure 1.
(a)
(b)
Figure 4-1. Dataset features (a) without preprocessing [12], and (b) with preprocessing
4.2.3 MODEL
The model is turned into a class that takes in training data as an initializer and has a
predict_words function that returns predicted words. Our model is created with Keras APIs
which use TensorFlow-GPU as the backend.
4.2.3.1 ARCHITECTURE
The data initially enters a fully connected layer that has two convolutional layers with 64
units with activation function ‘RELU’ and a filter size of 5x5. Then it is introduced to a Max
pooling layer with a pooling size of 2x2. Then we will have another set of 32-unit
convolutional layers with a 3x3 filter kernel and max pooling filters with similar
architectures. We will also use a Dropout layer of the probability of 50%. And the data finally
13
enters the flatten layer that roll out all the data into a single vector. The data enters the Dense
hidden layers with the number of units as 512-to-256-to-256 all with activation function -
‘RELU.’ At the output layer, we have a Dense layer with a number of units equal to the
number of labels, as well as the ‘Softmax' activation function. The model architecture
diagram is represented in figure 2.
14
(a) (b)
Figure 4-3. Training phase graphs, (a) accuracy Vs. epoch, and (b) loss Vs. epoch
4.2.3.3 PREDICTIONS
For predicting the word we will be using a function named getClassName taking image data
from a real-time webcam video stream where each frame of 2-second intervals is used to
predict from the character from the sign shown by the person. This frame is then processed
and sent to our model to predict its class using np.argmax to find the index of the character
shown. This character is concatenated to our string ‘text’ which will be sent to the next step.
The predicted string ‘text’ is sent to a function that uses the inherent pyttsx3 module and its
API to convert it to speech and be heard by the blind person. Moreover, we also use google’s
text to Speech API gTTS for converting the audio into an mp3 file and saving it for future
references.
Now after hearing what the other person said is heard by the blind person through
conversion, he responds with his answer in a speech. So, this speech is recognized using the
Speech Recognition API in the python Speech Recognition module. Here the response speech
from the blind person is turned into text and is returned for the next step.
After the response is converted to text, the text is convert-ed to characters. These characters
with the help of ‘labels.csv’ which has the path to the class folder for each character is used
to create a slideshow-esque reply by printing the first image in each class of the respective
15
character for 2 seconds. This concludes the two-communication between the two people.
Now the program restarts again after the deaf & dumb person responds by his sign.
The process flow chart is explained in figure 4 which describe the working model of
Communicator Assistance System for Blind-Deaf/Dumb People.
Figure 4-4. Process flow chart for the communicator assistance system
4.4.1 PEP 8
Since the entire project is based on Python language, the PEP 8 coding style is adopted.
Commenting regularly and updating whenever required, giving spaces before and after
operators and variables, uniform naming of classes and function, are some of the prominent
features of this particular coding style.
16
4.5 CONSTRAINTS, ALTERNATIVES AND TRADEOFF
4.5.1 CONSTRAINTS
● Making use of pandas Dataframe involved a lot of conversions from arrays and lists to
pandas Dataframe. Operations that were not supported by list and series are supported
by the pandas Dataframe.
● The neural network when trained for twenty or more epochs becomes overfit for the
training data, this further hampers the accuracy in prediction. Therefore, the neural
network was run for different epochs then a particular number was fixed.
● Our model recognizes way faster than we can change the signs, so we need to
improve our model to give us some time to change our hand signs.
● Our Current model is only usable with finger spelling signs, but in general ASL has
about 33,000 signs for basic daily usage.
● Also our model is very resource-heavy and takes high spec systems to train.
● Our model works with static images frames with high resolution. So it is hard to use
with low resolution and also recognize signs that are video type like J, Z which
requires movement.
● Our is only usable for conversation in English, while as we know the world has many
different languages.
4.5.2 ALTERNATIVES
4.5.2.1 GANS
Generative adversarial networks (GANs) are attracting growing interest in the deep learning
community. GANs have been applied to various domains such as computer vision, natural
language processing, time series synthesis, semantic segmentation etc. GANs belong to the
family of generative models in machine learning. Compared to other generative models e.g.,
variational autoencoders, GANs offer advantages such as an ability to handle sharp
estimated density functions, efficiently generating desired samples, eliminating
deterministic bias and with good compatibility with the internal neural architecture. These
properties have allowed GANs to enjoy great success especially in the field of computer
17
vision e.g., plausible image generation, image-to-image translation, image super-resolution
and image completion.
18
5 SCHEDULE, TASKS AND MILESTONES
5.2 TASKS:
a. To collect relevant and usable dataset that contains sufficient data for training.
b. To build a neural network for the Communicator assistance system..
c. To clean data by extracting useful information, redundancy checking, and filling
missing values and to make it fit for processing.
d. Research on the deep learning and statistical models and to study the application
in this project.
e. Implementing and fine-tuning of deep learning models for pre-processing and
image processing.
f. Implementing and fine-tuning a statistical model to get the desired accuracy for
better assistance.
g. Validation of all the models using the same validation data and to compare and
choose the best model.
19
5.3 MILESTONES
a. Successfully identified a dataset and hand sign images that are relevant to the
project and contains a sufficient number of entries for training.
b. Processed the data to suit the needs of this project.
c. Researched on deep learning and statistical models that can be made use of for
forecasting future data.
d. Successfully modeled and implemented the neural network for communicator
assistance with desired accuracy.
20
6 PROJECT DEMONSTRATION
(a) (b)
Figure 6-1. Proposed model sign prediction for various letter, (a) predicting ‘U’, and (b)
predicting ‘L’
Here we could see that the model predicts the sign ‘U’ and ‘L’ in a real-time live feed with an
accuracy of almost 100% as shown in figure 4. We continue to show signs like this, and the
characters get added to our ‘text’ string until we press the ‘q’ or ‘esc’ button, then the ‘text’
is sent to the text_to_speech function.
Our Text to Speech API takes a string input and converts it to audio and also saves it as a
.mp3 file for logs. It stops when the input string is ‘end’ as demonstrate in figure 6.
Our speech-to-text API takes the blind person’s response, recognizes their response to the
other communicator, and converts it to text. It stops its loop when it hears ‘stop’ as
demonstrated in figure 7.
(a)
21
(b)
(c)
Figure 6-2. Text to Speech API with different functions, (a) taking input, (b) saying and saving
the text, and (c) stopping the function
(a)
(b)
(c)
22
Figure 6-3. Speech to text API with different functions, (a) listening the audio, (b) recognizing
the audio, and (c) stopping the loop
This sentence will be converted to SignSlides and be shown to the Deaf/Dumb challenged
person Who responds again.
Sometimes forming sentences is tougher as ISL (Indian Sign Language) uses both hands
instead of one like other sign languages (for example ASL (American Sign Language)). Also,
our Speech recognition API is not capable of holding long sentences, so conversation should
be broken down to smaller sentences.
23
8 SUMMARY/CONCLUSION
Our objective here is to develop a communication method between Blind and deaf&dumb
using Deep Learning techniques mainly Convolutional Neural Networks and Computer
Vision. We created a model with 99% accuracy to recognize signs and convert them to
speech and the speech to text to signs.
Our model works well with given 38 classes, in general established sign languages like
ASL(American Sign Language) there are over 33,000 signs. So our work is only for limited
classes.
If our model is enhanced and made capable of working with massive numbers of signs, then
this can be used in physically handicapped schools where communication is necessary and
can also be used in public places to interact with people who don’t understand sign language.
24
9 REFERENCES
[1] Li, Dongxu, et al. "Word-level deep sign language recognition from video: A new large-
scale dataset and methods comparison." The IEEE Winter Conference on Applications of
Computer Vision. 2020.
[2] Ahmed, Shahjalal, et al. "Hand sign to Bangla speech: a deep learning in vision based
system for recognizing hand sign digits and generating Bangla speech." arXiv preprint
arXiv:1901.05613 (2019).
[3] De Coster, Mathieu, Mieke Van Herreweghe, and Joni Dambre. "Sign language
recognition with transformer networks." 12th International Conference on Language
Resources and Evaluation. 2020.
[4] Chung, Edison A., and Marco E. Benalcázar. "Real-Time Hand Gesture Recognition
Model Using Deep Learning Techniques and EMG Signals." 2019 27th European Signal
Processing Conference (EUSIPCO). IEEE, 2019.
[5] Tang, Ao, et al. "A real-time hand posture recognition system using deep neural
networks." ACM Transactions on Intelligent Systems and Technology (TIST) 6.2 (2015):
1-23.
[6] Jung, Seokwoo, et al. "Real-time Traffic Sign Recognition system with deep
convolutional neural network." 2016 13th International Conference on Ubiquitous
Robots and Ambient Intelligence (URAI). IEEE, 2016.
[7] Koller, Oscar, et al. "Deep sign: hybrid CNN-HMM for continuous sign language
recognition." Proceedings of the British Machine Vision Conference 2016. 2016.
[8] Arik, Sercan O., et al. "Deep voice: Real-time neural text-to-speech." arXiv preprint
arXiv:1702.07825 (2017).
[9] Tachibana, Hideyuki, Katsuya Uenoyama, and Shunsuke Aihara. "Efficiently trainable
text-to-speech system based on deep convolutional networks with guided attention." 2018
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2018.
[10] Seide, Frank, Gang Li, and Dong Yu. "Conversational speech transcription using
context-dependent deep neural networks." Twelfth annual conference of the international
speech communication association. 2011.
[11] Bahar, Parnia, Tobias Bieschke, and Hermann Ney. "A comparative study on end-to-
end speech to text translation." 2019 IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU). IEEE, 2019.
[12] Vaishnavi Sonawane, Indian Sign Language Dataset, [Accessed: 12:56 HRS,
13/05/2021]
“https://www.kaggle.com/vaishnaviasonawane/indian-sign-language-dataset”
25