Real Time Sign Language Interpreter Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

ML-BASED REAL-TIME SIGN LANGUAGE

INTERPRETER
A Major Project Report
Submitted in partial fulfillment of the requirements for the degree of
Bachelor of Technology
in
Internet of Things

Submitted by
Aditya Nema 0108IO201005
Ajinkya Balwant Soley 0108IO201006
Deepanshu Dixit 0108IO201018
Kushagra Shrivastava 0108IO201028
Shalini Sharma 0108IO201056

Project Guide:
Dr. Shailendra Kumar Shrivastava

Department of Information Technology


Samrat Ashok Technological Institute
Vidisha, Madhya Pradesh (India)
27 November 2023
Certificate

Department of Information Technology


Samrat Ashok Technological Institute, Vidisha

It is certified that the work contained in the project report entitled “ML-BASED
REAL-TIME SIGN LANGUAGE INTERPRETER” by the following students has been
carried out under my supervision and that this work has not been submitted elsewhere for
a degree.

Aditya Nema 0108IO201005


Ajinkya Balwant Soley 0108IO201006
Deepanshu Dixit 0108IO201018
Kushagra Shrivastava 0108IO201028
Shalini Sharma 0108IO201056

Date: Dr. Shailendra Kumar Shrivastava

This project report entitled “ML-BASED REAL-TIME SIGN LANGUAGE IN-


TERPRETER” submitted by the group is approved for the degree of Bachelor of Tech-
nology.
The viva-voce examination has been held on .

. .
Project Coordinator Examiner(s)
Declaration

SATI Vidisha
27 November 2023

We declare that this written submission represents our ideas in our own words and where
others’ ideas or words have been included, We have adequately cited and referenced the
original sources. We declare that we have properly and accurately acknowledged all
sources used in the production of this report. We also declare that we have adhered to all
principles of academic honesty and integrity and have not misrepresented or fabricated or
falsified any idea/data/fact/source in our submission. We understand that any violation of
the above will be a cause for disciplinary action by the Institute and can also evoke penal
action from the sources which have thus not been properly cited or from whom proper
permission has not been taken when needed.

Aditya Nema Ajinkya Balwant Deepanshu Dixit Kushagra Shalini Sharma


Soley Shrivastava
0108IO201005 0108IO201006 0108IO201018 0108IO201028 0108IO201056

v
Acknowledgements

We would like to extend our sincere gratitude to everyone who was involved in this engi-
neering project. We appreciate the dedication and hard work of our team members, our
project coordinator Prof. Ramratan Ahirwal and project guide, Dr. Shailendra Kumar
Shrivastava, who have been instrumental in helping us reach our goals.
We are thankful for the valuable guidance of Dr. Vipin Patait and Prof. Rashi Kumar
which was key in the completion of this project. We are thankful for the valuable guid-
ance and assistance provided by our supervisors and mentors.
We are also grateful for the support and encouragement of our family and friends, which
helped us throughout this project. Lastly, we thank all those who have helped us through
their advice and constructive feedback.
We feel fortunate to have had such a strong support system throughout this journey and
look forward to it.

Aditya Nema 0108IO201005


Ajinkya Balwant Soley 0108IO201006
Deepanshu Dixit 0108IO201018
Kushagra Shrivastava 0108IO201028
Shalini Sharma 0108IO201056

vii
Abstract

The "ML-based Real-time Indian Sign Language Interpreter" project aims to develop an
innovative system that facilitates seamless communication between individuals with hear-
ing impairments and the broader community. Leveraging machine learning (ML) tech-
niques, this real-time interpreter is specifically designed for the Indian Sign Language
(ISL).
The system employs a combination of computer vision and deep learning algorithms to
recognize and interpret gestures made in ISL. A robust dataset of diverse sign gestures is
utilized to train the model, allowing it to adapt and accurately interpret signs performed
by users in real-time. The incorporation of neural networks enhances the system’s ability
to generalize and comprehend variations in signing styles and contexts.
This project proposes a machine learning (ML) based real-time Indian sign language in-
terpreter. The interpreter would use a camera to capture the signer’s hand gestures, and
then use ML to translate those gestures into spoken or written text. The interpreter would
be designed to be accurate, efficient, and user-friendly.
The interpreter would be implemented using a deep learning model. The model would
be trained on a dataset of Indian sign language. The model would be able to recognize a
variety of hand gestures, including single-hand gestures and two-hand gestures.
The interpreter would be evaluated using a variety of metrics, including accuracy, speed,
and user satisfaction. The interpreter would be compared to other existing sign language
interpreters, both human and machine.
The results of this project would have a significant impact on the lives of deaf and hard-
of-hearing people in India. The interpreter would provide them with a new way to com-
municate with the hearing world.

ix
Table of Contents

Acknowledgements vii

Abstract ix

List of Figures xiii

1 Introduction 1
1.1 Project Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Literature Review 7

3 Problem Formulation and Proposed Solution 11


3.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Develop a Real Time Sign Language Interpreter . . . . . . . . . . . . . . 12
3.2.1 Comparing Different Methodologies . . . . . . . . . . . . . . . . 12
3.2.2 Best Fit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Work Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 Training Data Generator Configuration . . . . . . . . . . . . . . 18
3.3.2 Usage of CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Results and Discussion 23

5 Conclusion and Future Work 25


5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

A Appendix 29
A.1 Appendix 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A.2 Appendix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

xi
xii Table of Contents

References 33
List of Figures

3.1 MobileNet V1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Convulation Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Load and Preprocess the Dataset . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Training Data configuration . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5 Model Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6 Setting up Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.7 Plotting the points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.8 Plotted points on the graph . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.9 Display the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Plot for Loss and Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 23

xiii
Chapter 1

Introduction

Communication is a fundamental aspect of human interaction, playing a pivotal role in


shaping societies and fostering connections. For individuals with hearing impairments,
however, the ability to communicate seamlessly can be a significant challenge. Accord-
ing to a UN report, 80% of deaf individuals face literacy and oral language difficulties,
underlining the pressing need for innovative solutions to enhance the lives of those with
hearing impairments. The World Health Organization (WHO) further emphasizes that
over 5% of the world’s population, comprising 430 million people, requires rehabilitation
for hearing loss, a number projected to reach 700 million by 2050.

The significance of early exposure to sign language becomes evident when consid-
ering the challenges faced by children with hearing loss. The National Library of
Medicine[1] highlights the critical "golden period of learning," wherein the age of detec-
tion of hearing loss and the subsequent use of hearing aids significantly impact language
development. Learning sign language in the early stages of life not only enhances
linguistic growth but also nurtures cognitive and social development, empowering these
children to interact effectively and excel academically. It extends its positive influence to
families, fostering improved communication and comprehension.

The "ML-based Real-time Indian Sign Language Interpreter" project emerges as a


transformative response to the communication challenges faced by the deaf and hard-of-
hearing community in India. In recent years, advancements in machine learning (ML)
have opened up new possibilities for developing assistive technologies. This project
represents a cutting-edge application of ML techniques, specifically tailored for the
intricacies of the Indian Sign Language (ISL). The Indian context introduces unique
complexities in sign language, necessitating a specialized approach to ensure accurate

1
2 Introduction

and context-aware interpretation.

At its core, the project leverages a fusion of computer vision and ML algorithms
to create a real-time sign language interpreter. The choice of these technologies is
driven by the need for the system to not only recognize static signs but also dynamically
interpret the fluid and nuanced gestures inherent in ISL. The project’s success hinges on
the robustness of the dataset used for training, encompassing a wide spectrum of sign
gestures, thereby enabling the model to adapt and generalize effectively.

The interpreter’s functionality involves capturing hand gestures through a camera


and employing ML to translate these gestures into spoken or written text. The system is
meticulously designed to prioritize accuracy, efficiency, and user-friendliness, addressing
the practical challenges faced by users in real-world scenarios.

The deep learning model, a central component of the project, undergoes compre-
hensive training on the diverse ISL dataset. This training equips the model to recognize
an extensive repertoire of hand gestures, ranging from single-hand expressions to more
complex two-hand gestures. The project’s evaluation framework is multifaceted, encom-
passing metrics such as accuracy, speed, and user satisfaction. Comparative analyses with
existing sign language interpreters, both human and machine-based, provide insights into
the system’s performance and potential areas of improvement.

Beyond the technological intricacies, the project’s significance lies in its potential
societal impact. By providing a reliable and efficient means of communication, the
interpreter seeks to enhance the quality of life for individuals within the deaf and
hard-of-hearing community in India. It endeavors to break down communication barriers,
fostering greater inclusivity and enabling meaningful participation in a world that is
increasingly reliant on spoken and written language.

This introduction lays the foundation for a comprehensive exploration of the method-
ologies, results, and implications of the "ML-based Real-time Indian Sign Language
Interpreter" project, underscoring its potential to bring about positive and transformative
change in the lives of its users.
1.1 Project Scope 3

1.1 Project Scope


The ML-based Real-time Indian Sign Language Interpreter project encompasses a com-
prehensive and detailed approach, strategically addressing the communication challenges
faced by the deaf and hard-of-hearing community in India. The scope is outlined below,
covering each stage from data collection to deployment:

Data Collection and Preprocessing:

• Gather a diverse dataset of ISL gestures, representing various signing styles and
contexts.

• Ensure inclusivity by incorporating single-hand gestures, two-hand gestures, and


non-manual markers.

• Preprocess the data to eliminate noise, normalize hand positions, and enhance fea-
ture extraction.

Feature Extraction:

• Develop robust feature extraction techniques using computer vision algorithms.

• Identify hand positions, finger movements, and hand orientations to capture relevant
information.

• Extract features that encompass the spatial arrangement of fingers, palm orientation,
and relative motion between hands.

Machine Learning Model Design and Training:

• Design a deep learning model architecture suitable for sign language recognition.

• Explore Convolutional Neural Networks (CNNs) and Recurrent Neural Networks


(RNNs) or their combinations for effective gesture classification.

• Train the deep learning model on the preprocessed dataset, optimizing hyperparam-
eters for high accuracy and generalization.

Real-time Gesture Recognition and Translation:

• Implement a real-time processing pipeline for capturing, feature extraction, and


classification of gestures.
4 Introduction

• Integrate the trained deep learning model into the real-time processing pipeline for
efficient gesture recognition.

• Translate recognized gestures into corresponding spoken or written text using a


text-to-speech or text synthesis engine.

System Integration and User Interface:

• Develop a user-friendly interface for capturing hand gestures and displaying trans-
lated text or speech.

• Integrate the real-time gesture recognition and translation modules into the user
interface.

• Ensure the system is responsive and adaptable to variations in lighting, back-


grounds, and signing styles.

Evaluation and Refinement:

• Evaluate the system’s performance using metrics such as accuracy, speed, and user
satisfaction.

• Conduct comparative analyses with existing sign language interpreters, both human
and machine.

• Refine the system based on evaluation results, focusing on improving accuracy,


robustness, and user experience.

Deployment and Dissemination:

• Deploy the system as a standalone application, a web-based service, or a mobile


app.

• Develop comprehensive documentation and training materials to facilitate user


adoption and understanding.

• Disseminate the system through relevant channels, including deaf communities,


educational institutions, and healthcare organizations.
1.1 Project Scope 5

Societal Impact and Ethical Considerations:

• Aim to revolutionize sign language education, enhancing accessibility and inclusiv-


ity for the deaf and hard-of-hearing community in India.

• Strive to empower individuals by breaking down communication barriers and en-


abling meaningful participation in various aspects of life.

• Implement measures to protect user privacy, ensuring compliance with ethical stan-
dards and data protection regulations.

• Ensure cultural sensitivity in the design and deployment of the interpreter, respect-
ing the nuances of ISL and the diverse communities it serves.
Chapter 2

Literature Review

According to Michele Friedner’s research, some parents of deaf children in India might
see Sign Language as causing problems in their families. They worry that their deaf chil-
dren, who use Indian Sign Language (ISL), spend more time with other deaf people and
less with their hearing family members. This can strain family relationships, especially
when parents don’t know sign language well. Learning sign language themselves could
help bridge this gap.[2]

In India, the use of sign language in education is not common. Historically, some
people believed that deafness was a punishment for past sins. Deaf individuals even faced
legal restrictions, like being denied the right to inherit property. Due to these beliefs, deaf
education hasn’t been a priority in Indian society. Even though students in Deaf schools
naturally use ISL to communicate, it’s not formally recognized or encouraged by school
authorities.[3]

To change this, there’s a plan to introduce ISL classes and interpreter training pro-
grams. This started with the creation of an ISL Cell at the National Institute for the
Hearing Handicapped (NIHH) in 2001[4]. So far, it has been successful, but many people
still don’t know about ISL. To help deaf individuals communicate better with the broader
community, they are working on developing sign language interpreters.

The significant communication gap between the deaf and normal populations in
India, primarily due to the lack of sign language knowledge and limited availability of
interpreters[5]. Although efforts are underway to develop sign language recognition
systems, real-time recognition remains a substantial challenge. This paper introduces
an innovative approach that utilizes convolutional neural networks, data augmentation,

7
8 Literature Review

batch normalization, dropout, stochastic pooling, and the diffGrad optimizer to recognize
static signs in the Indian sign language alphabet. With remarkable training and validation
accuracy exceeding 99%, this method surpasses the performance of previous systems,
offering promise for more effective communication solutions.

The 3D sign language recognition problem is a complex and intriguing challenge


within human action recognition. The study[6] focuses on constructing color-coded
topographical descriptors from joint distances and angles, termed JDTD and JATD,
and proposes a two-stream CNN architecture for classification. By integrating distance
and angular features, the model demonstrates enhanced performance in predicting
spatiotemporal discriminative features. Comparative analysis using diverse datasets un-
derscores the model’s competitiveness against state-of-the-art baseline action recognition
frameworks.

The importance of Indian Sign Language (ISL) for communication with the hear-
ing impaired, as recognized by the RPwD Act 2016 in India[7]. It emphasizes the need
for sign language interpreters in government organizations and public sector undertak-
ings. The paper presents a deep learning-based methodology for ISL static alphabet
recognition using Convolutional Neural Networks (CNN), achieving an impressive
accuracy of 98.64% —outperforming many existing methods.

According to OCEANS 2019 - Marseille[8], in the context of marine aquaculture,


the paper proposes an innovative approach for real-time classification and identification
of marine animals using a combination of an embedded system and deep learning tech-
niques. The identification of video and image data captured by underwater cameras has
been undertaken by either humans or computers, yet achieving real-time processing has
proven challenging. In response to this limitation, a proposed methodology for efficient
real-time classification merges an embedded system with deep learning, employing the
MobileNetV2 architecture and transfer learning. Initially, marine animal images are
gathered by an underwater robot equipped with an embedded device. Subsequently,
a MobileNetV2 model, rooted in convolutional neural network (CNN) principles and
tailored to the marine animal images, is formulated to meet real-time processing
requirements. Further enhancement is achieved through transfer learning, refining the
classification capabilities. The model is then trained using the collected marine animal
images, and once trained, it can be downloaded onto the embedded device to facilitate
real-time classification of marine animal images underwater. To assess the efficacy of this
9

proposed method, experiments comparing InceptionV3 and MobileNetV1 models are


conducted, with a focus on identification accuracy rates and average classification times.
The findings highlight that the MobileNetV2 model, coupled with transfer learning,
outperforms the other considered models in real-time marine animal image classification.

According to the 2019 6th International Conference on Image and Signal Process-
ing and their Applications (ISPA)[9] In the realm of human-computer interaction, hand
gestures provide a natural and versatile means for various applications. However,
challenges such as the intricate nature of gesture patterns, variations in hand size, diverse
hand postures, and fluctuating environmental lighting can impact the effectiveness of
hand gesture recognition algorithms. The recent integration of deep learning has signif-
icantly elevated the capabilities of image recognition systems, with deep convolutional
neural networks (CNNs) showcasing superior performance in image representation and
classification when compared to traditional machine learning methods. This literature
survey focuses on a comparative analysis of two techniques for American Sign Language
hand gesture recognition. The first technique employs a proposed deep-convolution
neural network, while the second incorporates transfer learning using the pre-trained
MobileNetV2 model. Both models undergo training and testing with 1815 segmented
images characterized by color and a black background, encompassing static hand
gestures from five volunteers with variations in scale, lighting, and noise. The outcomes
reveal that the proposed CNN model attains an impressive classification accuracy of
98.9%, demonstrating a 2% enhancement over the CNN model enriched through transfer
learning techniques, which achieved 97.06%.

According to a paper published in International Journal of Innovative Science and


Research Technology[10], A gesture is a form of sign language that incorporates the
movement of the hands or face to indicate an idea, opinion, or emotion. Sign language is
a way for deaf and mute persons to communicate with others by using gestures. Deaf and
mute persons are familiar with sign language since it is widely used in their community,
while the general public is less familiar. Hand gestures have been increasingly popular
because they let deaf and mute people communicate with others. Many of these forms
of communication, however, are still limited to specialized applications and costly
hardware. As a result, we look at a simpler technique that uses fewer resources, such as
a personal computer with a web camera that accomplishes our goal. The gestures are
captured as images through a webcam and image processing is done to extract the hand
shape. The interpretation of images is carried out using a LeNet-5 Convolutional Neural
10 Literature Review

Network architecture.

According to a paper in Procedia Computer Science Journal[11], To solve the problem


of low gesture image recognition rate, we propose a transfer learning based image
recognition method called Mobilenet-RF. We combine the two models of MobileNet
convolutional network with Random Forest to further improve image recognition accu-
racy. This method firstly transfers the model architecture and weight files of MobileNet
to gesture images, trains the model and extracts image features, and then classifies the
features extracted by convolutional network through the Random Forest model, and
finally obtains the classification results. The test results on the Sign Language Digital
dataset, Sign Language Gesture Image dataset and Fingers dataset showed that the
recognition rate was significantly improved compared with Random Forest, Logistic
Regression, Nearest Neighbor, XGBoost, VGG, Inception and MobileNet.

According to the publication in 2019 42nd International convention on information


and communication technology, electronics and microelectronics (MIPRO)[12], The
popularity of Python is growing, especially in the field of data science. Consequently,
there is an increasing number of free libraries available for usage. The aim of this review
paper is to describe and compare the characteristics of different data mining and big data
analysis libraries in Python. There is currently no paper dealing with the subject and de-
scribing pros and cons of all these libraries. Here we consider more than 20 libraries and
separate them into six groups: core libraries, data preparation, data visualization, machine
learning, deep learning and big data. Beside functionalities of a certain library, important
factors for comparison are the number of contributors developing and maintaining the
library and the size of the community. Bigger communities mean larger chances for
easily finding solution to a certain problem. We currently recommend: pandas for data
preparation; Matplotlib, seaborn or Plotly for data visualization; scikit-learn for machine
leraning; TensorFlow, Keras and PyTorch for deep learning; and Hadoop Streaming and
PySpark for big data.
Chapter 3

Problem Formulation and Proposed


Solution

3.1 Objectives
1. Develop a Real-Time Indian Sign Language Interpreter:
Our foremost objective is to create a real-time Indian Sign Language (ISL) interpreter,
thereby enabling instantaneous communication between ISL users and individuals
unfamiliar with sign language. Real-time communication is essential for meaningful
conversations and interactions, allowing for seamless exchanges without undue delays.
To make this communication accessible and effective, our system will feature a user-
friendly interface that can adapt to diverse ISL expressions and signing styles. It will
be designed to be intuitive and easy to use, ensuring that both ISL speakers and those
who are not proficient in sign language can interact smoothly. Moreover, the system’s
robustness and adaptability will be central to its development, enabling it to function
reliably under various environmental conditions and for a wide range of users, ultimately
ensuring inclusivity in communication.

2. Expanding the Reach of Sign Language:


The project’s core functionality is centered around the accurate interpretation of ISL
gestures. Our system will be trained to recognize a comprehensive spectrum of ISL
signs, ensuring that it can facilitate detailed and nuanced communication. However, we
recognize that effective communication extends beyond sign language alone. To cater to a
broader audience and enhance accessibility, the project will go beyond interpretation and
offer outputs in both text and speech formats. By providing this multi-modal approach,
we aim to serve the diverse communication preferences of individuals, whether they are

11
12 Problem Formulation and Proposed Solution

sign language users, those who prefer text-based interactions, or individuals who rely on
spoken language. This comprehensive approach ensures that the project is not limited to
a single mode of communication and can be widely used.

3. Community Driven Innovation:


Ethical considerations are at the heart of our project. We prioritize the responsible
handling of user data and interactions, respecting privacy, consent, and dignity. Our
system will incorporate safeguards to protect the sensitive data of users, ensuring their
trust and complying with legal and ethical standards. In addition to ethical considerations,
we also explore the potential of releasing the project as open source. By adopting an
open-source approach, we promote transparency and encourage contributions from the
developer community. This approach fosters a collaborative environment, allowing for
continuous improvement and innovation, and aligns with the principles of inclusive,
responsible technology development. It also demonstrates our commitment to creating a
tool that is accessible, accountable, and community-driven.

By achieving these objectives, our project aims to create a valuable tool that em-
powers individuals who use the Indian Sign Language and promotes more effective
communication and understanding across a broader spectrum of society.

3.2 Develop a Real Time Sign Language Interpreter


3.2.1 Comparing Different Methodologies

1. MobileNets:

MobileNet represents a pioneering advancement in deep learning architecture, specif-


ically crafted for mobile and embedded vision applications. Its distinguishing feature
lies in its ability to achieve a commendable balance between computational efficiency
and accuracy, making it a preferred choice for real-time, resource-constrained scenarios.
Unlike traditional deep neural networks, MobileNet introduces depthwise separable
convolutions, a novel technique that significantly reduces computational demands
without compromising performance in tasks like image classification.

MobileNet’s lightweight design is characterized by its streamlined structure, mak-


ing it particularly well-suited for deployment on devices with limited computational
resources. The architecture’s adaptability to mobile platforms ensures efficient execution,
3.2 Develop a Real Time Sign Language Interpreter 13

Figure 3.1: MobileNet V1

making it a valuable asset for applications demanding real-time responsiveness.

In the Real-Time Sign Language Interpreter project, MobileNet is strategically chosen


for its efficiency in handling live video input, contributing to swift and accurate gesture
recognition. Its role in this project underscores its capacity to enhance accessibility and
inclusivity for individuals with hearing impairments by enabling real-time interpretation
of Indian Sign Language gestures on mobile and embedded devices. MobileNet’s
lightweight design not only aligns with the project’s computational constraints but also
facilitates seamless integration into a user-friendly interface, ensuring a responsive and
practical tool for bridging communication gaps within the hearing-impaired community.
This adaptability positions MobileNet as a powerful catalyst in advancing the project’s
mission of fostering effective and inclusive communication through innovative technol-
ogy solutions.

Advantages:
a. Lightweight and Efficient Design: MobileNet is specifically designed to be lightweight
and efficient, making it suitable for applications where computational resources are
limited. Its architecture allows for faster processing without compromising performance.
b. Reduced Model Size and Parameters: MobileNet has a smaller number of parameters
compared to deeper architectures, resulting in a reduced model size. This is advantageous
for scenarios with limited storage capacity and facilitates quicker model deployment.
c. Well-Suited for Real-Time Applications on Resource-Constrained Devices: The
efficient design of MobileNet, both in terms of model size and computational require-
ments, makes it well-suited for real-time applications, especially on devices with limited
resources such as mobile phones and edge devices.
14 Problem Formulation and Proposed Solution

Disadvantages:
a. May Sacrifice Some Accuracy Compared to Deeper Architectures: Due to its
lightweight design, MobileNet may sacrifice a small amount of accuracy compared to
deeper and more complex architectures like traditional CNNs. In scenarios where achiev-
ing the highest possible accuracy is paramount, a trade-off between model efficiency and
precision might need consideration.

2. Convolutional Neural Network (CNN):

A Convolutional Neural Network (CNN) is a deep learning architecture designed for


processing and analyzing visual data, making it particularly effective for tasks such as
image recognition and computer vision. It employs specialized layers to automatically
and adaptively learn hierarchical representations of features directly from the input data.

CNNs are structured to mimic the visual processing performed by the human brain.
Through the use of convolutional layers, the network learns to identify patterns and
features within the input, allowing it to recognize complex structures in images. These
networks have shown remarkable success in tasks like object recognition, image classifi-
cation, and, in the context of the Real-Time Sign Language Interpreter, capturing spatial
details of hand gestures.

The hierarchical architecture of CNNs enables them to automatically learn and ex-
tract increasingly abstract and complex features as the data passes through successive
layers. Non-linear activation functions, pooling layers, and fully connected layers
contribute to the network’s ability to understand and classify visual information.

CNNs are known for their parameter sharing and spatial hierarchies, making them
well-suited for tasks where the spatial arrangement of features is crucial. This character-
istic is particularly valuable in recognizing the unique hand configurations and positions
associated with sign language gestures.

In the Real-Time Sign Language Interpreter project, the CNN serves as a key component
for extracting essential spatial information from live video input, contributing to the
accurate and real-time interpretation of Indian Sign Language gestures. By harnessing the
power of convolutional layers, the CNN precisely identifies intricate patterns, shapes, and
3.2 Develop a Real Time Sign Language Interpreter 15

Figure 3.2: Convulation Neural Network

orientations of hands, enabling it to discern the rich vocabulary of Indian Sign Language.
This spatial understanding is crucial for the interpreter to recognize not only static
hand shapes but also dynamic movements, ensuring a comprehensive interpretation of
gestures. The CNN’s role extends beyond mere recognition; it actively contributes to the
system’s adaptability, allowing it to handle diverse signing styles, lighting conditions, and
backgrounds, ultimately enhancing the interpreter’s robustness in real-world scenarios.
The integration of CNN within the hybrid model showcases its versatility, making it an
indispensable tool for fostering inclusive communication for individuals with hearing
impairments.

Advantages:
a. Can Capture Complex Hierarchical Features: CNNs are designed to automatically
learn hierarchical features from raw pixel values. This ability is crucial for tasks like
image classification where the model needs to understand patterns at various levels of
abstraction. In sign language interpretation, capturing the hierarchical features of hand
gestures, finger movements, and spatial relationships is essential for accurate recognition.
b. Flexible Architecture for Customization: CNNs offer a flexible architecture that allows
customization to suit the specific characteristics of your dataset. You can design the
network with multiple convolutional layers, pooling layers, and fully connected layers
to capture and process the unique features of sign language gestures. This flexibility is
advantageous when tailoring the model to the intricacies of the task.
c. Well-Suited for Tasks Requiring High Accuracy: CNNs are known for their ability
to achieve high accuracy in image classification tasks when trained on large and diverse
datasets. In sign language interpretation, where precision is crucial for meaningful
communication, a CNN can be trained to recognize subtle variations in gestures, leading
to higher overall accuracy.
16 Problem Formulation and Proposed Solution

Disadvantages:
a. Larger Model Size: CNNs can have a larger number of parameters, leading to a larger
model size compared to more lightweight architectures. This can be a disadvantage in
scenarios with limited storage or when deploying the model on resource-constrained
devices, as it might require more memory and storage space.

3.2.2 Best Fit Model

In creating an effective sign language interpreter, we’ve chosen Convolutional


Neural Networks (CNNs) over MobileNets because we’re focused on getting the most
accurate results. Sign language is intricate, and to interpret it well, our model needs to
understand the subtle details and nuances in hand movements. CNNs are really good at
learning from diverse datasets, making them a perfect match for our goal of accurately
recognizing these subtle variations in sign language expressions.

Our decision is also influenced by the size of our dataset, which is quite big with
lots of labeled examples covering 36 different classes. CNNs excel in handling such large
and varied datasets. Their flexibility allows us to customize the model to pick up on the
unique features of sign language gestures, making it better at making precise predictions.

After comparing CNNs and MobileNets, it’s clear that CNNs are the better choice
for our project. While MobileNets are more efficient, we’ve found that the detailed
features CNNs can capture significantly contribute to the accuracy we’re aiming for in
sign language interpretation. So, our decision to go with CNNs is well-thought-out and
aligns with our commitment to achieving excellence in accuracy and precision for our
real-time sign language interpreter.
3.3 Work Done 17

3.3 Work Done


The code sets up a data generator using Keras for image classification tasks. ImageData-
Generator is a class in Keras that performs real-time data augmentation on images. In
your case, it’s used for generating batches of augmented image data.

Figure 3.3: Load and Preprocess the Dataset

\import numpy as np

Purpose: NumPy is a library for numerical operations in Python. It supports large, multi-
dimensional arrays and matrices, along with mathematical functions.

\import seaborn as sns

Purpose: Seaborn is a statistical data visualization library based on Matplotlib. It pro-


vides a high-level interface for drawing informative statistical graphics.

\from keras.preprocessing.image import load_img, img_to_array

Purpose: Keras is a high-level neural networks API. The functions load img and img to
array are used for loading images and converting them to arrays, respectively.

\import matplotlib.pyplot as plt

Purpose: Matplotlib is a plotting library for Python. It provides a variety of static, ani-
mated, and interactive plots.

\import os

Purpose: The os module provides interaction with the operating system. It’s used for
navigating the file system and specifying file paths.
18 Problem Formulation and Proposed Solution

\from google.colab import drive

Purpose: Google Colab is a cloud-based platform. The drive module is used to mount
Google Drive, enabling access to files stored in Google Drive.

\from keras.preprocessing.image import ImageDataGenerator

Purpose: ImageDataGenerator is a Keras class for real-time data augmentation during


neural network training. It generates batches of augmented images to improve model
generalization.

3.3.1 Training Data Generator Configuration

Figure 3.4: Training Data configuration

Configures the data generator for the training set. It specifies the directory containing
the training images, the target size, color mode (grayscale in this case), batch size, class
mode (categorical, indicating a classification task), and shuffle the data after each epoch.
Similar to the training set, configures the data generator for the validation set. The key
difference is that shuffle is set to False for the validation set, meaning the order of the
images will not be shuffled during training.

3.3.2 Usage of CNN

Import Libraries: - Import layers, models, and optimizer from Keras.


Number of Classes: - Define the number of classes for the classification task (29 in this
case).
Define Model Architecture: - Create a Sequential model.
- Add the first convolutional layer with 64 filters, batch normalization, ReLU activation,
max-pooling, and dropout.
- Add subsequent convolutional layers with increasing filter size.
- Flatten the output of the convolutional layers.
- Add fully connected (dense) layers with batch normalization, ReLU activation, and
3.3 Work Done 19

dropout.
- The last layer has units equal to the number of classes with softmax activation.

Figure 3.5: Model Compilation

Compile the Model: - Choose the Adam optimizer with a learning rate of 0.0001.
- Set the loss function to categorical crossentropy (suitable for multi-class classification).
- Choose accuracy as the evaluation metric.

3.3.3 Training

Figure 3.6: Setting up Epochs

Set Number of Epochs: - Specify the number of training epochs (10 in this case).
Import Callback: - Import the ModelCheckpoint callback from Keras. This callback is
used to save the model weights during training. Configure ModelCheckpoint Callback:
- Create a ModelCheckpoint instance to save the best model weights based on validation
20 Problem Formulation and Proposed Solution

accuracy. - Monitor validation accuracy, set verbose mode, and save only the best weights.
Create Callbacks List: - Create a list containing the ModelCheckpoint callback for later
use during model training.
Start Training: - Use the ‘fit‘ method on the model to train it.
- Provide the training data generator (‘train_generator‘) and the number of steps per
epoch.
- Specify the number of epochs, validation data generator (‘validation_generator‘), and
the number of validation steps.
- Include the callbacks list for additional functionality during training, such as saving the
best model weights.

Figure 3.7: Plotting the points

Set Plotting Parameters:


- Set the figure size to 20x10 inches.
Create Subplots:
- Create a subplot with 1 row and 2 columns (two plots side by side).
- Set the overall title to ’Optimizer: Adam’ with a font size of 10.
Plot Loss:
- In the first subplot (left), set the y-axis label to ’Loss’ with a font size of 16.
- Plot the training loss and validation loss from the training history.
- Add labels and a legend to distinguish between training and validation loss.
- Place the legend in the upper right corner.
Plot Accuracy:
- In the second subplot (right), set the y-axis label to ’Accuracy’ with a font size of 16.
- Plot the training accuracy and validation accuracy from the training history.
- Add labels and a legend to distinguish between training and validation accuracy.
- Place the legend in the lower right corner.
3.3 Work Done 21

Figure 3.8: Plotted points on the graph

Show Plot:
- Use the ‘show‘ method to display the generated plot.

3.3.4 Prediction

Figure 3.9: Display the Results

Define Categories:
-Create a list named CATEGORIES containing the labels for the different classes in your
sign language dataset.
Prepare Function:
-Define a function named prepare that takes a file path as input.
-Set the desired image size (IMG_SIZE) to 48 pixels.
-Read the image using OpenCV (cv2) and convert it to grayscale.
-Resize the image to the specified size.
-Reshape the image array to the format expected by the model ((-1, IMG_SIZE,
22 Problem Formulation and Proposed Solution

IMG_SIZE, 1)).
Load the Model:
-Load the pre-trained sign language interpreter model (full.model) using TensorFlow’s
Keras API.
Make Predictions:
-Use the prepare function to preprocess an input image for prediction.
-Pass the preprocessed image to the loaded model to obtain predictions for each class.
-The model outputs a probability distribution over the classes, and the class with the high-
est probability is considered the predicted class.
Display Results:
-Print or visualize the prediction results, showing the predicted sign language class for the
input image.
Chapter 4

Results and Discussion

In the comprehensive evaluation of our sign language interpreter model across 50 epochs,
we have witnessed a commendable trajectory of improvement. The training process,
consisting of 696 batches per epoch, demonstrates the model’s ability to learn and
generalize effectively.

Starting with the initial epoch, the model exhibited a training accuracy of 27.86%
and a validation accuracy of 34.33%. Over subsequent epochs, these values witnessed
significant enhancements, reflecting the model’s progressive refinement. By the final
epoch, the training accuracy reached an impressive 99.84%, while the validation accuracy
attained a noteworthy 99.99%.

Figure 4.1: Plot for Loss and Accuracy

As seen in the Figure 4.1 the corresponding loss metrics also tell a compelling
story. The training loss started at 2.4948 and steadily decreased over epochs, reaching

23
24 Results and Discussion

a minimal value of 0.0052 in the final epoch. Similarly, the validation loss showcased a
consistent downward trend, culminating in a minimal value of 0.00016967.

The validation accuracy, a critical measure of the model’s ability to generalize to


new data, showcased outstanding results. Throughout the training process, the validation
accuracy consistently improved, achieving 100% accuracy in the last several epochs.
This signifies the model’s robustness in accurately classifying sign language gestures,
even when confronted with previously unseen data.

The notable aspect of this training journey is the early indication of model profi-
ciency. As early as the second epoch, the validation accuracy surpassed the 69% mark,
and subsequent epochs witnessed substantial leaps in performance. The model’s ability
to learn intricate patterns from the dataset is evident in the steady climb of accuracy and
the concurrent decline in loss values.

In conclusion, the results of this training process instill confidence in the efficacy
of the Convolutional Neural Network (CNN) architecture for our sign language inter-
preter model. The impressive accuracy values, coupled with the diminishing loss metrics,
underscore the model’s capability to precisely interpret a diverse range of sign language
gestures. As we move forward, the emphasis will be on continued evaluation, potential
fine-tuning, and the application of this trained model in real-world scenarios.
Chapter 5

Conclusion and Future Work

5.1 Conclusion
In conclusion, the journey to develop a robust and accurate sign language interpreter has
been guided by a deliberate selection of Convolutional Neural Networks (CNNs) over
MobileNets. Our primary driver has been the unwavering commitment to achieving the
highest possible accuracy in sign language recognition. Understanding the intricate and
nuanced nature of sign language gestures, we recognized the need for a model capable
of grasping complex hierarchical features for precise interpretation. CNNs, renowned
for their prowess in learning from diverse datasets, seamlessly align with our quest to
attain superior accuracy by discerning the subtle variations inherent in sign language
expressions.

The significance of our decision is further underscored by the scale of our dataset,
comprising a substantial number of labeled instances across 36 distinct classes. This
sizable dataset provides an opportune landscape for CNNs to leverage their inherent flex-
ibility and adaptability. Through comprehensive training, we harness the customization
potential of CNNs, constructing a model finely tuned to the unique characteristics of
sign language gestures. This not only enhances the model’s ability to generalize but also
empowers it to make precise predictions, crucial for effective sign language interpretation.

Our rigorous comparison of CNNs and MobileNets culminated in a clear revela-


tion — CNNs emerge as the superior fit for our project. While MobileNets offer
efficiency, the nuanced features captured by CNNs proved to be indispensable for
achieving our paramount goal of accurate sign language interpretation. The meticulous
balance between computational efficiency and accuracy considerations favored CNNs,

25
26 Conclusion and Future Work

solidifying our strategic choice.

As we strive for excellence in accuracy and precision in our real-time sign lan-
guage interpreter, the decision to employ CNNs is rooted in a thoughtful alignment with
the unique demands of our project. The adaptability, customization, and feature-capturing
capabilities of CNNs position them as the optimal choice to meet the specific intricacies
of sign language recognition. Looking ahead, our commitment to refining and advancing
our sign language interpreter underscores our dedication to making a meaningful impact
in accessibility and communication for the hearing-impaired community. The journey
may have concluded, but the impact of our decision to embrace CNNs resonates in the
potential of a more inclusive and accessible future.

5.2 Future Work


Moving forward, our project envisions a significant expansion of its capabilities to truly
become a comprehensive tool for facilitating communication. The core focus on accurate
interpretation of Indian Sign Language (ISL) gestures lays the foundation for a more
inclusive and accessible communication system. While we have made strides in training
our system to recognize a wide array of ISL signs, our future work involves extending its
reach beyond sign language alone.

Our commitment to inclusivity prompts us to enhance the project’s functionality


by incorporating outputs in both text and speech formats. Recognizing that effective
communication takes various forms, we aim to cater to the diverse preferences of
individuals. This multi-modal approach ensures that our project becomes a versatile tool,
not limited to a single mode of communication. Whether users prefer sign language,
text-based interactions, or spoken language, our system aims to provide a seamless and
inclusive experience.

Ethical considerations form the cornerstone of our project’s development. As we


move forward, our focus on responsible data handling, privacy, consent, and dignity
remains unwavering. We are dedicated to implementing robust safeguards to protect user
data, ensuring trust and compliance with legal and ethical standards. Beyond ethical
considerations, we are actively exploring the possibility of releasing our project as open
source.
5.2 Future Work 27

Adopting an open-source approach aligns with our values of transparency, collabo-


ration, and community-driven innovation. By making our project open source, we
invite contributions from the developer community, creating a collaborative space for
continuous improvement and innovation. This approach underscores our commitment
to developing a tool that is not only technologically advanced but also accessible,
accountable, and responsive to the needs of the community it serves.

Looking ahead, the future work for our project involves a holistic approach. We
strive to break down communication barriers by embracing multiple modes of expression
and by upholding the highest ethical standards. The journey ahead involves refining our
system, expanding its capabilities, and fostering a community-driven ecosystem that
encourages innovation and inclusivity.
Appendix A

Appendix

A.1 Appendix 1
In this appendix, we provide a detailed overview of the architecture used for training
our sign language interpreter model and the key parameters associated with the training
process.

Model Architecture:
The sign language interpreter is built on a Convolutional Neural Network (CNN) archi-
tecture, leveraging its capability to capture intricate patterns and hierarchical features
crucial for sign language recognition. The model comprises multiple convolutional
layers followed by max-pooling layers to extract essential features from input images.
Subsequently, fully connected layers and a softmax layer are employed for classification
across the diverse set of sign language gestures.

Training Parameters:

1. Epochs: The model underwent training for 50 epochs, indicating the number of
times the entire dataset was processed.

2. Optimizer: The Adam optimizer was employed, known for its effectiveness in op-
timizing the model’s weights during training.

3. Loss Function: Categorical Crossentropy, suitable for multi-class classification


tasks, was used to measure the difference between predicted and actual class distri-
butions.

29
30 Appendix

4. Learning Rate: A default learning rate of 0.001 was utilized to control the step size
during optimization.

5. Batch Size: The training data was divided into batches, and each batch contained
32 images. This facilitated more efficient updates to the model’s weights during
training.

6. Validation Split: A validation split of 20% was applied, ensuring that a portion of
the training data was reserved for validation, allowing us to monitor the model’s
performance on unseen data.

A.2 Appendix 2
In this appendix, we present comprehensive details about the dataset used for training
and testing our sign language interpreter model.

Dataset Overview:
Our dataset consists of a diverse collection of images representing 26 letters of the
alphabet, numbers 0-9, and additional classes for symbols such as ’del,’ ’nothing,’ and
’space.’ Each class encompasses approximately 3,000 labeled images, resulting in a
substantial dataset with 36 classes in total.

Data Augmentation:
To enhance the model’s robustness and generalization, data augmentation techniques
were applied during training. These techniques include random rotations, horizontal
flips, and zooming, providing the model with a more varied set of training examples.

Image Preprocessing:
Images were preprocessed to ensure uniformity in dimensions, converting them to
grayscale and resizing to a fixed dimension of 48x48 pixels. This standardization allows
for consistent input to the model during training and inference.

Class Distribution:
Maintaining a balanced class distribution is crucial for preventing biases in the model.
The approximately 3,000 images per class contribute to a well-distributed dataset,
ensuring that the model is exposed to an adequate number of examples for each sign
language gesture.
A.2 Appendix 2 31

These appendices aim to provide a transparent and comprehensive understanding


of the technical aspects and dataset characteristics that underpin the development of our
sign language interpreter model.
References

[1] R. Bhadauria, S. Nair, and D. Pal, “A survey of deaf mutes,” Medical Journal Armed
Forces India, vol. 63, no. 1, pp. 29–32, 2007.

[2] M. Friedner, “Sign language as virus: Stigma and relationality in urban india,” Med-
ical Anthropology, vol. 37, no. 5, pp. 359–372, 2018.

[3] M. Miles, “Studying responses to disability in south asian histories: Approaches


personal, prakrital and pragmatical,” 2001.

[4] U. Zeshan, M. N. Vasishta, and M. Sethna, “Implementation of indian sign language


in educational settings,” Asia Pacific Disability Rehabilitation Journal, vol. 16,
no. 1, pp. 16–40, 2005.

[5] U. Nandi, A. Ghorai, M. M. Singh, C. Changdar, S. Bhakta, and R. Kumar Pal,


“Indian sign language alphabet recognition system using cnn with diffgrad opti-
mizer and stochastic pooling,” Multimedia Tools and Applications, vol. 82, no. 7,
pp. 9627–9648, 2023.

[6] E. K. Kumar, P. Kishore, M. T. K. Kumar, and D. A. Kumar, “3d sign language


recognition with joint distance and angular coded color topographical descriptor on
a 2–stream cnn,” Neurocomputing, vol. 372, pp. 40–54, 2020.

[7] C. Sruthi and A. Lijiya, “Signet: A deep learning based indian sign language recog-
nition system,” in 2019 International conference on communication and signal pro-
cessing (ICCSP). IEEE, 2019, pp. 0596–0600.

[8] X. Liu, Z. Jia, X. Hou, M. Fu, L. Ma, and Q. Sun, “Real-time marine animal images
classification by embedded system based on mobilenet and transfer learning,” in
OCEANS 2019 - Marseille, 2019, pp. 1–5.

[9] K. Bousbai and M. Merah, “A comparative study of hand gestures recognition based
on mobilenetv2 and convnet models,” in 2019 6th International Conference on Im-
age and Signal Processing and their Applications (ISPA), 2019, pp. 1–6.

33
34 References

[10] S. Vishwanath and S. S. Yawer, “Sign language interpreter using computer vision
and lenet-5 convolutional neural network architecture,” International Journal of In-
novative Science and Research Technology ISSN, no. 2456-2165, 2021.

[11] F. Wang, R. Hu, and Y. Jin, “Research on gesture image recognition method based
on transfer learning,” Procedia Computer Science, vol. 187, pp. 140–145, 2021.

[12] I. Stančin and A. Jović, “An overview and comparison of free python libraries for
data mining and big data analysis,” in 2019 42nd International convention on infor-
mation and communication technology, electronics and microelectronics (MIPRO).
IEEE, 2019, pp. 977–982.

You might also like