DOC-20241117-WA0000
DOC-20241117-WA0000
DOC-20241117-WA0000
INTRODUCTION
DEEP LEARNING
Deep learning which is a hot buzz nowadays and has firmly put down its roots in a
vast multitude of industries that are investing in fields like Artificial Intelligence, Big
Data and Analytics. For example, Google is using deep learning in its voice and image
recognition algorithms whereas Netflix and Amazon are using it to understand the
behavior of their customer. In fact, you won’t believe it, but researchers at MIT are trying to
predict future using deep learning. Deep learning can be considered as a subset of machine
learning. It is a field that is based on learning and improving on its own by examining
computer algorithms. While machine learning uses simpler concepts, deep learning works
with artificial neural networks, which are designed to imitate how humans think and learn.
Neural Network is the biological neurons, which is nothing but a brain cell. Until
recently, neural networks were limited by computing power and thus were limited in
complexity.
Deep learning has aided image classification, language translation, speech recognition.
It can be used to solve any pattern recognition problem and without human intervention. Deep
learning models are capable enough to focus on the accurate features themselves by requiring
a little guidance from the programmer and are very helpful in solving out the problem of
dimensionality. Deep learning algorithms are used, especially when we have a huge no of
inputs and outputs.
Neural Networks are layers of nodes, much like the human brain is made up of
neurons. Nodes within individual layers are connected to adjacent layers. The network is
said to be deeper based on the number of layers it has. A single neuron in the human
brain receives thousands of signals from other neurons. In an artificial neural network,
signals travel between nodes and assign corresponding weights. A heavier weighted node
will exert more effect on the next layer of nodes. The final layer compiles the weighted
inputs to produce an output. Deep learning systems require powerful hardware because
they have a large amount of data
being processed and involve several complex mathematical calculations. Even with such
advanced hardware, however, training a neural network can take weeks.
Deep learning systems require large amounts of data to return accurate results;
accordingly, information is fed as huge data sets. When processing the data, artificial neural
networks are able to classify data with the answers received from a series of binary true or
false questions involving highly complex mathematical calculations. Deep learning takes
this one step ahead. Deep learning automatically finds out the features which are important
for classification because of deep neural networks, whereas in case of Machine Learning we
had to manually define these features.
The first advantage of deep learning over machine learning is the needlessness of the
so-called feature extraction. Long before deep learning was used, traditional machine learning
methods were mainly used such as Decision Trees, SVM, Naïve Bayes Classifier and Logistic
Regression. These algorithms are also called flat algorithms. Flat here means that these
algorithms cannot normally be applied directly to the raw data (such as .csv, images, text,
etc.). We need a pre-processing step called Feature Extraction. The result of Feature
Extraction is a representation of the given raw data that can now be used by these classic
machine learning algorithms to perform a task. Feature Extraction is usually quite complex
and requires detailed knowledge of the problem domain. This pre-processing layer must
be adapted, tested and refined over several iterations for optimal results. The feature
extraction step is already part of the process that takes place in an artificial neural network.
During the training process, this step is also optimized by the neural network to obtain the
best possible abstract representation of the input data. This means that the models of deep
learning thus require little to no manual effort to perform and optimize the feature extraction
process.
This type of neural network is the very basic neural network where the flow control
occurs from the input layer and goes towards the output layer. These kinds of networks are
only having single layers or only 1 hidden layer since the data moves only in 1 direction there
is no back propagation technique in this network. In the feed-forward neural network, there
are not any feedback loops or connections in the network. There can be multiple hidden
layers which depend on what kind of data you are dealing with. The number of hidden layers
is known as the depth of the neural network. The deep neural network can learn from more
functions. Input layer first provides the neural network with data and the output layer
then make predictions on that data which is based on a series of functions. ReLU Function
is the most commonly used activation function in the deep neural network.
This kind of neural network has generally more than 1 layer preferably two layers
Radial basis networks are generally used in power restoration systems to restore the power in
the shortest span of time to avoid blackouts. The popular type of feed-forward network is the
radial basis function (RBF) network. It has two layers, not counting the input layer, and
contrasts from a multilayer perceptron in the method that the hidden units implement
computations. Each hidden unit significantly defines a specific point in input space, and its
output, or activation, for a given instance based on the distance between its point and the
instance, which is only a different point. The closer these two points, the better the activation.
The parameters that such a network understands are the centres and widths of the
RBFs and the weights used to design the linear set of the outputs acquired from the hidden
layer. An essential benefit over multilayer perceptrons is that the first group of parameters
can be decided independently of the second group and make accurate classifiers. One
method to decide the first group of parameters is to use clustering. The simple k-means
clustering algorithm can be applied, clustering each class independently to obtain k-basis
functions for each class. The second group of parameters is understood by keeping the
first parameters constant. This includes learning a simple linear classifier using one of the
approaches such as linear or logistic regression. If there are long fewer hidden units than
training instances, this can be done fast.
The limitation of RBF networks is that they provide each attribute with a similar
weight because all are considered equally in the distance computation unless attribute
weight
parameters are contained in the complete optimization process. Therefore, they cannot deal
efficiently with inappropriate attributes, against multilayer perceptrons. Support vector
machines share similar issues. Support vector machines with Gaussian kernels (i.e., “RBF
kernels”) are a definite method of RBF network, in which one function is centered on each
training instance, all basis functions have a similar width, and the outputs are merged linearly
by calculating the maximum-margin hyperplane. This has the result that some of the RBFs
have a nonzero weight the ones that define the support vectors.
This type of network are having more than 3 layers and its used to classify the data
which is not linear. These networks are extensively used for speech recognition and other
machine learning technologies. Multilayer perception is also known as MLP. It is fully
connected dense layers, which transform any input dimension to the desired dimension. A
multi-layer perception is a neural network that has multiple layers. To create a neural
network we combine neurons together so that the outputs of some neurons are inputs of
other neurons.
There are three inputs and thus three input nodes and the hidden layer has three
nodes. The output layer gives two outputs, therefore there are two output nodes. The
nodes in the input layer take input and forward it for further process, in the diagram above
the nodes in the input layer forwards their output to each of the three nodes in the hidden
layer, and in the same way, the hidden layer processes the information and passes it to the
output layer. Every node in the multi-layer perception uses a sigmoid activation function.
The sigmoid activation function takes real values as input and converts them to numbers
between 0 and 1 using the sigmoid formula.
CNN is one of the variations of the multilayer Perceptron.CNN can contain more than
1 convolution layer and since it contains a convolution layer the network is very deep with
fewer parameters.CNN is very effective for image recognition and identifying different image
patterns. It is assumed that the reader knows the concept of neural networks.
When it comes to Machine Learning, Artificial Neural Networks perform really well.
Artificial Neural Networks are used in various classification tasks like image, audio, words.
Different types of Neural Networks are used for different purposes, for example for
predicting the sequence of words we use Recurrent Neural Networks more precisely an
LSTM, similarly for image classification we use Convolution Neural networks. In this blog,
we are going to build a basic building block for CNN. Before diving into the Convolution
Neural Network, let us first revisit some concepts of Neural Network. In a regular Neural
Network there are three types of layers:
1. Input Layers: It’s the layer in which we give input to our model. The number of
neurons in this layer is equal to the total number of features in our data (number of
pixels in the case of an image).
2. Hidden Layer: The input from the Input layer is then feed into the hidden layer. There
can be many hidden layers depending upon our model and data size. Each hidden layer
can have different numbers of neurons which are generally greater than the number of
features. The output from each layer is computed by matrix multiplication of output of
the previous layer with learnable weights of that layer and then by the addition of
learnable biases followed by activation function which makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic function like
sigmoid or softmax which converts the output of each class into the probability score of
each class.
The data is then fed into the model and output from each layer is obtained this step is
called feed forward, we then calculate the error using an error function, some common error
functions are cross-entropy, square loss error, etc. After that, we back propagate into the
model by calculating the derivatives. This step is called Back propagation which basically is
used to minimize the loss.
RNN is a type of neural network where the output of a particular neuron is fed back as
an input to the same node. This method helps the network to predict the output. This kind of
network is useful in maintaining a small state of memory which is very useful for developing
the Chatbot and text-to-speech technologies. The output from previous step are fed as
input to the current step. In traditional neural networks, all the inputs and outputs are
independent of each other, but in cases like when it is required to predict the next word of a
sentence, the previous words are required and hence there is a need to remember the
previous words. Thus
RNN came into existence, which solved this issue with the help of a Hidden Layer. The main
and most important feature of RNN is Hidden state, which remembers some information
about a sequence.
RNN have a “memory” which remembers all information about what has been calculated. It
uses the same parameters for each input as it performs the same task on all the inputs or
hidden layers to produce the output. This reduces the complexity of parameters, unlike other
neural networks. RNN will do the following:
RNN converts the independent activations into dependent activations by providing the
same weights and biases to all the layers, thus reducing the complexity of increasing
parameters and memorizing each previous outputs by giving each output as input to the
next hidden layer.
Hence these three layers can be joined together such that the weights and bias of all the
hidden layers is the same, into a single recurrent layer.
A modular neural network is made up of several neural network models that are linked
together via an intermediate. Modular neural networks allow for more complex management
and handling of more basic neural network systems. In this case, the multiple neural networks
act as modules, each solving a portion of the issue. An integrator is responsible for dividing
the problem into multiple modules as well as integrating the answers of the modules to create
the system's final output. Modular neural networks have been studied in various methods
since the 1980s. A collection of "simple" or "weak" learners can outperform a single
deep learning model, according to the idea of ensemble learning. Modular neural networks,
in general, allow engineers to expand the possibilities of employing these technologies to
push the limits of what neural networks can do. Each network is converted into a module that
may be freely combined with modules of different sorts.
Factors leading to Modular Neural Network's development
Reducing model complexity: Controlling the degrees of freedom of the system is one
method to minimize training time.
Data fusion and prediction averaging: Network committees may be thought of as
composite systems consisting of comparable parts.
Combination of techniques: As a building block, more than one method or network
class can be utilized.
Learning several tasks at the same time: Trained modules can be transferred
between systems that are built for various tasks.
Robustness and incrementality: The integrated network may be fault-tolerant and
develop progressively.
Pixel Restoration - The concept of zooming into videos beyond its actual resolution was
unrealistic until Deep Learning came into play. In 2017, Google Brain researchers trained
a Deep Learning network to take very low resolution images of faces and predict the
person’s face through it. This method was known as the Pixel Recursive Super Resolution.
It enhances the resolution of photos significantly, pinpointing prominent features in order
that is just enough for personality identification.
NEWS AGGREGATION AND FAKE NEWS DETECTION - Deep Learning allows you
to customize news depending on the readers’ persona. Neural Networks help develop
classifiers that can detect fake and biased news and remove it from your feed.
ROBOTICS - Deep Learning is heavily used for building robots to perform human-like
tasks. Robots powered by Deep Learning use real-time updates to sense obstacles in their
path and pre-plan their journey instantly. Boston Dynamics robots react to people when
someone pushes them around, they can unload a dishwasher, get up when they fall, and do
other tasks as well.
SELF DRIVING CARS - Deep Learning is the force that is bringing autonomous driving
to life. A million sets of data are fed to a system to build a model, to train the machines to
learn, and then test the results in a safe environment. The Uber Artificial Intelligence Labs
at Pittsburg is not only working on making driverless cars humdrum but also integrating
several smart features such as food delivery options with the use of driverless cars. The
major concern for autonomous car developers is handling unprecedented scenarios. A
regular cycle of testing and implementation typical to deep learning algorithms is ensuring
safe driving with more and more exposure to millions of scenarios. Data from cameras,
sensors, geo-mapping is helping create succinct and sophisticated models to navigate
through traffic, identify paths, signage, pedestrian-only routes, and real-time elements like
traffic volume and road blockages.
ADVANTAGE OF DEEP LEARNING
COMPANY PROFILE
Wavtech solution as a leading IT solution and service provider, provides innovative
information technology - enabled solutions and services to meet the demands arising from
social transformation, shaping new life styles for individuals and creating values for the
society.
Wavtech solution has the world’s leading product engineering capabilities, ranging from
consultation, design, R&D, and integration to testing of embedded software, in the fields
of automotive electronics, smart devices, digital home products, and IT products. The
software provided by fantasy solution runs in a number of globally renowned brands.
Particularly offering the services that include application development & maintenance,
ERP implementation & consulting, testing, performance engineering, software localization
& globalization, IT infrastructures, BPO, IT education & training, etc.Sticking to its
business philosophy and brand commitment of “Beyond Technology”, fantasy solution is
dedicated to providing innovative information technologies to drive the sustainable
development of society, as well as becoming a company that is well recognized and
respected by employees, shareholders, customers, and society.
OBJECTIVE
The process of turning the user's signs and motions into text is referred to as sign
language recognition. It helps persons who are unable to communicate with the broader
public. The motion is mapped to relevant text in the training data using image processing
techniques and neural networks, and so raw images/videos are turned into text that can
be read and comprehended. Dumb persons are frequently denied access to normal
communication with other members of society. It has been found that they find it difficult
to connect with normal people with their gestures at times, as only a few of them are
recognised by the majority of people. Because people with hearing loss or who are deaf are
unable to communicate verbally, they must rely on some form of visual communication the
majority of the time. In the deaf and dumb community, sign language is the major mode
of communication. It has syntax and vocabulary much like any other language, but it
communicates through visual means. The issue arises when people who are deaf or dumb
try to communicate with others using these sign language grammars. This is due to the
fact that most people are unaware of these grammar rules. As a result, it has been observed
that a impared person's communication is limited to his or her family or the deaf
community. The increased public acceptance and funding for international projects
emphasises the necessity of sign language. For the dumb community, a computer-based
solution is in high demand in these age of technology. Some steps toward this goal include
teaching a computer to recognise speech, facial emotions, and human gestures.
Nonverbally communicated information is referred to as gestures. At any given time, a human
can make an infinite number of gestures. Computer vision researchers are particularly
interested in human gestures since they are received through vision. The goal of the project is
to create an HCI that can detect human motions. The conversion of these motions into
machine language necessitates the use of a complicated programming procedure.
CHAPTER 2
LITERATURE SURVEY
INTRODUCTION
A literature review surveys books, scholarly articles, and any other sources relevant
to a particular issue, area of research, or theory, and by so doing, provides a description,
summary, and critical evaluation of these works in relation to the research problem being
investigated. A literature survey, or literature review, is a proof essay of sorts. It is a
study and review of relevant literature materials. Literature reviews are designed to
provide an overview of sources you have explored while researching a particular topic and
to demonstrate to your readers how your research fits within a larger field of study.
Various Literature survey have been conducted by analysing papers under the area
of data mining, Machine Learning, Neural Network, Deep Learning to get an insight
about the undergoing research progression in the area of educational sector.
RELATED WORKS
Shagun Katoch (2022) [1] A basic human need is the capacity for communication
and self-expression. However, our viewpoints and the ways in which we interact with
people might differ significantly from those of individuals around us depending on
factors such as our upbringing, education, culture, and other factors. Additionally, it is
crucial to make sure that we are understood in the manner in which we want. Despite
this, regular people have little trouble engaging with one another and expressing
themselves through voice, gestures, body language, reading, writing, and talking, all of
which are commonly utilised by them. However, individuals with speech impediment only
use sign language, which makes it more challenging for them to interact with the majority
of people. This suggests the need for software that can identify sign language and
translate it into spoken or written language and vice versa. But the availability, price, and
use of such identities are constrained. The development of automatic sign language
recognition systems is mostly the result of scholars from several nations working on these
sign language recognizers.
Hamzah Luqman (2022) [2] The main form of communication for those who have
hearing loss is sign language. This language relies heavily on non-manual motions and hand
articulations. Recognition of sign language has gained popularity recently. In this study, we
present a trainable deep learning network that can efficiently capture the spatiotemporal
information from a limited number of sign frames for isolated sign language detection. Three
networks—the dynamic motion network (DMN), the accumulative motion network (AMN),
and the sign recognition network—combine to form our proposed hierarchical sign learning
module (SRN). In addition, we provide a method for addressing the variances in the sign
samples produced by various signers by extracting essential postures. These crucial postures
help the DMN stream acquire the spatiotemporal details relevant to the symptoms. We also
provide a cutting-edge method for encapsulating both static and dynamic information about
sign motions in a single frame. The main postures of the sign are fused in the forward and
backward directions to produce an accumulative video motion frame, preserving the sign's
spatial and temporal information.
Shikhar Sharma (2021) [3] The communication between a person from the impaired
community with a person who does not understand sign language could be a tedious task.
Sign language is the art of conveying messages using hand gestures. Recognition of dynamic
hand gestures in American Sign Language (ASL) became a very important challenge that
is still unresolved. In order to resolve the challenges of dynamic ASL recognition, a more
advanced successor of the Convolutional Neural Networks (CNNs) called 3-D CNNs is
employed, which can recognize the patterns in volumetric data like videos. The CNN is
trained for classification of 100 words on Boston ASL (Lexicon Video Dataset) LVD
dataset with more than 3300 English words signed by 6 different signers. 70% of the dataset
is used for Training while the remaining 30% dataset is used for testing the model. The
proposed work outperforms the existing state-of-art models in terms of precision (3.7%),
recall (4.3%), and f-measure (3.9%). The computing time (0.19 seconds per frame) of the
proposed work shows that the proposal may be used in real-time applications.
Hamzah Luqman (2022) [4] In order to communicate and engage with persons who
have hearing impairments, as well as for applications involving human-machine
interaction, sign language depends on the visual movements of human body parts. In recent
years, this discipline has drawn increasing interest, and a number of study findings
encompassing a range of topics, including sign acquisition, segmentation, recognition,
translation, and language structures, have been observed. This work presents a thorough,
current review of the state-of-the-art literature on automated sign language processing.
With an emphasis on acquisition tools, readily accessible databases, and recognition
approaches for finger spelling signs, isolated sign words, and continuous phrase recognition
systems, the study offers a taxonomy and overview of the body of knowledge and research
activities. It explores several relevant difficulties and highlights current advancements such
as deep machine learning and multimodal techniques. The goal of this survey is to gather
information from junior researchers and business developers working on sign language
gesture recognition and related systems, as well as to identify distinctive features, the
current state of the field, and potential future directions that could lead to further
advancements.
Noha Sarhan (2022) [5] In this study, we suggest multi-phase fine-tuning for deep
networks to recognise sign language instead than only normal object identification (SLR). By
fine-tuning the network's weights over numerous stages, it expands on the fruitful concept of
transfer learning. Layers are trained in steps by gradually unfreezing layers for training,
starting at the top of the network. Use this innovative training strategy for SLR as there is
a lack of training data and significant differences from the datasets typically utilised for pre-
training in this application. Our tests demonstrate that multi-phase fine-tuning may achieve
much higher accuracy in a smaller number of training epochs than earlier fine-tuning
methods. A key question in transfer learning is how many layers to fine-tune to take
advantage of the generality of lower layers’ features, while allowing the network to ft to
the target task. Suggested a sequential fine-tuning approach, starting by merely adjusting
the weights in the final fully linked layer and gradually adding more layers. Utilizing
Google Net, one of the most widely used network architectures, it was applied to transfer
learning from the field of object recognition to SLR.
CHAPTER 3
SYSTEM ANALYSIS
EXISTING SYSTEM
The sign language is used widely by people who are deaf-dumb these are used as a
medium for communication. A sign language is nothing but composed of various gestures
formed by different shapes of hand, its movements, orientations as well as the facial
expressions. There are around 466 million people worldwide with hearing loss and 34 million
of these are children. `Deaf' people have very little or no hearing ability. They use sign
language for communication. People use different sign languages in different parts of the
world. Compared to spoken languages they are very less in number. In existing system, lack
of datasets along with variance in sign language with locality has resulted in restrained
efforts in finger gesture detection. Existing project aims at taking the basic step in bridging
the communication gap between normal people and deaf and dumb people using Indian
sign language. Effective extension of this project to words and common expressions may
not only make the deaf and dumb people communicate faster and easier with outer world,
but also provide a boost in Developing autonomous systems for understanding and aiding
them. The Indian Sign Language lags behind its American Counterpart as the research in this
field is hampered by the lack of standard datasets
DISADVANTAGES
ADVANTAGES
SYSTEM SPECIFICATION
HARDWARE SPECIFICATION
SOFTWARE SPECIFICATION
SOFTWARE DESCRIPTION
Python's developers strive to avoid premature optimization, and reject patches to non-
critical parts of CPython that would offer marginal increases in speed at the cost of clarity.[
When speed is important, a Python programmer can move time-critical functions to extension
modules written in languages such as C, or use PyPy, a just-in-time compiler. CPython is also
available, which translates a Python script into C and makes direct C-level API calls into the
Python interpreter. An important goal of Python's developers is keeping it fun to use. This is
reflected in the language's name a tribute to the British comedy group Monty Python and in
occasionally playful approaches to tutorials and reference materials, such as examples that
refer to spam and eggs (from a famous Monty Python sketch) instead of the standard for and
bar.
A common neologism in the Python community is pythonic, which can have a wide
range of meanings related to program style. To say that code is pythonic is to say that it uses
Python idioms well, that it is natural or shows fluency in the language, that it conforms with
Python's minimalist philosophy and emphasis on readability. In contrast, code that is difficult
to understand or reads like a rough transcription from another programming language is called
unpythonic. Users and admirers of Python, especially those considered knowledgeable or
experienced, are often referred to as Pythonists, Pythonistas, and Pythoneers. Python is an
interpreted, object-oriented, high-level programming language with dynamic semantics. Its
high-level built in data structures, combined with dynamic typing and dynamic binding, make
it very attractive for Rapid Application Development, as well as for use as a scripting or glue
language to connect existing components together. Python's simple, easy to learn syntax
emphasizes readability and therefore reduces the cost of program maintenance. Python
supports modules and packages, which encourages program modularity and code reuse. The
Python interpreter and the extensive standard library are available in source or binary form
without charge for all major platforms, and can be freely distributed. Often, programmers fall
in love with Python because of the increased productivity it provides. Since there is no
compilation step, the edit-test-debug cycle is incredibly fast. Debugging Python programs is
easy: a bug or bad input will never cause a segmentation fault. Instead, when the interpreter
discovers an error, it raises an exception. When the program doesn't catch the exception, the
interpreter prints a stack trace. A source level debugger allows inspection of local and global
variables, evaluation of arbitrary expressions, setting breakpoints, stepping through the code a
line at a time, and so on. The debugger is written in Python itself, testifying to Python's
introspective power. On the other hand, often the quickest way to debug a program is to add a
few print statements to the source: the fast edit-test-debug cycle makes this simple approach
very effective.
Python’s initial development was spearheaded by Guido van Rossum in the late
1980s. Today, it is developed by the Python Software Foundation. Because Python is a
multiparadigm language, Python programmers can accomplish their tasks using different
styles of programming: object oriented, imperative, functional or reflective. Python can be
used in Web development, numeric programming, game development, serial port access and
more.
There are two attributes that make development time in Python faster than in other
programming languages:
1. Python is an interpreted language, which precludes the need to compile code before
executing a program because Python does the compilation in the background. Because
Python is a high-level programming language, it abstracts many sophisticated details
from the programming code. Python focuses so much on this abstraction that its code
can be understood by most novice programmers.
2. Python code tends to be shorter than comparable codes. Although Python offers fast
development times, it lags slightly in terms of execution time. Compared to fully
compiling languages like C and C++, Python programs execute slower. Of course,
with the processing speeds of computers these days, the speed differences are
usually only observed in benchmarking tests, not in real-world operations. In most
cases, Python is already included in Linux distributions and Mac OS X machines.
PYCHARM
PyCharm is the most popular IDE used for Python scripting language. This
chapter will give you an introduction to PyCharm and explains its features.
PyCharm offers some of the best features to its users and developers in the following
aspects
ALGORITHM USED
CNN ALGORITHM
CNN (Convolutional Neural Network) is a deep learning algorithm that is
primarily used for image processing and computer vision tasks. The algorithm is based
on a type of neural network architecture that has several convolutional layers, which
are responsible for extracting meaningful features from the input image.
The CNN (Convolutional Neural Network) algorithm typically involves the following
steps:
Input image: The input image is fed into the CNN algorithm.
Convolution: In this step, the input image is convolved with a set of learnable filters.
Each filter extracts a specific feature from the image, such as edges or textures.
Activation function: The result of the convolution operation is then passed through a
non-linear activation function, such as the Rectified Linear Unit (ReLU), to introduce
non-linearity into the model.
Pooling: In this step, the output of the activation function is downsampled using a
pooling operation, such as max pooling or average pooling. This helps to reduce the
spatial dimensionality of the input and makes the model more efficient.
Fully Connected Layers: After several convolutional and pooling layers, the output is
then flattened and fed into a series of fully connected layers. These layers perform the
classification or regression task by mapping the features extracted from the input
image to the target output.
Output layer: Finally, the output layer computes the final prediction, which could be
a classification probability distribution or a continuous value.
Optimization: The optimizer updates the model's weights using backpropagation to
minimize the loss function.
Repeat: The above steps are repeated multiple times until the model converges to the
optimal weights that minimize the loss function and improve the accuracy of the
predictions.
CHAPTER 6
PROJECT DESCRIPTION
PROBLEM DEFINITION
The problem of sign language recognition involves developing a system that can
accurately interpret and understand sign language gestures and translate them into written or
spoken language. This can be a challenging task because sign language is a complex and
expressive visual language that involves hand gestures, body language, and facial expressions.
To address this problem, a deep learning model can be developed that can accurately
recognize and interpret sign language gestures. The model should be trained on a large
dataset of sign language gestures, using techniques such as convolutional neural networks
(CNNs) and recurrent neural networks (RNNs). The deep learning model should be able to
recognize the various nuances and subtleties of sign language gestures, such as the speed
and direction of hand movements, facial expressions, and body language. The model
should also be able to handle variations in sign language dialects and regional differences.
Additionally, the model should be able to handle real-time sign language recognition,
which requires high-speed processing and low latency. This can be achieved by
optimizing the model architecture and using hardware accelerators such as graphics
processing units (GPUs). Overall, the development of a deep learning model for sign
language recognition has the potential to provide a valuable tool for individuals who
use sign language as their primary mode of communication. The model can help bridge
the communication gap between hearing and deaf individuals and promote inclusivity and
accessibility for all.
PROJECT OVERVIEW
HAND IMAGE
ACQUISITION
The hand gesture, during daily life, is a natural communication method mostly used
only among people who have some difficulty in speaking or hearing. However, a human
computer interaction system based on gestures has various application scenarios. In this
module, we can input the hand images from real time camera. The inbuilt camera can be
connected to the system. Gesture recognition has become a hot topic for decades. Nowadays
two methods are used primarily to perform gesture recognition. One is based on professional,
wearable electromagnetic devices, like special gloves. The other one utilizes computer vision.
The former one is mainly used in the film industry. It performs well but is costly and unusable
in some environment. The latter one involves image processing. However, the performance of
gesture recognition directly based on the features extracted by image processing is relatively
limited. Hand image captured from web camera. The purpose of Web camera is to capture the
human generated hand gesture and store its image in memory. The package called python
framework is used for storing image in memory
BINARIZATION
Background subtraction is one of the major tasks in the field of computer
vision and image processing whose aim is to detect changes in image sequences. Background
subtraction is any technique which allows an image's foreground to be extracted for further
processing (object recognition etc.). Many applications do not need to know everything about
the evolution of movement in a video sequence, but only require the information of changes in
the scene, because an image's regions of interest are objects (humans, cars, text etc.) in its
foreground. After the stage of image preprocessing (which may include image denoising, post
processing like morphology etc.) object localization is required which may make use of this
technique. Detecting foreground to separate these changes taking place in the foreground of
the background. It is a set of techniques that typically analyze the video sequences in real
time and are recorded with a stationary camera. All detection techniques are based on
modeling the background of the image i.e. set the background and detect which changes
occur. Defining the background can be very difficult when it contains shapes, shadows, and
moving objects. In defining the background it is assumed that the stationary objects could
vary in color and intensity over time. Scenarios where these techniques apply tend to be
very diverse. There can be highly variable sequences, such as images with very different
lighting, interiors, exteriors,
quality, and noise. In addition to processing in real time, systems need to be able to adapt to
these changes the implement the techniques to extract the foreground from background image
using Binarization approach to assign the values to background and foreground. Foreground
pixels are identified in real time environments
Artificial Neural Networks (ANN) can learn and therefore can be trained to recognize
patterns, find solutions, forecast future events and classify data. CNN is well documented to
be used for traffic related tasks. Neural Networks learning and behavior is dependent on the
way its individual computing elements are connected and by the strengths of these
connections or weights. These weights can be adjusted automatically by training the
network according to a specified learning rule until it performs the desired task correctly.
CNN is a supervised learning method i.e. a machine learning algorithm that uses known
dataset also known as training dataset. These known parameters help CNN to make
predictions. Input data along with their response values are the fundamental components of a
training dataset. In order to have higher predictive power and the ability to generalize for
several new datasets, the best way is to use larger training datasets. The fingers can be
classified by using convolutional neural network algorithm. CNN is a common method of
training artificial neural networks so as to minimize the objective function. It is a supervised
learning method, and is a generalization of the delta
rule. It requires a dataset of the desired output for many inputs, making up the training set. It
is most useful for feed-forward networks (networks that have no feedback, or simply, that
have no connections that loop).
SIGN RECOGNITION
Sign Language is a well-structured code gesture, every gesture has meaning assigned to
it. Sign Language is the only means of communication for deaf people. With the
advancement of science and technology many techniques have been developed not only
to minimize the problem of deaf people but also to implement it in different fields. From
the classification of sign features, label the signs with improved accuracy rate. It will display
the Alphabet Letters.
FLOW DIGRAM
LEVEL 0
LEVEL-1
Symbol showed by the user is converted to binary values .This binary values are
compared with the alphabets and corresponding alphabets will be displayed.
Gestures are detected using finger regions with the help of deeplearning.And each signs are
labelled accordingly.
A system architecture or systems architecture is the conceptual model that defines the
structure, behavior, and more views of a system. An architecture description is a formal
description and representation of a system, organized in a way that supports reasoning about
the structures and behaviors of the system. System architecture can comprise system
components, the externally visible properties of those components, the relationships (e.g. the
behavior) between them. It can provide a plan from which products can be procured, and
systems developed, that will work together to implement the overall system. There have been
efforts to formalize languages to describe system architecture; collectively these are called
architecture description languages (ADLs).
SYSTEM IMPLEMENTATION
7.1 IMPLEMENTATION
8.1 CONCLUSION
The ability to look, listen, talk, and respond appropriately to events is one of the most
valuable gifts a human being can have. However, some unfortunate people are denied this
opportunity. People get to know one another through sharing their ideas, thoughts, and
experiences with others around them. There are several ways to accomplish this, the best of
which is the gift of "Speech." Everyone can very persuasively transfer their thoughts and
comprehend each other through speech. Our initiative intends to close the gap by including a
low-cost computer into the communication chain, allowing sign language to be captured,
recognised, and translated into speech for the benefit of blind individuals. An image
processing technique is employed in this project to recognise the handmade movements. This
application is used to present a modern integrated planned system for hear impaired people.
The camera- based zone of interest can aid in the user's data collection. Each action will be
significant in its own right.
FUTURE ENHANCEMENT
Despite it having average accuracy, our system is still well-matched with the existing
systems, given that it can perform recognition at the given accuracy with larger vocabularies
and without an aid such as gloves or hand markings. In future, we can extend the framework
to implement various deep learning algorithms to recognize the signs and implement in real
time applications. In future Streamline Speed can be increased to get Sign input and display it
in a Sentences.
CHAPTER 9
APPENDICES
SOURCE CODE
importnumpy as np
import cv2 as cv
defcalc_landmark_list(image, landmarks):
landmark_point = []
landmark_point.append([landmark_x, landmark_y])
returnlandmark_point
import cv2 as cv
defdraw_landmarks(image, landmark_point):
iflen(landmark_point) > 0:
(0, 0, 0), 6)
(0, 0, 0), 6)
(0, 0, 0), 6)
cv.line(image, tuple(landmark_point[5]), tuple(landmark_point[6]),
(0, 0, 0), 6)
(0, 0, 0), 6)
(0, 0, 0), 6)
(0, 0, 0), 6)
(0, 0, 0), 6)
(0, 0, 0), 6)
(0, 0, 0), 6)
(0, 0, 0), 6)
(0, 0, 0), 6)
(0, 0, 0), 6)
(0, 0, 0), 6)
(0, 0, 0), 6)
(0, 0, 0), 6)
(0, 0, 0), 6)
(0, 0, 0), 6)
(0, 0, 0), 6)
(0, 0, 0), 6)
if index == 0:
-1)
if index == 1:
if index == 2:
-1)
if index == 3:
-1)
if index == 4:
-1)
if index == 5:
-1)
if index == 6:
-1)
if index == 7:
-1)
if index == 8:
cv.circle(image, (landmark[0], landmark[1]), 8, (255, 255, 255),
-1)
if index == 9:
-1)
if index == 10:
-1)
if index == 11:
-1)
if index == 12:
-1)
if index == 13:
-1)
if index == 14:
-1)
-1)
if index == 16:
-1)
if index == 17:
-1)
if index == 18:
-1)
if index == 19:
-1)
if index == 20:
-1)
return image
info_text = handedness.classification[0].label[0:]
if hand_sign_text != "":
#cv.putText(image, info_text, (10, 60), font, 1.0, (196, 255, 255), 2, cv.LINE_AA)
return image
import copy
importitertools
defpre_process_landmark(landmark_list):
temp_landmark_list = copy.deepcopy(landmark_list)
base_x, base_y = 0, 0
if index == 0:
temp_landmark_list = list(
itertools.chain.from_iterable(temp_landmark_list))
def normalize_(n):
return n / max_value
returntemp_landmark_list
importcsv
deflogging_csv(number, mode, landmark_list):
csv_path = 'model/keypoint_classifier/keypoint.csv'
writer = csv.writer(f)
writer.writerow([number, *landmark_list])
return
importnumpy as np
importtensorflow as tf
classKeyPointClassifier(object):
def init (
self,
model_path='model/keypoint_classifier/keypoint_classifier.tflite',
num_threads=1,
):
self.interpreter = tf.lite.Interpreter(model_path=model_path,
num_threads=num_threads)
self.interpreter.allocate_tensors()
self.input_details = self.interpreter.get_input_details()
self.output_details = self.interpreter.get_output_details()
def call (
self,
landmark_list,
):
input_details_tensor_index = self.input_details[0]['index']
self.interpreter.set_tensor(
input_details_tensor_index,
np.array([landmark_list], dtype=np.float32))
self.interpreter.invoke()
output_details_tensor_index = self.output_details[0]['index']
result = self.interpreter.get_tensor(output_details_tensor_index)
result_index = np.argmax(np.squeeze(result))
returnresult_index
importmediapipe as mp
import cv2
importnumpy as np
importuuid
importos
'''import subprocess as sp
programName = "notepad.exe"
#fileName = "sms.txt"
#sp.Popen([programName, fileName])
sp.Popen([programName])'''
mp_drawing = mp.solutions.drawing_utils
mp_hands = mp.solutions.hands
cap = cv2.VideoCapture(0)
withmp_hands.Hands(min_detection_confidence=0.8, min_tracking_confidence=0.5) as
hands:
whilecap.isOpened():
# BGR 2 RGB
# Flip on horizontal
image = cv2.flip(image, 1)
# Set flag
image.flags.writeable = False
# Detections
results = hands.process(image)
image.flags.writeable = True
# RGB 2 BGR
# print(results)
# Rendering results
ifresults.multi_hand_landmarks:
break
cap.release()
cv2.destroyAllWindows()
importcsv
importnumpy as np
importtensorflow as tf
RANDOM_SEED = 42
dataset = 'model/keypoint_classifier/keypoint.csv'
model_save_path = 'keypoint_classifier_new.h5'
NUM_CLASSES = 26
print(len(X_dataset))
print(len(y_dataset))
print(y_dataset)
print(X_dataset.shape)
train_ratio = 0.80
test_ratio = 0.20
model = tf.keras.models.Sequential([
tf.keras.layers.Input((21 * 2, )),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(20, activation='relu'),
tf.keras.layers.Dropout(0.4),
tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
])
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy']
)
model.summary()
hist=model.fit(X_train,y_train,epochs=500,batch_size=128,validation_data=(X_test,
y_test),callbacks=[cp_callback, es_callback])
importmatplotlib.pyplot as plt
model.save(model_save_path,include_optimizer=False)
plt.plot(hist.history['accuracy'])
plt.plot(hist.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.show()
plt.plot(hist.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.show()
importnumpy as np
import time
importmatplotlib.pyplot as plt
defplot_confusion_matrix(cm,
target_names,
title='Confusion matrix',
cmap=None,
normalize=True):
importitertools
misclass = 1 - accuracy
ifcmap is None:
cmap = plt.get_cmap('Blues')
plt.figure(figsize=(20, 20))
plt.title(title)
plt.colorbar()
plt.yticks(tick_marks, target_names)
if normalize:
if normalize:
horizontalalignment="center",
else:
horizontalalignment="center",
plt.tight_layout()
plt.ylabel('True label')
plt.savefig('model/keypoint_classifier/confusion_matrix.png')
model = load_model('model/keypoint_classifier/keypoint_classifier_new.h5')
pred_labels=[]
start_time = time.time()
pred_probabs = model.predict(X_test)
end_time = time.time()
pred_labels.append(list(pred_probab).index(max(pred_probab)))
cm = confusion_matrix(y_test, np.array(pred_labels))
nClassification Report')
print('---------------------------')
print(classification_report)
importcsv
import copy
import cv2 as cv
importmediapipe as mp
importnumpy as np
def main():
args = get_args()
cap_device = args.device
cap_width = args.width
cap_height = args.height
use_static_image_mode = args.use_static_image_mode
min_detection_confidence = args.min_detection_confidence
min_tracking_confidence = args.min_tracking_confidence
cap = cv.VideoCapture(cap_device)
cap.set(cv.CAP_PROP_FRAME_WIDTH, cap_width)
cap.set(cv.CAP_PROP_FRAME_HEIGHT, cap_height)
mp_hands = mp.solutions.hands
hands = mp_hands.Hands(
static_image_mode=use_static_image_mode,
max_num_hands=1,
min_detection_confidence=min_detection_confidence,
min_tracking_confidence=min_tracking_confidence,
keypoint_classifier = KeyPointClassifier()
keypoint_classifier_labels = csv.reader(f)
keypoint_classifier_labels = [
while True:
key = cv.waitKey(10)
break
if not ret:
break
image = cv.flip(image, 1)
debug_image = copy.deepcopy(image)
# print(debug_image.shape)
# cv.imshow("debug_image",debug_image)
results = hands.process(image)
image.flags.writeable = True
#print(hand_landmarks)
pre_processed_landmark_list = pre_process_landmark(landmark_list)
hand_sign_id = keypoint_classifier(pre_processed_landmark_list)
debug_image = draw_info_text(
debug_image,
handedness,
keypoint_classifier_labels[hand_sign_id])
cap.release()
cv.destroyAllWindows()
Figure 9.1Coding
REFERENCES
[1] Arpita Halder, Real-time Vernacular Sign Language Recognition using MediaPipe and
Machine Learning, 2021
[2] Hamzah Luqman, An Efficient Two-Stream Network for Isolated Sign Language
Recognition Using Accumulative Video Motion, 2022
[3] Hamzah Luqman, A comprehensive survey and taxonomy of sign language research, 2022
[5] Kil-Houm Park, An integrated mediapipe-optimized GRU model for Indian sign language
recognition, 2022
[6] Noha Sarhan, Multi-phase Fine-Tuning: A New Fine-Tuning Approach for Sign
Language Recognition, 2022
[8] Rahaf Abdulaziz Alawwad, Arabic Sign Language Recognition using Faster R-CNN, 2021
[9] Shagun Katoch, Indian Sign Language recognition system using SURF with SVM and
CNN, 2022
[10] Shikhar Sharma, ASL-3DCNN: American Sign Language recognition technique using 3-
D convolutional neural networks, 2021