Unit - DL

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 22

DEEP LEARNING AND APPLICATIONS

Unit IV Generative Modelling 9 Periods


Generative Modelling: Generative adversarial network. Zero Shot
Learning. Applications. Overview of MIL, Highway Network, Fractal
Network, Siamese Net.
The term generative AI means a group of approaches to the application of ML
and NN on big data sets with the aim to discover recurrent patterns. From this
learned information it then produces new and sometime human-like outputs. For
example, a generative AI model designed for fiction writing can generate new
stories comprising of similar features such as plot settings, character, and
themes.
How Does Generative AI Work?
However, Generative AI utilizes deep learning methods, especially artificial
neural networks inspired by the human brain. These models process large data
sets to derive common patterns and architectures thus getting smarter as they are
fed on more information. With the more content, the generative ai model
produces believable and human-looking outputs.
The functioning of Generative Artificial Intelligence (Generative AI) is based
on intricate mechanisms which employ sophisticated deep learning and neural
network algorithms. Here is an in-depth look at how generative AI works:
1. Data Collection and Training:
o Datasets: This is how generative AI models start off; they train on big
relevant data sets that give them a glimpse of the type of content they
should generate. For example, a text-based model can be developed by
using huge sets of texts while a generator based on images can be created
employing multitudinous image sets.
o Learning Patterns: The model can detect such complex patterns,
structures and relationships in the data during the period of training. This
is a procedure of looking out for similarities, patterns, and items that are
conspicuous from the data provided.
2. Neural Networks:
o Architecture: The majority of generative AIs are based on neural-like
networks that mimic the functioning of a human brain. Transformer has
been the most popular architecture and continues to perform well for
tasks such as NLP and image creation.
o Layers and Nodes: The neural networks comprise of several
interconnected nodes or neurons that help in processing information and
its transformation. The layers give the model a means of representing
information in an ordered manner.
3. Deep Learning:
o Complex Computations: Deep learning is a specialization in machine
learning in which a neural network carries out progressively sophisticated
calculations upon the input data over successive layers. This helps the
model to locate and acquire complex items.
o Training Iterations: This model goes through successive training cycles
which tune its inner parameters (weights and biases) after each iteration
by comparing model-generated output with the actual data.
4. Generative Process:
o Prompt Input: Generative AI kicks off with providing prompt to a user
who is instructed to give out the initial information that leads to content
creation. Depending on the type of support system, this prompt may
constitute a text question, an initial image, or any other input that is
applicable.
o Pattern Recognition: Generative AI works by exploiting the observed
patterns in the training data and generates results cohering to the supplied
prompt. For instance, if programmed with text, it could form words and
even paragraphs but if programmed with pictures, it would make images.
5. Refinement with User Feedback:
o Iterative Improvement: Feedback can be conducted on an iterative basis
for generative AI models. Users can give feedback on what is being
produced by this model thus allowing it to learn in order to improve and
produce contents that suit users' needs.
6. Scaling with More Data:
o Sophistication with Data Volume: Improved sophistication of
generating AI is made possible due to input of large sets of raw data.
With time, as the model receives and processes more data it becomes
highly efficient in producing lifelike and divergent products.
7. Multimodal Capabilities (Optional):
o Multimodal Models: A sub-category of generative AI models, named
multimodal, are able to deal with multiple modes of information
including, text, images, and audios. In doing so, they are able to produce
finer and richer outputs.
8. Deployment:
o Integration into Applications: After training, a generative AI model can
be introduced into different apps like catbots, creative devices for writing,
or coding helpers giving individuals an opportunity to communicate with
the AI and utilize its capabilities of generating content
Types of Generative AI Models
There are several groups of generative AI models, such as transformer-based
models, GANs, VAEs and multimodal models. There are several types and each
one is meant for a particular job, generating text, creating images including
processing of several types of data at once.
o Transformer-Based Models: They learn how to relate in a temporal
dimension information like words or sentences using big data sets. In
NLP systems, they are good in comprehending the syntax and semantics
of language.
Examples: ChatGPT-3, Google Bard.
o Generative Adversarial Networks (GANs): They are composed of two
neural networks, i.e., a generator and a discriminator, operating in tandem
but against one another. First is a generator that creates realistic data and
second is a discriminator which assesses whether that data is real or not.
This way, a series of adversarial processes yield better and better
artworks over time.
Examples: DALL-E, Midjourney.
o Variational Auto encoders (VAEs): Two networks, namely, an encoder
and a decoder are used in VAEs to convert and produce the data. An
encoder compresses the raw data into a compressed form (simpler), and
the decoder reconstructs compressed information into some similar but
different from the first one. Examples: They are widely used in different
images generation applications.
o Multimodal Models: Such models are able to manipulate several kinds
of data at once including texts, pictures and sound. They become capable
of producing more intricate and sophisticated results with this capacity.
Examples: GPT-4 and DALL-E-2 are the creation of OpenAI.
o Attention-Based Models: The attention mechanism makes a model to
concentrate on certain segments of input information while generating an
output. The process makes the model more able to see fine grain in the
detail of the data and complex relational information.Examples: Some
modern transformer models feature attention mechanism.
o Recurrent Neural Networks (RNNs): They are built to handle sequence
input, which is done via a hidden state that takes into account data from
prior steps. They are not frequently leveraged for generative AI but were
critical in earlier language-modellingexercises. Examples: Recently, most
of the models do not use historic language models.
o Autoregressive Models: In autoregressive models, successive outputs
are produced taking into account the preceding elements. This staged or
sequential generation process supports flexible development of dynamic,
context-appropriate content.
Examples: PixelRNN, PixelCNN.
o Large Language Models: The primary purpose of these models is to
capture human speech and create it artificially. Transformer architectures
are frequently applied while they can also be fine-tuned for different
language-oriented issues.
Examples: GPT-3, ChatGPT.
o Sparse Attention Models: The computational overhead is reduced when
sparse attention models attend to every element within the input data.
Because of their directness they attend the required aspects only and are
more adequate in some operations. Examples- Several sparse attention
transformers for different types of transformer based models.
Generative AI models for various needs such as natural language understanding,
text generation, image synthesis, and multi-modal content creation. This model
is dependent upon the use case to which it will be put and the material of the
content it will generate.Each type serves unique purposes, from language tasks
to image synthesis, catering to diverse generative AI applications.
Benefits of Generative AI Extend Across Industries:
o Lower Labour Costs: The automation of routine tasks results in less
man-hours and therefore reduces the costs for businesses.
o Greater Operational Efficiency: Through the use of generative AI
businesses can make processes efficient and get rid of error, improving
performance in any operational elements.
o Insights into Business Processes: Moreover, this technology allows for
the accumulation and examination of huge volumes of information which
provide useful information from a perspective of improved working
efficiency. The use of a data-based approach enables organizations to be
more efficient as they are able to find what to change in order to improve
organizational effectiveness.
o Empowering Professionals and Content Creators: Generative AI tools
provide a plethora of benefits for professionals and content creators,
aiding in various aspects of their work:
o Idea Creation: As part of the brainstorming and ideation process,
generative AI offers fresh angles and concepts that may inspire
innovation.
o Content Planning and Scheduling: Generative AI helps professionals to
plan when producing and schedule content for a uniform generation.
o Search Engine Optimization (SEO): Optimizing such content that is
provided by AI through SEO will help to increase visibility on the web
and the targeted traffic.
o Marketing and Audience Engagement: Generative AI enables
development of customized and compelling material relevant for better
marketing approaches and heightened user involvement.
o Research Assistance: Generative AI, for example, is beneficial in a
research task where professionals can easily obtain critical knowledge
and guide them with their job.
o Editing Support: With generative AI helping in the process of editing, it
could offer recommendations regarding improvement, ultimately
facilitating the final step in content refinement.
o Time Savings: The notable benefit is that it takes a short time in
repetitive and time-consuming errands. These processes save time,
allowing professional and creations to concentrate on tasks that require
human intelligence.
The productivity gains are great but the manual supervision and scrutiny over
the generative AI models have to be high. Generative AI should be utilized
responsibly by considering factors such as controlling bias and enforcing ethics
which will help exploit its full capabilities within various professions. The
combination of human skills and generative AI capabilities will revolutionize
workflows as they are currently practiced by industries and even drive superior
creativity and strategic outcomes.
What is a Generative Adversarial Network (GAN)
GAN is an architecture of Deep Learning. The Generative Adversarial Network
consists of two neural networks. These networks are competing with each other
in a zero-sum game framework. The Generative Adversarial Networks (GANs)
can be described as extremely powerful kinds of neural networks that are
employed to aid in Unsupervised Learning.. GANs are comprised of two neural
networks that are in competition with one another and can analyse the changes
within a set of data. The Generative Adversarial Network is generated the new
data which assembles a few known as data distributions.
GANs are a method for generative modelling that uses deep learning methods
like CNN (Convolutional Neural Network). Generative modelling is an
unsupervised learning method that automatically discovers and learns patterns
in input data so that the model can be used for new examples from the original
dataset.
GANs are a method of training generative modelling by framing the problem as
a supervised learning problem and using two sub-models. GANs have two
components:
1. Generator: The generator is mainly a convolution neutral network.
Generator This is a program that generates new data from real-world
images. The main aim of the generator is to generate the output which is
mistaken for the real data.
2. Discriminator: The Discriminator is mainly a deconvolution neutral
network. Discriminator This compares the images with real-world
examples to classify fake and real images. The main aim of the
discriminator is to identify the artificial output which is received in the
network.
3. How does GAN work?
Generative Adversarial Networks (GANs) can be broken down into three parts,
which are discussed in below -
o Generative: To learn more about a dynamic model that explains how
data are generated using a probabilistic model.
o Adversarial: The process of training models, is conducted in an
adversarial environment.
o Networks: Make use of deep neural networks to create Artificial
Intelligence (AI) algorithms to train for purposes.
Previously we told you that GAN has two neutral networks. One is Generator
and another is Discriminator. The generator is used to generate false or artificial
data. And on the other hand, the discriminator is used to identify the artificial
data. The artificial data is like images, audio, video, etc, which is generated by
the generator. These two neutral networks are competing in the training phase.
The steps of the generator and discriminator are repeated many numbers of
times. And from time to time the result was better than the previous results. In
the below diagram, we can visualize this process.

The generative model can capture the data distribution. It is tried to maximise
the probability of making mistake in the discriminator. So, the discriminator
cannot find which is real data and which is artificial data. On the other side, the
discriminator is tried to understand which data is received from the training data
and which is received from the generator. The Generative Adversarial Network
is used to minimax the game. Here the work of the discriminator is to minimize
the rewards V(D, G). Another side the generator works to minimize the
discriminator rewards and also maximize the loss.
What are the types of Generative Adversarial Network or GAN model?
There are various types of GAN models or Generative Adversarial Network
models. These are given in below -
1. Condition GAN:
It is one type of GAN. The Condition GAN also represent a CGAN. This GAN
has mainly described the deep learning method. In the CGAN, we put some
conditional parameters. In the generator, when we add data "X" then it
generates its corresponding data. In the CGAN, the labels are put in the input to
help the discriminator identify the fake data and the real data.
2. Vanilla GAN:
It is one of the simplest types of GAN. The Vanilla GAN also represent a
VGAN. In the vanilla GAN, the Generator and the Discriminator are very
simple and have multilayer perceptron's. The algorithm of the VGAN is also
very easy. The VGAN is used to optimization of a mathematical equation which
is used in the stochastic gradient descent.
3. Laplacian Pyramid GAN:
It is another type of GAN. The Laplacian Pyramid GAN also represent a
LAPGAN. The LAPGAN represents the linearly invertible image. This consists
of a set of bandpass images which is also residual of low frequency. The images
are also spaced an octave apart. In the Laplacian Pyramid GAN, we use
multiple numbers of generators and Discriminators. Here we also used the
Laplacian pyramid of different levels. High-quality images produced by the
LAPGAN. In the first step, the image is always down-sampled in Laplacian
Pyramid GAN. After that, the images are upscaled in each layer of the pyramid
and then pass the images which have some noise from the CGAN. The images
belong in CGAN until it gets the original size.
4. Super Resolution GAN:
It is another type of Generative Adversarial Network. The Super Resolution
GAN also represent an SRGAN. The SRGAN is produce a high-resolution
image. In this GAN, the deep neural network is used with the adversarial
network. The SRGAN is very much useful for upscaling low-resolution images
into high-resolution images. And it is also used to errors minimizing in the
images.
5. Deep Convolutional GAN:
The last type of Generative Adversarial Network or GAN is Deep convolutional
GAN. The Deep convolutional GAN also represent a DCGAN. This GAN is
most powerful GAN than others. The DCGAN is very popular. Deep
convolution GAN is ConvNets in place of multi-layer perceptron's. This is
easily implemented without max pooling. The layers of DCGAN are not fully
connected.
Zero-shot learning
Zero-shot learning (ZSL) is a machine learning scenario in which an AI
model is trained to recognize and categorize objects or concepts without having
seen any examples of those categories or concepts beforehand.
Most state-of-the-art deep learning models for classification or regression are
trained through supervised learning, which requires many labelled examples of
relevant data classes. Models “learn” by making predictions on a labelled
training dataset; data labels provide both the range of possible answers and the
correct answers (or ground truth) for each training example. “Learning,” here,
means adjusting model weights to minimize the difference between the model’s
predictions and that ground truth. This process requires enough labelled samples
for many rounds of training and updates.
While powerful, supervised learning is impractical in some real-world
scenarios. Annotating large amounts of data samples is costly and time-
consuming, and in cases like rare diseases and newly discovered species,
examples may be scarce or non-existent. Consider image recognition tasks:
according to one study, humans can recognize approximately 30,000
individually distinguishable object categories.1 It’s not feasible, in terms of
time, cost and computational resources, for artificial intelligence models to
remotely approach human capabilities if they must be explicitly trained on
labelled data for each class.
The need for machine learning models to be able to generalize quickly to a large
number of semantic categories with minimal training overhead has given rise
to n-shot learning: a subset of machine learning that also includes few-shot
learning (FSL) and one-shot learning. Few-shot learning typically uses transfer
learning and meta learning-based methods to train models to quickly recognize
new classes with only a few labelled training examples—or, in one-shot
learning, a single labelled example.
Zero-shot learning, like all n-shot learning, refers not to any specific algorithm
or neural network architecture, but to the nature of the learning problem itself:
in ZSL, the model is not trained on any labelled examples of the unseen classes
it is asked to make predictions on post-training.
This problem setup doesn’t account for whether that class was present (albeit
unlabelled) in training data. For example, some large language models
(LLMs) are well-suited for ZSL tasks, as they are pre-trained through self-
supervised learning on a massive corpus of text that may contain incidental
references to or knowledge about unseen data classes. Without labeled
examples to draw upon, ZSL methods all rely on the use of such auxiliary
knowledge to make predictions.
Given its versatility and wide range of use cases, zero-shot learning has become
an increasingly notable area of research in data science, particularly in the fields
of computer vision and natural language processing (NLP).

Generalized zero-shot learning (GSZL)


In a conventional ZSL setting, the model is tested on a dataset containing
samples from unseen classes of data. While useful for developing and validating
zero-shot methodologies, it doesn’t reflect most common real-world
conditions: generalized zero-shot learning (GSZL) refers to the specific zero-
shot learning problem in which the data point(s) the model is tasked with
classifying might belong to either unseen classes or seen classes: classes the
model has already “learned” from labelled examples.
GSZL must overcome an additional challenge: the tendency for classifiers to
bias predictions towards classes it has seen in training over unseen classes it has
not yet been exposed to. As such, GSZL often requires additional techniques to
mitigate that bias.
How does zero-shot learning work?
In the absence of any labeled examples of the categories the model is being
trained to learn, zero-shot learning problems make use of auxiliary information:
textual descriptions, attributes, embedded representations or other semantic
information relevant to the task at hand.
Rather than directly modeling the decision boundaries between classes, zero-
shot learning techniques typically output a probability vector representing the
likelihood that a given input belongs to certain classes. GSZL methods may add
a preliminary discriminator that first determines whether the sample belongs to
a seen class or a new class, then proceed accordingly

Applications of Zero-Shot Learning


Image Classification:
 ZSL allows models to recognize and classify images without having seen
examples from all possible categories during training. For instance, a
model trained on images of dogs and cats can classify a horse even if it
has never seen a horse during training.
Natural Language Processing (NLP):
 In NLP, ZSL can be used for tasks like sentiment analysis or text
classification where the model can understand and categorize texts into
unseen classes based on textual descriptions.
Object Detection:
 ZSL is applied in object detection systems to identify objects in images
that were not part of the training set, enhancing the flexibility of detection
systems in real-world scenarios.
Recommendation Systems:
 By leveraging user profiles and item descriptions, ZSL can suggest
products or content that a user has not previously interacted with but is
likely to be interested in.
Robotics:
 ZSL helps robots adapt to new tasks or environments by understanding
the task descriptions and learning to perform actions without prior direct
experience.
Medical Diagnosis:
 ZSL can aid in medical image analysis by diagnosing diseases based on
descriptions and symptoms even if the model has not been specifically
trained on certain conditions.
Speech Recognition:
 In speech recognition, ZSL can allow systems to understand and process
new words or phrases that were not included in the training dataset.
Text-to-Image Synthesis:
 ZSL can generate images from textual descriptions, enabling the creation
of visuals for concepts or objects not previously depicted in any dataset.
Gaming:
 In video games, ZSL can enable character or environment recognition
without prior exposure, allowing for more dynamic interactions based on
player behavior or actions.
Social Media Analytics:
 ZSL can analyze and categorize user-generated content (like posts and
comments) into new categories based on trends and linguistic cues,
enhancing content moderation and analysis.

MIL(Multiple Instance Learning )


Overview
Multiple-Instance (MI) learning is an extension of the standard supervised
learning setting. In standard supervised learning, the input consists of a set of
labeled instances each described by an attribute vector. The learner then induces
a concept that relates the label of an instance to its attributes. In MI learning, the
input consists of labeled examples (called “bags”) consisting of multisets of
instances, each described by an attribute vector, and there are constraints that
relate the label of each bag to the unknown labels of each instance. The MI
learner then induces a concept that relates the label of a bag to the attributes
describing the instances in it. This setting contains supervised learning as a
special case: if each bag contains exactly one instance, it reduces to a standard
supervised learning problem.
Representation
In the standard MIL assumption, negative bags are said to contain only negative
instances, while positive bags contain at least one positive instance. Positive
instances are labeled in the literature as witnesses.

An intuitive example for MIL is a situation where several people have a specific
key chain that contains keys. Some of these people are able to enter a certain
room, and some aren’t. The task is then to predict whether a certain key or a
certain key chain can get you into that room.
For solving this, we need to find the exact key that is common for all the
“positive” keychains – the green key. We can then correctly classify an entire
keychain – positive if it contains the required key, or negative if it doesn’t.
This standard assumption can be slightly modified to accommodate problems
where positive bags cannot be identified by a single instance, but by its
accumulation. For example, in the classification of desert, sea and beach
images, images of beaches contain both sand and water segments. Several
positive instances are required to distinguish a “beach” from “desert”/”sea”.
Characteristics of Multiple Instance Learning Problems
 Task/Prediction: Instance level vs Bag Level
In some applications, like object localization in images (in content retrieval, for
instance), the objective is not to classify bags, but to classify individual
instances. The bag label is the presence of that entity in the image.
Note that the bag classification performance of a method often is not
representative of its instance classification performance. For example, when
considering negative bags, a single False Positive causes a bag to be
misclassified. On the other hand, in positive bags, it does not change the label,
which shouldn’t affect the loss at bag-level.
Bag Composition
Most existing MIL methods assume that positive and negative instances are
sampled independently from a positive and a negative distribution. This is often
not the case, due to the co-occurrence of several relations:
i) Intra Bag Similarities
The instances belonging to the same bag share similarities that instances from
other bags do not. In Computer Vision applications, it is likely that all segments
share some similarities related to the capture condition (e.g. illumination).
Another option is overlapping patches in an extraction process, as represented
below.
Showcasing the problem of ambiguous negative classes in Multiple Instance
Learning problems, where the positive concept can be marginally represented
on a negative bag.
ii) Instance Co Occurrence
Instances co-occur in bags when they share a semantic relation. This type of
correlation happens when the subject of a picture is more likely to be seen in
some environment than in another, or when some objects are often found
together.

Showcasing the Multiple Instance Learning concept for image classification,


where an image with a bear has actually multiple concepts involving the
background such as the grass.

Instance and Bag Structure


In some problems, there is an underlying structure (spatial, temporal, relational,
causal) between instances in bags or even between bags. For example, when a
bag represents a video sequence – for instance, identifying the frames of a video
where a cat appears knowing only there’s a cat in that video – all frames or
patches are temporally and spatially ordered.
Label Ambiguity
Label Noise
Some MIL algorithms, especially those working under the standard MIL
assumption, rely heavily on the correctness of bag labels. In practice, there are
many situations where positive instances may be found in negative bags – due
to labeling errors or inherent noise. For example, in computer vision
applications, it is difficult to guarantee that negative images contain no positive
patches: An image showing a house may contain flowers, but is unlikely to be
annotated as a flower image.

Label noise occurs as well when you have different bags with different densities
of positive events. For instance, we have an audio recording (R1) of 10 seconds
containing only a total of 1 second of the tagged event in it and another audio
recording (R2) of the same duration in which the tagged event is present for a
total of 5 seconds. R1 is a weaker representation of the event compared to R2.

Different Label Spaces


It is possible to extract patches from negative images that fall into this positive
region. In the example shown below, some patches extracted from the image of
a white tiger fall into another concept region due to being visually similar to it.

Examples of label ambiguity in the Multiple Instance Learning domain. For


example, zebra and tiger stripes getting confused with a cake texture.

Highway network
In machine learning, the Highway Network was the first working very
deep feedforward neural network with hundreds of layers, much deeper than
previous artificial neural networks. It uses skip connections modulated by
learned gating mechanisms to regulate information flow, inspired by Long
Short-Term Memory (LSTM) recurrent neural networks. The advantage of a
Highway Network over the common deep neural networks is that it solves or
partially prevents the vanishing gradient problem, thus leading to easier to
optimize neural networks. The gating mechanisms facilitate information flow
across many layers ("information highways"). Highway Networks have been
used as part of text sequence labelling and speech recognition tasks.

The model has two gates in addition to the H(WH, x) gate: the transform
gate T(WT, x) and the carry gate C(WC, x). Those two last gates are non-linear
transfer functions (by convention Sigmoid function. The H(WH, x) function can
be any desired transfer function.

The carry gate is defined as C(WC, x) = 1 - T(WT, x). While the transform gate
is just a gate with a sigmoid transfer function.

The structure of a hidden layer follows the equation:


Plain Network :

 Before talking about Highway Networks, Let’s start with plain network
which consists of L layers where the l-th layer (with omitting the symbol
for layer):

 where x is input, WH is the weight, H is the transform function followed


by an activation function and y is the output. And for i-th unit:

 We compute the yi and pass it to next layer.


Highway network :

 In highway network, two non-linear transforms T and C are introduced:

 where T is the Transform Gate and C is the Carry Gate.

 In particular, C = 1 - T:

 We can have below conditions for particular T values:


 When T=0, we pass the input as output directly which creates an
information highway. That’s why it is called Highway Network !!!

 When T=1, we use the non-linear activated transformed input as output.

 Here, in contrast to the i-th unit in plain network, authors introduce


the block concept. For i-th block, there is a block state Hi(x), and
and transform gate output Ti(x). And the corresponding block output yi:

 which is connected to the next layer.

 Formally, T(x) is the sigmoid function:

Fractal Network

Fractal networks, or Fractal Net, leverage a recursive architecture to enhance


deep learning models. The ultimate aim is to build deeper networks without
significantly increasing the number of parameters, improving efficiency and
performance.
Fractal Network is a type of convolutional neural network that eschews
residual connections in favour of a "fractal" design. They involve repeated
application of a simple expansion rule to generate deep networks whose
structural layouts are precisely truncated fractals. These networks contain
interacting sub paths of different lengths, but do not include any pass-through or
residual connections; every internal signal is transformed by a filter and
nonlinearity before being seen by subsequent layers.It introduces a hierarchical,
recursive approach to constructing neural networks, where smaller modules are
repeated at various scales. This structure allows for increased depth and
complexity without a linear increase in parameters.

 Similar to ResNet, FractalNet utilizes skip connections that facilitate


better gradient flow during training, addressing issues like vanishing
gradients in very deep networks.
 The fractal modules can be thought of as building blocks that can be
easily stacked or scaled, allowing for more flexible architecture design.

Advantages :

 The recursive nature helps in learning more abstract and complex


features.
 It can achieve comparable or superior performance with fewer parameters
than traditional deep networks.
 The architecture helps stabilize training in deeper networks.

Siamese Networks :
A Siamese Neural Network is a class of neural network architectures
that contain two or more identical subnetworks. ‘identical’ here means, they
have the same configuration with the same parameters and weights. Parameter
updating is mirrored across both sub-networks. It is used to find the similarity of
the inputs by comparing its feature vectors, so these networks are used in many
applications

Traditionally, a neural network learns to predict multiple classes. This poses a


problem when we need to add/remove new classes to the data. In this case, we
have to update the neural network and retrain it on the whole dataset. Also, deep
neural networks need a large volume of data to train on. SNNs, on the other
hand, learn a similarity function. Thus, we can train it to see if the two images
are the same (which we will do here). This enables us to classify new classes of
data without training the network again.

A real-world application of Siamese Networks is in face recognition and


signature verification tasks. Imagine a company implementing an automated
face-based attendance system. With just one image of each employee
available, traditional CNNs would struggle to classify thousands of
employees precisely. Enter the Siamese network, excelling in precisely this
kind of scenario.
The Architecture of Siamese Networks

The Siamese network design comprises two identical subnetworks, each


processing one of the inputs. Initially, the inputs undergo processing through
a convolutional neural network (CNN), which extracts significant features
from the provided images. These subnetworks then generate encoded
outputs, often through a fully connected layer, resulting in a condensed
representation of the input data.

The CNN consists of two branches and a shared feature extraction


component, composed of layers for convolution, batch normalization, and
ReLU activation, followed by max pooling and dropout layers. The final
segment involves the FC layer, which maps the extracted features to the
ultimate classification outcomes. A function delineates a linear layer
followed by a sequence of ReLU activations and a series of consecutive
operations (convolution, batch normalization, ReLU activation, max pooling,
and dropout). The forward function guides the inputs through both branches
of the network.

The Differencing layer serves to identify similarities between inputs and


amplify distinctions among dissimilar pairs, accomplished using the
Euclidean Distance function:

Distance(x₁, x₂) = ∥f(x₁) – f(x₂)∥₂

In this context,
 x₁, x₂ are the two inputs.
 f(x) represents the output of the encoding.
 Distance denotes the distance function.

This property enables the network to acquire effective data representations


apply that to fresh, unseen samples. Consequently, the network generates an
encoding, often represented as a similarity score, that aids in-class
differentiation.

Depict the network’s architecture in the accompanying figure. Notably, this


network operates as a one-shot classifier, negating the need for many
examples per class.

Loss Functions Used in Siamese Networks

A loss function is a mathematical tool to gauge the dissimilarity between the


anticipated and actual output within a machine-learning model, given a
specific input. When training a model, the aim is to minimize this loss
function by adjusting the model’s parameters.

Numerous loss functions cater to diverse problem types. For instance, mean
squared error is apt for regression challenges, while cross-entropy loss suits
classification tasks.

Binary Cross-Entropy Loss

Binary cross-entropy loss proves valuable for binary classification tasks,


where the objective is to predict between two possible outcomes. In the
context of a Siamese network, the aim is to classify an image as either
“similar” or “dissimilar” to another.

This function quantifies the disparity between the forecasted probability of


the positive class and the actual outcome. Within the Siamese network, the
forecasted probability pertains to the likelihood of image similarity, while
the actual outcome assumes a binary form: 1 for image similarity and 0 for
dissimilarity.
The function’s formulation involves the negative logarithm of the true class
likelihood, calculated as:
−(ylog(p)+(1−y)log(1−p))

Here,

 y signifies the true label.


 p signifies the predicted probability.

Training a model with binary cross-entropy loss strives to minimize this


function by parameter adjustment. Through such minimization, the model
gains proficiency in accurate class prediction.

Contrastive Loss

Contrastive Loss delves into the differentiation of image pairs by employing


distance as a similarity measure. This function proves advantageous when
the number of training instances per class is in limit. It’s important to note
that Contrastive loss necessitates pairs of negative and positive training
samples. A visualization of this loss is provided in the accompanying figure.

You might also like