Unit V

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Machine Learning Techniques (KCS 055)

UNIT-V
REINFORCEMENT LEARNING–Introduction to Reinforcement Learning , Learning Task, Example of
Reinforcement Learning in Practice, Learning Models for Reinforcement – (Markov Decision process , Q
Learning - Q Learning function, Q Learning Algorithm ), Application of Reinforcement Learning,
Introduction to Deep Q Learning.
GENETIC ALGORITHMS: Introduction, Components, GA cycle of reproduction, Crossover, Mutation,
Genetic Programming, Models of Evolution and Learning, Applications.

Introduction to Reinforcement Learning


Reinforcement Learning is a part of machine learning. Here, agents are self-trained on reward and
punishment mechanisms. It’s about taking the best possible action or path to gain maximum rewards
and minimum punishment through observations in a specific situation. It acts as a signal to positive and
negative behaviors. Essentially an agent (or several) is built that can perceive and interpret the
environment in which is placed, furthermore, it can take actions and interact with it.

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


Reinforcement learning is a sub-branch of Machine Learning that trains a model to return an optimum
solution for a problem by taking a sequence of decisions by itself.

We model an environment after the problem statement. The model interacts with this environment and
comes up with solutions all on its own, without human interference. To push it in the right direction, we
simply give it a positive reward if it performs an action that brings it closer to its goal or a negative
reward if it goes away from its goal.

To understand reinforcement learning better, consider a dog that we have to house train. Here, the dog
is the agent and the house, the environment.

We can get the dog to perform various actions by offering incentives such as dog biscuits as a reward

The dog will follow a policy to maximize its reward and hence will follow every command and might
even learn a new action, like begging, all by itself.

The dog will also want to run around and play and explore its environment. This quality of a model is
called Exploration. The tendency of the dog to maximize rewards is called Exploitation. There is always a
tradeoff between exploration and exploitation, as exploration actions may lead to lesser rewards.

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


The below table shows the differences between the three main sub-branches of machine learning.

Important Terms in Reinforcement Learning

 Agent: Agent is the model that is being trained via reinforcement learning

 Environment: The training situation that the model must optimize to is called its environment

 Action: All possible steps that can be taken by the model

 State: The current position/ condition returned by the model

 Reward: To help the model move in the right direction, it is rewarded/points are given to it to
appraise some action

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


 Policy: Policy determines how an agent will behave at any time. It acts as a mapping between Action
and present State

Approaches to implement Reinforcement Learning

There are mainly three ways to implement reinforcement-learning in ML, which are:

1. Value-based:
The value-based approach is about to find the optimal value function, which is the
maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards
without using the value function. In this approach, the agent tries to apply such a policy
that the action performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no particular
solution or algorithm for this approach because the model representation is different
for each environment.

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


Reinforcement Learning Workflow

 Create the Environment


 Define the reward
 Create the agent
 Train and validate the agent
 Deploy the policy

Types of Reinforcement learning


1. Positive Reinforcement
2. Negative Reinforcement

Positive Reinforcement:

The positive reinforcement learning means adding something to increase the tendency that expected
behavior would occur again. It impacts positively on the behavior of the agent and increases the
strength of the behavior.

This type of reinforcement can sustain the changes for a long time, but too much positive reinforcement
may lead to an overload of states that can reduce the consequences.

Negative Reinforcement:

The negative reinforcement learning is opposite to the positive reinforcement as it increases the
tendency that the specific behavior will occur again by avoiding the negative condition.

It can be more effective than the positive reinforcement depending on situation and behavior, but it
provides reinforcement only to meet minimum behavior.

Elements of Reinforcement Learning


There are four main elements of Reinforcement Learning, which are given below:

1. Policy

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


2. Reward Signal
3. Value Function
4. Model of the environment
1) Policy: A policy can be defined as a way how an agent behaves at a given time. It maps the
perceived states of the environment to the actions taken on those states. A policy is the core element of
the RL as it alone can define the behavior of the agent. In some cases, it may be a simple function or a
lookup table, whereas, for other cases, it may involve general computation as a search process. It could
be deterministic or a stochastic policy:

For deterministic policy: a = π(s)


For stochastic policy: π(a | s) = P*At =a | St = s+

2) Reward Signal: The goal of reinforcement learning is defined by the reward signal. At each state, the
environment sends an immediate signal to the learning agent, and this signal is known as a reward
signal. These rewards are given according to the good and bad actions taken by the agent. The agent's
main objective is to maximize the total number of rewards for good actions. The reward signal can
change the policy, such as if an action selected by the agent leads to low reward, then the policy may
change to select other actions in the future.

3) Value Function: The value function gives information about how good the situation and action are
and how much reward an agent can expect. A reward indicates the immediate signal for each good and
bad action, whereas a value function specifies the good state and action for the future. The value
function depends on the reward as, without reward, there could be no value. The goal of estimating
values is to achieve more rewards.

4) Model: The last element of reinforcement learning is the model, which mimics the behavior of the
environment. With the help of the model, one can make inferences about how the environment will
behave. Such as, if a state and an action are given, then a model can predict the next state and reward.

Active and Passive RL techniques


Passive Learning

As the goal of the agent is to evaluate how good an optimal policy is, the agent needs to learn the
expected utility Uπ(s) for each state s. This can be done in three ways.

Direct Utility Estimation


In this method, the agent executes a sequence of trials or runs (sequences of states-actions transitions
that continue until the agent reaches the terminal state). Each trial gives a sample value and the agent
estimates the utility based on the samples values. Can be calculated as running averages of sample
values. The main drawback is that this method makes a wrong assumption that state utilities are
independent while in reality they are Markovian. Also, it is slow to converge.

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


Suppose we have a 4x3 grid as the environment in which the agent can move either Left, Right, Up or
Down(set of available actions). An example of a run

Total reward starting at (1,1) = 0.72

2. Adaptive Dynamic Programming (ADP)

ADP is a smarter method than Direct Utility Estimation as it runs trials to learn the model of the
environment by estimating the utility of a state as a sum of reward for being in that state and the
expected discounted reward of being in the next state.

Where R(s) = reward for being in state s, P(s’|s, π(s)) = transition model, γ = discount factor and Uπ(s) =
utility of being in state s’.

It can be solved using value-iteration algorithm. The algorithm converges fast but can become quite
costly to compute for large state spaces. ADP is a model based approach and requires the transition
model of the environment. A model-free approach is Temporal Difference Learning.

3. Temporal Difference Learning (TD)


TD learning does not require the agent to learn the transition model. The update occurs
between successive states and agent only updates states that are directly affected.

Where α = learning rate which determines the convergence to true utilities.


While ADP adjusts the utility of s with all its successor states, TD learning adjusts it with that of
a single successor state s’. TD is slower in convergence but much simpler in terms of
computation.

Active Learning

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


ADP with exploration function

As the goal of an active agent is to learn an optimal policy, the agent needs to learn the expected
utility of each state and update its policy. Can be done using a passive ADP agent and then using
value or policy iteration it can learn optimal actions. But this approach results into a greedy
agent. Hence, we use an approach that gives higher weights to unexplored actions and lower
weights to actions with lower utilities.

Where f(u, n) is the exploration function that increases with expected value u and decreases
with number of tries n

R+ is an optimistic reward and Ne is the number of times we want an agent to be forced to pick
an action in every state. The exploration function converts a passive agent into an active one.

2. Q-Learning

Q-learning is a TD learning method which does not require the agent to learn the transitional
model, instead learns Q-value functions Q(s, a) .

Q-values can be updated using the following equation,

Next action can be selected using the following policy,

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


Again, this is simpler to compute but slower than ADP.

Markov’s Decision Process

Markov’s Decision Process is a Reinforcement Learning policy used to map a current state to an
action where the agent continuously interacts with the environment to produce new solutions
and receive rewards

Markov’s Process states that the future is independent of the past, given the present. This
means that, given the present state, the next state can be predicted easily, without the need
for the previous state.

This theory is used by Markov’s Decision Process to get the next action in our machine learning
model. Markov’s Decision Process (MDP) uses:

 A set of States (S)


 A set of Models
 A set of all possible actions (A)
 A reward function that depends on the state and action R( S, A )
 A policy which is the solution of MDP

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


The policy of Markov’s Decision Process aims to maximize the reward at each state. The Agent
interacts with the Environment and takes Action while it is at one State to reach the next future
State. We base the action on the maximum Reward returned.

Reinforcement learning working process

For working process to consider two main things:

o Environment: It can be anything such as a room, maze, football ground, etc.


o Agent: An intelligent agent such as AI robot.

GAME

Chemistry

1. Robotics: RL is used in Robot navigation, Robo-soccer, walking, juggling, etc.


Control: RL can be used for adaptive control such as Factory processes, admission control in
telecommunication, and Helicopter pilot is an example of reinforcement learning.

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


Game Playing: RL can be used in Game playing such as tic-tac-toe, chess, etc.
Chemistry: RL can be used for optimizing the chemical reactions.
Business: RL is now used for business strategy planning.
Manufacturing: In various automobile manufacturing companies, the robots use deep
reinforcement learning to pick goods and put them in some containers.
Finance Sector: The RL is currently used in the finance sector for evaluating trading strategies.

Reinforcement Learning Algorithms

Reinforcement learning algorithms are mainly used in AI applications and gaming applications. The main
used algorithms are:

Q-Learning:

1. Q-learning is a popular model-free reinforcement learning algorithm based on the Bellman


equation.
2. The main objective of Q-learning is to learn the policy which can inform the agent that
what actions should be taken for maximizing the reward under what circumstances.
3. It is an off-policy RL that attempts to find the best action to take at a current state.
4. The goal of the agent in Q-learning is to maximize the value of Q.
5. The value of Q-learning can be derived from the Bellman equation. Consider the Bellman
equation given below:

6.
7. Q-learning is an Off policy RL algorithm, which is used for the temporal difference Learning. The
temporal difference learning methods are the way of comparing temporally successive
predictions.
8. It learns the value function Q (S, a), which means how good to take action "a" at a particular state
"s."
9. The below flowchart explains the working of Q- learning:

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


 State Action Reward State action (SARSA):

 SARSA stands for State Action Reward State action, which is an on-policy temporal difference
learning method. The on-policy control method selects the action for each state while learning
using a specific policy.
 The goal of SARSA is to calculate the Q π (s, a) for the selected current policy π and all pairs
of (s-a).
 The main difference between Q-learning and SARSA algorithms is that unlike Q-learning, the
maximum reward for the next state is not required for updating the Q-value in the table.
 In SARSA, new action and reward are selected using the same policy, which has determined the
original action.
 The SARSA is named because it uses the quintuple Q(s, a, r, s', a'). Where,
s: original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair.

 Deep Q Neural Network (DQN):

 As the name suggests, DQN is a Q-learning using Neural networks.


 For a big state space environment, it will be a challenging and complex task to define and update a
Q-table.
 To solve such an issue, we can use a DQN algorithm. Where, instead of defining a Q-table, neural
network approximates the Q-values for each action and state.

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


Different Machine Learning Task

1. Data Preprocessing
2. Exploratory data analysis
3. Feature engineering selection
4. Regression
5. Classification
6. Clustering
7. Multivariate Querying
8. Density estimation
9. Dimension reeducation
10. Model algorithm/ selection
11.Testing and matching

Genetic Algorithm

A genetic algorithm is an adaptive heuristic search algorithm inspired by "Darwin's theory of evolution
in Nature."

Genetic Algorithms are being widely used in different real-world applications, for example, Designing
electronic circuits, code-breaking, image processing, and artificial creativity.

Basic terminologies

o Population: Population is the subset of all possible or probable solutions, which can solve the
given problem.
o Chromosomes: A chromosome is one of the solutions in the population for the given problem,
and the collection of gene generates a chromosome.
o Gene: A chromosome is divided into a different gene, or it is an element of the chromosome.
o Allele: Allele is the value provided to the gene within a particular chromosome.
o Fitness Function: The fitness function is used to determine the individual's fitness level in the
population. It means the ability of an individual to compete with other individuals. In every
iteration, individuals are evaluated based on their fitness function.
o Genetic Operators: In a genetic algorithm, the best individual mate to regenerate offspring
better than parents. Here genetic operators play a role in changing the genetic composition of
the next generation.
o Selection

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


After calculating the fitness of every existent in the population, a selection process is used to determine
which of the individualities in the population will get to reproduce and produce the seed that will form
the coming generation.

Types of selection styles available

o Roulette wheel selection


o Event selection
o Rank- grounded selection

So, now we can define a genetic algorithm as a heuristic search algorithm to solve optimization
problems. It is a subset of evolutionary algorithms, which is used in computing. A genetic algorithm uses
genetic and natural selection concepts to solve optimization problems.

Working Process

The genetic algorithm works on the evolutionary generational cycle to generate high-quality solutions.
These algorithms use different operations that either enhance or replace the population to give an
improved fit solution.

It basically involves five phases to solve the complex optimization problems, which are given as below:

o Initialization
o Fitness Assignment
o Selection
o Reproduction
o Termination

1. Initialization

The process of a genetic algorithm starts by generating the set of individuals, which is called population.
Here each individual is the solution for the given problem. An individual contains or is characterized by a
set of parameters called Genes. Genes are combined into a string and generate chromosomes, which is
the solution to the problem. One of the most popular techniques for initialization is the use of random
binary strings.

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


2. Fitness Assignment

Fitness function is used to determine how fit an individual is? It means the ability of an individual to
compete with other individuals. In every iteration, individuals are evaluated based on their fitness
function. The fitness function provides a fitness score to each individual. This score further determines
the probability of being selected for reproduction. The high the fitness score, the more chances of
getting selected for reproduction.

3. Selection

The selection phase involves the selection of individuals for the reproduction of offspring. All the
selected individuals are then arranged in a pair of two to increase reproduction. Then these individuals
transfer their genes to the next generation.

There are three types of Selection methods available, which are:

o Roulette wheel selection


o Tournament selection
o Rank-based selection

4. Reproduction

After the selection process, the creation of a child occurs in the reproduction step. In this step, the
genetic algorithm uses two variation operators that are applied to the parent population. The two
operators involved in the reproduction phase are given below:

o Crossover: The crossover plays a most significant role in the reproduction phase of the genetic
algorithm. In this process, a crossover point is selected at random within the genes. Then the
crossover operator swaps genetic information of two parents from the current generation to
produce a new individual representing the offspring.

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


The genes of parents are exchanged among themselves until the crossover point is met. These
newly generated offspring are added to the population. This process is also called or crossover.
Types of crossover styles available:
o One point crossover
o Two-point crossover
o Livery crossover
o Inheritable Algorithms crossover
o Mutation
The mutation operator inserts random genes in the offspring (new child) to maintain the
diversity in the population. It can be done by flipping some bits in the chromosomes.
Mutation helps in solving the issue of premature convergence and enhances diversification. The
below image shows the mutation process:
Types of mutation styles available,
o Flip bit mutation
o Gaussian mutation
o Exchange/Swap mutation

5. Termination

After the reproduction phase, a stopping criterion is applied as a base for termination. The algorithm
terminates after the threshold fitness solution is reached. It will identify the final solution as the best
solution in the population.

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


Advantages of Genetic Algorithm
o The parallel capabilities of genetic algorithms are best.
o It helps in optimizing various problems such as discrete functions, multi-objective
problems, and continuous functions.
o It provides a solution for a problem that improves over time.
o A genetic algorithm does not need derivative information.

Limitations of Genetic Algorithms


o Genetic algorithms are not efficient algorithms for solving simple problems.
o It does not guarantee the quality of the final solution to a problem.
o Repetitive calculation of fitness values may generate some computational challenges.

Gradient Descent

Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th century. Gradient
Descent is defined as one of the most commonly used iterative optimization algorithms of machine

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


learning to train the machine learning and deep learning models. It helps in finding the local minimum
of a function.

The best way to define the local minimum or local maximum of a function using gradient descent is as
follows:

o If we move towards a negative gradient or away from the gradient of the function at the current
point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the function at the
current point, we will get the local maximum of that function.

Cost-function
The cost function is defined as the measurement of difference or error between actual values and expected values
at the current position and present in the form of a single real number.

Self Organizing Maps – Kohonen Maps


SOM is a type of Artificial Neural Network which is also inspired by biological models of neural systems
from the 1970s. It follows an unsupervised learning approach and trained its network through a
competitive learning algorithm. SOM is used for clustering and mapping (or dimensionality reduction)
techniques to map multidimensional data onto lower-dimensional which allows people to reduce complex
problems for easy interpretation. SOM has two layers, one is the Input layer and the other one is the
Output layer. The architecture of the Self Organizing Map with two clusters and n input features of any
sample is given below:

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


Algorithm

The steps involved are:


1. Weight initialization
2. For 1 to N number of epochs
3. Select a training example
4. Compute the winning vector
5. Update the winning vector
6. Repeat steps 3, 4, 5 for all training examples.
7. Clustering the test sample
What is Occam’s razor?

Occam’s razor is a law of parsimony popularly stated as (in William’s words) “Plurality must never be
posited without necessity”. Alternatively, as a heuristic, it can be viewed as, when there are multiple
hypotheses to solve a problem, the simpler one is to be preferred. It is not clear as to whom this
principle can be conclusively attributed to, but William of Occam’s (c. 1287 – 1347) preference for
simplicity is well documented. Hence this principle goes by the name, “Occam’s razor”. This often
means cutting off or shaving away other possibilities or explanations, thus “razor” appended to the
name of the principle. It should be noted that these explanations or hypotheses should lead to the
same result.
Relevance of Occam’s razor.
There are many events that favor a simpler approach either as an inductive bias or a constraint to
begin with. Some of them are:
 Where the results have suggested that preschoolers are sensitive to simpler explanations during
their initial years of learning and development.
 Preference for a simpler approach and explanations to achieve the same goal is seen in various
facets of sciences; for instance, the parsimony principle applied to the understanding of evolution.
 In theology, ontology, epistemology, etc this view of parsimony is used to derive various
conclusions.
 Variants of Occam’s razor are used in knowledge Discovery.
The paired sample t-test,
The paired sample t-test, sometimes called the dependent sample t-test, is a statistical procedure used
to determine whether the mean difference between two sets of observations is zero. In a paired
sample t-test, each subject or entity is measured twice, resulting in pairs of observations. Common
applications of the paired sample t-test include case-control studies or repeated-measures designs.

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


Suppose you are interested in evaluating the effectiveness of a company training program. One approach
you might consider would be to measure the performance of a sample of employees before and after
completing the program, and analyze the differences using a paired sample t-test.

Lazy learner:

1. Just store Data set without learning from it


2. Start classifying data when it receive Test data
3. So it takes less time learning and more time classifying data
Eager learner:

1. When it receive data set it starts classifying (learning)


2. Then it does not wait for test data to learn
3. So it takes long time learning and less time classifying data

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


Performing Convolution on a Matrix

We have “6x6” matrix which represents our image and “3 x3” matrices represents the kernel
So to perform convolution we overlap the kernel on the image matrix and multiply its every element with
the element of the image matrix

3x1 + 0x0 + 1x-1 + 1x1 + 5x0 + 8 x -1 + 2x1+ 7x0 + 2x-1 = -5


Similarly, we traverse the kernel through the whole image matrix and fill the entries in the output image
matrix
But we can observe that the output image matric size is less than the original image matrix size so to deal
with that problem we will use the concept of padding
Padding
To perform padding we will add extra cells symmetrically to the original image matrix

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


Calculating Size of Output Image Matrix
If the input image matrix is of size “n x n” and if the kernel size “f x f” and we have defined padding as p
then
Then the size of the output image matrix will be n + 2p -f +1
If we want the output size as the same of input then the condition will be:
n + 2p -f + 1 = n
2p = f -1
p = (f -1) / 2
So if we want our output image to have the same shape as input image we will use padding = (f-1)/2 for
our input image where f is the size of our kernel
Stride
This is another concept which is associated with performing convolutions

If stride =1 then the kernel shift by one unit to traverse through whole input image

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


If stride =2 then kernel shift by 2 unit while traversing through the input image
Given the input image matrix of size “ n x n ” and kernel of size “ f x f ” and we have padding as p and
stride as s
Then the shape of the final output image matrix will be
math.floor( (n + 2p -f)/s +1)

Question :

N=7 ; f=3
Output=n-f+1; Output=7-3+1; Output=5
1 0 0 1 1 0 1
0 0 1 1 1 0 1
1 1 1 0 1 0 1
1 1 0 1 0 0 0
1 0 1 0 1 1 0
0 1 1 0 1 1 1
0 1 1 1 1 1 1

7x7

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


1 0 0
0 1 1
1 1 0

3x3

n=7
f=3
Output=n-f+1

Ans:
4 4 3 3 3
4 2 3 3 2
3 3 3 1 3
3 4 2 4 3
4 3 4 4 5

Question 2
1 1 1 0 0
0 1 1 1 0
0 0 1 1 1
0 0 1 1 0
0 1 1 0 0

5x5
1 0 1
0 1 0
1 0 1

3x3
Stride =1
Ans :
4 3 4
2 4 3
2 3 4

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA

You might also like