Unit V
Unit V
Unit V
UNIT-V
REINFORCEMENT LEARNING–Introduction to Reinforcement Learning , Learning Task, Example of
Reinforcement Learning in Practice, Learning Models for Reinforcement – (Markov Decision process , Q
Learning - Q Learning function, Q Learning Algorithm ), Application of Reinforcement Learning,
Introduction to Deep Q Learning.
GENETIC ALGORITHMS: Introduction, Components, GA cycle of reproduction, Crossover, Mutation,
Genetic Programming, Models of Evolution and Learning, Applications.
We model an environment after the problem statement. The model interacts with this environment and
comes up with solutions all on its own, without human interference. To push it in the right direction, we
simply give it a positive reward if it performs an action that brings it closer to its goal or a negative
reward if it goes away from its goal.
To understand reinforcement learning better, consider a dog that we have to house train. Here, the dog
is the agent and the house, the environment.
We can get the dog to perform various actions by offering incentives such as dog biscuits as a reward
The dog will follow a policy to maximize its reward and hence will follow every command and might
even learn a new action, like begging, all by itself.
The dog will also want to run around and play and explore its environment. This quality of a model is
called Exploration. The tendency of the dog to maximize rewards is called Exploitation. There is always a
tradeoff between exploration and exploitation, as exploration actions may lead to lesser rewards.
Agent: Agent is the model that is being trained via reinforcement learning
Environment: The training situation that the model must optimize to is called its environment
Reward: To help the model move in the right direction, it is rewarded/points are given to it to
appraise some action
There are mainly three ways to implement reinforcement-learning in ML, which are:
1. Value-based:
The value-based approach is about to find the optimal value function, which is the
maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards
without using the value function. In this approach, the agent tries to apply such a policy
that the action performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no particular
solution or algorithm for this approach because the model representation is different
for each environment.
Positive Reinforcement:
The positive reinforcement learning means adding something to increase the tendency that expected
behavior would occur again. It impacts positively on the behavior of the agent and increases the
strength of the behavior.
This type of reinforcement can sustain the changes for a long time, but too much positive reinforcement
may lead to an overload of states that can reduce the consequences.
Negative Reinforcement:
The negative reinforcement learning is opposite to the positive reinforcement as it increases the
tendency that the specific behavior will occur again by avoiding the negative condition.
It can be more effective than the positive reinforcement depending on situation and behavior, but it
provides reinforcement only to meet minimum behavior.
1. Policy
2) Reward Signal: The goal of reinforcement learning is defined by the reward signal. At each state, the
environment sends an immediate signal to the learning agent, and this signal is known as a reward
signal. These rewards are given according to the good and bad actions taken by the agent. The agent's
main objective is to maximize the total number of rewards for good actions. The reward signal can
change the policy, such as if an action selected by the agent leads to low reward, then the policy may
change to select other actions in the future.
3) Value Function: The value function gives information about how good the situation and action are
and how much reward an agent can expect. A reward indicates the immediate signal for each good and
bad action, whereas a value function specifies the good state and action for the future. The value
function depends on the reward as, without reward, there could be no value. The goal of estimating
values is to achieve more rewards.
4) Model: The last element of reinforcement learning is the model, which mimics the behavior of the
environment. With the help of the model, one can make inferences about how the environment will
behave. Such as, if a state and an action are given, then a model can predict the next state and reward.
As the goal of the agent is to evaluate how good an optimal policy is, the agent needs to learn the
expected utility Uπ(s) for each state s. This can be done in three ways.
ADP is a smarter method than Direct Utility Estimation as it runs trials to learn the model of the
environment by estimating the utility of a state as a sum of reward for being in that state and the
expected discounted reward of being in the next state.
Where R(s) = reward for being in state s, P(s’|s, π(s)) = transition model, γ = discount factor and Uπ(s) =
utility of being in state s’.
It can be solved using value-iteration algorithm. The algorithm converges fast but can become quite
costly to compute for large state spaces. ADP is a model based approach and requires the transition
model of the environment. A model-free approach is Temporal Difference Learning.
Active Learning
As the goal of an active agent is to learn an optimal policy, the agent needs to learn the expected
utility of each state and update its policy. Can be done using a passive ADP agent and then using
value or policy iteration it can learn optimal actions. But this approach results into a greedy
agent. Hence, we use an approach that gives higher weights to unexplored actions and lower
weights to actions with lower utilities.
Where f(u, n) is the exploration function that increases with expected value u and decreases
with number of tries n
R+ is an optimistic reward and Ne is the number of times we want an agent to be forced to pick
an action in every state. The exploration function converts a passive agent into an active one.
2. Q-Learning
Q-learning is a TD learning method which does not require the agent to learn the transitional
model, instead learns Q-value functions Q(s, a) .
Markov’s Decision Process is a Reinforcement Learning policy used to map a current state to an
action where the agent continuously interacts with the environment to produce new solutions
and receive rewards
Markov’s Process states that the future is independent of the past, given the present. This
means that, given the present state, the next state can be predicted easily, without the need
for the previous state.
This theory is used by Markov’s Decision Process to get the next action in our machine learning
model. Markov’s Decision Process (MDP) uses:
GAME
Chemistry
Reinforcement learning algorithms are mainly used in AI applications and gaming applications. The main
used algorithms are:
Q-Learning:
6.
7. Q-learning is an Off policy RL algorithm, which is used for the temporal difference Learning. The
temporal difference learning methods are the way of comparing temporally successive
predictions.
8. It learns the value function Q (S, a), which means how good to take action "a" at a particular state
"s."
9. The below flowchart explains the working of Q- learning:
SARSA stands for State Action Reward State action, which is an on-policy temporal difference
learning method. The on-policy control method selects the action for each state while learning
using a specific policy.
The goal of SARSA is to calculate the Q π (s, a) for the selected current policy π and all pairs
of (s-a).
The main difference between Q-learning and SARSA algorithms is that unlike Q-learning, the
maximum reward for the next state is not required for updating the Q-value in the table.
In SARSA, new action and reward are selected using the same policy, which has determined the
original action.
The SARSA is named because it uses the quintuple Q(s, a, r, s', a'). Where,
s: original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair.
1. Data Preprocessing
2. Exploratory data analysis
3. Feature engineering selection
4. Regression
5. Classification
6. Clustering
7. Multivariate Querying
8. Density estimation
9. Dimension reeducation
10. Model algorithm/ selection
11.Testing and matching
Genetic Algorithm
A genetic algorithm is an adaptive heuristic search algorithm inspired by "Darwin's theory of evolution
in Nature."
Genetic Algorithms are being widely used in different real-world applications, for example, Designing
electronic circuits, code-breaking, image processing, and artificial creativity.
Basic terminologies
o Population: Population is the subset of all possible or probable solutions, which can solve the
given problem.
o Chromosomes: A chromosome is one of the solutions in the population for the given problem,
and the collection of gene generates a chromosome.
o Gene: A chromosome is divided into a different gene, or it is an element of the chromosome.
o Allele: Allele is the value provided to the gene within a particular chromosome.
o Fitness Function: The fitness function is used to determine the individual's fitness level in the
population. It means the ability of an individual to compete with other individuals. In every
iteration, individuals are evaluated based on their fitness function.
o Genetic Operators: In a genetic algorithm, the best individual mate to regenerate offspring
better than parents. Here genetic operators play a role in changing the genetic composition of
the next generation.
o Selection
So, now we can define a genetic algorithm as a heuristic search algorithm to solve optimization
problems. It is a subset of evolutionary algorithms, which is used in computing. A genetic algorithm uses
genetic and natural selection concepts to solve optimization problems.
Working Process
The genetic algorithm works on the evolutionary generational cycle to generate high-quality solutions.
These algorithms use different operations that either enhance or replace the population to give an
improved fit solution.
It basically involves five phases to solve the complex optimization problems, which are given as below:
o Initialization
o Fitness Assignment
o Selection
o Reproduction
o Termination
1. Initialization
The process of a genetic algorithm starts by generating the set of individuals, which is called population.
Here each individual is the solution for the given problem. An individual contains or is characterized by a
set of parameters called Genes. Genes are combined into a string and generate chromosomes, which is
the solution to the problem. One of the most popular techniques for initialization is the use of random
binary strings.
Fitness function is used to determine how fit an individual is? It means the ability of an individual to
compete with other individuals. In every iteration, individuals are evaluated based on their fitness
function. The fitness function provides a fitness score to each individual. This score further determines
the probability of being selected for reproduction. The high the fitness score, the more chances of
getting selected for reproduction.
3. Selection
The selection phase involves the selection of individuals for the reproduction of offspring. All the
selected individuals are then arranged in a pair of two to increase reproduction. Then these individuals
transfer their genes to the next generation.
4. Reproduction
After the selection process, the creation of a child occurs in the reproduction step. In this step, the
genetic algorithm uses two variation operators that are applied to the parent population. The two
operators involved in the reproduction phase are given below:
o Crossover: The crossover plays a most significant role in the reproduction phase of the genetic
algorithm. In this process, a crossover point is selected at random within the genes. Then the
crossover operator swaps genetic information of two parents from the current generation to
produce a new individual representing the offspring.
5. Termination
After the reproduction phase, a stopping criterion is applied as a base for termination. The algorithm
terminates after the threshold fitness solution is reached. It will identify the final solution as the best
solution in the population.
Gradient Descent
Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th century. Gradient
Descent is defined as one of the most commonly used iterative optimization algorithms of machine
The best way to define the local minimum or local maximum of a function using gradient descent is as
follows:
o If we move towards a negative gradient or away from the gradient of the function at the current
point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the function at the
current point, we will get the local maximum of that function.
Cost-function
The cost function is defined as the measurement of difference or error between actual values and expected values
at the current position and present in the form of a single real number.
Occam’s razor is a law of parsimony popularly stated as (in William’s words) “Plurality must never be
posited without necessity”. Alternatively, as a heuristic, it can be viewed as, when there are multiple
hypotheses to solve a problem, the simpler one is to be preferred. It is not clear as to whom this
principle can be conclusively attributed to, but William of Occam’s (c. 1287 – 1347) preference for
simplicity is well documented. Hence this principle goes by the name, “Occam’s razor”. This often
means cutting off or shaving away other possibilities or explanations, thus “razor” appended to the
name of the principle. It should be noted that these explanations or hypotheses should lead to the
same result.
Relevance of Occam’s razor.
There are many events that favor a simpler approach either as an inductive bias or a constraint to
begin with. Some of them are:
Where the results have suggested that preschoolers are sensitive to simpler explanations during
their initial years of learning and development.
Preference for a simpler approach and explanations to achieve the same goal is seen in various
facets of sciences; for instance, the parsimony principle applied to the understanding of evolution.
In theology, ontology, epistemology, etc this view of parsimony is used to derive various
conclusions.
Variants of Occam’s razor are used in knowledge Discovery.
The paired sample t-test,
The paired sample t-test, sometimes called the dependent sample t-test, is a statistical procedure used
to determine whether the mean difference between two sets of observations is zero. In a paired
sample t-test, each subject or entity is measured twice, resulting in pairs of observations. Common
applications of the paired sample t-test include case-control studies or repeated-measures designs.
Lazy learner:
We have “6x6” matrix which represents our image and “3 x3” matrices represents the kernel
So to perform convolution we overlap the kernel on the image matrix and multiply its every element with
the element of the image matrix
If stride =1 then the kernel shift by one unit to traverse through whole input image
Question :
N=7 ; f=3
Output=n-f+1; Output=7-3+1; Output=5
1 0 0 1 1 0 1
0 0 1 1 1 0 1
1 1 1 0 1 0 1
1 1 0 1 0 0 0
1 0 1 0 1 1 0
0 1 1 0 1 1 1
0 1 1 1 1 1 1
7x7
3x3
n=7
f=3
Output=n-f+1
Ans:
4 4 3 3 3
4 2 3 3 2
3 3 3 1 3
3 4 2 4 3
4 3 4 4 5
Question 2
1 1 1 0 0
0 1 1 1 0
0 0 1 1 1
0 0 1 1 0
0 1 1 0 0
5x5
1 0 1
0 1 0
1 0 1
3x3
Stride =1
Ans :
4 3 4
2 4 3
2 3 4