Unit-5 Genetic Reinforcement Markov Q-Learning

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 39

Unit-5

Genetic Algorithms
A genetic algorithm is a search heuristic that is inspired by Charles
Darwin’s theory of natural evolution.
This algorithm reflects the process of natural selection where the fittest
individuals are selected in order to produce offspring of the next
generation.
Notion of Natural Selection
The process of natural selection starts with the selection of fittest
individuals from a population.
They produce offspring which inherit the characteristics of the parents
and will be added to the next generation.
If parents have better fitness, their offspring will be better than parents
and have a better chance at surviving. This process keeps on iterating
and at the end, a generation with the fittest individuals will be found.
This notion can be applied for a search problem.
Basic Terminologies
Population − It is a subset of all the possible
(encoded) solutions to the given problem. The
population for a GA is analogous to the population for
human beings except that instead of human beings, we
have Candidate Solutions representing human beings.
Chromosomes − A chromosome is one such solution
to the given problem.
Gene − A gene is one element position of a
chromosome.
Allele − It is the value a gene takes for a particular
chromosome.
Basic Terminologies
GA Algorithm
Operators of Genetic Algorithms
Once the initial generation is created, the algorithm evolves
the generation using following operators –
1) Selection Operator: The idea is to give preference to
the individuals with good fitness scores and allow them to
pass their genes to successive generations.
2) Crossover Operator: This represents mating between
individuals. Two individuals are selected using selection
operator and crossover sites are chosen randomly. Then the
genes at these crossover sites are exchanged thus creating a
completely new individual (offspring). For example –
Operators of Genetic Algorithms
Operators of Genetic Algorithms
3) Mutation Operator: The key idea is to insert random
genes in offspring to maintain the diversity in the
population to avoid premature convergence. For
example –
GA Algorithm
1) Randomly initialize populations p
2) Determine fitness of population
3) Until convergence repeat:
a) Select parents from population
b) Crossover and generate new population
c) Perform mutation on new population
d) Calculate fitness for new population
Difference between Genetic Algorithms and Traditional
Algorithms
A search space is the set of all possible solutions to the problem. In
the traditional algorithm, only one set of solutions is maintained,
whereas, in a genetic algorithm, several sets of solutions in search
space can be used.
Traditional algorithms need more information in order to perform a
search, whereas genetic algorithms need only one objective function
to calculate the fitness of an individual.
Traditional Algorithms cannot work parallelly, whereas genetic
Algorithms can work parallelly (calculating the fitness of the
individualities are independent).
Traditional Algorithms can only generate one result in the end,
whereas Genetic Algorithms can generate multiple optimal results
from different generations.
Reinforcement Learning
Reinforcement Learning is a feedback-based ML technique
in which an agent learns to behave in an environment by
performing the actions and seeing the results of actions.
For each good action, the agent gets positive feedback, and
for each bad action, the agent gets negative feedback or
penalty.
In Reinforcement Learning, the agent learns automatically
using feedbacks without any labeled data,
unlike supervised learning.
RL solves a specific type of problem where decision making
is sequential, and the goal is long-term, such as game-
playing, robotics, etc.
Reinforcement Learning
Elements of Reinforcement Learning
Agent: An entity that can perceive/explore the environment
and act upon it.
Environment: A situation in which an agent is present or
surrounded by. In RL, we assume the stochastic
environment, which means it is random in nature.
Action: Actions are the moves taken by an agent within the
environment.
State: State is a situation returned by the environment after
each action taken by the agent.
Reward : A feedback returned to the agent from the
environment to evaluate the action of the agent.
Elements of Reinforcement Learning
Elements of Reinforcement Learning
Policy: Policy is a strategy applied by the agent for the next
action based on the current state.
A policy π is a function that takes as an input a state “S” (S2 in
our example) and returns an action “a” (a6 in our example).
i.e : π(s) → a
Or in our example, π(S2) → a6
Value Function : Whereas the reward signal indicates what’s
good for the agent at the moment (e.g. taking action “a6”
right now when the agent is on S2), a value function specifies
what is good in the long run.
The value of a state is the total amount of reward an agent
can expect to accumulate over the future, starting from that
How does Reinforcement Learning Work?
Let's take an example of a maze environment that the
agent needs to explore. Consider the below image:
In the above image, the agent is at the very first block
of the maze. The maze is consisting of an S6 block,
which is a wall, S8 a fire pit, and S4 a diamond block.
If the agent reaches the S4 block, then get the +1 reward; if it
reaches the fire pit, then gets -1 reward point. It can take four
actions: move up, move down, move left, and move right.
Suppose the agent considers the path S9-S5-S1-S2-S3, so he will
get the +1-reward point.
The agent will try to remember the preceding steps that it has
taken to reach the final step. To memorize the steps, it assigns 1
value to each previous step. Consider the below step:
Now, the agent has successfully stored the previous steps
assigning the 1 value to each previous block.
But what will the agent do if he starts moving from the block,
which has 1 value block on both sides?
Consider the below diagram:

Hence to solve the problem, we will use the Bellman


equation, which is the main concept behind reinforcement
Bellman Equation
Bellman equation was introduced by the
Mathematician Richard Ernest Bellman in the year 1953.
 It is associated with dynamic programming and is used to
calculate the values of a decision problem at a certain point by
including the values of previous states.
Key-elements used in Bellman equations are :
Action performed by the agent is referred to as "a"
State occurred by performing the action is "s."
The reward/feedback obtained for each good and bad action is
"R."
A discount factor is Gamma "γ."
The Bellman equation can be written as:
V(s) = max [R(s,a) + γV(s`)]
Where,
V(s)= value calculated at a particular point.
R(s,a) = Reward at a particular state s by performing an action.
γ = Discount factor
If the discount rate is close to 0, future rewards won’t count for
much in comparison to immediate rewards.
In contrast, if the discount rate is close to 1, rewards that are far
in the future will be almost as important as immediate rewards.
V(s`) = The value at the previous state.
So now, using the Bellman equation, we will find value at each
state of the given environment.
We will start from the block, which is next to the target block.
For 1st block: V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0
because there is no further state to move.
V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.
For 2nd block:
V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(say), V(s')= 1,
and R(s, a)= 0, because there is no reward at this state.
V(s2)= max[0.9(1)]=> V(s2)= max[0.9]=> V(s2) =0.9
For 3rd block:
V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9, V(s')= 0.9, and
R(s, a)= 0, because there is no reward at this state also.
V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81
For 4th block:
V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9, V(s')= 0.81, and
R(s, a)= 0, because there is no reward at this state also.
V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73
For 5th block:
V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9, V(s')= 0.73,
and R(s, a)= 0, because there is no reward at this state
also.
V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66
Learning Models in RL
Markov Decision Process
Q-Learning Algorithm
Deep Q Learning
Markov Decision Process
MDP is a framework that can solve most Reinforcement
Learning problems with discrete actions.
With the Markov Decision Process, an agent can arrive at an
optimal policy for maximum rewards over time.
The Markov Property
The Markov Property states that :
“Future is Independent of the past given the present”
E.g : In a Chess game, the players only focus on the current
state and do not need to remember past actions or states.
Mathematically we can express this statement as :

S[t] denotes the current state of the agent and s[t+1] denotes
Markov Property
It says that "If the agent is present in the current
state S1, performs an action a1 and move to the
state s2, then the state transition from s1 to s2
only depends on the current state and future
action and states do not depend on past actions,
rewards, or states."
Or, in other words, as per Markov Property, the
current state transition does not depend on any past
action or state
Markov Decision Process
Q-Learning Algorithm
Q-learning is a popular model-free reinforcement learning
algorithm based on the Bellman equation.
Model-free means that the agent uses predictions of the
environment’s expected response to move forward.
It does not use the reward system to learn, but rather, trial and
error.
The main objective of Q-learning is to learn the policy which can
inform the agent that what actions should be taken for
maximizing the reward under what circumstances.
The goal of the agent in Q-learning is to maximize the value of Q.
Q stands for quality in Q-learning, which means it specifies the
quality of an action taken by the agent.
Q-Learning algorithm
Q-Learning algorithm works like this:
Initialize all Q-values, e.g., with zeros
Choose an action a in the current state s based on the
current best Q-value
Perform this action a and observe the outcome (new
state s’).
Measure the reward R after this action
Update Q with an update formula that is called
the Bellman Equation.
Repeat steps 2 to 5 until the learning no longer improves
Q-Learning Algorithm
An example of Q-learning is an Advertisement
recommendation system.
Flow Chart for Q-Learning
CONTD…

In the above image, we can see there is an agent who has three
values options, V(s1), V(s2), V(s3). As this is MDP, so agent only
cares for the current state and the future state. The agent can go
to any direction (Up, Left, or Right), so he needs to decide where
to go for the optimal path. Here agent will take a move as per
probability bases and changes the state. But if we want some
exact moves, so for this, we need to make some changes in terms
of Q-value. Consider the image on next slide:
CONTD…

Q- represents the quality of the actions at each state. So instead


of using a value at each state, we will use a pair of state and
action, i.e., Q(s, a). Q-value specifies that which action is more
lubricative than others, and according to the best Q-value, the
agent takes his next move. The Bellman equation can be used
for deriving the Q-value.
To perform any action, the agent will get a reward R(s, a), and
also he will end up on a certain state, so the Q -value equation
will be:
Deep Q-Learning
Q-Learning approach is practical for very small
environments and quickly loses it’s feasibility when the
number of states and actions in the environment increases.
The solution for the above leads us to Deep Q-
Learning which uses a deep neural network to approximate
the values.
Deep Q Learning uses the Q-learning idea and takes it one
step further.
Instead of using a Q-table, we use a Neural Network that
takes a state and approximates the Q-values for each action
based on that state.
The basic working step for Deep Q-Learning is that the
initial state is fed into the neural network and it returns the
Q-value of all possible actions as an output.
The difference between Q-Learning and Deep Q-Learning
can be illustrated as follows:
Deep Q-Learning
We do this because using a classic Q-table is not very scalable.
It might work for simple games, but in a more complex game
with dozens of possible actions and game states the Q-table will
soon get complex and cannot be solved efficiently anymore.
So now we use a Deep Neural Network that gets the state as
input, and produces different Q values for each action.
Then again we choose the action with the highest Q-value.
The learning process is still the same with the iterative update
approach, but instead of updating the Q-Table, here we update
the weights in the neural network so that the outputs are
improved.
THANK YOU

You might also like