Unit-5 Genetic Reinforcement Markov Q-Learning
Unit-5 Genetic Reinforcement Markov Q-Learning
Unit-5 Genetic Reinforcement Markov Q-Learning
Genetic Algorithms
A genetic algorithm is a search heuristic that is inspired by Charles
Darwin’s theory of natural evolution.
This algorithm reflects the process of natural selection where the fittest
individuals are selected in order to produce offspring of the next
generation.
Notion of Natural Selection
The process of natural selection starts with the selection of fittest
individuals from a population.
They produce offspring which inherit the characteristics of the parents
and will be added to the next generation.
If parents have better fitness, their offspring will be better than parents
and have a better chance at surviving. This process keeps on iterating
and at the end, a generation with the fittest individuals will be found.
This notion can be applied for a search problem.
Basic Terminologies
Population − It is a subset of all the possible
(encoded) solutions to the given problem. The
population for a GA is analogous to the population for
human beings except that instead of human beings, we
have Candidate Solutions representing human beings.
Chromosomes − A chromosome is one such solution
to the given problem.
Gene − A gene is one element position of a
chromosome.
Allele − It is the value a gene takes for a particular
chromosome.
Basic Terminologies
GA Algorithm
Operators of Genetic Algorithms
Once the initial generation is created, the algorithm evolves
the generation using following operators –
1) Selection Operator: The idea is to give preference to
the individuals with good fitness scores and allow them to
pass their genes to successive generations.
2) Crossover Operator: This represents mating between
individuals. Two individuals are selected using selection
operator and crossover sites are chosen randomly. Then the
genes at these crossover sites are exchanged thus creating a
completely new individual (offspring). For example –
Operators of Genetic Algorithms
Operators of Genetic Algorithms
3) Mutation Operator: The key idea is to insert random
genes in offspring to maintain the diversity in the
population to avoid premature convergence. For
example –
GA Algorithm
1) Randomly initialize populations p
2) Determine fitness of population
3) Until convergence repeat:
a) Select parents from population
b) Crossover and generate new population
c) Perform mutation on new population
d) Calculate fitness for new population
Difference between Genetic Algorithms and Traditional
Algorithms
A search space is the set of all possible solutions to the problem. In
the traditional algorithm, only one set of solutions is maintained,
whereas, in a genetic algorithm, several sets of solutions in search
space can be used.
Traditional algorithms need more information in order to perform a
search, whereas genetic algorithms need only one objective function
to calculate the fitness of an individual.
Traditional Algorithms cannot work parallelly, whereas genetic
Algorithms can work parallelly (calculating the fitness of the
individualities are independent).
Traditional Algorithms can only generate one result in the end,
whereas Genetic Algorithms can generate multiple optimal results
from different generations.
Reinforcement Learning
Reinforcement Learning is a feedback-based ML technique
in which an agent learns to behave in an environment by
performing the actions and seeing the results of actions.
For each good action, the agent gets positive feedback, and
for each bad action, the agent gets negative feedback or
penalty.
In Reinforcement Learning, the agent learns automatically
using feedbacks without any labeled data,
unlike supervised learning.
RL solves a specific type of problem where decision making
is sequential, and the goal is long-term, such as game-
playing, robotics, etc.
Reinforcement Learning
Elements of Reinforcement Learning
Agent: An entity that can perceive/explore the environment
and act upon it.
Environment: A situation in which an agent is present or
surrounded by. In RL, we assume the stochastic
environment, which means it is random in nature.
Action: Actions are the moves taken by an agent within the
environment.
State: State is a situation returned by the environment after
each action taken by the agent.
Reward : A feedback returned to the agent from the
environment to evaluate the action of the agent.
Elements of Reinforcement Learning
Elements of Reinforcement Learning
Policy: Policy is a strategy applied by the agent for the next
action based on the current state.
A policy π is a function that takes as an input a state “S” (S2 in
our example) and returns an action “a” (a6 in our example).
i.e : π(s) → a
Or in our example, π(S2) → a6
Value Function : Whereas the reward signal indicates what’s
good for the agent at the moment (e.g. taking action “a6”
right now when the agent is on S2), a value function specifies
what is good in the long run.
The value of a state is the total amount of reward an agent
can expect to accumulate over the future, starting from that
How does Reinforcement Learning Work?
Let's take an example of a maze environment that the
agent needs to explore. Consider the below image:
In the above image, the agent is at the very first block
of the maze. The maze is consisting of an S6 block,
which is a wall, S8 a fire pit, and S4 a diamond block.
If the agent reaches the S4 block, then get the +1 reward; if it
reaches the fire pit, then gets -1 reward point. It can take four
actions: move up, move down, move left, and move right.
Suppose the agent considers the path S9-S5-S1-S2-S3, so he will
get the +1-reward point.
The agent will try to remember the preceding steps that it has
taken to reach the final step. To memorize the steps, it assigns 1
value to each previous step. Consider the below step:
Now, the agent has successfully stored the previous steps
assigning the 1 value to each previous block.
But what will the agent do if he starts moving from the block,
which has 1 value block on both sides?
Consider the below diagram:
S[t] denotes the current state of the agent and s[t+1] denotes
Markov Property
It says that "If the agent is present in the current
state S1, performs an action a1 and move to the
state s2, then the state transition from s1 to s2
only depends on the current state and future
action and states do not depend on past actions,
rewards, or states."
Or, in other words, as per Markov Property, the
current state transition does not depend on any past
action or state
Markov Decision Process
Q-Learning Algorithm
Q-learning is a popular model-free reinforcement learning
algorithm based on the Bellman equation.
Model-free means that the agent uses predictions of the
environment’s expected response to move forward.
It does not use the reward system to learn, but rather, trial and
error.
The main objective of Q-learning is to learn the policy which can
inform the agent that what actions should be taken for
maximizing the reward under what circumstances.
The goal of the agent in Q-learning is to maximize the value of Q.
Q stands for quality in Q-learning, which means it specifies the
quality of an action taken by the agent.
Q-Learning algorithm
Q-Learning algorithm works like this:
Initialize all Q-values, e.g., with zeros
Choose an action a in the current state s based on the
current best Q-value
Perform this action a and observe the outcome (new
state s’).
Measure the reward R after this action
Update Q with an update formula that is called
the Bellman Equation.
Repeat steps 2 to 5 until the learning no longer improves
Q-Learning Algorithm
An example of Q-learning is an Advertisement
recommendation system.
Flow Chart for Q-Learning
CONTD…
In the above image, we can see there is an agent who has three
values options, V(s1), V(s2), V(s3). As this is MDP, so agent only
cares for the current state and the future state. The agent can go
to any direction (Up, Left, or Right), so he needs to decide where
to go for the optimal path. Here agent will take a move as per
probability bases and changes the state. But if we want some
exact moves, so for this, we need to make some changes in terms
of Q-value. Consider the image on next slide:
CONTD…