Unit-5 Mla

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 22

N.B.K.R.

INSTITUTE OF SCIENCE AND TECHNOLOGY::VIDYANAGAR


(AUTONOMOUS)
*****
Department of Computer Science and Engineering
III B.TECH II SEMESTER, March 2024
(R20 Regulations)

Subject: 20CS3201 - MACHINE LEARNING APPLICATIONS


UNIT-V
Reinforcement Learning: What is Reinforcement Learning, How Reinforcement Learning works
with Example, Characteristics of Reinforcement Learning, Learning Models of Reinforcement
Learning-Markov Decision Process, Q-Learning, and Implementation of Q-learning with Python
Frameworks, Real Time Applications of Reinforcement Learning.

1. What is Reinforcement Learning?


 Reinforcement learning is an area of Machine Learning. It is about taking suitable action to
maximize reward in a particular situation.

 Reinforcement Learning is a feedback-based Machine learning technique in which an agent


learns to behave in an environment by performing the actions and seeing the results of
actions. For each good action, the agent gets positive feedback, and for each bad action, the
agent gets negative feedback or penalty.

 In Reinforcement Learning, the agent learns automatically using feedbacks without any
labeled data, unlike supervised learning.

 Since there is no labeled data, so the agent is bound to learn by its experience only.

 RL solves a specific type of problem where decision making is sequential, and the goal is
long-term, such as game-playing, robotics, etc.

 The agent interacts with the environment and explores it by itself. The primary goal of an
agent in reinforcement learning is to improve the performance by getting the maximum
positive rewards.

 The agent learns with the process of hit and trial, and based on the experience, it learns to
perform the task in a better way. Hence, we can say that "Reinforcement learning is a type of
machine learning method where an intelligent agent (computer program) interacts with the
environment and learns to act within that." How a Robotic dog learns the movement of his
arms is an example of Reinforcement learning.

 It is a core part of Artificial intelligence, and all AI agent works on the concept of
reinforcement learning. Here we do not need to pre-program the agent, as it learns from its
own experience without any human intervention.

 Example: Suppose there is an AI agent present within a maze environment, and his goal is to
find the diamond. The agent interacts with the environment by performing some actions, and
based on those actions, the state of the agent gets changed, and it also receives a reward or
penalty as feedback.
 The agent continues doing these three things (take action, change state/remain in the same
state, and get feedback), and by doing these actions, he learns and explores the environment.

 The agent learns that what actions lead to positive feedback or rewards and what actions lead
to negative feedback penalty. As a positive reward, the agent gets a positive point, and as a
penalty, it gets a negative point.
Terms used in Reinforcement Learning

o Agent(): An entity that can perceive/explore the environment and act upon it.
o Environment(): A situation in which an agent is present or surrounded by. In RL, we assume
the stochastic environment, which means it is random in nature.
o Action(): Actions are the moves taken by an agent within the environment.
o State(): State is a situation returned by the environment after each action taken by the agent.
o Reward(): A feedback returned to the agent from the environment to evaluate the action of
the agent.
o Policy(): Policy is a strategy applied by the agent for the next action based on the current
state.
o Value(): It is expected long-term retuned with the discount factor and opposite to the short-
term reward.
o Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current
action (a).

Key Features of Reinforcement Learning


o In RL, the agent is not instructed about the environment and what actions need to be taken.
o It is based on the hit and trial process.
o The agent takes the next action and changes states according to the feedback of the previous
action.
o The agent may get a delayed reward.
o The environment is stochastic, and the agent needs to explore it to reach to get the maximum
positive rewards.

Approaches to implement Reinforcement Learning


There are mainly three ways to implement reinforcement-learning in ML, which are:
1. Value-based:
The value-based approach is about to find the optimal value function, which is the maximum
value at a state under any policy. Therefore, the agent expects the long-term return at any
state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards without
using the value function. In this approach, the agent tries to apply such a policy that the action
performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the environment,
and the agent explores that environment to learn it. There is no particular solution or
algorithm for this approach because the model representation is different for each
environment.
Elements of Reinforcement Learning

There are four main elements of Reinforcement Learning, which are given below:

 Policy
 Reward Signal
 Value Function
 Model of the environment

1) Policy: A policy can be defined as a way how an agent behaves at a given time. It maps the
perceived states of the environment to the actions taken on those states. A policy is the core element
of the RL as it alone can define the behavior of the agent. In some cases, it may be a simple function
or a lookup table, whereas, for other cases, it may involve general computation as a search process. It
could be deterministic or a stochastic policy:

For deterministic policy: a = π(s)


For stochastic policy: π(a | s) = P[At =a | St = s]

2) Reward Signal: The goal of reinforcement learning is defined by the reward signal. At each state,
the environment sends an immediate signal to the learning agent, and this signal is known as
a reward signal. These rewards are given according to the good and bad actions taken by the agent.
The agent's main objective is to maximize the total number of rewards for good actions. The reward
signal can change the policy, such as if an action selected by the agent leads to low reward, then the
policy may change to select other actions in the future.

3) Value Function: The value function gives information about how good the situation and action are
and how much reward an agent can expect. A reward indicates the immediate signal for each good
and bad action, whereas a value function specifies the good state and action for the future. The
value function depends on the reward as, without reward, there could be no value. The goal of
estimating values is to achieve more rewards.

4) Model: The last element of reinforcement learning is the model, which mimics the behavior of the
environment. With the help of the model, one can make inferences about how the environment will
behave. Such as, if a state and an action are given, then a model can predict the next state and reward.

The model is used for planning, which means it provides a way to take a course of action by
considering all future situations before actually experiencing those situations. The approaches for
solving the RL problems with the help of the model are termed as the model-based approach.
Comparatively, an approach without using a model is called a model-free approach.

2. How does Reinforcement Learning Work?


To understand the working process of the RL, we need to consider two main things:
o Environment: It can be anything such as a room, maze, football ground, etc.
o Agent: An intelligent agent such as AI robot.
Let's take an example of a maze environment that the agent needs to explore. Consider the below
image:

 In the above image, the agent is at the very first block of the maze. The maze is consisting of
an S6 block, which is a wall, S8 a fire pit, and S4 a diamond block.
 The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the S 4 block, then
get the +1 reward; if it reaches the fire pit, then gets -1 reward point. It can take four
actions: move up, move down, move left, and move right.
 The agent can take any path to reach to the final point, but he needs to make it in possible
fewer steps. Suppose the agent considers the path S9-S5-S1-S2-S3, so he will get the +1-
reward point.

The agent will try to remember the preceding steps that it has taken to reach the final step. To
memorize the steps, it assigns 1 value to each previous step. Consider the below step:
Now, the agent has successfully stored the previous steps assigning the 1 value to each previous
block. But what will the agent do if he starts moving from the block, which has 1 value block on both
sides? Consider the below diagram:

It will be a difficult condition for the agent whether he should go up or down as each block has the
same value. So, the above approach is not suitable for the agent to reach the destination. Hence to
solve the problem, we will use the Bellman equation, which is the main concept behind
reinforcement learning.

The Bellman Equation

The Bellman equation was introduced by the Mathematician Richard Ernest Bellman in the year
1953, and hence it is called as a Bellman equation. It is associated with dynamic programming and
used to calculate the values of a decision problem at a certain point by including the values of
previous states.

It is a way of calculating the value functions in dynamic programming or environment that leads to
modern reinforcement learning.
The key-elements used in Bellman equations are:
o Action performed by the agent is referred to as "a"
o State occurred by performing the action is "s."
o The reward/feedback obtained for each good and bad action is "R."
o A discount factor is Gamma "γ."
The Bellman equation can be written as:
1. V(s) = max [R(s,a) + γV(s`)]

Where,

V(s)= value calculated at a particular point.

R(s,a) = Reward at a particular state s by performing an action.


γ = Discount factor

V(s`) = The value at the previous state.


In the above equation, we are taking the max of the complete values because the agent tries to find the
optimal solution always.
So now, using the Bellman equation, we will find value at each state of the given environment. We
will start from the block, which is next to the target block.
For 1st block:
V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to move.
V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.
For 2nd block:
V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because there is no reward
at this state.
V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9
For 3rd block:
V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because there is no reward
at this state also.
V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81
For 4th block:
V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0, because there is no
reward at this state also.
V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73
For 5th block:
V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0, because there is no
reward at this state also.
V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66
Consider the below image:

Now, we will move further to the 6 th block, and here agent may change the route because it always
tries to find the optimal path. So now, let's consider from the block next to the fire pit.
Now, the agent has three options to move; if he moves to the blue box, then he will feel a bump if he
moves to the fire pit, then he will get the -1 reward. But here we are taking only positive rewards, so
for this, he will move to upwards only. The complete block values will be calculated using this
formula. Consider the below image:

Types of Reinforcement learning


There are mainly two types of reinforcement learning, which are:
o Positive Reinforcement
o Negative Reinforcement
Positive Reinforcement:
 The positive reinforcement learning means adding something to increase the tendency that
expected behavior would occur again.
 It impacts positively on the behavior of the agent and increases the strength of the behavior.
 This type of reinforcement can sustain the changes for a long time, but too much positive
reinforcement may lead to an overload of states that can reduce the consequences.
Negative Reinforcement:
 The negative reinforcement learning is opposite to the positive reinforcement as it increases
the tendency that the specific behavior will occur again by avoiding the negative condition.
 It can be more effective than the positive reinforcement depending on situation and behavior,
but it provides reinforcement only to meet minimum behavior.
How to represent the agent state?
 We can represent the agent state using the Markov State that contains all the required
information from the history. The State St is Markov state if it follows the given condition:
 P[St+1 | St ] = P[St +1 | S1,......, St]
 The Markov state follows the Markov property, which says that the future is independent of
the past and can only be defined with the present.
 The RL works on fully observable environments, where the agent can observe the
environment and act for the new state.
3. Characteristics of reinforcement learning
 The feedback of a reward signal is not instantaneous. It is delayed by many time steps
 Sequential decision making is needed to reach a goal, so time plays an important role in
reinforcement problems (no IID assumption of the data holds good here)
 The agent's action affects the subsequent data it receives
 There is no supervisor, only a real number or reward signal
 Sequential decision making
 Time plays a crucial role in Reinforcement problems
 Feedback is always delayed, not instantaneous
 Agent’s actions determine the subsequent data it receives
4. Learning Models of Reinforcement Learning-Markov Decision Process
There are two important learning models in reinforcement learning:
 Markov Decision Process
 Q learning
Markov Decision Process or MDP, is used to formalize the reinforcement learning problems. If the
environment is completely observable, then its dynamic can be modeled as a Markov Process. In
MDP, the agent constantly interacts with the environment and performs actions; at each action, the
environment responds and generates a new state.

MDP is used to describe the environment for the RL, and almost all the RL problem can be
formalized using MDP.
MDP contains a tuple of four elements (S, A, Pa, Ra):
o A set of finite States S
o A set of finite Actions A
o Rewards received after transitioning from state S to state S', due to action a.
o Probability Pa.
MDP uses Markov property, and to better understand the MDP, we need to learn about it.
Markov Property:
It says that "If the agent is present in the current state S1, performs an action a1 and move to the
state s2, then the state transition from s1 to s2 only depends on the current state and future action
and states do not depend on past actions, rewards, or states."
Or, in other words, as per Markov Property, the current state transition does not depend on any past
action or state. Hence, MDP is an RL problem that satisfies the Markov property. Such as in a Chess
game, the players only focus on the current state and do not need to remember past actions or
states.
Finite MDP:
A finite MDP is when there are finite states, finite rewards, and finite actions. In RL, we consider only
the finite MDP.
Markov Process:
Markov Process is a memory less process with a sequence of random states S 1, S2, ....., St that uses the
Markov Property. Markov process is also known as Markov chain, which is a tuple (S, P) on state S
and transition function P. These two components (S and P) can define the dynamics of the system.
According To Text Book:
5. Learning Models of Reinforcement Learning- Q-Learning

Reinforcement learning algorithms are mainly used in AI applications and gaming applications. The
main used algorithms are:

o Q-Learning:

o Q-learning is an off policy RL algorithm, which is used for the temporal difference
Learning. The temporal difference learning methods are the way of comparing
temporally successive predictions.

o It learns the value function Q (S, a), which means how good to take action " a" at a
particular state "s."

o The below flowchart explains the working of Q- learning:


State Action Reward State action (SARSA):

o SARSA stands for State Action Reward State action, which is an on-
policy temporal difference learning method. The on-policy control method selects the
action for each state while learning using a specific policy.

o The goal of SARSA is to calculate the Q π (s, a) for the selected current policy π
and all pairs of (s-a).

o The main difference between Q-learning and SARSA algorithms is that unlike Q-
learning, the maximum reward for the next state is not required for updating
the Q-value in the table.

o In SARSA, new action and reward are selected using the same policy, which has
determined the original action.

o The SARSA is named because it uses the quintuple Q(s, a, r, s', a'). Where,
s: original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair.

Deep Q Neural Network (DQN):

o As the name suggests, DQN is a Q-learning using Neural networks.

o For a big state space environment, it will be a challenging and complex task to define
and update a Q-table.

o To solve such an issue, we can use a DQN algorithm. Where, instead of defining a Q-
table, neural network approximates the Q-values for each action and state.

Now, we will expand the Q-learning.

Q-Learning Explanation:

o Q-learning is a popular model-free reinforcement learning algorithm based on the Bellman


equation.

o The main objective of Q-learning is to learn the policy which can inform the agent that
what actions should be taken for maximizing the reward under what circumstances.

o It is an off-policy RL that attempts to find the best action to take at a current state.

o The goal of the agent in Q-learning is to maximize the value of Q.

o The value of Q-learning can be derived from the Bellman equation. Consider the Bellman
equation given below:
In the equation, we have various components, including reward, discount factor (γ), probability, and
end states s'. But there is no any Q-value is given so first consider the below image:

In the above image, we can see there is an agent who has three values options, V(s 1), V(s2), V(s3). As
this is MDP, so agent only cares for the current state and the future state. The agent can go to any
direction (Up, Left, or Right), so he needs to decide where to go for the optimal path. Here agent will
take a move as per probability bases and changes the state. But if we want some exact moves, so for
this, we need to make some changes in terms of Q-value. Consider the below image:
Q- represents the quality of the actions at each state. So instead of using a value at each state, we will
use a pair of state and action, i.e., Q(s, a). Q-value specifies that which action is more lubricative than
others, and according to the best Q-value, the agent takes his next move. The Bellman equation can be
used for deriving the Q-value.

To perform any action, the agent will get a reward R(s, a), and also he will end up on a certain state,
so the Q -value equation will be:

Hence, we can say that, V(s) = max [Q(s, a)]

The above formula is used to estimate the Q-values in Q-Learning.

What is 'Q' in Q-learning?

The Q stands for quality in Q-learning, which means it specifies the quality of an action taken by the
agent.

Q-table:

A Q-table or matrix is created while performing the Q-learning. The table follows the state and action
pair, i.e., [s, a], and initializes the values to zero. After each action, the table is updated, and the q-
values are stored within the table.

The RL agent uses this Q-table as a reference table to select the best action based on the q-values.

According to Textbook:
6. Implementation of Q-learning with Python Frameworks,

pip install gym


Before starting with example, you will need some helper code in order to visualize the working of the
algorithms. There will be two helper files which need to be downloaded in the working directory. One
can find the files here.

Step # 1 : Import required libraries.


import gym
import itertools
import matplotlib
import matplotlib.style
import numpy as np
import pandas as pd
import sys
from collections import defaultdict
from windy_gridworld import WindyGridworldEnv
import plotting
matplotlib.style.use('ggplot')
Step #2 : Create gym environment.
env = WindyGridworldEnv()
Step #3 : Make the $epsilon$-greedy policy.
def createEpsilonGreedyPolicy(Q, epsilon, num_actions):
"""
Creates an epsilon-greedy policy based
on a given Q-function and epsilon.
Returns a function that takes the state
as an input and returns the probabilities
for each action in the form of a numpy array
of length of the action space(set of possible actions).
"""
def policyFunction(state):
Action_probabilities = np.ones(num_actions,
dtype = float) * epsilon / num_actions
best_action = np.argmax(Q[state])
Action_probabilities[best_action] += (1.0 - epsilon)
return Action_probabilities
return policyFunction
Step #4 : Build Q-Learning Model.
def qLearning(env, num_episodes, discount_factor = 1.0,
alpha = 0.6, epsilon = 0.1):
"""
Q-Learning algorithm: Off-policy TD control.
Finds the optimal greedy policy while improving
following an epsilon-greedy policy"""
# Action value function
# A nested dictionary that maps
# state -> (action -> action-value).
Q = defaultdict(lambda: np.zeros(env.action_space.n))
# Keeps track of useful statistics
stats = plotting.EpisodeStats(
episode_lengths = np.zeros(num_episodes),
episode_rewards = np.zeros(num_episodes))
# Create an epsilon greedy policy function
# appropriately for environment action space
policy = createEpsilonGreedyPolicy(Q, epsilon, env.action_space.n)
# For every episode
for ith_episode in range(num_episodes):
# Reset the environment and pick the first action
state = env.reset()
for t in itertools.count():
# get probabilities of all actions from current state
action_probabilities = policy(state)
# choose action according to
# the probability distribution
action = np.random.choice(np.arange(
len(action_probabilities)),
p = action_probabilities)
# take action and get reward, transit to next state
next_state, reward, done, _ = env.step(action)
# Update statistics
stats.episode_rewards[ith_episode] += reward
stats.episode_lengths[ith_episode] = t
# TD Update
best_next_action = np.argmax(Q[next_state])
td_target = reward + discount_factor * Q[next_state][best_next_action]
td_delta = td_target - Q[state][action]
Q[state][action] += alpha * td_delta
# done is True if episode terminated
if done:
break
state = next_state
return Q, stats
Step #5 : Train the model.
Q, stats = qLearning(env, 1000)
Step #6 : Plot important statistics.
plotting.plot_episode_stats(stats)

Conclusion:
We see that in the Episode Reward over time plot that the episode rewards progressively increase over
time and ultimately levels out at a high reward per episode value which indicates that the agent has
learnt to maximize its total reward earned in an episode by behaving optimally at every state.

7. Applications of Reinforcement Learning

1. Robotics:
RL is used in Robot navigation, Robo-soccer, walking, juggling, etc.
2. Control:
1. RL can be used for adaptive control such as Factory processes, admission control in
telecommunication, and Helicopter pilot is an example of reinforcement learning.
3. Game Playing:
1. RL can be used in Game playing such as tic-tac-toe, chess, etc.
4. Chemistry:
1. RL can be used for optimizing the chemical reactions.
5. Business:
1. RL is now used for business strategy planning.
6. Manufacturing:
1. In various automobile manufacturing companies, the robots use deep reinforcement
learning to pick goods and put them in some containers.
7. Finance Sector:
1. The RL is currently used in the finance sector for evaluating trading strategies.
Here are applications of Reinforcement Learning:
 Robotics for industrial automation.
 Business strategy planning
 Machine learning and data processing
 It helps you to create training systems that provide custom instruction and materials according
to the requirement of students.
 Aircraft control and robot motion control
Why use Reinforcement Learning?
Here are prime reasons for using Reinforcement Learning:
 It helps you to find which situation needs an action
 Helps you to discover which action yields the highest reward over the longer period.
 Reinforcement Learning also provides the learning agent with a reward function.
 It also allows it to figure out the best method for obtaining large rewards.

Difference between Reinforcement Learning and Supervised Learning


The Reinforcement Learning and Supervised Learning both are the part of machine learning, but both
types of learnings are far opposite to each other. The RL agents interact with the environment, explore
it, take action, and get rewarded. Whereas supervised learning algorithms learn from the labeled
dataset and, on the basis of the training, predict the output.
The difference table between RL and Supervised learning is given below:

Reinforcement Learning Supervised Learning

RL works by interacting with the environment. Supervised learning works on the existing dataset.

The RL algorithm works like the human brain works Supervised Learning works as when a human learns
when making some decisions. things in the supervision of a guide.

There is no labeled dataset is present The labeled dataset is present.

No previous training is provided to the learning Training is provided to the algorithm so that it can
agent. predict the output.

RL helps to take decisions sequentially. In Supervised learning, decisions are made when
input is given.

Parameters Reinforcement Learning Supervised Learning


Reinforcement learning helps you to take your In this method, a decision is made on the input
Decision style
decisions sequentially. given at the beginning.

Works on Works on interacting with the environment. Works on examples or given sample data.
Parameters Reinforcement Learning Supervised Learning

In RL method learning decision is dependent. Supervised learning the decisions which are
Dependency on
Therefore, you should give labels to all the independent of each other, so labels are given
decision
dependent decisions. for every decision.

Supports and work better in AI, where human It is mostly operated with an interactive
Best suited
interaction is prevalent. software system or applications.

Example Chess game Object recognition

Challenges of Reinforcement Learning


Here are the major challenges you will face while doing Reinforcement earning:
 Feature/reward design which should be very involved
 Parameters may affect the speed of learning.
 Realistic environments can have partial observability.
 Too much Reinforcement may lead to an overload of states which can diminish the results.
 Realistic environments can be non-stationary.

Prepared by Mrs.C.Sonika

You might also like