RL Unit 5
RL Unit 5
RL Unit 5
Unit - 5
Reinforcement Learning
1. N-step returns
2. TD(λ) algorithm
3. Need for generalization in practice
4. Linear function approximation and
geometric view
5. Linear TD(λ)
6. Tile coding
7. Control with function approximation
8. Policy search
9. Policy gradient methods
10. Experience replay
11. Fitted Q iteration
12. Case studies
Reinforcement Learning
N-step returns
● R_t+1, R_t+2, ..., R_t+N are the rewards received at time steps t+1 to
t+N.
● V(S_t+N) is the estimated value of the state S_t+N (the state N time
steps ahead).
● γ is the discount factor, which determines the importance of future
rewards.
Scope of N-step Returns:
● N-step returns can provide a balance between short-term and
long-term rewards. They allow for a deeper look into the
consequences of actions taken in the environment.
● They can help improve the stability of learning algorithms, as they
reduce the variance in value estimates and can provide a more
accurate estimation of the value of states or state-action pairs.
Reinforcement Learning
Choosing N:
● The choice of N in N-step returns is a hyperparameter. Selecting the
right value for N depends on the specific problem and the
characteristics of the environment.
● Smaller values of N (e.g., N=1 or N=2) are more similar to TD(0)
methods and focus on shorter-term rewards.
● Larger values of N (e.g., N=3 or more) consider a more extended
sequence of rewards and can provide a better balance between
exploration and exploitation.
Reinforcement Learning
Reinforcement Learning
TD(λ) algorithm
TD(λ) improves over the offline λ return algorithm in three ways. First
it updates the weight vector on every step of an episode rather than only at
the end, and thus its estimates may be better sooner. Second, its
computations are equally distributed in time rather than all at the end of the
episode. And third, it can be applied to continuing problems rather than just
to episodic problems. In this section we present the semi-gradient version of
TD(λ) with function approximation.
the eligibility trace is a short-term memory, typically lasting less time than
the length of an episode. Eligibility traces assist in the learning process;
their only consequence is that they affect the weight vector, and then the
weight vector determines the estimated value.
Reinforcement Learning
- V(s) ≈ θᵀ * φ(s)
- Q(s, a) ≈ θᵀ * φ(s, a)
Where:
- V(s) is the estimated value of state s.
- Q(s, a) is the estimated value of taking action a in state s.
- θ is a weight vector.
- φ(s) and φ(s, a) are feature vectors that represent state s or state-action
pair (s, a).
- For state s and its successor state s', you can update the weight vector
θ using the n-step return:
Where:
- Gₙ is the n-step return, which is a sum of rewards and n-step estimates
of the value function.
- V(s) is the current estimate of the value of state s.
- α is the learning rate.
- ∇θ V(s) is the gradient of the value function with respect to the weight
vector θ.
The update rule involves computing the temporal difference error (TD
error) as (Gₙ - V(s)), and then using this error to update the weight vector
θ. The update of θ is proportional to the TD error and the gradient of the
value function with respect to the weights.
Geometric View
2. Basis Vectors:
- Consider each basis function or feature as a basis vector in this
high-dimensional space.
- Each feature can be thought of as a dimension in the space.
Linear TD(λ)
Linear TD(λ), often referred to as TD(lambda) or Temporal
Difference lambda, is a reinforcement learning algorithm that
combines linear function approximation with eligibility traces to
estimate the value function (V(s) or Q(s, a)) in Markov Decision
Processes (MDPs). It is an extension of the TD(0) algorithm, which
uses a one-step TD error, and it offers a way to balance between
one-step and multi-step updates. Here's an overview of Linear TD(λ):
Key Components:
2. TD Error:
- The TD error for Linear TD(λ) at time step t is computed as:
δ_t = R_{t+1} + γ . V(S_{t+1}) - V(S_t)
- It represents the difference between the actual reward received at
time step t+1 and the estimated value of the current state S_t.
3. Eligibility Traces:
- Linear TD(λ) uses eligibility traces to keep track of the eligibility of
each feature for updates. Eligibility traces are a way of assigning credit
to features that contributed to the TD error.
Reinforcement Learning
4. Weight Updates:
- The weight vector theta is updated based on the eligibility trace and
TD error at each time step:
theta leftarrow theta + α . δ_t . z_t
where α is the learning rate.
NOTE:
Tile coding
Overview
1. Tiling: Tile coding divides the continuous state space into a set of
overlapping regions or "tiles." Each tile represents a discrete region of the
continuous state space.
4. Combining Tilings: The active tiles from multiple tilings are often combined
to create a binary feature vector that represents the state. This feature vector
is used in learning algorithms.
Considerations:
2. Tile Coding Library: There are libraries and tools available that can assist in
creating and managing tile codings, helping to set up the tiling parameters.
Policy search
Policy search is a class of reinforcement learning methods that focuses on directly
optimizing a parameterized policy (policy being a mapping from states to actions) in
order to find an optimal or near-optimal policy for a given task. Unlike value-based
methods that estimate the value function and derive policies from it, policy search
methods work by iteratively adjusting policy parameters to maximize expected rewards.
Policy search methods are particularly useful in situations where the state and action
spaces are large, continuous, or unknown, making it challenging to use traditional
value-based methods like Q-learning. Here are some key concepts and approaches
related to policy search:
1. Policy Parameterization:
- In policy search, the policy is often parameterized by a set of learnable parameters,
usually denoted as θ. These parameters determine the behavior of the policy.
2. Objective Function:
- The goal of policy search is to find the optimal policy parameters (θ) that maximize
the expected return or cumulative reward. This is typically formulated as an objective
function, such as the expected return or the expected sum of rewards. The objective is to
maximize this function.
6. Sample Efficiency:
- One of the challenges with policy search is sample efficiency. It can require a
significant number of samples (interactions with the environment) to find a good policy,
making it impractical in situations where data collection is costly or risky.
7. Types of Policies:
- Policy search methods can be used with various types of policies, including
deterministic policies (output a single action) and stochastic policies (output a distribution
over actions).
8. Applications:
- Policy search has been applied in various domains, including robotics, autonomous
vehicles, game playing, and natural language processing. It's particularly useful in
situations where the agent has a lot of degrees of freedom, complex actions, or an
unknown environment model.
Each of these methods has its own strengths and weaknesses, and the choice of
algorithm depends on the specific requirements of the task and the available resources.
Policy search methods continue to be an active area of research in reinforcement
learning and artificial intelligence.
Methods:
1. Policy Parameterization:
- In policy gradient methods, the policy is often parameterized by a set of
learnable parameters, typically denoted as θ. These parameters determine
the behavior of the policy.
2. Objective Function:
- The primary goal of policy gradient methods is to find the optimal policy
parameters (θ) that maximize the expected return or cumulative reward.
This is typically formulated as the expected sum of rewards (or return), and
the objective is to maximize this function.
3. Stochastic Policies:
- Many policy gradient methods use stochastic policies, which means the
policy generates a probability distribution over actions. The optimization
aims to increase the likelihood of selecting actions that lead to higher
expected returns.
4. Policy Optimization:
- Policy optimization is the process of finding the optimal policy
parameters θ. Common optimization techniques include gradient ascent,
where the policy parameters are updated in the direction of the gradient of
the expected return with respect to θ.
6. Sample Efficiency:
- One of the challenges with policy gradient methods is sample efficiency.
They often require a significant number of samples (interactions with the
Reinforcement Learning
8. Types of Policies:
- Policy gradient methods can be used with various types of policies,
including deterministic policies (output a single action) and stochastic
policies (output a distribution over actions).
Experience replay
Experience replay is a fundamental technique used in reinforcement
learning, particularly in the context of deep reinforcement learning, to
improve the stability, efficiency, and effectiveness of training algorithms.
Experience replay involves storing and randomly sampling experiences from
an agent's interactions with the environment. These experiences, often
represented as tuples of (state, action, reward, next state), are stored in a
replay buffer, and they are sampled and replayed during training. The main
idea behind experience replay is to break the temporal correlations between
consecutive samples and make the learning process more data-efficient and
stable.
Reinforcement Learning
2. Replay Buffer Storage: These experiences are stored in the replay buffer.
4. Learning: The agent updates its policy or value function using the
experiences in the mini-batch, typically through gradient-based methods.
Fitted Q iteration
Fitted Q-Iteration (FQI) is an iterative reinforcement learning algorithm that is used for
approximating the optimal Q-function in a model-free manner. The algorithm is designed
for environments with large or continuous state-action spaces where traditional tabular
methods are impractical. Fitted Q-Iteration leverages function approximation to estimate
the Q-values more efficiently.
Reinforcement Learning
Case studies
Case 1: Watson’s Daily-Double Wagering
Why was the TD-Gammon method of self-play not used to learn the
critical value function vˆ? Learning from self-play in Jeopardy! would not
have worked very well because Watson was so di↵erent from any human
contestant. Self-play would have led to exploration of state space regions
that are not typical for play against human opponents, particularly human
champions. In addition, unlike backgammon, Jeopardy! is a game of
imperfect information because contestants do not have access to all the
information influencing their opponents’ play. In particular, Jeopardy!
contestants do not know how much confidence their opponents have for
responding to clues in the various categories. Self-play would have been
something like playing poker with someone who is holding the same cards
that you hold.
Reinforcement Learning