Unit 1
Unit 1
Unit 1
Reinforcement Learning
In the equation, we have various components, including reward, discount factor (γ),
probability, and end states s'. But there is no any Q-value is given so first consider the
below image:
In the above image, we can see there is an agent who has three values options, V(s 1),
V(s2), V(s3). As this is MDP, so agent only cares for the current state and the future state.
The agent can go to any direction (Up, Left, or Right), so he needs to decide where to go
for the optimal path. Here agent will take a move as per probability bases and changes
the state. But if we want some exact moves, so for this, we need to make some changes
in terms of Q-value. Consider the below image:
Q- represents the quality of the actions at each state. So instead of using a value at each
state, we will use a pair of state and action, i.e., Q(s, a). Q-value specifies that which
action is more lubricative than others, and according to the best Q-value, the agent
takes his next move. The Bellman equation can be used for deriving the Q-value.
To perform any action, the agent will get a reward R(s, a), and also he will end up on a
certain state, so the Q -value equation will be:
Non-Deterministic Rewards:
In a non-deterministic reward environment, the reward function is not fixed and can
vary randomly. This means that the same state-action pair can receive different
rewards in different episodes. To address this issue, Q-learning algorithms can use
techniques such as:
1. Epsilon-greedy exploration: The agent chooses the action with the highest expected
value with probability (1 - ε) and chooses a random action with probability ε.
2. Entropy regularization: The agent adds a term to the objective function that
encourages exploration by maximizing the entropy of the action distribution.
3. Reward bootstrapping: The agent uses a bootstrapping technique to estimate the
reward distribution and adjust its policy accordingly.
Non-Deterministic Actions:
In a non-deterministic action environment, the agent's actions can have multiple
possible outcomes, each with a certain probability. To address this issue, Q-learning
algorithms can use techniques such as:
1. Stochastic policy: The agent maintains a stochastic policy that specifies the
probability of taking each action in each state.
2. Episodic memory: The agent stores experiences in an episodic memory and uses
them to learn a more robust policy.
3. Trajectory-based methods: The agent learns to predict trajectories of states and
actions rather than individual actions.
Challenges in Non-Deterministic Environments:
1. Exploration-exploitation trade-off: The agent must balance exploration to gather
more information about the environment and exploitation to maximize rewards.
2. Uncertainty: The agent must deal with uncertainty in rewards and actions, which can
lead to overestimation or underestimation of value functions.
3. Convergence issues: Non-deterministic environments can lead to convergence issues
in Q-learning algorithms, especially when using techniques like epsilon-greedy
exploration.
Popular Algorithms for Non-Deterministic Environments:
1. Deep Q-Networks (DQN): A popular algorithm for non-deterministic environments
that uses a neural network to approximate the value function.
2. Actor-Critic Methods: Algorithms that combine policy-based and value-based
methods to learn both the policy and value function in non-deterministic
environments.
3. PPO (Proximal Policy Optimization): An algorithm that uses trust region
optimization to optimize the policy in non-deterministic environments.