PA4
PA4
PA4
py
python autograder.py -q q2
See the autograder tutorial in Project 0 for more information about using the autograder.
The code for this project contains the following files, which are available in a zip archive:
The crawler code and test harness. You will run this but not
crawler.py
edit it.
Evaluation: Your code will be autograded for technical correctness. Please do not change the
names of any provided functions or classes within the code, or you will wreak havoc on the
autograder. However, the correctness of your implementation -- not the autograder's
judgements -- will be the final judge of your score. If necessary, we will review and grade
assignments individually to ensure that you receive due credit for your work.
Academic Dishonesty: We will be checking your code against other submissions in the class
for logical redundancy. If you copy someone else's code and submit it with minor changes, we
will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to
submit your own work only; please don't let us down. If you do, we will pursue the strongest
consequences available to us.
Getting Help: You are not alone! If you find yourself stuck on something, contact the course
staff for help. Office hours, section, and the discussion forum are there for your support; please
use them. If you can't make our office hours, let us know and we will schedule more. We want
these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't
know when or how to help unless you ask.
MDPs
To get started, run Gridworld in manual control mode, which uses the arrow keys:
python gridworld.py -m
You will see the two-exit layout from class. The blue dot is the agent. Note that when you
press up, the agent only actually moves north 80% of the time. Such is the life of a Gridworld
agent!
You can control many aspects of the simulation. A full list of options is available by running:
python gridworld.py -h
You should see the random agent bounce around the grid until it happens upon an exit. Not the
finest hour for an AI agent.
Note: The Gridworld MDP is such that you first must enter a pre-terminal state (the double
boxes shown in the GUI) and then take the special 'exit' action before the episode actually ends
(in the true terminal state called TERMINAL_STATE, which is not shown in the GUI). If you run
an episode manually, your total return may be less than you expected, due to the discount rate
(-d to change; 0.9 by default).
Look at the console output that accompanies the graphical output (or use -t for all text). You
will be told about each transition the agent experiences (to turn this off, use -q).
As in Pacman, positions are represented by (x,y) Cartesian coordinates and any arrays are
indexed by [x][y], with 'north' being the direction of increasing y, etc. By default, most
transitions will receive a reward of zero, though you can change this with the living reward
option (-r).
Value iteration computes k-step estimates of the optimal values, Vk. In addition to running
value iteration, implement the following methods for ValueIterationAgent using Vk.
• computeActionFromValues(state) computes the best action according to the value
function given by self.values.
• computeQValueFromValues(state, action) returns the Q-value of the (state, action)
pair given by the value function given by self.values.
These quantities are all displayed in the GUI: values are numbers in squares, Q-values are
numbers in square quarters, and policies are arrows out from each square.
Important: Use the "batch" version of value iteration where each vector V k is computed from a
fixed vector Vk-1 (like in lecture), not the "online" version where one single weight vector is
updated in place. This means that when a state's value is updated in iteration k based on the
values of its successor states, the successor state values used in the value update computation
should be those from iteration k-1 (even if some of the successor states had already been
updated in iteration k). The difference is discussed in Sutton & Barto in the 6th paragraph of
chapter 4.1.
Note: A policy synthesized from values of depth k (which reflect the next k rewards) will
actually reflect the next k+1 rewards (i.e. you return πk+1πk+1). Similarly, the Q-values will
also reflect one more reward than the values (i.e. you return Q k+1).
You should return the synthesized policy πk+1πk+1.
Hint: Use the util.Counter class in util.py, which is a dictionary with a default value of
zero. Methods such as totalCount should simplify your code. However, be careful
with argMax: the actual argmax you want may be a key not in the counter!
Note: Make sure to handle the case when a state has no available actions in an MDP (think
about what this means for future rewards).
python autograder.py -q q1
The following command loads your ValueIterationAgent, which will compute a policy and
execute it 10 times. Press a key to cycle through values, Q-values, and the simulation. You
should find that the value of the start state ( V(start), which you can read off of the GUI) and
the empirical resulting average reward (printed after the 10 rounds of execution finish) are
quite close.
Hint: On the default BookGrid, running value iteration for 5 iterations should give you this
output:
You will now write a Q-learning agent, which does very little on construction, but instead learns
by trial and error from interactions with the environment through its update(state,
action, nextState, reward) method. A stub of a Q-learner is specified
in QLearningAgentin qlearningAgents.py, and you can select it with the option '-a q'.
For this question, you must implement
the update, computeValueFromQValues, getQValue,
and computeActionFromQValues methods.
Note: For computeActionFromQValues, you should break ties randomly for better behavior.
The random.choice() function will help. In a particular state, actions that your
agent hasn't seen before still have a Q-value, specifically a Q-value of zero, and if all of the
actions that your agent has seen before have a negative Q-value, an unseen action may be
optimal.
With the Q-learning update in place, you can watch your Q-learner learn under manual control,
using the keyboard:
python gridworld.py -a q -k 5 -m
Recall that -k will control the number of episodes your agent gets to learn. Watch how the
agent learns about the state it was just in, not the one it moves to, and "leaves learning in its
wake." Hint: to help with debugging, you can turn off noise by using the --noise
0.0parameter (though this obviously makes Q-learning less interesting). If you manually steer
Pacman north and then east along the optimal path for four episodes, you should see the
following Q-values:
Grading: We will run your Q-learning agent and check that it learns the same Q-values and
policy as our reference implementation when each is presented with the same set of examples.
To grade your implementation, run the autograder:
python autograder.py -q q4