Programming Assignment P5
|
Fall 2012
|
Implementing MDP algorithms
Playing cat and mouse
Exploring optimal behavior
The code we are providing
The code that you need to write
What to turn in
What you will be graded on
In this assignment, you will use some of the algorithms for computing optimal policies in Markov decision processes (MDP's) to help a mouse escape from a cat while trying to find cheese to eat.
The main components of the assignment are the following:
The first part of the assignment is to implement three of the algorithms that were discussed in lecture as well as R&N (sections 17.2 and 17.3): value iteration, policy iteration and policy evaluation. Your implementations should work for any (reasonable) MDP, not just the ones provided. See below for a description of where you should write your code.
The second part of the assignment is to use your code to compute optimal policies for a mouse who lives with a cat in the following grid world:
The mouse (m) can occupy any of the 31 blank squares. The cat (C) also can occupy any square, except for square (6,1) which is the mouse's hole (too small for the cat to squeeze in). There is cheese (z) that usually can be found in one of the following three squares: (2,3), (4,1) and (7,3). Occasionally, the cheese disappears altogether. Thus, this MDP has 31*30*4=3720 states.
The cat and the mouse can each move one square in any direction -- vertically, horizontally or diagonally. They also can choose not to move at all. Thus, there are nine possible moves from each square. If an action is attempted that causes the creature to bump into a wall, then it simply stays where it is.
In this problem, we will always take the point of view of the mouse. When the mouse is on the cheese, it receives a reward of +1. When the cat is on the mouse, it (the mouse) receives a reward of -21 (ouch). When the cat is on the mouse, and the mouse is on the cheese, the reward is -20. All other configurations have a reward of 0. Thus, the mouse is trying to eat cheese while simultaneously avoiding the cat.
We will consider three different cats, each with a distinct personality. The first cat, poor thing, is blind and oblivious, and simply wanders randomly around its environment choosing randomly among its nine available actions at every step. The second cat is hungry, alert and unrelenting. This cat always heads straight for the mouse following the shortest path possible. Thus, after the mouse makes its move, the cat chooses the action that will move it as close as possible to the mouse's new position. (If there is a tie among the cat's best available options, the cat chooses randomly among these equally good best actions.) However, when the mouse is in its hole and out of sight, the cat reverts to aimless (i.e., random) wandering. The third cat is also alert, but has a more sporting disposition, and therefore follows a combination of these two strategies: half the time, it wanders aimlessly, and half the time, it heads straight for the mouse. Machine-readable (and compressed) descriptions of the three MDP's corresponding to these three cats are provided in the files oblivious.mdp.gz, unrelenting.mdp.gz and sporting.mdp.gz.
If there is cheese available, it will always be in exactly one of the three locations listed above. At every time step, the cheese remains where it is with 90% probability, and with 10% probability, it vanishes altogether. Once it has vanished, the cheese remains hidden at each step with probability 70%, and with 30% probability, it reappears randomly in one of the three designated locations.
States are encoded as six tuples, the first two numbers indicating the position of the mouse, the second two numbers the position of the cat, and the last two numbers the position of the cheese. Thus, 4:2:2:4:7:3 indicates, as depicted in the figure above, that the mouse is in (4,2), the cat is in (2,4), and the cheese is in (7,3). When there is no cheese, the last two coordinates are empty; for instance, 4:2:2:4:: would encode the same state as above with the cheese missing.
The cat and the mouse alternate moves. However, in encoding the MDP, we collapse both moves into a single state transition. In addition, the cheese, when it vanishes or reappears, does so simultaneously with the mouse's move. For instance, from the configuration above, if the mouse moves to (5,2) and the cat responds by moving to (3,3), while the cheese vanishes, this all would be encoded as a single transition from state 4:2:2:4:7:3 to 5:2:3:3::. Actions in the MDP refer to actions that the mouse can make; the cat's actions are effectively "hard-wired" into the dynamics of the MDP itself.
For each of the three cats, your job will be to compute the mouse's optimal policy, i.e., the action that should be taken at each state to maximize the mouse's expected discounted reward, where we fix the discount factor (gamma) to be 0.95. You can then watch the cat and the mouse go at it using an animator that we are providing.
Once you have everything working, you should take some time to observe and understand the mouse's behavior in each of the three cases. Then answer the following:
We are providing a class called Mdp that loads a description of an MDP from a file whose format is described shortly. The class has a constructor taking a file name as argument that reads in the MDP description from the file. Each state is represented by an integer in the range 0 (inclusive) to numStates (exclusive), where numStates is the total number of states. Actions are represented in a similar fashion. The stateName and actionName arrays convert these integers back to strings in the obvious fashion. The reward array gives the immediate reward of each state. For efficiency, state transition probabilities are represented in a sparse fashion to take advantage of the fact that most states have a relatively small number of possible successors. In particular, the array nextState[s][a] represents a list of possible next states which may follow state s under action a. The understanding here is that the probability of transitioning to any other state is zero. There is a corresponding list of probabilities transProb[s][a]. Thus, transProb[s][a][i] is the probability of transitioning from s to nextState[s][a][i] under action a.
Each line of an MDP description file has one of three possible formats. The first possibility is that the line has the simple form
<state>
where <state> is any name (no white space). Such a line indicates that <state> is the start state of the MDP.
The second possibility is that the line has the form
<state> <reward>
where <state> is any name, and <reward> is a number. Such a line indicates that the reward in <state> is <reward>.
The last possibility is that the line has the form
<from_state> <action> <to_state> <prob>
where the first three tokens are names, and <prob> is a nonnegative number. Such a line indicates that the probability of transitioning from <from_state> to <to_state> under action <action> is <prob>. Multiple lines of this form with the same <from_state> and <action> can be combined. Thus,
u a v 0.3 u 0.4 w 0.1
is equivalent to the three lines
u a v 0.3
u a u 0.4
u a w 0.1
Lines of these forms can appear in the file in any order. However, if a reward is given more than once for the same state, then the last reward value given is assigned to that state. On the other hand, if more than one probability is given for the same state-action-state triple, then the sum of these probabilities is assigned to this transition. If the probabilities of transitioning out of a given state under a given action do not add up to one, then these numbers are renormalized so that they do. However, if this sum is zero, an exception is thrown. States that are not given a reward in the description file are assumed to have a reward of zero; likewise, transitions that do not appear in the file are assumed to have zero probability. An exception is also thrown if no start state is declared (using the first form above); if more than one such line appears, then the last one is taken to define the start state.
For instance, the following encodes an MDP with five states, -2, -1, 0, +1 and +2. There are two actions, L(eft) and R(ight) which cause the MDP to move to the next state to the left or right 90% of the time, or to move in the opposite direction 10% of the time (with movement impeded at the end points -2 and +2). The start state is 0. The reward is 1 in 0, -1 in -2, and -2 in +2. The start state is 0.
0
0 1
0 L -1 0.9 +1 0.1
0 R +1 0.9 -1 0.1
-1 L -2 0.9 0 0.1
-1 R 0 0.9 -2 0.1
-2 -1
-2 L -2 0.9 -1 0.1
-2 R -1 0.9 -2 0.1
+1 R +2 0.9 0 0.1
+1 L 0 0.9 +2 0.1
+2 -2
+2 R +2 0.9 +1 0.1
+2 L +1 0.9 +2 0.1
If you want to try out or test your program on MDP's of your own invention (which is highly recommended for debugging), you can do so by creating files of this form. Alternatively, you can create a subclass of the Mdp class, and write code that fills in the public fields with a description of your MDP.
Because the files being provided for the cat and mouse MDP's are quite large, we are providing them in a compressed (gzipped) form. Note that the Mdp class will automatically read a file that has been gzipped, provided that it ends with a ".gz" suffix. In other words, there is no need to uncompress the gzipped files that we are providing prior to running your code on them. (Of course, the code will also work on files that have not been compressed.) In addition, because these MDP's are somewhat large, you may need to increase the memory available to java (the option for doing this is system dependent, but it often has the form -Xmx256m which increases the memory to 256 megabytes).
The class CatMouseAnimator can be used to animate a given policy on one of the cat-and-mouse MDP's (and will surely crash if provided with any other kind of MDP). More precisely, the animator will animate a sequence of states provided to it by an object implementing the MdpSimulator interface; typically, these state sequences will be generated according to a fixed policy using the FixedPolicySimulator class.
The method animateGuiOnly invokes a graphical user interface (GUI) for this purpose showing the behavior of the cat and mouse in a graphical depiction of their grid-world, similar to the one above. Buttons are provided for starting or stopping the animation (at a speed that can be controlled), or for stepping through the animation one step at a time.
The method animateGuiAndPrint invokes the GUI as well, but also prints the state of the MDP at every step. The printed form of each state looks something like the following:
+===============+
| . . . . . . . |
| . C . . . . |
| . . . . . . z |
| . . . m . |
| . . . . . _ |
+===============+
As in the GUI, the letters m, C and z indicate the position of the mouse, cat and cheese. The mouse's hole is indicated with an underscore. In both the printed and GUI depictions, the mouse vanishes when the cat is on top of it; likewise, the cheese can be obscured by either the cat or the mouse (although the mouse changes from lower to upper case M when it is eating).
A final method animatePrintOnly prints the successive states of the MDP, but does not invoke the GUI.
The class RunCatMouse consists only of a main that loads an MDP from a file, finds and prints the optimal policy (using both value and policy iteration) and runs the animator (depending on command line arguments -- see the documentation). Before starting the animator, four columns are printed following each state: the optimal action under the policy found using value iteration; the optimal action under the policy found using policy iteration; the optimal utility as estimated using value iteration; and the estimated utility of the optimal policy using policy evaluation. For instance, on the simple five-state MDP given above, the output should look something like this:
Optimal policies:
0 L L 8.169018054714 8.169018054714
-1 R R 7.557867069632 7.557867069632
+1 L L 7.442544317667 7.442544317667
-2 R R 6.035332977387 6.035332977387
+2 L L 4.821409272492 4.821409272492
In this case, as expected, the optimal policy says to move to the center (right if in -1 or -2, and left if in +1 or +2), and to move left (away from the larger penalty state +2) if in the center.
(Note that your own results, even for a correct implementation, may differ slightly from the utilities that are given here, but should be accurate up to the numerical precision specified below.)
You might well decide to modify or replace the RunCatMouse class. However, your code should still work properly when run using the provided version.
To debug your code, you will surely want to run it on small MDP's. There are a number of "sanity checks" that can be done to see if your code is working. For instance, value iteration and policy iteration should output identical policies (up to possible ties between equally good actions). In addition, the utility of the optimal policy must satisfy the Bellman equations.
All code and data files can be obtained all at once from this zip file. Data is included in the data subdirectory.
Documentation on the provided Java classes is available here.
As a general reminder, you should not modify any of the code that we are providing, other than template files, or unless specifically allowed in the instructions. You also should not change the "signature" of any of the methods or constructors that you are required to write, i.e., the parameters of the method (or constructor), their types, and the type of the value returned by the method.
Your job is to fill in the constructors in the three classes ValueIteration, PolicyIteration and PolicyEvaluation. The constructor for ValueIteration takes as input an MDP and a discount factor; the constructed class then includes a field policy containing the optimal policy (an array mapping states to actions), and another field utility (an array mapping states to real numbers). The other two classes are set up similarly.
You may wish to add other fields that are returned as part of the class, such as the number of iterations until convergence. You also may add other constructors or methods, provided that these are in addition to the ones given in the template files (which are the ones that will be automatically tested). In particular, your code must work properly when run with the main provided in RunCatMouse.java.
Your implementation of policy evaluation should compute the
utility of a given policy for every possible starting state s. Unlike the
description of policy evaluation given in R&N, the version that you are asked to
write should not take as input a starting estimate of the utility. (As
noted above, you may add a constructor or method that takes such a starting
estimate, but this must be in addition to the constructor that we are providing
in the template file.) Your implementation should compute an estimate of the
utility of the given policy that is at least as accurate as specified below.
As discussed in class and in R&N, there are at least two techniques for
evaluating a policy. In the first approach, we use methods from linear algebra
to solve a set of linear equations that are a simplified form of the Bellman
equations. In the second approach, we use an iterative procedure to solve these
equations that is essentially a simplified form of value iteration; thus,
successive estimates of the utility are repeatedly plugged into the (simplified) Bellman
equations to obtain new estimates, a process that terminates when the change in
the estimates is small enough to guarantee that the utility estimates will meet
the specified accuracy requirements. Because the linear-algebraic approach
is fairly expensive, you should use the second approach on this assignment.
Your policy-iteration code should make use of your implementation of policy
evaluation, on each iteration obtaining an exact (up to the specified accuracy)
estimate of the utility of the current policy.
To clarify how this relates to the presentation of policy iteration in R&N:
The book describes both "standard" policy iteration, and a modification called
"modified" policy iteration. The distinction is in how policy evaluation is
implemented -- in the standard version, policies are evaluated exactly, while in
the modified version, they are only approximately evaluated using just a few
steps of simplified value iteration. For this assignment, your implementation
should be a version of standard policy iteration in the sense that you should
evaluate policies (nearly) exactly. However, to do so, you should be using the
iterative approach described above; this amounts to a procedure similar to
modified policy iteration except that, in evaluating a policy, instead of
running for a fixed number of simplified value iteration steps, your code should run until the
change in successive approximations of the policy's utility is sufficiently
small.
The utilities computed by your implementation of value iteration and policy evaluation should, at every state, be within 10-8 (1e-8) of the true utility for that state. In other words, if utility is the array of (estimated) utility values returned by ValueIteration, then for every state s, utility[s] must be within 10-8 of the true optimal utility at state s. Likewise, for PolicyEvaluation, utility[s] must be within 10-8 of the utility of state s under the given policy.
The final code that you turn in should not write anything to standard output or standard error. Your code must work properly when run with the provided classes on the provided MDP's.
Using this DropBox link, you should turn in the following:
We will automatically test your code on MDP's other than those provided with this assignment. Your grade will depend largely on getting the right answers. In addition, your code should be sensibly designed for reasonable efficiency (for instance, it should run in under a minute on a fast machine on an MDP like those provided), and it should not terminate prematurely with an exception (unless given bad data, of course); code that does otherwise risks getting no credit for the automatic testing portion of the grade. As usual, you should follow good programming practices, including documenting your code enough for the TA to be able to read and understand it.
Your program report should be should be clear, complete but concise, correct, thoughtful, critical, and perceptive.
This programming assignment will be worth about 75 points (about 45 based on automatic testing of your code, implementing the correct algorithms, proper documentation and good programming practice, and 30 for the program report).
Thanks to Dale Schuurmans from whom the main idea for this assignment was borrowed.