Homework #6
|
Fall 2010
|
As usual, it is strongly preferred that you submit the written parts of this assignment in hard copy as described on the assignments page. However, for this homework only, if it is impossible for you to turn in hard copy because you are traveling and out of the Princeton area over break, you may choose instead to submit the written parts of the assignment using either of the options described below. In either case, please keep in mind that the written exercises must be prepared separately from the program report so that they can be graded separately.
In this assignment, you will use some of the algorithms for computing optimal policies in Markov decision processes (MDP's) to help a mouse escape from a cat while trying to find cheese to eat.
The main components of the assignment are the following:
The first part of the assignment is to implement three of the algorithms that were discussed in lecture as well as R&N (sections 17.2 and 17.3): value iteration, policy iteration and policy evaluation. Your implementations should work for any (reasonable) MDP, not just the ones provided. See below for a description of where you should write your code.
The second part of the assignment is to use your code to compute optimal policies for a mouse who lives with a cat in the following grid world:
The mouse (m) can occupy any of the 31 blank squares. The cat (C) also can occupy any square, except for square (6,1) which is the mouse's hole (too small for the cat to squeeze in). There is cheese (z) that usually can be found in one of the following three squares: (2,3), (4,1) and (7,3). Occasionally, the cheese disappears altogether. Thus, this MDP has 31*30*4=3720 states.
The cat and the mouse can each move one square in any direction -- vertically, horizontally or diagonally. They also can choose not to move at all. Thus, there are nine possible moves from each square. If an action is attempted that causes the creature to bump into a wall, then it simply stays where it is.
In this problem, we will always take the point of view of the mouse. When the mouse is on the cheese, it receives a reward of +1. When the cat is on the mouse, it (the mouse) receives a reward of -21 (ouch). When the cat is on the mouse, and the mouse is on the cheese, the reward is -20. All other configurations have a reward of 0. Thus, the mouse is trying to eat cheese while simultaneously avoiding the cat.
We will consider three different cats. The first cat, poor thing, is blind and oblivious, and simply wanders randomly around its environment choosing randomly among its nine available actions at every step. The second cat is hungry, alert and unrelenting. This cat always heads straight for the mouse following the shortest path possible. Thus, after the mouse makes its move, the cat chooses the action that will move it as close as possible to the mouse's new position. (If there is a tie among the cat's best available options, the cat chooses randomly among these equally good best actions.) However, when the mouse is in its hole and out of sight, the cat reverts to aimless (i.e., random) wandering. The third cat is also alert, but has a more sporting disposition, and therefore follows a combination of these two strategies: half the time, it wanders aimlessly, and half the time, it heads straight for the mouse. Machine-readable (and compressed) descriptions of the three MDP's corresponding to these three cats is provided in the files oblivious.mdp.gz, unrelenting.mdp.gz and sporting.mdp.gz.
If there is cheese available, it will always be in exactly one of the three locations listed above. At every time step, the cheese remains where it is with 90% probability, and with 10% probability, it vanishes altogether. Once it has vanished, the cheese remains hidden at each step with probability 70%, and with 30% probability, it reappears randomly in one of the three designated locations.
States are encoded as six tuples, the first two numbers indicating the position of the mouse, the second two numbers the position of the cat, and the last two numbers the position of the cheese. Thus, 4:2:2:4:7:3 indicates, as depicted in the figure above, that the mouse is in (4,2), the cat is in (2,4), and the cheese is in (7,3). When there is no cheese, the last two coordinates are empty; for instance, 4:2:2:4:: would encode the same state as above with the cheese missing.
The cat and the mouse alternate moves. However, in encoding the MDP, we collapse both moves into a single state transition. In addition, the cheese, when it vanishes or reappears, does so simultaneously with the mouse's move. For instance, from the configuration above, if the mouse moves to (5,2) and the cat responds by moving to (3,3), while the cheese vanishes, this all would be encoded as a single transition from state 4:2:2:4:7:3 to 5:2:3:3::. Actions in the MDP refer to actions that the mouse can make; the cat's actions are effectively "hard-wired" into the dynamics of the MDP itself.
For each of the three cats, your job will be to compute the mouse's optimal policy, i.e., the action that should be taken at each state to maximize the mouse's expected discounted reward, where we fix the discount factor (gamma) to be 0.95. You can then watch the cat and the mouse go at it using an animator that we are providing.
Once you have everything working, you should take some time to observe and understand the mouse's behavior in each of the three cases. Then briefly write up your observations in 3-6 paragraphs (all together, not per case). As on previous homeworks, you can make observations that are either anecdotal or quantitative or both. Think about questions like the following suggestions: How is the mouse behaving in each case and why does this behavior make sense? What sorts of intelligent behavior appear to be emerging? Is the behavior what you would have expected, or are there aspects of the mouse's behavior that you found surprising? What happens if you take the mouse's optimal policy for cat A and have the mouse follow that policy when playing against cat B? Since there are three cats, there are nine possibilities here. You might construct a matrix showing the utility (say, averaged over all possible starting states), for all nine possibilities. Which policy should the mouse follow if it doesn't know which cat it's up against? Why?
In addition to exploring the mouse's optimal behavior, you should also explore and discuss the performance of the algorithms themselves. For instance, how many iterations does it take to reach convergence? (There is more than one way to measure convergence.) How is performance affected if we stop iterating one of the algorithms before we reach convergence? What happens if we vary gamma? Which algorithm seems faster and why?
We are providing a class called Mdp that loads a description of an MDP from a file whose format is described shortly. The class has a constructor taking a file name as argument that reads in the MDP description from the file. Each state is represented by an integer in the range 0 (inclusive) to numStates (exclusive), where numStates is the total number of states. Actions are represented in a similar fashion. The stateName and actionName arrays convert these integers back to strings in the obvious fashion. The reward array gives the immediate reward of each state. For efficiency, state transition probabilities are represented in a sparse fashion to take advantage of the fact that most states have a relatively small number of possible successors. In particular, the array nextState[s][a] represents a list of possible next states which may follow state s under action a. The understanding here is that the probability of transitioning to any other state is zero. There is a corresponding list of probabilities transProb[s][a]. Thus, transProb[s][a][i] is the probability of transitioning from s to nextState[s][a][i] under action a.
Each line of an MDP description file has one of three possible formats. The first possibility is that the line has the simple form
<state>
where <state> is any name (no white space). Such a line indicates that <state> is the start state of the MDP.
The second possibility is that the line has the form
<state> <reward>
where <state> is any name, and <reward> is a number. Such a line indicates that the reward in <state> is <reward>.
The last possibility is that the line has the form
<from_state> <action> <to_state> <prob>
where the first three tokens are names, and <prob> is a nonnegative number. Such a line indicates that the probability of transitioning from <from_state> to <to_state> under action <action> is <prob>. Multiple lines of this form with the same <from_state> and <action> can be combined. Thus,
u a v 0.3 u 0.4 w 0.1
is equivalent to the three lines
u a v 0.3
u a u 0.4
u a w 0.1
Lines of these forms can appear in the file in any order. However, if a reward is given more than once for the same state, then the last reward value given is assigned to that state. On the other hand, if more than one probability is given for the same state-action-state triple, then the sum of these probabilities is assigned to this transition. If the probabilities of transitioning out of a given state under a given action do not add up to one, then these numbers are renormalized so that they do. However, if this sum is zero, an exception is thrown. States that are not given a reward in the description file are assumed to have a reward of zero; likewise, transitions that do not appear in the file are assumed to have zero probability. An exception is also thrown if no start state is declared (using the first form above); if more than one such line appears, then the last one is taken to define the start state.
For instance, the following encodes an MDP with five states, -2, -1, 0, +1 and +2. There are two actions, L(eft) and R(ight) which cause the MDP to move to the next state to the left or right 90% of the time, or to move in the opposite direction 10% of the time (with movement impeded at the end points -2 and +2). The start state is 0. The reward is 1 in 0, -1 in -2, and -2 in +2. The start state is 0.
0
0 1
0 L -1 0.9 +1 0.1
0 R +1 0.9 -1 0.1
-1 L -2 0.9 0 0.1
-1 R 0 0.9 -2 0.1
-2 -1
-2 L -2 0.9 -1 0.1
-2 R -1 0.9 -2 0.1
+1 R +2 0.9 0 0.1
+1 L 0 0.9 +2 0.1
+2 -2
+2 R +2 0.9 +1 0.1
+2 L +1 0.9 +2 0.1
If you want to try out or test your program on MDP's of your own invention (which is highly recommended for debugging), you can do so by creating files of this form. Alternatively, you can create a subclass of the Mdp class, and write code that fills in the public fields with a description of your MDP.
Because the files being provided for the cat and mouse MDP's are quite large, we are providing them in a compressed (gzipped) form. Note that the Mdp class will automatically read a file that has been gzipped, provided that it ends with a ".gz" suffix. In other words, there is no need to uncompress the gzipped files that we are providing prior to running your code on them. (Of course, the code will also work on files that have not been compressed.) In addition, because these MDP's are somewhat large, you may need to increase the memory available to java (the option for doing this is system dependent, but it often has the form -Xmx256m which increases the memory to 256 megabytes).
The class CatMouseAnimator can be used to animate a given policy on one of the cat-and-mouse MDP's (and will surely crash if provided with any other kind of MDP). More precisely, the animator will animate a sequence of states provided to it by an object implementing the MdpSimulator interface; typically, these state sequences will be generated according to a fixed policy using the FixedPolicySimulator class.
The method animateGuiOnly invokes a graphical user interface (GUI) for this purpose showing the behavior of the cat and mouse in a graphical depiction of their grid-world, similar to the one above. Buttons are provided for starting or stopping the animation (at a speed that can be controlled), or for stepping through the animation one step at a time.
The method animateGuiAndPrint invokes the GUI as well, but also prints the state of the MDP at every step. The printed form of each state looks something like the following:
+===============+
| . . . . . . . |
| . C . . . . |
| . . . . . . z |
| . . . m . |
| . . . . . _ |
+===============+
As in the GUI, the letters m, C and z indicate the position of the mouse, cat and cheese. The mouse's hole is indicated with an underscore. In both the printed and GUI depictions, the mouse vanishes when the cat is on top of it; likewise, the cheese can be obscured by either the cat or the mouse (although the mouse changes from lower to upper case M when it is eating).
A final method animatePrintOnly prints the successive states of the MDP, but does not invoke the GUI.
The class RunCatMouse consists only of a main that loads an MDP from a file, finds and prints the optimal policy (using both value and policy iteration) and runs the animator (depending on command line arguments -- see the documentation). Before starting the animator, four columns are printed following each state: the optimal action under the policy found using value iteration; the optimal action under the policy found using policy iteration; the optimal utility as estimated using value iteration; and the estimated utility of the optimal policy using policy evaluation. For instance, on the simple five-state MDP given above, the output should look something like this:
Optimal policies:
0 L L 8.169018052871904 8.169018052871904
-1 R R 7.5578670677920545 7.5578670677920545
+1 L L 7.442544315827016 7.442544315827016
-2 R R 6.035332975545492 6.035332975545492
+2 L L 4.821409270650342 4.821409270650342
In this case, as expected, the optimal policy says to move to the center (right if in -1 or -2, and left if in +1 or +2), and to move left (away from the larger penalty state +2) if in the center.
You might well decide to modify or replace the RunCatMouse class. However, your code should still work properly when run using the provided version.
To debug your code, you will surely want to run it on small MDP's. There are a number of "sanity checks" that can be done to see if your code is working. For instance, value iteration and policy iteration should output identical policies (up to possible ties between equally good actions). In addition, the utility of the optimal policy must satisfy the Bellman equations.
All code and data files can be obtained from this directory, or all at once from this zip file. Data is included in the data subdirectory.
Documentation on the provided Java classes is available here.
As a general reminder, you should not modify any of the code that we are providing, other than template files, or unless specifically allowed in the instructions. You also should not change the "signature" of any of the methods or constructors that you are required to write, i.e., the parameters of the method (or constructor), their types, and the type of the value returned by the method.
Your job is to fill in the constructors in the three classes ValueIteration, PolicyIteration and PolicyEvaluation. The constructor for ValueIteration takes as input an MDP and a discount factor; the constructed class then includes a field policy containing the optimal policy (an array mapping states to actions), and another field utility (an array mapping states to real numbers). The other two classes are set up similarly.
You may wish to add other fields that are returned as part of the class, such as the number of iterations until convergence. You also may add other constructors or methods, provided that these are in addition to the ones given in the template files (which are the ones that will be automatically tested). In particular, your code must work properly when run with the main provided in RunCatMouse.java.
Note that, unlike the description of policy evaluation given in R&N, the version that you are asked to write does not take as input a starting estimate of the utility at every state. As noted above, you may add a constructor or method that takes such a starting estimate, but this must be in addition to the constructor that we are providing in the template file. Also, to evaluate a policy, you should use the iterative approach described in class, which is the same as what R&N calls "modified policy iteration." (Even though the version in the book is only run for a fixed number of iterations, I would recommend instead checking the difference between successive approximations, as in value iteration.)
The final code that you turn in should not write anything to standard output or standard error. Your code must work properly when run with the provided classes on the provided MDP's.
Using this DropBox link, you should turn in the following:
In addition, you should turn in your program report exploring the cat-and-mouse environments as described above. This written work should be submitted in hard copy in the manner explained on the assignments page.
We will automatically test your code on MDP's other than those provided with this assignment. Your grade will depend largely on getting the right answers. In addition, your code should be efficient enough to run reasonably fast (under a minute or two on a fast machine on an MDP like those provided), and should not terminate prematurely with an exception (unless given bad data, of course); code that does otherwise risks getting no credit for the automatic testing portion of the grade. As usual, you should follow good programming practices, including documenting your code.
Your program report should be clear, concise, thoughtful and perceptive.
This programming assignment will be worth about 75 points (about 45 based on automatic testing of your code, implementing the correct algorithms, proper documentation and good programming practice, and 30 for the program report).
Thanks to Dale Schuurmans from whom the main idea for this assignment was borrowed.