Homework #6
|
Fall 2003
|
In this assignment, you will use some of the algorithms for computing optimal policies in Markov decision processes (MDP's) to help a mouse escape from a cat while trying to find cheese to eat.
The main components of the assignment are the following:
The first part of the assignment is to implement three of the algorithms that were discussed in lecture as well as R&N (sections 17.2 and 17.3): value iteration, policy iteration and policy evaluation. Your implementations should work for any (reasonable) MDP, not just the ones provided. See below for a description of where you should write your code.
The second part of the assignment is to use your code to compute optimal policies for a mouse who lives with a cat in the following grid world:
The mouse (M) can occupy any of the 31 blank squares. The cat (C) also can occupy any square, except for square (6,1) which is the mouse's hole (too small for the cat to squeeze in). There is cheese (z) in squares (2,3) and (7,3) that never moves. Thus, this MDP has 31*30=930 states.
The cat and the mouse can each move one square in any direction -- vertically, horizontally or diagonally. They also can choose not to move at all. Thus, there are nine possible moves from each square. If an action is attempted that causes the creature to bump into a wall, then it simply stays where it is.
In this problem, we will always take the point of view of the mouse. When the mouse is on the cheese, it receives a reward of 1. When the cat is on the mouse, it (the mouse) receives a reward of -11. When the cat is on the mouse, and the mouse is on the cheese, the reward is -10. All other configurations have a reward of 0. Thus, the mouse is trying to eat cheese while simultaneously avoiding the cat.
We will consider three different cats. The first cat, poor thing, is blind and oblivious, and simply wanders randomly around its environment choosing randomly among its nine available actions at every step. The second cat is hungry, alert and unrelenting. This cat always heads straight for the mouse following the shortest path possible. Thus, after the mouse makes its move, the cat chooses the action that will move it as close as possible to the mouse's new position. (If there is a tie among the cat's best available options, the cat chooses randomly among these equally good best actions.) However, when the mouse is in its hole and out of sight, the cat reverts to aimless (i.e., random) wandering. The third cat is also alert, but has a more sporting disposition, and therefore follows a combination of these two strategies: half the time, it wanders aimlessly, and half the time, it heads straight for the mouse. Machine-readable descriptions of the three MDP's corresponding to these three cats is provided in the files oblivious.mdp, unrelenting.mdp and sporting.mdp.
States are encoded as four tuples, the first two numbers indicating the position of the mouse, and the second two numbers the position of the cat. Thus, 2:2:5:2 indicates that the mouse is in (2,2) and the cat is in (5,2). The cat and the mouse alternate moves. However, in encoding the MDP, we collapse both moves into a single state transition. For instance, from the configuration above, if the mouse moves to (3,1) and the cat responds by moving to (4,1), this would be encoded as a single transition from state 2:2:5:2 to 3:1:4:1. Actions in the MDP refer to action's that the mouse can make; the cat's actions are effectively "hard-wired" into the dynamics of the MDP itself.
For each of the three cats, your job will be to compute the mouse's optimal policy, i.e., the action that should be taken at each state to maximize the mouse's expected discounted reward, where we fix the discount factor (gamma) to be 0.95. You can then watch the cat and the mouse go at it using a primitive animator that we are providing.
Once you have everything working, you should take some time to understand the mouse's behavior in each of the three cases. Then briefly write up your observations in 2-4 paragraphs (all together, not per case). As on HW#4, you can make observations that are either anecdotal or quantitative or both. Think about questions like the following: How is the mouse behaving in each case and why does this behavior make sense? Is it what you would have expected, or are there aspects of the mouse's behavior that you found surprising? What happens if you take the mouse's optimal policy for cat A and have the mouse follow that policy when playing against cat B? Since there are three cats, there are nine possibilities here. Construct a matrix showing the utility (say, averaged over all possible starting states), for all nine possibilities. Which policy should the mouse follow if it doesn't know which cat it's up against? Why? You may also wish to explore the performance of the algorithms themselves. For instance, how many iterations does it take to reach convergence?
We are providing a class called Mdp that loads a description of an MDP from a file whose format is described shortly. The class has a constructor taking a file name as argument that reads in the MDP description from the file. Each state is represented by an integer in the range 0 (inclusive) to numStates (exclusive), where numStates is the total number of states. Actions are represented in a similar fashion. The stateName and actionName arrays convert these integers back to strings in the obvious fashion. The reward array gives the immediate reward of each state. For efficiency, state transition probabilities are represented in a sparse fashion to take advantage of the fact that most states have a relatively small number of possible successors. In particular, the array nextState[s][a] represents a list of possible next states which may follow state s under action a. The understanding here is that the probability of transitioning to any other state is zero. There is a corresponding list of probabilities transProb[s][a]. Thus, transProb[s][a][i] is the probability of transitioning from s to nextState[s][a][i] under action a.
Each line of an MDP description file has one of two possible formats. The first possibility is that the line has the form
<state> <reward>
where <state> is any name (no white space), and <reward> is a number. Such a line indicates that the reward in <state> is <reward>.
The other possibility is that the line has the form
<from_state> <action> <to_state> <prob>
where the first three tokens are names, and <prob> is a nonnegative number. Such a line indicates that the probability of transitioning from <from_state> to <to_state> under action <action> is <prob>. Multiple lines of this form with the same <from_state> and <action> can be combined. Thus,
u a v 0.3 u 0.4 w 0.1
is equivalent to the three lines
u a v 0.3
u a u 0.4
u a w 0.1
Lines of these forms can appear in the file in any order. However, if a reward is given more than once for the same state, then the last reward value given is assigned to that state. On the other hand, if more than one probability is given for the same state-action-state triple, then the sum of these probabilities is assigned to this transition. If the sum of the probabilities of transitioning out of a given state under a given action do not add up to one, then these numbers are renormalized so that they do. However, if this sum is zero, an exception is thrown. States that are not given a reward in the description file are assumed to have a reward of zero; likewise, transitions that do not appear in the file are assumed to have zero probability. The first state to be assigned a reward is assumed to be the start state.
For instance, the following encodes an MDP with five states, -2, -1, 0, +1 and +2. There are two actions, L and R which cause the MDP to move to the next state up or down 90% of the time, or to remain unchanged 10% of the time. Executing R in state +2 or L in state -2 has no effect. The start state is 0. The reward is 1 in 0, -1 at the endpoints and 0 elsewhere. The start state is 0.
0 1
0 L -1 0.9 0 0.1
0 R +1 0.9 0 0.1
-1 L -2 0.9 -1 0.1
-1 R 0 0.9 -1 0.1
-2 -1
-2 L -2 0.9 -2 0.1
-2 R -1 0.9 -2 0.1
+1 R +2 0.9 +1 0.1
+1 L 0 0.9 +1 0.1
+2 -1
+2 R +2 0.9 +2 0.1
+2 L +1 0.9 +2 0.1
If you want to try out or test your program on MDP's of your own invention, you can do so by creating files of this form. Alternatively, you can add a new constructor to the Mdp class that will fill in the public fields with a description of your MDP.
The class CatMouseAnimator can be used to "animate" a policy on one of the cat-and-mouse MDP's. When the method animate is called with a policy, the policy is followed on the MDP provided at the time of construction. Each shot of the animation looks something like the following:
+===============+
| . . . . . . . |
| C . . . . . |
| . z . . . . z |
| . . . . M |
| . . . . . _ |
+===============+
The letters M, C and z indicate the position of the mouse, cat and cheese. The mouse's hole is indicated with an underscore. The mouse vanishes when the cat is on top of it; likewise, the cheese can be obscured by either the cat or the mouse.
The class CatMouse consists only of a main that loads an MDP, finds and prints the optimal policy and runs the animator (unless the second argument is omitted). Before starting the animator, four columns are printed following each state: the optimal action under the policy found using value iteration; the optimal action under the policy found using policy iteration; the optimal utility as estimated using value iteration; and the estimated utility of the optimal policy using policy evaluation. You will probably want to modify or replace this file.
To debug your code, you will surely want to run it on small MDP's. There are a number of "sanity" checks that can be done to see if your code is working. For instance, value iteration and policy iteration should output identical policies (up to possible ties between equally good actions). In addition, the utility of the optimal policy must satisfy the Bellman equation.
Your job is to fill in the constructors in the three classes ValueIter, PolicyIter and PolicyEval. The constructor for ValueIter takes as input an MDP and a discount factor; the constructed class then includes a field policy containing the optimal policy (an array mapping states to actions), and another field utility (an array mapping states to real numbers). The other two classes are set up similarly. You may wish to add other fields that are returned as part of the class, such as the number of iterations until convergence.
The final code that you turn in should not write anything to standard output or standard error.
All code and data files can be obtained from this directory, or all at once from this tar file. Documentation is available here.
Using whiteboard, you should turn in the following:
In addition, you should turn in your write-up exploring the cat-and-mouse environments as described above. This written work should be handed in with the written exercises.
We will automatically test your code on MDP's other than those provided with this assignment. Your grade will depend largely on getting the right answers. In addition, your code should be efficient enough to run reasonably fast (easily under a minute on an MDP like those provided on a 2.4GHz machine), and should not terminate prematurely with an exception (unless given bad data, of course); code that does otherwise risks getting no credit for the automatic testing portion of the grade. As usual, you should follow good programming practices, including documenting your code.
Your write-up should be clear, concise, thoughtful and perceptive.
This programming assignment will be worth about 70 points (about 35 based on automatic testing of your code and on implementing the correct algorithms, 10 for documentation and good programming practice, and 25 for the write-up).
Thanks to Dale Schuurmans from whom this assignment was liberally borrowed.