Programming Assignment 3: Due February 14, 2000

In this assignment, you will implement model-based reinforcement learning combined with prioritized sweeping and compare it with Q learning in the Windy Gridworld problem.

Your program should work as follows. At each time, it should select an action according to its exploration policy, execute the action, observe the resulting state and reward, update its models of P(s'|s,a) and R(s'|s,a), push the predecessors of state s onto the priority queue, and perform n steps of prioritized sweeping.

I had a discussion with Valentina Bayer, and we came up with the following suggestion for how to implement the counters necessary for representing the models. The idea is to add two fields:

int ResultInfo::n;
int SuccessorInfo::n;

To explain the meaning of these two counters, recall that each MDP defines two arrays indexed by state: one to store a list of successor information and one to store a list of predecessor information. Let s be a state, and let SI be an instance of the SuccessorInfo class for this state whose action is a (i.e., SI.action = a). SuccessorInfo::n will store the number of times that action a has been executed in state s.

Now each SuccessorInfo record contains a list of ResultInfo records, one for each result state that has been observed so far. Let RI be one such record for a result state s'. Then ResultInfo::n will count the number of times that the environment made a transition to s' after executing a in s.

To keep the model up-to-date, we must keep the two fields, ResultInfo::probability and ResultInfo::reward up-to-date. We can do this by using incremental updates as follows. Every time we execute an action, we will find the matching SuccessorInfo record (or create a new one, if this is the first time we have executed this action in this state). Then, we will increment SuccessorInfo::n. Next, we observe the resulting state s' and immediate reward r, and find the corresponding ResultInfo record (or create a new one, if we have never made a transition from s and a to s' before). We can increment ResultInfo::n and then do the following updates:

  probability = (probability * (SuccessorInfo::n - 1) + 1) / SuccessorInfo::n;
  reward = ((reward * (ResultInfo::n - 1)) + r) / ResultInfo::n;

With these updates, the existing code for prioritized sweeping should work without any changes, because the probability and reward fields will always be up-to-date.

We also need to consider how to update the inverse model (the PredecessorInfo records). One approach is to look up the state s', and search its predecessor list for state s and action a and copy this updated probability field into the list.

There is an opportunity for a round-off error every time we make one of these updates, so we should probably cange probability and reward to be double precision instead of single precision floating point numbers.

Your program should permit the following to be specified at the command line (or interactively):

the number of steps of prioritized sweeping to do after each real-world step

Every time your program completes a trial, it should print out the number of primitive actions (which you should keep track of using the global variable ActionCounter) and the number of episodes. This will make it possible for you to create a plot like that of Figure 6.11.

Turn in the following:

A graph comparing Q learning and model-based learning with prioritized sweeping using a plot like that of Figure 6.11. You should use your best settings for Q learning as determined from Program 2. You will want to adjust the deltaEpsilon parameter differently for Q learning than for model-based learning with prioritized sweeping.
Source code listing of your code (please EMAIL this to me also).
Extra credit: Implement Q(lambda) and/or SARSA(lambda) and compare the performance for different settings of lambda.
More Extra Credit: Modify the windy world to be stochastic (the book has some suggestions), and repeat the above experiments.