Your program should work as follows. At each time, it should select an action according to its exploration policy, execute the action, observe the resulting state and reward, update its models of P(s'|s,a) and R(s'|s,a), push the predecessors of state s onto the priority queue, and perform n steps of prioritized sweeping.
I had a discussion with Valentina Bayer, and we came up with the following suggestion for how to implement the counters necessary for representing the models. The idea is to add two fields:
int ResultInfo::n; int SuccessorInfo::n;To explain the meaning of these two counters, recall that each MDP defines two arrays indexed by state: one to store a list of successor information and one to store a list of predecessor information. Let s be a state, and let
SIbe an instance of the
SuccessorInfoclass for this state whose action is a (i.e.,
SI.action = a).
SuccessorInfo::nwill store the number of times that action a has been executed in state s.
SuccessorInfo record contains a list of
ResultInfo records, one for each result state that has
been observed so far. Let
RI be one such record for a
result state s'. Then
ResultInfo::n will count
the number of times that the environment made a transition to
s' after executing a in s.
To keep the model up-to-date, we must keep the two fields,
ResultInfo::reward up-to-date. We can do this by using
incremental updates as follows. Every time we execute an action, we
will find the matching
SuccessorInfo record (or create a
new one, if this is the first time we have executed this action in
this state). Then, we will increment
Next, we observe the resulting state s' and immediate reward
r, and find the corresponding
(or create a new one, if we have never made a transition from s
and a to s' before). We can increment
ResultInfo::n and then do the following updates:
probability = (probability * (SuccessorInfo::n - 1) + 1) / SuccessorInfo::n; reward = ((reward * (ResultInfo::n - 1)) + r) / ResultInfo::n;With these updates, the existing code for prioritized sweeping should work without any changes, because the
rewardfields will always be up-to-date.
We also need to consider how to update the inverse model (the
PredecessorInfo records). One approach is to look up the
state s', and search its predecessor list for state s
and action a and copy this updated
field into the list.
There is an opportunity for a round-off error every time we make one
of these updates, so we should probably cange
reward to be
double precision instead of
single precision floating point numbers.
Your program should permit the following to be specified at the command line (or interactively):
Every time your program completes a trial, it should print out the number of primitive actions (which you should keep track of using the global variable ActionCounter) and the number of episodes. This will make it possible for you to create a plot like that of Figure 6.11.
Turn in the following: