Your program should work as follows. At each time, it should select an action according to its exploration policy, execute the action, observe the resulting state and reward, update its models of P(s'|s,a) and R(s'|s,a), push the predecessors of state s onto the priority queue, and perform n steps of prioritized sweeping.
I had a discussion with Valentina Bayer, and we came up with the following suggestion for how to implement the counters necessary for representing the models. The idea is to add two fields:
int ResultInfo::n; int SuccessorInfo::n;To explain the meaning of these two counters, recall that each MDP defines two arrays indexed by state: one to store a list of successor information and one to store a list of predecessor information. Let s be a state, and let
SI
be an instance of the
SuccessorInfo
class for this state whose action is
a (i.e., SI.action = a
).
SuccessorInfo::n
will store the number of times that
action a has been executed in state s.
Now each SuccessorInfo
record contains a list of
ResultInfo
records, one for each result state that has
been observed so far. Let RI
be one such record for a
result state s'. Then ResultInfo::n
will count
the number of times that the environment made a transition to
s' after executing a in s.
To keep the model up-to-date, we must keep the two fields,
ResultInfo::probability
and
ResultInfo::reward
up-to-date. We can do this by using
incremental updates as follows. Every time we execute an action, we
will find the matching SuccessorInfo
record (or create a
new one, if this is the first time we have executed this action in
this state). Then, we will increment SuccessorInfo::n
.
Next, we observe the resulting state s' and immediate reward
r, and find the corresponding ResultInfo
record
(or create a new one, if we have never made a transition from s
and a to s' before). We can increment
ResultInfo::n
and then do the following updates:
probability = (probability * (SuccessorInfo::n - 1) + 1) / SuccessorInfo::n; reward = ((reward * (ResultInfo::n - 1)) + r) / ResultInfo::n;With these updates, the existing code for prioritized sweeping should work without any changes, because the
probability
and
reward
fields will always be up-to-date.
We also need to consider how to update the inverse model (the
PredecessorInfo
records). One approach is to look up the
state s', and search its predecessor list for state s
and action a and copy this updated probability
field into the list.
There is an opportunity for a round-off error every time we make one
of these updates, so we should probably cange probability
and reward
to be double
precision instead of
single precision floating point numbers.
Your program should permit the following to be specified at the command line (or interactively):
Every time your program completes a trial, it should print out the number of primitive actions (which you should keep track of using the global variable ActionCounter) and the number of episodes. This will make it possible for you to create a plot like that of Figure 6.11.
Turn in the following: