This is a combined written and programming assignment.
1. Consider the following decision-making problem. We want to send an important package to Washington, D.C. We can choose each of three companies (US postal service, Federal Express, United Parcel Service), and for each company, we can choose two levels of service. Each level of service has a cost. It also yields a probability distribution over the number of days it will take the package to reach Washington. This information is summarized in the following table.
Probability of arrival (days) Company Class Cost 1 2 3 4 >4 US Mail First 0.32 0.0 0.2 0.4 0.2 0.2 Express 5.00 0.0 0.3 0.5 0.1 0.1 Federal Overnight 7.50 0.8 0.1 0.0 0.0 0.1 Express 2nd Day 5.00 0.1 0.6 0.1 0.1 0.1 UPS Overnight 7.00 0.7 0.2 0.05 0.01 0.04 2nd Day 4.00 0.0 0.8 0.15 0.01 0.04 Utility Functions: U1 100 50 10 0 -100 U2 100 0 0 0 0 U3 20 20 20 0 0
2. The code that comes with the textbook includes implementations of value iteration and policy iteration. To load and test these routines, do the following:
(load "/usr/local/classes/cs/cs430/code/aima.lisp") (aima-load 'uncertainty) (test 'uncertainty)
This code applies value iteration to compute an optimal policy for the problem shown in Figure 17.1 in the text.
Modify the 4x3 MDP so that the rewards in states (4 2) and (4 3)
are -10 and +10. (See the global variable *4x3-R-data*
in
4x3-mdp.lisp
). We will call the new MDP
*4x3-mdp-10*
.
Run value iteration and then compute the
greedy policy and print it out:
(setq *u* (value-iteration *4x3-mdp-10*)) (setq *p* (greedy-policy *u* (mdp-model *4x3-mdp-10*) (mdp-rewards *4x3-mdp-10*))) (print-policy *p*)Turn in a trace of this.
3. Run policy iteration on the unmodified *4x3-mdp*
.
Notice that I have installed a cutoff of 1000 iterations on the
value-determination
function (in dp.lisp
).
Policy iteration starts with a greedy policy based on an initial state
value function that is equal to the reward function. Because the
reward is the same in every state (except for the two goal states),
there are many ties, and these are broken randomly. Hence, every time
you run policy iteration, you will get a different initial policy.
Use the time
function to compute the average time of five
runs of policy iteration. Check the resulting policy in each run to
determine whether it is optimal.
A better algorithm than policy iteration is something called Modified
Policy Iteration (MPI). In MPI, the value determination phase is only
permitted to run for a small number of iterations, and then a policy
improvement step is performed. Change the limit in
value-determination
to be 100, and compute the average
time of five runs of policy iteration. Check the resulting policy in
each run to see whether it is optimal.
Change the limit to be 10, and compute the average time of five runs of policy iteration. Check the resulting policies.
Change the limit to be 1, and compute the average time of five runs of policy iteration. Check the resulting policies.
What is your conclusion?
Compare the cpu time of the best configuration of policy iteration with the cpu time of value iteration. Which is better in this domain?
Repeat this analysis for *4x3-mdp-10*
Turn in your answers to each of the above questions.
If you are interested in understanding the code that the authors have written, I've included a few notes of explanation here. The authors make extensive use of hash tables to represent P(S'|S,a), R(S'|S,a), and U(S). Lisp provides very nice hash table support.
(make-hash :test #'equal)
. This creates a hash
table and indicates that two hash keys should be treated as identical
if they are equal
to each other. This means that you can
use an S-expression as a hash key. Lisp has a built-in hash function
sxhash
that converts an arbitrary S expression into a
hash key with the property that two equal
expressions
will have identical hash values.
(gethash key table)
. This is used to look up a hash
table value given a key. It returns nil
if the
key
is not found in the table
.
(setf (gethash key table) value)
. This is used to
store a value in a hash table.
(maphash function table)
. This loops through all
elements stored in the hash table and executes the given
function
on each element. The function must have two
arguments, which will be found to the hash key and the corresponding
value.
The authors also use the do
construct. This is a very
complicated and messy construct, so I'll just explain their particular
use of it. The following code appears in dp.lisp
:
(do ((iteration 0 (+ iteration 1))) ((< max-delta epsilon) Unew) (setq max-delta 0) (rotatef Uold Unew) ;;; switch contents; then we will overwrite Unew (format t "Iteration ~D~%" iteration) (maphash #'(lambda (s u) (unless (sink? s M) (setf (gethash s Unew) (+ (gethash s R) (if (gethash s M) (apply #'max (mapcar #'(lambda (a) (q-value a s Uold M R)) (actions s M))) 0)))) (setq max-delta (max max-delta (abs (- (gethash s Unew) u))))) Uold))Here is an explanation of this code. The
do
is equivalent
to
(loop for iteration from 0 do (cond ((< max-delta epsilon) (return Unew)) (t ...)))The first list after the
do
is an iteration construct
similar to the for
statement in C. The variable
iteration
is initialized to zero and after each pass
through the loop, it is incremented by 1.
The second list after the do
is the termination test.
When the condition (< max-delta epsilon)
is true, the
do
terminates and returns Unew
.
The remainder of the body of the do
is the body of the
loop.
(rotatef a b)
swaps the values of a
and
b
.
The expression that begins with the maphash
is
particularly exciting. This implements one iteration of value
iteration. Let's analyze it from the inside out.
(apply #'max (mapcar #'(lambda (a) (q-value a s Uold M R)) (actions s M)))
(actions s M)
returns a list of the available actions in
state s
. For each action, we compute the expected value
of that action (this is performed by the function
q-value
). The mapcar
gathers these into a list
to which we apply the function max
to compute the largest
element in the list. If I were writing this code, I would have
written:
(loop for action in (actions s M) maximize (q-value action s Uold M R))Let's call the result of this expression the
BEST-VALUE
.
The surrounding fragment of code is
(setf (gethash s Unew) (+ (gethash s R) (if (gethash s M) BEST-VALUE 0)))The
(gethash s M)
is a check to see if s
has
any available actions. If not, then the value 0 is used. Otherwise,
the BEST-VALUE
computed above is used. To this, we add
the immediate reward (gethash s R)
to obtain a new value
for U(s), and store it using (setf (gethash s Unew)...)