CS430/530: Homework Assignment 5 (Due November 17, 2000)

This is a combined written and programming assignment.

1. Consider the following decision-making problem. We want to send an important package to Washington, D.C. We can choose each of three companies (US postal service, Federal Express, United Parcel Service), and for each company, we can choose two levels of service. Each level of service has a cost. It also yields a probability distribution over the number of days it will take the package to reach Washington. This information is summarized in the following table.


                              Probability of arrival (days)
Company  Class       Cost     1     2     3     4     >4

US Mail  First       0.32     0.0   0.2   0.4   0.2   0.2
         Express     5.00     0.0   0.3   0.5   0.1   0.1

Federal  Overnight   7.50     0.8   0.1   0.0   0.0   0.1
Express  2nd Day     5.00     0.1   0.6   0.1   0.1   0.1

UPS      Overnight   7.00     0.7   0.2   0.05  0.01  0.04
         2nd Day     4.00     0.0   0.8   0.15  0.01  0.04

Utility Functions:
                     U1      100    50    10     0   -100
                     U2      100     0     0     0      0
                     U3       20    20    20     0      0

At the bottom of the table are three different utility functions. Answer the following questions:

  1. Show a decision diagram for this problem.
  2. For each of the utility functions, compute the expected utility of each possible action and indicate which action maximizes the expected utility.

2. The code that comes with the textbook includes implementations of value iteration and policy iteration. To load and test these routines, do the following:

(load "/usr/local/classes/cs/cs430/code/aima.lisp")
(aima-load 'uncertainty)
(test 'uncertainty)

This code applies value iteration to compute an optimal policy for the problem shown in Figure 17.1 in the text.

Modify the 4x3 MDP so that the rewards in states (4 2) and (4 3) are -10 and +10. (See the global variable *4x3-R-data* in 4x3-mdp.lisp). We will call the new MDP *4x3-mdp-10*. Run value iteration and then compute the greedy policy and print it out:

(setq *u* (value-iteration *4x3-mdp-10*))
(setq *p* (greedy-policy *u*
                         (mdp-model *4x3-mdp-10*)
                         (mdp-rewards *4x3-mdp-10*)))
(print-policy *p*)        
Turn in a trace of this.

3. Run policy iteration on the unmodified *4x3-mdp*. Notice that I have installed a cutoff of 1000 iterations on the value-determination function (in dp.lisp). Policy iteration starts with a greedy policy based on an initial state value function that is equal to the reward function. Because the reward is the same in every state (except for the two goal states), there are many ties, and these are broken randomly. Hence, every time you run policy iteration, you will get a different initial policy.

Use the time function to compute the average time of five runs of policy iteration. Check the resulting policy in each run to determine whether it is optimal.

A better algorithm than policy iteration is something called Modified Policy Iteration (MPI). In MPI, the value determination phase is only permitted to run for a small number of iterations, and then a policy improvement step is performed. Change the limit in value-determination to be 100, and compute the average time of five runs of policy iteration. Check the resulting policy in each run to see whether it is optimal.

Change the limit to be 10, and compute the average time of five runs of policy iteration. Check the resulting policies.

Change the limit to be 1, and compute the average time of five runs of policy iteration. Check the resulting policies.

What is your conclusion?

Compare the cpu time of the best configuration of policy iteration with the cpu time of value iteration. Which is better in this domain?

Repeat this analysis for *4x3-mdp-10*

Turn in your answers to each of the above questions.


Reading the Code

If you are interested in understanding the code that the authors have written, I've included a few notes of explanation here. The authors make extensive use of hash tables to represent P(S'|S,a), R(S'|S,a), and U(S). Lisp provides very nice hash table support.

The authors also use the do construct. This is a very complicated and messy construct, so I'll just explain their particular use of it. The following code appears in dp.lisp:

(do ((iteration 0 (+ iteration 1)))
      ((< max-delta epsilon) Unew)
    (setq max-delta 0)
    (rotatef Uold Unew) ;;; switch contents; then we will overwrite Unew
    (format t "Iteration ~D~%" iteration)
    (maphash 
     #'(lambda (s u) 
	 (unless (sink? s M)
	   (setf (gethash s Unew)
		 (+ (gethash s R)
		    (if (gethash s M)
			(apply #'max
			       (mapcar
				#'(lambda (a) (q-value a s Uold M R))
				(actions s M)))
		      0))))
	 (setq max-delta (max max-delta (abs (- (gethash s Unew) u)))))
     Uold))
Here is an explanation of this code. The do is equivalent to
(loop for iteration from 0
      do
      (cond ((< max-delta epsilon)
             (return Unew))
            (t
             ...)))
The first list after the do is an iteration construct similar to the for statement in C. The variable iteration is initialized to zero and after each pass through the loop, it is incremented by 1.

The second list after the do is the termination test. When the condition (< max-delta epsilon) is true, the do terminates and returns Unew.

The remainder of the body of the do is the body of the loop.

(rotatef a b) swaps the values of a and b.

The expression that begins with the maphash is particularly exciting. This implements one iteration of value iteration. Let's analyze it from the inside out.

(apply #'max
       (mapcar
          #'(lambda (a) (q-value a s Uold M R))
          (actions s M)))
(actions s M) returns a list of the available actions in state s. For each action, we compute the expected value of that action (this is performed by the function q-value). The mapcar gathers these into a list to which we apply the function max to compute the largest element in the list. If I were writing this code, I would have written:
(loop for action in (actions s M)
   maximize (q-value action s Uold M R))
Let's call the result of this expression the BEST-VALUE. The surrounding fragment of code is
 (setf (gethash s Unew)
          (+ (gethash s R)
             (if (gethash s M)
                 BEST-VALUE
                 0)))
The (gethash s M) is a check to see if s has any available actions. If not, then the value 0 is used. Otherwise, the BEST-VALUE computed above is used. To this, we add the immediate reward (gethash s R) to obtain a new value for U(s), and store it using (setf (gethash s Unew)...)