Reinforcement Learning
Reinforcement learning problems involve an agent interacting with
an environment. The agent must learn about the environment and must
also discover how to act optimally in that environment. Hence, there
is both a statistical component (learning about the environment) and a
computational component (deciding how to act). In reinforcement
learning, the environment is typically modeled as a controllable
Markov process, so the agent must solve a Markov decision problem.
Reinforcement learning can therefore be viewed as the study of
tractable approximation algorithms for solving MDPs.
There are two main challenges in reinforcement learning research: (a)
scaling up to large problems and (b) handling partially-observable
Markov decision problems (where the agent cannot sense the entire
state of the environment). In addition addressing these fundamental
problems, we are also interested in finding innovative applications of
RL.
Large Problems
- Hierarchical reinforcement learning. We have developed the
MAXQ value function decomposition for hierarchical reinforcement
learning, and we are currently experimenting with it in the Kanfer-Ackerman Air Traffic control
task.
- Value function approximation. At the heart of most
reinforcement learning algorithms is the value function. To scale up
reinforcement learning to large problems, this value function must be
represented compactly. Existing representations using neural networks
are compact but slow to train. We are exploring other approaches to
value function approximation including methods based on fitted
regression trees (Wang and Dietterich, 2002) and support vector machines
(Dietterich and Wang, 2002).
- RL with Large Action Spaces. In many application
problems, the number of actions can be large. We are exploring
efficient methods for learning and acting in such spaces. One method
is based on smoothing the probability transition functions of similar
actions during learning (Dietterich, Busquets, Lopez de Mantaras, and
Sierra, submitted). We are applying this in a robot navigation
problem.
- Reinforcement Learning Algorithms for Combinatorial
Optimization. One area where very large MDPs arise is in
complex optimization problems. We have pioneered the application of
reinforcement learning to such problems, particularly with our work in
job-shop scheduling. Our approach is to learn a heuristic evaluation
function for guiding a search through a search space. In
combinatorial optimization, the search space is deterministic, but it
may have a very large branching factor and a very deep search depth.
We are currently exploring reinforcement learning algorithms
customized for such problems. Our motivating applications are (a)
job-shop scheduling, (b) segmentation of natural resource images, and
(c) determination of protein structure from NMR spectra.
- Intelligent exploration. Most reinforcement learning methods
rely on uninformed (or minimally-informed) search procedures to
explore the environment. In many practical settings, however, the
system designer can provide guidance in the form of an initial
policy or a planning system. The policy or the planner are typically
suboptimal (otherwise, there would be no need to build the
reinforcement learning system). The research question is how to
exploit a sub-optimal policy to guide reinforcement learning while
still retaining the ability to find the optimal policy.
Partially-Observable MDPs
- Cost-observable MDPs. A cost-observable MDP is a problem in
which the agent can perceive the entire state of the environment, but
it must pay a cost for each piece of sensor data. This can be viewed
as a special form of partially-observable MDP where the sense actions
do not change the state of the environment. We have published a few
papers: (Zubek and Dietterich, 2001; Zubek and Dietterich, 2000)
- Cognitive modeling. We are exploring the extent to which human
skill acquisition can be modeled by reinforcement learning. Our
motivating application is the Kanfer-Ackerman air traffic control
simulation. This application includes partial observability and
hierarchical structure.
A good source of tutorial information on Reinforcement Learning is the
RL Archive at Michigan State
University.
References
- Dietterich, T. G., Busquets, D., Lopez de Manataras, R., Sierra,
C. (submitted). Action Refinement in Reinforcement Learning by
Probability Smoothing. Postscript
Preprint.
- Busquets, D., Lopez de Mantaras, R., Sierra, C., and Dietterich,
T. G. (submitted). Reinforcement learning for landmark-based robot
navigation. Postscript Preprint
- Dietterich, T. G. and Wang, X. (to appear). Batch value
function approximation via support vectors. Accepted for
publication in Dietterich, T. G., Becker, S., Ghahramani, Z. (Eds.)
Advances in Neural Information Processing Systems 14.
Cambridge, MA: MIT Press. Postscript
preprint.
- Wang, X. and Dietterich, T. G. (to appear). Stabilizing value
function approximation with the BFBP algorithm. Accepted for
publication, Dietterich, T. G., Becker, S., Ghahramani, Z. (Eds.)
Advances in Neural Information Processing Systems 14.
Cambridge, MA: MIT Press. Postscript
preprint.
- Zubek, V. B., Dietterich, T. G. (2001). Two Heuristics for
Solving POMDPs Having a Delayed Need to Observe. To appear in
Proceedings of the IJCAI Workshop on Planning under Uncertainty and
Incomplete Information. August 6, 2001. Seattle, WA.
Postscript preprint.
- Dietterich, T. G. (2000). An Overview of MAXQ Hierarchical
Reinforcement Learning. To appear in B. Y. Choueiry and T. Walsh
(Eds.) Proceedings of the Symposium on Abstraction, Reformulation
and Approximation SARA 2000, Lecture Notes in Artificial
Intelligence, New York: Springer Verlag. Postscript
preprint © Springer-Verlag.
- Zubek, V. B. and Dietterich, T. G. (2000) A POMDP Approximation
Algorithm that Anticipates the Need to Observe. To appear in
Proceedings of the Pacific Rim Conference on Artificial Intelligence
(PRICAI-2000); Lecture Notes in Computer Science. New York:
Springer-Verlag. Postscript
Preprint.
- Dietterich, T. G. (2000). Hierarchical reinforcement
learning with the MAXQ value function decomposition.
Journal of Artificial Intelligence Research, 13, 227-303.
Compressed postscript.
Also available from my FTP directory as
Gzipped postscript
- Wang, X., Dietterich, T. G. (1999). Efficient Value Function
Approximation Using Regression Trees. In Proceedings of the
IJCAI Workshop on Statistical Machine Learning for Large-Scale
Optimization, Stockholm, Sweden. Postscript
file.
- Dietterich, T. G. (2000). State abstraction in MAXQ
hierarchical reinforcement learning. In Advances in Neural
Information Processing Systems, 12. S. A. Solla, T. K. Leen, and
K.-R. Muller (eds.), 994-1000, MIT Press.
Postscript
Preprint.
- Zhang, W., Dietterich, T. G. (submitted). Solving
Combinatorial Optimization Tasks by Reinforcement Learning: A General
Methodology Applied to Resource-Constrained Scheduling. Postscript
preprint.
- Zhang, W., Dietterich, T. G., (1996). High-Performance
Job-Shop Scheduling With A Time-Delay TD(lambda) Network.
D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo (Eds.) Advances in
Neural Information Processing Systems, 8, 1024-1030.
Postscript
preprint.
- Zhang, W., Dietterich, T. G., (1995). A Reinforcement
Learning Approach to Job-shop Scheduling. Proceedings of IJCAI95.
Postscript preprint.
- Zhang, W., Dietterich, T. G., (1995). Value Function
Approximations and Job-Shop Scheduling. In J. A. Boyan, A. W. Moore,
and R. S. Sutton (Eds.) Proceedings of the Workshop on Value
Function Approximation. Carnegie-Mellon University, School of
Computer Science, Report Number CMU-CS-95-206.
Postscript preprint.
- Dietterich, T. G., Flann, N. S., (1995). Explanation-based
Learning and Reinforcement Learning: A Unified View. In
Proceedings of the 12th International Conference on Machine Learning
(pp. 176-184). Tahoe City, CA. San Francisco: Morgan Kaufmann.
Poscript preprint. Long
version appeared in Machine Learning
- Dietterich, T. G. (1991) Knowledge compilation:
Bridging the gap between specification and implementation.
IEEE Expert, 6 (2) 80--82.
Postscript preprint.
- Cerbone, G., Dietterich, T. G., (1990) Inductive and
numerical methods in knowledge compilation. Proceedings of
CRIB-90. Menlo Park, CA: Price Waterhouse.
- Flann, N. S., and Dietterich, T. G., (1989)
A study of explanation-based methods for inductive learning.
Machine Learning, 4 (2), 187--226.
- Dietterich, T. G., and Bennett, J. S., (1986). The Test
Incorporation Hypothesis and the Weak Methods, Rep. No. TR 86-30-4,
Department of Computer Science, Oregon State University, Corvallis,
OR.
Postscript version.
- Dietterich, T. G., and Bennett, J. S. (1986). The Test
Incorporation Theory of Problem Solving, In Proceedings of the
Workshop on Knowledge Compilation, Department of Computer Science,
Oregon State University, Corvallis, OR.
Postscript version.
OSU Real-Time Skill Acquisition Group