ACM Computing Surveys 28A(4), December 1996, http://www.cs.orst.edu/~tgd/ml-strategic-directions.html. Copyright © 1996 by the Association for Computing Machinery, Inc. See the permissions statement below.

Machine Learning

Thomas G. Dietterich

Oregon State University, Department of Computer Science
303 Dearborn Hall, Corvallis, OR 97331-3202 USA
tgd@cs.orst.edu, http://www.cs.orst.edu/~tgd

Abstract: Machine Learning studies methods for developing computer systems that can learn from experience. These methods are appropriate in situations where it is impossible to anticipate at design time exactly how a computer system should behave. The ultimate behavior of the system is determined both by the framework provided by the programmer and the subsequent input/output experiences of the program. This position paper briefly reviews the major methods studied in machine learning and then describes important open problems and research directions.
Categories and Subject Descriptors: I.2.6 [Learning]; I.5 [Pattern Recognition]
General Terms: Supervised learning, reinforcement learning, temporal-difference learning, cognitive architectures.

1 Introduction

The fundamental goal of machine learning is to make computer systems that can adapt and learn from their experience. As such, machine learning is an alternative to traditional software engineering that replaces human-intensive hand-engineering with data-intensive software construction and testing. Machine learning methods are appropriate whenever hand-engineering of software is difficult and yet data is available for analysis by learning algorithms. This includes situations where humans are unable to introspect (e.g., computer vision, speech recognition, motor control), situations where human expertise is lacking (e.g., human genome program, drug design), situations where the task changes frequently (e.g., airline seat sales strategies), and situations where a program must be customized for individual users (e.g., information filtering, user interfaces). Because machine learning is inherently stochastic, it may not be appropriate for tasks where completely error-free performance must be guaranteed (e.g., life-support control in intensive care).

Learning tasks can generally be divided into one-shot decision tasks (classification, prediction) and sequential decision tasks (control, optimization, planning). One-shot decision tasks are usually formulated as supervised learning tasks, where the learning algorithm is given a set of input-output pairs. The input describes the information available for making the decision and the output describes the correct decision. For example, in handwritten digit recognition, the input would be an image of the handwritten digit and the output would be the identity of the digit.

Sequential decision tasks are usually formulated as reinforcement learning tasks, where the learning algorithm is part of an agent that interacts with an "external environment". At each point, the agent observes the current state of the environment (or some aspects of the environment). It then selects and executes some action, which usually causes the environment to change state. The environment then provides some feedback (e.g., immediate reward) to the agent. The goal of the learning process is to learn an action-selection policy that will maximize the long-term rewards received by the agent. An example of a sequential task is the control of a manufacturing facility, where the current state includes a list of orders to be fulfilled and the current status of the facility, and the actions change the settings of various manufacturing machines to maximize the profit produced by the facility. The immediate reward includes the cost of each action and the revenue received when an order is completed.

2 Fundamental Research Problems

Machine learning algorithms can generally be arranged along a spectrum from general ("off-the-shelf") methods, which attempt to perform well for a wide variety of problems, to custom methods designed for particular tasks. For example, methods for learning decision trees, rule sets, probabilistic networks, and feed-forward neural networks are very general, and have been applied to many different problems. Methods for speech recognition based on hidden Markov models have a more limited range of applicability. And methods for inferring binding site geometry in drug design are extremely application-specific.

Many fundamental research problems concern this general-to-specific spectrum.

What is the space of general methods? The algorithms for trees, rules, bayes nets and neural nets have all been developed in the last 10-15 years. There is no reason to believe that these four general methods exhaust the possibilities. What other convenient representational formalisms need to be explored?
What are the limits of general methods? Are our current algorithms for each of these general methods essentially optimal? Or is there substantial room for improvement?
What theoretical and engineering principles can guide the development of application-specific algorithms? The development of such algorithms is currently very expensive and time-consuming.
Can we find convenient ways to acquire, represent, and incorporate application-specific knowledge into general learning methods? Rather than constructing application-specific algorithms manually, it would be much better if we could construct them automatically by instantiating some general learning methods. This process is still not well understood, although there are several promising initial efforts.

In sequential decision-making tasks, there are several additional challenging fundamental questions.

In what situations is it better to represent and learn a policy directly, rather than first learning a value function? A policy specifies what action should be chosen in each state; a value function gives the expected long-term reward of each state. The best action can be selected by choosing the action that will lead to the state with the highest value according to the value function. Many methods---including genetic programming, genetic algorithms, and parameterized policies---search directly in policy space. Other methods---temporal difference learning, Q learning, and real-time dynamic programming---construct value functions instead.
What properties should an ideal value function approximator have?
What is the best way to resolve the tradeoff between exploration (i.e., search for better policies) and exploitation (i.e., execution of the current policy)?
Can we find ways to solve large partially-observable Markov decision problems?

3 Interdisciplinary Connections

A strength of research in machine learning is that in the past decade, a strong link has been forged between theoretical and experimental research, with overlapping attendence at conferences, publication in the same journals, and even individual papers that combine theoretical analysis with experiment. All of the fundamental problems listed above require a concerted attack using both theoretical and experimental methods.

Similar links are beginning to be established across disciplines, with connections to statistics, operations research, control, robotics, signal processing, and many other fields. These links need to be strengthened and expanded to include stronger links with practitioners in industry and in application fields in science and engineering. Graduate students need to have training in at least one application field if they are going to apply machine learning methods effectively in that field.

There is currently a serious lack of communication between academic research and industrial applications. Academic research is rarely packaged and delivered in a way that is accessible to industrial researchers (e.g., the fields of statistics and databases have been much more successful in this regard). Furthermore, many fundamental and practical problems arising in applications are not being addressed by academic research. One symptom of this is that few applications papers are being accepted for publication at conferences. Several people are attempting to rectify this problem. But other mechanisms must also be found for widening the communication from industry to academe.

Surprisingly, the links between machine learning research and the rest of artificial intelligence and computer science are not very strong. This is due in large part to the very different background and training required. Machine learning requires a strong background in continuous mathematics, numerical computing, probability, and statistics, whereas most of computer science has been focusing on discrete mathematics, and many departments of computer science do not require any background in probability theory!

Data-intensive methods are beginning to appear in parts of AI including computer vision and corpus-based natural language processing. However, for the most part, machine learning has focused on one-shot decision problems, which constitute a rather small portion of general intelligent behavior. It is time to consider how machine learning should be integrated into the architectures of intelligent agents. Should there be a single, general weak learning component (e.g., chunking in SOAR), or a large number of specialized learners? Should learning for perception and motor control be separate from more "cognitive" tasks? How can knowledge be transferred from one task to another within the lifetime of an agent? Concrete experimental settings are required to push this research forward.

4 Concluding Remarks

In conclusion, let me list some of the applications areas where there are opportunities for major improvements through the application of machine learning methods:

Machine perception. Approaches that learn to perform feature extraction, segmentation, and classification show promise of higher performance and greater ease of implementation than traditional systems built out of separate hand-built components.
Large-scale data analysis and information management. Businesses and scientists are gathering data faster than ever before. Scaled-up machine learning methods are critical for finding patterns and relationships in these data sets.
Information filtering and retrieval. Machine learning methods can form models of the information needs of users and of the various available sources of information. By combining these models, the explosion in online information can be managed to help users find only the information they need.
Control and optimization. Adaptive control methods will permit larger scale optimization (e.g., over entire enterprises) of manufacturing, planning, scheduling, inventory management, and logistics. They will also enable new kinds of applications such as smart buildings and smart vehicles.
Software engineering. Existing applications packages (such as statistical packages, image processing workbenches, user interface toolkits) provide modules that can be assembled to solve particular problems. Can this assembly be automated so that users do not need to search through the space of possible combinations?
Model building. Many disciplines in science and engineering construct complex models in order to support monitoring, diagnosis, repair, prediction, extrapolation, and so on. This is a mixture of programming (to construct the model) and adaptation (to modify the model and calibrate it to data). Modelers need better tools for constructing, calibrating, and managing their models.

This list is not exhaustive; there are many other fruitful areas for research (e.g., studying the dynamics of interactions among multiple learning agents, applications in cognitive psychology, etc.). Indeed, I believe that the next decade will see a revolution in the way we develop software. I believe we must move away from software that is pre-engineered, pre-compiled, and pre-optimized toward software that is adaptive and learns from its experiences. Unless software becomes self-assembling and adaptive, every computer user will need to have his or her own personal software engineer, and the cost of providing such services will severely limit the growth of the computer industry.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org.

Last modified: Sat Nov 16 12:03:20 1996

Tom Dietterich <tgd@cs.orst.edu>