Thomas G. Dietterich
Professor and Director of Intelligent Systems
School of Electrical Engineering and Computer Science
1148 Kelley Engineering Center
Oregon State University
Corvallis,
Oregon 97331-5501
E-mail:
tgd@cs.orst.edu
Phone: +1-541-737-5559
Office: KEC 2067
PGP Public Key
(Last updated December 20, 2012.)
Page Contents:
Research
Prospective Students
Publications
Talks
CV
Software
Students and Staff
Course Materials
Bio Sketch
Conferences
"If you invent a breakthrough in artificial intelligence,
so machines can learn," Mr. Gates responded, "that is worth
10 Microsofts." (Quoted in NY Times, Monday March 3, 2004)
The focus of my research is machine learning: How can we make
computer systems that adapt and learn from their experience? How can
we combine machine learning with other advances in AI to build
Integrated Intelligent Systems? How can we combine human knowledge
with massive data sets to expand scientific knowledge and build more
useful computer applications? My laboratory combines research on
machine learning and AI fundamentals with applications to problems in
science and engineering.
- Scientific Projects
- Ecosystem Informatics and Computational
Sustainability: Oregon State University is a leader in combining
computer science and the ecological sciences to build the new
discipline of Ecosystem Informatics. Ecosystem Informatics studies
methods for collecting, analyzing, and visualizing data on the
structure and function of ecosystems. It is an instance of an
important new direction in science: Data Exploration Science (see Jim
Gray's 2003
KDD talk).
Oregon State is also part of the NSF Expedition in
Computational Sustainability jointly with Cornell University,
Bowdoin College, Howard University, and the Conservation Fund. This
effort seeks to develop novel computational methods to address
problems in ecosystem science and sustainable management of the
biosphere.
My group is involved in many Ecosystem Informatics and
Computational Sustainability activities:
- Machine Learning for Species Distribution. One of the
central goals of ecology is to understand and predict the
distribution of species (including the bugs that we are studying
in the Insect Identification project). Given a data set that
records observations of the presence (or absence) of multiple
species at multiple locations, we wish to develop models that can
predict their presence/absence elsewhere. We are interested not
only in static distribution models, but also in process models
that capture the temporal and spatial of species distributions
(e.g., bird migration, flight times of moths, return of salmon,
spread of invasive species, survival of endangered species, etc.).
Our species distribution team includes faculty members (Matt
Betts, myself, and Weng-Keen Wong), post-docs Rebecca Hutchinson
and Selina Chu, and graduate students Arwen Lettkeman and Liping Liu.
We collaborate very closely with
the Cornell Laboratory of
Ornithology and with
the DataONE Datanet. In
particular, we are studying methods for dealing with the many
shortcomings of the citizen science data collected by the Lab of
Ornithology in their
eBird project including (a) partial
detection, (b) wide range of birder expertise, and (c) highly biased
spatial distribution of observations.
- BirdCast. Another special case of species distribution
modeling is understanding bird migration. With the Lab of
Ornithology, we are developing methods for reconstructing and
predicting bird migration across North America. Our goal is to
provide daily forecasts of bird migration by combining eBird
reports, weather radar, acoustic monitoring of flight calls, and
weather forecasts. The project web site is
available here.
- Finding Swallow Roosts from Doppler Radar. One
particular instance of species distribution modeling is
understanding the dynamics of tree swallow behavior. Former
postdoc Dan Sheldon is developing algorithms for analyzing Doppler
Radar to identify swallow roosts and understand swallow migration.
- Approximate Optimization for Bio-Economic Models. Many
sustainability applications require solving large spatio-temporal
optimization problems under uncertainty. We are collaborating
with economists Jo
Albers and Claire
Montgomery on methods for approximate solution of
spatio-temporal optimization problems involving land management
for wildfire control and counter-measures for controlling invasive
species.
- Project TAHMO: Deployment, Cleaning, and Analysis of Sensor
Network Data. We are part of
the Project TAHMO that seeks to
construct and deploy a network of 20,000 hydro-meteorological
stations in Africa. We are developing algorithms for sensor
placement, data cleaning, recovery from damaged sensors, and
analysis of the resulting data. We are building on our previous
work with Ethan Dereszynski on dynamic Bayesian network models for
sensor data cleaning.
- Arthropod Identification. Our current understanding of
complex ecosystems is limited by a lack of data. One particularly
useful kind of data is population counts of "bugs" (small
arthropods that live in soils, lakes, streams, and the ocean).
The BugID project seeks to develop devices for
capturing, imaging, and sorting bugs combined with general image
processing/machine learning/pattern recognition tools for counting
and classifying them. We hope to transform the ability of
scientists to measure the health of forests, streams, and
estuaries. More generally, we are interested in developing a wide
range of novel instruments for expanding the quality, quantity,
and spatio-temporal resolution of ecologically-relevant data. Our
research also contributes to computer vision and object
recognition more generally.
- NIPS 2012
Posner Lecture: Challenges for Machine Learning in
Computational Sustainability.
- ICML 2011 Tutorial
on Machine Learning in Ecology and Ecosystem
Management
- Intelligent Desktop Assistants. We have been involved in two
large efforts to develop intelligent assistants for the computer desktop.
- TaskTracer. When you come into work in the morning,
you don't want to say to your computer "I want to run Word", but
rather, "I want to work on my CS534 homework". In other words,
you would like a user interface that was organized around your
projects and activities rather than around application programs,
files, folders, etc. You would also like all of your information
in one place rather than scattered across the local file system,
network file systems, web sites, email folders, calendar,
contacts, etc. TaskTracer extends the Windows UI to provide
exactly this functionality. This research is supported by Google
and Intel (with previous support under the DARPA CALO project).
OSU
News Service story. Project Web
Site.
- CALO. The goal of the CALO project
was to develop an AI personal assistant that can help you find
relevant documents, prepare for meetings, keep track of what is
going on during meetings, and autonomously execute tasks such as
arranging travel, scheduling meetings, executing administrative
workflows (e.g., purchasing and staffing), and so on. Our work on
CALO focused on developing methods for integrating multiple,
separately-engineered components into a single learning and
reasoning system. We also prototyped a novel system that
employs programming-by-demonstration to define new learning tasks
for CALO to solve autonomously. We are currently editing a book
describing the results of the CALO project.
- Next Generation Phenomics. An important goal in biology
is to reconstruct the tree of life. As part of
the Project AVATOL team, we are
developing computer vision and machine learning methods to
automatically discover and score phenotype characters (features)
from images of biological specimens. These scores can then be
combined with other information (e.g., genetic sequences,
functional measurements) to reconstruct phylogenetic trees.
Phenomic information is particularly valuable for sets of
closely-related species (where DNA differences may not reflect
functional differences) and for extinct species known only through
fossil specimens.
The computer science challenges involve learning to score
known characters, which typically include shape, texture, color,
and topological features of specimens, from weakly-labeled data
and discovering new characters that are shared across some
taxonomic groups but not others.
- Learning via Interaction. Statistical machine learning
has focused primarily on learning from observational data. Human
learning often involves learning from demonstrations, explanations,
and feedback. I am interested in combining all of these methods to
develop intelligent systems that can learn both autonomously and
through interaction with human users and coaches. We have three
efforts in this direction:
- AI for Computer Games. The AI components of most computer
games are hand-authored rule-based systems. We are studying
methods for developing game AI agents via machine learning and
planning. One approach is to apply reinforcement learning to
automatically learn an opponent for games. Another approach is to
teach the game AI by demonstrating how to play the game. A third
method is to provide "coaching feedback" to the learning system.
A fourth approach is to transfer knowledge learned on one game to
rapidly construct an AI agent for a new game. We use the RTS
games Wargus and StarCraft as our
experimental platforms. Wargus is based on the Strategus game engine.
StarCarft is a commercial game from Blizzard Entertainment. This
work is funded by ARO under a MURI grant.
- End-User Debugging of Learned Programs. We are starting
to see end-user applications that incorporate machine learning
components (e.g., adaptive spam filters, adaptive email
management, adaptive user interfaces). How can we empower end
users to get these learning systems to behave properly? For
example, how can end users define new features, provide advice on
feature relevance, and yet also understand that no learning system
can be perfect? How can a learning system explain itself to the
end user? Funding provided by the National Science Foundation.
- Fundamental Machine Learning and Artificial Intelligence Research
- Machine Reading and Deep Reading. In collaboration
with researchers at BBN, CMU, University of Washington, ISI, and
UMass, we are studying methods for extracting knowledge from text
to support inference. Our focus is on learning rules (e.g., Horn
clauses) and scripts (e.g., logical hidden Markov models) from
noisy and incomplete training data extracted from reading text.
Funding provided by the DARPA Machine Reading and DEFT programs.
- Anomaly Detection. An important capability for AI
systems is to be able to detect when an input situation is
unusual. For example, anomaly detection can allow machine
learning systems to detect when an input case is very different
from the training data and hence could lead to extrapolation and
poor performance. Anomaly detection methods are also important
for detecting novel failures in sensor networks and novel attacks
on computer systems. We are developing a range of algorithms for
anomaly detection under the DARPA ADAMS program.
- Flexible Latent Variable Modeling. Many problems in
machine learning require learning models of hidden ("latent")
processes. Such latent variable models can be easily represented
using graphical models. However, such models are typically
expressed using parametric probability distributions, which limits
their ability to adapt to the complexity of the process and the
amount of data. Our research seeks to integrate flexible machine
learning methods (such as boosted regression trees) into latent
process models. Postdoc Rebecca Hutchinson and graduate student
Liping Liu are developing an R package that integrates boosted
regression trees into certain latent variable models common in
species distribution modeling.
- Learning Individual Models from Aggregate Data.
Most data in ecology (and other fields) records information in
aggregated form (e.g., population counts, census figures). Often,
we wish to fit models of individual behavior using such aggregated
data. One example is the problem of predicting bird migration
from eBird counts. Former Postdoc Dan Sheldon has developed a new
formalism, the Collective Grahical Model, that directly transforms
individual models to aggregate models that can then be easily
linked to aggregated data.
- Evidence Trees. We have developed a new approach to
supervised learning in which ensembles of tree classifiers are
applied not to make classification decisions but to select which
training data points provide evidence relevant to making a
decision or prediction. This evidence can then be input to a
second-level decision making process, which could be another
classifier or some form of kernel density estimation. We are
exploring this in the context of computer vision and the Arthropod
Identification project.
- Sample-Efficient Algorithms for Solving Spatial Markov
Decision Processes. We are developing algorithms for solving
MDPs in which the state consists of a landscape of patches, and
each patch has its own state. This means that the state space is
enormous. At each time step, an action must be specified for each
patch, so the action space is also enormous. In these problems,
phenomena (such as fires, infections, species spread) propagate
spatially, so we cannot treat the patches as independent. Our
algorithms seek exact (or bounded approximate) solutions for small
problem instances, and satisfactory solutions for real-sized
problem instances. Our algorithms optimize both expected
(discounted) cumulative reward and also risk-sensitive reward
measures. The transition probabilities for these models are
specified in the form of an expensive simulator which can be
invoked with a given state and action to obtain a sample of the
resulting state and the reward. An important goal is to minimize
the number of calls to the simulator.
- Software Engineering Methods and Tools for Machine
Learning Systems. Creating and maintaining a successful
deployed machine learning system is still largely an art that
requires a Ph.D. The goal of this research is to develop software
engineering methods and tools for creating and maintaining
deployed machine learning systems.
- Reviews, tutorials, and
books. I have written several review articles and tutorials on
machine learning.
My group meets on Tuesdays 3-5pm in KEC 3114. Each week, one member
gives a one-hour presentation on their work, and then everyone
provides 5-minute reports on their progress. There is also an AI
Seminar scheduled from time to time (contact Alan Fern for details).
I am NOT accepting any new students or postdocs for Fall 2013. The following is
advice for students applying in future years.
If you are seeking a research career in machine learning, data mining,
artificial intelligence and related areas, and you have a strong
background in mathematics and programming, please read my Information for Prospective Students
page. To see what courses I expect my Ph.D. students to take, please
see Recommended Courses for Ph.D. Students in
Machine Learning.
Professional Service, Journals, and Book Series
Entrepreneurial Activities
- I am a co-founder of Strands
(formerly MyStrands; formerly MusicStrands), a recommendation company.
- I am a co-founder of Smart Desktop. Smart Desktop
is now part of Decho, Inc., which is a "cloud
computing" effort of EMC.
Decho is commercializing technology developed as part of the
TaskTracer system.
- I am a co-founder of BigML. The
goal of this startup is to develop large scale cloud-based machine
learning services.
- Andrew Emmott, Graduate Student.
- Selina Chu, Postdoc.
- Mark Crowley, Postdoc.
- Rebecca Hutchinson, Postdoc.
- Jesse Hostetler, Graduate Student.
- Jed Irvine, Software Developer.
- Arwen Lettkeman, Graduate Student.
- Liping Liu, Graduate Student.
- Sean McGregor, Graduate Student.
- Michael Slater, Project Manager.
- Majid Alkaee Taleghan, Graduate Student.
- Shahed Sorower, Graduate Student.
- Pat Sullivan, Assistant and Grants Coordinator.
Former Students and Staff
- Hussein Almuallim,
Oil and Energy Professional, Calgary, Canada.
- Eric Altendorf, Google.
- Adam Ashenfelter, BigML, Inc., Corvallis, Oregon.
- Ghulum Bakiri, Department of Computer Science, Bahrain University
- Christian
Baumberger. Master Student in Biomedical Engineering at University
of Bern.
- Xinlong Bao. Google Pittsburgh.
- Brian Breck.
- Waranun Bunjongsat.
- Giuseppe Cerbone. Independent Information Services Professional, Milan, Italy.
- Martha Chamberlin.
- Hei Chan.
- Richard Charon.
- Eric Chown, Full Professor, Bowdoin College.
- Dan Corpron
- Diane Damon, Damon Consulting, Portland, OR.
- Ethan
Dereszynski, Research Scientist, WebTrends, Portland, OR.
- Phuoc Do, Wiavia.
- Nicholas
Flann Associate Professor, Utah State University
- Greg Foltz.
- Dan Forrest.
- Tony Fountain, Scientist
- Ashit Gandhi, Founder and Vice-President, Prism Gem, LLC - The Art of Diamond Coloring.
- Colin Gerety, Fort Collins, CO.
- Brandon Harvey, Enterprise Systems Integrator at University of Oregon.
- Guohua Hao, Data Scientist at Dataminr.
- Hermann Hild, President, SMI Cognitive Software GmbH .
- Saket Joshi, Postdoc (CI Fellow), Oregon State University.
- Varad
Joshi, Senior Software Engineering Manager at Arris, Portland, OR.
- Caroline Koff, Hewlett-Packard Corporation, Fort Collins, CO.
- Victoria
Keiser, Research Programmer, CMU. Masters Thesis (PDF).
- Michael Kelm, Research Scientist, Siemens Healthcare.
- Eun Bae Kong, Professor, Computer Science, Chungnam National University, South Korea
- Bill Langford, Research Associate at RMIT, Melbourne, Australia.
- Junyuan
Lin, VMWare, Seattle.
- Dragos Margineantu, The Boeing Company.
- Gonzalo
Martinez, Assistant Professor, Autonomous University of Madrid.
- Prafulla Mishra, Software Development Manager at eBay.
- Avis Ng.
- Soumya Ray, Assistant Professor, Case-Western Reserve University.
- Angelo Restificar, Principal Machine Learning Engineer, EBay, San Jose.
- Ritchey Ruff, Senior SDET, Microsoft.
- Dan Sheldon, Assistant Professor, University of Massachusetts, Amherst.
- Jianqiang Shen. Research Scientist, PARC. Doctoral dissertation.
- Rongkun Shen.
Post-doc, Oregon Health and Science University, Portland.
- Michael Shindler, HULU.
- Shriprakash Sinha. Ph.D. student TU Delft.
- Simone Stumpf. Lecturer, City University London.
- Tao Sun, Graduate Student at UMass Amherst.
- Dan Vega, Senior Software Engineer at Valley Inception, LLC.
- Mark Vulfson. Microsoft Corporation.
- Kiri Wagstaff, Researcher at
JPL.
- Xin Wang, Senior Scientist at Intelius.
- Dietrich Wettschereck. Recommind.com.
- Pengcheng Wu.
- Michael Wynkoop, Qualcomm.
- Qing Yao, College of Informatics and Electronics. Zhejiang Sci-Tech University. Hangzhou, China.
- Wei Zhang, The Boeing Company.
- Wei
Zhang. Google. Doctoral Dissertation (PDF).
- Valentina Zubek, Boehringer Ingelheim.
- CS519/GEO599: Principles of
Ecosystem Informatics, 2004-2005.
- CS 534, Spring 2005, Machine
Learning.
- CS430, Fall 2003, Introduction to
Artificial Intelligence
- CS539, Fall 2003, Seminar: Probabilistic
Relational Models
- CS 533, Applied Artificial
Intelligence for Engineeers.
- CS 539, Winter 2000, Selected Topics in
Artificial Intelligence: Probabilistic Agents
- CS 430/530, Fall 1999, Artificial Intelligence
Programming Techniques.
- CS 519, Fall 1996. Research Methods
in Computer Science.
- CS 450/550, Winter 1996, Introduction to Computer Graphics.
Machine Learning Resources
My Family's Musical Activities
Tom Dietterich, tgd@cs.orst.edu