Thomas G. Dietterich
Distinguished Professor and Director of Intelligent Systems
School of Electrical Engineering and Computer Science
1148 Kelley Engineering Center
Oregon State University
Office: KEC 2067
PGP Public Key
(Last updated August 16, 2014.)
Students and Staff
"If you invent a breakthrough in artificial intelligence,
so machines can learn," Mr. Gates responded, "that is worth
10 Microsofts." (Quoted in NY Times, Monday March 3, 2004)
The focus of my research is machine learning (and the associated
areas of Data Science and Big Data). How can we make
computer systems that adapt and learn from their experience? How can
we combine machine learning with other advances in AI to build
Integrated Intelligent Systems? How can we combine human knowledge
with massive data sets to expand scientific knowledge and build more
useful computer applications? My laboratory combines research on
machine learning and AI fundamentals with applications to problems in
science and engineering.
- Scientific Projects
- Ecosystem Informatics and Computational
Sustainability: Oregon State University is a leader in combining
computer science and the ecological sciences to build the new
discipline of Ecosystem Informatics. Ecosystem Informatics studies
methods for collecting, analyzing, and visualizing data on the
structure and function of ecosystems. It is an instance of an
important new direction in science: Data Exploration Science (see Jim
Oregon State is also part of the Institute for
Computational Sustainability led by Cornell University.
This effort seeks to develop novel computational methods to address
problems in ecosystem science and sustainable management of the
My group is involved in many Ecosystem Informatics and
Computational Sustainability activities:
- Machine Learning for Species Distribution. One of the
central goals of ecology is to understand and predict the
distribution of species (including the bugs that we are studying
in the Insect Identification project). Given a data set that
records observations of the presence (or absence) of multiple
species at multiple locations, we wish to develop models that can
predict their presence/absence elsewhere. We are interested not
only in static distribution models, but also in process models
that capture the temporal and spatial of species distributions
(e.g., bird migration, flight times of moths, return of salmon,
spread of invasive species, survival of endangered species, etc.).
Our species distribution team includes faculty members (Matt
Betts, myself, and Weng-Keen Wong), post-docs Rebecca Hutchinson
and Selina Chu, and graduate students Arwen Lettkeman and Liping Liu.
We collaborate very closely with
the Cornell Laboratory of
Ornithology and with
the DataONE Datanet. In
particular, we are studying methods for dealing with the many
shortcomings of the citizen science data collected by the Lab of
Ornithology in their
eBird project including (a) partial
detection, (b) wide range of birder expertise, and (c) highly biased
spatial distribution of observations.
- BirdCast. Another special case of species distribution
modeling is understanding bird migration. With the Lab of
Ornithology, we are developing methods for reconstructing and
predicting bird migration across North America. Our goal is to
understand what signals birds use to decide when to migrate and to
provide daily forecasts of bird migration by combining eBird
reports, weather radar, acoustic monitoring of flight calls, and
weather forecasts. The project web site is
- Approximate Optimization for Bio-Economic Models. Many
sustainability applications require solving large spatio-temporal
optimization problems under uncertainty. We are collaborating
with economists Jo
Albers and Claire
Montgomery on methods for approximate solution of
spatio-temporal optimization problems involving land management
for wildfire control and counter-measures for controlling invasive
- Project TAHMO: Deployment, Cleaning, and Analysis of Sensor
Network Data. We are part of
the Project TAHMO that seeks to
construct and deploy a network of 20,000 hydro-meteorological
stations in Africa. We are developing algorithms for sensor
placement, data cleaning, recovery from damaged sensors, and
analysis of the resulting data. We are building on our previous
work with Ethan Dereszynski on dynamic Bayesian network models for
sensor data cleaning.
- Arthropod Identification. Our current understanding of
complex ecosystems is limited by a lack of data. One particularly
useful kind of data is population counts of "bugs" (small
arthropods that live in soils, lakes, streams, and the ocean).
The BugID project seeks to develop devices for
capturing, imaging, and sorting bugs combined with general image
processing/machine learning/pattern recognition tools for counting
and classifying them. We hope to transform the ability of
scientists to measure the health of forests, streams, and
estuaries. More generally, we are interested in developing a wide
range of novel instruments for expanding the quality, quantity,
and spatio-temporal resolution of ecologically-relevant data. Our
research also contributes to computer vision and object
recognition more generally.
- NIPS 2012
Posner Lecture: Challenges for Machine Learning in
- ICML 2011 Tutorial
on Machine Learning in Ecology and Ecosystem
- Intelligent Desktop Assistants. We have been involved in two
large efforts to develop intelligent assistants for the computer desktop.
- TaskTracer. When you come into work in the morning,
you don't want to say to your computer "I want to run Word", but
rather, "I want to work on my CS534 homework". In other words,
you would like a user interface that was organized around your
projects and activities rather than around application programs,
files, folders, etc. You would also like all of your information
in one place rather than scattered across the local file system,
network file systems, web sites, email folders, calendar,
contacts, etc. TaskTracer extends the Windows UI to provide
exactly this functionality. This research is supported by DAPRA
with previous support from Google, Intel, and the DARPA CALO project.
News Service story. Project Web
- CALO. The goal of the CALO project
was to develop an AI personal assistant that can help you find
relevant documents, prepare for meetings, keep track of what is
going on during meetings, and autonomously execute tasks such as
arranging travel, scheduling meetings, executing administrative
workflows (e.g., purchasing and staffing), and so on. Our work on
CALO focused on developing methods for integrating multiple,
separately-engineered components into a single learning and
reasoning system. We also prototyped a novel system that
employs programming-by-demonstration to define new learning tasks
for CALO to solve autonomously. We are currently editing a book
describing the results of the CALO project.
- Next Generation Phenomics. An important goal in biology
is to reconstruct the tree of life. As part of
the Project AVATOL team, we are
developing computer vision and machine learning methods to
automatically discover and score phenotype characters (features)
from images of biological specimens. These scores can then be
combined with other information (e.g., genetic sequences,
functional measurements) to reconstruct phylogenetic trees.
Phenomic information is particularly valuable for sets of
closely-related species (where DNA differences may not reflect
functional differences) and for extinct species known only through
The computer science challenges involve learning to score
known characters, which typically include shape, texture, color,
and topological features of specimens, from weakly-labeled data
and discovering new characters that are shared across some
taxonomic groups but not others.
- Fundamental Machine Learning and Artificial Intelligence Research
Reading and Deep Reading. In collaboration
with researchers at BBN, CMU, University of Washington, ISI, and
UMass, we are studying methods for extracting knowledge from text
to support inference. Our focus is on learning rules (e.g., Horn
clauses) and scripts (e.g., logical hidden Markov models) from
noisy and incomplete training data extracted from reading text.
Funding provided by the DARPA Machine Reading and DEFT programs.
- Anomaly Detection. An important capability for AI
systems is to be able to detect when an input situation is
unusual. For example, anomaly detection can allow machine
learning systems to detect when an input case is very different
from the training data and hence could lead to extrapolation and
poor performance. Anomaly detection methods are also important
for detecting novel failures in sensor networks and novel attacks
on computer systems. We are developing a range of algorithms for
anomaly detection under the DARPA ADAMS program.
- Flexible Latent Variable Modeling. Many problems in
machine learning require learning models of hidden ("latent")
processes. Such latent variable models can be easily represented
using graphical models. However, such models are typically
expressed using parametric probability distributions, which limits
their ability to adapt to the complexity of the process and the
amount of data. Our research seeks to integrate flexible machine
learning methods (such as boosted regression trees) into latent
process models. Postdoc Rebecca Hutchinson and graduate student
Liping Liu are developing an R package that integrates boosted
regression trees into certain latent variable models common in
species distribution modeling.
- Learning Individual Models from Aggregate Data.
Most data in ecology (and other fields) records information in
aggregated form (e.g., population counts, census figures). Often,
we wish to fit models of individual behavior using such aggregated
data. One example is the problem of predicting bird migration
from eBird counts. Former Postdoc Dan Sheldon has developed a new
formalism, the Collective Grahical Model, that directly transforms
individual models to aggregate models that can then be easily
linked to aggregated data.
- Sample-Efficient Algorithms for Solving Spatial Markov
Decision Processes. We are developing algorithms for solving
MDPs in which the state consists of a landscape of patches, and
each patch has its own state. This means that the state space is
enormous. At each time step, an action must be specified for each
patch, so the action space is also enormous. In these problems,
phenomena (such as fires, infections, species spread) propagate
spatially, so we cannot treat the patches as independent. Our
algorithms seek exact (or bounded approximate) solutions for small
problem instances, and satisfactory solutions for real-sized
problem instances. Our algorithms optimize both expected
(discounted) cumulative reward and also risk-sensitive reward
measures. The transition probabilities for these models are
specified in the form of an expensive simulator which can be
invoked with a given state and action to obtain a sample of the
resulting state and the reward. An important goal is to minimize
the number of calls to the simulator.
- Software Engineering Methods and Tools for Machine
Learning Systems. Creating and maintaining a successful
deployed machine learning system is still largely an art that
requires a Ph.D. The goal of this research is to develop software
engineering methods and tools for creating and maintaining
deployed machine learning systems.
- Reviews, tutorials, and
books. I have written several review articles and tutorials on
My group meets Mondays 3-4pm in KEC 2057. Each week, one member gives
a one-hour presentation of a paper relevant to their work. There are
two other reading groups meeting this quarter. I'm participating in
the group organized by Alan Fern on the topic of Monte Carlo Tree
Search and MDPs. We meet Thursdays 3-4pm KEC 2057. There is also an
AI Seminar scheduled from time to time (contact Alan Fern for
I am NOT accepting any new students or postdocs for Fall 2014. The following is
advice for students applying in future years.
If you are seeking a research career in machine learning, data mining,
artificial intelligence and related areas, and you have a strong
background in mathematics and programming, please read my Information for Prospective Students
page. To see what courses I expect my Ph.D. students to take, please
see Recommended Courses for Ph.D. Students in
If you are interested in robotics, I encourage you to visit
the Robotics Team
Pages to learn more about our excellent robotics program.
Professional Service, Journals, and Book Series
- I am a co-founder of Strands
(formerly MyStrands; formerly MusicStrands), a recommendation company.
- I am a co-founder of Smart Desktop. Smart Desktop
is now part of Decho, Inc., which is a "cloud
computing" effort of EMC.
Decho is commercializing technology developed as part of the
- I am a co-founder and Chief Scientist of BigML. The
goal of this startup is to develop large scale cloud-based machine
- Andrew Emmott, Graduate Student.
- Mark Crowley, Postdoc.
- Rebecca Hutchinson, Postdoc.
- Jesse Hostetler, Graduate Student.
- Jed Irvine, Software Developer.
- Arwen Lettkeman, Graduate Student.
- Liping Liu, Graduate Student.
- Sean McGregor, Graduate Student.
- Michael Slater, Project Manager.
- Majid Alkaee Taleghan, Graduate Student.
- Shahed Sorower, Graduate Student.
- Pat Sullivan, Assistant and Grants Coordinator.
Former Students and Staff
- Hussein Almuallim,
Oil and Energy Professional, Calgary, Canada.
- Eric Altendorf, Google.
- Adam Ashenfelter, BigML, Inc., Corvallis, Oregon.
- Ghulum Bakiri, Department of Computer Science, Bahrain University
Baumberger. Master Student in Biomedical Engineering at University
- Xinlong Bao. Google Pittsburgh.
- Brian Breck.
- Waranun Bunjongsat.
- Giuseppe Cerbone. Independent Information Services Professional, Milan, Italy.
- Martha Chamberlin.
- Hei Chan.
- Richard Charon.
- Eric Chown, Full Professor, Bowdoin College.
- Selina Chu, JPL,
- Dan Corpron
- Diane Damon, Damon Consulting, Portland, OR.
Dereszynski, Research Scientist, WebTrends, Portland, OR.
- Phuoc Do, Wiavia.
Flann Associate Professor, Utah State University
- Greg Foltz.
- Dan Forrest.
- Tony Fountain, Scientist
- Ashit Gandhi, Founder and Vice-President, Prism Gem, LLC - The Art of Diamond Coloring.
- Colin Gerety, Fort Collins, CO.
- Brandon Harvey, Enterprise Systems Integrator at University of Oregon.
- Guohua Hao, Data Scientist at Dataminr.
- Hermann Hild, President, SMI Cognitive Software GmbH .
- Saket Joshi, Postdoc (CI Fellow), Oregon State University.
Joshi, Senior Software Engineering Manager at Arris, Portland, OR.
- Caroline Koff, Hewlett-Packard Corporation, Fort Collins, CO.
Keiser, Research Programmer, CMU. Masters Thesis (PDF).
- Michael Kelm, Research Scientist, Siemens Healthcare.
- Eun Bae Kong, Professor, Computer Science, Chungnam National University, South Korea
- Bill Langford, Research Associate at RMIT, Melbourne, Australia.
Lin, VMWare, Seattle.
- Dragos Margineantu, The Boeing Company.
Martinez, Assistant Professor, Autonomous University of Madrid.
- Prafulla Mishra, Software Development Manager at eBay.
- Avis Ng.
- Soumya Ray, Assistant Professor, Case-Western Reserve University.
- Angelo Restificar, Principal Machine Learning Engineer, EBay, San Jose.
- Ritchey Ruff, Senior SDET, Microsoft.
- Dan Sheldon, Assistant Professor, University of Massachusetts, Amherst.
- Jianqiang Shen. Research Scientist, PARC. Doctoral dissertation.
- Rongkun Shen.
Post-doc, Oregon Health and Science University, Portland.
- Michael Shindler, HULU.
- Shriprakash Sinha. Ph.D. student TU Delft.
Stumpf. Senior Lecturer, City University London.
- Tao Sun, Graduate Student at UMass Amherst.
- Dan Vega, Senior Software Engineer at Valley Inception, LLC.
- Mark Vulfson. Microsoft Corporation.
- Kiri Wagstaff, Researcher at
- Xin Wang, Senior Scientist at Intelius.
- Dietrich Wettschereck. Recommind.com.
- Pengcheng Wu.
- Michael Wynkoop, Qualcomm.
- Qing Yao, College of Informatics and Electronics. Zhejiang Sci-Tech University. Hangzhou, China.
- Wei Zhang, The Boeing Company.
Zhang. Google. Doctoral Dissertation (PDF).
- Valentina Zubek, Boehringer Ingelheim.
- CS519/GEO599: Principles of
Ecosystem Informatics, 2004-2005.
- CS 534, Spring 2005, Machine
- CS430, Fall 2003, Introduction to
- CS539, Fall 2003, Seminar: Probabilistic
- CS 533, Applied Artificial
Intelligence for Engineeers.
- CS 539, Winter 2000, Selected Topics in
Artificial Intelligence: Probabilistic Agents
- CS 430/530, Fall 1999, Artificial Intelligence
- CS 519, Fall 1996. Research Methods
in Computer Science.
- CS 450/550, Winter 1996, Introduction to Computer Graphics.
Machine Learning Resources
My Family's Musical Activities
Tom Dietterich, email@example.com