Information & Data Management and Analytics (IDEA) Laboratory

The myriad amounts of digital data, i.e., Big Data, can bring about tremendous advancements in various areas of science and technology. We investigate the principles of and build systems for usable, effective, and efficient management, learning, and reasoning over very large data sets.
  • Email: termehca [at]
  • Address: 3053 Kelley Engineering Center, Corvallis, OR 97330-5501
Cape Perpetua

Recent News

  • We will present our preliminary results on learning to join large tables effciently at SIGMOD-aiDM 2020. In our systems, the scan operators collaboratively learn efficient join strategies.
  • We will present our work on effective and efficient learning over large and noisy data without any cleaning and preprocessing at SIGMOD 2020. Our system enables users to learn over many datasets that could not be used before due to the prohibitively time-consuming efforts to clean them.
  • Our paper on data interaction game received an ACM SIGMOD Research Highlight Award.
  • We present our results on significantly improving the effectiveness of answering inexact queries, e.g., keyword queries, over large databases at SSDBM. The larger a database is, the database system returns more non-relevant answers as the database has many non-relevant answers for a query. A larger database, however, contains answers to more queries. We find subsets of the database that are sufficiently small so the database system returns mostly relevant answers. We also develop techniques to send the query to a sufficiently large subset of the database that contain its answers.
  • A couple of new manuscripts:
    • In the first one, we show how to learn accurate models directly over heterogeneous and dirty data without cleaning them; in the second
    • In the second one, we present a graph search algorithm that adapts to the evolution in data representation
    • .
  • We present the fundamental ideas behind our VDBMS system, which usably manages large scale variable and heterogeneous data at VLDB-Poly 2018
  • Ben discusses the bases of autonomous entity integration at VLDB-Poly 2018
  • We have a coupe of papers in the VLDB Journal 2018: 1) Yodsawalai has the paper Cost-Effective Conceptual Design Using Taxonomies, which addresses the tradeoff between the usability and overhead of organizing data in a structured form; and 2) Jose publishes the paper Logically Scalable and Efficient Relational Learning, which extends his work on designing efficient learning algorithms that are robust against the logical representations of the data.
  • Jose demonstrates CastorX, a system that efficiently learns over multiple heterogeneous databases using novel sampling techniques, at VLDB 2018. He presented a summary of its fundamental ideas at SIGMOD-DEEM 2018.
  • Ben presents his work on helping humans and large-scale data sources to progressively and automatically develop a mutual language for effective communication via reinforcement learning at SIGMOD 2018. His paper is selected as one of the best papers of the conference.
  • Jose demonstrates AutoMode, a system that automatically sets the language bias for learning systems over relational data at ICDE 2018.
  • People usually believe that to get effective results for vague queries, e.g., ambiguous keyword queries, data systems have to spend a lot of time and explore many potential answers in the data. We present a lightening talk on how to query large databases both effectively and efficiently using caching techniques at ICDE 2018.

Current projects

  • CHARM: Autonmous Communication of Humans and Information Sources

    One can gain invaluable insights by integrating and analyzing available data sources, such as traditional data systems, sensors, and social media. Data sources must also interact with each other to provide the information needed for many important queries and analyses. Unfortunately, different data sources express information in different forms. Humans also have their own ways of expressing their information and needs. Hence, humans and data sources cannot communicate effectively, which keeps many valuable insights out of our reach. CHARM aims at designing algorithms and systems that enable information sources and humans to automatically develop an effective mutual understanding and common language through interaction. Check out the project webpage for publication and more information.
  • READY: Representation Independent Data Analytics

    The output of data analytics algorithms highly depend on the structure and representation of their input data. To use current database analytics algorithms, users have to find the desired representation for these algorithms and transform (wrangle) their data to these representations. These tasks are hard and time-consuming and major obstacles for unlocking the value of data. READY aims at developing algorithms that return the desired results no matter how their input data is represented. Check out the project webpage for more information.

Selected awards

  • ACM SIGMOD Highlight Award, 2018.
  • SIGMOD best papers selection, 2018.
  • Distinguished PC member of SIGMOD 2017.
  • Best student paper award, ICDE 2011.
  • Yahoo! Key Scientific Challenges Award, 2011.
  • ICDE best papers selection, 2011.

Template by