Active Learning for Exploratory Clustering

Project Overview

Data clustering is a widely used tool for organizing data into coherent groups that correspond to the underlying structure in data. In many applications, incorporating domain knowledge into clustering can help enhance both the quality and the utility of the results of clustering. Unfortunately, users who are not data mining experts currently lack effective means of providing such input to guide clustering. Against this background, we seek to develop a novel class of algorithms that take advantage of active learning strategies to interactively elicit information from users to drive clustering.

An important aim of this work is the identification of types of input e.g., in the form of must-link and cannot-link constraints, that are both informative and easy to interactively elicit from users to improve the quality and utility of the results of clustering. The study is driven by and evaluated using exploratory data analysis tasks that arise in several application domains (1) ecosystem informatics e.g. exploratory analysis of in-field bird recordings; (2) human-computer interaction (HCI) e.g., analysis of HCI data to understand user behavior; and (3) plant genomics in collaboration with scientists with expertise in each of these domains.

Improved tools for interactive exploratory data analysis benefit a broad range of applications including most areas of science in which such analysis is beginning to play an increasingly important role in extracting knowledge from data. For example, in ecological informatics, such tools can help scientists to better understand the impact of environmental changes on bird species which in turn can help develop better methods for managing ecosystems.



Collaborative Projects


This work is supported by NSF Award 1055113. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).