Kagan Tumer: <b>Robust Order Statistics Based Ensembles for Distributed Data Mining</b>

Kagan Tumer's Publications

Display Publications by [Year] [Type] [Topic]

Robust Order Statistics Based Ensembles for Distributed Data Mining. K. Tumer and J. Ghosh. In H. Kargupta and P. Chan, editors, Advances in Distributed and Parallel Knowledge Discovery, pp. 185–210, AAAI/MIT Press, 2000.

Abstract

Integrating the outputs of multiple classifiers via combiners or meta-learners has led to substantial improvements in several difficult pattern recognition problems. In the typical setting investigated till now, each classifier is trained on data taken or resampled from a common data set, or randomly selected partitions thereof, and thus experiences similar quality of training data. However, in distributed data mining involving heterogeneous databases, the nature, quality and quantity of data available to each site/classifier may vary substantially, leading to large discrepancies in their performance.In this chapter we introduce and investigate a family of meta-classifiers based on order statistics, for robust handling of such cases. Based on a mathematical modeling of how the decision boundaries are affected by order statistic combiners, we derive expressions for the reductions in error expected when such combiners are used. We show analytically that the selection of the median, the maximum and in general, the ith order statistic improves classification performance. Furthermore, we introduce the trim and spread combiners, both based on linear combinations of the ordered classifier outputs, and empirically show that they are significantly superior in the presence of outliers or uneven classifier performance. So they can be fruitfully applied to several heterogeneous distributed data mining situations, specially when it is not practical or feasible to pool all the data in a common data warehouse before attempting to analyze it.

Download

[PDF]191.6kB

BibTeX Entry

@incollection{tumer-ghosh_dpkd00,
        author={K. Tumer and J. Ghosh},
        title="Robust Order Statistics Based Ensembles for Distributed Data Mining",
        booktitle = {Advances in Distributed and Parallel Knowledge Discovery},
	editor = {H. Kargupta and P. Chan},
	pages = {185-210},
	publisher = {AAAI/MIT Press},
	abstract={Integrating the outputs of multiple classifiers via combiners or meta-learners has led to substantial improvements in several difficult pattern recognition problems. In the typical setting investigated till now, each classifier is trained on data taken or resampled from a common data set, or randomly selected partitions thereof, and thus experiences similar quality of training data. However, in distributed data mining involving heterogeneous databases, the nature, quality and quantity of data available to each site/classifier may vary substantially, leading to large discrepancies in their performance.
In this chapter we introduce and investigate a family of meta-classifiers based on order statistics, for robust handling of such cases. Based on a mathematical modeling of how the decision boundaries are affected by order statistic combiners, we derive expressions for the reductions in error expected when such combiners are used. We show analytically that the selection of the median, the maximum and in general, the ith order statistic improves classification performance. Furthermore, we introduce the trim and spread combiners, both based on linear combinations of the ordered classifier outputs, and empirically show that they are significantly superior in the presence of outliers or uneven classifier performance. So they can be fruitfully applied to several heterogeneous distributed data mining situations, specially when it is not practical or feasible to pool all the data in a common data warehouse before attempting to analyze it.},
	bib2html_pubtype = {Book Chapters},
	bib2html_rescat = {Classifier Ensembles},
        year={2000}
}

Generated by bib2html.pl (written by Patrick Riley ) on Wed Apr 01, 2020 17:39:43