discriminant analysis; large and high-dim datasets with MCLUST

Chris Fraley (fraley@stat.washington.edu)
Tue, 18 May 1999 15:35:51 -0700 (PDT)

Issues :

1. discriminant analysis (supervised classification) with MCLUST
2. MCLUST applied to large data sets
3. MCLUST applied to high-dimensional data

This posting addresses several recent inquiries concerning the use of MCLUST
for large or high-dimensional data sets. The MCLUST documentation (README
and technical report) has been updated to reflect these issues.

------------------------------------------
Discriminant analysis and large data sets
------------------------------------------

First, it should be noted that the function estep() in MCLUST can be used
for discriminant analysis (supervised classification). It accepts as input
the parameters of a Gaussian mixture (means, covariances, and mixing
proportions) and a model specification, and returns conditional
probabilities which can be converted to a classification if desired.

Large data sets can be classified by first clustering a subset of the data,
and then classifying the remaining observations by discriminant analysis
(as was done e.g. with the MRI brain scan image in Banfield and Raftery,
Biometrics 49, 1993). Within MCLUST, the function emclust() can be run
on a subset of the data to find the clusters, and the optimal conditional
probabilities obtained via summary(). The function mstep() can then be
invoked to give the associated maximum likelihood parameters. New observations
are then classified using estep() with these parameters as input.

emclust() and emclust1() also include a provision for using a subsample of
size k of the data in the hierarchical clustering phase before applying EM
to the full data set. This strategy is often adequate for data sets that
are large but not extremely large in size.

---------------------
High dimensional data
---------------------

Models in which the orientation is allowed to vary between clusters (EEV, VEV,
and VVV in the current version of MCLUST) have O(d^2) parameters per cluster,
where d is the dimension of the data. For this reason, MCLUST may not work
well or may otherwise be inefficient for these models when applied to
high-dimensional data. It may still be possible to analyze such data with
MCLUST by restriction to models with fewer parameters (the spherical models EI
and VI and the constant variance model EEE in MCLUST), or else by applying a
dimension-reduction technique such as principal components.

Note that none of the methods currently in MCLUST can handle datasets in which
the number of observations is smaller than the data dimension.