Re: BNs from Decision Trees

Michael Jordan (jordan@cs.berkeley.edu)
Wed, 23 Sep 1998 12:50:36 -0700 (PDT)

To augment Stefano's email, the "hierarchical mixture of experts"
(HME) is a probabilistic decision tree, in which the decisions
(the nonterminal nodes) are modeled as having a probabilistic
dependence on the covariate (the "input vector"), via a multinomial
logit model. (Think of this as a generalized form of the noisy-OR model,
for a discrete node with more than two states. It provides a "soft",
multiway split of the input space. In the case of binary trees
the decisions are binary and the "multinomial logit" model reduces
to logistic regression.)

At the leaf nodes the HME allows a wide variety of classification or
regression models; typically a "generalized linear model" (GLIM) is
used. The overall likelihood of the tree is a (conditional) mixture
of the probability distributions at the leaves.

For balanced trees the HME reduces straightforwardly to a Bayes net.

The early work on the HME model focused on parameter estimation, and an
EM algorithm was developed. The reference is:

Jordan, M. I., \& Jacobs, R. A. (1994). Hierarchical mixtures
of experts and the EM algorithm. {\em Neural Computation},
{\em 6}, 181--214.

Subsequent work focused on model selection for the HME. References
include:

Fritsch, J., Finke, M., and Waibel, A., 1997, Adaptively growing
hierarchical mixtures of experts, in {\em Advances in Neural Information
Processing Systems 9}, (M. Mozer, M. Jordan, and T. Petsche, Eds.),
Cambridge, MA: MIT Press.

Jacobs, R. A., Peng, F., and Tanner, M. A., 1997, A Bayesian
approach to model selection in hierarchical mixtures-of-experts
architectures, {\em Neural Networks, 10}, 231--241.

Kang, K., Oh, J-H., 1997, Statistical mechanics of the mixture
of experts, in {\em Advances in Neural Information Processing
Systems 9}, (M. Mozer, M. Jordan, and T. Petsche, Eds.),
Cambridge, MA: MIT Press.

Ramamurti, V., and Ghosh, J., 1996, Structural adaptation in
mixture of experts, in {\em Proceedings of the 13th International
Conference on Pattern Recognition}. Los Alamitos, CA: IEEE
Computer Society Press.

Saito, K., and Nakano, R., 1996, A constructive learning
algorithm for an HME, in {\em Proceedings of the IEEE International
Conference on Neural Networks}.

Waterhouse, S. R., and Robinson, A. J., 1994, Constructive
algorithms for hierarchical mixtures of experts, in {\em Advances
in Neural Information Processing Systems 8}. (Touretzky, D.S.,
Mozer, M.C., and Hasselmo, M.E., Eds.), Cambridge, MA: MIT
Press.

Waterhouse, S., MacKay, D., and Robinson, T., 1994,
Bayesian methods for mixtures of experts, in {\em Advances
in Neural Information Processing Systems 8}. (Touretzky, D.S.,
Mozer, M.C., and Hasselmo, M.E., Eds.), Cambridge, MA: MIT
Press.

There is also a growing literature on approximation and estimation
results for HMEs, including the following two papers:

Jiang, W., and Tanner, M. T., 1998, Hierarchical mixtures-of-experts
for exponential family regression models with generalized linear
mean functions: Approximation and maximum likelihood estimation,
in {\em Proceedings of the Fourteenth Conference on Uncertainty
in Artificial Intelligence}, (G. F. Cooper and S. Moral, Eds.),
San Francisco: Morgan Kaufmann.

Zeevi, A., Meir, R., and Maiorov, V., 1998, Error bounds
for functional approximation and estimation using mixtures of
experts. {\em IEEE Transactions on Information Theory, 44},
1010--1025.

Finally, the "hidden Markov decision tree" (HMDT) that Stefano referred
to is a different beast; it is a time series model. It is essentially
a Markov version of an HME in which the decision at a given nonterminal
depends on the the decision at that nonterminal at the previous moment
in time. Fitting an HMDT to data is more complex than fitting an HME;
it requires variational or sampling methods.

Mike