At the leaf nodes the HME allows a wide variety of classification or
regression models; typically a "generalized linear model" (GLIM) is
used. The overall likelihood of the tree is a (conditional) mixture
of the probability distributions at the leaves.
For balanced trees the HME reduces straightforwardly to a Bayes net.
The early work on the HME model focused on parameter estimation, and an
EM algorithm was developed. The reference is:
Jordan, M. I., \& Jacobs, R. A. (1994). Hierarchical mixtures
of experts and the EM algorithm. {\em Neural Computation},
{\em 6}, 181--214.
Subsequent work focused on model selection for the HME. References
include:
Fritsch, J., Finke, M., and Waibel, A., 1997, Adaptively growing
hierarchical mixtures of experts, in {\em Advances in Neural Information
Processing Systems 9}, (M. Mozer, M. Jordan, and T. Petsche, Eds.),
Cambridge, MA: MIT Press.
Jacobs, R. A., Peng, F., and Tanner, M. A., 1997, A Bayesian
approach to model selection in hierarchical mixtures-of-experts
architectures, {\em Neural Networks, 10}, 231--241.
Kang, K., Oh, J-H., 1997, Statistical mechanics of the mixture
of experts, in {\em Advances in Neural Information Processing
Systems 9}, (M. Mozer, M. Jordan, and T. Petsche, Eds.),
Cambridge, MA: MIT Press.
Ramamurti, V., and Ghosh, J., 1996, Structural adaptation in
mixture of experts, in {\em Proceedings of the 13th International
Conference on Pattern Recognition}. Los Alamitos, CA: IEEE
Computer Society Press.
Saito, K., and Nakano, R., 1996, A constructive learning
algorithm for an HME, in {\em Proceedings of the IEEE International
Conference on Neural Networks}.
Waterhouse, S. R., and Robinson, A. J., 1994, Constructive
algorithms for hierarchical mixtures of experts, in {\em Advances
in Neural Information Processing Systems 8}. (Touretzky, D.S.,
Mozer, M.C., and Hasselmo, M.E., Eds.), Cambridge, MA: MIT
Press.
Waterhouse, S., MacKay, D., and Robinson, T., 1994,
Bayesian methods for mixtures of experts, in {\em Advances
in Neural Information Processing Systems 8}. (Touretzky, D.S.,
Mozer, M.C., and Hasselmo, M.E., Eds.), Cambridge, MA: MIT
Press.
There is also a growing literature on approximation and estimation
results for HMEs, including the following two papers:
Jiang, W., and Tanner, M. T., 1998, Hierarchical mixtures-of-experts
for exponential family regression models with generalized linear
mean functions: Approximation and maximum likelihood estimation,
in {\em Proceedings of the Fourteenth Conference on Uncertainty
in Artificial Intelligence}, (G. F. Cooper and S. Moral, Eds.),
San Francisco: Morgan Kaufmann.
Zeevi, A., Meir, R., and Maiorov, V., 1998, Error bounds
for functional approximation and estimation using mixtures of
experts. {\em IEEE Transactions on Information Theory, 44},
1010--1025.
Finally, the "hidden Markov decision tree" (HMDT) that Stefano referred
to is a different beast; it is a time series model. It is essentially
a Markov version of an HME in which the decision at a given nonterminal
depends on the the decision at that nonterminal at the previous moment
in time. Fitting an HMDT to data is more complex than fitting an HME;
it requires variational or sampling methods.
Mike