Summary of responses on "Learning BNs with hidden variables"

Frank Wittig (fwittig@cs.uni-sb.de)
Thu, 15 Apr 1999 11:52:39 +0000

Dear colleagues,

Here is a summary of the answers to a query that I posted to the
UAI list two weeks ago. Many thanks to all who responded.

Frank Wittig

----------------------------------------------------------------
A. ORIGINAL QUERY:

Subject: Learning BNs with hidden variables

...

We have been using the Adaptive Probabilistic Networks method
of Binder, Koller, Russell and Kanazawa (1997) to learn Bayesian
networks with hidden variables. (The context is the modeling of
unobservable properties of computer users.) We use
Polak-Ribiere's method for choosing the direction of the next
hill-climbing step.

Although the results so far have been reasonable, the learned
nets seem to be local optima that could be improved upon. Also,
the algorithm seems to concentrate its learning mainly on the
CPTs that link a hidden variable with its observable children,
leaving the CPT for the hidden variable largely unchanged, even
when this initial CPT does not represent a particularly promising
choice.

We'd be interested to hear of methods that others have used to
get good results with this type of algorithm - for example, by
specifying constraints that the learned network ought to satisfy.

We are also interested in experiences with implementations of
the EM algorithm for learning Bayesian networks and how these or
similar problems have been solved in that context.

...

Binder, J., Koller, D., Russell, S., & Kanazawa, K. (1997).
Adaptive probabilistic networks with hidden variables.
Machine Learning, 29, 213-244.

----------------------------------------------------------------
B. SUMMARY OF RESPONSES:

1. Almost all respondents said that, when learning Bayesian
networks using either gradient-based methods or implementations
of the EM algorithm, they had observed the same or similar
problems that we had observed. Generally, EM is preferred as a
learning technique, because it is simpler, faster and more robust
than gradient-based algorithms:

Jim Myers (myers29@erols.com):

"The reason I used stochastic search is to avoid getting "stuck"
at the nearest local optimum. The results were pretty good, but
there's still a good bit of work to do as this is a very
difficult problem."

Daniel Nikovski (Daniel_Nikovski@gs147.sp.cs.cmu.edu):

"I have had the same experience with the mentioned algorithm of
Binder, Koller, Russell and Kanazawa - it usually converges to
local maxima of the log-likelihood, which are far from optimal. I
compared it with EM on the problem of learning transition
probabilities between hidden state nodes in temporal
probabilistic networks - more details can be found in:

Nikovski, D. (1998). Learning stationary temporal probabilistic
networks. Conference on Automated Learning and Discovery, Pittsburgh,
June 11-13, 1998. http://www.cs.cmu.edu/~danieln/conald.ps"

Kevin Murphy (murphyk@cs.berkeley.edu):

"It is widely known that EM is simpler, faster and numerically
more robust than gradient descent (at least, that has been my
experience). Since you have already implemented the Binder et
al. method, it would be trivial to convert your code to do EM."

Stuart Russell (russell@cs.berkeley.edu):

"EM is a lot simpler than the advanced methods and has fewer
"knobs", so if you plan to implement one or the other you may
prefer to use EM. Some people (eg Aalborg) have proposed a
combination approach using each in the region of parameter space
for which it converges faster."

2. We received one answer that dealt with the mathematical basis
for this experience:

Matthew Brand (brand@merl.com):

"In probabilistic models with hidden variables, the energy or
error surface of the likelihood function typically has a local
optimum near every initialization of the hidden-variable CPTs.
Consequently maximum-likelihood re-estimation will often
terminate with a model whose parameters are changed little from
initialization. This paragraph from a recent paper explains why:

`The principle of maximum likelihood (ML) is not valid for small
data sets; in many learning tasks, the training data is rarely
large enough to ``wash out'' sampling artifacts (e.g., noise)
that obscure the data-generating mechanism's essential
regularities. This is an acute problem in hidden-variable models
because most of the parameters only witness a subset of the data
and the parameter space is riddled with local optima. Many of
these optima arise because the energy surface has d-fold symmetry
for permutationally equivalent parameterizations of the model.
The data's expected sufficient statistics are a super-position of
an exponential number of different interpretations, each
associated with a different choice of hidden variable values, and
any two interpretations may have opposite log-likelihood
gradients that "pull" toward opposite but permutationally
equivalent equilibria. Consequently ML-estimated hidden-variable
models are often both over-fit and under-fit, with poor
predictive power and modest generalization. The problem worsens
with "interior" hidden variables because diffusion of context
increases the uncertainty in the expected sufficient statistics
with each step away from the observables. This problem is
historically overlooked because even poorly fit models can
provide sufficient statistical coverage of the data to license
crude inference tasks such as classification. But well-fit
models can support substantially more powerful inference tasks.'

The paper goes on to show how to regularize the problem with
entropy-minimizing parameter estimators."

Matthew Brand. Inferring hidden state from video. Mitsubishi
Electric Research Labs technical report 99-XX (in reviews).

3. Learning `meaningful' CPTs for hidden variables seems to be an
interesting issue for future work. Several authors stated this:

Kevin Murphy:

"In the HMM community, it is often found that most of the
"action" is in learning the link to the observable node, and not
between the hidden nodes, since the observables affect the
likelihood much more. This is the correct thing to do if you want
to maximize the likelihood of the observed data. However, if you
care about learning interesting structure between the hidden
nodes, you should perhaps use a different objective function,
although it is not clear what that function should be. Remember
also that, if a node is always hidden, it might be
unidentifiable, and hence you cannot expect its CPT to be
"meaningful". This is a question I am also interested in, but I
do not know of any solutions."

Stuart Russell:

"Depending on the exact structure of your network, the CPT of the
hidden variable may be just fine. Consider the fact that the network
will be just as good if you flip the semantics of the hidden variable.
E.g., you think the variable is called Alive, but the learning
algorithm settles on a set of CPTs such that the variable is in fact
representing Dead. To you, the CPTs will look wrong, but the fit to
the data is identical."

"One could also consider "annotating" the network with QPN-style
influences, which would also constrain the solution found."

----------------------------------------------------------------
C. REFERENCES GIVEN IN THE RESPONSES:

Matthew Brand. Inferring hidden state from video. Mitsubishi
Electric Research Labs technical report 99-XX (in reviews).

Friedman, N. (1997). Learning Bayesian Networks in the Presence
of missing values and hidden variables, ML97

Friedman, N. (1999). The Bayesian Structural EM Algorithm, UAI99

Geiger, D., Heckerman D. and Meek, C. (1996). Asymptotic Model
Selection for Directed Graphs with Hidden Variables, UAI96

Lauritzen, S. (1996). The EM algorithm for Graphical Association Models
with Missing Data" Computational Statistics and Data Analysis,
V19, pp191-201

Melia, M. and Jordan, M. (1998). Estimating Dependency Structure as a
Hidden Variable, NIPS10

Nikovski, D. (1998). Learning stationary temporal probabilistic
networks. Conference on Automated Learning and Discovery,
Pittsburgh, June 11-13, http://www.cs.cmu.edu/~danieln/conald.ps

Thiesson B. (1995). Accelerated quantification of Bayesian
networks with incomplete data. In Proceedings of First
International Conference on Knowledge Discovery and Data Mining,
(ed. Fayyad, U.M. and Uthurusamy, R.), pp 306-11. AAAI Press

Thiesson B. (1997). Score and information for recursive
exponential models with incomplete data. In Proceedings of the
Thirteenth Conference on Uncertainty in Artificial Intelligence,
(ed. Geiger, D. and Shenoy, P.P.), pp 453-63. Morgan Kaufman
Publishers

-- 
==================================================================
Frank Wittig                Email: fwittig@cs.uni-sb.de
University of Saarbruecken  WWW:   http://w5.cs.uni-sb.de/~fwittig
P.O. Box 15 11 50           Phone: +49 681 302 4135   
D-66041 Saarbruecken        Fax:   +49 681 302 4136   
==================================================================