Hello,
I just joined this list and read last July's discussion on the definition of
Bayesian network. This has been a sore point for me since the late 80's, so
here's a late vote against any choice for Bayesian network that just leads to
a DAG representation of a joint probability distribution. It seems to me
that a DAG-style definition for Bayesian network is OK (but not great) for
initial modeling of a joint probability distribution, but is inadequate for
analysis or general solution algorithms. So, I suggest dropping the various
DAG-related definitions and instead define the Bayesian network as a
construct easily capable of carrying through from modeling to solution.
Take the hint from relational database theory. There, the relational
database structure (a set of tables that natural joins into a larger table)
does everything. It provides initial models and is closed under the primary
solution operations of select, project, and join. A generalization for
probability is a set of compatible partially observed conditional
distributions that multiply to give a larger partially observed conditional
distribution. Such sets can model joint probability distributions and are
closed under the solution operations of observation (select), marginalization
(project), multiplication of all densities along directed hyperpaths from
some conditioned to some conditioning variables (join), and normalization
(unique to probability relations). (If ratios are considered, any
multiplication or division is fair game.) Because such sets address the
whole field so well, they should be the primary objects of consideration and,
I think, deserve to be called Bayesian networks. DAG-style Bayesian networks
do have a place in drawing concise pictures, but that should be about it.
The problem is that a regular DAG-style Bayesian network
1) has exactly one conditioned variable per conditional, and
2) represents an overall joint distribution that is not conditioned and has
no observations.
1) is annoying, but doesn't limit the network's modeling power. 2) is indeed
limiting. It makes Bayesian networks unable to represent the processing
state of algorithms (even Pearl's "messages" are tack-ons to the Bayesian
network representation). So algorithms extend or dump Bayesian networks and
move on to using messages or to using potentials (generalized distributions,
etc.) collected in acyclic hypergraphs and other structures.
The interesting probabilistic quantity in most any analysis/algorithm is the
partially observed conditional distribution (and sometimes a ratio of them).
Not knowing an accepted name for this thing, I'll call it a POCD. It's not
exactly a potential, as that drops both the conditioned/conditioning
distinction and the observed variables. The density form of a POCD might
looks like:
p(W, X=x | Y, Z=z)
or more generally something like:
p({W : W in WW}, {X=x_X : X in XX and x_X in Range(X) is given}
| {Y : Y in YY}, {Z=z_Z : Z in ZZ and z_Z in Range(Z) is given})
with WW, XX, YY, and ZZ disjoint sets of random variables.
Computationally, it is fine to use potentials instead of POCDs, as long as
something (the whole network, set, database, etc. or internal state or
intrinsic structure of an algorithm) maintains whatever elided information is
necessary to produce correct results.
A rigorous definition (not attempted here) of POCDs and compatible sets of
POCDs would deal with probability spaces, distributions and the existence of
decompositions or perhaps a measure over which all densities are absolutely
continuous. It might also generalize to set membership constraints rather
than point observations.
Defining a Bayesian network as a set of compatible POCDs that multiply to
form an overall POCD (or an equivalent definition) allows:
- The Bayesian network structure to be closed under the set of operations
used in solutions.
- Potentials (most useful ones at least) to be recognized as
elided/abstracted versions of POCDs and the Bayesian network to be
composed of such POCDs/potentials.
- Acyclic hypergraphs (and other solution structures) to be recognized as
particular Bayesian networks (and the polytrees of the DAG-style Bayesian
network as ready-made acyclic hypergraphs).
- A DAG-style Bayesian network decorated with Pearl messages to be
recognized as a collection of small Bayesian networks. Similarly for an
acyclic hypergraph decorated with its messages.
- The elimination of the awkward identification that sometimes happens of a
random variable with a "node" that represents both a random variable and
one distinguished conditional distribution involving that random variable.
Random variables and POCDs would be distinct objects (e.g., the nodes and
edges of a directed hypergraph or the two types of nodes in a directed
bipartite graph).
- Multiple conditioned random variables to participate in one POCD and
observations to be associated with a POCD. All just like standard
mathematical notation. The former benefit means not having to split
quantities like P(A, B, C) into some arbitrarily chosen permutation of
P(A)P(B|A)P(C|A,B). Just use P(A, B, C) and be done with it. The later
means that things like likelihoods, posterior probabilities, and
combinations can be directly represented as bits or the whole of a
Bayesian network.
- The elimination of the need to build up from DAGs through moral graphs,
cliques, etc. to arrive at hypergraph concepts. Just read off the
relevant hypergraphs (even directed ones) from the set of POCDs. Directed
hypergraphs and hypergraphs (not DAGs) are the clean abstractions for
probability (and relational, constraint, possibility, etc.) computations.
Going from DAGs to hypergraphs is simply a distraction (though possibly an
enjoyable mathematical diversion). Just start with (directed)
hypergraphs.
DAG-style Bayesian networks are a nice shorthand to use in some drawings and
GUIs. But, treating DAG-style Bayesian networks as more than derivative
shorthand diverts attention from the core concepts. A more verbose
hypergraph representation (say, a bipartite directed graph with nodes being
either random variables or POCDs) maps mathematical modeling and solution
syntax 1-1 onto graph components. It deserves top billing.
Bayesian network is such a cool name that you would hope to really know
something when you learn what it is, but instead you only know the
conditional probability chain rule. To do useful work, you have to learn
many things outside the Bayesian network. With, the proposed definition,
once you learn the Bayesian network and how to operatate on it (not something
else), you know quite a bit.
Suggestions/comments/criticism are welcome. I follow this field hardly at
all, so I could be missing or belaboring things that are really obvious,
trivial, or out in left-field. I am certainly missing relevant references
and previous discussions of this topic.
Doug
This archive was generated by hypermail 2b29 : Fri Nov 09 2001 - 09:55:35 PST