[UAI] Definition of Bayesian Network (again)

From: Doug Morgan (dougrm@sprynet.com)
Date: Fri Nov 09 2001 - 09:52:24 PST

  • Next message: Olivier Coenen: "[UAI] Post-doctoral and Ph.D.student positions"

    Hello,

    I just joined this list and read last July's discussion on the definition of
    Bayesian network. This has been a sore point for me since the late 80's, so
    here's a late vote against any choice for Bayesian network that just leads to
    a DAG representation of a joint probability distribution. It seems to me
    that a DAG-style definition for Bayesian network is OK (but not great) for
    initial modeling of a joint probability distribution, but is inadequate for
    analysis or general solution algorithms. So, I suggest dropping the various
    DAG-related definitions and instead define the Bayesian network as a
    construct easily capable of carrying through from modeling to solution.

    Take the hint from relational database theory. There, the relational
    database structure (a set of tables that natural joins into a larger table)
    does everything. It provides initial models and is closed under the primary
    solution operations of select, project, and join. A generalization for
    probability is a set of compatible partially observed conditional
    distributions that multiply to give a larger partially observed conditional
    distribution. Such sets can model joint probability distributions and are
    closed under the solution operations of observation (select), marginalization
    (project), multiplication of all densities along directed hyperpaths from
    some conditioned to some conditioning variables (join), and normalization
    (unique to probability relations). (If ratios are considered, any
    multiplication or division is fair game.) Because such sets address the
    whole field so well, they should be the primary objects of consideration and,
    I think, deserve to be called Bayesian networks. DAG-style Bayesian networks
    do have a place in drawing concise pictures, but that should be about it.

    The problem is that a regular DAG-style Bayesian network

     1) has exactly one conditioned variable per conditional, and
     2) represents an overall joint distribution that is not conditioned and has
        no observations.

    1) is annoying, but doesn't limit the network's modeling power. 2) is indeed
    limiting. It makes Bayesian networks unable to represent the processing
    state of algorithms (even Pearl's "messages" are tack-ons to the Bayesian
    network representation). So algorithms extend or dump Bayesian networks and
    move on to using messages or to using potentials (generalized distributions,
    etc.) collected in acyclic hypergraphs and other structures.

    The interesting probabilistic quantity in most any analysis/algorithm is the
    partially observed conditional distribution (and sometimes a ratio of them).
    Not knowing an accepted name for this thing, I'll call it a POCD. It's not
    exactly a potential, as that drops both the conditioned/conditioning
    distinction and the observed variables. The density form of a POCD might
    looks like:

    p(W, X=x | Y, Z=z)

    or more generally something like:

    p({W : W in WW}, {X=x_X : X in XX and x_X in Range(X) is given}
      | {Y : Y in YY}, {Z=z_Z : Z in ZZ and z_Z in Range(Z) is given})

    with WW, XX, YY, and ZZ disjoint sets of random variables.

    Computationally, it is fine to use potentials instead of POCDs, as long as
    something (the whole network, set, database, etc. or internal state or
    intrinsic structure of an algorithm) maintains whatever elided information is
    necessary to produce correct results.

    A rigorous definition (not attempted here) of POCDs and compatible sets of
    POCDs would deal with probability spaces, distributions and the existence of
    decompositions or perhaps a measure over which all densities are absolutely
    continuous. It might also generalize to set membership constraints rather
    than point observations.

    Defining a Bayesian network as a set of compatible POCDs that multiply to
    form an overall POCD (or an equivalent definition) allows:

     - The Bayesian network structure to be closed under the set of operations
       used in solutions.

     - Potentials (most useful ones at least) to be recognized as
       elided/abstracted versions of POCDs and the Bayesian network to be
       composed of such POCDs/potentials.

     - Acyclic hypergraphs (and other solution structures) to be recognized as
       particular Bayesian networks (and the polytrees of the DAG-style Bayesian
       network as ready-made acyclic hypergraphs).

     - A DAG-style Bayesian network decorated with Pearl messages to be
       recognized as a collection of small Bayesian networks. Similarly for an
       acyclic hypergraph decorated with its messages.

     - The elimination of the awkward identification that sometimes happens of a
       random variable with a "node" that represents both a random variable and
       one distinguished conditional distribution involving that random variable.
       Random variables and POCDs would be distinct objects (e.g., the nodes and
       edges of a directed hypergraph or the two types of nodes in a directed
       bipartite graph).

     - Multiple conditioned random variables to participate in one POCD and
       observations to be associated with a POCD. All just like standard
       mathematical notation. The former benefit means not having to split
       quantities like P(A, B, C) into some arbitrarily chosen permutation of
       P(A)P(B|A)P(C|A,B). Just use P(A, B, C) and be done with it. The later
       means that things like likelihoods, posterior probabilities, and
       combinations can be directly represented as bits or the whole of a
       Bayesian network.

     - The elimination of the need to build up from DAGs through moral graphs,
       cliques, etc. to arrive at hypergraph concepts. Just read off the
       relevant hypergraphs (even directed ones) from the set of POCDs. Directed
       hypergraphs and hypergraphs (not DAGs) are the clean abstractions for
       probability (and relational, constraint, possibility, etc.) computations.
       Going from DAGs to hypergraphs is simply a distraction (though possibly an
       enjoyable mathematical diversion). Just start with (directed)
       hypergraphs.

    DAG-style Bayesian networks are a nice shorthand to use in some drawings and
    GUIs. But, treating DAG-style Bayesian networks as more than derivative
    shorthand diverts attention from the core concepts. A more verbose
    hypergraph representation (say, a bipartite directed graph with nodes being
    either random variables or POCDs) maps mathematical modeling and solution
    syntax 1-1 onto graph components. It deserves top billing.

    Bayesian network is such a cool name that you would hope to really know
    something when you learn what it is, but instead you only know the
    conditional probability chain rule. To do useful work, you have to learn
    many things outside the Bayesian network. With, the proposed definition,
    once you learn the Bayesian network and how to operatate on it (not something
    else), you know quite a bit.

    Suggestions/comments/criticism are welcome. I follow this field hardly at
    all, so I could be missing or belaboring things that are really obvious,
    trivial, or out in left-field. I am certainly missing relevant references
    and previous discussions of this topic.

    Doug



    This archive was generated by hypermail 2b29 : Fri Nov 09 2001 - 09:55:35 PST