Re: "information gained" sometimes != "entropy reduction" ??

Stephen M. Omohundro (om3@worldnet.att.net)
Wed, 12 Aug 1998 05:58:09 +0000

> Esteemed colleagues,
>
> I have a brief question concerning terminology, this time
> about "information."
>
> As a pleasant learning exercise, I am reinventing the wheel of
> Bayesian network inference. As one of the subsidiary outputs, I am
> planning to compute the difference in entropy between the posterior
> for some variable before a certain evidence item is introduced and
> the entropy of the posterior of the same variable after the evidence.
>
> Now what we'll usually see, I imagine, is that evidence usually
> reduces the entropy of the posterior, and I believe it is consistent
> with conventional terminology to say "reduction of entropy == gain
> of information" -- so many bits per item of evidence.
>
> But I know there is no guarantee that the posterior will have less
> entropy after the evidence is introduced. (I often have that feeling
> of "now I am more confused than before!")
>
> In this scenario, where is the "information gain"? In absorbing the
> evidence, something is gained -- but what? What is the quantity
> (if there is one) that's always increased by absorbing evidence?
>
> I can, of course, leave the word "information" out of the picture and
> refer simple to "change of entropy". But "information" is so suggestive
> and attractive -- I would rather use it if I can.
>
> Your comments are greatly appreciated.
>
> Regards,
> Robert Dodier

There's nothing special about Bayes' nets here, the same issue arises
with simple probabilistic inference. Consider two binary variables X
and Y with a joint distribution:

p(Y=0,X=0)=0 p(Y=1,X=0)=.5
p(Y=0,X=1)=.25 p(Y=1,X=1)=.25

Consider X to be the "evidence" variable. Before we know the value of
X, the marginal distribution over Y is obtained by summing over the
possible X values:

p(Y=0)=.25 p(Y=1)=.75

This has an entropy of -.25*lg(.25)-.75*lg(.75)=.81 bits

If we now learn that X=1, we find that p(Y|X=1)=p(Y,X=1)/p(X=1)

p(Y=0|X=1)=.5 p(Y=1|X=1)=.5

This has an entropy of -.5*lg(.5)-.5*lg(.5)=1 bit, so the entropy of Y
has *increased* contrary to intuition. (On the other hand, if we had
learned that X=0, it would have decreased to 0 bits).

There are two processes involved here: marginalization and
conditioning. Marginalization "forgets" about the values of some
variables and so loses information. Conditioning gains information
about the variable conditioned on when considered on the full joint
space. To see this, compute the change of entropy of the joint
distribution on the full state space. The initial entropy of the joint
distribution p(Y,X) is:

-.5*lg(.5)-.25*lg(.25)-.25*lg(.25)=.5*1+.5*2=1.5 bits

The distribution on the full space after learning that X=1 looks like:

p(Y=0,X=0)=0 p(Y=1,X=0)=0
p(Y=0,X=1)=.5 p(Y=1,X=1)=.5

This has an entropy of -.5*lg(.5)-.5*lg(.5)=1 bit so the entropy of
the distribution on the full space has indeed decreased.

So you can see that the culprit is the initial marginalization step
when computing the initial marginal distribution over Y. It loses more
information than is gained when we learn the value of X=1.

Stephen Omohundro