There's nothing special about Bayes' nets here, the same issue arises
with simple probabilistic inference. Consider two binary variables X
and Y with a joint distribution:
p(Y=0,X=0)=0 p(Y=1,X=0)=.5
p(Y=0,X=1)=.25 p(Y=1,X=1)=.25
Consider X to be the "evidence" variable. Before we know the value of
X, the marginal distribution over Y is obtained by summing over the
possible X values:
p(Y=0)=.25 p(Y=1)=.75
This has an entropy of -.25*lg(.25)-.75*lg(.75)=.81 bits
If we now learn that X=1, we find that p(Y|X=1)=p(Y,X=1)/p(X=1)
p(Y=0|X=1)=.5 p(Y=1|X=1)=.5
This has an entropy of -.5*lg(.5)-.5*lg(.5)=1 bit, so the entropy of Y
has *increased* contrary to intuition. (On the other hand, if we had
learned that X=0, it would have decreased to 0 bits).
There are two processes involved here: marginalization and
conditioning. Marginalization "forgets" about the values of some
variables and so loses information. Conditioning gains information
about the variable conditioned on when considered on the full joint
space. To see this, compute the change of entropy of the joint
distribution on the full state space. The initial entropy of the joint
distribution p(Y,X) is:
-.5*lg(.5)-.25*lg(.25)-.25*lg(.25)=.5*1+.5*2=1.5 bits
The distribution on the full space after learning that X=1 looks like:
p(Y=0,X=0)=0 p(Y=1,X=0)=0
p(Y=0,X=1)=.5 p(Y=1,X=1)=.5
This has an entropy of -.5*lg(.5)-.5*lg(.5)=1 bit so the entropy of
the distribution on the full space has indeed decreased.
So you can see that the culprit is the initial marginalization step
when computing the initial marginal distribution over Y. It loses more
information than is gained when we learn the value of X=1.
Stephen Omohundro