Re: "information gained" sometimes != "entropy reduction" ??

Tom Kane (tom@icbl.hw.ac.uk)
Wed, 12 Aug 1998 14:05:13 +0100

Hi Robert,

Sorry this turned out to be such a long ramble... but here are some
thoughts and comments...

When I was exemining Entropy for my PhD thesis, I related
it to information by saying that I would use the traditional
entropy function as a

"measure of the information content within a probability
distribution".

I would suggest that you not be surprised when information
you gain about a probability distribution causes the entropy
function to increase. I guess a lawyer would say
that if you end up more confused after consulting a witness
who gives you useful information, it was because your initial
assumptions were flawed.

For example, if I am to bet on the number of dots on the topmost
face of a dice thrown by my brother Patrick, and I believe Patrick
recently bought a biased dice from Harpers Bazzar, that will show
the 6 face 5 out of 6 times, then I will believe the 6 to appear
with a probability of 5/6. Being a good Bayesian-Entropist, since
I don't have any other information about the other faces I will
assign the remaining 1/6th probability equally amongst the other
5 possibilities. So, I end up with

P(6) = 5/6
P(1..5) = 1/30

And I am pretty certain that Patrick is going to throw a 6. If
a friend were to be presented with that probability distribution
he would probably say something along the lines of "cripes...
you seem pretty sure of yourself..." He may say to me,
"how did you get to be so certain?", and the minute I tell him
that my probability is conditioned on an assumption, he
has to accept the validity of my conditional assumption, or reject it.
And I become more explicit about my probability for the 6-face:

p(6 in the light of this dice being a Harpers Bazzar 6-special)
= 5/6

Say my friend chooses to help me investigate this possibility.
We break into Patrick's home, steal his dice and test it.
After throwing the dice 30,000 times, we see the 6 face 5,000
times. So, we return the dice and leave, with a little more
information. We'll take our lead from Bernoulli's limit
theorem, and decide that for us 30,000 was pretty close to
infinity, and assume the probability of the 6 face is
5,000/30,000 = 1/6.

P(6 in the light of Patrick using the dice we tested) = 1/6
P(1..5) = 5/6 divided by 5 = 1/6

My friend knows I have egg on my face, I know I have egg on
my face, and the entropy function also knows I have egg on
my face.

The entropy (log base 2) of the first distribution is: 1.04, and
for the second is 2.6 - quite a massive rise in entropy. But
what happened here? Well, I was forced to revisit my
assumption to test how valid it was. (As an aside
I would say that this is a valuable, time-honoured process amongst
Bayesians going all the way back to Laplace).

I am sorry this note meandered so, Robert, and I suppose it's
because it's a nice sunny day outside and I can't concentrate
on anything else... all I wanted to do was to chat about an
example where entropy rose dramatically, and it was my fault.

Another thing, related to your
question about information content comes to mind.

In my thesis I introduced a notion of specificity,
which I attached to an entropy measure for a probability
definition. It is a value between 0 and 1 and is defined

Specificity(PD) = Entropy(PD) - MinimumPossibleEntropy
---------------------------------------
MaxPossentropy - MinimumPossibleEntropy

In most cases, the MinumPossibleEntropy is 0, this would
correspond to certainty that the dice will always show
the same face.

The MaxPossEntropy is when 1/6th probability is assigned to
all possibilities equally, and comes to 2.6.

The specificity of the first distribution is:
1.04/2.6 = 0.4
and of the second distribution is:
2.6/2.6 = 1.

The closer the specificity to 0, the more informative is the distribution.
That is, the more it highlights one possibility over the mass of many
possibilities.

The interesting thing about specificity is that when there is no information
about which face will show and I disperse the
probability equally amongst all the possibilities, the specificity
is 1. This specificity rule picks up all instances where
probability is distributed equally among all possibilities,
(e.g. "Is there life on Mars?" 2 choices - 0.5 each,
"Which finger is going to be the next one jammed in the door,
10 choices 0.1 each, etc.) and give
them a specificity of 1, which essentially means, I have no
information content at all about this probability distribution.
So, I would recommend that alongside any probability distribution
goes its specificity measure.

End of ramble, except to say that I am all for being up front
about your evidence and assumptions, and applying the entropy
function over the complete sample space. If anyone would like
to see more info on this, I would be happy to send it on.

All the Best,
Tom Kane.