Next: Cross entropy
Up: Probability and information
Previous: Data-intensive grocery selection
We've in fact already seen the definition of entropy - but
to see that requires a slight change of point-of-view. Instead
of the scenario with the djinn, imagine watching a sequence
of symbols go past on a ticker-tape. You have seen
the symbols
and you are waiting for si to arrive. You ask yourself the
following question:
How much information will I gain when I see si?
another way to express the same thing is:
How predictable is si from its context?
The way to answer this is to enumerate the possible next symbo
which we'll call
. On the basis of
we have estimates of the probabilities
where
Each such outcome will gain
bits
of information. To answer our question we need the sum over
all the outcomes, weighted by their probability:

This is the formula which we used to choose questions for the decision
tree. But now the scenario is more passive. Each time we see a symbol
we are more or less surprised, depending on which symbol turns up.
Large information gain goes with extreme surprise. If you can reliably
predict the next symbol from context, you will not be surprised, and
the information gain will be low. The entropy will be highest when
you know least about the next symbol, and lowest when you know most.
A good language model is one
which provides reliable predictions. It therefore tends to
minimize entropy. In the next section we develop the formal apparatus for using
cross entropy to evaluate language models.
Next: Cross entropy
Up: Probability and information
Previous: Data-intensive grocery selection
Chris Brew
8/7/1998