Next: Questions:
Up: Probability and information
Previous: Why is the cross-entropy
Make sure you understand conditional probability
expressions like
P(Wn = holmes | Wn-1 = sherlock)
and the difference between this and
P(Wn-1 = sherlock | Wn = holmes)
This is a clear case, because
it is the probability that Wn-1 is
``sherlock'' given that Wn is ``holmes'', so,
because
more than one word can precede ``holmes'', it isn't 1.
You may be confused about why anyone would care
about
P(Wn-1 = sherlock | Wn = holmes)
in which case you should remember the possibility that
you are reading the text backwards from end to beginning!
You should also be familiar with the idea of joint probability
P(Wn = holmes, Wn-1=sherlock)
which is just the two events occurring together.
And you should be aware that

The second expression is for people
reading in the ordinary way, and the first is for those
of us who read backwards (don't do this at home - especially
with crime novels).
The usual form of Bayes' theorem is

This is a form which lets people who were fed the text backwards
convert their knowledge into a form which will be useful for
prediction when working forwards.
Of course there are variations which apply
to all kinds of situations more realistic than this one.
The general point
is that all this algebra lets you work with information which
is relatively easy to get in order to infer things which you can
count less reliably or not at all.
See the example about twins below to get more of an intuition
about this.
Next: Questions:
Up: Probability and information
Previous: Why is the cross-entropy
Chris Brew
8/7/1998