Next: Questions: Up: Probability and information Previous: Why is the cross-entropy

Summary and self-check

Make sure you understand conditional probability expressions like

P(W_n = holmes | W_n-1 = sherlock)

and the difference between this and

P(W_n-1 = sherlock | W_n = holmes)

This is a clear case, because it is the probability that W_n-1 is ``sherlock'' given that W_n is ``holmes'', so, because more than one word can precede ``holmes'', it isn't 1.

You may be confused about why anyone would care about

P(W_n-1 = sherlock | W_n = holmes)

in which case you should remember the possibility that you are reading the text backwards from end to beginning!

You should also be familiar with the idea of joint probability

P(W_n = holmes, W_n-1=sherlock)

which is just the two events occurring together.

And you should be aware that

$\begin{displaymath} P(w_{n-1}\vert w_{n}) \times P(w_{n}) \equiv P(w_{n}\vert w_{n-1}) \times P(w_{n-1})\end{displaymath}$

The second expression is for people reading in the ordinary way, and the first is for those of us who read backwards (don't do this at home - especially with crime novels). The usual form of Bayes' theorem is

$\begin{displaymath} P(w_{n}\vert w_{n-1}) = \frac{P(w_{n-1}\vert w_{n}) \times P(w_{n})}{P(w_{n-1})} \end{displaymath}$

This is a form which lets people who were fed the text backwards convert their knowledge into a form which will be useful for prediction when working forwards. Of course there are variations which apply to all kinds of situations more realistic than this one. The general point is that all this algebra lets you work with information which is relatively easy to get in order to infer things which you can count less reliably or not at all. See the example about twins below to get more of an intuition about this.

Next: Questions: Up: Probability and information Previous: Why is the cross-entropy

Chris Brew
8/7/1998