Next: Unique strings
Up: Probability and Language Models
Previous: Statistical models of language
The point of this section is to point up the issues in
statistical language modelling in a very simple
context. Language identification is relatively easy, but
demanding enough to work as an illustration.
The same principles apply to speech recognition and
part-of-speech tagging, but there is more going on in
those applications, which can get distracting.
The following few
pages are based on Dunning's paper on Statistical Language
Identification, which is strongly recommended.
Figure 8.1:
Language strings to identify
 |
It is obvious from the examples in figure 8.1
(first Spanish, second English, third French)
that
you don't need comprehension to
identify different
human languages. But it isn't immediately clear
how to do it. Various less good alternatives
are reviewed in the paper.
Dunning asks the following questions:
-
- Q: How simple can the program be?
- A: Small program based on statistical principles
-
- Q: What does it need to learn?
- A: No hand-coded linguistic knowledge is needed. Only
training data plus the assumption that texts are
made of bytes.
-
- Q:
How much training data needed?
- A: A few thousand words of sample text from each language
suffices. Ideally about 50 Kbytes
-
- Q:
How much test data?
- A: 10 characters work, 500 characters very well.
-
- Q:
Can it generalize?
- A: If trained on French, English and Spanish, thinks
German is English.
No linguistically motivated heuristics are needed beyond the
assumption that we have a probabilistic (low-order Markov) process
generating characters.