Next: Unique strings Up: Probability and Language Models Previous: Statistical models of language

Case study: Language Identification

The point of this section is to point up the issues in statistical language modelling in a very simple context. Language identification is relatively easy, but demanding enough to work as an illustration. The same principles apply to speech recognition and part-of-speech tagging, but there is more going on in those applications, which can get distracting. The following few pages are based on Dunning's paper on Statistical Language Identification, which is strongly recommended.

**Figure 8.1:** Language strings to identify
$\begin{figure} \begin{verbatim} e preebas bioquimica man immunodeficiency faits se sont produi\end{verbatim}\end{figure}$

It is obvious from the examples in figure 8.1 (first Spanish, second English, third French) that you don't need comprehension to identify different human languages. But it isn't immediately clear how to do it. Various less good alternatives are reviewed in the paper.

Dunning asks the following questions:

- Q: How simple can the program be?
- A: Small program based on statistical principles
- Q: What does it need to learn?
- A: No hand-coded linguistic knowledge is needed. Only training data plus the assumption that texts are made of bytes.
- Q: How much training data needed?
- A: A few thousand words of sample text from each language suffices. Ideally about 50 Kbytes
- Q: How much test data?
- A: 10 characters work, 500 characters very well.
- Q: Can it generalize?
- A: If trained on French, English and Spanish, thinks German is English.

No linguistically motivated heuristics are needed beyond the assumption that we have a probabilistic (low-order Markov) process generating characters.

Unique strings
Common words
Markov models
Bayesian Decision Rules
- Choice of priors may not matter:
Next: Unique strings Up: Probability and Language Models Previous: Statistical models of language
Chris Brew
8/7/1998