Next: Motivations for the scientfic
Up: The history
Previous: The Rosetta stone
conducted a heroic feat of social engineering by
organising 5000 Prussian analysts to count letter occurrences in 11
million words of text, using this as the basis of a treatise on
spelling rules. It is worth considering the logistics of doing this
in 1897. It now takes a matter of minutes to obtain similar data from
the large corpora of text which are available to us Taking a
508,219 word sample of the British National Corpus ()
we can use locally available tools (described later) to get the
results in table 2.1 for the frequencies of
letter pairs within words.
Table 2.1:
Letter-letter pairs in a sample of the British national corpus
 |
For comparison, table
contains the top 30 pairs in the
New Testament (180,404 words)
Table 2.2:
Letter-letter pairs in the complete New Testament
 |
Much of the potential of data-intensive linguistics arises from the
ease with which it is possible to do this sort of thing. Much of the
business is in working out what inferences to draw from such data.
Has anything changed since the New Testament version
in question was written? If so, what was it that changed? Spelling
conventions? Patterns of word usage? Perhaps there are lots of
proper names in the New Testament. What exactly happened to the
capital letters when we prepared the table? Was that what we
wanted to happen? All these questions deserve to be answered.
But we won't answer them now ...
Next: Motivations for the scientfic
Up: The history
Previous: The Rosetta stone
Chris Brew
8/7/1998