Next: Why statistical parsing? Up: Statistical Parsing Previous: Introduction

The need for structure

Parsing is the process of associating sentences with nested ``phrase markers''. This goes beyond the flat annotations which are produced by part-of-speech taggers, but stops short of full semantic representations. We have already seen the benefits of part-of-speech tagging as an aid to more refined formulation of corpus queries. We also saw the limitations of flat annotations: the two sentences whose tree diagrams are shown in figure 11.1 are different in meaning, yet have the same sequence of pre-terminal labels. A part-of-speech tagger has no means of telling the difference, but given an appropriate grammar a parser will be able to

1.: Determine that there are multiple analyses.
2.: (Maybe) venture an opinion about which analysis is the more likely.

In this example it is pretty clear that both analyses correspond to sensible meanings. Unfortunately, when we move to larger grammars it becomes much harder to ensure that this nice property stays true. impossible.

**Figure 11.1:** Two sentences built on the same words
$\begin{figure} \scalebox {0.5}{\rotatebox{-90}{\includegraphics{Figures/dog-duck... ...ebox {0.5}{\rotatebox{-90}{\includegraphics{Figures/dog-duck2.eps}}}\end{figure}$

Very often the best we can do (or the most we can afford) is to provide a grammar which ``covers'' the data at the expense of allowing a large number of spurious parses. Depending on the sophistication of the grammar, typical real-world sentences may receive hundreds, thousands or millions of analyses, most of which stretch our powers of interpretation to the limit. For example figure 11.2 has two sensible readings of a sentence, but the last one is hard to interpret. Charniak points out that you can just do it if you think ``biscuits'' is a good name for a dog.

**Figure 11.2:** Three sentences built on the same words
$\begin{figure} \scalebox {0.5}{\rotatebox{-90}{\includegraphics{Figures/charniak... ...5}{\rotatebox{-90}{\includegraphics{Figures/charniak-dog-bisc.eps}}}\end{figure}$

But crucially, he also points out that the rule which seems to be to blame for this over-generation, namely:

np $\rightarrow$ np np

is a perfectly reasonable rule for things like ``college principal'' or ``Wall Street Journal''. If you are committed to working with a purely ``possibilistic'' framework you will, at the very least, have to take on some careful work in order to block the application of the problematic rule in some contexts while allowing it in others. This is the problem of controlling over-generation and is frequently very serious.

On the other hand, given the right application, you may not care very much about over-generation. GSEARCH has no immediate need of statistical help in its business of finding interesting pieces of text: and for its purposes the mere existence of a parse is sufficient, since the expectation is that the reader will in any case inspect the output. In this application it may not be necessary to show the reader any structural information at all, still less choose the correct one.

Let us nevertheless assume that we do need to take on the problem of rampant ambiguity. The danger of over-generation may be reduced in a number of ways

Complicate the grammar.
Complicate the parser by giving it special inference mechanisms designed to control the over-generation.
Introduce an extra, supervisory component capable of rejecting unwelcome parses which would otherwise be accepted by the parser

All of these options add complexity in one way or another. Linguistics is difficult enough without having to simulate a conspiracy of complicated grammars, complicated machines and complicated interactions between components of the system. Experience shows that the purely symbolic approach becomes very difficult when one begins to move from carefully delineated problems in theoretical linguistics to large-scale tasks involving the full beauty (and horror) of natural language as it is actually used in the real world. Statistical methods are no panacea, and large-scale tasks tend to stay hard no matter what, but they do make it much more practical to at least begin the work which is needed if we are to process naturally-occurring language data.

Next: Why statistical parsing? Up: Statistical Parsing Previous: Introduction

Chris Brew
8/7/1998