Next: Why statistical parsing?
Up: Statistical Parsing
Previous: Introduction
Parsing is the process of associating sentences with nested ``phrase
markers''. This goes beyond the flat annotations which are produced by
part-of-speech taggers, but stops short of full semantic
representations. We have already
seen the benefits of part-of-speech tagging
as an aid to more refined formulation of corpus queries.
We also saw the limitations of flat annotations: the two
sentences whose tree diagrams are shown in figure 11.1
are different in meaning, yet have the same sequence of pre-terminal
labels. A part-of-speech tagger has no means of telling the
difference, but given an appropriate grammar a parser will be
able to
- 1.
- Determine that there are multiple analyses.
- 2.
- (Maybe) venture an opinion about which
analysis is the more likely.
In this example it is pretty clear that both analyses correspond to
sensible
meanings.
Unfortunately, when we move to larger grammars it becomes much harder
to ensure that this nice property stays true.
impossible.
Figure 11.1:
Two sentences built on the same words
 |
Very often the best we can do (or the most we can afford)
is to provide a grammar which ``covers'' the data at the expense of
allowing a large number of spurious parses
.
Depending on the sophistication of the grammar, typical real-world
sentences may receive hundreds, thousands or millions of analyses,
most of which stretch our powers of interpretation to the limit. For
example
figure 11.2 has two sensible readings of a sentence, but the
last one is hard to interpret. Charniak points out that
you can just do it if you think ``biscuits'' is a good name for
a dog.
Figure 11.2:
Three sentences built on the same words
 |
But crucially, he also points out that the rule which seems to be to
blame for this over-generation, namely:
np
np np
is a perfectly reasonable rule for things like ``college principal'' or ``Wall
Street Journal''.
If you are committed to working with a purely ``possibilistic''
framework you will, at the very least, have to take on some careful
work in order to block the application of the problematic rule in some
contexts while allowing it in others. This is the problem of
controlling over-generation and is frequently very serious.
On the other hand, given the right application,
you may not care very much about over-generation.
GSEARCH has no immediate need of statistical help
in its business of finding interesting pieces of
text: and for its purposes the mere existence of a parse
is sufficient, since the expectation is that the reader will
in any case inspect the output. In this application it may not
be necessary to show the reader any structural information at all,
still less choose the correct one.
Let us nevertheless assume that we do need to
take on the problem of rampant ambiguity.
The danger of over-generation may be reduced in a number
of ways
- Complicate the grammar.
- Complicate the parser by giving it special inference mechanisms
designed to control the over-generation.
- Introduce an extra, supervisory component capable of rejecting
unwelcome parses which would otherwise be accepted by the parser
All of these options add complexity in one way or another. Linguistics
is difficult enough without having to simulate a conspiracy of
complicated grammars, complicated machines and complicated
interactions between components of the system.
Experience shows that the purely symbolic approach becomes very
difficult when one begins to move from carefully delineated problems
in theoretical linguistics to large-scale tasks involving the
full beauty (and horror) of natural language as it is actually used in
the real world. Statistical methods are no panacea, and large-scale
tasks tend to stay hard no matter what, but they do make it
much more practical to at least begin the work which is needed if we
are to process naturally-occurring language data.
Next: Why statistical parsing?
Up: Statistical Parsing
Previous: Introduction
Chris Brew
8/7/1998