Partial Parsing Via Finite-State Cascades: Steven Abney
Partial Parsing Via Finite-State Cascades: Steven Abney
Partial Parsing Via Finite-State Cascades: Steven Abney
° 1
(Received 3 March2002 )
Abstract
Finite state cascades represent an attractive architecture for parsing unrestricted text.
Deterministic parsers specified by finite state cascades are fast and reliable. They can be
extended at modest cost to construct parse trees with finite feature structures. Finally,
such deterministic parsers do not necessarily involve trading off accuracy against speed—
they may in fact be more accurate than exhaustive search stochastic context free parsers.
L3 S S
T3
L2 NP PP VP NP VP
T2
L1 NP P NP VP NP VP
T1
L0 D N P D N N V-tns Pron Aux V-ing
0 the 1 woman 2 in 3 the 4 lab 5 coat 6 thought 7you 8 were 9 sleeping
syntactic labels into a word stream. By contrast, in a finite state cascade, spans of
input elements are reduced to single elements in each transduction, as in traditional
parsing. As an illustration, consider the regular cascade (1).
NP → D? N∗ N
(1) T1 :
VP → V-tns | Aux V-ing
T2 : {PP → P NP}
T3 : {S → PP∗ NP PP∗ VP PP∗}
The symbols “[NP ”, “n=”, “[N1 ”, and “]” represent actions. The pattern (2) trans-
lates to a finite transducer in which the boldface symbols (D, A, N, V) are input
symbols, and the actions are output symbols. I hasten to emphasize the differ-
ence between this transducer and the transductions Ti that we discussed earlier.
The transductions Ti are computed using level recognizers. Let us call transducers
such as that generated by the extended expression (2) internal transducers. Inter-
nal transducers do not replace level recognizers. Extended patterns such as (2) are
stripped of actions and compiled into automata whose union yields level recogniz-
ers, just as before. Level recognizer Ti is called with Li−1 as input, and creates
Li as output. After a phrase associated with pattern p is recognized, the internal
transducer for pattern p is used to flesh out the phrase with features and internal
structure.
4 Steven Abney
Further, unlike in unification grammars, recognized phrases are not rejected be-
cause of unification failures. Indeed, there is no unification per se: features are
represented as bit vectors, and feature assignment involves bitwise operations on
those vectors.1 For example, the case feature of the German word der might be
represented as 1000 0110 0000 0100, meaning that it is either masc. sg. nom.,
fem. sg. gen., fem. sg. dat., or pl. gen. Case features are combined by bitwise “and.”
If Mann is 1011 0000 0000 0000, then der Mann ends up being 1000 0000 0000
0000, i.e., unambiguously masc. sg. nom. If Haus is 0000 0000 1011 0000, then
der Haus ends up as all zeros. But it is not for that reason rejected as a phrase.
After a phrase is recognized, we run the internal transducer on it. That is, we
associate states of the internal transducer with the positions in the input spanned
by the recognized phrase, requiring of course that transitions respect input symbols
and that a final state be reached at the end of the phrase. In general, the transducer
is not deterministic. We do a backtracking search that postpones epsilon transitions
whenever possible, and accepts the first path that successfully reaches a final state
at the end of the recognized phrase.
By associating states of the internal transducer with input positions, we also
associate actions with input positions. A right-bracket action at position i means
“create a new current phrase ending at i”. A left-bracket action at i means “the
new current phrase begins at i”. An assignment action f= at i means “copy feature
f from the input symbol or old phrase at i to the current phrase”.
Given these interpretations, actions cannot simply be executed left-to-right.Ra-
ther, We must create the innermost phrase first, do the feature assignments for
which it is target, then drop it into the input and repeat with the next innermost
phrase. Fortunately, the order in which actions are to be executed can be determined
at compile time, so that we do not have to sort the actions at run time.
3 Evaluation
The speed of finite state cascades has already been mentioned. The speed of the
current implementation (Cass2) is quite sensitive to the number of levels: the parser
is about 2/3 faster with a two-level grammar as with a nine-level grammar. With
nine levels, the parser runs at about 1600 words/second (w/s) on a Sparcstation
ELC, or an estimated 1300 w/s on a Sparcstation 1. By comparison, parser speeds
reported in the literature range from less than one w/s to nearly 3000 w/s. I have
attempted to adjust the following figures for hardware, taking a Sun4/Sparcstation
1
Those features with a fixed set of possible values could of course be folded into phrase
categories, avoiding online computation. But folding features into categories often blows
up the grammar size unacceptably, in which case online computation as described here
is advisable.
Automata have been proposed that use bit-vectors as states (Blank 1989; Kornai 1996).
The approach taken here is less sophisticated, in that feature bit-vectors are used only
to pass information up the parse tree, rather than being used as states of the parsing
automaton.
Partial Parsing via Finite-State Cascades 5
1 as standard.2 Traditional chart parsers run at less than 1 w/s (Tacitus: ≈0.12
w/s (Hobbs et al. 1992)). “Skimming” parsers run at 20–50 w/s (Fastus: 23 w/s
(Appelt et al. 1992), Scisor: ≈30 w/s (Jacobs 1990), Clarit: ≈50 w/s (Evans et al.
1991)). Deterministic parsers can be more than an order of magnitude faster (CG:
410 w/s (Voutilainen 1993), Fidditch: 1200 w/s (Hindle, p.c.), Cass2: 1300–2300
w/s, Copsy: ≈2700 w/s (Schwarz 1990)). Cass2 is as fast as any parser in this class,
with the possible exception of Copsy, for which the hardware adjustment is highly
uncertain.
In measuring parser accuracy, there is a tension between evaluation with re-
spect to some application (e.g., does parsing improve accuracy of speech recogni-
tion/information retrieval/etc.?) and evaluation with respect to a linguistic defini-
tion of “phrase.” I propose to resolve this tension by distinguishing between the
accuracy of the parser at performing the linguistic task it was designed for, and the
utility of the linguistic task for particular applications.
In measuring accuracy, it is additionally important to distinguish between the
task definition, which typically involves human linguistic judgments, and “test
data,” which is a record of a particular person’s judgments. For example, a style-
book is a task definition, and a treebank is a record of judgments. Generally, a task
definition leaves ample room for interjudge variance—though the degree to which
interjudge variance compromises the validity of treebank-based evaluations is rarely
appreciated. To reduce interjudge variance, we may make the task definition more
detailed, but that also generally makes the task definition more arbitrary and less
useful for comparison of parsers. To be broadly useful, it is in my opinion better to
develop a collection of small, specialized “benchmarks” rather than a single large
stylebook, so that evaluation participants can pick and choose those tasks that best
match their systems’ actual design criteria.
We have performed a preliminary evaluation of the parser described here. I wrote
a brief stylebook defining “chunks”, intended as a benchmark for one part of the
output of a partial (or full-scale) parser. I hand-labelled a random sample of corpus
positions with the category and end-point of the chunk (if any) starting at that
corpus position. This permits us to estimate correlation between my judgments
and the parser’s “judgments” without creating a large treebank. A second human
judge (Marc Light) performed the same task, to permit us to gauge interjudge
reliability. The results are given in Table 3.
What is immediately striking is that the difference between parser and human
performance on this test is barely significant. That fact would not have emerged if
my mark-up had simply been accepted as the standard. Clearly, there is room for
improvement in both the grammar (to improve parser accuracy) and stylebook (to
reduce interjudge variance).
2
The estimates I used for hardware coefficients are as follows: Sparcstation 1: 1.0 (Fid-
ditch, Scisor, Clarit—hardware not specified in Scisor and Clarit citations; I have as-
sumed Sparcstation 1), Siemens BS2000: 1.0 (Copsy—estimate is highly uncertain),
Sparcstation ELC: 1.25 (Cass2), Sparcstation 2: 1.7 (Tacitus, Fastus—hardware not
explicitly given in Tacitus citation, I have assumed Sparcstation 2), Sparcstation 10:
3.8 (CG).
6 Steven Abney
cass2 marc
sample size N 1000
answersa in common X 921 934
chunks in tst t 390 381
chunks in std s 394
chunks in common x 343 348
a
The sample consists of corpus positions; X is a random variable whose values (the
“answers”) are “chunk of category c and length k” or “no chunk”. Per-word accuracy
is the percentage of correct answers. Precision and recall consider only the subsample
in which there is actually a chunk in the test or standard, respectively.
b
The plus-minus figures represent a 95% confidence interval, using a normal approxima-
tion to the binomial.
By way of closing remarks, I would like to address the motivation for partial parsing.
The common view is that a parser such as that described here trades off accuracy
for speed, compared to an exhaustive-search parser. But under certain reasonable
assumptions about English, partial parsers—in particular, parsers using the longest-
match rule—may be not only faster but also more accurate than exhaustive-search
parsers—in particular, stochastic context-free parsers. Consider the grammar
(3) S → bAB|cCA|dBD
A → a|aa
B → a|aa
C → a|aa|aaa
D → a|aa
Grammar (3) generates a finite language. Assume that each parse tree occurs with
equal frequency, with the exception that a longest-match rule resolves ambiguities.
That is, the parse-trees (4) are excluded, inasmuch as, for each parse in (4), there
is an alternative parse in which the middle child covers more of the input.
(4) [S b [A a] [B a a]]
[S c [C a] [A a a]]
[S c [C a a] [A a a]]
[S d [D a] [B a a]]
If our training corpus contains each parse tree with equal frequency, excluding the
parse trees (4), the maximum-likelihood estimate for rule probabilities is as follows:
Partial Parsing via Finite-State Cascades 7
5 Conclusion
I have presented a technique, finite state cascades, for producing fast, robust parsers
for unrestricted text. The technique has been applied to English and German, and
is being used in a project for inducing subcategorization frames and selectional
restrictions in these languages. The parser consists of a pipeline of finite state rec-
ognizers. Key concepts are easy-first parsing, islands of certainty, and containment
of ambiguity. Finite state cascades can be extended to include feature assignment
and output of “linguistic” structure at little cost in efficiency.
I have also drawn some distinctions that I believe are important for evaluation,
though not widely appreciated —in particular, the distinction between accuracy
and utility, and the distinction between task specification and test data, along with
the importance of measuring and controlling interjudge variance. To the latter end,
I propose using a collection of benchmarks instead of a single stylebook in order to
control interjudge variance while maintaining broadness of relevancy.
References
Abney, Steven. 1990a. Rapid incremental parsing with repair. In Proceedings of the 6th
New OED Conference: Electronic Text Research, pp. 1–9. Waterloo, Ontario.University
of Waterloo.
Abney, Steven. 1990b. Parsing by chunks. In Robert Berwick, Steven Abney, and Carol
Tenny (eds), Principle-Based Parsing. Kluwer Academic Publishers, 1991.
Appelt, Douglas E. et al., 1992. SRI international: FASTUS system MUC-4 test results
and analysis. In Proceedings, Fourth Message Understanding Conference (MUC-4), pp.
143–147. San Mateo, CA. Morgan Kaufmann.
8 Steven Abney
Blank, Glenn. 1989. A Finite and Real Time Processor for Natural Language. Communi-
cations of the ACM, 32(11): 1174–1189.
Evans, D., Ginther-Webster, K., Hart, M., Lefferts, R., and Monarch, I. 1991. Automatic
indexing using selective nlp and first-order thesauri. In Proc. of RIAO 91 (Barcelona),
pp. 624–643.
Hobbs, Jerry R. et al. 1992. SRI International: Description of the FASTUS system used
for MUC-4. In Proceedings, Fourth Message Understanding Conference (MUC-4), pp.
268–275, San Mateo, CA. Morgan Kaufmann.
Jacobs, Paul S. 1990. To parse or not to parse: Relation-driven text skimming. In COLING
90, vol. 2, pp. 194–198.
Kornai, András. 1996. Vectorized Finite State Automata. In Proc. of the ECAI 96 Work-
shop, Extended Finite State Models of Language (Prague), pp. 36–41.
Koskenniemi, Kimmo. 1990. Finite-state parsing and disambiguation. In COLING-90,
pp. 229–232.
Koskenniemi, Kimmo, Tapanainen, Pasi and Voutilainen, Atro. 1992. Compiling and using
finite-state syntactic rules. In COLING-92, pp. 156–162.
Lesk, Michael. 1978. Lex: a lexical analysis program generator. In UNIX Programming
Utilities and Libraries. Publisher unknown.
Roche, Emmanuel. 1993. Analyse Syntaxique Transformationnelle du Francais par Trans-
ducteurs et Lexique-Grammaire. PhD thesis, Université Paris 7.
Schwarz, Christoph. 1990. Automatic syntactic analysis of free text. JASIS, 41(6): 408–
417.
Voutilainen, Atro. 1993. NPtool, a detector of English noun phrases. In Proceedings of
the Workshop on Very Large Corpora, pp. 48–57.