(Charles F. Meyer) English Corpus Linguistics An
(Charles F. Meyer) English Corpus Linguistics An
(Charles F. Meyer) English Corpus Linguistics An
), lengthened syllables
(=), and a short pause (..). While prosodic annotation provides a more accu-
rate rendering of the ow of speech than punctuation does (cf. section 3.6.1.5),
inserting the necessary prosodic markup can increase transcription time sig-
nicantly in some cases, doubling or tripling the time of transcription. To
minimize transcription time, the ICE project decided to mark two lengths of
pause only short and long and to dispense with any other prosodic markup.
Although SGML-conformant systems of annotation provide important de-
scriptive information about a text, they can pose numerous difculties for both
corpus compilers and corpus users. First of all, because much markup has to
be manually inserted, if a corpus compiler wants to exploit the full possibilities
of the TEI system, for instance, he or she will need considerable resources to
hire assistants to insert the markup. There are tools that can assist in the in-
sertion of markup. To assist in the annotation of ICE texts, the ICE Markup
Assistant was developed (Quinn and Porter 1996: 657). This program uses
a series of WordPerfect macros to insert ICE markup in texts. These macros
are inserted either manually or automatically, depending upon how easy it is
to predict where a given tag should be inserted. For instance, because over-
lapping segments of speech occur randomly throughout a conversation, their
occurrence cannot be predicted and they therefore have to be inserted manu-
ally. Conversely, in a written text, text unit tags can be inserted automatically at
sentence boundaries, and then those that are erroneously marked can be post-
edited manually. Arange of SGMLresources can be found on the W3Cwebsite:
http://www.w3.org/MarkUp/SGML/.
Another way to minimize the amount of time it takes to annotate texts is
to use a reduced system of annotation. The ICE Project reduced the amount
of structural markup required in ICE texts to the most essential markup for
those ICE teams lacking the resources to insert all of the ICE markup that
had been developed (Meyer 1997). Likewise, there exists a version of TEI
called TEI Lite, which contains a minimal set of TEI conformant markup
(Burnard and Sperberg-McQueen 1995). In determining how much markup
should be included in a text, it is useful to adopt Burnards (1998) Chicago
86 Annotating a corpus
pizza metaphor. One can view markup as toppings on a pizza: the particular
toppings (i.e. markup) that are added depend upon both what the corpus
compiler sees as important to annotate and what resources are available to
insert this annotation. Although some object to this view Cook (1995), for
instance, views annotation used in spoken texts not as an embellishment, an
extra topping, but crucial to the interpretation of a speech event realistically
the corpus compiler has to draw a line in terms of how detailed texts should be
annotated.
While texts annotated with structural markup greatly facilitate the automatic
analysis of a corpus, the user wishing to browse through a text (particularly a
spoken text) will nd it virtually unreadable: the text will be lost among the
markup. The British component of ICE (ICE-GB) works around this problem
by enabling the user to select how much markup he or she wishes to see in a
text: all, none, or only some. Likewise, some concordancing programs, such
as MonoConc Pro 2.0 (cf. section 5.3.2), can display markup or turn it off.
One of the future challenges in corpus linguistics is the development of tools
that provide an easy interface to corpora containing signicant amounts of
annotation. The introduction of XML markup into corpora offers potential
help in this area, since as web browsers are designed to read XML annotated
documents, they will be able to convert a corpus into the kinds of documents
displayed by browsers on websites.
2
The discussion in this section might lead one to think that there has been
an endless proliferation of markup systems in corpus linguistics to annotate
corpora with structural markup. However, it is important to realize that all of
these systems have one important similarity: they are all SGML-conformant.
The ICE system was developed to annotate ICE documents, the TEI system a
wider range of spoken and written corpora as well as various kinds of documents
commonly found in the humanities, and XML various corpora for possible
distribution on the World Wide Web. Thus, all of these markup conventions are
simply instantiations of SGML put to differing uses.
4.2 Tagging a corpus
In discussing the process of tagging a corpus, it is important, rst of
all, to distinguish a tagset a group of symbols representing various parts of
speech from a tagger a software program that inserts the particular tags
making up a tagset. This distinction is important because tagsets differ in the
number and types of tags that they contain, and some taggers can insert more
than one type of tagset. In the example below, the Brill tagger (described in
Brill 1992), adapted for use in the AMALGAM Tagging Project, has been used
2
Cf. Edwards (1993) for further information on how markup can be made more readable, partic-
ularly markup used to annotate spoken texts.
4.2 Tagging a corpus 87
to assign part-of-speech designations to each word in the sentence Im doing
the work:
3
I/PRON(pers,sing)
m/V(cop,pres,encl)
doing/V(montr,ingp)
the/ART(def )
work/N(com,sing)
./PUNC(per)
In this example, which contains tags from the ICE tagset (Greenbaum and
Ni 1996), each word is assigned to a major word class: I to the class of pro-
nouns, m and doing to the class of verbs, the to the class of articles; work to
the class of nouns; and the end stop to the class of punctuation. In parentheses
following each major word class are designations providing more specic infor-
mation about the word: I is a personal pronoun that is singular; mis the enclictic
(i.e. contracted) present-tense formof the copula be;
4
doing is an -ing participle
that is monotransitive (i.e. takes a single object); the is a denite article; work
is a common noun; and the specic punctuation marked that is used is a pe-
riod. The manner in which the ICE tagset has been developed follows Leechs
(1997: 256) suggestion that those creating tagsets strive for Conciseness,
Perspicuity (making tag labels as readable as possible), and Analysability
(ensuring that tags can be decomposable into their logical parts, with some
tags, such as noun, occurring hierarchically above more specic tags, such as
singular or present tense).
Over the years, a number of different tagging programs have been developed
to insert a variety of different tagsets. The rst tagging programwas designed in
the early 1970s by Greene and Rubin (1971) to assign part-of-speech labels to
the Brown Corpus. Out of this programarose the various versions of the CLAWS
program, developed at the University of Lancaster initially to tag the LOB
Corpus (Leech, Garside, and Atwell 1983) and subsequently to tag the British
National Corpus (Garside, Leech, and Sampson 1987; Garside and Smith 1997).
The TOSCA team at the University of Nijmegen has developed a tagger that
can insert two tagsets: the TOSCA tagset (used to tag the Nijmegen Corpus)
and the ICE tagset (Aarts, van Halteren, and Oostdijk 1996). The AUTASYS
Tagger can also be used to insert the ICE tagset as well as the LOB tagset
(Fang 1996). The Brill Tagger is a multi-purpose tagger that can be trained
3
The example provided here is in a vertical format generated by the AMALGAM tagging project.
This project accepts examples to be tagged by e-mail, and will tag the examples with most of the
major tagsets, suchas those createdtotagthe LOB, Brown, andICEcorpora. For more information
on this project and on howthe Brill Tagger was trained to insert the various tagsets, see Atwell et al.
(2000) as well as the project website: http://agora.leeds.ac.uk/amalgam/amalgam/amalghome.
htm. The UCREL research centre at the University of Lancaster also provides online tagging
with CLAWS4 at: http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/trial.html.
4
In the current ICE tagset, m would be tagged as aux (prog, pres, encl), that is, as a present-tense
progressive enclictic auxiliary.
88 Annotating a corpus
to insert whatever tagset the user is working with into texts in English or any
other language. These taggers do not exhaust the number of taggers that have
been created, but provide an overview of the common taggers that have been
developed to annotate the types of corpora that corpus linguists typically work
with.
Taggers are of two types: rule-based or probablistic. In a rule-based tagger,
tags are inserted on the basis of rules of grammar written into the tagger. One of
the earlier rule-based taggers was the TAGGIT program, designed by Greene
and Rubin (1971) to tag the Brown Corpus and described in detail in Francis
(1979: 198206). The rst step in the tagging process, Francis (1979) notes, is
to look up a given word in the programs lexicon, and if the word is found, it
is assigned however many tags associated with the word in the lexicon. If after
this search the word is not found, an attempt is made to match the ending of the
word with a list of sufxes and the word-class tags associated with the sufxes.
If this search also fails to nd a match, the word is arbitrarily assigned three
tags: singular or mass noun, verb (base form), or adjective three form class
designations that the majority of words in English will fall into.
Of the words reaching this stage of analysis, 61 percent will have one tag,
and 51 percent of the remaining words will have sufxes associated with one
tag. The remaining words will have more than one tag and thus are candidates
for disambiguation. Initially, this is done automatically by a series of context
frame rules that look at the context in which the word occurs. For instance,
in the sentence The ships are sailing, the word ships will have two tags: plural
noun and third-person-singular verb. The context frame rules will note that ships
occurs following an article, and will therefore remove the verb tag and assign the
plural noun tag to this word. Although the context frame rules can disambiguate
a number of tags, the process is complicated, as Francis (1979: 202) observes,
by the fact that many of the words surrounding a word with multiple tags
will have multiple tags themselves, making the process of disambiguation quite
complicated. Consequently, 23 percent of the remaining tags had to be manually
disambiguated: analysts had to look at each example and decide which tag was
most appropriate, a process that itself was subject to error and inconsistency
and that led to further post-editing.
Although rule-based taggers have been largely superseded by probability-
based taggers, there is one current rule-based tagger, EngCG-2 (cf. Samuelsson
and Voutilainen 1997), that has been designed to overcome some of the prob-
lems inherent in early rule-based taggers like TAGGIT. In particular, rules in
EngCG-2 have wider application than in TAGGIT and are able to refer up to
sentence boundaries (rather thanthe local context alone) (Voutilainen1999: 18).
This capability has greatly improved the accuracy of EngCG-2 over TAGGIT.
Because rule-based taggers rely on rules written into the tagger, most taggers
developed since TAGGIT have been probabilistic in nature: they assign tags
based on the statistical likelihood that a given tag will occur in a given context.
Garside and Smith (1997: 104) give the example of the construction the run
4.2 Tagging a corpus 89
beginning a sentence. Because run starts the sentence and is preceded by the
determiner the, there is high probability that run is a noun rather than a verb.
The advantage of probabilistic taggers is that they can be trained on corpora
and over time develop very high and accurate probabilities, making them quite
accurate. However, even though probabilitistic and some rule-based taggers
such EngCG-2 can achieve accuracy rates exceeding 95 percent, the remaining
inaccuracies can be more extensive than one might think.
Kennedy (1996: 253) notes that such rates of accuracy involve averaging:
computing the success rate of tagging by combining gures for constructions
(such as the denite article the) that have high frequencies and that can be tagged
accurately almost all the time with other constructions (such as the word once,
which can be an adverbial or a subordinating conjunction) that can be tagged
accurately only 8085 percent of the time. If a corpus is not post-edited after
tagging is done (as was the case with the initial release of the British National
Corpus), the rate of inaccurate tagging for certain words can be quite high.
For instance, after the Wellington Corpus of Written New Zealand English was
tagged with the CLAWS tagger, Kennedy (1996: 255) found that more than
20 percent of the instances of once were wrongly tagged. In the sentence Once
implemented, it will be with us, once was tagged erroneously as an adverbial
rather than a subordinating conjunction. In other instances, adverbial uses of
once were tagged as conjunctions. Multiplied over a large corpus, error rates
as high as 20 percent can lead to considerable inaccuracy, a point that users of
corpora need to be aware of, especially if they intend to do an automatic analysis
of the corpus without actually looking carefully at the data being analyzed.
Because of the time and effort it takes to post-edit a tagged corpus, in the
future we are more likely to see corpora that are released for use without any
post-editing.
Because taggers cannot sanitize the data they have to work with, it is easy to
understand why they cannot tag with complete accuracy the words of any human
language. Human language is full of unusual and idiosyncratic phenomena
(Smith 1997: 147) that taggers have to account for. For example, consider how
the AMALGAM tagger analyzed the sentence Whats he want to prove?
what/PRON(nom)
s/V(cop,pres,encl)
he/PRON(pers,sing)
want/V(montr,pres)
to/PRTCL(to)
prove/V(montr,inn)
?/PUNC(qm)
The full uncontracted form of this sentence would be What does he want to
prove? However, because s is usually a contracted form of is, the tagger has
wrongly tagged s as the present-tense enclitic form of the copula be. To deal
with problematic cases of this type, those designing the CLAWS tagger have
90 Annotating a corpus
developed a series of corpus patches (Smith 1997: 1457): context-sensitive
rules that take the output of data tagged with CLAWS and attempt to correct
persistently problematic cases that have been identied by previous tagging.
However, even with these rules, there remains a 2 percent rate of error in the
tagging.
The tagsets used to annotate corpora are as varied as the taggers used to
insert them. For instance, the Brown Corpus tagset contains seventy-seven tags
(Francis and Ku cera 1982), the ICE tagset 262 tags (Greenbaum and Ni 1996:
93).
5
A listing of tags for the auxiliary verb do illustrates the manner in which
these two tagsets differ. The Brown tagset contains only three tags for do that
specify the main surface structure forms that this verb takes:
Tag Form
DO do
DOD did
DOZ does
The ICE tagset, on the other hand, contains eight tags for do that provide a more
exhaustive listing of the various forms that this auxiliary has (Greenbaum and
Ni 1996: 101):
Tag Form
Aux (do, inn) Do sit down
Aux (do, inn, neg) Dont be silly
Aux (do, past) Did you know that?
Aux (do, past, neg) You didnt lock the door
Aux (do, present) I do like you
Aux (do, pres, encl) Whats he want to prove?
Aux (do, pres, neg) You just dont understand
Aux (do, pres, procl) Dyou like ice-cream?
The differences in the tagsets reect differing conceptions of English grammar.
The Brown tagset with its three forms of do is based on a more traditional view
of the forms that this auxiliary takes, a view that is quite viable because this
tagset was designed to apply to a corpus of edited written English. The ICE
tagset, in contrast, is based on the view of grammar articulated in Quirk et al.
(1985). Not only is this grammar very comprehensive, taking into consideration
constructions with low frequencies in English, but it is one of the few reference
grammars of English to be based on speech as well as writing. Consequently,
the ICE tagset is not just detailed but intended to account for forms of do that
would be found in speech, such as the proenclictic formDyou. The more recent
Longman Grammar of Spoken and Written English is also based on a corpus,
the Longman Spoken and Written English Corpus, that was tagged by a tagger
whose tagset was sensitive to distinctions made in speech and writing (cf. Biber
5
The number of tags for the ICEtagset is approximate. As GreenbaumandNi (1996: 93) emphasize,
newconstructions are always being encountered, causing additional tags to be added to the tagset.
4.3 Parsing a corpus 91
et al. 1999: 358). Although it might seem that a larger tagset would lead to
more frequent tagging errors, research has shown that just the opposite is true:
the larger the tagset, the greater the accuracy of tagging (Smith 1997: 1401;
Le on and Serrano 1997: 1547).
4.3 Parsing a corpus
Tagging has become a very common practice in corpus linguistics,
largely because taggers have evolved to the point where they are highly accurate:
many taggers can automatically tag a corpus (with no human intervention) at
accuracy rates exceeding 95 percent. Parsing programs, on the other hand,
have much lower accuracy rates (7080 percent at best, depending upon how
correctness of parsing is dened [Leech and Eyes 1997: 35 and 51, note 3]),
and they require varying levels of human intervention.
Because tagging and parsing are such closely integrated processes, many
parsers have taggers built into them. For instance, the Functional Dependency
Grammar of English (the EngFDG parser) contains components that assign not
only syntactic functions to constituents but part-of-speech tags to individual
words (Voutilainen and Silvonen 1996). The TOSCA Parser has similar capa-
bilities (Aarts, van Halteren, and Oostdijk 1996). Like taggers, parsers can be
either probabilistic or rule-based, and the grammars that underlie them reect
particular conceptions of grammar, even specic grammatical theories, result-
ing in a variety of different parsing schemes, that is, different systems of gram-
matical annotation that vary both in detail and in the types of grammatical con-
structions that are marked (cf. the AMALGAM MultiTreebank for examples
of differing parsing schemes: http://www.scs.leeds.ac.uk/amalgam/amalgam/
multi-tagged.html).
Although there is an ongoing debate among linguists in the eld of natural
language processing concerning the desirability of probabilistic vs. rule-based
parsers, both kinds of parsers have been widely used to parse corpora. Propo-
nents of probabilistic parsers have seen them as advantageous because they are
able to parse rare or aberrant kinds of language, as well as more regular, run-of-
the-mill types of sentence structures (LeechandEyes 1997: 35). This capability
is largely the result of the creation of treebanks, which aid in the training of
parsers. Treebanks, such as the Lancaster Parsed Corpus and the Penn Tree-
bank, are corpora containing sentences that have been either wholly or partially
parsed, anda parser canmake use of the alreadyparsedstructures ina treebankto
parse newly encountered structures and improve the accuracy of the parser. The
example below contains a parsed sentence from the Lancaster Parsed Corpus:
A01 2
[S[N a AT move NN [Ti[Vi to TO stop VB Vi][N \0Mr NPT Gaitskell NP
N][P from IN [Tg[Vg nominating VBG Vg][N any DTI more AP labour NN
92 Annotating a corpus
life NN peers NNS N]Tg]P]Ti]N][V is BEZ V][Ti[Vi to TO be BE
made VBN Vi][P at IN [N a AT meeting NN [Po of INO [N labour NN
\0MPs NPTS N]Po]N]P][N tomorrow NR N]Ti] . . S]
The rst line of the example indicates that this is the second sentence from
sample A01 (the press reportage genre) of the LOBCorpus, sections of which
(mainly shorter sentences) are included in the treebank. Open and closed brack-
ets mark the boundaries of constituents: [S marks the opening of the sentence,
S] the closing; the [N preceding a move marks the beginning of a noun
phrase, N] following to stop its ending. Other constituent boundaries marked
in the sentence include Ti (to-innitive clause to stop Gaitskell from. . . ),
Vi (non-nite innitive clause to stop), and Vg (non-nite ing participle
clause nominating). Within each of these constituents, every word is assigned
a part-of-speech tag: a, for instance, is tagged AT, indicating it is an article;
move is tagged NN, indicating it is a singular common noun; and so forth.
6
Although many treebanks have been released and are available for linguistic
analysis, their primary purpose is to train parsers to increase their accuracy.
To create grammatically analyzed corpora intended more specically for
linguistic analysis, the TOSCA Group at Nijmegen University developed the
TOSCAParser, a rule-based parser that was used to parse the Nijmegen Corpus
and sections of the British component of ICE (ICE-GB). As the parse tree taken
from ICE-GB in gure 4.1 illustrates, the grammar underlying the TOSCA
Parser (described in detail in Oostdijk 1991) recognizes three levels of descrip-
tion: functions, categories, and features.
Figure 4.1 Parse tree from ICE-GB
6
A full listing of the labels in the above example can be found in Garside, Leech, and V aradi
(1992).
4.3 Parsing a corpus 93
The rst level of description functions is specied both within the clause and
the phrase. For instance, Heavy rainfall is functioning as subject (SU) in the
main clause in which it occurs; in the noun phrase itself, Heavy is functioning as
an adjective premodier (AJP). Categories are represented at both the phrase
and word level: Heavy rainfall is a noun phrase (NP), rainfall a noun
(N) that is also head of the noun phrase (NPHD). Features describe various
characteristics of functions or categories. For instance, the noun rainfall is a
common noun (com) that is singular (sing).
7
The TOSCAParser provides such detailed grammatical information because
it was designed primarily not to create a treebank to be used to increase the accu-
racy of the parser but to produce a grammatically annotated corpus that yields
databases containing detailed information that may be used by linguists . . . and
[that] allows for the testing of linguistic hypotheses that have been formulated
in terms of the formal grammar (Oostdijk 1991: 64). And, indeed, the two
corpora parsed by the TOSCA Parser, the Nijmegen Corpus and ICE-GB, can
be used with software programs that extract information from the corpora: the
Linguistic Database (LDB) Program for the Nijmegen Corpus (van Halteren
and van den Heuvel 1990) and the ICE Corpus Utility Program (ICECUP) for
ICE-GB (cf. section 5.1.3).
While the development of taggers has ourished in recent years (Fang
1996: 110), parsers are still in a state of development. Taggers such as the Brill
Tagger or CLAWS4 are readily available, and can be easily run on a corpus,
once the corpus is in a form the tagger can accept. The output of the tagger
will have to be post-edited, a process that can take some time. Nevertheless,
tagging a corpus is currently a fairly common practice in corpus linguistics.
Parsing, in contrast, is a much more involved undertaking, largely because of
the complexity of structures that a parser is required to analyze, especially if
spoken as well as written data is being parsed, and the subsequent increase in
the amount of post-editing of the parsed output that needs to be done.
A parser has a much greater range of structures to analyze than a tagger: not
just individual words but phrases and clauses as well. And because phrases and
clauses are considerablycomplexinstructure, there are numerous constructions,
suchas those that are coordinated, that are notoriouslydifcult toparse correctly.
Figure 4.2 contains the parsed output of the sentence The child broke his armand
his wrist and his mother called a doctor after it was submitted to the EngFDG
Parser.
8
7
A listing of function, category, and feature labels used in the parsing of ICE-GB can be
found in the Quick Guide to the ICE-GB Grammar (http://www.ucl.ac.uk/english-usage/ice-
gb/grammar.htm).
8
Cf. Tapanainen and J arvinen (1997) for a more detailed discussion of the EngFDG Parser and
the grammatical annotation it inserts into parsed texts such as those in gure 4.2. This is a rule-
based parser inuenced by the view of grammar articulated in dependency grammars, that
is, grammars such as those described in Mel cuk (1987) that instead of grouping constituents
hierarchically (as parsers based on phrase structure grammars do), break each word down into
its constituent parts, noting dependency relationships between constituents. The parsed output in
94 Annotating a corpus
0
1 The the det:>2 @DN> DET SG/PL
2 child child subj:>3 @SUBJ N NOM SG
3 broke break main:>0 @+FMAINV V PAST
4 his he attr:>5 @A> PRON PERS MASC GEN SG3
5 arm arm obj:>3 @OBJ N NOM SG
6 and and @CC CC
7 his he attr:>8 @A> PRON PERS MASC GEN SG3
8 wrist wrist @SUBJ N NOM SG @OBJ N NOM SG @PCOMPL-S N NOM SG
@A>N NOM SG
9 and and cc:>3 @CC CC
10 his he attr:>11 @A> PRON PERS MASC GEN SG3
11 mother mother subj:>12 @SUBJ N NOM SG
12 called call cc:>3 @+FMAINV V PAST
13 a a det:>14 @DN> DET SG
14 doctor doctor obj:>12 @OBJ N NOM SG
Figure 4.2 Parsed output from the EngFDG Parser
The example in gure 4.2 contains the coordinator and conjoining two noun
phrases, his armand his wrist, as well as two main clauses. Because coordinators
such as and can be used to conjoin phrases as well as clauses, it can be difcult
for parsers to distinguish phrasal from clausal coordination. In the example in
gure 4.2, the parser is unable to determine whether his wrist is coordinated
with his arm or his mother. Therefore, his wrist is assigned two function labels:
subject (@SUBJ) to indicate that his wrist and his mother are coordinated
noun phrases functioning as subject of the second clause; and object (@OBJ),
which is actually the correct parse, to indicate that his arm and his wrist are
coordinated noun phrases functioning as object of the verb broke in the rst
clause. The difculty that parsers have with coordinated constructions is further
compounded by the fact that coordination is so common (the one-million-word
ICE-GB contains over 20,000 instances of and), increasing the likelihood of
multiple parses for coordinated constructions.
An additional level of complexity is introduced if a parser is used to parse
spoken as well as written data. In this case, the parser has to deal with the
sheer ungrammaticality of speech, which poses major challenges for parsers,
especiallythose that are rule-based. For instance, the utterance below, takenfrom
gure 4.2 was obtained by using the demonstration version of the EngFDG parser available on
the Conexor website at: http://www.conexor./testing.html#1.
4.3 Parsing a corpus 95
ICE-USA, is typical of the ungrammatical nature of speech: it has numerous
false starts, leading to a series of partially formed sentences and clauses:
but anyway you were saying Peggy that well I was asking you the general question what
um how when youve taken courses in the linguistics program have you done mainly
just textbook stuff or have you actually this is actual hardcore linguistic scholarship
(cited in Greenbaum 1992: 174)
In a rule-based parser, it is impossible to write rules for utterances containing
repetitions, since such utterances have no consistent, predictable structure and
thus conform to no linguistic rule. In addition, while a written text will contain
clearly delineated sentences marked by punctuation, speech does not. Therefore,
with examples such as the above, it is difcult to determine exactly what should
be submitted to the parser as a parse unit: the stretch of language the parser
is supposed to analyze (Gerald Nelson, personal communication).
The increased complexity of structures that a parser has to analyze leads
to multiple parses in many cases, making the process of tag disambiguation
more complex than it is with the output of a tagger. With tagging, disambigua-
tion involves selecting one tag from a series of tags assigned to an individual
word. With parsing, in contrast, disambiguation involves not single words but
higher-level constituents; that is, instead of being faced with a situation where
it has to be decided whether a word is a noun or a verb or adjective, the an-
alyst has to select the correct parse tree from any number of different parse
trees generated by the parser. In using the TOSCA parser, for instance, Willis
(1996: 238) reports that after running the parser on a sentence, if a single cor-
rect parse tree was not generated, he had to select from a dozen to (on some
occasions) 120different parse trees. There is software tohelpinthis kindof post-
editing, such as ICETree (described at: http://www.ucl.ac.uk/english-usage/ice-
gb/icetree/download.htm), a general editing program that was designed to aid
in the post-editing of corpora tagged and parsed for the ICE project, and the
Nijmegen TOSCA Tree editor (described at: http://lands.let.kun.nl/TSpublic/
tosca/page4.html), which can also be used to post-edit the output of the TOSCA
parser. But still, any time there is manual intervention in the process of cre-
ating a corpus the amount of time required to complete the corpus multiplies
exponentially.
To increase the accuracy of parsing and therefore reduce the amount of post-
editing that is required, parsers take differing approaches. The TOSCA Parser
requires a certain amount of manual pre-processing of a text before it is sub-
mitted to the parser: syntactic markers are inserted around certain problematic
constructions (cf. Oostdijk 1991: 2637; Quinn and Porter 1996: 714), and
spoken texts are normalized. To ensure that coordinated structures are correctly
parsed, their beginnings and ends are marked. In the sentence The child broke
his arm and his wrist and his mother called a doctor (the example cited above),
markers would be placed around his arm and his wrist to indicate that these
two noun phrases are coordinated, and around the two main clauses that are
96 Annotating a corpus
coordinated to indicate that the coordinator and following wrist conjoins two
clauses. Other problematic constructions that had to be marked prior to pars-
ing were noun phrase postmodiers, adverbial noun phrases, appositive noun
phrases, and vocatives.
To successfully parse instances of dysuency in speech, the parser requires
that, in essence, an ungrammatical utterance is made grammatical: markup has
to be manually inserted around certain parts of an utterance that both tells the
parser to ignore this part of the utterance and creates a grammatically well-
formed structure to parse. In the example below, the speaker begins the utter-
ance with two repetitions of the expression can I before nally completing the
construction with can we.
can I can I can we take that again (ICE-GB S1A-001016)
To normalize this construction, markup is inserted around the two instances
of can I (<-> and </->) that tells the parser to ignore these two expres-
sions and parse only can we take that again, which is a grammatically well
formed:
<sent> <}> <-> Can I can I </-> <=> can we </=> </}> take that again <$?>
Arguably, normalization compromises the integrity of a text, especially if the
individual doing the normalization has to make decisions about how the text
is to be edited. And because normalization must be done manually, it can be
time-consuming as well. But normalization has advantages too. In constructions
such as the above, if the repetitions are not annotated, they will be included in
any lexical analysis of the corpus, leading to inaccurate word counts, since can,
for instance, will be counted three times, even though it is really being used
only once. In addition, even though repetitions and other features of speech are
excluded from analysis, if they are properly annotated, they can be recovered
if, for instance, the analyst wishes to study false starts.
Even though parsing a corpus is a formidable task, there do exist a num-
ber of parsed corpora available for linguistic research, though many of them
are relatively short by modern standards. In addition to the parsed corpora de-
scribed above the Nijmegen Corpus (130,000 words), the Lancaster Parsed
Corpus (140,000 words), and ICE-GB (one million words) there are also
the Lancaster/IBM Spoken English Corpus (52,000 words), the Polytechnic
of Wales Corpus (65,000 words), the PennHelsinki Parsed Corpus of Middle
English (1.3 million words), and the Susanne Corpus (128,000 words). The Penn
Treebank is larger (3,300,000 words) than all of these corpora, but aside from
the texts taken from the Brown Corpus, this corpus is not balanced, consisting
primarily of reportage from Dow Jones news reports. The ENGCG Parser (an
earlier version of the EngFDG Parser) has been used to parse sections of the
Bank of English Corpus (over 100 million words as of 1994), with the goal of
parsing the entire corpus (cf. J arvinen 1994 for details).
4.4 Other types of tagging and parsing 97
4.4 Other types of tagging and parsing
The previous sections described the tagging of words and the parsing of
constituents into syntactic units. While these processes are very well established
in the eld of corpus linguistics, there are other less common types of tagging
and parsing as well.
4.4.1 Semantic tagging
Semantic tagging involves annotating a corpus with markup that spec-
ies various features of meaning in the corpus. Wilson and Thomas (1997)
describe a number of systems of semantic tagging that have been employed. In
each of these systems, words in corpora are annotated with various schemes
denoting their meanings. For instance, in one scheme, each word is assigned a
semantic eld tag (Wilson and Thomas 1997: 61): a word such as cheeks is
given the tag Body and Body Parts, a word such as lovely the tag Aesthetic
Sentiments, and so forth. Schemes of this type can be useful for the creation
of dictionaries, and for developing systems that do content analysis, that is,
search documents for particular topics so that users interested in the topics can
have automatic analysis of a large database of documents.
4.4.2 Discourse tagging
In addition to annotating the meaning of words in corpora, semantic
systems of tagging have also looked at such semantic phenomena as anaphora,
the chaining together of co-referential links in a text. This is a type of discourse
tagging, whereby features of a text are annotated so that analysts can recover the
discourse structure of the text. Rocha (1997) describes a system he developed
to study topics in conversations. In this system, Rocha (1997) developed an
annotation scheme that allowed him to classify each anaphor he encountered in
a text and mark various characteristics of it. For instance, each anaphor receives
annotation specifying the type of anaphor it is: personal pronouns are marked
as being either subject pronouns (SP) or object pronouns (OP); demonstratives
as (De); possessives as (PoP); andsoforth(Rocha 1997: 269). Other information
is also supplied, such as whether the antecendent is explicit (ex ) or implicit
(im ). Rocha (1997) used this scheme to compare anaphor resolution in English
and Portuguese, that is, what chain of links are necessary to ultimately re-
solve the reference of any anaphor in a text. Rochas (1997: 277) preliminary
results show that most anaphors are personal pronouns, and their reference is
predominantly explicit.
4.4.3 Problem-oriented tagging
De Haan (1984) coined the term problem-oriented tagging to de-
scribe a method of tagging that requires the analyst to dene the tags to be used
98 Annotating a corpus
and to assign them manually to the constructions to be analyzed. For instance,
Meyer (1992) used this method to study appositions in the Brown, London
Lund, and Survey of English Usage Corpora. Each apposition that Meyer (1992)
identied was assigned a series of tags. An apposition such as one of my closest
friends, John in the sentence I called one of my closest friends, John would be
assigned various tag values providing such information as the syntactic form
of the apposition, its syntactic function, whether the two units of the apposition
were juxtaposed or not, and so forth. Within each of these general categories
were a range of choices that were assigned numerical values. For instance,
Meyer (1992: 1367) found that the appositions in the corpora he analyzed had
seventy-eight different syntactic forms. The above apposition had the form of
an indenite noun phrase followed by a proper noun, a form that was assigned
the numerical value of (6). By assigning numerical values to tags, the tags could
be subjected to statistical analysis.
Because problem-oriented tagging has to be done manually, it can be very
time-consuming. However, there is a software program, PC Tagger, that can be
used to expedite this type of tagging (Meyer and Tenney 1993). The advantage
of problem-oriented tagging is that the analyst can dene the tags to be used
and is not be constrained by someone elses tagset. In addition, the tagging can
be quite detailed and permit grammatical studies that could not be carried out
on a corpus that has only been lexically tagged.
The termproblem-orientedtaggingsuggests that this type of analysis involves
mainly tagging. But it is really a process of parsing as well: constituents larger
than the word are assigned syntactic labels.
4.5 Conclusions
In recent years, corpus linguists of all persuasions have been ac-
tively involved in developing systems of annotation for corpora. The Linguistic
Data Consortium (LDC) at the University of Pennsylvania, for instance, has
a whole web page of links to various projects involved in either annotat-
ing corpora or developing tools to annotate them (Linguistic Annotation:
http://www.ldc.upenn.edu/annotation/). This is a signicant and important de-
velopment in corpus linguistics, since an annotated corpus provides the corpus
user with a wealth of important information.
But while it is important that systems of annotation are developed, it is equally
important that corpus linguists develop tools that help reduce the amount of
time it takes to annotate corpora and that help users understand the complex
systems underlying many annotation systems. The Text Encoding Initiative
(TEI) has developed a comprehensive system of structural markup. However, it
is very time-consuming to insert the markup, and many users may have difculty
understanding how the TEI system actually works. Likewise, parsing a corpus
is a very involved process that requires the corpus creator not just to spend
Study questions 99
considerable time post-editing the output of the parser but to have a conceptual
understanding of the grammar underlying the parser. Because the process of
annotating corpora involves the manual intervention of the corpus creator at
various stages of annotation, in the foreseeable future, annotating a corpus will
continue to be one of the more labor-intensive parts of creating a corpus.
Study questions
1. Why is it necessary for a corpus in ASCII text format to contain structural
markup? What kinds of features would users of such a corpus not be able to
recover if the samples within the corpus did not contain structural markup?
2. How would a probabilistic tagger determine that in the sentence The child
likes to play with his friends, the word play is a verb rather than a noun?
3. Make up a short sentence and submit it to the online taggers below:
CLAWS part-of-speech tagger:
http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/trial.html
EngCG-2:
http://www.conexor./testing.html#1
Do the two taggers assign similar or different part-of-speech labels to your
sample sentence? Briey summarize the similarities and differences.
4. Why is parsing a much more difcult and involved process than tagging?
5 Analyzing a corpus
The process of analyzing a completed corpus is in many respects sim-
ilar to the process of creating a corpus. Like the corpus compiler, the corpus
analyst needs to consider such factors as whether the corpus to be analyzed is
lengthy enough for the particular linguistic study being undertaken and whether
the samples in the corpus are balanced and representative. The major difference
between creating and analyzing a corpus, however, is that while the creator of a
corpus has the option of adjusting what is included in the corpus to compensate
for any complications that arise during the creation of the corpus, the corpus
analyst is confronted with a xed corpus, and has to decide whether to continue
with an analysis if the corpus is not entirely suitable for analysis, or nd a new
corpus altogether.
This chapter describes the process of analyzing a completed corpus. It begins
with a discussion of how to frame a research question so that from the start,
the analyst has a clear hypothesis to test out in a corpus and avoids the
common complaint that many voice about corpus-based analyses: that many
such analyses do little more than simply count linguistic features in a corpus,
payinglittle attentiontothe signicance of the counts. The next sections describe
the process of doing a corpus analysis: howto determine whether a given corpus
is appropriate for a particular linguistic analysis, how to extract grammatical
information relevant to the analysis, how to create data les for recording the
grammatical information taken from the corpus, and how to determine the
appropriate statistical tests for analyzing the information in the data les that
have been created.
To illustrate the process of analyzing a corpus, this chapter will focus on a spe-
cic corpus analysis: the study of a particular grammatical construction, termed
the pseudo-title (Bell 1988), in various components of the International Cor-
pus of English (ICE). Pseudo-titles are constructions such as linguist in linguist
Noam Chomsky that occur in positions in which we normally nd titles like
Professor or Doctor. However, unlike titles, pseudo-titles do not function as
honorics (markers of deference and respect) but instead provide descriptive
information about the proper nouns that they precede (e.g. that NoamChomsky
is a linguist). The case study approach is taken in this chapter to avoid us-
ing disparate and unrelated examples to illustrate the process of doing a corpus
analysis and to demonstrate that each stage of corpus analysis is related.
1
1
For additional case studies, see Biber, Conrad, and Reppen (1998).
100
5.1 The pseudo-title in English: framing a research question 101
5.1 The pseudo-title in English: framing a
research question
To begin to frame a research question about pseudo-titles, it is rst
of all necessary to understand the history of the construction and how it has
evolved in English.
Pseudo-titles are thought to have originated in American English, specically
in journalistic writing. Quirk et al. (1985: 276, note) refer to pseudo-titles as
characteristic of Timestyle, a style of journalistic writing popularized by
Time magazine. Although pseudo-titles may have originated in American press
writing, they can currently be found in the press writing of both British and
New Zealand English. However, as Ryd en (1975), Bell (1988), and Meyer
(1992) have documented, while pseudo-titles are not stylistically marked in
American English they occur widely in all types of American newspapers and
magazines they are stigmatized in British English: their use is avoided in more
formal newspapers such as The Times and the Guardian, and they are found
mainly in newspapers associated with what is nowtermed tabloid journalism.
The stigma against pseudo-titles in British English is well documented in
style guides for British newspapers. For instance, in The Independent Style
Manual, it is remarked that pseudo-titles should be avoided because peoples
jobs are not titles (p. 6). In newspapers prohibiting the use of pseudo-titles, the
information contained in the pseudo-title is instead expressed in an equivalent
appositional construction (e.g. the linguist Noam Chomsky or Noam Chomsky,
a linguist). But while style guides may prohibit certain usages such as the
pseudo-title, it is an open question whether practice follows prescription, and
a well-designed corpus study can be used to examine the relationship between
prescription and practice. Bell (1988), for instance, studied 3,500 instances of
pseudo-titles taken from a large corpus of British, American, and New Zealand
newspapers. Using this corpus, Bell (1988) was able to conrm that while
pseudo-titles were quite common in American press writing, they were absent
from British newspapers that prohibited their usage. Additionally, Bell (1998)
was able to document that New Zealand press writing had moved away from
the British norm for pseudo-titles to the American norm, that is, that over the
years, pseudo-titles had gained in prominence and acceptability in NewZealand
press writing. This change in usage norms, Bell (1988: 3267) argues, reects
the fact that the colonial inuence of British culture on New Zealand culture
had been weakening in recent years, and that New Zealand English was
refocusing towards the United States . . . , a change that has led New Zealand
English closer to the American normfor pseudo-titles. The size of Bells (1988)
corpus also enabled him to study the linguistic structure of pseudo-titles, and to
document, for instance, the extent that newspapers favored pseudo-titles over
equivalent appositional structures, and to showthat there were certain linguistic
structures that favored or disfavored the use of a pseudo-title. For instance, he
102 Analyzing a corpus
found that pseudo-titles favored minimal post-modication. Consequently, a
construction such as lawyer Frederick Smith was more common than ?lawyer
for Hale, Brown, and Jones Frederick Smith.
To evaluate whether Bells (1988) study of pseudo-titles uses a corpus to an-
swer valid research questions, and to determine whether his study warrants fur-
ther corpus studies of pseudo-titles, it is important to understand the kinds of lin-
guistic evidence that corpora are best able to provide. Since the Chomskyan rev-
olution of the 1950s, linguistics has always regarded itself as a science in which
empirical evidence is used to advance linguistic theory. For the Chomskyan
linguist, this evidence was often obtained through introspection: the gathering
of data based on the linguists own intuitions. Because many corpus linguists
have found this kind of evidence limiting (cf. section 1.1), they have turned to the
linguistic corpus as a better source of empirical evidence for real rather than
contrived examples of linguistic constructions and for statistical information
on how frequently a given linguistic construction occurs in a corpus.
In turning to the corpus for evidence, however, many corpus linguists have
regarded the gathering of evidence as a primary goal: in some corpus studies,
it is not uncommon to see page after page of tables with statistical information
on the frequency of grammatical constructions, with little attention paid to the
signicance of the frequencies. Corpus-based research of this nature, Aarts
(2001: 7) notes,
invariably elicits a so what response: so what if we know that there are 435 instances
of the conjunction because in a particular category of written language, whereas there
are only 21 instances in conversations? So what if we are told that subordination is much
more common in womens speech than in mens speech?
To move beyond simply counting features in a corpus, it is imperative before
undertaking a corpus analysis to have a particular research question in mind,
and to regard the analysis of a corpus as both qualitative and quantitative
research research that uses statistical counts or linguistic examples to test a
clearly dened linguistic hypothesis.
In using a corpus to study pseudo-titles, Bell (1988) has obviously done more
thansimplycount the number of pseudo-titles inhis corpus. He has useda corpus
to, for instance, describe the particular linguistic structures that pseudo-titles
favor, and he has used frequency counts to document the increased prevalence
of pseudo-titles in NewZealand press writing. Because Bells (1988) study sug-
gests that pseudo-titles may be spreading beyond British and American English
to other regional varieties of English, his ndings raise additional questions
worthy of further investigation of corpora:
(a) To what extent have pseudo-titles spread to other varieties of English, and
in these varieties, is the American norm followed or the British norm?
(b) Do pseudo-titles have the same structure in these varieties that they have in
British, American, and New Zealand English?
5.2 Determining whether a corpus is suitable for answering a particular research question 103
(c) To what extent do newspapers in these varieties prefer pseudo-titles over
equivalent appositional structures?
The rst step in addressing these questions is to determine whether suitable
corpora exist for answering them.
5.2 Determining whether a corpus is suitable for
answering a particular research question
Bells (1988) study of pseudo-titles was based on a large corpus of
British, American, and NewZealand press writing that Bell assembled himself.
Although this is the ideal way to ensure that the corpus being used is appropriate
for the analysis being conducted, this approach greatly increases the work that
the corpus analyst must do, since creating ones own corpus prior to analyzing
it is obviously a large undertaking. It is therefore most desirable to work with
a corpus already available not just to decrease the work time but to add to
the growing body of work that has been based on a given corpus. The Brown
Corpus, for instance, has been around for close to forty years and has become
a type of benchmark for corpus work: a corpus on which a large amount of
research has been conducted over the years research that has generated results
that can be compared and replicated on other corpora, and that has added greatly
to our knowledge of the types of written English included in the Brown Corpus.
To study the spread of pseudo-titles to various regional varieties of English,
it is necessary not only to nd corpora of differing varieties of English but to
make sure that the corpora selected are as comparable as possible, that is, that
they contain texts that have been collected during a similar time period, that
represent similar genres, that are taken from a variety of different sources, and
that are of the appropriate length for studying pseudo-title usage. In other words,
the goal is to select corpora that are as similar as possible so that extraneous
variables (such as texts covering different time periods) are not introduced into
the study that may skew the results.
Since pseudo-titles have been extensively studied in British, American, and
New Zealand English, it might appear that corpora of these varieties need not
be sought, since analyzing these varieties again will simply repeat what others
have already done. However, there are important reasons why corpora of British,
American, and New Zealand English ought to be included in the study. First,
it is always desirable to have independent conrmation of results obtained in
previous studies. Such conrmation ensures that the results have validity beyond
the initial study from which they were obtained. Second, the previous studies of
pseudo-titles were conducted on very different kinds of corpora. Bells (1988)
study was conducted on a privately created corpus; Meyers (1992) study was
based on the Brown Corpus and the London Corpus, corpora consisting of
samples of written British and American English collected as far back as the late
104 Analyzing a corpus
1950s. Comparing very different corpora increases the potential that extraneous
variables will inuence the results. For instance, because Bells (1988) study
suggests that pseudo-title usage seems to be increasing over time, it is important
to study corpora collected during a similar time period so that the variable of
time does not distort the results.
There are several well-establishedcorpora that potentiallyqualifyas resources
for studying pseudo-title usage: the Brown and LOB corpora of American
and British English, respectively (and their updated counterparts: the Freiburg
Brown Corpus [FROWN] and Freiburg LOB Corpus [FLOB]); the Wellington
Corpus of New Zealand English; and the Kolhapur Corpus of Indian English.
Because each of these corpora has a similar design (e.g. comparable written
genres divided into 2,000-word samples), they are very suitable for compar-
ison. However, the Wellington Corpus contains texts collected at a later date
than the other corpora: 1986-90 as opposed to 1961. FLOB and FROWN con-
tain texts from 1991. These corpora thus introduce a time variable that is less
than desirable. If these corpora are eliminated from consideration, only three
corpora remain (Brown, LOB, and Kolhapur), and only one of these corpora
the Kolhapur Corpus of Indian English represents a variety of English in
which pseudo-titles have never before been studied. It is therefore obvious that
a different set of corpora are necessary if any information is to be obtained
on the spread of pseudo-titles to other varieties of English than American and
British English.
A better choice of corpora to consider are those included in the International
Corpus of English (ICE). The regional components of the ICE contain not just
identical genres but samples of speech and writing collected during a similar
time-frame (1990present). In addition, ICE corpora offer numerous regional
varieties not just British, American, and New Zealand English but Philippine,
Singaporean, Jamaican, and East African English.
2
The ICE corpora therefore
offer a potentially useful dataset for studying pseudo-titles. But to determine the
full potential of the ICEcorpora, two further points must be considered: whether
the components of ICE selected for analysis provide a representative sampling
of the kind of writing that pseudo-titles occur in, and if the components do,
whether the samples selected from the corpora are of the appropriate length to
nd a sufcient number of pseudo-titles and corresponding appositives.
Inhis studyof pseudo-titles, Bell (1988: 327) restrictedhis analysis topseudo-
titles occurring in hard news reporting and excluded fromconsideration other
types of press writing, such as editorials or feature writing. Because each ICE
corpus contains twenty samples of press reportage as well as twenty samples of
broadcast news reports, it will be quite possible to base an analysis of pseudo-
titles on the kinds of press reportage that Bell (1988) analyzed in his study.
The next question, however, is whether the analysis should be based on all
2
The ICE Project contains other varieties of English than these, but these were the only varieties
available at the time this study was conducted.
5.2 Determining whether a corpus is suitable for answering a particular research question 105
forty texts from each ICE corpus, or some subset of these texts. There are a
number of considerations that will affect choices of this nature. First of all, if
the construction to be analyzed has to be extracted by manual analysis, then
it becomes imperative to use as short a corpus as possible, since extensive
manual analysis will add to the length of time a given study will take. As will be
demonstrated in section 5.3, while pseudo-titles can be automatically extracted
from the British component of ICE because it is fully tagged and parsed, they
must be retrieved manually fromthe other components, since unlike, say, regular
verbs in English, which have verb sufxes such as -ing or -ed, pseudo-titles have
no unique marker that can be searched for. Therefore, to minimize the analysis
time, it is advisable to examine only a subset of the eligible texts in the various
ICE components.
One possibility is to analyze ten samples of press reporting and ten samples
of broadcast news from each ICE corpus. Another possibility is to analyze only
the twenty samples of press reportage from each corpus, or only the twenty
samples of broadcast news. Ultimately, it was decided to restrict the sample to
the twenty samples of press reportage, primarily because broadcast news is a
mixed category, containing not just scripted monologues but interviews as well.
Since interviews are not strictly hard news reporting, they are less likely to
contain pseudo-titles.
3
Moreover, since the goal of the study is to determine how
widespread pseudo-titles are in press reportage, it is quite sufcient to attempt
to investigate as many different newspapers as possible from the various ICE
corpora, to determine how common pseudo-title usage is.
Having restricted the number of texts to be examined, it is next necessary
to examine the individual samples from each component of ICE to ensure that
the samples represent many different newspapers rather than a fewnewspapers.
In other words, if a given component of ICE contains twenty samples taken
from only ve different newspapers, the samples to be investigated are less
than representative, since they provide information only on how ve different
newspapers use pseudo-titles. With a sample this small, it will be difcult to
make strong generalizations about the usage of pseudo-titles in general within
the press writing of a given variety of English. Table 5.1 lists the number of
different newspapers represented in the samples selected for study from the
various components of ICE.
As table 5.1 illustrates, the components vary considerably in terms of the
number of different newspapers that are represented. While ICE-USA contains
3
Robert Sigley (personal communication) notes that pseudo-titles would be perfectly appropriate
in a broadcast interview as a way of introducing an interviewee: We will be speaking with
Boston Globe Editor Robert Storrin. In ICE-GB, the only pseudo-title found in a broadcast
interview (designer in designer Anthony Baker) occurred in an interviewees description of a
play he recently saw:
The staging by Francisco Negrin and designer Anthony Baker wittily plays off the opera house
and its conventions nally allowing us to glimpse a three tier gilt and scarlet plush auditorium
into which the cast make their escape (ICE-GB:S1B-044 #2:1:A; emphasis added)
106 Analyzing a corpus
Table 5.1 Number of different newspapers included
within the various ICE components
Component Total
Great Britain 15
United States 20
New Zealand 12
Philippines 10
Jamaica 3
East Africa 3
Singapore 2
twenty different newspapers, ICE-East Africa contains only three. The relatively
few newspapers represented in some of the components is in part a reection
of the fact that in an area such as East Africa, in which English is a second
language, there exist only a few English language newspapers. Therefore, three
newspapers fromthis region is a representative sample. Still, the results obtained
from such a small sample will have to be viewed with some caution.
Because all ICE corpora contain 2,000-word samples, it is also important
to consider whether text fragments, rather than entire texts, will yield enough
examples of pseudo-titles and, additionally, whether pseudo-titles serve some
kind of discourse function, necessitating that a whole newspaper article be
analyzed as opposed to simply a short segment of it.
4
One way to answer these
questions is to do a pilot study of a few randomly selected samples taken
from each ICE component. Table 5.2 lists the number of pseudo-titles and
corresponding appositive structures occurring in nine samples from various
randomly selected ICE corpora.
Table 5.2 demonstrates that even though the frequency of the various struc-
tures being investigated differs in the nine newspapers, the structures do occur
with some regularity in the 2,000-words samples: as many as sixteen times in
one sample from ICE-New Zealand to as few as ve times in one sample from
ICE-Philippines. Of course, ultimately statistical tests will have to be applied
to determine the true signicance of these numbers. But at this stage, it seems
evident that pseudo-titles are such a common grammatical construction that
adequate numbers of them can be found in short 2,000-word excerpts.
One drawback of analyzing text fragments is that certain discourse features
cannot be studied in them because only partial texts (rather than whole texts)
are available for analysis. In particular, if one is investigating a construction
that contributes to the structure of a given text, then a text fragment will prove
insufcient for carrying out such a study. Because pseudo-titles serve mainly
4
Many ICE samples are actually lengthier than 2,000 words. However, the text of a sample beyond
2,000 words is marked as extra-corpus text and for this study was excluded from analysis.
5.3 Extracting information from a corpus 107
Table 5.2 The number of pseudo-titles and corresponding appositives in
selected newspapers from various ICE corpora
Pseudo-title Equivalent appositive Total
Independent (ICE-GB) 9 9
Guardian (ICE-GB) 7 7
Press (ICE-NZ) 13 13
Daily Nation (ICE-EA) 4 9 13
NY Times (ICE-US) 2 6 8
Jamaica Herald (ICE-JA) 2 4 6
Phil. Star (ICE-PH) 4 1 5
Manila Standard (ICE-PH) 11 0 11
Dominion (ICE-NZ) 16 0 16
to provide descriptive information about individuals, they play no role in the
overall structure or organization of a text. Therefore, it is perfectly justiable to
study pseudo-title usage in the 2,000-word excerpts occurring in samples from
ICE corpora, particularly if the goal is to study the structure and frequency
of pseudo-titles and corresponding appositives in various national varieties of
English.
5.3 Extracting information from a corpus
After a research question has been framed and a corpus selected to
analyze, it is next necessary to plan out exactly what kinds of grammatical
informationwill be extractedfromthe corpus, todetermine howthis information
will be coded and recorded, and to select the appropriate software that can most
efciently assist in nding the construction being investigated in the corpus
being studied.
5.3.1 Dening the parameters of a corpus analysis
The rst step in planning a corpus analysis is the development of a
clear working denition of the grammatical construction(s) to be studied. If a
corpus analysis is begun without an adequate denition of terms, the analyst
runs the risk of introducing too much inconsistency into his/her analysis. In
the case of pseudo-titles, this becomes an especially crucial issue because re-
search has shown that there is not always a clear boundary between a full title
and a pseudo-title. If some kind of working denition of each kind of title is
not determined prior to analysis, it is highly likely that as the analysis is con-
ducted what is counted as a pseudo-title will vary from instance to instance,
and as a consequence, considerable inconsistency will result that will seriously
108 Analyzing a corpus
compromise the integrity of the ndings. Fortunately, Bell (1988) has specied
quite precisely the kinds of semantic categories into which full titles can be
categorized, and his categories can be used to distinguish what counts as a title
from what counts as a pseudo-title:
Professional (Doctor, Professor)
Political (President, Chancellor, Senator)
Religious (Bishop, Cardinal, Mother)
Honors (Dame, Earl, Countess)
Military (General, Corporal )
Police (Commissioner, Constable, Detective-Sergeant)
Foreign (Monsieur, Senorita) (Bell 1988: 329)
In the present study, then, a pseudo-title will be dened as any construction
containing an initial unit (such as city employee in city employee Mark Smith)
that provides descriptive information about an individual and that cannot be
classied into any of the above semantic classes.
Because pseudo-titles will be compared to corresponding appositives, it is
alsonecessarytohave a workingdenitionof the types of appositives that will be
considered equivalent to a pseudo-title. Bell (1988) claims that pseudo-titles
and equivalent appositives are related through a process he terms determiner
deletion. Thus, for Bell (1988: 328), a pseudo-title such as fugitive nancier
Robert Vesco would be derived from an appositive containing the determiner
the in the rst unit: the fugitive nancier Robert Vesco. However, if ones goal
is to study the relationship between the use of pseudo-titles and all possible
equivalent appositives, then a wider range of appositive structures needs to
be investigated, largely because there exist corresponding structures where the
unit of the appositive equivalent to the pseudo-title occurs after, not before, the
noun phrase it is related to, and in such constructions, a determiner is optional:
Robert Vesco, [the] fugitive nancier. In the current study, then, a corresponding
appositive will include any appositive that is related to a pseudo-title through
the process systematic correspondence (Quirk et al. 1985: 57f.), a process that
species that two constructions are equivalent if they contain roughly the same
lexical content and have the same meaning. Although constructions such as
fugitive nancier Robert Vesco and Robert Vesco, fugitive nancier certainly
differ in emphasis and focus, they are close enough in form and meaning that
they can be considered equivalent structures.
Of course, a working denition is just that: a preliminary denition of a
grammatical category subject to change as new data are encountered. And as
anyone who has conducted a corpus study knows, once real data are exam-
ined, considerable complications can develop. As the study of pseudo-titles was
conducted, examples were found that required modication of the initial deni-
tion (cf. section 5.3.3 for a discussion of the actual methods that were employed
to locate pseudo-titles). For instance, the construction former Vice President
Dan Quayle (ICE-USA) contains an initial unit, former Vice President, that
5.3 Extracting information from a corpus 109
is semantically a Political designation and therefore qualies as a title; more-
over, Vice President is capitalized, a common orthographic characteristic of
titles. But the inclusion of the adjective former before Vice President suggests
that the focus here is less on honoring Dan Quayle and more on his pre-
vious occupation as Vice President of the United States. Consequently, this
construction (and others like it) was counted as a pseudo-title, even though
Vice President has semantic characteristics normally associated with titles.
Decisions of this nature are common in any corpus analysis that involves
the close examination of data, and there are basically two routes that the cor-
pus analyst can take to decide how to classify problematic data. As was done
above, if enough evidence exists to place a construction into one category
rather than another, then the construction can be classied as x rather than
y, even though the construction may have characteristics of both x and
y. Alternatively, the analyst can create an ad hoc category in which construc-
tions that cannot be neatly classied are placed. Either approach is acceptable,
provided that whichever decision is made about the classication of a partic-
ular grammatical construction is consistently applied throughout a given cor-
pus analysis, and that the decisions that are made are explicitly discussed in
the article or book in which the results of the study are reported. Moreover,
if an ad hoc category is created during a corpus analysis, once the analysis
is completed, all examples in the category should be examined to determine
whether they have any characteristics in common that might lead to additional
generalizations.
5.3.2 Coding and recording grammatical information
Once key terms and concepts have been dened, it is next necessary
to decide what grammatical information needs to be extracted from the corpus
being examined to best answer the research questions being investigated, and
to determine how this information can best be coded and recorded. To nd
precise answers to the various linguistic questions posed about pseudo-titles in
section 5.1, the following grammatical information will need to be obtained:
Information A: To determine howwidespread pseudo-title usage has become,
and to reveal which newspapers permit the use of pseudo-titles and which prefer
using only equivalent appositive constructions, it will be necessary to obtain
frequency counts of the number of pseudo-titles and equivalent appositives
occurring in the various regional varieties of ICE being investigated.
Information B: To determine how common pseudo-titles are in newspapers
that allowtheir use, it will be important to examine the role that style plays in the
use of pseudo-titles and equivalent appositives, and to ascertain whether given
the choice between one construction and the other, a particular newspaper will
choose a pseudo-title over a corresponding appositive construction. To study
stylistic choices, three types of correspondence relationships will be coded:
those in which an appositive can be directly converted into a pseudo-title:
110 Analyzing a corpus
(1) Durk Jager, executive vice president (ICE-USA)
executive vice president Durk Jager
those in which only a determiner needs to be deleted for the appositive to be
converted into a pseudo-title:
(2) the Organising Secretary, Mr Stephen Kalonzo Musyoka (ICE-East Africa)
Organising Secretary Mr Stephen Kalonzo Musyoka
and those in which only part of the appositive can be converted into a pseudo-
title:
(3) Peter Houliston, Counsellor andProgramme Director of Development at the Canadian
High Commission (ICE-Jamaica)
Counsellor and Programme Director Peter Houliston
Information C: To investigate why only part of the appositive can form a
pseudo-title in constructions such as (3), it is necessary to study the linguistic
considerations that make conversion of the entire unit of an appositive to a
pseudo-title stylistically awkward. Bell (1988: 336) found that an appositive
construction was favored (4a and 5a) over an equivalent pseudo-title (4b and
5b) if the unit that could potentially become a pseudo-title contained either a
genitive noun phrase (e.g. bureaus in 4a) or postmodication of considerable
complexity (e.g. . . . of the CIA station in Rome in the 1970s in 5a):
(4) a. the bureaus litigation and prosecution division chief Osias Baldivino (ICE-
Philippines)
b. ?*bureaus litigation and prosecution division chief Osias Baldivino
(5) a. Ted Shackley, deputy chief of the CIA station in Rome in the 1970s (ICE-
GB:W2C-010 #62:1)
b. ?deputy chief of the CIA station in Rome in the 1970s Ted Shackley
Information D: It was decided to record the length of a pseudo-title after a
casual inspection of the data seemed to suggest that in the Philippine component
there was greater tolerance for lengthy pseudo-titles (6) than for what seem-
ed to be the norm in the other varieties: pseudo-titles of only a few words in
length (7).
(6) Salamat and Presidential Adviser on Flagship Projects in Mindanao Robert Aven-
tajado (ICE-Philippines)
(7) Technology editor Kenneth James (ICE Singapore)
Often after a corpus analysis has begun, newinformation will be discovered that
is of interest to the study being conducted and that really needs to be recorded.
Usually such discoveries can be easily integrated into the study, provided that
the analyst has kept accurate record of the constructions that have already been
recorded, and has coded the data in a manner that allows for the introduction of
new grammatical information into the coding scheme.
5.3 Extracting information from a corpus 111
Data can be coded and recorded in numerous ways. Because the goal of
the current study is to record specic information about each construction be-
ing studied, it was decided to use a type of manual coding (or tagging) called
problem-oriented tagging (De Haan 1984; cf. also section 4.4.3). This method
of tagging allows the analyst to record detailed information about each gram-
matical construction under investigation. Table 5.3 outlines the coding system
used for recording the grammatical information described in (A)(D) above.
The coding scheme in table 5.3 allows for every pseudo-title or equivalent
appositive construction to be described with a six-sequence series of numbers
that not only label the construction as being a pseudo-title or appositive but
describe the regional variety in which the construction was found; the partic-
ular sample from which it was taken; the type of correspondence relationship
existing between an appositive and potential pseudo-title; the form of the con-
struction; and the length of the pseudo-title or the unit of the apposition that
could potentially be a pseudo-title. Coding the data this way will ultimately
allow for the results to be viewed from a variety of different perspectives. For
instance, because each construction is given a number identifying the regional
variety in which it occurred, it will be possible to compare pseudo-title usage
in, say, British English and Singapore English. Likewise, each sample repre-
sents a different newspaper. Therefore, by recording the sample from which a
construction was taken, it will be possible to know which newspapers permit
pseudo-title usage, and which do not, and the extent to which newspapers use
pseudo-titles and equivalent appositives similarly or differently.
According to the scheme in table 5.3, a construction such as nancial
adviser David Innes (ICE-GB:W2C-009 #41:2) would be coded in the fol-
lowing manner:
Country: Great Britain 6
Sample: W2C-009 9
Type: pseudo-title 1
Correspondence relationship 4
Form: simple NP 1
Length: two words 2
The advantage of using a coding scheme such as this is that by assigning a series
of numerical values to each construction being studied, the results can be easily
exported into any statistical program(cf. section 5.4 for a discussion of the kinds
of statistical programs into which quantitative data such as this can be exported).
The disadvantage of using numerical sequences of this nature is that they
increase the likelihood that errors will occur during the coding process. If, for
instance, a pseudo-title is mistakenly tagged as an apposition, it is difcult for
the analyst to recognize this mistake when each of these categories is given the
arbitrary number 1 and 2, respectively. Moreover, if each variable is coded
with an identical sequence of numbers, it will be quite difcult to determine
T
a
b
l
e
5
.
3
C
o
d
i
n
g
s
c
h
e
m
e
f
o
r
s
t
u
d
y
o
f
p
s
e
u
d
o
-
t
i
t
l
e
s
a
n
d
e
q
u
i
v
a
l
e
n
t
a
p
p
o
s
i
t
i
v
e
s
C
o
r
r
e
s
p
o
n
d
e
n
c
e
C
o
u
n
t
r
y
S
a
m
p
l
e
n
u
m
b
e
r
T
y
p
e
r
e
l
a
t
i
o
n
s
h
i
p
F
o
r
m
L
e
n
g
t
h
U
S
(
1
)
W
2
C
0
0
1
(
1
)
P
s
e
u
d
o
-
t
i
t
l
e
(
1
)
T
o
t
a
l
e
q
u
i
v
a
l
e
n
c
e
(
1
)
S
i
m
p
l
e
N
P
(
1
)
O
n
e
w
o
r
d
(
1
)
P
h
i
l
i
p
p
i
n
e
s
(
2
)
W
2
C
0
0
2
(
2
)
A
p
p
o
s
i
t
i
v
e
(
2
)
D
e
t
e
r
m
i
n
e
r
d
e
l
e
t
i
o
n
(
2
)
G
e
n
i
t
i
v
e
N
P
(
2
)
T
w
o
w
o
r
d
s
(
2
)
E
a
s
t
A
f
r
i
c
a
(
3
)
W
2
C
0
0
3
(
3
)
P
a
r
t
i
a
l
e
q
u
i
v
a
l
e
n
c
e
(
3
)
M
u
l
t
i
p
l
e
p
o
s
t
-
m
o
d
i
c
a
t
i
o
n
(
3
)
T
h
r
e
e
w
o
r
d
s
(
3
)
J
a
m
a
i
c
a
(
4
)
W
2
C
0
0
4
(
4
)
N
/
A
(
4
)
F
o
u
r
w
o
r
d
s
(
4
)
N
e
w
Z
e
a
l
a
n
d
(
5
)
W
2
C
0
0
5
(
5
)
F
i
v
e
w
o
r
d
s
(
5
)
G
r
e
a
t
B
r
i
t
a
i
n
(
6
)
e
t
c
.
S
i
x
o
r
m
o
r
e
w
o
r
d
s
(
6
)
S
i
n
g
a
p
o
r
e
(
7
)
5.3 Extracting information from a corpus 113
Figure 5.1 PC-Tagger pop-up menu
whether a 2 assigned for Variable 1, for instance, is an error when the same nu-
merical value is used for each of the other variables being studied. To reduce the
incidence of errors, pseudo-titles and equivalent appositives were coded with a
software program called PC Tagger (described in detail in Meyer and Tenney
1993). This programdisplays both the actual variables being studied in a partic-
ular corpus analysis, and the values associated with each variable (cf. gure 5.1).
Figure 5.1 contains a pseudo-title selected fromICE-Philippines, MILF legal
counsel Ferdausi Abbas, and a pop-up window containing in the left column
the Tag Names (i.e. variables) and in the right column Tag Values. Because
the length variable is selected, only the values for this variable appear. Since
the pseudo-title is three words in length, the value three is selected. The
programproduces as output a separate le that converts all values to the numbers
in table 5.3 associated with the values, and that can be exported to a statistical
program. Because the analyst is working with the actual names of tags and their
values, the results can be easily checked and corrected.
An alternative to using a program such as PC Tagger is to use a system con-
taining codes that are mnemonic. For instance, the CHAT system (Codes for
the Human Analysis of Transcripts), developed within the CHILDES Project
(Child Language Data Exchange System; cf. section 1.3.8), contains numerous
mnemonic symbols for coding differing kinds of linguistic data. To describe
the morphological characteristics of the phrase our family, the codes 1P,
POSS, and PRO are used to describe our as a rst-person-plural possessive
pronoun; the codes COLL and N are used to describe family as a collective
114 Analyzing a corpus
noun (CHAT Manual: http://childes.psy.cmu.edu/pdf/chat.pdf ). For the current
study, a pseudo-title, for instance, could be coded as PT and a corresponding
apposition as AP. And even though a system such as this lacks numerical val-
ues, the mnemonic tags can be easily converted into numbers (e.g. all instances
of PT can be searched for and replaced with the value 1).
5.3.3 Locating relevant constructions for a particular
corpus analysis
The greatest amount of work in any corpus study will be devoted to lo-
cating the particular construction(s) being studied, and then assigning to these
constructions the particular linguistic values being investigated in the study.
In the past, when corpora were not available in computer-readable form, the
analyst had to painstakingly extract grammatical information by hand, a very
tedious process that involved reading through printed texts and manually record-
ing grammatical information and copying examples by hand. Now that corpora
exist in computer-readable form, it is possible to reduce the time it takes to
conduct a corpus analysis by using software programs that can automate (to
varying degrees) the extraction of grammatical information. In conducting a
grammatical analysis of a corpus, the analyst can either learn a programming
language such as Perl and then write scripts to extract the relevant gram-
matical information, or use a general-purpose software application, such as a
concordancing program (described in detail below), that has been developed
for use on any corpus and that can perform many common tasks (e.g. word
searches) associated with any corpus analysis. Both approaches to corpus anal-
ysis have advantages and disadvantages, and neither approach will guarantee
that the analyst retrieves precisely what is being sought: just about any corpus
analysis will involve sorting through the data manually to eliminate unwanted
constructions and to organize the data in a manner suitable to the analysis.
There exist a number of programming languages, such as Python, Visual
Basic, or Perl, that can be powerful tools for analyzing corpora. Fernquest
(2000) has written a number of Perl scripts that can, for instance, search a corpus
and extract from it lines containing a specic phrase; that can calculate word
frequencies, sorting words either alphabetically or by frequency of occurrence;
and that can count bigrams in a text (i.e. two-word sequences) and organize
the bigrams by frequency. Perl scripts are quite commonly available (cf., for
instance, Melamuds 1996 le of Perl scripts), and can be modied to suit the
specic grammatical analysis being conducted. Of course, much of what can
be done with many Perl scripts can be more easily accomplished with, say, a
good concordancing program. But as Sampson (1998) quite correctly observes,
if those analyzing corpora have programming capabilities, they do not have to
rely on software produced by others which may not meet their needs.
If one is using someone elses scripts, using the scripts on a corpus is not that
difcult. But writing new scripts requires programming skills that are probably
5.3 Extracting information from a corpus 115
beyond the capabilities of the average corpus linguist. For these individuals,
there exist a number of very useful software programs that require no program-
ming skills and that can extract much useful information from corpora: the
Linguistic Data Base (LDB) for analyzing the Nijmegen Corpus (cf. section
4.3); ICECUP for analyzing the British component of ICE (described below);
Sara to analyze the British National Corpus (cf. Aston and Burnard 1998);
and a variety of different concordancing programs available for use on PCs,
Macintoshes, Unix work stations, even the World Wide Web (cf. appendix 2 for
a listing of concordancing programs and also Hockey 2000: 4965 for a survey
of some of the more common programs). Of course, as is the case with using any
new software program, the user will need to experiment with the program both
to learn how to use it and to determine whether it can do what the user needs to
accomplish in a given corpus analysis. Thus, before any corpus analysis is begun
in earnest, the corpus analyst will want to experiment with more than one con-
cordancing program to ensure that the best program is used for a given corpus
analysis.
The ability of such programs to retrieve grammatical information depends
crucially upon how easy it is to construct a search for a given grammatical con-
struction. In a lexical corpus (a corpus containing raw text and no grammatical
annotation), it is easiest to construct searches for particular strings of char-
acters: a particular word or group of words, for instance, or classes of words
containing specic prexes or sufxes (e.g. the derivational morpheme -un, or
a verb inection such as -ing). In corpora that have been tagged or parsed, it is
possible to search for particular tags and therefore retrieve the particular gram-
matical constructions (common nouns, adverbs, etc.) associated with the tags.
However, even in tagged and parsed corpora, it may be difcult to automatically
extract precisely what the analyst is studying. Therefore, just about any corpus
study will at some stage involve the manual analysis of data. Pseudo-titles pro-
vide a good illustration of the strengths and limitations of the various tools that
have been developed for corpus analysis.
A pseudo-title such as linguist Noam Chomsky consists of a common noun
followed by a proper noun. Because nouns are open-class items, they are a
very broad category. Consequently, it is highly likely that every pseudo-title
discovered in a given corpus will contain a unique series of common and proper
nouns. This characteristic of pseudo-titles makes it very difcult to use the most
common corpus tool the concordancing program to automatically retrieve
pseudo-titles.
The most common capability of any concordancing program is to conduct
simple searches for words, groups of words, or words containing prexes and
sufxes; to display the results of a given search in KWIC (key word in context)
format; and to calculate the frequency of the item being searched in the corpus
in which it occurs. Figure 5.2 contains an example concordancing window
displaying the results of a search for all instances of the conjunction but in
the fully tagged and parsed British component of ICE (ICE-GB). This corpus
116 Analyzing a corpus
Figure 5.2 Concordancing window for the conjunction but in ICE-GB
can be searched by a text retrieval program, ICECUP, that has concordancing
capabilities.
In gure 5.2, the conjunction but is displayed in KWIC format: all instances
of but are vertically aligned for easy viewing, and displayed in the text unit
(cf. section 4.1) in which they occur. In ICECUP (and most other concordancing
programs) the context can be expanded to include, for instance, not just a single
text unit but groups of text units occurring before and after the search item. At
the bottom of the screen is a gure (5,909) specifying how many instances of
but were found in the corpus.
While all concordancing programs can perform simple word searches, some
have more sophisticated search capabilities. For instance, some programs can
search not just for strings of characters but for lemmas (e.g. all the forms
of the verb be) if a corpus has been lemmatized, that is, if, for example,
all instances of be (am, was, are, etc.) have been linked through the process of
lemmatization. In a corpus that has been lemmatized, if one wants information
on the frequency of all forms of be in the corpus, it is necessary to simply search
for be. In an unlemmatized corpus, one would have to search for every form of
be separately.
5
Many concordancing programs can also perform wild card searches. That
is, a search can be constructed that nds not just strings of characters occurring
5
Yasumasa Someya has produced a le of lemmatized verb forms for English that can be integrated
into WordSmith and save the analyst the time of creating such a le manually. The le can be
downloaded at http://www.liv.ac.uk/ms2928/wordsmith/index.htm.
5.3 Extracting information from a corpus 117
in an exact sequence but strings that may have other words intervening between
them. For instance, a search of the correlative coordinator not just . . . but, as
in The movie was not just full of excessive violence but replete with foul lan-
guage, will have to be constructed so that it ignores the words that occur be-
tween just and but. To conduct such a search in a program such as Mono Conc
Pro 2.0 (http://www.athel.com/mono.html#monopro), the number of words be-
tween the two parts of the search expression can be set, specifying that the
search should nd all instances of not just . . . but where, say, up to twenty words
can occur between just and but. Wild card searches can additionally nd parts
of words: inWordSmith(http://www.liv.ac.uk/ms2928/wordsmith/index.htm),
searching for *ing will yield all instances of words in a corpus ending in -ing.
And wild card searches (as well as searches in general) can be extended to
search for words tagged in a specic way in a given corpus, or a particular word
tagged in a specic manner (e.g. all instances of like tagged as a conjunction).
Because pseudo-titles (and many other grammatical constructions) do not
have unique lexical content, a simple lexical search will not retrieve any in-
stances of pseudo-titles in a corpus. However, if pseudo-titles are studied in
corpora that have been tagged or parsed, it is at least possible to narrow the
range of structures generated in a search.
In a tagged or parsed corpus, it is possible to search not just for strings of
characters but for actual grammatical categories, such as nouns, verbs, noun
phrases, verb phrases, and so forth. Consequently, a tagged or parsed corpus
will allow for a greater range of structures to be recovered in a given search.
For instance, a concordancing program can be used to nd possessive nouns
by searching for the strings s or s. Although a search for these strings will
certainly recover numerous possessive nouns, it will also retrieve many un-
wanted constructions, such as contractions (e.g. Johns leaving). If possessives
are tagged, however, it will be possible to search for the possessive tag, in a
spoken or written text, and to recover only possessives, not extraneous construc-
tions such as contractions. In ICE-GB, possessives are assigned the tag genm
(genitive marker), and a search for all constructions containing this tag turned
up all the instances of possessives in ICE-GBin a couple of seconds. Other pro-
grams, such as WordSmith or Mono Conc Pro 2.0, can also be used to search
for tags.
Because in a tagged corpus only individual words are annotated, it can be dif-
cult to construct searches for grammatical constructions having different part-
of-speech congurations. Examples (8)(10) belowcontain tagged instances of
three pseudo-titles in ICE-GB:
(8) Community <N(com,sing):1/3> liaison <N(com,sing):2/3> ofcer <N(com,sing):
3/3> John <N(prop,sing):1/2> Hambleton <N(prop,sing):2/2>
(ICE-GB:W2C-009 #109:7)
(9) Liberal <N(com,sing):1/3> Democrats <N(com,sing):2/3> leader <N(com,sing):
3/3> Cllr <N(prop,sing):1/3> John <N(prop,sing):2/3> Hammond <N(prop,sing):
3/3> (ICE-GB:W2C-009 #59:3)
118 Analyzing a corpus
Figure 5.3 Parse tree for a sample pseudo-title in ICE-GB
(10) 59-year-old<ADJ(ge)>caretaker <N(com,sing)>Rupert <N(prop,sing):1/2>Jones
<N(prop,sing):2/2> : <PUNC(col)> (ICE-GB:W2C-011 #67:2)
While the pseudo-titles in (8)(10) consist of common and proper nouns, there
are some notable differences: (8) and (10) contain two proper nouns, (9) three
proper nouns; (10) begins with an adjective, (8) and (9) do not. There is thus no
sequence of common nouns and proper nouns to be searched for that uniquely
denes a pseudo-title. One could search for part of the pseudo-title that each
of the constructions has in common (e.g. a proper noun preceding a common
noun). But such a search is likely to turn up many structures other than pseudo-
titles. When such a search was conducted on ICE-GB, a noun phrase followed
by a vocative (11) was retrieved, as was an adverbial noun phrase followed by
a proper noun (12):
(11) Thats rubbish Aled (ICE-GB:S1A-068 #319:1:C)
(12) This time Hillier does nd his man (ICE-GB:S2A-018 #120:1:A)
To retrieve particular grammatical constructions, it is therefore more desirable
to use a parsed corpus rather than a tagged corpus because a parsed corpus will
contain annotation describing higher-level grammatical constructions.
Of the various ICE corpora being used to study pseudo-titles, only ICE-GB
has been parsed. To search ICE-GB for instances of pseudo-titles and corre-
sponding appositives, it is rst necessary to determine how such constructions
have been parsed. Figure 5.3 contains a parse tree fromICE-GBfor the pseudo-
title general manager Graham Sunderland.
The pseudo-title in gure 5.3 has been parsed into two noun phrases, the rst
containing two common nouns and the second two proper nouns. The second
noun phrase has the feature appos, indicating that in ICE-GB pseudo-titles
are considered a type of apposition.
Figure 5.4contains a parse tree for what is consideredanequivalent appositive
structure, Allen Chase, head of strategic exposure for NatWest Bank.
Figure 5.4 Parse tree for a sample appositive in ICE-GB
5.4 Subjecting the results of a corpus study to statistical analysis 119
Figure 5.5 FTF for appositive proper nouns
Like the pseudo-title in gure 5.3, the construction in gure 5.4 has the form
of two noun phrases containing various other structures. But unlike the pseudo-
title in gure 5.3, the appositive in gure 5.4 contains the feature appos for
both noun phrases, and additionally contains a parse marker for the mark of
punctuation, the comma, that separates both noun phrases.
There are two ways to search for pseudo-titles and appositives in ICE-GB.
Two separate searches can be constructed: one searching for two noun phrases
with the feature appos, and a second searching for two noun phrases with
only the second one containing the feature appos. Punctuation is irrelevant
to both searches, as the search tool for ICE-GB, ICECUP, can be instructed to
ignore punctuation. A more economical way to search for both pseudo-titles
and equivalent appositives is to construct a search containing one structure that
both constructions have in common: a single noun phrase containing proper
nouns that have the feature appos.
To nd all instances of proper nouns annotated with the feature appos,
ICECUP requires that a fuzzy tree fragment (FTF) is constructed, that is, a
partial tree structure that can serve as the basis for a search that will retrieve all
structures in ICE-GB that t the description of the FTF. Figure 5.5 contains an
FTF that will nd all instances of proper nouns with the feature appos.
This search was restricted to the press reportage section of ICE-GB, and in a
matter of seconds, all of the relevant constructions were retrieved and displayed
in KWIC format. Because the other components of ICE used in the study were
not tagged or parsed, pseudo-titles in these components had to be identied
manually, a process that took several days.
5.4 Subjecting the results of a corpus study to
statistical analysis
After information is extracted from a corpus (or series of corpora),
through either manual analysis or the use of a text retrieval program, it is
120 Analyzing a corpus
necessary to subject this information to some kind of statistical analysis. This
analysis can be quite simple and involve no more than providing frequency
counts of the particular constructions being investigated. However, if the goal
of ones analysis is to determine whether similarities or differences exist in a
corpus, it then becomes more necessary to apply specic statistical tests (such as
chi-square) to determine whether the differences or similarities are statistically
signicant or not, that is, whether they are real or merely the result of chance.
Because many modern-day corpus linguists have been trained as linguists,
not statisticians, it is not surprising that they have been reluctant to use statistics
in their studies. Many corpus linguists come from a tradition that has provided
themwith ample background in linguistic theory and the techniques of linguistic
description, but little experience in statistics. And as they begin doing analyses
of corpora, they nd themselves practicing their linguistic tradition within the
realm of numbers, the discipline of statistics, which many corpus linguists
nd foreign and intimidating. As a consequence, many corpus linguists have
chosen not to do any statistical analysis of the studies they conduct, and to work
instead mainly with frequency counts. For instance, the study of coordination
ellipsis described in section 1.2 was based solely on frequency differences in
the samples of speech and writing that were analyzed, and no statistical tests
were applied to determine whether these differences were statistically different.
Recently, however, many corpus linguists have begun to take the role of
statistics in corpus analysis much more seriously. A number of books have
been written, such as Woods, Fletcher, and Hughes (1986), Kretzschmar and
Schneider (1996), and Oakes (1998), that discuss statistical analysis of language
data in considerable detail. Some text analysis programs, such as WordSmith,
have statistical capabilities built into them so that as corpus linguists study
particular linguistic constructions (e.g. collocations) they can performstatistical
analyses to see whether the results they obtain are signicant. And many of the
commonly available statistical programs, such as SAS, SPSS, SYSTAT, or G,
can now be run on PCs or Macintoshes and have very friendly user interfaces
(cf. Kretzschmar 2000 for a review of the student version of SPSS, and a listing
of other statistical packages that can be run on a PC or Macintosh).
If the corpus linguist nds any of these resources intimidating, he or she
can always consult a specialist in the area of statistics for advice. However, if
this route is taken, it is important to realize that statistics is a vast and complex
eld of inquiry, and different academic disciplines will approach the use of
statistical tests from differing perspectives: mathematicians, for instance, use
statistics for different purposes than social scientists. Since corpus linguists are
interested in studying linguistic variation within a corpus, it is most appropriate
for them to follow the methodology of statistical analysis used in the social
sciences.
However one obtains information on the use of statistics in corpus analysis, it
is important to realize that by conducting a more rigorous statistical evaluation
of their results, corpus linguists can not only be more condent about the results
5.4 Subjecting the results of a corpus study to statistical analysis 121
they obtain but may even gain new insights into the linguistic issues they are
investigating.
The statistical analysis of data is essentially a three-part process that in-
volves:
1. evaluating the corpus from which the results are to be obtained to determine
its suitability for statistical analysis;
2. running the appropriate statistical tests;
3. interpreting the results, and nding linguistic motivations for them.
Whether the latter part of step 3 nding linguistic motivations for the results
is done rst in a series of hypothesis tested out in a corpus or last depends
upon the type of study one is doing, a point that will be discussed in greater
detail in section 5.4.1. But following all of these steps is the best way to insure
the validity and soundness of ones statistical analysis of a corpus.
5.4.1 Judging the suitability of a corpus for statistical
analysis and determining the appropriate statistical
tests to apply
Prior to conducting the study of pseudo-titles described in section 5.1,
great care was taken in selecting a corpus that would yield valid results. Because
pseudo-title usage has changed over time, texts were selected from a similar
time-frame to ensure that the results reected current usage trends. Even though
only text fragments were used, a pilot study was conducted to ensure that suf-
cient numbers of pseudo-titles could be found in 2,000-word samples. Had
texts from different time periods been mixed, or had the samples not turned
up enough examples of pseudo-titles, it would have been difcult to trust the
results that had been obtained, and considerable time would have been wasted
conducting a study that for all practical purposes was invalid. The more the
analyst is able to keep extraneous variables out of a corpus study, the more he
or she will be able to trust the validity of the results obtained.
Many times, however, the analyst does not have the advantage of basing an
analysis on a corpus that is methodologically pure. This is particularly the
case when using corpora developed in earlier eras, when those creating corpora
did not have a pilot corpus to guide their designs (Biber 1993: 256) and the
benet of the years of experience that now exists in the design and creation of
corpora. Also, the analyst may wish to do an analysis in an area where no good
corpora are available. However, even if all or part of a corpus is not designed
as ideally as the analyst would like, it is still possible to analyze the corpus and
make generalizations based on the results that are obtained. Woods, Fletcher,
and Hughes (1986: 55-6) advise that analysts accept the results of each study,
in the rst place, as though any sampling had been carried out in a theoretically
correct fashion . . . [and] then look at the possibility that they may have been
distorted by the way the sample was, in fact, obtained. If there are problems
122 Analyzing a corpus
with the sample, it is advisable to attempt to foresee some of the objections that
might be made about the quality of that material and either attempt to forestall
criticism or admit openly to any serious defects.
Ideally, it would be most desirable to analyze only fully representative corpora
so that it is not necessary to make the concessions that Woods, Fletcher, and
Hughes (1986) advocate. But as long as the analyst is clear about the kind of
corpus that was examined, open to the variables that might have affected the
results, and sensitive to criticisms that may be leveled against the results, it is
perfectly acceptable to base a linguistic analysis on a less than ideal corpus.
After determining whether the corpus being studied is suitable to the analysis
being undertaken, the analyst needs to next consider what statistical tests are ap-
propriate, and after applying the tests, whether any signicant results obtained
have some linguistic motivation: statistical signicance is meaningless if it has
no linguistic motivation. In deciding which statistical tests to apply, and howthe
results of such tests are best evaluated, it is useful to make the distinction that
Biber (1988) makes between macroscopic and microscopic analyses. In
macroscopic analyses, Biber (1988: 61) observes, the goal is to determine
the overall dimensions of variation in a language, that is, to investigate spoken
vs. written English, or a variety of genres of English, and to determine which
linguistic constructions dene and distinguish these kinds of English. Macro-
scopic analyses are in a sense inductive. In his analysis of linguistic variation
in speech and writing, although Biber (1988: 6575) pre-selected the corpora
his analysis would be based on and the particular linguistic constructions he
investigated, he did not initially go through the descriptive statistics that he
generated, based on the occurrence of the constructions in the corpora, and at-
tempt to determine which results looked meaningful and which did not before
applying further statistical tests. Instead, he simply did a factor analysis of the
results to determine which constructions tended to co-occur in specic genres
and which did not. And it was not until he found that passives and nominal-
izations, for instance, tended to co-occur that he attempted to nd a functional
motivationfor the results (Biber 1988: 80): that passives andnominalizations oc-
curredtogether ingenres (suchas academic prose) inwhichabstract information
predominated (Biber 1988: 119).
In microscopic analyses, on the other hand, the purpose is to undertake
a detailed description of the communicative functions of particular linguistic
features: the analyst takes a particular grammatical construction, such as the
relative clause, and attempts to study its function in various genres or linguistic
contexts. Microscopic analyses are typically deductive rather than inductive:
the analyst begins with a series of hypotheses and then proceeds to conrm or
disconrm them in the corpus being investigated. Of course, there will always
be results that are unanticipated, but these results can be handled by proposing
newhypotheses or reformulating old ones. Since the majority of corpus analyses
involve microscopic rather than macroscopic analyses, the remainder of this
chapter will focus on the various kinds of statistical tests that were applied to
the results of the study of pseudo-titles detailed in section 5.3.
5.5 The statistical analysis of pseudo-titles 123
5.5 The statistical analysis of pseudo-titles
A good corpus study, as was argued in section 5.1, combines quali-
tative and quantitative research methods. In any corpus analysis, the balance
between these methods will vary. This section explores this balance with re-
spect to the various hypotheses proposed in section 5.2 concerning the use and
structure of pseudo-titles, and demonstrates how these hypotheses can be con-
rmed or disconrmed through simple exploration of the various samples
of ICE included in the study as well as the application of various statistical
tests.
5.5.1 Exploring a corpus
Previous research on pseudo-titles has documented their existence in
American, British, and New Zealand press reportage, and demonstrated that
because their usage was stigmatized, certain newspapers (particularly in the
British press) prohibited their usage. To determine whether pseudo-titles have
spread to the other varieties of English represented in ICE and whether their
usage is stigmatized in these varieties, it is only necessary to examine examples
from the various samples to see whether a given newspaper allows pseudo-title
usage or not. And in simply examining examples in a corpus, one is likely to
encounter interesting data. In the case of pseudo-titles, for instance, it was
found that certain newspapers that prohibited pseudo-title usage did in fact
contain pseudo-titles, a nding that reects the well-known fact that practice
does not always follow prescription.
Table 5.4 lists the number of newspapers containing or not containing pseudo-
titles in the various regional components of ICE investigated. As table 5.4
illustrates, pseudo-titles have spread to all of the regional varieties of English
investigated, and it is only in Great Britain that there were many newspapers
prohibiting the usage of pseudo-titles.
Table 5.4 Number of newspapers containing pseudo-titles in various
ICE components
Newspapers without Newspapers with
Country pseudo-titles pseudo-titles Total
Great Britain 7 8 15
United States 1 19 20
New Zealand 1 11 12
Philippines 0 10 10
Jamaica 0 3 3
East Africa 0 3 3
Singapore 0 2 2
Totals 9 (14%) 56 (86%) 65 (100%)
124 Analyzing a corpus
Although the results displayed in table 5.4 reveal specic trends in usage,
there were some interestingexceptions tothese trends. InICE-USA, after further
investigation, it was found that the one US newspaper that did not contain any
pseudo-titles, the Cornell Chronicle, actually did allow pseudo-titles. It just
so happened that the 2,000-word sample included in ICE did not have any
pseudo-titles. This nding reveals an important limitation of corpora: that the
samples included within them do not always contain the full range of usages
existing in the language, and that it is often necessary to look further than the
corpus itself for additional data. In the case of the Cornell Chronicle, this meant
looking at additional samples of press reportage from the newspaper. In other
cases, it may be necessary to supplement corpus ndings with elicitation tests
(cf. Greenbaum 1984; de M onnink 1997): tests that ask individuals to identify
constructions as acceptable or not, or that seek information from individuals
about language usage. For pseudo-titles, one might give newspaper editors and
writers a questionnaire that elicits their attitudes towards pseudo-titles, and asks
them whether they consider certain borderline cases of pseudo-titles (such as
former President George Bush; section cf. 5.3.1) to actually be pseudo-titles,
and so forth.
Two newspapers whose style manuals prohibited pseudo-title usage actually
contained pseudo-titles. The New York Times and one British newspaper, the
Guardian, contained pseudo-titles in sports reportage. This suggests that at
least in these two newspapers the prohibition against pseudo-titles sometimes
does not extend to less formal types of writing. In addition, the New York
Times contained in its news reportage instances of so-called borderline cases of
pseudo-titles (see above). Some of the British-inuenced varieties of English
contained a mixture of British and American norms for pseudo-title usage.
Example (13) (taken from ICE-East Africa) begins with a pseudo-title, Lawyer
Paul Muite, but two sentences later contains a corresponding apposition a
lawyer, Ms. Martha Njoka that contains features of British English: a title, Ms.,
before the name in the second part of the appositive, and no punctuation marking
the title as abbreviated. In American press writing, typically an individuals full
name would be given without any title, and if a title were used, it would end in
a period (Ms.) marking it as an abbreviation.
(13) Lawyer Paul Muite and his co-defendants in the LSK contempt suit wound up their
case yesterday and accused the Government of manipulating courts through proxies
to silence its critics . . . Later in the afternoon, there was a brief drama in court when
a lawyer, Ms. Martha Njoka, was ordered out after she deed the judges directive
to stop talking while another lawyer was addressing the court. (ICE-East Africa)
Exploring a corpus qualitatively allows the analyst to provide descriptive in-
formation about the results that cannot be presented strictly quantitatively. But
because this kind of discussion is subjective and impressionistic, it is better to
devote the bulk of a corpus study to supporting qualitative judgements about a
corpus with quantitative information.
5.5 The statistical analysis of pseudo-titles 125
5.5.2 Using quantitative information to support qualitative
statements
In conducting a microscopic analysis of data, it is important not to
become overwhelmed by the vast amount of statistical information that such a
study will be able to generate, but to focus instead on using statistical analysis
to conrm or disconrm the particular hypotheses one has set out to test. In
the process of doing this, it is very likely that new and unanticipated ndings
will be discovered: a preliminary study of pseudo-titles, for instance, led to the
discovery that the length of pseudo-titles varied by national variety, a discovery
that will be described in detail below.
One of the most common ways to begin testing hypotheses is to use the
cross tabulation capability found in any statistical package. This capabil-
ity allows the analyst to arrange the data in particular ways to discover as-
sociations between two or more of the variables being focused on in a par-
ticular study. In the study of pseudo-titles, each construction was assigned a
series of tags associated with six variables, such as the regional variety the
construction was found in, and whether the construction was a pseudo-title
or a corresponding apposition (cf. section 5.3.2). To begin investigating how
pseudo-titles and corresponding appositives were used in the regional vari-
eties of ICE being studied, a cross tabulation of the variables country and
type was generated. This cross tabulation yielded the results displayed in
table 5.5. Because so few examples of pseudo-titles and equivalent apposi-
tives were found in ICE-East Africa, ICE-Singapore, and ICE-Jamaica, the
cross tabulations in table 5.5 (and elsewhere in this section) were restricted
to the four ICE varieties (ICE-USA, ICE-Philippines, ICE-New Zealand, and
ICE-Great Britain) from which a sufcient number of examples could be
taken.
The results of the cross tabulation in table 5.5 yield rawnumbers and percent-
ages which suggest various trends. In ICE-USA, Phil, and NZ, more pseudo-
titles than corresponding appositives were used, though ICE-Phil and NZ have
a greater percentage of pseudo-titles than does ICE-USA. In ICE-GB, just the
opposite occurs: more corresponding appositives than pseudo-titles were used,
Table 5.5 The frequency of pseudo-titles and corresponding appositives in the
national varieties of ICE
Country Pseudo-title Appositive Total
USA 59 (54%) 51 (46%) 110 (100%)
Phil 83 (69%) 38 (31%) 121 (100%)
NZ 82 (73%) 31 (27%) 113 (100%)
GB 23 (23%) 78 (77%) 101 (100%)
Total 247 198 445
126 Analyzing a corpus
ndings reecting the fact that there is a greater stigma against the pseudo-titles
in British press reportage than in the reportage of the other varieties.
When comparing results from different corpora, in this case differing com-
ponents of ICE, it is very important to compare corpora of similar length. If
different corpora of varying length are compared and the results are not nor-
malized, then the comparisons will be distorted and misleading. For instance,
if one were to count and then compare the number of pseudo-titles in one corpus
of 40,000 words and another of 50,000 words, the results would be invalid, since
a 50,000-word corpus is likely to contain more pseudo-titles than a 40,000-word
corpus, simply because it is longer. This may seem like a fairly obvious point,
but in conducting comparisons of the many different corpora that nowexist, the
analyst is likely to encounter corpora of varying length: corpora such as Brown
or LOB are one million words in length and contain 2,000-word samples; the
LondonLund Corpus is approximately 500,000 words in length and contains
5,000-word samples; and the British National Corpus is 100 million words long
and contains samples of varying length. Moreover, often the analyst will wish to
compare his or her results with the results of someone elses study, a comparison
that is likely to be based on corpora of differing lengths.
To enable comparisons of corpora that differ in length, Biber, Conrad, and
Reppen (1998: 2634) provide a convenient formula for normalizing frequen-
cies. Using this formula, to calculate the number of pseudo-titles occurring
per 1,000 words in the four varieties of ICE in table 5.5, one simply divides
the number of pseudo-titles (247) by the length of the corpus in which they
occurred (80,000 words) and multiplies this number by 1,000:
(247/80,000) 1,000 = 3.0875
The choice of norming to 1,000 words is arbitrary, but as larger numbers and
corpora are analyzed, it becomes more advisable to normto a higher gure (e.g.
occurrences per 10,000 words).
Although the percentages in table 5.5 suggest various differences in how
pseudo-titles and corresponding appositives are used, without applying any
statistical tests there is no way to know whether the differences are real or due
to chance. Therefore, in addition to considering percentage differences in the
data, it is important to apply statistical tests to the results to ensure that any
claims made have validity. The most common statistical test for determining
whether differences are signicant or not is the t-test, or analysis of variance.
However, because linguistic data do not typically have normal distributions, it
is more desirable to apply what are termed non-parametric statistical tests:
tests that make no assumptions about whether the data on which they are being
applied have a normal or non-normal distribution.
Data that are normally distributed will yield a bell curve: most cases will
be close to the mean, and the remaining cases will fall off quickly in fre-
quency on either side of the curve. To understand why linguistic data are not
normally distributed, it is instructive to examine the occurrence of pseudo-titles
5.5 The statistical analysis of pseudo-titles 127
Table 5.6 Frequency of occurrence of pseudo-titles in the samples from
ICE components
1 2 3 4 5 6 7 8 9 10 Total
USA 2 3 0 10 16 2 3 15 1 7 59
Phil 15 11 9 6 4 8 6 15 5 4 83
NZ 24 13 4 10 6 5 4 6 10 0 82
GB 0 0 0 0 8 3 3 2 0 7 23
Standard
Minimum Maximum Average deviation Kurtosis Skewness
0 24 6.175 5.514026 2.97711 1.147916
in the forty texts that were examined (cf. table 5.6), and the various statistical
measurements that calculate whether a distribution is normal or not.
As the gures in table 5.6 indicate, the distribution of pseudo-titles across
the forty samples was quite varied. Many samples contained no pseudo-titles;
one sample contained 24. The average number of pseudo-titles per sample was
around six. The standard deviation indicates that 68 percent of the pseudo-titles
occurring in the samples clustered within about 5.5 points either belowor above
the average, that is, that 68 percent of the samples contained between one and
11 pseudo-titles. But the real signs that the data are not normally distributed
are the gures for kurtosis and skewness.
If the data were normally distributed, the gures for kurtosis and skewness
would be 0 (or at least close to 0). Kurtosis measures the extent to which
a distribution deviates from the normal bell curve: whether the distribution is
clustered around a certain point in the middle (positive kurtosis), or whether the
distribution is clustered more around the ends of the curve (negative kurtosis).
Skewness measures how asymmetrical a distribution is: the extent to which
more scores are above or below the mean. Both of the scores for kurtosis and
skewness are veryhigh: a negative kurtosis of 2.97711indicates that scores are
clustering very far from the mean (the curve is relatively at), and the gure
of 1.147916 for skewness indicates that more scores are above the mean than
below. Figure 5.6 illustrates the at and skewed nature of the curve. The
horizontal axis plots the various number of pseudo-titles found in each of the
forty texts, and the vertical axis the number of texts having a given frequency.
The resultant curve is clearly not a bell curve.
Because most linguistic data behave the way that the data in table 5.5 do, it is
more desirable to apply non-parametric statistical tests to the results, and one of
the more commonly applied tests of this nature in linguistics is the chi-square.
The chi-square statistic is very well suited to the two-way cross tabulation in
table 5.5: the dependent variable (i.e. the variable that is constant, in this case the
country) is typically put in the left-hand column, and the independent variable
128 Analyzing a corpus
8
7
6
5
4
3
2
1
0
N
u
m
b
e
r
o
f
t
e
x
t
s
w
i
t
h
t
h
i
s
f
r
e
q
u
e
n
c
y
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Frequency per text
Figure 5.6 Pseudo-title frequency across samples
(i.e. the variable that changes, in this case the type: whether the construction
is a pseudo-title or corresponding apposition) is in the top-most row. Table 5.7
presents the results of a chi-square analysis of the data in table 5.5.
In essence, the chi-square test calculates the extent to which the distribution
in a given dataset either conrms or disconrms the null hypothesis: in this
case, whether or not there are differences in the distribution of pseudo-titles and
equivalent appositives in the four regional varieties of ICE being compared. To
perform this comparison, the chi-square test compares observed frequencies
in a given dataset with expected frequencies (i.e. the frequencies one would
expect to nd if there were no differences in the distribution of the data). The
higher the chi-square value, the more signicant the differences are.
The application of the chi-square test to the frequencies in table 5.5 yielded
a value of 65.686. To interpret this number accurately, one rst of all needs to
knowthe degrees of freedom in a given dataset (i.e. the number of data points
Table 5.7 Chi-square results for differences in the distribution of pseudo-titles
and corresponding appositives in the samples from ICE components
Statistical test Value Degrees of freedom Signicance level
Chi-square 65.686 3 p .000
5.5 The statistical analysis of pseudo-titles 129
that may vary). Since table 5.5 contains four rows and two columns, the degrees
of freedom can be calculated using the formula below:
(4 1) (2 1) = 3 degrees of freedom
With three degrees of freedom, the chi-square value of 65.686 is signicant at
less than the .000 level.
While it is generally accepted that any level below .05 indicates statistical
signicance, it is quite common for more stringent signicance levels to be
employed (e.g. p
<
=
.001). Because the signicance level for the data in table 5.7
is considerably beloweither of these levels, it can be safely and reliably assumed
that there are highly signicant differences in the distributions of pseudo-titles
and appositives across the four varieties of English represented in the table.
The chi-square test applied to the data in table 5.5 simply suggests that
there are differences in the use of pseudo-titles in the four national varieties
of English being investigated. The chi-square test says nothing about differ-
ences between the individual varieties (e.g. whether ICE-USA differs from
ICE-NZ). To be more precise about how the individual varieties differ from
one another, it is necessary to compare the individual varieties themselves in
a series of 22 chi-square tables. However, in examining a single dataset as
exhaustively as this, it is important to adjust the level that must be reached for
statistical signicance because, as Sigley (1997: 231) observes, If . . . many
tests are performed on the same data, there is a risk of obtaining spuriously
signicant results. This adjustment can be made using the Bonferroni correc-
tion, which determines the appropriate signicance level by dividing the level
of signicance used in a given study by the number of different statistical tests
applied to the dataset. The Bonferroni-corrected critical value for the ICE data
being examined is given below and is based on the fact that to compare all
four ICE components individually, six different chi-square tests will have to be
performed:
.05 / 6 = .0083
Signicance level Number of tests performed Corrected value
Table 5.8 contains the results of the comparison, from most signicant differ-
ences down to least signicant differences.
The results in table 5.8 illustrate some notable differences in the use of
pseudo-titles and equivalent appositives in the various national varieties. First
of all, the use of these constructions in ICE-GB is very different from their
use in the other varieties: the levels of signicance are very high and reect
the deeply ingrained stigma against the use of pseudo-titles in British press re-
portage, a stigma that does not exist in the other varieties. Second, even though
130 Analyzing a corpus
Table 5.8 Comparison of the distribution of pseudo-titles and corresponding
appositives in individual ICE components
Countries Statistical test Degrees of freedom Value
6
Signicance level
NZ and GB Chi-square 1 50.938 p << 0.0001
Phil and GB Chi-square 1 44.511 p << 0.0001
US and GB Chi-square 1 19.832 p << 0.0001
US and NZ Chi-square 1 7.796 p = .005
US and Phil Chi-square 1 4.830 p = .028 (non-sig.)
Phil and NZ Chi-square 1 .273 p = .601 (non-sig.)
pseudo-titles may have originated in American press reportage, their use is
more widespread in ICE-NZ and ICE-Phil, though with the Bonferroni correc-
tion the values are just below the level of signicance to indicate a difference
between ICE-USA and ICE-Phil. Finally, there were no signicant differences
between ICE-NZ and ICE-Phil. These results indicate that pseudo-title usage
is widespread, even in British-inuenced varieties such as New Zealand En-
glish, and that there is a tendency for pseudo-titles to be used more widely than
equivalent appositives in those varieties other than British English into which
they have been transplanted from American English.
While the chi-square statistic is very useful for evaluating corpus data, it
does have its limitations. If the analyst is dealing with fairly small numbers
resulting in either empty cells or cells with low frequencies, then the reliability
of chi-square is reduced. Table 5.9 lists the correspondence relationships for
appositives in the four varieties of English examined.
Three of the cells in the category of total equivalence contain fewer than
ve occurrences, making the chi-square statistic invalid for the data in table
5.9. One way around this problem is to combine variables in a principled
manner to increase the frequency for a given cell and thus make the results
Table 5.9 Correspondence relationships for appositives in the samples from
ICE components
Country Total equivalence Determiner deletion Partial equivalence Total
USA 1 (2%) 14 (28%) 36 (71%) 51 (101%)
Phil 1 (2.6%) 8 (21%) 29 (76%) 38 (100%)
NZ 0 (0%) 13 (42%) 18 (58%) 31 (100%)
GB 8 (10%) 22 (28%) 48 (62%) 78 (100%)
Total 10 (5%) 57 (29%) 131 (66%) 198 (100%)
6
In a 2 2 chi-square table, as Sigley (1997: 226) observes, the distribution is binomial rather
than continuous. It is therefore customary in a 2 2 table to use Yates correction rather the
normal Pearson chi-square value.
5.5 The statistical analysis of pseudo-titles 131
Table 5.10 Correspondence relationships for appositives in the samples from
ICE components (with combined cells)
Total equivalence/
Country determiner deletion Partial equivalence Total
USA 15 (29%) 36 (71%) 51 (100%)
Phil 9 (24%) 29 (76%) 38 (100%)
NZ 13 (42%) 18 (58%) 31 (100%)
GB 30 (39%) 48 (62%) 78 (101%)
Total 67 (34%) 131 (66%) 198 (100%)
Statistical Degrees of Signicance
test Value freedom level
Chi-square 3.849 3 p = .278
of the chi-square statistic more valid. As was noted in section 5.1, one rea-
son for recording the particular correspondence relationship for an apposi-
tive was to study the stylistic relationship between pseudo-titles and various
types of equivalent appositives: to determine, for instance, whether a newspa-
per prohibiting pseudo-titles relied more heavily than those newspapers allow-
ing pseudo-titles on appositives related to pseudo-titles by either determiner
deletion (e.g. the acting director, Georgette Smith acting director Geor-
gette Smith) or total equivalence (Georgette Smith, acting director acting
director Georgette Smith). Because these two correspondence relationships in-
dicate similar stylistic choices, it is justiable to combine the results for both
choices to increase the frequencies and make the chi-square test for the data
more valid.
Table 5.10 contains the combined results for the categories of total equiv-
alence and determiner deletion. This results in cells with high enough fre-
quencies to make the chi-square test valid. The results indicate, however, that
there was really no difference between the four varieties in terms of the cor-
respondence relationships that they exhibited: the chi-square value (3.849) is
relatively low and as a result the signicance level (.278) is above the level
necessary for statistical signicance.
It was expected that ICE-GBwould contain more instances of appositives ex-
hibiting either total equivalence or determiner deletion, since in general British
newspapers do not favor pseudo-titles and would therefore favor alternative ap-
positive constructions. And indeed the newspapers in ICE-GBdid contain more
instances. But the increased frequencies are merely a consequence of the fact
that, in general, the newspapers in ICE-GB contained more appositives than
the other varieties. Each variety followed a similar trend and contained fewer
appositives related by total equivalence or determiner deletion and more related
by partial equivalence. These ndings call into question Bells (1988) notion
132 Analyzing a corpus
Table 5.11 The length of pseudo-titles in the various components of ICE
Country 14 words 5 or more words Total
USA 57 (97%) 2 (3%) 59 (100%)
Phil 71 (86%) 12 (15%) 83 (101%)
NZ 66 (81%) 16 (20%) 82 (101%)
GB 23 (100%) 0 (0%) 23 (100%)
Total 217 (88%) 30 (12%) 247 (100%)
Statistical test Value Degrees of freedom Signicance level
Chi-square 12.005 3 p = .007
Likelihood ratio 15.688 3 p = .001
of determiner deletion, since overall such constructions were not that com-
mon and whether a newspaper allowed or disallowed pseudo-titles had little
effect on the occurrence of appositives related by determiner deletion. Having
a greater effect on the particular correspondence relation that was found was
whether the appositive contained a genitive noun phrase or some kind of post-
modication, structures that led to a partial correspondence with a pseudo-title
and that occurred very commonly in all varieties.
While combining values for variables can increase cell values, often such
a strategy does not succeed simply because so few constructions occur in a
particular category. In such cases, it is necessary to select a different statistical
test to evaluate the results. To record the length of a pseudo-title or appositive,
the original coding system had six values: one word in length, two words, three
words, four words, ve words, and six or more words. It turned out that this
coding scheme was far too delicate and made distinctions that simply did not
exist in the data: many cells had frequencies that were too low to apply the chi-
square test. And combining categories, as is done in table 5.11, still resulted
in two cells with frequencies lower than ve, making the chi-square results for
this dataset invalid.
In cases like this, it is necessary to apply a different statistical test: the log-
likelihood (or G
2
) test. Dunning (1993: 656) has argued that, in general, this
test is better than the chi-square test because it can be applied to very much
smaller volumes of text . . . [and enable] comparisons to be made between the
signicance of the occurrences of both rare and common phenomena. Dunning
(1993: 623) notes that the chi-square test was designed to work with larger
datasets that have items that are more evenly distributed, not with corpora
containing what he terms rare events (e.g. two instances in ICE-USA of
pseudo-titles lengthier than ve words). Applied to the data in table 5.11, the
log-likelihood test (termed the likelihood ratio in SPSS parlance) conrmed
that the length of pseudo-titles varied by variety.
The results of the log-likelihood test point to a clear trend in table 5.11:
that lengthier pseudo-titles occur more frequently in ICE-Phil and NZ than in
ICE-USA and GB. In fact, ICE-GB had no pseudo-titles lengthier than ve
5.5 The statistical analysis of pseudo-titles 133
Table 5.12 The length of appositives in the various components of ICE
Country 14 words 5 or more words Total
USA 22 (43%) 29 (57%) 51 (100%)
Phil 14 (37%) 24 (63%) 38 (100%)
NZ 14 (45%) 17 (55%) 31 (100%)
GB 32 (41%) 46 (59%) 78 (100%)
Total 82 (41%) 116 (59%) 198 (100%)
Statistical test Value Degrees of freedom Signicance level
Chi-square .574 3 p = .902
words, and ICE-USA had only two instances. These ndings are reected in
the examples in (14) and (15), which contain pseudo-titles lengthier than ve
words that occurred predominantly in newspapers in ICE-Phil and ICE-NZ.
(14) a. Salamat and Presidential Adviser on Flagship Projects in Mindanao Robert
Aventajado (ICE-Philippines)
b. Time Magazine Asia bureau chief Sandra Burton (ICE-Philippines)
c. Marikina Metropolitan Trial Court judge Alex Ruiz (ICE-Philippines)
d. MILF Vice Chairman for Political Affairs Jadji Murad (ICE-Philippines)
e. Autonomous Region of Muslim Mindanao police chief Damming Unga (ICE-
Philippines)
(15) a. Oil and Gas planning and development manager Roger OBrien (ICE-NZ)
b. NewPlymouth Fire Services deputy chief re ofcer Graeme Moody (ICE-NZ)
c. corporate planning and public affairs executive director Graeme Wilson (ICE-
NZ)
d. Federated Gisborne-Wairoa provincial president Richard Harris (ICE-NZ)
e. Wesley and former New Zealand coach Chris Grinter (ICE-NZ)
The pseudo-title is a relatively new and evolving structure in English. There-
fore, it is to be expected that its usage will show variation, in this case in the
length of pseudo-titles in the various components of ICE under investigation.
The appositive, on the other hand, is a well-established construction in English,
and if the length of appositives is considered, there were no differences between
the varieties, as is illustrated in table 5.12. Table 5.12 demonstrates that it is
more normal for appositives to be lengthier, and that while ICE-GB has more
appositives than the other varieties, the proportion of appositives of varying
lengths is similar to the other varieties.
One reason for the general difference in length of appositives and pseudo-
titles is that there is a complex interaction between the form of a given pseudo-
title or appositive and its length. In other words, three variables are interacting:
type (pseudo-title or appositive), form (simple noun phrase, genitive noun
phrase, noun phrase with post-modication), and length (one to four words
or ve words or more). Table 5.13 provides a cross tabulation of all of these
variables.
134 Analyzing a corpus
Table 5.13 The form and length of pseudo-titles and corresponding
appositives
Type Form 1-4 words 5 or more words Total
PT Simple NP 216 (90%) 23 (10%) 239 (100%)
Gen. NP 0 (0%) 0 (0%) 0 (0%)
Post. Mod. 1 (13%) 7 (87%) 8 (100%)
Total 217 (88%) 30 (12%) 247 (100%)
Appos Simple NP 52 (84%) 10 (16%) 62 (100%)
Gen. NP 18 (67%) 9 (33%) 27 (100%)
Post. Mod. 12 (11%) 97 (89%) 109 (100%)
Total 82 (41%) 116 (59%) 198 (100%)
A chi-square analysis of the trends in table 5.13 would be invalid not only
because some of the cells have values lower than ve but because the chi-square
test cannot pinpoint specically which variables are interacting. To determine
what the interactions are, it is more appropriate to conduct a loglinear analysis
of the results.
A loglinear analysis considers interactions between variables: whether, for
instance, there is an interaction between type, form, and length; between
type and form; between form and length; and so forth. In setting up
a loglinear analysis, one can either investigate a predetermined set of associa-
tions (i.e. only those associations that the analyst thinks exist in the data), or
base the analysis on a saturated model: a model that considers every possible
interaction the variables would allow. The drawback of a saturated model, as
Oakes (1998: 38) notes, is that because it includes all the variables and inter-
actions required to account for the original data, there is a danger that we will
select a model that is too good. . . [and that nds] spurious relationships. That
is, when all interactions are considered, it is likely that signicant interactions
between some interactions will be coincidental. Thus, it is important to nd
linguistic motivations for any signicant associations that are found.
Because only three variables were being compared, it was decided to use a
saturated model to investigate associations. This model generated the following
potential associations:
(16) a. type*form*length
b. type*form
c. type*length
d. form*length
e. form
f. type
g. length
5.5 The statistical analysis of pseudo-titles 135
Table 5.14 Associations between various variables
K Degrees of freedom Likelihood ratio Probability Probability
3 2 .155 .9254 .9300
2 7 488.010 .0000 .0000
1 11 825.593 .0000 .0000
Likelihood ratio and chi-square tests were conducted to determine whether there
was a signicant association between all three variables (16a), and between all
possible combinations of two-way interactions (16bd). In addition, the vari-
ables were analyzed individually to determine the extent to which they affected
the three- and two-way associations in 16ad. The results are presented in table
5.14.
The rst line in table 5.14 demonstrates that there were no associations
between the three variables: the likelihood ratio score had probability where
p > .05. On the other hand, there were signicant associations between the
two-way and one-way variables.
To determine which of these associations were strongest, a procedure called
backward elimination was applied to the results. This procedure works in a
step-by-step manner, at each stage removing from the analysis an association
that is least strong and then testing the remaining associations to see which is
strongest. This procedure produced the two associations in table 5.15 as being
the strongest of all the associations tested. Interpreted in conjunction with the
frequency distributions in table 5.13, the results in table 5.14 suggest that
while appositives are quite diverse in their linguistic form, pseudo-titles are
not. Even though a pseudo-title and corresponding appositive have roughly the
same meaning, a pseudo-title is mainly restricted to being a simple noun phrase
that is, in turn, relatively short in length. In contrast, the unit of an appositive
corresponding to a pseudo-title can be not just a simple noun phrase but a
genitive noun phrase or a noun phrase with post-modication as well.
These linguistic differences are largely a consequence of the fact that the
structure of a pseudo-title is subject to the principle of end-weight (Quirk
et al. 1985: 13612). This principle stipulates that heavier constituents are best
placed at the end of a structure, rather than at the beginning of it. A pseudo-
title will always come at the start of the noun phrase in which it occurs. The
lengthier and more complex the pseudo-title, the more unbalanced the noun
Table 5.15 Strongest associations between variables
Degrees of freedom Likelihood ratio Probability
Type*form 2 246.965 .0000
Length*form 2 239.067 .0000
136 Analyzing a corpus
phrase will become. Therefore, pseudo-titles typically have forms (e.g. simple
noun phrases) that that are short and non-complex structures, though as table
5.12 illustrated, usage does vary by national variety. In contrast, an appositive
consists of two units, one of which corresponds to a pseudo-title. Because this
unit is independent of the proper noun to which it is related in speech it
occupies a separate tone unit, in writing it is separated by a comma from the
proper noun to which it is in apposition it is not subject to the end-weight
principle. Consequently, the unit of an appositive corresponding to a pseudo-
title has more forms of varying lengths.
The loglinear analysis applied to the data in table 5.12 is very similar to the
logistic regression models used in the Varbrul programs: IVARB(for MS-DOS)
and GoldVarb (for the Macintosh) (cf. Sankoff 1987; Sigley 1997: 23852).
These programs have been widely used in sociolinguistics to test the interaction
of linguistic variables. For instance, Tagliamonte and Lawrence (2000) used
GoldVarb to examine which of seven linguistic variables favored the use of
three linguistic forms to express the habitual past: a simple past-tense verb,
used to, or would. Tagliamonte and Lawrence (2000: 336) found, for instance,
that the type of subject used in a clause signicantly affected the choice of verb
form: the simple past was used if the subject was a second-person pronoun,
used to was used if the subject was a rst-person pronoun, and would was
used if the subject was a noun phrase with a noun or third-person pronoun as
head.
Although the Varbrul programs have been used primarily in sociolinguistics
to study the application of variable rules, Sigley (1997) demonstrates the value
of the programs for corpus analysis as well in his study of relative clauses in
the Wellington Corpus of New Zealand English. The advantage of the Varbul
programs is that they were designed specically for use in linguistic analy-
sis and are thus easier to use than generic statistical packages, such as SPSS
or SAS. But it is important to realize that these statistical packages, as the
loglinear analysis in this section demonstrated, can replicate the kinds of statis-
tical analyses done by the Varbrul programs. Moreover, as Oakes (1998: 151)
demonstrates, these packages can perform a range of additional statistical anal-
yses quite relevant to the concerns of corpus linguists, from those, such as the
Pearson-Product Moment, that test correlations between variables to Regres-
sion tests, which test the effects that independent variables have on dependent
variables.
5.6 Conclusions
To conduct a corpus analysis effectively, the analyst needs to plan out
the analysis carefully. It is important, rst of all, to begin the process with a very
clear research question in mind, so that the analysis involves more than simply
counting linguistic features. It is next necessary to select the appropriate
5.6 Conclusions 137
corpus for analysis: to make sure, for instance, that it contains the right kinds
of texts for the analysis and that the corpus samples to be examined are lengthy
enough. And if more than one corpus is to be compared, the corpora must
be comparable, or the analysis will not be valid. After these preparations are
made, the analyst must nd the appropriate software tools to conduct the study,
code the results, and then subject them to the appropriate statistical tests. If
all of these steps are followed, the analyst can rest assured that the results
obtained are valid and the generalizations that are made have a solid linguistic
basis.
Study questions
1. What is the danger of beginning a corpus analysis without a clearly thought-
out research question in mind?
2. How does the analyst determine whether a given corpus is appropriate for
the corpus analysis to be conducted?
3. What kinds of analyses are most efciently conducted with a concordancing
program?
4. What kinds of information can be found in a tagged or parsed corpus that
cannot be found in a lexical corpus?
5. The data in the table below are adapted from a similar table in Meyer (1996:
38) and contain frequencies for the distribution of phrasal (e.g. John and
Mary) and clausal (e.g. We went to the store and our friends bought some
wine) coordination in various samples of speech and writing from the Inter-
national Corpus of English. Go to the web page given below at Georgetown
University and use the Web Chi-square Calculator on the page to deter-
mine whether there is a difference between speech and writing in the distribu-
tion of phrasal and clausal coordination: http://www.georgetown.edu/cball/
webtools/web chi.html.
Syntactic structures in speech and writing
Medium Phrases Clauses Total
Speech 168 (37%) 289 (63%) 457 (100%)
Writing 530 (77%) 154 (23%) 684 (100%)
6 Future prospects in corpus linguistics
In describing the complexity of creating a corpus, Leech (1998: xvii)
remarks that a great deal of spadework has to be done before the research
results [of a corpus analysis] can be harvested. Creating a corpus, he comments,
always takes twice as much time, and sometimes ten times as much effort
because of all the work that is involved in designing a corpus, collecting texts,
and annotating them. And then, after a given period of time, Leech (1998: xviii)
continues, the corpus becomes out of date, requiring the corpus creator to
discard the concept of a static corpus of a given length, and to continue to collect
and store corpus data indenitely into the future . . . The process of analyzing a
corpus may be easier than the description Leech (1998) gives above of creating
a corpus, but still, many analyses have to be done manually, simply because we
do not have the technology that can extract complex linguistic structures from
corpora, no matter how extensively they are annotated. The challenge in corpus
linguistics, then, is to make it easier both to create and analyze a corpus. What
is the likelihood that this will happen?
Planning a corpus. As more and more corpora have been created, we have
gained considerable knowledge of howto construct a corpus that is balanced and
representative and that will yield reliable grammatical information. We know,
for instance, that what we plan to do with a corpus greatly determines how it is
constructed: vocabulary studies necessitate larger corpora, grammatical studies
(at least of relatively frequently occurring grammatical constructions) shorter
corpora. The British National Corpus is the culmination of all the knowledge
we have gained since the 1960s about what makes a good corpus.
But while it is of prime importance to descriptive corpus linguistics to create
valid and representative corpora, in the eld of natural language processing this
is an issue of less concern. Obviously, the two elds have different interests: it
does not require a balanced and representative corpus to train a parser or speech-
recognition system. But it would greatly benet the eld of corpus linguistics if
descriptive corpus linguists and more computationally oriented linguists and en-
gineers worked together to create corpora. The British National Corpus is a good
example of the kind of corpus that can be created when descriptive linguists,
computational linguists, and the publishing industry cooperate. The TalkBank
Project at Carnegie Mellon University and the University of Pennsylvania is a
multi-disciplinary effort designed to organize varying interest groups engaged
in the computational study of human and animal communication. One of the
138
Future prospects in corpus linguistics 139
interest groups, Linguistic Exploration, deals withthe creationandannotationof
corpora for purposes of linguistic research. The Michigan Corpus of Academic
Spoken English (MICASE) is the result of a collaborative effort involving both
linguists and digital librarians at the University of Michigan. Cross-disciplinary
efforts such as these integrate the linguistic and computational expertise that
exists among the various individuals creating corpora, they help increase the
kinds and types of corpora that are created, and they make best use of the limited
resources available for the creation of corpora.
Collecting and computerizing written texts. Because so many written texts
are now available in computerized formats in easily accessible media, such as
the World Wide Web, the collection and computerization of written texts has
become much easier than in the past. It is no longer necessary for every written
text to be typed in by hand or scanned with an optical scanner and then the
scanning errors corrected. If texts are gathered from the World Wide Web, it
is still necessary to strip them of html formatting codes. But this process can
be automated with software that removes such markup from texts. Creating
a corpus of written texts is now an easy and straightforward enterprise. The
situation is more complicated for those creating corpora containing texts from
earlier periods of English: early printed editions of books are difcult to scan
optically with any degree of accuracy; manuscripts that are handwritten need
to be manually retyped and the corpus creator must sometimes travel to the
library where the manuscript is housed. This situation might be eased in coming
years. There is an increased interest both among historical linguists and literary
scholars in computerizing texts from earlier periods of English. As projects
such as the Canterbury Project are completed, we may soon see an increased
number of computerized texts from earlier periods.
Collecting and computerizing spoken texts. While it is now easier to prepare
written texts for inclusion in a corpus, there is little hope for making the col-
lection and transcription of spoken texts easier. For the foreseeable future, it
will remain an arduous task to nd people who are willing to be recorded, to
make recordings, and to have the recordings transcribed. There are advantages
to digitizing spoken samples and using specialized software to transcribe them,
but still the transcriber has to listen to segments of speech over and over again
to achieve an accurate transcription. Advances in speech recognition might au-
tomate the transcription of certain kinds of speech (e.g. speeches and perhaps
broadcast news reports), but no software will be able to cope with the dysu-
ency of a casual conversation. Easing the creation of spoken corpora remains
one of the great challenges in corpus linguistics, a challenge that will be with
us for some time in the future.
Copyright restrictions. Obtaining the rights to use copyrighted material has
been a perennial problem in corpus linguistics. The rst release of the British
National Corpus could not be obtained by anyone outside the European Union
because of restrictions placed by copyright holders on the distribution of certain
written texts. The BNC Sampler and second release of the entire corpus have
140 Future prospects in corpus linguistics
no distribution restrictions but only because the problematic texts are not in-
cluded in these releases. The distribution of ARCHERhas been on hold because
it has not been possible to obtain copyright permission for many of the texts
included in the corpus. As a result, access to the corpus is restricted to those
who participated in the actual creation of the corpus a method of distribution
that does not violate copyright law. It is unlikely that this situation will ease
in the future. While texts are more widely available in electronic form, partic-
ularly on the World Wide Web, getting permission to use these texts involves
the same process as getting permission for printed texts, and current trends in
the electronic world suggest that access to texts will be more restrictive in the
future, not less. Therefore, the problem of copyright restrictions will continue
to haunt corpus linguists for the foreseeable future.
Annotating texts with structural markup. The development of SGML-based
annotation systems has been one of the great advances in corpus linguistics,
standardizing the annotation of many features of corpora so that they can be
unambiguously transferred from computer to computer. The Text Encoding
System (TEI) has provided a system of corpus annotation that is both detailed
and exible, and the introduction of XML (the successor to HTML) to the eld
of corpus linguistics will eventually result in corpora that can be made available
on the World Wide Web. There exist tools to help in the insertion of SGML-
conformant markup to corpora, and it is likely that such tools will be improved
in the future. Nevertheless, much of this annotation has to be inserted manually,
requiring hours of work on the part of the corpus creator. We will have much
better annotated corpora in the future, but it will still be a major effort to insert
this annotation into texts.
Tagging and parsing. Tagging is now a standard part of corpus creation, and
taggers are becoming increasingly accurate and easy to use. There will always be
constructions that will be difcult to tag automatically and will require human
intervention to correct, but tagged corpora should become more widespread
in the future we may even see the day when every corpus released has been
tagged.
Parsing is improving too, but has a much lower accuracy than tagging. There-
fore, much human intervention is required to correct a parsed text, and we have
not yet reached the point that the team not responsible for designing the parser
can use it effortlessly. One reason the British component of ICE (ICE-GB) took
nearly ten years to complete was that considerable effort had to be expended
correcting the output of the parsing of the corpus, particularly the spoken part.
At some time in the future, parsing may be as routine as tagging, but because a
parser has a much more difcult job than a tagger, we have some time to wait
before parsed corpora will be widely available.
Text analysis. The most common text analysis program for corpora, the con-
cordancer, has become an established xture for the analysis of corpora. There
are many such programs available for use on PCs, Macintoshes, and even the
World Wide Web. Such programs are best for retrieving sequences of strings
Future prospects in corpus linguistics 141
(such as words), but many can now search for particular tags in a corpus, and if
a corpus contains le header information, some concordancing programs can
sort les so that the analyst can specify what he or she wishes to analyze in a
given corpus: journalistic texts, for instance, but not other kinds of texts.
More sophisticated text analysis programs, such as ICECUP, are rare, and
it is not likely that we will see as many programs of this nature in the future
as concordancers. And a major problem with programs such as ICECUP and
many concordancers is that they were designed to work on a specic corpus
computerized in a specic format. Consequently, ICECUP works only on the
British component of the International Corpus of English, and Sara on the
BNC (though there are plans to extend the use of Sara to other corpora as well).
The challenge is to systemize the design of corpora and concordancers so that
any concordancer can work on any corpus. Of course, it is highly likely that
the next generation of corpus linguists will have a much better background in
programming. Thus, these corpus linguists will be able to use their knowledge
of languages such as Perl or Visual Basic to write specic scripts to analyze
texts, and as these scripts proliferate, they can be passed from person to person
and perhaps make obsolete the need for specic text analysis programs to be
designed.
Corpus linguistics has been one of the more exciting methodological devel-
opments in linguistics since the Chomskyan revolution of the 1950s. It reects
changing attitudes among many linguists as to what constitutes an adequate
empirical study of language, and it has drawn upon recent developments in
technology to make feasible the kinds of empirical analyses of language that cor-
pus linguists wish to undertake. Of course, doing a corpus analysis will always
involve work more work than sitting in ones ofce or study and making up the
data for a particular analysis but doing a corpus analysis properly will always
have its rewards and will help us advance the study of human language, an area
of study that linguists of all persuasions would agree we still know relatively
little about.
Appendix 1
Corpus resources
Cross references to resources listed in this table are indicated in boldface. The
various resources are alphabetized by acronym or full name, depending upon which
usage is most common.
The publisher has used its best endeavors to ensure that the URLs for external websites
referred to in this book are correct and active at the time of going to press. However, the
publisher has no responsibility for the websites and can make no guarantee that a site
will remain live or that the content is or will remain appropriate.
Resource Description Availability
American National Corpus Currently in progress; is
intended to contain spoken
and written texts that
model as closely as
possible the texts in the
British National Corpus
Project website:
http://www.cs.vassar.edu/
ide/anc/
American Publishing
House for the Blind Corpus
25 million words of edited
written American English
Originally created by IBM;
described in Fillmore
(1992)
ARCHER (A
Representative Corpus of
English Historical
Registers)
1.7 million words
consisting of various
genres of British and
American English covering
the period 16501990
In-house corpus (due to
copyright restrictions); an
expanded version,
ARCHER II, is underway
Bank of English Corpus 415 million words of
speech and writing (as of
October 2000); texts are
continually added
Collins-Cobuild:
http://titania.cobuild.collins.
co.uk/boe info.html
Bergen Corpus of London
Teenage English (COLT)
500,000-word corpus of
the speech of London
teenagers from various
boroughs; available online
or as part of the British
National Corpus
Project website:
http://www.hd.uib.no/colt/
Birmingham Corpus 20 million words of written
English
Evolved into Bank of
English Corpus
142
Corpus resources 143
British National Corpus
(BNC)
100 million words of
samples of varying length
containing spoken (10
million words) and written
(90 million words) British
English
BNC website:
http://info.ox.ac.uk/bnc/
index.html
British National Corpus
(BNC) Sampler
2 million words of speech
and writing representing
184 samples taken from the
British National Corpus
BNC website:
http://info.ox.ac.uk/bnc/
getting/sampler.html
Brown Corpus One million words of
edited written American
English; created in 1961;
divided into 2,000-word
samples from various
genres (e.g. press
reportage, ction,
government documents)
See: ICAME CD-ROM
Cambridge International
Corpus
100 million words of
varying amounts of spoken
and written British and
American English, with
additional texts being
added continuously
CUP website:
http://uk.cambridge.org/elt/
reference/cic.htm
Cambridge Learners
Corpus
10 million words of student
essay exams, with
additional texts being
added continuously
CUP website:
http://uk.cambridge.org/elt/
reference/clc.htm
Canterbury Project Ultimate goal of the project
is to make available in
electronic form all versions
of the Canterbury Tales
and to provide an interface
to enable, for instance,
comparisons of the various
versions
Project website:
http://www.cta.dmu.ac.uk/
projects/ctp/
Chemnitz Corpus A parallel corpus of
English and German
translations
Project website:
http://www.tu-chemnitz.de/
phil/english/real/
transcorpus/index.htm
144 Appendix 1
Child Language Data
Exchange System
(CHILDES) Corpus
Large multi-lingual
database of spoken
language from children and
adults engaged in rst or
second language
acquisition
http://childes.psy.cmu.edu/
See also: MacWhinney
(2000)
Corpora Discussion List
(made available through
the Norwegian Computing
Centre for the Humanities)
Internet discussion list for
issues related to corpus
creation, analysis, tagging,
parsing, etc.
http://www.hit.uib.no/
corpora/welcome.txt
Corpus of Early
English Correspondence
Two versions of English
correspondence: the full
version (2.7 million words)
and a sampler version
(450,000 words)
Project website:
http://www.eng.helsinki./
doe/projects/ceec/
corpus.htm
Sampler version available
on ICAME CD-ROM
Corpus of Middle English
Prose and Verse
A large collection of
Middle English texts
available in electronic
format
Project website:
http://www.hti.umich.edu/c/
cme/about.html
Corpus of Spoken
Professional English
Approximately 2 million
words taken from spoken
transcripts of academic
meetings and White House
press conferences
Athelstan website:
http://www.athel.com/
cpsa.html
The Electronic Beowulf A digital version of the Old
English poem Beowulf that
can be searched
Project website:
http://www.uky.edu/
kiernan/eBeowulf/
guide.htm
EnglishNorwegian
Parallel Corpus
A parallel corpus of
English and Norwegian
translations: 30 samples of
ction and 20 samples of
non-ction in the original
and in translation
Project website:
http://www.hf.uio.no/iba/
prosjekt/
The Expert Advisory
Group on Language
Engineering Standards
(EAGLES)
Has developed A Corpus
Encoding Standard
containing guidelines for
the creation of corpora
Project website:
http://www.cs.vassar.edu/
CES/
Corpus resources 145
FLOB (Freiburg
LancasterOslo
Bergen) Corpus
One million words of
edited written British
English published in 1991;
divided into 2,000-word
samples in varying genres
intended to replicate the
LOB Corpus
See: ICAME CD-ROM
FROWN (FreiburgBrown)
Corpus
One million words of
edited written American
English published in 1991;
divided into 2,000-word
samples in varying genres
intended to replicate the
Brown Corpus
See: ICAME CD-ROM
Helsinki Corpus Approximately 1.5 million
words of Old, Middle, and
Early Modern English
divided into samples of
varying length
See: ICAME CD-ROM
Helsinki Corpus of Older
Scots
Approximately 400,000
words of transcribed
speech (recorded in the
1970s) from four rural
dialects in England and
Ireland
See: ICAME CD-ROM
Hong Kong University of
Science and Technology
Learner Corpus
25 million words of learner
English written by
rst-year university
students whose native
language is Chinese
Contact:
John Milton, Project
Director, [email protected]
ICAME Bibliography Extensive bibliography of
corpus-based research
created by Bengt Altenberg
(Lund University, Sweden)
ICAME website:
1989:
http://www.hd.uib.no/icame
/icame-bib2.txt
19908:
http://www.hd.uib.no/icame
/icame-bib3.htm
ICAME CD-ROM 20 different corpora (e.g.
Brown, LOB, Helsinki) in
various computerized
formats (DOS, Windows,
Macintosh and Unix)
ICAME website:
http://www.hit.uib.no/icame
/cd/
146 Appendix 1
International Corpus of
English (ICE)
A variety of million-word
corpora (600,000 words of
speech, 400,000 words of
writing) representing the
various national varieties
of English (e.g. American,
British, Irish, Indian, etc.)
Three components now
complete:
Great Britain:
http://www.ucl.ac.uk/
english-usage/ice-
gb/index.htm
East Africa: http://www.tu-
chemnitz.de/phil/english/
real/eafrica/corpus.htm
New Zealand:
http://www.vuw.ac.nz/lals/
corpora.htm#The New
Zealand component of the
International
ICECUP (ICE Corpus
Utility Program)
Text retrieval software for
use with ICE-GB
Survey of English Usage
website:
http://www.ucl.ac.
uk/english-usage/ice-gb
/icecup.htm
ICE-GB (British
component of the
International Corpus of
English)
One million words of
spoken and written British
English fully tagged and
parsed
See: International Corpus
of English
International Corpus of
Learner English (ICLE)
Approximately 2 million
words of written English
composed by non-native
speakers of English from
14 different linguistic
backgrounds
Project website:
http://www.tr.ucl.ac.be/FL
TR/GERM/ETAN/CECL/
introduction.html
Lampeter Corpus Approximately 1.1 million
words of Early Modern
English tracts and
pamphlets taken from
various genres (e.g.
religion, politics) from the
period 16401740;
contains complete texts,
not text samples
Project website:
http://www.tu-
chemnitz.de/phil/english/
real/lampeter/
lamphome.htm
Corpus resources 147
Lancaster Corpus A precursor to the
million-word
LancasterOsloBergen
(LOB) Corpus of edited
written British English
See: LancasterOslo
Bergen (LOB) Corpus
Lancaster/IBM Spoken
English Corpus
53,000 words of spoken
British English (primarily
radio broadcasts); available
in various formats,
including a parsed version
See: ICAME CD-ROM
Lancaster Parsed Corpus A parsed corpus containing
approximately 140,000
words from various genres
in the LancasterOslo
Bergen (LOB) Corpus
See: ICAME CD-ROM
LancasterOsloBergen
(LOB) Corpus
One million words of
edited written British
English published in 1961
and divided into 2,000-
word samples; modeled
after the Brown Corpus
See: ICAME CD-ROM
Linguistic Data
Consortium (LDC)
For an annual fee, makes
available to members a
variety of spoken and
written corpora of English
and many other languages
LDC website:
http://www.ldc.upenn.edu/
London Corpus The original corpus of
spoken and written British
English rst created in the
1960s by Randolph Quirk
at the Survey of English
Usage, University College
London; sections of the
spoken part are included in
the LondonLund Corpus
Can be used on-site at the
Survey of English Usage:
http://www.ucl.ac.uk/
english-usage/home.htm
LondonLund Corpus Approximately 500,000
words of spoken British
English from various
genres (e.g. spontaneous
dialogues, radio
broadcasts) that has been
prosodically transcribed
See: ICAME CD-ROM
148 Appendix 1
LongmanLancaster
Corpus
A corpus available in
orthographic form that
contains approximately 30
million words of written
English taken from various
varieties of English
world-wide
Longman website:
http://www.longman-
elt.com/dictionaries/corpus/
lclonlan.html
Longman Learners Corpus 10 million words of writing
by individuals from around
the world learning English
as a second or foreign
language
Longman website:
http://www.longman-
elt.com/dictionaries/corpus/
lclearn.html
The Longman Spoken and
Written English Corpus
(LSWE)
Approximately 40 million
words of samples of
spoken and written British
and American English
Described in Biber et al.
(1999)
Map Task Corpus Digitized transcriptions of
individuals engaged in
task-oriented dialogues
in which one speaker helps
another speaker replicate a
route on a map
Project website:
http://www.hcrc.ed.ac.uk/
maptask/
Michigan Corpus of
Academic Spoken English
(MICASE)
Various types of spoken
American English recorded
in academic contexts: class
lectures and discussions,
tutorials, dissertation
defenses
Searchable on the Web:
http://www.hti.umich.edu/
m/micase/
Nijmegen Corpus A 130,000-word parsed
corpus of written English
TOSCA Research website:
http://lands.let.kun.nl/
TSpublic/tosca/
research.html
The Northern Ireland
Transcribed Corpus of
Speech
400,000 words of
interviews with individuals
speaking Hiberno-English
from various regions of
Northern Ireland
See Kirk (1992)
PennHelsinki Parsed
Corpus of Middle English
1.3 million words of parsed
Middle English taken from
55 samples found in the
Helsinki Corpus
Project website:
http://www.ling.upenn.edu/
mideng/
Corpus resources 149
Penn Treebank (Releases I
and II)
A heterogeneous collection
of speech and writing
totaling approximately 4.9
million words; sections
have been tagged and
parsed
Linguistic Data
Consortium (LDC)
website:
http://www.ldc.upenn.edu/
Catalog/LDC95T7.html
Polytechnic of Wales
Corpus
A 65,000-word parsed
corpus of the speech of
children ages 612
conversing in playgroups
of three and in interviews
with adults
See: ICAME CD-ROM
Santa Barbara Corpus of
Spoken American English
Large corpus containing
samples of varying length
of different kinds of spoken
American English:
spontaneous dialogues,
monologues, speeches,
radio broadcasts, etc.
Project website:
http://www.linguistics.ucsb.
edu/research/sbcorpus/
default.htm
First release of corpus can
be purchased from the
Linguistic Data
Consortium (LDC):
http://www.ldc.upenn.edu/
Catalog/LDC2000S85.html
Susanne Corpus 130,000 words of written
English based on various
genres in the Brown
Corpus that have been
parsed and marked up
based on an annotation
scheme developed for the
project
Project website:
http://www.cogs.susx.ac.uk/
users/geoffs/RSue.html
Switchboard Corpus 2,400 telephone
conversations between two
speakers from various
dialect regions in the
United States; topics of
conversations were
suggested beforehand
Linguistic Data
Consortium (LDC)
website:
http://www.ldc.upenn.edu/
Catalog/LDC97S62.html
TalkBank Project Cross-disciplinary effort to
use computational tools to
study human and animal
communication
Project website:
http://www.talkbank.org/
150 Appendix 1
Tampere Corpus A corpus proposed to
consist of various kinds of
scientic writing for
specialized and
non-specialized audiences
Described in Norri and
Kyt o (1996)
Text Encoding Initiative
(TEI)
Has developed standards
for the annotation of
electronic documents
Project website:
http://www.tei-c.org/
TIMIT Acoustic-Phonetic
Continuous Speech Corpus
Various speakers from
differing dialects of
American English reading
ten sentences containing
phonetically varied sounds
Linguistic Data
Consortium (LDC)
website:
http://www.ldc.upenn.edu/
Catalog/LDC93S1.html
TIPSTER Corpus Collection of various kinds
of written English, such as
Wall Street Journal and
Associated Press news
stories; intended for
research in information
retrieval
Linguistic Data
Consortium (LDC)
website:
http://www.ldc.upenn.edu/
Catalog/LDC93T3A.html
Wellington Corpus One million words of
written New Zealand
English divided into genres
that parallel the Brown and
LOB corpora but that were
collected between 1986
and 1990
See: ICAME CD-ROM
York Corpus 1.5 million words taken
from sociolinguistic
interviews with speakers of
York English
See Tagliamonte (1998)
Appendix 2
Concordancing programs
PC/Macintosh-based programs
Conc (for the Macintosh)
John Thomson
available from Summer Institute of Linguistics
http://www.indiana.edu/letrs/help-services/QuickGuides/about-conc.html
Concordancer for Windows
Zdenek Martinek in collaboration with Les Siegrist
http://www.ifs.tu-darmstadt.de/sprachlit/wconcord.htm
Corpus Presenter
Raymond Hickey
http://www.uni-essen.de/lan300/corpus presenter.htm
Corpus Wizard
Kobe Phoenix Lab, Japan
http://www2d.biglobe.ne.jp/htakashi/software/CWNE.HTM
Lexa
Raymond Hickey
Available from Norwegian Computing Centre for the Humanities
http://www.hd.uib.no/lexainf.html
MonoConc Pro 2.0
Athelstan
http://www.athel.com/mono.html#monopro
ParaConc (for multilingual corpora)
Michael Barlow
http://www.athel.com/
Sara
British National Corpus
http://info.ox.ac.uk/bnc/sara/client.html
Tact
Centre for Computing in the Humanities, University of Toronto
http://www.chass.utoronto.ca:8080/cch/tact.html
WordCruncher (now called Document Explorer)
Hamilton-Locke, Inc. (an older version is also available on the ICAME
CD-ROM, 2nd edn.)
http://hamilton-locke.com/DocExplorer/Index.html
151
152 Appendix 2
WordSmith
Mike Scott
Oxford University Press
http://www.oup.com/elt/global/catalogue/multimedia/wordsmithtools3/
Web-based programs
CobuildDirect
http://titania.cobuild.collins.co.uk/direct info.html
KWiCFinder
http://miniappolis.com/KWiCFinder/KWiCFinderHome.html
The Michigan Corpus of Academic Spoken English (MICASE)
http://www.hti.umich.edu/micase/
Sara
Online version of the British National Corpus
http://sara.natcorp.ox.ac.uk
TACTWeb
http://kh.hd.uib.no/tactweb/homeorg.htm
References
Aarts, Bas (1992) Small Clauses in English: The Nonverbal Types. Berlin and NewYork:
Mouton de Gruyter.
(2001) Corpus Linguistics, Chomsky, and Fuzzy Tree Fragments. In Mair and Hundt
(2001). 513.
Aarts, Bas and Charles F. Meyer (eds.) (1995) The Verb in Contemporary English.
Cambridge University Press.
Aarts, Jan and Willem Meijs (eds.) (1984) Corpus Linguistics: Recent Developments
in the Use of Computer Corpora. Amsterdam: Rodopi.
Aarts, Jan, Pieter de Haan, and Nelleke Oostdijk (eds.) (1993) English Language Cor-
pora: Design, Analysis, and Exploitation. Amsterdam: Rodopi.
Aarts, Jan, Hans van Halteren, and Nelleke Oostdijk (1996) The TOSCA Analysis
System. In Koster and Oltmans (1996). 18191.
Aijmer, Karin and Bengt Altenberg (eds.) (1991) English Corpus Linguistics. London:
Longman.
Altenberg, Bengt and Marie Tapper (1998) The Use of Adverbial Connectors in
Advanced Swedish Learners Written English. In Granger (1998). 8093.
Ammon, U., N. Dittmar, and K. J. Mattheier (eds.) (1987) Sociolinguistics: An Interna-
tional Handbook of the Science of Language and Society, vol. 2. Berlin: de Gruyter.
Aston, Guy and Lou Burnard (1998) The BNCHandbook: Exploring the British National
Corpus with SARA. Edinburgh University Press.
Atwell, E., G. Demetriou, J. Hughes, A. Schiffrin, C. Souter, and S. Wilcock (2000)
A Comparative Evaluation of Modern English Corpus Grammatical Annotation
Schemes. ICAME Journal 24. 723.
Barlow, Michael (1999) MonoConc 1.5 and ParaConc. International Journal of Corpus
Linguistics 4 (1). 31927.
Bell, Alan (1988) The British Base and the American Connection in NewZealand Media
English. American Speech 63. 32644.
Biber, Douglas (1988) Variation Across Speech and Writing. New York: Cambridge
University Press.
(1990) Methodological Issues Regarding Corpus-based Analyses of Linguistic
Variation. Literary and Linguistic Computing 5. 25769.
(1993) Representativeness in Corpus Design. Literary and Linguistic Computing 8.
24157.
(1995) Dimensions of Register Variation: ACross-Linguistic Comparison. Cambridge
University Press.
Biber, Douglas, Edward Finegan, and Dwight Atkinson (1994) ARCHER and its
Challenges: Compiling and Exploring a Representative Corpus of English
Historical Registers. In Fries, Tottie, and Schneider (1994). 113.
153
154 References
Biber, Douglas, Susan Conrad, and Randi Reppen (1998) Corpus Linguistics:
Investigating Language Structure and Language Use. Cambridge University
Press.
Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad, and Edward Finegan
(1999) The Longman Grammar of Spoken and Written English. London: Longman.
Biber, Douglas and Jen a Burges (2000) Historical Change in the Language Use of
Women and Men: Gender Differences in Dramatic Dialogue. Journal of English
Linguistics 28 (1). 2137.
Blachman, Edward, Charles F. Meyer, and Robert A. Morris (1996) The UMBIntelligent
ICE Markup Assistant. In Greenbaum (1996a). 5464.
Brill, Eric (1992) A Simple Rule-Based Part-of-Speech Tagger. Proceedings of the 3rd
Conference on Applied Natural Language Processing. Trento: Italy.
Burnard, Lou (1995) The Text Encoding Initiative: An Overview. In Leech, Myers, and
Thomas (1995). 6981.
(1998) The Pizza Chef: A TEI Tag Set Selector. http://www.hcu.ox.ac.uk/TEI/pizza.
html.
Burnard, Lou and C. M. Sperberg-McQueen (1995) TEI Lite: An Introduction to Text
Encoding for Interchange. http://www.tei-c.org/Lite/index.html.
Burnard, Lou and Tony McEnery (eds.) (2000) Rethinking Language Pedagogy from a
Corpus Perspective. Frankfurt: Peter Lang.
Chafe, Wallace (1994) Discourse, Consciousness, and Time. Chicago: University of
Chicago Press.
(1995) Adequacy, User-Friendliness, and Practicality in Transcribing. In Leech,
Myers, and Thomas (1995). 5461.
Chafe, Wallace, John Du Bois, and Sandra Thompson (1991) Towards a New Corpus of
American English. In Aijmer and Altenberg (1991). 6482.
Chomsky, Noam (1995) The Minimalist Program. Cambridge, MA: MIT Press.
Coates, Jennifer (1983) The Semantics of the Modal Auxiliaries. London: Croom
Helm.
Collins, Peter (1991a) The Modals of Obligation and Necessity in Australian English.
In Aijmer and Altenberg (1991). 14565.
(1991b) Cleft and Pseudo-Cleft Constructions in English. Andover: Routledge.
Composition of the BNC. http://info.ox.ac.uk/bnc/what/balance.html.
Cook, Guy (1995) Theoretical Issues: Transcribing the Untranscribable. In Leech,
Myers, and Thomas (1995). 3553.
Corpus Encoding Standard (2000) http://www.cs.vassar.edu/CES/122.
Crowdy, Steve (1993) Spoken Corpus Design. Literary and Linguistic Computing 8.
25965.
Curme, G. (1947) English Grammar. New York: Harper and Row.
Davies, Mark (2001) Creating and Using Multi-million Word Corpora from Web-based
Newspapers. In Simpson and Swales (2001). 5875.
de Haan, Pieter (1984) Problem-Oriented Tagging of English Corpus Data. In Aarts and
Meijs (1984). 12339.
DuBois, John, StephanSchuetze-Coburn, Susanna Cumming, andDanae Paolino(1993)
Outline of Discourse Transcription. In Edwards and Lampert (1993). 4589.
Dunning, Ted (1993) Accurate Methods for the Statistics of Surprise and Coincidence.
Computational Linguistics 19 (1). 6174.
References 155
Eckman, Fred (ed.) (1977) Current Themes in Linguistics. Washington, DC: John Wiley.
Edwards, Jane (1993) Principles and Contrasting Systems of Discourse Transcription.
In Edwards and Lampert (1993). 331.
Edwards, Jane and Martin Lampert (eds.) (1993) Talking Data. Hillside, NJ: Lawrence
Erlbaum.
Ehlich, Konrad (1993) HIAT: A Transcription System for Discourse Data. In Edwards
and Lampert (1993). 12348.
Elsness, J. (1997) The Perfect and the Preterite in Contemporary and Earlier English.
Berlin and New York: Mouton de Gruyter.
Fang, Alex (1996) AUTASYS: Automatic Tagging and Cross-Tagset Mapping. In
Greenbaum (1996a). 11024.
Fernquest, Jon (2000) Corpus Mining: Perl Scripts and Code Snippets. http://www.
codearchive.com/home/jon/program.html.
Fillmore, Charles (1992) Corpus Linguistics or Computer-Aided Armchair Linguistics.
In Svartvik (1992). 3560.
Finegan, Edward and Douglas Biber (1995) That and Zero Complementisers in Late
ModernEnglish: ExploringARCHERfrom16501990. InAarts andMeyer (1995).
24157.
Francis, W. Nelson (1979) A Tagged Corpus Problems and Prospects. In Greenbaum,
Leech, and Svartvik (1979). 192209.
(1992) Language Corpora B.C. In Svartvik (1992). 1732.
Francis, W. Nelson and H. Ku cera (1982) Frequency Analysis of English Usage: Lexicon
and Grammar. Boston: Houghton Mifin.
Fries, Udo, Gunnel Tottie, and Peter Schneider (eds.) (1994) Creating and Using English
Language Corpora. Amsterdam: Rodopi.
Garside, Roger, Geoffrey Leech, and Geoffrey Sampson (1987) The Computational
Analysis of English. London: Longman.
Garside, Roger, Geoffrey Leech, and Tam as V aradi (1992) Lancaster Parsed Corpus.
Manual to accompany the Lancaster Parsed Corpus. http://khnt.hit.uib.no/
icame/manuals/index.htm.
Garside, Roger, Geoffrey Leech, and Anthony McEnery (eds.) (1997) Corpus Annota-
tion. London: Longman.
Garside, Roger and Nicholas Smith (1997) A Hybrid Grammatical Tagger: CLAWS 4.
In Garside, Leech, and McEnery (1997). 102121.
Gavioli, Laura (1997) Exploring Texts through the Concordancer: Guiding the Learner.
In Anne Wichmann, Steven Fligelstone, Tony McEnery, and Gerry Knowles (eds.)
(1997) Teaching and Language Corpora. London: Longman. 8399.
Gillard, Patrick and Adam Gadsby (1998) Using a Learners Corpus in Compiling ELT
Dictionaries. In Granger (1998). 15971.
Granger, Sylvianne (1993) International Corpus of Learner English. In Aarts, de Haan,
and Oostdijk (1993). 5771.
(1998) Learner English on Computer. London: Longman.
Greenbaum, Sidney (1973) Informant Elicitation of Data on Syntactic Variation. Lingua
31. 20112.
(1975) Syntactic Frequency and Acceptability. Lingua 40. 99113.
(1984) Corpus Analysis and Elicitation Tests. In Aarts and Meijs (1984). 195201.
(1992) A New Corpus of English: ICE. In Svartvik (1992). 17179.
156 References
(ed.) (1996a) Comparing English Worldwide: The International Corpus of English.
Oxford: Clarendon Press.
(1996b) The Oxford English Grammar. Oxford: Oxford University Press.
Greenbaum, Sidney, Geoffrey Leech, and Jan Svartvik (eds.) (1979) Studies in English
Linguistics. London: Longman.
Greenbaum, S. and Meyer, C. (1982) Ellipsis and Coordination: Norms and Preferences.
Language and Communication 2:13749.
Greenbaum, Sidney and Jan Svartvik (1990) The LondonLund Corpus of Spoken
English. In Svartvik (1990). 1145.
Greenbaum, Sidney and Ni Yibin (1996) About the ICE Tagset. In Greenbaum (1996a).
92109.
Greenbaum, Sidney, Gerald Nelson, and Michael Weizman (1996) Complement Clauses
in English. In Thomas and Short (1996). 7691.
Greene, B. B. and G. M. Rubin (1971) Automatic Grammatical Tagging. Technical
Report. Department of Linguistics: Brown University.
Hadley, Gregory (1997) Sensing the Winds of Change: An Introduction to Data-Driven
Learning. http://web.bham.ac.uk/johnstf/winds.htm.
Haegeman, Lilliane (1987) Register Variation in English: Some Theoretical Observa-
tions. Journal of English Linguistics 20 (2). 23048.
(1991) Introduction to Government and Binding Theory. Oxford: Blackwell.
Halteren, Hans van and Theo van den Heuvel (1990) Linguistic Exploitation of Syntactic
Databases. The Use of the Nijmegen Linguistic DataBase Program. Amsterdam:
Rodopi.
Haselrud, V. and Anna-Brita Stenstr om (1995) Colt: Mark-up and Trends. Hermes 13.
5570.
Hasselg ard, Hilde (1997) Sentence Openings inEnglishandNorwegian. InLjung(1997).
320.
Hickey, Raymond, Merja Kyt o, Ian Lancashire, and Matti Rissanen (eds.) (1997) Tracing
the Trail of Time. Proceedings from the Second Diachronic Corpora Workshop.
Amsterdam: Rodopi.
Hockey, Susan (2000) Electronic Texts in the Humanities. Oxford: Oxford University
Press.
J arvinen, Timo (1994) Annotating 200 Million Words: The Bank of English Project.
Proceedings of COLING 94, Kyoto, Japan. http://www.lingsoft./doc/engcg/
Bank-of-English.html.
Jespersen, Otto (190949) A Modern English Grammar on Historical Principles.
Copenhagen: Munksgaard.
Johansson, Stig and Knut Hoand (1994) Towards an EnglishNorwegian Parallel
Corpus. In Fries, Tottie, and Schneider (1994). 2537.
Johansson, Stig and Jarle Ebeling (1996) Exploring the EnglishNorwegian Parallel
Corpus. In Percy, Meyer, and Lancashire (1996).
Johns, Tim F. (1994) From Printout to Handout: Grammar and Vocabulary Teaching in
the Context of Data-driven Learning. In Odlin (1994). 293313.
Kalton, Graham (1983) Introduction to Survey Sampling. Beverly Hills, CA: Sage.
Kennedy, Graeme (1996) Over Once Lightly. In Percy, Meyer, and Lancashire (1996).
25362.
Kettemann, Bernhard (1995) On the Use of Concordancing in ELT. TELL&CALL 4.
415.
References 157
Kirk, John (1992) The Northern Ireland Transcribed Corpus of Speech. In Leitner
(1992). 6573.
Koster, C. and E. Oltmans (eds.) (1996) Proceedings of the rst AGFL Workshop.
Nijmegen: CSI.
Kretzschmar, William A., Jr. (2000) Review of SPSS Student Version 9.0 for Windows.
Journal of English Linguistics 28 (3). 31113.
Kretzschmar, WilliamA., Jr. and E. Schneider (1996) Introduction to Quantitative Anal-
ysis of Linguistic Survey Data. Los Angeles: Sage.
Kretzschmar, William A., Jr., Charles F. Meyer, and Dominique Ingegneri (1997) Uses
of Inferential Statistics in Corpus Linguistics. In Ljung (1997). 16777.
Kyt o, M. (1991) Variation and Diachrony, with Early American English in Focus. Stud-
ies on can/may and shall/will. University of Bamberg Studies in English
Linguistics 28. Frankfurt am Main: Peter Lang.
(1996) Manual to the Diachronic Part of the Helsinki Corpus of English Texts: Coding
Conventions and Lists of Source Texts. 3rd edn. Department of English: University
of Helsinki.
Labov, W. (1972) The Transformation of Experience in Narrative Syntax. In Language
in the Inner City. Philadelphia: University of Pennsylvania Press. 35496.
Landau, Sidney (1984) Dictionaries: The Art and Craft of Lexicography. New York:
Charles Scribner.
Leech, Geoffrey (1992) Corpora and Theories of Linguistic Performance. In Svartvik
(1992). 10522.
(1997) Grammatical Tagging. In Garside, Leech, and McEnery (1997). 1933.
(1998) Preface. In Granger (1998). xivxx.
Leech, Geoffrey, Roger Garside, and Eric Atwell (1983) The Automatic Grammatical
Tagging of the LOB Corpus. ICAME Journal 7. 1333.
Leech, Geoffrey, Greg Myers, and Jenny Thomas (eds.) (1995) Spoken English on Com-
puter. Harlow, Essex: Longman.
Leech, Geoffrey and Elizabeth Eyes (1997) Syntactic Annotation: Treebanks. In Garside,
Leech, and McEnery (1997). 3452.
Leitner, Gerhard (ed.) (1992) New Directions in English Language Corpora. Berlin:
Mouton de Gruyter.
Le on, Fernando S anchez and Amalio F. Nieto Serrano (1997) Retargeting a Tagger. In
Garside, Leech, and McEnery (eds.). 15165.
Ljung, Magnus (ed.) (1997) Corpus-based Studies in English. Amsterdam: Rodopi.
MacWhinney, Brian (1996) The CHILDES System. American Journal of Speech-
Language Pathology 5. 514.
(2000) The CHILDES Project: Tools for Analyzing Talk. 3rd edn., vol. 1: Transcription
Format and Programs, vol 2: The Database. Mahwah, NJ: Erlbaum.
Mair, Christian (1990) Innitival Complement Clauses in English. Cambridge Uni-
versity Press.
(1995) Changing Patterns of Complementation, and Concomitant Grammaticalisa-
tion, of the Verb Help in Present-Day British English. In Aarts and Meyer (1995).
25872.
Mair, Christian and Marianne Hundt (eds.) (2001) Corpus Linguistics and Linguistic
Theory. Amsterdam: Rodopi.
Maniez, Fran cois (2000) Corpus of English Proverbs and Set Phrases. Message posted
on the Corpora List, 24 January. http://www.hit.uib.no/corpora/20001/0057.html.
158 References
Marcus, M., B. Santorini, and M. Marcinkiewicz (1993) Building a Large Annotated
Corpus of English: The Penn Treebank. Computational Linguistics 19. 31430.
Markus, Manfred (1997) Normalization of Middle English Prose in Practice. In Ljung
(1997). 21126.
Mel cuk, Igor A. (1987) Dependency Syntax: Theory and Practice. Albany: State Uni-
versity of New York Press.
Melamed, Dan (1996) 170 General Text Processing Tools (Mostly in PERL5).
http://www.cis.upenn.edu/melamed/genproc.html.
Meurman-Solin, Anneli (1995) ANewTool: The Helsinki Corpus of Older Scots (1450
1700). ICAME Journal 19. 4962.
Meyer, Charles F. (1992) Apposition in Contemporary English. Cambridge University
Press.
(1995) Coordination Ellipsis in Spoken and Written American English. Language
Sciences 17. 24169.
(1996) Coordinate Structures in the British and American Components of the Inter-
national Corpus of English. World Englishes 15. 2941.
(1997) Minimal Markup for ICE Texts. ICE NEWSLETTER 25. http://www.cs.umb.
edu/meyer/icenews2.html.
(1998) Studying Usage on the World Wide Web. http://www.cs.umb.edu/
meyer/usage.html.
Meyer, Charles F. and Richard Tenney (1993) Tagger: An Interactive Tagging Program.
In Souter and Atwell (1993). 30212.
Meyer, Charles F., Edward Blachman, and Robert A. Morris (1994) Can You See Whose
Speech Is Overlapping? Visible Language 28 (2). 11033.
Milton, John and Robert Freeman (1996) Lexical Variation in the Writing of Chinese
Learners of English. In Percy, Meyer, and Lancashire (1996). 12131.
Mindt, Dieter (1995) An Empirical Grammar of the English Verb. Berlin: Cornelsen.
M onnink, Inga de (1997) Using Corpus and Experimental Data: A Multimethod
Approach. In Ljung (1997). 22744.
Murray, Thomas E. and Carmen Ross-Murray (1992) On the Legality and Ethics of
Surreptitious Recording. Publication of the American Dialect Society 76. 1575.
(1996) Under Cover of Law: More on the Legality of Surreptitious Recording.
Publication of the American Dialect Society 79. 182.
Nelson, Gerald (1996) Markup Systems. In Greenbaum (1996a). 3653.
Nevalainen, Terttu (2000) Gender Differences in the Evolution of Standard English:
Evidence from the Corpus of Early English Correspondence. Journal of English
Linguistics 28 (1). 3859.
Nevalainen, Terttu, and Helena Raumolin-Brunberg (eds.) (1996) Sociolinguistics and
Language History: Studies Based on the Corpus of Early English Correspondence.
Amsterdam: Rodopi.
Newmeyer, Frederick (1998) Language Form and Language Function. Cambridge,
MA: MIT Press.
Nguyen, Long, Spyros Matsoukas, Jason Devenport, Daben Liu, Jay Billa, Francis
Kubala, and John Makhoul (1999) Further Advances in Transcription of Broadcast
News. Proceedings of the 6th European Conference on Speech Communication and
Technology, vol. 2. Edited by G. Olaszy, G. Nemeth, K. Erdohegy (aka EuroSpeech
99). European Speech Communication Association (ESCA). 66770.
References 159
Norri, Juhani and Merja Kyt o (1996) A Corpus of English for Specic Purposes: Work
in Progress at the University of Tampere. In Percy, Meyer, and Lancashire (1996).
15969.
Oakes, Michael P. (1998) Statistics for Corpus Linguistics. Edinburgh University Press.
Odlin, Terrence (ed.) (1994) Perspectives on Pedagogical Grammar. New York:
Cambridge University Press.
Ooi, Vincent (1998) Computer Corpus Lexicography. Edinburgh University Press.
Oostdijk, Nelleke (1991) Corpus Linguistics and the Automatic Analysis of English.
Amsterdam: Rodopi.
Pahta, P aivi and Saara Nevanlinna (1997) Re-phrasing in Early English. Expository
Apposition with an Explicit Marker from 1350 to 1710. In Rissanen, Kyt o, and
Heikkonen (1997). 12183.
Percy, Carol, Charles F. Meyer, and Ian Lancashire (eds.) (1996) Synchronic Corpus
Linguistics. Amsterdam: Rodopi.
Porter, Nick and Akiva Quinn (1996) Developing the ICE Corpus Utility Program. In
Greenbaum (1996a). 7991.
Powell, Christina and Rita Simpson (2001) Collaboration between Corpus Linguists and
Digital Librarians for the MICASE Web Search Interface. In Simpson and Swales
(2001). 3247.
Prescott, Andrew (1997) The Electronic Beowulf and Digital Restoration. Literary and
Linguistic Computing 12. 18595.
Quinn, Akiva and Nick Porter (1996) Annotation Tools. In Greenbaum (1996a). 6578.
Quirk, Randolph (1992) On Corpus Principles and Design. In Svartvik (1992). 45769.
Quirk, Randolph, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik (1972) AGram-
mar of Contemporary English. London: Longman.
(1985) A Comprehensive Grammar of the English Language. London: Longman.
Renouf, Antoinette (1987) Corpus Development. In Sinclair (1987). 140.
Rissanen, Matti (1992) The Diachronic Corpus as a Window to the History of English.
In Svartvik (1992). 185205.
(2000) The World of English Historical Corpora: fromCdmon to the Computer Age.
Journal of English Linguistics 28 (1). 720.
Rissanen, Matti, Merja Kyt o, and Kirsi Heikkonen (eds.) (1997) English in Transition:
Corpus-based Studies in English Linguistics and Genre Styles. Topics in English
Linguistics 23. Berlin and New York: Mouton de Gruyter.
Robinson, Peter (1998) New Methods of Editing, Exploring, and Reading The Canter-
bury Tales. http://www.cta.dmu.ac.uk/projects/ctp/desc2.html.
Rocha, Marco (1997) A Probabilistic Approach to Anaphora Resolution in Dialogues
in English. In Ljung (1997). 26179.
Ryd en, Mats (1975) Noun-Name Collocations in British Newspaper Language. Studia
Neophilologica 67. 1439.
Sampson, Geoffrey (1998) Corpus Linguistics User Needs. Message posted to the
Corpora List, 29 July. http://www.hd.uib.no/corpora/19983/0030.html.
Samuelsson, Christer and Atro Voutilainen (1997) Comparing a Linguistic and a
Stochastic Tagger. Proceedings of the 35th Annual Meeting of the Association for
Computational Linguistics and the 8th Conference of the European Chapter of the
Association for Computational Linguistics. Madrid: Association for Computational
Linguistics. 24653.
160 References
S anchez, Aquilino and Pascual Cantos (1997) Predictability of Word Forms (Types)
and Lemmas in Linguistic Corpora. A Case Study Based on the Analysis of the
CUMBRE Corpus: An 8-Million-Word Corpus of Contemporary Spanish. Inter-
national Journal of Corpus Linguistics 2. 25980.
Sanders, Gerald (1977) A Functional Typology of Elliptical Coordinations. In Eckman
(1977). 24170.
Sankoff, David (1987). Variable rules. In Ammon, Dittmar, and Mattheier (1987).
98497.
Schmied, Josef (1996) Second-Language Corpora. In Greenbaum (1996a). 18296.
Schmied, Josef and Hildegard Sch afer (1996) Approaching Translationese Through
Parallel and Translation Corpora. In Percy, Meyer and Lancashire (1996).
4155.
Schmied, Josef, and Claudia Claridge (1997) Classifying Text- or Genre-Variation in
the Lampeter Corpus of Early Modern English Texts. In Hickey, Kyt o, Lancashire,
and Rissanen (1997). 11935.
Sigley, Robert J. (1997) Choosing Your Relatives: Relative Clauses in New Zealand
English. Unpublished PhD thesis. Wellington: Department of Linguistics, Victoria
University of Wellington.
Simpson, Rita, Bret Lucka, and Janine Ovens (2000) Methodological Challenges of
Planning a Spoken Corpus with Pedagogical Outcomes. In Burnard and McEnery
(2000). 439.
Simpson, Rita and John Swales (eds.) (2001) Corpus Linguistics in North America. Ann
Arbor: University of Michigan Press.
Sinclair, John (ed.) (1987) Looking Up: An Account of the COBUILD Project. London:
Collins.
(1991) Corpus, Concordance, Collocation. Oxford University Press.
(1992) Introduction. BBC English Dictionary. London: HarperCollins. xxiii.
Smith, Nicholas (1997) Improving a Tagger. In Garside, Leech and McEnery (1997).
13750.
Souter, Clive and Eric Atwell (eds.) (1993) Corpus-Based Computational Linguistics.
Amsterdam: Rodopi.
Sperberg-McQueen, C. M. and Lou Burnard (eds.) (1994a) Guidelines for Electronic
Text Encoding and Interchange ( TEI P3). http://etext.virginia.edu/TEI.html.
(1994b) A Gentle Introduction to SGML. In Guidelines for Electronic Text En-
coding and Interchange ( TEI P3). http://etext.lib.virginia.edu/bin/tei-tocs?div=
DIV1&id=SG.
Stenstr om, Anna-Brita and Gisle Andersen (1996) More Trends in Teenage Talk: A
Corpus-Based Investigation of the Discourse Items cos and innit. In Percy, Meyer,
and Lancashire (1996). 189203.
Svartvik, J. (ed.) (1990) The LondonLund Corpus of Spoken English. Lund University
Press.
(1992) Directions in Corpus Linguistics. Berlin: Mouton de Gruyter.
Svartvik, Jan and Randolph Quirk (eds.) (1980) ACorpus of English Conversation. Lund
University Press.
Tagliamonte, Sali (1998) Was/Were Variation across the Generations: View from the
City of York. Language Variation and Change 10 (2). 15391.
References 161
Tagliamonte, Sali and Helen Lawrence (2000) I Used to Dance, but I Dont Dance
Now: The Habitual Past in English. Journal of English Linguistics 28 (4). 324
53.
Tannen, D. (1989) Talking Voices: Repetition, Dialogue, and Imagery in Conversational
Discourse. Cambridge University Press.
Tapanainen, Pasi and Timo J arvinen (1997) A Non-projective Dependency Parser.
http://www.conexor./anlp97/anlp97.html [also published in Procs. ANLP-97.
ACL. Washington, DC].
The Independent Style Manual, 2nd edn. (1988) London: The Independent.
The Spoken Component of the BNC. http://info.ox.ac.uk/bnc/what/spok design.html.
The Written Component of the BNC. http://info.ox.ac.uk/bnc/what/writ design.html.
Thomas, Jenny and Mick Short (eds.) (1996) Using Corpora for Language Research.
London: Longman.
Thompson, Henry S., Anne H. Anderson, and Miles Bader (1995) Publishing a Spoken
and Written Corpus on CD-ROM: The HCRC Map Task Experience. In Leech,
Myers, and Thomas (1995). 16880.
Tottie, G. (1991) Negation in English Speech and Writing. A Study in Variation. San
Diego: Academic Press.
van Halteren, Hans and Theo van den Heuvel (1990) Linguistic Exploitation of Syntactic
Databases. Amsterdam: Rodopi.
Voutilainen, Atro (1999) A Short History of Tagging. In Hans van Halteren (ed.)
Syntactic Wordclass Tagging (1999). Dordrecht: Kluwer.
Voutilainen, Atro and Mikko Silvonen (1996) A Short Introduction to ENGCG.
http://www.lingsoft./doc/engcg/intro.
Wheatley, B., G. Doddington, C. Hemphill, J. Godfrey, E.C. Holliman, J. McDaniel, and
D. Fisher (1992) Robust Automatic Time Alignment of Orthographic Transcriptions
with Unconstrained Speech. Proceedings of ICASSP-92, vol. 1. 5336.
Willis, Tim (1996) Analysing the Lancaster/IBM Spoken English Corpus (SEC) Using
the TOSCA Analysis System (for ICE): Some Impressions from a User. In Percy,
Meyer, and Lancashire (1996). 23751.
Wilson, A. and Tony McEnery (1994) Teaching and Language Corpora. Technical
Report. Department of Modern English Language and Linguistics, University of
Lancaster.
Wilson, Andrewand Jenny Thomas (1997) Semantic Annotation. In Garside, Leech and
McEnery (1997). 5365.
Woods, A, P. Fletcher, and A. Hughes (1986) Statistics in Language Studies. Cambridge
University Press.
Index
Aarts, Bas, 4, 102
adequacy, 23, 1011
age, 4950
Altenberg, Bengt, 267
AMALGAM Tagging Project, 867, 89
American National Corpus, 24, 84, 142
American Publishing House for the Blind
Corpus, 17, 142
analyzing a corpus, 100
determining suitability, 1037, 107t
exploring a corpus, 1234
extracting information: dening parameters,
1079; coding and recording, 10914,
112t; locating relevant constructions,
11419, 116f, 118f
framing research question, 1013
future prospects, 1401
see also pseudo-titles (corpus analysis
case study); statistical analysis
anaphors, 97
annotation, 989
future prospects, 140
grammatical markup, 81
parsing, 916, 98, 140
part-of-speech markup, 81
structural markup, 689, 816
tagging, 8691, 978, 111, 11718, 140
types, 81
appositions, 42, 98
see also pseudo-titles (corpus analysis
case study)
ARCHER (A Representative Corpus of
English Historical Registers), 21,
22, 79 n6, 140, 142
Aston, Guy, 19
AUTASYS Tagger, 87
Bank of English Corpus, 15, 96, 142
BBC English Dictionary, 15, 1617
Bell, Alan, 100, 1013, 104, 108, 110, 131
Bergen Corpus of London Teenage English
see COLT Corpus
Biber, Douglas, 10, 1920, 22, 32, 33,
36, 3940, 41, 42, 52, 78, 121,
122, 126
Biber, Douglas, et al. (1999) 14
Birmingham Corpus, 15, 142
Blachman, Edward, 767
BNC see British National Corpus
Brill, Eric, 86
Brill Tagger, 868
British National Corpus (BNC), 143
annotation, 84
composition, 18, 31t, 34, 36, 38, 401, 49
copyright, 13940
planning, 302, 33, 43, 51, 138
record keeping, 66
research using, 15, 36
speech samples, 59
tagging, 87
time-frame, 45
British National Corpus (BNC) Sampler,
13940, 143
Brown Corpus, xii, 1, 143
genre variation, 18
length, 32
research using, 6, 910, 12, 42, 98, 103
sampling methodology, 44
tagging, 87, 90
time-frame, 45
see also FROWN (FreiburgBrown)
Corpus
Burges, Jen, 52
Burnard, Lou, 19, 82, 84, 856
Cambridge International Corpus, 15, 143
Cambridge Learners Corpus, 15, 143
Canterbury Project, 79, 143
Cantos, Pascual, 33 n2
Chafe, Wallace, 3, 32, 52, 72, 85
CHAT system (Codes for the Human Analysis
of Transcripts), 26, 11314
Chemnitz Corpus, 23, 143
CHILDES (Child Language Data Exchange
System) Corpus, xiii, 26, 113, 144
Chomsky, Noam, 2, 3
CIA (contrastive interlanguage analysis), 26
CLAN software programs, 26
CLAWS tagger, 25, 87, 8990
Coates, Jennifer, 12, 13
162
Index 163
collecting data
general considerations, 556
record keeping, 646
speech samples, 56; broadcasts, 61;
future prospects, 139; microphones, 60;
natural speech, 568, 59; permission,
57; problems, 601; recording, 589;
sample length, 578; tape recorders, 5960
writing samples: copyright, 38, 612, 79 n6,
13940; electronic texts, 634; future
prospects, 139; sources, 624
see also sampling methodology
Collins, Peter, xiixiii
Collins COBUILD English Dictionary, 15
Collins COBUILD Project, 14, 15
COLT Corpus (Bergen Corpus of London
Teenage English), xiiixiv, 18, 49, 142
competence vs. performance, 4
computerizing data
directory structure, 67, 68f
le format, 667
markup, 67, 689 see also annotation
speech, see speech, computerizing
written texts, 7880, 139
concordancing programs
KWIC format, 11516, 116f
for language learning, 278
lemma searches, 116
programs, 115, 117, 1501
with tagged or parsed corpus, 11718
uses, 16, 86, 114
wild card searches, 11617
Conrad, Susan, 126
contrastive analysis, 224
contrastive interlanguage analysis (CIA), 26
Cook, Guy, 72, 86
copyright, 38, 44, 57, 612, 79 n6, 13940
Corpora Discussion List, 144
corpus (corpora)
balanced, xii
construction see planning corpus
construction
denitions, xixii
diachronic, 46
historical, 202, 378, 46, 51, 789
learner, 267
monitor, 15
multi-purpose, 36
parallel, 224
parsed, 96
resources, 1429
special-purpose, 36
synchronic, 456
corpus linguistics, xi, xiiixiv, 12, 34
Corpus of Early English Correspondence, 22,
37, 144
Corpus of Middle English Prose and Verse, 144
Corpus of Spoken Professional English, 71,
144
corpus-based research, 11
contrastive analysis, 224
grammatical studies, 1113
historical linguistics, 202
language acquisition, 267
language pedagogy, 278
language variation, 1720
lexicography, 1417
limitations, 124
natural language processing (NLP), xiii,
246
reference grammars, 1314
translation theory, 224
Crowdy, Steve, 43, 59
Curme, G., 13
data-driven learning, 278
de Haan, Pieter, 978
descriptive adequacy, 2, 3
diachronic corpora, 46
dialect variation, 512
dictionaries, 1417
Du Bois, John, 32, 52, 85
Dunning, Ted, 132
EAGLES Project see Expert Advisory Group
on Language Engineering Standards, The
Ebeling, Jarle, 23
education, 50
Ehlich, Konrad, 77
Electronic Beowulf, The, 21, 144
electronic texts, 634
elliptical coordination
frequency, 7, 12
functional analysis, 611
genres, 6, 910
position, 67
repetition in speech, 9
serial position effect, 78, 8t
speech vs. writing, 89
suspense effect, 78, 8t
empty categories, 45
ENGCG Parser, 96
EngCG-2 tagger, 88
EngFDG parser, 91, 934, 934 n8, 96
EnglishNorwegian Parallel Corpus, 23,
62, 144
ethnographic information, 656
see also sociolinguistic variables
Expert Advisory Group on Language
Engineering Standards, The (EAGLES),
xi, 84, 144
explanatory adequacy, 2, 3, 1011
164 Index
Extensible Markup Language see XML
Eyes, Elizabeth, 91
Fernquest, Jon, 114
Fillmore, Charles, 4, 17
Finegan, Edward, 22
Fletcher, P., 1212
FLOB (FreiburgLancasterOsloBergen)
Corpus, 21, 45, 145
frame semantics, 17
Francis, W. Nelson, 1, 88
FROWN (FreiburgBrown) Corpus, 21, 145
FTF see fuzzy tree fragments
functional descriptions of language
elliptical coordination, 611, 8t, 12
repetition in speech, 9
voice, 56
fuzzy tree fragments (FTF), 119, 119f
Gadsby, Adam, 27
Garside, Roger, 889
Gavioli, Laura, 28
gender, 18, 22, 489
generative grammar, 1, 35
genre variation, 18, 1920, 31t, 348, 35t, 402
Gillard, Patrick, 27
government and binding theory, 45
grammar
generative, 1, 35
universal, 23
Grammar Safari, 28
grammars, reference, 1314
grammatical markup see parsers
grammatical studies, 1113
Granger, Sylvianne, 26
Greenbaum, Sidney, 7, 14, 22, 35t, 64, 75, 95
Greene, B. B., 87, 88
Haegeman, Lilliane, 23, 45, 6
Hasselg ard, Hilde, 23
Helsinki Corpus, 145
composition, 201, 38
planning, 46
research using, 22, 37, 51
symbols system, 67
Helsinki Corpus of Older Scots, 145
historical corpora, 202, 378, 46, 51, 789
see also ARCHER; Helsinki Corpus
Hoand, Knut, 23
Hong Kong University of Science and
Technology (HKUST) Learner Corpus,
26, 145
Hughes, A., 1212
ICAME Bibliography, 145
ICAME CD-ROM, 67, 145
ICE (International Corpus of English), 146
annotation, 823, 84, 85, 87, 90
composition, 34, 35t, 36, 38, 39, 402, 104
computerizing data, 72, 73
copyright, 38, 44
criteria, 50
record keeping, 66
regional components, 104, 1056, 106t,
110, 123, 124
research using, 6, 9 see also pseudo-titles
(corpus analysis case study)
sampling, 44, 56
time-frame, 45
see also ICECUP; ICE-GB; ICE-USA
ICE Markup Assistant, 85, 86
ICE Tree, 95
ICECUP (ICE Corpus Utility Program), 19,
116, 119, 146
ICE-East Africa, 106, 106t, 107t, 110,
123t, 124
ICE-GB, 146
annotation, 25, 834, 86, 923, 92f, 96,
11718, 118f, 140
composition, 106t
computerizing data, 73
criteria, 50
record keeping, 645
research using, 14, 19, 11516, 116f
see also pseudo-titles (corpus analysis
case study)
ICE-Jamaica, 106t, 107t, 110, 123t
ICE-New Zealand, 106, 106t, 107t, 123t,
125, 1303
ICE-Philippines, 106, 106t, 107t, 110, 123t,
125, 1303
ICE-Singapore, 106t, 110, 123t
ICE-USA
composition, 53, 106t
computerizing data, 70, 71, 734, 79
copyright, 62
criteria, 467
directory structure, 678, 68f
length, 323
record keeping, 64, 65
research using see pseudo-titles (corpus
analysis case study)
sampling, 58, 601
ICLE see International Corpus of Learner
English
Ingegneri, Dominique, 423
International Corpus of English see ICE
International Corpus of Learner English
(ICLE), 26, 27, 146
Jespersen, Otto, xii, 13
Johansson, Stig, 23
Index 165
Kalton, Graham, 43
Kennedy, Graeme, 89
Kettemann, Bernhard, 278
Kirk, John, 52
Kolhapur Corpus of Indian English, 104
Kretzschmar, William A., Jr., 423
Kucera, Henry, 1
KWIC (key word in context), 11516, 116f
Kyt, M., 37
Kyt, Merja, 42
Labov, W., 9
Lampeter Corpus, 38, 146
Lancaster Corpus, 12, 147
see also LOB (LancasterOsloBergen)
Corpus
LancasterOsloBergen Corpus see LOB
(LancasterOsloBergen) Corpus
Lancaster Parsed Corpus, 912, 96, 147
Lancaster/IBM Spoken English Corpus, 96,
147
Landau, Sidney, 16
language acquisition, 267, 47
language pedagogy, 278
language variation, 3, 1720
dialect variation, 512
genre variation, 18, 1920, 31t, 348,
35t, 402
sociolinguistic variables, 1819, 22, 4853
style-shifting, 19
Lawrence, Helen, 52, 136
LDB see Linguistic Database Program
LDC see Linguistic Data Consortium
learner corpora, 267
Leech, Geoffrey, xi, 4, 87, 91, 138
lemmas, 16, 116, 116 n5
length
of corpus, 324, 126
of text samples, 3840
lexicography, 1417
Linguistic Data Consortium (LDC), 24,
98, 147
Linguistic Database (LDB) Program, 93, 115
linguistic theory
adequacy, 23, 1011
corpus linguistics, xi, xiiixiv, 12, 34
generative grammar, 1, 35
government and binding theory, 45
minimalist theory, 3
see also functional descriptions of language
LOB (LancasterOsloBergen) Corpus, 12,
1415, 19, 39, 45, 87, 147
see also FLOB
(FreiburgLancasterOsloBergen)
Corpus
London Corpus, 12, 13, 50, 103, 147
LondonLund Corpus, 147
annotation, 82
composition, 53
names in, 75
research using, 12, 19, 39, 42, 98
Longman Dictionary of American English, 15
Longman Dictionary of Contemporary
English, 15
Longman Essential Activator, 27
LongmanLancaster Corpus, 12, 148
Longman Learners Corpus, 26, 27, 148
Longman Spoken and Written English Corpus,
The (LSWE), 14, 90, 148
LSWE see Longman Spoken and Written
English Corpus, The
Mair, Christian, 45
Map Task Corpus, 59, 148
markup, 67, 689
see also annotation
Markus, Manfred, 78, 79
Melcuk, Igor A., 93 n8
Meyer, Charles F., 6, 7, 28, 423, 767,
98, 101, 103
Michigan Corpus of Academic Spoken
English (MICASE), 148, 151
composition, 36, 53
computerization, 69, 723
planning, 139
record keeping, 65
Mindt, Dieter, 1213
minimalist theory, 3
modal verbs, xiixiii, 1213
monitor corpora, 15
Morris, Robert A., 767
multi-purpose corpora, 36
Murray, James A. H., 16
native vs. non-native speakers, 468
natural language processing (NLP), xiii, 246
Nelson, Gerald, 22, 83
Nevalainen, Terttu, 22, 51
Nguyen, Long et al., 70
Nijmegen Corpus, 87, 92, 93, 96, 148
NLP see natural language processing
Norri, Juhani, 42
Northern Ireland Transcribed Corpus
of Speech, The, 52, 148
noun phrases, 56, 1314
null-subject parameter, 23
Oakes, Michael P., 134, 136
observational adequacy, 2
Ooi, Vincent, 15
Oostdijk, Nelleke, 93
Oxford English Dictionary (OED), 16
166 Index
ParaConc, 24, 150
parallel corpora, 224
parsed corpora, 96
parsers
probabilistic, 91
rule-based, 924, 934 n8, 956
parsing a corpus
accuracy, 91, 95
complexity, 935
disambiguation, 95
future prospects, 140
manual pre-processing, 956
normalization, 96
parsers, 914, 95
post-editing, 95
problem-oriented tagging, 978
speech, 945, 96
treebanks, 912
part-of-speech markup see taggers
PC Tagger, 98, 113, 113f
PennHelsinki Parsed Corpus of Middle
English, 96, 148
Penn Treebank, xii, xiii, 25, 37, 91, 96, 149
Perl programming language, 114
planning corpus construction, 30, 445, 53
British National Corpus, 302, 33, 43, 51,
138
future prospects, 1389
genres, 31t, 348, 35t, 402
length of text samples, 3840
native vs. non-native speakers, 468
number of texts, 403
overall length, 324
range of speakers and writers, 402, 434
sociolinguistic variables, 1819, 4853
time-frame, 456
Polytechnic of Wales Corpus, 96, 149
pro-drop, 23
programming languages, 114
pseudo-titles (corpus analysis case study),
1012
determining suitability, 1037, 107t
extracting information: dening parameters,
1079; coding and recording, 10914,
112t, 113f; locating relevant
constructions, 115, 11719, 118f, 119f
framing research question, 1013
statistical analysis: exploring a corpus,
1234; using quantitative information,
12536, 125t, 127t, 128f, 128t, 130t,
131t, 132t, 133t, 134t, 135t
Quirk, Randolph et al., 1314, 101, 108, 135
record keeping, 646
reference grammars, 1314
Reppen, Randi, 126
research see analyzing a corpus; corpus-based
research
Rissanen, Matti, 20, 21, 378, 46, 51
Rocha, Marco, 97
Rubin, G. M., 87, 88
Ryd en, Mats, 101
sampling methodology
non-probability sampling, 44, 45
probability sampling, 434
sampling frames, 423
see also speech samples
Sampson, Geoffrey, 114
S anchez, Aquilino, 33 n2
Sanders, Gerald, 67
Santa Barbara Corpus of Spoken American
English, 32, 44, 69, 71, 85, 149
Sara concordancing program, 1819,
150, 151
scanners, 79
Sch afer, Hildegard, 234
Schmied, Josef, 234, 47
SEU see Survey of English Usage
SGML (Standard Generalized Markup
Language), 825, 86
Sigley, Robert J., 105 n3, 129, 136
Sinclair, John, 14, 15
small clauses, 4
Smith, Nicholas, 889, 90
social contexts and relationships, 523
sociolinguistic variables, 1819
age, 4950
dialect, 512
education, 50
gender, 18, 22, 489
social contexts and relationships, 523
see also ethnographic information
software programs, 1819, 24, 26, 115
Someya, Yasumasa, 114
special-purpose corpora, 36
speech, computerizing
background noise, 75
detail, 712
extra-linguistic information, 72
future prospects, 139
iconicity and speech transcription, 758
lexicalized expressions, 723
linked expressions, 73
names of individuals, 75
partially uttered words, 73
principles, 72
punctuation, 745
repetitions, 734
speech-recognition programs, 246, 701
transcription programs, 6970
Index 167
transcription time, 71
unintelligible speech, 74
vocalized pauses, 72
speech, repetition in, 9
speech samples, 56
broadcasts, 61
future prospects, 139
microphones, 60
natural speech, 568, 59
parsing, 945, 96
permission, 57
problems, 601
recording, 589
sample length, 578
tape recorders, 5960
see also planning corpus construction;
speech, computerizing
speech-recognition programs, 246, 701
Sperberg-McQueen, C. M., 82
Standard Generalized Markup Language see
SGML
statistical analysis, 11921
backward elimination, 135
Bonferroni correction, 129
chi-square test, 12732, 127t, 134, 135
cross tabulation, 125
degrees of freedom, 129
determining suitability, 1212
exploring a corpus, 1234
frequency counts, 120
frequency normalization, 126
kurtosis, 127, 127t
length of corpora, 126
linguistic motivation, 122
log-likelihood (G
2
) test, 132, 135
loglinear analysis, 134, 136
macroscopic analysis, 122
non-parametric tests, 126, 127
normal distribution, 1267
programs, 120, 136
pseudo-titles (case study), 12336, 125t,
127t, 128f, 128t, 130t, 131t, 132t,
133t, 134t, 135t
saturated models, 134
skewness, 127, 127t
using quantitative information, 12536
Varbrul programs, 136
structural markup, 81
detail, 856
display, 86
intonation, 85
SGML, 825, 86
TEI, 67, 84, 85, 86, 98, 149
timing, 689
XML, 84, 86
style-shifting, 19
Survey of English Usage (SEU), 42, 98
Susanne Corpus, 96, 149
Svartvik, Jan, 75
Switchboard Corpus, 25, 149
synchronic corpora, 456
taggers, 8690
accuracy, 8990
probabilistic, 889
rule-based, 88
tagging a corpus, 8691
accuracy, 8990, 91
disambiguation, 88
discourse tagging, 97
future prospects, 140
limitations, 11718
post-editing, 89
problem-oriented tagging, 978, 111
semantic tagging, 97
taggers, 8690
tagsets, 86, 87, 901
see also SGML; Text Encoding Initiative
(TEI)
TAGGIT, 88
Tagliamonte, Sali, 52, 136
tagsets, 86, 87, 901
TalkBank Project, 138, 149
Tampere Corpus, 42, 150
Tannen, D., 9
Tapper, Marie, 267
Text Encoding Initiative (TEI), 67, 84, 85, 86,
98, 150
text samples
length, 3840
number, 403
that, 22
Thomas, Jenny, 97
Thompson, Sandra, 32, 52, 85
time-frame, 456
TIMIT AcousticPhonetic Continuous Speech
Corpus, 245, 150
TIPSTER Corpus, 25, 150
TOSCA parser, 91, 923, 92f, 956
TOSCA tagset, 87
TOSCA Tree editor, 95
transcription programs, 6970
translation theory, 224
tree editors, 95
treebanks, 912
see also Penn Treebank
universal grammar, 23
Varbrul programs, 136
verb complements, 45
verb phrases, 13
168 Index
verbs, modal, xiixiii, 1213
voice, 56
Weizman, Michael, 22
Wellington Corpus, 89, 136, 150
Willis, Tim, 95
Wilson, Andrew, 97
Woods, A., 1212
World Wide Web, 28, 634, 7980,
140, 14251
written texts
collecting data, 614, 139
computerizing, 7880, 139
copyright, 38, 44, 612, 79 n6, 13940
electronic texts, 634
see also planning corpus construction
XML (Extensible Markup Language), 84, 86
York Corpus, 52, 150