3 Corpora Annotation - Documento 1
3 Corpora Annotation - Documento 1
3 Corpora Annotation - Documento 1
Some people (notably John Sinclair — see chapter 1) prefer not to engage in corpus
annotation: for them, the unannotated corpus is the 'pure' corpus they want to investigate
— the corpus without adulteration with information which is suspect, possibly
reflecting the predilections, or even the errors, of the annotator. For others, annotation is
a means to make a corpus much more useful — an enrichment of the original raw
corpus. From this perspective, probably a majority view, adding annotation to a corpus
is giving 'added value', which can be used for research by the individual or team that
carried out the annotation, but which can also be passed on to others who may find it
useful for their own purposes. For example, POS-tagged versions of major English
language corpora such as the Brown Corpus, the LOB Corpus and the British National
Corpus have been distributed widely throughout the world for those who would like to
make use of the tagging, as well as of the original 'raw' corpus. In this chapter, I will
assume that such annotation is a benefit, so long as it is done well, with an eye to the
standards that ought to apply to such work.
Apart from part-of-speech (POS) tagging, there are other types of annotation,
corresponding to different levels of linguistic analysis of a corpus or text — for
example:
phonetic annotation
e.g. adding information about how a word in a spoken corpus was pronounced. prosodic
annotation — again in a spoken corpus — adding information about prosodic features
such as stress, intonation and pauses. syntactic annotation — e.g. adding information
about how a given sentence is parsed, in terms of syntactic analysis into such units such
phrases and clauses
semantic annotation
e.g. adding information about the semantic category of words — the noun cricket as a
term for a sport and as a term for an insect belong to different semantic categories,
although there is no difference in spelling or pronunciation.
pragmatic annotation
e.g. adding information about the kinds of speech act (or dialogue act) that occur in a
spoken dialogue — thus the utterance okay on different occasions may be an
acknowledgement, a request for feedback, an acceptance, or a pragmatic marker
initiating a new phase of discussion.
discourse annotation
e.g. adding information about anaphoric links in a text, for example connecting the
pronoun them and its antecedent the horses in: I'll saddle the horses and bring them
round. [an example from the Brown corpus]
stylistic annotation
e.g. adding information about speech and thought presentation (direct speech, indirect
speech, free indirect thought, etc.)
lexical annotation
adding the identity of the lemma of each word form in a text — i.e. the base form of the
word, such as would occur as its headword in a dictionary (e.g. lying has the lemma
LIE).
(For further information on such kinds of annotation, see Garside et al. 1997.) In fact, it
is possible to think up untold kinds of annotation that might be useful for specific kinds
of research. One example is dysfluency annotation: those working on spoken data may
wish to annotate a corpus of spontaneous speech for dysfluencies such as false starts,
repeats, hesitations, etc. — see Lickley, no date). Another illustration comes from an
area of corpus research which has flourished in the last ten years: the creation and study
of learner corpora (Granger 1998). Such corpora, consisting of writing (or speech)
produced by learners of a second language, may be annotated with 'error tags' indicating
where the learner has produced errors, and what kinds of errors these are (Granger et al
2002).
3. Why annotate?
As I have already indicated, annotation is undertaken to give 'added value' to the corpus.
A glance at some of the advantages of an annotated corpus will help us to think about
the standards of good practice these corpora require.
Manual examination of a corpus
What has been built into the corpus in the form of annotations can also be extracted
from the corpus again, and used in various ways. For example, one of the main uses of
POS tagging is to enhance the use of a corpus in making dictionaries. Thus
lexicographers, searching through a corpus by means of a concordancer, will want to be
able to distinguish separate (verb) from separate (adjective), and if this distinction is
already signalled in the corpus by tags, the separation can be automatic, without the
painstaking search through hundreds or thousands of examples that might otherwise be
necessary. Equally, a grammarian wanting to examine the use of progressive aspect in
English (is working, has been eating, etc) can simply search, using appropriate search
software, for sequences of BE (any form of the lemma) followed — allowing for certain
possibilities of intervening words — by the ing-form of a verb.
Similarly, if a corpus has been annotated in advance, this will help in many kinds of
automatic processing or analysis. For example, corpora which have been POS-tagged
can automatically yield frequency lists or frequency dictionaries with grammatical
classification. Such listings will treat leaves (verb) and leaves (noun) as different words,
to be listed and counted separately, as for most purposes they should be. Another
important case is automatic parsing, i.e. the automatic syntactic analysis of a text or a
corpus: the prior tagging of a text can be seen as a first stage of syntactic analysis from
which parsing can proceed with greater success. Thirdly, consider the case of speech
synthesis: if a text is to be read aloud by a speech synthesiser, as in the case of the
'talking books' service provided for the blind, the synthesiser needs to have the
information that a particular instance of sow is a noun (= female pig) rather than a verb
(as in to sow seeds), because this make a difference to the word's pronunciation.
Re-usability of annotations
Some people may say that the annotation of a corpus for the above cases is not needed,
automatic processing could include the analysis of such features as part of speech: it is
unnecessary thereafter to preserve a copy of the corpus with the built-in information
about word class. This argument may work for some cases, but generally the annotation
is far more useful if it is preserved for future use. The fact is that linguistic annotation
cannot be done accurately and automatically: because of the complex and ambiguous
nature of language, even a relatively simple annotation task such as POS-tagging can
only be done automatically with up to 95% to 98% accuracy. This is far from ideal, and
to obtain an optimally tagged corpus, it is necessary to undertake manual work, often on
a large scale. The automatically tagged corpus afterwards has to be post-edited by a
team of human beings, who may spend thousands of hours on it. The result of such
work, if it makes the corpus more useful, should be built into a tagged version of the
corpus, which can then be made available to any people who want to use the tagging as
a springboard for their own research. In practice, such corpora as the LOB Corpus and
the BNC Sampler Corpus have been manually post-edited and the tagging has been used
by thousands of people. The BNC itself — all 100 million words of it — has been
automatically tagged but has not been manually post-edited, as the expense of
undertaking this task would be prohibitive. But the percentage of error — 2% — is
small enough to be discounted for many purposes. So my conclusion is that — as long
as the annotation provided is a kind useful to many users — an annotated corpus gives
'value added' because it can be readily shared by others, apart from those who originally
added the annotation. In short, an annotated corpus is a sharable resource, an example of
the electronic resources increasingly relied on for research and study in the humanities
and social sciences.
Multi-functionality
If we take the re-usability argument one step further, we note that annotation often has
many different purposes or applications: it is multi-functional. This has already been
illustrated in the case of POS tagging: the same information about the grammatical class
of words can be used for lexicography, for parsing, for frequency lists, for speech
synthesis, and for many other applications. People who build corpora are familiar with
the idea that no one in their right mind would offer to predict the future uses of a corpus
— future uses are always more variable than the originator of the corpus could have
imagined! The same is true of an annotated corpus: the annotations themselves spark off
a whole new range of uses which would not have been practicable unless the corpus had
been annotated.
However, this multi-functionality argument does not always score points for annotated
corpora. There is a contrary argument that the annotations are more useful, the more
they are designed to be specific to a particular application.
What I have said above about the usefulness of annotated corpora, of course, depends
crucially on whether the annotation has been well planned and well carried out. It is
important, then, to recommend a set of standards of good practice to be observed by
annotators wherever possible.
The annotations are added as an 'optional extra' to the corpus. It should always be easy
to separate the annotations from the raw corpus, so that the raw corpus can be retrieved
exactly in the form it had before the annotations were added. This is common sense: not
all users will find the annotations useful, and annotation should never result in any loss
of information about the original corpus data.
Lou Burnard (in chapter 3) emphasises the need to provide adequate documentation
about the corpus and its constituent texts. For similar reasons, it is important to provide
explicit and detailed documentation about the annotations in an annotated corpus.
Documentation to be provided about annotations should include the following, so that
users will know precisely what they're getting:
Mention any computer tools used, and any phases of revision resulting in new releases,
etc.
What annotation scheme was applied?
By coding scheme, I mean the set of symbolic conventions employed to represent the
annotations themselves, as distinct from the original corpus. Again, I will devote a
separate section to this (Section 5).
It might be thought that annotators will always proclaim the excellence of their
annotations. However, although some aspects of 'goodness' or quality elude judgement,
others can be measured with a degree of objectivity: accuracy and consistency are two
such measures. Annotators should supply what information they can on the quality of
the annotation. (see further Section 8 below.)
This and the following maxims are more open to debate. Any type of annotation
presupposes a typology — a system of classification — for the phenomena being
represented. But linguistics, like most academic disciplines, is sadly lacking in
agreement about the categories to be used in such description. Different terminologies
abound, and even the use of a single term, such as verb phrase, is notoriously a prey to
competing theories. Even an apparently simple matter, such as defining word classes
(POS), is open to considerable disagreement. Against this background, it might be
suggested that corpus annotation cannot be usefully attempted: there is no absolute
'God's truth' view of language or 'gold standard' annotation against which the decision to
call word x as noun and word y a verb can be measured.
De facto standards encapsulate what people have found to work in the past, which
argues that they should be adopted by people undertaking a new research project, to
support a growing consensus in the community. However, often a new project breaks
new ground, for example with a different kind of data, a different language, a different
purpose those of previous projects. It would clearly be a recipe for stagnation if we were
to coerce new projects into the following exactly the practices of earlier ones.
Nevertheless it makes sense for new projects to respect the outcomes of earlier projects,
and only to depart from their practices where this can be justified. In 8 below, I will
refer to some of the incipient standards for different kinds of annotation and mark-up.
These can only be presented tentatively, however, as the practice of corpus annotation is
continually evolving.
In the early 1990s, the European Union launched an initiative under the name of
EAGLES (Expert Advisory Groups on Language Engineering Standards) with the goal
of encouraging standardisation of practices for natural language processing in academia
and industry, particularly but not exclusively in the EU. One group of 'experts' set to
work on corpora, and from this and later initiatives there emerged various documents
specifying guidelines (or provisional standards) for corpus annotation. In the following
sections, I will refer to the EAGLES documents where appropriate.
But before focussing on annotation schemes and the linguistic categories they
incorporate, it will be helpful to touch briefly on the encoding of annotations — that is,
the actual symbolic representations used. This means we are for the moment
concentrating on how annotations are outwardly manifested — for example, what you
see when you inspect a corpus file on your computer screen — rather than what their
meaning is, in linguistic terms.
As an example, I have already mentioned one very simple device, the underscore
symbol, to signal the attachment of a POS tag to a word, as in Paula_NP1. The
presentation of the tag itself may be complex or simple. Here, for convenience, the
category of 'singular proper noun' is represented by a sequence of three characters, N for
noun, P for proper (noun), and 1 for singular.
One basic requirement is that the POS tag (or any other annotation device) should be
unambiguous in representing what it stands for. Another requirement, useful for
everyday purposes such as reading a concordance on a screen, is brevity: the three
characters, in this case, concisely signal the three distinguishing grammatical features of
the NP1 category. A third requirement, more useful in some contexts than in others, is
that the annotation device should be transparent to the human reader rather than opaque.
The example NP1 is at least to some degree intelligible, and is less mystifying than it
would be if some arbitrary sequence of symbols, say Q!@, had been chosen.
The type of tag illustrated above originated with the earliest corpus to be POS-tagged
(in 1971), the Brown Corpus. More recently, since the early 1990s, there has been a far-
reaching trend to standardize the representation of all phenomena of a corpus, including
annotations, by the use of a standard mark-up language — normally one of the series of
related languages SGML, HTML, and XML (see Lou Burnard, chapter 3). One
advantage of using these languages for encoding features in a text is that they provide a
general means of interchange of documents, including corpora, between one user or
research site and another. In this sense, SGML/HTML/XML have developed into a
world-wide standard which can be applied to any language, to spoken as well as to
written language, and to languages of different historical periods. Furthermore, the use
of the mark-up language itself can be efficiently parsed or validated, enabling the
annotator to check whether there are any ill-formed traits in the markup, which would
signal errors or omissions. Yet another advantage is that, as time progresses, tools of
various kinds are being developed to facilitate the processing of texts encoded in these
languages. One example is the set of tools developed at the Human Communication
Research Centre, Edinburgh, for supporting linguistic annotation using XML (Carletta
et al. 2002).
However, one drawback of these mark-up languages is that they tend to be more
'verbose' than the earlier symbolic conventions used, for example, for the Brown and
LOB corpora. In this connection we can compare the LOB representation Paula_NP1
(Johansson 1986) with the SGML representation to be found in the BNC (first released
in 1995): <w NP1>Paula, or the even more verbose version if a closing tag is added, as
required by XML: <w type="NP1">Paula</w>. In practice, this verbosity can be
avoided by a conversion routine which could produce an output, if required, as simple
as the LOB one Paula_NP1. This, however, would require a further step of processing
which may not be easy to manage for the technically less adept user.
Within the overall framework SGML, different co-existing encoding standards have
been proposed or implemented: notably, the CDIF standard used for the mark-up of the
BNC (see Burnard 1995) and the CES recommended as an EAGLES standard (Ide
1996). One further drawback of the SGML/XML approach to encoding is that it
assumes, by default, that annotation has a 'parsable' hierarchical tree structure, which
does not allow cross-cutting brackets as in <x ...> ... <y...> ... <x/> ... <y/>. Any corpus
of spoken data, in particular, is likely to contain such cross-bracketing, for example in
the cross-cutting of stretches of speech which need to be marked for different levels of
linguistic information — such phenomena as non-fluencies, interruptions, turn overlaps,
and grammatical structure are prone to cut across one another in complex ways.
This difficulty can be overcome within SGML/XML, although not without adding
considerably to the complexity of the mark-up — for example, by copious use of
pointer devices (in the BNC) or by the use of so-called stand-off annotation (Carletta et
al. 2002).
CHILDES ('child language data exchange system') is likely to be the first choice not
only for those working on child language corpora, but on related fields such as second
language acquisition and code-switching. As the name suggests, CHILDES is neither a
corpus nor a coding scheme in itself, but it provides both, operating as a service which
pools together the data of many researchers all over the world, using a common coding
and annotation schemes, and common software including annotation software.
6. Annotation manual
Although annotation manuals often build up piecemeal in this way, for the present
purpose we should see them as completed documents intended for corpus users. They
can be thought of as consisting of two sections — (a) a list of annotation devices and (b)
a specification of annotation practices — which I will illustrate, as before, using the
familiar case of a POS tagging scheme (for an example, see Johansson, 1986, for the
LOB Corpus, or Sampson, (1995, Ch.3 for the SUSANNE Corpus).
This list acts as a glossary — a convenient first port of call for people trying to make
sense of the annotations. For POS tagging, the first thing to list is the tagset — i.e., the
list of symbols used for representing different POS categories. Such tagsets vary in size,
from about 30 tags to about 270 tags. The tagset can be listed together with a simple
definition and exemplification of what the tag means:
NN1 singular common noun (e.g. book, girl)
NN2 plural common noun (e.g. books, girls)
NP1 singular proper noun (e.g. Susan, Cairo)
etc.
The last of these, (c), is the most important: the guidelines on how to annotate particular
pieces of text can be elaborated almost ad infinitum. Taking again the example of POS
tagging, consider what this means with a particular tag such as NP1 (singular proper
noun). In the automatic tagging process, a dictionary that matches words to tags can
make a large majority of such decisions without human intervention. But problems
arise, as always, with 'grey areas' that the manual must attempt to specify. For example,
should New York be tagged as one example of NP1 or two? Should the tag NP1 apply
to [the] Pope, [the] Renaissance, Auntie, Gold (in Gold Coast), Fifth (in Fifth Avenue),
T and S (in T S Eliot), Microsoft and Word in Microsoft Word? If not, what alternative
tags should be applied to these cases? The manual should if possible answer such
questions in a principled way, so that consistency of annotation practices between
different texts and different annotators can be ensured and verified. But inevitably some
purely arbitrary distinctions have to be made. Languages suffer to varying extents from
ambiguity of word classifications, and in a language like English, a considerable
percentage of words have to be tagged variably according to their context of occurrence.
Other languages have different problems: for example, in German the initial capital is
used for common nouns as well as a proper nouns, and cannot be used as a criterion for
NP1. In Chinese, there is no signal of proper noun status such as capital letters in
alphabetic languages. Indeed, more broadly considered, the whole classification of parts
of speech in the Western tradition is of doubtful validity for languages like Chinese.
In this section I will briefly list and comment on some previous work in developing
provisional de facto standards (see 4 above) of good practice for different levels of
linguistic annotation. The main message here is that anyone starting to undertake
annotation of a corpus at a particular level should take notice of previous work which
might provide a model for new work. There are two caveats, however: (a) these are only
a few of the references that might be chased up, and (b) most of these references are for
English. If you are thinking of annotating a corpus of another language, especially one
which corpus linguistics has neglected up to now, it makes sense to hunt down any
work going forward on that language, or on a closely related language. For this purpose,
grammars, dictionaries and other linguistic publications on the language should not be
neglected, even if they belong to the pre-corpus age.
The 'Brown Family' of corpora (consisting of the Brown Corpus, the LOB
Corpus, the Frown Corpus and the FLOB Corpus) makes use of a family of
similar tagging practices, originated at Brown University and further developed
at Lancaster. The two tagsets (C5 and C7) used for the tagging of the British
National Corpus are well known (see Garside et al. 1997: 254-260).
An EAGLES document which recommends flexible 'standard' guidelines for EU
languages is to be found in Leech and Wilson (1994), revised and abbreviated in
Leech and Wilson (1999).
Note that POS tagging schemes are often part of parsing schemes, to be
considered under the next heading.
Syntactic annotation
Prosodic annotation
The standard system for annotating prosody (stress, intonation, etc.) is ToBI (=
Tones and Break Indices), which comes with its own speech-processing
platform. Its phonological model originated with Pierrehumbert (1980). The
system is partially automated, but needs to be substantially adapted for fresh
languages and dialects.
ToBI is well supported by dedicated software and a committed research
community. On the other hand, it has met with criticism, and two alternative
annotation systems worth examining are INTSINT (see Hirst 1991) and TSM —
tonetic stress marks (see Knowles et al. 1996).
For a survey of prosodic annotation of dialogue, see Grice et al. (2000: 39-54).
Pragmatic/Discourse annotation
For corpus annotation, it is difficult to draw a line between pragmatics and discourse
analysis.
An international Discourse Resource Initiative (DRI) came up with some
recommendations for the analysis of spoken discourse at the level of dialogue
acts (= speech acts) and at higher levels such as dialogue transactions,
constituting a kind of 'grammar' of discourse. These were set out in the DAMSL
manual (= Dialog Act Markup in Several Layers) (Allen and Core 1997).
Other influential schemes are those of TRAINS, VERBMOBIL, the Edinburgh
Map Task Corpus, SPAAC (Leech and Weisser 2003). These all focus on
practical task-oriented dialogue. One exceptional case is the Switchboard
DAMSL annotation project (Stolcke et al. 2000), applied to telephone
conversational data.
Discourse can also be analysed at the level of anaphoric relations (e.g. pronouns
and their antecedents — see Garside et al 1997:66-84).
A survey of pragmatic annotation is provided in Grice et al. (2000: 54-67).
A European project MATE (= Multi-level annotation, tools engineering) has
tackled the issue of standardization in developing tools for corpus annotation,
and more specifically for dialogue annotation, developing a workbench and an
evaluation of various schemes, investigating their applicability across languages
(http://mate.nis.sdu.dk/).
There is less to say about other levels of annotation mentioned in 2 above, either
because they are less challenging or have been less subject to efforts of standardization.
Examples particularly worth notice are:
phonetic annotation
stylistic annotation
Semino and Short (2003) have developed a detailed annotation scheme for modes of
speech and thought representation — one area of considerable interest in stylistics. This
has been applied to a varied corpus of literary and non-literary texts.
In section 4 I mentioned that the quality or 'goodness' of annotation was one important
— though rather unclear — criterion to be sought for in annotation. Reverting to the
POS-tagging example once again, we may distinguish two quite different ideas of
quality. The first refers to the linguistic realism of the categories. It would be possible
to invent tags which were easy to apply automatically with 100% accuracy — e.g. by
arbitrarily dividing a dictionary into 100 parts and assigning a set of 100 tags to words
in the dictionary according to their alphabetical order — but these tags would be useless
for any serious linguistic analysis. Hence we have to make sure that our tagset is well
designed to bring together in one category words which are likely to have psychological
and linguistic affinity, i.e. are similar in terms of the syntactic distribution, their
morphological form, and/or their semantic interpretation.
A second, less abstract, notion of quality refers not to the tagset, but to the accuracy and
consistency with which it is applied.
Accuracy refers to the percentage of words (i.e. word tokens) in a corpus which are
correctly tagged. Allowing for ambiguity in tag assignment, this is sometimes divided
into two categories — precision and recall — see van Halteren (1999: 81-86).
Recall is the extent to which all correct annotations are found in the output of the
tagger.
Precision is the extent to which incorrect annotations are rejected from the output.
The obvious question to ask here is: what is meant by 'correct'? The answer is:
'correctness' is defined by what the annotation scheme allows or disallows — and this is
an added reason why the annotation scheme has to be specific in detail, and has to
correspond as closely as possible with linguistic realities recognized as such..
For example, automatic taggers can achieve tagging as high as 98% correct. However,
this is not as good as it could be, so the automatic tagging is often followed by a post-
editing stage in which human analysts correct any mistakes in the automatic tagging, or
resolve any ambiguities.
The first question here is: is it possible for hand-editors to achieve 100% accuracy?
Most people will find this unlikely, because of the unpredictable peculiarities of
language that crop up in a corpus, and because of the failure of even the most detailed
annotation schemes to deal with all eventualities. Perhaps between 99% and 99.5%
accuracy might be the best that can be achieved, given that unclear and unprecedented
cases are bound to arise. Nevertheless, 99.5% accuracy achieved with the help of a
human post-editor would still be preferable to 96% or 97% as the result of just
automatic tagging. Accuracy is therefore one criterion of quality in POS-tagging, and
indeed in any annotation task.
A second question that may be asked is: how consistently has the annotation task been
performed? One way to test this in POS tagging is to have two human annotators post-
edit the same piece of automatically-tagged text, and to determine in what percentage of
cases they agree with one another. The more this consistency measure (called inter-
rater agreement) approaches 100%, the higher the quality of the annotation. (Accuracy
and consistency are obviously related: if both raters achieve 100% accuracy, it is
inevitable that they achieve 100% consistency.)
In the early days of POS-tagging evaluation, it was feared that up to 5% of words would
be so uncertain in their word class that a high degree of accuracy and of consistency
could not be achieved. However, this is too pessimistic: Baker (1997) and Voutilainen
and Järvinen (1995) have shown how scores not far short of 100% can be attained for
both measures.
K = P(A) - P(E)
1 - P(E)
"where P(A) is the proportion of time that the coders agree and P(E) is the proportion of
times that we would expect them to agree by chance." (Carletta 1996: 4).
It is not necessary to have special software. You can annotate the text using a
general-purpose text editor or word processor. But this means the job has to be
done by hand, which risks being slow and prone to error.
For some purposes, particularly if the corpus is large and is to be made available
for general use, it is important to have the annotation validated. That is, the
vocabulary of annotation is controlled and is allowed to occur only in
syntactically valid ways. A validating tool can be written from scratch, or can
use macros for word processors or editors.
If you decide to use XML-compliant annotation, this means that you have the
option to make use of the increasingly available XML editors. An XML editor,
in conjunction with a DTD or schema, can do the job of enforcing well-
formedness or validity without any programming of the software, although a
high degree of expertise with XML will come in useful.
Special tagging software has been developed for large projects — for example
the CLAWS tagger and Template Tagger used for the Brown Family or corpora
and the BNC. Such programs or packages can be licensed for your own
annotation work. (For CLAWS, see the UCREL website
http://www.comp.lancs.ac.uk/ucrel/.)
There are tagsets which come with specific software — e.g. the C5, C7 and C8
tagsets for CLAWS, and CHAT for the CHILDES system, which is the de facto
standard for language acquisition data.
There are more general architectures for handling texts, language data and
software systems for building and annotation corpora. The most prominent
example of this is GATE ('general architecture for text engineering'
http://gate.ac.uk) developed at the University of Sheffield.
© Geoffrey Leech 2004. The right of Geoffrey Leech to be identified as the Author of
this Work has been asserted by him in accordance with the Copyright, Designs and
Patents Act 1988.
All material supplied via the Arts and Humanities Data Service is protected by
copyright, and duplication or sale of all or any part of it is not permitted, except that
material may be duplicated by you for your personal research use or educational
purposes in electronic or print form. Permission for any other use must be obtained from
the Arts and Humanities Data Service.
Electronic or print copies may not be offered, whether for sale or otherwise, to any third
party.