Corpus Lingustics
Corpus Lingustics
Corpus Lingustics
The word "corpus", derived from the Latin word meaning "body", may be used to refer to
any text in written or spoken form. However, in modern Linguistics this term is used to
refer to large collections of texts which represent a sample of a particular variety or use of
language(s) that are presented in machine readable form. Other definitions, broader or
stricter, exist. See, for example, the definition in the book "Corpus Linguistics" by Tony
McEnery and Andrew Wilson or read more about different kinds of corpora in the
Systematic Dictionary of Corpus Linguistics.
Computer-readable corpora can consist of raw text only, i.e. plain text with no additional
information. Many corpora have been provided with some kind of linguistic information,
here called mark-up or annotation.
Types of corpora
There are many different kinds of corpora. They can contain written or spoken
(transcribed) language, modern or old texts, texts from one language or several
languages. The texts can be whole books, newspapers, journals, speeches etc, or consist
of extracts of varying length. The kind of texts included and the combination of different
texts vary between different corpora and corpus types.
'General corpora' consist of general texts, texts that do not belong to a single text type,
subject field, or register. An example of a general corpus is the British National Corpus.
Some corpora contain texts that are sampled (chosen from) a particular variety of a
language, for example, from a particular dialect or from a particular subject area. These
corpora are sometimes called 'Sublanguage Corpora'.
Corpora can consist of texts in one language (or language variety) only or of texts in
more than one language. If the texts are the same in all languages, e.i. translations, the
corpus is called a Parallel Corpus. A Comparable Corpus is a collection of "similar" text
to get data quickly and easily and also to have this data presented in a format suitable for
analysis.
Corpus linguistics is, however, not the same as mainly obtaining language data through
the use of computers. Corpus linguistics is the study and analysis of data obtained from a
corpus. The main task of the corpus linguist is not to find the data but to analyse it.
Computers are useful, and sometimes indispensable, tools used in this process.
Learn more
If you want to learn more about corpora and corpus linguistics you can use the links
below. On the Background page you can follow the development of corpus linguistics
through presentations of some central corpora/kinds of corpora. On the Working with
Corpora page you will find information about things to think about when you want to use
corpora for language learning or research. Use the Tutorial to learn about how to make
corpus searches and analyse the result or go straight to the Search Engine to make online
searches in a number of corpora.
Background
The use of collections of text in language study is not a new idea. In the Middle Ages
work began on making lists of all the words in a particular texts, together with their
contexts - what we today call concordancing. Other scholars counted word frequencies
from single texts or from collections of texts and produced lists of the most frequent
words. Areas where corpora were used include language acquisition, syntax, semantics,
and comparative linguistics, among others. Even if the term 'corpus linguistics' was not
used, much of the work was similar to the kind of corpus based research we do today
with one great exception - they did not use computers.
You can learn more about early corpus linguistics, HERE (external link). We will move
on to look at some important stages in the development of corpus linguistics by focusing
on some central corpora. The presentation below is not an extensive account of all
corpora or every stage, but merely meant to help you get familiar with some key corpora
and concepts.
Big is beautiful?
In 1995 another large corpus was released; the British National Corpus (BNC). This
corpus consists of some 100 million words. Like the BoE it contains both written and
spoken material, but unlike the BoE, the BNC is finite - no more texts are added to it after
its completion. The BNC texts were selected according to carefully pre-defined selection
criteria with targets set for the amount of text to be included from different text types
(learn more HERE). The texts have been encoded with mark-up providing information
about the texts, authors, speakers.
Specialized corpora
Historical corpora
The use of collections of text in the study of language is, as we have seen, not a new
invention. Among those involved in historical linguistics were some that soon saw the
potential usefulness of computerised historical corpora. A diachronic corpus with English
texts from different periods was compiled at the University of Helsinki. The Helsinki
Corpus of English Texts contains texts from the Old, Middle and Early Modern English
periods, 1,5 million words in total.
Another historical corpus is the recently released Lampeter Corpus of Early Modern
English Tracts. This collection consists of "[P]amphlets and tracts published in the
century between 1640 and 1740" from six different domains. The Lampeter Corpus can
be seen as one example of a corpus covering a more specialized area.
International/multilingual Corpora
As we have seen above, there is a great variety of corpora in English. So far much corpus
work has indeed concerned the English language, for various reasons. There are,
however, a growing number of corpora available in other languages as well. Some of
them are monolingual corpora - collections of text from one language. Here the Oslo
Corpus of Bosnian text and the Contemporary Portuguese Corpus can be mentioned as
two examples.
A number of multilingual corpora also exist. Many of these are parallel corpora; corpora
with the same text in several languages. These corpora are often used in the field of
Machine Translation. The English-Norwegian Parallel Corpus is one example, the
English Turkish Aligned Parallel Corpora another.
The Linguistic Data Consortium (LDC) holds a collection of telephone conversations in
various languages: CALLFRIEND and CALLHOME.
Other
The increased availability and use of the Internet have made it possible to find great
amounts of texts readily available in electronic format. Apart from all the web-pages
containing information of different kinds, it is also possible to find whole collections of
text. Among these collections can be mentioned all the on-line newspapers and journals
(example), and sites where whole books can be found on-line (example). Other examples
yet include dictionaries and word-lists of various kinds.
Although these collections may not be considered corpora for one reason or another (see
definition of corpus), they can be analysed with corpus linguistic tools and methods. This
is an area which has not yet been explored in detail, although some attempts have been
made at using the Internet as one big corpus.
Further information about collections of text available on the Internet can be found on the
Related Sites page.
Ongoing projects
Others
The amount and diversity of corpus related research projects and groups are great. Below
is a small sample to give you an understanding of the scope and variety. You can find
more information by following the links on the Related Sites page.
AMALGAM Automatic Mapping Among Lexico-Grammatical Annotation
"an attempt to create a set of mapping algorithms to map between the main
tagsets and phrase structure grammar schemes used in various research corpora"
(home page)
The Canterbury Tales Project
"aims to make available ... full transcripts of the ... Canterbury Tales" (home
page).
CSLU: The Center for Spoken Language Understanding
"a multidisciplinary center for research in the area of spoken language
understanding" (home page).
ETAP : Creating and annotating a parallel corpus for the recognition of
translation equivalents
This project, run at the University of Uppsala, Sweden, aims to develop a
computerized multilingual corpus based on Swedish source text with translations
into Dutch, English, Finnish, French, German, Italian and Spanish. (home page)
TELRI
TELRI is an initiative, funded by the European Commission, meant to facilitate
work in the field of Natural Language Processing (NLP) by, among other things,
supplying various language resources. Read more on the home page.
What next?
The interest for computerised corpora and corpus linguistics is growing. More and more
universities offer courses in corpus linguistics and/or use corpora in their teaching and
research. The number and diversity of corpora being compiled are great and corpora as
used in many projects. It is not possible to go into detail and present all the corpora, all
the courses, all the projects here. This has been meant as a brief introduction. More
information can be found by browsing the net and reading journals and books. The
electronic mailing list Corpora can be a good starting point for someone who wishes to
learn about what goes on within the field of corpus linguistics at the moment.
Using corpora
To be able to use corpora sucessfully in linguistic study or research there are some areas
that you may want to look into.
The corpus
The kind of corpus you use and the kind of texts included in it are factors
that affect what results you get. Read more about
o choosing a corpus and
o knowing your corpus
The tools
There are a number of tools (computer programs) available to use with
corpora. The basic functions are usually to search the corpus and display
the hits, with different options for what kinds of searches are possible to
make and of how the hits can be displayed. For a presentation of what
corpus handling tools can do, click HERE (link to be added). For a list of
software to use with corpora, use this link.
The know-how
It is not difficult to search a corpus and find examples of various kinds
once you know how to use your tool. In the tutorial you are introduced to
the W3-Corpora search engine and shown how you can use it on various
corpora for a number of research tasks. The illustrations and comments
will provide you with examples on the kind of questions that are useful to
ask when you are working with corpora
Using corpora
Which corpus should I choose?
The choice of corpus is very important for the kind of results you will get and what the
results can tell you. When deciding which corpus to use, there are certain points that are
good to consider.
In the List of Corpora you will find corpora of various kinds under certain sub-headings
(spoken, historical, etc.)
What is available?
A very important question to consider when setting out to make a corpus-based study is
'what is available?'. There is a number of corpora, but not all of them are
publically available
readily available
Publically available corpora are those which anyone can use for free. Most corpora are
not publically available. Some are available to anyone who buys a copy of it or a licence
to use it, which may vary in cost between a few ponds (to cover administrative costs) to
several hundred pounds. Some corpora are not available to anyone but their owners, and
therefore not possible to obtain.
By readily available we here mean corpora which are ready to be used at once. What is
readily available varies between different institutions. Some have corpora installed on
their network, or stored on CD-roms. These are then available to anyone who has access
to that network/CD-rom and knows how to use the corpus. Other institutions do not have
access to any corpora, or not to the corpora that is needed for the particular task/study.
When this is the case, the options are to try to get access to the corpus, or to use some
other data or method.
Getting a corpus usually means acquiring it (buying, down-loading, compiling), installing
it, and finding the right tools to use with it. This can be a time-consuming, complicated
and costly procedure. Some corpora can be accessed online, freely or at a cost. You will
find a list of such corpora here.
Tools
There are a number of different programs and search engines available for use with
corpora, and some are presented on the 'tools' page (to be added).
Using corpora
Knowing your corpus
Something about corpus compilation
To combine texts into a corpus is called to compile a corpus. There are various ways of
doing this, depending on what kind of corpus you want to create and on what resources
(time, money, knowledge) you have at your disposal.
Even if you are not compiling your own corpus, it is important to know something about
corpus compilation when you use a corpus. Using a corpus is using a selection of texts to
represent the language. How the corpus has been compiled is of utmost importance for
the results you get when using it. What texts are included, how these are marked up, the
proportions of different text types, the size of the various texts, how the texts have been
selected, etc. are all important issues.
found in only one of the samples (at most). Words that occur infequently would not
necessarily be evenly distributed across the two samples.
Now imagine that you divide the newspaper into sections (or classify its content into
categories/text types) before cutting it up, and then put the cuttings in different bowls. By
picking your paper slips from the different bowls you can influence the composition of
your sample. You can choose to take slips from only one bowl or from several, in equal or
different proportions. If there is a difference in the language in the bowls, there will be a
difference in the language on the slips and that will affect your sample correspondingly.
You can easily see that if you were to take 100 slips of paper from the 'sports' bowl and
100 slips from the 'editorial' bowl, you would probably find a larger number of the word
football in the sample taken from the 'sports' bowl than from the 'editorial'.
Corpus compilation
We can use the image above to give a (simplified) description of how a corpus can be
created. (We will not go into any practical issues here - this is merely intended to give
you an understanding of why it is important to know the corpus you use). If we imagine
the language as a whole as the newspaper, we can say that the words on the slips of
papers are texts (bits of spoken or written language). You create (compile) a corpus by
selecting texts from the language. The composition of the corpus depends on the kind of
texts you use, how you have selected them, and in what proportions. If you have divided
your paper into sections you can decide to use more texts from one section, to use texts
from one section only, to use a set proportion of texts from each section, to use a set
number of texts from each section, etc. What kind of bowls you use will also make a
difference - will you have bowls for various text types (reviews, editorials, news
reportage, etc), or sort the cuttings according to author specification (age of author, sex,
education, etc)? Perhaps sorted according to time when they were written, intended
reader, colour of print? How do you classify the texts? If you look at the slips before you
select/dischard them, the composition of your sample/corpus will reflect the choices you
made (for example, you may choose to select texts which contain some particular
feature/construction/vocabulary item, irrespectable of what section they come from).
Discalaimer
The image of the language as a newspaper is perhaps giving the impresson that 'the
language as a whole' is a well-defined and generally agreed upon notion, something that
is concrete and possible to quantify. This is far from the case. We should not be tempted
to forget that language is not a confined, closed entity but a very difficult notion to
define, quantitatively or qualitatively. Try to decide, for example, how much language
you use in a day. Do you then count only the language you produce or all the language
you get in contact with? What are the proportions of written and spoken language?
Should the spoken language you hear on the radio (actively listening or just overhearing)
be counted differently from the spoken language directed to you? Does it make a
difference if you talk/write to several people or just to one? What is language spoken to a
dictaphone or answering machine? Would a shopping list be counted as language? What
about a diagram you make/see as an illustration to a text (spoken or written)? etc.
When compiling a corpus, you do not only have to take into account how you define
language - you also have to decide what proportions of different varieties of language you
want to include in your corpus. Once that is settled, you have to get the language acquire the texts. Articles from newspapers and books can be easy enough to get hold of,
and transcripts/scripts of certain radio and TV programs as well. How do you get the
more personal writings like letters and diarys, though? And records of personal
conversations, confessions, information given in confidence, etc? Moreover, as many
corpus compilers can testify, much time and effort has to be spent on legal issues such as
optaining permission to use the texts and making sure that no copy-right agreement is
broken.
Summary
When you think of what we have described above, it is easy to understand why it is
important to know something about how a corpus is compiled and what kind of text
sample are included. Among the issues that have to be considered, then, by both corpus
compilors and corpus users are:
the language sampled (what kind of newspaper has been used?)
the size of the corpus (how many pieces of paper were taken from the newspaper
bowl?)
kind of text included (from which bowls was the sample taken?)
the proportions of different text types (how many slips of paper from each bowl?)
If the corpus consists of samples from a particular variety of language (from the 'sports'
bowl, for example) you will find that it may be very different from another sample taken
from another bowl. Moreover, it is important to know about the size of the corpus and the
size/number of samples making up the corpus. If you have a big corpus (a large
proportion of the newspaper) you may be able to find even rare words. In a small sample
you have a bigger chance of missing something (think of all the words you don't get if
you take only ten slips from the newspaper bowl, for example). If the corpus consists of a
large part of one particular bowl you get a good picture of that particular bowl. It may or
may not be different to a sample from another bowl. If you have a corpus of the same size
but consisting of several small samples from different bowls, you will have a broader
corpus (from more areas). The samples from each bowl are still small, however, so you
may not be able to say much about the language in any one bowl.
Among the practical matters that have to be solved by the compiler are:
how can the texts be obtained? Where do they exist? (in books, on the WWW, etc)
do you need permission to use the texts?
do you need to process the material to include it (transcribe, code, convert files,
etc)?
how can the texts be converted to the format you want them in (made
electronically readable by scanning, keying-in, converting files to right format,
etc)?
Though the user of the corpus do not have to make decisions about these practical
matters, there are other issues that are important for the user to be aware of. Among those
are, for example:
permission to use the corpus.
Some corpora are only available to licence holders or for particular purposes
(such as non-commercial academic research, teaching, personal use, etc)
permission to reproduce text.
You may be permitted to use the texts as long as you do not quote them or publish
them.
format of the texts.
Some texts may be available only in particular formats that cannot be read by a
usual word processor, for example.
software.
A number of programs, search engines have been developed for the use on
corpora in general or on specific corpora. A basic knowledge of and access to
some of these tool may be necessary in order to make use of the corpus.
Annotated Corpora
Apart from the pure text, a corpus can also be provided with additional linguistic
information, called 'annotation'. This information can be of different nature, such as
prosodic, semantic or historical annotation. The most common form of annotated corpora
is the grammatically tagged one. In a grammatically tagged corpus, the words have been
assigned a word class label (part-of-speech tag). The Brown Corpus, the LOB Corpus and
the British National Corpus (BNC) are examples of grammatically annotated corpora.
The LLC Corpus has been prosodically annotated. The Susanne Corpus is an example of
a parsed corpus, a corpus that has been syntactically analysed and annotated.
Annotated corpora constitute a very useful tool for research. In the Tutorial you can find
examples of how to make use of the annotation when searching a corpus.
Further information about corpus annotation and annotated corpora can be found, for
example, in the book Corpus Annotation: Linguistic Information from Computer Text
Corpora (external link), or by using the following links:
Types of annotation
Certain kinds of linguistic annotation, which involve the attachment of special codes to
words in order to indicate particular features, are often known as "tagging" rather than
annotation, and the codes which are assigned to features are known as "tags". These
terms will be used in the sections which follow:
Part of Speech annotation
Lemmatisation
Parsing
Semantics
Discoursal and text linguistic annotation
Phonetic transcription
Prosody
Problem-oriented tagging
Part-of-speech Annotation.
This is the most basic type of linguistic corpus annotation - the aim being to assign to
each lexical unit in the text a code indicating its part of speech. Part-of-speech annotation
is useful because it increases the specificity of data retrieval from corpora, and also forms
an esential foundation for further forms of analysis (such as syntactic parsing and
semantic field annotation). Part-of-speech annotation also allows us to distinguish
between homographs.
Click here for an example of part-of-speech annotation.
Part-of-speech annotation was one of the first types of annotation to be formed on
corpora and is the most common today. One reason for this is because it is a task that can
be carried out to a high degree of accuracy by a computer. Greene and Rubin (1971)
achieved a 71% accuracy rate of correctly tagged words with their early part-of-speech
tagging program (TAGGIT). In the early 1980s the UCREL team at Lancaster University
reported a success rate of 95% using their program CLAWS.
Read about idiomatic tags and the tagging of contracted forms in Corpus Linguistics,
chapter 2, pages 40-42.
Points of interest
Lemmatisation
Lemmatisation is closely allied to the identification of parts-of-speech and involves the
reduction of the words in a corpus to their respective lexemes. Lemmatisation allows the
researcher to extract and examine all the variants of a particular lexeme without having to
input all the possible variants, and to produce frequency and distribution information for
the lexeme. Although accurate software has been developed for this purpose (Beale
1987), lemmatisation has not been applied to many of the more widely available corpora.
However, the SUSANNE corpus does contain lemmatised forms of the corpus words,
along with other information. See the example below - the fourth column contains the
lemmatised words:
N12:0510g
N12:0510h
N12:0510i
N12:0510j
N12:0510k
N12:0510m
N12:0510n
N12:0510p
N12:0520a
N12:0520b
N12:0520c
N12:0520d
N12:0520e
N12:0520f
N12:0520g
N12:0520h
N12:0520i
N12:0520j
N12:0520k
N12:0520m
PPHS1m
VVDv
AT
NN1c
IF
DD221
DD222
NNT2
CC
VVDv
IO
AT1
NNc
IIb
DDQr
PPH1
VMd
VB0
VVNt
YF
He
studied
the
problem
for
a
few
seconds
and
thought
of
a
means
by
which
it
might
be
solved
+.
he
study
the
problem
for
a
few
second
and
think
of
a
means
by
which
it
may
be
solve
-
Parsing
Parsing involves the procedure of bringing basic morphosyntactic categories into highlevel syntactic relationships with one another. This is probably the most commonly
encountered form of corpus annotation after part-of-speech tagging. Parsed corpora are
sometimes known as treebanks. This term alludes to the tree diagrams or "phrase
markers" used in parsing. For example, the sentence "Claudia sat on a stool" (BNC)
might be represented by the following tree diagram:
S]
In depth: You might want to read about full parsing, skeleton parsing, and constraint
grammar by following this link.
Because automatic parsing (via computer programs) has a lower success rate than part-ofspeech annotation, it is often either post-edited by human analysts or carried out by hand
(although possibly with the help of parsing software). The disadvantage of manual
parsing, however, is inconsistency, especially where more than one person is parsing or
editing the corpus, which can often be the case on large projects. The solution - more
detailed guidelines, but even then there can occur ambiguities where more than one
interpretation is possible.
Parsing: in depth
Not all parsing systems are the same. The two main differences are:
The number of constituent types which a system employs.
The way in which constituent types are allowed to combine with each other.
However, despite these differences, the majority of parsing schemes are based on a form
of context-free phrase structure grammar. Within this system an important distinction
must be made beyween full parsing and skeleton parsing.
Full parsing aims to provide as detailed as possible analysis of the sentence structure,
while skeleton parsing is a less detailed approach which tends to use a less finely
distinguished set of syntactic constituent types and ignores, for example, the internal
structure of certain constituent types. The two examples below show the differences.
Full parsing:
[S[Ncs another_DT new_JJ style_NN feature_NN Ncs] [Vzb is_BEZ Vzb] [Ns
the_AT1 [NN/JJ& wine-glass_NN [JJ+ or_CC flared_JJ HH+]NN/JJ&]
heel_NN ,_, [Fr[Nq which_WDT Nq] [Vzp was_BEDZ shown_VBN Vzp] [Tn[Vn
teamed_VBN Vn] [R up_RP R] [P with_INW [NP[JJ/JJ/NN& pointed_JJ ,_, [JJsquared_JJ JJ-] ,_, [NN+ and_CC chisel_NN NN+]JJ/JJ/NN&] toes_NNS
Np]P]Tn]Fr]Ns] ._. S]
Fr relative phrase
JJ adjective phrase
Ncs noun phrase, count noun singular
Np noun phrase, plural
Nq noun phrase, wh-word
Ns noun phrase, singular
P prepositional phrase
R adverbial phrase
S sentence
Tn past participal phrase
Vn verb phrase, past participle
Vzb verb phrase, third person singular to be
Vzp verb phrase, passive third person singular
Skeleton Parsing
[S& [P For_IF [N the_AT members_NN2 [P of_IO [N this_DD1 university_NNL1
N]P]N]P] [N this_DD1 charter_NN1 N] [V enshrines_VVZ [N a_AT1
victorious_JJ principle_NN1 N]V]S&] ;_; and_CC [S+[N the_AT fruits_NN2
[P of_IO [N that_DD1 victory_NN1 N]P]N] [V can_VM immediately_RR be_VB0
seen_VVN [P in_II [N the_AT international_JJ community_NNJ [P of_IO [N
scholars_NN2 N]P] [Fr that_CST [V has_VHZ graduated_VVN here_RL today_RT
V]Fr]N]P]V]S+] ._.
Constraint grammar
It is not always the case that a corpus is parsed using context-free phrase structure
grammar. For example, the Birmingham Bank of English has been part-of-speech tagged
"it"
"have"
"maintain"
PCP2 @-FMAINV
V INF @-FMAINV
"present" A ABS @AN
""
"boundary" N NOM PL @OBJ
""
"intact" A ABS @PCOMPL-O @ NUM CARD @
"
On the line next to each word are three (or sometimes more) pieces of information. The
first item in double quotes is the lemma of that word, following that is a part-of speech
code (which can include more than one string e.g. N NOM PL); and at the right-hand end
of the line is a tag indicating the grammatical function of the word. These begin with a @
and stand for:
@+FMAINV
@-FMAINV
@
premodifying adjective
@CC
coordinator
@DN>
determiner
@GN>
premodifying genitive
@INFMARK>
infinitive marker
@NN>
premodifying noun
@OBJ
object
@PCOMPL-O
object compliment
@PCOMPL-S
subject compliment
@QN>
premodifying quantifier
@SUBJ
subject
Semantics
Two types of semantic annotation can be identified:
1. The marking of semantic relationships between items in the text, for example
the agents or patients of particular actions. This has scarcely begun to be
widely accepted at the time of writing, although some forms of parsing capture
much of its import.
2. The marking of semantic features of words in the text, essentially the
annotation of word senses in one form or another. This has quite a long history,
dating back to the 1960s.
There is no universal agreement about which semantic features ought to be annotated - in
fact in the past much of the annotation was motivated by social scientific theories of, for
instance, social interaction. However, Sedelow and Sedelow (1969) made use of Roget's
Thesarus - in which words are organised into general semantic categories.
The example below (Wilson, forthcoming) is intended to give the reader an idea of the
types of categories used in semantic tagging:
And
the
soldiers
00000000
00000000
23241000
platted
a
crown
of
thorns
and
put
it
on
his
head
and
they
put
on
him
a
purple
robe
21072000
00000000
21110400
00000000
13010000
00000000
21072000
00000000
00000000
00000000
21030000
00000000
00000000
21072000
00000000
00000000
00000000
31241100
21110321
Low content word (and, the, a, of, on, his, they etc)
Plant life in general
Body and body parts
Object-oriented physical activity (e.g. put)
Men's clothing: outer clothing
Headgear
War and conflict: general
Colour
The semantic categories are represented by 8-digit numbers - the one above is based on
that used by Schmidt (1993) and has a hierarchical structure, in that it is made up of three
top level categories, which are themselves subdivided, and so on.
Discourse tags
Stenstrm (1984) annotated the London-Lund spoken corpus with 16 "discourse tags".
They included categories such as:
"apologies" e.g. sorry, excuse me
"greetings" e.g. hello
"hedges" e.g. kind of, sort of thing
"politeness" e.g. please
"responses" e.g. really, that's right
Despite their potential role in the analysis of discourse these kinds of annotation have
never become widely used, possibly because the linguistic categories are context-
dependent and their identification in texts is a greater source of dispute than other forms
of linguistic phenomena.
Anaphoric annotation
Cohesion is the vehicle by which elements in text are linked together, through the use of
pronouns, repetition, substitution and other devices. Halliday and Hasan's "Cohesion in
English" (1976) was considered to be a turning point in linguistics, as it was the most
influential account of cohesion. Anaphoric annotation is the marking of pronoun
reference - our pronoun system can only be realised and understood by reference to large
amounts of empirical data, in other words, corpora.
Anaphoric annotation can only be carried out by human analysts, since one of the aims of
the annotation is to train computer programs with this data to carry out the task. There are
only a few instances of corpora which have been anaphorically annotated; one of these is
the Lancaster/IBM anaphoric treebank, an example of which is given below:
A039 1 v (1 [N Local_JJ atheists_NN2 N] 1) [V want_VV0 (2
Charlotte_N1 9) Police_NN2 Department_NNJ N] 2) [Ti to_TO
rid_VVN of_IO [N 3 <REF=2 its_APP$ chaplain 3) ,_, [N {{3
Rev._NNSB1 Dennis_NP1 Whitaker_NP1 3} ,_, 38_MC N]N]Ti]V]
[N the_AT (9
get_VV0
the_AT
._.
The above text has been part-of-speech tagged and skeleton parsed, as well as
anaphorically annotated. The following codes explain the annotation:
(1 1) etc. - noun phrase which enters into a relationship with anaphoric elements
in the text
<REF=2 - referential anaphor; the number indicates the noun phrase which it
refers to - here it refers to noun phrase number 2, the Charlotte Police
Department
{{3 3}} - noun phrase entering in equivalence relationship with preceding noun
phrase; here the Rev Dennis Whitaker is identified as being the same referent as
noun phrase number 3, its chaplain
Phonetic transcription
Spoken language corpora can also be transcibed using a form of phonetic transcription.
Not many examples of publicly available phonetically transcribed corpora exist at the
time of writing. This is possibly because phonetic transcription is a form of annotation
which needs to be carried out by humans rather than computers. Such humans have to be
well skilled in the perception and transcription of speech sounds. Phonetic transcription is
therefore a very time consuming task.
Another problem is that phonetic transcription works on the assumption that the speech
signal can be divided into single, clearly demarcated "sounds", while in fact, these
"sounds" do not have such clear boundaries, therefore what phonetic transcription takes
to be the same sound, might be different according to context.
Nevertheless, phonetically transcribed corpora is extremely useful to the linguist who
lacks the technological tools and expertise for the laboratory analysis of recorded speech.
One such example is the MARSEC corpus (which is derived from the Lancaster/IBM
Spoken English Corpus) and has been manipulated by the Universities of Lancaster and
Leeds. The MARSEC corpus will include a phonetic transcription.
Prosody
Prosody refers to all aspects of the sound system above the level of segmental sounds e.g.
stress, intonation and rhythm. The annotations in prosodically annotated corpora typically
follow widely accepted descriptive frameworks for prosody such as that of O'Connor and
Arnold (1961). Usually, only the most prominent intonations are annotated, rather than
the intonation of every syllable. The example below is taken from the London-Lund
corpus:
1
1
1
1
1
1
8
8
8
8
8
8
14
15
14
14
14
14
1470 1
1480 1
1490 1
1500 1
1510 1
1520 1
/
1 8 14 1530 1
1 8 14 1540 1
1 5 15 1550 1
1
1
1
1
1
1
A
A
B
A
B
A
11
20
11
11
11
11
/
/
/
/
/
perceive a fall in pitch, while others may perceive a slight rise after the fall. This
leads to our second point:
2. Consistency is difficult to maintain, especially if more than one person
annotates the corpus. (This can be alleviated to some degree by having two people
both annotate a small part of the corpus.)
3. Recoverability is difficult (see Leech's 1st Maxim) since prosodic features are
carried by syllables rather than whole words - annotations appear within the
words themselves making it difficult for software to retrieve the raw corpus.
4. Sometimes special graphics characters are used to indicate prosodic phenomena.
However, not all computers and printers can handle such characters. TEI
guidelines for text encoding will hopefully alleviate these difficulties.
Problem-oriented tagging
Problem-oriented tagging (as described by de Haan (1984)) is the phenomenon whereby
users will take a corpus, either already annotated, or unannotated, and add to it their
own form of annotation, oriented particularly towards their own research goal. This
differs in two ways from the other types of annotation we have examined in this session.
1. It is not exhaustive. Not every word (or sentence) is tagged - only those which are
directly relevant to the research. This is something which problem-oriented
tagging has in common with anaphoric annotation.
2. Annotation schemes are selected, not for broad coverage and theory-neutrality,
but for the revelance of the distinctions which it makes to the specific questions
that the researcher wishes to ask of his/her data.
Although it is difficult to generalise further about this form of corpus annotation, it is an
important type to keep in mind in the context of practical research using corpora.