Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
The word "corpus", derived from the Latin word meaning "body", may be used to refer
to any text in written or spoken form. However, in modern Linguistics this term is used
to refer to large collections of texts which represent a sample of a particular variety or
use of language(s) that are presented in machine readable form. Other definitions,
broader or stricter, exist
Computer-readable corpora can consist of raw text only, i.e. plain text with no
additional information. Many corpora have been provided with some kind of linguistic
information, called mark-up or annotation.
Types of corpora
There are many different kinds of corpora. They can contain written or spoken
(transcribed) language, modern or old texts, texts from one language or several
languages. The texts can be whole books, newspapers, journals, speeches etc, or
consist of extracts of varying length. The kind of texts included and the combination of
different texts vary between different corpora and corpus types.
The use of real examples of texts in the study of language is not a new issue in
the history of linguistics. However, Corpus Linguistics has developed considerably
in the last decades due to the great possibilities offered by the processing of natural
language with computers. The availability of computers and machine-readable text
has made it possible to get data quickly and easily and also to have this data
presented in a format suitable for analysis.
Corpus linguistics is, however, not the same as mainly obtaining language data
through the use of computers. Corpus linguistics is the study and analysis of data
obtained from a corpus. The main task of the corpus linguist is not to find the data
but to analyze it. Computers are useful, and sometimes indispensable, tools used
in this process.
The Brown Corpus has also spawned a number of similarly structured corpora: the
LOB Corpus (1960s British English), Kolhapur (Indian English), Wellington (New
Zealand English), Australian Corpus of English (Australian English), the Frown Corpus
(early 1990s American English), and the FLOB Corpus (1990s British English). Other
corpora represent many languages, varieties and modes, and include the International
Corpus of English, and the British National Corpus, a 100 million word collection of a
range of spoken and written texts, created in the 1990s by a consortium of publishers,
universities (Oxford and Lancaster) and the British Library. For contemporary
American English, work has stalled on the American National Corpus, but the 360
million word Corpus of Contemporary American English (COCA) (1990-present) is now
available.
To combine texts into a corpus is called to compile a corpus. There are various ways
of doing this, depending on what kind of corpus you want to create and on what
resources (time, money, knowledge) you have at your disposal.
Even if you are not compiling your own corpus, it is important to know something
about corpus compilation when you use a corpus. Using a corpus is using a selection
of texts to represent the language. How the corpus has been compiled is of utmost
importance for the results you get when using it. What texts are included, how these
are marked up, the proportions of different text types, the size of the various
texts, how the texts have been selected, etc. are all important issues.
Let us imagine that you have a newspaper - a collection of texts of different kinds
(editorials, reportage on different topics, reviews, cartoons, letters to the editor,
sports commentaries, lists of shares, etc) written by different people. You then cut
the paper into small pieces with one word on each. You put all the pieces/words
into a bowl and pick a sample of ten at random. Obviously there would be several
words that you know exist in the newspaper that are not found in your sample.
If you were to pick another ten pieces of paper you would not expect the two sets
of ten words to be exactly the same. If you picked two sets of 100 words each,
you would probably find that some words, especially frequent words like function
words, can be found in both samples, if not in exactly the same numbers. You
would also find that many words are found in only one of the samples. If you took
two very large samples you would find that the frequent words would occur to a
similar extent. Words that occur only once in the newspaper would be found in
only one of the samples (at most). Words that occur infrequently would not
necessarily be evenly distributed across the two samples.
WEBCORP : http://www.webcorp.org.uk/cgi-bin/webcorp2.nm
Prof. Rogério Pereira Azeredo 10
SOME QUERIES
have a bath x take a bath
have a nap x take a nap
make a mistake x commit a mistake
salt and peeper x pepper and salt
dead or alive x alive or dead
lost and found x found and lost
on and off x off and on
fish and chips x chips and fish
sick and tired x tired and sick
black and white x black on white
cats and dogs x dogs and cats
bacon and eggs x eggs and bacon
out of the ( blue/green/white/black/red/grey)
(green/red/black/white/yellow) with anger
(green/red/black/white/yellow) with envy
make/prepare/fix/cook dinner