Corpora
Corpora
Corpora
A corpus is a collection of actually occurring texts, written or spoken, which is stored and accessed by
means of computers and available for study and analysis by grammarians, lexicographers, teachers and
language learners.
It is a principled collection of texts available for qualitative and quantitative analysis.
This means that we can look at a language feature in a corpus in different ways. We can use a corpus to
examine how many times a certain word occurs, this will give us quantitative results, that is numbers of
occurrences, which we can then use to compare frequencies. And we can also look more qualitatively at
how a word or phrase is used across a corpus (ex: its collocations), since it provides us with many
examples of the search item in its context of use.
A collection of texts, in order to make a corpus, must represent something and its merits will be often
judged on how representative it is.
It can be composed of written or spoken texts, or a mix of both, and nowadays it is possible to add
multimedia elements, such as video clips, to corpora of spoken language.
- Concordancing is a core tool in corpus linguisitcs, and it means using corpus software to find every
occurrence of a particular word or phrase. With a computer, in a matter of seconds, we can have the node
word/phrase presented with concordance lines (7 or 8 words presented at either side).
-Another corpus technique which software can perform is extremely rapid calculation of word frequency
lists (or wordlists) for any batch/set of texts. By running a word frequency list on your corpus, you can get
a rank ordering of all the words in it in order of frequency.
- Also, in relation to the frequency of words, another function is key word analysis, which allows us to
identify those words whose frequency is unusually high in comparison with some norm.
- Cluster Analysis is another corpus technique: it is the process of generating chunks or cluster lists, similar
to making single word lists, but it allows us to look for word combinations (of 2, 3…or 6 word
combinations).
A further corpus strategy, when looking at concordance lines, is to create a ‘ lexico-grammatical profile’
of a word and its contexts of use. A lexico-grammatical profile describes typical contexts in terms of
collocations (which words occurs most frequently and with statistical significance in the word’s
environment), chunks/idioms (does the word form part of any recurrent chunks? Is the word idiom-
prone?), syntactic restrictions (typical clause-positions, are there syntactic patterns which restrict the
word?); semantic restrictions (are there semantic restrictions? Ex: Is the word only applied to humans?,
can be used with an intensifier? What is its most frequent meaning?), prosody (patterns of used,
particular connotations in relation to environments – ex: ‘cause’ tends to occur on negative
environments) and other relevant or recurring features.
Finally, I would like to mention that corpora have considerable application in the area of translation, and
this is from two main perspectives: descriptive and practical. Descriptive research looks at corpora of
translations, comparing these with corpora of original texts so as to establish the characteristics both
peculiar and universal to translated texts. And, on the other hand, corpora have been looked at as aids in
the process of human and machine translation, and for this purpose three main types of corpora have
been distinguished (by Aston):
- Monolingual corpora (consisting of texts in a single language, which may be either the source or the TL
of a given translation)
- Comparable corpora (monolingual corpora of similar design are available for 2 or more languages and
are treated as components of a single comparable corpus)
- Parallel corpora (these have components in 2 or more languages, consisting of original texts and their
translations, for ex., a novel and its translation in another language. There are unidirectional (A B) and
bidirectional (AB; BA) parallel corpora.