Definition of A Corpus

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Definition of a corpus

The concept of carrying out research on written or spoken texts is not restricted to corpus linguistics. Indeed, individual texts are often used for many kinds of literary and linguistic analysis - the stylistic analysis of a poem, or a conversation analysis of a tv talk show. However, the notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways. In principle, any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text). But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition. The following list describes the four main characteristics of the modern corpus.

Corpora in Speech Research


A spoken corpus is important because of the following useful features: It provides a broad sample of speech, extending over a wide selection of variables such as: o o o o speaker gender speaker age speaker class genre (e.g. newsreading, poetry, legal proceedings etc)

This allows generalisations to be made about spoken language as the corpus is as wide and as representative as possible. It also allows for variations within a given spoken language to be studied. It provides a sample of naturalistic speech rather than speech elicited under aritificial conditions. The findings from the corpus are therefore more likely to reflect language as it is spoken in "real life" since the data is less likely to be subject to production monitoring by the speaker (such as trying to suppress a regional accent). Because the (transcribed) corpus has usually been enhanced with prosodic and other annotations it is easier to carry out large scale quantitative analyses than with fresh raw data. Where more than one type of annotation has been used it is possible to study the interrelationships between say, phonetic annotations and syntactic structure. Prosodic annotation of spoken corpora

Because much phonetic corpus annotation has been at the level of prosody, this has been the focus of most of the phonetic and phonological research in spoken corpora. This work can be divided roughly into three types: 1. 2. 3. How do prosodic elements of speech relate to other linguistic levels? How does what is actually perceived and transcribed relate to the actual acoustic reality of speech? How does the typology of the text relate to the prosodic patterns in the corpus?

Corpora in Lexical Studies


Empirical data has been used in lexicography long before the discipline of corpus linguistics was invented. Samuel Johnson, for example, illustrated his dictionary with examples from literature, and in the 19th Century the Oxford Dictionary used citation slips to study and illustrate word usage. Corpora, however, have changed the way in which linguists can look at language.A linguist who has access to a corpus, or other (non-representative) collection of machine readable text can call up all the examples of a word or phrase from many millions of words of text in a few seconds. Dictionaries can be produced and revised much more quickly than before, thus providing up-to-date information about language. Also, definitions can be more complete and precise since a larger number of natural examples are examined. Examples extracted from corpora can be easily organised into more meaningful groups for analysis. For example, by sorting the right-hand context of the word alphabetically so that it is possible to see all instances of a particular collocate together. Furthermore, because corpus data contains a rich amount of textual information - regional variety, author, date, genre, part-of-speech tags etc it is easier to tie down usages of particular words or phrases as being typical of particular regional varieties, genres and so on. The open-ended (constantly growing) monitor corpus has its greatest role in dictionary building as it enables lexicographers to keep on top of new words entering the language, or existing words changing their meanings, or the balance of their use according to genre etc. However, finite corpora also have an important role in lexical studies - in the area of quantification. It is possible to rapidly produce reliable frequency counts and to subdivide these areas across various dimensions according to the varieties of language in which a word is used. Finally, the ability to call up word combinations rather than individual words, and the existence of mutual information tools which establish relationships between co-occuring words (see Session 3) mean that we can treat phrases and collocations more systematically than was previously possible. A phraseological unit may consitute a piece of technical terminology or an idiom, and collocations are important clues to specific word senses.

Corpora and Grammar

Grammatical (or syntactic) studies have, along with lexical studies, been the most frequent types of research which have used corpora. Copora makes a useful tool for syntactical research because of : The potential for the representative quantification of a whole language variety. Their role as empirical data for the testing of hypotheses derived from grammatical theory.

Many smaller-scale studies of grammar using corpora have included quantitative data analysis (for example, Schmied's 1993 study of relative clauses). There is now a greater interest in the more systematic study of grammatical frequency - for example, Oostdijk and de Haan (1994a) are aiming to analyse the frequency of the various English clause types. Since the 1950s the rational-theory based/empiricist-descriptive division in linguistics (see Session One) has often meant that these two approaches have been viewed as separate and in competition with each other. However, there is a group of researchers who have used corpora in order to test essentially rationalist grammatical theory, rather than use it for pure description or the inductive generation of theory. At Nijmegen University, for instance, primarily rationalist formal grammars are tested on real-life language found in computer corpora (Aarts 1991). The formal grammar is first devised by reference to introspective techniques and to existing accounts of the grammar of the language. The grammar is then loaded into a computer parser and is run over a corpus to test how far it accounts for the data in the corpus. The grammar is then modified to take account of those analyses which it missed or got wrong.

Corpora and Semantics


The main contribution that corpus linguistics has made to semantics is by helping to establish an approach to semantics which is objective, and takes account of indeterminacy and gradience. Mindt (1991) demonstrates how a corpus can be used in order to provide objective criteria for assigning meanings to linguistic terms. Mindt points out that frequently in semantics, meanings of terms are described by reference to the linguist's own intuitions - the rationalist approach that we mentioned in the section on Corpora and Grammar. Mindt argues that semantic distinctions are associated in texts with characteristic observable contexts - syntactic, morphological and prosodic and by considering the environments of the linguistic entities an empirical objective indicator for a particular semantic distinction can be arrived at. Another role of corpora in semantics has been in establishing more firmly the notions of fuzzy categories and gradience. In theoretical linguistics, categories are usually seen as being hard and fast - either an item belongs to a category or it does not. However, psychological work on categorisation suggests that cognitive categories are not usually "hard and fast" but instead have fuzzy boundaries, so it is not so much a question of whether an item belongs to one category or the other, but how often it falls into one category as opposed to the other one. In looking

empirically at natural language in corpora it is clear that this "fuzzy" model accounts better for the data: clear-cut boundaries do not exist; instead there are gradients of membership which are connected with frequency of inclusion.

Corpora in Pragmatics and Discourse Analysis


The amount of corpus-based reseach in pragmatics and discourse analysis has been relatively small up to now. This is partly because these fields rely on context (Myers 1991) and the small samples of texts used in corpora tend to mean that they are somewhat removed from their social and textual contexts. Sometimes relevant social information (gender, class, region) is encoded within the corpus but it is still not always possible to infer context from corpus texts. Much of the work that has been carried out in this area has used the London-Lund corpus which was until recently the only truly conversational corpus. The main contribution of such research has been to the understanding of how conversation works, with respect to lexical items and phrases which have conversational functions. Stenstm (1984) correlated discourse items such as well, sort of and you know with pauses in speech and showed that such correlations related to whether or not the speaker expects a response from the addressee. Another study by Stenstm (1987) examined "carry-on signals" such as right, right-o and all right. These signals were classified according to the typology of their various functions e.g.: right was used in all functions, but especially as a response, to evaluate a previous response or terminate an exchange. All right was used to mark a boundary between two stages in discourse. that's right was used as an emphasiser. it's alright and that's alright were responses to apologies.

The availability of new conversational corpora, such as the spoken part of the BNC (British National Corpus) should provide a greater incentive both to extend and to replicate such studies, since the amount of conversational data available, and the social/geographical range of people recorded both will have increased. At present, quantitative analyses of corpus-based approaches to issues in pragmatics have been poorly served. Hopefully this is one area which will be exploited by linguists in the near future.

Corpora and Stylistics


Stylistics researchers are usually interested in individual texts or authors rather than the more general varieties of a language and tend not to be large-scale users of corpora. Nevertheless, some stylisticians are interested in investigating broader issues such as genre, and others have found corpora to be important sources of data in their research.

In order to define an author's particular style, we must, in part examine the degree by which the author leans towards different ways of putting things (technical vs non-technical vocabulary, long sentences vs short sentences and so on). This task requires comparisons to be made not only internally within the author's own work, but also with other authors or the norms of the language or variety as a whole. As Leech and Short (1981) point out, stylistics often demands the use of quantification to back up judgements which may appear subjective rather than objective. This is where corpora can play a useful role. Another type of stylistic variation is the more general variation between genres and channels - for example, one of the most common uses of corpora has been in looking at the differences between spoken and written language. Altenberg (1984) examined the differences in the ordering of cause-result constructions while Tottie (1991) looked at the differences in negation strategies. Other work has looked at variations between genres, using subsamples of corpora as a database. For example, Wilson (1992) used sections from the LOB and Kolhpur corpora, the Augustan Prose Sample and a sample of modern English conversation to examine the usage of since and found that causal since had evolved from being the main causal connective in late seventeenth century writing to being characteristic of formal learned writing in the twentieth century.

Corpora and Cultural Studies


It is only recently that the role of a corpus in telling us about culture has really begun to be explored. After the completion of the LOB corpus of British English, one of the earliest pieces of work to be carried out was a comparison of its vocabulary with the vocabulary of the American Brown corpus (Hofland and Johansson 1982). This revealed interesting differences which went beyond the purely linguistic ones such as spelling (colour/color) or morphology (got/gotten). Leech and Fallon (1992) used the results of these earlier studies, along with KWIC concordances of the two corpora to check up on the senses in which words were being used. They then grouped the differences which were statistically significant into fifteen broad categories. The frequencies of concepts in these categories revealed differences between the two countries which were primarily of cultural, not linguistic difference. For example travel words were more frequent in American English than British English, perhaps suggestive of the larger size of the United States. Words in the domains of crime and the military were also more common in the American data, as was "violent crime" in the crime category, perhaps suggestive of the American "gun culture". In general, the findings seemed to suggest a picture of American culture at the time of the two corpora (1961) that was more macho and dynamic than British culture. Although this work is in its infancy and requires methodological refinement, it seems to be an interesting and promising area of study, which could also integrate more closely work in language learning with that in national cultural studies.

Conclusion
Sampling and quantification. Because a corpus is sampled to maximally represent the population, any findings taken from the corpus can be generalised to the larger population. Hence quantification in corpus linguistics is more meaningful than other forms of linguistic quantification because it can tell us about a variety of language, not just that which is being analysed. Ease of access. As all of the data collection has been dealt with by someone else, the researcher does not have to go through the issues of sampling, collection and encoding. The majority of corpora are readily available, either free or at low-cost price. Once the corpora have been obtained, it is usually easy to access the data within it, e.g. by using a concordance program. Enriched data. Many corpora have already been enriched with additional linguistic information such as part-of-speech annotation, parsing and prosodic transcription. Hence data retrieval from annotated corpora can be easier and more specific than with unannotated data. Naturalistic data. Corpus data is not always completely unmonitored in the sense that the people producing the spoken or written texts are unaware until after the fact that they are being asked to participate in the building of a corpus. But for the most part, the data are largely naturalistic, unmonitored and the product of real social contexts. Thus the corpus provides one of the most reliable sources of naturally occurring data that can be examined.

Webliography
http://www.lancs.ac.uk/fss/courses/ling/corpus/ http://www.lancs.ac.uk/fss/courses/ling/corpus/Corpus4/4FRA1.HTM

You might also like