Seminar 7

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Seminar

1. Corpus Linguistics, Applied Linguistics and Computational Linguistics.


What is corpus linguistics? Corpus linguistics is a methodology that involves
computer-based empirical analyses (both quantitative and qualitative) of
language use by employing large, electronically available collections of
naturally occurring spoken and written texts, so-called corpora.
An example of a general corpus is the British National Corpus. Some corpora
contain texts that are sampled (chosen from) a particular variety of a language,
for example, from a particular dialect or from a particular subject area. These
corpora are sometimes called 'Sublanguage Corpora'.
Corpus linguistics encompasses the compilation and analysis of collections of
spoken and written texts as the source of evidence for describing the nature,
structure, and use of languages. This work typically brings a quantitative
dimension to the description of languages by including information on the
probability with which linguistic items or processes occur in particular
contexts. Corpora vary greatly in size and design but most are nowadays in
electronic form with purpose-built computer software to support analysis.
Present-day corpus linguistics has grown out of a long tradition of using texts
as the empirical basis for linguistic description, studying all levels of language,
including phonology, lexis, grammar, and discourse.
"Computational Linguistics is an interdisciplinary field which centers around
the use of computers to process or produce human language"
In some ways, computational linguistics and corpus linguistics can be seen as
overlapping disciplines. Computational linguists are dependent on computer-
readable linguistic data to use in their research, while corpus linguists often
use computational methods when analysing their data. One main difference
can be said to be that in corpus linguistics it is the data in the corpus that is the
main object of study. In computational linguistics, corpora are not studied as
such but used as a resource to solve various problems.
2.History of CL.
The first modern corpus is the Brown University Standard Corpus of Present-
day American English (‘the Brown corpus’; Francis and Kučera, 1979; Kučera
and Francis, 1967), compiled at Brown University in the 1960s. With about one
million words of text collected from a balanced, wide variety of sources, the
Brown corpus represented a breakthrough in terms of both corpus size and
corpus design. In subsequent years, however, the standard of corpus size
increased rapidly. Thanks to the enhanced storage and processing power of
modern computers, it is now not uncommon to encounter a corpus that
contains hundreds or thousands of millions of words. What came hand in hand
with the increase in scale was the expansion in genre. In addition to the so-
called ‘balanced’ corpora (see below for the definition of balanced) such as the
Brown corpus, corpora have also been developed for highly specialized
domains and with unconventional data types.
3.The Objectives of Corpus Linguistics
The main focus of corpus linguistics is to discover patterns of authentic language
use through analysis of actual usage.
Corpus linguistic approaches enable examination of large bodies of language
data based on computing power. These bodies of data, or corpora, facilitate
investigation of the meaning of words in context.
corpus linguistics quantitative and qualitative methods are extensively used in
combination.
4. Corpus Compilation and Corpus Annotation.
Apart from the pure text, a corpus can also be provided with additional linguistic
information, called 'annotation'. This information can be of different nature, such as
prosodic, semantic or historical annotation. The most common form of annotated
corpora is the grammatically tagged one.
Corpus compilation involves “designing a corpus, collecting texts, encoding the
corpus, assembling and storing the relevant metadata, marking up the texts
where necessary and possibly adding linguistic annotation”
5. Methods used for CL.
Corpus linguistics has generated a number of research methods, which attempt to trace a path
from data to theory. Wallis and Nelson (2001) first introduced what they called the 3A
perspective: Annotation, Abstraction and Analysis.

 Annotation consists of the application of a scheme to texts. Annotations may include


structural markup, part-of-speech tagging, parsing, and numerous other representations.
 Abstraction consists of the translation (mapping) of terms in the scheme to terms in a
theoretically motivated model or dataset. Abstraction typically includes linguist-directed
search but may include e.g., rule-learning for parsers.
 Analysis consists of statistically probing, manipulating and generalising from the dataset.
Analysis might include statistical evaluations, optimisation of rule-bases or knowledge
discovery methods.

Most lexical corpora today are part-of-speech-tagged (POS-tagged). However even corpus
linguists who work with 'unannotated plain text' inevitably apply some method to isolate salient
terms. In such situations annotation and abstraction are combined in a lexical search.
The advantage of publishing an annotated corpus is that other users can then perform
experiments on the corpus (through corpus managers). Linguists with other interests and
differing perspectives than the originators' can exploit this work. By sharing data, corpus linguists
are able to treat the corpus as a locus of linguistic debate and further study.

6 . Electronic Text Corpora: Typology.


Text Corpora a corpus is a large set of texts (electronically stored and processed) ,
it may be used to refer to any text in written or spoken form that can be available on
computers as software or via internet. G. Cook (2003:73) suggests that the word
corpus refers to a databank of language which has actually occurred-whether
written, spoken or a mixture of the two. The written texts are originally from
magazines, books, diaries, newspapers, letters, popular fictions……; however the
spoken texts can be any recorded formal or informal conversations: Telephone
conversations, dialogues, radio-shows, political meetings…
7. Advantages and disadvantages of corpora.
What are the advantages of using corpus linguistics?

The great advantage of the corpus-linguistic method is that language researchers


do not have to rely on their own or other native speakers' intuition or even on
made-up examples.

Corpora allow access to authentic data and show frequency patterns of words
and grammar construction. Such patterns can be used to improve language
materials or to directly teach students.

But, there are also two disadvantages. Only 10% of the corpus is based on spoken
language so there is not much information about it. The second disadvantage is that
a corpus will never tell you what is grammatically or syntactically wrong or
right.

You might also like