ANNOTATION PROCESS OF BRITISH AND AMERICAN CORPORA
Sevinj Mammadzada
Odlar Yurdu University, Koroglu Rahimov 13, AZ1072, Baku, Azerbaijan
[email protected]
Key words: corpus linguistics, annotation, linguistic data, text fragments, word
combinations.
Açar sözlər: korpus dilçiliyi, annotasiya, linqvistik informasiya, mətn parçaları, söz
birləşmələri.
Ключевые слова: Корпусная лингвистика, аннотация, лингвистическая
информация, текстовые фрагменты, словосочетание.
Corpus linguistics is considered to be a branch of computational linguistics. This new
linguistic direction formed in the light of compiling the frequency dictionaries develops in
connection with such problems as machine translation, application of the mathematical and
statistic methods in linguistics, and finally, formation of systems of natural language processing.
The further development of corpus linguistics does not exclude the possibility of consolidation of
all issues of computational linguistics. [5]
In the general, the issue of relationship between computational linguistics and corpus
linguistics is often put forward. Researchers sometimes separate them, sometimes include corpus
linguistics in computational linguistics. Actually the term “computational linguistics” is in itself
a subject of the dispute. It used to be called “mathematical linguistics”, “computing linguistics”.
Now it is called “computational linguistics”, and the main reason is use of computational in this
sphere. At first the notion “computing linguistics” was used because it was directly related to
calculating of language units. The need to calculate was undoubtedly caused by finding
frequency of word use. Long before computers a researcher had to read a text several times in
order to solve such a task and register the required word, then the registered words were
calculated and their quantitative indicators were determined.
The article deals with the issues of annotation and marking in the corpus. Corpus
linguistics is a branch of applied linguistics that studies the general principles of formation and
use of linguistic corpuses. The main sources of the language material are texts. Corpus linguistics
differs from translational linguistics. Corpus linguistics studies speech. Traditional grammar
focuses on the research of the language. Corpus linguistics researches speech that comes from its
material. A corpus consists of a databank of natural texts compiled from writing and a
transcription of recorded speech. Corpus consists of real texts and they are speech products that
are result of the communication process of the different types. [2, p.10-15]
Corpus linguistics began forming as a branch in the first half of 1990. “Corpus linguistics
was in its prime”. [7] This expression was noted in J.Svartvik’s article dedicated to corpus
linguistics in 1992. Corpus linguistics is closely linked to computational linguistis, it also uses it
and at the same time enriches it.
English spoken in four countries of Great Britain (England, Scotland, Wales, Ireland) has
its specific peculiarities. English of each country is quite different.
It was assumed for a long time that RP (Received Pronunciation) is a social variant and
patois of the privileged segments of the English population.
The word “received” in the XIX century was accepted as a notion of the literary language.
It was mostly understood as the language of the aristocracy. Later this notion was propagated as
“the King’s English”.
English spoken in the USA is called American English. In I.V. Arnold’s opinion, as
American English is a regional variant it cannot be called a dialect. This variant is derived from
the standard English, there is American National standard. [4, p. 256-266]
American English like America itself has very interesting history of the development.
Three and half centuries period is reflected in the vocabulary of American English. Such words
and terms as “Blue laws, sunbonnet, law cabin, forty-niner, cash house, motel, boby-sitter” give
information about the past and present of America. American English does not remain within the
country, quite a lot words have passed into the other languages (Ok, telephone). [8 ,p.-4] Indian
words, through the Spanish and Portuguese languages, the Aztec languages influenced the
language of England before English took roots in America. The names of some Indian tribes now
are the onomastic units in the USA, for example: “Iowa, Kansas, Michigan” etc. Some of them
took roots in the European languages, too. The words that came to Europe from New World had
rich exotic peculiarities. This was due to the mixing of the language used in America with the
languages of the neighboring countries, for example: potato, tomato, chocolate, cocoa, cannibal,
maize, savannah etc. Though these words are considered to be usual they were new words for the
English and Europeans before. For example, when the word “barbeque” appeared in British
English it had been already used and known in America. One of the variants spoken in many
places and heard in the most famous music is “Black English” (primarily spoken by most black
people in the United State. At one time this variant was not interesting to linguistis. However at
present there is great interest in this dialect.
Though at present linguistics do not have complete information about the corpus they are
aware of significance of the corpus. “A corpus language is an electronic collection of texts of any
language”. [6] This thought can be explained as follows: a person reads a text in order to get
some information and use it. A linguist begins to divide the text into parts, and such separated
texts take places in the corpus. At the present time literature and materials that used to be
collected for years can be collected in a short time. Today the time is spent on study of material,
not on its collection. Corpus greatly facilitates hard work. The role of corpus is indispensable for
linguistis. The quantity and quality of the material received from the corpus is much higher in
comparison with the period before corpuses. Using more than 10 000 examples during the
research of the language is higher than using 10 examples. The corpus of this language is used in
order to find these 10000 examples. Thus, appearance of corpus linguistics in the 1980 s,
preparation of a number of corpuses confirmed the urgency and importance of corpus linguistics
in the end.
There is a sufficient number of the different types of corpuses. V.P.Zakharov distinguishes
the different types of corpuses on purposes and indicators. It should be added that his
classification cannot be accepted completely, it can be accepted only conditionally. Thus
corpuses are multi-purpose and specialized according to their purposes. Multi-purpose corpuses
collect texts of the different genres, and specialized corpuses cover only one genre or one group
of genres. Corpuses of texts can be grouped on genres: literary, folklore, dramatic, publicistic
etc. Marking criterion divides a corpus into the coded (marked) and non-coded (unmarked)
groups. It can be called by another name, for example: indexed and unindexed. Indexed corpus
contains the words and sentences with tags (morphological, syntactic, semantic etc.). [2, p.15-25]
Corpus is divided into non-electronic and electronic types of corpuses according to the format
criterion. Non-electronic corpuses are related to the previous periods. At present only electronic
corpuses are used. The second criterion applied in the classification of corpuses is the peculiarity
of access of texts to the corpus. Being static and updated corpuses are divided into two groups. In
general, the range of classification criteria of corpuses is wide enough. These criteria are
connected with the purposes of corpuses. With the development of computational linguistics the
new forms of use of corpuses are found. The future perspectives and types of corpus linguistics
are also based on the annotation of corpus materials. Morphological, syntactic, semantic,
terminological marking confirms the necessity of registration of at least four corpuses.
One of the important criterion for users is its usability. It is possible in corpuses with free
access through on-line. There must be right to use in historical corpuses. Closed corpuses are
intended for other purposes, public use is forbidden. Corpuses are divided into monolingual,
bilingual and multilingual ones according to the criterion of text volume.
The variants and dialects of the language are opposed in the monolingual corpus. For
example, bilingual and monolingual corpuses as variants of English can be divided into two main
types:
1) corpuses that demonstrate numerous original texts and their translation into one or
several other language;
2) corpuses that combine texts covering the same field regardless of writing in one or
several languages.
Such corpuses are mostly used by translators. Both types of corpuses are widely used in
translation, machine translation, compiling terminological dictionaries and also in comparative
research of languages.
Automatic translation as achievement of computational linguistics has become reality and
has been given for use of people. At present the programs of translation from many languages
into English, and vice versa have already taken their places on the internet. Posting of automatic
translations programs on the different internet sites is still continuing. The most complicated
problems of translation from the Indo-European languages and vice versa have been already
elaborated.
Electronic dictionaries vere compiled long before the other dictionaries in the machine
translation system.
There is now a wide choice of dictionaries in the software market. Only one dictionary
could be added to the first electronic dictionaries. Now numerous dictionaries can be included in
electronic dictionaries.
The main issue of this article includes the problem of annotation of a corpus. The object of
the research is corpus linguistics as a whole. The studies prove the direct link between marking
and the highlighted problem. One of the important problems solved in the process of formation
of corpus in corpus linguistics is provision of translation from marked text to unmarked text. The
important factor of corpus is annotation of included texts. With the development of corpus
linguistics it became necessary to provide texts with additional information. If linguistic
information is connected with syntactic structure it is related to a sentence, but if linguistic
information is connected with a lexeme and grammatical peculiarities it is related to a word.
Solution to the mentioned problems poses certain difficulties. The given article deals with such
kind of difficulties.
Initially, the concept of "annotation" appeared in the English language. The term is used to
refer to the process of adding additional linguistic information to an electronic corpus of oral or
written data. In Russian, the concept of "annotation" is a multi-valued term. It can mean sentence
markup, tagging, or text compression (summarization). We use this term in the sense of markup
suggestions. In applied linguistics, the term "annotation" of texts is interpreted as communication
of certain additional linguistic information about text, which is implemented through its markup
in accordance with certain concept or theory. The concept within which annotation is made, can
define the task of automatic word processing and set new standards. Corpus annotation texts is a
prerequisite for many methods of machine learning when solving problems of automatic text
processing.
One of the important problems solved in the process of formation of corpus in corpus
linguistics is provision of transition from marked text to unmarked text. The term annotation is
used in corpus terminology of English. This term entered Russian, too. As Azerbaijani corpus
linguistics is new the appropriate terminological base has not been formed yet. Nevertheless,
such variants of terms as annotasiya, annotasiyalama, nişanlama, markerləmə can be used.
Annotation or marking means linguistic information included in corpus. [1, p.-240-244]
Linguistic annotation of language data was originally performed in order to provide
information for the development and testing of linguistic theories, or, as it is known today,
corpus linguistics. At the time, considerable time and effort was required to annotate data with
even the simplest linguistic phenomena, and the annotated corpora available for study were quite
small. Over the past three decades, advances in computing power and storage together with
development of robust methods for automatic annotation have made linguistically-annotated data
increasingly available in ever-growing quantities. As a result, these resources now serve not only
linguistic studies, but also the field of natural language processing (NLP), which relies on
linguistically-annotated text and speech corpora to evaluate new human language technologies
and, crucially, to develop reliable statistical models for training these technologies. In recent
years, there has been a noticeable upswing in linguistic annotation activity, which has expanded
to cover a wide variety of linguistic phenomena.[9, p.1]
At present the rapid progress of computational technology, computationalization of
printing and publishing, appearance of internet, collection of a huge amount of language material
stored in electronic media require use of these sources. Corpus linguistics is a new branch of
computational linguistics has a dual character. It is considered to be a branch of computational
linguistics.
Annotation Methods
Annotation methods can be linguistic and statistical. Annotations methods are built on
concepts. Linguistic - on the material quality rules. Statistical are based on quantity. Statistical
methods became popular in the 1950s. Unfortunately, the development of these methods ended
very quickly. This is explained because of two factors. First, there is the issue of data
availability. One of problems of applying statistical methods to language data at that time was
that the datasets were generally so small that it was impossible to make interesting statistical
generalizations on a large number of linguistic phenomena. Second, there has been a general
shift in the social sciences.
Use of corpuses is one of the characteristic features of the modern linguistics. The different
linguistic information and material formed the corpuses are applied in solving problems in broad
terms.
Users of corpora are generally not interested in the content specific texts, and their textual
information and examples of usage certain linguistic elements and structures. This is, first of all,
linguists. Initial linguistic research carried out with using corpora, were reduced to counting the
frequencies of occurrence of various language elements. Statistical techniques are used in
solving complex linguistic tasks such as machine translation, recognition and speech synthesis,
spelling and grammar checkers, etc. So, set phrases are, from a semantic point of view,
indivisible semantic unit, which is very important to take into account in lexicography, automatic
text processing systems. On body material statistical methods can determine which words occur
together regularly and thus can be classified as sustainable phrases. Corpora are a rich source of
data for research on lexicography and grammar.
Revealing of some linguistic factors, confirmation of their regularity and coincidence
require research and analysis of voluminous language material. Solution of this task is possible
only at the level of corpus linguistics. Corpus linguistics is used both in application, teaching
language and in its research. All these factors confirm the urgency of the research connected with
corpus linguistics. The main theoretical issues of the research include the problem of annotation
of a corpus. The studies prove the direct link between marking and the highlighted problem.
Generalization of experience of corpus makes it possible to consider the theoretical problems of
creation of the modern national corpuses from a new angle. [2, p.120-122]
Opportunities for analysis, segmentation and segment analysis are wide in corpus
linguistics because its object is a finished text, and the units forming this text are revealed. The
wordforms of the texts are studied in corpus linguistics. The structure and mathematical purpose
of the corpus allows to determine the inner circle of the wordform. Corpus gives the material for
calculating the probality of a certain sequence of wordforms. One of the advantages of corpus
linguistics is the following: those who research the quantative peculiarities of the language may
directly use the corpus material. Use of corpus gets rid of such labour – intensive work as choice,
collection and loading of material. It should be noted that available electronic libraries can be
used with some purposes. There are search systems in many user programs (for example, Word).
Some problems can be solved by using these systems. It is possible to determine frequency
dictionary of writer’s language, total volume of frequency of words used by the writer. [6]
Corpus is not completed only by choice of texts, determination of contexts and inclusion
them in the corpus. In this case corpus loses its significance and becomes meaningless collection
in comparison with electronical library. One of the important decisive factors of corpus is
annotation of included texts. At first annotation covered only linguistic information. With the
development of corpus linguistics it became necessary to provide texts with additional
information. Corpus included information about its creation, genre of text, author, date of
writing, name of the work from which the text is choosen, precise information about edition,
page number. They are not linguistic data. If linguistic information is connected with syntactic
structure it is related to a sentence, but if linguistic information is connected with a lexeme and
grammatical peculiarities it is related to a word. [6]
Solution to the mentioned problems poses certain difficulties. When user directly accesses
information block about the author he gets information about this author’s works that are to be
used and directly accessing text corpus from here he acquires corpuses. In the other case, user
chooses context from corpus and then gets information about its belonging, date of writing. The
author of corpus faces such tasks as different directions of access information about printed
version of the context. So placing of reference to marking, annotation materials in the corpus, the
different views and approaches to them are taken into account in corpus linguistics. [3]
In the 1980s, linguistic annotation was usually motivated by the desire to study a given
linguistic phenomenon in large bodies of data, and annotation schemes typically directly
reflected a specific linguistic theory. As the need for reliable automatic annotation for larger and
larger bodies of data increased in the early 90s, there sometimes arose a tension between the
requirements for accurate automatic annotation and a comprehensive linguistic accounting that
could contribute to validation and refinement of the underlying theory. An early example is the
Penn Treebank project’s reduction and modification of the part-of-speech tagset developed for
the Brown Corpus, in order to obtain more accurate results from automatic taggers and parsers.
In the following decades, machine learning arose as the central methodology for NLP. [9, p.6]
Corpus linguistics was formed in order to collect and store enormous amount of language
material, to solve the different linguistic problems using this material. When creating the national
corpus of the language it is important to take into consideration selection criteria of the collected
material. When using the corpus text there must be opportunity to analyze the words, wordcombinations, grammatical categories used in the text. A range of classification criteria of
corpuses is wide enough. These criteria are connected with the purposes of corpuses. With the
development of computational linguistics the new forms of use of corpuses are found. The future
perspectives and types of corpus linguistics are also based on the annotation of corpus material.
Morphological, syntactic, semantic, terminological marking confirms the necessity to note at
least four corpuses.
REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
Mahmudov M.Ə. “Mətnin formal təhlili sistemi”- Bakı:Elm, 2002 - 244 s.
Захаров В.П. «Корпусная лингвистика» - Санкт-Петербург, 2013 - 148 с.
Марчук Ю.Н. «Проблемы машинного перевода»- Москва; Наука, 1983 - 233 с.
Arnold I.V. The English Word. Moscow, 1986, p.265-266
Kennedy G. “An introduction to Corpus linguistics”- London, 2014 - 328 p.
McEnery T. “Corpus Linguistics”- Edinburg University Press, 2001- 235 p.
Svartvik J. “Directions in Corpus Linguistics”. Proceedings of the Nobel Symposium 82
Stockholm, 4-8 August, 1991- 487 p.
Thomas Pyles. Words and Ways of American English. Random House, 1952, p.4
Nancy Ide “Introduction - The handbook of linguistic annotation”
https://www.cs.vassar.edu/~ide/papers/handbook-intro.pdf, -15 p.
ANNOTATION PROCESS OF BRITISH AND AMERICAN CORPORA
SUMMARY
This article is about corpus linguistics, its purpose and main directions. The article
provides a historical overview of corpus linguistics, the main national corporations and their
creation. In addition, the main goals and objectives of corpus linguistics are widely explained.
The national corpus of British English is analyzed comparatively with the corpus of modern
American English. The dialects and dialects of these languages are compared. It is noted that the
same word is used in different variants. Regional and social variability is the main content of the
article.
The concept of annotation or marking refers to the linguistic information entered into the
body. The corpus is not completed only by selecting the texts, defining the contexts in the order
of random numbers and entering them into the corpus. As corpus linguistics developed, it
became clear that texts needed to be supplemented with additional information.
The peculiarities of the formation and development of corpus linguistics are also covered
in the article. Its areas and directions are also specific.
The range of criteria for the classification of buildings is quite wide. These criteria are
related to the purpose of the corps. As computer linguistics develops, new forms of using corpora
are sought. Future perspectives and types of corpus linguistics are also based on annotations of
corpus materials. Morphological, syntactic, semantic, terminological marking confirms the need
to mark at least these four corpora.
The article examines the application of corpus linguistics as an important area of
linguistics. This article on corpus linguistics, which is emerging as an independent field of
science, uses many new concepts and terms.
Mammadzada Sevinj
BRİTANİYA VƏ AMERİKA İNGİLİSCƏSİNİN KORPUSLARINDA ANNOTASİYALAMA
PROSESİ
XÜLASƏ
Bu məqalə korpus dilçiliyi, onun məqsədi və əsas istiqamətləri mövzusundadır. Məqalədə
korpus dilçiliyinin tarixi baxışına, əsas milli korpuslara və onların yaradılmalarına yer
verilmişdir. Bundan başqa korpus dilçiliyinin başlıca məqsədləri və vəzifələri geniş şəkildə
açıqlanır. Britaniya ingilis dilinin milli korpusu müasir amerikan ingilis dilinin korpusu ilə
müqayisəli şəkildə təhlil edilir. Bu dillərin dialekt və ləhcələri müqayisə edilir. Eyni sözün
müxtəlif variantlarda işlənməsi qeyd olunur. Regional və sosial variativlik məqalənin əsas
məzmunudur.
Annotasiyalama və ya markerləmə anlayışı korpusa daxil edilən linqvistik informasiyanı
nəzərdə tutur. Korpus yalnız mətnlərin seçilməsi, kontekstlərin təsadüfi ədədlər qaydası ilə
müəyyənləşdirilib korpusa daxil edilməsi ilə tamamlanmır. Korpus dilçiliyi inkişaf etdikcə
mətnlərin əlavə informasiyalarla təchiz edilməsinin də zəruriliyi aşkara çıxmışdır.
Korpus dilçiliyinin özünəməxsus formalaşma və inkişaf xüsusiyyətləri də həmçinin məqalədə
işıqlandırılır. Onun sahələri və istiqamətləri də spesifik xarakter daşıyır.
Korpusların təsnifi meyarları dairəsi kifayət qədər genişdir. Bu meyarlar korpusların
təyinatı ilə bağlılığa malikdir. Kompüter dilçiliyi inkişaf etdikcə korpuslardan istifadənin yeni
formaları axtarılıb tapılır. Korpus linqvistikasının gələcək perspektivləri və tipləri həm də korpus
materiallarının annotasiyalaşdırılmasına əsaslanır. Morfoloji, sintaktik, semantik, terminoloji
markerləmə ən azı bu dörd korpusu qeyd etməyin zəruriliyini təsdiqləyir.
Məqalədə korpus dilçiliyi tətbiqi dilçiliyin mühüm bir sahəsi kimi araşdırılır. Müstəqil bir elm
sahəsi kimi formalaşan korpus dilçiliyinə aid bu məqalədə bir çox yeni anlayış və terminlərdən
istifadə olunur.
Məmmədzadə Sevinc
ПРОЦЕСС АННОТАЦИИ БРИТАНСКОЙ И АМЕРИКАНСКОЙ КОРПУСОВ
РЕЗЮМЕ
Эта статья о корпусной лингвистике, ее назначении и основных направлениях. В
статье дается исторический обзор корпусной лингвистики, основных национальных
корпораций и их создания. Кроме того, широко разъясняются основные цели и задачи
корпусной лингвистики. Национальный корпус британского английского языка
анализируется в сравнении с корпусом современного американского английского.
Сравниваются диалекты и диалекты этих языков. Отмечено, что одно и то же слово
используется в разных вариантах. Региональная и социальная изменчивость - основное
содержание статьи.
Понятие аннотации или маркировки относится к лингвистической информации,
введенной в тело. Корпус не дополняется только выбором текстов, определением
контекстов в порядке случайных чисел и вводом их в корпус. По мере развития корпусной
лингвистики стало ясно, что тексты необходимо дополнять дополнительной
информацией.
В статье также освещены особенности становления и развития корпусной
лингвистики. Его направления и направления тоже специфичны.
Спектр критериев классификации построек достаточно широк. Эти критерии
связаны с назначением корпуса. По мере развития компьютерной лингвистики ведется
поиск новых форм использования корпусов. Будущие перспективы и типы корпусной
лингвистики также основаны на аннотациях корпусных материалов. Морфологическая,
синтаксическая,
семантическая,
терминологическая
маркировка
подтверждает
необходимость маркировки хотя бы этих четырех корпусов.
В статье исследуется применение корпусной лингвистики как важной области
лингвистики. В этой статье о корпусной лингвистике, которая становится
самостоятельной областью науки, используется много новых понятий и терминов.
Мамедзаде Севиндж
Rəyçi: fil.e.d.,prof. A.Y.Məmmədov