Annotation process of British and American Corpuses

Sevinj Mammadzada

Annotation process of British and American Corpuses

Sevinj Mammadzada

2022, Filologiya məsələləri No 2

visibility

…

description

8 pages

link

1 file

Corpus linguistics is considered to be a branch of computational linguistics. This new linguistic direction formed in the light of compiling the frequency dictionaries develops in connection with such problems as machine translation, application of the mathematical and statistic methods in linguistics, and finally, formation of systems of natural language processing. The further development of corpus linguistics does not exclude the possibility of consolidation of all issues of computational linguistics

ANNOTATION PROCESS OF BRITISH AND AMERICAN CORPORA Sevinj Mammadzada Odlar Yurdu University, Koroglu Rahimov 13, AZ1072, Baku, Azerbaijan [email protected] Key words: corpus linguistics, annotation, linguistic data, text fragments, word combinations. Açar sözlər: korpus dilçiliyi, annotasiya, linqvistik informasiya, mətn parçaları, söz birləşmələri. Ключевые слова: Корпусная лингвистика, аннотация, лингвистическая информация, текстовые фрагменты, словосочетание. Corpus linguistics is considered to be a branch of computational linguistics. This new linguistic direction formed in the light of compiling the frequency dictionaries develops in connection with such problems as machine translation, application of the mathematical and statistic methods in linguistics, and finally, formation of systems of natural language processing. The further development of corpus linguistics does not exclude the possibility of consolidation of all issues of computational linguistics. [5] In the general, the issue of relationship between computational linguistics and corpus linguistics is often put forward. Researchers sometimes separate them, sometimes include corpus linguistics in computational linguistics. Actually the term “computational linguistics” is in itself a subject of the dispute. It used to be called “mathematical linguistics”, “computing linguistics”. Now it is called “computational linguistics”, and the main reason is use of computational in this sphere. At first the notion “computing linguistics” was used because it was directly related to calculating of language units. The need to calculate was undoubtedly caused by finding frequency of word use. Long before computers a researcher had to read a text several times in order to solve such a task and register the required word, then the registered words were calculated and their quantitative indicators were determined. The article deals with the issues of annotation and marking in the corpus. Corpus linguistics is a branch of applied linguistics that studies the general principles of formation and use of linguistic corpuses. The main sources of the language material are texts. Corpus linguistics differs from translational linguistics. Corpus linguistics studies speech. Traditional grammar focuses on the research of the language. Corpus linguistics researches speech that comes from its material. A corpus consists of a databank of natural texts compiled from writing and a transcription of recorded speech. Corpus consists of real texts and they are speech products that are result of the communication process of the different types. [2, p.10-15] Corpus linguistics began forming as a branch in the first half of 1990. “Corpus linguistics was in its prime”. [7] This expression was noted in J.Svartvik’s article dedicated to corpus linguistics in 1992. Corpus linguistics is closely linked to computational linguistis, it also uses it and at the same time enriches it. English spoken in four countries of Great Britain (England, Scotland, Wales, Ireland) has its specific peculiarities. English of each country is quite different. It was assumed for a long time that RP (Received Pronunciation) is a social variant and patois of the privileged segments of the English population. The word “received” in the XIX century was accepted as a notion of the literary language. It was mostly understood as the language of the aristocracy. Later this notion was propagated as “the King’s English”. English spoken in the USA is called American English. In I.V. Arnold’s opinion, as American English is a regional variant it cannot be called a dialect. This variant is derived from the standard English, there is American National standard. [4, p. 256-266] American English like America itself has very interesting history of the development. Three and half centuries period is reflected in the vocabulary of American English. Such words and terms as “Blue laws, sunbonnet, law cabin, forty-niner, cash house, motel, boby-sitter” give information about the past and present of America. American English does not remain within the country, quite a lot words have passed into the other languages (Ok, telephone). [8 ,p.-4] Indian words, through the Spanish and Portuguese languages, the Aztec languages influenced the language of England before English took roots in America. The names of some Indian tribes now are the onomastic units in the USA, for example: “Iowa, Kansas, Michigan” etc. Some of them took roots in the European languages, too. The words that came to Europe from New World had rich exotic peculiarities. This was due to the mixing of the language used in America with the languages of the neighboring countries, for example: potato, tomato, chocolate, cocoa, cannibal, maize, savannah etc. Though these words are considered to be usual they were new words for the English and Europeans before. For example, when the word “barbeque” appeared in British English it had been already used and known in America. One of the variants spoken in many places and heard in the most famous music is “Black English” (primarily spoken by most black people in the United State. At one time this variant was not interesting to linguistis. However at present there is great interest in this dialect. Though at present linguistics do not have complete information about the corpus they are aware of significance of the corpus. “A corpus language is an electronic collection of texts of any language”. [6] This thought can be explained as follows: a person reads a text in order to get some information and use it. A linguist begins to divide the text into parts, and such separated texts take places in the corpus. At the present time literature and materials that used to be collected for years can be collected in a short time. Today the time is spent on study of material, not on its collection. Corpus greatly facilitates hard work. The role of corpus is indispensable for linguistis. The quantity and quality of the material received from the corpus is much higher in comparison with the period before corpuses. Using more than 10 000 examples during the research of the language is higher than using 10 examples. The corpus of this language is used in order to find these 10000 examples. Thus, appearance of corpus linguistics in the 1980 s, preparation of a number of corpuses confirmed the urgency and importance of corpus linguistics in the end. There is a sufficient number of the different types of corpuses. V.P.Zakharov distinguishes the different types of corpuses on purposes and indicators. It should be added that his classification cannot be accepted completely, it can be accepted only conditionally. Thus corpuses are multi-purpose and specialized according to their purposes. Multi-purpose corpuses collect texts of the different genres, and specialized corpuses cover only one genre or one group of genres. Corpuses of texts can be grouped on genres: literary, folklore, dramatic, publicistic etc. Marking criterion divides a corpus into the coded (marked) and non-coded (unmarked) groups. It can be called by another name, for example: indexed and unindexed. Indexed corpus contains the words and sentences with tags (morphological, syntactic, semantic etc.). [2, p.15-25] Corpus is divided into non-electronic and electronic types of corpuses according to the format criterion. Non-electronic corpuses are related to the previous periods. At present only electronic corpuses are used. The second criterion applied in the classification of corpuses is the peculiarity of access of texts to the corpus. Being static and updated corpuses are divided into two groups. In general, the range of classification criteria of corpuses is wide enough. These criteria are connected with the purposes of corpuses. With the development of computational linguistics the new forms of use of corpuses are found. The future perspectives and types of corpus linguistics are also based on the annotation of corpus materials. Morphological, syntactic, semantic, terminological marking confirms the necessity of registration of at least four corpuses. One of the important criterion for users is its usability. It is possible in corpuses with free access through on-line. There must be right to use in historical corpuses. Closed corpuses are intended for other purposes, public use is forbidden. Corpuses are divided into monolingual, bilingual and multilingual ones according to the criterion of text volume. The variants and dialects of the language are opposed in the monolingual corpus. For example, bilingual and monolingual corpuses as variants of English can be divided into two main types: 1) corpuses that demonstrate numerous original texts and their translation into one or several other language; 2) corpuses that combine texts covering the same field regardless of writing in one or several languages. Such corpuses are mostly used by translators. Both types of corpuses are widely used in translation, machine translation, compiling terminological dictionaries and also in comparative research of languages. Automatic translation as achievement of computational linguistics has become reality and has been given for use of people. At present the programs of translation from many languages into English, and vice versa have already taken their places on the internet. Posting of automatic translations programs on the different internet sites is still continuing. The most complicated problems of translation from the Indo-European languages and vice versa have been already elaborated. Electronic dictionaries vere compiled long before the other dictionaries in the machine translation system. There is now a wide choice of dictionaries in the software market. Only one dictionary could be added to the first electronic dictionaries. Now numerous dictionaries can be included in electronic dictionaries. The main issue of this article includes the problem of annotation of a corpus. The object of the research is corpus linguistics as a whole. The studies prove the direct link between marking and the highlighted problem. One of the important problems solved in the process of formation of corpus in corpus linguistics is provision of translation from marked text to unmarked text. The important factor of corpus is annotation of included texts. With the development of corpus linguistics it became necessary to provide texts with additional information. If linguistic information is connected with syntactic structure it is related to a sentence, but if linguistic information is connected with a lexeme and grammatical peculiarities it is related to a word. Solution to the mentioned problems poses certain difficulties. The given article deals with such kind of difficulties. Initially, the concept of "annotation" appeared in the English language. The term is used to refer to the process of adding additional linguistic information to an electronic corpus of oral or written data. In Russian, the concept of "annotation" is a multi-valued term. It can mean sentence markup, tagging, or text compression (summarization). We use this term in the sense of markup suggestions. In applied linguistics, the term "annotation" of texts is interpreted as communication of certain additional linguistic information about text, which is implemented through its markup in accordance with certain concept or theory. The concept within which annotation is made, can define the task of automatic word processing and set new standards. Corpus annotation texts is a prerequisite for many methods of machine learning when solving problems of automatic text processing. One of the important problems solved in the process of formation of corpus in corpus linguistics is provision of transition from marked text to unmarked text. The term annotation is used in corpus terminology of English. This term entered Russian, too. As Azerbaijani corpus linguistics is new the appropriate terminological base has not been formed yet. Nevertheless, such variants of terms as annotasiya, annotasiyalama, nişanlama, markerləmə can be used. Annotation or marking means linguistic information included in corpus. [1, p.-240-244] Linguistic annotation of language data was originally performed in order to provide information for the development and testing of linguistic theories, or, as it is known today, corpus linguistics. At the time, considerable time and effort was required to annotate data with even the simplest linguistic phenomena, and the annotated corpora available for study were quite small. Over the past three decades, advances in computing power and storage together with development of robust methods for automatic annotation have made linguistically-annotated data increasingly available in ever-growing quantities. As a result, these resources now serve not only linguistic studies, but also the field of natural language processing (NLP), which relies on linguistically-annotated text and speech corpora to evaluate new human language technologies and, crucially, to develop reliable statistical models for training these technologies. In recent years, there has been a noticeable upswing in linguistic annotation activity, which has expanded to cover a wide variety of linguistic phenomena.[9, p.1] At present the rapid progress of computational technology, computationalization of printing and publishing, appearance of internet, collection of a huge amount of language material stored in electronic media require use of these sources. Corpus linguistics is a new branch of computational linguistics has a dual character. It is considered to be a branch of computational linguistics. Annotation Methods Annotation methods can be linguistic and statistical. Annotations methods are built on concepts. Linguistic - on the material quality rules. Statistical are based on quantity. Statistical methods became popular in the 1950s. Unfortunately, the development of these methods ended very quickly. This is explained because of two factors. First, there is the issue of data availability. One of problems of applying statistical methods to language data at that time was that the datasets were generally so small that it was impossible to make interesting statistical generalizations on a large number of linguistic phenomena. Second, there has been a general shift in the social sciences. Use of corpuses is one of the characteristic features of the modern linguistics. The different linguistic information and material formed the corpuses are applied in solving problems in broad terms. Users of corpora are generally not interested in the content specific texts, and their textual information and examples of usage certain linguistic elements and structures. This is, first of all, linguists. Initial linguistic research carried out with using corpora, were reduced to counting the frequencies of occurrence of various language elements. Statistical techniques are used in solving complex linguistic tasks such as machine translation, recognition and speech synthesis, spelling and grammar checkers, etc. So, set phrases are, from a semantic point of view, indivisible semantic unit, which is very important to take into account in lexicography, automatic text processing systems. On body material statistical methods can determine which words occur together regularly and thus can be classified as sustainable phrases. Corpora are a rich source of data for research on lexicography and grammar. Revealing of some linguistic factors, confirmation of their regularity and coincidence require research and analysis of voluminous language material. Solution of this task is possible only at the level of corpus linguistics. Corpus linguistics is used both in application, teaching language and in its research. All these factors confirm the urgency of the research connected with corpus linguistics. The main theoretical issues of the research include the problem of annotation of a corpus. The studies prove the direct link between marking and the highlighted problem. Generalization of experience of corpus makes it possible to consider the theoretical problems of creation of the modern national corpuses from a new angle. [2, p.120-122] Opportunities for analysis, segmentation and segment analysis are wide in corpus linguistics because its object is a finished text, and the units forming this text are revealed. The wordforms of the texts are studied in corpus linguistics. The structure and mathematical purpose of the corpus allows to determine the inner circle of the wordform. Corpus gives the material for calculating the probality of a certain sequence of wordforms. One of the advantages of corpus linguistics is the following: those who research the quantative peculiarities of the language may directly use the corpus material. Use of corpus gets rid of such labour – intensive work as choice, collection and loading of material. It should be noted that available electronic libraries can be used with some purposes. There are search systems in many user programs (for example, Word). Some problems can be solved by using these systems. It is possible to determine frequency dictionary of writer’s language, total volume of frequency of words used by the writer. [6] Corpus is not completed only by choice of texts, determination of contexts and inclusion them in the corpus. In this case corpus loses its significance and becomes meaningless collection in comparison with electronical library. One of the important decisive factors of corpus is annotation of included texts. At first annotation covered only linguistic information. With the development of corpus linguistics it became necessary to provide texts with additional information. Corpus included information about its creation, genre of text, author, date of writing, name of the work from which the text is choosen, precise information about edition, page number. They are not linguistic data. If linguistic information is connected with syntactic structure it is related to a sentence, but if linguistic information is connected with a lexeme and grammatical peculiarities it is related to a word. [6] Solution to the mentioned problems poses certain difficulties. When user directly accesses information block about the author he gets information about this author’s works that are to be used and directly accessing text corpus from here he acquires corpuses. In the other case, user chooses context from corpus and then gets information about its belonging, date of writing. The author of corpus faces such tasks as different directions of access information about printed version of the context. So placing of reference to marking, annotation materials in the corpus, the different views and approaches to them are taken into account in corpus linguistics. [3] In the 1980s, linguistic annotation was usually motivated by the desire to study a given linguistic phenomenon in large bodies of data, and annotation schemes typically directly reflected a specific linguistic theory. As the need for reliable automatic annotation for larger and larger bodies of data increased in the early 90s, there sometimes arose a tension between the requirements for accurate automatic annotation and a comprehensive linguistic accounting that could contribute to validation and refinement of the underlying theory. An early example is the Penn Treebank project’s reduction and modification of the part-of-speech tagset developed for the Brown Corpus, in order to obtain more accurate results from automatic taggers and parsers. In the following decades, machine learning arose as the central methodology for NLP. [9, p.6] Corpus linguistics was formed in order to collect and store enormous amount of language material, to solve the different linguistic problems using this material. When creating the national corpus of the language it is important to take into consideration selection criteria of the collected material. When using the corpus text there must be opportunity to analyze the words, wordcombinations, grammatical categories used in the text. A range of classification criteria of corpuses is wide enough. These criteria are connected with the purposes of corpuses. With the development of computational linguistics the new forms of use of corpuses are found. The future perspectives and types of corpus linguistics are also based on the annotation of corpus material. Morphological, syntactic, semantic, terminological marking confirms the necessity to note at least four corpuses. REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. Mahmudov M.Ə. “Mətnin formal təhlili sistemi”- Bakı:Elm, 2002 - 244 s. Захаров В.П. «Корпусная лингвистика» - Санкт-Петербург, 2013 - 148 с. Марчук Ю.Н. «Проблемы машинного перевода»- Москва; Наука, 1983 - 233 с. Arnold I.V. The English Word. Moscow, 1986, p.265-266 Kennedy G. “An introduction to Corpus linguistics”- London, 2014 - 328 p. McEnery T. “Corpus Linguistics”- Edinburg University Press, 2001- 235 p. Svartvik J. “Directions in Corpus Linguistics”. Proceedings of the Nobel Symposium 82 Stockholm, 4-8 August, 1991- 487 p. Thomas Pyles. Words and Ways of American English. Random House, 1952, p.4 Nancy Ide “Introduction - The handbook of linguistic annotation” https://www.cs.vassar.edu/~ide/papers/handbook-intro.pdf, -15 p. ANNOTATION PROCESS OF BRITISH AND AMERICAN CORPORA SUMMARY This article is about corpus linguistics, its purpose and main directions. The article provides a historical overview of corpus linguistics, the main national corporations and their creation. In addition, the main goals and objectives of corpus linguistics are widely explained. The national corpus of British English is analyzed comparatively with the corpus of modern American English. The dialects and dialects of these languages are compared. It is noted that the same word is used in different variants. Regional and social variability is the main content of the article. The concept of annotation or marking refers to the linguistic information entered into the body. The corpus is not completed only by selecting the texts, defining the contexts in the order of random numbers and entering them into the corpus. As corpus linguistics developed, it became clear that texts needed to be supplemented with additional information. The peculiarities of the formation and development of corpus linguistics are also covered in the article. Its areas and directions are also specific. The range of criteria for the classification of buildings is quite wide. These criteria are related to the purpose of the corps. As computer linguistics develops, new forms of using corpora are sought. Future perspectives and types of corpus linguistics are also based on annotations of corpus materials. Morphological, syntactic, semantic, terminological marking confirms the need to mark at least these four corpora. The article examines the application of corpus linguistics as an important area of linguistics. This article on corpus linguistics, which is emerging as an independent field of science, uses many new concepts and terms. Mammadzada Sevinj BRİTANİYA VƏ AMERİKA İNGİLİSCƏSİNİN KORPUSLARINDA ANNOTASİYALAMA PROSESİ XÜLASƏ Bu məqalə korpus dilçiliyi, onun məqsədi və əsas istiqamətləri mövzusundadır. Məqalədə korpus dilçiliyinin tarixi baxışına, əsas milli korpuslara və onların yaradılmalarına yer verilmişdir. Bundan başqa korpus dilçiliyinin başlıca məqsədləri və vəzifələri geniş şəkildə açıqlanır. Britaniya ingilis dilinin milli korpusu müasir amerikan ingilis dilinin korpusu ilə müqayisəli şəkildə təhlil edilir. Bu dillərin dialekt və ləhcələri müqayisə edilir. Eyni sözün müxtəlif variantlarda işlənməsi qeyd olunur. Regional və sosial variativlik məqalənin əsas məzmunudur. Annotasiyalama və ya markerləmə anlayışı korpusa daxil edilən linqvistik informasiyanı nəzərdə tutur. Korpus yalnız mətnlərin seçilməsi, kontekstlərin təsadüfi ədədlər qaydası ilə müəyyənləşdirilib korpusa daxil edilməsi ilə tamamlanmır. Korpus dilçiliyi inkişaf etdikcə mətnlərin əlavə informasiyalarla təchiz edilməsinin də zəruriliyi aşkara çıxmışdır. Korpus dilçiliyinin özünəməxsus formalaşma və inkişaf xüsusiyyətləri də həmçinin məqalədə işıqlandırılır. Onun sahələri və istiqamətləri də spesifik xarakter daşıyır. Korpusların təsnifi meyarları dairəsi kifayət qədər genişdir. Bu meyarlar korpusların təyinatı ilə bağlılığa malikdir. Kompüter dilçiliyi inkişaf etdikcə korpuslardan istifadənin yeni formaları axtarılıb tapılır. Korpus linqvistikasının gələcək perspektivləri və tipləri həm də korpus materiallarının annotasiyalaşdırılmasına əsaslanır. Morfoloji, sintaktik, semantik, terminoloji markerləmə ən azı bu dörd korpusu qeyd etməyin zəruriliyini təsdiqləyir. Məqalədə korpus dilçiliyi tətbiqi dilçiliyin mühüm bir sahəsi kimi araşdırılır. Müstəqil bir elm sahəsi kimi formalaşan korpus dilçiliyinə aid bu məqalədə bir çox yeni anlayış və terminlərdən istifadə olunur. Məmmədzadə Sevinc ПРОЦЕСС АННОТАЦИИ БРИТАНСКОЙ И АМЕРИКАНСКОЙ КОРПУСОВ РЕЗЮМЕ Эта статья о корпусной лингвистике, ее назначении и основных направлениях. В статье дается исторический обзор корпусной лингвистики, основных национальных корпораций и их создания. Кроме того, широко разъясняются основные цели и задачи корпусной лингвистики. Национальный корпус британского английского языка анализируется в сравнении с корпусом современного американского английского. Сравниваются диалекты и диалекты этих языков. Отмечено, что одно и то же слово используется в разных вариантах. Региональная и социальная изменчивость - основное содержание статьи. Понятие аннотации или маркировки относится к лингвистической информации, введенной в тело. Корпус не дополняется только выбором текстов, определением контекстов в порядке случайных чисел и вводом их в корпус. По мере развития корпусной лингвистики стало ясно, что тексты необходимо дополнять дополнительной информацией. В статье также освещены особенности становления и развития корпусной лингвистики. Его направления и направления тоже специфичны. Спектр критериев классификации построек достаточно широк. Эти критерии связаны с назначением корпуса. По мере развития компьютерной лингвистики ведется поиск новых форм использования корпусов. Будущие перспективы и типы корпусной лингвистики также основаны на аннотациях корпусных материалов. Морфологическая, синтаксическая, семантическая, терминологическая маркировка подтверждает необходимость маркировки хотя бы этих четырех корпусов. В статье исследуется применение корпусной лингвистики как важной области лингвистики. В этой статье о корпусной лингвистике, которая становится самостоятельной областью науки, используется много новых понятий и терминов. Мамедзаде Севиндж Rəyçi: fil.e.d.,prof. A.Y.Məmmədov

Log In

Annotation process of British and American Corpuses

Related papers

Related topics