Papers by Elena Bolshakova
Труды объединённой научной конференции "Интернет и современное общество", 2011
Various NLP applications require automatic discourse analysis of texts. For analysis of scientifi... more Various NLP applications require automatic discourse analysis of texts. For analysis of scientific and technical texts, we propose to use all typical lexical units organizing scientific discourse; we call them common scientific words and expressions, most of them are known as discourse markers. The paper discusses features of scientific discourse, as well as the variety of discourse markers specific for scientific and technical texts. Main organizing principles of a computer dictionary comprising common scientific words and expressions are described. Key ideas of a discourse recognition procedure based on the dictionary and surface syntactical analysis are pointed out.
Computational Linguistics and Intellectual Technologies, Jun 19, 2021
The paper describes a way to generate a dataset of Russian word forms, which is needed to build a... more The paper describes a way to generate a dataset of Russian word forms, which is needed to build an appropriate neural model for morpheme segmentation of word forms. The developed generation procedure produces word forms segmented into morphs that are classified by morpheme types, based on existing dataset of segmented lemmas and additional dictionary data, as well as fine-grained classification of Russian inflectional paradigms, which makes it possible to correctly process word forms with alternating consonants and fluent vowels in endings. The built representative dataset (more than 1,6 million word forms) was used to develop a neural model for morpheme segmentation of word forms with classification of segmented morphs. The experiments have shown that in detecting morphs boundaries the model has comparable quality with the best segmentation models for lemmas (98% of F-measure), slightly outperforming them in word-level classification accuracy (with score 91%).
The paper describes experiments on automatic single-word term extraction based on combining vario... more The paper describes experiments on automatic single-word term extraction based on combining various features of words, mainly linguistic and statistical, by machine learning methods. Since single-word terms are much more difficult to recognize than multi-word terms, a broad range of word features was taken into account, among them are widely-known measures (such as TF-IDF), some novel features, as well as proposed modifications of features usually applied for multi-word term extraction. A large target collection of Russian texts in the domain of banking was taken for experiments. Average Precision was chosen to evaluate the results of term extraction, along with the manually created thesaurus of terminology on banking activity that was used to approve extracted terms. The experiments showed that the use of multiple features significantly improves the results of automatic extraction of domain-specific terms. It was proved that logistic regression is the best machine learning method for single-word term extraction; the subset of word features significant for term extraction was also revealed.
Lecture Notes in Computer Science, 2006
ABSTRACT A dictionary-free morphological classifier of nouns for a highly inflective language is ... more ABSTRACT A dictionary-free morphological classifier of nouns for a highly inflective language is developed. The classifier is a front-end utility for acquiring a very large DB of Russian collocations and WordNet-like semantic links. For its main functions, the classifier uses the final letters of standard noun forms and extensive morphological and lexical data. The percentage of nouns correctly classified in a standalone manner is now 99.65%. A completely error-free performance is impossible for context-free methods in principle, primarily because of homonymy: the nouns of various senses may decline in different ways. Therefore the classifier’s results are additionally tested against more than 200,000 collocations stored in the DB and, when it is necessary, are automatically corrected.
Springer eBooks, 2005
Syntactic links between content words in meaningful texts are intuitively conceived 'normal,' thu... more Syntactic links between content words in meaningful texts are intuitively conceived 'normal,' thus ensuring text cohesion. Nevertheless we are not aware on a broadly accepted Internet-based measure of cohesion between words syntactically linked in terms of Dependency Grammars. We propose to measure lexico-syntactic cohesion between content words by means of Internet with a specially introduced Stable Connection Index (SCI). SCI is similar to Mutual Information known in statistics, but does not require iterative evaluation of total amount of Web-pages under search engine's control and is insensitive to both fluctuations and slow growth of raw Web statistics. Based on Russian, Spanish, and English materials, SCI presented concentrated distributions for various types of word combinations; hence lexico-syntactic cohesion acquires a simple numeric measure. It is shown that SCI evaluations can be successfully used for semantic error detection and correction, as well as for information retrieval.
Communications in computer and information science, 2015
The paper describes a strategy that applies heuristics to combine sets of terminological words an... more The paper describes a strategy that applies heuristics to combine sets of terminological words and words combination pre-extracted from a scientific text by several term recognition procedures. Each procedure is based on a collection of lexico-syntactic patterns representing specific linguistic information about terms within scientific texts. Our strategy is aimed to improve the quality of automatic term extraction from a particular scientific text. The experiments have shown that the strategy gives 11–17 % increase of F-measure compared with the commonly-used methods of term extraction.
The paper discusses facilities of computer systems for editing scientific and technical texts, wh... more The paper discusses facilities of computer systems for editing scientific and technical texts, which partially automate functions of human editor and thus help the writer to improve text quality. Two experimental systems LINAR and CONUT developed in 90s to control the quality of Russian scientific and technical texts are briefly described; and general principles for designing more powerful editing systems are pointed out. Features of an editing system being now under development are outlined, primarily the underlying linguistic knowledge base and procedures controlling the text.
Communications in Computer and Information Science, 2019
The paper addresses the task of automatic morpheme segmentation involving both splitting words in... more The paper addresses the task of automatic morpheme segmentation involving both splitting words into morphs and classification of resulted morphs. For segmentation of Russian words, a new model based on Bi-LSTM neural network is proposed and experimentally evaluated on several training data sets differing in labeling. The proposed model has comparable quality with the best supervised machine learning models for morpheme segmentation with classification, slightly outperforming them in word-level classification accuracy with score 89%.
Abstract: Various NLP applications require automatic discourse analysis of texts. For analysis of... more Abstract: Various NLP applications require automatic discourse analysis of texts. For analysis of scientific and technical texts, we propose to use all typical lexical units organizing scientific discourse; we call them common scientific words and expressions, most of them are known as discourse markers. The paper discusses features of scientific discourse, as well as the variety of discourse markers specific for scientific and technical texts. Main organizing principles of a computer dictionary comprising common scientific words and expressions are described. Key ideas of a discourse recognition procedure based on the dictionary and surface syntactical analysis are pointed out.
Artificial Intelligence, 2020
Lecture Notes in Computer Science, 2017
Main styles, or paradigms of programming – imperative, functional, logic, and object-oriented – a... more Main styles, or paradigms of programming – imperative, functional, logic, and object-oriented – are shortly described and compared, and corresponding programming techniques are outlined. Programming languages are classified in accordance with the main style and techniques supported. It is argued that profound education in computer science should include learning base programming techniques of all main programming paradigms.
The paper reports on preliminary results of an ongoing research aiming at development of an autom... more The paper reports on preliminary results of an ongoing research aiming at development of an automatic procedure for recognition of discourse-compositional structure of scientific and technical texts, which is required in many NLP applications. The procedure exploits as discourse markers various domain-independent words and expressions that are specific for scientific and technical texts and organize scientific discourse. The paper discusses features of scientific discourse and common scientific lexicon comprising such words and expressions. Methodological issues of development of a computer dictionary for common scientific lexicon are concerned; basic principles of its organization are described as well. Main steps of the discourse-analyzing procedure based on the dictionary and surface syntactical analysis are pointed out.
Malapropism is a semantic error that is hardly detectable because it usually retains syntactical ... more Malapropism is a semantic error that is hardly detectable because it usually retains syntactical links between words in the sentence but replaces one content word by a similar word with quite different meaning. A method of automatic detection of malapropisms is described, based on Web statistics and a specially defined Semantic Compatibility Index (SCI). For correction of the detected errors, special dictionaries and heuristic rules are proposed, which retains only a few highly SCI-ranked correction candidates for the user’s selection. Experiments on Web-assisted detection and correction of Russian malapropisms are reported, demonstrating efficacy of the described method.
Morphemic structure of words is useful for various NLP problems, in particular, for deriving a me... more Morphemic structure of words is useful for various NLP problems, in particular, for deriving a meaning of unknown words in languages with rich morphology, such as Russian. For Russian, several machine learning models for automatic morpheme segmentation of words were built, but only for parsing their lemmas. Meanwhile, significantly varying word forms are present in texts, among them unknown words are often encountered, and their lemmas are unknown. The paper reports on experiments for comparing two ways to automatically segment Russian word forms, both ways involve splitting into morphs and classification of resulted morphs. The former is based on a neural model trained on a data set automatically augmented with segmented word forms, the latter produces segmentation through predicted lemma and a pre-trained neural morpheme segmentation model for lemmas. It was shown that the models have comparable quality in morpheme segmentation with classification, and the model based on the augme...
Uploads
Papers by Elena Bolshakova