Papers by Alexander Gelbukh
Cicling, 2004
Ce numéro publie les actes de la seconde conférence annuelle consacrée à la linguistique informat... more Ce numéro publie les actes de la seconde conférence annuelle consacrée à la linguistique informatique et au traitement intelligent des textes, CICLing 2001(Mexico, 18-24 Février 2001). Les interventions en linguistique informatique ont porté sur les thèmes suivants: théories et formalismes, sémantique, anaphore et référence, désambiguïsation, traduction, génération de texte, dictionnaires et corpus, morphologie, techniques d'analyse syntaxique automatique. Dans le domaine du traitement intelligent des textes, les communications se ...
Ce numéro publie les actes de la seconde conférence annuelle consacrée à la linguistique informat... more Ce numéro publie les actes de la seconde conférence annuelle consacrée à la linguistique informatique et au traitement intelligent des textes, CICLing 2001(Mexico, 18-24 Février 2001). Les interventions en linguistique informatique ont porté sur les thèmes suivants: théories et formalismes, sémantique, anaphore et référence, désambiguïsation, traduction, génération de texte, dictionnaires et corpus, morphologie, techniques d'analyse syntaxique automatique. Dans le domaine du traitement intelligent des textes, les communications se ...
Lecture Notes in Computer Science, 2010
Page 1. Computing Transfer Score in Example-Based Machine Translation Rafał Jaworski Adam Mickiew... more Page 1. Computing Transfer Score in Example-Based Machine Translation Rafał Jaworski Adam Mickiewicz University Poznań, Poland [email protected] Abstract. This paper presents an idea in Example-Based Machine ...
Computacion Y Sistemas, Mar 1, 2008
Descripción: ONE OF THE PROBLEMS OF INFORMATION RETRIEVAL IN INTERNET AND DIGITAL LIBRARIES IS LO... more Descripción: ONE OF THE PROBLEMS OF INFORMATION RETRIEVAL IN INTERNET AND DIGITAL LIBRARIES IS LOW PRECISION: A HIGH NUMBER OF RETRIEVED DOCUMENTS OF LOW RELEVANCE. FOR EXAMPLE, A PERSON LOOKS FOR INFORMATION ABOUT JAGUARS (THE ANIMAL) AND THE DOCUMENTS RETRIEVED ARE ABOUT THE MODEL OF A CAR. THIS PROBLEM ARISES DUE TO AMBIGUITY OF DIFFERENT SENSES OF WORDS. THE TASK OF DETERMINING THE CORRECT INTERPRETATION OF A WORD ...

La mayoría de los sistemas de análisis morfológico están basados en el modelo conocido como la mo... more La mayoría de los sistemas de análisis morfológico están basados en el modelo conocido como la morfología de dos niveles. Sin embargo, este modelo no es muy adecuado para lenguajes con alternaciones irregulares de raíz (por ejemplo, el español o el ruso). En este trabajo describimos un sistema computacional de análisis morfológico para el lenguaje español basado en otro modelo, cuya idea principal es el análisis a través de generación. El modelo consiste en un conjunto de reglas para obtener todas las raíces de una forma de palabra para cada lexema, su almacenamiento en el diccionario, la producción de todas las hipótesis posibles durante el análisis y su comprobación a través de la generación morfológica. Se usó un diccionario de 40,000 lemas, a través del cual se pueden analizar más de 2,500,000 formas gramaticales posibles. Para el tratamiento de palabras desconocidas se está desarrollando un algoritmo basado en heurísticas. El sistema desarrollado está disponible sin costo para el uso académico.

Soft cardinality (SC) is a softened version of the classical cardinality of set theory. However, ... more Soft cardinality (SC) is a softened version of the classical cardinality of set theory. However, given its prohibitive cost of computing (exponential order), an approximation that is quadratic in the number of terms in the text has been proposed in the past. SC Spectra is a new method of approximation in linear time for text strings, which divides text strings into consecutive substrings (i.e., q-grams) of different sizes. Thus, SC in combination with resemblance coefficients allowed the construction of a family of similarity functions for text comparison. These similarity measures have been used in the past to address a problem of entity resolution (name matching) outperforming SoftTFIDF measure. SC spectra method improves the previous results using less time and obtaining better performance. This allows the new method to be used with relatively large documents such as those included in classic information retrieval collections. SC spectra method exceeded SoftTFIDF and cosine tf-idf baselines with an approach that requires no term weighing.
This book constitutes the refereed proceedings of the 5th Mexican International Conference on Art... more This book constitutes the refereed proceedings of the 5th Mexican International Conference on Artificial Intelligence, MICAI 2006, held in Apizaco, Mexico in November 2006. It contains over 120 papers that address such topics as knowledge representation and reasoning, machine learning and feature selection, knowledge discovery, computer vision, image processing and image retrieval, robotics, as well as bioinformatics and medical applications.

Research in Computing Science, 2013
We propose a fully automatic technique for evaluating text summaries without the need to prepare ... more We propose a fully automatic technique for evaluating text summaries without the need to prepare the gold standard summaries manually. A standard and popular summary evaluation techniques or tools are not fully automatic; they all need some manual process or manual reference summary. Using recognizing textual entailment (TE), automatically generated summaries can be evaluated completely automatically without any manual preparation process. We use a TE system based on a combination of lexical entailment module, lexical distance module, Chunk module, Named Entity module and syntactic text entailment (TE) module. The documents are used as text (T) and summary of these documents are taken as hypothesis (H). Therefore, the more information of the document is entailed by its summary the better the summary. Comparing with the ROUGE 1.5.5 evaluation scores over TAC 2008 (formerly DUC, conducted by NIST) dataset, the proposed evaluation technique predicts the ROUGE scores with a accuracy of 98.25% with respect to ROUGE-2 and 95.65% with respect to ROUGE-SU4.
Lingvisticae Investigationes, 2007
Abstract: We present a linguistic analysis of Named Entities in Spanish texts. Our work is focuse... more Abstract: We present a linguistic analysis of Named Entities in Spanish texts. Our work is focused on the determination of the structure of complex proper names: names with coordinated constituents, names with prepositional phrases and names formed by several content words initialized by a capital letter. We present the analysis of circa 49,000 examples obtained from Mexican newspapers. We detailed their structure and give some notions about the context surrounding them. Since named entities belong to open class of ...
This paper describes a Twitter sentiment analysis system that classifies a tweet as positive or n... more This paper describes a Twitter sentiment analysis system that classifies a tweet as positive or negative based on its overall tweet-level polarity. Supervised learning classifiers often misclassify tweets containing conjunctions like "but" and conditionals like "if", due to their special linguistic characteristics. These classifiers also assign a decision score very close to the decision boundary for a large number tweets, which suggests that they are simply unsure instead of being completely wrong about these tweets. To counter these two challenges, this paper proposes a system that enhances supervised learning for polarity classification by leveraging on linguistic rules and sentic computing resources. The proposed method is evaluated on two publicly available Twitter corpora to illustrate its effectiveness.

Correct interpretation of the text frequently requires knowledge of semantic categories of nouns,... more Correct interpretation of the text frequently requires knowledge of semantic categories of nouns, especially in languages with free word order. For example, in Spanish the phrases pintó un cuadro un pintor (lit. painted a picture a painter) and pintó un pintor un cuadro (lit. painted a painter a picture) mean the same: 'a painter painted a picture'; with the only way to tell the subject from the object being by knowing that pintor 'painter' is causal agent cuadro is a thing. We present a method for extracting semantic information of this kind from existing machine-readable human-oriented explanatory dictionaries. First, we extract from the dictionary an is-a hierarchy and manually mark the categories of a few top-level concepts. Then, for a given word, we follow the hierarchy upward until finding a concept whose semantic category is known. Application of this procedure to two different human-oriented Spanish dictionaries gives additional information as compared with using solely Spanish EuroWordNet. In addition, we show the results of an experiment conducted to evaluate the similarity of word classification with this method.
Cicling, 2009
This book constitutes the proceedings of the 11th International Conference on Computational Lingu... more This book constitutes the proceedings of the 11th International Conference on Computational Linguistics and Intelligent Text Processing, held in Iaşi, Romania, in March 2010. The 60 paper included in the volume were carefully reviewed and selected from numerous submissions. The book also includes 3 invited papers. The topics covered are: lexical resources, syntax and parsing, word sense disambiguation and named entity recognition, semantics and dialog, humor and emotions, machine translation and ...

ABSTRACT The article presents the experiments carried out as part of the participation in the mai... more ABSTRACT The article presents the experiments carried out as part of the participation in the main task (English dataset) of QA4MRE@CLEF 2013. In the developed system, we first combine the question Q and each candidate answer option A to form (Q , A) pair. Each pair has been considered a Hypothesis (H). We have used Morphological Expansion to rebuild the H. Then, each H has been verified by assigning a matching score. Stop words and interrogative words are removed from each H and query words are identified to retrieve the most relevant sentences from the associated document using Lucene. Relevant sentences are retrieved from the associated document based on the TF-IDF of the matching query words along with n-gram overlap of the sentence with the H. Each retrieved sentence defines the Text T. Each T-H pair is assigned a ranking score that works on textual entailment principle. The inference weight i.e., matching score has automatically been assigned to each answer options based on their inference matching. Each sentence in the associated document has contributed an inference score to each H. The candidate answer option that receives the highest inference score has been identified as the most relevant option and selected as the answer to the given question.

ABSTRACT The article presents the experiments carried out as part of the participation in the pil... more ABSTRACT The article presents the experiments carried out as part of the participation in the pilot task of QA4MRE@CLEF 2013. In the developed system, we have first generated answer pattern by combining the question and each answer option to form the Hypothesis (H). Stop words and interrogative word are removed from each H and query words are identified to retrieve the most relevant sentences from the associated document using Lucene. Relevant sentences are retrieved from the associated document based on the TF-IDF of the matching query words along with n-gram overlap of the sentence with the H. Each retrieved sentence defines the Text T. Each T-H pair is assigned a ranking score that works on textual entailment principle. A matching score is automatically assigned to each answer options based on the matching. A parallel procedure also generates the possible answer patterns from given questions and answer options. Each sentence in the associated document is assigned an inference score with respect to each answer pattern. Evaluated inference score for each answer option is added with the matching score. The answer option that receives the highest selection score is identified as the most relevant option and selected as the answer to the given question.

The article presents the experiments carried out as part of the participation in the pilot task (... more The article presents the experiments carried out as part of the participation in the pilot task (Modality and Negation) 1 of QA4MRE@CLEF 2012. Modality and Negation are two main grammatical devices that allow to express extra-propositional aspects of meaning. Modality is a grammatical category that allows to express aspects related to the attitude of the speaker towards statements. Negation is a grammatical category that allows to change the truth value of a proposition. The input for the systems is a text where all events expressed by verbs are identified and numbered the output should be a label per event. The possible values are: mod, neg, neg-mod, none. In the developed system, we first build a database for modal verbs of two categories: epistemic and deontic. Also, we used a negative verb list of 1877 verbs. This negative verb list has been used to identify negative modality. We extract the each tagged events from each sentences. Then our system check modal verbs by that database from each sentences. If any modal verbs is found before that an event then that event should be modal verb and tagged as mod. If modal verb is there and also negeted words is found before that evet then that event should negeted mod and tagged as neg-mod. If no modal verb is found before that an event but negeted word are found before that event then that event should be negeted and tagged as neg. Otherwise the event should tagged as none. We trained our system by traing data (sample data) that was provided by QA4MRE organizer. Then we are tested our system on test dataset. In test data set there are eight documents, two per each of the four topics such as Alzheimer, music and society, AIDs and climate change. Our system overall accuracy is 0.6262 (779 out of 1244).
Uploads
Papers by Alexander Gelbukh