Papers by Juan-Manuel Torres-Moreno
Lecture Notes in Computer Science, 2011
arXiv: Computation and Language, 2020
In this work we present a new small data-set in Computational Creativity (CC) field, the Spanish ... more In this work we present a new small data-set in Computational Creativity (CC) field, the Spanish Literary Sentences for emotions detection corpus (LISSS). We address this corpus of literary sentences in order to evaluate or design algorithms of emotions classification and detection. We have constitute this corpus by manually classifying the sentences in a set of emotions: Love, Fear, Happiness, Anger and Sadness/Pain. We also present some baseline classification algorithms applied on our corpus. The LISSS corpus will be available to the community as a free resource to evaluate or create CC-like algorithms.
In this paper, we show the enhancing of the Demanded Skills Diagnosis (DiCoDe: Diagnóstico de Com... more In this paper, we show the enhancing of the Demanded Skills Diagnosis (DiCoDe: Diagnóstico de Competencias Demandadas), a system developed by Mexico City’s Ministry of Labor and Employment Promotion (STyFE: Secretaría de Trabajo y Fomento del Empleo de la Ciudad de México) that seeks to reduce information asymmetries between job seekers and employers. The project uses webscraping techniques to retrieve job vacancies posted on private job portals on a daily basis and with the purpose of informing training and individual case management policies as well as labor market monitoring. For this purpose, a collaboration project between STyFE and the Language Engineering Group (GIL: Grupo de Ingeniería Lingüística) was established in order to enhance DiCoDe by applying NLP models and semantic analysis. By this collaboration, DiCoDe’s job vacancies system’s macro-structure and its geographic referencing at the city hall (municipality) level were improved. More specifically, dictionaries were ...
Research in Computing Science, 2019
Today there are different systems capable of extracting digital document metadata. However, the a... more Today there are different systems capable of extracting digital document metadata. However, the absence of a structure de fi ned in the distribution of metadata in documents from a digital art library presents a major problem, this is generally due to the style that each author or publisher decides to use both on the cover and in the cover of the document. Although there are software tools that perform the task of extracting metadata, they focus only on structured documents such as journals, scientific articles, etc. Metadata is not more than structured information data, that is, information information or data data. This paper introduces the use of natural language techniques and typographic information of the text in the document for the extraction of metadata, such as: title, authors, publisher and date of publication. The results obtained in the evaluation with unstructured digital documents indicate the potential of the proposed approach, which is capable of producing good results in the extraction of metadata.
Research in Computing Science, 2017
In this paper we present Genex+ (Genex plus), an improved implementation of the Genex system (Gén... more In this paper we present Genex+ (Genex plus), an improved implementation of the Genex system (Générateur d'Exemples) [17], by using the semantic closeness between certain fragments of texts. This process was based on the combination of the successive extraction of concordances and collocations with the determination of the semantic closeness by the Mutual Information method [24] and Cosine calculus [3,22]. Results have proven that a good example is almost always associated to the definition of the word to which it is making reference, and that it can be extracted automatically by the consecutive restriction of semantic fields applied to fragments of general language corpora. This has the advantage of presenting a conceptual reformulation of what is said in the definition by using highly informative textual fragments, but with a less formal register. For this second version, we implemented changes required for a bilingual entry.
arXiv (Cornell University), Dec 14, 2012
Previous works demonstrated that Automatic Text Summarization (ATS) by sentences extraction may b... more Previous works demonstrated that Automatic Text Summarization (ATS) by sentences extraction may be improved using sentence compression. In this work we present a sentence compressions approach guided by level-sentence discourse segmentation and probabilistic language models (LM). The results presented here show that the proposed solution is able to generate coherent summaries with grammatical compressed sentences. The approach is simple enough to be transposed into other languages.
Advances in Computational Intelligence, 2018
Sentence Boundary Detection (SBD) has been a major research topic since Automatic Speech Recognit... more Sentence Boundary Detection (SBD) has been a major research topic since Automatic Speech Recognition transcripts have been used for further Natural Language Processing tasks like Part of Speech Tagging, Question Answering or Automatic Summarization. But what about evaluation? Do standard evaluation metrics like precision, recall, F-score or classification error; and more important, evaluating an automatic system against a unique reference is enough to conclude how well a SBD system is performing given the final application of the transcript? In this paper we propose Window-based Sentence Boundary Evaluation (WiSeBE), a semi-supervised metric for evaluating Sentence Boundary Detection systems based on multi-reference (dis)agreement. We evaluate and compare the performance of different SBD systems over a set of Youtube transcripts using WiSeBE and standard metrics. This double evaluation gives an understanding of how WiSeBE is a more reliable metric for the SBD task.
ArXiv, 2016
In this paper we describe a dynamic normalization process applied to social network multilingual ... more In this paper we describe a dynamic normalization process applied to social network multilingual documents (Facebook and Twitter) to improve the performance of the Author profiling task for short texts. After the normalization process, n-grams of characters and n-grams of POS tags are obtained to extract all the possible stylistic information encoded in the documents (emoticons, character flooding, capital letters, references to other users, hyperlinks, hashtags, etc.). Experiments with SVM showed up to 90% of performance.
TAC 2008 NIST, 2008
For the third participation of the LIA to the
DUC–TAC conferences, two summarizers
were develop... more For the third participation of the LIA to the
DUC–TAC conferences, two summarizers
were developed. The first is based on
the S MMR sentence scoring algorithm de-
scribed in (Boudin et al., 2008). The sec-
ond summarizer is a fusion between two
sentence scoring methods: SMMR and a
variable length insertion gap n-term model
(Favre et al., 2006; Boudin et al., 2007).
We compare our two summarizers using
the manual and automatic TAC’s assess-
ments. The fusion achieves better auto-
matic scores but lower manual scores than
the S MMR system alone. It is likely due
to an overfitting problem owing to a small
training corpus (DUC 2007 update).
Ingénierie des systèmes d'information, 2006
... succ. Centre-ville Montréal (Québec) H3C 3A7 Canada {remy.kessler, juan-manuel.torres, marc.e... more ... succ. Centre-ville Montréal (Québec) H3C 3A7 Canada {remy.kessler, juan-manuel.torres, marc.elbeze}@univ-avignon.fr RÉSUMÉ. ... pauvre. Ceci impose donc d'utiliser des outils de traitement automatique robustes et flexibles. ...
Lecture Notes in Computer Science, 2012
A Symbolic Approach for Automatic Detection of Nuclearity and Rhetorical Relations among Intra-se... more A Symbolic Approach for Automatic Detection of Nuclearity and Rhetorical Relations among Intra-sentence Discourse Segments in Spanish 2 3.2. RST Spanish Treebank 4. Corpus analysis 5. System implementation 6. Experiments and evaluation 7. Conclusions and future work A Symbolic Approach for Automatic Detection of Nuclearity and Rhetorical Relations among Intra-sentence Discourse Segments in Spanish 5
2013 12th Mexican International Conference on Artificial Intelligence, 2013
ABSTRACT In this paper we revisit the Textual Energy model. We deal with the two major disadvanta... more ABSTRACT In this paper we revisit the Textual Energy model. We deal with the two major disadvantages of the Textual Energy: the asymmetry of the distribution and the unbounded ness of the maximum value. Although this model has been successfully used in several NLP tasks like summarization, clustering and sentence compression, no correction of these problems has been proposed until now. Concerning the maximum value, we analyze the computation of Textual Energy matrix and we conclude that energy values are dominated by the lexical richness in quadratic growth of the vocabulary size. Using the Box-Cox transformation, we show empirical evidence that a log transformation could correct both problems.
Computación y Sistemas, 2020
This paper aims to show that generating and evaluating summaries are two linked but different tas... more This paper aims to show that generating and evaluating summaries are two linked but different tasks even when the same Divergence of the Probability Distribution (DPD) is used in both. This result allows the use of DPD functions for evaluating summaries automatically without references and also for generating summaries without falling into inconsistencies.
Technological Innovation in the Teaching and Processing of Lsps Proceedings of Tislid 10 2011 Isbn 9788436262179 Pags 301 310, 2011
In this paper we would like to show that certain grammatical features, besides lexicon, have a st... more In this paper we would like to show that certain grammatical features, besides lexicon, have a strong potential to differentiate specialized texts from nonspecialized texts. We have developed a tool including these features and it has been trained using machine learning techniques based on association rules using two subcorpora (specialized vs. non-specialized), each one divided into training and test corpora. We have evaluated this tool and the results show that the strategy we have used is suitable to differentiate specialized texts from plain texts. These results could be considered as an innovative perspective to research on domains related with terminology, specialized discourse and computational linguistics, with applications to automatic compilation of Languages for Specific Purposes (LSP) corpora and optimization of search engines among others.
JAVA, 1996
Nous comparons la performance en généralisation de l'algorithme incrémental Monoplan, qui constr... more Nous comparons la performance en généralisation de l'algorithme incrémental Monoplan, qui construit un réseau de neurones à une couche cachée d'unités binaires, à celle d'autres algorithmes, pas
nécessairement neuronaux, sur trois problèmes bien connus. En général, les réseaux de neurones, et plus particulièrement ceux construits avec des approches incrémentales, produisent les meilleures généralisations. Les
réseaux construits avec Monoplan ont un très petit nombre d'unités, grâce aux performances de Minimerror, l'algorithme utilisé au niveau de l'apprentissage des neurones individuels. Cette petite taille explique, au moins en partie, les excellents résultats obtenus avec Monoplan en généralisation.
Lecture Notes in Computer Science, 2008
The exponential growth of Internet allowed the development of a market of online job search sites... more The exponential growth of Internet allowed the development of a market of online job search sites. This paper aims at presenting the E-Gen system (Automatic Job Oer Processing system for Human Resources). E-Gen will implement several complex tasks: an analysis and categorization of jobs oers which are unstructured text documents (e-mails of job oers possibly with an attached document), an analysis and a relevance ranking of the candidate answers. We present a strategy to resolve the last task: After a process of ltering and lemmatisation, we use vectorial representation and dierent similarity measures. The quality of ranking obtained is evaluated using ROC curves.
Lecture Notes in Computer Science
This paper investigates the problem of automatic chemical Term Recognition (TR) and proposes to t... more This paper investigates the problem of automatic chemical Term Recognition (TR) and proposes to tackle the problem by fusing Symbolic and statistical techniques. Unlike other solutions described in the literature, which only use complex and costly human made ruledbased matching algorithms, we show that the combination of a seven rules matching algorithm and a naïve Bayes classifier achieves high performances. Through experiments performed on different kind of available Organic Chemistry texts, we show that our hybrid approach is also consistent across different data sets.
Lecture Notes in Computer Science, 2009
The market of online job search sites grows exponentially. This implies volumes of information (m... more The market of online job search sites grows exponentially. This implies volumes of information (mostly in the form of free text) become manually impossible to process. An analysis and assisted categorization seems relevant to address this issue. We present E-Gen, a system which aims to perform assisted analysis and categorization of job offers and of the responses of candidates. This
Lecture Notes in Computer Science, 2013
ABSTRACT This paper presents some experiments of evaluation of a statistical stemming algorithm b... more ABSTRACT This paper presents some experiments of evaluation of a statistical stemming algorithm based on morphological segmentation. The method estimates affixality of word fragments. It combines three indexes associated to possible cuts. This unsupervised and language-independent method has been easily adapted to generate an effective morphological stemmer. This stemmer has been coupled with Cortex, an automatic summarization system, in order to generate summaries in English, Spanish and French. Summaries have been evaluated using ROUGE. The results of this extrinsic evaluation show that our stemming algorithm outperforms several classical systems.
Uploads
Papers by Juan-Manuel Torres-Moreno
DUC–TAC conferences, two summarizers
were developed. The first is based on
the S MMR sentence scoring algorithm de-
scribed in (Boudin et al., 2008). The sec-
ond summarizer is a fusion between two
sentence scoring methods: SMMR and a
variable length insertion gap n-term model
(Favre et al., 2006; Boudin et al., 2007).
We compare our two summarizers using
the manual and automatic TAC’s assess-
ments. The fusion achieves better auto-
matic scores but lower manual scores than
the S MMR system alone. It is likely due
to an overfitting problem owing to a small
training corpus (DUC 2007 update).
nécessairement neuronaux, sur trois problèmes bien connus. En général, les réseaux de neurones, et plus particulièrement ceux construits avec des approches incrémentales, produisent les meilleures généralisations. Les
réseaux construits avec Monoplan ont un très petit nombre d'unités, grâce aux performances de Minimerror, l'algorithme utilisé au niveau de l'apprentissage des neurones individuels. Cette petite taille explique, au moins en partie, les excellents résultats obtenus avec Monoplan en généralisation.
DUC–TAC conferences, two summarizers
were developed. The first is based on
the S MMR sentence scoring algorithm de-
scribed in (Boudin et al., 2008). The sec-
ond summarizer is a fusion between two
sentence scoring methods: SMMR and a
variable length insertion gap n-term model
(Favre et al., 2006; Boudin et al., 2007).
We compare our two summarizers using
the manual and automatic TAC’s assess-
ments. The fusion achieves better auto-
matic scores but lower manual scores than
the S MMR system alone. It is likely due
to an overfitting problem owing to a small
training corpus (DUC 2007 update).
nécessairement neuronaux, sur trois problèmes bien connus. En général, les réseaux de neurones, et plus particulièrement ceux construits avec des approches incrémentales, produisent les meilleures généralisations. Les
réseaux construits avec Monoplan ont un très petit nombre d'unités, grâce aux performances de Minimerror, l'algorithme utilisé au niveau de l'apprentissage des neurones individuels. Cette petite taille explique, au moins en partie, les excellents résultats obtenus avec Monoplan en généralisation.