L’article comenca analitzant breument que entenem per TA. A continuacio observa que la manera de ... more L’article comenca analitzant breument que entenem per TA. A continuacio observa que la manera de treballar a que s’han acostumat molts traductors amb les memories de traduccio pot ser reproduida perfectament amb un sistema de TA, sempre que tingui unes caracteristiques determinades, que son analitzades. Finalment, es fa esment de dos aspectes complementaris de la traduccio amb STA: l’us del llenguatge controlat en la produccio de textos, i la relacio de la produccio de textos multilingues amb la traduccio.
While sentiment analysis has become an established field in the NLP community, research into lang... more While sentiment analysis has become an established field in the NLP community, research into languages other than English has been hindered by the lack of resources. Although much research in multilingual and cross-lingual sentiment analysis has focused on unsupervised or semi-supervised approaches, these still require a large number of resources and do not reach the performance of supervised approaches. With this in mind, we introduce two datasets for supervised aspect-level sentiment analysis in Basque and Catalan, both of which are under-resourced languages. We provide high-quality annotations and benchmarks with the hope that they will be useful to the growing community of researchers working on these languages.
Este artículo aborda la oferta formativa en accesibilidad en los medios en España, centrándose co... more Este artículo aborda la oferta formativa en accesibilidad en los medios en España, centrándose concretamente en la audiodescripción, el subtitulado para sordos y la lengua de signos o señas. Se detallan los cursos que se ofrecen en este ámbito a nivel universitario y no universitario, así como también en empresas. Finalmente, se presenta una nueva propuesta: el máster oficial en accesibilidad en los medios
Tradumàtica tecnologies de la traducció, Dec 30, 2020
Currently, post-editing of machine translation (MT) has been introduced as a regular practice in ... more Currently, post-editing of machine translation (MT) has been introduced as a regular practice in the translation workflow, especially since the good results in quality obtained by neural MT (NMT). This fact is linked to the efforts LSPs and customers have done to reduce costs due to the recent global crisis and the increasing globalization, which has had a negative impact on translators' revenues and on their working practices. In this context, post-editing is often perceived with a negative bias by translators. We study attitudes of translators post-editing for the first time and relate them to their productivity rates. We also compare the results with a survey answered by professional post-editors assessing their perception of the task in the current marketplace.
How it fits together As for interrelations within and among the three Modules, there are three ty... more How it fits together As for interrelations within and among the three Modules, there are three types of relations, i.e. 1. overlapping parts 2. some parts being a prerequisite for others 3. complementary parts Note 1: This is the case where the same topic is introduced twice in different modules, a fact which cannot be avoided since omitting it in one module or the other would result in an incomplete and incoherent description of the module in question (e.g. hardware components like scanner and printer are referred to in both Module A and Module B). Module C Language Engineering ca 55% of the overall size Module B DTP/IT for translators ca 30% of the overall size possibility of specialization / deepening of knowledge possibility of specialization / deepening of knowledge
The recent improvements in neural MT (NMT) have driven a shift from statistical MT (SMT) to NMT. ... more The recent improvements in neural MT (NMT) have driven a shift from statistical MT (SMT) to NMT. However, to assess the usefulness of MT models for post-editing (PE) and have a detailed insight of the output they produce, we need to analyse the most frequent errors and how they affect the task. We present a pilot study of a fine-grained analysis of MT errors based on post-editors corrections for an English to Spanish medical text translated with SMT and NMT. We use the MQM taxonomy to compare the two MT models and have a categorized classification of the errors produced. Even though results show a great variation among post-editors’ corrections, for this language combination fewer errors are corrected by post-editors in the NMT output. NMT also produces fewer accuracy errors and errors that are less critical.
We present the NewSoMe (News and Social Media) Corpus, a set of subcorpora with annotations on op... more We present the NewSoMe (News and Social Media) Corpus, a set of subcorpora with annotations on opinion expressions across genres (news reports, blogs, product reviews and tweets) and covering multiple languages (English, Spanish, Catalan and Portuguese). NewSoMe is the result of an effort to increase the opinion corpus resources available in languages other than English, and to build a unifying annotation framework for analyzing opinion in different genres, including controlled text, such as news reports, as well as different types of user generated contents (UGC). Given the broad design of the resource, most of the annotation effort were carried out resorting to crowdsourcing platforms: Amazon Mechanical Turk and CrowdFlower. This created an excellent opportunity to research on the feasibility of crowdsourcing methods for annotating big amounts of text in different languages.
Current state-of-the-art models for sentiment analysis make use of word order either explicitly b... more Current state-of-the-art models for sentiment analysis make use of word order either explicitly by pre-training on a language modeling objective or implicitly by using recurrent neural networks (RNNs) or convolutional networks (CNNs). This is a problem for cross-lingual models that use bilingual embeddings as features, as the difference in word order between source and target languages is not resolved. In this work, we explore reordering as a pre-processing step for sentence-level cross-lingual sentiment classification with two language combinations (English-Spanish, English-Catalan). We find that while reordering helps both models, CNNS are more sensitive to local reorderings, while global reordering benefits RNNs.
In the last years, we have witnessed an increase in the use of post-editing of machine translatio... more In the last years, we have witnessed an increase in the use of post-editing of machine translation (PEMT) in the translation industry. It has been included as part of the translation workflow because it increases productivity of translators. Currently, many Language Service Providers offer PEMT as a service. For many years now, (closely) related languages have been post-edited using rulebased and phrase-based machine translation (MT) systems because they present less challenges due to their morphological and syntactic similarities. Given the recent popularity of neural MT (NMT), this paper analyzes the performance of this approach compared to phrase-based statistical MT (PBSMT) on in-domain and general domain documents. We use standard automatic measures and temporal and technical effort to assess if NMT yields a real improvement when it comes to post-editing the Spanish-Catalan language pair.
In this paper we discuss containers and other general nouns, and develop a proposal for represent... more In this paper we discuss containers and other general nouns, and develop a proposal for representing them in a structured lexicon. We adopt a typed feature structure formalism and show that even in more cases than those mentioned in the literature an underspecification analysis is appropriate. This contributes to the simplification of the lexicon, postulating less lexical rules and avoiding a lot of redundancy. Our main data come from Catalan, but the results are applicable to many other languages (including English). The paper is organised as follows. In section 1 we present the Catalan data. In section 2 we discuss some of the previous proposals. Section 3 is devoted to develop our treatment, which is implemented in LKB. 1 The main conclusions are given in section 4.
Emotion intensity prediction determines the degree or intensity of an emotion that the author exp... more Emotion intensity prediction determines the degree or intensity of an emotion that the author expresses in a text, extending previous categorical approaches to emotion detection. While most previous work on this topic has concentrated on English texts, other languages would also benefit from fine-grained emotion classification, preferably without having to recreate the amount of annotated data available in English in each new language. Consequently, we explore cross-lingual transfer approaches for fine-grained emotion detection in Spanish and Catalan tweets. To this end we annotate a test set of Spanish and Catalan tweets using Best-Worst scaling. We compare six cross-lingual approaches, e.g., machine translation and cross-lingual embeddings, which have varying requirements for parallel data – from millions of parallel sentences to completely unsupervised. The results show that on this data, methods with low parallel-data requirements perform surprisingly better than methods that us...
There is currently an extended use of post-editing of machine translation (PEMT) in the translati... more There is currently an extended use of post-editing of machine translation (PEMT) in the translation industry. This is due to the increase in the demand of translation and to the significant improvements in quality achieved by neural machine translation (NMT). PEMT has been included as part of the translation workflow because it increases translators’ productivity and it also reduces costs. Although an effective post-editing requires enough quality of the MT output, usual automatic metrics do not always correlate with post-editing effort. We describe a standalone tool designed both for industry and research that has two main purposes: collect sentence-level information from the post-editing process (e.g. post-editing time and keystrokes) and visually present multiple evaluation scores so they can be easily interpreted by a user.
Cross-lingual transfer has improved greatly through multi-lingual language model pretraining, red... more Cross-lingual transfer has improved greatly through multi-lingual language model pretraining, reducing the need for parallel data and increasing absolute performance. However, this progress has also brought to light the differences in performance across languages. Specifically, certain language families and typologies seem to consistently perform worse in these models. In this paper, we address what effects morphological typology has on zero-shot cross-lingual transfer for two tasks: Part-of-speech tagging and sentiment analysis. We perform experiments on 19 languages from four language typologies (fusional, isolating, agglutinative, and introflexive) and find that transfer to another morphological type generally implies a higher loss than transfer to another language with the same morphological typology. Furthermore, POS tagging is more sensitive to morphological typology than sentiment analysis and, on this task, models perform much better on fusional languages than on the other t...
This paper assesses the role of multi-label classification in modelling polysemy for language acq... more This paper assesses the role of multi-label classification in modelling polysemy for language acquisition tasks. We focus on the acquisition of semantic classes for Catalan adjectives, and show that polysemy acquisition naturally suits architectures used for multilabel classification. Furthermore, we explore the performance of information drawn from different levels of linguistic description, using feature sets based on morphology, syntax, semantics, and n-gram distribution. Finally, we demonstrate that ensemble classifiers are a powerful and adequate way to combine different types of linguistic evidence: a simple, majority voting ensemble classifier improves the accuracy from 62.5% (best single classifier) to 84%.
The recent improvements in machine translation (MT) have boosted the use of post-editing (PE) in ... more The recent improvements in machine translation (MT) have boosted the use of post-editing (PE) in the translation industry. A new machine translation paradigm, neural machine translation (NMT), is displacing its corpus-based predecessor, statistical machine translation (SMT), in the translation workflows currently implemented because it usually increases the fluency and accuracy of the MT output. However, usual automatic measurements do not always indicate the quality of the MT output and there is still no clear correlation between PE effort and productivity. We present a quantitative analysis of different PE effort indicators for two NMT systems (transformer and seq2seq) for English-Spanish in-domain medical documents. We compare both systems and study the correlation between PE time and other scores. Results show less PE effort for the transformer NMT model and a high correlation between PE time and keystrokes.
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Attention based deep learning systems have been demonstrated to be the state of the art approach ... more Attention based deep learning systems have been demonstrated to be the state of the art approach for aspect-level sentiment analysis, however, end-to-end deep neural networks lack flexibility as one can not easily adjust the network to fix an obvious problem, especially when more training data is not available: e.g. when it always predicts positive when seeing the word disappointed. Meanwhile, it is less stressed that attention mechanism is likely to "over-focus" on particular parts of a sentence, while ignoring positions which provide key information for judging the polarity. In this paper, we describe a simple yet effective approach to leverage lexicon information so that the model becomes more flexible and robust. We also explore the effect of regularizing attention vectors to allow the network to have a broader "focus" on different parts of the sentence. The experimental results demonstrate the effectiveness of our approach.
L’article comenca analitzant breument que entenem per TA. A continuacio observa que la manera de ... more L’article comenca analitzant breument que entenem per TA. A continuacio observa que la manera de treballar a que s’han acostumat molts traductors amb les memories de traduccio pot ser reproduida perfectament amb un sistema de TA, sempre que tingui unes caracteristiques determinades, que son analitzades. Finalment, es fa esment de dos aspectes complementaris de la traduccio amb STA: l’us del llenguatge controlat en la produccio de textos, i la relacio de la produccio de textos multilingues amb la traduccio.
While sentiment analysis has become an established field in the NLP community, research into lang... more While sentiment analysis has become an established field in the NLP community, research into languages other than English has been hindered by the lack of resources. Although much research in multilingual and cross-lingual sentiment analysis has focused on unsupervised or semi-supervised approaches, these still require a large number of resources and do not reach the performance of supervised approaches. With this in mind, we introduce two datasets for supervised aspect-level sentiment analysis in Basque and Catalan, both of which are under-resourced languages. We provide high-quality annotations and benchmarks with the hope that they will be useful to the growing community of researchers working on these languages.
Este artículo aborda la oferta formativa en accesibilidad en los medios en España, centrándose co... more Este artículo aborda la oferta formativa en accesibilidad en los medios en España, centrándose concretamente en la audiodescripción, el subtitulado para sordos y la lengua de signos o señas. Se detallan los cursos que se ofrecen en este ámbito a nivel universitario y no universitario, así como también en empresas. Finalmente, se presenta una nueva propuesta: el máster oficial en accesibilidad en los medios
Tradumàtica tecnologies de la traducció, Dec 30, 2020
Currently, post-editing of machine translation (MT) has been introduced as a regular practice in ... more Currently, post-editing of machine translation (MT) has been introduced as a regular practice in the translation workflow, especially since the good results in quality obtained by neural MT (NMT). This fact is linked to the efforts LSPs and customers have done to reduce costs due to the recent global crisis and the increasing globalization, which has had a negative impact on translators' revenues and on their working practices. In this context, post-editing is often perceived with a negative bias by translators. We study attitudes of translators post-editing for the first time and relate them to their productivity rates. We also compare the results with a survey answered by professional post-editors assessing their perception of the task in the current marketplace.
How it fits together As for interrelations within and among the three Modules, there are three ty... more How it fits together As for interrelations within and among the three Modules, there are three types of relations, i.e. 1. overlapping parts 2. some parts being a prerequisite for others 3. complementary parts Note 1: This is the case where the same topic is introduced twice in different modules, a fact which cannot be avoided since omitting it in one module or the other would result in an incomplete and incoherent description of the module in question (e.g. hardware components like scanner and printer are referred to in both Module A and Module B). Module C Language Engineering ca 55% of the overall size Module B DTP/IT for translators ca 30% of the overall size possibility of specialization / deepening of knowledge possibility of specialization / deepening of knowledge
The recent improvements in neural MT (NMT) have driven a shift from statistical MT (SMT) to NMT. ... more The recent improvements in neural MT (NMT) have driven a shift from statistical MT (SMT) to NMT. However, to assess the usefulness of MT models for post-editing (PE) and have a detailed insight of the output they produce, we need to analyse the most frequent errors and how they affect the task. We present a pilot study of a fine-grained analysis of MT errors based on post-editors corrections for an English to Spanish medical text translated with SMT and NMT. We use the MQM taxonomy to compare the two MT models and have a categorized classification of the errors produced. Even though results show a great variation among post-editors’ corrections, for this language combination fewer errors are corrected by post-editors in the NMT output. NMT also produces fewer accuracy errors and errors that are less critical.
We present the NewSoMe (News and Social Media) Corpus, a set of subcorpora with annotations on op... more We present the NewSoMe (News and Social Media) Corpus, a set of subcorpora with annotations on opinion expressions across genres (news reports, blogs, product reviews and tweets) and covering multiple languages (English, Spanish, Catalan and Portuguese). NewSoMe is the result of an effort to increase the opinion corpus resources available in languages other than English, and to build a unifying annotation framework for analyzing opinion in different genres, including controlled text, such as news reports, as well as different types of user generated contents (UGC). Given the broad design of the resource, most of the annotation effort were carried out resorting to crowdsourcing platforms: Amazon Mechanical Turk and CrowdFlower. This created an excellent opportunity to research on the feasibility of crowdsourcing methods for annotating big amounts of text in different languages.
Current state-of-the-art models for sentiment analysis make use of word order either explicitly b... more Current state-of-the-art models for sentiment analysis make use of word order either explicitly by pre-training on a language modeling objective or implicitly by using recurrent neural networks (RNNs) or convolutional networks (CNNs). This is a problem for cross-lingual models that use bilingual embeddings as features, as the difference in word order between source and target languages is not resolved. In this work, we explore reordering as a pre-processing step for sentence-level cross-lingual sentiment classification with two language combinations (English-Spanish, English-Catalan). We find that while reordering helps both models, CNNS are more sensitive to local reorderings, while global reordering benefits RNNs.
In the last years, we have witnessed an increase in the use of post-editing of machine translatio... more In the last years, we have witnessed an increase in the use of post-editing of machine translation (PEMT) in the translation industry. It has been included as part of the translation workflow because it increases productivity of translators. Currently, many Language Service Providers offer PEMT as a service. For many years now, (closely) related languages have been post-edited using rulebased and phrase-based machine translation (MT) systems because they present less challenges due to their morphological and syntactic similarities. Given the recent popularity of neural MT (NMT), this paper analyzes the performance of this approach compared to phrase-based statistical MT (PBSMT) on in-domain and general domain documents. We use standard automatic measures and temporal and technical effort to assess if NMT yields a real improvement when it comes to post-editing the Spanish-Catalan language pair.
In this paper we discuss containers and other general nouns, and develop a proposal for represent... more In this paper we discuss containers and other general nouns, and develop a proposal for representing them in a structured lexicon. We adopt a typed feature structure formalism and show that even in more cases than those mentioned in the literature an underspecification analysis is appropriate. This contributes to the simplification of the lexicon, postulating less lexical rules and avoiding a lot of redundancy. Our main data come from Catalan, but the results are applicable to many other languages (including English). The paper is organised as follows. In section 1 we present the Catalan data. In section 2 we discuss some of the previous proposals. Section 3 is devoted to develop our treatment, which is implemented in LKB. 1 The main conclusions are given in section 4.
Emotion intensity prediction determines the degree or intensity of an emotion that the author exp... more Emotion intensity prediction determines the degree or intensity of an emotion that the author expresses in a text, extending previous categorical approaches to emotion detection. While most previous work on this topic has concentrated on English texts, other languages would also benefit from fine-grained emotion classification, preferably without having to recreate the amount of annotated data available in English in each new language. Consequently, we explore cross-lingual transfer approaches for fine-grained emotion detection in Spanish and Catalan tweets. To this end we annotate a test set of Spanish and Catalan tweets using Best-Worst scaling. We compare six cross-lingual approaches, e.g., machine translation and cross-lingual embeddings, which have varying requirements for parallel data – from millions of parallel sentences to completely unsupervised. The results show that on this data, methods with low parallel-data requirements perform surprisingly better than methods that us...
There is currently an extended use of post-editing of machine translation (PEMT) in the translati... more There is currently an extended use of post-editing of machine translation (PEMT) in the translation industry. This is due to the increase in the demand of translation and to the significant improvements in quality achieved by neural machine translation (NMT). PEMT has been included as part of the translation workflow because it increases translators’ productivity and it also reduces costs. Although an effective post-editing requires enough quality of the MT output, usual automatic metrics do not always correlate with post-editing effort. We describe a standalone tool designed both for industry and research that has two main purposes: collect sentence-level information from the post-editing process (e.g. post-editing time and keystrokes) and visually present multiple evaluation scores so they can be easily interpreted by a user.
Cross-lingual transfer has improved greatly through multi-lingual language model pretraining, red... more Cross-lingual transfer has improved greatly through multi-lingual language model pretraining, reducing the need for parallel data and increasing absolute performance. However, this progress has also brought to light the differences in performance across languages. Specifically, certain language families and typologies seem to consistently perform worse in these models. In this paper, we address what effects morphological typology has on zero-shot cross-lingual transfer for two tasks: Part-of-speech tagging and sentiment analysis. We perform experiments on 19 languages from four language typologies (fusional, isolating, agglutinative, and introflexive) and find that transfer to another morphological type generally implies a higher loss than transfer to another language with the same morphological typology. Furthermore, POS tagging is more sensitive to morphological typology than sentiment analysis and, on this task, models perform much better on fusional languages than on the other t...
This paper assesses the role of multi-label classification in modelling polysemy for language acq... more This paper assesses the role of multi-label classification in modelling polysemy for language acquisition tasks. We focus on the acquisition of semantic classes for Catalan adjectives, and show that polysemy acquisition naturally suits architectures used for multilabel classification. Furthermore, we explore the performance of information drawn from different levels of linguistic description, using feature sets based on morphology, syntax, semantics, and n-gram distribution. Finally, we demonstrate that ensemble classifiers are a powerful and adequate way to combine different types of linguistic evidence: a simple, majority voting ensemble classifier improves the accuracy from 62.5% (best single classifier) to 84%.
The recent improvements in machine translation (MT) have boosted the use of post-editing (PE) in ... more The recent improvements in machine translation (MT) have boosted the use of post-editing (PE) in the translation industry. A new machine translation paradigm, neural machine translation (NMT), is displacing its corpus-based predecessor, statistical machine translation (SMT), in the translation workflows currently implemented because it usually increases the fluency and accuracy of the MT output. However, usual automatic measurements do not always indicate the quality of the MT output and there is still no clear correlation between PE effort and productivity. We present a quantitative analysis of different PE effort indicators for two NMT systems (transformer and seq2seq) for English-Spanish in-domain medical documents. We compare both systems and study the correlation between PE time and other scores. Results show less PE effort for the transformer NMT model and a high correlation between PE time and keystrokes.
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Attention based deep learning systems have been demonstrated to be the state of the art approach ... more Attention based deep learning systems have been demonstrated to be the state of the art approach for aspect-level sentiment analysis, however, end-to-end deep neural networks lack flexibility as one can not easily adjust the network to fix an obvious problem, especially when more training data is not available: e.g. when it always predicts positive when seeing the word disappointed. Meanwhile, it is less stressed that attention mechanism is likely to "over-focus" on particular parts of a sentence, while ignoring positions which provide key information for judging the polarity. In this paper, we describe a simple yet effective approach to leverage lexicon information so that the model becomes more flexible and robust. We also explore the effect of regularizing attention vectors to allow the network to have a broader "focus" on different parts of the sentence. The experimental results demonstrate the effectiveness of our approach.
Uploads
Papers by Toni Badia