Proceedings of the 14thWorkshop on Building and Using Comparable Corpora (BUCC 2021), 2021
Parallel sentences extracted from comparable corpora can be useful to supplement parallel corpora... more Parallel sentences extracted from comparable corpora can be useful to supplement parallel corpora when training machine translation (MT) systems. This is even more prominent in low-resource scenarios, where parallel corpora are scarce. In this paper, we present a system which uses three very different measures to identify and score parallel sentences from comparable corpora. We measure the accuracy of our methods in low-resource settings by comparing the results against manually curated test data for English-Icelandic, and by evaluating an MT system trained on the concatenation of the parallel data extracted by our approach and an existing data set. We show that the system is capable of extracting useful parallel sentences with high accuracy, and that the extracted pairs substantially increase translation quality of an MT system trained on the data, as measured by automatic evaluation metrics.
availability reliability and security, Aug 17, 2021
Contact tracing apps used in tracing and mitigating the spread of COVID-19 have sparked discussio... more Contact tracing apps used in tracing and mitigating the spread of COVID-19 have sparked discussions and controversies worldwide. The major concerns in relation to these apps are around privacy. Ireland was in general praised for the design of its COVID tracker app, and the transparency through which privacy issues were addressed. However, the "voice" of the Irish public was not really heard or analysed. This study aimed to analyse the Irish public sentiment towards privacy and COVID tracker app. For this purpose we have conducted sentiment analysis on Twitter data collected from public Twitter accounts from Republic of Ireland. We collected COVID-19 related tweets generated in Ireland over a period of time from January 1, 2020 up to December 31, 2020 in order to perform sentiment analysis on this data set. Moreover, the study performed sentiment analysis on the feedback received from a national survey on privacy conducted in Republic of Ireland. The findings of the study reveal a significant criticism towards the app that relate to privacy concerns, but other aspects of the app as well. The findings also reveal some positive attitude towards the fight against COVID-19, but these are not necessarily related to the technological solutions employed for this purpose.
Most Indian languages lack sufficient parallel data for Machine Translation (MT) training. In thi... more Most Indian languages lack sufficient parallel data for Machine Translation (MT) training. In this study, we build English-to-Indian language Neural Machine Translation (NMT) systems using the state-of-the-art transformer architecture. In addition, we investigate the utility of back-translation and its effect on system performance. Our experimental evaluation reveals that the back-translation method helps to improve the BLEU scores for both English-to-Hindi and English-to-Bengali NMT systems. We also observe that back-translation is more useful in improving the quality of weaker baseline MT systems. In addition, we perform a manual evaluation of the translation outputs and observe that the BLEU metric cannot always analyse the MT quality as well as humans. Our analysis shows that MT outputs for the English–Bengali pair are actually better than that evaluated by BLEU metric.
Proceedings of the 14thWorkshop on Building and Using Comparable Corpora (BUCC 2021), 2021
Parallel sentences extracted from comparable corpora can be useful to supplement parallel corpora... more Parallel sentences extracted from comparable corpora can be useful to supplement parallel corpora when training machine translation (MT) systems. This is even more prominent in low-resource scenarios, where parallel corpora are scarce. In this paper, we present a system which uses three very different measures to identify and score parallel sentences from comparable corpora. We measure the accuracy of our methods in low-resource settings by comparing the results against manually curated test data for English-Icelandic, and by evaluating an MT system trained on the concatenation of the parallel data extracted by our approach and an existing data set. We show that the system is capable of extracting useful parallel sentences with high accuracy, and that the extracted pairs substantially increase translation quality of an MT system trained on the data, as measured by automatic evaluation metrics.
The way information spreads through society has changed significantly over the past decade with t... more The way information spreads through society has changed significantly over the past decade with the advent of online social networking. Twitter, one of the most widely used social networking websites, is known as the real-time, public microblogging network where news breaks first. Most users love it for its iconic 140-character limitation and unfiltered feed that show them news and opinions in the form of tweets. Tweets are usually multilingual in nature and of varying quality. However, machine translation (MT) of twitter data is a challenging task especially due to the following two reasons: (i) tweets are informal in nature (i.e., violates linguistic norms), and (ii) parallel resource for twitter data is scarcely available on the Internet. In this paper, we develop FooTweets, a first parallel corpus of tweets for English–German language pair. We extract 4, 000 English tweets from the FIFA 2014 world cup and manually translate them into German with a special focus on the informal n...
Integrating Natural Language Processing (NLP) and computer vision is a promising effort. However,... more Integrating Natural Language Processing (NLP) and computer vision is a promising effort. However, the applicability of these methods directly depends on the availability of a specific multimodal data that includes images and texts. In this paper, we present a collection of a Multimodal corpus of comparable document and their images in 9 languages from the web news articles of Euronews website.1 This corpus has found widespread use in the NLP community in Multilingual and multimodal tasks. Here, we focus on its acquisition of the images and text data and their multilingual alignment.
Sentiment classification has been crucial for many natural language processing (NLP) applications... more Sentiment classification has been crucial for many natural language processing (NLP) applications, such as the analysis of movie reviews, tweets, or customer feedback. A sufficiently large amount of data is required to build a robust sentiment classification system. However, such resources are not always available for all domains or for all languages. In this work, we propose employing a machine translation (MT) system to translate customer feedback into another language to investigate in which cases translated sentences can have a positive or negative impact on an automatic sentiment classifier. Furthermore, as performing a direct translation is not always possible, we explore the performance of automatic classifiers on sentences that have been translated using a pivot MT system. We conduct several experiments using the above approaches to analyse the performance of our proposed sentiment classification system and discuss the advantages and drawbacks of classifying translated sente...
This paper reports on a comparative evaluation of phrase-based statistical machine translation (P... more This paper reports on a comparative evaluation of phrase-based statistical machine translation (PBSMT) and neural machine translation (NMT) for four language pairs, using the PET interface to compare educational domain output from both systems using a variety of metrics, including automatic evaluation as well as human rankings of adequacy and fluency, error-type markup, and post-editing (technical and temporal) effort, performed by professional translators. Our results show a preference for NMT in side-by-side ranking for all language pairs, texts, and segment lengths. In addition, perceived fluency is improved and annotated errors are fewer in the NMT output. Results are mixed for perceived adequacy and for errors of omission, addition, and mistranslation. Despite far fewer segments requiring post-editing, document-level post-editing performance was not found to have significantly improved in NMT compared to PBSMT. This evaluation was conducted as part of the TraMOOC project, which...
Social media platforms such as Twitter and Facebook are hugely popular websites through which Int... more Social media platforms such as Twitter and Facebook are hugely popular websites through which Internet users can communicate and spread information worldwide. On Twitter, messages (tweets) are generated by users from all over the world in many different languages. Tweets about different events almost always encode some degree of sentiment. As is often the case in the field of language processing, sentiment analysis tools exist primarily in English, so if we want to understand the sentiment of the original tweets, we are forced to translate them from the source language into English and pushing the English translations through a sentiment analysis tool. However, Lohar et al. (2017) demonstrated that using freely available translation tools often caused the sentiment encoded in the original tweet to be altered. As a consequence, they built a series of sentiment-specific translation engines and pushed tweets containing either positive, neutral or negative sentiment through the appropri...
In this age of the digital economy, promoting organisations attempt their best to engage the cust... more In this age of the digital economy, promoting organisations attempt their best to engage the customers in the feedback provisioning process. With the assistance of customer insights, an organisation can develop a better product and provide a better service to its customer. In this paper, we analyse the real world samples of customer feedback from Microsoft Office customers in four languages, i.e., English, French, Spanish and Japanese and conclude a five-plus-one-classes categorisation (comment, request, bug, complaint, meaningless and undetermined) for meaning classification. The task is to %access multilingual corpora annotated by the proposed meaning categorization scheme and develop a system to determine what class(es) the customer feedback sentences should be annotated as in four languages. We propose following approaches to accomplish this task: (i) a multinomial naive bayes (MNB) approach for multi-label classification, (ii) MNB with one-vs-rest classifier approach, and (iii)...
Twitter has become an immensely popular platform where the users can share information within a c... more Twitter has become an immensely popular platform where the users can share information within a certain character limit (280 characters) which encourages them to deliver short and informal messages (tweets). In general, machine translation (MT) of tweets is a challenging task. However, for translating German tweets about football into English, it has been shown that a moderate translation performance in terms of the BLEU score can be achieved using the phrase-based translation engines built on a tiny parallel Twitter data set [1]. In this work, we propose to further increase the translation quality using the neural machine translation models and applying the following strategies: (i) we back translate a set of out-of-domain English tweets released by ”Harvard data set” in 2017 into German and add the synthetic parallel data to the tiny parallel data used in [1]; (ii) as tweets are short in general, we extract short text pairs from the large news-commentary parallel data and add it t...
We propose a novel method to bootstrap the construction of parallel corpora for new pairs of stru... more We propose a novel method to bootstrap the construction of parallel corpora for new pairs of structurally different languages. We do so by combining the use of a pivot language and self-training. A pivot language enables the use of existing translation models to bootstrap the alignment and a self-training procedure enables to achieve better alignment, both at the document and sentence level. We also propose several evaluation methods for the resulting alignment.
With the wide spread of the social media and online forums, individual users have been able to ac... more With the wide spread of the social media and online forums, individual users have been able to actively participate in the generation of online content in different languages and dialects. Arabic is one of the fastest growing languages used on Internet, but dialects (like Egyptian and Saudi Arabian) have a big share of the Arabic online content. There are many differences between Dialectal Arabic and Modern Standard Arabic which cause many challenges for Machine Translation of informal Arabic language. In this paper, we investigate the use of Automatic Error Correction method to improve the quality of Arabic User-Generated texts and its automatic translation. Our experiments show that the new system with automatic correction module outperforms the baseline system by nearly 22.59% of relative improvement.
The Prague Bulletin of Mathematical Linguistics, 2016
FaDA is a free/open-source tool for aligning multilingual documents. It employs a novel crossling... more FaDA is a free/open-source tool for aligning multilingual documents. It employs a novel crosslingual information retrieval (CLIR)-based document-alignment algorithm involving the distances between embedded word vectors in combination with the word overlap between the source-language and the target-language documents. In this approach, we initially construct a pseudo-query from a source-language document. We then represent the target-language documents and the pseudo-query as word vectors to find the average similarity measure between them. This word vector-based similarity measure is then combined with the term overlap-based similarity. Our initial experiments show that s standard Statistical Machine Translation (SMT)- based approach is outperformed by our CLIR-based approach in finding the correct alignment pairs. In addition to this, subsequent experiments with the word vector-based method show further improvements in the performance of the system.
Computational Linguistics and Intelligent Text Processing, 2014
Multi Lingual Snippet Generation MLSG systems provide the users with snippets in multiple languag... more Multi Lingual Snippet Generation MLSG systems provide the users with snippets in multiple languages. But collecting and managing documents in multiple languages in an efficient way is a difficult task and thereby makes this process more complicated. Fortunately, this requirement can be fulfilled in another way by translating the snippets from one language to another with the help of Machine Translation MT systems. The resulting system is called Cross Lingual Snippet Generation CLSG system. This paper presents the development of a CLSG system by Snippet Translation when documents are available only in one language. We consider the English-Bengali language pair for snippet translation in one direction English to Bengali. In this work, a major concentration is given towards translating snippets with simpler but excluding deeper MT concepts. In experimental results, an average BLEU score of 14.26 and NIST score of 4.93 are obtained.
ABSTRACT This paper reports the development of the first tagged resource for question answering r... more ABSTRACT This paper reports the development of the first tagged resource for question answering research for a less computerized Indian language, namely Bengali. We developed a tagging scheme for annotating the questions based on their types. Expected answer type and question topical target are also marked to facilitate the answer search. Due to scarcity of canonical documents in the web for Bengali, we could not take the advantage of web as the resource and the major portion of the resource data was collected from authentic books. Six highly qualified annotators were involved in this rigorous work. At present, the resource contains 47 documents from three domains, namely history, geography and agriculture. Question answering based annotation was performed to prepare more than 2250 question-answer pairs. The inter-annotator agreement scores measured in non-weighted kappa statistics is satisfactory.
Statistical Machine Translation SMT delivers a convenient format for representing how translation... more Statistical Machine Translation SMT delivers a convenient format for representing how translation process is modeled. The translations of words or phrases are generally computed based on their occurrence in some bilingual training corpus. However, SMT still suffers for out of vocabulary OOV words and less frequent words especially when only limited training data are available or training and test data are in different domains. In this paper, we propose a convenient way to handle OOV and rare words using paraphrasing technique. Initially we extract paraphrases from bilingual training corpus with the help of comparable corpora. The extracted paraphrases are analyzed by conditionally checking the association of their monolingual distribution. Bilingual aligned paraphrases are incorporated as additional training data into the PB-SMT system. Integration of paraphrases into PB-SMT system results in significant improvement.
2021 IEEE International Mediterranean Conference on Communications and Networking (MeditCom), 2021
Full bibliographic details must be given when referring to, or quoting from full items including ... more Full bibliographic details must be given when referring to, or quoting from full items including the author's name, the title of the work, publication details where relevant (place, publisher, date), pagination, and for theses or dissertations the awarding institution, the degree type awarded, and the date of the award.
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, 2016
Comparable corpora have been shown to be useful in several multilingual natural language processi... more Comparable corpora have been shown to be useful in several multilingual natural language processing (NLP) tasks. Many previous papers have focused on how to improve the extraction of parallel data from this kind of corpus on different levels. In this paper, we are interested in improving the quality of bilingual comparable corpora according to increased document alignment score. We describe our participation in the bilingual document alignment shared task of the First Conference on Machine Translation (WMT16). We propose a technique based on sourceto-target sentence-and word-based scores and the fraction of matched source named entities. We performed our experiments on English-to-French document alignments for this bilingual task.
Proceedings of the 14thWorkshop on Building and Using Comparable Corpora (BUCC 2021), 2021
Parallel sentences extracted from comparable corpora can be useful to supplement parallel corpora... more Parallel sentences extracted from comparable corpora can be useful to supplement parallel corpora when training machine translation (MT) systems. This is even more prominent in low-resource scenarios, where parallel corpora are scarce. In this paper, we present a system which uses three very different measures to identify and score parallel sentences from comparable corpora. We measure the accuracy of our methods in low-resource settings by comparing the results against manually curated test data for English-Icelandic, and by evaluating an MT system trained on the concatenation of the parallel data extracted by our approach and an existing data set. We show that the system is capable of extracting useful parallel sentences with high accuracy, and that the extracted pairs substantially increase translation quality of an MT system trained on the data, as measured by automatic evaluation metrics.
availability reliability and security, Aug 17, 2021
Contact tracing apps used in tracing and mitigating the spread of COVID-19 have sparked discussio... more Contact tracing apps used in tracing and mitigating the spread of COVID-19 have sparked discussions and controversies worldwide. The major concerns in relation to these apps are around privacy. Ireland was in general praised for the design of its COVID tracker app, and the transparency through which privacy issues were addressed. However, the "voice" of the Irish public was not really heard or analysed. This study aimed to analyse the Irish public sentiment towards privacy and COVID tracker app. For this purpose we have conducted sentiment analysis on Twitter data collected from public Twitter accounts from Republic of Ireland. We collected COVID-19 related tweets generated in Ireland over a period of time from January 1, 2020 up to December 31, 2020 in order to perform sentiment analysis on this data set. Moreover, the study performed sentiment analysis on the feedback received from a national survey on privacy conducted in Republic of Ireland. The findings of the study reveal a significant criticism towards the app that relate to privacy concerns, but other aspects of the app as well. The findings also reveal some positive attitude towards the fight against COVID-19, but these are not necessarily related to the technological solutions employed for this purpose.
Most Indian languages lack sufficient parallel data for Machine Translation (MT) training. In thi... more Most Indian languages lack sufficient parallel data for Machine Translation (MT) training. In this study, we build English-to-Indian language Neural Machine Translation (NMT) systems using the state-of-the-art transformer architecture. In addition, we investigate the utility of back-translation and its effect on system performance. Our experimental evaluation reveals that the back-translation method helps to improve the BLEU scores for both English-to-Hindi and English-to-Bengali NMT systems. We also observe that back-translation is more useful in improving the quality of weaker baseline MT systems. In addition, we perform a manual evaluation of the translation outputs and observe that the BLEU metric cannot always analyse the MT quality as well as humans. Our analysis shows that MT outputs for the English–Bengali pair are actually better than that evaluated by BLEU metric.
Proceedings of the 14thWorkshop on Building and Using Comparable Corpora (BUCC 2021), 2021
Parallel sentences extracted from comparable corpora can be useful to supplement parallel corpora... more Parallel sentences extracted from comparable corpora can be useful to supplement parallel corpora when training machine translation (MT) systems. This is even more prominent in low-resource scenarios, where parallel corpora are scarce. In this paper, we present a system which uses three very different measures to identify and score parallel sentences from comparable corpora. We measure the accuracy of our methods in low-resource settings by comparing the results against manually curated test data for English-Icelandic, and by evaluating an MT system trained on the concatenation of the parallel data extracted by our approach and an existing data set. We show that the system is capable of extracting useful parallel sentences with high accuracy, and that the extracted pairs substantially increase translation quality of an MT system trained on the data, as measured by automatic evaluation metrics.
The way information spreads through society has changed significantly over the past decade with t... more The way information spreads through society has changed significantly over the past decade with the advent of online social networking. Twitter, one of the most widely used social networking websites, is known as the real-time, public microblogging network where news breaks first. Most users love it for its iconic 140-character limitation and unfiltered feed that show them news and opinions in the form of tweets. Tweets are usually multilingual in nature and of varying quality. However, machine translation (MT) of twitter data is a challenging task especially due to the following two reasons: (i) tweets are informal in nature (i.e., violates linguistic norms), and (ii) parallel resource for twitter data is scarcely available on the Internet. In this paper, we develop FooTweets, a first parallel corpus of tweets for English–German language pair. We extract 4, 000 English tweets from the FIFA 2014 world cup and manually translate them into German with a special focus on the informal n...
Integrating Natural Language Processing (NLP) and computer vision is a promising effort. However,... more Integrating Natural Language Processing (NLP) and computer vision is a promising effort. However, the applicability of these methods directly depends on the availability of a specific multimodal data that includes images and texts. In this paper, we present a collection of a Multimodal corpus of comparable document and their images in 9 languages from the web news articles of Euronews website.1 This corpus has found widespread use in the NLP community in Multilingual and multimodal tasks. Here, we focus on its acquisition of the images and text data and their multilingual alignment.
Sentiment classification has been crucial for many natural language processing (NLP) applications... more Sentiment classification has been crucial for many natural language processing (NLP) applications, such as the analysis of movie reviews, tweets, or customer feedback. A sufficiently large amount of data is required to build a robust sentiment classification system. However, such resources are not always available for all domains or for all languages. In this work, we propose employing a machine translation (MT) system to translate customer feedback into another language to investigate in which cases translated sentences can have a positive or negative impact on an automatic sentiment classifier. Furthermore, as performing a direct translation is not always possible, we explore the performance of automatic classifiers on sentences that have been translated using a pivot MT system. We conduct several experiments using the above approaches to analyse the performance of our proposed sentiment classification system and discuss the advantages and drawbacks of classifying translated sente...
This paper reports on a comparative evaluation of phrase-based statistical machine translation (P... more This paper reports on a comparative evaluation of phrase-based statistical machine translation (PBSMT) and neural machine translation (NMT) for four language pairs, using the PET interface to compare educational domain output from both systems using a variety of metrics, including automatic evaluation as well as human rankings of adequacy and fluency, error-type markup, and post-editing (technical and temporal) effort, performed by professional translators. Our results show a preference for NMT in side-by-side ranking for all language pairs, texts, and segment lengths. In addition, perceived fluency is improved and annotated errors are fewer in the NMT output. Results are mixed for perceived adequacy and for errors of omission, addition, and mistranslation. Despite far fewer segments requiring post-editing, document-level post-editing performance was not found to have significantly improved in NMT compared to PBSMT. This evaluation was conducted as part of the TraMOOC project, which...
Social media platforms such as Twitter and Facebook are hugely popular websites through which Int... more Social media platforms such as Twitter and Facebook are hugely popular websites through which Internet users can communicate and spread information worldwide. On Twitter, messages (tweets) are generated by users from all over the world in many different languages. Tweets about different events almost always encode some degree of sentiment. As is often the case in the field of language processing, sentiment analysis tools exist primarily in English, so if we want to understand the sentiment of the original tweets, we are forced to translate them from the source language into English and pushing the English translations through a sentiment analysis tool. However, Lohar et al. (2017) demonstrated that using freely available translation tools often caused the sentiment encoded in the original tweet to be altered. As a consequence, they built a series of sentiment-specific translation engines and pushed tweets containing either positive, neutral or negative sentiment through the appropri...
In this age of the digital economy, promoting organisations attempt their best to engage the cust... more In this age of the digital economy, promoting organisations attempt their best to engage the customers in the feedback provisioning process. With the assistance of customer insights, an organisation can develop a better product and provide a better service to its customer. In this paper, we analyse the real world samples of customer feedback from Microsoft Office customers in four languages, i.e., English, French, Spanish and Japanese and conclude a five-plus-one-classes categorisation (comment, request, bug, complaint, meaningless and undetermined) for meaning classification. The task is to %access multilingual corpora annotated by the proposed meaning categorization scheme and develop a system to determine what class(es) the customer feedback sentences should be annotated as in four languages. We propose following approaches to accomplish this task: (i) a multinomial naive bayes (MNB) approach for multi-label classification, (ii) MNB with one-vs-rest classifier approach, and (iii)...
Twitter has become an immensely popular platform where the users can share information within a c... more Twitter has become an immensely popular platform where the users can share information within a certain character limit (280 characters) which encourages them to deliver short and informal messages (tweets). In general, machine translation (MT) of tweets is a challenging task. However, for translating German tweets about football into English, it has been shown that a moderate translation performance in terms of the BLEU score can be achieved using the phrase-based translation engines built on a tiny parallel Twitter data set [1]. In this work, we propose to further increase the translation quality using the neural machine translation models and applying the following strategies: (i) we back translate a set of out-of-domain English tweets released by ”Harvard data set” in 2017 into German and add the synthetic parallel data to the tiny parallel data used in [1]; (ii) as tweets are short in general, we extract short text pairs from the large news-commentary parallel data and add it t...
We propose a novel method to bootstrap the construction of parallel corpora for new pairs of stru... more We propose a novel method to bootstrap the construction of parallel corpora for new pairs of structurally different languages. We do so by combining the use of a pivot language and self-training. A pivot language enables the use of existing translation models to bootstrap the alignment and a self-training procedure enables to achieve better alignment, both at the document and sentence level. We also propose several evaluation methods for the resulting alignment.
With the wide spread of the social media and online forums, individual users have been able to ac... more With the wide spread of the social media and online forums, individual users have been able to actively participate in the generation of online content in different languages and dialects. Arabic is one of the fastest growing languages used on Internet, but dialects (like Egyptian and Saudi Arabian) have a big share of the Arabic online content. There are many differences between Dialectal Arabic and Modern Standard Arabic which cause many challenges for Machine Translation of informal Arabic language. In this paper, we investigate the use of Automatic Error Correction method to improve the quality of Arabic User-Generated texts and its automatic translation. Our experiments show that the new system with automatic correction module outperforms the baseline system by nearly 22.59% of relative improvement.
The Prague Bulletin of Mathematical Linguistics, 2016
FaDA is a free/open-source tool for aligning multilingual documents. It employs a novel crossling... more FaDA is a free/open-source tool for aligning multilingual documents. It employs a novel crosslingual information retrieval (CLIR)-based document-alignment algorithm involving the distances between embedded word vectors in combination with the word overlap between the source-language and the target-language documents. In this approach, we initially construct a pseudo-query from a source-language document. We then represent the target-language documents and the pseudo-query as word vectors to find the average similarity measure between them. This word vector-based similarity measure is then combined with the term overlap-based similarity. Our initial experiments show that s standard Statistical Machine Translation (SMT)- based approach is outperformed by our CLIR-based approach in finding the correct alignment pairs. In addition to this, subsequent experiments with the word vector-based method show further improvements in the performance of the system.
Computational Linguistics and Intelligent Text Processing, 2014
Multi Lingual Snippet Generation MLSG systems provide the users with snippets in multiple languag... more Multi Lingual Snippet Generation MLSG systems provide the users with snippets in multiple languages. But collecting and managing documents in multiple languages in an efficient way is a difficult task and thereby makes this process more complicated. Fortunately, this requirement can be fulfilled in another way by translating the snippets from one language to another with the help of Machine Translation MT systems. The resulting system is called Cross Lingual Snippet Generation CLSG system. This paper presents the development of a CLSG system by Snippet Translation when documents are available only in one language. We consider the English-Bengali language pair for snippet translation in one direction English to Bengali. In this work, a major concentration is given towards translating snippets with simpler but excluding deeper MT concepts. In experimental results, an average BLEU score of 14.26 and NIST score of 4.93 are obtained.
ABSTRACT This paper reports the development of the first tagged resource for question answering r... more ABSTRACT This paper reports the development of the first tagged resource for question answering research for a less computerized Indian language, namely Bengali. We developed a tagging scheme for annotating the questions based on their types. Expected answer type and question topical target are also marked to facilitate the answer search. Due to scarcity of canonical documents in the web for Bengali, we could not take the advantage of web as the resource and the major portion of the resource data was collected from authentic books. Six highly qualified annotators were involved in this rigorous work. At present, the resource contains 47 documents from three domains, namely history, geography and agriculture. Question answering based annotation was performed to prepare more than 2250 question-answer pairs. The inter-annotator agreement scores measured in non-weighted kappa statistics is satisfactory.
Statistical Machine Translation SMT delivers a convenient format for representing how translation... more Statistical Machine Translation SMT delivers a convenient format for representing how translation process is modeled. The translations of words or phrases are generally computed based on their occurrence in some bilingual training corpus. However, SMT still suffers for out of vocabulary OOV words and less frequent words especially when only limited training data are available or training and test data are in different domains. In this paper, we propose a convenient way to handle OOV and rare words using paraphrasing technique. Initially we extract paraphrases from bilingual training corpus with the help of comparable corpora. The extracted paraphrases are analyzed by conditionally checking the association of their monolingual distribution. Bilingual aligned paraphrases are incorporated as additional training data into the PB-SMT system. Integration of paraphrases into PB-SMT system results in significant improvement.
2021 IEEE International Mediterranean Conference on Communications and Networking (MeditCom), 2021
Full bibliographic details must be given when referring to, or quoting from full items including ... more Full bibliographic details must be given when referring to, or quoting from full items including the author's name, the title of the work, publication details where relevant (place, publisher, date), pagination, and for theses or dissertations the awarding institution, the degree type awarded, and the date of the award.
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, 2016
Comparable corpora have been shown to be useful in several multilingual natural language processi... more Comparable corpora have been shown to be useful in several multilingual natural language processing (NLP) tasks. Many previous papers have focused on how to improve the extraction of parallel data from this kind of corpus on different levels. In this paper, we are interested in improving the quality of bilingual comparable corpora according to increased document alignment score. We describe our participation in the bilingual document alignment shared task of the First Conference on Machine Translation (WMT16). We propose a technique based on sourceto-target sentence-and word-based scores and the fraction of matched source named entities. We performed our experiments on English-to-French document alignments for this bilingual task.
Uploads
Papers by Pintu Lohar