This paper describes our methodologies for NTCIR-6 CLIR involving Korean and Japanese, and report... more This paper describes our methodologies for NTCIR-6 CLIR involving Korean and Japanese, and reports the official result for Stage 1 and Stage 2. We participated in three tracks: K-K and J-J monolingual tracks and J-K cross-lingual tracks. As in the previous year, we focus on handling segmentation ambiguities in Asian languages. As a result, we prepared multiple term representations for documents and queries, of which ranked results are merged to generate final ranking. From official results, our methodology in Korean won the top in 6 subtasks of total 9 subtasks for Stage 2,and won the top in 2 subtasks of total 3 subtasks for Stage 1. Even though our system is the same as the previous one, final performances from NTCIR-3 to NTCIR-5 are further improved over our previous results by slightly modifying the feedback parameters.
Phrase Detectives, one of the first games-with-a-purpose for corpus annotation (www. phrasedetect... more Phrase Detectives, one of the first games-with-a-purpose for corpus annotation (www. phrasedetectives.org) went officially online on December 1st 2008, and one of its very first presentations in front of an NLP audience took place at the first edition of the “People’s Web Meets NLP” workshop in Singapore in 2009. The option of annotating Italian documents was added in 2010, and a Facebook version went live in January 2012. Although the project that funded its creation ended in September 2009, the game has stayed very much alive, in fact it is getting more active all the time we recently passed the 11,000 players mark and are about to reach 200,000 words of fully annotated documents, with a goal of annotating at least 1 million. In the talk I will discuss recent developments and analyze the results so far in terms of quality and quantity of annotated data and annotation costs. References: Massimo Poesio, Jon Chamberlain, Udo Kruschwitz, Livio Robaldo, and Luca Ducceschi, In Press Phr...
This paper describes UKP’s participation in the cross-lingual link discovery task at NTCIR-10 (Cr... more This paper describes UKP’s participation in the cross-lingual link discovery task at NTCIR-10 (CrossLink2). The task addressed in our work is to find valid anchor texts from a Chinese, Japanese, and Korean (CJK) Wikipedia page and retrieve the corresponding target Wiki pages in the English language. The CrossLink framework was developed based on our previous CrossLink system that works on the opposite directions of the language pairs, i.e. discovered anchor texts from English Wikipedia pages and their corresponding targets in CJK languages. The framework consists of anchor selection, anchor ranking, anchor translation, and target discovery sub-modules. Each sub-module in the framework has been shown to work well both in monolingual settings and English to CJK language pairs. We seek to find out whether the approach that worked very well for English to CJK would still work for CJK to English. We use the same experimental settings that were used in our previous participation, and our ...
This paper reports our experimental results at the NTCIR-6 English Patent Retrieval Subtask. Our ... more This paper reports our experimental results at the NTCIR-6 English Patent Retrieval Subtask. Our previous participation at the patent retrieval Subtask revealed that the long length of the patent applications require less smoothing of the document model than general documents such as news paper articles. We setup the initial baseline retrieval system for U.S. patent applications and compare the difference from that of Japanese patent applications.
an opinion analysis system developed for a Multilingual Opinion Analysis Task at NTCIR8. Given a ... more an opinion analysis system developed for a Multilingual Opinion Analysis Task at NTCIR8. Given a topic and relevant newspaper articles, our system determines whether a sentence in the articles has an opinion. If so, we then extract the holder of the opinion. In the opinion judgment task, we constructed a phrase-level opinion expression extractor from sentence-level annotated corpus. In opinion holder extraction task, we used the probability that the word is appeared in the opinion holder and a dependency relationship between the word and the verb of the sentence.
Since the first online demonstration of Neural Machine Translation (NMT) by LISA, NMT development... more Since the first online demonstration of Neural Machine Translation (NMT) by LISA, NMT development has recently moved from laboratory to production systems as demonstrated by several entities announcing roll-out of NMT engines to replace their existing technologies. NMT systems have a large number of training configurations and the training process of such systems is usually very long, often a few weeks, so role of experimentation is critical and important to share. In this work, we present our approach to production-ready systems simultaneously with release of online demonstrators covering a large variety of languages (12 languages, for 32 language pairs). We explore different practical choices: an efficient and evolutive open-source framework; data preparation; network architecture; additional implemented features; tuning for production; etc. We discuss about evaluation methodology, present our first findings and we finally outline further work. Our ultimate goal is to share our ex...
There is a need for real-time communication between the deaf and hearing without the aid of an in... more There is a need for real-time communication between the deaf and hearing without the aid of an interpreter. Developing a machine translation (MT) system between sign and spoken languages is a multimodal task since sign language is a visual language, which involves the automatic recognition and translation of video images. In this paper, we present the research we have been carrying out to build an automated sign language recognizer (ASLR), which is the core component of a machine translation (MT) system between American Sign Language (ASL) and English. Developing an ASLR is a challenging task due to the lack of sufficient quantities of annotated ASL-English parallel corpora for training, testing and developing an ASLR. This paper describes the research we have been conducting to explore a range of different techniques for automatically generating synthetic data from existing datasets to improve the accuracy of ASLR. This work involved experimentation with several algorithms with var...
Subjectivity analysis is a rapidly growing field of study. Along with its applications to various... more Subjectivity analysis is a rapidly growing field of study. Along with its applications to various NLP tasks, much work have put efforts into multilingual subjectivity learning from existing resources. Multilingual subjectivity analysis requires language-independent criteria for comparable outcomes across languages. This paper proposes to measure the multilanguage-comparability of subjectivity analysis tools, and provides meaningful comparisons of multilingual subjectivity analysis from various points of view.
We propose a reordering method to improve the fluency of the output of the phrase-based SMT (PBSM... more We propose a reordering method to improve the fluency of the output of the phrase-based SMT (PBSMT) system. We parse the translation results that follow the source language order into non-projective dependency trees, then reorder dependency trees to obtain fluent target sentences. Our method ensures that the translation results are grammatically correct and achieves major improvements over PBSMT using dependency-based metrics.
Training efficiency is one of the main problems for Neural Machine Translation (NMT). Deep networ... more Training efficiency is one of the main problems for Neural Machine Translation (NMT). Deep networks need for very large data as well as many training iterations to achieve state-of-the-art performance. This results in very high computation cost, slowing down research and industrialisation. In this paper, we propose to alleviate this problem with several training methods based on data boosting and bootstrap with no modifications to the neural network. It imitates the learning process of humans, which typically spend more time when learning “difficult” concepts than easier ones. We experiment on an English-French translation task showing accuracy improvements of up to 1.63 BLEU while saving 20% of training time.
We describe an opinion analysis system developed for Multilingual Opinion Analysis Task at NTCIR7... more We describe an opinion analysis system developed for Multilingual Opinion Analysis Task at NTCIR7. Given a topic and relevant newspaper articles, our system determines whether a sentence in the articles carries an opinion, if so, then extract the polarity and holder of the opinion. Our system uses subjectivity lexicons to score the sentiment weight of a word, in addition with a weight that reflects the discriminating power of the word. We borrow some techniques from Information Retrieval because discovering the importance and discriminating power of a word in a collection of documents is a commonly dealt issue in information retrieval tasks. We also use our own set of heuristics that are more specific to the task. Our system achieves high performance overall, with exceptional performances on polarity judgment of sentences.
This paper introduces a cross-language information retrieval (CLIR) framework that combines the s... more This paper introduces a cross-language information retrieval (CLIR) framework that combines the state-of-the-art keyword-based approach with a latent semantic-based retrieval model. To capture and analyze the hidden semantics in cross-lingual settings, we construct latent semantic models that map text in different languages into a shared semantic space. Our proposed framework consists of deep belief networks (DBN) for each language and we employ canonical correlation analysis (CCA) to construct a shared semantic space. We evaluated the proposed CLIR approach on a standard ad hoc CLIR dataset, and we show that the cross-lingual semantic analysis with DBN and CCA improves the state-of-the-art keyword-based CLIR performance.
This paper describes our methodologies for NTCIR-6 CLIR involving Korean and Japanese, and report... more This paper describes our methodologies for NTCIR-6 CLIR involving Korean and Japanese, and reports the official result for Stage 1 and Stage 2. We participated in three tracks: K-K and J-J monolingual tracks and J-K cross-lingual tracks. As in the previous year, we focus on handling seg-mentation ambiguities in Asian languages. As a result, we prepared multiple term representations for documents and queries, of which ranked results are merged to generate final ranking. From official results, our methodology in Korean won the top in 6 subtasks of total 9 subtasks for Stage 2,and won the top in 2 subtasks of total 3 subtasks for Stage 1. Even though our system is the same as the pre-vious one, final performances from NTCIR-3 to NTCIR-5 are further improved over our previous results by slightly modifying the feedback parame-ters.
This paper describes our methodologies for NTCIR-6 CLIR involving Korean and Japanese, and report... more This paper describes our methodologies for NTCIR-6 CLIR involving Korean and Japanese, and reports the official result for Stage 1 and Stage 2. We participated in three tracks: K-K and J-J monolingual tracks and J-K cross-lingual tracks. As in the previous year, we focus on handling segmentation ambiguities in Asian languages. As a result, we prepared multiple term representations for documents and queries, of which ranked results are merged to generate final ranking. From official results, our methodology in Korean won the top in 6 subtasks of total 9 subtasks for Stage 2,and won the top in 2 subtasks of total 3 subtasks for Stage 1. Even though our system is the same as the previous one, final performances from NTCIR-3 to NTCIR-5 are further improved over our previous results by slightly modifying the feedback parameters.
Phrase Detectives, one of the first games-with-a-purpose for corpus annotation (www. phrasedetect... more Phrase Detectives, one of the first games-with-a-purpose for corpus annotation (www. phrasedetectives.org) went officially online on December 1st 2008, and one of its very first presentations in front of an NLP audience took place at the first edition of the “People’s Web Meets NLP” workshop in Singapore in 2009. The option of annotating Italian documents was added in 2010, and a Facebook version went live in January 2012. Although the project that funded its creation ended in September 2009, the game has stayed very much alive, in fact it is getting more active all the time we recently passed the 11,000 players mark and are about to reach 200,000 words of fully annotated documents, with a goal of annotating at least 1 million. In the talk I will discuss recent developments and analyze the results so far in terms of quality and quantity of annotated data and annotation costs. References: Massimo Poesio, Jon Chamberlain, Udo Kruschwitz, Livio Robaldo, and Luca Ducceschi, In Press Phr...
This paper describes UKP’s participation in the cross-lingual link discovery task at NTCIR-10 (Cr... more This paper describes UKP’s participation in the cross-lingual link discovery task at NTCIR-10 (CrossLink2). The task addressed in our work is to find valid anchor texts from a Chinese, Japanese, and Korean (CJK) Wikipedia page and retrieve the corresponding target Wiki pages in the English language. The CrossLink framework was developed based on our previous CrossLink system that works on the opposite directions of the language pairs, i.e. discovered anchor texts from English Wikipedia pages and their corresponding targets in CJK languages. The framework consists of anchor selection, anchor ranking, anchor translation, and target discovery sub-modules. Each sub-module in the framework has been shown to work well both in monolingual settings and English to CJK language pairs. We seek to find out whether the approach that worked very well for English to CJK would still work for CJK to English. We use the same experimental settings that were used in our previous participation, and our ...
This paper reports our experimental results at the NTCIR-6 English Patent Retrieval Subtask. Our ... more This paper reports our experimental results at the NTCIR-6 English Patent Retrieval Subtask. Our previous participation at the patent retrieval Subtask revealed that the long length of the patent applications require less smoothing of the document model than general documents such as news paper articles. We setup the initial baseline retrieval system for U.S. patent applications and compare the difference from that of Japanese patent applications.
an opinion analysis system developed for a Multilingual Opinion Analysis Task at NTCIR8. Given a ... more an opinion analysis system developed for a Multilingual Opinion Analysis Task at NTCIR8. Given a topic and relevant newspaper articles, our system determines whether a sentence in the articles has an opinion. If so, we then extract the holder of the opinion. In the opinion judgment task, we constructed a phrase-level opinion expression extractor from sentence-level annotated corpus. In opinion holder extraction task, we used the probability that the word is appeared in the opinion holder and a dependency relationship between the word and the verb of the sentence.
Since the first online demonstration of Neural Machine Translation (NMT) by LISA, NMT development... more Since the first online demonstration of Neural Machine Translation (NMT) by LISA, NMT development has recently moved from laboratory to production systems as demonstrated by several entities announcing roll-out of NMT engines to replace their existing technologies. NMT systems have a large number of training configurations and the training process of such systems is usually very long, often a few weeks, so role of experimentation is critical and important to share. In this work, we present our approach to production-ready systems simultaneously with release of online demonstrators covering a large variety of languages (12 languages, for 32 language pairs). We explore different practical choices: an efficient and evolutive open-source framework; data preparation; network architecture; additional implemented features; tuning for production; etc. We discuss about evaluation methodology, present our first findings and we finally outline further work. Our ultimate goal is to share our ex...
There is a need for real-time communication between the deaf and hearing without the aid of an in... more There is a need for real-time communication between the deaf and hearing without the aid of an interpreter. Developing a machine translation (MT) system between sign and spoken languages is a multimodal task since sign language is a visual language, which involves the automatic recognition and translation of video images. In this paper, we present the research we have been carrying out to build an automated sign language recognizer (ASLR), which is the core component of a machine translation (MT) system between American Sign Language (ASL) and English. Developing an ASLR is a challenging task due to the lack of sufficient quantities of annotated ASL-English parallel corpora for training, testing and developing an ASLR. This paper describes the research we have been conducting to explore a range of different techniques for automatically generating synthetic data from existing datasets to improve the accuracy of ASLR. This work involved experimentation with several algorithms with var...
Subjectivity analysis is a rapidly growing field of study. Along with its applications to various... more Subjectivity analysis is a rapidly growing field of study. Along with its applications to various NLP tasks, much work have put efforts into multilingual subjectivity learning from existing resources. Multilingual subjectivity analysis requires language-independent criteria for comparable outcomes across languages. This paper proposes to measure the multilanguage-comparability of subjectivity analysis tools, and provides meaningful comparisons of multilingual subjectivity analysis from various points of view.
We propose a reordering method to improve the fluency of the output of the phrase-based SMT (PBSM... more We propose a reordering method to improve the fluency of the output of the phrase-based SMT (PBSMT) system. We parse the translation results that follow the source language order into non-projective dependency trees, then reorder dependency trees to obtain fluent target sentences. Our method ensures that the translation results are grammatically correct and achieves major improvements over PBSMT using dependency-based metrics.
Training efficiency is one of the main problems for Neural Machine Translation (NMT). Deep networ... more Training efficiency is one of the main problems for Neural Machine Translation (NMT). Deep networks need for very large data as well as many training iterations to achieve state-of-the-art performance. This results in very high computation cost, slowing down research and industrialisation. In this paper, we propose to alleviate this problem with several training methods based on data boosting and bootstrap with no modifications to the neural network. It imitates the learning process of humans, which typically spend more time when learning “difficult” concepts than easier ones. We experiment on an English-French translation task showing accuracy improvements of up to 1.63 BLEU while saving 20% of training time.
We describe an opinion analysis system developed for Multilingual Opinion Analysis Task at NTCIR7... more We describe an opinion analysis system developed for Multilingual Opinion Analysis Task at NTCIR7. Given a topic and relevant newspaper articles, our system determines whether a sentence in the articles carries an opinion, if so, then extract the polarity and holder of the opinion. Our system uses subjectivity lexicons to score the sentiment weight of a word, in addition with a weight that reflects the discriminating power of the word. We borrow some techniques from Information Retrieval because discovering the importance and discriminating power of a word in a collection of documents is a commonly dealt issue in information retrieval tasks. We also use our own set of heuristics that are more specific to the task. Our system achieves high performance overall, with exceptional performances on polarity judgment of sentences.
This paper introduces a cross-language information retrieval (CLIR) framework that combines the s... more This paper introduces a cross-language information retrieval (CLIR) framework that combines the state-of-the-art keyword-based approach with a latent semantic-based retrieval model. To capture and analyze the hidden semantics in cross-lingual settings, we construct latent semantic models that map text in different languages into a shared semantic space. Our proposed framework consists of deep belief networks (DBN) for each language and we employ canonical correlation analysis (CCA) to construct a shared semantic space. We evaluated the proposed CLIR approach on a standard ad hoc CLIR dataset, and we show that the cross-lingual semantic analysis with DBN and CCA improves the state-of-the-art keyword-based CLIR performance.
This paper describes our methodologies for NTCIR-6 CLIR involving Korean and Japanese, and report... more This paper describes our methodologies for NTCIR-6 CLIR involving Korean and Japanese, and reports the official result for Stage 1 and Stage 2. We participated in three tracks: K-K and J-J monolingual tracks and J-K cross-lingual tracks. As in the previous year, we focus on handling seg-mentation ambiguities in Asian languages. As a result, we prepared multiple term representations for documents and queries, of which ranked results are merged to generate final ranking. From official results, our methodology in Korean won the top in 6 subtasks of total 9 subtasks for Stage 2,and won the top in 2 subtasks of total 3 subtasks for Stage 1. Even though our system is the same as the pre-vious one, final performances from NTCIR-3 to NTCIR-5 are further improved over our previous results by slightly modifying the feedback parame-ters.
Uploads
Papers by Jungi Kim