This paper studies label augmentation for training dialogue response selection. The existing mode... more This paper studies label augmentation for training dialogue response selection. The existing model is trained by "observational" annotation, where one observed response is annotated as gold. In this paper, we propose "counterfactual augmentation" of pseudo-positive labels. We validate that the effectiveness of augmented labels are comparable to positives, such that ours outperform state-of-the-arts without augmentation.
We aim to leverage human and machine intelligence together for attention supervision. Specificall... more We aim to leverage human and machine intelligence together for attention supervision. Specifically, we show that human annotation cost can be kept reasonably low, while its quality can be enhanced by machine selfsupervision. Specifically, for this goal, we explore the advantage of counterfactual reasoning, over associative reasoning typically used in attention supervision. Our empirical results show that this machine-augmented human attention supervision is more effective than existing methods requiring a higher annotation cost, in text classification tasks, including sentiment analysis and news categorization.
Diffusion-based generative models have achieved remarkable success in various domains. It trains ... more Diffusion-based generative models have achieved remarkable success in various domains. It trains a shared model on denoising tasks that encompass different noise levels simultaneously, representing a form of multi-task learning (MTL). However, analyzing and improving diffusion models from an MTL perspective remains underexplored. In particular, MTL can sometimes lead to the well-known phenomenon of negative transfer, which results in the performance degradation of certain tasks due to conflicts between tasks. In this paper, we first aim to analyze diffusion training from an MTL standpoint, presenting two key observations: (O1) the task affinity between denoising tasks diminishes as the gap between noise levels widens, and (O2) negative transfer can arise even in diffusion training. Building upon these observations, we aim to enhance diffusion training by mitigating negative transfer. To achieve this, we propose leveraging existing MTL methods, but the presence of a huge number of denoising tasks makes this computationally expensive to calculate the necessary per-task loss or gradient. To address this challenge, we propose clustering the denoising tasks into small task clusters and applying MTL methods to them. Specifically, based on (O2), we employ interval clustering to enforce temporal proximity among denoising tasks within clusters. We show that interval clustering can be solved using dynamic programming, utilizing signal-tonoise ratio, timestep, and task affinity for clustering objectives. Through this, our approach addresses the issue of negative transfer in diffusion models by allowing for efficient computation of MTL methods. We validate the proposed clustering and its integration with MTL methods through various experiments, demonstrating improved sample quality of diffusion models. Our project page is available at https://gohyojun15.github.io/ANT_diffusion/.
Text classification in education, usually called auto-tagging, is the automated process of assign... more Text classification in education, usually called auto-tagging, is the automated process of assigning relevant tags to educational content, such as questions and textbooks. However, auto-tagging suffers from a data scarcity problem, which stems from two major challenges: 1) it possesses a large tag space and 2) it is multi-label. Though a retrieval approach is reportedly good at low-resource scenarios, there have been fewer efforts to directly address the data scarcity problem. To mitigate these issues, here we propose a novel retrieval approach CEAA that provides effective learning in educational text classification. Our main contributions are as follows: 1) we leverage transfer learning from question-answering datasets, and 2) we propose a simple but effective data augmentation method introducing cross-encoder style texts to a bi-encoder architecture for more efficient inference. An extensive set of experiments shows that our proposed method is effective in multi-label scenarios and low-resource tags compared to state-of-the-art models.
Question generation (QG) is the task of generating a valid and fluent question based on a given c... more Question generation (QG) is the task of generating a valid and fluent question based on a given context and the target answer. According to various purposes, even given the same context, instructors can ask questions about different concepts, and even the same concept can be written in different ways. However, the evaluation for QG usually depends on single reference-based similarity metrics, such as ngram-based metric or learned metric, which is not sufficient to fully evaluate the potential of QG methods. To this end, we propose to paraphrase the reference question for a more robust QG evaluation. Using large language models such as GPT-3, we created semantically and syntactically diverse questions, then adopt the simple aggregation of the popular evaluation metrics as the final scores. Through our experiments, we found that using multiple (pseudo) references is more effective for QG evaluation while showing a higher correlation with human evaluations than evaluation with a single reference.
Large Pre-trained Language Models (PLM) have become the most desirable starting point in the fiel... more Large Pre-trained Language Models (PLM) have become the most desirable starting point in the field of NLP, as they have become remarkably good at solving many individual tasks. Despite such success, in this paper, we argue that current paradigms of working with PLMs are neglecting a critical aspect of modeling human intelligence: functional compositionality. Functional compositionality-the ability to compose learned tasks-has been a long-standing challenge in the field of AI (and many other fields) as it is considered one of the hallmarks of human intelligence. An illustrative example of such is cross-lingual summarization, where a bilingual person (English-French) could directly summarize an English document into French sentences without having to translate the English document or summary into French explicitly. We discuss why this matter is an important open problem that requires further attention from the field. Then, we show that current PLMs (e.g., GPT-2 and T5) don't have functional compositionality yet and it is far from human-level generalizability. Finally, we suggest several research directions that could push the field towards zeroshot functional compositionality of language models. 1
Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021
Classification datasets are often biased in observations, leaving only a few observations for min... more Classification datasets are often biased in observations, leaving only a few observations for minority classes. Our key contribution is detecting and reducing Under-represented (U-) and Over-represented (O-) artifacts from dataset imbalance, by proposing a Counterfactual Generative Smoothing approach on both feature-space and data-space, namely CGS and CGS. Our technical contribution is smoothing majority and minority observations, by sampling a majority seed and transferring to minority. Our proposed approaches not only outperform state-of-the-arts in both synthetic and real-life datasets, they also effectively reduce both artifact types. CCS CONCEPTS • Computing methodologies → Natural language processing; Natural language generation.
Proceedings of the AAAI Conference on Artificial Intelligence
This paper studies the problem of multilingual causal reasoning in resource-poor languages. Exist... more This paper studies the problem of multilingual causal reasoning in resource-poor languages. Existing approaches, translating into the most probable resource-rich language such as English, suffer in the presence of translation and language gaps between different cultural area, which leads to the loss of causality. To overcome these challenges, our goal is thus to identify key techniques to construct a new causality network of cause-effect terms, targeted for the machine-translated English, but without any language-specific knowledge of resource-poor languages. In our evaluations with three languages, Korean, Chinese, and French, our proposed method consistently outperforms all baselines, achieving up-to 69.0% reasoning accuracy, which is close to the state-of-the-art accuracy 70.2% achieved on English.
This paper studies dialogue response selection task. As stateof-the-arts are neural models requir... more This paper studies dialogue response selection task. As stateof-the-arts are neural models requiring a large training set, data augmentation is essential to overcome the sparsity of observational annotation, where one observed response is annotated as gold. In this paper, we propose counterfactual augmentation, of considering whether unobserved utterances would "counterfactually" replace the labelled response, for the given context, and augment only if that is the case. We empirically show that our pipeline improves BERT-based models in two different response selection tasks without incurring annotation overheads.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
We aim to leverage human and machine intelligence together for attention supervision. Specificall... more We aim to leverage human and machine intelligence together for attention supervision. Specifically, we show that human annotation cost can be kept reasonably low, while its quality can be enhanced by machine selfsupervision. Specifically, for this goal, we explore the advantage of counterfactual reasoning, over associative reasoning typically used in attention supervision. Our empirical results show that this machine-augmented human attention supervision is more effective than existing methods requiring a higher annotation cost, in text classification tasks, including sentiment analysis and news categorization.
Neural attention mechanism has been used as a form of explanation for model behavior. Users can e... more Neural attention mechanism has been used as a form of explanation for model behavior. Users can either passively consume explanation or actively disagree with explanation and then supervise attention into more proper values (attention supervision). Though attention supervision was shown to be effective in some tasks, we find the existing attention supervision is biased, for which we propose to augment counterfactual observations to debias and contribute to accuracy gains. To this end, we propose a counterfactual method to estimate such missing observations and debias the existing supervisions. We validate the effectiveness of our counterfactual supervision on widely adopted image benchmark datasets: CUFED and PEC.
Proceedings of the 28th International Conference on Computational Linguistics
In this paper, we study review generation given a set of attribute identifiers which are user ID,... more In this paper, we study review generation given a set of attribute identifiers which are user ID, product ID and rating. This is a difficult subtask of natural language generation since models are limited to the given identifiers, without any specific descriptive information regarding the inputs, when generating the text. The capacity of these models is thus confined and dependent to how well the models can capture vector representations of attributes. We thus propose to additionally leverage references, which are selected from a large pool of texts labeled with one of the attributes, as textual information that enriches inductive biases of given attributes. With these references, we can now pose the problem as an instance of text-to-text generation, which makes the task easier since texts that are syntactically, semantically similar to the output text are provided as inputs. Using this framework, we address issues such as selecting references from a large candidate set without textual context and improving the model complexity for generation. Our experiments show that our models improve over previous approaches on both automatic and human evaluation metrics.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
This paper studies the keyphrase generation (KG) task for scenarios where structure plays an impo... more This paper studies the keyphrase generation (KG) task for scenarios where structure plays an important role. For example, a scientific publication consists of a short title and a long body, where the title can be used for de-emphasizing unimportant details in the body. Similarly, for short social media posts (e.g., tweets), scarce context can be augmented from titles, though often missing. Our contribution is generating/augmenting structure then encoding these information, using existing keyphrases of other documents, complementing missing/incomplete titles. Specifically, we first extend the given document with related but absent keyphrases from existing keyphrases, to augment missing contexts (generating structure), and then, build a graph of keyphrases and the given document, to obtain structure-aware representation of the augmented text (encoding structure). Our empirical results validate that our proposed structure augmentation and structure-aware encoding can improve KG for both scenarios, outperforming the state-of-the-art 1 .
This paper proposes the task of Visual COPA (VCOPA). Given a premise image and two alternative im... more This paper proposes the task of Visual COPA (VCOPA). Given a premise image and two alternative images, the task is to identify the more plausible alternative with their commonsense causal context. The VCOPA task is designed as its desirable machine system needs a more detailed understanding of the image, commonsense knowledge, and complex causal reasoning than state-of-the-art AI techniques. For that, we generate an evaluation dataset containing 380 VCOPA questions and over 1K images with various topics, which is amenable to automatic evaluation, and present the performance of baseline reasoning approaches as initial benchmarks for future systems.
This paper studies label augmentation for training dialogue response selection. The existing mode... more This paper studies label augmentation for training dialogue response selection. The existing model is trained by "observational" annotation, where one observed response is annotated as gold. In this paper, we propose "counterfactual augmentation" of pseudo-positive labels. We validate that the effectiveness of augmented labels are comparable to positives, such that ours outperform state-of-the-arts without augmentation.
We aim to leverage human and machine intelligence together for attention supervision. Specificall... more We aim to leverage human and machine intelligence together for attention supervision. Specifically, we show that human annotation cost can be kept reasonably low, while its quality can be enhanced by machine selfsupervision. Specifically, for this goal, we explore the advantage of counterfactual reasoning, over associative reasoning typically used in attention supervision. Our empirical results show that this machine-augmented human attention supervision is more effective than existing methods requiring a higher annotation cost, in text classification tasks, including sentiment analysis and news categorization.
Diffusion-based generative models have achieved remarkable success in various domains. It trains ... more Diffusion-based generative models have achieved remarkable success in various domains. It trains a shared model on denoising tasks that encompass different noise levels simultaneously, representing a form of multi-task learning (MTL). However, analyzing and improving diffusion models from an MTL perspective remains underexplored. In particular, MTL can sometimes lead to the well-known phenomenon of negative transfer, which results in the performance degradation of certain tasks due to conflicts between tasks. In this paper, we first aim to analyze diffusion training from an MTL standpoint, presenting two key observations: (O1) the task affinity between denoising tasks diminishes as the gap between noise levels widens, and (O2) negative transfer can arise even in diffusion training. Building upon these observations, we aim to enhance diffusion training by mitigating negative transfer. To achieve this, we propose leveraging existing MTL methods, but the presence of a huge number of denoising tasks makes this computationally expensive to calculate the necessary per-task loss or gradient. To address this challenge, we propose clustering the denoising tasks into small task clusters and applying MTL methods to them. Specifically, based on (O2), we employ interval clustering to enforce temporal proximity among denoising tasks within clusters. We show that interval clustering can be solved using dynamic programming, utilizing signal-tonoise ratio, timestep, and task affinity for clustering objectives. Through this, our approach addresses the issue of negative transfer in diffusion models by allowing for efficient computation of MTL methods. We validate the proposed clustering and its integration with MTL methods through various experiments, demonstrating improved sample quality of diffusion models. Our project page is available at https://gohyojun15.github.io/ANT_diffusion/.
Text classification in education, usually called auto-tagging, is the automated process of assign... more Text classification in education, usually called auto-tagging, is the automated process of assigning relevant tags to educational content, such as questions and textbooks. However, auto-tagging suffers from a data scarcity problem, which stems from two major challenges: 1) it possesses a large tag space and 2) it is multi-label. Though a retrieval approach is reportedly good at low-resource scenarios, there have been fewer efforts to directly address the data scarcity problem. To mitigate these issues, here we propose a novel retrieval approach CEAA that provides effective learning in educational text classification. Our main contributions are as follows: 1) we leverage transfer learning from question-answering datasets, and 2) we propose a simple but effective data augmentation method introducing cross-encoder style texts to a bi-encoder architecture for more efficient inference. An extensive set of experiments shows that our proposed method is effective in multi-label scenarios and low-resource tags compared to state-of-the-art models.
Question generation (QG) is the task of generating a valid and fluent question based on a given c... more Question generation (QG) is the task of generating a valid and fluent question based on a given context and the target answer. According to various purposes, even given the same context, instructors can ask questions about different concepts, and even the same concept can be written in different ways. However, the evaluation for QG usually depends on single reference-based similarity metrics, such as ngram-based metric or learned metric, which is not sufficient to fully evaluate the potential of QG methods. To this end, we propose to paraphrase the reference question for a more robust QG evaluation. Using large language models such as GPT-3, we created semantically and syntactically diverse questions, then adopt the simple aggregation of the popular evaluation metrics as the final scores. Through our experiments, we found that using multiple (pseudo) references is more effective for QG evaluation while showing a higher correlation with human evaluations than evaluation with a single reference.
Large Pre-trained Language Models (PLM) have become the most desirable starting point in the fiel... more Large Pre-trained Language Models (PLM) have become the most desirable starting point in the field of NLP, as they have become remarkably good at solving many individual tasks. Despite such success, in this paper, we argue that current paradigms of working with PLMs are neglecting a critical aspect of modeling human intelligence: functional compositionality. Functional compositionality-the ability to compose learned tasks-has been a long-standing challenge in the field of AI (and many other fields) as it is considered one of the hallmarks of human intelligence. An illustrative example of such is cross-lingual summarization, where a bilingual person (English-French) could directly summarize an English document into French sentences without having to translate the English document or summary into French explicitly. We discuss why this matter is an important open problem that requires further attention from the field. Then, we show that current PLMs (e.g., GPT-2 and T5) don't have functional compositionality yet and it is far from human-level generalizability. Finally, we suggest several research directions that could push the field towards zeroshot functional compositionality of language models. 1
Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021
Classification datasets are often biased in observations, leaving only a few observations for min... more Classification datasets are often biased in observations, leaving only a few observations for minority classes. Our key contribution is detecting and reducing Under-represented (U-) and Over-represented (O-) artifacts from dataset imbalance, by proposing a Counterfactual Generative Smoothing approach on both feature-space and data-space, namely CGS and CGS. Our technical contribution is smoothing majority and minority observations, by sampling a majority seed and transferring to minority. Our proposed approaches not only outperform state-of-the-arts in both synthetic and real-life datasets, they also effectively reduce both artifact types. CCS CONCEPTS • Computing methodologies → Natural language processing; Natural language generation.
Proceedings of the AAAI Conference on Artificial Intelligence
This paper studies the problem of multilingual causal reasoning in resource-poor languages. Exist... more This paper studies the problem of multilingual causal reasoning in resource-poor languages. Existing approaches, translating into the most probable resource-rich language such as English, suffer in the presence of translation and language gaps between different cultural area, which leads to the loss of causality. To overcome these challenges, our goal is thus to identify key techniques to construct a new causality network of cause-effect terms, targeted for the machine-translated English, but without any language-specific knowledge of resource-poor languages. In our evaluations with three languages, Korean, Chinese, and French, our proposed method consistently outperforms all baselines, achieving up-to 69.0% reasoning accuracy, which is close to the state-of-the-art accuracy 70.2% achieved on English.
This paper studies dialogue response selection task. As stateof-the-arts are neural models requir... more This paper studies dialogue response selection task. As stateof-the-arts are neural models requiring a large training set, data augmentation is essential to overcome the sparsity of observational annotation, where one observed response is annotated as gold. In this paper, we propose counterfactual augmentation, of considering whether unobserved utterances would "counterfactually" replace the labelled response, for the given context, and augment only if that is the case. We empirically show that our pipeline improves BERT-based models in two different response selection tasks without incurring annotation overheads.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
We aim to leverage human and machine intelligence together for attention supervision. Specificall... more We aim to leverage human and machine intelligence together for attention supervision. Specifically, we show that human annotation cost can be kept reasonably low, while its quality can be enhanced by machine selfsupervision. Specifically, for this goal, we explore the advantage of counterfactual reasoning, over associative reasoning typically used in attention supervision. Our empirical results show that this machine-augmented human attention supervision is more effective than existing methods requiring a higher annotation cost, in text classification tasks, including sentiment analysis and news categorization.
Neural attention mechanism has been used as a form of explanation for model behavior. Users can e... more Neural attention mechanism has been used as a form of explanation for model behavior. Users can either passively consume explanation or actively disagree with explanation and then supervise attention into more proper values (attention supervision). Though attention supervision was shown to be effective in some tasks, we find the existing attention supervision is biased, for which we propose to augment counterfactual observations to debias and contribute to accuracy gains. To this end, we propose a counterfactual method to estimate such missing observations and debias the existing supervisions. We validate the effectiveness of our counterfactual supervision on widely adopted image benchmark datasets: CUFED and PEC.
Proceedings of the 28th International Conference on Computational Linguistics
In this paper, we study review generation given a set of attribute identifiers which are user ID,... more In this paper, we study review generation given a set of attribute identifiers which are user ID, product ID and rating. This is a difficult subtask of natural language generation since models are limited to the given identifiers, without any specific descriptive information regarding the inputs, when generating the text. The capacity of these models is thus confined and dependent to how well the models can capture vector representations of attributes. We thus propose to additionally leverage references, which are selected from a large pool of texts labeled with one of the attributes, as textual information that enriches inductive biases of given attributes. With these references, we can now pose the problem as an instance of text-to-text generation, which makes the task easier since texts that are syntactically, semantically similar to the output text are provided as inputs. Using this framework, we address issues such as selecting references from a large candidate set without textual context and improving the model complexity for generation. Our experiments show that our models improve over previous approaches on both automatic and human evaluation metrics.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
This paper studies the keyphrase generation (KG) task for scenarios where structure plays an impo... more This paper studies the keyphrase generation (KG) task for scenarios where structure plays an important role. For example, a scientific publication consists of a short title and a long body, where the title can be used for de-emphasizing unimportant details in the body. Similarly, for short social media posts (e.g., tweets), scarce context can be augmented from titles, though often missing. Our contribution is generating/augmenting structure then encoding these information, using existing keyphrases of other documents, complementing missing/incomplete titles. Specifically, we first extend the given document with related but absent keyphrases from existing keyphrases, to augment missing contexts (generating structure), and then, build a graph of keyphrases and the given document, to obtain structure-aware representation of the augmented text (encoding structure). Our empirical results validate that our proposed structure augmentation and structure-aware encoding can improve KG for both scenarios, outperforming the state-of-the-art 1 .
This paper proposes the task of Visual COPA (VCOPA). Given a premise image and two alternative im... more This paper proposes the task of Visual COPA (VCOPA). Given a premise image and two alternative images, the task is to identify the more plausible alternative with their commonsense causal context. The VCOPA task is designed as its desirable machine system needs a more detailed understanding of the image, commonsense knowledge, and complex causal reasoning than state-of-the-art AI techniques. For that, we generate an evaluation dataset containing 380 VCOPA questions and over 1K images with various topics, which is amenable to automatic evaluation, and present the performance of baseline reasoning approaches as initial benchmarks for future systems.
Uploads
Papers by Seungtaek Choi