Papers by Pratyay Banerjee
Computer Vision – ECCV 2020
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Knowledge-based visual question answering (VQA) requires answering questions with external knowle... more Knowledge-based visual question answering (VQA) requires answering questions with external knowledge in addition to the content of images. One dataset that is mostly used in evaluating knowledge-based VQA is OK-VQA, but it lacks a gold standard knowledge corpus for retrieval. Existing work leverage different knowledge bases (e.g., ConceptNet and Wikipedia) to obtain external knowledge. Because of varying knowledge bases, it is hard to fairly compare models' performance. To address this issue, we collect a natural language knowledge base that can be used for any VQA system. Moreover, we propose a Visual Retriever-Reader pipeline to approach knowledge-based VQA. The visual retriever aims to retrieve relevant knowledge, and the visual reader seeks to predict answers based on given knowledge. We introduce various ways to retrieve knowledge using text and images and two reader styles: classification and extraction. Both the retriever and reader are trained with weak supervision. Our experimental results show that a good retriever can significantly improve the reader's performance on the OK-VQA challenge. The code and corpus are provided in this link. * Equal contribution Question: What sort of vehicle used this item? Answer: fire truck LXMERT: truck LXMERT + Caption: fire truck Ours: fire truck kn: fire engine, also called fire truck, mobile (nowadays selfpropelled) piece of equipment used in firefighting.... Caption: a red fire hydrant sitting on the side of a road. Question: Where did this sport originate? Answer: australia, hawaii, polynesian LXMERT: california LXMERT + Caption: california Ours: hawaii kn: surfing was invented in hawaii... Caption: a man riding a wave on a surfboard in the ocean.
ArXiv, 2021
Decompilation is the procedure of transforming binary programs into a high-level representation, ... more Decompilation is the procedure of transforming binary programs into a high-level representation, such as source code, for human analysts to examine. While modern decompilers can reconstruct and recover much information that is discarded during compilation, inferring variable names is still extremely difficult. Inspired by recent advances in natural language processing, we propose a novel solution to infer variable names in decompiled code based on Masked Language Modeling, Byte-Pair Encoding, and neural architectures such as Transformers and BERT. Our solution takes raw decompiler output, the less semantically meaningful code, as input, and enriches it using our proposed finetuning technique, Constrained Masked Language Modeling. Using Constrained Masked Language Modeling introduces the challenge of predicting the number of masked tokens for the original variable name. We address this count of token prediction challenge with our post-processing algorithm. Compared to the state-of-th...
ArXiv, 2020
Question answering (QA) in natural language (NL) has been an important aspect of AI from its earl... more Question answering (QA) in natural language (NL) has been an important aspect of AI from its early days. Winograd's ``councilmen'' example in his 1972 paper and McCarthy's Mr. Hug example of 1976 highlights the role of external knowledge in NL understanding. While Machine Learning has been the go-to approach in NL processing as well as NL question answering (NLQA) for the last 30 years, recently there has been an increasingly emphasized thread on NLQA where external knowledge plays an important role. The challenges inspired by Winograd's councilmen example, and recent developments such as the Rebooting AI book, various NLQA datasets, research on knowledge acquisition in the NLQA context, and their use in various NLQA models have brought the issue of NLQA using ``reasoning'' with external knowledge to the forefront. In this paper, we present a survey of the recent work on them. We believe our survey will help establish a bridge between multiple fields of A...
ArXiv, 2020
A recent work has shown that transformers are able to “reason” with facts and rules in a limited ... more A recent work has shown that transformers are able to “reason” with facts and rules in a limited setting where the rules are natural language expressions of conjunctions of conditions implying a conclusion. Since this suggests that transformers may be used for reasoning with knowledge given in natural language, we do a rigorous evaluation of this with respect to a common form of knowledge and its corresponding reasoning – the reasoning about effects of actions. Reasoning about action and change has been a top focus in the knowledge representation subfield of AI from the early days of AI and more recently it has been a highlight aspect in common sense question answering. We consider four action domains (Blocks World, Logistics, Dock-Worker-Robots and a Generic Domain) in natural language and create QA datasets that involve reasoning about the effects of actions in these domains. We investigate the ability of transformers to (a) learn to reason in these domains and (b) transfer that l...
Activity classification is a task where we need to identify a sequence of gestures for a period o... more Activity classification is a task where we need to identify a sequence of gestures for a period of time. It is a challenging task without visual cues and only based on hand movements. There are several applications of activity classification without visual cues in science and technology, and in this paper we propose a solution based on EMG and IMU features from Myo Gesture Control Armband. We try to capture the temporal features of different hand gestures in multiple ways and apply machine learning and new deep learning techniques. Our approach is very promising and we are able to distinguish Eating activity from other activities with 94.76% accuracy.
Captioning is a crucial and challenging task for video understanding. In videos that involve acti... more Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene. These changes can be observable, such as movements, manipulations, and transformations of the objects in the scene -- these are reflected in conventional video captioning. However, unlike images, actions in videos are also inherently linked to social and commonsense aspects such as intentions (why the action is taking place), attributes (such as who is doing the action, on whom, where, using what etc.) and effects (how the world changes due to the action, the effect of the action on other agents). Thus for video understanding, such as when captioning videos or when answering question about videos, one must have an understanding of these commonsense aspects. We present the first work on generating \textit{commonsense} captions directly from videos, in order to describe latent aspects such as int...
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
Methodologies for training visual question answering (VQA) models assume the availability of data... more Methodologies for training visual question answering (VQA) models assume the availability of datasets with human-annotated Image-Question-Answer (I-Q-A) triplets. This has led to heavy reliance on datasets and a lack of generalization to new types of questions and scenes. Linguistic priors along with biases and errors due to annotator subjectivity have been shown to percolate into VQA models trained on such samples. We study whether models can be trained without any human-annotated Q-A pairs, but only with images and their associated textual descriptions or captions. We present a method to train models with synthetic Q-A pairs generated procedurally from captions. Additionally, we demonstrate the efficacy of spatial-pyramid image patches as a simple but effective alternative to dense and costly object bounding box annotations used in existing VQA models. Our experiments on three VQA benchmarks demonstrate the efficacy of this weakly-supervised approach, especially on the VQA-CP challenge, which tests performance under changing linguistic priors.
ArXiv, 2021
Following procedural texts written in natural languages is challenging. We must read the whole te... more Following procedural texts written in natural languages is challenging. We must read the whole text to identify the relevant information or identify the instruction-flow to complete a task, which is prone to failures. If such texts are structured, we can readily visualize instruction-flows, reason or infer a particular step, or even build automated systems to help novice agents achieve a goal. However, this structure recovery task is a challenge because of such texts’ diverse nature. This paper proposes to identify relevant information from such texts and generate information flows between sentences. We built a large annotated procedural text dataset (CTFW) in the cybersecurity domain (3154 documents). This dataset contains valuable instructions regarding software vulnerability analysis experiences. We performed extensive experiments on CTFW with our LM-GNN model variants in multiple settings. To show the generalizability of both this task and our method, we also experimented with p...
Heavy reliance on human-annotated training datasets (which typically suffer from annotator subjec... more Heavy reliance on human-annotated training datasets (which typically suffer from annotator subjectivity and linguistic priors) has led to learning spurious correlations, bias amplification, and lack of robustness in vision-and-language (V&L) models. We study whether VQA models can be trained without any human-annotated Q-A pairs or object-bounding boxes. We use a self-supervised framework that involves procedural synthesis of Q-A pairs from captions and pre-training tasks for training our models. Since our Q-A pairs are synthetic, they exhibit a linguistic domain shift from the questions in VQA data and a label-shift in the answer-set, i.e. a zero-shot learning task. We benchmark our models on VQA-v2, GQA, and on VQA-CP which contains a softer version of label shift.
ArXiv, 2021
Knowledge-based visual question answering (VQA) requires answering questions with external knowle... more Knowledge-based visual question answering (VQA) requires answering questions with external knowledge in addition to the content of images. One dataset that is mostly used in evaluating knowledge-based VQA is OKVQA, but it lacks a gold standard knowledge corpus for retrieval. Existing work leverage different knowledge bases (e.g., ConceptNet and Wikipedia) to obtain external knowledge. Because of varying knowledge bases, it is hard to fairly compare models’ performance. To address this issue, we collect a natural language knowledge base that can be used for any VQA system. Moreover, we propose a Visual Retriever-Reader pipeline to approach knowledge-based VQA. The visual retriever aims to retrieve relevant knowledge, and the visual reader seeks to predict answers based on given knowledge. We introduce various ways to retrieve knowledge using text and images and two reader styles: classification and extraction. Both the retriever and reader are trained with weak supervision. Our exper...
arXiv: Computation and Language, 2019
In this work, we formulate the NER task as a multi-answer knowledge guided QA task (KGQA) which h... more In this work, we formulate the NER task as a multi-answer knowledge guided QA task (KGQA) which helps to predict entities only by assigning B, I and O tags without associating entity types with the tags. We provide different knowledge contexts, such as, entity types, questions, definitions and examples along with the text and train on a combined dataset of 18 biomedical corpora. This formulation (a) enables systems to jointly learn NER specific features from varied NER datasets, (b) can use knowledge-text attention to identify words having higher similarity to provided knowledge, improving performance, (c) reduces system confusion by reducing the prediction classes to B, I, O only, and (d) makes detection of nested entities easier. We perform extensive experiments of this KGQA formulation on 18 biomedical NER datasets, and through experiments we note that knowledge helps in achieving better performance. Our problem formulation is able to achieve state-of-the-art results in 12 datasets.
Analysis of vision-and-language models has revealed their brittleness under linguistic phenomena ... more Analysis of vision-and-language models has revealed their brittleness under linguistic phenomena such as paraphrasing, negation, textual entailment, and word substitutions with synonyms or antonyms. While data augmentation techniques have been designed to mitigate against these failure modes, methods that can integrate this knowledge into the training pipeline remain under-explored. In this paper, we present SDRO†, a model-agnostic method that utilizes a set linguistic transformations in a distributed robust optimization setting, along with an ensembling technique to leverage these transformations during inference. Experiments on benchmark datasets with images (NLVR) and video (VIOLIN) demonstrate performance improvements as well as robustness to adversarial attacks. Experiments on binary VQA explore the generalizability of this method to other V&L tasks.
DARPA and Allen AI have proposed a collection of datasets to encourage research in Question Answe... more DARPA and Allen AI have proposed a collection of datasets to encourage research in Question Answering domains where (commonsense) knowledge is expected to play an important role. Recent language models such as BERT and GPT that have been pre-trained on Wikipedia articles and books, have shown decent performance with little fine-tuning on several such Multiple Choice Question-Answering (MCQ) datasets. Our goal in this work is to develop methods to incorporate additional (commonsense) knowledge into language model based approaches for better question answering in such domains. In this work we first identify external knowledge sources, and show that the performance further improves when a set of facts retrieved through IR is prepended to each MCQ question during both training and test phase. We then explore if the performance can be further improved by providing task specific knowledge in different manners or by employing different strategies for using the available knowledge. We prese...
Methodologies for training visual question answering (VQA) models assume the availability of data... more Methodologies for training visual question answering (VQA) models assume the availability of datasets with human-annotated ImageQuestion-Answer (I-Q-A) triplets. This has led to heavy reliance on datasets and a lack of generalization to new types of questions and scenes. Linguistic priors along with biases and errors due to annotator subjectivity have been shown to percolate into VQA models trained on such samples. We study whether models can be trained without any human-annotated Q-A pairs, but only with images and their associated textual descriptions or captions. We present a method to train models with synthetic Q-A pairs generated procedurally from captions. Additionally, we demonstrate the efficacy of spatial-pyramid image patches as a simple but effective alternative to dense and costly object bounding box annotations used in existing VQA models. Our experiments on three VQA benchmarks demonstrate the efficacy of this weakly-supervised approach, especially on the VQA-CP chall...
In this work, we propose Masked Noun-Phrase Prediction (MNPP), a pre-training strategy to tackle ... more In this work, we propose Masked Noun-Phrase Prediction (MNPP), a pre-training strategy to tackle pronoun resolution in a fully unsupervised setting. Firstly, We evaluate our pretrained model on various pronoun resolution datasets without any finetuning. Our method outperforms all previous unsupervised methods on all datasets by large margins. Secondly, we proceed to a few-shot setting where we finetune our pre-trained model on WinoGrande-S and XS separately. Our method outperforms RoBERTa-large baseline with large margins, meanwhile, achieving a higher AUC score after further finetuning on the remaining three official splits of WinoGrande.
ArXiv, 2020
Logical connectives and their implications on the meaning of a natural language sentence are a fu... more Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this \textit{Lens of Logic}, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions. We construct an augmentation of the VQA dataset as a benchmark, with questions containing logical compositions and linguistic transformations (negation, disjunction, conjunction, and antonyms). We propose our {Lens of Logic (LOL)} model which uses question-attention and logic-attention to understand logical connectives in the question, and a novel Frechet-Compatibility Loss, which ensures that the answers of the component questions and the composed question are consistent with the inferred logical operation. Our model ...
In this work, we try to perform Named Entity Recognition (NER) with external knowledge. We formul... more In this work, we try to perform Named Entity Recognition (NER) with external knowledge. We formulate the NER task as a multi-answer question answering (MAQA) task and provide different knowledge contexts, such as entity types, questions, definitions, and definitions with examples. Moreover, the formulation of the task as a MAQA task helps to reduce other errors. This formulation (a) enables systems to jointly learn from varied NER datasets, enabling systems to learn more NER specific features, (b) can use knowledge-text attention to identify words having higher similarity to 'entity type' mentioned in the knowledge, improving performance, (c) reduces confusion in systems by reducing the classes to be predicted, limited to only three (B, I, O), (d) Makes detection of Nested Entities easier. We perform extensive experiments of this Knowledge Guided NER (KGNER) formulation on 15 Biomedical NER datasets, and through these experiments, we see external knowledge helps. We will rel...
Vision-and-language (V&L) reasoning necessitates perception of visual concepts such as objects an... more Vision-and-language (V&L) reasoning necessitates perception of visual concepts such as objects and actions, understanding semantics and language grounding, and reasoning about the interplay between the two modalities. One crucial aspect of visual reasoning is spatial understanding, which involves understanding relative locations of objects, i.e. implicitly learning the geometry of the scene. In this work, we evaluate the faithfulness of V&L models to such geometric understanding, by formulating the prediction of pair-wise relative locations of objects as a classification as well as a regression task. Our findings suggest that state-ofthe-art transformer-based V&L models lack sufficient abilities to excel at this task. Motivated by this, we design two objectives as proxies for 3D spatial reasoning (SR) – object centroid estimation, and relative position estimation, and train V&L with weak supervision from off-the-shelf depth estimators. This leads to considerable improvements in accu...
ArXiv, 2020
Open Domain Question Answering requires systems to retrieve external knowledge and perform multi-... more Open Domain Question Answering requires systems to retrieve external knowledge and perform multi-hop reasoning by composing knowledge spread over multiple sentences. In the recently introduced open domain question answering challenge datasets, QASC and OpenBookQA, we need to perform retrieval of facts and compose facts to correctly answer questions. In our work, we learn a semantic knowledge ranking model to re-rank knowledge retrieved through Lucene based information retrieval systems. We further propose a "knowledge fusion model" which leverages knowledge in BERT-based language models with externally retrieved knowledge and improves the knowledge understanding of the BERT-based language models. On both OpenBookQA and QASC datasets, the knowledge fusion model with semantically re-ranked knowledge outperforms previous attempts.
Uploads
Papers by Pratyay Banerjee