Papers by Niluthpol Chowdhury Mithun
arXiv (Cornell University), Mar 29, 2023
Lecture Notes in Computer Science, 2022
2023 IEEE Conference Virtual Reality and 3D User Interfaces (VR)
2022 26th International Conference on Pattern Recognition (ICPR)
Cornell University - arXiv, May 17, 2022
Understanding the geometric relationships between objects in a scene is a core capability in enab... more Understanding the geometric relationships between objects in a scene is a core capability in enabling both humans and autonomous agents to navigate in new environments. A sparse, unified representation of the scene topology will allow agents to act efficiently to move through their environment, communicate the environment state with others, and utilize the representation for diverse downstream tasks. To this end, we propose a method to train an autonomous agent to learn to accumulate a 3D scene graph representation of its environment by simultaneously learning to navigate through said environment. We demonstrate that our approach, GraphMapper, enables the learning of effective navigation policies through fewer interactions with the environment than vision-based systems alone. Further, we show that GraphMapper can act as a modular scene encoder to operate alongside existing Learning-based solutions to not only increase navigational efficiency but also generate intermediate scene representations that are useful for other future tasks.
arXiv Computer Science, Aug 21, 2020
Prior works on text-based video moment localization focus on temporally grounding the textual que... more Prior works on text-based video moment localization focus on temporally grounding the textual query in an untrimmed video. These works assume that the relevant video is already known and attempt to localize the moment on that relevant video only. Different from such works, we relax this assumption and address the task of localizing moments in a corpus of videos for a given sentence query. This task poses a unique challenge as the system is required to perform: (i) retrieval of the relevant video where only a segment of the video corresponds with the queried sentence, and (ii) temporal localization of moment in the relevant video based on sentence query. Towards overcoming this challenge, we propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences. In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries. Qualitative and quantitative results on three benchmark text-based video moment retrieval datasets-Charades-STA, DiDeMo, and ActivityNet Captions-demonstrate that our method achieves promising performance on the proposed task of temporal localization of moments in a corpus of videos.
2022 International Conference on Robotics and Automation (ICRA)
Class imbalance is a fundamental problem in computer vision applications such as semantic segment... more Class imbalance is a fundamental problem in computer vision applications such as semantic segmentation. Specifically, uneven class distributions in a training dataset often result in unsatisfactory performance on under-represented classes. Many works have proposed to weight the standard cross entropy loss function with pre-computed weights based on class statistics, such as the number of samples and class margins. There are two major drawbacks to these methods: 1) constantly up-weighting minority classes can introduce excessive false positives in semantic segmentation; 2) a minority class is not necessarily a hard class. The consequence is low precision due to excessive false positives. In this regard, we propose a hard-class mining loss by reshaping the vanilla cross entropy loss such that it weights the loss for each class dynamically based on instantaneous recall performance. We show that the novel recall loss changes gradually between the standard cross entropy loss and the inverse frequency weighted loss. Recall loss also leads to improved mean accuracy while offering competitive mean Intersection over Union (IoU) performance. On Synthia dataset 1 , recall loss achieves 9% relative improvement on mean accuracy with competitive mean IoU using DeepLab-ResNet18 compared to the cross entropy loss. Code available at https://github.com/PotatoTian/recall-semseg.
We participated in the video to text description: matching and ranking task in TRECVID 2018. The ... more We participated in the video to text description: matching and ranking task in TRECVID 2018. The goal of this task is to return a ranked list of the most likely text descriptions that correspond to each video in the test set. We trained joint visual-semantic embedding models using image-text pairs from an image-captioning dataset and applied to the videotext retrieval task utilizing key frames of videos extracted by a sparse subset selection approach. Our retrieval system performed reasonably across all the testing sets. Our best system, which uses a late-fusion of similarity scores obtained from the key frames of a video, achieved mean inverted ranking score of 0.225 on the testing set C, and we ranked the 4th overall on this task.
We participated in the matching and ranking subtask in TRECVid challenge 2017. The task here was ... more We participated in the matching and ranking subtask in TRECVid challenge 2017. The task here was to return a ranked list of the most likely text descriptions that correspond to each video. We adopted a joint visual semantic embedding approach for image-text retrieval and applied to the video-text retrieval task utilizing key-frames extracted by dissimilaritybased sparse subset selection approach. We trained our system on the MS-COCO dataset and tested on the TRECVid dataset. Our approach got an average mean inverted ranking score of 0.255 across 4 sets of testing data, and we ranked the 3rd overall in the challenge on this task.
2021 IEEE International Conference on Robotics and Automation (ICRA), 2021
Visual navigation for autonomous agents is a core task in the fields of computer vision and robot... more Visual navigation for autonomous agents is a core task in the fields of computer vision and robotics. Learningbased methods, such as deep reinforcement learning, have the potential to outperform the classical solutions developed for this task; however, they come at a significantly increased computational load. Through this work, we design a novel approach that focuses on performing better or comparable to the existing learning-based solutions but under a clear time/computational budget. To this end, we propose a method to encode vital scene semantics such as traversable paths, unexplored areas, and observed scene objects-alongside raw visual streams such as RGB, depth, and semantic segmentation masks-into a semantically informed, top-down egocentric map representation. Further, to enable the effective use of this information, we introduce a novel 2-D map attention mechanism, based on the successful multi-layer Transformer networks. We conduct experiments on 3-D reconstructed indoor PointGoal visual navigation and demonstrate the effectiveness of our approach. We show that by using our novel attention schema and auxiliary rewards to better utilize scene semantics, we outperform multiple baselines trained with only raw inputs or implicit semantic information while operating with an 80% decrease in the agent's experience.
ArXiv, 2021
This paper presents a novel approach for the Vision-andLanguage Navigation (VLN) task in continuo... more This paper presents a novel approach for the Vision-andLanguage Navigation (VLN) task in continuous 3D environments, which requires an autonomous agent to follow natural language instructions in unseen environments. Existing end-to-end learning-based VLN methods struggle at this task as they focus mostly on utilizing raw visual observations and lack the semantic spatio-temporal reasoning capabilities which is crucial in generalizing to new environments. In this regard, we present a hybrid transformerrecurrence model which focuses on combining classical semantic mapping techniques with a learning-based method. *Work done while MZI was an intern at SRI International *Georgia Institute of Technology [email protected] †SRI International (niluthpol.mithun, zachary.seymour, han-pang.chiu, supun.samarasakera, rakesh.kumar)@sri.com Our method creates a temporal semantic memory by building a top-down local ego-centric semantic map and performs cross-modal grounding to align map and languag...
Author(s): Mithun, Niluthpol Chowdhury | Advisor(s): Roy-Chowdhury, Amit K | Abstract: In recent ... more Author(s): Mithun, Niluthpol Chowdhury | Advisor(s): Roy-Chowdhury, Amit K | Abstract: In recent years, tremendous success has been achieved in many computer vision tasks using deep learning models trained on large hand-labeled image datasets. In many applications, this may be impractical or infeasible, either because of the non-availability of large datasets or the amount of time and resource needed for labeling. In this respect, an increasingly important problem in the field of computer vision, multimedia and machine learning is how to learn useful models for tasks where labeled data is sparse. In this thesis, we focus on learning comprehensive joint representations for different cross-modal visual-textual retrieval tasks leveraging weak supervision, that is noisier and/or less precise but cheaper and/or more efficient to collect. Cross-modal visual-textual retrieval has gained considerable momentum in recent years due to the promise of deep neural network models in learning robus...
2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018
Constructing a feature representation invariant to certain types of geometric and photometric tra... more Constructing a feature representation invariant to certain types of geometric and photometric transformations is of significant importance in many computer vision applications. In spite of significant effort, developing invariant feature representations remains a challenging problem. Most of the existing representations often fail to satisfy the longterm repeatability requirements of specific applications like vision-based localization, applications whose domain includes significant, non-uniform illumination and environmental changes. To these ends, we explore the use of natural image pairs (i.e. images captured of the same location but at different times) as an additional source of supervision to generate an improved feature representation for the task of vision-based localization. Specifically, we resort to training deep denoising autoencoder, with CNN feature representation of one image in the pair being treated as a noisy version of the other. The resulting system thereby learns...
This paper presents the Online Social Network Investigator (OSNI), a scalable distributed system ... more This paper presents the Online Social Network Investigator (OSNI), a scalable distributed system to search social net- work data, based on a spatiotemporal window and a list of keywords. Given that only 2% of tweets are geolocated, we have implemented and compared various state-of-art loca- tion estimation techniques. Further, to enrich the context of posts, associations of images to terms are estimated through various classication techniques. The accuracies of these es- timations are evaluated on large real datasets. OSNI's query interface is available on the Web.
The identity of subjects in many portraits has been a matter of debate for art historians that re... more The identity of subjects in many portraits has been a matter of debate for art historians that relied upon subjective analysis of facial features to resolve ambiguity in sitter identity. Developing automated face verification technique has thus garnered interest to provide a quantitative way to reinforce the decision arrived at by the art historians. However, most existing works often fail to resolve ambiguities concerning the identity of the subjects due to significant variation in artistic styles and the limited availability and authenticity of art images. To these ends, we explore the use of deep Siamese Convolutional Neural Networks (CNN) to provide a measure of similarity between a pair of portraits. To mitigate limited training data issue, we employ CNN based style-transfer technique that creates several new images by recasting an art style to other images, keeping original image content unchanged. The resulting system thereby learns features which are discriminative and invar...
Detection and localization of image manipulations are becoming of increasing interest to research... more Detection and localization of image manipulations are becoming of increasing interest to researchers in recent years due to the significant rise of malicious contentchanging image tampering on the web. One of the major challenges for an image manipulation detection method is to discriminate between the tampered regions and other regions in an image. We observe that most of the manipulated images leave some traces near boundaries of manipulated regions including blurred edges. In order to exploit these traces in localizing the tampered regions, we propose an encoder-decoder based network where we fuse representations from early layers in the encoder (which are richer in low-level spatial cues, like edges) by skip pooling with representations of the last layer of the decoder and use for manipulation detection. In addition, we utilize resampling features extracted from patches of images by feeding them to LSTM cells to capture the transition between manipulated and non-manipulated bloc...
2020 25th International Conference on Pattern Recognition (ICPR)
Prior works on text-based video moment localization focus on temporally grounding the textual que... more Prior works on text-based video moment localization focus on temporally grounding the textual query in an untrimmed video. These works assume that the relevant video is already known and attempt to localize the moment on that relevant video only. Different from such works, we relax this assumption and address the task of localizing moments in a corpus of videos for a given sentence query. This task poses a unique challenge as the system is required to perform: (i) retrieval of the relevant video where only a segment of the video corresponds with the queried sentence, and (ii) temporal localization of moment in the relevant video based on sentence query. Towards overcoming this challenge, we propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences. In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries. Qu...
Uploads
Papers by Niluthpol Chowdhury Mithun