Papers by Ming-Hsuan Yang
2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021
Vision-and-Language Navigation (VLN) requires an agent to find a path to a remote location on the... more Vision-and-Language Navigation (VLN) requires an agent to find a path to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas. Most existing methods take the words in the instructions and the discrete views of each panorama as the minimal unit of encoding. However, this requires a model to match different nouns (e.g., TV, table) against the same input view feature. In this work, we propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level, namely objects and words. Our sequential BERT also enables the visual-textual clues to be interpreted in light of the temporal context, which is crucial to multiround VLN tasks. Additionally, we enable the model to identify the relative direction (e.g., left/right/front/back) of each navigable location and the room type (e.g., bedroom, kitchen) of its current and final navigation goal, as such information is widely mentioned in instructions implying the desired next and final locations. We thus enable the model to know-where the objects lie in the images, and to know-where they stand in the scene. Extensive experiments demonstrate the effectiveness compared against several state-of-the-art methods on three indoor VLN tasks: REVERIE, NDH, and R2R. Project repository: https://github.com/YuankaiQi/ORIST * Corresponding author ... Walk past the coffee table and couch and stop by the pool table. ... Go to the front of the couch towards the billiard table, and stop. ... Walk in front of the fireplace ... stop once you reach the pool table.
2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022
Multi-Task Learning (MTL) aims to enhance the model generalization by sharing representations bet... more Multi-Task Learning (MTL) aims to enhance the model generalization by sharing representations between related tasks for better performance. Typical MTL methods are jointly trained with the complete multitude of ground-truths for all tasks simultaneously. However, one single dataset may not contain the annotations for each task of interest. To address this issue, we propose the Semi-supervised Multi-Task Learning (SemiMTL) method to leverage the available supervisory signals from different datasets, particularly for semantic segmentation and depth estimation tasks. To this end, we design an adversarial learning scheme in our semisupervised training by leveraging unlabeled data to optimize all the task branches simultaneously and accomplish all tasks across datasets with partial annotations. We further present a domain-aware discriminator structure with various alignment formulations to mitigate the domain discrepancy issue among datasets. Finally, we demonstrate the effectiveness of the proposed method to learn across different datasets on challenging street view and remote sensing benchmarks.
Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231)
Procedings of the British Machine Vision Conference 2006, 2006
We formulate the problem of dynamic texture synthesis as a nonlinear manifold learning and traver... more We formulate the problem of dynamic texture synthesis as a nonlinear manifold learning and traversing problem. We characterize dynamic textures as the temporal changes in spectral parameters of image sequences. For continuous changes of such parameters, it is commonly assumed that all these parameters lie on or close to a low-dimensional manifold embedded in the original configuration space. For complex dynamic data, the manifolds are usually nonlinear and we propose to use a mixture of linear subspaces to model a nonlinear manifold. These locally linear subspaces are further aligned within a global coordinate system. With the nonlinear manifold being globally parameterized, we overcome motion discontinuity problems encountered in switching linear models and dynamics. We present a nonparametric method to describe the complex dynamics of data sequences on the manifold. We also apply such approach to dynamic spatial parameters such as motion capture data. The experimental results suggest that our approach is able to synthesize smooth, complex dynamic textures and human motions, and has potential applications to other dynamic data synthesis problems.
ArXiv, 2019
The study of mouse social behaviours has been increasingly undertaken in neuroscience research. H... more The study of mouse social behaviours has been increasingly undertaken in neuroscience research. However, automated quantification of mouse behaviours from the videos of interacting mice is still a challenging problem, where object tracking plays a key role in locating mice in their living spaces. Artificial markers are often applied for multiple mice tracking, which are intrusive and consequently interfere with the movements of mice in a dynamic environment. In this paper, we propose a novel method to continuously track several mice and individual parts without requiring any specific tagging. Firstly, we propose an efficient and robust deep learning based mouse part detection scheme to generate part candidates. Subsequently, we propose a novel Bayesian Integer Linear Programming Model that jointly assigns the part candidates to individual targets with necessary geometric constraints whilst establishing pair-wise association between the detected parts. There is no publicly available ...
The recent years have witnessed advances in parallel algorithms for large scale optimization prob... more The recent years have witnessed advances in parallel algorithms for large scale optimization problems. Notwithstanding demonstrated success, existing algorithms that parallelize over features are usually limited by divergence issues under high parallelism or require data preprocessing to alleviate these problems. In this work, we propose a Parallel Coordinate Descent Newton algorithm using multidimensional approximate Newton steps (PCDN), where the off-diagonal elements of the Hessian are set to zero to enable parallelization. It randomly partitions the feature set into b bundles/subsets with size of P, and sequentially processes each bundle by first computing the descent directions for each feature in parallel and then conducting P-dimensional line search to obtain the step size. We show that: (1) PCDN is guaranteed to converge globally despite increasing parallelism; (2) PCDN converges to the specified accuracy within the limited iteration number of T, and T decreases with increas...
In this work, we propose a simple yet effective meta-learning algorithm in thesemi-supervised set... more In this work, we propose a simple yet effective meta-learning algorithm in thesemi-supervised settings. We notice that existing consistency-based approachesmostly do not consider the essential role of the label information for consistencyregularization. To alleviate this issue, we bridge the relationship between theconsistency loss and label information by unfolding and differentiating throughone optimization step. Specifically, we exploit the pseudo labels of the unlabeledexamples which are guided by the meta-gradients of the labeled data loss so thatthe model can generalize well on the labeled examples. In addition, we introduce asimple first-order approximation to avoid computing higher-order derivatives andguarantee scalability. Extensive evaluations on the SVHN, CIFAR, and ImageNetdatasets demonstrate that the proposed algorithm performs favorably against thestate-of-the-art methods.
In this paper, we present a simple yet fast and robust algorithm which exploits the spatio-tempor... more In this paper, we present a simple yet fast and robust algorithm which exploits the spatio-temporal context for visual tracking. Our approach formulates the spatio-temporal relationships between the object of interest and its local context based on a Bayesian framework, which models the statistical correlation between the low-level features (i.e., image intensity and position) from the target and its surrounding regions. The tracking problem is posed by computing a confidence map, and obtaining the best target location by maximizing an object location likelihood function. The Fast Fourier Transform is adopted for fast learning and detection in this work. Implemented in MATLAB without code optimization, the proposed tracker runs at 350 frames per second on an i7 machine. Extensive experimental results show that the proposed algorithm performs favorably against state-of-the-art methods in terms of efficiency, accuracy and robustness.
International Journal of Computer Vision, 2022
Recent image-to-image (I2I) translation algorithms focus on learning the mapping from a source to... more Recent image-to-image (I2I) translation algorithms focus on learning the mapping from a source to a target domain. However, the continuous translation problem that synthesizes intermediate results between the two domains has not been well-studied in the literature. Generating a smooth sequence of intermediate results bridges the gap of two different domains, facilitating the morphing effect across domains. Existing I2I approaches are limited to either intra-domain or deterministic inter-domain continuous translation. In this work, we present an effective signed attribute vector, which enables continuous translation on diverse mapping paths across various domains. In particular, utilizing the sign operation to encode the domain information, we introduce a unified attribute space shared by all domains, thereby allowing the interpolation on attribute vectors of different domains. To enhance the visual quality of continuous translation results, we generate a trajectory between two sign-symmetrical attribute vectors and leverage the domain information of the interpolated results along the trajectory for ad
ArXiv, 2020
Obtaining object response maps is one important step to achieve weakly-supervised semantic segmen... more Obtaining object response maps is one important step to achieve weakly-supervised semantic segmentation using image-level labels. However, existing methods rely on the classification task, which could result in a response map only attending on discriminative object regions as the network does not need to see the entire object for optimizing the classification loss. To tackle this issue, we propose a principled and end-to-end train-able framework to allow the network to pay attention to other parts of the object, while producing a more complete and uniform response map. Specifically, we introduce the mixup data augmentation scheme into the classification network and design two uncertainty regularization terms to better interact with the mixup strategy. In experiments, we conduct extensive analysis to demonstrate the proposed method and show favorable performance against state-of-the-art approaches.
Computer Vision – ECCV 2018, 2018
We introduce a new problem of generating an image based on a small number of key local patches wi... more We introduce a new problem of generating an image based on a small number of key local patches without any geometric prior. In this work, key local patches are defined as informative regions of the target object or scene. This is a challenging problem since it requires generating realistic images and predicting locations of parts at the same time. We construct adversarial networks to tackle this problem. A generator network generates a fake image as well as a mask based on the encoder-decoder framework. On the other hand, a discriminator network aims to detect fake images. The network is trained with three losses to consider spatial, appearance, and adversarial information. The spatial loss determines whether the locations of predicted parts are correct. Input patches are restored in the output image without much modification due to the appearance loss. The adversarial loss ensures output images are realistic. The proposed network is trained without supervisory signals since no labels of key parts are required. Experimental results on six datasets demonstrate that the proposed algorithm performs favorably on challenging objects and scenes.
Quantitative Evaluation. We quantitatively evaluate the proposed long-term correlation tracking (... more Quantitative Evaluation. We quantitatively evaluate the proposed long-term correlation tracking (LCT) algorithm on the 50 benchmark sequences with comparisons to the 11 state-of-the-art trackers, CSK [4], STC [10], KCF [5] MIL [1], Struck [3], CT [11], ASLA [6]), TLD [7], SCM [12], MEEM [9], and TGPR [2]). We report the distance precision at a threshold of 20 pixels in Table 1 and the overlap success rate at a threshold of 0.5 in Table 2. We report the distance precision plots over eight tracking challenges in our attribute-based evaluation in Figure 1 as mentioned on line 643 in the manuscript.
ArXiv, 2019
The study of mouse social behaviours has been increasingly undertaken in neuroscience research. H... more The study of mouse social behaviours has been increasingly undertaken in neuroscience research. However, automated quantification of mouse behaviours from the videos of interacting mice is still a challenging problem, where object tracking plays a key role in locating mice in their living spaces. Artificial markers are often applied for multiple mice tracking, which are intrusive and consequently interfere with the movements of mice in a dynamic environment. In this paper, we propose a novel method to continuously track several mice and individual parts without requiring any specific tagging. Firstly, we propose an efficient and robust deep learning based mouse part detection scheme to generate part candidates. Subsequently, we propose a novel Bayesian Integer Linear Programming Model that jointly assigns the part candidates to individual targets with necessary geometric constraints whilst establishing pair-wise association between the detected parts. There is no publicly available ...
International Journal of Computer Vision
Image deblurring is a classic problem in lowlevel computer vision, which aims to recover a sharp ... more Image deblurring is a classic problem in lowlevel computer vision, which aims to recover a sharp image from a blurred input image. Recent advances in deep learning have led to significant progress in solving this problem, and a large number of deblurring networks have been proposed. This paper presents a comprehensive and timely survey of recently published deep-learning based image deblurring approaches, aiming to serve the community as a useful literature review. We start by discussing common causes of image blur, introduce benchmark datasets and performance metrics, and summarize different problem formulations. Next we present a taxonomy of methods using convolutional neural networks (CNN) based on architecture, loss function, and application, offering a detailed review
In this work, we propose a simple yet effective meta-learning algorithm in thesemi-supervised set... more In this work, we propose a simple yet effective meta-learning algorithm in thesemi-supervised settings. We notice that existing consistency-based approachesmostly do not consider the essential role of the label information for consistencyregularization. To alleviate this issue, we bridge the relationship between theconsistency loss and label information by unfolding and differentiating throughone optimization step. Specifically, we exploit the pseudo labels of the unlabeledexamples which are guided by the meta-gradients of the labeled data loss so thatthe model can generalize well on the labeled examples. In addition, we introduce asimple first-order approximation to avoid computing higher-order derivatives andguarantee scalability. Extensive evaluations on the SVHN, CIFAR, and ImageNetdatasets demonstrate that the proposed algorithm performs favorably against thestate-of-the-art methods.
IEEE Transactions on Pattern Analysis and Machine Intelligence
GAN inversion aims to invert a given image back into the latent space of a pretrained GAN model s... more GAN inversion aims to invert a given image back into the latent space of a pretrained GAN model so that the image can be faithfully reconstructed from the inverted code by the generator. As an emerging technique to bridge the real and fake image domains, GAN inversion plays an essential role in enabling pretrained GAN models, such as StyleGAN and BigGAN, for use for real image editing applications. Moreover, GAN inversion also provides insights into the interpretation of the latent space of GANs and how realistic images can be generated. In this paper, we provide an overview of GAN inversion with a focus on recent algorithms and applications. We cover important techniques of GAN inversion and their applications in image restoration and image manipulation. We further elaborate on some trends and challenges for future research. A curated list of GAN inversion methods, datasets, and other related information can be found at github.com/weihaox/awesome-gan-inversion.
2021 IEEE Winter Conference on Applications of Computer Vision (WACV)
Image extrapolation aims at expanding the narrow field of view of a given image patch. Existing m... more Image extrapolation aims at expanding the narrow field of view of a given image patch. Existing models mainly deal with natural scene images of homogeneous regions and have no control of the content generation process. In this work, we study conditional image extrapolation to synthesize new images guided by the input structured text. The text is represented as a graph to specify the objects and their spatial relation to the unknown regions of the image. Inspired by drawing techniques, we propose a progressive generative model of three stages, i.e., generating a coarse bounding-boxes layout, refining it to a finer segmentation layout, and mapping the layout to a realistic output. Such a multi-stage design is shown to facilitate the training process and generate more controllable results. We validate the effectiveness of the proposed method on the face and human clothing dataset in terms of visual results, quantitative evaluations and flexible controls.
International Journal of Computer Vision
Dense correspondence across semantically related images has been extensively studied, but still f... more Dense correspondence across semantically related images has been extensively studied, but still faces two challenges: (1) large variations in appearance, scale and pose exist even for objects from the same category, and (2) labeling pixellevel dense correspondences is labor intensive and infeasible to scale. Most existing methods focus on designing var-The original article can be found online at
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022
The softmax cross-entropy loss function has been widely used to train deep models for various tas... more The softmax cross-entropy loss function has been widely used to train deep models for various tasks. In this work, we propose a Gaussian mixture (GM) loss function for deep neural networks for visual classification. Unlike the softmax cross-entropy loss, our method explicitly shapes the deep feature space towards a Gaussian Mixture distribution. With a classification margin and a likelihood regularization, the GM loss facilitates both high classification performance and accurate modeling of the feature distribution. The GM loss can be readily used to distinguish abnormal inputs, such as the adversarial examples, based on the discrepancy between feature distributions of the inputs and the training set. Furthermore, theoretical analysis shows that a symmetric feature space can be achieved by using the GM loss, which enables the models to perform robustly against adversarial attacks. The proposed model can be implemented easily and efficiently without using extra trainable parameters. Extensive evaluations demonstrate that the proposed method performs favorably not only on image classification but also on robust detection of adversarial examples generated by strong attacks under different threat models.
International Journal of Computer Vision, 2022
Dense correspondence across semantically related images has been extensively studied, but still f... more Dense correspondence across semantically related images has been extensively studied, but still faces two challenges: 1) large variations in appearance, scale and pose exist even for objects from the same category, and 2) labeling pixel-level dense correspondences is labor intensive and infeasible to scale. Most existing methods focus on designing various matching modules using fully-supervised Ima-geNet pretrained networks. On the other hand, while a variety of self-supervised approaches are proposed to explicitly measure image-level similarities, correspondence matching the pixel level remains under-explored. In this work, we propose a multi-level contrastive learning approach for semantic matching, which does not rely on any ImageNet pretrained model. We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects, while the performance can be further enhanced by regularizing cross-instance cycle-consistency at intermediate feature levels. Experimental results on the PF-PASCAL, PF-WILLOW, and SPair-71k benchmark datasets demonstrate that our method performs favorably against the state-of-the-art ap
Uploads
Papers by Ming-Hsuan Yang