Skip to main content

Sebastian Palacio

Followers

0

Following

0

Public Views

Interests

Uploads

Papers by Sebastian Palacio

Sequential Spatial Transformer Networks for Salient Object Classification

Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods

Hitchhiker's Guide to Super-Resolution: Introduction and Recent Advances

IEEE Transactions on Pattern Analysis and Machine Intelligence

With the advent of Deep Learning (DL), Super-Resolution (SR) has also become a thriving research ... more With the advent of Deep Learning (DL), Super-Resolution (SR) has also become a thriving research area. However, despite promising results, the field still faces challenges that require further research, e.g., allowing flexible upsampling, more effective loss functions, and better evaluation metrics. We review the domain of SR in light of recent advances and examine state-of-the-art models such as diffusion (DDPM) and transformer-based SR models. We critically discuss contemporary strategies used in SR and identify promising yet unexplored research directions. We complement previous surveys by incorporating the latest developments in the field, such as uncertainty-driven losses, wavelet networks, neural architecture search, novel normalization methods, and the latest evaluation techniques. We also include several visualizations for the models and methods throughout each chapter to facilitate a global understanding of the trends in the field. This review ultimately aims at helping researchers to push the boundaries of DL applied to SR.

$Research paper thumbnail of P $\approx$ NP, at least in Visual Question Answering$

P $\approx$ NP, at least in Visual Question Answering

arXiv (Cornell University), Mar 26, 2020

In recent years, progress in the Visual Question Answering (VQA) field has largely been driven by... more In recent years, progress in the Visual Question Answering (VQA) field has largely been driven by public challenges and large datasets. One of the most widely-used of these is the VQA 2.0 dataset, consisting of polar ("yes/no") and non-polar questions. Looking at the question distribution over all answers, we find that the answers "yes" and "no" account for 38 % of the questions, while the remaining 62 % are spread over the more than 3000 remaining answers. While several sources of biases have already been investigated in the field, the effects of such an over-representation of polar vs. non-polar questions remain unclear. In this paper, we measure the potential confounding factors when polar and non-polar samples are used jointly to train a baseline VQA classifier, and compare it to an upper bound where the over-representation of polar questions is excluded from the training. Further, we perform cross-over experiments to analyze how well the feature spaces align. Contrary to expectations, we find no evidence of counterproductive effects in the joint training of unbalanced classes. In fact, by exploring the intermediate feature space of visual-text embeddings, we find that the feature space of polar questions already encodes sufficient structure to answer many non-polar questions. Our results indicate that the polar (P) and the non-polar (N P) feature spaces are strongly aligned, hence the expression P ≈ N P .

Adversarial Defense based on Structure-to-Signal Autoencoders

2020 IEEE Winter Conference on Applications of Computer Vision (WACV)

Adversarial attacks have exposed the intricacies of the complex loss surfaces approximated by neu... more Adversarial attacks have exposed the intricacies of the complex loss surfaces approximated by neural networks. In this paper, we present a defense strategy against gradientbased attacks, on the premise that input gradients need to expose information about the semantic manifold for attacks to be successful. We propose an architecture based on compressive autoencoders (AEs) with a two-stage training scheme, creating not only an architectural bottleneck but also a representational bottleneck. We show that the proposed mechanism yields robust results against a collection of gradient-based attacks under challenging white-box conditions. This defense is attack-agnostic and can, therefore, be used for arbitrary pre-trained models, while not compromising the original performance. These claims are supported by experiments conducted with state-of-the-art image classifiers (ResNet50 and Inception v3), on the full ImageNet validation set. Experiments, including counterfactual analysis, empirically show that the robustness stems from a shift in the distribution of input gradients, which mitigates the effect of tested adversarial attack methods. Gradients propagated through the proposed AEs represent less semantic information and instead point to low-level structural features.

What do Deep Networks Like to See?

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

We propose a novel way to measure and understand convolutional neural networks by quantifying the... more We propose a novel way to measure and understand convolutional neural networks by quantifying the amount of input signal they let in. To do this, an autoencoder (AE) was fine-tuned on gradients from a pre-trained classifier with fixed parameters. We compared the reconstructed samples from AEs that were fine-tuned on a set of image classifiers (AlexNet, VGG16, ResNet-50, and Inception v3) and found substantial differences. The AE learns which aspects of the input space to preserve and which ones to ignore, based on the information encoded in the backpropagated gradients. Measuring the changes in accuracy when the signal of one classifier is used by a second one, a relation of total order emerges. This order depends directly on each classifier's input signal but it does not correlate with classification accuracy or network size. Further evidence of this phenomenon is provided by measuring the normalized mutual information between original images and auto-encoded reconstructions from different fine-tuned AEs. These findings break new ground in the area of neural network understanding, opening a new way to reason, debug, and interpret their results. We present four concrete examples in the literature where observations can now be explained in terms of the input signal that a model uses.

XAI Handbook: Towards a Unified Framework for Explainable AI

2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021

Self-supervised Test-time Adaptation on Video Data

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022

In typical computer vision problems revolving around video data, pre-trained models are simply ev... more In typical computer vision problems revolving around video data, pre-trained models are simply evaluated at test time, without adaptation. This general approach clearly cannot capture the shifts that will likely arise between the distributions from which training and test data have been sampled. Adapting a pre-trained model to a new video encountered at test time could be essential to avoid the potentially catastrophic effects of such shifts. However, given the inherent impossibility of labeling data only available at testtime, traditional "fine-tuning" techniques cannot be leveraged in this highly practical scenario. This paper explores whether the recent progress in test-time adaptation in the image domain and self-supervised learning can be leveraged to adapt a model to previously unseen and unlabelled videos presenting both mild (but arbitrary) and severe covariate shifts. In our experiments, we show that test-time adaptation approaches applied to self-supervised methods are always beneficial, but also that the extent of their effectiveness largely depends on the specific combination of the algorithms used for adaptation and self-supervision, and also on the type of covariate shift taking place.

Classless Association Using Neural Networks

The goal of this paper is to train a model based on the relation between two instances that repre... more The goal of this paper is to train a model based on the relation between two instances that represent the same unknown class. This task is inspired by the Symbol Grounding Problem and the association learning between modalities in infants. We propose a novel model called Classless Association that has two parallel Multilayer Perceptrons (MLPs) with a EM-training rule. Moreover, the training relies on matching the output vectors of the MLPs against a statistical distribution as alternative loss function because of the unlabeled data. In addition, the output classification of one network is used as target of the other network, and vice versa for learning the agreement between both unlabeled sample. We generate four classless datasets based on MNIST, where the input is two different instances of the same digit. Furthermore, our classless association model is evaluated against two scenarios: totally supervised and totally unsupervised. In the first scenario, our model reaches a good per...

Revisiting Sequence-to-Sequence Video Object Segmentation with Multi-Task Loss and Skip-Memory

2020 25th International Conference on Pattern Recognition (ICPR), 2021

P ≈ NP, at least in Visual Question Answering

2020 25th International Conference on Pattern Recognition (ICPR), 2021

In recent years, progress in the Visual Question Answering (VQA) field has largely been driven by... more In recent years, progress in the Visual Question Answering (VQA) field has largely been driven by public challenges and large datasets. One of the most widely-used of these is the VQA 2.0 dataset, consisting of polar ("yes/no") and non-polar questions. Looking at the question distribution over all answers, we find that the answers "yes" and "no" account for 38 % of the questions, while the remaining 62 % are spread over the more than 3000 remaining answers. While several sources of biases have already been investigated in the field, the effects of such an over-representation of polar vs. non-polar questions remain unclear. In this paper, we measure the potential confounding factors when polar and non-polar samples are used jointly to train a baseline VQA classifier, and compare it to an upper bound where the over-representation of polar questions is excluded from the training. Further, we perform cross-over experiments to analyze how well the feature spaces align. Contrary to expectations, we find no evidence of counterproductive effects in the joint training of unbalanced classes. In fact, by exploring the intermediate feature space of visual-text embeddings, we find that the feature space of polar questions already encodes sufficient structure to answer many non-polar questions. Our results indicate that the polar (P) and the non-polar (N P) feature spaces are strongly aligned, hence the expression P ≈ N P .

Contextual Classification Using Self-Supervised Auxiliary Models for Deep Neural Networks

2020 25th International Conference on Pattern Recognition (ICPR), 2021

Classification problems solved with deep neural networks (DNNs) typically rely on a closed world ... more Classification problems solved with deep neural networks (DNNs) typically rely on a closed world paradigm, and optimize over a single objective (e.g., minimization of the cross-entropy loss). This setup dismisses all kinds of supporting signals that can be used to reinforce the existence or absence of a particular pattern. The increasing need for models that are interpretable by design makes the inclusion of said contextual signals a crucial necessity. To this end, we introduce the notion of Self-Supervised Autogenous Learning (SSAL) models. A SSAL objective is realized through one or more additional targets that are derived from the original supervised classification task, following architectural principles found in multi-task learning. SSAL branches impose low-level priors into the optimization process (e.g., grouping). The ability of using SSAL branches during inference, allow models to converge faster, focusing on a richer set of class-relevant features. We show that SSAL models consistently outperform the state-of-the-art while also providing structured predictions that are more interpretable.

Symbolic Association Using Parallel Multilayer Perceptron

Lecture Notes in Computer Science, 2016

The goal of our paper is to learn the association and the semantic grounding of two sensory input... more The goal of our paper is to learn the association and the semantic grounding of two sensory input signals that represent the same semantic concept. The input signals can be or cannot be the same modality. This task is inspired by infants learning. We propose a novel framework that has two symbolic Multilayer Perceptron (MLP) in parallel. Furthermore, both networks learn to ground semantic concepts and the same coding scheme for all semantic concepts in both networks. In addition, the training rule follows EM-approach. In contrast, the traditional setup of association task pre-defined the coding scheme before training. We have tested our model in two cases: mono-and multi-modal. Our model achieves similar accuracy association to MLPs with pre-defined coding schemes.

Sequential Spatial Transformer Networks for Salient Object Classification

Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods

Hitchhiker's Guide to Super-Resolution: Introduction and Recent Advances

IEEE Transactions on Pattern Analysis and Machine Intelligence

With the advent of Deep Learning (DL), Super-Resolution (SR) has also become a thriving research ... more With the advent of Deep Learning (DL), Super-Resolution (SR) has also become a thriving research area. However, despite promising results, the field still faces challenges that require further research, e.g., allowing flexible upsampling, more effective loss functions, and better evaluation metrics. We review the domain of SR in light of recent advances and examine state-of-the-art models such as diffusion (DDPM) and transformer-based SR models. We critically discuss contemporary strategies used in SR and identify promising yet unexplored research directions. We complement previous surveys by incorporating the latest developments in the field, such as uncertainty-driven losses, wavelet networks, neural architecture search, novel normalization methods, and the latest evaluation techniques. We also include several visualizations for the models and methods throughout each chapter to facilitate a global understanding of the trends in the field. This review ultimately aims at helping researchers to push the boundaries of DL applied to SR.

$Research paper thumbnail of P $\approx$ NP, at least in Visual Question Answering$

P $\approx$ NP, at least in Visual Question Answering

arXiv (Cornell University), Mar 26, 2020

In recent years, progress in the Visual Question Answering (VQA) field has largely been driven by... more In recent years, progress in the Visual Question Answering (VQA) field has largely been driven by public challenges and large datasets. One of the most widely-used of these is the VQA 2.0 dataset, consisting of polar ("yes/no") and non-polar questions. Looking at the question distribution over all answers, we find that the answers "yes" and "no" account for 38 % of the questions, while the remaining 62 % are spread over the more than 3000 remaining answers. While several sources of biases have already been investigated in the field, the effects of such an over-representation of polar vs. non-polar questions remain unclear. In this paper, we measure the potential confounding factors when polar and non-polar samples are used jointly to train a baseline VQA classifier, and compare it to an upper bound where the over-representation of polar questions is excluded from the training. Further, we perform cross-over experiments to analyze how well the feature spaces align. Contrary to expectations, we find no evidence of counterproductive effects in the joint training of unbalanced classes. In fact, by exploring the intermediate feature space of visual-text embeddings, we find that the feature space of polar questions already encodes sufficient structure to answer many non-polar questions. Our results indicate that the polar (P) and the non-polar (N P) feature spaces are strongly aligned, hence the expression P ≈ N P .

Adversarial Defense based on Structure-to-Signal Autoencoders

2020 IEEE Winter Conference on Applications of Computer Vision (WACV)

Adversarial attacks have exposed the intricacies of the complex loss surfaces approximated by neu... more Adversarial attacks have exposed the intricacies of the complex loss surfaces approximated by neural networks. In this paper, we present a defense strategy against gradientbased attacks, on the premise that input gradients need to expose information about the semantic manifold for attacks to be successful. We propose an architecture based on compressive autoencoders (AEs) with a two-stage training scheme, creating not only an architectural bottleneck but also a representational bottleneck. We show that the proposed mechanism yields robust results against a collection of gradient-based attacks under challenging white-box conditions. This defense is attack-agnostic and can, therefore, be used for arbitrary pre-trained models, while not compromising the original performance. These claims are supported by experiments conducted with state-of-the-art image classifiers (ResNet50 and Inception v3), on the full ImageNet validation set. Experiments, including counterfactual analysis, empirically show that the robustness stems from a shift in the distribution of input gradients, which mitigates the effect of tested adversarial attack methods. Gradients propagated through the proposed AEs represent less semantic information and instead point to low-level structural features.

What do Deep Networks Like to See?

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

We propose a novel way to measure and understand convolutional neural networks by quantifying the... more We propose a novel way to measure and understand convolutional neural networks by quantifying the amount of input signal they let in. To do this, an autoencoder (AE) was fine-tuned on gradients from a pre-trained classifier with fixed parameters. We compared the reconstructed samples from AEs that were fine-tuned on a set of image classifiers (AlexNet, VGG16, ResNet-50, and Inception v3) and found substantial differences. The AE learns which aspects of the input space to preserve and which ones to ignore, based on the information encoded in the backpropagated gradients. Measuring the changes in accuracy when the signal of one classifier is used by a second one, a relation of total order emerges. This order depends directly on each classifier's input signal but it does not correlate with classification accuracy or network size. Further evidence of this phenomenon is provided by measuring the normalized mutual information between original images and auto-encoded reconstructions from different fine-tuned AEs. These findings break new ground in the area of neural network understanding, opening a new way to reason, debug, and interpret their results. We present four concrete examples in the literature where observations can now be explained in terms of the input signal that a model uses.

XAI Handbook: Towards a Unified Framework for Explainable AI

2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021

Self-supervised Test-time Adaptation on Video Data

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022

In typical computer vision problems revolving around video data, pre-trained models are simply ev... more In typical computer vision problems revolving around video data, pre-trained models are simply evaluated at test time, without adaptation. This general approach clearly cannot capture the shifts that will likely arise between the distributions from which training and test data have been sampled. Adapting a pre-trained model to a new video encountered at test time could be essential to avoid the potentially catastrophic effects of such shifts. However, given the inherent impossibility of labeling data only available at testtime, traditional "fine-tuning" techniques cannot be leveraged in this highly practical scenario. This paper explores whether the recent progress in test-time adaptation in the image domain and self-supervised learning can be leveraged to adapt a model to previously unseen and unlabelled videos presenting both mild (but arbitrary) and severe covariate shifts. In our experiments, we show that test-time adaptation approaches applied to self-supervised methods are always beneficial, but also that the extent of their effectiveness largely depends on the specific combination of the algorithms used for adaptation and self-supervision, and also on the type of covariate shift taking place.

Classless Association Using Neural Networks

The goal of this paper is to train a model based on the relation between two instances that repre... more The goal of this paper is to train a model based on the relation between two instances that represent the same unknown class. This task is inspired by the Symbol Grounding Problem and the association learning between modalities in infants. We propose a novel model called Classless Association that has two parallel Multilayer Perceptrons (MLPs) with a EM-training rule. Moreover, the training relies on matching the output vectors of the MLPs against a statistical distribution as alternative loss function because of the unlabeled data. In addition, the output classification of one network is used as target of the other network, and vice versa for learning the agreement between both unlabeled sample. We generate four classless datasets based on MNIST, where the input is two different instances of the same digit. Furthermore, our classless association model is evaluated against two scenarios: totally supervised and totally unsupervised. In the first scenario, our model reaches a good per...

Revisiting Sequence-to-Sequence Video Object Segmentation with Multi-Task Loss and Skip-Memory

2020 25th International Conference on Pattern Recognition (ICPR), 2021

P ≈ NP, at least in Visual Question Answering

2020 25th International Conference on Pattern Recognition (ICPR), 2021

In recent years, progress in the Visual Question Answering (VQA) field has largely been driven by... more In recent years, progress in the Visual Question Answering (VQA) field has largely been driven by public challenges and large datasets. One of the most widely-used of these is the VQA 2.0 dataset, consisting of polar ("yes/no") and non-polar questions. Looking at the question distribution over all answers, we find that the answers "yes" and "no" account for 38 % of the questions, while the remaining 62 % are spread over the more than 3000 remaining answers. While several sources of biases have already been investigated in the field, the effects of such an over-representation of polar vs. non-polar questions remain unclear. In this paper, we measure the potential confounding factors when polar and non-polar samples are used jointly to train a baseline VQA classifier, and compare it to an upper bound where the over-representation of polar questions is excluded from the training. Further, we perform cross-over experiments to analyze how well the feature spaces align. Contrary to expectations, we find no evidence of counterproductive effects in the joint training of unbalanced classes. In fact, by exploring the intermediate feature space of visual-text embeddings, we find that the feature space of polar questions already encodes sufficient structure to answer many non-polar questions. Our results indicate that the polar (P) and the non-polar (N P) feature spaces are strongly aligned, hence the expression P ≈ N P .

Contextual Classification Using Self-Supervised Auxiliary Models for Deep Neural Networks

2020 25th International Conference on Pattern Recognition (ICPR), 2021

Classification problems solved with deep neural networks (DNNs) typically rely on a closed world ... more Classification problems solved with deep neural networks (DNNs) typically rely on a closed world paradigm, and optimize over a single objective (e.g., minimization of the cross-entropy loss). This setup dismisses all kinds of supporting signals that can be used to reinforce the existence or absence of a particular pattern. The increasing need for models that are interpretable by design makes the inclusion of said contextual signals a crucial necessity. To this end, we introduce the notion of Self-Supervised Autogenous Learning (SSAL) models. A SSAL objective is realized through one or more additional targets that are derived from the original supervised classification task, following architectural principles found in multi-task learning. SSAL branches impose low-level priors into the optimization process (e.g., grouping). The ability of using SSAL branches during inference, allow models to converge faster, focusing on a richer set of class-relevant features. We show that SSAL models consistently outperform the state-of-the-art while also providing structured predictions that are more interpretable.

Symbolic Association Using Parallel Multilayer Perceptron

Lecture Notes in Computer Science, 2016

The goal of our paper is to learn the association and the semantic grounding of two sensory input... more The goal of our paper is to learn the association and the semantic grounding of two sensory input signals that represent the same semantic concept. The input signals can be or cannot be the same modality. This task is inspired by infants learning. We propose a novel framework that has two symbolic Multilayer Perceptron (MLP) in parallel. Furthermore, both networks learn to ground semantic concepts and the same coding scheme for all semantic concepts in both networks. In addition, the training rule follows EM-approach. In contrast, the traditional setup of association task pre-defined the coding scheme before training. We have tested our model in two cases: mono-and multi-modal. Our model achieves similar accuracy association to MLPs with pre-defined coding schemes.