2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2020
Generative models have recently shown the ability to realistically generate data and model the di... more Generative models have recently shown the ability to realistically generate data and model the distribution accurately. However, joint modeling of an image with the attribute that it is labeled with requires learning a cross modal correspondence between image and attribute data. Though the information present in a set of images and its attributes possesses completely different statistical properties altogether, there exists an inherent correspondence that is challenging to capture. Various models have aimed at capturing this correspondence either through joint modeling of a variational autoencoder or through separate encoder networks that are then concatenated. We present an alternative by proposing a bridged variational autoencoder that allows for learning cross-modal correspondence by incorporating cross-modal hallucination losses in the latent space. In comparison to the existing methods, we have found that by using a bridge connection in latent space we not only obtain better generation results, but also obtain highly parameterefficient model which provide 40% reduction in training parameters for bimodal dataset and nearly 70% reduction for trimodal dataset. We validate the proposed method through comparison with state of the art methods and benchmarking on standard datasets.
2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI), 2018
E-commerce is a trading trend that is carried online. It has lubricated the transactions for the ... more E-commerce is a trading trend that is carried online. It has lubricated the transactions for the sellers as well as the consumers. This does not require any personal meeting. This has eventually led to an increment in market competition. The users thus utilize the recommendation system for enhancing their performances. Content-based filtering and collaborative filtering methods have been employed by the hybrid recommendation. These obtain the resemblance of user file and product description. Experiment inferences show that recommendation resembles the product description and the user profile. The mean precision value of this similarity is 69.7% and the recall value is 73.63%.
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021
Understanding the relationship between the auditory and visual signals is crucial for many differ... more Understanding the relationship between the auditory and visual signals is crucial for many different applications ranging from computer-generated imagery (CGI) and video editing automation to assisting people with hearing or visual impairments. However, this is challenging since the distribution of both audio and visual modality is inherently multimodal. Therefore, most of the existing methods ignore the multimodal aspect and assume that there only exists a deterministic one-to-one mapping between the two modalities. It can lead to low-quality predictions as the model collapses to optimizing the average behavior rather than learning the full data distributions. In this paper, we present a stochastic model for generating speech in a silent video. The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory signal's conditional distribution given the visual signal. We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
The ability to envisage the visual of a talking face based just on hearing a voice is a unique hu... more The ability to envisage the visual of a talking face based just on hearing a voice is a unique human capability. There have been a number of works that have solved for this ability recently. We differ from these approaches by enabling a variety of talking face generations based on single audio input. Indeed, just having the ability to generate a single talking face would make a system almost robotic in nature. In contrast, our unsupervised stochastic audio-to-video generation model allows for diverse generations from a single audio input. Particularly, we present an unsupervised stochastic audio-to-video generation model that can capture multiple modes of the video distribution. We ensure that all the diverse generations are plausible. We do so through a principled multi-modal variational autoencoder framework. We demonstrate its efficacy on the challenging LRW and GRID datasets and demonstrate performance better than the baseline, while having the ability to generate multiple diverse lip synchronized videos.
International Journal of Computer Sciences and Engineering, 2018
For public sector units, NGO/ NPO and various governance systems, the purchasing function is ofte... more For public sector units, NGO/ NPO and various governance systems, the purchasing function is often far down the line where still the manual / conventional way of processing is in practice. The financial functions of these organizations can save large amount of time and value of money with appropriate techniques. Efficient Purchase system in the area is central to this article, that is often overlooked for automatisation of a typical UTD system and has the largest potential payback scope in purchasing.
2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2020
Generative models have recently shown the ability to realistically generate data and model the di... more Generative models have recently shown the ability to realistically generate data and model the distribution accurately. However, joint modeling of an image with the attribute that it is labeled with requires learning a cross modal correspondence between image and attribute data. Though the information present in a set of images and its attributes possesses completely different statistical properties altogether, there exists an inherent correspondence that is challenging to capture. Various models have aimed at capturing this correspondence either through joint modeling of a variational autoencoder or through separate encoder networks that are then concatenated. We present an alternative by proposing a bridged variational autoencoder that allows for learning cross-modal correspondence by incorporating cross-modal hallucination losses in the latent space. In comparison to the existing methods, we have found that by using a bridge connection in latent space we not only obtain better generation results, but also obtain highly parameterefficient model which provide 40% reduction in training parameters for bimodal dataset and nearly 70% reduction for trimodal dataset. We validate the proposed method through comparison with state of the art methods and benchmarking on standard datasets.
2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI), 2018
E-commerce is a trading trend that is carried online. It has lubricated the transactions for the ... more E-commerce is a trading trend that is carried online. It has lubricated the transactions for the sellers as well as the consumers. This does not require any personal meeting. This has eventually led to an increment in market competition. The users thus utilize the recommendation system for enhancing their performances. Content-based filtering and collaborative filtering methods have been employed by the hybrid recommendation. These obtain the resemblance of user file and product description. Experiment inferences show that recommendation resembles the product description and the user profile. The mean precision value of this similarity is 69.7% and the recall value is 73.63%.
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021
Understanding the relationship between the auditory and visual signals is crucial for many differ... more Understanding the relationship between the auditory and visual signals is crucial for many different applications ranging from computer-generated imagery (CGI) and video editing automation to assisting people with hearing or visual impairments. However, this is challenging since the distribution of both audio and visual modality is inherently multimodal. Therefore, most of the existing methods ignore the multimodal aspect and assume that there only exists a deterministic one-to-one mapping between the two modalities. It can lead to low-quality predictions as the model collapses to optimizing the average behavior rather than learning the full data distributions. In this paper, we present a stochastic model for generating speech in a silent video. The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory signal's conditional distribution given the visual signal. We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
The ability to envisage the visual of a talking face based just on hearing a voice is a unique hu... more The ability to envisage the visual of a talking face based just on hearing a voice is a unique human capability. There have been a number of works that have solved for this ability recently. We differ from these approaches by enabling a variety of talking face generations based on single audio input. Indeed, just having the ability to generate a single talking face would make a system almost robotic in nature. In contrast, our unsupervised stochastic audio-to-video generation model allows for diverse generations from a single audio input. Particularly, we present an unsupervised stochastic audio-to-video generation model that can capture multiple modes of the video distribution. We ensure that all the diverse generations are plausible. We do so through a principled multi-modal variational autoencoder framework. We demonstrate its efficacy on the challenging LRW and GRID datasets and demonstrate performance better than the baseline, while having the ability to generate multiple diverse lip synchronized videos.
International Journal of Computer Sciences and Engineering, 2018
For public sector units, NGO/ NPO and various governance systems, the purchasing function is ofte... more For public sector units, NGO/ NPO and various governance systems, the purchasing function is often far down the line where still the manual / conventional way of processing is in practice. The financial functions of these organizations can save large amount of time and value of money with appropriate techniques. Efficient Purchase system in the area is central to this article, that is often overlooked for automatisation of a typical UTD system and has the largest potential payback scope in purchasing.
Uploads
Papers by ravindra yadav