Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221)
…
4 pages
1 file
In this work we demonstrate an improvement in the state-of-theart large vocabulary continuous speech recognition (LVCSR) performance, under clean and noisy conditions, by the use of visual information, in addition to the traditional audio one. We take a decision fusion approach for the audio-visual information, where the single-modality (audio-and visual-only) HMM classifiers are combined to recognize audio-visual speech. More specifically, we tackle the problem of estimating the appropriate combination weights for each of the modalities. Two different techniques are described: The first uses an automatically extracted estimate of the audio stream reliability in order to modify the weights for each modality (both clean and noisy audio results are reported), while the second is a discriminative model combination approach where weights on pre-defined model classes are optimized to minimize WER (clean audio only results).
Journal of Signal Processing Systems, 2011
Audiovisual speech recognition (AVSR) using acoustic and visual signals of speech has received attention recently because of its robustness in noisy environments. An important issue in decision fusion based AVSR system is the determination of appropriate integration weight for the speech modalities to integrate and ensure better performance under various SNR conditions. Generally, the integration weight is calculated from the relative reliability of two modalities. This paper investigates the effect of reliability measure on integration weight estimation and proposes a genetic algorithm (GA) based reliability measure which uses optimum number of best recognition hypotheses rather than N best recognition hypotheses to determine an appropriate integration weight. Further improvement in recognition accuracy is achieved by optimizing the above measured integration weight by genetic algorithm. The performance of the proposed integration weight estimation scheme is demonstrated for isolated word recognition (incorporating commonly used functions in mobile phones) via multi-speaker database experiment. The results show that the proposed schemes improve robust recognition accuracy over the conventional unimodal systems, and a couple of related existing bimodal systems, namely, the baseline reliability ratio-based system and N best recognition hypotheses reliability ratio-based system under various SNR conditions.
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001
We investigate the fusion of audio and video a posteriori phonetic probabilities in a hybrid ANN/HMM audio-visual speech recognition system. Three basic conditions to the fusion process are stated and implemented in a linear and a geometric weighting scheme. These conditions are the assumption of conditional independence of the audio and video data and the contribution of only one of the two paths when the SNR is very high or very low, respectively. In the case of the geometric weighting a new weighting scheme is developed whereas the linear weighting follows the Full Combination approach as employed in multi-stream recognition. We compare these two new concepts in audio-visual recognition to a rather standard approach known from the literature. Recognition tests were performed in a continuous number recognition task on a single speaker database containing 1712 utterances with two different types of noise added.
A new, simple and practical way of fusing audio and visual information to enhance audiovisual automatic speech recognition within the framework of an application of large-vocabulary speech recognition of French Canadian speech is presented, and the experimental methodology is described in detail. The visual information about mouth shape is extracted off-line using a cascade of weak classifiers and a Kalman filter, and is combined with the large-vocabulary speech recognition system of the Centre de Recherche Informatique de Montréal. The visual classification is performed by a pair-wise kernel-based linear discriminant analysis (KLDA) applied on a principal component analysis (PCA) subspace, followed by a binary combination and voting algorithm on 35 French phonetic classes. Three fusion approaches are compared: (1) standard low-level feature-based fusion, (2) decision-based fusion within the framework of the transferable belief model (an interpretation of the Dempster-Shafer evidential theory), and (3) a combination of (1) and (2). For decision-based fusion, the audio information is considered to be a precise Bayesian source, while the visual information is considered an imprecise evidential source. This treatment ensures that the visual information does not significantly degrade the audio information in situations where the audio performs well (e.g., a controlled noise-free environment). Results show significant improvement in the word error rate to a level comparable to that of more sophisticated systems. To the authors' knowledge, this work is the first to address large-vocabulary audiovisual recognition of French Canadian speech and decision-based audiovisual fusion within the transferable belief model.
sfhmmy.ntua.gr
Most Automatic Speech Recognition (ASR) systems only use speech features extracted from the speaker's audio signal. The performance of such audio-only speech recognizers heavily degrades whenever the audio signal is not ideal, for example in environments with heavy acoustic noise. One recent approach to robust speech recognition in such adverse conditions is to also utilize in ASR systems visual speech related features extracted from videos capturing the speaker's face. This approach to robust ASR is inspired from the audio-visual mechanisms also present in human speech recognition. The purpose of this paper is twofold: (1) To make a short introduction to the field of audio-visual speech recognition and highlight the research challenges in the area; and (2) to summarize our recent research in the problem of adaptive audio-visual fusion.
2004
This paper looks into the information fusion problem in the context of audio-visual speech recognition. Existing approaches to audio-visual fusion typically address the problem in either the feature domain or the decision domain. In this work, we consider a hybrid approach that aims to take advantages of both the feature fusion and the decision fusion methodologies. We introduce a general formulation to facilitate information fusion at multiple stages, followed by an experimental study of a set of fusion schemes allowed by the framework. The proposed method is implemented on a realtime audio-visual speech recognition system, and evaluated on connected digit recognition tasks under varying acoustic conditions. The results show that the multistage fusion system consistently achieves lower word error rates than the reference feature fusion and decision fusion systems. It is further shown that removing the audio only channel from the multistage system only leads to minimal degradations in recognition performance while providing a noticeable reduction in computational load.
A new, simple and practical way of fusing audio and visual information to enhance audiovisual automatic speech recognition within the framework of an application of large-vocabulary speech recognition of French Canadian speech is presented, and the experimental methodology is described in detail. The visual information about mouth shape is extracted off-line using a cascade of weak classifiers and a Kalman filter, and is combined with the large-vocabulary speech recognition system of the Centre de Recherche Informatique de Montréal. The visual classification is performed by a pair-wise kernel-based linear discriminant analysis (KLDA) applied on a principal component analysis (PCA) subspace, followed by a binary combination and voting algorithm on 35 French phonetic classes. Three fusion approaches are compared: (1) standard low-level feature-based fusion, (2) decision-based fusion within the framework of the transferable belief model (an interpretation of the Dempster-Shafer evidential theory), and (3) a combination of (1) and (2). For decision-based fusion, the audio information is considered to be a precise Bayesian source, while the visual information is considered an imprecise evidential source. This treatment ensures that the visual information does not significantly degrade the audio information in situations where the audio performs well (e.g., a controlled noise-free environment). Results show significant improvement in the word error rate to a level comparable to that of more sophisticated systems. To the authors' knowledge, this work is the first to address large-vocabulary audiovisual recognition of French Canadian speech and decision-based audiovisual fusion within the transferable belief model.
2004
Audio-Visual Speech Recognition (AVSR) uses vision to enhance speech recognition but also introduces the problem of how to join (or fuse) these two signals together. Mainstream research achieves this using a weighted product of the output of the phoneme classifiers for both modalities. This paper analyses current weighting measures and compares them to several new measures proposed by the authors. Most importantly, when calculating the dispersion of the output there is a shift from analysing the variance to analysing the skewness of the distribution. Experiments in AVSR using neural networks raise questions of the utility of such measures with some intriguing results.
The Journal of the Acoustical Society of America, 1990
We have made signi cant progress in automatic speech recognition ASR for well-de ned applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, for ASR to approach h uman levels of performance and for speech to become a truly pervasive user interface, we need novel, nontraditional approaches that have the potential of yielding dramatic ASR improvements. Visual speech is one such source for making large improvements in high noise environments with the potential of channel and task independence. It is not e ected by the acoustic environment and noise, and it possibly contains the greatest amount of complementary information to the acoustic signal. In this workshop, our goal was to advance the state-of-the-art in ASR by demonstrating the use of visual information in addition to the traditional audio for large vocabulary continuous speech recognition LVCSR. Starting with an appropriate audio-visual database, collected and provided by IBM, we demonstrated for the rst time that LVCSR performance can be improved by the use of visual information in the clean audio case. Speci cally, b y conducting audio lattice rescoring experiments, we showed a 7 relative word error rate WER reduction in that condition. Furthermore, for the harder problem of speech contaminated by s p e e c h babble" noise at 10 dB SNR, we demonstrated that recognition performance can beimproved by 27 in relative WER reduction, compared to an equivalent audio-only recognizer matched to the noise environment. We believe that this paves the way to seriously address the challenge of speech recognition in high noise environments and to potentially achieve human levels of performance. In this report, we detail a number of approaches and experiments conducted during the summer workshop in the areas of visual feature extraction, hidden Markov model based visual-only recognition, and audio-visual information fusion. The later was our main concentration: In the workshop, a numberof feature fusion as well as decision fusion techniques for audio-visual ASR were explored and compared.
EURASIP Journal on Advances in Signal Processing, 2002
It has been shown that integration of acoustic and visual information especially in noisy conditions yields improved speech recognition results. This raises the question of how to weight the two modalities in different noise conditions. Throughout this paper we develop a weighting process adaptive to various background noise situations. In the presented recognition system, audio and video data are combined following a Separate Integration (SI) architecture. A hybrid Artificial Neural Network/Hidden Markov Model (ANN/HMM) system is used for the experiments. The neural networks were in all cases trained on clean data. Firstly, we evaluate the performance of different weighting schemes in a manually controlled recognition task with different types of noise. Next, we compare different criteria to estimate the reliability of the audio stream. Based on this, a mapping between the measurements and the free parameter of the fusion process is derived and its applicability is demonstrated. Finally, the possibilities and limitations of adaptive weighting are compared and discussed.
2000
A major goal of current speech recognition research is to improve the robustness of recognition systems used in noisy environments. Recent strides in computing technology have allowed considera- tion of systems that use visual information to augment the deci- sion capability of the recognizer, allowing superior performance in these difficult environments. A crucial area of research in audio- visual speech
Journal of Spanish Language Teaching, 2021
Pantanal Editora, 2024
Journal of Medieval Iberian Studies
Annals of Microbiology, 2011
Insignia: Journal of International Relations, 2021
International Journal of Palliative Nursing, 2008
Behaviour Research and Therapy, 1982
Science News, 2012
British Journal of Clinical Pharmacology, 2005
Agung Prasetyo , 2024
Informatics, 2021
Future Internet, 2019