Papers by Panikos Heracleous
In this study, the use of silicon NAM (Non-Audible Murmur) microphone in automatic speech recogni... more In this study, the use of silicon NAM (Non-Audible Murmur) microphone in automatic speech recognition is presented. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible) speech, but also very quietly uttered speech (non-audible murmur). As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine com-munication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech conversion etc.) for sound-impaired people. Using a small amount of training data and adaptation approaches, 93.9% word accuracy was achieved for a 20k Japanese vocabulary dictation task. Non-audible murmur recognition in noisy environments is also investigated. In this study, further analysis of the NAM speech has been made using distance measures between hidden Markov model (HMM) pairs. It has been shown th...
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
Ieee Transactions on Audio Speech and Language Processing, Aug 1, 2010
Page 1. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1 Analysis and recognition of... more Page 1. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1 Analysis and recognition of NAM speech using HMM distances and visual information Panikos Heracleous, Viet-Anh Tran, Takayuki Nagai, and Kiyohiro Shikano ...
Ieee Pes Innovative Smart Grid Technologies Europe, Oct 1, 2014
Interspeech, 2009
In order to recover the movements of usually hidden articulators such as tongue or velum, we have... more In order to recover the movements of usually hidden articulators such as tongue or velum, we have developed a data-based speech inversion method. HMMs are trained, in a multistream framework, from two synchronous streams: articulatory movements measured by EMA, and MFCC + energy from the speech signal. A speech recognition procedure based on the acoustic part of the HMMs delivers the chain of phonemes and together with their durations, information that is subsequently used by a trajectory formation procedure based on the articulatory part of the HMMs to synthesise the articulatory movements. The RMS reconstruction error ranged between 1.1 and 2. mm.
Annual Conference of the International Speech Communication Association, 2005
In this paper, we present the use of stethoscope and silicon NAM microphones in automatic speech ... more In this paper, we present the use of stethoscope and silicon NAM microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible) speech, but also very quietly uttered speech (non-audible murmur). As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired. Previously, we presented speech recognition experiments for non-audible murmur captured by a stethoscope microphone. In this paper, we also present recognition results using a more advanced NAM microphone, the so-called silicon NAM microphone. Using adaptation techniques and a small amount of training data, we achieved a 93.9% word accuracy for non-audible murmur recognition. We also report experimental results in noisy environments showing the effectiveness of using a NAM microphone in noisy environments. In addition to a dictation task, we also present a keyword spotting experiment based on nonaudible murmur.
Annual Conference of the International Speech Communication Association, 2005
The recognition of distant talking speech in a noisy and reverberant environments is key issue in... more The recognition of distant talking speech in a noisy and reverberant environments is key issue in any speech recognition system. A so-called hands-free speech recognition system plays an important role in the natural and friendly human-machine interface. Considering the practical use of a speech recognition system, we realize that such a system has to deal, also, with the case of the presence of multiple sound sources, including multiple talkers, as well as other noise sources. This paper proposes a novel method which recognizes multiple talkers simultaneously in real environments by extending the 3-D Viterbi search to a 3-D N-best search algorithm. While the 3-D Viterbi method nds the most likely path in the 3-D trellis space, the proposed method considers multiple hypotheses for each direction in every frame. Combinations of the direction sequence and the phoneme sequence of multiple sources are included in the N-best list. The paper investigates the performance of the proposed method through experiments using real utterances of multiple talkers.
IEICE Transactions on Information and Systems
Speech is bimodal in nature and includes the audio and visual modalities. In addition to acoustic... more Speech is bimodal in nature and includes the audio and visual modalities. In addition to acoustic speech perception, speech can be also perceived using visual information provided by the mouth/face (i.e., automatic lipreading). In this study, the visual speech production in noisy environments is investigated. The authors show that the Lombard effect plays an important role not only in audio speech but also in visual speech production. Experimental results show that when visual speech is produced in noisy environments, the visual parameters of the mouth/face change. As a result, the performance of a visual speech recognizer decreases.
In this article, automatic recognition of French Cued Speech based on hidden Markov models (HMM) ... more In this article, automatic recognition of French Cued Speech based on hidden Markov models (HMM) is pre-sented. Cued Speech is a visual system which uses handshapes in different positions and in combination with lip-patterns of speech, and makes all the sounds of spoken language clearly understandable to deaf and hearing-impaired people. The aim of Cued Speech is to overcome the problems of lipreading and thus enable deaf children and adults to understand full spoken language. In automatic recognition of Cued Speech, lip shape and gesture recognition are required. In addition, the integration of the two modalities is of the greatest importance. In this study, lip shape component is fused with gestures com-ponents to realize Cued Speech recognition. Using concatenative feature fusion and multi-stream HMM decision fusion, vowel recognition and consonant recognition experiments have been conducted. For vowel recognition, an 87.6% vowel accuracy was obtained showing a 61.3% relative imp...
Speech is the most natural communication mean for humans. However, in situations where audio spee... more Speech is the most natural communication mean for humans. However, in situations where audio speech is not available or cannot be perceived because of disabilities or adverse environmental conditions, people may resort to alternative methods such as augmented speech. Augmented speech is audio speech supplemented or replaced by other modalities, such as audiovisual speech, or Cued Speech. Cued Speech is a visual communication mode, which uses lipreading and handshapes placed in different position to make spoken language wholly understandable to deaf individuals. The current study reports the authors' activities and progress in Cued Speech recognition for French. Previously, the authors have reported experimental results for vowel- and consonant recognition in Cued Speech for French in the case of a normal-hearing subject. The study has been extended by also employing a deaf cuer, and both cuer-dependent and multi-cuer experiments based on hidden Markov models (HMM) have been cond...
Cued Speech is a visual communication mode, which uses hand shapes and lip shapes making all the ... more Cued Speech is a visual communication mode, which uses hand shapes and lip shapes making all the sounds of spoken language clearly understandable to deaf and hearing-impaired people. Using Cued Speech the problems of lipreading can be overcome resulting thus in understanding of full spoken language by deaf children and adults. In automatic recognition of Cued Speech, lip shape recognition, gesture recognition, and integration of the two modalities are required. Previously, the authors have reported studies on vowel-, consonant, and isolated word recognition in Cued Speech for French. In the current study, continuous phoneme recognition experiments are presented using data from a normal-hearing and a deaf cuer. In the case of the normal-hearing cuer, the obtained phoneme correct was 82.9%, and in the case of the deaf cuer 81.5%. The results showed, that automatic recognition of Cued Speech shows similar performance in both normalhearing and deaf cuers.
Uploads
Papers by Panikos Heracleous