5th International Conference on Spoken Language Processing (ICSLP 1998)
In this paper, a novel architecture, which integrates the recurrent neural network (RNN) based co... more In this paper, a novel architecture, which integrates the recurrent neural network (RNN) based compensation process and the hidden Markov model (HMM) based speech recognition process into a unified framework, is proposed. The RNN is employed to estimate the additive bias, which represents the telephone channel effect, in the cepstral domain. Compensation of telephone channel effects is implemented by subtracting the additive bias from the cepstral coefficients of the input utterance. The integrated recognition system is trained based upon MCE/GPD (minimum classification error/generalized probabilistic descent) method with an objective function that is designed to minimize recognition error rates. Experimental results for speaker-independent Mandarin polysyllabic word recognition show an error rate reduction of 21.5% compared to the baseline system.
Reliability of the listening test design has a great influence on the performance of the quality ... more Reliability of the listening test design has a great influence on the performance of the quality estimation model. In this paper we compare four different listening test designs by Monte Carlo simulation. Three common problems of interval scale ratings are included in the simulation, and their influences on the performance of estimating the underlying true quality are investigated. It turns out that in these methods, randomly choosing partial trials for Scaled Comparison could be the most reliable way to perform listening test under the influences of interval scale ratings problems.
A novel scheme of allocating variable pulses for each frame is proposed to reduce the bit-rate of... more A novel scheme of allocating variable pulses for each frame is proposed to reduce the bit-rate of MPE and CELP coders while maintaining the same speech quality. Since speech signal is not stationary, the required pulse number in a speech coder should be variable frame by frame. In this paper we tried to approximate the optimal pulse allocation by greedy search algorithm based on the criterion of perceptual disturbance value derived by PESQ analysis. In the experiments the proposed scheme was used to reduce the pulse numbers of two standard speech coders, G.723.1 and MPEG-4 CELP. The results show that the proposed scheme can achieve over 30% bit-rate reduction in fixed codebook (FCB) and about 20% in all for both coders while maintaining the same speech quality in both objective and subjective measure. We also designed several methods to accelerate the optimal search, which could largely reduce the execution time by 120 times in the best case.
Modern Chinese text contains not only Chinese characters but also non-Chinese characters, like fi... more Modern Chinese text contains not only Chinese characters but also non-Chinese characters, like figures, abbreviation, marks, etc. Therefore, a proper design of text preprocessing is required for a Chinese text-to-speech (TTS) system in practical usage. In this paper, a Chinese TTS system with a text preprocessor for markup command processing, text normalization, and sentence segmentation is presented. A novel idea, confidence measures for TTS system, was also proposed.
This research is aimed to design an auditory-only in-vehicle speech system, named as Talking Car ... more This research is aimed to design an auditory-only in-vehicle speech system, named as Talking Car Novice Mode, and provide with elicitation that even a novice can easily handle. In this study, 19 participants were asked to use radio and music functions in two kinds of in-vehicle speech systems, the original Talking Car and Talking Car Novice Mode, while driving through a virtual world. Data of secondary task performance, the amount of time spent on tasks and the times of calling help function were recorded by a camera. The annoyed score of sentences, NASA-TLX questionnaire and subjective questionnaire were completed after the test. The result indicated that there was no significant difference between driving with and without tasks on either the reaction time of slamming the brake or the times user call for help. Besides, the learning curve of Talking Car Novice Mode is steep and ensures that Talking Car Novice Mode provides enough elicitation to novices. Hence, the Talking Car Novice Mode is expected to be friendlier and safer than original Talking Car in-vehicle speech system for a novice user.
2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings
TD-PSOLA is one of the most widely used prosodic modification techniques. However, perceptible di... more TD-PSOLA is one of the most widely used prosodic modification techniques. However, perceptible distortions are introduced occasionally and how TD-PSOLA affects speech quality has not been fully understood and controlled. In this paper, we present a quality estimation method before performing modification. By exploiting relationship between prosodic modifications and subjective scores, 27 distance measures are proposed and respective performances are
IEEE International Conference on Acoustics Speech and Signal Processing, 2002
This paper presents a new method for automatically selecting speech segments that are expected to... more This paper presents a new method for automatically selecting speech segments that are expected to minimize perceptual distortion in synthesis. The method is based on comparison of candidates fully prosody-aligned to each other. Automatic segmentation, pitch marking and PSOLA method work together for prosody alignment. Two distance measures, MFCC and PSQM, are used for comparison because of human perceptual consideration. Experiment shows that the average distortion by using the selected best unit in outside testing is similar to that in training corpus with only few exceptions. The symmetry characteristics and correlation of these two distance measures are also studied and reveal that both are properly symmetric and consistent with each other for most cases.
2005 12th IEEE International Conference on Electronics, Circuits and Systems, 2005
Special application domains impose various specific limitations such as storage and vocabularies ... more Special application domains impose various specific limitations such as storage and vocabularies on synthesizers. This work proposes a systematic approach to develop special-domain corpus-based speech synthesizers. An optimal recording script can be generated according to configurable application requirements. Moreover, the corresponding unit inventory and synthesizer can be constructed under minimal user intervention. A route guidance example is also presented to
[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992
A novel spectral coding method, two-dimensional differential line spectra pair coding (2DdLSP), i... more A novel spectral coding method, two-dimensional differential line spectra pair coding (2DdLSP), is proposed. Taking advantage of the strong inter-frame, and intra-frame correlation of LSP parameters, a two-dimensional linear prediction technique is used to reduce the variance of the parameters to be quantized. One scalar quantization and two vector quantization schemes are designed to quantize the 2-D prediction residuals. Without
The Journal of the Acoustical Society of America, 2010
A method of speech segment selection for concatenative synthesis based on prosody-aligned distanc... more A method of speech segment selection for concatenative synthesis based on prosody-aligned distance measure is disclosed. This method is based on comparison of speech segments segmented from a speech corpus, wherein speech segments are fully prosody-aligned to each other before distortion measure. With prosody alignment embedded in selection process, distortion resulting from possible prosody modification in synthesis could be taken into account objectively in selection phase. In order to carry out the purpose of the present invention, automatic segmentation, pitch marking and PSOLA method work together for prosody alignment. Two distortion measures, MFCC and PSQM are used for comparing two prosody-aligned segments of speech because of human perceptual consideration.
IEEE Transactions on Speech and Audio Processing, 1995
Abstmct-This correspondence proposes a new CELP coding method which embeds speech classification ... more Abstmct-This correspondence proposes a new CELP coding method which embeds speech classification in adaptive codebook search. This approach can retain the synthesized speech quality at bit-rates below 4 kb/s. A pitch analyzer is designed to classipY each frame by its periodicity, and with a finite-state machine, one of four states is determined. Then the adaptive codebook search scheme is switched according to the state. Simulation results show that higher SEGSNR and lower computation complexity can be achieved, and the pitch contour of the synthesized speech is smoother than that produced by conventional CELP coders.
5th International Conference on Spoken Language Processing (ICSLP 1998)
In this paper, a novel architecture, which integrates the recurrent neural network (RNN) based co... more In this paper, a novel architecture, which integrates the recurrent neural network (RNN) based compensation process and the hidden Markov model (HMM) based speech recognition process into a unified framework, is proposed. The RNN is employed to estimate the additive bias, which represents the telephone channel effect, in the cepstral domain. Compensation of telephone channel effects is implemented by subtracting the additive bias from the cepstral coefficients of the input utterance. The integrated recognition system is trained based upon MCE/GPD (minimum classification error/generalized probabilistic descent) method with an objective function that is designed to minimize recognition error rates. Experimental results for speaker-independent Mandarin polysyllabic word recognition show an error rate reduction of 21.5% compared to the baseline system.
Reliability of the listening test design has a great influence on the performance of the quality ... more Reliability of the listening test design has a great influence on the performance of the quality estimation model. In this paper we compare four different listening test designs by Monte Carlo simulation. Three common problems of interval scale ratings are included in the simulation, and their influences on the performance of estimating the underlying true quality are investigated. It turns out that in these methods, randomly choosing partial trials for Scaled Comparison could be the most reliable way to perform listening test under the influences of interval scale ratings problems.
A novel scheme of allocating variable pulses for each frame is proposed to reduce the bit-rate of... more A novel scheme of allocating variable pulses for each frame is proposed to reduce the bit-rate of MPE and CELP coders while maintaining the same speech quality. Since speech signal is not stationary, the required pulse number in a speech coder should be variable frame by frame. In this paper we tried to approximate the optimal pulse allocation by greedy search algorithm based on the criterion of perceptual disturbance value derived by PESQ analysis. In the experiments the proposed scheme was used to reduce the pulse numbers of two standard speech coders, G.723.1 and MPEG-4 CELP. The results show that the proposed scheme can achieve over 30% bit-rate reduction in fixed codebook (FCB) and about 20% in all for both coders while maintaining the same speech quality in both objective and subjective measure. We also designed several methods to accelerate the optimal search, which could largely reduce the execution time by 120 times in the best case.
Modern Chinese text contains not only Chinese characters but also non-Chinese characters, like fi... more Modern Chinese text contains not only Chinese characters but also non-Chinese characters, like figures, abbreviation, marks, etc. Therefore, a proper design of text preprocessing is required for a Chinese text-to-speech (TTS) system in practical usage. In this paper, a Chinese TTS system with a text preprocessor for markup command processing, text normalization, and sentence segmentation is presented. A novel idea, confidence measures for TTS system, was also proposed.
This research is aimed to design an auditory-only in-vehicle speech system, named as Talking Car ... more This research is aimed to design an auditory-only in-vehicle speech system, named as Talking Car Novice Mode, and provide with elicitation that even a novice can easily handle. In this study, 19 participants were asked to use radio and music functions in two kinds of in-vehicle speech systems, the original Talking Car and Talking Car Novice Mode, while driving through a virtual world. Data of secondary task performance, the amount of time spent on tasks and the times of calling help function were recorded by a camera. The annoyed score of sentences, NASA-TLX questionnaire and subjective questionnaire were completed after the test. The result indicated that there was no significant difference between driving with and without tasks on either the reaction time of slamming the brake or the times user call for help. Besides, the learning curve of Talking Car Novice Mode is steep and ensures that Talking Car Novice Mode provides enough elicitation to novices. Hence, the Talking Car Novice Mode is expected to be friendlier and safer than original Talking Car in-vehicle speech system for a novice user.
2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings
TD-PSOLA is one of the most widely used prosodic modification techniques. However, perceptible di... more TD-PSOLA is one of the most widely used prosodic modification techniques. However, perceptible distortions are introduced occasionally and how TD-PSOLA affects speech quality has not been fully understood and controlled. In this paper, we present a quality estimation method before performing modification. By exploiting relationship between prosodic modifications and subjective scores, 27 distance measures are proposed and respective performances are
IEEE International Conference on Acoustics Speech and Signal Processing, 2002
This paper presents a new method for automatically selecting speech segments that are expected to... more This paper presents a new method for automatically selecting speech segments that are expected to minimize perceptual distortion in synthesis. The method is based on comparison of candidates fully prosody-aligned to each other. Automatic segmentation, pitch marking and PSOLA method work together for prosody alignment. Two distance measures, MFCC and PSQM, are used for comparison because of human perceptual consideration. Experiment shows that the average distortion by using the selected best unit in outside testing is similar to that in training corpus with only few exceptions. The symmetry characteristics and correlation of these two distance measures are also studied and reveal that both are properly symmetric and consistent with each other for most cases.
2005 12th IEEE International Conference on Electronics, Circuits and Systems, 2005
Special application domains impose various specific limitations such as storage and vocabularies ... more Special application domains impose various specific limitations such as storage and vocabularies on synthesizers. This work proposes a systematic approach to develop special-domain corpus-based speech synthesizers. An optimal recording script can be generated according to configurable application requirements. Moreover, the corresponding unit inventory and synthesizer can be constructed under minimal user intervention. A route guidance example is also presented to
[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992
A novel spectral coding method, two-dimensional differential line spectra pair coding (2DdLSP), i... more A novel spectral coding method, two-dimensional differential line spectra pair coding (2DdLSP), is proposed. Taking advantage of the strong inter-frame, and intra-frame correlation of LSP parameters, a two-dimensional linear prediction technique is used to reduce the variance of the parameters to be quantized. One scalar quantization and two vector quantization schemes are designed to quantize the 2-D prediction residuals. Without
The Journal of the Acoustical Society of America, 2010
A method of speech segment selection for concatenative synthesis based on prosody-aligned distanc... more A method of speech segment selection for concatenative synthesis based on prosody-aligned distance measure is disclosed. This method is based on comparison of speech segments segmented from a speech corpus, wherein speech segments are fully prosody-aligned to each other before distortion measure. With prosody alignment embedded in selection process, distortion resulting from possible prosody modification in synthesis could be taken into account objectively in selection phase. In order to carry out the purpose of the present invention, automatic segmentation, pitch marking and PSOLA method work together for prosody alignment. Two distortion measures, MFCC and PSQM are used for comparing two prosody-aligned segments of speech because of human perceptual consideration.
IEEE Transactions on Speech and Audio Processing, 1995
Abstmct-This correspondence proposes a new CELP coding method which embeds speech classification ... more Abstmct-This correspondence proposes a new CELP coding method which embeds speech classification in adaptive codebook search. This approach can retain the synthesized speech quality at bit-rates below 4 kb/s. A pitch analyzer is designed to classipY each frame by its periodicity, and with a finite-state machine, one of four states is determined. Then the adaptive codebook search scheme is switched according to the state. Simulation results show that higher SEGSNR and lower computation complexity can be achieved, and the pitch contour of the synthesized speech is smoother than that produced by conventional CELP coders.
Uploads
Papers by Carl Kuo