Carl Kuo

Followers

Following

Public Views

A senier researcher and technical director at ITRI.

less

InterestsView All (11)

Uploads

Papers by Carl Kuo

Automatic speech segmentation and verification for concatenative synthesis

8th European Conference on Speech Communication and Technology (Eurospeech 2003)

... Chih-Chung Kuo, Chi-Shiang Kuo, Jau-Hung Chen, and Sen-Chia Chang ... 2. Syllable Segmentatio... more

AN RNN-based compensation method for Mandarin telephone speech recognition

5th International Conference on Spoken Language Processing (ICSLP 1998)

In this paper, a novel architecture, which integrates the recurrent neural network (RNN) based co... more In this paper, a novel architecture, which integrates the recurrent neural network (RNN) based compensation process and the hidden Markov model (HMM) based speech recognition process into a unified framework, is proposed. The RNN is employed to estimate the additive bias, which represents the telephone channel effect, in the cepstral domain. Compensation of telephone channel effects is implemented by subtracting the additive bias from the cepstral coefficients of the input utterance. The integrated recognition system is trained based upon MCE/GPD (minimum classification error/generalized probabilistic descent) method with an objective function that is designed to minimize recognition error rates. Experimental results for speaker-independent Mandarin polysyllabic word recognition show an error rate reduction of 21.5% compared to the baseline system.

Comparison and Analysis of Listening Test Methods for Development of Perceptual Speech Quality Assessment

Reliability of the listening test design has a great influence on the performance of the quality ... more Reliability of the listening test design has a great influence on the performance of the quality estimation model. In this paper we compare four different listening test designs by Monte Carlo simulation. Three common problems of interval scale ratings are included in the simulation, and their influences on the performance of estimating the underlying true quality are investigated. It turns out that in these methods, randomly choosing partial trials for Scaled Comparison could be the most reliable way to perform listening test under the influences of interval scale ratings problems.

Download

A study of variable pulse allocation for MPE and CELP coders based on PESQ analysis

Interspeech 2005, 2005

A novel scheme of allocating variable pulses for each frame is proposed to reduce the bit-rate of... more A novel scheme of allocating variable pulses for each frame is proposed to reduce the bit-rate of MPE and CELP coders while maintaining the same speech quality. Since speech signal is not stationary, the required pulse number in a speech coder should be variable frame by frame. In this paper we tried to approximate the optimal pulse allocation by greedy search algorithm based on the criterion of perceptual disturbance value derived by PESQ analysis. In the experiments the proposed scheme was used to reduce the pulse numbers of two standard speech coders, G.723.1 and MPEG-4 CELP. The results show that the proposed scheme can achieve over 30% bit-rate reduction in fixed codebook (FCB) and about 20% in all for both coders while maintaining the same speech quality in both objective and subjective measure. We also designed several methods to accelerate the optimal search, which could largely reduce the execution time by 120 times in the best case.

A Chinese text-to-speech system with text preprocessing and confidence measure for practical usage

ieee region 10 conference, Dec 2, 1997

Modern Chinese text contains not only Chinese characters but also non-Chinese characters, like fi... more Modern Chinese text contains not only Chinese characters but also non-Chinese characters, like figures, abbreviation, marks, etc. Therefore, a proper design of text preprocessing is required for a Chinese text-to-speech (TTS) system in practical usage. In this paper, a Chinese TTS system with a text preprocessor for markup command processing, text normalization, and sentence segmentation is presented. A novel idea, confidence measures for TTS system, was also proposed.

Ergonomics Design with Novice Elicitation on an Auditory-Only In-Vehicle Speech System

Lecture Notes in Computer Science, 2013

This research is aimed to design an auditory-only in-vehicle speech system, named as Talking Car ... more This research is aimed to design an auditory-only in-vehicle speech system, named as Talking Car Novice Mode, and provide with elicitation that even a novice can easily handle. In this study, 19 participants were asked to use radio and music functions in two kinds of in-vehicle speech systems, the original Talking Car and Talking Car Novice Mode, while driving through a virtual world. Data of secondary task performance, the amount of time spent on tasks and the times of calling help function were recorded by a camera. The annoyed score of sentences, NASA-TLX questionnaire and subjective questionnaire were completed after the test. The result indicated that there was no significant difference between driving with and without tasks on either the reaction time of slamming the brake or the times user call for help. Besides, the learning curve of Talking Car Novice Mode is steep and ensures that Talking Car Novice Mode provides enough elicitation to novices. Hence, the Talking Car Novice Mode is expected to be friendlier and safer than original Talking Car in-vehicle speech system for a novice user.

A Windowed Search Method for the Pitch Predictor in Celp Coder

Jise, 1992

Perceptual Distortion Analysis And Quality Estimation Of Prosody-Modified Speech For Td-Psola

2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings

TD-PSOLA is one of the most widely used prosodic modification techniques. However, perceptible di... more TD-PSOLA is one of the most widely used prosodic modification techniques. However, perceptible distortions are introduced occasionally and how TD-PSOLA affects speech quality has not been fully understood and controlled. In this paper, we present a quality estimation method before performing modification. By exploiting relationship between prosodic modifications and subjective scores, 27 distance measures are proposed and respective performances are

Speech segment selection for concatenative synthesis based on prosody-aligned distance measure

IEEE International Conference on Acoustics Speech and Signal Processing, 2002

This paper presents a new method for automatically selecting speech segments that are expected to... more This paper presents a new method for automatically selecting speech segments that are expected to minimize perceptual distortion in synthesis. The method is based on comparison of candidates fully prosody-aligned to each other. Automatic segmentation, pitch marking and PSOLA method work together for prosody alignment. Two distance measures, MFCC and PSQM, are used for comparison because of human perceptual consideration. Experiment shows that the average distortion by using the selected best unit in outside testing is similar to that in training corpus with only few exceptions. The symmetry characteristics and correlation of these two distance measures are also studied and reveal that both are properly symmetric and consistent with each other for most cases.

Special-domain speech synthesizer

2005 12th IEEE International Conference on Electronics, Circuits and Systems, 2005

Special application domains impose various specific limitations such as storage and vocabularies ... more Special application domains impose various specific limitations such as storage and vocabularies on synthesizers. This work proposes a systematic approach to develop special-domain corpus-based speech synthesizers. An optimal recording script can be generated according to configurable application requirements. Moreover, the corresponding unit inventory and synthesizer can be constructed under minimal user intervention. A route guidance example is also presented to

Low bit-rate quantization of LSP parameters using two-dimensional differential coding

[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992

A novel spectral coding method, two-dimensional differential line spectra pair coding (2DdLSP), i... more A novel spectral coding method, two-dimensional differential line spectra pair coding (2DdLSP), is proposed. Taking advantage of the strong inter-frame, and intra-frame correlation of LSP parameters, a two-dimensional linear prediction technique is used to reduce the variance of the parameters to be quantized. One scalar quantization and two vector quantization schemes are designed to quantize the 2-D prediction residuals. Without

Method of speech segment selection for concatenative synthesis based on prosody-aligned distance measure

The Journal of the Acoustical Society of America, 2010

A method of speech segment selection for concatenative synthesis based on prosody-aligned distanc... more A method of speech segment selection for concatenative synthesis based on prosody-aligned distance measure is disclosed. This method is based on comparison of speech segments segmented from a speech corpus, wherein speech segments are fully prosody-aligned to each other before distortion measure. With prosody alignment embedded in selection process, distortion resulting from possible prosody modification in synthesis could be taken into account objectively in selection phase. In order to carry out the purpose of the present invention, automatic segmentation, pitch marking and PSOLA method work together for prosody alignment. Two distortion measures, MFCC and PSQM are used for comparing two prosody-aligned segments of speech because of human perceptual consideration.

Pronunciation Assessment Method And System Based On Distinctive Feature Analysis

The Journal of the Acoustical Society of America, 2011

Speech classification embedded in adaptive codebook search for low bit-rate CELP coding

IEEE Transactions on Speech and Audio Processing, 1995

Abstmct-This correspondence proposes a new CELP coding method which embeds speech classification ... more Abstmct-This correspondence proposes a new CELP coding method which embeds speech classification in adaptive codebook search. This approach can retain the synthesized speech quality at bit-rates below 4 kb/s. A pitch analyzer is designed to classipY each frame by its periodicity, and with a finite-state machine, one of four states is determined. Then the adaptive codebook search scheme is switched according to the state. Simulation results show that higher SEGSNR and lower computation complexity can be achieved, and the pitch contour of the synthesized speech is smoother than that produced by conventional CELP coders.

Download

Efficient and scalable methods for text script generation in corpus-based TTS design

7th International Conference on Spoken Language Processing (ICSLP 2002)