Temporally Weighted Linear Prediction Features For Tackling Additive Noise in Speaker Verification
Temporally Weighted Linear Prediction Features For Tackling Additive Noise in Speaker Verification
Temporally Weighted Linear Prediction Features For Tackling Additive Noise in Speaker Verification
6, JUNE 2010
599
Temporally Weighted Linear Prediction Features for Tackling Additive Noise in Speaker Verication
Rahim Saeidi, Student Member, IEEE, Jouni Pohjalainen, Tomi Kinnunen, and Paavo Alku
AbstractText-independent speaker verication under additive noise corruption is considered. In the popular mel-frequency cepstral coefcient (MFCC) front-end, the conventional Fourier-based spectrum estimation is substituted with weighted linear predictive methods, which have earlier shown success in noise-robust speech recognition. Two temporally weighted variants of linear predictive modeling are introduced to speaker verication and they are compared to FFT, which is normally used in computing MFCCs, and to conventional linear prediction. The effect of speech enhancement (spectral subtraction) on the system performance with each of the four feature representations is also investigated. Experiments by the authors on the NIST 2002 SRE corpus indicate that the accuracy of the conventional and proposed features are close to each other on clean data. For factory noise at 0 dB SNR level, baseline FFT and the better of the proposed features give EERs of 17.4% and 15.6%, respectively. These accuracies improve to 11.6% and 11.2%, respectively, when spectral subtraction is included as a preprocessing method. The new features hold a promise for noise-robust speaker verication. Index TermsAdditive noise, speaker verication, stabilized weighted linear prediction (SWLP).
I. INTRODUCTION PEAKER verication is the task of verifying ones identity based on the speech signal [1]. A typical speaker verication system consists of a short-term spectral feature extractor (front-end) and a pattern matching module (back-end). For pattern matching, Gaussian mixture models [2] and support vector machines [3] are commonly used. The standard spectrum analysis method for speaker verication is the discrete Fourier transform, implemented as the fast Fourier transform (FFT). Linear prediction (LP) is another approach to estimate the short-time spectrum [4]. Research in speaker recognition over the past two decades has largely concentrated on tackling the channel variability problem, that is, how to normalize the adverse effects due to differing training and test handsets or channels (e.g., GSM versus landline speech) [5]. Another challenging problem in
Manuscript received February 03, 2010; revised April 05, 2010. Date of publication April 19, 2010; date of current version May 07, 2010. The work of R. Saeidi was supported by a scholarship from the Finnish Foundation for Technology Promotion (TES). The work of J. Pohjalainen and T. Kinnunen was supported by Academy of Finland Projects (127345, 135003 Lastu-programme, and 132129). The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Pascale Fung. R. Saeidi and T. Kinnunen are with the School of Computing, University of Eastern Finland, FI-80101 Joensuu, Finland (e-mail: rahim.saeidi@uef.; tomi. kinnunen@uef.). J. Pohjalainen and P. Alku are with the Department of Signal Processing and Acoustics, Aalto University School of Science and Technology, FI-00076 Aalto, Finland (e-mail: [email protected]., paavo.alku@hut.). Digital Object Identier 10.1109/LSP.2010.2048649
speaker recognition, and speech technology in general, is that of additive noise, that is, degradation that originates from other sound sources and adds to the speech signal. Neither FFT nor LP can robustly handle conditions of additive noise. Therefore, this topic has been studied extensively in the past few decades and many speech enhancement methods have been proposed to tackle problems caused by additive noise [6], [7]. These methods include, for example, spectral subtraction, Wiener ltering and Kalman ltering. They are all based on forming a statistical estimate for noise and removing it from the corrupted speech. Speech enhancement methods can be used in speaker recognition as a preprocessing stage to remove additive noise. However, they have two potential drawbacks. First, noise estimates are never perfect, which may result in removing not only the noise but also speaker-dependent components of the original speech. Second, additional preprocessing increases processing time which can become a problem in real-time authentication. Another approach to increase robustness is to carry out feature normalization such as cepstral mean and variance normalization (CMVN), RASTA ltering [8] or feature warping [9]. These methods are often stacked with each other and combined with score normalization such as T-norm [10]. Finally, examples of model-domain methods, specically designed to tackle additive noise, include model-domain spectral subtraction [11], missing feature theory [12] and parallel model combination [13], to mention a few. Model-domain methods are always limited to a certain model family, such as Gaussian mixtures. This paper focuses on short-term spectral feature extraction (Fig. 1). Several previous studies have addressed robust feature extraction in speaker identication based on LP-derived methods, e.g., [14][16]. All these investigations, however, use vector quantization (VQ) classiers and some of the feature extraction methods utilized are computationally intensive, because they involve solving for the roots of LP polynomials. Differently from these previous studies, this work a) compares two straightforward noise-robust modications of LP and b) utilizes them in a more modern speaker verication system based on adapted Gaussian mixtures [2] and MFCC feature extraction. The robust linear predictive methods used for spectrum estimation (Fig. 1) are weighted linear prediction (WLP) [17] and stabilized WLP (SWLP) [18], which is a variant of WLP that guarantees the stability of the resulting all-pole lter. Rather than removing noise as speech enhancement methods do, the weighted LP methods aim to increase the contribution of such samples in the lter optimization that have been less corrupted by noise. As illustrated in Fig. 2, the corresponding all-pole spectra may preserve the formant structure of noise-corrupted voiced speech better than
600
(1)
Fig. 1. Front-end of the speaker recognition system. While standard mel-frequency cepstral features derived through a mel-frequency spaced lterbank placed on the magnitude spectrum are used in this work, the way how the magnitude spectrum is computed varies (FFT = Fast Fourier transform, baseline method; LP = Linear prediction; WLP = Weighted linear prediction; SWLP = Stabilized weighted linear prediction).
which can be solved for the coefcients to obtain the WLP all-pole model . It is easy to show that conventional LP can be obtained as a special case of WLP: for all , where is a nite nonzero conby setting stant, becomes a multiplier of both sides of (1) and cancels out, leaving the LP normal equations [4]. The conventional autocorrelation LP method is guaranteed to produce always a stable all-pole model, that is, a lter where all poles are within the unit circle [4]. However, such a guarantee does not exist for autocorrelation WLP when the weighting is arbitrary [17], [18]. Because of the importance function of model stability in coding and synthesis applications, SWLP was developed [18]. The WLP normal equations (1) can be as alternatively written in terms of partial weights
(2) where for . As shown in [18] (using a matrix-based formulation), model stability is guaranteed if the partial weights are, instead, dened recursively as and . Substitution of these values in (2) gives the SWLP normal equations. The motivation for temporal weighting is to emphasize the contribution of the less noisy signal regions in solving the LP lter coefcients. Typically, the weighting function in WLP and SWLP is chosen as the short-time energy (STE) of the immediate signal history [17][19], computed using a sliding window of samples as . STE weighting emphasizes those sections of the speech waveform which consist of samples having a large amplitude. It can be argued that these segments of speech are likely to be less corrupted by stationary additive noise than the low-energy segments. Indeed, when compared to traditional spectral modeling methods such as FFT and LP, WLP and SWLP using STE-weighting have been shown to improve noise robustness in automatic speech recognition [18], [19]. III. SPEAKER VERIFICATION SETUP The effectiveness of the features is evaluated on the NIST 2002 speaker recognition evaluation (SRE) corpus, which consists of realistic speech samples transmitted over different cellular networks with varying types of handsets. The experiments are conducted using a standard Gaussian mixture model classier with a universal background model (GMM-UBM) [2]. The GMM-UBM system was chosen since it is simple and may outperform support vector machines under additive noise conditions [13]. Test normalization (T-norm) [10] is applied on the logarithmic likelihood ratio scores. There
Fig. 2. Examples of FFT, LP, WLP, and SWLP spectra for a voiced speech sound taken from the NIST 2002 speaker recognition corpus and corrupted by factory noise (SNR 10 dB). The spectra have been shifted by approximately 10 dB with respect to each other.
the conventional methods. The WLP and SWLP features were recently applied to automatic speech recognition in [19] with promising results; the authors were curious to see whether these improvements would translate to speaker verication as well.
II. SPECTRUM ESTIMATION METHODS In linear predictive modeling, with prediction order , it is assumed that each speech sample can be predicted as a linear combination of previous samples, , where is the digital speech signal and are the prediction coefcients. The difference between the actual sample and its predicted value is the residual . WLP is a generalization of LP. In contrast to conventional LP, WLP introduces a temporal weighting of the squared residual in model coefcient optimization, allowing emphasis of the temporal regions assumed to be little affected by the noise, and de-emphasis of the noisy regions. The coefcients are solved by minimizing the energy of the weighted squared residual [17] , where is the weighting function. The range of summation of (not explicitly written) is chosen in this work to correspond to the autocorrelation method, in which the energy is minimized over a theoretically innite interval, but is considered to be zero outside the actual analysis window [4]. By setting the partial
601
are 2982 genuine and 36 277 impostor test trials in the NIST 2002 corpus. For each of the 330 target speakers, two minutes of untranscribed, conversational speech is available to train the target speaker model. The duration of the test utterances varies between 15 and 45 s. The (gender-dependent) background models and cohort models for Tnorm, having 1024 Gaussians, are trained using the NIST 2001 corpus. This baseline system [20] has comparable or better accuracy than other systems evaluated on this corpus (e.g., [21]). Features are extracted every 15 ms from 30 ms frames multiplied by a Hamming window. Depending on the feature extraction method, the magnitude spectrum is computed differently. For the baseline method, the FFT of the windowed frame is directly computed. For LP, WLP and SWLP, the model coefcients and the corresponding all-pole spectra are rst derived as explained in Section II. All the three parametric methods use a predictor order of . For WLP and SWLP, the short-term energy window duration is set to samples. A 27-channel mel-frequency lterbank is used to extract 12 MFCCs. After RASTA ltering, and coefcients, a standard component in modern speaker verication [1], are appended. Voiced frames are then selected using an energy-based voice activity detector (VAD). Finally, cepstral mean and variance normalization (CMVN) is performed. The procedure is illustrated in Fig. 1. Two standard metrics are used to assess recognition accuracy: the equal error rate (EER) and the minimum detection cost function value (MinDCF). EER corresponds to the threshold at which the miss rate and false alarm rate are equal; MinDCF is the minimum value of a weighted cost function given by . In addition, a few selected detection error tradeoff (DET) curves are plotted showing the full trade-off curve between false alarms and misses on a normal deviate scale. All the reported minDCF values are multiplied by 10, for ease of comparison. To study robustness against additive noise, some noise is digitally added from the NOISEX-92 database1 to the speech samples. This study uses white and factory2 noises (the latter is re1Samples
ferred to as factory noise throughout the paper). The background models and target speaker models are trained with clean data, but the noise is added to the test les with a given average segmental (frame-average) signal-to-noise ratio (SNR). Five values are considered: dB, where clean refers to the original, uncontaminated NIST samples. In summary, the evaluation data used in the present study contains linear and nonlinear distortion present in the sounds of the NIST 2002 database as well as additive noise taken from the NOISEX-92 database. Also included in the experiments is the well-known and simple speech enhancement method, spectral subtraction (as described in [6]). The effect of speech enhancement is studied alone as well as in combination with the new features. The noise model is initialized from the rst ve frames and updated during the non-speech periods with VAD labels given by the energy method. IV. SPEAKER VERIFICATION RESULTS The results for white and factory noise are shown in Tables I and II, respectively. In addition, Fig. 3 shows a DET plot that compares the four feature sets under factory noise degradation at SNR of 0 dB without any speech enhancement. Examining
available at http://spib.rice.edu/spib/select_noise.html
602
enhanced case, both proposed methods still improved upon the FFT baseline, and SWLP remained the most robust method. In summary, the weighted linear predictive features are a promising approach for speaker recognition in the presence of additive noise. REFERENCES
[1] T. Kinnunen and H. Li, An overview of text-independent speaker recognition: From features to supervectors, Speech Commun., vol. 52, no. 1, pp. 1240, Jan. 2010. [2] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, Speaker verication using adapted gaussian mixture models, Digital Signal Process., vol. 10, no. 1, pp. 1941, Jan. 2000. [3] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, Support vector machines using GMM supervectors for speaker verication, IEEE Signal Process. Lett., vol. 13, no. 5, pp. 308311, May 2006. [4] J. Makhoul, Linear prediction: A tutorial review, Proc. IEEE, vol. 64, no. 4, pp. 561580, Apr. 1975. [5] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition, IEEE Trans. Audio, Speech Lang. Process., vol. 15, no. 4, pp. 14351447, May 2007. [6] P. C. Loizou, Speech Enhancement: Theory and Practice. Boca Raton, FL: CRC, 2007. [7] T. Ganchev, I. Potamitis, N. Fakotakis, and G. Kokkinakis, Text-independent speaker verication for real fast-varying noisy environments, Int. J. Speech Technol., vol. 7, no. 4, pp. 281292, Oct. 2004. [8] H. Hermansky and N. Morgan, RASTA processing of speech, IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp. 578589, Oct. 1994. [9] J. Pelecanos and S. Sridharan, Feature warping for robust speaker verication, in Proc. Speaker Odyssey: The Speaker Recognition Workshop (Odyssey 2001), Crete, Greece, Jun. 2001, pp. 213218. [10] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, Score normalization for text-independent speaker verication systems, Digital Signal Process., vol. 10, no. 13, pp. 4254, Jan. 2000. [11] J. A. Nolazco-Flores and L. P. Garcia-Perera, Enhancing acoustic models for robust speaker verication, in Proc. Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2008), Las Vegas, NV, Apr. 2008, pp. 48374840. [12] J. Ming, T. J. Hazen, J. R. Glass, and D. A. Reynolds, Robust speaker recognition in noisy conditions, IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 5, pp. 17111723, Jul. 2007. [13] S. G. Pillay, A. Ariyaeeinia, M. Pawlewski, and P. Sivakumaran, Speaker verication under mismatched data conditions, Signal Process., vol. 3, no. 4, pp. 236246, Jul. 2009. [14] K. T. Assaleh and R. J. Mammone, New LP-derived features for speaker identication, IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp. 630638, Oct. 1994. [15] R. P. Ramachandran, M. S. Zilovic, and R. J. Mammone, A comparative study of robust linear predictive analysis methods with applications to speaker identication, IEEE Trans. Speech Audio Process., vol. 3, no. 2, pp. 117125, Mar. 1995. [16] M. S. Zilovic, R. P. Ramachandran, and R. J. Mammone, Speaker identication based on the use of robust cepstral features obtained from pole-zero transfer functions, IEEE Trans. Speech Audio Process., vol. 6, no. 3, pp. 260267, 1998. [17] C. Ma, Y. Kamp, and L. F. Willems, Robust signal selection for linear prediction analysis of voiced speech, Speech Commun., vol. 12, no. 2, pp. 6981, 1993. [18] C. Magi, J. Pohjalainen, T. Bckstrm, and P. Alku, Stabilised weighted linear prediction, Speech Commun., vol. 51, no. 5, pp. 401411, 2009. [19] J. Pohjalainen, H. Kallasjoki, K. J. Palomki, M. Kurimo, and P. Alku, Weighted linear prediction for speech analysis in noisy conditions, in Proc. Interspeech 2009, Brighton, U.K., 2009, pp. 13151318. [20] R. Saeidi, H. R. S. Mohammadi, T. Ganchev, and R. D. Rodman, Particle swarm optimization for sorted adapted gaussian mixture models, IEEE Trans. Audio, Speech Language Process., vol. 17, no. 2, pp. 344353, Feb. 2009. [21] C. Longworth and M. J. F. Gales, Combining derivative and parametric kernels for speaker verication, IEEE Trans. Audio, Speech Lang. Process., vol. 17, no. 4, pp. 748757, May 2009.
Fig. 4. Comparing FFT and SWLP with and without speech enhancement (SS = Spectral Subtraction).
the EER and MinDCF scores without speech enhancement, the following observations are made. The accuracy of all four feature sets degrades signicantly under additive noise; performance in white noise is inferior to that in factory noise2. WLP and SWLP outperform FFT and LP in most cases, with large differences at low SNRs and for factory noise; the best performing methods for white noise and factory noise are WLP and SWLP, respectively. WLP and SWLP show minor improvement over FFT also in the clean condition, showing consistency of the new features. It is interesting to note that, although SWLP is stabilized mainly for synthesis purposes and WLP has performed better in speech recognition [19], SWLP seems to slightly outperform WLP in speaker recognition. Considering the effect of speech enhancement, as summarized by the representative DET plot in Fig. 4, speech enhancement as a preprocessing step is seen to signicantly improve all the four methods. In addition, according to Tables I and II, the difference becomes progressively larger with decreasing SNR. This is expected since for a less noisy signal, spectral subtraction is also likely to remove other information in addition to noise. After including speech enhancement, even though the enhancement has a larger effect than the choice of the feature set, SWLP remains the most robust method and together with WLP outperforms baseline FFT. Note that here the benet from spectral subtraction may be quite pronounced due to almost stationary noise types. V. CONCLUSION Temporally weighted linear predictive features in speaker verication were studied. Without speech enhancement, the new WLP and SWLP features outperformed standard FFT and LP features in recognition experiments under additive-noise conditions. The effectiveness of spectral subtraction in highly noisy environments was also demonstrated. However, in the
2Factory noise has an overall lowpass spectral slope close to that of speech, whereas the spectrum of white noise is at. White noise is thus likely to corrupt the higher formants of speech more severely.