Spectral Restoration Based Speech Enhancement For Robust Speaker Identification

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

International Journal of Interactive Multimedia and Artificial Intelligence, Vol.

5, Nº 1

Spectral Restoration Based Speech Enhancement for


Robust Speaker Identification
Nasir Saleem*1, Tayyaba Gul Tareen2
1
Department of Electrical Engineering, Gomal University, D.I.Khan (Pakistan)
2
Department of Electrical Engineering, Iqra University, Peshawar (Pakistan)

Received 8 October 2017 | Accepted 21 December 2017 | Published 19 January 2018

Abstract Keywords
Spectral restoration based speech enhancement algorithms are used to enhance quality of noise masked speech A Priori SNR, Spectral
for robust speaker identification. In presence of background noise, the performance of speaker identification Restoration, Speech
systems can be severely deteriorated. The present study employed and evaluated the Minimum Mean-Square- Enhancement, Speaker
Error Short-Time Spectral Amplitude Estimators with modified a priori SNR estimate prior to speaker Identification, Mel
identification to improve performance of the speaker identification systems in presence of background noise. Frequency Cepstral
For speaker identification, Mel Frequency Cepstral coefficient and Vector Quantization is used to extract the Coefficients, Vector
speech features and to model the extracted features respectively. The experimental results showed significant Quantization.
improvement in speaker identification rates when spectral restoration based speech enhancement algorithms
are used as a pre-processing step. The identification rates are found to be higher after employing the speech DOI: 10.9781/ijimai.2018.01.002
enhancement algorithms.

I. Introduction estimate and brings one-frame delay. Therefore, to avoid one-frame


delay, momentum terms are incorporated to get better tracking speed of

S PEECH enhancement aspires to improve quality by employing


a variety of speech processing algorithms. The intention of the
enhancement is to improve the speech intelligibility and/or overall
system and avoid the frame delay problem. All the mentioned systems
in [11-13] can significantly improve speech quality. Binary masking
[14-18] is another class that increases speech quality and intelligibility
perceptual quality of speech noise masked speech. Enhancement of simultaneously. This paper presents Mean-Square-Error Short-Time
speech degraded by background noise, called noise reduction is a Spectral Amplitude Estimators with modified a priori SNR estimation
significant area of speech enhancement and is considered for diverse to reduce background noise and to improve identification rates of
applications for example, mobile phones, speech/speaker recognition/ speaker identification systems in presence of background noises. The
identification [1] and hearing aids. The speech signals are frequently paper is prepared as follows. Section 2 presents the overview of speech
contaminated by the background noise, which affects the performance enhancement system; section 3 gives speaker identification system;
of speaker identification (SID) systems. The SID systems are used section 4 presents the experimental setup, results and discussions, and
in online banking, voice mail, remote computer access etc. Therefore, section 5 presents the summary and concluding remarks. The Matlab
for effective use of such systems, a speech enhancement system must R2015b is used to construct the algorithms and simulations.
be positioned in front-end to improve identification accuracy. Fig.1
shows the procedural block diagram of speech enhancement and
speaker identification system. The algorithms for speech enhancement
are categorized into three fundamental classes, (i) filtering techniques
including spectral subtraction [2-5] Wiener filtering [6-8] and signal
subspace techniques [9-10], (ii) Spectral restoration algorithms including
Mean-Square-Error Short-Time Spectral Amplitude Estimators [11-12]
and (iii) speech-model based algorithms. The systems presented in [6-8,
11-13] principally depend on accurate estimates of signal-to-noise ratio
(SNR) in all frequency bands, because gain is computed as function
of spectral SNR. A conventional and recognized technique for SNR
estimation is decision-directed (DD) method suggested in [11] The
DD technique tails the shape of instantaneous SNR for a priori SNR

* Corresponding author.
E-mail address: [email protected] Fig. 1. Procedural block diagram of Speech enhancement and speaker
identification system.

- 34 -
Regular Issue

II. Spectral Restoration Based Speech Enhancement System momentum parameter (ζ=0.998), μ(m,ωk) shows momentum terms
and λD(m,ωk) is the estimation of background noise variance. The
In classical spectral restoration based speech enhancement system, ξMDD(k,ωk)shows a priori SNR estimation after modification. The
the noisy speech is given as; y(t)= s(t) + n(t), where s(t) and n(t) specify estimated power spectrum of the clean speech magnitude SEST(k,ωk)
clean speech and noise signal respectively. Let Y(k,ωk), S(k,ωk) and is attained by multiplying gain function with noisy speech Y(k,ωk) as:
N(k,ωk) show y(t), s(t) and n(t) respectively with spectral element ωk
and time frame k. The quasi-stationary nature of speech is considered in SEST (k,ωk ) = Y(k,ωk ) *G(k,ωk )
(8)
frame analysis since noise and speech signals both reveal non-stationary
behavior. A speech enhancement algorithm involves in multiplication The gain function G(k,ωk) is given as:
of a spectral gain G(k,ωk) to short-time spectrum Y(k,ωk) and the
computation of spectral gain follows two key parameters, a posteriori
SNR and the a priori SNR estimation:
2 2
Y(k,ω k ) Y(k,ω k )
γ(k,ω k )= 2
= 2
E{ N(k,ω k ) } σ (k,ω k )
n
(1)
(9)
2
Where, ς is used to avoid large gain values at low a posteriori SNR
2
E{ S(k,ω k ) } σ (k,ω k )
s
ξ(k,ω k )= =
E{ N(k,ω k ) }
2 2
σ (k,ω k ) and ς =10 is chosen here.
(2)
n

Where E{.} shows expectation operator, γ(k,ωk) and ξ(k,ωk) III. Speaker Identification System
presents a posteriori SNR estimation and a priori SNR estimation. In
practical implementations of a speech enhancement system, squared The intention of a Speaker identification system is to identity
power spectrum density of clean speech |X(k,ωk)|2 and noise |D(k,ωk)|2 information regarding any speaker which is categorized into two
are unrevealed as only noisy speech is available. Therefore; both sub-categories called as Speaker identification (SID) and speaker
instantaneous and a priori SNR need to be estimated. The noise power Verification (SVR). For SID, the Mel Frequency Cepstral coefficient
spectral density is estimated during speech gaps exploiting standard (MFCC) and Vector Quantization (VQ) is used to extract the speech
recursive relation, given as: features and to model the extracted features respectively. The
speaker identification system drives in two stages, the training and
ˆ 2n (k,ωk )= βσ
σ ˆ n2 (k-1,ωk )+(1-β)σY
2
(k-1,ωk ) testing stages. In training mode the system is allowed to create the
(3)
database of speech signals and formulate a feature model of speech
utterances. In testing mode, the system uses information provided in
Where, β is the smoothing factor and is estimation in database and attempts to segregate and identify the speakers. Here,
previous frame. The SNR can be calculated as: the Mel frequency Cepstral Coefficients (MFCCs) features are used
S(k,ωk )
2 for constructing a SID system. The extracted features of speakers are
SNR INST (k,ωk )= 2 quantized to a number of centroids employing vector quantization
N(k,ωk )
(4) (VQ) K-means algorithm. MFCCs are computed in training as well as
in testing stage. The Euclidean distance among MFCCs of all speakers
2 in training stage to centroids of isolated speaker in testing stage is
G(k-1,ωk )*Y(k,ωk ) calculated and a particular speaker is identified according to minimum
ξ DD (k,ωk )=α
σ̂ 2n (k,ωk -1) Euclidean distance.
+(1-α)F{γ(k,ωk )-1}
(5) A. Feature Extraction
Where α is smoothing factor and has a constant value 0.98, ξDD(k,ωk) The MFCCs are acquired by pre-emphasis of speech initially to
is a priori noise estimate via decision-direct (DD) method whereas emphasize high frequencies and eliminate glottal and lip radiations.
F{.} is half-wave rectification. By setting α as a fixed value near to 1, The resulting speech is fragmented, windowed, and FFT is computed
the DD approach introduces less residual noise. However, it may lead to attain spectra. To estimate human auditory system, triangular band-
to delay in estimation since a fixed value cannot track the rapid change pass filters bank is utilized. A linear scale is used to compute center
of speech. The DD is an efficient method and achieves well in speech frequencies which are lower than 1 kHz, while logarithmic scale is
enhancement applications however; the a priori SNR follows the shape considered for center frequencies higher than 1 kHz. The filter bank
of instantaneous SNR and brings single-frame delay. To overcome response is given in Fig. 2. The Mel-spaced filter bank response is
single-frame delay, a modified form of DD method is used to estimate given as:
a priori SNR. The modified a priori SNR is written as:

(10)
2
G(k-1,ωk )*Y(k,ωk )
ξ MDD (k,ωk ) = α
σ̂ 2n (k,ωk -1)
The DFT is computed on log of Mel spectrum to figure Cepstrum
+μ(k,ωk )+(1-α)F{γ(k,ωk )-1}
(6) as:

μ(k,ωk ) = ζ[ξ PRIO (k-1,ωk ) - ξ PRIO (k-2,ωk )] (7)


(11)
Equation (6) shows the modified DD (MDD) version used in the
speech enhancement system, α is smoothing parameter (α=0.98), ζ is Where Mg shows MFCCs, Ṡ is nth Mel filter output, K is number of

- 35 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 5, Nº 1

MFCCs chosen between 5 to 26, and Nf is the number of Mel filters. lowest distortion is selected, and this distortion is the sum of squared
Initially few coefficients are considered since most of the specific Euclidean distances among vectors and their centroids. As a result,
information about speakers is present in them. all feature vectors in M sequence are compared with codebooks, and
the codebooks with the minimum average distance are selected. The
Euclidean distance between two points, λ = (λ1, λ2…λn) and η = (η1, η2...
ηn) is given by [21-22]:

(12)

IV. Results and Discussion


Six different speakers, three male and three female, were selected
from Noizeus [23] and TIMIT database, respectively, while 50 speech
sentences uttered by the speakers are considered during training stage
for speaker identification. In testing stage, speech utterances are selected
at random to access the identification rates. To evaluate performance of
Fig. 2. Mel-Spaced Filter bank Response. system, four signal-to-noise ratio levels, including 0dB, 5dB, 10dB and
15dB are used. Also three noisy situations including car, street and white
noise are used to degrade the clean speech. The Perceptual evaluation
B. Vector Quantization
of speech quality (PESQ) [23] and Segmental SNR (SNRSeg) [24] is
Vector quantization (VQ) is a lossy compression method based on used to predict the speech quality after speech enhancement. Three sets
the block coding theory [20]. The purpose of VQ in speaker recognition of experiments are conducted to measure the speaker identification
systems is to create a classification system for every speaker and a rates including, clean speech with no background noise, speech
large set of acoustic vectors are converted to lesser set that signifies degraded by background noise and speech processed by the spectral
centroids of distribution shown in Fig. 3. The VQ is employed since all restoration enhancing algorithms. The presented system is compared
MFCC generated feature vector cannot be stored and extracted acoustic to various baseline state-of-art speech enhancement algorithms. The
vectors are clustered into a set of codewords (referred to as codebook) baseline algorithms include MMSE, Spectral subtraction (SS), and
and this clustering is achieved by using the K-Means Algorithm which signal subspace (Sig_Sp). Table I shows the PESQ scores obtained
separates the M feature vectors into K centroids. Initially K cluster- with the spectral restoration based algorithm and baseline algorithms.
centroids are chosen randomly within M feature vectors and then all The proposed algorithm performed very well in noisy environments
feature vectors are allocated to nearby centroid, and the creating the and at all SNR levels against baseline speech enhancement algorithms.
centroids, all other new clusters follow the same pattern. The process A considerable improvement in PESQ scores is evident which shows
keeps on until a certain condition for stopping is reached, i.e., the mean that the proposed speech enhancement algorithm effectively reduced
square error (MSE) among acoustic vector and cluster centroid is lower various background noise sources from target speech. Similarly, Fig. 4
than a certain predefined threshold or there are no additional variations shows PESQ scores obtained after applying Minimum Mean-Square-
in cluster-center task [21]. Error Short-Time Spectral Amplitude Estimators with modified a
priori SNR estimate (MMSE-MDD). The modified version offers
the best results consistently in all SNR levels and noisy conditions
when compared to noisy and speech processed by traditional MMSE-
STSA speech enhancement algorithm. Table II shows the SNRSeg
results obtained with the spectral restoration based algorithm and
baseline algorithms. Again in terms of SNRSeg, the proposed speech
enhancement algorithm outperformed against baseline algorithms.
Significant SNRSeg improvements are evident from the obtained
results. Fig. 5 shows the speech quality in terms of segmental SNR
(SNRSeg) where highest SNRSeg scores are obtained with MMSE-
MDD. The enhanced speech associated with six speakers is tested for
speaker identification. Table III offers the percentage identification
rates achieved with proposed speech enhancement algorithm against
baseline algorithms. The speaker identification rates are remarkably
improved with the proposed algorithm in various noise environments
at all SNR levels as compared to baseline algorithms and unprocessed
noisy speech. At low SNR (0dB) a significant increase in identification
Fig. 3. 2D acoustic Vector analysis for speakers.
rates is observed in all noise environments which clearly showed that
the noise is effectively eliminated. Fig. 6 shows the identification rates,
the lowest identification rates are observed in presence of background
C. Speaker Identification noise (Babble, car and street) however, employment of the speech
The speaker recognition phase is characterized by a set of acoustic enhancement before speaker identification has tremendously increased
feature vectors {M1, M2,…., Mt}, and is judged against codebooks in the identification rates which are evident in Fig.5. The identification
list. For all codebooks a distortion is calculated, and a speaker with the rates for MMSE-MDD are higher in all SNR conditions and levels.

- 36 -
Regular Issue

TABLE I. PESQ Analysis Against Baseline Speech Enhancement Algorithms and Noisy Speech

SNR Noisy Spectral


Noise Type Signal Subspace MMSE Proposed
(in dB) Speech Subtraction

0 1.72 1.89 1.91 1.89 1.97


5 2.11 2.19 2.29 2.23 2.35
Babble Noise
10 2.43 2.53 2.61 2.55 2.69
15 2.66 2.71 2.76 2.71 2.83

0 1.79 1.91 2.01 1.87 2.07


5 1.97 2.23 2.31 2.21 2.45
Car Noise
10 2.31 2.42 2.62 2.61 2.72
15 2.45 2.56 2.76 2.78 2.91

0 1.77 1.93 1.96 1.88 2.13


5 2.05 2.21 2.31 2.12 2.43
Street Noise
10 2.41 2.57 2.59 2.55 2.69
15 2.54 2.65 2.69 2.61 2.86

TABLE II. Segmental SNR (SNRSeg) Analysis Against Baseline Speech Enhancement Algorithms and Noisy Speech

SNR Noisy Spectral


Noise Type Signal Subspace MMSE Proposed
(in dB) Speech Subtraction

0 0.11 1.21 1.55 1.12 1.66


5 1.13 1.77 1.89 1.83 2.01
Babble
10 1.45 2.11 2.17 1.99 2.37
15 1.64 2.34 2.38 2.28 2.44

0 0.10 1.32 1.28 1.13 1.63


5 1.23 1.89 1.93 1.78 1.98
Car
10 1.56 2.14 2.21 1.97 2.41
15 1.66 2.29 2.33 2.37 2.57

0 0.18 1.29 1.41 1.16 1.59


5 1.43 1.88 1.92 1.72 1.99
Street
10 1.53 2.21 2.23 2.01 2.39
15 1.67 2.35 2.39 2.21 2.51

TABLE III. Speaker Identification Rates of Speech Enhancement Algorithms (in Percentage)

SNR Noisy Spectral


Noise Type Signal Subspace MMSE Proposed
(in dB) Speech Subtraction

0 41 52 55 56 62
5 58 64 67 69 71
Babble
10 77 81 83 84 79
15 85 88 89 88 91

0 40 51 53 55 58
5 56 66 69 71 73
Car
10 76 81 85 87 88
15 82 89 89 90 91

0 38 49 54 57 59
5 46 67 69 71 73
Street
10 71 80 82 86 88
15 80 85 87 90 92

V.
I. Summary and Conclusions offered significant improvements in terms of PESQ and SNRSeg scores.
The speaker identification rates are higher than baseline algorithms in
This paper presents the Mean-Square-Error Short-Time Spectral all noise environments and at all SNR levels consistently. In presence
Amplitude Estimators with modified a priori SNR estimation to reduce of noise, it is difficult to identify specific speaker, however; the use of a
the background noise and to improve identification rates of speaker speech enhancement system prior to speaker identification remarkably
identification systems in presence of background noises. The lowest increased the identification rates. On the basis of experimental results,
identification rates are reported when background noises such as it is concluded that the use of the proposed speech enhancement
babble, car and street are present. By implementing the proposed speech algorithm as preprocessor can remarkably increase the speaker
enhancement algorithm as pre-processing step, the identification rates identification in many noisy environments as compared to many other
are increased about 40%, 38% and 35% at low SNR level (0dB) in speech enhancement algorithms.
all noise environments. The proposed speech enhancement algorithm

- 37 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 5, Nº 1

References
[1] Hicham, E.M., Akram, H., Khalid, S. (2016) Using features of local
densities, statistics and HMM toolkit (HTK) for offline Arabic handwriting
text recognition. J. Electr. Syst. Inform. Technol., 4(3), 387-396. http://
dx.doi.org/10.1016/j.jesit.2016.07.005
[2] Berouti, M., Schwartz, M., and Makhoul, J. (1979). Enhancement of
speech corrupted by acoustic noise. Proc. IEEE Int. Conf. Acoust., Speech,
Signal Processing, pp: 208-211.
[3] Kamath, S. and Loizou, P. (2002). A multi-band spectral subtraction
method for enhancing speech corrupted by colored noise. IEEE Int. Conf.
Acoust., Speech, Signal Processing, vol. 4, pp. 44164-44164.
[4] Gustafsson, H., Nordholm, S., and Claesson, I. (2001). Spectral subtraction
using reduced delay convolution and adaptive averaging. IEEE Trans. on
Speech and Audio Processing, 9(8), 799-807.
[5] Saleem N, Ali S, Khan U and Ullah F, (2013).  Speech Enhancement
with Geometric Advent of Spectral Subtraction using Connected Time-
Frequency Regions Noise Estimation.  Research Journal of Applied
Sciences, Engineering and Technology, 6(06), 1081-1087.
[6] Lim, J. and Oppenheim, A. V. (1978). All-pole modeling of degraded
speech. IEEE Trans. Acoust. , Speech, Signal Proc., ASSP-26(3), 197-
210.
[7] Scalart, P. and Filho, J. (1996). Speech enhancement based on a priori
Fig. 4. PESQ Analysis. signal to noise estimation. Proc. IEEE Int. Conf. Acoust., Speech, Signal
Processing, 629-632.
[8] Hu, Y. and Loizou, P. (2004). Speech enhancement based on wavelet
thresholding the multitaper spectrum. IEEE Trans. on Speech and Audio
Processing, 12(1), 59-67.
[9] Hu, Y. and Loizou, P. (2003). A generalized subspace approach for
enhancing speech corrupted by colored noise. IEEE Trans. on Speech and
Audio Processing, 11, 334-341.
[10] Jabloun, F. and Champagne, B. (2003). Incorporating the human hearing
properties in the signal subspace approach for speech enhancement. IEEE
Trans. on Speech and Audio Processing, 11(6), 700-708.
[11] Ephraim, Y. and Malah, D. (1984). Speech enhancement using a minimum
mean-square error short-time spectral amplitude estimator. IEEE Trans.
Acoust.,Speech, Signal Process., ASSP-32(6), 1109-1121.
[12] Ephraim, Y. and Malah, D. (1985). Speech enhancement using a minimum
mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust.,
Speech, Signal Process., ASSP-23(2), 443-445.
[13] Cohen, I. (2002). Optimal speech enhancement under signal presence
uncertainty using log-spectra amplitude estimator. IEEE Signal Processing
Letters, 9(4), 113-116.
[14] Saleem, N. (2016), Single channel noise reduction system in low SNR.
International Journal of Speech Technology, 20(1), 89-98. doi: 10.1007/
s10772-016-9391-z
[15] Saleem, N., Mustafa, E., Nawaz, A., & Khan, A. (2015). Ideal binary
masking for reducing convolutive noise. International Journal of Speech
Fig. 5. SNRSeg Analysis.
Technology, 18(4), 547–554. doi:10.1007/s10772-015-9298-0
[16] Saleem, N., Shafi, M., Mustafa, E., & Nawaz, A. (2015). A novel binary
mask estimation based on spectral subtraction gain induced distortions
for improved speech intelligibility and quality. Technical Journal, UET,
Taxila, 20(4), 35–42.
[17] Boldt, J. B., Kjems, U., Pedersen, M. S., Lunner, T., & Wang, D. (2008).
Estimation of the ideal binary mask using directional systems. In Proc. Int.
Workshop Acoust. Echo and Noise Control, pp. 1–4.
[18] Wang, D. (2008). Time-frequency masking for speech separation and its
potential for hearing aid design. Trends in Amplification, 12(4), 332–353.
doi:10.1177/1084713808326455
[19] Wang, D. (2005). On ideal binary mask as the computational goal of
auditory scene analysis. In Speech separation by humans and machines,
pp: 181–197.doi:10.1007/0-387-22794-6_12
[20] Gray R.M. (2013). Vector Quantization. IEEE ASSP Magazine, 1(2), 4-29.
[21] Likas A., Vlassis and Verbeek J. J., (2003). The global k-means clustering
algorithm. Pattern Recognition, 36(2), 451-461.
[22] Khan S. S and Ahmed A. (2004). Cluster center initialization for Kmeans
algorithm. Pattern Recognition Letters, 25(11), 1293-1302.
[23] Hu Y. and Loizou P. (2007). Subjective evaluation and comparison of
speech enhancement algorithms. Speech Commun., 49(7-8), 588–601.
doi:10.1016/j.specom.2006.12.006
Fig. 6. Speaker identification rate analysis.

- 38 -
Regular Issue

[24] Rix A.W., Beerends J. G., Hollier M. P., Hekstra A.P. (2001). Perceptual
evaluation of speech quality (PESQ)-a new method for speech quality
assessment of telephone networks and codecs. In Acoustics, Speech, and
Signal Processing (ICASSP), 749–752. doi: 10.1109/ICASSP.2001.941023

Nasir Saleem
Engr. Nasir Saleem received the B.S degree in
Telecommunication Engineering from University of
Engineering and Technology, Peshawar-25000, Pakistan
in 2008 and M.S degree in Electrical Engineering from
CECOS University, Peshawar, Pakistan in 2012. He
was a senior Lecturer at the Institute of Engineering and
Technology, Gomal University, D.I.Khan-29050, Pakistan.
He is now Assistant Professor in Department of Electrical Engineering, Gomal
University, Pakistan. Currently, he is pursuing Ph.D. degree in electrical
Engineering from University of Engineering and Technology, Peshawar-25000,
Pakistan. His research interests are in the area of digital signal processing,
speech processing and speech enhancement.

Tayyaba Gul Tareen


Engr. Tayyaba Gul Tareen received the B.S degree in
Electrical Engineering from University of Engineering and
Technology, Peshawar-25000, Pakistan in 2008 and M.S
degree in Electrical Engineering from CECOS University,
Peshawar, Pakistan in 2012. Currently, She is pursuing
Ph.D. degree in electrical Engineering from Iqra University,
Peshawar-25000, Pakistan. Her research interests are in the
area of digital signal processing.

- 39 -

You might also like