Spectral Restoration Based Speech Enhancement For Robust Speaker Identification
Spectral Restoration Based Speech Enhancement For Robust Speaker Identification
Spectral Restoration Based Speech Enhancement For Robust Speaker Identification
5, Nº 1
Abstract Keywords
Spectral restoration based speech enhancement algorithms are used to enhance quality of noise masked speech A Priori SNR, Spectral
for robust speaker identification. In presence of background noise, the performance of speaker identification Restoration, Speech
systems can be severely deteriorated. The present study employed and evaluated the Minimum Mean-Square- Enhancement, Speaker
Error Short-Time Spectral Amplitude Estimators with modified a priori SNR estimate prior to speaker Identification, Mel
identification to improve performance of the speaker identification systems in presence of background noise. Frequency Cepstral
For speaker identification, Mel Frequency Cepstral coefficient and Vector Quantization is used to extract the Coefficients, Vector
speech features and to model the extracted features respectively. The experimental results showed significant Quantization.
improvement in speaker identification rates when spectral restoration based speech enhancement algorithms
are used as a pre-processing step. The identification rates are found to be higher after employing the speech DOI: 10.9781/ijimai.2018.01.002
enhancement algorithms.
* Corresponding author.
E-mail address: [email protected] Fig. 1. Procedural block diagram of Speech enhancement and speaker
identification system.
- 34 -
Regular Issue
II. Spectral Restoration Based Speech Enhancement System momentum parameter (ζ=0.998), μ(m,ωk) shows momentum terms
and λD(m,ωk) is the estimation of background noise variance. The
In classical spectral restoration based speech enhancement system, ξMDD(k,ωk)shows a priori SNR estimation after modification. The
the noisy speech is given as; y(t)= s(t) + n(t), where s(t) and n(t) specify estimated power spectrum of the clean speech magnitude SEST(k,ωk)
clean speech and noise signal respectively. Let Y(k,ωk), S(k,ωk) and is attained by multiplying gain function with noisy speech Y(k,ωk) as:
N(k,ωk) show y(t), s(t) and n(t) respectively with spectral element ωk
and time frame k. The quasi-stationary nature of speech is considered in SEST (k,ωk ) = Y(k,ωk ) *G(k,ωk )
(8)
frame analysis since noise and speech signals both reveal non-stationary
behavior. A speech enhancement algorithm involves in multiplication The gain function G(k,ωk) is given as:
of a spectral gain G(k,ωk) to short-time spectrum Y(k,ωk) and the
computation of spectral gain follows two key parameters, a posteriori
SNR and the a priori SNR estimation:
2 2
Y(k,ω k ) Y(k,ω k )
γ(k,ω k )= 2
= 2
E{ N(k,ω k ) } σ (k,ω k )
n
(1)
(9)
2
Where, ς is used to avoid large gain values at low a posteriori SNR
2
E{ S(k,ω k ) } σ (k,ω k )
s
ξ(k,ω k )= =
E{ N(k,ω k ) }
2 2
σ (k,ω k ) and ς =10 is chosen here.
(2)
n
Where E{.} shows expectation operator, γ(k,ωk) and ξ(k,ωk) III. Speaker Identification System
presents a posteriori SNR estimation and a priori SNR estimation. In
practical implementations of a speech enhancement system, squared The intention of a Speaker identification system is to identity
power spectrum density of clean speech |X(k,ωk)|2 and noise |D(k,ωk)|2 information regarding any speaker which is categorized into two
are unrevealed as only noisy speech is available. Therefore; both sub-categories called as Speaker identification (SID) and speaker
instantaneous and a priori SNR need to be estimated. The noise power Verification (SVR). For SID, the Mel Frequency Cepstral coefficient
spectral density is estimated during speech gaps exploiting standard (MFCC) and Vector Quantization (VQ) is used to extract the speech
recursive relation, given as: features and to model the extracted features respectively. The
speaker identification system drives in two stages, the training and
ˆ 2n (k,ωk )= βσ
σ ˆ n2 (k-1,ωk )+(1-β)σY
2
(k-1,ωk ) testing stages. In training mode the system is allowed to create the
(3)
database of speech signals and formulate a feature model of speech
utterances. In testing mode, the system uses information provided in
Where, β is the smoothing factor and is estimation in database and attempts to segregate and identify the speakers. Here,
previous frame. The SNR can be calculated as: the Mel frequency Cepstral Coefficients (MFCCs) features are used
S(k,ωk )
2 for constructing a SID system. The extracted features of speakers are
SNR INST (k,ωk )= 2 quantized to a number of centroids employing vector quantization
N(k,ωk )
(4) (VQ) K-means algorithm. MFCCs are computed in training as well as
in testing stage. The Euclidean distance among MFCCs of all speakers
2 in training stage to centroids of isolated speaker in testing stage is
G(k-1,ωk )*Y(k,ωk ) calculated and a particular speaker is identified according to minimum
ξ DD (k,ωk )=α
σ̂ 2n (k,ωk -1) Euclidean distance.
+(1-α)F{γ(k,ωk )-1}
(5) A. Feature Extraction
Where α is smoothing factor and has a constant value 0.98, ξDD(k,ωk) The MFCCs are acquired by pre-emphasis of speech initially to
is a priori noise estimate via decision-direct (DD) method whereas emphasize high frequencies and eliminate glottal and lip radiations.
F{.} is half-wave rectification. By setting α as a fixed value near to 1, The resulting speech is fragmented, windowed, and FFT is computed
the DD approach introduces less residual noise. However, it may lead to attain spectra. To estimate human auditory system, triangular band-
to delay in estimation since a fixed value cannot track the rapid change pass filters bank is utilized. A linear scale is used to compute center
of speech. The DD is an efficient method and achieves well in speech frequencies which are lower than 1 kHz, while logarithmic scale is
enhancement applications however; the a priori SNR follows the shape considered for center frequencies higher than 1 kHz. The filter bank
of instantaneous SNR and brings single-frame delay. To overcome response is given in Fig. 2. The Mel-spaced filter bank response is
single-frame delay, a modified form of DD method is used to estimate given as:
a priori SNR. The modified a priori SNR is written as:
(10)
2
G(k-1,ωk )*Y(k,ωk )
ξ MDD (k,ωk ) = α
σ̂ 2n (k,ωk -1)
The DFT is computed on log of Mel spectrum to figure Cepstrum
+μ(k,ωk )+(1-α)F{γ(k,ωk )-1}
(6) as:
- 35 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 5, Nº 1
MFCCs chosen between 5 to 26, and Nf is the number of Mel filters. lowest distortion is selected, and this distortion is the sum of squared
Initially few coefficients are considered since most of the specific Euclidean distances among vectors and their centroids. As a result,
information about speakers is present in them. all feature vectors in M sequence are compared with codebooks, and
the codebooks with the minimum average distance are selected. The
Euclidean distance between two points, λ = (λ1, λ2…λn) and η = (η1, η2...
ηn) is given by [21-22]:
(12)
- 36 -
Regular Issue
TABLE I. PESQ Analysis Against Baseline Speech Enhancement Algorithms and Noisy Speech
TABLE II. Segmental SNR (SNRSeg) Analysis Against Baseline Speech Enhancement Algorithms and Noisy Speech
TABLE III. Speaker Identification Rates of Speech Enhancement Algorithms (in Percentage)
0 41 52 55 56 62
5 58 64 67 69 71
Babble
10 77 81 83 84 79
15 85 88 89 88 91
0 40 51 53 55 58
5 56 66 69 71 73
Car
10 76 81 85 87 88
15 82 89 89 90 91
0 38 49 54 57 59
5 46 67 69 71 73
Street
10 71 80 82 86 88
15 80 85 87 90 92
V.
I. Summary and Conclusions offered significant improvements in terms of PESQ and SNRSeg scores.
The speaker identification rates are higher than baseline algorithms in
This paper presents the Mean-Square-Error Short-Time Spectral all noise environments and at all SNR levels consistently. In presence
Amplitude Estimators with modified a priori SNR estimation to reduce of noise, it is difficult to identify specific speaker, however; the use of a
the background noise and to improve identification rates of speaker speech enhancement system prior to speaker identification remarkably
identification systems in presence of background noises. The lowest increased the identification rates. On the basis of experimental results,
identification rates are reported when background noises such as it is concluded that the use of the proposed speech enhancement
babble, car and street are present. By implementing the proposed speech algorithm as preprocessor can remarkably increase the speaker
enhancement algorithm as pre-processing step, the identification rates identification in many noisy environments as compared to many other
are increased about 40%, 38% and 35% at low SNR level (0dB) in speech enhancement algorithms.
all noise environments. The proposed speech enhancement algorithm
- 37 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 5, Nº 1
References
[1] Hicham, E.M., Akram, H., Khalid, S. (2016) Using features of local
densities, statistics and HMM toolkit (HTK) for offline Arabic handwriting
text recognition. J. Electr. Syst. Inform. Technol., 4(3), 387-396. http://
dx.doi.org/10.1016/j.jesit.2016.07.005
[2] Berouti, M., Schwartz, M., and Makhoul, J. (1979). Enhancement of
speech corrupted by acoustic noise. Proc. IEEE Int. Conf. Acoust., Speech,
Signal Processing, pp: 208-211.
[3] Kamath, S. and Loizou, P. (2002). A multi-band spectral subtraction
method for enhancing speech corrupted by colored noise. IEEE Int. Conf.
Acoust., Speech, Signal Processing, vol. 4, pp. 44164-44164.
[4] Gustafsson, H., Nordholm, S., and Claesson, I. (2001). Spectral subtraction
using reduced delay convolution and adaptive averaging. IEEE Trans. on
Speech and Audio Processing, 9(8), 799-807.
[5] Saleem N, Ali S, Khan U and Ullah F, (2013). Speech Enhancement
with Geometric Advent of Spectral Subtraction using Connected Time-
Frequency Regions Noise Estimation. Research Journal of Applied
Sciences, Engineering and Technology, 6(06), 1081-1087.
[6] Lim, J. and Oppenheim, A. V. (1978). All-pole modeling of degraded
speech. IEEE Trans. Acoust. , Speech, Signal Proc., ASSP-26(3), 197-
210.
[7] Scalart, P. and Filho, J. (1996). Speech enhancement based on a priori
Fig. 4. PESQ Analysis. signal to noise estimation. Proc. IEEE Int. Conf. Acoust., Speech, Signal
Processing, 629-632.
[8] Hu, Y. and Loizou, P. (2004). Speech enhancement based on wavelet
thresholding the multitaper spectrum. IEEE Trans. on Speech and Audio
Processing, 12(1), 59-67.
[9] Hu, Y. and Loizou, P. (2003). A generalized subspace approach for
enhancing speech corrupted by colored noise. IEEE Trans. on Speech and
Audio Processing, 11, 334-341.
[10] Jabloun, F. and Champagne, B. (2003). Incorporating the human hearing
properties in the signal subspace approach for speech enhancement. IEEE
Trans. on Speech and Audio Processing, 11(6), 700-708.
[11] Ephraim, Y. and Malah, D. (1984). Speech enhancement using a minimum
mean-square error short-time spectral amplitude estimator. IEEE Trans.
Acoust.,Speech, Signal Process., ASSP-32(6), 1109-1121.
[12] Ephraim, Y. and Malah, D. (1985). Speech enhancement using a minimum
mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust.,
Speech, Signal Process., ASSP-23(2), 443-445.
[13] Cohen, I. (2002). Optimal speech enhancement under signal presence
uncertainty using log-spectra amplitude estimator. IEEE Signal Processing
Letters, 9(4), 113-116.
[14] Saleem, N. (2016), Single channel noise reduction system in low SNR.
International Journal of Speech Technology, 20(1), 89-98. doi: 10.1007/
s10772-016-9391-z
[15] Saleem, N., Mustafa, E., Nawaz, A., & Khan, A. (2015). Ideal binary
masking for reducing convolutive noise. International Journal of Speech
Fig. 5. SNRSeg Analysis.
Technology, 18(4), 547–554. doi:10.1007/s10772-015-9298-0
[16] Saleem, N., Shafi, M., Mustafa, E., & Nawaz, A. (2015). A novel binary
mask estimation based on spectral subtraction gain induced distortions
for improved speech intelligibility and quality. Technical Journal, UET,
Taxila, 20(4), 35–42.
[17] Boldt, J. B., Kjems, U., Pedersen, M. S., Lunner, T., & Wang, D. (2008).
Estimation of the ideal binary mask using directional systems. In Proc. Int.
Workshop Acoust. Echo and Noise Control, pp. 1–4.
[18] Wang, D. (2008). Time-frequency masking for speech separation and its
potential for hearing aid design. Trends in Amplification, 12(4), 332–353.
doi:10.1177/1084713808326455
[19] Wang, D. (2005). On ideal binary mask as the computational goal of
auditory scene analysis. In Speech separation by humans and machines,
pp: 181–197.doi:10.1007/0-387-22794-6_12
[20] Gray R.M. (2013). Vector Quantization. IEEE ASSP Magazine, 1(2), 4-29.
[21] Likas A., Vlassis and Verbeek J. J., (2003). The global k-means clustering
algorithm. Pattern Recognition, 36(2), 451-461.
[22] Khan S. S and Ahmed A. (2004). Cluster center initialization for Kmeans
algorithm. Pattern Recognition Letters, 25(11), 1293-1302.
[23] Hu Y. and Loizou P. (2007). Subjective evaluation and comparison of
speech enhancement algorithms. Speech Commun., 49(7-8), 588–601.
doi:10.1016/j.specom.2006.12.006
Fig. 6. Speaker identification rate analysis.
- 38 -
Regular Issue
[24] Rix A.W., Beerends J. G., Hollier M. P., Hekstra A.P. (2001). Perceptual
evaluation of speech quality (PESQ)-a new method for speech quality
assessment of telephone networks and codecs. In Acoustics, Speech, and
Signal Processing (ICASSP), 749–752. doi: 10.1109/ICASSP.2001.941023
Nasir Saleem
Engr. Nasir Saleem received the B.S degree in
Telecommunication Engineering from University of
Engineering and Technology, Peshawar-25000, Pakistan
in 2008 and M.S degree in Electrical Engineering from
CECOS University, Peshawar, Pakistan in 2012. He
was a senior Lecturer at the Institute of Engineering and
Technology, Gomal University, D.I.Khan-29050, Pakistan.
He is now Assistant Professor in Department of Electrical Engineering, Gomal
University, Pakistan. Currently, he is pursuing Ph.D. degree in electrical
Engineering from University of Engineering and Technology, Peshawar-25000,
Pakistan. His research interests are in the area of digital signal processing,
speech processing and speech enhancement.
- 39 -