Modified Mel Frequency Cepstral Coefficient
Modified Mel Frequency Cepstral Coefficient
Modified Mel Frequency Cepstral Coefficient
Bangladesh
Abstract: In this paper a new approach in close-set speaker testing speech is confined to specific utterances with several
identification has been presented. The traditional mel- sessions. In this paper, a new text-dependent speaker
frequency cepstral coefficient (MFCC) feature has been identification approach has been studied. This process is
modified and it’s name was given modified MFCC implemented in three steps: extraction of feature from audio
(MMFCC). Text-dependent dataset has been used to test the signal, speaker prototype behavioral modeling and testing the
presented method’s speaker identification rate. The speaker system performance based on pattern classification process.
identification rate of the presented study was estimated in The feature extraction from voice signal is the most
both clean and contaminated conditions. Four types of noises challenging issue in automatic speaker identification. A
were added to clean signal to get noisy signals for a limit of successful feature extraction provides speaker distinguishing
signal to noise ratios (SNRs) from -5dB to 10 dB. Moreover, characteristics which fit properly in the speaker modeling. This
the obtained performance was compared with the model gives robust speaker identification performance
performance of the traditional features like MFCC- and irrespective of nature of environment. That’s why, a great
Gammatone frequency cepstral coefficient (GFCC)-based emphasized is given on extraction of feature from audio speech
methods. The evaluated results showed the proposed method signal. To obtain successful feature from voice signals, a number
achieves significant improved performance over conventional of feature extraction models have been developed. In broad sense
MFCC- and GFCC-based methods performance under noisy all feature extraction models are categorized into two groups:
conditions. auditory periphery-based feature extraction model such as
Index Terms: Text-dependent, robust speaker identification, (MFCC) [1] and (GFCC) [2] and voice production system based
modified MFCC (MMFCC), envelope. model like perceptual linear predictive (PLP) coefficients [3] and
linear prediction cepstral coefficients (LPCCs) [4].
I. INTRODUCTION
The MFCC is mostly used as front-end method in speaker,
Automatic speaker identification is belongs to the biometric speech, and phoneme classification. Now-a-days this feature is
process of identifying a target speaker from a number of known used as a standard feature to compare the performances from
or unknown speakers by matching voice pattern against a newly proposed method. MFCC provides better speaker
number of speaker models. Now a day, Speaker identification is identification accuracy in clean condition [5, 6] but in noisy
becoming popular in trading, banking, shopping, crime conditions (at 0 dB SNR) it’s performance falls to a remarkable
investigation, information reservation and as a forensic tool level [7]. The application of cubic root operation in lieu of log
because of unique and distinguishing characteristics of individual operation in MFCC feature can slightly improve speaker
voice production mechanism. Generally, speaker identification is identification accuracy under noisy conditions [8].
done using both text-dependent and independent speech signal. The study [5] shows, the inclusion of phase information
In automatic text-dependent speaker identification system, enhance MFCC performance. According to the [9] investigation,
506
cochlea basilar membrane responses and applies cubic root
operation on filter responses to add cochlea non-linearity. In this
study, 64 bands of Gammatone filter have been used to simulate
frequency responses from 50 Hz to 3 kHz following [14] and
frequency range has been chosen to keep pace with the proposed
feature information. The filter responses were decimated to 100
Hz which is equivalent to framing and a cubic root operation was
implemented. Finally, DCT was applied to convert the spectral
information into time domain. It was observed in [14], the most
speech information retains in 1st to 23rd bands after applying
DCT operation due to its compaction properties. So, only first
23rd coefficients were used this study as GFCC feature. To be
mentioned, no FFT application is needed in GFCC feature
Fig. 2: The newly implemented method’s functional block extraction.
diagram.
In this study, 6 points of each band response was considered C. Experimental Setup
as a frame with 50% overlap and energy of each band was
computed following This paper presents a close-set text-dependent speaker
௧ identification process. A text-dependent dataset University
ܵሺ݅ǡ ݉ሻ ൌ ݔሺ݅ǡ ͳǣ ݉ ݎ כሻ ݔ כሺ݅ǡ ͳǣ ݉ ݎ כሻ ǥ ǥ ǥ Ǥ Ǥ ሺ͵ሻ
Here, i indicates the number of filter band. m is the number of Malaya (UM) [11] has been used in this paper to record the
frames with overlap of r. t is stands for transpose of matric. This speaker identification rate of the newly proposed method in clean
and noisy conditions. The noisy speech was obtained by adding
process reduces the feature size about one-third of the
four various noises for an extent of SNRs from -5 dB to 10 dB at
conventional feature. Finally, a discrete cosine transform was step of 5dB. White, pink, street, and babble noises were used as
applied to convert the obtained energy spectrum into cepstral background noise in noisy signals.
coefficient. The obtained feature name was given modified mel- UM dataset contains 39 speakers. Each speaker has ten speech
frequency cepstrum coefficient (MMFCC). Here is to be samples. Each sample’s utterance was ‘University Malaya’.
mentioned that, the dynamic and acceleration coefficients were Seven samples from each speaker were used to create speaker
not included in MMFCC that could be further studied topic. prototype behavioral model. Only clean signals were used to
train the speaker model. Once speaker model was ready, it was
B. Baseline Feature Extraction saved for testing the proposed method validation. Rest three
The feature extraction processes of MFCC and GFCC have samples were taken to test the presented method performance in
been described below chronologically. clean and distorted conditions.
The feature extraction process of conventional MFCC is The most crucial task in speaker identification is modeling
almost similar to the proposed feature obtaining process. The speaker behavioral model. A successful classifier can extract
difference between them only in cochlea non-linearity operation: latent parameters from extracted feature from audio speech
MFCC uses only operation on FFT-based power spectrum and signal that characterizes each individual speaker identity. The
does not average filter responses energies. The number of filter adequate information availability ensures accurate speaker
bands and frequency range all parameters were kept same as the modeling this study, Gaussian mixture model-universal
proposed feature extraction procedure. background model (GMM- UBM) [15] has been. has been used
The MFCC feature extraction from audio speech signal was in this study for speaker modeling to achieve robust SID
done using rastamat toolbox [13]. To make fare comparison with performance.
the proposed method, only static coefficients of MFCC have The GMM speaker modeling is adapted with the UBM-based
been taken into consideration. trained speaker data to make the system faster, stable,
and to have better performances. The application of expectation
ii. Gammatone filter cepstral coefficient (GFCC) maximization (EM) [16] in GMM-based speaker modeling
GFCC feature is almost similar to MFCC feature. GFCC uses makes it successful classifier. The positive sides of application of
Gammatone filter rather using of triangular filter to reflect EM is that it can capture required latent parameters for GMM
507
from a small quantity of training data, and the obtained White Noise
parameters can be used to the new data by maximum a-posteriori 100
(MAP) adaptation [17].
A GMM-UBM-based classifier with 128 mixture components 80
508
system like compression, two-tone rate suppression, non-linear identification." In Acoustics, Speech and Signal Processing
tuning, and adaptation in the inner-hair-cell-AN synapse. It has (ICASSP), 2013 IEEE International Conference on 2013, pp.
mentioned above, cubic root operation was applied on speech 7204-7208.
[9] K. K. Paliwal and L. Alsteris, ‘‘Usefulness of phase spectrum
power spectrum to reflect cochlea nonlinearity. However, all in human speech perception,’’ in Proc. Eurospeech’03, 2003,
nonlinearities were gone when basilar membrane energies were pp. 2117---2120.
averaged as observed in this study. So, application of cubic [10] Plomp, R. & Steeneken, H. J. M. “Effect of phase on the
operation was not contributing here toward better speaker timbre of complex tones.” J. Acoust. Soc. Am. 1969,vol. 46,
identification performance. Rather, average energy (envelope) is pp. 409–421,.
[11] M. Islam, M. Zilany, and A. Wissam, "Neural-Response-Based
contributing significantly to achieve improve performance which TextDependent Speaker Identification Under Noisy Conditions," in
is distinguishing between proposed and MFCC feature. International Conference for Innovation in Biomedical Engineering
and Life Sciences, 2016, pp. 11-14.
IV. CONCLUSION [12] R. S. Holambe and M. S. Deshpande, "Nonlinearity Framework in
Speech Processing," in Advances in Non-Linear Modeling for Speech
Processing, ed: Springer, 2012, pp. 11-25
The improvement of automatic speaker identification
[13] D. P. W. Ellis, “PLP and RASTA and MFCC, and inversion in
performance under noisy condition is still challenging. To Matlab,” 2005. [Online]. Available: http://www.ee.columbia.
provide comparatively better text-dependent speaker edu/dpwe/resources/matlab/rastamat/.
identification result under contaminated level a new feature [14] Y. Shao, S. Srinivasan, and D. Wang, "Incorporating auditory
named modified mel-frequency cepstral coefficient (MMFCC) feature uncertainties in robust speaker identification," in
has been introduced in this paper. The newly proposed method Acoustics, Speech and Signal Processing, 2007. ICASSP
was tested in both clean and noisy conditions using GMM-UBM. 2007. IEEE International Conference on, 2007, pp. IV-277-
IV-280.
The obtained performance was compared with the conventional [15] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, "Speaker verification
MFCC- and GFCC-based performances. The proposed method using adapted Gaussian mixture models," Digital signal processing,
provides significant improve performance over baseline 2000, vol.10, pp. 19-41.
[16] J. A. Bilmes, "A gentle tutorial of the EM algorithm and its
methods. There was no option to validate the proposed method application to parameter estimation for Gaussian mixture and hidden
extensively due to scarcity of more text-dependent datasets. The Markov models, "International Computer Science Institute, 1998, vol.
presented feature can be used for a large text-dependent dataset 4, p. 126.
[17] S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P.
in future. Text-independent speaker identification and speech Woodland, "Hidden Markov Model Toolkit (HTK) Version 3.2. 1
recognition could be future studied topic using proposed feature. User’s Guide, 2002" Cambridge University Engineering Department,
Cambridge, MA.
REFERENCES
[1] Davis, S., and Mermelstein, P.: ‘Comparison of parametric
representations for monosyllabic word recognition in
continuously spoken sentences’, IEEE transactions on
acoustics, speech, and signal processing, 1980, 28, (4), pp.
357-366
[2] Y. Shao, S. Srinivasan, and D. Wang, "Incorporating auditory
feature uncertainties in robust speaker identification," in
Acoustics, Speech and Signal Processing, 2007. ICASSP
2007. IEEE International Conference on, 2007, pp. IV-277-
IV-280.
[3] E. Shriberg, “Higher-level features in speaker recognition,”
Lecture Notes Comput. Sci., 2007. vol. 4343, pp. 241–259.
[4] J. Makhoul, “Linear prediction: A tutorial review,” Proc.
IEEE, Apr. 1975, vol. 63, no. 4, pp. 561–580.
[5] S. Nakagawa, L. Wang, and S. Ohtsuka, "Speaker identification and
verification by combining MFCC and phase information," Audio,
Speech, and Language Processing, IEEE Transactions on 2012,
vol.20, pp. 1085-1095.
[6] V. Zue, S. Seneff, and J. Glass, "Speech database development at
MIT: TIMIT and beyond," Speech Communication on 1990, vol. 9,
pp. 351-356.
[7] T.-S. Chi, T.-H. Lin, and C.-C. Hsu, "Spectro-temporal modulation
energy based mask for robust speaker identification," The Journal of
the Acoustical Society of America, 2012, vol. 131, pp. EL368-EL374.
[8] Zhao, Xiaojia, and DeLiang Wang. "Analyzing noise
robustness of MFCC and GFCC features in speaker
509