8834 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Research Journal of Applied Sciences, Engineering and Technology 4(1): 33-40, 2012

ISSN: 2040-7467
© Maxwell Scientific Organization, 2012
Submitted: September 02, 2011 Accepted: October 07, 2011 Published: January0 1, 2012

Speaker Identification and Verification using Vector Quantization and Mel


Frequency Cepstral Coefficients

A. Srinivasan
Department of ECE, Srinivasa Ramanujan Centre, SASTRA University,
Kumbakonam-612001, India

Abstract: In the study of speaker recognition, Mel Frequency Cepstral Coefficient (MFCC) method is the best
and most popular which is used to feature extraction. Further vector quantization technique is used to minimize
the amount of data to be handled in recent years. In the present study, the Speaker Recognition using Mel
Frequency Cepstral coefficients and vector Quantization for the letter “Zha” (in Tamil language) is obtained.
The experimental results are analyzed with the help of MATLAB in different situations and it is proved that
the results are efficient in the noisy environment.

Key words: MATLAB, Mel frequency, speaker recognition, vector quantization (VQ)

INTRODUCTION were performed by 65 listeners grouped in panels of 8


listeners. The results were compared with the state-of-the-
In speech recognition, HMMs have been used for art computer algorithms. It was observed that individual
modeling observed patterns from 1970s. Many human listeners vary significantly in their ability to
researchers (Rabiner and Juang, 1986; Rabiner and recognize speakers.
Schafer, 1978; Russell and Moore, 1985 and Fu, 1980) In recent years, the higher level cues have begun to
published a large number of papers, which present HMM interest more and more researchers in automatic speaker
as tool for use on these practical problems. These studies recognition (Campbell et al., 2003; Doddington, 2001;
are written by researchers interested in pattern Reynolds et al., 2003; Xiang, 2003). For instance,
recognition, often from a viewpoint in engineering or recently automatic systems that use several low- and high-
computer science, and they usually focus on algorithms level speaker cues have been introduced (Campbell et al.,
and on results in practical situations. Speech recognition 2003; Reynolds et al., 2003). Although many new
recognizes the words but speaker recognition identifies techniques were invented and developed, there are still a
and verifies the speaker. It is a biometric modality that number of practical limitations because of which
uses an individual’s voice for recognition processes. In
widespread deployment of applications and services is not
general speaker recognition is referred by two different
possible. Vector Quantization is an efficient data
subtasks viz, Speaker Identification (SI) and Speaker
compression technique and has been used in various
Verification (SV). Of which identification task is
considered more difficult. When the number of speakers applications involving VQ-based encoding and VQ based
increases, the probability of an incorrect decision recognition. Vector Quantization has been very popular in
increases (Doddington, 1985; Furui, 1986, 1994, 2001; the field of speech recognition. Speech recognition of the
Prabhakar et al., 2003). The performance of the letter “Zha” (in Tamil language) by using LPC
verification task is not, at least in theory, affected by the (Srinivasan et al., 2009) and using HMM (Srinivasan,
population size since only two speakers are compared. 2011) are recognized.
In the auditory speaker recognition, it has been In the present study the Speaker Recognizer using
observed that there are considerable differences between Mel Frequency Cepstral coefficients and vector
individuals (Rose, 2002; Schmidt-Nielsen and Crystal, Quantization for the letter “Zha” (in Tamil language) is
2000). Moreover, human performance decreases as the obtained. The experimental results are analyzed with the
time increases between listening the two voices (Kerstholt help of MATLAB.
et al., 2003). Several studies have been conducted to
compare human and machine performance in speaker METHODOLOGY
recognition (Schmidt-Nielsen and Crystal, 2000; Liu
et al., 1997; Sullivan and Pelecanos, 2001). Schmidt- Modules of speaker recognition: A speaker recognition
Nielsen and Crystal have conducted a large-scale system is mainly composed of the following four
comparison in which nearly 50,000 listening judgments modules:

33
Res. J. Appl. Sci. Eng. Technol., 4(1): 33-40, 2012

C Front-end processing: It is the "signal processing" The distance from a vector to the closest codeword of
part, which converts the sampled speech signal into a codebook is called a VQ-distortion. In the recognition
set of feature vectors, which characterize the phase, an input utterance of an unknown voice is “vector-
properties of speech that can separate different quantized” using each trained codebook and the total VQ
speakers. Front-end processing is performed both in distortion is computed. The speaker corresponding to the
training- and recognition phases. VQ codebook with smallest total distortion is identified as
C Speaker modeling: This part performs a reduction of the speaker of the input utterance. By using these
feature data by modelling the distributions of the training data features are clustered to form a
feature vectors. codebook for each speaker. In the recognition stage, the
C Speaker database: The speaker models are stored data from the tested speaker is compared to the codebook
here. of each speaker and measure the difference. These
C Decision logic: It makes the final decision about the differences are then use to make the recognition decision.
identity of the speaker by comparing unknown
feature vectors to all models in the database and Design problem: The VQ design problem can be stated
selecting the best matching model. as follows. Given a vector source with its statistical
properties known, given a distortion measure, and given
Mel frequency cepstral coefficients: Mel frequency the number of code vectors, find a codebook (the set of all
Cepstral Coefficients are coefficients that represent audio red stars) and a partition (the set of blue lines) which
based on perception. This coefficient has a great success result in the smallest average distortion. We assume that
in speaker recognition application. It is derived from the there is atraining sequence consisting of M source
Fourier Transform of the audio clip. In this technique the vectors:
frequency bands are positioned logarithmically, whereas
in the Fourier Transform the frequency bands are not I = {x1,x2,...,xM}
positioned logarithmically. As the frequency bands are
positioned logarithmically in MFCC, it approximates the This training sequence can be obtained from some large
human system response more closely than any other database. For example, if the source is a speech signal,
system. These coefficients allow better processing of data. then the training sequence can be obtained by recording
In the Mel Frequency Cepstral Coefficients the several long telephone conversations. M is assumed to be
calculation of the Mel Cepstrum is same as the real sufficiently large so that all the statistical properties of the
Cepstrum except the Mel Cepstrum’s frequency scale is source are captured by the training sequence. We assume
warped to keep up a correspondence to the Mel scale. that the source vectors are k-dimensional, e.g,

Vector quantization: Vector quantization is a process of xm = (xm,1, xm,2 ,..., xm,k), m = 1, 2 ,..., M
mapping vectors from a large vector space to a finite
number of regions in that space. Each region is called a Let N be the number of code vectors and let:
cluster and can be represented by its center called a code
word. The collection of all code words is called a code C = {c1, c2 ,..., cN}
book. Vector Quantization (VQ) is a lossy data
compression method based on principle of block represents the codebook. Each code vector is K
coding . It is a fixed-to-fixed length algorithm. VQ may dimensional, e.g.,
be thought as an approximator.
The technique of VQ consists of extracting a small cn = (cn,1, cn,2, ...., cn,k), n = 1, 2, ..., N
number of representative feature vectors as an efficient
means of characterizing the speaker specific features. By Let Sn be the encoding region associated with code
means of VQ, storing every single vector that we generate vector Cn and let:
from the training is impossible. Figure 1 shows a
conceptual diagram to illustrate this recognition process p = {S1, S2, ..., SN}
in the figure, only two speakers and two dimensions of the
acoustic space are shown. The circles refer to the acoustic Denote the partition of the space. If the source vector
vectors from the speaker 1 while the triangles are from the xm is in the encoding region Sn, then its approximation
speaker 2. In the training phase, using the clustering (denoted by Q(Xn) is Cn:
algorithm described in a speaker-specific VQ codebook
is generated for each known speaker by clustering his/her Q(xm) = cn, if xm ,sn
training acoustic vectors. The result code words
(centroids) are shown by black circles and black triangles Assuming a squared-error distortion measure, the average
for speaker 1 and 2, respectively. distortion is given by:

34
Res. J. Appl. Sci. Eng. Technol., 4(1): 33-40, 2012

Speaker 1 Speaker 2

Speaker 1
Centroid
Sample VQ Distortion
Speaker 2
Centroid
Sample

Fig. 1: VQ classification model

1 M 2 those training vectors. There is a well-know algorithm,


Dave = ∑ xm − Q( xm ) namely LBG algorithm Linde et al., 1980, for clustering
Mk m =1
a set of L training vectors into a set of M codebook
The design problem can be succinctly stated as follows: vectors. The algorithm is formally implemented by the
Given T and N find C and P such that D is minimized. following recursive procedure:

Optimality criteria: If C and P are a solution to the C Design a 1-vector codebook; this is the centroid of
above minimization problem, then it must satisfy the the entire set of training vectors (hence, no iteration
following two criteria. is required here).
C Double the size of the codebook by splitting each
Nearest neighbor condition: current codebook yn according to the rule

{
sn = x: x − cn
2 2
≤ x − cn ' ∀ n' = 1, 2, ..., N } y +n = yn (1+ ∈)
y −n = yn (1− ∈)
This condition says that the encoding
region Sn should consists of all vectors that are closer
to Cn than any of the other code vectors. For those vectors where, n varies from 1 to the current size of the
lying on the boundary (blue lines) tie-breaking procedure codebook, and is a splitting parameter (we choose ,
will do. = 0.01).
C Nearest-Neighbor Search: for each training vector,
Centroid condition: find the codeword in the current codebook that is
closest (in terms of similarity measurement), and


assign that vector to the corresponding cell
xm ∈sn xm (associated with the closest codeword).
cn = n = 1,2,..., N

C Centroid update: Update the codeword in each cell
xm ∈sn 1 using the centroid of the training vectors assigned to
that cell.
This condition says that the code vector Cn should be C Iteration 1: repeat steps 3 and 4 until the average
average of all those training vectors that are in encoding distance falls below a preset threshold
region Sn. In implementation, one should ensure that at C Iteration 2: repeat steps 2, 3 and 4 until a codebook
least one training vector belongs to each encoding region size of M is designed
(so that the denominator in the above equation is never 0).
Intuitively, the LBG algorithm designs an M-vector
Algorithm: After the enrolment session, the acoustic codebook in stages. It starts first by designing a 1-vector
vectors extracted from input speech of each speaker codebook, then uses a splitting technique on the
provide a set of training vectors for that speaker. As codewords to initialize the search for a 2-vector
described above, the next important step is to build a codebook, and continues the splitting process until the
speaker-specific VQ codebook for each speaker using desired M-vector codebook is obtained.

35
Res. J. Appl. Sci. Eng. Technol., 4(1): 33-40, 2012

Find
centroid

Yes No Stop
M<M

Split each
centroid

M = 2*m

Cluster
vector

Find
centroids

Compute D
(distortion)

No D’-D < ε Yes


D’ =D
D

Fig. 2: VQ flow chart

LBG design algorithm: The LBG VQ design algorithm recognition. Environmental noises during the recording
is an iterative algorithm which alternatively solves the process are overcome using Wavesurfer tool. It is a
above two optimality criteria. The algorithm requires an simple but powerful interface. The standard speech
initial code book. This initial codebook is obtained by analysis such as waveform, Spectrogram, Pitch, and
the splitting method. In this method, an initial code vector Power panes are analyzed. Magnitude and frequency
is set as the average of the entire training sequence. This comparison of 3 male and 3 female speakers is shown in
code vector is then split into two. The iterative algorithm Table 1.
is run with these two vectors as the initial codebook. The
Trained and test set of voice data are processed in
final two code vectors are splitted into four and the
MATLAB using vector quantization, subsequently Mel
process is repeated until the desired number of code
vectors is obtained. Figure 2 shows the flow chart of the frequency cepstral coefficients are obtained and shown in
algorithm. Fig. 3. The cepstral representation of the speech spectrum
provides a good representation of the local spectral
EXPERIMENTAL RESULTS properties of the signal for the given frame analysis.
In the speaker identification process it is identified
The most distinctive letter in Tamil language is “Zha” that the speakers are matched with test and trained data
because of the deliberation, the articulation of the sound (Fig. 4). During the verification process of the speaker
demands. Therefore the trained set of fifty speakers was recognition, positive and negative results have been
selected to spell the letter “Zha” for the speaker arrived. If the speakers are matched (positive) with voice

36
Res. J. Appl. Sci. Eng. Technol., 4(1): 33-40, 2012

Table 1: Magnitude and frequency comparison


Frequency *F1 F2 F3 **M1 M2 M3
S1.No. (HZ) dB dB dB dB dB Db
1 15.625 - 20.47 - 20.98 - 19.67 - 21.02 - 20.86 - 20.71
2 140.625 - 32.68 - 32.73 - 32.01 - 32.94 - 33.13 - 32.75
3 390.625 - 36.63 - 36.74 - 36.13 - 36.94 - 37.14 - 37.01
4 640.625 - 41.36 34.99 - 40.97 - 41.48 - 41.92 - 40.99
5 1015.625 - 51.20 - 47.91 - 51.00 - 51.90 - 52.65 - 50.90
*: Female; **: Male

Fig. 3: MFCC vactors before and after VQ

Fig. 4: Speaker identification

37
Res. J. Appl. Sci. Eng. Technol., 4(1): 33-40, 2012

Fig. 5: Speaker verification (positive result)

Fig. 6: Speaker verification (negative result)

38
Res. J. Appl. Sci. Eng. Technol., 4(1): 33-40, 2012

the output will be matched otherwise does not matched. Furui, S., 1994. An overview of speaker recognition
Fig. 5 (positive) and Fig. 6 (negative) show the technolog. ESCA Workshop on Automatic Speaker
verification of the speakers. Recognition, Identification and Verification, pp: 1-9.
The performance of VQ is typically given in terms of Furui, S., 2001. Digital Speech Processing, Synthesis, and
the signal-to-distortion ratio (SDR): Recognition, 2nd Edn., Marcel Dekker, Inc., New
York.
⎛ σ2 ⎞ Kerstholt, J., E. Jansen, A. van Amelsvoort and A.
SDR= 10 log 10 ⎜ ⎟ Broeders, 2003. Earwitness line-ups: Effects of
⎝ Dave ⎠ speech duration, retention interval and acoustic
environment on identification accuracy. Proceeding
where s 2 is the variance of the source and Dave is the on. 8th European Conference on Speech
average squared-error distortion. The higher value of Communication and Technology (Eurospeech)
the SDR gives the better performance. (Geneva, Switzerland), pp: 709-712.
Linde, Y., A. Buzo and R. Gray, 1980. An algorithm for
CONCLUSION vector quantizer design. IEEE Trans. Commun., 28:
84-95.
Speaker Recognition using Mel Frequency Cepstral Liu, L., J. He and G. Palm, 1997. A Comparison of
coefficients and vector Quantization for the letter “Zha” Human and Machine in Speaker Recognition. In
(in Tamil language) is recognized. The experimental Proc. 5th European Conference on Speech
results are analyzed with the help of MATLAB and it is Communication and Technology (Eurospeech)
proved that the results are efficient. This process can be (Rhodos, Greece), pp: 2327-2330.
extended for n number of speakers. In future, Speaker Prabhakar, S., S. Pankanti and A. Jain, 2003. Biometric
Recognition process will receive the prime importance for recognition: Security and privacy concerns. IEEE
voice based Automatic Teller Machine. Security Privacy Magazine, 1: 33-42.
Rabiner, L.R. and B.H. Juang, 1986. An introduction to
ACKNOWLEDGMENT hidden Markov models. IEEE Acoustics, Speech
Signal Processing Magazine, 3: 4-16.
The Author would like to thank G. Rajaa Rabiner, L.R. and R.W. Schafer, 1978. Digital Processing
Krishnamurthy and G. Raghavan, Tata consultancy of Speech Signals. Prentice-Hall, Englewood Cliffs,
services, Chennai, for their help during the process in N.J.
Matlab and the speakers for spelling out the letter “Zha” Reynolds, D., W. Andrews, J. Campbell, J. Navratil,
during the process and the reviewer for his kind B.Peskin, A. Adami, Q. Jin, D. Klusacek,
acceptance to review this article with valid suggestions to J.Abramson, R. Mihaescu, J. Godfrey, D. Jones and
improve the manuscript. B. Xiang, 2003. The SuperSID project: exploiting
high-level information for high-accuracy speaker
REFERENCES
recognition. In Proc. Int. Conf. on Acoustics, Speech
and Signal Processing (ICASSP) (Hong Kong), pp:
Campbell, J., D. Reynolds and R. Dunn, 2003. Fusing
high-and low-level features for speaker recognition. 784-787.
Proceeding on 8th European Conference on Speech Rose, P., 2002. Forensic Speaker Identification. Taylor &
Communication and Technology (Eurospeech) Francis, London.
(Geneva, Switzerland), pp: 2665-2668. Russell, M.J. and R.K. Moore, 1985. Explicit modeling of
Doddington, G., 1985. Speaker recognition-identifying state occupancy in hidden Markov models for speech
people by their voices. Proceedings IEEE, 73(11): signals, Proceedings ICASSP-85 International
1651-1164. Conference on Acoustics, Speech and Signal
Doddington, G., 2001. Speaker Recognition Based on Processing, Institute of Electrical and Electronic
Idiolectal Differences between Speakers, Proceeding. Engineers, (New York), IEEE, pp: 5-8.
7th European Conference on Speech Communication Schmidt-Nielsen, A. and T. Crystal, 2000. Speaker
and Technology, (Eurospeech) (Aalborg, Denmark), verification by human listeners: Experiments
pp: 2521-2524. comparing human and machine performances using
Fu, K.S., 1980. Statistical Pattern Classification Using the NIST 1998 speaker evaluation data. Digital
Contextual Information. Research Studies Press. Signal Processing, 10: 249-266.
Furui, S., 1986. Speaker independent isolated word Srinivasan, A., K. Srinivasa Rao, D. Narasimhan and
recognition using dynamic features of speech K. Kannan, 2009. Speech processing of the letter
spectrum. IEEE Transactions on Acoustic, Speech, ‘zha’ in Tamil Language with LPC. Contemp. Eng.
Signal Processing, ASSP-34(1): 52-59. Sci., 2(10): 497-505.

39
Res. J. Appl. Sci. Eng. Technol., 4(1): 33-40, 2012

Srinivasan, A., 2011. Speech recognition using hidden Xiang, B., 2003. Text-independent speaker verification
markov model. Appl. Mathe. Sci., 5(79): 3943-3948. with dynamic trajectory model. IEEE Signal Proc.
Sullivan, K. and J. Pelecanos, 2001. Revisiting Carl Lett., 10: 141-143
Bildt’s Impostor: Would a Speaker Verification
System Foil him? Proceeding on Audio and Video-
Based Biometric Authentication (AVBPA) (Halmstad,
Sweden), pp: 144-149.

40

You might also like