Reference Paper 4
Reference Paper 4
Reference Paper 4
org
Published in IET Biometrics
Received on 16th February 2014
Revised on 10th September 2014
Accepted on 11th September 2014
doi: 10.1049/iet-bmt.2014.0011
ISSN 2047-4938
Abstract: The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric
authentication systems. The objective of this research is to investigate the problem of identifying a speaker from its voice
regardless of the content. In this study, the authors designed and implemented a novel text-independent multimodal speaker
identification system based on wavelet analysis and neural networks. Wavelet analysis comprises discrete wavelet transform,
wavelet packet transform, wavelet sub-band coding and Mel-frequency cepstral coefficients (MFCCs). The learning module
comprises general regressive, probabilistic and radial basis function neural networks, forming decisions through a majority
voting scheme. The system was found to be competitive and it improved the identification rate by 15% as compared with the
classical MFCC. In addition, it reduced the identification time by 40% as compared with the back-propagation neural
network, Gaussian mixture model and principal component analysis. Performance tests conducted using the GRID database
corpora have shown that this approach has faster identification time and greater accuracy compared with traditional
approaches, and it is applicable to real-time, text-independent speaker identification systems.
analyses, were first tested and proposed in [23]. The primary high dimensionality or strict decision conditions [10, 27].
idea was to irregularly prune the decomposition tree generated The combination of multiple NNs eliminates the poor
by WPT for enhanced accuracy. A feature extraction scheme performance from over- or under-fitting the training data
derived from the wavelet eigenfunction was proposed in [25], with individual NNs wherein each network has a different
and a text-independent SIS was proposed in [26] based on an level of generalisation capability. The combination of
improved wavelet transform, which relies on the kernel multiple NNs resolves the higher identification rate problem
canonical correlation analysis. WPT, which is analogous to but complicates the method by increasing the training time.
DWT in some ways, obtains the speech signal using a Below we briefly describe the architectures of some of the
recursive binary tree and performs a form of recursive NNs that we implemented as part of our proposed MNN.
decomposition. Instead of performing decomposition only The PNN has an input layer where the input vectors are
on approximations, it decomposes the details as well. WPT, inserted (in this case, the audio feature vectors). The
therefore, has a better feature representation than DWT network also includes one or more hidden layers with
[25]. This is why WPT is used as part of our proposed multiple neurons that are connected through weighted paths.
MNN as laid out in Sections 3 and 4. Additionally, it includes one or more output neurons
GMM is extensively used for classification tasks in speaker depending on the number of different classes. PNN is a
identification [1, 19]. It is a parametric learning model that statistical classifier network that applies the maximum a
assumes that the process being modelled has the posteriori hypothesis to classify a test pattern X as Class C
characteristics of a Gaussian process whose parameters do if the following applies
not change over time. This assumption is valid because a
signal can be assumed to be stationary over a Hamming
P(Xi |Ci )P Ci ≥ P(Xi |Cj )P Cj ∀ j (1)
window. GMM tries to capture the underlying probability
distribution governing the instances presented during the
training phase. Given a test instance, the GMM tries to Here, P(Ci) is the prior probability of speaker i that is
estimate the maximum likelihood that the test instance has determined from the training feature vectors. P(Xi|Ci) is the
been generated from a specific speaker’s GMM. The GMM conditional probability that this pattern is generated from
with the maximum value of likelihood owns the test class Ci assuming that the training data follow a probability
instance and is declared to belong to the respective speaker. density function (PDF). A PDF is estimated for each
In this paper, we employ GMM for the classification task. speaker class. As shown in Fig. 2, the third layer from left
generated a text-dependence for the training data, whereas in rate (TPR) as a function of the false positive rate (FPR) for
our training set, the user speaks up to 1000 different sentences different values of the spread. The fraction of true positives
yielding a text-independent data set. out of the total actual positives is known as the TPR, and
The performance results are summarised in Table 3 and the fraction of false positives out of the total actual
show that our proposed system yields the most accurate negatives is called the FPR. The FPR is the same as the
results (97.5%) for text-independent speaker identification complement of specificity (i.e. one minus specificity), also
compared with an established set of algorithms including known as the FRR. TPR is also known as sensitivity, and is
GMM, PCA, the parallel classifier model in [10] and BPNN. the complement of the FAR. The ROC curve is a graphical
Note that in [10] text-dependent classification was used. plot that can illustrate a binary classifier’s performance with
It is noteworthy that MFCC produces a two-dimensional variation of the discrimination threshold. TPR and FPR
(2D) matrix, whereas BPNN is by nature designed to cater depend on the size of the enrollment database and the
for 1D input. Therefore there is a mismatch between these decision threshold for the matching scores and/or number of
algorithms and they cannot be used together without losing matched identifiers returned. Therefore, a ROC curve plots
a large part of the information. Hence, there is no data for the rate of accepted impostor attempts against the
MFCC and parallel BPNN in Table 3. Moreover, it can be corresponding rate of true positives parametrically as a
observed that MNN outperforms the other classifiers for function of the decision threshold. The results can be
most feature extraction schemes as expected (in most changed by adjusting this threshold. The 30 experiments we
columns of Table 3, MNN results in the highest conducted produced different combinations of the FPR and
identification rate). The only exception is MFCC; in this TPR. These two values were plotted to generate a ROC
case, GMM gives a better result than MNN, which is curve for DWT, WPT, WSBC, irregular decomposition and
because of the following reason. When GMM is fitted to a MFCC for a comparison of accuracy with the proposed
smoothed spectrum of speech, an alternative set of features MNN as shown in Fig. 8. This curve shows that the ROC
can be extorted from the signal. In addition to the standard curve for WPT lies very close to the upper left boundary
MFCC parameterisation, complementary information is and has more area under it compared with DWT, WSBC,
embedded in these extra features. Combining GMM means MFCC and irregular decomposition. The ROC curve for
with MFCC by concatenation into a single feature vector MFCC with the same data lies closest to the diagonal and
can therefore improve identification performance. This is shows the least effective accuracy as compared with the rest
the reason why MFCC performs the best with GMM. of the pre-processing algorithms. A speech identification
However, the best result in this table is achieved using system would be far from usable if the TPR is too low or
MNN when used with WPT. the FPR is too high. The goal is to operate with low values
of FPR and high values of TPR; therefore the upper-left
4.4 Receiver operating characteristics portion of this figure is the practical region of operability.
The variations of FAR and FRR with different sets of
We also calculated the receiver operating characteristic threshold values for WPT, WSBC and DWT are shown in
(ROC) curve illustrating the performance superiority of our Fig. 9. Since WSBC and DWT are the closest competitors
proposed system. The ROC curve shows the true positive of WPT (as depicted in the ROC of Fig. 8), we only