Reference Paper 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

www.ietdl.

org
Published in IET Biometrics
Received on 16th February 2014
Revised on 10th September 2014
Accepted on 11th September 2014
doi: 10.1049/iet-bmt.2014.0011

ISSN 2047-4938

Speaker identification using multimodal neural


networks and wavelet analysis
Noor Almaadeed1,2, Amar Aggoun3, Abbes Amira2,4
1
Department of Computer Engineering, Brunel University, Kingston Lane, Uxbridge, Middlesex UB8 3PH, UK
2
Department of Computer Science and Engineering, College of Engineering, Qatar University, Doha, Qatar
3
Department of Computer Science and Technology, University of Bedfordshire, University Square, Luton, LU1, 3JU, UK
4
Department of Engineering and Computer Science, University of the West of Scotland, Paisley, UK
E-mail: [email protected]

Abstract: The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric
authentication systems. The objective of this research is to investigate the problem of identifying a speaker from its voice
regardless of the content. In this study, the authors designed and implemented a novel text-independent multimodal speaker
identification system based on wavelet analysis and neural networks. Wavelet analysis comprises discrete wavelet transform,
wavelet packet transform, wavelet sub-band coding and Mel-frequency cepstral coefficients (MFCCs). The learning module
comprises general regressive, probabilistic and radial basis function neural networks, forming decisions through a majority
voting scheme. The system was found to be competitive and it improved the identification rate by 15% as compared with the
classical MFCC. In addition, it reduced the identification time by 40% as compared with the back-propagation neural
network, Gaussian mixture model and principal component analysis. Performance tests conducted using the GRID database
corpora have shown that this approach has faster identification time and greater accuracy compared with traditional
approaches, and it is applicable to real-time, text-independent speaker identification systems.

1 Introduction extraction strategy to extract text-independent features from


a speech signal.
The task of speaker recognition can comprise speaker The classical Mel-frequency cepstral coefficients (MFCC)
identification (i.e. identifying the current speaker) or method is likely the most popular feature extraction strategy
speaker verification (i.e. verifying whether the speaker is used to date. This method is utilised herein for comparison
who he claims to be) [1]. There are two types of speaker with wavelet analysis. Linear predictive coding (LPC) has
identification: text-dependent (the speaker is given a immensely aided text-dependent identification tasks [3].
specific set of words to be uttered) and text-independent Both MFCC and LPC use a global approach for speech
(the speaker is identified regardless of the words spoken) analysis and are, therefore, susceptible to additive noise in
[2]. This paper proposes a novel approach towards building the speech [4]. In this paper, we employed MFCC for
a text-independent speaker identification system (SIS). comparison and relied heavily on wavelet-analysis strategies
A digital speech signal in its crudest form comprises for feature extraction.
frequency values sampled at consistent time intervals. It There are essentially two broad categories for methods to
must be pre-processed to extract feature vectors that develop learning algorithms based on extracted speech
represent unique information for a particular speaker features: generative and discriminative models. Generative
irrespective of the speech content. A learning algorithm methods are widely used and include stochastic models
generalises these feature vectors for various speakers during such as the hidden Markov model (HMM) [5], the Gaussian
training and verifies the speaker’s identity using a test mixture model (GMM) [6] and template-based models (e.g.
signal during the test phase. In practice, no two digital vector quantisation) [7]. The goal of a generative model is
signals are the same even for the same speaker and the to symbolise the distribution space of the stored data
same set of words. The amplitude and pitch in a speaker’s generated from a particular class. This training process
voice can vary from one recording session to another. ignores competing data and considers only related data. In
Environmental noise, the recording equipment, the speed at contrast, discriminative models shape the discriminative
which the speaker speaks and the speaker’s various areas of a distribution. The primary purpose of this method
psychological and physical states increase the complexity of is to reduce classification errors in the stored data as much
this task. Text-independent speaker identification allows the as possible. Unlike generative models, data from all
speaker to speak any set of words during a test. For such competing classes are also considered. Major discriminative
versatile systems, there is a need for a general feature models include polynomial classifiers [8], the support

18 IET Biom., 2015, Vol. 4, Iss. 1, pp. 18–28


This is an open access article published by the IET under the Creative Commons Attribution doi: 10.1049/iet-bmt.2014.0011
License (http://creativecommons.org/licenses/by/3.0/)
www.ietdl.org
vector machine [9], the multilayer perceptron and artificial digital signal processing problems. They have also been
neural network (ANN) [10] methods, such as the general used in many different methods in feature extraction plans
regressive NN (GRNN) [11], probabilistic NN (PNN) and designed for the task of speech or voice identification.
radial basis function NN (RBF-NN) models [12]. A speech signal contains a huge amount of data. For
To date, no single biometric system has been developed example, a 1 s speech signal consists of ∼50 000
that can claim to identify or verify a speaker in all varieties floating-point values in a single linear vector. The
of environments. Accurate classification of a speaker is a performance enhancement of a SIS requires a careful
challenge when inter-class differences exceed intra-class selection of suitable features from the raw set available,
differences, which primarily arise from a text-independent which is usually somewhat redundant. The most relevant
approach or noisy data. In an attempt to resolve this and significant information must be chosen from the
problem, two or more biometric techniques can be original feature space using an appropriate feature selection
combined in a single system to improve the effectiveness of scheme. The first block in the SIS is the feature extraction
identification. This information fusion can be generated at block. In this phase, the rough audio signal is pre-processed
different levels for multimodal biometrics. Information to extract only the distinguishing features for analysis from
fusion is information that is merged from disparate sources the entire signal. The feature extraction techniques used are
with different conceptual, contextual and typographical discrete wavelet transform (DWT), wavelet packet transform
expressions. In multimodal biometrics, this is possible at (WPT), wavelet sub-band coding (WSBC) and MFCC.
the sensor, feature, score or decision levels [13, 14]. In Section 2.1 presents a review of the basics of these
sensor-level fusion, the core data from multiple sensors are techniques.
combined for each modality which reduces classification NNs are the most common approach to learning non-linear
error. In feature-level fusion, the speaker information or complex training spaces. NNs are vastly applied in
received from multiple sources undergoes a feature numerous data analysis and speaker identification schemes,
extraction step, and this information is fused logically. In as well as classification tasks [19]. For an ANN, there is no
score-level fusion, a score is assigned to each individual need to predict the transfer function between the input and
biometric system, and these scores are used to make output ahead of time, and this is one of its greatest
decisions for the final classification. In decision-level advantages. In Section 2.2, we provide a comprehensive
schemes, the final decision to accept or reject an individual description of different NNs in context of our proposed SIS.
system is generated via a voting procedure (e.g. majority,
AND, OR etc.). Many researchers, like Nefian et al. [15], 2.1 Wavelet analysis and feature extraction
tend to lean towards the early fusion approaches for
audio-visual speech recognition. A speaker verification Wavelet transforms [20–22] have been studied
system based on audio-visual hybrid fusion from a set of comprehensively in the recent times and widely utilised in
features that are cross-modal was proposed in [16]. For a various areas of science and engineering. Under the class of
personnel authentication system based on face and voice, wavelet analysis, a mother wavelet is processed on dilation
Chetty and Wagner [17] also developed a feature-level and translation. Many signals of interest can be represented
fusion to check the liveness, and presented test results with wavelet decompositions, in general. The fundamental
performed on the VidTIMIT and UCBN databases. idea behind wavelets is to analyse a given signal according
The fusion performed in this paper involved the to a scale [10]. The wavelet successively decomposes the
decision-level scheme. Different wavelet feature extraction given signal into a set of smaller signals at multiple levels
techniques and decision-level schemes were investigated and analyses each piece of the signal at different
using three popular classifiers for text-independent, open-set frequencies with different resolutions. For instance, good
speaker identification. The selected architectures were GRNN, time resolution with poor frequency resolution at high
PNN and RBF-NN. These NNs are fast, reliable and efficient frequencies and high-frequency resolution with poor time
for non-linear and complex data. Compared with resolution at low frequencies are more suitable for samples
back-propagation NNs (BPNN), which require a long training with short-duration high-frequency components and
period, these networks are instantly trained and produce long-duration low-frequency components, respectively. The
immediate results when applied to a test signal. Combining window width is altered as the transform is computed for
multiple ANNs enhances the generalisation capability and each spectral component. Wavelets are well suited for
increases the identification rate. It also reduces the false approximating data with sharp discontinuities. An example
accept rate (FAR) for a given false reject rate (FRR), and vice of a signal in the wavelet domain and a short-time Fourier
versa [18]. This motivated us to develop a novel identification transform is illustrated in Fig. 1.
system, namely the multimodal NN (MNN). Wavelets are a class of functions used to localise a given
This paper is organised as follows. Section 2 describes function for both space and scaling. A family of wavelets
wavelet feature extraction methods and the basics of NN. In can be constructed from a function called a mother wavelet,
Section 3, we introduce the proposed fusion system with a which is confined in a finite interval with a zero average. A
detail justification of the feature extraction and NNs that set of wavelets are formed from the mother wavelet by
were chosen. Section 4 presents a comprehensive analysis translating and scaling the mother wavelet. The wavelet tree
of the performance and test results for this scheme, and and methodology that have been used for speaker or speech
finally Section 5 presents conclusions and recommendations recognition include the DWT, WPT and the Mel-scale and
for future work. sub-band coding algorithms WSBC, which were first
utilised for speaker identification in [23, 24]. The advantage
of WSBC is that it models the human auditory system and
2 Overview of system components thus decreases the number of parameters for the entire
WPT, which reduces the time required for speaker
Wavelet and wavelet packet analysis have been proven as identification. Finally, the wavelet with irregular
effectual signal processing techniques for a variety of decomposition algorithm, along with other wavelet

IET Biom., 2015, Vol. 4, Iss. 1, pp. 18–28 19


doi: 10.1049/iet-bmt.2014.0011 This is an open access article published by the IET under the Creative Commons Attribution
License (http://creativecommons.org/licenses/by/3.0/)
www.ietdl.org

Fig. 1 Wavelet decomposition of signal S into detailed and approximate components

analyses, were first tested and proposed in [23]. The primary high dimensionality or strict decision conditions [10, 27].
idea was to irregularly prune the decomposition tree generated The combination of multiple NNs eliminates the poor
by WPT for enhanced accuracy. A feature extraction scheme performance from over- or under-fitting the training data
derived from the wavelet eigenfunction was proposed in [25], with individual NNs wherein each network has a different
and a text-independent SIS was proposed in [26] based on an level of generalisation capability. The combination of
improved wavelet transform, which relies on the kernel multiple NNs resolves the higher identification rate problem
canonical correlation analysis. WPT, which is analogous to but complicates the method by increasing the training time.
DWT in some ways, obtains the speech signal using a Below we briefly describe the architectures of some of the
recursive binary tree and performs a form of recursive NNs that we implemented as part of our proposed MNN.
decomposition. Instead of performing decomposition only The PNN has an input layer where the input vectors are
on approximations, it decomposes the details as well. WPT, inserted (in this case, the audio feature vectors). The
therefore, has a better feature representation than DWT network also includes one or more hidden layers with
[25]. This is why WPT is used as part of our proposed multiple neurons that are connected through weighted paths.
MNN as laid out in Sections 3 and 4. Additionally, it includes one or more output neurons
GMM is extensively used for classification tasks in speaker depending on the number of different classes. PNN is a
identification [1, 19]. It is a parametric learning model that statistical classifier network that applies the maximum a
assumes that the process being modelled has the posteriori hypothesis to classify a test pattern X as Class C
characteristics of a Gaussian process whose parameters do if the following applies
not change over time. This assumption is valid because a
   
signal can be assumed to be stationary over a Hamming
P(Xi |Ci )P Ci ≥ P(Xi |Cj )P Cj ∀ j (1)
window. GMM tries to capture the underlying probability
distribution governing the instances presented during the
training phase. Given a test instance, the GMM tries to Here, P(Ci) is the prior probability of speaker i that is
estimate the maximum likelihood that the test instance has determined from the training feature vectors. P(Xi|Ci) is the
been generated from a specific speaker’s GMM. The GMM conditional probability that this pattern is generated from
with the maximum value of likelihood owns the test class Ci assuming that the training data follow a probability
instance and is declared to belong to the respective speaker. density function (PDF). A PDF is estimated for each
In this paper, we employ GMM for the classification task. speaker class. As shown in Fig. 2, the third layer from left

2.2 Neural networks

An NN consists of multiple perceptrons combined in multiple


layers beginning with the input layer, followed by one or
more hidden layers and ending at the output layer. Each
perceptron has an associated weight. These weights are
adjusted during training to map the training samples to the
known target concepts. At the end of training, a tuned
weight matrix is produced, which corresponds to a complex
function that maps the input to the output.
The most common NN types include BPNN and
feed-forward networks. The training input is passed through
the network a number of times to adjust the weights
accordingly. The iterative data training process requires
multiple passes through the network for correct training.
This requires a large amount of time before the network
converges to a fine-tuned weight matrix. Therefore ANNs
are notorious for long training times and over- or Fig. 2 Architecture of a PNN: input layer, hidden layer, class layer
under-fitting training data. The use of combined multiple and output layer [12]
NNs is an excellent means to apply machine learning under © Massachusetts Institute of Technology, 1989

20 IET Biom., 2015, Vol. 4, Iss. 1, pp. 18–28


This is an open access article published by the IET under the Creative Commons Attribution doi: 10.1049/iet-bmt.2014.0011
License (http://creativecommons.org/licenses/by/3.0/)
www.ietdl.org

Fig. 3 Structure of an RBF NN [12]


© Massachusetts Institute of Technology, 1989
Fig. 4 GRNN architecture [27]
© IEEE, 1991
to right is the class layer and the last layer is the output, which
represents the winning class.
On the other hand, RBF networks [12] employ RBFs
directly to each input value without associating a weight 3 Proposed MNN
line from the input to this RBF layer as shown in Fig. 3. It
consists of three layers of neurons: input, hidden and This section presents a novel approach using multiple NNs
output. The hidden layer neurons represent a series of (PNN, RBF-NN and GRNN) to classify an SIS using a
centres in the input data space, as shown in Fig. 3. The wavelet-based selection method. The proposed system
RBF is given by consists of feature extraction and modelling blocks that use
multiresolution analysis and decision fusion with an MNN,
respectively.
ym = fm (x) = exp[−|x − cm |2 /(2s2 )] (2) Speaker identification is an expert system based on the
single biometric of voice data. It first extracts the audio
from the raw audio data. An audio stream consists of
thousands of values in the range [−1, 1] that are sampled at
Here |x − cm|2 is the square of the distance between the input a regular interval. An 8 kHz sampling rate means that 8000
feature vector x and the centre vector cm for the current RBF such values vary each second when a speaker’s audio is
node. recorded. These raw values only tell us about the amplitude
The network output is a weighted sum from these RBF variations in the speech and do not convey any explicit
nodes and is calculated as follows information about the speaker. Since we are using
text-independent speaker identification, we must extract
  distinguishing speech features that describe a speaker’s
1 M   orientation or, more specifically, the qualities of the
zj = um jym (3)
M m=1 speaker’s glottal tract which are independent of the
language being used. Therefore, if the same speaker speaks
a different set of words next time, our system should
These networks have many uses, such as time series identify the speaker. Therefore, we must transform the raw
prediction, classification and system control. In the context signal into a parametric representation.
of speaker identification, the RBF-NN utilises the projection Usually, short-time spectral analysis techniques, such as
of an eigenface space to compute the NN input features. LPC and MFCC, are used to transform the raw signal into a
There are two main categories of learning: the supervised parametric representation containing the most important
learning and the unsupervised learning. The RBF-NN has characteristics of the signal [29]. MFCC, originally
both a supervised and unsupervised component to its developed for speech recognition systems, employs both
learning. logarithmically spaced filters and Mel-scale filters, which
Next is the GRNN, which is based on a general regression. are less susceptible to noise and variations in the physical
These networks, first proposed in 1991 [27], are widely used conditions of the speaker. Herein, we have used MFCC to
in many identification tasks. Fig. 4 shows the block diagram capture the most phonetically important characteristics for
of the GRNN architecture. It is a one-passing learning speaker identification from the audio signal. We select the
algorithm, which can be used for estimating continuous widely accepted MFCC features for our research because of
variables such as some transient content in speech signal. their demonstrated superior performance. However, the
GRNN has a structure similar to PNN and RBF networks MFCC feature vector describes only the power spectral
but are based on general regression as proposed by [28]. In envelope of a single frame, but not the information in its
contrast to PNNs and RBF-NNs, a GRNN uses a PDF dynamics. To incorporate the ongoing changes over
based on a normal distribution. multiple frames, the first and second derivatives of the

IET Biom., 2015, Vol. 4, Iss. 1, pp. 18–28 21


doi: 10.1049/iet-bmt.2014.0011 This is an open access article published by the IET under the Creative Commons Attribution
License (http://creativecommons.org/licenses/by/3.0/)
www.ietdl.org
features can be computed, which are known as the delta and RBF-NNs and GRNNs are more suitable for
delta–delta coefficients, respectively [30]. These dynamic low-dimensional data like that the different wavelet-analysis
features of cepstral coefficients are often employed to methods yield through DWT, WPT, WSBC or irregular
improve speech recognition performance [31, 32]. As a decomposition. Therefore PNNs, RBF-NNs and GRNNs are
pre-processing step on the audio signal, we perform the best candidates for bagging as they are simultaneously
pre-emphasis to compensate for the high-frequency falloff, strong and fast learners. Furthermore, the work in [23]
and then use the short-term analysis technique using reported that the back-propagation algorithm over-fits
windowing. training data and has a higher error rate than RBF-NN.
In the proposed MNN, we use multiple ANNs for These reasons were the primary motivations for developing
classification using wavelet-based feature extraction a scheme that resolves the under- and over-fitting problems
methods, namely: DWT, WPT, WSBC and irregular and minimises the training time.
decomposition. The prominent features extracted through The proposed system architecture for speaker identification
these methods are fed into a learning model wherein the is illustrated in Fig. 5. A system with text-independent
target concept is modelled and mapped to the training speaker identification methods was constructed using MNN
samples for classification. This system employs the with majority vote, including the GRNN, PNN and
following three different classifier architectures in parallel: RBF-NN models. The voting is conducted as follows
GRNN, PNN and RBF-NN. The architectures of these NNs
have been described in Section 2.2. In this section, we VoteCount(Xi |Ci ) = GRNN Output(Xi |Ci ) + PNN Output(Xi |Ci )
present the major highlights of the text-independent SIS
+ RBFNN Output(Xi |Ci )
based on bootstrap aggregating these equally robust but fast
learners, which are chosen for the reasons stated below. (4)
BPNNs, RBF-NNs, GRNNs and PNNs can be easily
differentiated from each other on the basis of structure,
training strategy, samples requirement, training time, Each test sample is passed through each of the three NNs. If
accuracy and suitability for various types of data. PNNs, any two of the networks classify the given test sample as
RBF-NNs and GRNNs require just a fraction of the belonging to the same speaker from the training data, then
samples, as well as much lesser training times, compared the test sample is declared to belong to that speaker.
with BPNN. These NNs are more adaptive in converging However, where each network classifies the given test
quickly to a decision surface as more neurons can be added sample as a different class, the sample is considered to be
at runtime to aid the results compared with BPNNs, which ‘not identified’.
have a fixed number of neurons in the hidden layers. PNNs, During the training phase, feature vectors extracted from
the training data are fed into each of the networks in
parallel. These networks require only one pass through the
data in contrast to the multiple epochs/iterations that are
used in BPNN. The size (i.e. number of neurons) of the
input layer is equal to the number of MFCC features. Each
neuron takes in streams of data as inputs that arise from the
consecutive frames. Some of the advanced NNs have the
size of the input layer enlarged to two or three adjacent
frames [33] in order to obtain a better context dependency
for the acoustic feature vectors. The number of input layers
can also be chosen by multiplying the cepstral order with
the total frame number [34], leading to an extremely large
input layer size. However, in both the above cases, the
computational times are affected because of the increased
number of hidden layers and states. If an inadequate
number of neurons are used, the network will be unable to
model complex data, and the resulting fit will be poor. If
too many neurons are used, the training time may become
excessively long [33] (in addition, the network may over fit
the data and start modelling random noise).
For testing, the extracted feature vectors from the test signal
are fed to all the ANNs in parallel and three classification
outputs are calculated corresponding to the three classifiers
used here. The majority voting scheme is employed for the
classification results of the ANNs. The class that obtains two
out of three votes is taken to be the final classification result.
During the test phase, the procedure used was almost the
same as the one in the training phase. The test speaker’s file
is pre-processed to extract wavelet features. These features
are classified individually by the trained PNN, GRNN and
RBF-NN. The majority voting scheme ensures equal weight
and the final classification is made on the premises.
In contrast to the BPNN and feed-forward networks, none
of these networks requires iterative training, which takes a
Fig. 5 Proposed system for speaker identification considerable amount of time. Additionally, each of these

22 IET Biom., 2015, Vol. 4, Iss. 1, pp. 18–28


This is an open access article published by the IET under the Creative Commons Attribution doi: 10.1049/iet-bmt.2014.0011
License (http://creativecommons.org/licenses/by/3.0/)
www.ietdl.org
networks focuses on a different probing level to fit the same data from the 34 speakers in GRID. Since all the
training data. One of them focuses on the training data utterances recorded in the GRID corpus have the same
completely (over-fitting), the second learns the training data length and sampling rate, they transform to the same
with an error margin (under-fitting) and the third lies number of frames, same MFCC output vector length and
between the previous two and thus helps to increase the same number of neurons in the input layer of the
ability to generalise the overall system for both known and subsequent NN. The averaged identification results are
unknown signal instances. Moreover, the combination of presented in the subsequent sections. The test results
these networks with a majority voting scheme helps to presented in this section were collected on a computer with
overcome the under- and over-fitting problems. This a 2.8 GHz Intel Core 2 Duo processor and 4 GB of memory.
approach improves the classification accuracy of the overall
system. Only the fusion of the PNN, RBF-NN and GRNN 4.2 Audio and feature extraction
networks in the voting scheme is capable of reducing the
training time and obtain a higher accuracy than those of the A speech signal contains a massive amount of data. For
BPNN and feed-forward networks. However, such a example, a 1 s speech signal consists of ∼25 000–50 000
method would still be faster than methods that use BPNN floating-point values in a single linear vector. GRID
and feed-forward networks. database files have a fixed sampling rate of 25 kHz. An
audio signal is usually segmented into frames of 10–30 ms
with some overlap [37]. Each frame has to be multiplied
4 Results and analysis with a Hamming window in order to keep the continuity of
the first and the last points in the frame. Overlapping
In this section, we describe the outcomes of the
windows allow analysis centred at a frame point. In our
comprehensive testing performed on the GRID corpus [35].
case, the audio signal is divided into 15 ms frames using
Below we first describe the experimental procedure in
Hamming windows with a 10 ms overlap to smooth out the
Section 4.1, and then present the accuracy and
frequencies at the edges of each frame or window. An
computational effectiveness of the proposed MNN in
audio signal is constantly changing, but we assume that on
Sections 4.2–4.5. We provide a comprehensive analysis and
short-time scales the audio signal does not change much. If
comparisons with some other existing systems in Section 4.6.
the frame is too short, we do not have enough samples to
obtain a reliable spectral estimate; if it is too long, the
4.1 Evaluation methods signal changes too much throughout the frame. A 15 ms
window at 25 kHz (for GRID database) transforms to 375
The identification experiment was performed using the samples, which is enough to obtain a reliable spectral
GRID speech corpus [35]. GRID is a multi-speaker shape. Although some researchers tend to choose a larger
audio-visual sentence database that supports joint frame size, several others [38–40] have found 15 ms frame
computational-behavioural studies in speech perception. sizes more useful than longer ones depending on the
GRID consists of high-quality audio and video recordings database and methodology applied. We employed different
of 1000 sentences spoken by 18 male and 16 female frame sizes in our studies and 15 ms turned out to be the
speakers. It uses a fixed and simple grammatical structure best choice.
<command:4> <colour:4> <preposition:4> <letter:25> The resultant frames are further processed using
<number:10> <adverb:4>, where the numbers in brackets logarithmically spaced filters and Mel-scale filters, Fourier
indicate the number of choices at each point, for example, transforms and cosine transforms to produce an MFCC
‘bin blue at A1 again’ or ‘place green by D2 now’. vector for each frame. The number of filters in the
Speakers produced such sentences at a normal speaking rate Mel-scale filter is adjusted to control the number of MFCC
and were asked to complete the sentence in 3 s. The reason features. The delta and delta–delta features are computed
we chose GRID for our studies is that these sentences using linear regression formulas. These additional features
control for differences in speaking style and syntax, and the have the capability of performing better than an
existence of many keyword repetitions allows for MCCC-only implementation, but usually incur enormous
cross-condition comparisons of acoustic properties. numerical burden. Later in this section, we perform some
Different gross phonetic classes (nasal, vowel, fricative, tests to find out the appropriate balance of these features.
plosive and liquid) were used as the initial or final sounds The feature extraction block of this system consists of the
of filler words in each position [36], thereby allowing a following algorithms: DWT, WPT, WSBC and irregular
wide range of phonetic features to be captured. decomposition. All feature vectors are linear vectors of
The 10-fold cross-validation experiments were used to test
all 34 speakers in the GRID database using different values of
the spread. The spread denotes how closely the NN should fit Table 1 Summary of feature extraction vectors for wavelet
analysis
the training data. The default value range for spread is
between 0 and 1, with 1 being the most generalised fitting Input Feature extraction Output
to the training data with relatively lower accuracy. A spread scheme vector length
of 0 is a complete close fit to the training data and produces
maximum accuracy. We can say 1 under-fits the training 1 s long audio signal discrete wavelet 8
data, whereas 0 over-fits the training data. The spread is (GRID) recorded at transform
44.1 kHz wavelet packet 64
also known as the radius of a neuron. With larger spread, transform
neurons at a distance from a point have a greater influence. WPT in Mel-scale 6
There is a trade-off in choosing different values of spread (WSBC)
between 0 and 1. This variable was chosen as the base irregular 57
decomposition
variable and 30 different values were assigned to it. MFCC 20 × 450
Therefore, they resulted in 30 different experiments on the

IET Biom., 2015, Vol. 4, Iss. 1, pp. 18–28 23


doi: 10.1049/iet-bmt.2014.0011 This is an open access article published by the IET under the Creative Commons Attribution
License (http://creativecommons.org/licenses/by/3.0/)
www.ietdl.org
length ≤64, as summarised in Table 1. During the scope of Table 2 Comparison of MFCC, delta MFCC and delta–delta
this research, experimentation and testing were performed MFCC
with all of these approaches, selecting one at a time. In this Number of features from Total number of performance
phase, the rough audio signal is pre-processed to extract features accuracy, %
only the distinguishing features from the entire signal for MFCC Delta Delta–delta
analysis. Only one of the above strategies is used at a time, MFCC MFCC
and the programme allows the user to select the feature
extraction strategy during the training and testing phase. 20 0 0 20 99.3
0 20 0 20 52.5
Each feature yields a different set of parameters and a 0 0 20 20 36.9
unique training data on which the programme trains itself. 10 10 0 20 91.1
The testing phase includes the feature extraction strategy to 10 0 10 20 86.4
generate consistent results. Experimental results show that 0 10 10 20 77.5
6 6 6 18 66.6
WPT generates the most accurate results as described in 7 7 7 21 72.3
Section 4.3. 8 8 8 24 76.2
One of our goals was to establish the optimum number of 16 16 16 48 98.9
MFCC features for our case. Fig. 6 summarises the results
obtained with the various numbers of MFCC features used.
These results were collected for 20 files per speaker using features do not perform very well. Therefore, we choose to
10-fold cross-validation. This experiment indicates that use 20-MFCC features and none of the delta or delta–delta
varying the number of MFCC features gradually from 10 to features in the remainder of our experiments.
20 has a notable impact on the overall accuracy of the We also experimented to improve the accuracy of the
system, but the gain starts decaying beyond 20 features. model with respect to the number of Gaussian mixtures
However, the number of MFCC features is directly allowed per GMM. The objective is to choose the best
proportional to computation time. Therefore, 20-MFCC mixture components to achieve high discrimination
features were found to be the best trade-off value between accuracy. Theoretically, too few mixture components can
the computation time and the overall accuracy. produce a GMM model which does not accurately model
We have also incorporated the delta and delta–delta MFCC the distinguishing characteristics of a speech distribution.
features into our framework to check whether they offer any However, too many components can reduce performance
additional performance gain. As stated before, adding 20 when there are a large number of model parameters relative
features each from the delta and delta–delta MFCCs will to the available training data and can also result in
result in a total of 60-element feature vector, thereby excessive computational complexity [42].
tremendously affecting the computation time. In Table 2, In the tests performed in [43], the results show that, as
we provide a performance summary on changing the the number of Gaussians in GMM increase from 2 to 32, the
numbers of MFCC, delta MFCC and delta–delta MFCC average speech entropy in each Gaussian decrease while the
features, where the total number of features is 20 or close. average speaker entropy remains near constant. We varied
It can be seen that the all-MFCC case performs better than the order of this mixture gradually from 1 to 16 to find the
any of the other cases. If we use all three types of features most appropriate value. Fig. 7 summarises the results. It can
in equal amounts, we need a total of 48 elements to get be seen that increasing the order does not necessarily
close to the 20-MFCC case. These results imply that these increase the system accuracy; as a matter of fact, there are
time derivative features are not good substitutes for having some uneven fluctuations at certain values. The large orders
more MFCC features, especially when the computational caused very high computational expense, but did not seem
burden has to be accounted for. The classification accuracy to yield good performance for the small amount of available
presented in [41] also supports a similar observation, where training data. The mixture component selection is limited by
different feature sets perform unevenly and the differential the amount of training data. Model order selection becomes
more important with smaller amount of training data. On
further investigation, we found that many of the mixtures
reduced to single points, as they did not have enough values
to carry on further computation. However, the above
experiment shows that 1 and 2 Gaussian mixtures provide
the optimum accuracy for voice.

4.3 Identification accuracy

The same testing criteria were applied to GMM, BPNN and


principal component analysis (PCA) for comparison with
the proposed MNN. The wavelet packet analysis (8-level)
generated a 97.5% identification rate, which is a large
improvement compared with MFCC (with 20 feature
vectors), which had a 77.5% identification rate for these
experiments. The findings in [23] suggest that irregular
decomposition generates better results than DWT, WPT and
WPT in Mel-scale algorithms. In contrast, we found that
both WPT in Mel-scale and irregular decomposition were
less accurate compared with WPT when tested with the
Fig. 6 Accuracy (%) of speaker identification using MFCC GRID speech database. This difference is because [23] used
features and GMM a limited set of five sentences for each speaker, which

24 IET Biom., 2015, Vol. 4, Iss. 1, pp. 18–28


This is an open access article published by the IET under the Creative Commons Attribution doi: 10.1049/iet-bmt.2014.0011
License (http://creativecommons.org/licenses/by/3.0/)
www.ietdl.org

Fig. 7 Effect of the Gaussian mixture order on


a Accuracy
b Model training and average identification times

generated a text-dependence for the training data, whereas in rate (TPR) as a function of the false positive rate (FPR) for
our training set, the user speaks up to 1000 different sentences different values of the spread. The fraction of true positives
yielding a text-independent data set. out of the total actual positives is known as the TPR, and
The performance results are summarised in Table 3 and the fraction of false positives out of the total actual
show that our proposed system yields the most accurate negatives is called the FPR. The FPR is the same as the
results (97.5%) for text-independent speaker identification complement of specificity (i.e. one minus specificity), also
compared with an established set of algorithms including known as the FRR. TPR is also known as sensitivity, and is
GMM, PCA, the parallel classifier model in [10] and BPNN. the complement of the FAR. The ROC curve is a graphical
Note that in [10] text-dependent classification was used. plot that can illustrate a binary classifier’s performance with
It is noteworthy that MFCC produces a two-dimensional variation of the discrimination threshold. TPR and FPR
(2D) matrix, whereas BPNN is by nature designed to cater depend on the size of the enrollment database and the
for 1D input. Therefore there is a mismatch between these decision threshold for the matching scores and/or number of
algorithms and they cannot be used together without losing matched identifiers returned. Therefore, a ROC curve plots
a large part of the information. Hence, there is no data for the rate of accepted impostor attempts against the
MFCC and parallel BPNN in Table 3. Moreover, it can be corresponding rate of true positives parametrically as a
observed that MNN outperforms the other classifiers for function of the decision threshold. The results can be
most feature extraction schemes as expected (in most changed by adjusting this threshold. The 30 experiments we
columns of Table 3, MNN results in the highest conducted produced different combinations of the FPR and
identification rate). The only exception is MFCC; in this TPR. These two values were plotted to generate a ROC
case, GMM gives a better result than MNN, which is curve for DWT, WPT, WSBC, irregular decomposition and
because of the following reason. When GMM is fitted to a MFCC for a comparison of accuracy with the proposed
smoothed spectrum of speech, an alternative set of features MNN as shown in Fig. 8. This curve shows that the ROC
can be extorted from the signal. In addition to the standard curve for WPT lies very close to the upper left boundary
MFCC parameterisation, complementary information is and has more area under it compared with DWT, WSBC,
embedded in these extra features. Combining GMM means MFCC and irregular decomposition. The ROC curve for
with MFCC by concatenation into a single feature vector MFCC with the same data lies closest to the diagonal and
can therefore improve identification performance. This is shows the least effective accuracy as compared with the rest
the reason why MFCC performs the best with GMM. of the pre-processing algorithms. A speech identification
However, the best result in this table is achieved using system would be far from usable if the TPR is too low or
MNN when used with WPT. the FPR is too high. The goal is to operate with low values
of FPR and high values of TPR; therefore the upper-left
4.4 Receiver operating characteristics portion of this figure is the practical region of operability.
The variations of FAR and FRR with different sets of
We also calculated the receiver operating characteristic threshold values for WPT, WSBC and DWT are shown in
(ROC) curve illustrating the performance superiority of our Fig. 9. Since WSBC and DWT are the closest competitors
proposed system. The ROC curve shows the true positive of WPT (as depicted in the ROC of Fig. 8), we only

Table 3 Accuracy rate of the MNN compared with other algorithms


DWT, % WPT, % MFCC, % WSBC, % Irregular decomposition, %

GMM 35.80 38.60 83.30 36.50 33.26


MNN 84.70 97.50 77.50 94.40 80.80
BPNN 40.38 41.47 21.20 34.48 32.07
PCA — — 82.90 — —
parallel BPNN [10] 61.25 65.43 — 58.67 56.85

IET Biom., 2015, Vol. 4, Iss. 1, pp. 18–28 25


doi: 10.1049/iet-bmt.2014.0011 This is an open access article published by the IET under the Creative Commons Attribution
License (http://creativecommons.org/licenses/by/3.0/)
www.ietdl.org
Table 4 Average training and identification times for different
algorithms

GMM MNN 3-BPNN PCA BPNN


[10]

average training 5.8 0.8 120 2.8 90


time, s
average identification 2.5 0.05 0.10 1.5 0.8
time, s

compared with our MNN. The proposed MNN takes only


0.05 s on average for the identification phase of a test
signal. This is the fastest identification time ever seen in a
text-independent SIS. Training is the phase where all the
time is spent in traditional systems, but the proposed system
takes care of it as it employs the instantaneously adaptable
classifiers in parallel with no training time.
Fig. 8 ROC curves for various wavelet-analysis algorithms tested We have also performed some simulations to test the effect
using the proposed MNN of changing the size of the NN input size. A higher number of
input layer neurons causes the number of hidden layer
neurons to go up (and therefore the number of states),
compared their performances to WPT and omitted the other which eventually increase the identification and training
algorithms for legibility. For each of the algorithms, we times. As for performance accuracy, there was no
plotted the FAR–FRR pairs for different thresholds, and significant change observed since the NNs are designed to
placed the equal error rate (EER) performance (when FAR utilise the closed set of data in the most optimal manner.
equals FAR) in the middle. Note that the threshold values However, as shown in Table 5, the increased size can
are not of interest here, and they are different for different severely affect the identification and training times. Hence,
algorithms; we normalised them to align their EER values. we conclude that the size of the input layer is best set equal
Fig. 9 shows that WPT can achieve a much lower EER to the number of MFCC features.
(about 5%) than the other two schemes. If we inspect this
figure along any vertical line, we can see that the FAR and/
or FRR for WPT are less than those of the other schemes. 4.6 Performance improvement
Both Figs. 8 and 9 suggest that WPT combined with the
proposed fusion system of MNN outperforms DWT, The performance results described in Table 3, performance
WSBC, MFCC and irregular decomposition. times in Table 4 and the ROC curve in Fig. 8
comprehensively validate that the proposed system is more
sophisticated because it outperforms other systems both in
4.5 Operational speed accuracy and performance times. Below we analyse the
reasons and trends of our MNN and then compare them
The proposed system has 2-fold advantages in terms of with some state-of-the-art techniques.
accuracy and speed. PCA, because of its dual nature (a In the research results reported in [1], it was suggested that
classifier and a dimensionality reduction algorithm), is irregular decomposition yields better results than DWT, WPT
compatible with MFCC as the feature extraction strategy. and WPT in Mel-scale algorithms. On the contrary, we found
Table 4 shows the training time and identification time of that both WPT in Mel-scale and irregular decomposition were
the state-of-the-art algorithms in speaker identification less accurate than WPT. Pawar et al. [1] used a limited,
text-dependent set of five sentences for each speaker,
whereas our training set is truly text-independent with the
user speaking up to 1000 different sentences. One of the
main reasons why our system works more efficiently is
because of the application of the majority voting scheme
during the parallel combination of three classifiers, which
are all fast and robust. These classifiers may suffer from
inadequacies such as under- or over-fitting problems when
used alone, but mitigate each other’s shortcomings when
combined in the proposed manner. In summary, the
proposed system owes its performance improvement to:
(a) bootstrap aggregating of multiple classifiers for a better

Table 5 Average training and identification times for MNN


with different input layer sizes

Size of the input layer of NN 20 40 60

average training time, s 0.8 3.5 8.1


Fig. 9 FARs and FRRs of the proposed MNN compared with average identification time, s 0.05 0.32 0.85
different schemes

26 IET Biom., 2015, Vol. 4, Iss. 1, pp. 18–28


This is an open access article published by the IET under the Creative Commons Attribution doi: 10.1049/iet-bmt.2014.0011
License (http://creativecommons.org/licenses/by/3.0/)
www.ietdl.org
Table 6 Performance comparison with state-of-the-art speaker approach that allows the target speaker to be reliably
identification approaches identified across a wide range of signal-to-noise ratios by
References Algorithm, database Performance treating segregation and recognition as coupled problems.
accuracy, % The system in [48] is mainly based on the verification using
the likelihood ratio test. The likelihood functions used some
[44] 32-mixture GMM, UBM, 91.7 effective GMMs that are relatively simple and easy to
MFCC, spectro-temporal implement. For speaker representation, it employed the
modulation, GRID corpus UBM, from which speaker models were derived using
[45] exemplar-based sparse 95.5
representation, sparse Bayesian adaptation. The verification accomplishment was
discriminant analysis, further enhanced using score normalisation. Their
dot-scoring, GRID performance was successfully tested in several NIST
[46] GMM-UBM (mixed-UBM 85 speaker recognition evaluations. In [49], a robust perceptual
and multi-conditioned
GMMs), GRID corpus
features and iterative clustering approach are proposed for
[47] GMM speech prior, single 96.3 isolated digits and continuous speech recognition and
mixture HMM, speech speaker identification, and its evaluation is performed on
segregation, GRID corpus clean test speeches. A new SIS based on a modified NN
[48] GMM-UBM (independent of 96.8 was proposed in [50], namely the multiple parametric
pre-processing algorithm),
TIMIT self-organising pap (M-PSOM). It attempted to reduce the
[49] LPC, K-means 96.37 acceptance of impostors while maintaining a high accuracy
TI digits_1, TI digits_2 and for identification. Most of the prior systems would rely on a
TIMIT databases single NN for an entire SIS, but the M-PSOM utilises
[50] MFCC, parametric neural 90.6
network
parametric NNs for the individual speakers to record and
CSLU speaker recognition depict their distinctive acoustic signatures. This paper
corpora demonstrated that this method outperforms many other
proposed wavelets, MNN 97.5 competitive methods like wavelets, GMM, HMM and
multimodal GRID corpus vector quantisation. Our proposed approach outperforms all
neural network
these published systems in terms of accuracy, and therefore
proves itself as one of the best candidates for speaker
identification.
hypothesis in the decision space; (b) careful selection of
multiple combined ANN instances of the same class that
complement each other by tackling the under- and 5 Conclusions
over-fitting problems; and (c) selection of the most suitable
feature extraction strategy (i.e. WPT). Instead of using other Conventional approaches to speaker identification with slow
recently popular methods like BPNN methods, we explored identification and poor accuracy are inadequate in a
the more adaptive instantly trained class of NNs, which real-world setting. We have been motivated by these
substantially improved the classification accuracy and shortcomings to conceive and implement a novel approach
reduced the identification time. in this paper. We developed a novel approach that
The recent development in human identification in other combines multiple NNs with wavelet analysis to construct a
areas of the world has been inspiring and competitive. For method that outperforms classical GMM, BPNN and PCA
any novel method to succeed, the comparative analysis of in both identification time and accuracy. Through
performance with the state-of-the-art methods is deemed comprehensive testing using the GRID database, the system
necessary. Table 6 shows the reported performance of some described herein is 97.5% accurate with a 50 ms
of the existing methods in line with the performance identification time where WPT is the feature extraction
accuracy of our system. It shows that our system method. Our real-time approach is directly applicable to
outperforms several other published systems that achieve industrial devices for security and authentication. This
some of the best identification rates available in the method lays the foundation for further research in speaker
literature. In the first four rows of Table 6, we provide identification for real-time systems. In the future, to further
comparisons with systems that used the same corpus as develop the approach described herein, we will combine
ours (GRID). For the purpose of comparison across real-time facial recognition with speaker identification to
different databases (e.g. CSLU and TIMIT), in the last three generate a more robust system that is applicable for the
rows of the table, we also show results based on some other industry. Moreover, we will combine audio and visual
widely used databases. A short description of these systems features at the feature level with MNNs to further improve
follows. the accuracy.
In [44], an algorithm which distinguishes speech from
non-speech based on spectro-temporal modulation energies 6 References
is proposed and evaluated in robust text-independent
closed-set speaker identification simulations. An 1 Pawar, R.V., Kajave, P.P., Mali, S.N.: ‘Speaker identification using
exemplar-based representation and sparse discrimination neural networks’. Proc. World Academy of Science, Engineering and
Technology, 2005, no. 7, pp. 429–433
was proposed in [45] that outperformed the baseline 2 Rabiner, L., Juang, B.H.: ‘Fundamentals of speech recognition’
GMM-universal background model (UBM) and (Prentice-Hall, 1993)
HMM-based systems with a large margin. The GMM-UBM 3 Kinsner, W., Peters, D.: ‘A speech recognition system using linear
system in [46] has shown an average 85% identification predictive coding and dynamic time warping’. Proc. Annual Int. Conf.
accuracy on GRID corpus when a mixed-UBM and IEE, Engineering in Medicine & Biology Society, New Orleans, LA,
4–7 November 2006, no. 3, pp. 1070–1071
multi-conditioned GMMs are utilised. The work in [47] 4 Benesty, J., Sondhi, M., Huang, Y.: ‘Springer handbook of speech
presents a novel fragment-based speaker identification processing’ (Springer, 2007)

IET Biom., 2015, Vol. 4, Iss. 1, pp. 18–28 27


doi: 10.1049/iet-bmt.2014.0011 This is an open access article published by the IET under the Creative Commons Attribution
License (http://creativecommons.org/licenses/by/3.0/)
www.ietdl.org
5 Abdalla, M.I., Ali, H.S.: ‘Wavelet-based Mel-frequency cepstral 29 Lu, W., Sun, W., Lu, H.: ‘Robust watermarking based on DWT
coefficients for speaker identification using hidden Markov models’, and non-negative matrix factorization’, Comput. Electr. Eng., 2009,
J. Telecommun., 2010, 1, (2), pp. 16–21 35, (1), pp. 183–188
6 Suvarna Kumar, G., Prasad Raju, K.A., Rao, M., et al.: ‘Speaker 30 Ye, J.: ‘Speech recognition using time domain features from phase
recognition using GMM’, Int. J. Eng. Sci. Technol., 2010, 2, (6), space reconstructions’. PhD thesis. Marquette University Milwaukee,
pp. 2428–2436 Wisconsin, 2004
7 Kekre1, H.B., Kulkarni, V.: ‘Speaker identification by using vector 31 Furui, S.: ‘Speaker-independent isolated word recognition using
quantization’, Int. J. Eng. Sci. Technol., 2010, 2, (5), pp. 1325–1331 dynamic features of speech spectrum’, IEEE Trans. ASSP, 1986, 34,
8 Campbell, W.M., Assaleh, K.T., Broun, C.C.: ‘Speaker recognition with (1), pp. 52–59
polynomial classifiers’, IEEE Trans. Speech and Audio Processing, 32 Wilpon, J.G., Lee, C.H., Rabiner, L.R.: ‘Improvements in connected
2002, 10, (4), pp. 205–212 digit recognition using higher order spectral and energy features’.
9 Wang, J.C., Yang, C.H., Wang, J.F., Lee, H.P.: ‘Robust speaker Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing,
identification and verification’, Taiwan IEEE Computational Toronto, Canada, 1991
Intelligence Magazine, 2007, 2, (2), pp. 52–59 33 Rottland, J., Neukirchen, C., Willett, D., Rigoll, G.: ‘Large vocabulary
10 Shukla, A., Tiwari, R., Hemant Kumar, M., Kala, R.: ‘Speaker speech recognition with context dependent MMI-connectionist/HMM
systems using the WSJ database’. EUROSPEECH, 1997
identification using wavelet analysis and modular neural networks’, J.
34 Hamzah, R., Jamil, N., Seman, N.: ‘Filled pause classification using
Acoust. Soc. India (JASI), 2009, 36, (1), pp. 14–19
energy-boosted Mel-frequency cepstrum coefficients’. Proc. Int. Conf.
11 Revada, L.K.V., Rambatla, V.K., Ande, K.V.N.: ‘A novel approach to
on Robotic, Vision, Signal Processing & Power Applications, 2014,
speech recognition by using generalised regression neural networks’,
pp. 311–319
IJCSI Int. J. Comput. Sci. Issues, 2011, 1, pp. 483–489 35 ‘The GRID audio corpus for speech recognition’. Available at http://
12 Moody, J., Darken, C.J.: ‘Fast learning in networks of locally-tuned www.dcs.shef.ac.uk/spandh/gridcorpus
processing units’, Neural Comput., 1989, 1, (2), pp. 281–294 36 Cooke, M., Barker, J., Cunningham, S., Shao, X.: ‘An audio-visual
13 Hall, D.L., Llinas, J.: ‘Handbook of multi-sensor data fusion’ (CRC corpus for speech perception and automatic speech recognition’,
Press, UK, 2011) J. Acoust. Soc. Am., 2006, 120, (5), pp. 2421–2424
14 Ross, A., Jain, A.: ‘Information fusion in biometrics’, Pattern Recognit. 37 Holmes, W., Speech synthesis and recognition, (CRC Press, UK, 2001)
Lett., 2003, 24, (3), pp. 2115–2125 38 Gelbart, D.: ‘Ensemble feature selection for multi-stream automatic
15 Nefian, A., Liang, L., Pi, X., Liu, X., Murphy, K.: ‘Dynamic Bayesian speech recognition’. Technical Report No. UCB/EECS-2008-160,
networks for audio-visual speech recognition’, EURASIP J. Adv. Signal University of California at Berkeley, December 2008
Process., 2002, 11, pp. 1274–1288 39 Mirhassani, S.M., Ting, H.N.: ‘Fuzzy-based discriminative feature
16 Chetty, G., Wagner, M.: ‘Audio visual speaker verification based on representation for children’s speech recognition’, Dig. Signal Process.,
hybrid fusion of cross modal features’, in Pattern Recognition and 2014, 31, pp. 102–114
Machine Intelligence, (Springer, Berlin, 2007) 40 Morris, A., Bloothooft, G., Barry, W., Andreeva, B., Koreman, J.C.:
17 Chetty, G., Wagner, M.: ‘Investigating feature-level fusion for checking ‘Human and machine identification of consonantal place of
liveness in face-voice authentication’. Int. Symp. on Signal Processing articulation from vocalic transition segments’. EUROSPEECH, 1997
and its Applications, 2005, vol. 1 41 Li, D., Sethi, I., Dimitrova, N., McGee, T.: ‘Classification of general
18 Arora, S., Bhattacharjee, D., Nasipuri, M., Malik, L., Kundu, M., Basu, audio data for content-based retrieval’, Pattern Recognit. Lett., 2001,
D.K.: ‘Performance comparison of SVM and ANN for handwritten 22, (5), pp. 533–544
Devnagari character recognition’, IJCSI Int. J. Comput. Sci., 2010, 7, 42 Morris, A., Wu, D., Koreman, J.: ‘GMM based clustering and speaker
(3), pp. 1–10 separability in the TIMIT speech database’, IEICE Trans. Fundam.
19 Xiang, B., Berger, T.: ‘Efficient text-independent speaker verification Syst., 2005, 85, pp. 1–8
with structural Gaussian mixture models and neural network’, IEEE 43 Reynolds, D.: ‘Robust text-independent speaker identification using
Trans. Speech Audio Process., 2003, 11, (5), pp. 447–456 Gaussian mixture speaker models’, IEEE Trans. Speech Audio
20 Mallat, S.: ‘A wavelet tour of signal processing’ (Elsevier, UK, 1999) Process., 1995, 3, (1), pp. 72–83
21 Lung, S., Chen, C.: ‘Further reduced form of Karhunen–Loeve 44 Chi, T.S., Lin, T.H., Hsu, C.C.: ‘Spectro-temporal modulation energy
transform for text independent speaker recognition’, Electron. Lett., based mask for robust speaker identification’, J. Acoust. Soc. Am.,
1998, 34, (14), pp. 1380–1382 2012, 131, (5), pp. 368–374
45 Gemmeke, J., Virtanen, T., Hurmalainen, A.: ‘Exemplar-based sparse
22 Vetterli, M., Kovacevic, J.: ‘Wavelets and subband coding’
representations for noise robust automatic speech recognition’, IEEE
(Prentice-Hall, New Jersey, 1995)
Trans. Audio Speech Lang. Process., 2011, 19, (7), pp. 2067–2080
23 Wu, J.D., Lin, B.F.: ‘Speaker identification using discrete wavelet packet
46 Saeidi, R., Mowlaee, P., Kinnunen, T., Tan, Z., Christensen, M., Jensen,
transform technique with irregular decomposition’, Expert Syst. Appl.,
H., Franti, P.: ‘Signal-to-signal ratio independent speaker identification
2009, 36, (2), pp. 3136–3143 for co-channel speech signals’. Proc. IEEE Int. Conf. Pattern
24 Deshpande, M.S., Holambe, R.S.: ‘Speaker identification using Recognition, 2010, pp. 4545–4548
admissible wavelet packet based decomposition’, Int. J. Inf. Commun. 47 Barker, J., Ma, N., Coy, A., Cooke, M.: ‘Speech fragment decoding
Eng., 2011, 6, (1), pp. 20–23 techniques for simultaneous speaker identification and speech
25 Lung, Y.: ‘Feature extracted from wavelet eigenfunction estimation for recognition’, Comput. Speech Lang., 2010, 24, (1), pp. 94–111
text-independent speaker recognition’, Pattern Recognit., 2004, 37, 48 Reynolds, D., Quatieri, T., Dunn, R.: ‘Speaker verification using adapted
pp. 1543–1544 Gaussian mixture models’, Digit. Signal Process., 2000, 10, (3), pp.
26 Lung, Y.: ‘Improved wavelet feature extraction using kernel analysis for 19–41
text-independent speaker recognition’, Digit. Signal Process., 2010, 49 Revathi, A., Ganapathy, R., Venkataramani, Y.: ‘Text independent
20, (5), pp. 1400–1407 speaker recognition and speaker independent speech recognition
27 Specht, D.F.: ‘A general regression neural network’, IEEE Trans. using iterative clustering approach’, Int. J. Comput. Sci. Inf. Technol.,
Neural Netw., 1991, 2, (6), pp. 568–576 2009, 1, (2), pp. 30–42
28 Amrouche, A., Rouvaen, J.: ‘Efficient system for speech recognition 50 Gomez, P.: ‘A text independent speaker recognition system using a
using general regression neural network’, Int. J. Intell. Technol., 2006, novel parametric neural network’, Int. J. Signal Process., Image
1, (2), pp. 183–189 Process. Pattern Recognit., 2011, 1, pp. 1–16

28 IET Biom., 2015, Vol. 4, Iss. 1, pp. 18–28


This is an open access article published by the IET under the Creative Commons Attribution doi: 10.1049/iet-bmt.2014.0011
License (http://creativecommons.org/licenses/by/3.0/)

You might also like