Speaker Recognition Using Pulse Coupled Neural Networks
Speaker Recognition Using Pulse Coupled Neural Networks
Speaker Recognition Using Pulse Coupled Neural Networks
text independent speaker recognition is performed in a closed set of 10 speakers in order to verify if the proposed architecture is suitable for the task.
II. PULSED NEURON MODEL In pulsed neuron models, the inputs and outputs are given by short pulse sequences generated over time. The model used in this work is the Spike Response Model described in Maass [3] and presented in Figure 1 with some modifications. Considering that the neuron in the figure is the ith neuron, the internal state is described by a variable ui that represents the membrane potential in biological systems. When this internal variable exceeds a threshold, the neuron fires and generates a pulse on its output y i. This internal variable depends on external inputs given by connections from other neurons xi and an internal feedback from the
S PEAKER identification is a particular category of the more general task of speaker recognition and aims to reveal the identity of a particular speaker based on the comparison of the incoming utterance with a number of patterns previously stored. This identification is performed using features extracted from each utterance, which are used afterwards as inputs to a classifier. Artificial neural networks (ANN) are widely used as pattern classifiers [1] and many researchers have applied ANN in speaker recognition tasks [2]. The actual wave of interest in Pulse Coupled Neural Networks (PCNN) comes from biology studies [5] where a number of researchers are working to develop models that mimic the natural neurons. [7] Most works are concentrated in theoretical aspects of the neuron model and a few researchers report practical applications of such a paradigm. [8] This paper reports the use of PCNN in developing a new architecture for speaker recognition systems. First a presentation of a pulsed neuron model is done and a twolayer architecture of pulsed neurons is proposed. Afterwards,
Manuscript received February 1, 2007. The publication of this work was supported in part by the Fundacao Aplicaqoes de Tecnologias Crfticas
neuron output that adjusts the neuron sensitivity to inputs in order to fire the neuron again. It should be noted that when a pulse arrives at the input xj, it is multiplied by the corresponding weight which represents the synaptic efficiency. Afterwards, the pulse passes through a leaky integrator with time constant Tji,
Atech. A. P. Timoszczuk is with the University of Sao Paulo DEE/EPUSP/PSI- Laboratory of Integrated Systems - LSI- ICONE Group - Av.Prof.Luciano Gualberto, 158 Trav.3, Sao Paulo, SP, 05508900, Brazil (phone: 55-11-3040-7336; fax: 55-11-3040-7400; e-mail:
that mimics the biological fading of post-synaptic potential. All incoming pulses are added prior to being compared at the soma and if the weighted sum exceeds a threshold vi, a pulse is generated at the output. This analysis was conducted supposing that the neuron is at its rest state and no previous input pulse had occurred, so pi is null and there is no effect due to dynamic threshold. Immediately after the output pulse is generated, the feedback link receives a pulse and generates a dynamic threshold that is added to the static threshold and makes the membrane potential more negative. The dynamic threshold is then implemented using a leaky integrator and corresponds to biological ionic channels returning to the rest state. This negative potential makes the next firing of the neuron more difficult, and several accumulated input pulses are necessary in order to make the neuron fire again. If the input does not receive more pulses, the post-synaptic and membrane potential fades with time constants Tji and T respectively, conducting the neuron to the rest state.
antonio.pedro @ieee.org). E. F. Cabral, Jr., is with Noetics Institute, Inc, Plainfield, IN 46168 USA (e-mail: [email protected]).
* * *
r1i
decay
fltuic p1)Use 0to ti h 0leuto lE ale ng i =nteral stale: of te h nuron; OUt outpUt pUIs of ith nc1 1()
Fig. 1. Block diagram for Spike Response Model, as modified from Maass [3].
Finally, considering the digital filters presented above, Equation 5 describes the neuron model in discrete space that was used to implement the pulsed layers in this work. It models the membrane potential resulting from inputs and dynamic threshold, with rji and ri given by Equation 2 and 4 respectively.
In order to run this neuron model using digital computers, a discrete model must be used. This discrete model is obtained dividing the continuous time in intervals of constant duration T, which are used to compute each neuron state over the discrete time space. The leaky integrators were replaced by a first order digital filter with relaxation factor , corresponding to the fading factors. The Equations 1 and 2 model the digital filter for post-
ujf,l)
f'I,
(5)
Where Fi is the space containing all pre-synaptic neurons connected to neuron i, and thus the Equation 6 represents the sum of all inputs from j E Fi neurons connected to neuron i:
.t EA'w ('10 +
synaptic fading:
p,f (n)
)
5E
(; t
(h) lt p (n 1)
(1)
(2)
(6)
Z
T l
v(rn1) S
Pi (17
1) +v(
(7)
Similarly the Equations 3 and 4 model the digital filter for membrane potential fading:
represents the contribution of dynamic threshold, that is related to recent fire states of neuron i and is indicated as (O)i in Figure 1.
III. LAYERED ARCHITECTURE PROPOSED
-( /v(n) prp(n =
r
-1
(3)
-exp{,
The neural network architecture proposed for feature extraction consists of two layers of pulsed neurons as illustrated in Figure 2. The first layer converts the input sequence into a pulse-modulated sequence and the second one extracts and generates features to a MLP classifying layer.
A. Pulse coding layer In this layer, the number of neurons is equal to the dimension of input vector and there are no feedback connections between the neurons. As the input sequence is presented to this layer, each vector coefficient is fed to one of the input neurons, changing the internal state of the corresponding neuron. As the sequence is presented to neuron inputs, the membrane potential varies proportionally with the amplitude of the input value. The result presented at the output of this layer is
* *
to neuron i
wji represents the connection efficiency from neuron j to neuron i yi represents the pulse at neuron output
a frequency modulated pulse sequence that is proportional to the amplitude of each input coefficient.
the number of neurons was fixed at 300. In the first layer, the neuron function used was linear, and these units were used as coupling for the input vector. In the hidden and last layers, the neuron function was chosen to be an adjustable sigmoid as described in Equation 8 which made it possible adjust the neuron response to the dynamic range of input data.
a
+1
(a) input - Ml Cpstral cxe tieS (b) pttlseen coud 1i fi seqtience (C) pukles bmlVedl throuh tlle laver 2 (d) Wakr 1&nfified
(8)
B. Feature extraction layer This layer presents a novelty in this architecture. It is composed of a linear arrangement of neurons interconnected in such a way that forms a ring. This network follows the training principles of well-known self-organizing maps [9] and is trained using competitive rules [6]. The ring configuration works better than a simple linear arrangement and ensures that there is no border distortion introduced during the training. This ring self-organized map has its training algorithm based on self-organization and makes use of neighbor updating rules. During the training, the winning neuron weights are updated strongly and the neighbors' weights are updated with a strength that is inversely proportional to the distance from the winner. The neighborhood is wide at the start of training and is decreased until only the winning neuron weights are updated. At the end of the training process, the weights assume the most common combination of values presented in the input sequence. The statistic learned by this layer is used to conduct a mapping of the pulse sequence generated by the previous layer in a new such normalized sequence of pulses, corresponding to each speaker, that are presented to the classifier layer.
C. Classifier layer The main objective of this work was to evaluate the ability of the architecture implemented using the previous two layers to retain the speaker information, a multilayer perceptron was choosed as a reference classifier. Once the ability of the architecture to retain speaker information is demonstrated, a new classifier layer using pulsed coupled neurons could be proposed as a future work. The multilayer perceptron neural network was implemented in a "classic configuration" [4] with three layers fully connected: the first layer, one hidden layer, and the last layer. The first layer was configured with the same dimension of the output vector from feature extraction layer and the number of outputs (last layer) was made equal to the number of classes (speakers) to identify. At the hidden layer,
The MLP training was done using a standard back propagation algorithm [4] with adaptive learning rate and a fixed momentum.
IV. SPEECH CORPUS AND REPRESENTATION A. Speech corpus The speech database was constructed using a subset of the Speaker Recognition vl.0 database from CSLU - Center for Spoken Language Understanding - of Oregon Graduate Institute - U.S.A. The subset selected was composed 10 of speakers (5 males and 5 females) and 8 phrases recorded over a digital phone line were used with 8KHz sampling rate and 16 bits. The phrases were uttered 5 times by the speakers in different recording sessions over time, resulting in 40 utterances for each speaker.
B. Speech segments representation The utterances were segmented into 256-point frames resulting in 32ms frames of speech. Consecutive frames were overlapped 128 points (16ms) and 16 Mel Cepstral coefficients were calculated for each frame with 20 filters uniformly spaced in the Mel scale, using a discrete cosine transform (DCT).
V. EXPERIMENTAL RESULTS To conduct the experiments the phrases were split into two sets, the first one with phrases 1-2-3-4 for training and the
second with phrases 5-6-7-8 for testing. The first layer of the proposed architecture was configured with 16 pulsed neurons each one to process a corresponding MFCC coefficient. The second layer was configured with 100 pulsed neurons interconnected as described in section 3. The classifier layer was configured with a number of input neurons corresponding to same number of output neurons of second layer.
In order to train and test the classifier, it was necessary to define the feature vector to be used. In these experiments, for a corresponding phrase, the number of output pulses for each neuron that belongs to feature extraction layer was used as a feature. Applying this strategy, the feature vector resulting is a one dimensional vector with dimension equal to 100, and each element of this vector represent the number of pulses that the corresponding neuron present at the output of the feature extraction layer. This feature vector is called icon. The training sequence for the proposed architecture is done in two phases. * The first phase comprises the feature extraction layer training, when each speaker statistics is learned. At this phase the pulses sequence generated at the first layer are presented and the neural network that correspond to this layer is trained using a competitive self-organized algorithm as described in section 3. * The second phase comprises the training of the final classifier layer used to verify the speaker identity. At this phase the training phrases are encoded in pulses by the first layer, mapped statistically by the second layer and the corresponding pulse sequence is used to compose an icon representing the corresponding phrase.
It is important to note that this second phase is expected to be changed when a final classifier layer be implemented using pulsed neurons.
recognition rate
99
7__
B
C
98.%
76
86
A. Test ] In test 1, the training set was presented to the recognizer and layer two was trained over 500 epochs. Afterwards, the training set was presented again and layer two generated the corresponding output pulse sequences. These pulse sequences were then transformed into icons as described previously and used to train the classifier. After the trainings were completed, the testing sequences were presented to the recognizer resulting in 96% recognizing for training set and 82% for testing.
B. Test 2 Test 2 was conducted in order to verify the reason for such different recognition rates from test 1 and to verify if the pulsed layers were extracting the speaker information. Each phrase used in this experiment had a phonetic content that characterizes the speaker. The best characterization is done when all phrases with complementary phonetic contents are used in a training set, providing a more complete model for each speaker. After these considerations, the phrases used in training and test sets were selected to provide variable information in the training phase and use only one phrase to test if the recognition rate changes. It is expected that as the train set contains more information the recognition rate increases. The phrases used to train and test the recognizer as well as
for speaker recognition task is promising. It was noted that the low recognition rate was due to a poor training set as demonstrated in test 2. The recognition rate is expected to increase with a large training set. In test 2, when the phrase 1 was returned to the training set and the recognition rate increased, it was clear that the pulsed neural network was obtaining this additional information and giving it to the classifier demonstrating that the architecture proposed preserved the information. Finally, the main conclusions here are:
a) the PCNN are promising for speaker recognition tasks; b) the proposed architecture was able to generate features that conduct to speaker recognition.
Future work will explore other strategies to choose feature vectors from PCNN and implement a classifying layer using pulsed neurons.
REFERENCES
[1] [2] R.P.Lippmann, Pattern classification using neural networks. IEEE Communications Magazine, v.27, n.ll, p.47-64, Nov. 1989. K.R. Farrel and R.J. Mammone, Speaker recognition using neural networks and conventional classifiers. IEEE TSAP v.2, n. 1, Part II, p.194-205, Jan. 1994. W. Mass and C. Bishop, editors, Pulsed Neural Networks, MITPress, Cambridge, 1999. Haykin, S. Neural networks: a comprehensive foundation. New York, MacMillan, 1994. R.C. deCharms and M.M. Merzenich, Primary cortical representation of sounds by the coordination of action-potential timing. Nature, v.381, 13 June 1996. B. Ruf and M. Schmitt, Self-organization of spiking neurons using action-potential timing. IEEE Transactions on Neural Networks, v.9, n.3, p.575-578, May 1998. W. Maass, Networks of spiking neurons: the third generation of neural network models. Neural Networks, v.10, n.9, p.1659-71, Dec. 1997.
[3]
[4]
[5]
[6]
[7]
[8]
D. Mercier and R. Seguier, Spiking neurons (STANNs) in speech recognition. In: 3rd WSES International Conference on Neural Networks and applications, Intertaken, Proceeding, Feb 2002. [9] T. Kohonen, Self-organized formation of topologically correct feature maps. Biological Cybernetics, n.43, p.671-680, 1983.