Artificial Neural Networks and Support Vector Machine For Voice Disorders Identification
Artificial Neural Networks and Support Vector Machine For Voice Disorders Identification
Artificial Neural Networks and Support Vector Machine For Voice Disorders Identification
Abstract—The diagnosis of voice diseases through the invasive researches indicate that voice disorders identification can be
medical techniques is an efficient way but it is often done by the exploitation of Mel Frequency Cepstral
uncomfortable for patients, therefore, the automatic speech Coefficients (MFCC) with the harmonics-to-noise ratio,
recognition methods have attracted more and more interest normalized noise energy and glottal-to-noise excitation ratio,
recent years and have known a real success in the identification Gaussian mixture model was used as classifier [4]. Also,
of voice impairments. In this context, this paper proposes a Daubechies‟ discrete wavelet transform, linear prediction
reliable algorithm for voice disorders identification based on two coefficient, and least-square Support Vector Machine (LS-
classification algorithms; the Artificial Neural Networks (ANN) SVM) were investigated in [5]. In addition, a voice recognition
and the Support Vector Machine (SVM). The feature extraction
algorithm was proposed in [6] based on the MFCC
task is performed by the Mel Frequency Cepstral Coefficients
(MFCC) and their first and second derivatives. In addition, the
coefficients, their first and second derivatives, performance of
Linear Discriminant Analysis (LDA) is proposed as feature F-ratio and Fisher‟s discriminant ratio as feature reduction
selection procedure in order to enhance the discriminative ability methods and Gaussian Mixture Model (GMM) as classifier; the
of the algorithm and minimize its complexity. The proposed voice main idea, here, consists in demonstrating that the detection of
disorders identification system is evaluated based on a voice impairments can be performed using both mel cepstral
widespread performance measures such as the accuracy, vectors and their first derivative, ignoring the second
sensitivity, specificity, precision and Area Under Curve (AUC). derivative. In this paper, we will prove that the contribution of
the first and second derivatives of the MFCC features mainly
Keywords—Automatic Speech Recognition (ASR); Pathological depends on the classifier. Indeed, the Artificial Neural
voices; Artificial Neural Networks (ANN); Support Vector Machine Networks (ANN) and the Support Vector Machine (SVM) as
(SVM); Linear Discriminant Analysis (LDA); Mel Frequency classifiers are investigated in this work and a comparative
Cepstral Coefficients (MFCC) study between their respective performances is conducted. In
addition, three combinations of the MFCC features, their first
I. INTRODUCTION and second derivatives are proposed for the feature extraction
When the mechanism of voice production is affected, the task. In order to select the most relevant parameters from the
voice becomes pathological and sometimes intelligible which resulting feature vector, the Linear Discriminant Analysis
causes many problems and difficulties to integrate the social (LDA) is suggested as feature selection procedure.
environment and to have an easy exchange between members Furthermore, the system performance is assessed in terms of
of the same community. Therefore, the diagnosis of voice the accuracy, sensitivity, specificity, precision and Area Under
impairments is imperative to avoid so many issues. Voice Curve (AUC). In the next section, the methodology and
disorders can be classified into three main categories: organic, database used in this work are described as well as the
functional or combination of both [1]. This study is designed performance measures. Then, section 3 presents the
for organic voice disorders. Indeed, a voice disorder is organic experimental results and section 4 discusses these obtained
if it is caused by structural (anatomic) or physiologic disease, results. Finally, we conclude this paper with section 5.
either a disease of the larynx itself or by remote systemic or
neurologic diseases that alter larungeal structure or function II. MATERIALS AND METHODS
[2]. In this research, we have worked on both structural and A. Database
neurogenic disorders. Four types of pathologies are examined:
Chronical laryngitis, Cyst, Reinke edema and Spasmodic In this research, we have selected the voice samples from
dysphonia since they are widespread diseases and their medical the „Saarbrucken Voice Database‟ (SVD) [7], [8] which is a
analysis is a bit tricky to date. Among many techniques to German disorders voice database collected in collaboration
identify voice diseases, the automatic acoustic analysis has with the Department of Phonetics and ENT at the Caritas clinic
proven its efficiency last years and has attracted more and more St. Theresia in Saarbrucken and the Institute of Phonetics of
success. The advantage of acoustic analysis is its nonintrusive the University of the Saarland. It contains 2225 voice samples
nature and its potential for providing quantitative data with with a sampling rate of 50 kHz and with a 16 bit amplitude
reasonable expenditure of analysis time [3]. Therefore, several resolution. Subjects have sustained the vowels [i], [a] and [u]
techniques and methods have been introduced and many for 1s long. In this study, the continuous vowel [a] phonation
studies have been conducted in the literature. Some of these produced by 50 normal people and 70 patients were examined.
Four types of pathologies are investigated: Chronical laryngitis
339 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 5, 2016
(24), Cyst (6), Reinke‟s edema (19) and Spasmodic dysphonia binning. Therefore, Mel filtering process has to be performed.
(21). Thus, the obtained speech signal spectrum is filtered by a
group of triangle bandpass filters that simulate the
B. The Proposed Algorithm characteristics of human's ear [9], [10]. The following equation
In this paper, the extraction of the acoustical features from is used to compute the Mel frequency fMel for a given linear
the speech signal is performed by the MFCC parameterization frequency fHz in Hz.
method. In addition, the first and second derivatives which
provide information about the dynamics of the time-variation
in MFCC original features were investigated to verify their f Mel 2595 * log(1 f Hz / 700) (1)
contribution to the proposed algorithm. In order to optimize the
voice disorders detection, a projection based Linear
The nonlinear characteristic of human auditory system in
Discriminant Analysis (LDA) as feature selection method is
frequency is approximated by the Mel filtering procedure. At
suggested and a comparative study is elaborated between
this stage, a natural logarithm is applied on each output
optimized and non-optimized features for every tested
spectrum from Mel bank. Finally, The Discrete Cosine
combination. As regards the classification task, the Artificial
Transform (DCT) is performed to convert the log Mel
Neural Networks (ANN) are used as unconventional approach
spectrum into time domain; thus the Mel Frequency Cepstrum
in addition to the Support Vector Machine as a new method
Coefficients (MFCC) are obtained. Besides, there are several
successfully exploited in recent years, Fig. 1.
ways to approximate the first derivative of a cepstral
coefficient. In this research, we use the following formula [11]:
dx ( t )
x ( t ) x mM M m ( t m ) (2)
dt
340 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 5, 2016
rates in order to conclude the most effective classifier for the pathological class. Furthermore, Accuracy measures the
identification of voice disorders. algorithm correct classification rate and the AUC which is an
important statistical property for evaluating the discriminability
1) Support Vector Machine: between the two classes of normal and pathological samples.
Support Vector Machines are a class of learning techniques Therefore, the AUC provides another way to measure the
introduced by Vladimir Vapnik in the early 90s [14], [15]. The accuracy of the proposed system. These measures are based on
binary classification is where the training data comes only from the following notions:
two different classes (+1 or -1). The idea of SVM is to find a
hyperplane that best separates the two classes with maximum TP : True Positive : identified as pathological when
margin. If the data is linearly separable, it is called « Hard- pathological samples are actually present
margin SVM ». If the data is non-linearly separable, it is called TN : True Negative : identified as normal when normal
"Soft-margin SVM". In this case, the data are mapped into a samples are actually present
higher-dimensional space where the function becomes linear.
This transformation space is often performed using a "Kernel FP : False Positive : identified as pathological when normal
Mapping function" and the new space is called "Features samples are actually present
space". The most widely used SVM kernel functions are linear
kernel, polynomial kernel and Radial Basis function (RBF) as FN : False Negative : identified as normal when
Gaussian kernel. pathological samples are actually present
The training phase of the SVM classifier involves searching These measures can be calculated as follows:
the hyperplane that maximizes the margin. Such hyperplane is TP TN
called « hyperplane optimal separation ». Accuracy
TP TN FP FN
In this research, the proposed algorithm was trained with
the « Radial Basis Function » (RBF) as a Gaussian SVM kernel TP
Sensitivity
and LIBSVM which is a SVM library [16]. TP FN
2) Artificial Neural Networks: TN
Artificial Neural Networks are absolutely one of the most
Specificity
TN FP
effective approaches for speech recognition thanks to their
numerous architectures and learning algorithms, In this paper, TP
Precision =
the architecture of the proposed neural networks is composed TP FP
of three layers, an input layer for the transmission of the input
1 TP TN
features without distortion, a hidden layer containing 250 Area Under Curve (AUC) =
neurons (sigmoid is applied as activation function) and an 2 TP FN TN FP
output layer containing a linear function neuron. Each layer is
completely connected to the next one. The proposed neural III. EXPERIMENTAL RESULTS
network learning is performed based on the principles of the In this research, the dataset was divided into two parts: 70%
Bayesian regularization algorithms. Indeed, the network weight of the data were used for training and 30% for validation. All
values are adjusted successively at every step of learning in simulations were conducted in MATLAB 2013a with Intel
order to achieve an output as close as possible to the Core-i7, 2.20 GHz CPU and 4 GB RAM.
considered data [17].
A. Evaluation Based on the SVM Performance
Concerning the Bayesian approach, it is based on the
exploitation of a random distribution of the network weight In this part of the article, we present the SVM performance
probabilities. The neural network learning consists in rates for different combinations of the MFCC coefficients
determining the distribution knowing the training data. Indeed, before and after applying the LDA feature selection procedure.
after the examination of the training data, the initial probability Table 1 shows the SVM performance in terms of accuracy
attributed to weights, before performing the learning, is (Acc %), sensitivity (Sens %), specificity (Spec %), precision
transformed into a final distribution through the application of (Prec %) and AUC (%) for the different MFCC feature vectors.
the Bayes theorem [17]. The experimental results show that there is a slight
F. Evaluation Process increase, in the SVM performance rates between the MFCC
and MFCC_Delta1 combinations, of 0.04% in the accuracy
In order to judge the effectiveness and the robustness of the rate, 0.03% in the AUC rate, 0.04% in the sensitivity rate,
proposed algorithm, it has to be assessed according to different 0.05% in the specificity rate and 0.07% in the precision rate.
performance measures. In this research, five performance Whereas, the system performances are exactly equal for the
measures were used: accuracy, sensitivity, specificity, combinations of MFCC_Delta1 and MFCC_Deltas1&2 with
precision and the Area Under Curve (AUC) from the Receiver an accuracy rate of 80.4%, sensitivity of 87.83%, specificity of
Operating Characteristic Curve (ROC). Indeed, sensitivity 73.58%, AUC of 80.7% and precision of 72.29%. Therefore,
measures the ability of the algorithm to recognise pathological we can note that the first and the second derivatives don‟t
samples. It opposes specificity which evaluates the ability of provide a significant improvement in the system performances
the algorithm to identify normal samples. Precision represents when the SVM is used as classifier which demonstrates that the
the proportion of well-classified pathological samples from the
341 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 5, 2016
SVM algorithm is not sensible to the information provided by for the MFCC_Delta1 and 6.94% between the optimized and
these features about the dynamics of the time-variation in the non-optimized MFCC_Delta1&2 features.
MFCC original vector. Besides, after applying the LDA
procedure, the SVM performance rates are certainly less close 100
Non optimized (%)
but not enough distant to change the whole analysis about the 80.67
87.53
80.70
87.31
80.70
87.64 Optimized (%)
80
contribution of the first and the second derivatives in the
0
MFCC MFCC_Delta1 MFCC_Deltas1&2
Fig. 4. Comparison between the SVM AUC rates of the optimized and non-
optimized MFCC features
100
Non optimized (%)
86.28 86.07 86.44 Optimized (%)
80.36 80.40 80.40
80
Accuracy (%)
60
40
342 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 5, 2016
combinations is observed before and after applying the LDA IV. DISCUSSION
transformation. In this paper, the ANN is proposed as unconventional
As regards the LDA method, it was applied to the different approach in addition to the SVM as a new method successfully
MFCC combinations in order to select the most significant exploited in speech recognition. The main motivation for
parameters from the feature extraction task to be the input conducting this research was to investigate the efficiency of
vector of the ANN architecture. This strategy leads to an each of those classifiers in the identification of voice disorders.
optimization in the system performance. Indeed, the In addition, it was interesting to scrutinize the contribution of
experimental results show an improvement in the ANN the first and second derivatives of the MFCC features for every
performance measurements for all the optimized MFCC feature classifier. The experimental results demonstrate that the effect
combinations. Fig. 5 compares the ANN accuracy rates of the of these derivative features depends on the classifier. Indeed,
optimized and non-optimized MFCC vectors. when the SVM is used as classifier, the first and second
100
derivatives do not provide any improvement to the system
87.82
Non optimized (%) performance comparing to the original MFCC features.
90 Optimized (%)
81.19
84.06 85.20 However, when the ANN is used as classifier, these derivative
80.25
80
75.13 features can be considered important since they contribute in
70 the improvement of the system performance. In this case, there
is an average improvement about 4% between the combination
Accuracy (%)
60
50
of the MFCC, MFCC_Delta1 and the MFCC_Delta1&2.
40 Besides, the LDA procedure is used to select the most
30 relevant parameters from a resulting feature vector in order to
20 reduce the system dimensionality without affecting its
10
performance. Indeed, our findings show that the LDA method
minimizes the system complexity while improving the
0
MFCC MFCC_Delta1 MFCC_Deltas1&2 performance rates for every feature combination; therefore it
Fig. 5. Comparison between the ANN accuracy rates of the optimized and can be considered as an optimization procedure.
non-optimized MFCC features Table 3 compares the proposed algorithms with previous
significant works. It is observed that the proposed algorithm
The experimental results exposed in Fig. 5 show an
appears competitive for the detection of voice disorders from
optimization of 5.12% in the accuracy rate of the non-
the Saarbrucken Voice Database (SVD).
optimized MFCC features, while the improvement is about
2.87% for the combination of the MFCC features and their first TABLE III. COMPARATIVE TABLE BETWEEN PROPOSED ALGORITHM AND
derivatives. Also the optimization procedure provides a 2.62% PREVIOUS WORKS
increase in the accuracy rate of the MFCC features associated
with their first and second derivatives. In fact, the improvement
was observed for all performance measures namely the AUC
rates which were improved to reach 81.87% for the MFCC
combination with an optimization of 6.85% while 3.85% and
2.75% were the improvement rates for the combination of
MFCC_Delta1 and MFCC_Delta1&2, respectively, Fig. 6.
100
Non optimized (%)
87.96
90 Optimized (%)
85.59 85.21
81.87 81.74
80
75.02
70
Area Under Curve (%)
60
50
40
30
20
10
0
MFCC MFCC_Delta1 MFCC_Deltas1&2
Fig. 6. Comparison between the ANN AUC rates of the optimized and non-
optimized MFCC features
Finally, with an accuracy rate of 86.44%, sensitivity of
Finally, the optimized MFCC_Delta1&2 combination 98.24%, specificity of 77.04%, AUC of 87.64% and precision
reached the best ANN performance rates with an accuracy rate of 74.42%, the SVM classifier can be judged efficient for voice
of 87.82%, sensitivity of 99.12%, specificity of 80.31%, AUC disorders identification. Also, the ANN classifier offers an
of 87.96% and a precision of 81.42% as mentioned in Table 2. accuracy rate of 87.82%, sensitivity of 99.12%, specificity of
343 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 5, 2016
80.31%, AUC of 87.96% and precision of 81.42% which are Calibration and Fusion of Scores Using MultiFocal Toolkit,” Advances
slightly better than those of the SVM classifier which leads to in Speech and Language Technologies for Iberian Languages, vol. 328,
pp. 99-109, 2012.
conclude that the ANN classifier is likewise effective for voice
[5] E. F. Fonseca, R. C. Guido, P. R. Scalassara, C. D. Macciel and J. C.
impairment identification. With these performance rates, the Pereria, “Wavelet time frequency analysis and least square support
proposed algorithm can be considered reliable for the vector machine for the identification of voice disorders,” Comp. Bio.
identification of pathological voices from normal ones. Med., vol. 37, pp. 571-578, 2007.
[6] J. I. Godino-Llorente, P. Gomez-Vilda and M. Blanco-Velasco,
V. CONCLUSION “Dimensionality reduction of a pathological voice quality assessment
system based on Gaussian mixture models and short-term cepstral
This paper proposes an optimized voice disorders parameters,” IEEE Trans. Biomed. Eng., vol. 53, pp. 1943- 1953, 2006.
identification algorithm based on short-term cepstral
[7] W. J. Barry and M. Putzer, Saarbrucken Voice Database, Institute of
parameters and the Linear Discriminant Analysis as feature Phonetics, Univ. of Saarland.
selection method. As regards the classification task, it is [8] M. Putzer and J. Koreman, “A German database of patterns of
performed by the Artificial Neural Networks and the Support pathological vocal fold vibration,” Phonus 3, Institute of Phonetics,
Vector Machine. The three combinations of MFCC, University of the Saarland, pp. 143-153, 1997.
MFCC_Delta1 and MFCC_Delta1&2 are examined in order to [9] X. Xiong, “Robust speech features and acoustic models for speech
conclude the role of the derivative features. Indeed, recognition,” PhD Dissertation, School of computer engineering,
experimental results demonstrate that the contribution of the Nanyang Technological University, 2009.
first and second derivative of the MFCC features varies [10] V. Tiwari, “MFCC and its applications in speaker recognition,”
International Journal on Emerging Technologies, vol. 1, pp. 19-22,
according to the classifier. In addition, the LDA transformation 2010.
can be considered as optimization procedure since it improves [11] J. W. Picone, “Signal modeling techniques in speech recognition,” in
the system performance while reducing its dimensionality. The Proc. of the IEEE, vol. 81, pp. 1215-1247, 1993.
accuracy rates of 86.44% and 87.82% were obtained by the [12] G. Quanquan, L. Zhenhui and H. Jiawei, Linear Discriminant
SVM and the ANN, respectively. Therefore, we can conclude Dimensionality Reduction, ser. Lecture Notes in Computer Science,
that ANN and SVM are efficient for voice disorders Machine Learning and Knowledge Discovery. Germany: Springer, 2011,
identification with a slight advantage to the ANN. Many future pp. 549-564.
improvements can be proposed such as including other feature [13] V. S. Tomar, “Discriminant feature space transformations for automatic
extraction methods in a hybrid schema in order to improve speech recognition,” Department of Electrical and Computer
Engineering, McGill University, Montreal, 2012.
performance rates. For instance, we can suggest the Discrete
[14] I. Guyon, B. Boser and V. Vapnik, “Automatic capacity tuning of very
Wavelet Transform to be integrated with the proposed MFCC large VC-dimension classifiers,” Advances in Neural Information
features. In addition, the real time implementation of the Processing Systems, pp. 147-155, 1993.
proposed algorithm may be envisaged. [15] B. E. Boser, I. M. Guyon and V. N. Vapnik, “A training algorithm for
REFERENCES optimal margin classifiers,” in Proc. WCLT'92, New York, 1992.
[1] A. Akbari and M. K. Arjmandi, “An efficient voice pathology [16] C. C. Chang and C. J. Lin, “LIBSVM: a library for support vector
classification scheme based on applying multi-layer linear discriminant machines,” ACM Trans. Intell. Syst. Technol., vol. 27, pp. 1-27, 2011.
analysis to wavelet packet-based features,” Biomedical Signal [17] R.M. Neal, Bayesian learning for neural networks, New York : Spring
Processing and Control, vol. 10, pp. 209-223, 2014. Verlag, 1996.
[2] A. E. Aronson and D. M. Bless, Clinical voice disorders, Fourth ed., [18] A. Al-nasheri, A Zulfiqar, M Ghulam and A Mansour, “Voice pathology
New York: Thieme, 2009. detection using auto-correlation of different filters bank,” in Proc.
[3] Lions Voice Clinic, University of Minnesota, Department of AICCSA'14, Doha, Qatar, 2014.
Otolaryngology, P. O. Box 487, 420 Delaware St., SE, Minneapolis, [19] I. M. M. El Emary, M. Fezari and F. Amara, “Towards developing a
MN55455,USA. voice pathologies detection system,” Journal of Communications
[4] D. Martinez, E. Lleida, A. Ortega, A. Miguel and J. Villalba, “Voice Technology and Electronics, vol. 59, pp. 1280-1288, 2014.
Pathology Detection on the Saarbruecken Voice Database with
344 | P a g e
www.ijacsa.thesai.org