A Front-End Technique for Automatic Noisy
Speech Recognition
Hay Mar Soe Naing
Risanuri Hidayat
Rudy Hartanto
Yoshikazu Miyanaga
University of Computer Studies, Thaton
Thaton, Myanmar
[email protected]
Gadjah Mada University
Yogyakarta, Indonesia
[email protected]
Gadjah Mada University
Yogyakarta, Indonesia
[email protected]
Hokkaido University
Sapporo, Japan
[email protected]
Abstract—The sounds in a real environment not often take
place in isolation because sounds are building complex and
usually happen concurrently. Auditory masking relates to the
perceptual interaction between sound components. This paper
proposes modeling the effect of simultaneous masking into
the Mel frequency cepstral coefficient (MFCC) and effectively
improve the performance of the resulting system. Moreover,
the Gammatone frequency integration is presented to warp
the energy spectrum which can provide gradually decaying the
weights and compensate for the loss of spectral correlation.
Experiments are carried out on the Aurora-2 database, and
frame-level cross entropy-based deep neural network (DNNHMM) training is used to build an acoustic model. While given
models trained on multi-condition speech data, the accuracy of
our proposed feature extraction method achieves up to 98.14%
in case of 10dB, 94.40% in 5dB, 81.67% in 0dB and 51.5% in
-5dB, respectively.
Index Terms—Feature Extraction, Gammatone Filterbank,
Psychoacoustics, Simultaneous Masking, Speech Recognition
I. I NTRODUCTION
Speech recognition is the critical technology in the humancomputer interfaces (HCI) system and translate the human
auditory function [1] [2]. In parallel to computer technology
development, automatic speech recognition (ASR) is invading
our lives, and its applications are more and more pervasive.
It is built into medical purpose, banking system, tourism
and information inquiry, speech to speech translation and
other service systems [3]. The recognition performance of the
ASR system is unavoidably affected by the interruption of
channel and unwanted background noise. Hence, to improve
the performance of the system, we necessitate to remove the
corrupted noise and enhance the quality of the speech signal
[4]. Noise reduction techniques can be implemented in a
different perspective to the ASR system, such as speech enhancement upon the signal level, extracting the robust feature
vectors and adjusting the back-end acoustic models. In the real
environment, the situation of ambient noise cannot consider in
the prior stage and hard to predict. The noise reduction techniques should not depend upon the assumptions about noisy
conditions or training parameters and work well under specific
noise scenarios. The primary intention of a noise-robust feature
extractor is to cause few or without assumptions about the
978-1-7281-9896-5/20/$31.00 © 2020 IEEE
noise information. This is one of the challenging tasks in favor
of recent and ongoing research areas. Appropriate and relevant
speech features can differentiate the different speech classes
under the disturbance of environmental noise and variability
of speaker characteristics [5].
Some of the earlier works have been conducted in different
perceptive to increase the noise immunity of Mel frequency
cepstral coefficients (MFCC) under noisy situations. MFCC
simulates the human auditory system and captures the main
characteristics of phonemes in speech. In [6], the spectral subtraction (SpecSub) is used in conventional MFCC by estimating and subtracting the noise spectrum of non-speech region
from noisy speech. SpecSub algorithm is considerably good
under noise situations; however, it could not function well in
clean speech signal. In [7], a psychoacoustic model of frequency masking has been suggested and introduced the transformation of power spectral density in multiple fundamental
harmonics frequencies into MFCC. A front-end [8], spectral
subtraction algorithm was presented to pre-filter the noise in
the speech signal and analyzed various frequency warping
scales with a non-perceptual scale. One study proposed the
power normalized cepstral coefficients (PNCC) replacing with
q-log power function and applied the mean normalization after
logarithm to remove the effect of convolution noise [9].
This paper proposes a modified method of MFCC to extract
the relevant and appropriate acoustic feature vectors from
noisy speech by applying the simultaneous masking effect of
human hearing mechanism. The minimum masking threshold
is calculated based on the psychoacoustic model, which commonly use in audio watermarking technology can represent the
most sensitive limit for distortion of the signal. With the use of
the masking effect model into conventional MFCC, the noise
effect can be lessened without substantial loss and perceiving
deterioration of speech signal.
The organization of this paper is as follows. This paper
propose the simultaneous masking effect based cepstral feature
and Gammatone filterbank analysis in section 2. In Section
3, the feature recognition engine based on Gaussian Mixture
Model-based Hidden Markov Model (GMM-HMM) and hybrid Deep Neural Network (DNN)-HMM are briefly explained.
The experimental results and discussion part will present in
section 4. The last section concludes the whole paper.
II. M ETHODOLOGY
The automatic speech recognition system may lead to low
recognition performance due to the variation among speakers,
different channels, or surrounding corrupted noise. Thus, the
robustness has been a crucial problem in signal and speech
processing area [10]. Psychoacoustic is a study of sound
perception in the human auditory system includes the concept of auditory masking, how human response in different
frequencies, the relation between loudness and sound pressure
level. Typically, the concept of examining and modeling the
human hearing mechanism is a logical approach to enhance the
accuracy of the speech recognition system. One weak audible
sound becomes uncleared in the existence of another louder
sound. This is called the effect of auditory masking, and which
is fundamental in the psychoacoustic modeling process. This
masking effect is the relation to the selectivity of auditory
processing and how human ears response to different complex sounds in real-life environment. Simultaneous masking
occurs between two close frequencies sound components. The
low-power signal (maskee) is unhearable by the concurrent
existence of another louder sound component (masker). Both
this masker and maskee are possibly a tone or narrow-band
noise.
Figure 1 shows the nature of simultaneous or frequency
masking, where the stronger signal S0 is the masker. Due to the
presence of masker, the absolute hearing threshold is elevated
as the new hearing threshold. This is called the masking
threshold and is a type of noticeable limit for distortion in
the hearing mechanism. Any sound components below this
curve cannot be heard and masked by the masker. The faint
sounds S1 and S2 are wholly unhearable because their sound
pressure levels are at a lower place of masking threshold.
The sound S3 is partly masked by S0 masker and perceivable
only the above portion of the masking threshold. The masker
produces a sufficient strength excitation patterns on the basilar
membrane in the human cochlea. This excitation keeps the
catching of a weaker sound excitation within the same critical
band [11]. Simultaneous masking is usually happening in a
Masker
Sound pressure level
S0
A. Modeling Minimum Masking Threshold
There are six documented processing steps to implement the
modeling of minimum masking threshold in psychoacoustic
model [11] [12] [13].
Step 1: Perform FFT analysis: Spectral estimation is
computed for each frame by applying the Fast Fourier
Transform (FFT) to produce the spectral coefficients. The
power spectral density estimate of x̃(n) is defined as follows.
"N −1
# 2
1 X
−j2πnk
P SD(k) = 10 log10
x̃(n) exp
(1)
N n=0
N
where 0 < k ≤ N/2. Then, power spectral density estimate
P SD(k) is normalized to 65dB sound pressure level due to
the maximum sensitive part of human conversation speech.
P (k) = 65 − max {P SD(k)} + P SD(k)
(2)
Step 2: Finding tonal and non-tonal components: The tonelike, noise-like frequency components are selected from the
maximum of normalized power spectral density estimate
within two neighbors, which called the local maxima. If
the local maxima value is minimum 7dB exceeding than its
neighboring components within a specific Bark range Dk , such
a maxima is denoted as a tonal masker. Otherwise, treated as
a non-tonal masker.
S TM = P (k) | [P (k) − P (k ± Dk )] ≥ 7dB
(3)
S NM = P (k) ∈
/ ST M
(4)
where STM is a set of tonal maskers and SNM is a set of nontonal maskers. As the masking effect is additive in logarithmic
domain, the sound pressure level of each masker is computed
as follow:
X h P (j) i
(5)
10 10 ∀P (j)
PT M,N M (k) = 10 log10
j
d
Partially m
soun
Absolute hearing
threshold
sound
0.5
extraction, can figure out which frequency components commit
more to the masking threshold and how much noise effect can
mix into the signal without being anticipated. Moreover, it can
also figure the amplitude of speech signal.
1
S2
4
Frequency (kHz)
Fig. 1: Nature of Simultaneous or Frequency Masking.
real environment. The idea behind the audio watermarking
technology utilize the minimum masking threshold (MMT) to
hide the watermark information. Depend on this consideration,
this paper introduces a modified MFCC by combining with
the masking effect to extract the robust acoustic features from
noisy speeches. With the use of masking models into feature
where P (j) is the set of tonal and non-tonal components,
P (j) ∈ ST M,N M .
Step 3: Determination of valid maskers: The magnitude
of each masker must exceed the absolute threshold of
hearing. Any group of maskers must be taking place within
0.5 Bark distance. Only the masker with the highest sound
pressure level value can preserve and the rest can eliminate.
PT M,N M (k) ≥ AT H(k)
PT M,N M (k) = arg
max
k0 ∈[−0.5,0.5]
PT M,N M (k + k0 )
(6)
(7)
Step 4: Figuring individual masking thresholds: The tonal and
non-tonal masking threshold expresses a masking contribution
at frequency index i to another masker located at frequency
index j. The individual masking thresholds, T(T M,N M ) (i, j)
are given by:
TT M,N M (i, j) = PX [z(j)] + ∆X [z(j)] + SF (i, j)
(8)
where PX [z(j)] refers to the sound pressure level of the
tonal or non-tonal masker in frequency index j, z(j) is the
Bark frequency of j. The term ∆X is masking index of tonal
and non-toanl masker, and SF (i, j) denotes the spreading
function of masking contribution from masker at j to maskee
at i.
Step 5: Figuring of global masking threshold: The powers
corresponding to the upper and lower slopes of individual
sub-band masking curves and a given absolute hearing
threshold are summed to form a composite global masking
contour.
Tg (i) = 10
AT H(i)
10
+
N
TM
X
10
TT M (i,j)
10
+
j=1
N
NM
X
10
TN M (i,j)
10
(9)
j=1
Tg (i) = 10 log10 Tg (i)
(10)
Step 6: Figuring of minimum masking threshold: The minimum masking threshold is gained from the global masking
threshold. The spectral subsamples of global masking threshold are mapped onto nth uniform sub-bands (1 ≤ n ≤ 32).
TM (m) = min Tg (i)
fid (i)∈n
m = [8(n − 1) + 1] : 8n (11)
The MMT, which is figured from a psychoacoustic model
represents the most sensitive limit or just a noticeable distortion of the signal. Any sounds or frequency components
lie under the threshold that cannot be heard and masked by a
masker [13]. This conception is analyzed and integrated into
the conventional MFCC feature extraction to tolerate the noise
impact. The spectrum value of each frame is compared with
the minimum masking threshold. If the spectrum value is lower
than the masking threshold, this spectrum is set to the value
of the threshold limit. In such a way, the modified spectrum
magnitude is worked out on each frame and passed through
the process of filterbank analysis.
1.2
1
0.8
0.6
0.4
0.2
0
1000
2000
3000
4000
5000
6000
7000
8000
Frequency(Hz)
Fig. 2: Frequency Integration of Gammatone Filterbank on
16kHz.
B. Gammatone Filterbank Analysis
The speech signal causes the vibration on the basilar membrane of the inner ear’s cochlea in the human auditory system.
The localized frequency information of the speech signal is
reacted to each part of the basilar membrane. Similarly, the
digital filterbank resembles the processing of basilar membrane in auditory modeling. Each bandpass filter is simulated
the frequency characteristics of the basilar membrane [14]. The
human hearing is the most sensitive to frequencies between
2000-5000Hz and less sensitive in high frequency region. This
will affect the performance of the ASR system. The triangular
shape Mel filterbank is used to warp the spectrum envelop
in conventional MFCC to overcome this problem. Triangular
shape filterbank is symmetrically tapered at the ends, and
it cannot render the weight outsides the sub-bands. As a
consequence, the correlation between sub-bands and nearby
spectral information from adjacent sub-bands may lead to
losing. Gammatone filterbank has gaussian shape and allows
for gradually decreasing the weights at both ends and tolerate
for compensating the possible loss of spectral information
correlation [15]. The frequency response of a Gammatone
filterbank is illustrated in Figure 2. This gaussian shape of
filterbank is substituted in place of the triangular Mel filterbank. The Gammatone filterbank is physiologically motivated
to imitate the structure of the human auditory system. The
frequency response of the Gammatone function in the time
domain is specified as follows [16]:
g(t) = atn−1 cos(2πfc + φ)e−2πbt
(12)
where a is the amplitude; n is the order of filter which
determines the slope of each filter; fc is the center frequency;
π is the phase shift and b is the bandwidth of filter which
specifies the duration of impulse response.
III. P ERCEPTUAL BASED G AMMATONE F REQUENCY
C EPSTRAL C OEFFICIENTS (PGFCC)
The proposed feature extraction involves the modeling of
minimum masking threshold based on the presence of masker
and maskee on every frame. Each frame has 512 points, and
the frame-shift is 384 points. Firstly, the Hamming windowing
process is done in order to keep the continuities at the
beginning and end of each frame. Then, the minimum masking
threshold is figured out based on the simultaneous or frequency
masking of psychoacoustic model. The tone-like and noise-like
components are determined from normalized power spectral
density based on minimum local maxima value (7dB) and also
takes out the irrelevant elements. After that, the individual and
global masking thresholds are calculated. After figuring the
global masking threshold, the spectral subsamples are mapped
onto 32 uniform subbands to generate the minimum masking
threshold. Then, the normalized FFT outputs are compared
with the minimum masking threshold on each frame. If the
spectral value is lower than the threshold, assigns with the
minimum limit value. These modified spectral features are
passed through the gaussian shape 64 channels Gammatone
filterbank to wrap the spectrum envelop. After getting the
Frame Blocking
Windowing Process
DFT
Modeling Simultaneous
Masking Effect
Gammatone Filterbank
Analysis
Energy
DCT (IDFT)
Triphone training using LDA+MLLT: The linear discriminant analysis (LDA) uses the spliced PGFCC features and
reduces feature dimension into 40 for all data to produce
the HMM states. MLLT process applies to enforces the linear transformation to get a significant change for individual
speaker.
Speaker Adaptive Training using fMLLR: The primary goal
of speaker adaptation is to modify the acoustic model parameters to match the features of actual audible. Features are
linearly transformed to normalize the variability of speakers
with the use of feature space maximum likelihood linear
regression (fMLLR).
12 PGFCC Features
Combine Features
12 PGFCC + 1 Energy
CMVN
Proposed PGFCC Features
Fig. 3: Process Flow of Proposed Perceptual Simultaneous
Masking Effect based Cepstral Feature Technique.
results of filterbank process, the discrete cosine transform
(DCT) is done on the coefficients to achieve the maximum
decorrelation among the feature vectors. The amplitudes of
speech signal are varied in time. These amplitude variations
represent the short-term energy which gives information about
the time-dependent characteristics. The short-time spectral
energy is calculated on the log filterbank energy of each frame.
Finally, the first (2-13) cepstral coefficients and short-term
spectral energy (PGFCC+Energy) are defined as the proposed
front-end feature vectors. Figure 3 illustrates the detail process
flow of proposed feature extraction technique.
IV. F EATURE R ECOGNITION E NGINE
After extracting the cepstral features from the speech signal, the feature recognition engine judges the possible word
sequences within the feature space from training utterances.
In this section will describe the statistical approaches called
Gaussian Mixture Model (GMM) based hidden Markov model
(HMM) and cross entropy-based hybrid Deep Neural Network
(DNN) - HMM model as the recognition engines. Kaldi toolkit
is used to implement the acoustic model training for this study.
A. GMM-HMM Model
Gaussian Mixture Model-based Hidden Markov Model
(GMM-HMM) is one of the most popular statistical models
to interpret the sequential structure of the speech. Each HMM
state applies a mixture of Gaussian to model a spectral
representation of speech. Each training process is followed by
the alignment between the acoustic feature vector and sound
units [17].
Triphone training (PGFCC+∆+∆∆): This training concentrates on the contextual information between left and
right phonemes. In general, the delta and delta-delta features
represent the first order and second-order derivatives.
B. DNN-HMM Model
This phase trains a DNN to provide posterior probability
estimates for each HMM state from given observation sequence. The networks are trained to optimize a given training
objective function using the standard error back-propagation
procedure. Typically, frame-level cross-entropy is employed
as the objective, and optimization is through with the minibatch stochastic gradient descent (SGD) [18]. The parameter
learning with a cross-entropy criterion is processed as three
steps. Firstly, the GMM-HMMs model is trained with the use
of the maximum likelihood estimation, which is applied in
DNN-HMM training that contains the state’s prior and transition probabilities. Then, the forced alignment is generated by
matching the acoustic features vectors with the corresponding
labels from the Viterbi algorithm with GMM-HMM model.
State labels are the learning target of the DNN output layer
and the weight values of DNN training is done by minimizing
the cost function of cross-entropy given as below,
C=−
i=1
X
qi log pi
(13)
Q
where C denotes as the cross entropy function, Q is the set
of states, pi is the softmax layer output and qi is the targeted
state.
V. R ESULTS AND D ISCUSSION
AURORA database is designed by the European Telecommunications Standards Institute (ETSI) to evaluate and standardize the performance of distributed speech recognition
(DSR) systems in noisy environment. AURORA-2 [19] is
based on the original version of TIDigits English connected
digits database launched from Linguistic Data Consortium
(LDC) [20]. The total 8,440 utterances are taken from the
training part of TIDigits, participating with 55 male and 55
female adult speakers. These utterances are equally split into
twenty subsets and contain 422 utterances in each subset.
These twenty subsets represent four different noise situations;
subway (recording in a moving suburban train), exhibition
(atmosphere in a classical exhibition hall with mixture of
voices), babble (atmosphere in mixture of several chatter
voices) and car noise (recording inside a running car) and
different SNRs levels (clean, 20dB, 15dB, 10dB, 5dB). 1,001
utterances of 52 male and 52 female speakers are taken from
the testing part of TIDigits database. The summary of detail
data usage was described in Table I.
TABLE I: Detail Description of Aurora2 Speech Corpus
Category
Vocabulary
Sampling
Description
continuous digits sequences (0-9) plus ’oh’
44.1kHz, 16bits, mono channel
Male 111 spks
21-70 ages
Participants
Female 114 spks
17-59 ages
Training
8,440 utts.
Multi-condition∗
Testing
1,001 utts.
Subway,Babble,Exhibition,Car
∗ Multi-condition training under subway, exhibition hall, car and babble
noise at clean condition and SNR of 20dB, 15dB, 10dB and 5dB.
masking effect using the psychoacoustic model is integrated
into MFCC to defeat the drawback of conventional MFCC.
By introducing the simultaneous masking effect into feature
extraction technique, the noise effect of speech signal can be
lessened while the noise is higher in speech signal, and it can
also minimize the irrelevant feature components.
TABLE III: Recognition Accuracy (%) of Proposed PGFCC
Feature Extraction on different noise situations and SNRs
Model
GMM
In this paper, the recognition of connected digit has been
carried out to evaluate the noise robustness of proposed
features extraction technique. The feature extraction part is
implemented using the MATLAB software. These extracted
acoustic features are passed through the Kaldi toolkit to build
the speaker adaptive triphone model using GMM-HMM and
hybrid DNN-HMM techniques. Firstly, the speaker adapted
training (SAT), i.e., train on fMLLR adapted features is built.
In this training, we use the number of leaves is 300 and the
total gaussian is 3000. Then, the frame-level cross entropybased DNN-HMM training is built with three hidden layers
before softmax. Each hidden layer has 378 neurons and the
network has 272 output units. The initial learning rate is 0.008,
and the mini-batch size is 256 as default.
DNN
SNR
10dB
5dB
0dB
-5dB
10dB
5dB
0dB
-5dB
Subway
97.36
92.08
76.11
36.97
98.89
95.73
86.64
61.74
Exhibition
95.16
89.63
71.46
32.27
98.52
95.40
85.22
57.51
Babble
95.19
85.13
55.50
21.55
96.86
91.05
71.80
38.81
Car
95.82
87.92
57.08
16.91
98.30
95.41
83.00
47.93
Avg.
95.88
88.69
65.04
26.93
98.14
94.40
81.67
51.50
According to this experiment, our proposed feature extractor
against the noise and outperformed the MFCC, especially
in low SNRs 0dB and -5dB. Table III expresses the recognition accuracy of proposed PGFCC technique. With the
use of GMM-HMM model, the recognition performance was
achieved 95.88% in SNR of 10dB, 88.69% in 5dB, 65.04% in
0dB and 26.93% in -5dB respectively. When the DNN-HMM
TABLE II: Recognition accuracy (%) of MFCC feature extraction on different noise situations and different SNR levels
Model
GMM
DNN
SNR
10dB
5dB
0dB
-5dB
10dB
5dB
0dB
-5dB
Subway
96.68
91.56
71.72
23.95
98.71
95.92
82.56
42.00
Exhibition
93.37
84.45
62.82
25.64
97.72
93.86
80.99
45.26
Babble
96.61
86.94
54.87
16.9
97.97
93.02
71.19
29.78
Car
95.56
87.06
46.82
9.90
98.54
94.54
75.40
19.36
Avg.
95.5
87.5
59.05
19.09
98.24
94.34
77.54
34.10
(a)
Table II expresses the word recognition rate (%) of MFCC
technique over four different situations of noises, namely,
subway, exhibition, babble, and car noises. With the use
of GMM-HMM model, the average recognition accuracy
of overall noise situations takes a value of 95.55% under
SNR of 10dB, 87.50% under 5dB, 59.05% under 0dB and
19.09% for SNR of -5dB, respectively. Additionally, the cross
entropy-based DNN-HMM model is evaluated to boost the
performance of the system. The hybrid DNN-HMM systems
have the advantages of DNN’s strong learning power and
HMMs sequential modeling ability to outperform the existing GMM-HMM systems. According to this experiment, the
average recognition accuracy achieved up to 98.24% in SNR
of 10dB, 94.34% in 5dB, 77.54% in 0dB and 34.10% in 5dB, respectively. However, the conventional MFCC functions
well in a quiet environment while the result is degrading
under background noise. The implementation of simultaneous
(b)
Fig. 4: Overall noise validation of MFCC and proposed
PGFCC (a) Mean accuracy at different SNR level (b) Mean accuracy separated for different noise types using cross-entropy
based DNN-HMM acoustic model.
model has been building, the average accuracy gained up to
98.14% in case of 10dB, 94.40% in 5dB, 81.67% in 0dB and
51.50% in -5dB. As illustrated in Figure 4(a), when comparing
with conventional MFCC, the average relative improvements
were 6.44% in SNR of 5dB, 25.57% in 0dB, and 91.2% in
-5dB respectively with the use of hybrid DNN-HMM acoustic
model. However, the recognition performance decreased in
terms of 10dB. The relative decrements were 0.1% using
DNN-HMM. Besides, the overall accuracy (%) was followed
up to find out how much the results will grant in different
noise situations under all SNRs level. As seen in Figure 4(b),
the average accuracy of proposed PGFCC outperformed in all
types of noise situations. Using the DNN-HMM model, the
relative improvement was 7.5%, 5.91%, 2.24%, and 12.78%
under subway, exhibition, babble and car noises, respectively.
Although the relative improvement is not too much significant in higher SNR level comparing with conventional MFCC,
we observed that the proposed algorithm achieved the more
substantial improvement in lower SNR levels especially at 0dB
and -5dB. Moreover, the proposed algorithm cannot be recognized well in types of babble noise in terms of 10dB and 5dB.
As the nature of babble noise is the atmosphere in a mixture of
various chatter voices, such kind of sound may lead to diverge
the human perception in the auditory system. Our proposed
method finds the sensitive limit threshold for perceiving in the
hearing mechanism based on auditory masking and influence
the amplitude of the signal. As a consequence, in the situation
of higher SNR level (10dB and 5dB) under babble noise, our
proposed method may lead to a substantial loss of spectral
information and possibly distort the original clean signal.
VI. C ONCLUSION
In this paper, we propose a modified front-end algorithm
based on MFCC which is successfully implemented with
simultaneous masking and Gammatone frequency integration.
This method imitates the nature of human auditory system
to handle noise-adverse situations. Aurora-2 database is used
to carry out the experiments. The GMM-HMM and DNNHMM recognizers are used to prove the robustness of proposed
front-end algorithm. Although the word recognition rates of
our proposed method have achieved comparable results with
MFCC for high SNRs, they are significantly outperformed in
lower SNRs, especially at 0dB and -5dB. The highest accuracy
is gained by DNN-HMM, which provides 98.14% in 10dB,
94.40% in 5dB, 81.67% in 0dB and 51.50% in -5dB. Yet
while the proposed modification algorithm still managed to
improve the satisfying accuracy in high SNR of babble noise,
the recognition rate was higher in more noisy conditions over
MFCC. The future effort will assess a detailed analysis on
proposed algorithm with new improvement in clean condition
and babble noise. Also, the performance will analyze on large
vocabulary continuous speech recognition system.
ACKNOWLEDGMENT
This work was financially supported by ASEAN University Network/Southeast Asia Engineering Education Develop-
ment Network (AUN/SEED-Net), JICA. This study was also
supported in parts by the Ministry of Education, Science,
Sports and Culture, Grant-in-Aid for Scientific Research (B)
(18H0321) and the Ministry of Internal Affairs and Communications for SCOPE Program (185001003).
R EFERENCES
[1] S. A. Majeed, H. Husain, S. A. Samad, and T. F. Idbeaa, “Mel
frequency cepstral coefficients (Mfcc) feature extraction enhancement
in the application of speech recognition: A comparison study,” J. Theor.
Appl. Inf. Technol., vol. 79, no. 1, pp. 38–56, 2015.
[2] B. T. Sai, I. C. Yadav, S. Shahnawazuddin, and G. Pradhan, “Enhancing
pitch robustness of speech recognition system through spectral smoothing,” 12th Int. Conf. Signal Process. Commun. SPCOM, pp. 242–246,
2018.
[3] S. K. Gaikwad, B. W. Gawali, and P. Yannawar, “A Review on Speech
Recognition Technique,” Int. J. Comput. Appl., vol. 10, no. 3, pp. 16–24,
2010.
[4] N. Wada, N. Hayasaka, S. Yoshizawa, and Y. Miyanaga, “Robust speech
recognition with feature extraction using combined method of RSF
and DRA,” in IEEE International Symposium on Communications and
Information Technologies: ISCIT, 2004, vol. 2, pp. 1001–1004.
[5] S. J. Arora and R. P. Singh, “Automatic speech recognition: A Review,”
Int. J. Comput. Appl., vol. 60, no. 9, pp. 34–44, 2012.
[6] A. L. Georgescu, H. Cucu, C. Burileanu, “SpeeD’s DNN approach to
Romanian speech recognition,” in International Conference on Speech
Technology and Human-Computer Dialogue (SpeD),” 2017, pp. 1–8.
[7] K. K. Tomchuk, “Spectral Masking in MFCC Calculation for Noisy
Speech,” in Wave Electronics and its Application in Information and
Telecommunication Systems, WECONF, 2018, pp. 1–4.
[8] N. Upadhyay and H. G. Rosales, “Robust Recognition of English Speech
in Noisy Environments Using Frequency Warped Signal Processing,”
Natl. Acad. Sci. Lett., vol. 41, no. 1, pp. 15–22, Feb. 2018.
[9] H. F. Pardede, “On noise robust feature for speech recognition based on
power function family,” in International Symposium on Intelligent Signal
Processing and Communication Systems, ISPACS, 2015, pp. 386–391.
[10] D. Darabian, H. Marvi, and M. S. Noughabi, “Improving the performance of MFCC for Persian robust speech recognition,” Journal of
Artificial Intelligence. Data Mining, vol. 3, no. 2, pp. 149–156, 2015.
[11] Y. Lin and W. H. Lin, “Audio watermark: A comprehensive foundation
using MATLAB,” 2015.
[12] H. K. Maganti and M. Matassoni, “A perceptual masking approach for
noise robust speech recognition,” Eurasip Journal of Audio, Speech,
Music Processing, vol. 2012, no. 1, pp. 1–9, 2012.
[13] H. M. S. Naing, R. Hidayat, B. Winduratna, and Y. Miyanaga, “Psychoacoustical masking effect-based feature extraction for robust speech
recognition,” International Journal of Innovative Computing, Information
and Control, vol. 15, no. 5, pp. 1641–1654, 2019.
[14] M. Russo, M. Stella, M. Sikora, and V. Pekić, “Robust cochlear-modelbased speech recognition,” Computers, vol. 8, no. 1, 2019.
[15] G. K. Liu, “Evaluating gammatone frequency cepstral coefficients with
neural networks for emotion recognition from speech,” arXiv preprint
arXiv:1806.09010, 2018.
[16] J. Qi, D. Wang, Y. Jiang, and R. Liu, “Auditory features based on
Gammatone filters for robust speech recognition,” in Proceedings - IEEE
International Symposium on Circuits and Systems, 2013, pp. 305–308.
[17] P. Upadhyaya, S. K. Mittal, Y. V. Varshney, O. Farooq and M. R. Abidi,
“Speaker adaptive model for hindi speech using Kaldi speech recognition
toolkit,” International Conference on Multimedia, Signal Processing and
Communication Technologies (IMPACT), 2017, pp.222-226.
[18] V. V. Vegesna, K. Gurugubelli, H. K. Vydana, B. Pulugandla, M.
Shrivastava, and A. K. Vuppala, “DNN-HMM acoustic modeling for
large vocabulary Telugu speech recognition,” in International Conference
on Mining Intelligence and Knowledge Exploration, 2017, pp. 189–197.
[19] D. Pearce and H. G. Hirsch, “The AURORA experimental framework
for the performance evaluation of speech recognition systems under
noisy conditions,” in 6th International Conference on Spoken Language
Processing, ICSLP, 2000.
[20] R. Leonard, “A database for speaker-independent digit recognition,” in
ICASSP ’84. IEEE International Conference on Acoustics, Speech, and
Signal Processing, 1984, vol. 9, no. 1, pp. 328–331.