SPEAKER RECOGNITION FROM RAW WAVEFORM WITH SINCNET
Mirco Ravanelli, Yoshua Bengio∗
Mila, Université de Montréal, ∗ CIFAR Fellow
ABSTRACT
Deep learning is progressively gaining popularity as a viable
alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural
Networks (CNNs) when fed by raw speech samples directly.
Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and
formants. Proper design of the neural network is crucial to
achieve this goal.
This paper proposes a novel CNN architecture, called
SincNet, that encourages the first convolutional layer to
discover more meaningful filters. SincNet is based on
parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of
each filter, only low and high cutoff frequencies are directly
learned from data with the proposed method. This offers a
very compact and efficient way to derive a customized filter
bank specifically tuned for the desired application.
Our experiments, conducted on both speaker identification and speaker verification tasks, show that the proposed
architecture converges faster and performs better than a standard CNN on raw waveforms.
Index Terms— speaker recognition, convolutional neural
networks, raw samples.
1. INTRODUCTION
Speaker recognition is a very active research area with notable applications in various fields such as biometric authentication, forensics, security, speech recognition, and speaker
diarization, which has contributed to steady interest towards
this discipline [1]. Most state-of-the-art solutions are based
on the i-vector representation of speech segments [2], which
contributed to significant improvements over previous Gaussian Mixture Model-Universal Background Models (GMMUBMs) [3]. Deep learning has shown remarkable success
in numerous speech tasks [4–7], including recent studies in
speaker recognition [8, 9]. Deep Neural Networks (DNNs)
have been used within the i-vector framework to compute
Baum-Welch statistics [10], or for frame-level feature extraction [11]. DNNs have also been proposed for direct discrim-
inative speaker classification, as witnessed by the recent literature on this topic [12–15]. Most of past attempts, however, employed hand-crafted features such as FBANK and
MFCC coefficients [12, 16, 17]. These engineered features
are originally designed from perceptual evidence and there
are no guarantees that such representations are optimal for all
speech-related tasks. Standard features, for instance, smooth
the speech spectrum, possibly hindering the extraction of crucial narrow-band speaker characteristics such as pitch and formants. To mitigate this drawback, some recent works have
proposed directly feeding the network with spectrogram bins
[18–20] or even with raw waveforms [21–30]. CNNs are the
most popular architecture for processing raw speech samples,
since weight sharing, local filters, and pooling help discover
robust and invariant representations.
We believe that one of the most critical part of current
waveform-based CNNs is the first convolutional layer. This
layer not only deals with high-dimensional inputs, but is also
more affected by vanishing gradient problems, especially
when employing very deep architectures. The filters learned
by the CNN often take noisy and incongruous multi-band
shapes, especially when few training samples are available.
These filters certainly make some sense for the neural network, but do not appeal to human intuition, nor appear to lead
to an efficient representation of the speech signal.
To help the CNNs discover more meaningful filters in the
input layer, this paper proposes to add some constraints on
their shape. Compared to standard CNNs, where the filterbank characteristics depend on several parameters (each element of the filter vector is directly learned), the SincNet convolves the waveform with a set of parametrized sinc functions that implement band-pass filters. The low and high cutoff frequencies are the only parameters of the filter learned
from data. This solution still offers considerable flexibility,
but forces the network to focus on high-level tunable parameters with broad impact on the shape and bandwidth of the
resulting filter.
Our experiments are carried out under challenging but realistic conditions, characterized by minimal training data (i.e.,
12-15 seconds for each speaker) and short test sentences (lasting from 2 to 6 seconds). Results achieved on a variety of
datasets, show that the proposed SincNet converges faster and
achieves better end task performance than a more standard
CNN. Under the considered experimental setting, our archi-
tecture also outperforms a more traditional speaker recognition system based on i-vectors.
The remainder of the paper is organized as follows. The
SincNet architecture is described in Sec. 2. Sec. 3 discusses
the relation to prior work. The experimental setup and results
are outlined in Sec. 4 and Sec. 5 respectively. Finally, Sec. 6
discusses our conclusions.
Speaker Classification
Softmax
CNN/DNN layers
Dropout
2. THE SINCNET ARCHITECTURE
Leaky ReLU
The first layer of a standard CNN performs a set of timedomain convolutions between the input waveform and some
Finite Impulse Response (FIR) filters [31]. Each convolution
is defined as follows1 :
y[n] = x[n] ∗ h[n] =
L−1
X
x[l] · h[n − l]
Layer Norm
Pooling
(1)
l=0
where x[n] is a chunk of the speech signal, h[n] is the filter
of length L, and y[n] is the filtered output. In standard CNNs,
all the L elements (taps) of each filter are learned from data.
Conversely, the proposed SincNet (depicted in Fig. 1) performs the convolution with a predefined function g that depends on few learnable parameters θ only, as highlighted in
the following equation:
y[n] = x[n] ∗ g[n, θ]
(2)
A reasonable choice, inspired by standard filtering in digital signal processing, is to define g such that a filter-bank
composed of rectangular bandpass filters is employed. In the
frequency domain, the magnitude of a generic bandpass filter
can be written as the difference between two low-pass filters:
f
f
− rect
,
G[f, f1 , f2 ] = rect
2f2
2f1
(3)
where f1 and f2 are the learned low and high cutoff frequencies, and rect(·) is the rectangular function in the magnitude
frequency domain2 . After returning to the time domain (using
the inverse Fourier transform [31]), the reference function g
becomes:
g[n, f1 , f2 ] = 2f2 sinc(2πf2 n) − 2f1 sinc(2πf1 n),
(4)
where the sinc function is defined as sinc(x) = sin(x)/x.
The cut-off frequencies can be initialized randomly in the
range [0, fs /2], where fs represents the sampling frequency
of the input signal. As an alternative, filters can be initialized with the cutoff frequencies of the mel-scale filter-bank,
which has the advantage of directly allocating more filters
in the lower part of the spectrum, where many crucial clues
1 Most deep learning toolkits actually compute correlation rather than
convolution. The obtained flipped (mirrored) filters do not affect the results.
2 The phase of the rect(·) function is considered to be linear.
Speech Waveform
Fig. 1: Architecture of SincNet.
about the speaker identity are located. To ensure f1 ≥ 0 and
f2 ≥ f1 , the previous equation is actually fed by the following parameters:
f1abs = |f1 |
(5)
f2abs
(6)
= f1 + |f2 − f1 |
Note that no bounds have been imposed to force f2 to
be smaller than the Nyquist frequency, since we observed that
this constraint is naturally fulfilled during training. Moreover,
the gain of each filter is not learned at this level. This parameter is managed by the subsequent layers, which can easily
attribute more or less importance to each filter output.
An ideal bandpass filter (i.e., a filter where the passband
is perfectly flat and the attenuation in the stopband is infinite)
requires an infinite number of elements L. Any truncation of
g thus inevitably leads to an approximation of the ideal filter,
characterized by ripples in the passband and limited attenuation in the stopband. A popular solution to mitigate this issue
is windowing [31]. Windowing is performed by multiplying
the truncated function g with a window function w, which
aims to smooth out the abrupt discontinuities at the ends of g:
gw [n, f1 , f2 ] = g[n, f1 , f2 ] · w[n].
(7)
This paper uses the popular Hamming window [32], defined
as follows:
2πn
.
(8)
w[n] = 0.54 − 0.46 · cos
L
The Hamming window is particularly suitable to achieve high
frequency selectivity [32]. However, results not reported here
reveals no significant performance difference when adopting
other functions, such as Hann, Blackman and Kaiser windows.
All operations involved in SincNet are fully differentiable
and the cutoff frequencies of the filters can be jointly optimized with other CNN parameters using Stochastic Gradient
Descent (SGD) or other gradient-based optimization routines.
As shown in Fig. 1, a standard CNN pipeline (pooling, normalization, activations, dropout) can be employed after the
first sinc-based convolution. Multiple standard convolutional
or fully-connected layers can then be stacked together to finally perform a speaker classification with a softmax classifier.
2.1. Model properties
The proposed SincNet has some remarkable properties:
• Fast Convergence: SincNet forces the network to focus only on the filter parameters with major impact
on performance. The proposed approach actually implements a natural inductive bias, utilizing knowledge
about the filter shape (similar to feature extraction
methods generally deployed on this task) while retaining flexibility to adapt to data. This prior knowledge
makes learning the filter characteristics much easier,
helping SincNet to converge significantly faster to a
better solution.
• Few Parameters: SincNet drastically reduces the
number of parameters in the first convolutional layer.
For instance, if we consider a layer composed of F
filters of length L, a standard CNN employs F · L
parameters, against the 2F considered by SincNet. If
F = 80 and L = 100, we employ 8k parameters for
the CNN and only 160 for SincNet. Moreover, if we
double the filter length L, a standard CNN doubles its
parameter count (e.g., we go from 8k to 16k), while
SincNet has an unchanged parameter count (only two
parameters are employed for each filter, regardless its
length L). This offers the possibility to derive very selective filters with many taps, without actually adding
parameters to the optimization problem. Moreover,
the compactness of the SincNet architecture makes it
suitable in the few sample regime.
• Computational Efficiency: The proposed function g
is symmetric. This means we can perform convolution
in a very efficient way by only considering one side of
the filter and inheriting the results for the other half.
This saves 50% of the first-layer computation over a
standard CNN.
• Interpretability: The SincNet feature maps obtained
in the first convolutional layer are definitely more interpretable and human-readable than other approaches.
The filter bank, in fact, only depends on parameters
with a clear physical meaning.
3. RELATED WORK
Several works have recently explored the use of low-level
speech representations to process audio and speech with
CNNs. Most prior attempts exploit magnitude spectrogram
features [18–20, 33–35]. Although spectrograms retain more
information than standard hand-crafted features, their design
still requires careful tuning of some crucial hyper-parameters,
such as the duration, overlap, and typology of the frame window, as well as the number of frequency bins. For this reason,
a more recent trend is to directly learn from raw waveforms,
thus completely avoiding any feature extraction step. This
approach has shown promise in speech [21–25], including
emotion tasks [26], speaker recognition [28], spoofing detection [27], and speech synthesis [29, 30]. Similar to SincNet,
some previous works have proposed to add constraints on the
CNN filters, for instance forcing them to work on specific
bands [33, 34]. Differently from the proposed approach, the
latter works operate on spectrogram features and still learn
all the L elements of the CNN filters. An idea related to the
proposed method has been recently explored in [35], where
a set of parameterized Gaussian filters are employed. This
approach operates on the spectrogram domain, while SincNet
directly considers raw time domain waveform.
To the best of our knowledge, this study is the first to show
the effectiveness of the proposed sinc filters for time-domain
audio processing from raw waveforms using convolutional
neural networks. Several past works target speech recognition, while our study specifically considers a speaker recognition application. The compact filters learned by SincNet are
particularly suitable for speaker recognition tasks, especially
in a realistic scenario characterized by few seconds of training
data for each speaker and short sentences for testing.
4. EXPERIMENTAL SETUP
The proposed SincNet has been evaluated on different corpora
and compared to numerous speaker recognition baselines. In
the spirit of reproducible research, we perform most experiments using publicly available data such as Librispeech, and
0
n
250
0
n
250
0
n
250
0
n
250
0
n
250
0
n
250
0
f [Hz]
4000
0
f [Hz]
4000
0
f [Hz]
4000
0
f [Hz]
4000
0
f [Hz]
4000
0
f [Hz]
4000
(a) CNN Filters
(b) SincNet Filters
Fig. 2: Examples of filters learned by a standard CNN and by the proposed SincNet (using the Librispeech corpus). The first
row reports the filters in the time domain, while the second one shows their magnitude frequency response.
release the code of SincNet on GitHub3 . In the following sections, an overview of the experimental settings is provided.
4.1. Corpora
To provide experimental evidence on datasets characterized
by different numbers of speakers, this paper considers the
TIMIT (462 spks, train chunk) [36] and Librispeech (2484
spks) [37] corpora. Non-speech intervals at the beginning
and end of each sentence were removed. The Librispeech
sentences with internal silences lasting more than 125 ms
were split into multiple chunks. To address text-independent
speaker recognition, the calibration sentences of TIMIT (i.e.,
the utterances with the same text for all speakers) have been
removed. For the latter dataset, five sentences for each
speaker were used for training, while the remaining three
were used for test. For the Librispeech corpus, the training and test material have been randomly selected to exploit
12-15 seconds of training material for each speaker and test
sentences lasting 2-6 seconds.
4.2. SincNet Setup
The waveform of each speech sentence was split into chunks
of 200 ms (with 10 ms overlap), which were fed into the SincNet architecture. The first layer performs sinc-based convolutions as described in Sec. 2, using 80 filters of length L = 251
samples. The architecture then employs two standard convolutional layers, both using 60 filters of length 5. Layer nor3
at https://github.com/mravanelli/SincNet/.
malization [38] was used for both the input samples and for
all convolutional layers (including the SincNet input layer).
Next, three fully-connected layers composed of 2048 neurons
and normalized with batch normalization [39] were applied.
All hidden layers use leaky-ReLU [40] non-linearities. The
parameters of the sinc-layer were initialized using mel-scale
cutoff frequencies, while the rest of the network was initialized with the well-known “Glorot” initialization scheme [41].
Frame-level speaker classification was obtained by applying
a softmax classifier, providing a set of posterior probabilities
over the targeted speakers. A sentence-level classification was
simply derived by averaging the frame predictions and voting
for the speaker which maximizes the average posterior.
Training used the RMSprop optimizer, with a learning
rate lr = 0.001, α = 0.95, ǫ = 10− 7, and minibatches of
size 128. All the hyper-parameters of the architecture were
tuned on TIMIT, then inherited for Librispeech as well.
The speaker verification system was derived from the
speaker-id neural network considering two possible setups.
First, we consider the d-vector framework [12, 20], which
relies on the output of the last hidden layer and computes
the cosine distance between test and the claimed speaker dvectors. As an alternative solution (denoted in the following
as DNN-class), the speaker verification system can directly
take the softmax posterior score corresponding to the claimed
identity. The two approaches will be compared in Sec. 5.
To perform an accurate evaluation, 10 utterances from
impostors were randomly selected for each sentence coming
from a genuine speaker. Impostors were taken from a speaker
pool different from that used for training the speaker id network.
1
0.9
SincNet
CNN
1st Formant
0.7
2nd Formant
0.6
SincNet
CNN
0.8
FER(%)
Normalized Filter Sum
Pitch
0.8
0.4
0.6
0.5
0.2
0
0.4
0.3
0
1000
2000
3000
4000
0
50
100
150
200
Frequency [Hz]
# Epoch
Fig. 3: Cumulative frequency response of the SincNet filters.
Fig. 4: Frame Error Rate (%) of SincNet and CNN models
over various training epochs. Results are reported on TIMIT.
4.3. Baseline Setups
We compared SincNet with several alternative systems. First,
we considered a standard CNN fed by the raw waveform. This
network is based on the same architecture as SincNet, but replacing the sinc-based convolution with a standard one.
A comparison with popular hand-crafted features was
also performed. To this end, we computed 39 MFCCs (13
static+∆+∆∆) and 40 FBANKs using the Kaldi toolkit [42].
These features, computed every 25 ms with 10 ms overlap,
were gathered to form a context window of approximately
200 ms (i.e., a context similar to that of the considered
waveform-based neural network). A CNN was used for
FBANK features, while a Multi-Layer Perceptron (MLP) was
used for MFCCs4 . Layer normalization was used for the
FBANK network, while batch normalization was employed
for the MFCC one. The hyper-parameters of these networks
were also tuned using the aforementioned approach.
For speaker verification experiments, we also considered an i-vector baseline. The i-vector system was implemented with the SIDEKIT toolkit [43]. The GMM-UBM
model, the Total Variability (TV) matrix, and the Probabilistic Linear Discriminant Analysis (PLDA) were trained on the
Librispeech data (avoiding test and enrollment sentences).
GMM-UBM was composed of 2048 Gaussians, and the rank
of the TV and PLDA eigenvoice matrix was 400. The enrollment and test phase is conducted on Librispeech using the
same set of speech segments used for DNN experiments.
5. RESULTS
This section reports the experimental validation of the proposed SincNet. First, we perform a comparison between the
filters learned by a SincNet and by a standard CNN. We then
4 CNNs exploit local correlation across features and cannot be effectively
used with uncorrelated MFCC features.
compare our architecture with other competitive systems on
both speaker identification and verification tasks.
5.1. Filter Analysis
Inspecting the learned filters is a valuable practice that provides insight into what the network is actually learning. Fig.
2 shows some examples of filters learned by a standard CNN
(Fig. 2a) and by the proposed SincNet (Fig. 2b) using the Librispeech dataset (the frequency response is plotted between 0
and 4 kHz). As observed in the figures, the standard CNN
does not always learn filters with a well-defined frequency
response. In some cases the frequency response looks noisy
(see the first filter of Fig. 2a), while in others assuming multiband shapes (see the third filter of the CNN plot). SincNet, instead, is specifically designed to implement rectangular bandpass filters, leading to more meaningful CNN filters.
Beyond a qualitative inspection, it is important to highlight which frequency bands are covered by the learned filters.
Fig. 3 shows the cumulative frequency response of the filters
learned by SincNet and CNN. Interestingly, there are three
main peaks which clearly stand out from the SincNet curve
(see the red line in the figure). The first one corresponds to the
pitch region (the average pitch is 133 Hz for a male and 234
for a female). The second peak (approximatively located at
500 Hz) mainly captures first formants, whose average value
over the various English vowels is indeed 500 Hz. Finally,
the third peak (ranging from 900 to 1400 Hz) captures some
important second formants, such as the second formant of the
vowel /a/, which is located on average at 1100 Hz. This
filter-bank configuration indicates that SincNet has successfully adapted its characteristics to address speaker identification. Conversely, the standard CNN does not exhibit such a
meaningful pattern: the CNN filters tend to correctly focus
on the lower part of the spectrum, but peaks tuned on first and
second formants do not clearly appear. As one can observe
from Fig. 3, the CNN curve stands above the SincNet one.
SincNet, in fact, learns filters that are, on average, more selective than CNN ones, possibly better capturing narrow-band
speaker clues.
5.2. Speaker Identification
Fig. 4 shows the learning curves of SincNet compared with
that of a standard CNN. These results, achieved on the TIMIT
dataset, highlight a faster decrease of the Frame Error Rate
(F ER%) when SincNet is used. Moreover, SincNet converges to better performance leading to a FER of 33.0%
against a FER of 37.7% achieved with the CNN baseline.
DNN-MFCC
CNN-FBANK
CNN-Raw
SINCNET
TIMIT
0.99
0.86
1.65
0.85
LibriSpeech
2.02
1.55
1.00
0.96
Table 1: Sentence Error Rate (SER%) of speaker identification systems trained on TIMIT (462 spks) and Librispeech
(2484 spks) datasets. SincNets outperform the competing alternatives.
Table 1 reports the achieved Sentence Error Rates (SER%).
The table shows that SincNet outperforms other systems on
both TIMIT and Librispeech datasets. The gap with a standard CNN fed by raw waveform is particularly large on
TIMIT, confirming the effectiveness of SincNet when few
training data are available. Although this gap is reduced
when LibriSpeech is used, we still observe a 4% relative
improvement that is also obtained with faster convergence
(1200 vs 1800 epochs). Standard FBANKs provide results
comparable to SincNet only on TIMIT, but are significantly
worse than our architecture when using Librispech. With few
training data, the network cannot discover filters much better
than FBANKs, but with more data a customized filter-bank is
learned and exploited to improve the performance.
5.3. Speaker Verification
As a last experiment, we extend our validation to speaker
verification. Table 2 reports the Equal Error Rate (EER%)
achieved with the Librispeech corpus. All DNN models show
promising performance, leading to an EER lower than 1% in
all cases. The table also highlights that SincNet outperforms
the other models, showing a relative performance improvement of about 11% over the standard CNN model. DNN-class
models perform significantly better than d-vectors. Despite
the effectiveness of the later approach, a novel DNN model
must be trained (or fine-tuned) for each new speaker added
into the pool [28]. This makes this approach better performing, but less flexible than d-vectors.
DNN-MFCC
CNN-FBANK
CNN-Raw
SINCNET
d-vector
0.88
0.60
0.58
0.51
DNN-class
0.72
0.37
0.36
0.32
Table 2: Speaker Verification Equal Error Rate (EER%) on
Librispeech datasets over different systems. SincNets outperform the competing alternatives.
For the sake of completeness, experiments have also been
conducted with standard i-vectors. Although a detailed comparison with this technology is out of the scope of this paper, it is worth noting that our best i-vector system achieves
a EER=1.1%, rather far from what achieved with DNN systems. It is well-known in the literature that i-vectors provide competitive performance when more training material
is used for each speaker and when longer test sentences are
employed [44–46]. Under the challenging conditions faced in
this work, neural networks achieve better generalization.
6. CONCLUSIONS AND FUTURE WORK
This paper proposed SincNet, a neural architecture for directly processing waveform audio. Our model, inspired by
the way filtering is conducted in digital signal processing, imposes constraints on the filter shapes through efficient parameterization. SincNet has been extensively evaluated on challenging speaker identification and verification tasks, showing
performance benefits for all considered corpora.
Beyond performance improvements, SincNet also significantly improves convergence speed over a standard CNN, and
is more computationally efficient due to exploitation of filter symmetry. Analysis of the SincNet filters reveals that the
learned filter-bank is tuned to precisely extract some known
important speaker characteristics, such as pitch and formants.
In future work, we would like to evaluate SincNet on other
popular speaker recognition tasks, such as VoxCeleb. Although this study targeted speaker recognition only, we believe that the proposed approach defines a general paradigm
to process time-series and can be applied in numerous other
fields. Our future effort will be thus devoted to extending to
other tasks, such as speech recognition, emotion recognition,
speech separation, and music processing.
Acknowledgement
We would like to thank Gautam Bhattacharya, Kyle Kastner,
Titouan Parcollet, Dmitriy Serdyuk, Maurizio Omologo, and
Renato De Mori for their helpful comments. This research
was enabled in part by support provided by Calcul Québec
and Compute Canada.
7. REFERENCES
[1] H. Beigi,
Fundamentals of Speaker Recognition,
Springer, 2011.
[2] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and
P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
[3] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn,
“Speaker verification using adapted Gaussian mixture
models,” Digital Signal Processing, vol. 10, no. 1–3,
pp. 19–41, 2000.
[4] I. Goodfellow, Y. Bengio, and A. Courville,
Learning, MIT Press, 2016.
Deep
[5] D. Yu and L. Deng, Automatic Speech Recognition - A
Deep Learning Approach, Springer, 2015.
[6] G. Dahl, D. Yu, L. Deng, and A. Acero, “Contextdependent pre-trained deep neural networks for large
vocabulary speech recognition,” IEEE Transactions on
Audio, Speech, and Language Processing, vol. 20, no.
1, pp. 30–42, 2012.
[7] M. Ravanelli, Deep learning for Distant Speech Recognition, PhD Thesis, Unitn, 2017.
[8] M. McLaren, Y. Lei, and L. Ferrer, “Advances in deep
neural network approaches to speaker recognition,” in
Proc. of ICASSP, 2015, pp. 4814–4818.
[9] F. Richardson, D. Reynolds, and N. Dehak, “Deep neural network approaches to speaker and language recognition,” IEEE Signal Processing Letters, vol. 22, no. 10,
pp. 1671–1675, 2015.
[10] P. Kenny, V. Gupta, T. Stafylakis, P. Ouellet, and
J. Alam, “Deep neural networks for extracting baumwelch statistics for speaker recognition,” in Proc. of
Speaker Odyssey, 2014.
[11] S. Yaman, J. W. Pelecanos, and R. Sarikaya, “Bottleneck features for speaker recognition,” in Proc. of
Speaker Odyssey, 2012, pp. 105–108.
[12] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and
J. Gonzalez-Dominguez, “Deep neural networks for
small footprint text-dependent speaker verification,” in
Proc. of ICASSP, 2014, pp. 4052–4056.
[13] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer,
“End-to-end text-dependent speaker verification,” in
Proc. of ICASSP, 2016, pp. 5115–5119.
[14] D. Snyder, P. Ghahremani, D. Povey, D. Romero,
Y. Carmiel, and S. Khudanpur, “Deep neural networkbased speaker embeddings for end-to-end speaker verification,” in Proc. of SLT, 2016, pp. 165–170.
[15] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and
S. Khudanpur, “X-vectors: Robust dnn embeddings for
speaker recognition,” in Proc. of ICASSP, 2018.
[16] F. Richardson, D. A. Reynolds, and N. Dehak, “A
unified deep neural network for speaker and language
recognition,” in Proc. of Interspeech, 2015, pp. 1146–
1150.
[17] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for textindependent speaker verification,” in Proc. of Interspeech, 2017, pp. 999–1003.
[18] C. Zhang, K. Koishida, and J. Hansen,
“Textindependent speaker verification based on triplet convolutional neural network embeddings,” IEEE/ACM
Trans. Audio, Speech and Lang. Proc., vol. 26, no. 9,
pp. 1633–1644, 2018.
[19] G. Bhattacharya, J. Alam, and P. Kenny, “Deep speaker
embeddings for short-duration speaker verification,” in
Proc. of Interspeech, 2017, pp. 1517–1521.
[20] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb:
a large-scale speaker identification dataset,” in Proc. of
Interspech, 2017.
[21] D. Palaz, M. Magimai-Doss, and R. Collobert, “Analysis of CNN-based speech recognition system using raw
speech as input,” in Proc. of Interspeech, 2015.
[22] T. N. Sainath, R. J. Weiss, A. W. Senior, K. W. Wilson,
and O. Vinyals, “Learning the speech front-end with raw
waveform CLDNNs,” in Proc. of Interspeech, 2015.
[23] Y. Hoshen, R. Weiss, and K. W. Wilson, “Speech acoustic modeling from raw multichannel waveforms,” in
Proc. of ICASSP, 2015.
[24] T. N. Sainath, R. J. Weiss, K. W. Wilson, A. Narayanan,
M. Bacchiani, and A. Senior, “Speaker localization and
microphone spacing invariant acoustic modeling from
raw multichannel waveforms,” in Proc. of ASRU, 2015.
[25] Z. Tüske, P. Golik, R. Schlüter, and H. Ney, “Acoustic modeling with deep neural networks using raw time
signal for LVCSR,” in Proc. of Interspeech, 2014.
[26] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi,
M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu
features? end-to-end speech emotion recognition using
a deep convolutional recurrent network,” in Proc. of
ICASSP, 2016, pp. 5200–5204.
[27] H. Dinkel, N. Chen, Y. Qian, and K. Yu, “End-toend spoofing detection with raw waveform CLDNNS,”
Proc. of ICASSP, pp. 4860–4864, 2017.
[41] X. Glorot and Y. Bengio, “Understanding the difficulty
of training deep feedforward neural networks,” in Proc.
of AISTATS, 2010, pp. 249–256.
[28] H. Muckenhirn, M. Magimai-Doss, and S. Marcel, “Towards directly modeling raw speech signal for speaker
verification using CNNs,” in Proc. of ICASSP, 2018.
[42] D. Povey et al., “The Kaldi Speech Recognition
Toolkit,” in Proc. of ASRU, 2011.
[29] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,
O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and
K. Kavukcuoglu, “Wavenet: A generative model for raw
audio,” in Arxiv, 2016.
[30] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain,
J. Sotelo, A. C. Courville, and Y. Bengio, “Samplernn:
An unconditional end-to-end neural audio generation
model,” CoRR, vol. abs/1612.07837, 2016.
[31] L. R. Rabiner and R. W. Schafer, Theory and Applications of Digital Speech Processing, Prentice Hall, NJ,
2011.
[32] S. K. Mitra, Digital Signal Processing, McGraw-Hill,
2005.
[33] T. N. Sainath, B. Kingsbury, A. R. Mohamed, and
B. Ramabhadran, “Learning filter banks within a deep
neural network framework,” in Proc. of ASRU, 2013,
pp. 297–302.
[34] H. Yu, Z. H. Tan, Y. Zhang, Z. Ma, and J. Guo, “DNN
Filter Bank Cepstral Coefficients for Spoofing Detection,” IEEE Access, vol. 5, pp. 4779–4787, 2017.
[35] H. Seki, K. Yamamoto, and S. Nakagawa, “A deep
neural network integrated with filterbank learning for
speech recognition,” in Proc. of ICASSP, 2017, pp.
5480–5484.
[36] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “DARPA
TIMIT Acoustic Phonetic Continuous Speech Corpus
CDROM,” 1993.
[37] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur,
“Librispeech: An ASR corpus based on public domain
audio books,” in Proc. of ICASSP, 2015, pp. 5206–5210.
[38] J. Ba, R. Kiros, and G. E. Hinton, “Layer normalization,” CoRR, vol. abs/1607.06450, 2016.
[39] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. of ICML, 2015, pp. 448–456.
[40] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier
nonlinearities improve neural network acoustic models,”
in Proc. of ICML, 2013.
[43] A. Larcher, K. A. Lee, and S. Meignier, “An extensible speaker identification sidekit in python,” in Proc. of
ICASSP, 2016, pp. 5095–5099.
[44] A. K. Sarkar, D Matrouf, P.M. Bousquet, and J.F. Bonastre, “Study of the effect of i-vector modeling on short
and mismatch utterance duration for speaker verification,” in Proc. of Interspeech, 2012, pp. 2662–2665.
[45] R. Travadi, M. Van Segbroeck, and S. Narayanan,
“Modified-prior i-Vector Estimation for Language Identification of Short Duration Utterances,” in Proc. of Interspeech, 2014, pp. 3037–3041.
[46] A. Kanagasundaram, R. Vogt, D. Dean, S. Sridharan,
and M. Mason, “i-vector based speaker recognition on
short utterances,” in Proc. of Interspeech, 2011, pp.
2341–2344.