End-to-End Speech Recognition: A Survey

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL.
32, 2024 325
End-to-End Speech Recognition: A Survey

Rohit Prabhavalkar , Member, IEEE, Takaaki Hori , Senior Member, IEEE, Tara N. Sainath , Fellow, IEEE,
Ralf Schlüter , Senior Member, IEEE, and Shinji Watanabe , Fellow, IEEE
Abstract—In the last decade of automatic speech recognition the acoustic feature set (e.g., non-linear discriminant/tandem
(ASR) research, the introduction of deep learning has brought approach [6], [7]). In language modeling, deep learning has
considerable reductions in word error rate of more than 50% replaced count-based approaches [8], [9], [10]. However, in
relative, compared to modeling without deep learning. In the
wake of this transition, a number of all-neural ASR architectures these early attempts at introducing deep learning, the classi-
have been introduced. These so-called end-to-end (E2E) models cal ASR architecture was unmodified. Classical state-of-the-art
provide highly integrated, completely neural ASR models, which ASR systems today are composed of many separate components
rely strongly on general machine learning knowledge, learn more and knowledge sources: especially speech signal preprocessing;
consistently from data, with lower dependence on ASR domain- methods for robustness with respect to recording conditions;
specific experience. The success and enthusiastic adoption of deep
learning, accompanied by more generic model architectures has led phoneme inventories and pronunciation lexica; phonetic clus-
to E2E models now becoming the prominent ASR approach. The tering; handling of out-of-vocabulary words; various methods
goal of this survey is to provide a taxonomy of E2E ASR models for adaptation/normalization; elaborate training schedules with
and corresponding improvements, and to discuss their properties different objectives including sequence discriminative training,
and their relationship to classical hidden Markov model (HMM)
etc. The potential of deep learning, on the other hand, initiated
based ASR architectures. All relevant aspects of E2E ASR are
covered in this work: modeling, training, decoding, and external successful approaches to integrate formerly separate modeling
language model integration, discussions of performance and de- steps, e.g., by integrating speech signal pre-processing and fea-
ployment opportunities, as well as an outlook into potential future ture extraction into acoustic modeling [11], [12].
developments. More consequently, the introduction of deep learning to ASR
Index Terms—End-to-end, automatic speech recognition. also initiated research to replace classical ASR architectures
based on hidden Markov models (HMM) with more integrated
I. INTRODUCTION joint neural network model structures [13], [14], [15], [16].
These ventures might be seen as trading specific speech pro-
1
HE classical statistical architecture decomposes an au-
T tomatic speech recognition (ASR) system into four main
components: acoustic feature extraction from speech audio sig-
cessing models for more generic machine learning approaches
to sequence-to-sequence processing – akin to how statistical
approaches to natural language processing have come to re-
nals, acoustic modeling, language modeling and search based on place more linguistically oriented models. For these all-neural
Bayes’ decision rule [1], [2], [3]. Classical acoustic modeling approaches recently the term end-to-end (E2E) [14], [17], [18],
is based on hidden Markov models (HMMs) to account for [19] has been established. Therefore, first of all an attempt
speaking rate variation. Within the classical approach, deep to define the term end-to-end in the context of ASR is due
learning has been introduced into acoustic and language mod- in this survey. According to the Cambridge Dictionary, the
eling. In acoustic modeling, deep learning has replaced Gaus- adjective “end-to-end” is defined as: “including all the stages of
sian mixture distributions (hybrid HMM [4], [5]) or augmented a process” [20]. We therefore propose the following definition
of end-to-end ASR: an integrated ASR model that enables joint
Manuscript received 21 February 2023; revised 2 September 2023; accepted training from scratch; avoids separately obtained knowledge
5 October 2023. Date of publication 30 October 2023; date of current version 16
November 2023. The associate editor coordinating the review of this manuscript sources; and, provides single-pass recognition consistent with
and approving it for publication was Prof. Kai Yu. (Corresponding author: Rohit the objective to optimize the task-specific evaluation measure,
Prabhavalkar.) i.e., usually label (word, character, subword, etc.) error rate.
Rohit Prabhavalkar is with Google LLC., Mountain View, CA 94043 USA
(e-mail: [email protected]). While this definition suffices for the present discussion, we note
Takaaki Hori is with Apple Inc., Cambridge, MA 02142 USA (e-mail: that such an idealized definition hides many nuances involved in
[email protected]). the term E2E and lacks distinctiveness; we elaborate on some of
Tara N. Sainath is with Google LLC., New York, NY 10011 USA (e-mail:
[email protected]). these nuances in Section II to discuss the various connotations
Ralf Schlüter is with Lehrstuhl Informatik 6 - Computer Science De- of the term E2E in the context of ASR.
partment, RWTH Aachen University, 52074 Aachen, Germany (e-mail: What are potential benefits of E2E approaches to ASR? The
[email protected]).
Shinji Watanabe is with Carnegie Mellon University, Pittsburgh, PA 15213 primary objective when developing an ASR systems is to mini-
USA (e-mail: [email protected]). mize the expected word error rate; secondary objectives are to re-
Digital Object Identifier 10.1109/TASLP.2023.3328283 duce time and memory complexity of the resulting decoder, and
1 The term “classical” here refers to the former, long-term, state-of-the-art
ASR architecture based on the decomposition into acoustic and language model, – assuming a constrained development budget – genericity, and
and with acoustic modeling based on hidden Markov models. ease of modeling. First of all, an integrated ASR system, defined
© 2023 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/
326 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 32, 2024
in terms of a single neural network structure supports genericity components, which may also mean dropping the classical sepa-
of modeling and may allow for faster development cycles when ration of ASR into an acoustic model and a language model.
building ASR systems for new languages or domains. Similarly, However, in practice E2E ASR systems are often combined
ASR models defined by a single neural network structure may with external language models trained on text-only data, which
become more ‘lean’ compared to classical modeling, with a weakens the end-to-end nature of the system to some extent.
simpler decoding process, obviating the need to integrate sep- b) Joint Training: In terms of model training, E2E can be
arate models. The resulting reduction in memory footprint and interpreted as estimating all parameters, of all components of a
power consumption supports embedded ASR applications [21], model jointly using a single objective function that is consistent
[22]. Furthermore, end-to-end joint training may help to avoid with the task at hand, which in case of ASR means minimiz-
spurious optima from intermediate training stages. Avoiding ing the expected word error rate.2 However, the term lacks
secondary knowledge sources like pronunciation lexica may distinctiveness here, as classical and/or modular ASR model
be helpful for languages/domains where such resources are architectures also support joint training with a single objective.
not easily available. Also, secondary knowledge sources may c) Training from Scratch: The E2E property can also be
themselves be erroneous; avoiding these may improve models interpreted with respect to the training process itself, by re-
trained directly from data, provided that sufficient amounts of quiring training from scratch, avoiding external knowledge like
task-specific training data are available. prior alignments or initial models pre-trained using different
With the current surge of interest in E2E ASR models and criteria or knowledge sources. However, note that pre-training
an increasing diversity of corresponding work, the authors of and fine-tuning strategies are also relevant, if the model has
this review think it is time to provide an overview of this rapidly explicit modularity, including self-supervised learning [25] or
evolving domain of research. The goal of this survey is to provide joint training of front-end and speech recognition models [26].
an in-depth overview of the current state of research on E2E Especially in case of limited amounts of target task training
ASR systems, covering all relevant aspects of E2E ASR, with data, utilizing large pretrained models is important to obtain
a contrastive discussion of the different E2E and classical ASR performant E2E ASR systems.
architectures. d) Avoiding Secondary Knowledge Sources: For ASR, stan-
This survey of E2E speech recognition is structured as fol- dard secondary knowledge sources are pronunciation lexica and
lows. Section II discusses the nuances in the term E2E as it phoneme sets, as well as phonetic clustering, which in classical
applies to ASR. Section III describes the historical evolution of state-of-the-art ASR systems usually is based on classifica-
E2E speech recognition, with specific focus on the input-output tion and regression trees (CART) [27]. Secondary knowledge
alignment and an overview of prominent E2E ASR models. sources and separately trained components may introduce er-
Section IV discusses improvements of the basic E2E models, rors, might be inconsistent with the overall training objective
including E2E model combination, training loss functions, con- and/or may generate additional cost. Therefore, in an E2E
text, encoder/decoder structures and endpointing. Section V approach, these would be avoided. Standard joint training of
provides an overview of E2E ASR model training. Decoding an E2E model requires using a single kind of training data,
algorithms for the different E2E approaches are discussed in which in case of ASR would be transcribed speech audio data.
Section VI. Section VII discusses the role and integration of However, in ASR often even larger amounts of text-only data,
(separate) language models in E2E ASR. Section VIII reviews as well as optional untranscribed speech audio are available.
experimental comparisons of the different E2E as well as classi- One of the challenges of E2E modeling therefore is how to
cal ASR approaches. Section IX provides an overview of appli- take advantage of text-only and audio-only data jointly with-
cations of E2E ASR. Section X investigates future directions of out introducing secondary (pretrained) models and/or training
E2E research in ASR, before concluding in Section XI. Finally, objectives [28], [29].
we note that this survey paper also includes comparative discus- e) Direct Vocabulary Modeling: Avoiding pronunciation lex-
sions between novel E2E models and classical HMM-based ASR ica and corresponding subword units leave E2E recognition
approaches in terms of various aspects; most sections end with vocabularies to be derived from whole word or character rep-
a summarization of the relationship between E2E models and resentations. Whole word models [30], according to Zipf’s
HMM-based ASR approaches in relation to the topics covered law [31], would require unrealistically high amounts of tran-
within the respective sections. scribed training data for large vocabularies, which might not be
attainable for many tasks. On the other hand, methods to generate
subword vocabularies based on characters, like the currently
II. DISTINCTIVENESS OF THE TERM E2E popular byte pair encoding (BPE) approach [32], might be seen
As noted in Section I the term E2E provides an idealized as secondary approaches outside the E2E objective, even more
definition of ASR systems, and can benefit from a more detailed so if acoustic data is considered for subword derivation [33],
discussion based on the following perspectives. [34], [35], [36].
a) Joint Modeling: In terms of ASR, the E2E property can be
interpreted as considering all components of an ASR system
2 Note that this does not necessarily require Bayes Risk training, as standard
jointly as a single computational graph. Even more so, the
training criteria like cross entropy, maximum mutual information and maximum
common understanding of E2E in ASR is that of a single joint likelihood in case of classical ASR models asymptotically guarantee optimal
modeling approach that does not necessarily distinguish separate performance in the sense of Bayes decision rule, also [23], [24].
PRABHAVALKAR et al.: END-TO-END SPEECH RECOGNITION: A SURVEY 327
f) Generic Modeling: Finally, E2E modeling also requires b) Implicit Alignment Modeling: does not introduce a latent
genericity of the underlying modeling: task-specific constraints alignment variable, but models the label sequence posterior
are learned completely from data, in contrast to task-specific P (C|X) directly.
knowledge which influences the modeling of the system ar- Explicit alignment modeling approaches can mainly be distin-
chitecture in the first place. For example, the monotonicity guished by their choice of latent variable; these can be encoded
constraint in ASR may be learned completely from data in an in terms of valid emission paths in corresponding finite state
end-to-end fashion (e.g., in attention-based approaches [16]), or automata (FSA) [38] which relate the input and output sequences
it may directly be implemented, as in classical HMM structures. – the approach taken in our article. Typically, latent variables in
However, model constraints may be considered by way of regu- explicit alignment modeling in transducer E2E models intro-
larization in E2E ASR model training, and can thus provide an duce extensions to the output label set with different forms of
alternative way to introduce task-specific knowledge. continuation labels (including, but not limited to so-called blank
g) Single-Pass Search: In terms of the recognition/search labels).3
problem, the E2E property can be interpreted as integrating all
components (models, knowledge sources) of an ASR system A. Encoder and Decoder Modules
before coming to a decision. This is in line with Bayes’ decision
Irrespective of the alignment modeling approach, following
rule, which exactly requires a single global decision integrating
the notation introduced in [41], it is useful to view all E2E ASR
all available knowledge sources, which is supported by both
models as being composed of an encoder module and a decoder
classical ASR models as well as E2E models. On the other hand,
module. The encoder module, denoted H(X), maps an input
multipass search is not only exploited by classical ASR models,
acoustic frame sequence, X, of length T into a higher-level
but also by E2E ASR models, the most prominent case here
representation, H(X) = (h1 , . . . , hT ) of length T (typically
being (external) language model rescoring.
T ≤ T ). Note that the encoder output is independent of the
All in all, we need to conclude that a) “E2E” does not provide
hypothesized label sequence. The decoder module models the
a clear distinction between classical and novel, so-called E2E
label sequence posterior on top of the encoder output:
models, and b) the E2E property often is weakened in practice,
leaving the term as a more general, idealized perspective on ASR P (C|X) = P C H(X)
modeling. Thus, we may distinguish different approaches based upon
how the output label sequence distribution (including potential
III. A TAXONOMY OF E2E MODELS IN ASR latent variables resulting from the alignment modeling) are
decomposed into individual label (and alignment) contributions;
Before we derive a taxonomy of E2E ASR modeling ap-
these may occur per output label position, per encoder frame
proaches, we first introduce our notation. We denote the input
position, or combinations thereof:
speech utterance as X, which we assume has been parameterized
into D-dimensional acoustic frames (e.g., log-mel features) of P C[, A]H(X)
length T : X = (x1 , . . . , xT ), where xt ∈ RD . We denote the L

corresponding word sequences as C, which can be decomposed = P ci [, ai ]ci−1 i−1 i−1 i−1
1 [, a1 ], vi (c1 [, a1 ], H(X))
into a suitable sequence of labels of length L: C = (c1 , . . . , cL ), i=1
where each label cj ∈ C. Our description is agnostic to the where the notation mi−1 corresponds to the sequence
1
specific representation used for decomposing the word sequence of i − 1 previous instances of the variables m; and,
into labels; popular choices include characters, words, or sub- vi (ci−1 i−1
1 [, a1 ], H(X)) denotes a context-vector that provides
word sequences (e.g., BPE [32], word-pieces [37]). the connection between encoder output, H(X), and the label
ASR may be viewed as a sequence classification problem output position, i. In general the context vector may depend
which maps a variable length input, X, into an output, C, of on the label context (and possibly the latent variable context,
unknown length. Following Bayes’ decision rule, any statistical for explicit alignment modeling approaches). Apart from the
approach to ASR must determine how to model the word se- underlying alignment model and corresponding output label de-
quence posterior probability, P (C|X). Thus, a natural taxonomy composition, decoder modules differ in terms of the assumptions
of E2E ASR modeling can be based on the various strategies for on their label context ci−1 (and their latent variable context
1
modeling this word sequence posterior: i.e., how the alignment ai−1 ), which correspond to different conditional independence
1
problem between input and output sequence is handled; and, assumptions, and by their access to the encoder output. For
how sequence modeling is decomposed to the level of individual example, the local posterior may only depend on a single encoder
input vectors xt and/or output labels cl . We find that it is useful frame output (i.e., with the context vector being reduced to a
to distinguish implicit and explicit modeling approaches, based single encoder frame’s output): vi (ci−1
1 , H(X)) = hti (X). As
on the modeling of the sequence-to-sequence alignment: we shall see in detail in the following sections, the simplest
a) Explicit Alignment Modeling: does not necessarily refer case of an encoder frame-level decomposition (with L = T , and
to the determination of a single unique alignment, but instead
introduces an explicit alignment modeled as a latent variable, A:
3 For example, these extensions may also include explicit duration variables,
P (C|X) = P (C, A|X) leading to segmental models [39]. Such models can be rewritten into equivalent
A transducer models [40], and vice-versa.
ti = i) corresponds to CTC [13]; AED models [16] and their

variants maintain the full dependency of the context vector.
Finally, different E2E models can also be distinguished by
the specific modeling choices that are involved in the design
of the neural network used to implement the encoder and the
decoder. These might involve feed-forward neural networks,
convolutional neural networks, recurrent neural networks (either
uni-directional or bi-directional) [42], attention [43], and vari-
ous combinations thereof (e.g., transformers [44] or conform-
ers [45]). These modeling choices and corresponding training
methods can be applied across E2E ASR models and therefore
do not enter the taxonomy of E2E ASR models discussed here.
However, specific choices will be discussed as part of the exem- Fig. 1. Example alignment sequence for a CTC model with the target sequence
C = (s, e, e) (right), alongside a (non-deterministic) finite state automaton
plary E2E ASR models presented in Section VIII and Section IX. (FSA) [38] (left) representing the set of all valid alignment paths.
B. Explicit Alignment Modeling Approaches

example, (X, C), we denote the set of all valid alignments,
Early E2E modeling approaches modeled alignments explic-
(X,C) = {A = (a1 , a2 , . . . , aT )}, such that each at ∈ Cb with
ACTC
itly through a latent variable, which is marginalized out (possi- the additional constraint that A is identical to C after first
bly, approximately) during training and inference. Examples of collapsing consecutive identical labels, and then removing all
this family of approaches include connectionist temporal clas- blank symbols. For example, if T = 10, and C = (s, e, e),
sification (CTC) [13], the recurrent neural network transducer then A = (s, b , b , e, e, b , e, e, b , b) ∈ ACTC
(X,C) , as il-
(RNN-T) [14], the recurrent neural aligner (RNA) [46], and lustrated in Fig. 1. As can be seen in this example, repeated
the hybrid auto-regressive transducer [47] (HAT). As will be labels in the output can be represented by intervening blanks.
discussed in subsequent sections, the latter modeling approaches Following (1), CTC defines the posterior probability of the label
in this family represent increasingly sophisticated modeling sequence C conditioned on the input, X, by marginalizing over
of alignments, with fewer independence assumptions and are all possible CTC alignments as:
thus increasingly powerful. A common feature of all explicit
alignment models discussed in this section is that they introduce PCTC (C|X) = P (A|H(X))
an additional blank symbol, denoted b, and define an output A∈ACTC
(X,C)
probability distribution over symbols in the set Cb = C ∪ {b}. T

The interpretation of the b symbol varies slightly between each
= P (at |at−1 , . . . , a1 , H(X))
of these models, as we discuss in greater details below. For now, t=1
A∈ACTC
it suffices to say that given a specific training example, (X, C), (X,C)
each of these models defines a set of valid alignments, denoted T

by A(T,C) , and define the conditional distribution P (C|X) by = P (at |ht ) (2)
marginalizing over all valid alignment sequences: A∈ACTC t=1
(X,C)
P (C|X) = P (C|A, H(X))P (A|H(X)) Critically, as can be seen in (2), CTC makes a strong indepen-
A dence assumption that the model’s output at time t is condition-

= P (A|H(X)) (1) ally independent of the outputs at other timesteps, given the local
A∈A(T =|H(X)|,C)
encoder output at time t.
Thus, a CTC model consists of a neural network that models
where, by definition P (C|A, H(X)) = 1 if and only if A ∈
the distribution P (at |X), at each step as shown in Fig. 2.
A(T,C) and 0 otherwise.4 We discuss the specific formulations
The encoder is connected to a softmax layer with |Cb | targets
of each of these models in the subsequent sections.
representing the individual probabilities in (2): P (at = c|X) =
1) Connectionist Temporal Classification (CTC): Connec-
P (at = c|H(X)), which comprises the decoder module for
tionist Temporal Classification (CTC) was proposed by Graves
CTC. Thus, at each step, t, the model consumes a single encoded
et al. [13] as a technique for mapping a sequence of input tokens
frame ht and outputs a distribution over the labels; in other
to a corresponding sequence of output tokens. CTC explicitly
words, the model “outputs” a single label either blank, b, or
models alignments between the encoder output, H(X), and
one of the targets in C.
the label sequence, C, by introducing a special “blank” label,
2) Recurrent Neural Network Transducer (RNN-T): The Re-
denoted by b: Cb = C ∪ {b}. An alignment, A ∈ Cb∗ , is thus
current Neural Network Transducer (RNN-T) [14], [48] was
a sequence of labels in C or b.5 Given a specific training
proposed by Graves as an improvement over the basic CTC
model [13], by removing some of the conditional independence
4 This is equivalent to the assumption that the mapping from an alignment A
assumptions that we discussed previously. The RNN-T model,
to a label sequence C is unique, by definition.
5 S ∗ denotes a Kleene closure: the set of all possible sequences composed of which is depicted in Fig. 3, is best understood by contrasting
tokens in the set S. it against the CTC model. As with CTC, the RNN-T model
Fig. 4. Example alignment sequence (right) for an RNN-T model with the
target sequence C = (s, e, e). Horizontal transitions in the image correspond
to blank outputs. The FSA (left) represents the set of all valid RNN-T alignment
Fig. 2. Representation of the CTC model consisting of an encoder which maps paths.
the input speech into a higher-level representation, and a softmax layer which
predicts frame-level probabilities over the set of output labels and blank.
require no special treatment as illustrated in Fig. 4, where,
i1 = i2 = 0; i3 = i4 = 1; i10 = 3; etc.
We may then define the posterior probability P (C|X) as
before:

PRNNT (C|X) = P (A|H(X))
A∈ARNNT
(X,C)
T
+L
= P (aτ |aτ −1 , . . . , a1 , H(X))
A∈ARNNT τ =1
(X,C)
T
+L
= P (aτ |ciτ , ciτ −1 , . . . , c0 , hτ −iτ )
A∈ARNNT τ =1
(X,C)
T
+L
= P (aτ |piτ , hτ −iτ ) (3)
Fig. 3. RNN-T Model [14], [48] consists of an encoder which transforms the
input speech frames into a high-level representation, and a prediction-network A∈ARNNT τ =1
(X,C)
which models the sequence of non-blank labels that have been output previously.
The prediction network output, pit , represents the output after producing the where, P = (p1 , . . . , pL ) represents the output of the prediction
previous non-blank label sequence c1 , . . . , cit . The joint network produces a network depicted in Fig. 3 which summarizes the sequence of
probability distribution over the output symbols (augmented with blank) given
the prediction network state and a specific encoded frame.
previously predicted non-blank labels, implemented as another
neural network: pj = N N (·|c0 , . . . , cj−1 ), where c0 is a special
start-of-sentence label, sos. Thus, as can be seen in (2), RNN-T
augments the output symbols with the blank symbol, and thus reduces some of the independence assumptions in CTC since
defines a distribution over label sequences in Cb . Similarly, as the output at time t is conditionally dependent on the sequence
with CTC, the model consists of an encoder which processes the of previous non-blank predictions, but is independent of the
input acoustic frames X to generate the encoded representation specific choice of alignment (i.e., the choice of the frames at
H(X) = (h1 , . . . , hT ). which the non-blank tokens were emitted).
Unlike CTC, however, the blank symbol in RNN-T has a Our presentation of RNN-T alignments considers the “canoni-
slightly different interpretation; for each input encoder frame, cal” case. In principle, however, the model can encode the same
ht , the RNN-T model outputs a sequence of zero or more set of conditional independence assumptions in RNN-T (i.e.,
symbols in C which are terminated by a single blank sym- the model structure), while considering alternative alignment
bol. Thus, we may define the set of all valid alignment se- structures as in the work of [49]. In their work, Moritz et al.,
quences in RNN-T as: ARNNT (X,C) = {A = (a1 , a2 , . . . , aT +L )}, represent valid frame-level alignments as an arbitrary graph.
the set of all sequences of T + L symbols in Cb∗ , which This formulation, for example, allows for the use of “CTC-like”
are identical to C after removing all blanks. Finally, for a alignments in the RNN-T model (i.e., outputting a single label –
given output position τ , let iτ denote the number of non- blank, or non-blank – at each frame) while conditioning on the
blank labels in the partial sequence (a1 , . . . , aτ −1 ). Thus, the set of previous non-blank symbols as in the RNN-T model.
number of blanks in the partial sequence (a1 , . . . , aτ −1 ) is 3) Recurrent Neural Aligner (RNA): The recurrent neural
τ − iτ − 1. For example, if T = 7, and C = (s, e, e), then aligner (RNA) was proposed by Sak et al. [46]. The RNA
A = (b , s, b , b , b , e, e, b , b , b) ∈ ARNNT
(X,C) . Note model generalizes the RNN-T model by removing one of its
that, unlike the CTC model, repeated labels in the output conditional independence assumptions. The model, depicted in
Fig. 6. Example alignment sequence (right) for an RNA model with the target
sequence C = (s, e, e). Horizontal transitions in the image correspond to blank
outputs; diagonal transitions correspond to outputting a non-blank symbol. The
FSA (left) represents the set of valid alignments for the RNA model. Although
the FSA is identical to the corresponding FSA for RNN-T in Fig. 4, the semantics
of the b label are different in the two cases.
Fig. 5. RNA Model [46] resembles the RNN-T model [14], [48] in terms of the where, as before it denotes the number of non-blank symbols
model structure. However, this model is only permitted to output a single label in the partial alignment sequence (a1 , . . . , at−1 ), and qt−1 =
– either blank, or non-blank – in a single frame. Unlike RNN-T, the prediction
network state in the RNA model, qt−1 , depends on the entire alignment sequence
NN(·|at−1 , . . . , a1 ) represents the output of a neural network
at−1 , . . . , a1 . The joint network produces a probability distribution over the which summarizes the entire partial alignment sequence, where
output symbols (augmented with blank) given the prediction network state and NN(·) represents a suitable neural network (an LSTM in [46]).
a specific encoded frame.
Thus, RNA removes the one remaining conditional indepen-
dence assumption of the RNN-T model, by conditioning on the
sequence of previous non-blank labels as well as the alignment
Fig. 5, is best understood by considering how it differs from
that generated them. However, this comes at a cost: the exact
the RNN-T model. As with CTC and RNN-T, the RNA model
computation of the log-likelihood in (3) (and corresponding
defines a probability distribution over blank augmented labels
gradients) is intractable. Instead, RNA makes two simplifying
in the set Cb , where b has the same semantics as in the CTC
assumption to ensure tractable training: by assuming that the
model: at each frame the model can only output a single label
model can only output a single label at each frame; and utilizing
– either blank, or non-blank – before advancing to the next
a straight-through estimator for the alignment [50]. The latter
frame; unlike CTC (but as in RNN-T) the model only outputs a
constraint – allowing only a single label (blank or non-blank)
single instance of each non-blank label. More specifically, the
at each frame – has also been explored in the context of the
set of valid alignments, ARNA (X,C) = (a1 , . . . , aT ), in the RNA monotonic RNN-T model [51]. Finally, we note that the work
model consist of length T sequences in Cb∗ with exactly T − L
in [52] further generalizes the RNA model by employing two
blank symbols, and which are identical to C after removing all
RNNs when defining the state: a slow RNN (which corresponds
blanks. Thus, the blank symbol has a different interpretation
to the sequence of previously predicted non-blank labels), and
in RNA and the RNN-T models: in RNN-T, outputting a blank
a fast RNN (which also conditions on the frames at which the
symbol advances the model to the next frame; in RNA, however,
non-blank labels were output).
the model advances to the next frame after outputting a single
blank or non-blank label. Restricting the model to output a
C. Implicit Alignment Modeling Approaches
single non-blank label at each frame improves computational
efficiency and simplifies the decoding process, by limiting the One of the main benefits of the explicit alignment approaches
number of model expansions at each frame (in constrast to RNN- such as CTC, RNN-T, or RNA is that they result in ASR models
T decoding). For example, if T = 8, and C = (s, e, e), then that are easily amenable to frame-synchronous decoding.6 In this
A = (b , s, b , e, b , b , e, b) ∈ ARNA
(X,C) as illustrated in
section, we discuss the attention-based encoder-decoder (AED)
Fig. 6. models (also known as, listen-attend-and-spell (LAS)) [15],
The RNA posterior probability, P (C|X), is defined as: [16], [53], which employs the attention mechanism [43] to
implicitly identify and model the portions of the input acoustics

PRNA (C|X) = P (A|H(X)) which are relevant to each output unit. These models were
A∈ARNA
first popularized in the context of machine translation [54].
(X,C)
Unlike explicit alignment modeling approaches, attention-based
T
encoder-decoder models use an attention mechanism [43] to
= P (at |at−1 , . . . , a1 , H(X)) learn a correspondence between the entire acoustic sequence and
A∈ARNA t=1 the individual labels. Such models support label-synchronous
(X,C)
T

= P (at |qt−1 , ht ) (4) 6 By frame-synchronous decoding, we refer to the ability of the model to
t=1 produce output label for each input frame of speech. Models such as CTC,
A∈ARNA
(X,C) RNN-T, or RNA, support frame-synchronous decoding.
L+1

= P (ci |si , vi ) (5)
i=1
where, vi corresponds to a context vector, which summarizes
the relevant portions of the encoder output, H(X), given the se-
quence of previous predictions ci−1 , . . . , c0 ; and, si corresponds
to the corresponding decoder state after outputting the sequence
of previous symbols, which is produced by updating the decoder
state based on the previous context vector and output label:
si = Decoder(vi−1 , si−1 , ci−1 )
The symbol c0 = sos is a special start-of-sentence symbol
which serves as the first input to the attention-based decoder
before it has produced any outputs. As can be seen in (5), an
important benefit of AED models over models such as CTC or
RNN-T is that they do not make any independence assumptions
between model outputs and the input acoustics, and are thus
more general than the implicit alignment models, while being
considerably easier to train and implement since we do not have
Fig. 7. Attention-based encoder decoder (AED) model [15], [16], [53].
The output distribution is conditioned on the decoder state, si (which sum- to explicitly marginalize over all possible alignment sequences.
marizes the previously decoded symbols), and the context vector, vi (which However, this comes at a cost: previously generated context
summarizes the encoder output based on the decoder state). In the seminal work vectors (which are analogous to the decoded partial alignment
this is accomplished by concatenating the two
of Chan et al., [16], for example,
in explicit alignment models) are not revised as the decoding
vectors, as denoted by the symbol in the figure.
proceeds. Stated another way, while the encoder processing
H(X) might be bi-directional, the decoding process in AED
models reveals a left-right asymmetry [55].
decoding, meaning that during inference, each hypothesis in the
1) Computing the Context Vector in AED Models: As we
beam is expanded by 1 label.
mentioned before, the context vector, vi , is computed by em-
In the explicit alignment approaches presented in
ploying the attention mechanism [43]. The central idea behind
Section III-B, during inference, the model continues to output
these approaches is to define a state vector si which corresponds
symbols until it has processed the final frame at which point
to the state of the model after outputting c1 , . . . , ci−1 . The atten-
the decoding process is complete; similarly, during training, the
tion function, atten(ht , si ) ∈ R, then defines a score between
forward-backward algorithm aligns over all possible alignment
the model state after outputting i − 1 previous symbols, and
sequences. Since an AED model processes the entire acoustic
each of the encoded frames in H(X). These scores can then be
sequence at once, the model needs a mechanism by which
normalized using the softmax function to define a set of weights
it can indicate that it is done emitting all output symbols.
corresponding to each ht as:
This is achieved by augmenting the set of outputs with an
end-of-sentence symbol, eos, so that the output vocabulary exp{atten(ht , si )}
αt,i = T
consists of the set Ceos = C ∪ {eos}. Thus, the AED model, t =1 exp{atten(ht , si )}
depicted in Fig. 7, consists of an encoder network – which Intuitively, the weight αt,i represents the relevance of a par-
encodes the input acoustic frame sequence, X = (x1 , . . . , xT ), ticular encoded frame ht when outputting the next symbol ci ,
into a higher-level representation H(X) = (h1 , . . . , hT ) – after the model has already output the symbols c1 , . . . , ci−1 , as
and an attention-based decoder which defines the probability illustrated in Fig. 8. The context vector summarizes the encoder
distribution over the set of output symbols, Ceos . Thus, output based on the computed attention weights:
given a paired training example, (X, C), we denote by
Ce = (c1 , . . . , cL , eos), the ground-truth symbol sequence of vi = αt,i ht
t
length (L + 1) augmented with the eos symbol. AED models
compute the conditional probability of the output sequence A number of possible attention mechanisms have been ex-
augmented with the eos symbol as: plored in the literature: the most common forms are called
‘content-based attention’, which include dot-product atten-
tion [16] and additive attention [43]. The content-based atten-
P (Ce |X) = P (Ce |H(X)) tion computes the attention score atten(ht , si ) based on the
L+1
relevance between ht and si . However, the score does not
= P (ci |ci−1 , . . . , c0 = sos , H(X)) consider location information, i.e., it is determined by only the
i=1 content, independent of the position. This can lead to incorrect
L+1 attention weights with a large discrepancy against the previous

= P (ci |ci−1 , . . . , c0 = sos , vi ) steps. Thus, location-based attention atten(si , fi,j ) has been pro-
i=1 posed [15], where fi,j is a convolutional feature vector extracted
which computes the context vector, vi , as a sum over all input

frames ht , the various proposed models constrain this sum to
be computed over a subset of frames to allow for streaming
decoding. In the context of our presentation, it is easiest to
think of these models as consisting of an underlying alignment
(whether known or unknown) which can be used to perform
streaming inference.
The Neural Transducer (NT) [59] explicitly partitions the
Fig. 8. Unlike models such as RNN-T or CTC, AED models do not have input encoder frames into T W non-overlapping chunks of length
explicit alignment. However, it is possible to interpret the attention weights αt,i
for a particular output symbol ci as an alignment weight which is represented W : H1W = [h1 , . . . , hW ]; · · · ; HTWW = [hT W +1 , . . . , hT W W ],
T
above for the target sequence C = (s, e, e, eos). In this representation, the where T W = W , and ht = 0 if t > T . Unlike the AED model
size of the circle and the darkness level are proportional to the corresponding which examines all encoded frames when computing the context
attention weights; thus the total probability mass is the same for each row. As
illustrated above, the first few frames correspond to the first symbol c1 = s, vector, the NT model is restricted to process a single chunk
while the latter frames correspond to the second ‘e’: c3 = e. at a time; the model only advances to the next chunk when it
outputs a special end-of-chunk symbol (analogous to eos in
from αi−1 , the attention weights in the previous step. The hybrid the AED model); inference in the model terminates when the
attention, i.e., a combination of the content- and location-based model has output the end-of-chunk symbol in the final chunk
attentions, has also been investigated in [15], showing a higher HTWW . If the alignments of the ground-truth output sequence,
accuracy than the separate ones. Besides, other location-based C, with respect to the W -length chunks are unknown, then it is
methods use a Gaussian (mixture) model estimated with si to possible to train the system by using a rough initial alignment
obtain attention weights [56], [57]. Transformer model [44] where symbols are distributed equally among the T W chunks,
uses only content-based dot-product attention, but also takes followed by iterative refinement by computing the most likely
location information into account through positional encoding. output alignments given the current model parameters [59] sim-
Apart from the specific choice of the attention mechanism, a ilar to forced-alignments in HMM-based systems. An alternate
common technique to improve performance involves the use approach [63] consists of using a separate system (e.g., a clas-
of multiple independent attention heads – vi1 , . . . , viK – which sical hybrid system) to get initial alignments (e.g., word-level
are then concatenated together to obtain the final context vector alignments), which can be used to assign sub-word units to the
vi = [vi1 ; . . . ; viK ], in the so-called multi-head attention ap- individual chunks.
proach [44], or indeed by stacking together multiple attention- An alternative approach, proposed by Raffel et al. [60],
based layers in the transformer decoder presented by Vaswani modifies the vanilla AED model by explicitly introducing an
et al. [44]. alignment module which scans the encoder frames, H(X),
from left-to-right to identify whether the current frame should
be used to emit any outputs (modeled as a Bernoulli random
D. From Implicit to Explicit Alignment Modeling variable). If a frame, τ , is selected, then the model produces
AED models, which make no conditional independence as- an output based on the local encoder frame, hτ . The process is
sumptions, are extremely powerful, often outperforming explicit then repeated starting from the currently selected frame, thus
alignment E2E approaches such as CTC, or RNN-T [41]. How- allowing multiple outputs to be generated at the same frame.
ever, these models also have some significant disadvantages, This results in a model with hard monotonic alignments between
most notably that the models are typically non-streaming: i.e., the input speech and the output labels since the models are con-
the models must process all acoustic frames before they can strained to generate outputs in a streaming fashion. A Monotonic
generate any output hypotheses. A somewhat related issue is Chunkwise Attention (MoChA) model [61] improves upon the
that the models are extremely sensitive to the length of the work of Raffel et al., by allowing the model to generate the
acoustic sequences, which requires special processing to be able next output using a context vector computed using attention
to decode long-form audio [58]. There is a body of work that lies over a local window of frames to the left of the selected frame
in between these two extremes: models such as the neural trans- τ : hτ −W +1 , . . . , hτ . Thus, the MoChA model consists of a
ducer [59], or those based on monotonic alignments [60] and its two-level process – identifying frames where output should
variants (e.g., monotonic chunkwise alignments (MoChA) [61], be produced following [60], followed by an AED model over
monotonic infinite lookback (MILK) [62] etc.) use an explicit frames to the left of the selected frame. A refinement to the
alignment model, while also utilizing an attention mechanism MoChA model, proposed by Arivazhagan et al. [62] – the mono-
that allows the model to examine local acoustics in order to tonic infinite lookback (MILK) attention model – computes the
refine predictions. In other words, this corresponds to a class of context vector over all frames to the left of the selected frame
streaming AED models. Generally speaking, these models are τ (i.e., h1 , . . . , hτ ) at each step. Another two-fold approach to
motivated by the observation that speech (unlike tasks such as enable streaming operation is presented in [64] under the term of
machine translation) exhibits a ‘local’ relationship between the triggered attention, where a CTC-network is used to trigger, i.e.
encoded frames (assuming that the encoder is uni-directional) control the activation of an AED model with a limited decoder
and the output units; thus, unlike the general AED model delay. We also direct interested readers to studies of various
attention variants: Merboldt et al. [65] compare a number of example, Watanabe et al. [86] find that attention-based models
local monotonic attention variants; Zeyer et al. [66] discuss perform poorly on long or noisy utterances, mainly because
segmental attention variants; Zeyer et al. [67] study the related the model has too much flexibility in predicting alignments
decoding and the relevance of segment length modeling, leading when presented with the entire input utterance. In contrast,
to improved generalization towards long sequences. Segmental models such as CTC – which have left-to-right constraints
attention models are related to transducer models [68]. However, during decoding – perform much better in these cases. They
segmental E2E ASR models are not limited to be realized based propose to employ a multi-task learning strategy with both CTC
on the attention mechanism and may not only be related to a and attention-based losses, which provides a 5–14% relative
direct HMM [39], but have also been shown to be equivalent improvement in word error rate over attention-based models on
to neural transducer modeling [40], thus even providing a clear Wall Street Journal (WSJ) and Chime tasks. Pang et al. [87]
relation between duration modeling and blank probabilities. explore combining the benefits of RNN-T and AED. Specif-
ically, RNN-T decodes utterances in a left-to-right fashion,
Relationship to Classical ASR which works well for long utterances. On the other hand, since
AED sees the entire utterance, it often shows improvements for
In classical ASR models, these frame-level alignments can be utterances where surrounding context is needed to predict the
modeled with HMMs while using generative GMMs or neural current word, e.g., ”one dollar and fifty cents” →
networks to model the output distribution of acoustic frames; $1.50. To combine RNN-T and AED, the authors propose to
frame-level alignments to train neural network acoustic models produce a first-pass result with RNN-T, that is then rescored with
may be obtained by force-alignment from a base GMM-HMM AED in the second-pass. To reduce computation, the authors
systems, but direct sequence training not requiring initial align- share the encoder between RNN-T and AED. The authors find
ments is also possible [69]. that RNN-T + AED provides a 17–22% relative improvement
E2E models introduce alternative alignment modeling ap- in word error rate over RNN-T alone on a voice search task.
proaches to ASR. While the attention mechanism provides a Other flavors of streaming 1st-pass following by attention-based
qualitatively novel approach to map acoustic observation se- 2nd-pass rescoring, such as deliberation [88], have also been
quences to label sequences, transducer approaches [13], [14], explored. One of the issues with such rescoring approaches
[46], [70] handle the alignment problem in a way that can be in- is that any potential improvements are limited to the lattice
terpreted to be similar to HMMs with a specific model topology, produced by the 1st-pass system. To address this, methods which
including marginalization over alignments [71], [72], [73]. CTC run a 2nd-pass beam search have also been explored, particularly
models can also be employed in an HMM-like fashion during in the context of streaming ASR – e.g. cascaded encoder [89],
decoding [74]. Moreover, transducer approaches are equivalent Y-architecture [90] and Universal ASR [91].
to segmental models/direct HMM [40].
Another prominent feature of E2E systems besides the align-
ment property is their direct character-level modeling avoiding B. Incorporating Context
phoneme-based modeling and pronunciation lexica [16], [19], Contextual biasing to a specific domain, including a user’s
[74], [75], [76], [77], [78], [79], [80], [81], [82], with some song names, app names and contact names, is an important com-
even heading for whole-word modeling [30], [76]. However, ponent of any production-level automatic speech recognition
character-level modeling also is viable with classical hybrid (ASR) system. Contextual biasing is particularly challenging
HMM architectures [83] and has been shown to work even with in E2E models because these models typically retain only a
standard HMM models w/o neural networks [84]. small list of candidates during beam search, and tend to perform
poorly when recognizing words that are seen infrequently during
IV. ARCHITECTURE IMPROVEMENTS TO BASIC E2E MODELS training (typically named entities), which is the main source
In this section, we describe various algorithmic changes of biasing phrases. There have been a few approaches in the
to vanilla E2E models which are critical in order to obtain literature to incorporate context.
improved performance over classical ASR systems. First, we One approach, known as shallow-fusion contextual bias-
describe various ways of combining different complementary ing [92], constructs a stand-alone weighted finite state transducer
E2E models to improve performance. Next, we introduce ways (FST) representing the biasing phrases. The scores from the
to incorporate context into these models to improve performance biasing FST are interpolated with the scores of the E2E model
on rare proper noun entities. We then describe improved encoder during beam search, with special care taken to ensure we do not
and decoder architectures that take better advantage of the many over- or under-bias phrases. An alternate approach proposes to
cores on specialized architectures such as tensor processing units inject biasing phrases into the model in an all-neural fashion. For
(TPUs) [85]. Finally, we discuss how to improve the latency of example, Pundak et al. [93] represent a set of biasing phrases
the model through an integrated E2E endpointer. by embedding vectors. These vectors are fed as additional input
to an attention-based model, which can then choose to attend
to the phrases and hence boost the chances of predicting the
A. Combinations of Models
phrases. Kim and Metze [94] propose to bias towards dialog
Different end-to-end models are complementary, and there context. In addition, Bruguier et al. [95] extend [93], by lever-
have been numerous attempts at combining these methods. For aging phonemic pronunciations for the biasing phrases when
constructing phrase embeddings. Finally, Delcroix et al. [96] use decoding. These models have shown improved latency and WER
an utterance-wise context vector like an i-vector computed by trade-off by having the endpointing decision predicted as part
a pooling across frame-by-frame hidden state vectors obtained of the model. Furthermore, [114], [115] explored using the CTC
from a sub network (this sub-network is called a sequence- blank symbol for endpoint detection.
summary network).
V. TRAINING E2E MODELS
C. Encoder and Decoder Structure In general, training of E2E models follows deep learning
There have been improvements to encoder architectures of schemes [116], [117], with specific consideration of the sequen-
E2E models over time. The first end-to-end models used long tial structure and the latent alignment problem to be handled in
short-term memory recurrent neural networks (LSTMs), for both ASR. E2E ASR models may be trained end-to-end, notwith-
the encoder and decoder. The main drawback of these sequential standing potential elaborate training schedules and extensive
models is that each frame depends on the computation from the data augmentation. Part of the appeal of end-to-end models
previous frame, and therefore multiple frames cannot be batched is that they do not assume conditional independence between
in parallel. the input frames. Given a training set T = {(Xn , Cn )}N n=1 ,
With the improvement of hardware, specifically on-device the training
criterion L to be minimized can be written as:
Edge Tensor Processing Units (TPUs), with thousands of cores, L=− N n=1 log P (C n |X n ) (which is equivalent to maximiz-
architectures that can better take advantage of the hardware, have ing the total conditional log-likelihood).
been explored. Such architectures include convolution-based
architectures, such as ContextNet [97]. The use of self-attention A. Alignment in Training
to replace the sequential recurrence in LSTMs was explored E2E models such as RNN-T and CTC introduce an additional
in Transformers for ASR [98], [99], [100]. Finally, combining blank token b for alignment. Therefore optimization implies
self-attention with convolution, known as Conformer [45], or marginalizing across all alignments, as follows:
multi-layer perceptron [101], was also explored. Both Trans- N

former and Conformer have shown competitive performance to Lex = − log P (Cn , An |Xn )
LSTMs on many tasks [102], [103]. n=1 An
On the decoder side, research for transducer models has
This requires the forward-backward algorithm [118], [119] for
shown that a large LSTM decoder can be replaced with a simple
efficient computation of the training criterion and its gradient,
embedding lookup table, that attends to only a few previous
with minor modifications for CTC, RNN-T, and RNA models, as
tokens from the model [47], [104], [105], [106], [107]. This
well as classical (full-sum) hybrid ANN/HMMs corresponding
demonstrates that most of the power of the E2E model is in
to the differences in alignments defined in each of these models.
the encoder, which has been a consistent theme of both E2E as
In comparison, AED models are based on implicit alignment
well as classical hybrid HMM models. However, improved de-
modeling approaches, and the training criterion does not have a
coder modeling may also be effective depending on the specific
latent variable A for explicit alignment as:
downstream task. Research has shown that extended decoder N
architectures enable pre-training and adaptation of the decoder
Lim = − log P (Cn |Xn )
using extensive text-only data, leading to accuracy gains [108],
n=1
[109]. For example, one approach separates RNN-T’s prediction We refer the interested reader to the individual papers for further
network into separate blank and vocabulary prediction (LM) details on the training algorithms [13], [14], [15], [16], [46],
components, where the LM component can be trained with text [48], [53], [71], [120]. As shown in Section III-A, in both
data [108]. In addition, in line with the growing interest in large explicit and implicit alignment cases, P (C|X) is factorized
language models in recent years, research has also begun on with respect to input time t and output position i, respec-
solving multiple tasks, including speech recognition, using only tively, and the factorized distribution is conditioned on the
an auto-regressive, GPT-style decoder [110], [111]. label context ci−1
1 , except

for CTC. For example, in the AED
case: log P (C|X) = L i−1
i=1 log P (ci |X, c1 ). During training,
D. Integrated Endpointing we use a teacher-forcing technique where the ground truth
An important characteristic of streaming speech recognition history is used as a label context.
systems is that they must endpoint quickly, so that the ASR result As part of the training procedure, all E2E as well as classical
can be finalized and sent to the server for the appropriate action hidden Markov models for ASR provide mechanisms to solve
to be performed. Endpointing is typically done with an external the underlying sequence alignment problem - either explicitly
voice-activity detector. Since endpointing is both an acoustic via corresponding latent variables, as in CTC, RNN-T or RNA,
and language model decision, recent works in streaming RNN-T and also hybrid ANN/HMM, or implicitly, as in AED models.
models [112], [113] have investigated predicting a microphone Also, the distinction between speech and silence needs to be
closing token eos at the end of the utterance – e.g., “What’s the considered, which may be handled explicitly by introducing
weather eos”. Following the notation from Section III, this is silence as a latent label (hybrid ANN/HMM), or implicitly by
done by including an eos token as part of the set of class labels not labeling silence at all, as currently is the standard in virtually
C and encouraging the model to predict this token to terminate all E2E models.
E2E models also may take advantage of hierarchical train- candidate hypotheses in the beam. Thus, this type of training
ing schedules. These schedules may comprise several separate first requires training the model to optimize P (C|X) in order
training passes and explicit, initially generated alignments that to initialize the model with a good set of parameters to run
are kept fixed for some Viterbi-style [121], [122], [123] train- a beam search. However, also direct approaches have been
ing epochs before re-enabling E2E-style full-sum training that introduced that avoid this separation to train discriminatively
marginalizes over all possible alignments. Such an alternative from scratch [69], [136].
approach is employed by Zeyer et al. [52], where an initial Papers that explore penalizing word errors include, Mini-
full-sum RNN-T model is used to generate an alignment and mum Word Error Rate (MWER) training [137], where the loss
continue with framewise cross-entropy training. This greatly function is constructed such that the expected number of word
simplifies the training process by replacing the summation over errors are minimized. Further work includes MWER for RNN-
all possible alignments in (4) by a single term corresponding to T and self-attention-T [138], as well as MWER using prefix
the alignment sequence generated. Recently, a similar procedure search instead of n-best [139]. Also, there have been studies
has been introduced in [124] also employing E2E models, only. that consider MWER in terms of reinforcement learning [140],
In this work, CTC is used to initialize the training and to gener- [141]. Optimal Completion Distillation (OCD) [81] proposes
ate an initial alignment, followed by intermediate Viterbi-style to minimize the total edit distance using an efficient dynamic
RNN-T training and final full-sum fine tuning, which improved programming algorithm. Finally, another body of research with
convergence compared to full-sum-only training approaches. sequence training introduce a separate external language model
It is interesting to note that in contrast to the RNN-T and at training time [142], which can also be done efficiently via
RNA label-topologies, CTC does not require alignments with approximate lattice recombination [129] and also lattice-free
single label emissions per label position. However, training approaches [130].
CTC models eventually does lead to single label emissions per
hypothesized label. An analysis of this property of CTC training D. Pretraining
which is usually called peaky behavior can be found in [125] and
references therein. Laptev et al. [126] even introduces a CTC All E2E models as well as classical hidden Markov models
variant without non-blank loop transitions. for ASR provide holistic models that in principle enable training
from scratch. However, many strategies exist to initialize and
guide the training process to reach optimal performance and/or to
B. Training With External Language Models
obtain efficient convergence by applying pretraining and model
E2E ASR models generally are normalized on sequence growing [143], [144]. Supervised layer-wise pretraining has
level. Therefore, sequence training with the maximum mutual been successfully applied for classical [5], [145], as well as
information criterion [127] is the same as standard cross en- attention-based ASR models [146], which can be combined with
tropy/conditional likelihood training. However, once external intermediate sub-sampling schemes [147], and model grow-
language models are included in the training phase, sequence ing [148]. Pretraining approaches utilizing untranscribed audio,
normalization needs to be included explicitly, leading to MMI large-scale semi-supervised data and/or multilingual data [149],
sequence discriminative training. This has been exploited as a [150], [151], [152], [153], [154], [155], [156], [157], [158],
further approach to combine E2E models with external language [159], [160] would deserve a self-contained survey and they are
models trained on text-only data during the training phase it- applicable for hybrid DNN/HMM and E2E approaches likewise
self [128], [129], [130]. – they will not be further discussed here.
C. Minimum Word Error Rate Training E. Training Schedules and Curricula

Since the objective of speech recognition is to minimize Dedicated training schedules have been developed to guide
word error rate (WER), there has been a growing number of the optimization process and as part of that reach proper
research studies that incorporate this into the objective function alignment behavior explicitly or implicitly [52], [124], [147].
by minimizing the model-based expectation of the number of Many approaches exist for learning rate control [161], [162]:
word errors, as follows: NewBob [163], [164] and enhancements [162]; global ver-
N sus parameter-wise learning rate control (exponential decay,
Lmwer = W(Cn , Cn )P (Cn |Xn ) power decay, etc.) [165]; learning rate warm-up [44]; warm
n=1 Cn

restarts/cosine annealing [166]; weight decay versus gradually
where W(Cn , Cn ) is the word error count in a hypothesis Cn decreasing batch size [167]; fine-tuning [168] or population-
given a reference Cn , and n is an index which iterates over the based training [169]; etc. For a survey of meta learning cf. [170].
entire training set. These methods, known as sequence or dis- Sequence learning approaches also consider curriculum learn-
criminative training, have shown great improvements for classi- ing [171], [172], e.g., by considering short sequences first [173],
cal ASR [131], [132], [133], [134], [135], and have since been [174]; interim increase of sub-sampling [147] initially more
explored in E2E models. Typically these losses are constructed sub-sampling; or, for multi-speaker ASR training sort mixed
by running in ‘beam-search’ mode rather than teacher-forcing speech by SNR and start with speakers of balanced energy and
mode, and construct a loss from the errors made from the mixed gender [175].
F. Optimization and Regularization lacks consideration of an (external) language model during train-
Optimization usually is based on stochastic gradient de- ing. However, these potential shortcomings may be remedied
scent [176], with momentum [177], [178], and a number of cor- by using sequence discriminative training criteria [127] and
responding adaptive approaches, most prominently Adam [179] lattice-free training approaches [69].
and variants thereof [145], [179], [180]. In contrast to strict E2E systems, the classical ASR architec-
Investing more training epochs seems to provide improve- ture includes the use of secondary knowledge sources beyond the
ments [52, Table 8], and also averaging over epochs has been primary training data, i.e. (transcribed) speech audio for acoustic
reported to help [102]. For a discussion of the double descent model training, and textual data for language model training.
effect and its relation to the amount of training data, label noise Most prominently, this includes the use of a pronunciation lexi-
and early stopping cf. [181]. con and the definition of a phoneme set. Secondary resources like
Regularization strongly contributes to training performance: pronunciation lexica may be helpful in low-resource scenarios.
e.g., L2 and weight decay [166], [182]; weight noise [183]; However, their generation often is costly and may even introduce
adaptive mean L1/L2 [184]; gradient noise [185]; dropout [186], errors, like pronunciations from a lexicon not reflecting the
[187], [188], layer dropout [189], [190], [191]; dropcon- actual pronunciations observed. Therefore, for large enough
nect [192]; zoneout [193]; smoothing of attention scores [15]; training resources, secondary knowledge sources might become
label smoothing [194]; scheduled sampling [195]; auxiliary obsolete [209], or even harmful, in case of erroneous information
loss [194], [196]; variable backpropagation through time [197], introduced [210], [211].
[198]; mixup [199]; increased frame rate [180]; or, batch nor- Classical ASR models usually are trained successively, with
malization [200]. knowledge derived from models trained earlier injected into
later training stages, e.g. in the form of HMM state align-
ments. However, such approaches from classical ASR might
G. Data Augmentation also be interpreted as specific training schedules. Initializing
deep learning models using HMM alignments obtained from
Training of E2E ASR models also benefit from data augmen- acoustic models based on mixtures of Gaussians may be in-
tation methods, which might also be viewed as regularization terpreted in this way, with the Gaussian mixtures serving as an
methods. However, their diversity and impact on performance initial shallow model. In classical ASR, also approaches training
justifies a separate overview. deep neural networks from scratch while avoiding intermediate
Most data augmentation methods perform data perturbation training of Gaussians has been proposed [212], [213], [214], also
by exploiting certain dimensions of speech signal variation: in combination with character-level modeling [83]. Another step
speed perturbation [201], [202], vocal tract length perturba- towards more integrated training of classical systems has been
tion [201], [203], frequency axis distortion [201], sequence noise to apply discriminative training criteria avoiding intermediate
injection [204], SpecAugment [205], or semantic mask [206]. (usually lattice-based) representations of competing word se-
Also, text-only data may be used to generate data using text- quences [69], [136], [215], [216], [217].
to-speech (TTS) on feature [207] or signal level [208]. In a The training of classical ASR systems usually applies sec-
comparison of the effect of TTS-based data augmentation on ondary objectives to solve subtasks like phonetic clustering.
different E2E ASR architectures in [208], AED seemed to be The classification and regression trees (CART) approach is
the only architecture that appeared to benefit significantly from used to cluster triphone HMM states [27], [218]. More re-
the TTS data. cent approaches proposed clustering within a neural network
In a recent study [174] and corresponding follow-up modeling framework, while still retaining secondary cluster-
work [180], many of the regularization and data augmenta- ing objectives [213], [219]. However, also in E2E approaches
tion methods listed here have been exploited jointly leading secondary objectives are used, most prominently for subword
to state-of-the-art performance on the Switchboard task for a generation, e.g. via byte-pair encoding [32]. Also, available
single-headed AED model. pronunciation lexica can be utilized indirectly for assisting sub-
word generation for E2E systems [35], [36], which are shown to
outperform byte-pair encoding. Within classical ASR systems,
Relationship to Classical ASR phonetic clustering also can be avoided completely by modeling
E2E systems attempt to define ASR models that integrate phonemes in context directly [220].
all knowledge sources into a single global joint model that does It is interesting to observe that specifically attention-based
not utilize secondary knowledge sources and avoids the classical encoder-decoder models require larger numbers of training
separation into acoustic and language models. These global joint epochs to reach high performance, e.g. for a comparison of
models are completely trained from scratch using a single global systems trained on Switchboard 300 h cf. Table 5 in [221]. Also,
training criterion based on a single kind of (transcribed) training attention-based encoder-decoder models have been shown to
data and thus require less ASR domain-specific knowledge suffer from low training resources [222], [223], which can be
provided sufficient amounts of training data are available. improved by a number of approaches, including regularization
While standard hybrid ANN/HMM training for ASR using techniques [174] as well as data augmentation using SpecAug-
frame-wise cross entropy already is discriminative, it is not ment [224] and text-to-speech (TTS) [29]. SpecAugment also is
yet sequence discriminative, requires prior alignments and also shown to improve classical hybrid HMM models [225]. TTS on
the other hand considerably improved attention-based encoder- described below. However, it is known that the degradation of the
decoder models trained on limited resources, but did not reach greedy search algorithm is not very large [16], [46], especially
the performance of other E2E approaches or hybrid HMM mod- when the model is well trained in matched conditions.8
els, which in turn were not considerably improved by TTS [208].
Multilingual approaches also help improve ASR development B. Beam Search
for low resource tasks, again both for classical [226], as well as
The beam search algorithm is introduced to approximately
for E2E systems [227], [228].
consider a subset of possible hypotheses C˜ among all possible
hypotheses U ∗ during decoding, i.e., C˜ ⊂ U ∗ . A predicted output
VI. DECODING E2E MODELS
sequence Ĉ is selected among a hypothesis subset C˜ instead of
This section describes several decoding algorithms for end- all possible hypotheses U ∗ , i.e.,
to-end speech recognition. The basic decoding algorithm of end- Ĉ = arg max P (C|X) (6)
to-end ASR tries to estimate the most likely sequence Ĉ among C∈C˜
all possible sequences, as follows: The beam search algorithm is to find a set of possible hypotheses
˜ which can include promising hypotheses efficiently by avoid-
C,
Ĉ = arg max∗ P (C|X)
C∈U ing the combinatorial explosion encountered with all possible
The following section describes how to obtain the recognition hypotheses U ∗ .
result Ĉ. There are two major beam search categories: 1) frame syn-
chronous beam search and 2) label synchronous beam search.
A. Greedy Search The major difference between them is whether it performs
The Greedy search algorithm is mainly used in CTC, which hypothesis pruning for every input frame t or every output token
ignores the dependency of the output labels as follows: i. The following sections describe these two algorithms in more
T detail.

Â = arg max P (at |X)
at C. Label Synchronous Beam Search
t=1
where at is an alignment token introduced in Section III-B1. The Suppose we have a set of partial hypotheses up to (i − 1)th
original character sequence is obtained by converting alignment token C˜1:i−1 . A set of all possible partial hypotheses up to ith
token sequence Â to the corresponding token sequence Ĉ. The token C1:i is expanded from C˜1:i−1 as follows:
argmax operation can be performed in parallel over input frame
C1:i = {(C˜1:i−1 , ci = c)}c∈U (7)
t, yielding fast decoding [13], [229], although the lack of the
output dependency causes relatively poor performance than the The number of hypotheses |C1:i | would be |C˜1:i−1 | × |U |, at
attention and RNN-T based methods in general. most. The beam search algorithm prunes the low probability
CTC’s fast decoding is further boosted with transformer [44], score hypotheses from C1:i and only keeps a certain number
[98], [102] and its variants [45], [103] since their entire com- (beam size Δ) of hypotheses at i among C1:i . This pruning step
putation across the frames is parallelized [190], [230]. For ex- is represented as follows:
ample, the non-autoregressive models, including Imputer [231], C˜1:i = NBESTC ∈C P (C1:i |X), where |C˜1:i | = Δ (8)
1:i 1:i
Mask-CTC [230], Insertion-based modeling [232], Continuous Note that NBEST(·) is an operation to extract top Δ hypotheses
integrate-and-fire (CIF) [233] and other variants [234], [235] in terms of the probability score P (C1:i |X) computed from
have been actively studied as an alternative non-autoregressive an end-to-end neural network, or a fusion of multiple scores
model to CTC. [235] shows that CTC greedy search and its described in Section VII-B.
variants achieve 0.06 real-time factor (RTF)7 by using Intel(R) In the label synchronous beam search, the length of the
Xeon(R) Silver 4114 CPU, 2.20 GHz. The paper also shows output sequence (N ) is unknown. Therefore, during this pruning
that the degradation of the non-autoregressive models from the process, we also add the hypothesis that reaches the end of an
attention/RNN-T methods with beam search is not extremely utterance (i.e., predict the end of sentence symbol eos) to a set
large (19.7% with self-conditioned CTC [234] versus 18.5 and of hypotheses C˜ in (6) as a promising hypothesis.
18.9% with AED and RNN-T, respectively). The label synchronous beam search does not explicitly depend
The greedy search algorithm is also used as approximate on the alignment information; thus, it is often used in implicit
decoding for both implicit and explicit alignment modeling ap- alignment modeling approaches, including AED. Due to this
proaches, including AED, RNA, CTC, and RNN-T, as follows: nature, sequence hypotheses of the same length might cover
ĉi = arg max P (ci |Ĉ1:i−1 , X) for i = 1, . . . , N a completely different number of encoder frames, unlike the
ci
frame synchronous beam search, as pointed out by [40]. As a
ât = arg max P (at |Â1:t−1 , X) for t = 1, . . . , T result, we observe that the scores of very short and long segment
at
hypotheses often become the same range, and the beam search
The greedy search algorithm does not consider alternate hy-
potheses in a sequence compared with the beam search algorithm
8 On the other hand, in the AED models, increasing the search space does not
consistently improve the speech recognition performance [77], [236] – a fact
7 The ratio of the actual decoding time to the duration of the input speech. also observed in neural machine translation [237].
wrongly selects such hypotheses. [86] shows an example of such beam search together with CTC and neural LM, which applies
extreme cases, resulting in large deletion and insertion errors for on-the-fly rescoring to the hypotheses given by CTC prefix
short and long-segment hypotheses, respectively. Thus, the label search using the AED and LM scores.
synchronous beam search requires heuristics to limit the output
sequence length to avoid extremely long/short output sequences. E. Block-Wise Decoding
Usually, the minimum and maximum length thresholds are
determined proportionally to the input frame length |X| with Another beam search implementation uses a fixed-length
tunable parameters ρmin and ρmax as Lmin = ρmin |X| , Lmax = block unit for the input feature. In this block processing, we can
ρmax |X| . Although these are quite intuitive ways to control use the future context inside the block by using the non-causal
the length of a hypothesis, the minimum and maximum output encoder network based on the BLSTM, output-delayed unidi-
lengths depend on the token unit or type of script in each lan- rectional LSTM, or transformer (and its variants). This future
guage. Another heuristic is to provide an additional score related context information avoids the degradation of the fully causal
to the output length or attention weights – e.g., a length penalty, network. In this setup, the chunk size becomes the trade-off
and a coverage term [77], [238]. The end-point detection [239] is of controlling latency and accuracy. This technique is used in
also used to estimate the hypothesis length automatically. [236] both RNN-T [100], [250], [251] and AED [61], [252], [253],
redefines the implicit length model of the attention decoder to [254]. Block-wise processing is especially important for implicit
take into account beam search, resulting in consistent behavior alignment modeling approaches, including AED, since it can
without degradation for increasing beam sizes. provide block-wise monotonic alignment constraint between the
Note that there are several studies on applying label input feature and output label, and realize block-wise streaming
synchronous beam search to explicit alignment modeling decoding.
approaches. For example, label synchronous beam search algo-
rithms for CTC are realized by marginalizing all possible align- F. Model Fusion During Decoding
ments for each label hypothesis [13]. [240] extends CIF [233] Similar to the classical HMM-based beam search, we combine
to produce label-level encoder representation and realizes label various scores obtained from different modules, including the
synchronous beam search in RNN-T. main end-to-end ASR and LM scores.
1) Synchronous Score Fusion: The most simple score fusion
D. Frame Synchronous Beam Search is performed when the scores of multiple modules are synchro-
In contrast to the label synchronous case in (8), the frame nized. In this case, we can simply add the multiple scores at each
synchronous beam search algorithm performs pruning at every frame t or label i. The most well-known score combination is
input frame t, as follows: LM shallow fusion.
LM shallow fusion: As discussed in Section VII, various
C˜1:i(t) = NBESTC P (C1;i(t) |X), where |C˜1:i(t) | = Δ
1;i(t) neural LMs can be integrated with end-to-end ASR. The most
where C1;i(t) is an i(t)-length label sequence obtained from simple integration is based on LM shallow fusion [255], [256],
the alignment A1:t , which is introduced in Section III-B. [257], as discussed in Section VII-B1, which (log-) linearly adds
P (C1;i(t) |X) is obtained by summing up all possible alignments the LM score Plm (C1:i ) to E2E ASR scores P (C1:i |X) during
A1:t ∈ A(X,C1;i(t) ) . Unlike the label synchronous beam search, beam search in (8) as follows:
frame synchronous beam search depends on explicit alignment
log P (C1:i |X) → log P (C1:i |X) + γ log Plm (C1:i )
A; thus, it is often used for explicit alignment modeling ap-
proaches, including CTC, RNN-T, and RNA. C1:i(t) is an ex- where γ is a language model weight. Of course, we can combine
panded partial hypotheses up to input frame t, similar to (7). other scores, such as the length penalty and coverage terms, as
Compared with the label synchronous algorithm, the frame discussed in Section VI-C.u
synchronous algorithm needs to handle additional output to- 2) Asynchronous Score Fusion: If we combine the frame-
ken transitions inside the beam search algorithm. The frame dependent score functions, P (at |·), used in explicit alignment
synchronous algorithm can be easily extended in online and/or modeling approaches, e.g., CTC, RNN-T, and label-dependent
streaming decoding, thanks to the explicit alignment information score functions, P (ci |·), used implicit alignment modeling ap-
with input frame and output token. proaches, e.g., AED, language model, we have to deal with the
Classical approaches to beam search for HMM, but also CTC mismatch between the frame and label time indices t and i,
and RNN-T variants, are based on weighted finite state transduc- respectively.
ers (WFST) [38], [74], [241] or lexical prefix trees [106], [242], In the time-synchronous beam search, this fusion is performed
[243]. They are categorized as frame synchronous beam search. by incorporating the language model score in the label transition
These methods are often combined with an N-gram language state [22], [70], [258]. [247] also combines a word-based lan-
model or a full-context neural language model [244], [245]. guage model and token-based CTC model by incorporating the
RNN-T [14], [246] and CTC prefix search [247] can deal with language model score triggered by the word delimiter (space)
a neural language model by incorporating the language model symbol.
score in the label transition state. Interestingly, triggered atten- In the label-synchronous beam search, we first compute the
tion approaches [248], [249] allow us to use implicit alignment label-dependent scores from the frame-dependent score function
modeling approaches, including AED, in frame-synchronous by marginalizing all possible alignments given a hypothesis label
sequence. CTC/attention joint decoding [86] is a typical exam- a lattice. The lattice output can also be used for spoken term
ple, where the CTC score is computed by marginalizing all possi- detection, spoken language understanding, and word posteriors.
ble alignments based on the CTC forward algorithm [229]. This However, due to the lack of Markov assumptions, RNN-T and
approach eliminates the wrong alignment issues and difficulties AED cannot merge the hypothesis and cannot generate a lattice
of finding the correct end of sentences in the label-synchronous straightforwardly, unlike the HMM-based or CTC systems. To
beam search [86]. tackle this issue, there are several studies of modifying these
Note that the model fusion method during beam search can models by limiting the output dependencies in the fixed length
realize simple one-pass decoding, while it limits the time unit (i.e., finite-history) [47], [267], or keeping the original RNN-
of the models to be the same or it requires additional dynamic T structure but merging the similar hypotheses during beam
programming to adjust the different time units, especially for search [107].
the label-synchronous beam search. This dynamic programming
computation becomes significantly large when the length of the I. Vectorization Across Both Hypotheses and Utterances
utterance becomes larger and requires some heuristics to reduce
the computational cost [259]. We can accelerate the decoding process by vectorizing mul-
tiple hypotheses during the beam search, where we replace
the score accumulation steps for each hypothesis with vector-
G. Lexical Constraint During Score Fusion
matrix operations for the vectorized hypotheses. This has been
Classically, we use a word-based language model to capture studied in RNN-T [22], [258], [268] and attention-based [259]
the contextual information with the word unit, and also consider models. This modification leverages the parallel computing
the word-based lexical constraint for ASR. However, end-to-end capabilities of multi-core CPUs, GPUs and TPUs, resulting in
ASR often uses a letter or token unit and it causes further significant speedups, while enabling multiple utterances to be
unit mismatch during beam search. As described in previous processed simultaneously in a batch. Major deep neural network
sections, the classical approach of incorporating the lexical and end-to-end ASR toolkits support this vectorization. For
constraint from the token unit to the word unit is based on a example, Tensorflow9 [269], and FAIRESEQ10 [270] provide
WFST. This method first makes a TLG transducer composed a vectorized beam search interface for a generic sequence to
of the token (T), word lexicon (L), and word-based language sequence task, and it can be used for attention-based end-to-
transducers (G) [74]. This TLG transducer has been used for end ASR. End-to-end ASR toolkits including ESPnet11 [259],
both CTC [74] and attention-based [53] models. ESPRESSO12 [261], LINGVO [271], and, RETURNN13 [272]
Another approach used in the time synchronous beam search also support the vectorized beam search algorithm.
is to insert the word-based language model score triggered by
the word delimiter (space) symbol [75]. To synchronize the Relationship to Classical ASR
word-based language model with a character-based end-to-end
ASR, [260] combines the word and character-based LMs with One of the most prominent properties shared between E2E and
the prefix tree representation, while [239], [261] uses look-ahead classical statistical ASR systems is the use of a single-pass de-
word probabilities to predict next characters instead of using the coding strategy, which integrates all knowledge sources involved
character-based LM. The prefix tree representation is also used (models, components), before coming to a final decision [123].
for the sub-word token unit case [262], [263]. This includes the use of full label context dependency both
for E2E systems [51], [77], [174], [229], [262], [273], [274],
H. Multi-Pass Fusion [275], as well as classical systems via full-context language
models [244], [245], [276], [277]. In classical ASR systems,
The previous fusion methods are performed during the beam even HMM alignment path summation may be retained in
search, which enables a one-pass algorithm. The popular alter- search [278]. Both E2E as well as classical ASR systems employ
native methods are based on multi-pass algorithms where we do beam search in decoding. However, compared to classical search
not care about the synchronization and perform n-best or lattice approaches, E2E beam search usually is highly simplified with
scoring by considering the entire context within an utterance. very small beam sizes around 1 to 100 [15], [16], [77], [147].
[16] uses the N-best rescoring techniques to integrate a word- Very small beam sizes also partly mask a length bias exhibited by
based language model. [55] combines forward and backward E2E attention-based encoder-decoder models [279], [280], thus
searches within a multi-pass decoding framework to combine trading model errors against search errors [281]. An overview of
bidirectional LSTM decoder networks. Recently two-pass al- approaches to handle the length bias beyond using small beam
gorithms of switching different end-to-end ASR systems have sizes in ASR is presented in [236].
been investigated, including RNN-T → AED [264]; CTC →
AED [265], [266]. This aims to provide streamed output in the
first pass and re-scoring with AED in the second pass to refine the 9 [Online]. Available: https://www.tensorflow.org/api_docs/python/tf/
previous output, thus satisfying a real-time interface requirement contrib/seq2seq/BeamSearchDecoder
10 [Online]. Available: https://github.com/pytorch/fairseq/blob/master/
while providing high recognition performance.
fairseq/sequence_generator.py
In addition to the N-best output in the above discussion, there 11 [Online]. Available: https://github.com/espnet/espnet
is a strong demand for generating a lattice output for better 12 [Online]. Available: https://github.com/freewym/espresso
multi-pass decoding thanks to richer hypothesis information in 13 [Online]. Available: https://github.com/rwth-i6/returnn

Many classical ASR search paradigms are based on multipass such as characters, subwords, and single words, the probability
approaches that successively generate search space representa- distribution can be computed based on the chain rule as:
tions applying increasingly complex acoustic and/or language L+1

models [243], [282], [283]. However, multipass strategies also P (C) = P (ci |c0:i−1 )
are employed using E2E models, which however softens the i=1
E2E concept. Decoder model combination is pursued in a where ci denotes the i-th token of C, and c0:i−1 represents to-
two-pass approach, while even retaining latency constraints as ken sequence c0 , c1 , . . . , ci−1 , assuming c0 = sos and cL+1 =
in [87]. Further multipass approaches include E2E adaptation eos.
approaches [284], [285], [286], [287]. Most LMs are designed to provide the conditional probability
P (ci |c0:i−1 ), i.e., they are modeled to predict the next token
VII. LM INTEGRATION given a sequence of the preceding tokens. We briefly review
such LMs focusing on the different techniques to represent each
This section discusses language models (LMs) used for E2E
token, ci , and back-history, c0:i−1 .
ASR. Hybrid ASR systems have long been using a pretrained
1) N-Gram LM: N -gram LMs have long been used for
LM [2], whereas most end-to-end (E2E) ASR systems employ
ASR [2]. Early E2E systems in [53], [74], [77] also employed
a single E2E model that includes a network component acting
an N -gram LM. The N -gram models rely on the Markov
as an LM.14 For example, the prediction network of RNN-T
assumption that the probability distribution of the next token
and the decoder network of AED models take on the role of a
depends only on the previous N − 1 tokens, i.e., P (ci |c0:i−1 ) ≈
LM covering label back-histories. Therefore, E2E ASR does not
P (ci |ci−N +1:i−1 ), where N is typically 3 to 5 for word-based
seem to require external LMs. Nevertheless, many studies have
models and higher for sub-word and character-based models.
demonstrated that external LMs help improve the recognition
The maximum likelihood estimates of N -gram probabilities are
accuracy in E2E ASR.
determined based on the counts of N sequential tokens in the
There are presumably three reasons that E2E ASR still re-
training data set as:
quires an external LM:
K(ci−N +1 , . . . , ci )
a) Compensation for poor generalization: E2E models need P (ci |ci−N +1:i−1 ) =
to learn a more complicated mapping function than classical ci K(ci−N +1 , . . . , ci )
modular-based models such as acoustic models. Consequently, where, K(·) denotes the count of each token sequence. Since the
E2E models tend to face overfitting problems if the amount data size is finite, it is important to apply a smoothing technique
of training data is not sufficient. Pretrained LMs potentially to avoid estimating the probabilities based on zero or very small
compensate for the less generalized predictions made by E2E counts for rare token sequences. Those techniques compensate
models. the N -gram probabilities with lower order models, e.g., (N −
b) Use of external text data: E2E models need to be trained 1)-gram models, according to the magnitude of the count [288].
using paired speech and text data, while LMs can be trained However, since the N -gram probabilities still rely on the discrete
with only text data. Generally, text data can be collected more representation of each token and the history, they suffer from
easily than the paired data. The training speed of an LM is also data sparsity problems, leading to poor generalization.
faster than that of E2E models for the same number of sentences. The advantage of the N -gram models is their simplicity,
Accordingly, the LM can be improved more effectively with although they underperform state-of-the-art neural LMs. In the
external text data, providing additional performance gain to the training, the main step is to just count the N tuples in the data
ASR system. set, which is required only once. During decoding, the LM
c) Domain adaptation: Domain adaptation helps improve probabilities can be obtained very quickly by table lookup or
recognition accuracy when the E2E model is applied to a specific can be attached to a decoding graph, e.g., WFST, in advance.
domain. However, domain adaptation of the E2E model requires 2) FNN-LM: The feed-forward neural network (FNN) LM
a certain amount of paired data in the target domain. Also, when was proposed in [9], which estimates N -gram probabilities
multiple domains are assumed, it may be costly to maintain using a neural network. The network accepts N − 1 tokens, and
multiple E2E models for the domains the system supports. If a predicts the next token as:
pretrained LM for the target domain is available, it may more P (ci |ci−N +1:i−1 ) = softmax(Wo hi + bo )
easily improve recognition accuracy for domain-specific words
and speaking styles without updating the E2E model. hi = tanh(Wh ei + bh )
This section reviews various types of LMs used for E2E ASR ei = concat(E(ci−N +1 ), . . . , E(ci−1 ))
and fusion techniques to integrate LMs into E2E models.
where Wo and Wh are weight matrices, and bo and bh are bias
vectors. E(y) provides an embedding vector of c, and concat(·)
A. Language Models
operation concatenates given vectors.15 This model first maps
The LMs provide a prior probability distribution, P (C). If each input token to an embedding space, and then obtains
the sentence, C, can be decomposed into a sequence of tokens hidden vector, hi , as a context vector representing the previous
14 In the simplest case of a CTC model as in Fig. 2, the included LM component 15 We omit the optional direct connection from the embedding layer to the
however is limited to a label prior without label context. softmax layer in [9] for simplicity.
N − 1 tokens. Finally, it outputs the probability distribution of characters [294]. Moreover, they are highly parallelizable and
the next token through the softmax layer. Although this LM thus suitable for training the model with a large training data set.
still relies on the Markov assumption, it outperforms classical 5) Transformer LM: Transformer architecture [44] has been
N -gram LMs described in the previous section. The superior applied to LMs [295] and used for ASR [102], [296], where
performance of FNN-LM is primarily due to the distributed the LMs are designed as a Transformer decoder without any
representation of each token and the history. The LM learns to inputs from other modules such as encoders. The hidden vector
represent token/context vectors such that semantically similar is computed as:
tokens/histories are placed close to each other in the embedding hi = FFN(hi ) + hi
space. Since this representation has a better smoothing effect
than the count-based one used for N -gram LMs, FNN-LM can hi = MHA(ei , e1:i , e1:i ) + ei
provide a better generalization than N -gram LMs for predicting where FFN(·) and MHA(·, ·, ·) denote a feed forward network
the next token. Neural network-based LMs basically utilize this and a multi-head attention module, respectively. The multihead
type of representation. attention and feed-forward blocks are typically stacked multiple
3) RNN-LM: A recurrent neural network (RNN) LM was times, e.g., 6 times [102], to obtain the final hidden vector.
introduced to exploit longer contextual information over N − 1 The advantage of Transformer LMs is that they can take all
previous tokens using recurrent connections [289]. Unlike FNN- tokens in the history into account through the self-attention
LM, the hidden vector is computed as: mechanism without summarizing them into a fixed-size memory
hi = recurrence(ei , hi−1 ) like RNN-LMs. Thus, the long history can be fully considered
with attention to predict the next token, achieving better per-
ei = E(ci−1 ) formance than RNN-LMs. However, the computational com-
where, recurrence(ei , hi−1 ) represents a recursive function, plexity increases quadratically as the length of the sequence.
which accepts previous hidden vector hi−1 with input ei , and Therefore, the history length is typically limited to a fixed size
outputs next hidden vector hi . In the case of simple (Elman-type) or within every single sentence. To overcome this limitation,
RNN, the function can be computed as Transformer-XL [297] reuses already computed activations,
recurrence(e, h) = tanh(Wh e + Wr h + bh ) which includes information on farther previous tokens, and the
model is trained with a truncated back-propagation through time
where, Wr is a weight matrix for the recurrent connection,
(BPTT) algorithm [298]. Compressive Transformer [299] ex-
which is applied to the previous hidden vector h. This recurrent
tends this approach to utilize even longer contextual information
loop makes it possible to hold the history information in the
by incorporating a compression step to keep older, but important,
hidden vector without limiting the history to N − 1 tokens.
information in a fixed-size memory network.
However, the history information decays exponentially as tokens
are processed with this recursion. Therefore, currently stacked
B. Fusion Approaches
LSTM layers are more widely used for the recurrent network,
which have separate internal memory cells and gating mecha- There are several ways to incorporate an external LM into
nisms to keep long-range history information [290]. With this E2E ASR, called LM fusion. Their purpose is to improve the
mechanism, RNN-LMs outperform other N -gram-based models recognition accuracy of E2E ASR by leveraging the benefits
in many tasks. of the external LM described in the first part of this section.
4) ConvLM: Convolutional neural networks (ConvLM) have However, there can be a mismatch in the prediction between the
also been applied to LMs [291], [292], [293]. ConvLM [292] E2E model and the LM when trained on different data sets, and
replace the recurrent connections used in RNN-LMs with gated therefore the LM may not collaborate well with the E2E model.
temporal convolutions. The hidden vector is computed as Researchers have investigated various LM fusion approaches to
hi = hi ⊗ σ(gi ) reduce the mismatch between models in different situations.
1) Shallow Fusion: Shallow fusion is the most popular ap-
hi = ei−k+1:i ∗ W + b proach to combine the pretrained E2E model and LM in the
gi = ei−k+1:i ∗ V + c inference time. As we described in Section VI-F, shallow fu-
sion simply combines the E2E and LM scores by a log-linear
where ⊗ is element-wise multiplication, ∗ is a temporal convo-
combination as
lution operation, and k is the patch size. σ(gi ) represents a gating
function of convoluted activation hi , and is modeled as a sigmoid Score(C|X) = log P (C|X) + γ log P (C) (9)
function. W and V are matrices for convolution and b and c are where γ is a scaling factor for the LM [255], [256], [257]. The
bias vectors. The convolution and gating blocks are typically advantage of this approach is that it is easy and effective when
stacked multiple times with residual connections. In [293], a there are no major mismatches between the source and target
ConvLM with 14 blocks has been applied for E2E ASR. Similar domains.
to FNN-LM, ConvLM allow us to use only a fixed history size, 2) Deep Fusion: Deep fusion [300] is an approach to com-
but they are more parameter efficient and easier to utilize longer bine an LM with an E2E model using a joint network. Given a
histories than the FNN-LM by stacking the layers. Thus, they pretrained E2E model and an LM, all the network parameters
achieve competitive performance to that of RNN-LMs [292], are fine-tuned jointly so that the models collaborate better to im-
even with the finite history consisting of short tokens such as prove the recognition accuracy, where the joint network is used
to combine the E2E and LM states through a gating mechanism

that controls the contribution of the LM according to the current
state.
3) Cold Fusion: Cold fusion [301] is another approach to
combine a pretrained LM like deep fusion, but the E2E model
is learned while freezing the LM parameters. Since the E2E
model is aware of the LM throughout training, it learns to use
the LM to reduce language specific information and capture
only the relevant information to map the source to the target
sequence. This mechanism reduces the role of LM in the E2E
model and alleviates the language bias of the training data.
Accordingly, the E2E model becomes more robust to domain
mismatches between the training data and the target domain.
Fig. 9. E2E ASR performance improvement in the switchboard task.
Unlike deep fusion, cold fusion makes it possible to combine
the E2E model with a pretrained LM for the target domain,
improving the recognition accuracy. Component fusion [302] C. Use of Large-Scale Pretrained LMs
extends cold fusion to use a pretrained LM with transcriptions of In recent years, LMs trained with large-scale text data are
the training data for the E2E model, more focusing on reducing available for different NLP tasks. BERT [306] and GPT-2 [307]
the bias of the training data. are representative models based on Transformer LMs. Such LMs
4) Internal LM Estimation: There is another approach to have also been applied to E2E ASR systems in different ways,
reduce language bias in training data through shallow fusion. e.g., N-best rescoring [308] and dialog context embedding [309].
The language bias is a problem when a big domain mismatch
exists between the source domain (training data) and the target
domain (test data) because the E2E model scores are strongly Relationship to Classical ASR
dependent on the language priors in the source domain. To The architecture of classical ASR systems provides a separa-
remove such a bias from the score, we can explicitly estimate tion between the acoustic model and the language model. In con-
the LM that represents the language priors, called Internal LM, trast to this, E2E models avoid this separation and define a joint
and subtract the LM score from the ASR score of (9): model. While this allows for training with a single objective, it
limits training of the (implicit) prior to the transcriptions of the
Score(C|X) = log Pϕ (C|X) − γϕ log Pϕ (C) + γτ log Pτ (C)
audio training data. To exploit further text-only training data,
where subscripts ϕ and τ indicate the source and target do- usually a separate LM is combined with E2E models, nonethe-
mains, respectively. γϕ and γτ are their scaling factors. Sub- less. However, due to the implicit prior of E2E models, i.e. the
tracting the internal LM score corresponds to approximating internal language model, combination with separate language
acoustic probability density Pϕ (X|C) because Pϕ (X|C) ∝ models is not straightforward and requires corresponding inter-
Pϕ (C|X)/Pϕ (C) is satisfied for fixed X, where the ASR score nal language model estimation and compensation approaches,
can be seen as a classical hybrid ASR system. Accordingly, e.g. [47], [128], [303], [304], [310]. At least from the recognition
the subtracted E2E model score plays a role of acoustic model accuracy perspective, it remains unclear, if the clear separation
and makes it more domain independent in terms of language, of acoustic modeling and language modeling in the classical
achieving a higher recognition accuracy in combination with ASR architecture is a disadvantage because of separate training
the external LM Pτ (C). objectives, or rather an advantage, since text-only training data
The density ratio method [303] trains an internal LM using the may be used easily. Also, the language model training objective,
transcript of the training data. Hybrid autoregressive transducer i.e. language model perplexity, is observed to correlate well
(HAT) [47] extends RNN-T so that the model becomes the with word error rate [311], [312], [313], [314]. Furthermore,
internal LM when the encoder output is eliminated, i.e., set to discriminative approaches to language modeling [315] may be
zero. This approach simplifies the framework by utilizing the viewed as a step towards joint modeling.
prediction network as the internal LM, which avoids training an
additional LM and using it in the inference time. In the work
of [304], an approach similar to HAT has been proposed where VIII. OVERALL PERFORMANCE TRENDS OF E2E APPROACHES
IN COMMON BENCHMARKS
the internal LM is formulated on top of standard RNN-T and
attention-based encoder-decoder models, respectively. In [128], This section summarizes various techniques with the common
several techniques to estimate internal LMs have been proposed ASR benchmarks based on switchboard (SWBD) [316] in Fig. 9
for AED models, where an estimated bias vector is fed to the and Librispeech [317] in Fig. 10 to see the trajectory of the
LM instead of a zero vector. The bias vector can be estimated techniques developed in end-to-end ASR. We choose these two
by averaging encoder states or context vectors, or by a small databases because they are widely used in speech and machine
LSTM predicting the context vector based on the decoder label learning communities and cover spontaneous (SWBD) and read
context, only. These techniques to estimate the internal LM were speech (Librispeech) speaking styles. Figs. 9 and 10 show that
also evaluated for RNN-T in [305]. the performance improvement relative to the initial works [79],
IX. DEPLOYMENT OF E2E MODELS

Many of the ideas discussed in this paper have been explored
by various industry research labs [265], [326], [327], [328],
[329], [330], [331], inter alia. In this section, we review the
development of on-device production-level systems at Google
as a typical case study for deployment.
The first streaming E2E model, deployed to production,
was launched in 2019 for the Pixel 4 smartphone [22], [332].
This model used a streaming RNN-T first-pass system, while
re-scoring first-pass hypotheses with an AED system in the
second pass. In addition, FST-based contextual biasing [92] was
employed in the model, which was critical to obtain accurate
Fig. 10. E2E ASR performance improvement in the Librispeech task. results for diverse queries. This model ran on CPU and was
much faster than real time.
[147] based on the E2E models is significant, and the error rates In 2020, for the Pixel 5 smartphone [333], the system was
of all tasks become less than half of the original error rates!16 improved further to reduce user-perceived latency (i.e., the time
Although the overall trends show that the ASR performance between when the user speaks, and when words appear on
has steadily improved over time, there are several remarkable the device). This included advancements such as end-to-end
gains. One significant gain observed in both benchmarks in endpointing [113] to encourage faster microphone closing; as
the middle of 2019 comes from the data augmentation method well as FastEmit [91] to encourage the model to emit tokens
represented by SpecAugment [205], [206], as discussed in earlier.
Section V-G. The subsequent gains mostly come from the explo- Finally, in 2021 the model was further improved for the Pixel
ration of the new neural network architectures, including trans- 6 smartphone [334], to take advantage of the tensor processing
former [102], [318], conformer [45], [103], and contextnet [97] unit (TPU) [85] on the device. Improvements include the use
on top of SpecAugment, as discussed in Section IV-C. Such an of conformer layers for the encoder [45]; a small embedding
exploration is also performed in language modeling to improve prediction network for the decoder [104]; a 2-pass cascaded
the ASR performance [102], [296]. The final gain observed in encoder to run a 2nd-pass beam search [89]; and, a neural LM
the Librispeech benchmark in 2021 is based on self-supervised re-scorer to help improve accuracy long-tail named entities. This
learning [25], [319] and semi-supervised learning [320], [321]. model is the best ASR system that Google has released to date,
These techniques utilize a considerable amount of unlabeled both in terms of quality and latency.
in-domain speech data (e.g., Libri-light 60 K hours [322]).
X. AREAS FOR FUTURE WORK
Relationship to Classical ASR
Currently, E2E models dominate the academic debate on
Speech recognition research has always been pushed by inter- ASR. However, at least partly, this is not (yet?) reflected in the
national evaluation campaigns (e.g. as lead by NIST) and corre- corresponding commercial deployment of E2E ASR architec-
sponding benchmark tasks. The competition between classical tures. E2E models are not yet the perfect match for all ASR
and E2E approaches is nicely reflected in the widely used Lib- conditions and further research is needed to take full advantage
rispeech [317] and Switchboard [316] tasks, showing that E2E of the benefits of E2E modeling.
models gain momentum. As shown in Fig. 10, on Librispeech, E2E models seem to perform really well when training data
the current best-published classical hybrid systems range around is abundant, while not scaling well to low-resource conditions.
2.3% (test-clean) and 4.9% (test-other) word error rate [222], Similarly, domain change requires a flexible exchange of lan-
[323], while there already are a number of E2E systems pro- guage models, which is natural for classical ASR models based
viding similar performance [205], [206], [224], [320], with on a separation of acoustic and language models. Ongoing
some E2E systems clearly outperforming former state-of-the-art research on the use of external language models in E2E models
results with word error rates down to 1.8% (test-clean) and 3.7% and internal language model estimation already is promising,
(test-other) [324] with similar results reported in [45], [97]. but can be expected to see further improvements.
Merging insights from classical HMM-based and monotonic Top E2E ASR systems usually require orders of magnitude
RNN-T provided similarly well results with a limited training more training epochs than comparable classical ASR systems,
budget [124]. Finally, when trained on Switchboard 300 h, the and further research into efficient and robust optimization and
current best result, obtained with an E2E system [180] is 5.4% training schedules is needed.
compared to 6.6% word error rate for the best hybrid system The high level of integration of E2E models also involves a
result [325] on the HUB5’00 Switchboard test set, in Fig. 9. loss in modularity, which might support the explainability and
reusability of models. Also, more efficient training schedules
16 For readers who want to know the latest update of these benchmarks can also
might take advantage of modularity. One assumed advantage of
check https://github.com/syhw/wer_are_we and https://github.com/thu-spmi/ E2E models is that everything is trained from data and secondary
ASR-Benchmarks/blob/main/README.md. knowledge sources (e.g. pronunciation lexica and phoneme sets)
are avoided. However, rare events, like rare words in ASR still multilingual ASR; adaptation to new application domains, and
provide a challenge, which needs further research. speakers; etc.), which we do not cover due to space limitations.
With the missing separation of acoustic and language models,
the question arises of how to exploit text-only resources in ACKNOWLEDGMENT
E2E model training - do we foresee solutions beyond training
data generation using TTS? We note that a number of recent The authors would like to thank Julian Dierkes, Yifan Peng,
works have explored approaches to combine speech and text Zoltán Tüske, Albert Zeyer, and Wei Zhou for their help on
modalities by attempting to implicitly or explicitly map them refining our manuscript.
into a shared space [159], [335], [336], [337], [338], [339], [340],
[341]. Furthermore, high-performance E2E solutions exist for REFERENCES
both discriminative problems like ASR, as well as generative [1] T. Bayes, “An essay towards solving a problem in the doctrine of
problems like TTS, how can both be exploited jointly to support chances,” Philos. Trans. Roy. Soc. London, vol. 53, pp. 370–418, 1763.
[2] F. Jelinek, Statistical Methods for Speech Recognition. Cambridge, MA,
semi-supervised training based on text-only and/or audio-only USA: MIT Press, 1997.
data on top of transcribed speech audio [28], [342]? [3] L. R. Rabiner, “A tutorial on hidden Markov models and selected appli-
For AED architectures, we observe a length bias, which cations in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286,
Feb. 1989.
complicates the decoding process. Although many heuristics [4] H. A. Bourlard and N. Morgan, Connectionist Speech Recognition: A
are known to tackle length bias in AED, we are still missing Hybrid Approach. Norwell, MA, USA: Kluwer Academic Publishers,
a well-founded explanation for it, as well as a corresponding 1993.
[5] F. Seide, G. Li, and D. Yu, “Conversational speech transcription using
remedy of the original model. context-dependent deep neural networks,” in Proc. Interspeech, 2011,
Other open research problems include speaker adaptation and pp. 437–440.
robustness to recording conditions, especially in mismatch sit- [6] V. Fontaine, C. Ris, and H. Leich, “Nonlinear discriminant analysis for
improved speech recognition,” in Proc. Eurospeech, 1997, pp. 1–4.
uations. The E2E principle also provides a promising candidate [7] H. Hermansky, D. P. W. Ellis, and S. Sharma, “Tandem connectionist
to solve multichannel ASR by providing an E2E solution jointly feature extraction for conventional HMM systems,” in Proc. IEEE Int.
tackling the source separation, speaker diarization and speech Conf. Acoust., Speech, Signal Process., 2000, vol. 3, pp. 1635–1638.
[8] M. Nakamura and K. Shikano, “A study of english word category
recognition problem [26], [343]. prediction based on neural networks,” in Proc. IEEE Int. Conf. Acoust.,
Finally, we need to investigate, if E2E is a suitable guiding Speech, Signal Process., 1989, pp. 731–734.
principle, and how different E2E ASR models relate to each [9] Y. Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic language
model,” in Proc. Neural Inf. Process. Syst., 2000, pp. 932–938.
other as well as to classical ASR approaches. The most important [10] H. Schwenk and J.-L. Gauvain, “Connectionist language modeling for
guiding principle of ASR research and development has been large vocabulary continuous speech recognition,” in Proc. IEEE Int. Conf.
performance, and ASR has been boosted strongly by widely Acoust., Speech, Signal Process., 2002, pp. 765–768.
[11] Z. Tüske, P. Golik, R. Schlüter, and H. Ney, “Acoustic modeling with deep
used benchmark tasks and international evaluation campaigns. neural networks using raw time signal for LVCSR,” in Proc. Interspeech,
With the current diversity of classical and E2E models, we also 2014, pp. 890–894.
need to resolve the question of what constitutes state-of-the-art [12] T. N. Sainath, R. J. Weiss, K. W. Wilson, A. Narayanan, M. Bacchiani, and
A. Senior, “Speaker location and microphone spacing invariant acoustic
in ASR today, and can we expect a common state-of-the-art ASR modeling from raw multichannel waveforms,” in Proc. IEEE Autom.
architecture in the future? Speech Recognit. Understanding, 2015, pp. 30–36.
[13] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connection-
ist temporal classification: Labelling unsegmented sequence data with
XI. CONCLUSION recurrent neural networks,” in Proc. Int. Conf. Mach. Learn., 2006,
pp. 369–376.
In this work, we presented a detailed overview of end-to-end [14] A. Graves, “Sequence transduction with recurrent neural networks,” in
Proc. Int. Conf. Mach. Learn., Edinburgh, Scotland, Jun. 2012.
approaches to ASR. Such models, which have grown in popular- [15] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,
ity over the last few years, propose to use highly integrated neural “Attention-based models for speech recognition,” in Proc. Adv. Neural
network components which allow input speech to be converted Inf. Process. Syst., 2015, pp. 577–585.
[16] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A
directly into output text sequences through character-based out- neural network for large vocabulary conversational speech recognition,”
put units. Thus, such models eschew the classical modular ASR in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2016, pp. 4960–
architecture consisting of an acoustic model, a pronunciation 4964.
[17] P. Liang, A. Bouchard-Côté, D. Klein, and B. Taskar, “An end-to-end
model, and a language model, in favor of a single compact discriminative approach to machine translation,” in Proc. Assoc. Comput.
structure, and rely on the data to learn effectively. These design Linguistics, 2006, pp. 761–768.
choices enable the deployment of highly accurate on-device [18] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P.
Kuksa, “Natural language processing (Almost) from scratch,” J. Mach.
speech recognition models (see Section IX), but also come with Learn. Res., vol. 12, pp. 2493–2537, 2011.
a number of downsides which are still areas of active research [19] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with
(see Section X). recurrent neural networks,” in Proc. Int. Conf. Mach. Learn., 2014,
pp. 1764–1772.
Finally, we direct interested readers to Li’s excellent contem- [20] “Cambridge Dictionary,” Accessed: Feb. 21, 2020. [Online]. Available:
poraneous overview article on end-to-end ASR [344], which https://dictionary.cambridge.org/dictionary/english/end-to-end
offers a complementary perspective to our own. In particular, [21] R. Pang et al., “Compression of end-to-end models,” in Proc. Interspeech,
2018, pp. 27–31.
readers of [344] may find a more detailed exposition on the [22] Y. He et al., “Streaming end-to-end speech recognition for mobile
choice of encoder structure, and the applications of E2E ap- devices,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
proaches to allied ASR areas (e.g., multi-speaker recognition; Brighton, U.K., 2019, pp. 6381–6385.
[23] R. Schlüter and H. Ney, “Model-based MCE bound to the true bayes’ [50] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating
error,” IEEE Signal Process. Lett., vol. 8, no. 5, pp. 131–133, May 2001. gradients through stochastic neurons for conditional computation,” 2013,
[24] H. Ney, “On the relationship between classification error bounds and arXiv:1308.3432.
training criteria in statistical pattern recognition,” in Proc. Iberian Conf. [51] A. Tripathi, H. Lu, H. Sak, and H. Soltau, “Monotonic recurrent neural
Pattern Recognit. Image Anal., 2003, pp. 636–645. network transducer and decoding strategies,” in Proc. IEEE Autom.
[25] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “”wav2vec 2.0: A Speech Recognit. Understanding, 2019, pp. 944–948.
framework for self-supervised learning of speech representations,” in [52] A. Zeyer, A. Merboldt, R. Schlüter, and H. Ney, “A new training
Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 12449–12460. pipeline for an improved neural transducer,” in Proc. Interspeech, 2020,
[26] X. Chang, W. Zhang, Y. Qian, J. Le Roux, and S. Watanabe, “MIMO- pp. 2812–2816.
Speech: End-to-end multi-channel multi-speaker speech recognition,” in [53] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-
Proc. IEEE Autom. Speech Recognit. Understanding, 2019, pp. 237–244. to-end attention-based large vocabulary speech recognition,” in Proc.
[27] L. Breiman, J. Friedman, C. Stone, and R. Olshen, Classication and IEEE Int. Conf. Acoust., Speech, Signal Process., 2016, pp. 4945–4949.
Regression Trees. Belmont, CA, USA: Taylor & Francis, 1984. [54] Y. Wu et al., “Google’s neural machine translation system: Bridging the
[28] A. Tjandra, S. Sakti, and S. Nakamura, “Listening while speaking: gap between human and machine translation,” 2016, arXiv:1609.08144.
Speech chain by deep learning,” in Proc. IEEE Autom. Speech Recognit. [55] M. Mimura, S. Sakai, and T. Kawahara, “Forward-backward attention
Understanding Workshop, Dec. 2017, pp. 301–308. decoder,” in Proc. Interspeech, 2018, pp. 2232–2236.
[29] M. K. Baskar, S. Watanabe, R. Astudillo, T. Hori, L. Burget, and J. [56] A. Graves, “Generating sequences with recurrent neural networks,” 2013,
Černocký, “Semi-supervised sequence-to-sequence ASR using unpaired arXiv:1308.0850.
speech and text,” in Proc. Interspeech, 2019, pp. 3790–3794. [57] J. Hou, S. Zhang, and L.-R. Dai, “Gaussian prediction based attention
[30] H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acoustic- for online end-to-end speech recognition,” in Proc. Interspeech, 2017,
to-word LSTM model for large vocabulary speech recognition,” in Proc. pp. 3692–3696, doi: 10.21437/Interspeech.2017-751.
Interspeech, 2017, pp. 3707–3711. [58] C.-C. Chiu et al., “A comparison of end-to-end models for long-form
[31] G. K. Zipf, Human Behavior and the Principle of Least Effort. Boston, speech recognition,” in Proc. IEEE Autom. Speech Recognit. Understand-
MA, USA: Addison-Wesley Press, 1949. ing, 2019, pp. 889–896.
[32] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of [59] N. Jaitly, Q. V. Le, O. Vinyals, I. Sutskever, D. Sussillo, and S. Bengio,
rare words with subword units,” in Proc. Assoc. Comput. Linguistics, “An online sequence-to-sequence model using partial conditioning,” in
2015, pp. 1715–1725. Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 5067–5075.
[33] W. Chan, Y. Zhang, Q. Le, and N. Jaitly, “Latent sequence decom- [60] C. Raffel, M.-T. Luong, P. J. Liu, R. J. Weiss, and D. Eck, “Online and
positions,” in Proc. Int. Conf. Learn. Representations, 2017. [Online]. linear-time attention by enforcing monotonic alignments,” in Proc. Int.
Available: https://openreview.net/forum?id=SyQq185lg Conf. Mach. Learn., 2017, pp. 2837–2846.
[34] H. Liu, Z. Zhu, X. Li, and S. Satheesh, “Gram-CTC: Automatic unit [61] C.-C. Chiu and C. Raffel, “Monotonic Chunkwise attention,” in Proc.
selection and target decomposition for sequence labelling,” in Proc. Int. Int. Conf. Learn. Representations, 2018.
Conf. Mach. Learn., Aug. 2017, pp. 2188–2197. [62] N. Arivazhagan et al., “Monotonic infinite lookback attention for simul-
[35] H. Xu, S. Ding, and S. Watanabe, “Improving end-to-end speech recog- taneous machine translation,” in Proc. Assoc. Comput. Linguistics, 2019,
nition with pronunciation-assisted sub-word modeling,” in Proc. IEEE pp. 1313–1323.
Int. Conf. Acoust., Speech, Signal Process., 2019, pp. 7110–7114. [63] T. N. Sainath et al., “Improving the performance of online neural trans-
[36] W. Zhou, M. Zeineldeen, Z. Zheng, R. Schlüter, and H. Ney, “Acoustic ducer models,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
data-driven subword modeling for end-to-end speech recognition,” in 2018, pp. 5864–5868.
Proc. Interspeech, 2021, pp. 2886–2890. [64] N. Moritz, T. Hori, and J. Le Roux, “Triggered attention for end-to-end
[37] M. Schuster and K. Nakajima, “Japanese and Korean voice search,” speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal
in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2012, Process., 2019, pp. 5666–5670.
pp. 5149–5152. [65] A. Merboldt, A. Zeyer, R. Schlüter, and H. Ney, “An analysis
[38] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers of local monotonic attention variants,” in Proc. Interspeech, 2019,
in speech recognition,” Comput. Speech Lang., vol. 16, no. 1, pp. 69–88, pp. 1398–1402.
2002. [66] A. Zeyer, R. Schlüter, and H. Ney, “A Study of Latent Monotonic
[39] E. Beck, M. Hannemann, P. Doetsch, R. Schlüter, and H. Ney, “Seg- Attention Variants,” Mar. 2021, arXiv:2103.16710.
mental encoder-decoder models for large vocabulary automatic speech [67] A. Zeyer, R. Schmitt, W. Zhou, R. Schlüter, and H. Ney, “Monotonic
recognition,” in Proc. Interspeech, 2018, pp. 766–770. segmental attention for automatic speech recognition,” in Proc. IEEE
[40] W. Zhou, A. Zeyer, A. Merboldt, R. Schlüter, and H. Ney, “Equivalence Spoken Lang. Technol. Workshop, 2023, pp. 229–236.
of segmental and neural transducer modeling: A proof of concept,” in [68] Z. Tian, J. Yi, Y. Bai, J. Tao, S. Zhang, and Z. Wen, “Synchronous
Proc. Interspeech, 2021, pp. 2891–2895. transformers for end-to-end speech recognition,” in Proc. IEEE Int. Conf.
[41] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and N. Jaitly, Acoust., Speech, Signal Process., 2020, pp. 7884–7888.
“A comparison of sequence-to-sequence models for speech recognition,” [69] D. Povey et al., “Purely sequence-trained neural networks for ASR
in Proc. Interspeech, 2017, pp. 939–943. based on lattice-free MMI,” in Proc. Interspeech, 2016, pp. 2751–2755,
[42] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural doi: 10.21437/Interspeech.2016-595.
Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [70] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2Letter: An end-to-end
[43] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation convnet-based speech recognition system,” 2016, arXiv:1609.03193.
by jointly learning to align and translate,” in Proc. Int. Conf. Learn. [71] P. Haffner, “Connectionist speech recognition with a global MMI algo-
Representations, 2015. rithm,” in Proc. Eurospeech, 1993, pp. 1929–1932.
[44] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. [72] A. Zeyer, E. Beck, R. Schlüter, and H. Ney, “CTC in the context of gener-
Process. Syst., 2017, pp. 5998–6008. alized full-sum HMM training,” in Proc. Interspeech, 2017, pp. 944–948.
[45] A. Gulati et al., “Conformer: Convolution-augmented transformer for [73] T. Raissi, W. Zhou, S. Berger, R. Schlüter, and H. Ney, “HMM vs. CTC
speech recognition,” in Proc. Interspeech, 2020, pp. 5036–5040. for automatic speech recognition: Comparison based on full-sum training
[46] H. Sak, M. Shannon, K. Rao, and F. Beaufays, “Recurrent neural aligner: from scratch,” in Proc. IEEE Spoken Lang. Technol. Workshop, 2023,
An encoder-decoder neural network model for sequence to sequence pp. 287–294.
mapping,” in Proc. Interspeech, 2017, pp. 1298–1302. [74] Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech
[47] E. Variani, D. Rybach, C. Allauzen, and M. Riley, “Hybrid autoregressive recognition using deep RNN models and WFST-Based decod-
transducer (HAT),” in Proc. IEEE Int. Conf. Acoust., Speech, Signal ing,” in Proc. IEEE Autom. Speech Recognit. Understanding, 2015,
Process., 2020, pp. 6139–6143. pp. 167–174.
[48] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep [75] A. Hannun et al., “Deep Speech: Scaling up End-to-End Speech Recog-
recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech, nition,” 2014, arXiv:1412.5567.
Signal Process., 2013, pp. 6645–6649. [76] L. Lu, X. Zhang, and S. Renals, “On training the recurrent neural
[49] N. Moritz, T. Hori, S. Watanabe, and J. Le Roux, “Sequence transduction network encoder-decoder for large vocabulary end-to-end speech recog-
with graph-based supervision,” in Proc. IEEE Int. Conf. Acoust., Speech, nition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2016,
Signal Process., 2022, pp. 7212–7216. pp. 5060–5064.
[77] J. Chorowski and N. Jaitly, “Towards better decoding and language model [102] S. Karita et al., “A comparative study on transformer vs RNN in speech
integration in sequence to sequence models,” in Proc. Interspeech, 2017, applications,” in Proc. IEEE Autom. Speech Recognit. Understanding
pp. 523–527. Workshop, 2019, pp. 449–456.
[78] Y. Zhang, W. Chan, and N. Jaitly, “Very deep convolutional networks for [103] P. Guo et al., “Recent Developments on ESPNET Toolkit Boosted by
end-to-end speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Conformer,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
Signal Process., 2017, pp. 4845–4849. 2021, pp. 5874–5878.
[79] S. Toshniwal, H. Tang, L. Lu, and K. Livescu, “Multitask learning with [104] R. Botros, T. Sainath, R. David, E. Guzman, W. Li, and Y. He,
low-level auxiliary tasks for encoder-decoder based speech recognition,” “Tied & reduced RNN-T decoder,” in Proc. Interspeech, 2021,
in Proc. Interspeech, 2017, pp. 3532–3536. pp. 4563–4567.
[80] A. Renduchintala, S. Ding, M. Wiesner, and S. Watanabe, “Multi-modal [105] M. Ghodsi, X. Liu, J. Apfel, R. Cabrera, and E. Weinstein, “RNN-
data augmentation for end-to-end ASR,” in Proc. Interspeech, 2018, Transducer with stateless prediction network,” in Proc. IEEE Int. Conf.
pp. 2394–2398. Acoust., Speech, Signal Process., 2020, pp. 7049–7053.
[81] S. Sabour, W. Chan, and M. Norouzi, “Optimal completion distilla- [106] W. Zhou, S. Berger, R. Schlüter, and H. Ney, “Phoneme based neural
tion for sequence learning,” in Proc. Int. Conf. Learn. Representations, transducer for large vocabulary speech recognition,” in Proc. IEEE Int.
2019. Conf. Acoust., Speech, Signal Process., 2021, pp. 5644–5648.
[82] C. Weng et al., “Improving attention based sequence-to-sequence mod- [107] R. Prabhavalkar et al., “Less is more: Improved RNN-T decoding using
els for end-to-end english conversational speech recognition,” in Proc. limited label context and path merging,” in Proc. IEEE Int. Conf. Acoust.,
Interspeech, 2018, pp. 761–765. Speech, Signal Process., 2021, pp. 5659–5663.
[83] D. Le, X. Zhang, W. Zheng, C. Fügen, G. Zweig, and M. L. Seltzer, [108] X. Chen, Z. Meng, S. Parthasarathy, and J. Li, “Factorized
“From senones to chenones: Tied context-dependent graphemes for neural transducer for efficient language model adaptation,” in
hybrid speech recognition,” in Proc. IEEE Autom. Speech Recognit. Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022,
Understanding Workshop, 2019, pp. 457–464. pp. 8132–8136.
[84] S. Kanthak and H. Ney, “Context-dependent acoustic modeling using [109] Z. Meng et al., “Modular hybrid autoregressive transducer,” in Proc. IEEE
graphemes for large vocabulary speech recognition,” in Proc. IEEE Int. Spoken Lang. Technol. Workshop, 2023, pp. 197–204.
Conf. Acoust., Speech, Signal Process., 2002, pp. 845–848. [110] T. Wang et al., “VioLA: Unified codec language models for speech
[85] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor recognition, synthesis, and translation,” 2023, arXiv:2305.16107.
processing unit,” in Proc. 44th Annu. Int. Symp. Comput. Architecture, [111] P. K. Rubenstein et al., “AudioPaLM: A large language model that can
2017, pp. 1–12. speak and listen,” 2023, arXiv:2306.12925.
[86] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid [112] S.-Y. Chang, B. Li, and G. Simko, “A unified endpointer using multitask
CTC attention architecture for end-to-end speech recognition,” IEEE and multidomain training,” in Proc. IEEE Autom. Speech Recognit.
J. Sel. Topics Signal Process., vol. 11, no. 8, pp. 1240–1253, Dec. Understanding Workshop, 2019, pp. 100–106.
2017. [113] B. Li et al., “Towards fast and accurate streaming end-to-end ASR,”
[87] T. N. Sainath et al., “Two-pass end-to-end speech recognition,” in Proc. in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020,
Interspeech, 2019, pp. 2773–2777. pp. 6069–6073.
[88] K. Hu, T. N. Sainath, R. Pang, and R. Prabhavalkar, “Deliberation model [114] T. Yoshimura, T. Hayashi, K. Takeda, and S. Watanabe, “End-to-End
based two-pass end-to-end speech recognition,” in Proc. IEEE Int. Conf. automatic speech recognition integrated with CTC-Based voice activity
Acoust., Speech, Signal Process., 2020, pp. 7799–7803. detection,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
[89] A. Narayanan et al., “Cascaded encoders for unifying streaming and 2020, pp. 6999–7003.
non-streaming ASR,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal [115] Y. Fujita, T. Wang, S. Watanabe, and M. Omachi, “Toward streaming ASR
Process., 2021, pp. 5629–5633. with non-autoregressive insertion-based model,” in Proc. Interspeech,
[90] A. Tripathi, J. Kim, Q. Zhang, H. Lu, and H. Sak, “Transformer trans- 2021, pp. 3740–3744.
ducer: One model unifying streaming and non-streaming speech recog- [116] Y. Bengio, “Practical recommendations for gradient-based training of
nition,” 2020, arXiv:2010.03192. deep architectures,” in Neural Networks: Tricks of the Trade, 2nd ed.,
[91] J. Yu et al., “Universal ASR: Unify and improve streaming asr with full- Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, arXiv:1206.5533.
context modeling,” in Proc. Int. Conf. Learn. Representations, 2021. [117] J. Schmidhuber, “Deep learning in neural networks: An overview,”
[92] D. Zhao et al., “Shallow-fusion end-to-end contextual biasing,” in Proc. Neural Netw., vol. 61, pp. 85–117, Jan. 2015.
Interspeech, 2019, pp. 1418–1422. [118] L. Baum, “An inequality and associated maximization technique in
[93] G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, statistical estimation for probabilistic functions of Markov processes,”
“Deep context: End-to-end contextual speech recognition,” in Proc. IEEE Inequalities, vol. 3, pp. 1–8, 1972.
Spoken Lang. Technol. Workshop, 2018, pp. 418–425. [119] L. Rabiner and B.-H. Juang, “An introduction to hidden Markov models,”
[94] S. Kim and F. Metze, “Dialog-context aware end-to-end speech recogni- IEEE Trans. Acoust., Speech, Signal Process., vol. 3, no. 1, pp. 4–16,
tion,” in Proc. IEEE Spoken Lang. Technol. Workshop, 2018, pp. 434–440. Jan. 1986.
[95] A. Bruguier, R. Prabhavalkar, G. Pundak, and T. N. Sainath, “Phoebe: [120] Y. Bengio, R. De Mori, G. Flammia, and R. Kompe, “Neural network-
Pronunciation-aware contextualization for end-to-end speech recogni- Gaussian mixture hybrid for speech recognition or density estimation,”
tion,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2019, in Proc. Adv. Neural Inf. Process. Syst., 1991, pp. 175–182.
pp. 6171–6175. [121] R. E. Bellman, Dynamic Programming. Princeton, NJ, USA: Princeton
[96] M. Delcroix, S. Watanabe, A. Ogawa, S. Karita, and T. Nakatani, “Aux- Univ. Press, 1957.
iliary feature based adaptation of end-to-end ASR systems,” in Proc. [122] A. Viterbi, “Error bounds for convolutional codes and an asymptoti-
Interspeech, 2018, pp. 2444–2448. cally optimal decoding algorithm,” IEEE Trans. Inf. Theory, vol. 13,
[97] W. Han et al., “ContextNet: Improving convolutional neural networks for pp. 260–269, Apr. 1967.
automatic speech recognition with global context,” in Proc. Interspeech, [123] H. Ney, “The use of a one-stage dynamic programming algorithm for
2020, pp. 3610–3614. connected word recognition,” IEEE Trans. Acoust., Speech, Signal Pro-
[98] L. Dong, S. Xu, and B. Xu, “Speech-transformer: A. no-recurrence cess., vol. 32, no. 2, pp. 263–271, Apr. 1984.
sequence-to-sequence model for speech recognition,” in Proc. IEEE Int. [124] W. Zhou, W. Michel, R. Schlüter, and H. Ney, “Efficient training of neural
Conf. Acoust., Speech, Signal Process., 2018, pp. 5884–5888. transducer for speech recognition,” in Proc. Interspeech, Incheon, Korea,
[99] Q. Zhang et al., “Transformer transducer: A streamable speech recogni- 2022, pp. 2058–2062.
tion model with transformer encoders and RNN-T loss,” in Proc. IEEE [125] A. Zeyer, R. Schlüter, and H. Ney, “Why does CTC result in Peaky
Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 7829–7833. behavior?,” 2021, arXiv:2105.14849.
[100] C.-F. Yeh et al., “Transformer-transducer: End-to-snd speech recognition [126] A. Laptev, S. Majumdar, and B. Ginsburg, “CTC variations through
with self-attention,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal new WFST topologies,” in Proc. Interspeech, 2022, pp. 1041–1045,
Process., 2019, pp. 7829–7833. doi: 10.21437/interspeech.2022-10854.
[101] Y. Peng, S. Dalmia, I. Lane, and S. Watanabe, “Branchformer: Parallel [127] X. He, L. Deng, and W. Chou, “Discriminative learning in sequential
MLP-attention architectures to capture local and global context for speech pattern recognition–a unifying review for optimization-oriented speech
recognition and understanding,” in Proc. Int. Conf. Mach. Learn., 2022, recognition,” IEEE Signal Process. Mag., vol. 25, no. 5, pp. 14–36,
pp. 17627–17643. Sep. 2008.
[128] M. Zeineldeen, A. Glushko, W. Michel, A. Zeyer, R. Schlüter, and [150] Y.-C. Chen, S.-F. Huang, H.-y. Lee, Y.-H. Wang, and C.-H. Shen,
H. Ney, “Investigating methods to improve language model integration “Audio Word2vec: Sequence-to-sequence autoencoding for unsupervised
for attention-based encoder-decoder ASR models,” in Proc. Interspeech, learning of audio segmentation and representation,” IEEE/ACM Trans.
2021, pp. 2856–2860. Audio, Speech, Lang. Process., vol. 27, no. 9, pp. 1481–1493, Sep. 2019,
[129] N.-P. Wynands, W. Michel, J. Rosendahl, R. Schlüter, and H. Ney, doi: 10.1109/TASLP.2019.2922832.
“Efficient sequence training of attention models using approximative re- [151] S. Scanzio, P. Laface, L. Fissore, R. Gemello, and F. Mana, “On the use
combination,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., of a multilingual neural network front-end,” in Proc. Interspeech, 2008,
2022, pp. 8002–8006. pp. 2711–2714.
[130] Z. Yang, W. Zhou, R. Schlüter, and H. Ney, “Lattice-free sequence [152] Z. Tüske, J. Pinto, D. Willett, and R. Schlüter, “Investigation on cross-
discriminative training for phoneme-based neural transducers,” in Proc. and multilingual MLP features under matched and mismatched acoustical
IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5. conditions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
[131] V. Valtchev, J. J. Odell, P. C. Woodland, and S. J. Young, “MMIE training 2013, pp. 7349–7353.
of large vocabulary recognition systems,” Speech Commun., vol. 22, no. 4, [153] S. Zhou, S. Xu, and B. Xu, “Multilingual end-to-end speech recog-
pp. 303–314, 1997. nition with a single transformer on low-resource languages,” 2018,
[132] D. Povey and P. Woodland, “Improved discriminative training techniques arXiv:1806.05059.
for large vocabulary continuous speech recognition,” in Proc. IEEE Int. [154] O. Adams, M. Wiesner, S. Watanabe, and D. Yarowsky, “Mas-
Conf. Acoust., Speech, Signal Process., 2001, pp. 45–48. sively multilingual adversarial speech recognition,” in Proc. North
[133] R. Schlüter, W. Macherey, B. Müller, and H. Ney, “Comparison of Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol.,
discriminative training criteria and optimization methods for speech 2019, pp. 96–108.
recognition,” Speech Commun., vol. 34, no. 3, pp. 287–310, May [155] W. Hou, Y. Dong, B. Zhuang, L. Yang, J. Shi, and T. Shinozaki, “Large-
2001. scale end-to-end multilingual speech recognition and language identifica-
[134] B. Kingsbury, “Lattice-based optimization of sequence classification tion with multi-task learning,” in Proc. Interspeech, 2020, pp. 1037–1041.
criteria for neural-network acoustic modeling,” in Proc. IEEE Int. Conf. [156] V. Pratap et al., “Massively multilingual ASR: 50 languages, 1 model, 1
Acoust., Speech, Signal Process., 2009, pp. 3761–3764. billion parameters,” in Proc. Interspeech, 2020, pp. 4751–4755.
[135] G. Heigold, R. Schlüter, H. Ney, and S. Wiesler, “Discriminative train- [157] B. Li et al., “Scaling end-to-end models for large-scale multilingual
ing for automatic speech recognition: Modeling, criteria, optimization, ASR,” in Proc. IEEE Autom. Speech Recognit. Understanding, 2021,
implementation, and performance,” IEEE Signal Process. Mag., vol. 29, pp. 1011–1018.
no. 6, pp. 58–69, Nov. 2012. [158] Y. Zhang et al., “BigSSL: Exploring the frontier of large-scale semi-
[136] W. Michel, R. Schlüter, and H. Ney, “Comparison of lattice-free and supervised learning for automatic speech recognition,” IEEE J. Sel. Topics
lattice-based sequence discriminative training criteria for LVCSR,” in Signal Process., vol. 16, no. 6, pp. 1519–1532, Oct. 2022.
Proc. Interspeech, 2019, pp. 1601–1605. [159] Z. Chen et al., “MAESTRO: Matched speech text representations through
[137] R. Prabhavalkar et al., “Minimum word error rate training for attention- modality matching,” in Proc. Interspeech, 2022, pp. 4093–4097.
based sequence-to-sequence models,” in Proc. IEEE Int. Conf. Acoust., [160] A. Radford et al., “Introducing whisper - robust speech recognition via
Speech, Signal Process., 2018, pp. 4839–4843. large-scale weak supervision,” Sep. 2022. [Online]. Available: https://
[138] C. Weng, C. Yu, J. Cui, C. Zhang, and D. Yu, “Minimum bayes risk openai.com/blog/whisper/
training of RNN-Transducer for end-to-end speech recognition,” in Proc. [161] T. P. Vogl, J. Mangis, A. Rigler, W. Zink, and D. Alkon, “Accelerating
Interspeech, 2020, pp. 966–970, doi: 10.21437/Interspeech.2020-1221. the convergence of the back-propagation method,” Biol. Cybern., vol. 59,
[139] M. K. Baskar, L. Burget, S. Watanabe, M. Karafiát, T. Hori, and J. no. 4, pp. 257–263, 1988.
H. Černockỳ, “Promising Accurate Prefix Boosting for Sequence-to- [162] N. S. Keskar and G. Saon, “A nonmonotone learning rate strategy for
Sequence ASR,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Pro- SGD training of deep neural networks,” in Proc. IEEE Int. Conf. Acoust.,
cess., May 2019, pp. 5646–5650. Speech, Signal Process., 2015, pp. 4974–4978.
[140] A. Tjandra, S. Sakti, and S. Nakamura, “Sequence-to-sequence asr opti- [163] S. Renals, N. Morgan, H. Bourlard, C. Wooters, and P. Kohn, “Connec-
mization via reinforcement learning,” in Proc. IEEE Int. Conf. Acoust., tionist speech recognition: Status and prospects,” Int. Conf. Sci. Inst.,
Speech, Signal Process., 2018, pp. 5829–5833. 1991, Tech. Rep. TR-OI-070.
[141] S. Karita, A. Ogawa, M. Delcroix, and T. Nakatani, “Sequence training [164] D. Johnson, D. Ellis, C. Oei, C. Wooters, and P. Faerber, “QuickNet,” Int.
of encoder-decoder model using policy gradient for end-to-end speech Conf. Swarm Intell., Berkeley, 2004. [Online]. Available: http://www.icsi.
recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., berkeley.edu/Speech/qn.html
2018, pp. 5839–5843. [165] A. Senior, G. Heigold, M. Ranzato, and K. Yang, “An empirical study of
[142] W. Michel, R. Schlüter, and H. Ney, “Early stage LM integration using learning rates in deep neural networks for speech recognition,” in Proc.
local and global log-linear combination,” in Proc. Interspeech, 2020, IEEE Int. Conf. Acoust., Speech, Signal Process., 2013, pp. 6724–6728.
pp. 3605–3609. [166] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in
[143] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm Proc. Int. Conf. Learn. Representations, 2019.
for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, [167] S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le, “Don’t decay
Jul. 2006. the learning rate, increase the batch size,” in Proc. Int. Conf. Learn.
[144] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer- Representations, 2018.
wise training of deep networks,” in Proc. Adv. Neural Inf. Process. Syst., [168] J. Howard and S. Ruder, “Universal language model fine-tuning for text
2006, pp. 153–160. classification,” in Proc. Assoc. Comput. Linguistics, 2018, pp. 328–339.
[145] A. Zeyer, P. Doetsch, P. Voigtlaender, R. Schlüter, and H. Ney, “A [169] M. Jaderberg et al., “Population based training of neural networks,” 2017,
comprehensive study of deep bidirectional LSTM RNNs for acoustic arXiv:1711.09846.
modeling in speech recognition,” in Proc. IEEE Int. Conf. Acoust., [170] T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey, “Meta-learning
Speech, Signal Process., 2017, pp. 2462–2466. in neural networks: A survey,” IEEE Trans. Pattern Anal. Mach. Intell.,
[146] A. Zeyer, T. Alkhouli, and H. Ney, “RETURNN as a generic flexible vol. 44, no. 9, pp. 5149–5169, Sep. 2021.
neural toolkit with application to translation and speech recognition,” in [171] J. L. Elman, “Learning and development in neural networks: The im-
Proc. Assoc. Comput. Linguistics, 2018, pp. 128–133. portance of starting small,” Cognition, vol. 48, no. 1, pp. 71–99, 1993,
[147] A. Zeyer, K. Irie, R. Schlüter, and H. Ney, “Improved training of end-to- doi: 10.1016/0010-0277(93)90058-4.
end attention models for speech recognition,” in Proc. Interspeech, 2018, [172] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum
pp. 7–11. learning,” in Proc. Int. Conf. Mach. Learn., 2009, pp. 41–48.
[148] A. Zeyer, A. Merboldt, R. Schlüter, and H. Ney, “A comprehensive [173] D. Amodei et al., “Deep speech 2: End-to-end speech recognition
analysis on attention models,” in Proc. Adv. Neural Inf. Process. Syst., in english and mandarin,” in Proc. Int. Conf. Mach. Learn., 2016,
2018. pp. 173–182.
[149] Y. Chung, C. Wu, C. Shen, H. Lee, and L. Lee, “Audio Word2Vec: [174] Z. Tüske, G. Saon, K. Audhkhasi, and B. Kingsbury, “Single headed
Unsupervised learning of audio segment representations using sequence- attention based sequence-to-sequence model for state-of-the-art results
to-sequence autoencoder,” in Proc. Interspeech, 2016, pp. 765–769. on switchboard,” in Proc. Interspeech, 2020, pp. 551–555.
[175] W. Zhang, X. Chang, Y. Qian, and S. Watanabe, “Improving end-to- [202] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation
end single-channel multi-talker speech recognition,” IEEE/ACM Trans. for speech recognition,” in Proc. Interspeech, 2015, pp. 3586–3589.
Audio, Speech, Lang. Process., vol. 28, pp. 1385–1394, 2020. [203] N. Jaitly and G. E. Hinton, “Vocal tract length perturbation (VTLP)
[176] B. Polyak, “Some methods of speeding up the convergence of iteration improves speech recognition,” in Proc. Int. Conf. Mach. Learn.,
methods,” USSR Comput. Math. Math. Phys., vol. 4, no. 5, pp. 1–17, 2013.
1964, doi: 10.1016/0041-5553(64)90137-5. [204] G. Saon, Z. Tüske, K. Audhkhasi, and B. Kingsbury, “Sequence noise
[177] Y. Nesterov, “A method of solving a convex programming problem with injected training for end-to-end speech recognition,” in Proc. IEEE Int.
convergence rate O( k12 ),” Sov. Math. Doklady, vol. 27, pp. 372–376, Conf. Acoust., Speech, Signal Process., 2019, pp. 6261–6265.
1983. [205] D. S. Park et al., “SpecAugment on large scale datasets,” in Proc. IEEE
[178] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 6879–6883.
initialization and momentum in deep learning,” in Proc. Int. Conf. Mach. [206] C. Wang et al., “Semantic mask for transformer based end-to-end speech
Learn., 2013, pp. 1139–1147. recognition,” in Proc. Interspeech, 2020, pp. 971–975.
[179] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” [207] T. Hayashi et al., “Back-translation-style data augmentation for end-
in Proc. Int. Conf. Learn. Representations, 2015. to-end ASR,” in Proc. IEEE Spoken Lang. Technol. Workshop, 2018,
[180] Z. Tüske, G. Saon, and B. Kingsbury, “On the limit of english conversa- pp. 426–433.
tional speech recognition,” in Proc. Interspeech, 2021, pp. 2062–2066. [208] N. Rossenbach, M. Zeineldeen, B. Hilmes, R. Schlüter, and H. Ney,
[181] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever, “Comparing the benefit of synthetic training data for various automatic
“Deep double descent: Where bigger models and more data hurt,” in Proc. speech recognition architectures,” in Proc. IEEE Autom. Speech Recognit.
Int. Conf. Learn. Representations, 2020. Understanding Workshop, 2021, pp. 788–795.
[182] A. Krogh and J. Hertz, “A simple weight decay can improve generaliza- [209] T. N. Sainath et al., “No need for a lexicon? Evaluating the value
tion,” in Proc. Neural Inf. Process. Syst., 1991, pp. 950–957. of the pronunciation lexica in end-to-end models,” in Proc. IEEE
[183] A. F. Murray and P. J. Edwards, “Enhanced MLP performance and fault Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 5859–5863,
tolerance resulting from synaptic weight noise during training,” IEEE doi: 10.1109/ICASSP.2018.8462380.
Trans. Neural Netw., vol. 5, no. 5, pp. 792–802, Sep. 1994. [210] C. Wooters and A. Stolcke, “Multiple-pronunciation lexical modeling in
[184] A. Graves, “Practical variational inference for neural networks,” in Proc. a speaker independent speech understanding system,” in Proc. Int. Conf.
Adv. Neural Inf. Process. Syst., 2011. Spoken Lang. Process., 1994, pp. 1363–1366.
[185] A. Neelakantan et al., “Adding gradient noise improves learning for very [211] I. McGraw, I. Badr, and J. R. Glass, “Learning lexicons from speech
deep networks,” Nov. 2015, arXiv:1511.06807. using a pronunciation mixture model,” IEEE/ACM Trans. Audio, Speech,
[186] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Lang. Process., vol. 21, no. 2, pp. 357–366, Feb. 2013.
Salakhutdinov, “Improving neural networks by preventing co-adaptation [212] A. Senior, G. Heigold, M. Bacchiani, and H. Liao, “GMM-Free DNN
of feature detectors,” Jul. 2012, arXiv:1207.0580. acoustic model training,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal
[187] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification Process., 2014, pp. 5602–5606, doi: 10.1109/ICASSP.2014.6854675.
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. [213] G. Gosztolya, T. Grósz, and L. Tóth, “GMM-Free flat start
Process. Syst., 2012. sequence-discriminative DNN training,” in Proc. Interspeech, 2016,
[188] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: pp. 3409–3413, doi: 10.21437/Interspeech.2016-391.
Representing model uncertainty in deep learning,” in Proc. Int. Conf. [214] H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “Flat-start single-
Mach. Learn., 2016, pp. 1050–1059. stage discriminatively trained HMM-based models for ASR,” IEEE/ACM
[189] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep Trans. Audio, Speech, Lang. Process., vol. 26, no. 11, pp. 1949–1961,
networks with stochastic depth,” in Proc. Eur. Conf. Comput. Vis., 2016, Nov. 2018.
pp. 646–661. [215] H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, and G. Zweig,
[190] N.-Q. Pham, T.-S. Nguyen, J. Niehues, M. Müller, and A. Waibel, “Very “The IBM 2004 conversational telephony system for rich transcription,”
deep self-attention networks for end-to-end speech recognition,” in Proc. in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2005, pp. 205–
Interspeech, 2019, pp. 66–70. 208.
[191] J. Lee and S. Watanabe, “Intermediate loss regularization for CTC-Based [216] H. Hadian, D. Povey, H. Sameti, J. Trmal, and S. Khudanpur,
speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal “Improving LF-MMI using unconstrained supervisions for ASR,”
Process., 2021, pp. 6224–6228. in Proc. IEEE Spoken Lang. Technol. Workshop, 2018, pp. 43–47,
[192] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus, “Regularization doi: 10.1109/SLT.2018.8639684.
of neural networks using DropConnect,” in Proc. Int. Conf. Mach. Learn., [217] N. Kanda, Y. Fujita, and K. Nagamatsu, “Lattice-free state-level mini-
2013, pp. 1058–1066. mum Bayes risk training of acoustic models,” in Proc. Interspeech, 2018,
[193] D. Krueger et al., “Zoneout: Regularizing RNNs by randomly preserving pp. 2923–2927, doi: 10.21437/Interspeech.2018-79.
hidden activations,” in Proc. Int. Conf. Learn. Representations, 2017. [218] S. J. Young and P. C. Woodland, “The use of state tying in continuous
[194] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking speech recognition,” in Proc. Eurospeech, 1993, pp. 2203–2206.
the inception architecture for computer vision,” in Proc. IEEE Conf. [219] S. Wiesler, G. Heigold, M. Nußbaum-Thom, R. Schlüter, and H. Ney,
Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826. “A discriminative splitting criterion for phonetic decision trees,” in Proc.
[195] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling Interspeech, 2010, pp. 54–57.
for sequence prediction with recurrent neural networks,” in Proc. Adv. [220] T. Raissi, E. Beck, R. Schlüter, and H. Ney, “Towards consistent hybrid
Neural Inf. Process. Syst., 2015. HMM acoustic modeling,” 2021, arXiv:2104.02387.
[196] T. Trinh, A. Dai, T. Luong, and Q. Le, “Learning longer-term dependen- [221] M. Zeineldeen, A. Zeyer, W. Zhou, T. Ng, R. Schlüter, and H. Ney, “A
cies in RNNs with auxiliary losses,” in Proc. Int. Conf. Mach. Learn., systematic comparison of grapheme-based vs. phoneme-based label units
2018, pp. 4965–4974. for encoder-decoder-attention models,” Nov. 2020, arXiv:2005.09336.
[197] R. J. Williams and J. Peng, “An efficient gradient-based algorithm for [222] C. Lüscher et al., “RWTH ASR systems for LibriSpeech: Hybrid vs
on-line training of recurrent network trajectories,” IEEE Neural Comput., attention,” in Proc. Interspeech, 2019, pp. 231–235.
vol. 2, no. 4, pp. 490–501, Dec. 1990. [223] D. Park et al., “Improved noisy student training for automatic speech
[198] S. Merity, N. S. Keskar, and R. Socher, “An analysis of neural language recognition,” in Proc. Interspeech, 2020, pp. 2817–2821.
modeling at multiple scales,” 2018, arXiv:1803.08240. [224] D. S. Park et al., “SpecAugment: A simple data augmentation method
[199] L. Meng, J. Xu, X. Tan, J. Wang, T. Qin, and B. Xu, “MixSpeech: Data for automatic speech recognition,” in Proc. Interspeech, 2019, pp. 2613–
augmentation for low-resource automatic speech recognition,” in Proc. 2617.
IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 7008–7012. [225] W. Zhou, W. Michel, K. Irie, M. Kitza, R. Schlüter, and H. Ney, “The
[200] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep net- RWTH ASR System for TED-LIUM Release 2: Improving Hybrid HMM
work training by reducing internal covariate shift,” in Proc. Int. Conf. with SpecAugment,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal
Mach. Learn., 2015, pp. 448–456. Process., 2020, pp. 7839–7843.
[201] N. Kanda, R. Takeda, and Y. Obuchi, “Elastic spectral distortion for low [226] J. Cui et al., “Multilingual representations for low resource speech
resource speech recognition with deep neural networks,” in Proc. IEEE recognition and keyword search,” in Proc. IEEE Autom. Speech Recognit.
Autom. Speech Recognit. Understanding, 2013, pp. 309–314. Understanding, 2015, pp. 259–266.
[227] O. Adams, M. Wiesner, S. Watanabe, and D. Yarowsky, “Massively [253] H. Miao, G. Cheng, P. Zhang, T. Li, and Y. Yan, “Online hy-
multilingual adversarial speech recognition,” in Proc. North Amer. brid CTC/Attention architecture for end-to-end speech recognition,”
Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2019, in Proc. Interspeech, 2019, pp. 2623–2627, doi: 10.21437/Inter-
pp. 96–108. speech.2019-2018.
[228] A. Kannan et al., “Large-scale multilingual speech recognition with a [254] E. Tsunoo, Y. Kashiwagi, and S. Watanabe, “streaming transformer ASR
streaming end-to-end model,” in Proc. Interspeech, 2019, pp. 2130–2134. with blockwise synchronous beam search,” in Proc. IEEE Spoken Lang.
[229] A. Graves, “Connectionist Temporal Classification,” in Supervised Se- Technol. Workshop, 2021, pp. 22–29.
quence Labelling with Recurrent Neural Networks. Heidelberg, Ger- [255] K. Hwang and W. Sung, “Character-level language modeling with hi-
many: Springer, 2012, pp. 61–93. erarchical recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust.,
[230] Y. Higuchi, S. Watanabe, N. Chen, T. Ogawa, and T. Kobayashi, “Mask Speech, Signal Process., 2017, pp. 5720–5724.
CTC: Non-autoregressive end-to-end ASR with CTC and mask predict,” [256] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint CTC-
in Proc. Interspeech, 2020, pp. 3655–3659. Attention based end-to-end speech recognition with a deep CNN encoder
[231] W. Chan, C. Saharia, G. Hinton, M. Norouzi, and N. Jaitly, “Imputer: and RNN-LM,” in Proc. Interspeech, 2017, pp. 949–953.
Sequence modelling via imputation and dynamic programming,” in Proc. [257] A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and
Int. Conf. Mach. Learn., 2020, pp. 1403–1413. R. Prabhavalkar, “An analysis of incorporating an external lan-
[232] Y. Fujita, S. Watanabe, M. Omachi, and X. Chang, “Insertion-based mod- guage model into a sequence-to-sequence model,” in Proc. IEEE
eling for end-to-end automatic speech recognition,” in Proc. Interspeech, Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 5824–5828,
2020, pp. 3660–3664. doi: 10.1109/ICASSP.2018.8462682.
[233] L. Dong and B. Xu, “Cif: Continuous integrate-and-fire for end-to-end [258] G. Saon, Z. Tüske, D. Bolanos, and B. Kingsbury, “Advancing RNN
speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal transducer technology for speech recognition,” in Proc. IEEE Int. Conf.
Process., 2020, pp. 6079–6083. Acoust., Speech, Signal Process., 2021, pp. 5654–5658.
[234] J. Nozaki and T. Komatsu, “Relaxing the conditional independence [259] H. Seki, T. Hori, S. Watanabe, N. Moritz, and J. Le Roux, “Vectorized
assumption of CTC-Based ASR by conditioning on intermediate pre- beam search for CTC-Attention-Based speech recognition,” in Proc.
dictions,” in Proc. Interspeech, 2021, pp. 3735–3739. Interspeech, 2019, pp. 3825–3829.
[235] Y. Higuchi et al., “A comparative study on non-autoregressive modelings [260] T. Hori, S. Watanabe, and J. R. Hershey, “Multi-level language mod-
for speech-to-text generation,” in Proc. IEEE Autom. Speech Recognit. eling and decoding for open vocabulary end-to-end speech recogni-
Understanding, 2021, pp. 47–54. tion,” in Proc. IEEE Autom. Speech Recognit. Understanding, 2017,
[236] W. Zhou, R. Schlüter, and H. Ney, “Robust beam search for encoder- pp. 287–293.
decoder attention based speech recognition without length bias,” in Proc. [261] Y. Wang et al., “Espresso: A fast end-to-end neural speech recognition
Interspeech, 2020, pp. 1768–1772. toolkit,” in Proc. IEEE Autom. Speech Recognit. Understanding Work-
[237] P. Koehn and R. Knowles, “Six challenges for neural machine transla- shop, 2019, pp. 136–143.
tion,” in Proc. 1st Workshop Neural Mach. Transl., 2017, pp. 28–39. [262] Z. Tüske, K. Audhkhasi, and G. Saon, “Advancing sequence-to-sequence
[238] Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li, “Modeling coverage for based speech recognition,” in Proc. Interspeech, 2019, pp. 3780–3784.
neural machine translation,” in Proc. Assoc. Comput. Linguistics, 2016, [263] J. Drexler and J. Glass, “Subword regularization and beam search de-
pp. 76–85. coding for end-to-end automatic speech recognition,” in Proc. IEEE Int.
[239] T. Hori, J. Cho, and S. Watanabe, “End-to-end speech recognition with Conf. Acoust., Speech, Signal Process., 2019, pp. 6266–6270.
word-based rnn language models,” in Proc. IEEE Spoken Lang. Technol. [264] T. N. Sainath et al., “Two-pass end-to-end speech recognition,” in Proc.
Workshop, Dec. 2018, pp. 389–396. Interspeech, 2019, pp. 2773–2777.
[240] K. Deng and P. C. Woodland, “Label-synchronous neural transducer for [265] Z. Yao et al., “WeNet: Production oriented streaming and non-streaming
end-to-end ASR,” 2023, arXiv:2307.03088. end-to-end speech recognition toolkit,” in Proc. Interspeech, 2021,
[241] T. Hori and A. Nakamura, Speech Recognition Algorithms Using pp. 4054–4058.
Weighted Finite-State Transducers. San Rafael, CA, USA: Morgan & [266] D. Wu et al., “U2++: Unified two-pass bidirectional end-to-end model
Claypool Publishers, 2013. for speech recognition,” 2021, arXiv:2106.05642.
[242] R. Haeb-Umbach and H. Ney, “Improvements in beam search for 10000- [267] M. Zapotoczny, P. Pietrzak, A. Lancucki, and J. Chorowski, “Lattice
Word continuous-speech recognition,” IEEE Speech Audio Process., generation in attention-based speech recognition models,” in Proc. Inter-
vol. 2, no. 2, pp. 353–356, Apr. 1994. speech, 2019, pp. 2225–2229.
[243] H. Ney and S. Ortmanns, “Progress in dynamic programming search [268] J. Kim, Y. Lee, and E. Kim, “Accelerating RNN transducer inference
for LVCSR,” Proc. IEEE, vol. 88, no. 8, pp. 1224–1240, Aug. 2000, via adaptive expansion search,” IEEE Signal Process. Lett., vol. 27,
doi: 10.1109/5.880081. pp. 2019–2023, 2020.
[244] T. Hori, Y. Kubo, and A. Nakamura, “Real-time one-pass decoding [269] M. Abadi et al., “TensorFlow: A system for large-scale machine learning,”
with recurrent neural network language model for speech recogni- in Proc. 12th USENIX Symp. Operating Syst. Des. Implementation, 2016,
tion,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2014, pp. 265–283.
pp. 6364–6368. [270] M. Ott et al., “FAIRSEQ: A fast, extensible toolkit for sequence mod-
[245] E. Beck, W. Zhou, R. Schlüter, and H. Ney, “LSTM language modeling,” in Proc. North Amer. Chapter Assoc. Comput. Linguistics: Hum.
els for LVCSR in first-pass decoding and lattice-rescoring,” 2019, Lang. Technol., 2019, pp. 48–53.
arXiv:1907.01030. [271] J. Shen et al., “Lingvo: A modular and scalable framework for sequence-
[246] G. Saon, Z. Tüske, and K. Audhkhasi, “Alignment-length synchronous to-sequence modeling,” 2019, arXiv:1902.08295.
decoding for RNN transducer,” in Proc. IEEE Int. Conf. Acoust., Speech, [272] P. Doetsch, A. Zeyer, P. Voigtlaender, I. Kulikov, R. Schlüter, and H. Ney,
Signal Process., 2020, pp. 7804–7808. “RETURNN: The RWTH extensible training framework for universal
[247] A. Y. Hannun, A. L. Maas, D. Jurafsky, and A. Y. Ng, “First-pass large recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech,
vocabulary continuous speech recognition using bi-directional recurrent Signal Process., 2017, pp. 5345–5349.
DNNs,” 2014, arXiv:1408.2873. [273] A. Hannun, A. Lee, Q. Xu, and R. Collobert, “Sequence-to-sequence
[248] N. Moritz, T. Hori, and J. Le Roux, “Triggered Attention for End-to-End speech recognition with time-depth separable convolutions,” in Proc.
Speech Recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Interspeech, 2019, pp. 3785–3789.
Process., Brighton, U.K., May 2019, pp. 5666–5670. [274] M. Li, M. Liu, and H. Masanori, “End-to-end speech recognition with
[249] N. Moritz, T. Hori, and J. Le, “Streaming automatic speech recognition adaptive computation steps,” in Proc. IEEE Int. Conf. Acoust., Speech,
with the transformer model,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2019, pp. 6246–6250.
Signal Process., 2020, pp. 6074–6078. [275] P. Bahar, N. Makarov, A. Zeyer, R. Schüter, and H. Ney, “Exploring a
[250] M. Jain et al., “RNN-T for latency controlled ASR with improved beam zero-order direct HMM based on latent attention for automatic speech
search,” 2019, arXiv:1911.01629. recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
[251] L. Lu, C. Liu, J. Li, and Y. Gong, “Exploring transformers for large-scale 2020, pp. 7854–7858.
speech recognition,” in Proc. Interspeech, 2020, pp. 5041–5045. [276] Z. Huang, G. Zweig, and B. Dumoulin, “Cache based recurrent neu-
[252] T. Wang, Y. Fujita, X. Chang, and S. Watanabe, “Streaming end-to-end ral network language model inference for first pass speech recogni-
ASR based on blockwise non-autoregressive models,” in Proc. Inter- tion,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2014,
speech, 2021, pp. 3755–3759. pp. 6354–6358.
[277] J. Jorge, A. Giménez, J. Iranzo-Sánchez, J. Civera, A. Sanchis, and A. [303] E. McDermott, H. Sak, and E. Variani, “A density ratio approach to
Juan, “Real-time one-pass decoder for speech recognition using LSTM language model fusion in end-to-end automatic speech recognition,” in
language models,” in Proc. Interspeech, 2019, pp. 3820–3824. Proc. IEEE Autom. Speech Recognit. Understanding Workshop, 2019,
[278] W. Zhou, R. Schlüter, and H. Ney, “Full-sum decoding for hybrid HMM pp. 434–441.
based speech recognition using LSTM language model,” in Proc. IEEE [304] Z. Meng et al., “Internal language model estimation for domain-adaptive
Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 7834–7838. end-to-end speech recognition,” in Proc. IEEE Spoken Lang. Technol.
[279] P. Sountsov and S. Sarawagi, “Length bias in encoder decoder models Workshop, 2020, pp. 243–250.
and a case for global conditioning,” in Proc. Empirical Methods Natural [305] W. Zhou, Z. Zheng, R. Schlüter, and H. Ney, “On language model
Lang. Process., 2016, pp. 1516–1525. integration for RNN transducer based speech recognition,” in Proc. IEEE
[280] K. Murray and D. Chiang, “Correcting length bias in neural machine Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 8407–8411.
translation,” in Proc. WMT, 2018, pp. 212–223. [306] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
[281] F. Stahlberg and B. Byrne, “On NMT search errors and model errors: Cat of deep bidirectional transformers for language understanding,” in Proc.
got your tongue?,” in Proc. Empirical Methods Natural Lang. Process., Assoc. Comput. Linguistics, 2019, pp. 4171–4186.
2019, pp. 3354–3360. [307] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
[282] N. Deshmukh, A. Ganapathiraju, and J. Picone, “Hierarchical search for “Language models are unsupervised multitask learners,” 2019, ope-
large-vocabulary conversational speech recognition: Working toward a nAI blog. [Online]. Available: https://cdn.openai.com/better-language-
solution to the decoding problem,” IEEE Signal Process. Mag., vol. 16, models/language_models_are_unsupervised_multitask_learners.pdf
no. 5, pp. 84–107, Sep. 1999. [308] J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff, “Masked language
[283] L. Nguyen and R. Schwartz, “Single-tree method for grammar-directed model scoring,” in Proc. Assoc. Comput. Linguistics, 2020, pp. 2699–
search,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1999, 2712.
pp. 613–616. [309] S. Kim, S. Dalmia, and F. Metze, “Gated embeddings in end-to-end
[284] L. Sarı, N. Moritz, T. Hori, and J. L. Roux, “Unsupervised speaker adapta- speech recognition for conversational-context fusion,” in Proc. Assoc.
tion using attention-based speaker memory for end-to-end ASR,” in Proc. Comput. Linguistics, 2019, pp. 1131–1141.
IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 7384–7388. [310] A. Zeyer, A. Merboldt, W. Michel, R. Schlüter, and H. Ney, “Librispeech
[285] F. Weninger, J. Andrés-Ferrer, X. Li, and P. Zhan, “Listen, attend, transducer model with internal language model prior correction,” in Proc.
spell and adapt: Speaker adapted sequence-to-sequence ASR,” in Proc. Interspeech, 2021, pp. 2052–2056.
Interspeech, 2019, pp. 3805–3809. [311] L. R. Bahl, F. Jelinek, and R. L. Mercer, “A maximum likelihood approach
[286] Z. Meng, Y. Gaur, J. Li, and Y. Gong, “Speaker adaptation for attention- to continuous speech recognition,” IEEE Trans. Pattern Anal. Mach.
based end-to-end speech recognition,” in Proc. Interspeech, 2019, Intell., vol. PAMI-5, no. 2, pp. 179–190, Mar. 1983.
pp. 241–245. [312] J. Makhoul and R. Schwartz, “State of the art in continuous speech
[287] N. Tomashenko and Y. Estève, “Evaluation of feature-space speaker recognition,” Proc. Nat. Acad. Sci., vol. 92, no. 22, pp. 9956–9963,
adaptation for end-to-end acoustic models,” in Proc. 11th Int. Conf. Lang. Oct. 1995.
Resour. Eval., 2018, pp. 3163–3170. [313] D. Klakow and J. Peters, “Testing the correlation of word error rate and
[288] S. F. Chen and J. Goodman, “An empirical study of smoothing techniques perplexity,” Speech Commun., vol. 38, no. 1, pp. 19–28, 2002.
for language modeling,” in Proc. Assoc. Comput. Linguistics, 1996, [314] M. Sundermeyer, H. Ney, and R. Schlüter, “From feedforward to re-
pp. 310–318. current LSTM neural networks for language modeling,” IEEE/ACM
[289] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur, Trans. Audio, Speech, Lang. Process., vol. 23, no. 3, pp. 517–529,
“Recurrent neural network based language model,” in Proc. Interspeech, Mar. 2015.
2010, pp. 1045–1048. [315] T. Hori, C. Hori, S. Watanabe, and J. R. Hershey, “Minimum word error
[290] M. Sundermeyer, R. Schlüter, and H. Ney, “LSTM neural networks for training of long short-term memory recurrent neural network language
language modeling,” in Proc. Interspeech, 2012, pp. 194–197. models for speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech,
[291] N.-Q. Pham, G. Kruszewski, and G. Boleda, “Convolutional neural net- Signal Process., 2016, pp. 5990–5994.
work language models,” in Proc. Joint Conf. Empirical Methods Natural [316] J. Godfrey, E. Holliman, and J. McDaniel, “SWITCHBOARD: Telephone
Lang. Process., 2016, pp. 1153–1162. speech corpus for research and development,” in Proc. IEEE Int. Conf.
[292] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling Acoust., Speech, Signal Process., 1992, pp. 517–520.
with gated convolutional networks,” in Proc. 34th Int. Conf. Mach. Learn., [317] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An
2017, pp. 933–941. ASR corpus based on public domain audio books,” in Proc. IEEE Int.
[293] N. Zeghidour, Q. Xu, V. Liptchinsky, N. Usunier, G. Synnaeve, Conf. Acoust., Speech, Signal Process., 2015, pp. 5206–5210.
and R. Collobert, “Fully convolutional speech recognition,” 2018, [318] A. Zeyer, P. Bahar, K. Irie, R. Schlüter, and H. Ney, “A comparison of
arXiv:1812.06864. transformer and LSTM encoder decoder models for ASR,” in Proc. IEEE
[294] T. Likhomanenko, G. Synnaeve, and R. Collobert, “Who needs words? Autom. Speech Recognit. Understanding, 2019, pp. 8–15.
lexicon-free speech recognition,” in Proc. Interspeech, 2019, pp. 3915– [319] W.-N. Hsu et al., “HuBERT: Self-supervised speech representation learn-
3919. ing by masked prediction of hidden units,” IEEE/ACM Trans. Audio,
[295] R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones, “Character- Speech, Lang. Process., vol. 19, pp. 3451–3460, 2021.
level language modeling with deeper self-attention,” in Proc. AIII, 2019, [320] G. Synnaeve et al., “End-to-end ASR: From supervised to semi-
pp. 3159–3166. supervised learning with modern architectures,” in Proc. Int. Conf. Mach.
[296] K. Irie, A. Zeyer, R. Schlüter, and H. Ney, “Language modeling with Learn., 2020.
deep transformers,” in Proc. Interspeech, 2019, pp. 3905–3909. [321] E. G. Ng, C.-C. Chiu, Y. Zhang, and W. Chan, “Pushing the limits
[297] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. Le, and R. Salakhutdinov, of non-autoregressive speech recognition,” in Proc. Interspeech, 2021,
“Transformer-XL: Attentive language models beyond a fixed-length con- pp. 3725–2729.
text,” in Proc. Assoc. Comput. Linguistics, 2019, pp. 2978–2988. [322] J. Kahn et al., “Libri-light: A benchmark for ASR with limited or no
[298] P. J. Werbos, “Backpropagation through time: What it does and how supervision,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
to do it,” Proc. IEEE, vol. 78, no. 10, pp. 1550–1560, Oct. 1990, 2020, pp. 7669–7673.
doi: 10.1109/5.58337. [323] Y. Wang et al., “Transformer-based acoustic modeling for hybrid speech
[299] J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap, “Com- recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
pressive transformers for long-range sequence modelling,” in Proc. Adv. 2020, pp. 6874–6878.
Neural Inf. Process. Syst., 2020, pp. 6154–6158. [324] K. Kim et al., “E-branchformer: Branchformer with enhanced merging
[300] C. Gulcehre et al., “On using monolingual corpora in neural machine for speech recognition,” in Proc. IEEE Spoken Lang. Technol. Workshop,
translation,” 2015, arXiv:1503.03535. 2023, pp. 84–91.
[301] A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold fusion: Training [325] M. Kitza, P. Golik, R. Schlüter, and H. Ney, “Cumulative adap-
Seq2Seq models together with language models,” in Proc. Interspeech, tation for BLSTM acoustic models,” in Proc. Interspeech, 2019,
2018, pp. 387–391. pp. 754–758.
[302] C. Shan et al., “Component fusion: Learning replaceable language model [326] C.-C. Chiu et al., “State-of-the-art speech recognition with sequence-
component for end-to-end speech recognition system,” in Proc. IEEE Int. to-sequence models,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal
Conf. Acoust., Speech, Signal Process., 2019, pp. 5361–5635. Process., 2018, pp. 4774–4778.
[327] K. Kim et al., “Attention based on-device streaming speech recognition Takaaki Hori (Senior Member, IEEE) received the
with large speech corpus,” in Proc. IEEE Autom. Speech Recognit. Ph.D. degree in system and information engineering
Understanding Workshop, 2019, pp. 956–963. from Yamagata University, Yonezawa, Japan, in 1999.
[328] J. Li et al., “Developing RNN-T models surpassing high-performance From 1999 to 2015, he was engaged in researches
hybrid models with customization capability,” in Proc. Interspeech,2020, on speech recognition and spoken language process-
pp. 3590–3594. ing with Cyber Space Laboratories and Communi-
[329] R. Hsiao, D. Can, T. Ng, R. Travadi, and A. Ghoshal, “Online automatic cation Science Laboratories, Nippon Telegraph and
speech recognition with listen, attend and spell model,” IEEE Signal Telephone Corporation, Tokyo, Japan. From 2015
Process. Lett., vol. 27, pp. 1889–1893, 2020. to 2021, he was a Senior Principal Research Scien-
[330] Y. Shi et al., “Emformer: Efficient memory transformer based acoustic tist with Mitsubishi Electric Research Laboratories,
model for low latency streaming speech recognition,” in Proc. IEEE Int. Cambridge, MA, USA. He is currently a Machine
Conf. Acoust., Speech, Signal Process., 2021, pp. 6783–6787. Learning Researcher with Apple. His research interests include automatic speech
[331] X. Chen, Y. Wu, Z. Wang, S. Liu, and J. Li, “Developing real-time recognition, spoken language understanding, and language modeling. During
streaming transformer transducer for speech recognition on large-scale 2020–2022, he was a Member of the IEEE Speech and Language Processing
dataset,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, Technical Committee.
pp. 5904–5908.
[332] T. N. Sainath et al., “A streaming on-device end-to-end model surpassing
server-side conventional model quality and latency,” in Proc. IEEE Int.
Conf. Acoust., Speech, Signal Process., 2020, pp. 6059–6063.
Tara N. Sainath (Fellow, IEEE) received the Ph.D.
[333] B. Li et al., “A better and faster end-to-end model for streaming
degree in electrical engineering and computer sci-
ASR,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021,
ence from the Massachusetts Institute of Technology,
pp. 5634–5638.
Cambridge, MA, USA, in 2009. The main focus of her
[334] T. N. Sainath et al., “An efficient streaming non-recurrent on-device
Ph.D. work was in acoustic modeling for noise robust
end-to-end model with improvements to rare-word modeling,” in Proc.
speech recognition. After her Ph.D., she spent five
Interspeech, 2021, pp. 1777–1781.
years with Speech and Language Algorithms Group,
[335] A. Bapna et al., “SLAM: A unified encoder for speech and language
IBM Thomas J. Watson Research Center, before join-
modeling via speech-text joint pre-training,” 2021, arXiv:2110.10329.
ing Google Research. She was the Program Chair
[336] A. Bapna et al., “mSLAM: Massively Multilingual Joint Pre-Training for
for ICLR in 2017 and 2018. She has co-organized
Speech and text,” 2022, arXiv:2202.01374.
numerous special sessions and workshops, including
[337] Y. Tang et al., “Unified speech-text pre-training for speech translation and
Interspeech 2010, ICML 2013, Interspeech 2016 and ICML 2017. In addition,
recognition,” in Proc. Assoc. Comput. Linguistics, 2022, pp. 1488–1499.
she is a Member of the IEEE Speech and Language Processing Technical
[338] Y.-A. Chung, C. Zhu, and M. Zeng, “SPLAT: Speech-language joint
Committee and an Associate Editor for IEEE/ACM TRANSACTIONS ON AUDIO,
pre-training for spoken language understanding,” in Proc. North Amer.
SPEECH, AND LANGUAGE PROCESSING.
Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2021,
pp. 1897–1907.
[339] J. Ao et al., “SpeechT5: Unified-modal encoder-decoder pre-training for
spoken language processing,” in Proc. Assoc. Comput. Linguistics, 2022,
pp. 5723–5738.
[340] S. Thomas, H. K. J. Kuo, B. Kingsbury, and G. Saon, “Towards reducing Ralf Schlüter (Senior Member, IEEE) received the
the need for speech training data to build spoken language understanding Dr.rer.nat. degree in computer science and the Ha-
systems,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, bilitation degree in computer science from RWTH
pp. 7932–7936. Aachen University, Aachen, Germany, in 2000 and
[341] T. N. Sainath et al., “JOIST: A joint speech and text streaming model for 2009, respectively. In May 1996, he joined the Com-
ASR,” in Proc. IEEE Spoken Lang. Technol. Workshop, 2023, pp. 52–59. puter Science Department, RWTH Aachen Univer-
[342] T. Hori, R. Astudillo, T. Hayashi, Y. Zhang, S. Watanabe, and J. Le Roux, sity, where he is currently a Lecturer and Academic
“Cycle-consistency training for end-to-end speech recognition,” in Proc. Director, leading the Automatic Speech Recognition
IEEE Int. Conf. Acoust., Speech, Signal Process., 2019, pp. 6271–6275. Group, the Chair Computer Science 6 – Machine
[343] T. Ochiai, S. Watanabe, T. Hori, and J. R. Hershey, “Multichannel Learning and Human Language Technology. In 2019,
end-to-end speech recognition,” in Proc. Int. Conf. Mach. Learn., 2017, he also joined AppTek GmbH Aachen as Senior Re-
pp. 2632–2641. searcher. His research interests include sequence classification, specifically all
[344] J. Li, “Recent advances in end-to-end automatic speech recogni- aspects of automatic speech recognition, decision theory, stochastic modeling,
tion,” APSIPA Trans. Signal Inf. Process, vol. 11, no. 1, Nov. 2021, and signal analysis. During 2013–2019, he was the Subject Editor of Speech
doi: 10.1561/116.00000050. Communication.
Shinji Watanabe (Fellow, IEEE) received the B.S.,

Rohit Prabhavalkar (Member, IEEE) received the M.S., and Ph.D. (Dr. Eng.) degrees from Waseda
Ph.D. degree in computer science and engineering University, Tokyo, Japan. He is currently an Associate
from The Ohio State University, Columbus, OH, Professor with Carnegie Mellon University, Pitts-
USA, in 2013. Following his Ph.D., he joined the burgh, PA, USA. He was a research Scientist with
Speech Technologies group, Google, where he is NTT Communication Science Laboratories, Kyoto,
currently a Staff Research Scientist. At Google, his Japan, from 2001 to 2011, a Visiting Scholar with
research has focused primarily on developing com- the Georgia Institute of Technology, Atlanta, GA,
pact acoustic models which can run efficiently on USA, in 2009, and a Senior Principal Research Scien-
mobile devices, and on developing improved end-to- tist with Mitsubishi Electric Research Laboratories,
end automatic speech recognition systems. He has Cambridge, MA, USA, from 2012 to 2017. Before
coauthored more than 50 refereed papers, which have Carnegie Mellon University, he was an Associate Research Professor with
received two best paper awards (ASRU 2017; ICASSP 2018). He is currently Johns Hopkins University, Baltimore, MD, USA, from 2017 to 2020. His
a Member of the IEEE Speech and Language Processing Technical Committee research interests include automatic speech recognition, speech enhancement,
during 2018–2024, and an Associate Editor for the IEEE/ACM TRANSACTIONS spoken language understanding, and machine learning for speech and language
ON AUDIO, SPEECH, AND LANGUAGE PROCESSING. processing. He is an ISCA Fellow.

End-to-End Speech Recognition: A Survey

Uploaded by

Copyright:

Available Formats

End-to-End Speech Recognition: A Survey

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

End-to-End Speech Recognition: A Survey

Uploaded by

Copyright:

Available Formats

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL.

32, 2024 325