asr01-intro

Automatic Speech Recognition: Introduction
Peter Bell
Automatic Speech Recognition— ASR Lecture 1

11 January 2020
ASR Lecture 1 Automatic Speech Recognition: Introduction 1

Automatic Speech Recognition — ASR
Course details
Lectures: About 18 lectures, delivered live on Teams for now
Labs: Weekly lab sessions – using Python, OpenFst
(openfst.org) and later Kaldi (kaldi-asr.org)
Lab sessions will start in Week 3 – exact format TBA.
Assessment:
First five lab sessions worth 10%
Coursework, building on the lab sessions, worth 40%
Open book exam in April or May worth 50%
People:
Course organiser: Peter Bell
Guest lecturers: Hiroshi Shimodaira and Yumnah Mohammied
TA: Andrea Carmantini
Demonstrators: Chau Luu and Electra Wallington
http://www.inf.ed.ac.uk/teaching/courses/asr/

Your background
If you have taken:
Speech Processing and either of (MLPR or MLP)
Perfect!
either of (MLPR or MLP) but not Speech Processing
(probably you are from Informatics)
You’ll require some speech background:
A couple of the lectures will cover material that was in Speech
Processing
Some additional background study (including material from
Speech Processing)
Speech Processing but neither of (MLPR or MLP)
(probably you are from SLP)
You’ll require some machine learning background (especially
neural networks)
A couple of introductory lectures on neural networks provided
for SLP students
Some additional background study

Labs
Series of weekly labs using Python, OpenFst and Kaldi

They count towards 10% of the course credit
Labs start week 3 – exact arrangements TBA
You will need to work in pairs
Labs 1-5 will give you hands-on experience of using HMM
algorithms to build your own ASR system
These labs are an important pre-requisite for the coursework –
take advantage of the demonstrator support!
Later optional labs will introduce you to Kaldi recipes for
training acoustic models – useful if you will be doing an
ASR-related research project

What is speech recognition?


Speech-to-text transcription
Transform recorded audio into a sequence of words
Just the words, no meaning.... But do need to deal with
acoustic ambiguity: “Recognise speech?” or “Wreck a nice
beach?”
Speaker diarization: Who spoke when?
Speech recognition: what did they say?
Paralinguistic aspects: how did they say it? (timing,
intonation, voice quality)
Speech understanding: what does it mean?

Why is
speech recognition
difficult?

From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, or
speaker-independent? Adaptation to speaker
characteristics

characteristics
Environment Noise, competing speakers, channel conditions
(microphone, phone line, room acoustics)

characteristics
Style Continuously spoken or isolated? Planned monologue
or spontaneous conversation?

characteristics
Vocabulary Machine-directed commands, scientific language,
colloquial expressions

characteristics
Accent/dialect Recognise the speech of all speakers who speak a
particular language

characteristics
particular language
Other paralinguistics Emotional state, social class, . . .

characteristics
particular language
Other paralinguistics Emotional state, social class, . . .
Language spoken Estimated 7,000 languages, most with limited
training resources; code-switching; language change
From a machine learning perspective
As a classification problem: very high dimensional output

space


space
As a sequence-to-sequence problem: very long input sequence
(although limited re-ordering between acoustic and word
sequences)


space
sequences)
Data is often noisy, with many “nuisance” factors of variation
in the data


space
sequences)
in the data
Very limited quantities of training data available (in terms of
words) compared to text-based NLP
Manual speech transcription is very expensive (10x real time)


space
sequences)
in the data
Very limited quantities of training data available (in terms of
words) compared to text-based NLP
Manual speech transcription is very expensive (10x real time)
Hierachical and compositional nature of speech production
and comprehension makes it difficult to handle with a single
model

The speech recognition problem
We generally represent recorded speech as a sequence of

acoustic feature vectors (observations), X and the output
word sequence as W


word sequence as W
At recognition time, our aim is to find the most likely W,
given X


word sequence as W
At recognition time, our aim is to find the most likely W,
given X
To achieve this, statistical models are trained using a corpus
of labelled training utterances (Xn , Wn )

Representing recorded speech (X)
Represent a recorded utterance as a sequence of feature vectors
Reading: Jurafsky & Martin section 9.3

Labelling speech (W)
Labels may be at different levels: words, phones, etc.

Labels may be time-aligned – i.e. the start and end times of an
acoustic segment corresponding to a label are known
Reading: Jurafsky & Martin chapter 7 (especially sections 7.4, 7.5)

Two key challenges
In training the model:

Aligning the sequences Xn and Wn for each training utterance

Two key challenges

w1 w2
NO RIGHT
x1 x2 x3 x4 ...

Two key challenges

w1 w2
NO RIGHT
x1 x2 x3 x4 ...

Two key challenges

p1 p2 p3 p4 p5
n oh r ai t
x1 x2 x3 x4 ...

Two key challenges

g1 g2 g3 g4 g5 g6 g6
n o r i g h t
x1 x2 x3 x4 ...

Two key challenges

g1 g2 g3 g4 g5 g6 g6
n o r i g h t
x1 x2 x3 x4 ...
In performing recognition:
Searching over all possible output sequences W
to find the most likely one

Two key challenges

g1 g2 g3 g4 g5 g6 g6
n o r i g h t
x1 x2 x3 x4 ...
In performing recognition:
Searching over all possible output sequences W
to find the most likely one
The hidden Markov model (HMM) provides a good solution to

both problems

The Hidden Markov Model
x1 x2 x3 x4 ...
A simple but powerful model for mapping a sequence of

continuous observations to a sequence of discrete outputs
It is a generative model for the observation sequence
Algorithms for training (forward-backward) and
recognition-time decoding (Viterbi)

The Hidden Markov Model
x1 x2 x3 x4 ...
A simple but powerful model for mapping a sequence of

continuous observations to a sequence of discrete outputs
It is a generative model for the observation sequence
Algorithms for training (forward-backward) and
recognition-time decoding (Viterbi)
Later in the course we will also look at newer all-neural,
fully-differentiable “end-to-end” models
Hierarchical modelling of speech
Generative "No right" Utterance W

Model
NO RIGHT Word
n oh r ai t Subword
HMM
Acoustics X

“Fundamental Equation of Statistical Speech Recognition”
If X is the sequence of acoustic feature vectors (observations) and

W denotes a word sequence, the most likely word sequence W∗ is
given by
W∗ = arg max P(W | X)
W

“Fundamental Equation of Statistical Speech Recognition”
If X is the sequence of acoustic feature vectors (observations) and

W denotes a word sequence, the most likely word sequence W∗ is
given by
W∗ = arg max P(W | X)
W
Applying Bayes’ Theorem:

p(X | W)P(W)
P(W | X) =
p(X)
∝ p(X | W)P(W)
W∗ = arg max p(X | W) P(W)
W | {z } | {z }
Acoustic Language
model model

Speech Recognition Components
W∗ = arg max p(X | W)P(W)

W
Use an acoustic model, language model, and lexicon to obtain the

most probable word sequence W∗ given the observed acoustics X
Recorded Speech X Decoded Text W*
(Transcription)
Signal
Analysis p(X | W)
Acoustic
Model
Search
Space
Training P(W)
Language W
Data
Model

Phones and Phonemes
Phonemes
abstract unit defined by linguists based on contrastive role in
word meanings (eg “cat” vs “bat”)
40–50 phonemes in English
Phones
speech sounds defined by the acoustics
many allophones of the same phoneme (eg /p/ in “pit” and
“spit”)
limitless in number
Phones are usually used in speech recognition – but no
conclusive evidence that they are the basic units in speech
recognition
Possible alternatives: syllables, automatically derived units, ...
(Slide taken from Martin Cooke from long ago)

Evaluation
How accurate is a speech recognizer?

String edit distance
Use dynamic programming to align the ASR output with a
reference transcription
Three type of error: insertion, deletion, substitutions
Word error rate (WER) sums the three types of error. If there
are N words in the reference transcript, and the ASR output
has S substitutions, D deletions and I insertions, then:
S +D +I
WER = 100 · % Accuracy = 100 − WER%
N
Speech recognition evaluations: common training and
development data, release of new test sets on which different
systems may be evaluated using word error rate

Next Lecture
Recorded Speech Decoded Text

(Transcription)
Signal
Analysis
Acoustic
Model
Search
Space
Training Language
Data Model

Example: recognising TV broadcasts

Reading
Jurafsky and Martin (2008). Speech and Language Processing

(2nd ed.): Chapter 7 (esp 7.4, 7.5) and Section 9.3.
General interest:
The Economist Technology Quarterly, “Language: Finding a
Voice”, Jan 2017.
http://www.economist.com/technology-quarterly/2017-05-
01/language
The State of Automatic Speech Recognition: Q&A with
Kaldi’s Dan Povey, Jul 2018.
https://medium.com/descript/the-state-of-automatic-
speech-recognition-q-a-with-kaldis-dan-povey-
c860aada9b85

asr01-intro

Uploaded by

Copyright:

Available Formats

asr01-intro

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

asr01-intro

Uploaded by

Copyright:

Available Formats

Automatic Speech Recognition: Introduction

Automatic Speech Recognition— ASR Lecture 1

ASR Lecture 1 Automatic Speech Recognition: Introduction 1

ASR Lecture 1 Automatic Speech Recognition: Introduction 2

ASR Lecture 1 Automatic Speech Recognition: Introduction 3

Series of weekly labs using Python, OpenFst and Kaldi

ASR Lecture 1 Automatic Speech Recognition: Introduction 4

ASR Lecture 1 Automatic Speech Recognition: Introduction 5

ASR Lecture 1 Automatic Speech Recognition: Introduction 5

ASR Lecture 1 Automatic Speech Recognition: Introduction 6

ASR Lecture 1 Automatic Speech Recognition: Introduction 7

ASR Lecture 1 Automatic Speech Recognition: Introduction 8

ASR Lecture 1 Automatic Speech Recognition: Introduction 8

ASR Lecture 1 Automatic Speech Recognition: Introduction 8

ASR Lecture 1 Automatic Speech Recognition: Introduction 8

ASR Lecture 1 Automatic Speech Recognition: Introduction 8

ASR Lecture 1 Automatic Speech Recognition: Introduction 8

As a classification problem: very high dimensional output

ASR Lecture 1 Automatic Speech Recognition: Introduction 9

As a classification problem: very high dimensional output

ASR Lecture 1 Automatic Speech Recognition: Introduction 9

As a classification problem: very high dimensional output

ASR Lecture 1 Automatic Speech Recognition: Introduction 9

As a classification problem: very high dimensional output

ASR Lecture 1 Automatic Speech Recognition: Introduction 9

As a classification problem: very high dimensional output

ASR Lecture 1 Automatic Speech Recognition: Introduction 9

We generally represent recorded speech as a sequence of

ASR Lecture 1 Automatic Speech Recognition: Introduction 10

We generally represent recorded speech as a sequence of

ASR Lecture 1 Automatic Speech Recognition: Introduction 10

We generally represent recorded speech as a sequence of

ASR Lecture 1 Automatic Speech Recognition: Introduction 10

Represent a recorded utterance as a sequence of feature vectors

Reading: Jurafsky & Martin section 9.3

Labels may be at different levels: words, phones, etc.

Reading: Jurafsky & Martin chapter 7 (especially sections 7.4, 7.5)

ASR Lecture 1 Automatic Speech Recognition: Introduction 12

In training the model:

ASR Lecture 1 Automatic Speech Recognition: Introduction 13

In training the model:

ASR Lecture 1 Automatic Speech Recognition: Introduction 13

In training the model:

ASR Lecture 1 Automatic Speech Recognition: Introduction 13

In training the model:

ASR Lecture 1 Automatic Speech Recognition: Introduction 13

In training the model:

ASR Lecture 1 Automatic Speech Recognition: Introduction 13

In training the model:

ASR Lecture 1 Automatic Speech Recognition: Introduction 13

In training the model:

The hidden Markov model (HMM) provides a good solution to

ASR Lecture 1 Automatic Speech Recognition: Introduction 13

A simple but powerful model for mapping a sequence of

ASR Lecture 1 Automatic Speech Recognition: Introduction 14

A simple but powerful model for mapping a sequence of

Generative "No right" Utterance W

ASR Lecture 1 Automatic Speech Recognition: Introduction 15

If X is the sequence of acoustic feature vectors (observations) and

ASR Lecture 1 Automatic Speech Recognition: Introduction 16