asr01-intro

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Automatic Speech Recognition: Introduction

Peter Bell

Automatic Speech Recognition— ASR Lecture 1


11 January 2020

ASR Lecture 1 Automatic Speech Recognition: Introduction 1


Automatic Speech Recognition — ASR
Course details
Lectures: About 18 lectures, delivered live on Teams for now
Labs: Weekly lab sessions – using Python, OpenFst
(openfst.org) and later Kaldi (kaldi-asr.org)
Lab sessions will start in Week 3 – exact format TBA.
Assessment:
First five lab sessions worth 10%
Coursework, building on the lab sessions, worth 40%
Open book exam in April or May worth 50%
People:
Course organiser: Peter Bell
Guest lecturers: Hiroshi Shimodaira and Yumnah Mohammied
TA: Andrea Carmantini
Demonstrators: Chau Luu and Electra Wallington
http://www.inf.ed.ac.uk/teaching/courses/asr/

ASR Lecture 1 Automatic Speech Recognition: Introduction 2


Your background
If you have taken:
Speech Processing and either of (MLPR or MLP)
Perfect!
either of (MLPR or MLP) but not Speech Processing
(probably you are from Informatics)
You’ll require some speech background:
A couple of the lectures will cover material that was in Speech
Processing
Some additional background study (including material from
Speech Processing)
Speech Processing but neither of (MLPR or MLP)
(probably you are from SLP)
You’ll require some machine learning background (especially
neural networks)
A couple of introductory lectures on neural networks provided
for SLP students
Some additional background study

ASR Lecture 1 Automatic Speech Recognition: Introduction 3


Labs

Series of weekly labs using Python, OpenFst and Kaldi


They count towards 10% of the course credit
Labs start week 3 – exact arrangements TBA
You will need to work in pairs
Labs 1-5 will give you hands-on experience of using HMM
algorithms to build your own ASR system
These labs are an important pre-requisite for the coursework –
take advantage of the demonstrator support!
Later optional labs will introduce you to Kaldi recipes for
training acoustic models – useful if you will be doing an
ASR-related research project

ASR Lecture 1 Automatic Speech Recognition: Introduction 4


What is speech recognition?

ASR Lecture 1 Automatic Speech Recognition: Introduction 5


What is speech recognition?

ASR Lecture 1 Automatic Speech Recognition: Introduction 5


What is speech recognition?

Speech-to-text transcription
Transform recorded audio into a sequence of words
Just the words, no meaning.... But do need to deal with
acoustic ambiguity: “Recognise speech?” or “Wreck a nice
beach?”
Speaker diarization: Who spoke when?
Speech recognition: what did they say?
Paralinguistic aspects: how did they say it? (timing,
intonation, voice quality)
Speech understanding: what does it mean?

ASR Lecture 1 Automatic Speech Recognition: Introduction 6


Why is
speech recognition
difficult?

ASR Lecture 1 Automatic Speech Recognition: Introduction 7


From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, or
speaker-independent? Adaptation to speaker
characteristics

ASR Lecture 1 Automatic Speech Recognition: Introduction 8


From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, or
speaker-independent? Adaptation to speaker
characteristics
Environment Noise, competing speakers, channel conditions
(microphone, phone line, room acoustics)

ASR Lecture 1 Automatic Speech Recognition: Introduction 8


From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, or
speaker-independent? Adaptation to speaker
characteristics
Environment Noise, competing speakers, channel conditions
(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologue
or spontaneous conversation?

ASR Lecture 1 Automatic Speech Recognition: Introduction 8


From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, or
speaker-independent? Adaptation to speaker
characteristics
Environment Noise, competing speakers, channel conditions
(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologue
or spontaneous conversation?
Vocabulary Machine-directed commands, scientific language,
colloquial expressions

ASR Lecture 1 Automatic Speech Recognition: Introduction 8


From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, or
speaker-independent? Adaptation to speaker
characteristics
Environment Noise, competing speakers, channel conditions
(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologue
or spontaneous conversation?
Vocabulary Machine-directed commands, scientific language,
colloquial expressions
Accent/dialect Recognise the speech of all speakers who speak a
particular language

ASR Lecture 1 Automatic Speech Recognition: Introduction 8


From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, or
speaker-independent? Adaptation to speaker
characteristics
Environment Noise, competing speakers, channel conditions
(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologue
or spontaneous conversation?
Vocabulary Machine-directed commands, scientific language,
colloquial expressions
Accent/dialect Recognise the speech of all speakers who speak a
particular language
Other paralinguistics Emotional state, social class, . . .

ASR Lecture 1 Automatic Speech Recognition: Introduction 8


From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, or
speaker-independent? Adaptation to speaker
characteristics
Environment Noise, competing speakers, channel conditions
(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologue
or spontaneous conversation?
Vocabulary Machine-directed commands, scientific language,
colloquial expressions
Accent/dialect Recognise the speech of all speakers who speak a
particular language
Other paralinguistics Emotional state, social class, . . .
Language spoken Estimated 7,000 languages, most with limited
training resources; code-switching; language change
ASR Lecture 1 Automatic Speech Recognition: Introduction 8
From a machine learning perspective

As a classification problem: very high dimensional output


space

ASR Lecture 1 Automatic Speech Recognition: Introduction 9


From a machine learning perspective

As a classification problem: very high dimensional output


space
As a sequence-to-sequence problem: very long input sequence
(although limited re-ordering between acoustic and word
sequences)

ASR Lecture 1 Automatic Speech Recognition: Introduction 9


From a machine learning perspective

As a classification problem: very high dimensional output


space
As a sequence-to-sequence problem: very long input sequence
(although limited re-ordering between acoustic and word
sequences)
Data is often noisy, with many “nuisance” factors of variation
in the data

ASR Lecture 1 Automatic Speech Recognition: Introduction 9


From a machine learning perspective

As a classification problem: very high dimensional output


space
As a sequence-to-sequence problem: very long input sequence
(although limited re-ordering between acoustic and word
sequences)
Data is often noisy, with many “nuisance” factors of variation
in the data
Very limited quantities of training data available (in terms of
words) compared to text-based NLP
Manual speech transcription is very expensive (10x real time)

ASR Lecture 1 Automatic Speech Recognition: Introduction 9


From a machine learning perspective

As a classification problem: very high dimensional output


space
As a sequence-to-sequence problem: very long input sequence
(although limited re-ordering between acoustic and word
sequences)
Data is often noisy, with many “nuisance” factors of variation
in the data
Very limited quantities of training data available (in terms of
words) compared to text-based NLP
Manual speech transcription is very expensive (10x real time)
Hierachical and compositional nature of speech production
and comprehension makes it difficult to handle with a single
model

ASR Lecture 1 Automatic Speech Recognition: Introduction 9


The speech recognition problem

We generally represent recorded speech as a sequence of


acoustic feature vectors (observations), X and the output
word sequence as W

ASR Lecture 1 Automatic Speech Recognition: Introduction 10


The speech recognition problem

We generally represent recorded speech as a sequence of


acoustic feature vectors (observations), X and the output
word sequence as W
At recognition time, our aim is to find the most likely W,
given X

ASR Lecture 1 Automatic Speech Recognition: Introduction 10


The speech recognition problem

We generally represent recorded speech as a sequence of


acoustic feature vectors (observations), X and the output
word sequence as W
At recognition time, our aim is to find the most likely W,
given X
To achieve this, statistical models are trained using a corpus
of labelled training utterances (Xn , Wn )

ASR Lecture 1 Automatic Speech Recognition: Introduction 10


Representing recorded speech (X)

Represent a recorded utterance as a sequence of feature vectors

Reading: Jurafsky & Martin section 9.3


ASR Lecture 1 Automatic Speech Recognition: Introduction 11
Labelling speech (W)

Labels may be at different levels: words, phones, etc.


Labels may be time-aligned – i.e. the start and end times of an
acoustic segment corresponding to a label are known

Reading: Jurafsky & Martin chapter 7 (especially sections 7.4, 7.5)

ASR Lecture 1 Automatic Speech Recognition: Introduction 12


Two key challenges

In training the model:


Aligning the sequences Xn and Wn for each training utterance

ASR Lecture 1 Automatic Speech Recognition: Introduction 13


Two key challenges

In training the model:


Aligning the sequences Xn and Wn for each training utterance
w1 w2

NO RIGHT

x1 x2 x3 x4 ...

ASR Lecture 1 Automatic Speech Recognition: Introduction 13


Two key challenges

In training the model:


Aligning the sequences Xn and Wn for each training utterance
w1 w2

NO RIGHT

x1 x2 x3 x4 ...

ASR Lecture 1 Automatic Speech Recognition: Introduction 13


Two key challenges

In training the model:


Aligning the sequences Xn and Wn for each training utterance
p1 p2 p3 p4 p5

n oh r ai t

x1 x2 x3 x4 ...

ASR Lecture 1 Automatic Speech Recognition: Introduction 13


Two key challenges

In training the model:


Aligning the sequences Xn and Wn for each training utterance
g1 g2 g3 g4 g5 g6 g6

n o r i g h t

x1 x2 x3 x4 ...

ASR Lecture 1 Automatic Speech Recognition: Introduction 13


Two key challenges

In training the model:


Aligning the sequences Xn and Wn for each training utterance
g1 g2 g3 g4 g5 g6 g6

n o r i g h t

x1 x2 x3 x4 ...

In performing recognition:
Searching over all possible output sequences W
to find the most likely one

ASR Lecture 1 Automatic Speech Recognition: Introduction 13


Two key challenges

In training the model:


Aligning the sequences Xn and Wn for each training utterance
g1 g2 g3 g4 g5 g6 g6

n o r i g h t

x1 x2 x3 x4 ...

In performing recognition:
Searching over all possible output sequences W
to find the most likely one

The hidden Markov model (HMM) provides a good solution to


both problems

ASR Lecture 1 Automatic Speech Recognition: Introduction 13


The Hidden Markov Model

x1 x2 x3 x4 ...

A simple but powerful model for mapping a sequence of


continuous observations to a sequence of discrete outputs
It is a generative model for the observation sequence
Algorithms for training (forward-backward) and
recognition-time decoding (Viterbi)

ASR Lecture 1 Automatic Speech Recognition: Introduction 14


The Hidden Markov Model

x1 x2 x3 x4 ...

A simple but powerful model for mapping a sequence of


continuous observations to a sequence of discrete outputs
It is a generative model for the observation sequence
Algorithms for training (forward-backward) and
recognition-time decoding (Viterbi)
Later in the course we will also look at newer all-neural,
fully-differentiable “end-to-end” models
ASR Lecture 1 Automatic Speech Recognition: Introduction 14
Hierarchical modelling of speech

Generative "No right" Utterance W


Model
NO RIGHT Word

n oh r ai t Subword

HMM

Acoustics X

ASR Lecture 1 Automatic Speech Recognition: Introduction 15


“Fundamental Equation of Statistical Speech Recognition”

If X is the sequence of acoustic feature vectors (observations) and


W denotes a word sequence, the most likely word sequence W∗ is
given by
W∗ = arg max P(W | X)
W

ASR Lecture 1 Automatic Speech Recognition: Introduction 16


“Fundamental Equation of Statistical Speech Recognition”

If X is the sequence of acoustic feature vectors (observations) and


W denotes a word sequence, the most likely word sequence W∗ is
given by
W∗ = arg max P(W | X)
W

Applying Bayes’ Theorem:


p(X | W)P(W)
P(W | X) =
p(X)
∝ p(X | W)P(W)
W∗ = arg max p(X | W) P(W)
W | {z } | {z }
Acoustic Language
model model

ASR Lecture 1 Automatic Speech Recognition: Introduction 16


Speech Recognition Components

W∗ = arg max p(X | W)P(W)


W

Use an acoustic model, language model, and lexicon to obtain the


most probable word sequence W∗ given the observed acoustics X
Recorded Speech X Decoded Text W*
(Transcription)

Signal
Analysis p(X | W)
Acoustic
Model
Search
Space
Training P(W)
Language W
Data
Model

ASR Lecture 1 Automatic Speech Recognition: Introduction 17


Phones and Phonemes

Phonemes
abstract unit defined by linguists based on contrastive role in
word meanings (eg “cat” vs “bat”)
40–50 phonemes in English
Phones
speech sounds defined by the acoustics
many allophones of the same phoneme (eg /p/ in “pit” and
“spit”)
limitless in number
Phones are usually used in speech recognition – but no
conclusive evidence that they are the basic units in speech
recognition
Possible alternatives: syllables, automatically derived units, ...

(Slide taken from Martin Cooke from long ago)

ASR Lecture 1 Automatic Speech Recognition: Introduction 18


Evaluation

How accurate is a speech recognizer?


String edit distance
Use dynamic programming to align the ASR output with a
reference transcription
Three type of error: insertion, deletion, substitutions
Word error rate (WER) sums the three types of error. If there
are N words in the reference transcript, and the ASR output
has S substitutions, D deletions and I insertions, then:
S +D +I
WER = 100 · % Accuracy = 100 − WER%
N
Speech recognition evaluations: common training and
development data, release of new test sets on which different
systems may be evaluated using word error rate

ASR Lecture 1 Automatic Speech Recognition: Introduction 19


Next Lecture

Recorded Speech Decoded Text


(Transcription)

Signal
Analysis
Acoustic
Model
Search
Space
Training Language
Data Model

ASR Lecture 1 Automatic Speech Recognition: Introduction 20


Example: recognising TV broadcasts

ASR Lecture 1 Automatic Speech Recognition: Introduction 21


Reading

Jurafsky and Martin (2008). Speech and Language Processing


(2nd ed.): Chapter 7 (esp 7.4, 7.5) and Section 9.3.
General interest:
The Economist Technology Quarterly, “Language: Finding a
Voice”, Jan 2017.
http://www.economist.com/technology-quarterly/2017-05-
01/language
The State of Automatic Speech Recognition: Q&A with
Kaldi’s Dan Povey, Jul 2018.
https://medium.com/descript/the-state-of-automatic-
speech-recognition-q-a-with-kaldis-dan-povey-
c860aada9b85

ASR Lecture 1 Automatic Speech Recognition: Introduction 22

You might also like