asr01-intro
asr01-intro
asr01-intro
Peter Bell
Speech-to-text transcription
Transform recorded audio into a sequence of words
Just the words, no meaning.... But do need to deal with
acoustic ambiguity: “Recognise speech?” or “Wreck a nice
beach?”
Speaker diarization: Who spoke when?
Speech recognition: what did they say?
Paralinguistic aspects: how did they say it? (timing,
intonation, voice quality)
Speech understanding: what does it mean?
NO RIGHT
x1 x2 x3 x4 ...
NO RIGHT
x1 x2 x3 x4 ...
n oh r ai t
x1 x2 x3 x4 ...
n o r i g h t
x1 x2 x3 x4 ...
n o r i g h t
x1 x2 x3 x4 ...
In performing recognition:
Searching over all possible output sequences W
to find the most likely one
n o r i g h t
x1 x2 x3 x4 ...
In performing recognition:
Searching over all possible output sequences W
to find the most likely one
x1 x2 x3 x4 ...
x1 x2 x3 x4 ...
n oh r ai t Subword
HMM
Acoustics X
Signal
Analysis p(X | W)
Acoustic
Model
Search
Space
Training P(W)
Language W
Data
Model
Phonemes
abstract unit defined by linguists based on contrastive role in
word meanings (eg “cat” vs “bat”)
40–50 phonemes in English
Phones
speech sounds defined by the acoustics
many allophones of the same phoneme (eg /p/ in “pit” and
“spit”)
limitless in number
Phones are usually used in speech recognition – but no
conclusive evidence that they are the basic units in speech
recognition
Possible alternatives: syllables, automatically derived units, ...
Signal
Analysis
Acoustic
Model
Search
Space
Training Language
Data Model