Hands-On Speech Recognition Wit - Yamin Ren
Hands-On Speech Recognition Wit - Yamin Ren
Hands-On Speech Recognition Wit - Yamin Ren
Kaldi/TIMIT
Demystify Automatic Speech Recognition
(ASR)
& Deep Neural Networks (DNN)
Yamin Ren
Leo Reny
HYPech.com
India • Japan • Korea • Singapore • United Kingdom • United States
Automatic Speech Recognition Series
Hands-on Kaldi : ALL RIGHTS RESERVED.
Leo Reny Yamin Ren Publisher and GM: Mel Pearce Associate Director: Kris Burns
Manager of Editorial Services: Hayden Simpson Marketing Manager: Lilyana Cardenas
Acquisitions Editor: Alexandra Lewis Project and Copy Editor: Isabella Fletcher
Technical Reviewer: Sergio Crawford Interior Layout Tech: ReApex Limited Cover
Designer:
No part of this work covered by the copyright herein may be reproduced, transmitted, stored,
or used in any form or by any means graphic, electronic, or mechanical, including but not
limited to photocopying, recording, scanning, digitizing, taping, web distribution, information
networks, or information storage and retrieval systems, except as permitted under Section
107 or 108 of the 1976 United States Copyright Act, without the prior written permission of
the publisher.
Throughout this book, we refer to products and designs which are not our property. These
references are meant only to be informational. We do not represent the companies
mentioned and were not paid promotional fees. However, if these compa- nies would like to
send us evaluation copies of future products, we would be thrilled. References to products
are not endorse- ments, but reflect our opinions in some cases.
Computer software products mentioned are the property of their respective publishers.
Instead of attempting to list every software publishers and brand, or including trademark
symbols throughout this book, we assume that We know these product and brand names are
protected under U.S. and international laws. Fonts and designs are the intellectual property
of the de- sign artists. Although U.S. copyright laws do not protect font designs, we consider
them the property of the designers and licensing agencies.
For product information and assistance, contact us at Hypech Publisher Support, 1-872-222-
8067
For permission to use material from this text or product, submit all requests to:
[email protected]
Printed in the US 1 2 3 4 5 6 7 12 11 10
Kaldi is an open-source speech recognition toolkit written in C++ for speech recognition and
signal processing, freely avail- able under the Apache License v2.0. Access, Excel,
Microsoft, SQL Server, and Windows are registered trademarks of Micro- soft Corporation.
MySQL is a registered trademark of MySQL AB. Mac OS is a registered trademark of Apple
Inc. All other trademarks are the property of their respective owners.
For You
Acknowledgments
We would like to acknowledge and thank Daniel Povey for his
invaluable work on Kaldi, indeed a great ASR Toolkit, and for
his great help and kindness in letting us use some of the text
of Kaldi’s web pages describing the software. After he moved
to Xiaomi, Daniel’s brilliant wisdom inspired all the people
working around him.
Chapter 6. MonoPhone 81
1.2.5 Decoding
Below is
the comparison from one point of view.
1.4 Kaldi
$cat INSTALL
rerun:
$extras/check_dependencies.sh
We’ll get:
“OK”
means ok.
We compile it. It might take hours.
$cd lKaldi/tools/; make; cd ../src; ./configure; make
2.4 Yes or No
waves_yesno.tar.gz 100%[===================>]
waves_yesno/
waves_yesno/1_0_0_0_0_0_1_1.wav
waves_yesno/1_1_0_0_1_0_1_0.wav
waves_yesno/1_0_1_1_1_1_0_1.wav
waves_yesno/1_1_1_1_0_1_0_0.wav
waves_yesno/0_0_1_1_1_0_0_0.wav
waves_yesno/0_1_1_1_1_1_1_1.wav
waves_yesno/0_1_0_1_1_1_0_0.wav
waves_yesno/1_0_1_1_1_0_1_0.wav
waves_yesno/1_0_0_1_0_1_1_1.wav
waves_yesno/0_0_1_0_1_0_0_0.wav
waves_yesno/0_1_0_1_1_0_1_0.wav
waves_yesno/0_0_1_1_0_1_1_0.wav
waves_yesno/1_0_0_0_1_0_0_1.wav
waves_yesno/1_1_0_1_1_1_1_0.wav
waves_yesno/0_0_1_1_1_1_0_0.wav
waves_yesno/1_1_0_0_1_1_1_0.wav
waves_yesno/0_0_1_1_0_1_1_1.wav
waves_yesno/1_1_0_1_0_1_1_0.wav
waves_yesno/0_1_0_0_0_1_1_0.wav
waves_yesno/0_0_0_1_0_0_0_1.wav
Checking data/local/dict/lexicon.txt
--> reading data/local/dict/lexicon.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict/lexicon.txt is OK
**Creating data/local/dict/lexiconp.
txt from data/local/dict/lexicon.txt
fstaddselfloops data/lang/phones/wdisambig_phones. int
data/lang/phones/wdisambig_words.int
fstisstochastic data/lang_test_tg/G.fst
1.20397 1.20397
Succeeded in formatting data.
di-asr.org/doc/data_prep.html
for more information.
steps/diagnostic/analyze_alignments.sh --cmd
utils/run.pl data/lang exp/mono0a
steps/diagnostic/analyze_alignments.sh: see stats in
exp/mono0a/log/analyze_alignments.log
1 warnings in exp/mono0a/log/update.*.log
exp/mono0a: nj=1 align prob=-81.88 over 0.05h [retry=0.0%, fail=0.0%] states=11
gauss=371
steps/train_mono.sh: Done training mono
phone system in exp/mono0a
tree-info exp/mono0a/tree
tree-info exp/mono0a/tree
fsttablecompose data/lang_test_tg/L_dis
ambig.fst data/lang_test_tg/G.fst
fstdeterminizestar --use-log=true
fstminimizeencoded
fstpushspecial
fstisstochastic data/lang_test_tg/tmp/LG.fst
0.534295 0.533859
[info]: LG not stochastic.
fsttablecompose exp/mono0a/graph_tgpr/
Ha.fst data/lang_test_tg/tmp/CLG_1_0.fst
fstdeterminizestar --use-log=true
fstminimizeencoded
fstrmsymbols exp/mono0a/graph_tgpr/disambig_tid.int fstrmepslocal
fstisstochastic exp/mono0a/graph_tgpr/HCLGa.fst
0.5342 -0.000422432
HCLGa is not stochastic
3.1 About
TIMIT 39
Text corpus design was a joint effort among the
Massachusetts Institute of Technology (MIT), Stanford
Research Institute (SRI), and Texas Instruments (TI). The
speech was recorded at TI, transcribed at MIT, and has been
maintained, verified, and prepared for CD-ROM production by
the National Institute of Standards and Technology (NIST).
Additional information including the referenced material and
some relevant reprints of articles may be found in the printed
documentation which is also available from NTIS (NTIS#
PB91-100354).
s5/data as below.
We get the complete TIMIT in Kaldi now.
Once installation completed, read ./data/README.DOCfirst, and find
more information from ./data/DOC.
cd train/DR1/FCJF0
Entering one of them, we could find 10 utterances. Each has
four files: .phn, .txt, .wav, and .wrd. Let’s try to open them and
see what’s inside.
.wav is the audio file recorded with sound format (It’s a little
bit different. But let’s assume it for now). Others are
corresponding metadata like lexicon, XML metadata, and
language model training text in txt format. The metadata and
text files help Kaldi understand WAV and transfer the
speechto feature vector and then to the character, word and
language.
3.2.3 TIMIT Recipes
The recipe usually starts with run.shin the root directory. Links to
the Kaldi have been stored in utils/and steps/ directories.
# Set which stage this script is on, you can set it to the stage number
that has already been executed to avoid running the same command
repeatedly.
stage=0
Stage is set to 0 by default, which means the recipe will run
all blocks. If encountering an error, we can check which
stages are successfully passes and re-run the recipe by ./run.sh
--stage x, or set the stage equal to a stage number to run that
specific stage only.
Chapter 4. Kaldi Data Preparation
from beginning to FST
The speech data is the core data for speech recognition; The
lexicon data is like a dictionary that include a word dictionary
(word-pronunciation pairs), phonemes dictionary (types of
phonemes); The language data is the language model
(mostly the language model is trained apart from the text of
speech data). Mostly, a dataset for ASR is accompanied witha
language model.
For the TIMIT dataset specifically, the three kinds of data are
integrated in the dataset, we can easily get them from scripts
in local.
Before acting with the code, read the TIMIT description (some
docs in the dataset. The data collection process is frustrating,
and all the datasets are built with great efforts).
The binary and executive files of Kaldi have been stored into
different folders with a logic. path.shhas been used to make link
them to the TIMIT.
Above is the path.sh.
We could see Kaldi’s complicated path definition below. This
is why the path.shfile is called at the beginning of all Kaldi
scripts.
set -e means stop for any error. Just run any binary file to
check whether path.sh has been set correctly:
feat-to-dim
If like below, it's good then.
Even though we couldn’t understand what it showing, we
already know that the paths have been all set.
Keep the acoustic model (line 18-32) parameters as is.
After data preparation, each data set (train, test, dev) will
produce _sph.flist, _sph.scp, .uttids, .trans, .text, _wav.scp,
.utt2spk,
Step 3: Dictionary
We define the FST in a separate file. Each line in the FST file
identifies an arc (except the last line). The first two columns
identify the state of where it transits from and where it transits
to. The third column represents the input label and the fourth
column represents the output label. If it is a dash, the output
is empty.
$ local/timit_format_data.sh
Two files are most important: words.txt and phones.txt. Both are
FST format: phone to integer. Looks like same.
*.csl files including silence, non-silence and phone’s id. We
could use fstprint to check L.fst content.
fstprint --isymbols=data/lang/phones.txt --osymbols=data/ lang/words.txt data/lang/L.fst |
head
The result is:
0 0 aa aa
0 0 ae ae
0 0 ah ah
0 0 ao ao
0 0 aw aw
0 0 ax ax
0 0 ay ay
00bb
The Result
wav-to-duration --read-entire-file=true
scp:train_wav.scp ark,t:train_dur.ark
LOG (wav-to-duration[5.5.164~1-9698]:main():wav-to-duration.cc:92) Printed duration for
3696 audio files.
LOG (wav-to-duration[5.5.164~1-9698]:main():wav
to-duration.cc:94) Mean duration was 3.06336, min
and max durations were 0.91525, 7.78881
wav-to-duration --read-entire-file=true
scp:dev_wav.scp ark,t:dev_dur.ark
LOG (wav-to-duration[5.5.164~1-9698]:main():wav-to-duration.cc:92) Printed duration for
400 audio files.
LOG (wav-to-duration[5.5.164~1-9698]:main():wav
to-duration.cc:94) Mean duration was 3.08212, min
and max durations were 1.09444, 7.43681
wav-to-duration --read-entire-file=true
scp:test_wav.scp ark,t:test_dur.ark
LOG (wav-to-duration[5.5.164~1-9698]:main():wav-to-duration.cc:92) Printed duration for
192 audio files.
LOG (wav-to-duration[5.5.164~1-9698]:main():wav
to-duration.cc:94) Mean duration was 3.03646, min
and max durations were 1.30562, 6.21444
inpfile: data/local/lm_tmp/lm_phone_bg.ilm.gz
outfile: /dev/stdout
loading up to the LM level 1000 (if any)
dub: 10000000
OOV code is 50
OOV code is 50
Saving in txt format to /dev/stdout
Dictionary & language model preparation succeeded
Checking data/local/dict/lexicon.txt
--> reading data/local/dict/lexicon.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict/lexicon.txt is OK
Checking data/local/dict/lexiconp.txt
--> reading data/local/dict/lexiconp.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict/lexiconp.txt is OK
fstisstochastic data/lang_test_bg/G.fst
0.000510126 -0.0763018
utils/validate_lang.pl data/lang_test_bg
Checking data/lang_test_bg/phones.txt ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/lang_test_bg/phones.txt is OK
Checking data/lang_test_bg/phones/con
text_indep.{txt, int, csl} ...
Checking data/lang_test_bg/phones/option
al_silence.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
fsttablecompose data/lang_test_bg/L_dis
ambig.fst data/lang_test_bg/G.fst
One widely used model is to use frame to cut the audio wave.
Each frame is about 10ms. Each frame could extract 39
digits, whichrepresenting the frame’s audio feature. Put them
in a vector, we get feature vector.
train/wav.scp,
divides it to several parts evenly, and invokes
run.plextracting the features. One job processes one wav.scp.
The scripts compute-mfcc-featsand copy-featscreate
MFCC folder contains .ark and .scp files. The utterances and
features pair are stored in .scp. the real features are stored in
.ark. Kaldi use these two files together.
Let’s check the files:
$ head raw_mfcc_train.1.scp
faem0_si1392 ./timit/s5/mfcc/raw_mfcc_train.1.ark:13 faem0_si2022
./timit/s5/mfcc/raw_mfcc_train.1.ark:6313 faem0_si762
./timit/s5/mfcc/raw_mfcc_train.1.ark:9349 faem0_sx132
./timit/s5/mfcc/raw_mfcc_train.1.ark:13048 faem0_sx222
./timit/s5/mfcc/raw_mfcc_train.1.ark:16916 faem0_sx312
./timit/s5/mfcc/raw_mfcc_train.1.ark:20537 faem0_sx402
./timit/s5/mfcc/raw_mfcc_train.1.ark:25549 faem0_sx42
./timit/s5/mfcc/raw_mfcc_train.1.ark:29767 fajw0_si1263
./timit/s5/mfcc/raw_mfcc_train.1.ark:32804 fajw0_si1893
./timit/s5/mfcc/raw_mfcc_train.1.ark:39403
Scripts and Archives are oganzied as the table. One table has
an index (like utt_id). .scp is text file, each line has a key and
a corresponding file name, which telling Kaldi the data
location.
Archive file could be text or binary. The format is key (like utt_
id), space, and the object.
We could check MFCC’s dimension like below:
feat-to-dim scp:data/train/feats.scp
feat-to-dim ark:mfcc/raw_mfcc_train.1.ark
The result is 13, as we expected.
MFCC & CMVN Output 79
MFCC Feature Extration & CMVN for Training and Test set
===============================================================
train_nj has been given 30. --nj 30 instructs Kaldi to split the
wav file into four even parts to process, whichare stored in
data/ train/split30 labeled with number 1 to 30. Each block
contains three scp files and other supporting files.
························
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 36
steps/train_mono.sh: Pass 37
steps/train_mono.sh: Pass 38
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 39
steps/diagnostic/analyze_alignments.sh --cmd run.pl
--max-jobs-run 10 data/lang exp/mono
steps/diagnostic/analyze_alignments.sh: see stats in
exp/mono/log/analyze_alignments.log
2 warnings in exp/mono/log/align.*.*.log exp/mono: nj=4 align
prob=-99.15 over 3.12h [retry=0.0%, fail=0.0%] states=144
gauss=985
The mono folder will be created under exp, in which the model
parameters are stored in .mdl files. We could check the
contents like below:
$ gmm-copy --binary=false exp/mono/0.mdl - | less
The out put will be:
<TransitionModel>
<Topology>
<TopologyEntry>
<ForPhones>
2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 </ForPhones>
<State> 0 <PdfClass> 0 <Transition> 0
0.75 <Transition> 1 0.25 </State>
<State> 1 <PdfClass> 1 <Transition> 1
0.75 <Transition> 2 0.25 </State>
<State> 2 <PdfClass> 2 <Transition> 2
0.75 <Transition> 3 0.25 </State>
<State> 3 </State>
</TopologyEntry>
<TopologyEntry>
<ForPhones>
1
</ForPhones>
<State> 0 <PdfClass> 0 <Transition> 0 0.5 <Transition> 1 0.5 </State>
<State> 1 <PdfClass> 1 <Transition> 1 0.5 <Transition> 2 0.5 </State>
<State> 2 <PdfClass> 2 <Transition> 2 0.75 <Transition> 3 0.25 </State>
<State> 3 </State>
</TopologyEntry>
</Topology>
<Triples> 144
100
111
122
203
214
225
306
317
328
409
············
Kaldi label p.d.f with number starting with 0 (it’s called pdf-
ids), while there is no name in HTK model.
.mdl also telling us that there is no data to link context-
dependent phones and pdf-ids. Let’s check the tree file:
$ copy-tree --binary=false exp/mono/tree - | less
The output is:
copy-tree --binary=false exp/mono/tree
LOG (copy-tree:main():copy-tree.cc:55) Copied tree
ContextDependency 1 0 ToPdf TE 0 49 ( NULL TE -1 3 ( CE 0 CE 1
CE 2 )
TE -1 3 ( CE 3 CE 4 CE 5 )
TE -1 3 ( CE 6 CE 7 CE 8 )
TE -1 3 ( CE 9 CE 10 CE 11 )
TE -1 3 ( CE 12 CE 13 CE 14 )
TE -1 3 ( CE 15 CE 16 CE 17 )
TE -1 3 ( CE 18 CE 19 CE 20 )
TE -1 3 ( CE 21 CE 22 CE 23 )
TE -1 3 ( CE 24 CE 25 CE 26 )
TE -1 3 ( CE 27 CE 28 CE 29 )
TE -1 3 ( CE 30 CE 31 CE 32 )
TE -1 3 ( CE 33 CE 34 CE 35 )
···················
EndContextDependency
This is the tree of monophone. It’s very trivial since there are
no branches. CE means constant eventmap, representing the
leaves of the tree.
faem0_si2022 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
111111111111111111111111111
1 1 4 3 3 3 3 3 6 5 5 5 266 265 265 265 265 265 265
268 267 270 269 269 20 22 24 80 82 81 84 83 32 31 34
33 33 33 33 36 35 35 35 35 35 35 122 121 121 121 121
121 121 121 121 124 123 126 125 146 145 148 147 150
62 64 63 66 65 65 65 65 65 65 68 70 72 71 242 241 244
246 245 224 223 223 223 223 223 223 223 223 223 223
223 226 225 225 228 152 154 153 156 260 262 264 263
68 70 72 71 212 211 211 211 214 213 213 216 215 215
215 215 44 43 46 45 45 45 48 47 47 254 256 258 122 121 121 121
121 121 121 121 121 121 121 124 123 126 125 125 26 25 25 28 27
27 27 27 27 30 29 29 29 2 1 1 1 1 4 3 3 3 3 3 3 6 5
exp/mono/log/acc.37.12.log:LOG (gmm-acc-stats
ali:main():gmm-acc-stats-ali.cc:115) Overall avg like per frame (Gaussian only) = -99.1242
over 595815 frames.
exp/mono/log/acc.38.10.log:LOG (gmm-acc-stats
ali:main():gmm-acc-stats-ali.cc:115) Overall avg like per frame (Gaussian only) = -99.0045
over 666753 frames.
exp/mono/log/acc.38.11.log:LOG (gmm-acc-stats
ali:main():gmm-acc-stats-ali.cc:115) Overall avg like per frame (Gaussian only) = -95.769
over 793715 frames.
exp/mono/log/acc.38.12.log:LOG (gmm-acc-stats
ali:main():gmm-acc-stats-ali.cc:115) Overall avg like per frame (Gaussian only) = -99.0953
over 595815 frames.
exp/mono/log/acc.39.10.log:LOG (gmm-acc-stats
ali:main():gmm-acc-stats-ali.cc:115) Overall avg like per frame (Gaussian only) = -98.9901
over 666753 frames.
exp/mono/log/acc.39.11.log:LOG (gmm-acc-stats
ali:main():gmm-acc-stats-ali.cc:115) Overall avg like per frame (Gaussian only) = -95.7472
over 793715 frames.
exp/mono/log/acc.39.12.log:LOG (gmm-acc-stats
ali:main():gmm-acc-stats-ali.cc:115) Overall avg like per frame (Gaussian only) = -99.0786
over 595815 frames.
We could see the Acoustic likelihood of each iteration.
6.4 Decoding
Then
create new exp/mono/graph/HCLGa.fstwith $tree and $model.
The final result is:
tree-info exp/mono/tree
tree-info exp/mono/tree
fsttablecompose data/lang_test_bg/L_disambig.fst
data/lang_test_bg/G.fst
fstdeterminizestar --use-log=true
fstpushspecial
fstminimizeencoded
fstisstochastic data/lang_test_bg/tmp/LG.fst
-0.00841335 -0.00928529
make-h-transducer --disambig-syms-out=exp/mono/
graph/disambig_tid.int --transition-scale=1.0 data/
lang_test_bg/tmp/ilabels_1_0 exp/mono/tree exp/mono/ final.mdl
fstminimizeencoded
fstdeterminizestar --use-log=true
fsttablecompose exp/mono/graph/Ha.fst data/lang_
test_bg/tmp/CLG_1_0.fst
fstrmsymbols exp/mono/graph/disambig_tid.int fstrmepslocal
fstisstochastic exp/mono/graph/HCLGa.fst 0.000381767 -0.00951546
for x in exp/{mono,tri,sgmm,dnn,combine}*/decode*; do [
-d $x ] && echo $x | grep “${1:-.*}” >/dev/null && grep WER $x/wer_* 2>/dev/null |
utils/best_wer.sh; done
===================================================
tree-info exp/mono/tree
tree-info exp/mono/tree
fsttablecompose data/lang_test_bg/L_disambig.fst
data/lang_test_bg/G.fst
fstminimizeencoded
fstpushspecial
fstdeterminizestar --use-log=true
fstisstochastic data/lang_test_bg/tmp/LG.fst
-0.00841336 -0.00928521
fstisstochastic data/lang_test_bg/tmp/CLG_1_0.fst
-0.00841336 -0.00928521
make-h-transducer --disambig-syms-out=exp/mono/
graph/disambig_tid.int --transition-scale=1.0 data/
lang_test_bg/tmp/ilabels_1_0 exp/mono/tree exp/mono/ final.mdl
fstrmepslocal
fsttablecompose exp/mono/graph/Ha.fst data/lang_
test_bg/tmp/CLG_1_0.fst
fstdeterminizestar --use-log=true
fstrmsymbols exp/mono/graph/disambig_tid.int fstminimizeencoded
fstisstochastic exp/mono/graph/HCLGa.fst
0.000381709 -0.00951555
The key issue for Speech Recognition (not only Kaldi, but the
whole field) is the speech variation, which occur everywhere:
channel/microphone type, environmental noise, speaking
style, vocal anatomy, gender, accent, health, etc.
The rest is just keeping iteration for better result. The default
is 35 iterations. When reaching 10, 20, and 30 iteration, Kaldi
will re-ali to create new ali.gz for further processing.
incgauss=$[($totgauss-$numgauss)/$max_iter_inc]
7.3 Output
=============================================================== tri1 :
Deltas + Delta-Deltas Training & Decoding
===============================================================
63 warnings in exp/tri1/log/init_model.log
44 warnings in exp/tri1/log/update.*.log
1 warnings in exp/tri1/log/compile_questions.log
fstrmepslocal
fstminimizeencoded
fstrmsymbols exp/tri1/graph/disambig_tid.int
fstdeterminizestar --use-log=true
fstisstochastic exp/tri1/graph/HCLGa.fst
0.000443876 -0.0175772
HCLGa is not stochastic
8.1 General
Training pass 1
Training pass 2
steps/train_lda_mllt.sh: Estimating MLLT
Training pass 3
...
===================================================
tri2 : LDA + MLLT Training & Decoding
===================================================
Training pass 1
Training pass 2
steps/train_lda_mllt.sh: Estimating MLLT
Training pass 4
steps/train_lda_mllt.sh: Estimating MLLT Training pass 5
Training pass 6
steps/train_lda_mllt.sh: Estimating MLLT Training pass 7
Training pass 8
Training pass 9
Training pass 10
Aligning data
Training pass 11
Training pass 12
steps/train_lda_mllt.sh: Estimating MLLT Training pass 13
Training pass 14
Training pass 15
Training pass 16
Training pass 17
Training pass 18
Training pass 19
Training pass 20
Aligning data
Training pass 21
Training pass 22
Training pass 23
Training pass 24
Training pass 25
Training pass 26
Training pass 27
Training pass 28
Training pass 30
Aligning data
Training pass 31
Training pass 32
Training pass 33
Training pass 34
steps/diagnostic/analyze_alignments.sh --cmd run.pl
--max-jobs-run 10 data/lang exp/tri2
steps/diagnostic/analyze_alignments.sh: see stats in
exp/tri2/log/analyze_alignments.log
215 warnings in exp/tri2/log/update.*.log
92 warnings in exp/tri2/log/init_model.log
1 warnings in exp/tri2/log/compile_questions.log exp/tri2: nj=30 align
prob=-47.97 over 3.12h [re
tree-info exp/tri2/tree
make-h-transducer --disambig-syms-out=exp/tri2/
graph/disambig_tid.int --transition-scale=1.0 data/
lang_test_bg/tmp/ilabels_3_1 exp/tri2/tree exp/tri2/ final.mdl
fstrmepslocal
fsttablecompose exp/tri2/graph/Ha.fst data/lang_
test_bg/tmp/CLG_3_1.fst
fstminimizeencoded
fstrmsymbols exp/tri2/graph/disambig_tid.int fstdeterminizestar --use-
log=true
fstisstochastic exp/tri2/graph/HCLGa.fst
0.000445813 -0.0175771
HCLGa is not stochastic
===================================================
tri3 : LDA + MLLT + SAT Training & Decoding
===================================================
===================================================
SGMM2 Training & Decoding
===================================================
fstrmepslocal
fsttablecompose exp/sgmm2_4/graph/Ha.fst data/lang_
test_bg/tmp/CLG_3_1.fst
fstrmsymbols exp/sgmm2_4/graph/disambig_tid.int
fstminimizeencoded
fstdeterminizestar --use-log=true
fstisstochastic exp/sgmm2_4/graph/HCLGa.fst 0.000485195
-0.0175772
HCLGa is not stochastic
MMI Formula:
Let’s convert it a
little bit:
===================================================
tree-info exp/sgmm2_4_ali/tree
tree-info exp/sgmm2_4_ali/tree
fsttablecompose exp/sgmm2_4_denlats/lang/L_disambig. fst
exp/sgmm2_4_denlats/lang/G.fst
fstminimizeencoded
fstdeterminizestar --use-log=true
fstpushspecial
fstisstochastic exp/sgmm2_4_denlats/lang/tmp/LG.fst 1.27271e-05
1.27271e-05
fstcomposecontext --context-size=3 --central-posi
tion=1 --read-disambig-syms=exp/sgmm2_4_denlats/
lang/phones/disambig.int --write-disambig-syms=exp/
sgmm2_4_denlats/lang/tmp/disambig_ilabels_3_1.int
exp/sgmm2_4_denlats/lang/tmp/ilabels_3_1.7065 exp/
sgmm2_4_denlats/lang/tmp/LG.fst
make-h-transducer --disambig-syms-out=exp/sgmm2_4_
denlats/dengraph/disambig_tid.int --transitionscale=1.0
exp/sgmm2_4_denlats/lang/tmp/ilabels_3_1 exp/sgmm2_4_ali/tree
exp/sgmm2_4_ali/final.mdl
fsttablecompose exp/sgmm2_4_denlats/dengraph/Ha.fst
exp/sgmm2_4_denlats/lang/tmp/CLG_3_1.fst
fstrmepslocal
fstminimizeencoded
===================================================
DNN Hybrid Training & Decoding
===================================================
LOG (copy-int-vector[5.5.164~1-9698]:main():copyint-vector.cc:83)
Copied 3696 vectors of int32. Getting subsets of validation examples
for diagnostics and combination.
(in order to avoid stressing the disk, these won’t all run at once).
steps/nnet2/get_egs.sh: Finished preparing training examples
steps/nnet2/train_tanh.sh: 15 + 5 = 20 iterations,
steps/nnet2/train_tanh.sh: (while reducing learning rate) + (with
constant learning rate).
Training neural net (pass 0)
Training neural net (pass 1)
Training neural net (pass 2)
......
Training neural net (pass 7)
Training neural net (pass 8)
Training neural net (pass 9)
Training neural net (pass 10)
Training neural net (pass 11)
Training neural net (pass 12)
Mixing up from 1904 to 5000 components
Training neural net (pass 13)
Training neural net (pass 14)
Training neural net (pass 15)
Training neural net (pass 16)
Training neural net (pass 17)
Training neural net (pass 18)
Training neural net (pass 19)
Setting num_iters_final=5
--do-smbr true \
$data_fmllr/train data/lang $srcdir ${srcdir}_ ali ${srcdir}_denlats $dir
|| exit 1
# Decode
for ITER in 1 6; do
test_it${ITER} || exit 1
steps/nnet/decode.sh --nj 20 --cmd “$decode_
cmd” \
--nnet $dir/${ITER}.nnet --acwt $acwt \ $gmmdir/graph
$data_fmllr/dev $dir/decode_
dev_it${ITER} || exit 1
done
fi
===================================================
# INFO
steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a
stack of RBMs
dir : exp/dnn4_pretrain-dbn
Train-set : data-fmllr-tri3/train ‘3696’
LOG ([5.5.164~1-9698]:main():cuda-gpu-available. cc:49)
### IS CUDA GPU AVAILABLE? ‘HP’ ###
# PREPARING FEATURES
copy-feats --compress=true scp:data-fmllr-tri3/
nnet
ark:-
from 1 GPUs
9698]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(0):
GeForce GTX 1070 free:7750M, used:365M, total:8116M,
free/total:0.954946
total:8116M, free/total:0.954946
497522)
# + normalization of NN-input at
‘exp/dnn4_pretraindbn/tr_splice5_cmvn-g.nnet’
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:53)
Reading exp/dnn4_pretrain-dbn/tr_splice5.nnet
num-components 3
input-dim 40
output-dim 440
number-of-parameters 0.00088 millions
component 1 : <Splice>, input-dim 40, output-dim 440,
###
LOG (rbm-convert-to-nnet[5.5.164~1-9698]:main():rbmconvert-to-
nnet.cc:69) Written model to exp/dnn4_ pretrain-dbn/1.dbn
from 1 GPUs
9698]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(0):
GeForce GTX 1070 free:7750M, used:366M, total:8116M,
free/total:0.954884
total:8116M, free/total:0.954884
LOG (nnet-forward[5.5.164~1-9698]:main():nnet-forward.cc:192)
Done 3696 files in 0.0902158min, (fps 207802)
207802)
9698]:main():compute-cmvn-stats.cc:168) Wrote global CMVN stats
to standard output CMVN stats to standard output
LOG (cmvn-to-nnet[5.5.164~1-9698]:main():cmvn-tonnet.cc:114)
Written cmvn in ‘nnet1’ model to: exp/ dnn4_pretrain-dbn/2.cmvn
LOG (rbm-convert-to-nnet[5.5.164~1-9698]:main():rbmconvert-to-
nnet.cc:69) Written model to
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:82) Written
model to exp/dnn4_pretrain-dbn/2.dbn
from 1 GPUs
9698]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(0):
GeForce GTX 1070 free:7751M, used:365M, total:8116M,
free/total:0.954984
total:8116M, free/total:0.954984
9698]:SelectGpuIdAuto():cu-device.cc:385) Trying to select device: 0
(automatically), mem_ratio: 0.954984
LOG (nnet-forward[5.5.164~1-9698]:main():nnet-forward.cc:192)
Done 3696 files in 0.1077min, (fps 174068)
174068)
9698]:main():compute-cmvn-stats.cc:168) Wrote global CMVN stats
to standard output
LOG (cmvn-to-nnet[5.5.164~1-9698]:main():cmvn-tonnet.cc:114)
Written cmvn in ‘nnet1’ model to: exp/ dnn4_pretrain-dbn/3.cmvn
from 1 GPUs
9698]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(0):
GeForce GTX 1070 free:7750M, used:365M, total:8116M,
free/total:0.954946
total:8116M, free/total:0.954946
157076)
LOG (cmvn-to-nnet[5.5.164~1-9698]:main():cmvn-tonnet.cc:114)
Written cmvn in ‘nnet1’ model to: exp/ dnn4_pretrain-dbn/4.cmvn
initializing ‘exp/dnn4_pretrain-dbn/4.rbm.init’
from 1 GPUs
9698]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(0):
GeForce GTX 1070 free:7750M, used:366M, total:8116M,
free/total:0.954907
total:8116M, free/total:0.954907
LOG (nnet-forward[5.5.164~1-9698]:main():nnet-forward.cc:192)
Done 3696 files in 0.13468min, (fps 139197)
139197)
LOG (cmvn-to-nnet[5.5.164~1-9698]:main():cmvn-tonnet.cc:114)
Written cmvn in ‘nnet1’ model to: exp/ dnn4_pretrain-dbn/5.cmvn
from 1 GPUs
9698]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(0):
GeForce GTX 1070 free:7751M, used:365M, total:8116M,
free/total:0.954999
total:8116M, free/total:0.954999
LOG (nnet-forward[5.5.164~1-9698]:main():nnet-forward.cc:192)
Done 3696 files in 0.148966min, (fps 125848)
125848)
LOG (cmvn-to-nnet[5.5.164~1-9698]:main():cmvn-tonnet.cc:114)
Written cmvn in ‘nnet1’ model to: exp/ dnn4_pretrain-dbn/6.cmvn
LOG (rbm-convert-to-nnet[5.5.164~1-9698]:main():rbmconvert-to-
nnet.cc:69) Written model to
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:82) Written
model to exp/dnn4_pretrain-dbn/6.dbn
# REPORT
# RBM pre-training progress (line per-layer)
Pre-training finished.
# Removing features tmpdir /tmp/Kaldi.1A01 @ HP train.ark
# Accounting: time=1934 threads=1
# Ended (code 0) at Fri Jan 4 13:56:03 CST 2019, elapsed time 1934
seconds
# steps/nnet/train.sh --feature-transform exp/dnn4_ pretrain-
dbn/final.feature_transform --dbn exp/dnn4_ pretrain-dbn/6.dbn --hid-
layers 0 --learn-rate 0.008 data-fmllr-tri3/train_tr90 data-fmllr-
tri3/train_ cv10 data/lang exp/tri3_ali exp/tri3_ali exp/dnn4_ pretrain-
dbn_dnn
# INFO
steps/nnet/train.sh : Training Neural Network dir : exp/dnn4_pretrain-
dbn_dnn
LOG ([5.5.164~1-9698]:SelectGpuIdAuto():cu-device.cc:338)
cudaSetDevice(0): GeForce GTX 1070 free:7750M, used:365M,
total:8116M, free/total:0.954953
# PREPARING FEATURES
# re-saving features to local disk,
feat-to-dim ‘ark:copy-feats
scp:exp/dnn4_pretraindbn_dnn/train.scp.10k ark:- |’ -
num-components 3
input-dim 40
output-dim 440
number-of-parameters 0.00088 millions
component 1 : <Splice>, input-dim 40, output-dim 440,
###
# NN-INITIALIZATION
# getting input/output dims :
feat-to-dim ‘ark:copy-feats
scp:exp/dnn4_pretraindbn_dnn/train.scp.10k ark:- | nnet-forward
“nnetconcat exp/dnn4_pretrain-dbn_dnn/final.feature_ transform
‘\’’exp/dnn4_pretrain-dbn/6.dbn’\’’ -|” ark:- ark:- |’ -
LOG (nnet-forward[5.5.164~1-9698]:SelectGpuId():cudevice.cc:128)
Manually selected to compute on CPU. nnet-concat
exp/dnn4_pretrain-dbn_dnn/final.feature_ transform
exp/dnn4_pretrain-dbn/6.dbn -
9698]:Init():nnet-nnet.cc:314) </NnetProto>
LOG (nnet-initialize[5.5.164~1-9698]:main():nnetinitialize.cc:63)
Written initialized model to exp/ dnn4_pretrain-dbn_dnn/nnet.init
nnet-concat exp/dnn4_pretrain-dbn/6.dbn exp/dnn4_ pretrain-
dbn_dnn/nnet.init exp/dnn4_pretrain-dbn_ dnn/nnet_dbn_dnn.init
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:53)
Reading exp/dnn4_pretrain-dbn/6.dbn
tree-info exp/dnn4_pretrain-dbn_dnn/tree
tree-info exp/dnn4_pretrain-dbn_dnn/tree
fsttablecompose exp/dnn4_pretrain-dbn_dnn_denlats/
lang/L_disambig.fst exp/dnn4_pretrain-dbn_dnn_denlats/lang/G.fst
fstminimizeencoded
fstdeterminizestar --use-log=true
fstpushspecial
fstisstochastic exp/dnn4_pretrain-dbn_dnn_denlats/
lang/tmp/CLG_3_1.fst
1.27657e-05 0
make-h-transducer --disambig-syms-out=exp/dnn4_pretrain-
dbn_dnn_denlats/dengraph/disambig_tid.int
--transition-scale=1.0 exp/dnn4_pretrain-dbn_dnn_
denlats/lang/tmp/ilabels_3_1 exp/dnn4_pretrain-dbn_ dnn/tree
exp/dnn4_pretrain-dbn_dnn/final.mdl
fstrmepslocal
fsttablecompose exp/dnn4_pretrain-dbn_dnn_denlats/
dengraph/Ha.fst exp/dnn4_pretrain-dbn_dnn_denlats/
lang/tmp/CLG_3_1.fst
fstrmsymbols exp/dnn4_pretrain-
dbn_dnn_denlats/dengraph/disambig_tid.int
fstminimizeencoded
fstdeterminizestar --use-log=true
fstisstochastic exp/dnn4_pretrain-dbn_dnn_denlats/
dengraph/HCLGa.fst
0.000473932 -0.000485819
Success
Chapter 14. Final Results
from MonoPhone to DNN
===============================================================
%WER 31.8 | 400 15057 | 71.6 19.6 8.8 3.4 31.8 100.0 |
-0.506 | exp/mono/decode_dev/score_5/ctm_39phn.filt. sys
%WER 24.8 | 400 15057 | 79.2 15.8 5.0 4.0 24.8 99.8 | -0.147 |
exp/tri1/decode_dev/score_10/ctm_39phn. filt.sys
%WER 22.6 | 400 15057 | 81.0 14.2 4.7 3.6 22.6 99.5 | -0.217 |
exp/tri2/decode_dev/score_10/ctm_39phn. filt.sys
%WER 20.5 | 400 15057 | 82.6 12.8 4.7 3.1 20.5 99.8 | -0.588 |
exp/tri3/decode_dev/score_10/ctm_39phn. filt.sys
%WER 23.2 | 400 15057 | 80.0 14.8 5.2 3.3 23.2 99.5 |
-0.188 | exp/tri3/decode_dev.si/score_10/ctm_39phn. filt.sys
%WER 21.0 | 400 15057 | 81.9 12.5 5.6 2.9 21.0 99.8 |
-0.525 | exp/tri4_nnet/decode_dev/score_5/ctm_39phn. filt.sys
%WER 18.1 | 400 15057 | 85.2 11.0 3.8 3.3 18.1 99.5 | -0.662 |
exp/sgmm2_4/decode_dev/score_6/ctm_39phn. filt.sys
%WER 18.3 | 400 15057 | 84.8 11.3 3.9 3.2 18.3 99.5 | -0.279 |
exp/sgmm2_4_mmi_b0.1/decode_dev_it1/
score_8/ctm_39phn.filt.sys
%WER 18.5 | 400 15057 | 84.2 11.4 4.4 2.7 18.5 99.5 | -0.138 |
exp/sgmm2_4_mmi_b0.1/decode_dev_it2/
score_10/ctm_39phn.filt.sys
%WER 18.5 | 400 15057 | 84.3 11.4 4.4 2.8 18.5 99.5 | -0.148 |
exp/sgmm2_4_mmi_b0.1/decode_dev_it3/
score_10/ctm_39phn.filt.sys
%WER 18.5 | 400 15057 | 84.3 11.4 4.3 2.8 18.5 99.5 | -0.155 |
exp/sgmm2_4_mmi_b0.1/decode_dev_it4/
score_10/ctm_39phn.filt.sys
%WER 17.4 | 400 15057 | 84.8 10.4 4.7 2.3 17.4 99.3 | -0.616 |
exp/dnn4_pretrain-dbn_dnn/decode_dev/ score_6/ctm_39phn.filt.sys
%WER 17.3 | 400 15057 | 85.1 10.4 4.5 2.4 17.3 99.3 |
-0.581 | exp/dnn4_pretrain-dbn_dnn_smbr/decode_dev_
it1/score_6/ctm_39phn.filt.sys
%WER 17.0 | 400 15057 | 85.9 10.2 3.9 2.9 17.0 99.5 |
-0.610 | exp/dnn4_pretrain-dbn_dnn_smbr/decode_dev_
it6/score_6/ctm_39phn.filt.sys
%WER 16.9 | 400 15057 | 85.9 11.0 3.2 2.8 16.9 99.5 | -0.126 |
exp/combine_2/decode_dev_it1/score_6/ ctm_39phn.filt.sys
%WER 16.9 | 400 15057 | 86.0 10.9 3.1 2.8 16.9 99.3 | -0.126 |
exp/combine_2/decode_dev_it2/score_6/ ctm_39phn.filt.sys
%WER 16.8 | 400 15057 | 85.8 10.8 3.4 2.6 16.8 99.3 | -0.036 |
exp/combine_2/decode_dev_it3/score_7/ ctm_39phn.filt.sys
%WER 16.9 | 400 15057 | 86.3 10.9 2.8 3.2 16.9 99.0 | -0.260 |
exp/combine_2/decode_dev_it4/score_5/ ctm_39phn.filt.sys
%WER 32.1 | 192 7215 | 71.1 19.4 9.5 3.3 32.1 100.0 | -0.468 |
exp/mono/decode_test/score_5/ctm_39phn. filt.sys
%WER 26.1 | 192 7215 | 77.6 16.9 5.5 3.6 26.1 100.0 | -0.136 |
exp/tri1/decode_test/score_10/ctm_39phn. filt.sys
%WER 24.0 | 192 7215 | 79.7 15.0 5.4 3.7 24.0 99.5
13.5 Final Output 193
| -0.260 | exp/tri2/decode_test/score_10/ctm_39phn. filt.sys
%WER 21.2 | 192 7215 | 82.1 13.2 4.7 3.3 21.2 99.5 | -0.695 |
exp/tri3/decode_test/score_9/ctm_39phn. filt.sys
%WER 24.0 | 192 7215 | 79.3 15.4 5.2 3.4 24.0 99.5 |
-0.281 | exp/tri3/decode_test.si/score_9/ctm_39phn. filt.sys
%WER 22.8 | 192 7215 | 80.3 13.7 6.0 3.1 22.8 100.0 | -0.423 |
exp/tri4_nnet/decode_test/score_5/ ctm_39phn.filt.sys
%WER 19.5 | 192 7215 | 82.8 12.1 5.1 2.3 19.5 100.0 |
-0.120 | exp/sgmm2_4/decode_test/score_10/ctm_39phn. filt.sys
%WER 19.7 | 192 7215 | 84.3 11.9 3.8 3.9 19.7 100.0 | -0.673 |
exp/sgmm2_4_mmi_b0.1/decode_test_it1/
score_6/ctm_39phn.filt.sys
%WER 20.0 | 192 7215 | 83.1 12.3 4.6 3.1 20.0 100.0 | -0.179 |
exp/sgmm2_4_mmi_b0.1/decode_test_it2/
score_9/ctm_39phn.filt.sys
%WER 20.0 | 192 7215 | 83.1 12.4 4.5 3.2 20.0 100.0 | -0.195 |
exp/sgmm2_4_mmi_b0.1/decode_test_it3/
score_9/ctm_39phn.filt.sys
%WER 19.9 | 192 7215 | 83.8 12.1 4.1 3.7 19.9 99.5 | -0.521 |
exp/sgmm2_4_mmi_b0.1/decode_test_it4/
score_7/ctm_39phn.filt.sys
%WER 18.3 | 192 7215 | 84.4 11.2 4.4 2.7 18.3 100.0 | -1.177 |
exp/dnn4_pretrain-dbn_dnn/decode_test/ score_4/ctm_39phn.filt.sys
%WER 18.4 | 192 7215 | 84.7 11.2 4.2 3.1 18.4 100.0 |
-1.168 | exp/dnn4_pretrain-dbn_dnn_smbr/decode_test_
it1/score_4/ctm_39phn.filt.sys
%WER 18.5 | 192 7215 | 84.9 11.4 3.7 3.4 18.5 100.0 |
-0.779 | exp/dnn4_pretrain-dbn_dnn_smbr/decode_test_
it6/score_5/ctm_39phn.filt.sys
%WER 17.8 | 192 7215 | 84.9 11.6 3.5 2.7 17.8 99.5
%WER 18.2 | 192 7215 | 85.1 11.6 3.3 3.2 18.2 99.5 | -0.236 |
exp/combine_2/decode_test_it2/score_5/ ctm_39phn.filt.sys
%WER 18.1 | 192 7215 | 84.8 11.7 3.5 2.9 18.1 99.5 | -0.124 |
exp/combine_2/decode_test_it3/score_6/ ctm_39phn.filt.sys
%WER 18.0 | 192 7215 | 84.7 11.8 3.5 2.8 18.0 99.5 | -0.130 |
exp/combine_2/decode_test_it4/score_6/ ctm_39phn.filt.sys
===================================================
Finished successfully on Fri Jan 4 14:16:46 CST 2019
===================================================
Next Step
When voice assistants began to emerge in 2011 with the
introduction of Siri, no one could have predicted that this
novelty would become a driver for techinnovation. Now nearly
10 years later, it’s estimated that every one in six Americans
own a smart speaker (Google Home, Amazon Echo) and
eMarketer forecasts that nearly 100 million smartphone users
will be using voice assistants in 2020.
A
acc_tree_stats 125
acc-tree-stats 113
AccumAmDiagGmm 81
AccumDiagGmm 81
acoustic features 3
Acoustic likelihood 95
acoustic model 3
Acoustic Model 84
acoustic models 4, 8
Acoustic models 6
Acoustic-Phonetic Continuous Speech
Corpus 38
AffineTransform 160
algorithm 9
ali 112
ali/ali.gz 114
alignment 113
alignments 114
ali.gz 115
AmDiagGmm 81
ANN 2
ASR I, 4, 7
Automatic Speech Recognition I DiagGmm 81
DiagGMM 85
DiagGmmNormal 81
dialect 39
Dictionary 59
dir/ali.gz 114
discriminative training 138
DNN 2, 7, 157, 158
DNN/CNN 75
DNN-HMM-based 8
DNN hybrid 12
DR1 44
DR8 44
DTW 2
B
b_a_i. 115
bash 17
Baum-Welch 82
Bayes 150
BLAS 12
BMMI 161
Build_tree 125
C
CD-1 159
CE 138
Cepstral mean and variance normalization 76
check_dependencies.sh 19
CMN 158
CMU 5, 10
CMVN 76, 77
coefficient 75
compile-train-graphs 114
convert-ali 114
corpus 37
CUDA 157
D
Dan’s DNN 149, 150
DARPA 38
datasets 48
DCT 5, 75
decode.sh. 83
decoding 3
Decoding 9, 96
Decoding Graph 98
De-distribution 8
deep learning 10
Deep Neural Networks 7, 149
defacto 34
Deltas 109
Deltas + Delta-Deltas 109
DFT 75
E
egs 41
ELRA 13
EM 82
EndContextDependency 90 eventmap 90
F
fbank 75
feats.scp. 76
feature-alignment 161
Feature Extraction 5
Features extraction 3
feature vectors 3
FFT 5
Filter-bank Analysis 3
Final Results 191
Finite State Transducers (FSTs): 12 fMLLR 77, 125, 131, 158
fMLLR-adapted 125
forward 112
forward pdf 112
forward-pdf 81
Fourier Analysis 3
Frame-level Cross-entropy 160 framing 3
FST 12, 60, 98
fstinput 13
fsts.JOB.gz 85
layer-wise pre-training 157
LDA 119, 125
LDA+MLLT 119, 125
LDC 13, 37
LDC93S1 37
L_disambig.fst 61, 98, 99
lexicon 3
L.fst 61
likelihoo, Acoustic 95
likelihood 11, 137
LM 9
LVCSR 103
G
Gaussian-Bernoulli 159
GCONSTS 85
Git 16
github 17
GMM 7, 8, 75
gmm-est 86
GMM-HMM 2, 7
gmm-init-model 114
gmm-init-mono 84
gmm-mixup 114
GPU 15, 157
H
HCLG 85, 98
Hidden Markov Model 7
Hinton 2
HMM 7
hmm-state 81
HTK 2, 89
hybrid 7
Hybrid Training and Decoding framework
150
I
Individual Word Error Rates 7
input_transform 160
ISIP 10
IWER 7
K
Kaldi 10, 11
Karel 149
Karel’s 158
Karel’s DNN 157
Karel Vesely 149, 157
L
language model 3
Language Model 9
LAPACK 12
lattice 97, 141
LAU 60
M
machine learning 10
maximum likelihood 11
Maximum likelihood 138
Maximum Mutual Information 138 max_iter_inc 115
Mel Frequency Cepstral Coefficients 5 mfcc 35
MFCC 5, 75
MFCCs 110
MFCC’s DCT 75
mkgraph.sh 83
MLE 86, 139
MLLT 119, 125
MMI 137, 138
MMI + SGMM2 137
MMI model 139
monoPhone 111
Monophone 82
MonoPhone 81
MPE 161
N
N-best 97
N-Gram 2
NIST 39
nnetbin 149
nonlinearities 150
NSN 60
NTIS 39
NVidia 149
SGMM2 131
SGMMs 12
Sigmoid 160
signals 3
SIL 60
SIL_B 60
sMBR 149, 157, 162
sMBR4 161
Softmax 160
Speech Recognition 3
Sphinx 5, 10
spk2gender 53
spk_id 54
SPN 60
SRI 39
stages 158
steps/align_fmllr.sh. 132
steps/align_si.sh 112
stm 53
Stochastic Gradient Descent 161 sum-tree-stats 113
SVM 2
O
oov.txt/int 61
OpenFst 12
OpenFST 22
P
pattern recognition I
pdf 112
pdfclass 90
p.d.f-id 92
Perl 17
perplexity 103
phones 61
phone status. 112
phones.txt 61
Povey 12
pre_ali.gz 132
pre-emphasis 3
pre-processing 3
Pre-processing 4
pre-trained DBN 160
Pre-training 159
probability of Transition 63
P(W|O) 139
Python 17
Pytorch 10
R
RBM 149
Recipes 46
Restricted Boltzmann Machines 157 run_dnn.sh 158
S
SAT 125
scp 83
self-look pdf 112
self-loop. 112
Semi-supervised Gaussian Mixture Model
(SGMMs) 12
Sequence-discriminative 161
Sequence-discriminative Training 161 SGD 8
T
tanh 160
TE 90
Tensorflow 10
TIMIT 2, 8
TIMIT corpus 44
toolkit 11
topo 61, 85
trailblazers 34
train_mono.sh 83
train-mono.sh 83
transducers 22
TransitionalModel 81
Tri1 109
tri1 : Deltas + Delta-Deltas 112 tri2 119, 121
tri2: LDA + MLLT 119
tri3 125, 126
tri3: LDA+MLLT+SAT 125 triPhone 111
Triphone 109
triphonestates 160
U
utils 46
utt 112
utterance 54
utt_id 54
V
vector 5
Viterbi 9, 81, 82, 90 VQ 2
W
WER 34, 75
WFST 22, 96
windowing 3
word dictionary 47 word error rate 34 words.txt 61
Y
Yes or No 23
Z
Zhang 12