Hands-On Speech Recognition Wit - Yamin Ren

Download as pdf or txt
Download as pdf or txt
You are on page 1of 223

Hands-on Speech Recognition with

Kaldi/TIMIT
Demystify Automatic Speech Recognition
(ASR)
& Deep Neural Networks (DNN)
Yamin Ren
Leo Reny
HYPech.com
India • Japan • Korea • Singapore • United Kingdom • United States
Automatic Speech Recognition Series
Hands-on Kaldi : ALL RIGHTS RESERVED.

Leo Reny Yamin Ren Publisher and GM: Mel Pearce Associate Director: Kris Burns
Manager of Editorial Services: Hayden Simpson Marketing Manager: Lilyana Cardenas
Acquisitions Editor: Alexandra Lewis Project and Copy Editor: Isabella Fletcher
Technical Reviewer: Sergio Crawford Interior Layout Tech: ReApex Limited Cover
Designer:

No part of this work covered by the copyright herein may be reproduced, transmitted, stored,
or used in any form or by any means graphic, electronic, or mechanical, including but not
limited to photocopying, recording, scanning, digitizing, taping, web distribution, information
networks, or information storage and retrieval systems, except as permitted under Section
107 or 108 of the 1976 United States Copyright Act, without the prior written permission of
the publisher.

Throughout this book, we refer to products and designs which are not our property. These
references are meant only to be informational. We do not represent the companies
mentioned and were not paid promotional fees. However, if these compa- nies would like to
send us evaluation copies of future products, we would be thrilled. References to products
are not endorse- ments, but reflect our opinions in some cases.

Computer software products mentioned are the property of their respective publishers.
Instead of attempting to list every software publishers and brand, or including trademark
symbols throughout this book, we assume that We know these product and brand names are
protected under U.S. and international laws. Fonts and designs are the intellectual property
of the de- sign artists. Although U.S. copyright laws do not protect font designs, we consider
them the property of the designers and licensing agencies.
For product information and assistance, contact us at Hypech Publisher Support, 1-872-222-
8067
For permission to use material from this text or product, submit all requests to:
[email protected]

Luke Fletcher Indexer: Harley Kramer Proofreader: Blair Salas

Printed in the US 1 2 3 4 5 6 7 12 11 10

Kaldi is an open-source speech recognition toolkit written in C++ for speech recognition and
signal processing, freely avail- able under the Apache License v2.0. Access, Excel,
Microsoft, SQL Server, and Windows are registered trademarks of Micro- soft Corporation.
MySQL is a registered trademark of MySQL AB. Mac OS is a registered trademark of Apple
Inc. All other trademarks are the property of their respective owners.

ISBN-13: 978-1-7774122-6-5 (e-Book)


ISBN-13: 978-1-7774122-5-8 (Book)
ISBN-13: 979-8-5666334-9-7 (KDP)

Hypech.com is a leading provider of customized learning solutions to IT technology.


Hypech.com learning products are represented in Canada by ReApex Corp.

Visit our corporate website at http://HYPech.com.

For You
Acknowledgments
We would like to acknowledge and thank Daniel Povey for his
invaluable work on Kaldi, indeed a great ASR Toolkit, and for
his great help and kindness in letting us use some of the text
of Kaldi’s web pages describing the software. After he moved
to Xiaomi, Daniel’s brilliant wisdom inspired all the people
working around him.

Moreover, we would like to acknowledge Karel Vesely for his


great work on his very effective version of DNN.

We would like to thank those friends from Google, Xiaomi,


iFlytek, and Baidu Institute, who contributed to my
understanding of the ASR and AI. Without their
encouragement and support, I might not start this Kaldi book
project.

We would like to thank the many editors and workers at


Hypech Publisher who skillfully enhanced and improved many
aspects of this book as it was brought to fruition. Without
Alexandra Lewis, my acquisitions editor, this book literally
would not exist. Sergio Crawford, my technical editor, did an
outstanding job in his review. I thank him for the numerous
suggestions and corrections that he provided. Finally, Mel
Pearce, my project editor was superb in pulling it all together,
while adding a professional touch and coherency to the entire
project.

Finally, and most especially, we would like to thank everyone


in our immediate family for their encouragement and support
as we’ve dedicated myself to this project.
the author
Seeing rather than listening could help hearing loss group a
lot. That’s why Yamin Ren has founded Seeing Voice
Corporation in 2018, devoting to help hearing loss people by
using the advanced Artificial Intelligence technology
Automatic Speech Recognition.

As the ProductManger, Yamin led a team of 30 developers


build up cloud-based ASR platform SV Cloud
SpeechRecognition. The SV ASR is based on Kaldi and
Tensorflow. The system has been widely used in government
and industry. One of the successful products is Seeing Voice
AR Glass.

Yamin has been involved with AI and software developing for


20+ years. He has designed and developed multiple software
products for factories and companies before starting ASR. His
main area of expertise is with software development, product
design, and innovation. He has served as a consultant for
Canada Federal Government, Glencore, and many other
industry leaders. He has worked with Accenture and HWG.
He holds an MBA from the University of California, Los
Angeles, with a specialization in Management Information
System.

For more information on his current activities or to contact the


author, please visit http://www.hypech.com.
CONTENTS
Introduction ix
Chapter 1. ASR Review I

1.1 History of ASR 2


1.2 General Steps 3
1.2.1 Pre-processing 4
1.2.2 Feature Extraction 5
1.2.3 Acoustic Model Training 6
1.2.4 Language Model Training 9
1.2.5 Decoding 9
1.3 Popular ASR tools 10
1.4 Kaldi 11

Chapter 2. Kaldi Installation 15

2.1 Machine and OS 15


2.2 Install and Configure Git 16
2.3 Install and Configure Kaldi 17
2.4 Yes or No 23

Chapter 3. Setup TIMIT 37

3.1 About TIMIT 38


3.2 TIMIT Directory Structure and Files 41
3.2.1 Top Level Directories and Files 41
vi
3.2.2 TIMIT Corpus 44 3.2.3 TIMIT Recipes 46

Chapter 4. Kaldi Data Preparation 47

Step 1: Setup Environment (line 1-32) 49


Step 2: Prepare Data (line 33-50) 52
Step 3: Dictionary 59
Step 4: Create FST Script 60
The Result 64

Chapter 5. Kaldi Feature Extraction 75

Step 1: Get feats.scp 76


Step 2: Get cmvn.scp 77
MFCC & CMVN Output 79

Chapter 6. MonoPhone 81

6.1 Monophone Training Method 82


6.2 Training Process 83
6.3 Train Acoustic Model 84
6.4 Decoding 96
6.3.1 Build up Decoding Graph 98
6.3.2 Decoding 102
6.5 Final Results and Output 103

Chapter 7. Triphone: Deltas 109

7.1 General 110


7.2 tri1 : Deltas + Delta-Deltas Analysis 112
7.3 Output 116

Chapter 8. tri2: LDA + MLLT 119

8.1 General 119 8.2 tri2 Output 121 vii

Chapter 9. tri3: LDA+MLLT+SAT 125

tri3 Output 126

Chapter 10. SGMM2 131


10.1 SGMM2 General 131 10.2 SGMM2 Output 133

Chapter 11. MMI + SGMM2 137

11.1 Why MMI? 137


11.2 MMI General 139
11.3 MMI + SGMM2 Output 142

Chapter 12. Dan’s DNN 149

12.1 General 149 12.2 Output 151

Chapter 13. Karel’s DNN 157

13.0 Stage 0: Store Features 158


13.1 Stage 1: Pre-training 159
13.2 Stage 2: Frame-level Cross-entropy 160
13.3 Stage 3: Sequence-discriminative Training 161
13.4 Stage 4: Iteration of sMBR 162
13.5 Final Output 163

Chapter 14. Final Results 191


Next Step 195 Index 196
viii
Introduction
Speech Recognition

Speech is acoustic signal containing information of idea that


is formed in speaker’s mind. SpeechRecognition’s purpose is
to retrieve the acoustic information contained in speech
signal. The noise or other irrelevant hurts the accuracy of
speech recognition.

Speech processing has three stages. Signal processing stage


studies human auditory system and process signal in form of
frames. Phoneme stage processes the speechphonemes, the
basic unitof speech.
Based on signal and phoneme, the last stage is word
processing. The linguistic entity of speech is the target.

Kaldi models the above three stages. It is roughly composed


of two parts: traditional HMM-GMM and deep neutral network
DNN.

When learning Kaldi, it's a good idea to start with HMM-GMM.


Many matured models like steps/train_delta.sh, steps/train_
fmllr.sh, steps/decode.sh are all based on HMM-GMM model.
Lacking the fundamental knowledge of HMM-GMM model will
be confused about the decision tree,alignment, lattice, etc.

To learn HMM-GMM, the best resource is not Kaldi but


another opensource tool HTK. HTK Book elaborates all the
details and concepts of speech recognition from trainging to
decoding.
For HMM-GMM purpose, we need only know well about
decision trees(DTs), training network extension, Viterbi
algorithm, EM algorithm. Trying to identify the difference
among them and understanding when to use which.

Chapter 2, 8, 10, 12, 13 of HTK Book are most important.


Many professionals suggest to check Kaldi script while
reading the book, which could help understanding the
concept while practising. For example, the build-tree script in
steps/train_delta.h is a perfect match to Chapter 10 tree-
based clustering of HKT Book, and GMM-EST script help
understanding Chapter 8 Parameter Re-Estimation Formulae
a lot.

We will understand speech recognition clearly after clarifying


the HMM-GMM, following which we could start the neutral
network. Kaldi supports most of the netral networks like MLP,
RNN, CNN, LSTM, and CTC.

MLP might be the best entrance to the neutral network. We


could treat MLP as the basic model of the neutral network.
Kaldi provides three tools for neutral network: nnet1, nnet2,
and nnet3.

The nnet1 and nnet2 are using HMM-DNN framework. The


difference is that the nnet2 is using NG-SGD method first
introduced by Dan Povey, which support multi-threading
parallel training. To learn nnet2, we could start with Dan’s
papers published after 2012.

Armed with nnet1 and nnet2, we could challenge nnet3 chain


model and other neutral networks like RN, CNN, LSTM. The
rest is just following the papers from Microsoft, Dan Povey,
x
Google, and University of Toronto.

Some languages (for example Chinese) have tone. The tone


itself contains meaning. We need add tone information into
the feature.

All in all, SpeechRecognition is still more theoritical than


practical. We need learn and understand many theories
before doing some valuable work. Most of the formulas need
us to derive to understand.

Kaldi is a Speech Recognition platform providing many ASR


models like GMM, SGMM, DNN, and HMM. We could train
our own data using these platforms, and then recogine. The
only thing we need do is to modify the scripts to use our own
corpus. There are a couple of competitions (like HTK).

Kaldi is comparably easy to start and deliver. We have


implemented several ASR projects using Kaldi. The feedback
for online version is not bad. In this book, we will show every
steps from start to delivery.

Our next exploration will focus on offline ASR, which is told by


Dan that has been accomplished in chain-model, whichis not
as effective as what we expected.

Chapter 1 discussing some background of the ASR. Knowing


the history and context of the new topic is the best way to
understand itin my humble opinion. Always asking: Who is it?
Where did it come from? Where is it going?
Once the soul questions have been answered, we get to
install Kaldi. Chapter 2 Installation will explain Kaldi
environment and installation process. We could have Mac,
PC, Linux, Windows or any platform.

We test the Kaldi installation with some small projects like


yes/ no. There are another recipe called 10 digits speech
recognition good for testing purpose as well, which is not
included in this book.

Chapter 3 downloads and sets up TIMIT in Kaldi with specific


environment parameters.
Chapter 4 prepares the data for TIMIT. We learn about FST,
dictionary and some other relevant concepts during
preparation. Chapter 5 extracts features. MFCC and CMVN
will be discussed in details.
Chapter 6 runs monophone model for TIMIT. All the ASR
fundamental concepts have been explained.

Chapter 7 to 9 run triphone model for tri1, tri2, tri3. Chapter


10 runs SGMM2 model.
Chapter 11 runs MMI + SGMM2.
Chapter 12 runs Dan’s DNN.

Chapter 13 covers all the stages of Karel’s DNN, including


store features, pre-training, frame-level cross-entropy,
sequence-discriminative training, and iteration of sMBR.

Chapter 14 is the final results of the whole TIMIT output which


could be used as a template for comparison.
When finishing the whole book, we will be armed with Kaldi
ASR Neutral Network models running capability.
This book gives you a start point to pursue higher goals in
Artificial Intelligence world.
xii
Chapter 1. ASR Review
Automatic Speech Recognition
The essence of the voice and speech recognition is
pattern recognition. We compare the unknown and
known speech pattern. The best matching one is the
recognized result. To put it simply, we train the voice and
speech to get a statistical model, which connect the
language, voice, or speech with probability to the wave
based on the matching record of the speech and wave,
so that once getting the wave, we could derive the actual
speech. Speech recognition is an interdisciplinary
technology, including signal processing, pattern
recognition, probability, information system, sounds
mechanics, hearing mechanics, statistics, linguistics,
and artificial intelligence.
original comic by sandserif, https:// www.instagram.com/sandserifcomics/

1.1 History of ASR

History can help us understand present, predict future.

Bell Lab is considered as the pioneer of speech recognition.


The first product was developed around 1950s by Davis
called “Audrey”.
Even though CMU and some other famous institutes joined
the speech recognition research, the whole process is very
slow in 60s and 70s.

The ‘80s saw speech recognition fast growth. Two major


tehcnologies have been developed: HMM and N-Gram, in
which statistics started to be used for prediction. Scholars
changed the research method from template model to HMM
statistic model. GMM-HMM has established its main
framework in speech recognition field since then.

Speech recognition was matured in 90s largely due to the


personal computer. Faster processors and bigger memory
made it possible for statistic models. HTK developed by
Cambridge promoted GMM-HMM model, with which the
speech recognition technology had achieved close to 80%
accuracy.

The key breakthrough is happened in 2006, when DBN was


introduced by Hinton. DBN directly led to the recovery of DNN
research.

Year 2009 saw Hinton’s success withTIMIT. Scholars started


using DNN-HMM instead of GMM-HMM.

The different techniques adopted for feature vectors


extraction process and speech pattern recognition have been
developed since 2009, among which Dynamic Time Warping
(DTW), Vector Quantization (VQ), Artificial Neural
Networks(ANN), and support vector machines(SVM) are
popular.

1.2 General Steps


So-called “SpeechRecognition” is actually convert a piece of
language signals to text. The system is mainly composed of
acoustic features, acoustic model, lexicon, language model,
and decoding parts.

To create more acceptable feature vectors, we usually


perform pre-emphasis, windowing, Fourier Analysis, Filter-
bank Analysis, framing and other pre-processing. The pre-
processing of voice could help extract the original voice
signal.

Features extraction transforms the sound signal from time


domain to frequency domain, and provides appropriate
features vector for acoustic model. Acoustic model calculates
the mark of each feature based on acoustic analysis. The
language model finds the words probability based on
language theory. In the end, the words probability turns to text
decoded based on lexicon.
Kaldi ASR: Extending the ASpIRE model, Mar 11, 2017 ~ KRISZTIÁN
1.2.1 Pre-processing

There are certain requirements for speech recognition model


to process the raw sounds. Pre-processing could filter the
noise, identify the end points, framing and pre-emphasis.
Distinguishing speech signals from other non-speech signals
is always important for ASR systems. Audio and
speechsegmentation will always be needed to break the
continuous audio stream into manageable chunks. By using
ASR acoustic models trained for a particular acoustic
condition such as male speaker versus female speaker or
wide bandwidth (high quality microphone input) versus
telephone narrow bandwidth the overall performance can be
significantly improved.

The pre-processing could also be designed to provide


additional interesting information such as division into
speaker turns and speaker identities allowing for automatic
indexing and retrieval of all occurrences of the same speaker.
It can also provide end of turn information relevant for
recovering punctuation marks, syntactic parsing, etc.
Additionally speaker segmentation and clustering information
can be used for efficient speaker adaptation whichhas been
shown to significantly improve ASR accuracy . All these
information, when combined with the text output of the ASR,
results in a rich transcription easier to understand. 1.2.2
Feature Extraction
The speechis an analog audio signal in the form of sound
waves. ASR is a math model which requires digital number
input. Features are the numbers to represent the wave. The
main method we use is MFCC (Mel Frequency Cepstral
Coefficients).

MFCC uses FFT (Fast Fourier Transform) to convert each


frame of speechsample from a time series of bounded time
domain signals into a frequency spectrum. The frame that has
undergone the windowing process is converted into a
frequency spectrum. Humans listen to voice information
based on time domain signals. At this stage Mel-spectrum will
be converted into time domain by using Discrete Cosine
Transform (DCT). The result is called Mel-frequency cepstrum
coefficient (MFCC).

CMU Sphinx uses MFCC in different way. Starting from an


audio clip, it slides windows of 25 ms widthand 10 ms apart to
extract MFCC features. For each window frame, 39 MFCC
parameters will be extracted. The primary objective of speech
recognition is to build a statistical model to infer the text
sequences W (say “cat sits on a mat”) from a sequence of
feature vectors X. To put them in a vector, we get feature
vector.

Another approachlooks for all possible sequences of words


(with limited maximum length) and finds one that matches the
input acoustic features the best.
original from https://manningbooks.medium.com/the-computer-visionpipeline-part-4-feature-
extraction-6343ef063588
1.2.3 Acoustic Model Training

An acoustic model is used in automatic speech recognition to


represent the relationship between an audio signal and the
phonemes or other linguistic units that make up speech. A
well trained acoustic model (often called a voice model) is
critical in the performance of an ASR enabled application.
from: http://www.kecl.ntt.co.jp/icl/signal/en/topics1.html

Acoustic models are trained by taking audio recordings of


speech, and their text transcriptions, and creating statistical
representations of the sounds that make up each word. A
simple rule is that a model trained with hundreds and
thousands of hours of transcribed audio will perform
betterthan a model trained with limited audio. Quantity of data
is important but it should also include a broad distribution of
sounds.

HMM (Hidden Markov Model) is widely used in the ASR


systems to model acoustic. The unit of acoustic model could
be phone, syllable, word or other levels of sound. For a small
amount of data, syllable is good enough. For a large sample,
using phone (ie. consonant, vowel) is more appropriate. The
larger the sample, the smaller the acoustic model unit.

The task of automatic speech recognition has traditionally


been accomplished by modeling the problem as a Hidden
Markov Model (HMM). Gaussian Mixture Models (GMM) have
been used to determine how well each state of each HMM fits
a frame or a shortwindow of frames of coefficients that
represents the acoustic input.

In the context of speech recognition, the GMM-HMM hybrid


approachhas a serious shortcoming – GMMs are statistically
inefficient for modeling data thatlie on or neara nonlinear
manifold in the data space. Because speech is typically
produced by modulating a relatively small number of
parameters, ithas been hypothesized that the true underlying
structure is much lower dimensional than is immediately
apparent in a window that contains lots of coefficients.

Deep Neural Networks (DNN), on the other hand, have the


potential to learn much better models of data that lie on or
near a nonlinear manifold. Many studies have confirmed this
hypothesis, with DNN systems outperform GMMs by
approximately 6% for Individual Word Error Rates (IWER).
The networks are trained to optimize a given training
objective function using the standard errorback propagation
procedure. In a DNN-HMM hybrid system, the DNN is trained
to provide posteriorprobability estimates for the different HMM
states. Typically, crossentropy is used as the objective and
the optimization is done through stochastic gradient descent
(SGD). For any given objective, the important quantity to
calculate is its gradient with respect to the activations at the
output layer. The gradients for all the parameters of the
network can be derived from this one quantity based on the
back propagation procedure.

Compared to traditional GMM-HMM acoustic models,


DNNHMM-based acoustic models perform better in terms of
TIMIT database. When compared with GMM, DNN is
advantageous in the following ways:

1. De-distribution hypothesis is not required for characteristic


distribution when DNN models the posterior probability of the
acoustic characteristics of speech.

2. GMM requires de-correlation processing for


inputcharacteristics, but DNN is capable of using various
forms of input characteristics.

3. GMM can only use single-frame speechas inputs, butDNN


is capable of capturing valid context information by means of
splicing adjoining frames.
1.2.4 Language Model Training

Language modeling (LM) is the use of various statistical and


probabilistic techniques to determine the probability of a given
sequence of words occurring in a sentence. Language
modeling could effectively connect grammar and semantic,
describe the relationship between words, so as to improve the
accuracy of recognition and narrow down the search scope.

There are 3 stages for language modelling: lexicon, grammar,


and syntax.

N-gram model is so far most effective to predict the language.


In daily life, we could understand other’s meaning before it
finishes the sentence. With the first several words, we predict
the rest. N-Gram means using the previous n-1 words to
predict the next word. The whole sentence will be derived
from the dynamic prediction. Bi-gram and tri-gram are most
common used.

1.2.5 Decoding

Acoustic model and language model calculate probabilities


that a particular frame of the input speech corresponds to
each of a specific set of speech sounds. Decoding will find
the most likely word sequence given the per-frame, per-sound
probability matrix, in light of the phone models, the word
pronunciations in the dictionary, and the word-sequence
constraints described by the grammar.

To be more accurate, decoding means searching. The


decoder essentially searches over all possible alignments of
all possible pronunciations of all possible word sequences to
find the most likely one. Viterbi algorithm helps to limit the
number of searches for the maximum path.

1.3 Popular ASR tools

Below is an old table showing the characters of Kaldi, CMU


Sphinx, HTK, Julius, and ISIP. Kaldi has been cooperated
with Tensorflow and Pytorch and gained dramatic progress in
deep learning and machine learning.

Below is
the comparison from one point of view.
1.4 Kaldi

Kaldi is an open source toolkit made for dealing with speech


recognition. It could handle speaker recognition and speaker
separation.

Kaldi is written mainly in C++, but the toolkit is wrapped with


Bashand Python scripts on top of C++. Forbasic usage this
wrapping spares the need to get in too deep in the source
code.
Daniel Povey1, The Kaldi Speech Recognition Toolkit, 2011

At its early stage, the Kaldi toolkit supported modeling of


context-dependent phones of arbitrary context lengths, all
commonly used techniques that can be estimated using
maximum likelihood, and almost all adaptation techniques
available in the ASR literature. At present, it also supports the
recently proposed Semi-supervised Gaussian Mixture Model
(SGMMs) (Huang, Hasegawa-Johnson, 2010), (Povey etal.,
2011), discriminative training, and the very promising DNN
hybrid training and decoding (Kaldi, WEB-a; Vesely et al.,
2013; Kaldi, WEB-b; Zhang et al., 2014; Povey et al., 2015).
Moreover, developers are working on using 436 P. COSI, G.
PACI, G. SOMMAVILLA, F. TESSER large language models
in the FST framework, and the development of Kaldi is
continuing.

Kaldi is considered to be “easy to use” (once you learn the


basics, and assuming you understand the underlying science)
• It’s “easy to extend and modify”
• It’s “redistributable”: unrestrictive license, community project

• If your stuff works oris interesting, the Kaldi team is open to


including it and your example scripts in our central repository:
more citation, as others build on it.

In particular, even if Kaldi is similar in aims and scope to HTK,


and the goal is still to have modern and flexible code, written
in C++, that is easy to modify and extend, the important
features that represent the main reasons to use Kaldi versus
other toolkits include:

• Code-level integration with Finite State Transducers (FSTs):


compiling against the OpenFst toolkit (using it as a library);
• Extensive linear algebra support: including a matrix library
that wraps standard BLAS and LAPACK routines

• Extensible design: providing, as far as possible, algorithms


in the most generic form possible; for instance, decoders are
templated on an object that provides a score indexed by a
(frame, fstinput-symbol) tuple, this meaning that the decoder
could work from any suitable source of scores, such as a
neural net;

• open license: the code is licensed under Apache 2.0, which


is one of the least restrictive licenses available;

• complete recipes: making available complete recipes for


building speech recognition systems, that work from widely
available databases such as those provided by the ELRA or
Linguistic Data Consortium (LDC).
It should be noted that the goal of releasing complete recipes
is an important aspect of Kaldi. Since the code is publicly
available under a license that permits modifications and re-
release, this encourages people to release their code, along
with their script directories, in a similar format to Kaldi’s own
example script.
Chapter 2. Kaldi Installation
Machine, OS, and System
For learning purpose, a PC with GPU will be able to run
most of the Kaldi models.
2.1 Machine and OS We bought a new SSD Hard drive to
install on an old HP Pavilion Gaming PC with GTX 1070TI.
We installed Ubuntu 18.04.1 LTS into the new SSD and
configured the machina dual-system, Ubuntu and
Windows.

2.2 Install and Configure Git


We need Git to get Kaldi.
$sudo apt install git-all
Some simple configuration:
$git config --global user.name “Leo Reny”
$git config --global user.email [email protected] $git config --
global alias.co checkout
$git config --global alias.br branch
$git config --global alias.st status

After configuration, run below to check the result:


$git config --list
The output:
leo@Ubuntu:~$ git config --list
user.name=Leo Reny
[email protected]
alias.co=checkout
alias.br=branch
alias.st=status

2.3 Install and Configure Kaldi


We clone Kaldi from Git to local folder .\lKaldi. We named
local Kaldi as golden. Taking about 2 minutes.
$git clone https://github.com/Kaldi-asr/Kaldi.git lKaldi --origin golden

Once the cloning is finished, go inside the Kaldi folder. There,


the INSTALL file explains the installation steps. Catthe file
and see what’s included.

Kaldi deals with many file formats, among which:


.sh for bash scripts
.py for Python scripts
.cc for C++ code

.hforheader files, containing variables, functions… used by


various C++ files
.pl for Perl scripts, useful to process text files
To check the installation requirements, we run:

$cat INSTALL

Now, going to tools to check the dependencies with:


$extras/check_dependencies.sh
Following the instructions to install the following apps: To
compile the Kaldi, we need install some apps.
$sudo apt install make
$sudo apt-get install g++ automake autoconf sox gfortran libtool
subversion python2.7

$sudo apt-get install zlib1g-dev


$~/lKaldi/tools$ extras/install_irstlm.sh
$~/lKaldi/tools$ source env.sh
$~/lKaldi/tools$ sudo ./extras/install_mkl.sh

rerun:
$extras/check_dependencies.sh
We’ll get:

“OK”
means ok.
We compile it. It might take hours.
$cd lKaldi/tools/; make; cd ../src; ./configure; make

Following the instruction of what we missed, install all


the components it required. After quite long time
compiling, we finally make the Kaldi ready.
Let’s check the directory structure:
y

The most important directories are:


• egs, which stands for examples

• tools, which contains Kaldi dependencies and setup


instructions
• src
, which contains the source code

The samples were stored in the ./egs, whichwe will use


frequently in this book. Let’s check tools and src here:
WFST is popular in modeling ASR transducers. One popular
open-source WFST toolkit is the OpenFst and it is heavily
used by Kaldi. To use WFST in OpenFST, we need to define
the input and output symbols and the FST definition. The
symbol file contains the vocabulary and maps words into
unique IDs used by OpenFST.

OpenFst is a library for constructing, combining, optimizing,


and searching weighted finite-state transducers (FSTs). FSTs
have key applications in speech recognition and synthesis.

Below is src details:


With these information in hand, we are comfortable about the
Kaldi installation. Let’s play a test run in next section, Yes or
No?

2.4 Yes or No

It’s not a philosophy debate. It’s a sample data provided by


Kaldi. Let’s check the folder.
We couldn’tsee the wav file. That’s because Kaldi doesn’t
install it by default. No big deal, Kaldi will pull it when running.
$cd ~/lKaldi/egs/yesno/s5
$cd ~/Kaldi-trunk/egs/yesno/s5
$sudo ./run.sh
[sudo] password for sv:
--2019-01-03 09:01:50-- http://www.openslr.
org/resources/1/waves_yesno.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64 Connecting to
www.openslr.org (www.openslr.
org)|46.101.158.64|:80.. connected.
HTTP request sent, awaiting response... 302 Found Location:
http://101.96.8.160/www.openslr.org/resources/1/waves_yesno.tar.gz [following]
--2019-01-03 09:01:51--http://101.96.8.160/www.
openslr.org/resources/1/waves_yesno.tar.gz

Connecting to 101.96.8.160:80... connected. HTTP request sent, awaiting response... 200


OK Length: 4703754 (4.5M) [application/x-gzip] Saving to: ‘waves_yesno.tar.gz’

waves_yesno.tar.gz 100%[===================>]

4.49M 313KB/s in 15s


waves_yesno/1_1_0_1_1_0_0_1.wav waves_yesno/0_1_1_1_0_1_0_1.wav
waves_yesno/0_1_1_1_0_0_0_0.wav waves_yesno/README~
waves_yesno/0_1_0_0_0_1_0_0.wav waves_yesno/1_0_0_0_0_0_0_1.wav
waves_yesno/1_1_0_1_1_0_1_1.wav waves_yesno/1_1_0_0_0_0_0_1.wav
waves_yesno/1_0_0_0_0_0_0_0.wav waves_yesno/0_1_1_1_1_0_1_0.wav
waves_yesno/0_0_1_1_0_1_0_0.wav waves_yesno/1_1_1_0_0_0_0_1.wav
waves_yesno/1_0_1_0_1_0_0_1.wav waves_yesno/0_1_0_0_1_0_1_1.wav
waves_yesno/0_0_1_1_1_1_1_0.wav waves_yesno/1_1_0_0_0_1_1_1.wav
waves_yesno/0_1_1_1_0_0_1_0.wav waves_yesno/1_1_0_1_0_1_0_0.wav
waves_yesno/1_1_1_1_1_1_1_1.wav waves_yesno/0_0_1_0_1_0_0_1.wav
waves_yesno/1_1_1_1_0_0_1_0.wav waves_yesno/0_0_1_1_1_0_0_1.wav
waves_yesno/0_1_0_1_0_0_0_0.wav waves_yesno/1_1_1_1_1_0_0_0.wav
waves_yesno/README
waves_yesno/0_1_1_0_0_1_1_1.wav waves_yesno/0_0_1_0_0_1_1_0.wav
waves_yesno/1_1_0_0_1_0_1_1.wav waves_yesno/1_1_1_0_0_1_0_1.wav
waves_yesno/0_0_1_0_0_1_1_1.wav waves_yesno/0_0_1_1_0_0_0_1.wav
waves_yesno/1_0_1_1_0_1_1_1.wav waves_yesno/1_1_1_0_1_0_1_0.wav
waves_yesno/1_1_1_0_1_0_1_1.wav waves_yesno/0_1_1_0_0_1_1_0.wav
waves_yesno/0_0_0_1_0_1_1_0.wav
waves_yesno/1_1_1_1_1_1_0_0.wav
waves_yesno/0_0_0_0_1_1_1_1.wav
Preparing train and test data
Dictionary preparation succeeded

2019-01-03 09:02:07 (304 KB/s) - ‘waves_yesno.tar.gz’ saved [4703754/4703754]

waves_yesno/
waves_yesno/1_0_0_0_0_0_1_1.wav
waves_yesno/1_1_0_0_1_0_1_0.wav
waves_yesno/1_0_1_1_1_1_0_1.wav
waves_yesno/1_1_1_1_0_1_0_0.wav
waves_yesno/0_0_1_1_1_0_0_0.wav
waves_yesno/0_1_1_1_1_1_1_1.wav
waves_yesno/0_1_0_1_1_1_0_0.wav
waves_yesno/1_0_1_1_1_0_1_0.wav
waves_yesno/1_0_0_1_0_1_1_1.wav
waves_yesno/0_0_1_0_1_0_0_0.wav
waves_yesno/0_1_0_1_1_0_1_0.wav
waves_yesno/0_0_1_1_0_1_1_0.wav
waves_yesno/1_0_0_0_1_0_0_1.wav
waves_yesno/1_1_0_1_1_1_1_0.wav
waves_yesno/0_0_1_1_1_1_0_0.wav
waves_yesno/1_1_0_0_1_1_1_0.wav
waves_yesno/0_0_1_1_0_1_1_1.wav
waves_yesno/1_1_0_1_0_1_1_0.wav
waves_yesno/0_1_0_0_0_1_1_0.wav
waves_yesno/0_0_0_1_0_0_0_1.wav

utils/prepare_lang.sh --position-dependent-phones false data/local/dict <SIL> data/local/lang


data/lang

Checking data/local/dict/silence_phones.txt ...


--> reading data/local/dict/silence_phones.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict/silence_phones.txt is OK

Checking data/local/dict/optional_silence.txt ...


--> reading data/local/dict/optional_silence.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict/optional_silence.txt is OK

Checking data/local/dict/nonsilence_phones.txt ...


--> reading data/local/dict/nonsilence_phones.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict/nonsilence_phones.txt is OK

Checking disjoint: silence_phones.txt, nonsilence_phones.txt


--> disjoint property is OK.

Checking data/local/dict/lexicon.txt
--> reading data/local/dict/lexicon.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict/lexicon.txt is OK

Checking data/local/dict/extra_questions.txt ...


--> data/local/dict/extra_questions.txt is empty (this is OK)
--> SUCCESS [validating dictionary directory data/local/dict]

**Creating data/local/dict/lexiconp.
txt from data/local/dict/lexicon.txt
fstaddselfloops data/lang/phones/wdisambig_phones. int
data/lang/phones/wdisambig_words.int

prepare_lang.sh: validating output directory


utils/validate_lang.pl data/lang
Checking data/lang/phones.txt ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/lang/phones.txt is OK

Checking words.txt: #0 ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/lang/words.txt is OK

Checking disjoint: silence.txt, nonsi


lence.txt, disambig.txt ...

--> silence.txt and nonsilence.txt are disjoint


--> silence.txt and disambig.txt are disjoint
--> disambig.txt and nonsilence.txt are disjoint
--> disjoint property is OK

Checking sumation: silence.txt, nonsilence.txt, disambig.txt ...


--> found no unexplainable phones in phones.txt

Checking data/lang/phones/context_indep.{txt, int, csl} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 1 entry/entries in data/lang/phones/context_indep.txt

--> data/lang/phones/context_indep.int corre


sponds to data/lang/phones/context_indep.txt
--> data/lang/phones/context_indep.csl corre
sponds to data/lang/phones/context_indep.txt

--> data/lang/phones/context_indep.{txt, int, csl} are OK

Checking data/lang/phones/nonsilence.{txt, int, csl} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 2 entry/entries in data/lang/phones/nonsilence.txt

--> data/lang/phones/nonsilence.int corre


sponds to data/lang/phones/nonsilence.txt
--> data/lang/phones/nonsilence.csl corre
sponds to data/lang/phones/nonsilence.txt
--> data/lang/phones/nonsilence.{txt, int, csl} are OK

Checking data/lang/phones/silence.{txt, int, csl} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 1 entry/entries in data/lang/phones/silence.txt

--> data/lang/phones/silence.int corre


sponds to data/lang/phones/silence.txt
--> data/lang/phones/silence.csl corre
sponds to data/lang/phones/silence.txt
--> data/lang/phones/silence.{txt, int, csl} are OK

Checking data/lang/phones/optional_silence.{txt, int, csl} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 1 entry/entries in data/lang/phones/optional_silence.txt

--> data/lang/phones/optional_silence.int corre


sponds to data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.csl corre
sponds to data/lang/phones/optional_silence.txt

--> data/lang/phones/optional_silence.{txt, int, csl} are OK --> data/lang/phones/disambig.csl


corre
sponds to data/lang/phones/disambig.txt

Checking data/lang/phones/disambig.{txt, int, csl} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 2 entry/entries in data/lang/phones/disambig.txt

--> data/lang/phones/disambig.int corre


sponds to data/lang/phones/disambig.txt

--> data/lang/phones/disambig.{txt, int, csl} are OK

Checking data/lang/phones/roots.{txt, int} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 3 entry/entries in data/lang/phones/roots.txt

--> data/lang/phones/roots.int corre


sponds to data/lang/phones/roots.txt
--> data/lang/phones/roots.{txt, int} are OK

Checking data/lang/phones/sets.{txt, int} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 3 entry/entries in data/lang/phones/sets.txt

--> data/lang/phones/sets.int corre


sponds to data/lang/phones/sets.txt
--> data/lang/phones/sets.{txt, int} are OK

Checking data/lang/phones/extra_questions.{txt, int} ... Checking optional_silence.txt ...


--> reading data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.txt is OK
Checking disambiguation symbols: #0 and #1
--> data/lang/phones/disambig.txt has “#0” and “#1”
--> data/lang/phones/disambig.txt is OK

Checking topo ...


Checking word-level disambiguation symbols...
--> data/lang/phones/wdisambig.txt ex
ists (newer prepare_lang.sh)

Checking data/lang/oov.{txt, int} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces

--> 1 entry/entries in data/lang/oov.txt


--> data/lang/oov.int corresponds to data/lang/oov.txt
--> data/lang/oov.{txt, int} are OK

--> data/lang/L.fst is olabel sorted


--> data/lang/L_disambig.fst is olabel sorted
--> SUCCESS [validating lang directory data/lang]

Preparing language models for test


arpa2fst --disambig-symbol=#0 --read-symbol-table=data/lang_ test_tg/words.txt
input/task.arpabo data/lang_test_tg/G.fst LOG (arpa2fst[5.5.164~1-9698]:Read():arpa-file
parser.cc:94) Reading \data\ section.
LOG (arpa2fst[5.5.164~1-9698]:Read():arpa-file
parser.cc:149) Reading \1-grams: section.
LOG (arpa2fst[5.5.164~1-9698]:RemoveRedundantStates():ar pa-lm-compiler.cc:359)
Reduced num-states from 1 to 1

fstisstochastic data/lang_test_tg/G.fst
1.20397 1.20397
Succeeded in formatting data.

steps/make_mfcc.sh --nj 1 data/train_yes


no exp/make_mfcc/train_yesno mfcc
utils/validate_data_dir.sh: WARNING: you have only one speaker. This probably a bad idea.
Search for the word ‘bold’ in http://Kaldi-asr.
org/doc/data_prep.html for more information.
utils/validate_data_dir.sh: Successfully vali
dated data-directory data/train_yesno
steps/make_mfcc.sh: [info]: no segments file ex
ists: assuming wav.scp indexed by utterance.
Succeeded creating MFCC features for train_yesno
steps/compute_cmvn_stats.sh data/train_yes
no exp/make_mfcc/train_yesno mfcc

Succeeded creating CMVN stats for train_yesno


fix_data_dir.sh: kept all 31 utterances.
fix_data_dir.sh: old files are kept in data/train_yesno/.backup

steps/make_mfcc.sh --nj 1 data/test_yes


no exp/make_mfcc/test_yesno mfcc
utils/validate_data_dir.sh: WARNING: you have only one speaker. This probably a bad idea.
Search for the word ‘bold’ in http://Kal

di-asr.org/doc/data_prep.html
for more information.

utils/validate_data_dir.sh: Successfully vali


dated data-directory data/test_yesno
steps/make_mfcc.sh: [info]: no segments file ex
ists: assuming wav.scp indexed by utterance.
It seems not all of the feature files were suc
cessfully processed (29 != 31);
consider using utils/fix_data_dir.sh data/test_yesno Less than 95% the features were
successful
ly generated. Probably a serious error.
steps/compute_cmvn_stats.sh data/test_yes
no exp/make_mfcc/test_yesno mfcc

Succeeded creating CMVN stats for test_yesno


fix_data_dir.sh: kept 29 utterances out of 31
fix_data_dir.sh: old files are kept in data/test_yesno/.backup

steps/train_mono.sh --nj 1 --cmd utils/run.pl --totgauss 400 data/train_yesno data/lang


exp/mono0a

steps/train_mono.sh: Initializing monophone system. steps/train_mono.sh: Compiling training


graphs
steps/train_mono.sh: Aligning data equally (pass 0) steps/train_mono.sh: Pass 1
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 2
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 3
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 4
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 5
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 6
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 7
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 8
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Aligning data steps/train_mono.sh: Pass 10 steps/train_mono.sh:
Aligning data steps/train_mono.sh: Pass 11 steps/train_mono.sh: Pass 12
steps/train_mono.sh: Aligning data steps/train_mono.sh: Pass 13 steps/train_mono.sh: Pass
14 steps/train_mono.sh: Aligning data steps/train_mono.sh: Pass 15 steps/train_mono.sh:
Pass 16 steps/train_mono.sh: Aligning data steps/train_mono.sh: Pass 17
steps/train_mono.sh: Pass 18 steps/train_mono.sh: Aligning data steps/train_mono.sh: Pass
19 steps/train_mono.sh: Pass 20 steps/train_mono.sh: Aligning data steps/train_mono.sh:
Pass 21 steps/train_mono.sh: Pass 22 steps/train_mono.sh: Pass 23 steps/train_mono.sh:
Aligning data steps/train_mono.sh: Pass 24 steps/train_mono.sh: Pass 25
steps/train_mono.sh: Pass 26 steps/train_mono.sh: Aligning data steps/train_mono.sh: Pass
27 steps/train_mono.sh: Pass 28 steps/train_mono.sh: Pass 29 steps/train_mono.sh:
Aligning data steps/train_mono.sh: Pass 30 steps/train_mono.sh: Pass 31
steps/train_mono.sh: Pass 32 steps/train_mono.sh: Aligning data steps/train_mono.sh: Pass
33 steps/train_mono.sh: Pass 35
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 36
steps/train_mono.sh: Pass 37
steps/train_mono.sh: Pass 38
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 39

steps/diagnostic/analyze_alignments.sh --cmd
utils/run.pl data/lang exp/mono0a
steps/diagnostic/analyze_alignments.sh: see stats in
exp/mono0a/log/analyze_alignments.log
1 warnings in exp/mono0a/log/update.*.log
exp/mono0a: nj=1 align prob=-81.88 over 0.05h [retry=0.0%, fail=0.0%] states=11
gauss=371
steps/train_mono.sh: Done training mono
phone system in exp/mono0a

tree-info exp/mono0a/tree
tree-info exp/mono0a/tree

fsttablecompose data/lang_test_tg/L_dis
ambig.fst data/lang_test_tg/G.fst

fstdeterminizestar --use-log=true
fstminimizeencoded
fstpushspecial
fstisstochastic data/lang_test_tg/tmp/LG.fst
0.534295 0.533859
[info]: LG not stochastic.

fstcomposecontext --context-size=1 --central-position=0 --read-disambig-


syms=data/lang_test_tg/phones/ disambig.int --write-disambig-syms=data/lang_test_tg/
tmp/disambig_ilabels_1_0.int data/lang_test_tg/tmp/ ilabels_1_0.24717
data/lang_test_tg/tmp/LG.fst

fstisstochastic data/lang_test_tg/tmp/CLG_1_0.fst 0.534295 0.533859


[info]: CLG not stochastic.
make-h-transducer --disambig-syms-out=exp/mono0a/graph_tgpr/ disambig_ tid.int --
transition-scale=1.0 data/lang_test_tg/ tmp/ilabels_1_0 exp/mono0a/tree
exp/mono0a/final.mdl

fsttablecompose exp/mono0a/graph_tgpr/
Ha.fst data/lang_test_tg/tmp/CLG_1_0.fst

fstdeterminizestar --use-log=true
fstminimizeencoded
fstrmsymbols exp/mono0a/graph_tgpr/disambig_tid.int fstrmepslocal
fstisstochastic exp/mono0a/graph_tgpr/HCLGa.fst
0.5342 -0.000422432
HCLGa is not stochastic

add-self-loops --self-loop-scale=0.1 --reorder=true exp/ mono0a/final.mdl


exp/mono0a/graph_tgpr/HCLGa.fst
steps/decode.sh --nj 1 --cmd utils/run.pl exp/mono0a/ graph_tgpr data/test_yesno
exp/mono0a/decode_test_yesno decode.sh: feature type is delta
steps/diagnostic/analyze_lats.sh --cmd utils/run.pl exp/ mono0a/graph_tgpr
exp/mono0a/decode_test_yesno
steps/diagnostic/analyze_lats.sh: see stats in exp/
mono0a/decode_test_yesno/log/analyze_alignments.log Overall, lattice depth (10,50,90-per
centile)=(1,1,2) and mean=1.2
steps/diagnostic/analyze_lats.sh: see stats in exp/mono0a/
decode_test_yesno/log/analyze_lattice_depth_stats.log local/score.sh --cmd utils/run.pl
data/test_yesno exp/ mono0a/graph_tgpr exp/mono0a/decode_test_yesno
local/score.sh: scoring with word insertion penalty=0.0,0.5,1.0 %WER 0.00 [ 0 / 232, 0 ins, 0
del, 0 sub ] exp/
mono0a/decode_test_yesno/wer_10_0.0

The WER (word error rate) is 0, perfect. WER is widely


accepted as the defacto metric for ASR. The lower the better.
Word Error Rate = (Substitutions + Insertions + Deletions) / Number
of Words Spoken
Substitutions are anytime a word gets replaced (for example,
“twinkle” is transcribed as “crinkle”)
Insertions are anytime a word gets added that wasn’t said (for
example, “trailblazers” becomes “tray all blazers”)
Deletions are anytime a word is omitted from the transcript
(for example, “get it done” becomes “get done”)
Till now, we can make sure that Kaldi has been successfully
installed. Let’s check the folder again:
Clearly,
data, exp, mfcc, and waves_yes-no have been added.
Chapter 3. Setup TIMIT
TIMIT Structure

TIMIT is often used as a benchmark for performance,


although results on TIMIT are not always transferable to other
corpora. TIMIT is available as LDC corpus LDC93S1, TIMIT is
one of the original clean speech databases.
(http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?
catalogId=LDC93S1).

“The TIMIT corpus of read speech is designed to provide


speech data for acoustic-phonetic studies and for the
development and evaluation of automatic speech recognition
systems. TIMIT contains broadband recordings of 630
speakers of eight major dialects of American English,
eachreading ten phonetically rich sentences. The TIMIT
corpus includes time-aligned orthographic, phonetic and word
transcriptions as well as a 16-bit, 16kHz speech waveform file
for each utterance.”

“The TIMIT corpus transcriptions have been hand


verified. Test and training subsets, balanced for phonetic
and dialectal coverage, are specified. Tabular computer-
searchable information is included as well as written
documentation.” The zipped of TIMIT is about 442.21M,
could be found at
https://academictorrents.com/details/34e2b787451381869
76cbc27939b1b34d18bd5b3 for free. 3.1 About TIMIT

TIMIT (TIMIT Acoustic-Phonetic Continuous Speech Corpus)


has resulted from the joint efforts of several sites under
sponsorship from the Defense Advanced Research Projects
Agency
- Information Science and Technology Office(DARPA-ISTO).

3.1 About
TIMIT 39
Text corpus design was a joint effort among the
Massachusetts Institute of Technology (MIT), Stanford
Research Institute (SRI), and Texas Instruments (TI). The
speech was recorded at TI, transcribed at MIT, and has been
maintained, verified, and prepared for CD-ROM production by
the National Institute of Standards and Technology (NIST).
Additional information including the referenced material and
some relevant reprints of articles may be found in the printed
documentation which is also available from NTIS (NTIS#
PB91-100354).

TIMIT contains a total of 6300 sentences, 10 sentences


spoken by each of 630 speakers from 8 major dialect regions
of the United States. Below table shows the number of
speakers for the 8 dialect regions.
The text material in the TIMIT prompts (found in the file
“prompts.doc”) consists of 2 dialect “shibboleth” sentences
designed at SRI, 450 phonetically-compact sentences
designed at MIT, and 1890 phonetically-diverse sentences
selected at TI. The dialect sentences (the SA sentences)
were meant to expose the dialectal variants of the speakers
and were read by all 630 speakers. The phonetically-compact
sentences were designed to provide a good coverage of pairs
of phones, with extra occurrences of phonetic contexts
thought to be either difficult or of particular interest. Each
speaker read 5 of these sentences (the SX sentences) and
each text was spoken by 7 different speakers. The
phonetically-diverse sentences (the SI sentences) were
selected from existing text sources - the Brown Corpus
(Kuchera and Francis, 1967) and the Playwrights Dialog
(Hultzen, et al., 1964) - so as to add diversity in sentence
types and phonetic contexts. The selection criteria maximized
the variety of allophonic contexts found in the texts. Each
speaker read 3 of these sentences, with each sentence being
read only by a single speaker. Below table summarizes the
speech material in TIMIT.

The speechmaterial has been subdivided into portions for


training and testing. The criteria for the subdivision is
described in the file “testset.doc”.

3.2 TIMIT Directory Structure and Files


Kaldi will install TIMIT directories and scripts. Only thing we
need do is to download TIMIT database. Below screenshot
shows the folder structure of once Kaldi has been
successfully installed.

3.2.1 Top Level Directories and Files

stands for “examples”, which contains example~/lKaldi/egs


training recipes forsome speechcorpora like TIMIT. Under
each of these folders are different versions (s3, s4, s5, etc.)
The highest number, usually s5, is the most current version
and should be used for any new development or training.
Check the folder structure of ./egs/timit/s5. All the folders are
there except
the ./data.
Download TIMIT data (around 400M to 600M) from
http://www. ldc.upenn.edu/Catalog/CatalogEntry.jsp?
catalogId=LDC93S1, school library, or other sources and
extract it to a local folder, rename the root folder to data, and
put it into ./egs/timit/

s5/data as below.
We get the complete TIMIT in Kaldi now.
Once installation completed, read ./data/README.DOCfirst, and find
more information from ./data/DOC.

The main folders and files we are going to use are:


cmd.shcontains all the parameters. confcontains configura
tions for certain scripts. datacontaining any data directories,
such as a train and test directory for TIMIT. localthis directory
typically contains files that relate only to the corpus we’re
working on (e.g. TIMIT). path.shcontains the path to the Kaldi
source directory

contains scripts for creating an ASR system. utilsconsteps


tains scripts to modify Kaldi files in certain ways, for example
to subset data directories into smaller pieces. exp, which will
be created later, contains the actual experiments and models,
as well as logs.

TIMIT recipe is embodied in run.shscript, with suprun.sh


porting files in ./local. This script will call high level scripts in
./stepsand ./utils, which in turn call binaries in Kaldi which perform
the actual computation. stepsand utilsare two folders linked to
Kaldi.
3.2.2 TIMIT Corpus

The TIMIT corpus is located at ./Data. Below is the structure.


The folder structure is exactly same for./TESTand ./TRAIN. Below
them are 8 folders, DR1-DR8, representing 8 dialects.

Each dialect contains a couple of folders, representing


different person’s sound (F means female, M means male).
Let’s see one of them:

cd train/DR1/FCJF0
Entering one of them, we could find 10 utterances. Each has
four files: .phn, .txt, .wav, and .wrd. Let’s try to open them and
see what’s inside.

.wav is the audio file recorded with sound format (It’s a little
bit different. But let’s assume it for now). Others are
corresponding metadata like lexicon, XML metadata, and
language model training text in txt format. The metadata and
text files help Kaldi understand WAV and transfer the
speechto feature vector and then to the character, word and
language.
3.2.3 TIMIT Recipes

We have audio file .wav. We have metadata and training text


files .phn, .wrd, and .txt. The next step is to put them together
using a Recipe.

The recipe usually starts with run.shin the root directory. Links to
the Kaldi have been stored in utils/and steps/ directories.

Edit the run.shscript to set paths to various tools and to specify


the location of directories containing the WAV files, XML files,
language model training text and Combilex lexicon.

Each Kaldi recipe consists of multiple stages, which simply


means run the commands in this block if stage is less than or
equal to number x. If choosing equal to a stage number, Kaldi
will run only that stage.

# Set which stage this script is on, you can set it to the stage number
that has already been executed to avoid running the same command
repeatedly.

stage=0
Stage is set to 0 by default, which means the recipe will run
all blocks. If encountering an error, we can check which
stages are successfully passes and re-run the recipe by ./run.sh
--stage x, or set the stage equal to a stage number to run that
specific stage only.
Chapter 4. Kaldi Data Preparation
from beginning to FST

The speech data is the core data for speech recognition; The
lexicon data is like a dictionary that include a word dictionary
(word-pronunciation pairs), phonemes dictionary (types of
phonemes); The language data is the language model
(mostly the language model is trained apart from the text of
speech data). Mostly, a dataset for ASR is accompanied witha
language model.
For the TIMIT dataset specifically, the three kinds of data are
integrated in the dataset, we can easily get them from scripts
in local.

Before acting with the code, read the TIMIT description (some
docs in the dataset. The data collection process is frustrating,
and all the datasets are built with great efforts).

The start point of TIMIT training is run.sh, being located at the


root of TIMIT. We will run run.sh step by step till DNN.
Step 1: Setup Environment (line 1-32)

The first 14 lines are comment. Line 15 source the cmd.sh.

TIMIT uses cmd.shto setup the environment variables. "Dot


space dot" format makes sure that cmd.shcould reserver the
current parameters while return the parameters, similar with
adding a source at the begining. If just use "dot" to run, after
running the cmd.sh, the system quit the shell, the environment
varaibles are not kept, which means we did nothing.

The cmd.sh reads:

We are running the TIMIT at our local machine, so we need


change queue.pl to run.pl (queue.pl is for Oracle GridEngine,
it's out of our scope).
export train_cmd=”run.pl --mem 4G”
export decode_cmd=”run.pl --mem 4G”
export cuda_cmd=”run.pl --gpu 1”

The final cmd.sh should look like this:


Line 16-17 of run.sh are:
[ -f path.sh ] && . ./path.sh
set -e

The binary and executive files of Kaldi have been stored into
different folders with a logic. path.shhas been used to make link
them to the TIMIT.
Above is the path.sh.
We could see Kaldi’s complicated path definition below. This
is why the path.shfile is called at the beginning of all Kaldi
scripts.

set -e means stop for any error. Just run any binary file to
check whether path.sh has been set correctly:
feat-to-dim
If like below, it's good then.
Even though we couldn’t understand what it showing, we
already know that the paths have been all set.
Keep the acoustic model (line 18-32) parameters as is.

Step 2: Prepare Data (line 33-50)

Line 39 points out the TIMIT data location. Change it to our


own computer:
$ timit=/home/leo/lKaldi/egs/timit/s5/data #data path

Keep running timit_data_prep.sh (line 41), we will get:


$ local/timit_data_prep.sh $timit || exit 1 #prepare data
As we could see, timit_data_prep.shcreates localfolder under datato
check and integrate the audio wav files and metadata of the
wav files, make it ready for further processing. The
processing result will be stored to ./data/local/data.

After data preparation, each data set (train, test, dev) will
produce _sph.flist, _sph.scp, .uttids, .trans, .text, _wav.scp,
.utt2spk,

.spk2utt, .spk2gender, .stm, and .glm.


In Kaldi, .wav is acctually .sph format. Want to really listen the
wave, need run tools/sph2pipe_v2.5/sph2pipetimit_ data_prep.sh to convert.
These files are linked together by utterance (utt_id) and
speak (spk_id).

Kaldi provides two scripts to check the data preparation:


$ utils/validate_data_dir.sh data/train
$ utils/validate_data_dir.sh data/test
If everything fine, we will see:

The common mistakes could be missing utt2spk. If so,


correcting it by running utils/spk2utt_to_utt2spk.pl.
We could know the number of sentences in the system by
using wc command.
wc -l < data/train/utt2spk
yes, we have 3696 sentences here.

Here is the new folder structure (level 1):

The main additions are: dev, test, train, lang, lang_test_bg,


and local. Please be noticed that TRAIN and TEST come with
original TIMIT installation CD/files.
The level 2 file structure of data is as below:
Step 3: Dictionary 59

Step 3: Dictionary

Dictionary maps words to phones. The mapping could be


done through machine learning and complicated functions as
well.
Line 43 /local/timit_prepare_dict.shwill create a diction at .
/data/local/dict. The
files created are:
invokes some scripts:timit_prepare_dict.sh
irstlm/bin/build-lm.sh creates phone language model
lm_tmp/lm_phone_bg.ilm.gz.

irstlm/bin/compile-lmprocesses the lm_train.textto create


nist_lm/lm_phone_bg.arpa.gz.

Step 4: Create FST Script

Line 47 & 48 create FST Script.

We define the FST in a separate file. Each line in the FST file
identifies an arc (except the last line). The first two columns
identify the state of where it transits from and where it transits
to. The third column represents the input label and the fourth
column represents the output label. If it is a dash, the output
is empty.

Kaldi allows users to define and categorize different types of


phones, including “real phones” and silence phones. All these
category information helps Kaldi to build HMM topologies and
decision trees.

will transform silence and non-silence phone files


prepare_lang.sh
for Kaldi. The file context_indep.txtcontains all the phones which
are not “real phones”: i.e. silence (SIL), spoken noise (SPN),
non-spoken noise (NSN), and laughter (LAU). The file may
contain many variants of these phones depends on the word
position. For example, SIL_B is the silence phone that
occurred at the beginning of a word.
To create FST Script, we need run utils/prepare_lang.sh.
$ utils/prepare_lang.sh --sil-prob 0.0 --positiondependent-phones false --num-sil-states 3
data/local/dict “sil” data/local/lang_tmp data/lang

$ local/timit_format_data.sh

Itwill use files in ./data/local/dictto create files indata/ langcontaining


language information.

L.fst fst structure of dictionary


L_disambig.fst eliminate the ambiguity
phones.txt oov.txt/int topo
words.txt phones
phones
unknown word

topology file, including transfer probability, transfer path, id,


etc.
words and id
words and phones
The input files are: lexicon.txt, lexconp.txt, silence_phones.txt/
nonsilence_phones.txt, optional_silence.txt, extra_questions.
txt. utils/make_lexicon_fst.plcould change the format to fst.

Two files are most important: words.txt and phones.txt. Both are
FST format: phone to integer. Looks like same.
*.csl files including silence, non-silence and phone’s id. We
could use fstprint to check L.fst content.
fstprint --isymbols=data/lang/phones.txt --osymbols=data/ lang/words.txt data/lang/L.fst |
head
The result is:
0 0 aa aa
0 0 ae ae
0 0 ah ah
0 0 ao ao
0 0 aw aw
0 0 ax ax
0 0 ay ay
00bb

The topo file defines three HMM status of each phone.


<Transition> defines the probability of Transition.

The second script defines data format, create data/lang_


from
test_bgdirectoy, copy files in data/lang, and create G.fst
data/local/nist_lm and lm_phone_bg.arpa.gz.

The Result

Below is the final result of data preparation:


sv@HP:~/lKaldi/egs/timit/s5$ ./run.sh
===============================================================

Data & Lexicon & Language Preparation


===============================================================

wav-to-duration --read-entire-file=true
scp:train_wav.scp ark,t:train_dur.ark
LOG (wav-to-duration[5.5.164~1-9698]:main():wav-to-duration.cc:92) Printed duration for
3696 audio files.

LOG (wav-to-duration[5.5.164~1-9698]:main():wav
to-duration.cc:94) Mean duration was 3.06336, min
and max durations were 0.91525, 7.78881

wav-to-duration --read-entire-file=true
scp:dev_wav.scp ark,t:dev_dur.ark
LOG (wav-to-duration[5.5.164~1-9698]:main():wav-to-duration.cc:92) Printed duration for
400 audio files.

LOG (wav-to-duration[5.5.164~1-9698]:main():wav
to-duration.cc:94) Mean duration was 3.08212, min
and max durations were 1.09444, 7.43681

wav-to-duration --read-entire-file=true
scp:test_wav.scp ark,t:test_dur.ark
LOG (wav-to-duration[5.5.164~1-9698]:main():wav-to-duration.cc:92) Printed duration for
192 audio files.

LOG (wav-to-duration[5.5.164~1-9698]:main():wav
to-duration.cc:94) Mean duration was 3.03646, min
and max durations were 1.30562, 6.21444

Data preparation succeeded


LOGFILE:/dev/null

$bin/ngt -i=”$inpfile” -n=$order -gooout=y -o=”$gzip -c > $tmpdir/ngram.${sdict}.gz” -


fd=”$tmpdir/$sdict” $dictionary $additional_parameters >> $logfile 2>&1

$bin/ngt -i=”$inpfile” -n=$order -gooout=y -o=”$gzip -c > $tmpdir/ngram.${sdict}.gz” -


fd=”$tmpdir/$sdict” $dictionary $additional_parameters >> $logfile 2>&1

$scr/build-sublm.pl $verbose $prune $prune_thr_str $smoothing


“$additional_smoothing_parameters” --size $order
--ngrams “$gunzip -c $tmpdir/ngram.${sdict}.gz” -sublm $tmpdir/lm.$sdict
$additional_parameters >> $logfile 2>&1

inpfile: data/local/lm_tmp/lm_phone_bg.ilm.gz
outfile: /dev/stdout
loading up to the LM level 1000 (if any)
dub: 10000000
OOV code is 50
OOV code is 50
Saving in txt format to /dev/stdout
Dictionary & language model preparation succeeded

utils/prepare_lang.sh --sil-prob 0.0 --position-dependent-phones false --num-sil-states 3


data/local/dict sil data/local/lang_tmp data/lang

Checking data/local/dict/silence_phones.txt ...


--> reading data/local/dict/silence_phones.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict/silence_phones.txt is OK

Checking data/local/dict/optional_silence.txt ...


--> reading data/local/dict/optional_silence.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict/optional_silence.txt is OK

Checking data/local/dict/nonsilence_phones.txt ...


--> reading data/local/dict/nonsilence_phones.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict/nonsilence_phones.txt is OK
Checking disjoint: silence_phones.txt, nonsilence_phones.txt
--> disjoint property is OK.

Checking data/local/dict/lexicon.txt
--> reading data/local/dict/lexicon.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict/lexicon.txt is OK

Checking data/local/dict/lexiconp.txt
--> reading data/local/dict/lexiconp.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict/lexiconp.txt is OK

Checking lexicon pair data/local/dict/lexicon.txt and data/local/dict/lexiconp.txt


--> lexicon pair data/local/dict/lexicon. txt and data/local/dict/lexiconp.txt match

Checking data/local/dict/extra_questions.txt ...


--> reading data/local/dict/extra_questions.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/local/dict/extra_questions.txt is OK
--> SUCCESS [validating dictionary directory data/local/dict]

fstaddselfloops data/lang/phones/wdisambig_phones. int


data/lang/phones/wdisambig_words.int

prepare_lang.sh: validating output directory


utils/validate_lang.pl data/lang
Checking data/lang/phones.txt ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/lang/phones.txt is OK

Checking words.txt: #0 ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/lang/words.txt is OK

Checking disjoint: silence.txt, nonsi


lence.txt, disambig.txt ...

--> silence.txt and nonsilence.txt are disjoint


--> silence.txt and disambig.txt are disjoint
--> disambig.txt and nonsilence.txt are disjoint
--> disjoint property is OK

Checking sumation: silence.txt, nonsilence.txt, disambig.txt ...


--> found no unexplainable phones in phones.txt
Checking data/lang/phones/context_indep.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 1 entry/entries in data/lang/phones/context_indep.txt

--> data/lang/phones/context_indep.int corre


sponds to data/lang/phones/context_indep.txt
--> data/lang/phones/context_indep.csl corre
sponds to data/lang/phones/context_indep.txt
--> data/lang/phones/context_indep.{txt, int, csl} are OK

Checking data/lang/phones/nonsilence.{txt, int, csl} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 47 entry/entries in data/lang/phones/nonsilence.txt

--> data/lang/phones/nonsilence.int corre


sponds to data/lang/phones/nonsilence.txt
--> data/lang/phones/nonsilence.csl corre
sponds to data/lang/phones/nonsilence.txt
--> data/lang/phones/nonsilence.{txt, int, csl} are OK

Checking data/lang/phones/silence.{txt, int, csl} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 1 entry/entries in data/lang/phones/silence.txt

--> data/lang/phones/silence.int corre


sponds to data/lang/phones/silence.txt
--> data/lang/phones/silence.csl corre
sponds to data/lang/phones/silence.txt

--> data/lang/phones/silence.{txt, int, csl} are OK --> 1 entry/entries in


data/lang/phones/optional_silence.txt

Checking data/lang/phones/optional_silence.{txt, int, csl} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces

--> data/lang/phones/optional_silence.int corre


sponds to data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.csl corre
sponds to data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.{txt, int, csl} are OK

Checking data/lang/phones/disambig.{txt, int, csl} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 2 entry/entries in data/lang/phones/disambig.txt
--> data/lang/phones/disambig.int corre
sponds to data/lang/phones/disambig.txt
--> data/lang/phones/disambig.csl corre
sponds to data/lang/phones/disambig.txt
--> data/lang/phones/disambig.{txt, int, csl} are OK

Checking data/lang/phones/roots.{txt, int} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 48 entry/entries in data/lang/phones/roots.txt

--> data/lang/phones/roots.int corre


sponds to data/lang/phones/roots.txt
--> data/lang/phones/roots.{txt, int} are OK

Checking data/lang/phones/sets.{txt, int} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 48 entry/entries in data/lang/phones/sets.txt

--> data/lang/phones/sets.int corre


sponds to data/lang/phones/sets.txt

--> data/lang/phones/sets.{txt, int} are OK


sponds to data/lang/phones/extra_questions.txt
--> data/lang/phones/extra_questions.{txt, int} are OK

Checking data/lang/phones/extra_questions.{txt, int} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 2 entry/entries in data/lang/phones/extra_questions.txt
--> data/lang/phones/extra_questions.int corre

Checking optional_silence.txt ...


--> reading data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.txt is OK

Checking disambiguation symbols: #0 and #1


--> data/lang/phones/disambig.txt has “#0” and “#1”
--> data/lang/phones/disambig.txt is OK

Checking topo ...


Checking word-level disambiguation symbols...
--> data/lang/phones/wdisambig.txt ex
ists (newer prepare_lang.sh)

Checking data/lang/oov.{txt, int} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 1 entry/entries in data/lang/oov.txt
--> data/lang/oov.int corresponds to data/lang/oov.txt
--> data/lang/oov.{txt, int} are OK

--> data/lang/L.fst is olabel sorted


--> data/lang/L_disambig.fst is olabel sorted
--> SUCCESS [validating lang directory data/lang]

Preparing train, dev and test data


utils/validate_data_dir.sh: Successfully validated data-directory data/train
utils/validate_data_dir.sh: Successful
ly validated data-directory data/dev
utils/validate_data_dir.sh: Successfully validated data-directory data/test
Preparing language models for test
arpa2fst --disambig-symbol=#0 --read-symbol-table=data/ lang_test_bg/words.txt -
data/lang_test_bg/G.fst LOG (arpa2fst[5.5.164~1-9698]:Read():arpa-fileparser.cc:94)
Reading \data\ section.

LOG (arpa2fst[5.5.164~1-9698]:Read():arpa-fileparser.cc:149) Reading \1-grams: section.


LOG (arpa2fst[5.5.164~1-9698]:Read():arpa-fileparser.cc:149) Reading \2-grams: section.

WARNING (arpa2fst[5.5.164~1-9698]:ConsumeNGram():arpa-lmcompiler.cc:313) line 60


[-3.26717 <s> <s>] skipped: n-gram has invalid BOS/EOS placement

LOG (arpa2fst[5.5.164~1-9698]:RemoveRedundantStates():ar pa-lm-compiler.cc:359)


Reduced num-states from 50 to 50

fstisstochastic data/lang_test_bg/G.fst
0.000510126 -0.0763018
utils/validate_lang.pl data/lang_test_bg
Checking data/lang_test_bg/phones.txt ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/lang_test_bg/phones.txt is OK

Checking words.txt: #0 ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/lang_test_bg/words.txt is OK

Checking disjoint: silence.txt, nonsi


lence.txt, disambig.txt ...

--> silence.txt and nonsilence.txt are disjoint


--> silence.txt and disambig.txt are disjoint
--> disambig.txt and nonsilence.txt are disjoint
--> disjoint property is OK

Checking sumation: silence.txt, nonsilence.txt, disambig.txt ...


--> found no unexplainable phones in phones.txt -->
data/lang_test_bg/phones/context_indep.int corresponds to
data/lang_test_bg/phones/context_indep.txt

Checking data/lang_test_bg/phones/con
text_indep.{txt, int, csl} ...

--> text seems to be UTF-8 or ASCII, checking whitespaces


--> text contains only allowed whitespaces

--> 1 entry/entries in data/lang_test_


bg/phones/context_indep.txt

--> data/lang_test_bg/phones/context_indep.csl corresponds to


data/lang_test_bg/phones/context_indep.txt
--> data/lang_test_bg/phones/context_in
dep.{txt, int, csl} are OK
Checking data/lang_test_bg/phones/non
silence.{txt, int, csl} ...

--> text seems to be UTF-8 or ASCII, checking whitespaces


--> text contains only allowed whitespaces
--> 47 entry/entries in data/lang_test_bg/phones/nonsilence.txt

--> data/lang_test_bg/phones/nonsilence.int corre


sponds to data/lang_test_bg/phones/nonsilence.txt
--> data/lang_test_bg/phones/nonsilence.csl corre
sponds to data/lang_test_bg/phones/nonsilence.txt
--> data/lang_test_bg/phones/nonsilence.{txt, int, csl} are OK

Checking data/lang_test_bg/phones/silence.{txt, int, csl} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 1 entry/entries in data/lang_test_bg/phones/silence.txt

--> data/lang_test_bg/phones/silence.int corre


sponds to data/lang_test_bg/phones/silence.txt
--> data/lang_test_bg/phones/silence.csl corre
sponds to data/lang_test_bg/phones/silence.txt

--> data/lang_test_bg/phones/silence.{txt, int, csl} are OK Checking


data/lang_test_bg/phones/disambig.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 2 entry/entries in data/lang_test_bg/phones/disambig.txt

Checking data/lang_test_bg/phones/option
al_silence.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces

--> 1 entry/entries in data/lang_test_bg/


phones/optional_silence.txt
--> data/lang_test_bg/phones/optional_silence.int corresponds to
data/lang_test_bg/phones/optional_silence.txt
--> data/lang_test_bg/phones/optional_silence.csl corresponds to
data/lang_test_bg/phones/optional_silence.txt
--> data/lang_test_bg/phones/optional_si
lence.{txt, int, csl} are OK

--> data/lang_test_bg/phones/disambig.int corre


sponds to data/lang_test_bg/phones/disambig.txt
--> data/lang_test_bg/phones/disambig.csl corre
sponds to data/lang_test_bg/phones/disambig.txt
--> data/lang_test_bg/phones/disambig.{txt, int, csl} are OK

Checking data/lang_test_bg/phones/roots.{txt, int} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 48 entry/entries in data/lang_test_bg/phones/roots.txt

--> data/lang_test_bg/phones/roots.int corre


sponds to data/lang_test_bg/phones/roots.txt
--> data/lang_test_bg/phones/roots.{txt, int} are OK

Checking data/lang_test_bg/phones/sets.{txt, int} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 48 entry/entries in data/lang_test_bg/phones/sets.txt

--> data/lang_test_bg/phones/sets.int corre


sponds to data/lang_test_bg/phones/sets.txt
--> data/lang_test_bg/phones/sets.{txt, int} are OK
Checking data/lang_test_bg/phones/ex
tra_questions.{txt, int} ...

--> text seems to be UTF-8 or ASCII, checking whitespaces


--> text contains only allowed whitespaces

--> 2 entry/entries in data/lang_test_bg/


phones/extra_questions.txt
--> data/lang_test_bg/phones/extra_questions.int corresponds to
data/lang_test_bg/phones/extra_questions.txt
--> data/lang_test_bg/phones/extra_questions.{txt, int} are OK

Checking optional_silence.txt ...


--> reading data/lang_test_bg/phones/optional_silence.txt
--> data/lang_test_bg/phones/optional_silence.txt is OK

Checking disambiguation symbols: #0 and #1


--> data/lang_test_bg/phones/disambig.txt has “#0” and “#1”
--> data/lang_test_bg/phones/disambig.txt is OK

Checking topo ...


Checking word-level disambiguation symbols...
--> data/lang_test_bg/phones/wdisambig.txt exists (newer prepare_lang.sh)

Checking data/lang_test_bg/oov.{txt, int} ...


--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 1 entry/entries in data/lang_test_bg/oov.txt

--> data/lang_test_bg/oov.int corre


sponds to data/lang_test_bg/oov.txt
--> data/lang_test_bg/oov.{txt, int} are OK

--> data/lang_test_bg/L.fst is olabel sorted


--> data/lang_test_bg/L_disambig.fst is olabel sorted
--> data/lang_test_bg/G.fst is ilabel sorted
--> data/lang_test_bg/G.fst has 50 states

fstdeterminizestar data/lang_test_bg/G.fst /dev/null


--> data/lang_test_bg/G.fst is determinizable

--> utils/lang/check_g_properties.pl success


fully validated data/lang_test_bg/G.fst

--> utils/lang/check_g_properties.pl succeeded.


--> Testing determinizability of L_disambig . G fstdeterminizestar

fsttablecompose data/lang_test_bg/L_dis
ambig.fst data/lang_test_bg/G.fst

--> L_disambig . G is determinizable


--> SUCCESS [validating lang directory data/lang_test_bg] Succeeded in formatting data.
Chapter 5. Kaldi Feature Extraction
MFCC

Voice or speech are analog signal, which is hard to be


reproduce with some parameters. Scientists build up different
models to make the reproduction as close as possible. They
found some features are more useful than others.

One widely used model is to use frame to cut the audio wave.
Each frame is about 10ms. Each frame could extract 39
digits, whichrepresenting the frame’s audio feature. Put them
in a vector, we get feature vector.

GMM (Gaussian Mixture Model) ignores the coefficient


among all the features. The result is perfect for eachunit but
sometimes making no sense as a whole.

MFCC compensates GMM with the coefficient consideration


and thus does a better job in feature extraction. The MFCC
feature extraction technique basically includes windowing the
signal, applying the DFT, taking the log of the magnitude, and
then warping the frequencies on a Mel scale, followed by
applying the inverse DCT.

DNN/CNN is built up to take advantage of such coefficient


features extracted from MFCC. DNN/CNN implements fbank
features to reduce WER with MFCC’s DCT.

Step 1: Get feats.scp

Kaldi writes the feature vector into feats.scp. In another word, we


extract those elements representing the speech, ignore those
useless sounds (noises, emotions). Line 60-63 do the job.

steps/make_mfcc.shstores MFCC features into feats.

There are 13 dimensions for eachframe. Only when MFCC


scp.
features are used are deltas and deltas-deltas calculated to
convert to 39 dimensions.
Take data/trainas example, make_mfcc.shreads data/

train/wav.scp,
divides it to several parts evenly, and invokes
run.plextracting the features. One job processes one wav.scp.
The scripts compute-mfcc-featsand copy-featscreate

mfcc/raw_mfcc_train.JOB.arkand corresponding mfcc/


raw_mfcc_train.JOB.scpfiles. JOB is number representing how many
parts are divided.
The last step is to combine those mfcc/raw_mfcc_train.
JOB.scpfiles into one data/train/feats.scp, with which,
we could calculate CMVN (Cepstral mean and variance
normalization).
Step 2: Get cmvn.scp 77

Step 2: Get cmvn.scp

We calculate CMVN for each speecher. Each dimension (13


dimensions total) should be calculated 3 groups of statistics in
count/sum/sum-squared. Count represents the total number of
frames. To sum up each dimension of one frame to get sum.
Calculate sum-square. The variance and mean are calculated
based on them.

There is no option to do CMVN per utterance. The idea is that


if you did it per utterance it would not make sense to do
perspeaker fMLLR on top of that (since you’d be doing fMLLR
on top of different offsets). Therefore what would be the use
of the speaker information? In this case you should probably
make the speaker-ids identical to the utterance-ids. The
speakerinformation does not have to correspond to actual
speakers, it’s just the level you want to adapt at.
To check cmvn.scpor cmvn.ark, we could use copy-matrix.

Each speecher has a matrix with 28 dimensions. The first 13


dimensions are sum, then count, and last 13 are sum-
squared. The result is stored in cmvn.scp.

MFCC folder contains .ark and .scp files. The utterances and
features pair are stored in .scp. the real features are stored in
.ark. Kaldi use these two files together.
Let’s check the files:
$ head raw_mfcc_train.1.scp
faem0_si1392 ./timit/s5/mfcc/raw_mfcc_train.1.ark:13 faem0_si2022
./timit/s5/mfcc/raw_mfcc_train.1.ark:6313 faem0_si762
./timit/s5/mfcc/raw_mfcc_train.1.ark:9349 faem0_sx132
./timit/s5/mfcc/raw_mfcc_train.1.ark:13048 faem0_sx222
./timit/s5/mfcc/raw_mfcc_train.1.ark:16916 faem0_sx312
./timit/s5/mfcc/raw_mfcc_train.1.ark:20537 faem0_sx402
./timit/s5/mfcc/raw_mfcc_train.1.ark:25549 faem0_sx42
./timit/s5/mfcc/raw_mfcc_train.1.ark:29767 fajw0_si1263
./timit/s5/mfcc/raw_mfcc_train.1.ark:32804 fajw0_si1893
./timit/s5/mfcc/raw_mfcc_train.1.ark:39403

Scripts and Archives are oganzied as the table. One table has
an index (like utt_id). .scp is text file, each line has a key and
a corresponding file name, which telling Kaldi the data
location.

Archive file could be text or binary. The format is key (like utt_
id), space, and the object.
We could check MFCC’s dimension like below:

feat-to-dim scp:data/train/feats.scp
feat-to-dim ark:mfcc/raw_mfcc_train.1.ark
The result is 13, as we expected.
MFCC & CMVN Output 79

MFCC & CMVN Output


===============================================================

MFCC Feature Extration & CMVN for Training and Test set
===============================================================

steps/make_mfcc.sh --cmd run.pl --max-jobs-run 10


--nj 10 data/train exp/make_mfcc/train mfcc
steps/make_mfcc.sh: moving data/train/
feats.scp to data/train/.backup
utils/validate_data_dir.sh: Successfully val
idated data-directory data/train
steps/make_mfcc.sh: [info]: no segments file ex
ists: assuming wav.scp indexed by utterance.

Succeeded creating MFCC features for train


steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train mfcc Succeeded creating
CMVN stats for train

steps/make_mfcc.sh --cmd run.pl --max-jobs-run


10 --nj 10 data/dev exp/make_mfcc/dev mfcc
steps/make_mfcc.sh: moving data/dev/
feats.scp to data/dev/.backup
utils/validate_data_dir.sh: Successful
ly validated data-directory data/dev
steps/make_mfcc.sh: [info]: no segments file ex
ists: assuming wav.scp indexed by utterance.

Succeeded creating MFCC features for dev


steps/compute_cmvn_stats.sh data/dev exp/make_mfcc/dev mfcc Succeeded creating
CMVN stats for dev
steps/make_mfcc.sh --cmd run.pl --max-jobs-run 10
--nj 10 data/test exp/make_mfcc/test mfcc
steps/make_mfcc.sh: moving data/test/
feats.scp to data/test/.backup
utils/validate_data_dir.sh: Successfully val
idated data-directory data/test
steps/make_mfcc.sh: [info]: no segments file ex
ists: assuming wav.scp indexed by utterance.

Succeeded creating MFCC features for test


steps/compute_cmvn_stats.sh data/test exp/make_mfcc/test mfcc Succeeded creating
CMVN stats for test
Chapter 6. MonoPhone
Viter bi

Features Vector is the last step for us to retrieve from the


audio wave. The rest work will rely on our modeling and
analyzing ability. Firstthought is to use phone recognition, or
monophone as ASR professional named.

Monophone is the easiest model to train for two reasons: first,


Englishhas only around 40 phones. Second, mono also
means it is context independent.

The drawback is also clear: it does not has much ability to


identify the variation.

The following classes are used in monophone training.


• DiagGmm: reserving GMM parameters.
• DiagGmmNormal: GMM original value.
• AmDiagGmm: all the GMMs.
• AccumDiagGmm: accumulation of GMM
• AccumAmDiagGmm: keeping AccumDiagGmm

TransitionalModel: keeping HMM Topo, logo transitional


probability, transition-id, transition-state, triplets (phone,
hmmstate, forward-pdf mappings.

6.1 Monophone Training Method

Kaldi uses Viterbi instead of Baum-Welch for monophone


GMM training. We just need modify the three parameters in
EM method.
Updating Baum-Welch method is time consuming because it
requires to calculate forward and backward. Monophone is
context independent, so Viterbi is better. Baum-Welch will use
all the paths accumulated, while Viterbi just uses Viterbi path.

Viterbi aligns the model feature extractions based on


lastround running results to get the HMM status of each
frame’s feature (i.e. transition-id), which is called forced
alignment. The forced alignment could get the status series of
the featured vector. For example:

A representation of the sequence of HMM states taken by the


Viterbi (best-path) alignment of an utterance. In Kaldi an
alignment is synonymous with a sequence of transition-ids.
Most of the time an alignment is derived from aligning the
reference transcript of an utterance, in which case it is called
a forced alignment. lattices also contain alignment information
as sequences of transition-ids for each word sequence in the
lattice. The program show-alignments shows alignments in a
humanreadable format.

Though Viterbi training yields a good performance in most


cases, sometimes it leads to suboptimal models, specially
when using discrete HMMs to model spontaneous speech. In
these cases, Baum-Welch shows more robust than both
Viterbi training and the combined approach, compensating for
its high computational cost. When using continuous HMMs,
Viterbi training reveals as good as Baum-Welch at a much
lower cost.
6.2 Training Process 83

6.2 Training Process


There are three main steps to train monophone:
1. Train the mono acoustic model.
2. Make a decoding graph with current configuration for test.
3. Decode the development and test dataset with the graph.

The three steps match Kaldi’s three scrpits: train_mono.sh,


mkgraph.sh, and decode.sh.

train_nj has been given 30. --nj 30 instructs Kaldi to split the
wav file into four even parts to process, whichare stored in
data/ train/split30 labeled with number 1 to 30. Each block
contains three scp files and other supporting files.

train-mono.sh sets up the variable $stage, which could


resume from the interrupt point to avoid re-training from the
start point. For example, if $stage = 10, the iteration will start
from the 11.
6.3 Train Acoustic Model

Line 69 of run.sh is train_mono.sh.


Line 80 gmm-init-monoinitializes the monophone model,
which creates 0.mdland tree under exp/mono/, calculates the

global variance and mean of each feature of every dimension,


reading topo file and create shared phone list based on $lang/
phones/sets.init, and creates ctx_dep (tree) based on shared
phone list.

To be more specific, Kaldi creates one component GMM,


whose mean and variance initialize as the global mean and
global variance. The Gauss’ class is DiagGMM. It stores the
inverse of variance and mean and the product of mean and
varaince inverse instead of the original data. Italso stores the
constant (the <GCONSTS> of GMM part in .mdl.

The loop from 87 to 93 compile-train-graphs will compile


graphs of transcripts to exp/mono/fsts.JOB.gz(total 30
JOBs) in ark format including each utt-id in the train.tra. It
complies the FST for eachsentence, whichis composed of
non-transferable HCLG.
The loop from 95 to 101 aligns data equally, creating
alignment sequence based on the features and FST. The
output is exp/
mono/0.JOB.acc.

Line 107 gmm-estwill re-calculate MLE of 0.mdl based on


GMM acoustic model. The new MLE will be stored to exp/mono/1.
mdl. The old exp/mono/0.*.acc will be removed. Line 112 - 149
iterates the result. The process will update Transitional Model
and GMM.

The output of this train_mono.sh will be:

steps/train_mono.sh: Initializing monophone system.


steps/train_mono.sh: Compiling training graphs steps/train_mono.sh:
Aligning data equally (pass 0) steps/train_mono.sh: Pass 1
steps/train_mono.sh: Aligning data

························
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 36

steps/train_mono.sh: Pass 37
steps/train_mono.sh: Pass 38
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 39
steps/diagnostic/analyze_alignments.sh --cmd run.pl
--max-jobs-run 10 data/lang exp/mono
steps/diagnostic/analyze_alignments.sh: see stats in
exp/mono/log/analyze_alignments.log
2 warnings in exp/mono/log/align.*.*.log exp/mono: nj=4 align
prob=-99.15 over 3.12h [retry=0.0%, fail=0.0%] states=144
gauss=985

The mono folder will be created under exp, in which the model
parameters are stored in .mdl files. We could check the
contents like below:
$ gmm-copy --binary=false exp/mono/0.mdl - | less
The out put will be:
<TransitionModel>
<Topology>
<TopologyEntry>

<ForPhones>
2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 </ForPhones>
<State> 0 <PdfClass> 0 <Transition> 0
0.75 <Transition> 1 0.25 </State>
<State> 1 <PdfClass> 1 <Transition> 1
0.75 <Transition> 2 0.25 </State>
<State> 2 <PdfClass> 2 <Transition> 2
0.75 <Transition> 3 0.25 </State>
<State> 3 </State>
</TopologyEntry>

<TopologyEntry>
<ForPhones>
1
</ForPhones>
<State> 0 <PdfClass> 0 <Transition> 0 0.5 <Transition> 1 0.5 </State>
<State> 1 <PdfClass> 1 <Transition> 1 0.5 <Transition> 2 0.5 </State>
<State> 2 <PdfClass> 2 <Transition> 2 0.75 <Transition> 3 0.25 </State>

<State> 3 </State>
</TopologyEntry>
</Topology>

<Triples> 144
100
111
122
203
214
225
306
317
328
409

············

The top is topoinformation. There are 49 phones, and phone 1


is sitting in a separated section. Checking phones.txtwe know
that it is sil (silence). topo file is compiled as starting with
initial status (100% probability) and the last status is ending
(100% probability). In our case, obviously, -1 is the starting
point, 0,1,2 are transient status of HMM, and 3 is the ending
status.

Underneath the topology there is a <Triples>tag. This maps each


phone and one of its three states to a unique number. That
is, there are num phones x num states = 144 triples. It’s easy to
conclude that 0.mdl is the initialized model, in which all the
parameters are initial value. We train this model 40 times. The
final probability will be stored in 40.mdl.

Kaldi label p.d.f with number starting with 0 (it’s called pdf-
ids), while there is no name in HTK model.
.mdl also telling us that there is no data to link context-
dependent phones and pdf-ids. Let’s check the tree file:
$ copy-tree --binary=false exp/mono/tree - | less
The output is:
copy-tree --binary=false exp/mono/tree
LOG (copy-tree:main():copy-tree.cc:55) Copied tree
ContextDependency 1 0 ToPdf TE 0 49 ( NULL TE -1 3 ( CE 0 CE 1
CE 2 )

TE -1 3 ( CE 3 CE 4 CE 5 )
TE -1 3 ( CE 6 CE 7 CE 8 )
TE -1 3 ( CE 9 CE 10 CE 11 )
TE -1 3 ( CE 12 CE 13 CE 14 )
TE -1 3 ( CE 15 CE 16 CE 17 )
TE -1 3 ( CE 18 CE 19 CE 20 )
TE -1 3 ( CE 21 CE 22 CE 23 )
TE -1 3 ( CE 24 CE 25 CE 26 )
TE -1 3 ( CE 27 CE 28 CE 29 )
TE -1 3 ( CE 30 CE 31 CE 32 )
TE -1 3 ( CE 33 CE 34 CE 35 )
···················
EndContextDependency

This is the tree of monophone. It’s very trivial since there are
no branches. CE means constant eventmap, representing the
leaves of the tree.

TE means table eventmap, representing searching table.


There is no SE (split eventmap, representing branches of the
tree) since it’s monophone.

TE 0 29 is a table eventmap from key 0 to 29. In the


parentheses sitting 29 event maps. The first one is Null
representing a pointer pointing to the 0 since phone-id0 is
reserved forepsilon.

TE -1 3 ( CE 75 CE 76 CE 77 ), means one table eventmap is


divided from key-1. The key is the pdfclass in topo file,
representing the HMM status. There are 3 status for this
phone, so the key will be 0, 1, 2. Three constant eventmap
are in parentheses, representing 3 leaves of the tree.

Below we will check Viterbi alignment of the training. Eachline


of data. ali.*.gz is linked with a training file.

$ copy-int-vector “ark:gunzip -c exp/mono/ali.1.gz|” ark,t:- | head


-n 2
The output is:
faem0_si1392 2 4 3 3 3 3 3 3 6 5 5 5 5 5 38 37 37
40 42 218 217 217 217 217 217 217 217 217 217 217 217
220 219 222 221 221 221 248 247 247 247 247 247 250
252 176 175 178 177 177 177 177 177 180 122 121 121 121 121 121
121 121 124 123 123 126 125 26 28 27 30 29 29 212 211 214 216
215 146 148 150 260 262 264 263 263 278 277 277 277 280 282 281
281 281 14 13 13 13 13 16 15 18 17 176 175 178 180 62 64 66 206
208 210 242 241 244 246 170 169 169 169 169 172 171 174 173 173
173 173 173 38 37 37 37 40 42 41 218 217 217 217 217 217 217 217
217 217 220 219 222 221 146 145 148 150 62 64 66 56 55 55 55 55
55 55 58 60 59 59 248 250 249 252 251 251 251 251 251 116 115
115 115 115 118 117 117 120 119 224 223 223 223 223 223 223 226
225 225 228 98 97 97 100 99 102 101 266 265 265 265 265 268 267
267 267 270 269 269 86 88 90 110 109 112 111 114 122 121 121 121
121 121 121 121 121 121 124 123 126 125 8 7 7 7 7 7 7 7 7 10 9 9
12 212 214 216 176 175 175 178 177 180 134 133 133 133 136 135
138 137 86 88 90 89 278 277 277 277 280 282 281 281 38 37 40 42
62 64 66 65 206 208 207 207 207 207 207 210 209 14 13 13 16 15
15 18 17 17 17 62 61 64 66 164 166 168 167 167 152 154 156 188
190 189 189 192 191 191 224 223 223 223 223 223 223 223 223 223
226 225 225 228 227 86 85 85 85 85 85 85 85 85 85 85 88 87 90 260
259 262 264 263 68 70 72 71 71 71 2 4 3 3 6 5 5 5 5 5 5 14 13 13 16
15 15 15 15 15 18 17 17 182 181 184 186 260 262 264 68 70 72 122
121 121 121 121 121 121 124 123 123 126 152 151 151 151 151 154
153 153 153 156 155 155 155 155 155 155 170 169 169 169 169 169
169 172 171 174 173 260 259 262 261 264 263 263 263 218 217 217
217 217 217 217 217 217 217 220 219 222 221 221 2 1 4 3 3 3 3 3 3
36

faem0_si2022 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
111111111111111111111111111
1 1 4 3 3 3 3 3 6 5 5 5 266 265 265 265 265 265 265
268 267 270 269 269 20 22 24 80 82 81 84 83 32 31 34
33 33 33 33 36 35 35 35 35 35 35 122 121 121 121 121
121 121 121 121 124 123 126 125 146 145 148 147 150
62 64 63 66 65 65 65 65 65 65 68 70 72 71 242 241 244
246 245 224 223 223 223 223 223 223 223 223 223 223
223 226 225 225 228 152 154 153 156 260 262 264 263
68 70 72 71 212 211 211 211 214 213 213 216 215 215
215 215 44 43 46 45 45 45 48 47 47 254 256 258 122 121 121 121
121 121 121 121 121 121 121 124 123 126 125 125 26 25 25 28 27
27 27 27 27 30 29 29 29 2 1 1 1 1 4 3 3 3 3 3 3 6 5

This data is the result of Viterbi alignment. Each speech is


represented by one line. When checking the above
exp/mono/tree file, the maximum p.d.f-id is only 83, but the
data here is far more than 83. The reason is that alignment
doesn’t use p.d.f-id, instead, it uses more sliced identifier
(transition-id). These ids combinedthe phone and the
transition probability in topo file. To know more about the
transition-id, we could:
$show-transitions data/lang/phones.txt exp/mono/0.mdl
The output is:
Transition-id = 1 p = 0.5 [self-loop]
Transition-id = 2 p = 0.5 [0 -> 1]

Transition-state 2: phone = sil hmm-state = 1 pdf = 1 Transition-id = 3


p = 0.5 [self-loop]
Transition-id = 4 p = 0.5 [1 -> 2]
Transition-state 3: phone = sil hmm-state = 2 pdf = 2 Transition-id = 5
p = 0.75 [self-loop]
Transition-id = 6 p = 0.25 [2 -> 3]

Transition-state 4: phone = a hmm-state = 0 pdf = 3 Transition-id = 7


p = 0.75 [self-loop]
Transition-id = 8 p = 0.25 [0 -> 1]

Transition-state 5: phone = a hmm-state = 1 pdf = 4 Transition-id = 9


p = 0.75 [self-loop]
Transition-id = 10 p = 0.25 [1 -> 2]

Transition-state 6: phone = a hmm-state = 2 pdf = 5 Transition-id = 11


p = 0.75 [self-loop]
Transition-id = 12 p = 0.25 [2 -> 3]

Transition-state 7: phone = b hmm-state = 0 pdf = 6 Transition-id = 13


p = 0.75 [self-loop]

Transition-id = 14 p = 0.25 [0 -> 1]


Transition-state 8: phone = b hmm-state = 1 pdf = 7
Transition-id = 15 p = 0.75 [self-loop]
Transition-id = 16 p = 0.25 [1 -> 2]
Transition-state 9: phone = b hmm-state = 2 pdf = 8
Transition-id = 17 p = 0.75 [self-loop]
Transition-id = 18 p = 0.25 [2 -> 3]
Transition-state 10: phone = ch hmm-state = 0 pdf = 9
Transition-id = 19 p = 0.75 [self-loop]
Transition-id = 20 p = 0.25 [0 -> 1]

Transition-state 11: phone = ch hmm-state = 1 pdf = 10


Transition-id = 21 p = 0.75 [self-loop]
Transition-id = 22 p = 0.25 [1 -> 2]
Transition-state 12: phone = ch hmm-state = 2 pdf = 11

Transition-id = 23 p = 0.75 [self-loop]


Transition-id = 24 p = 0.25 [2 -> 3]
.....
Transition-state 82: phone = z hmm-state = 0 pdf = 81
Transition-id = 163 p = 0.75 [self-loop]
Transition-id = 164 p = 0.25 [0 -> 1]
Transition-state 83: phone = z hmm-state = 1 pdf = 82
Transition-id = 165 p = 0.75 [self-loop]
Transition-id = 166 p = 0.25 [1 -> 2]
Transition-state 84: phone = z hmm-state = 2 pdf = 83
Transition-id = 167 p = 0.75 [self-loop]
Transition-id = 168 p = 0.25 [2 -> 3]

Obviously, it’s the initial status before training. To make it


more clear, let’s try this way:
show-alignments data/lang/phones.txt exp/mono/40.mdl
exp/mono/40.occs | less
#(.occs means occupation counts)

The output is:


Transition-state 1: phone = sil hmm-state = 0 pdf = 0 Transition-id = 1
p = 0.934807 count of pdf = 1.13866e+06 [self-loop]
Transition-id = 2 p = 0.0651934 count of pdf = 1.13866e+06 [0 -> 1]
Transition-state 2: phone = sil hmm-state = 1 pdf = 1 Transition-id = 3
p = 0.889584 count of pdf = 672302 [self-loop]
Transition-id = 4 p = 0.110416 count of pdf = 672302 [1 -> 2]
Transition-state 3: phone = sil hmm-state = 2 pdf = 2 Transition-id = 5
p = 0.7137 count of pdf = 259284 [self-loop]
Transition-id = 6 p = 0.2863 count of pdf = 259284 [2 -> 3]
Transition-state 4: phone = a hmm-state = 0 pdf = 3 Transition-id = 7
p = 0.713307 count of pdf = 390711 [self-loop]
Transition-id = 8 p = 0.286693 count of pdf = 390711 [0 -> 1]
Transition-state 5: phone = a hmm-state = 1 pdf = 4 Transition-id = 9
p = 0.594051 count of pdf = 275931 [self-loop]
Transition-id = 10 p = 0.405949 count of pdf = 275931 [1 -> 2]
Transition-state 6: phone = a hmm-state = 2 pdf = 5 Transition-id = 11
p = 0.594987 count of pdf = 276569 [self-loop]
Transition-id = 12 p = 0.405013 count of pdf = 276569 [2 -> 3]
Transition-state 7: phone = b hmm-state = 0 pdf = 6 Transition-id = 13
p = 0.590539 count of pdf = 19660 [self-loop]

Transition-id = 14 p = 0.409461 count of pdf = 19660 [0 -> 1]


Transition-state 8: phone = b hmm-state = 1 pdf = 7

Transition-id = 15 p = 0.417553 count of pdf = 13821 [self-loop]

We are using 40.mdl. The result is the final transition


probability the model get.
Let’s check what’s the next step for training.
grep Overall exp/mono/log/acc.{?.?,?.??,??.?,??.??}.log
#Below is the last part of the output:
exp/mono/log/acc.35.10.log:LOG (gmm-acc-stats
ali:main():gmm-acc-stats-ali.cc:115)

exp/mono/log/acc.37.12.log:LOG (gmm-acc-stats
ali:main():gmm-acc-stats-ali.cc:115) Overall avg like per frame (Gaussian only) = -99.1242
over 595815 frames.

exp/mono/log/acc.38.10.log:LOG (gmm-acc-stats
ali:main():gmm-acc-stats-ali.cc:115) Overall avg like per frame (Gaussian only) = -99.0045
over 666753 frames.

exp/mono/log/acc.38.11.log:LOG (gmm-acc-stats
ali:main():gmm-acc-stats-ali.cc:115) Overall avg like per frame (Gaussian only) = -95.769
over 793715 frames.

exp/mono/log/acc.38.12.log:LOG (gmm-acc-stats
ali:main():gmm-acc-stats-ali.cc:115) Overall avg like per frame (Gaussian only) = -99.0953
over 595815 frames.

exp/mono/log/acc.39.10.log:LOG (gmm-acc-stats
ali:main():gmm-acc-stats-ali.cc:115) Overall avg like per frame (Gaussian only) = -98.9901
over 666753 frames.

exp/mono/log/acc.39.11.log:LOG (gmm-acc-stats
ali:main():gmm-acc-stats-ali.cc:115) Overall avg like per frame (Gaussian only) = -95.7472
over 793715 frames.

exp/mono/log/acc.39.12.log:LOG (gmm-acc-stats
ali:main():gmm-acc-stats-ali.cc:115) Overall avg like per frame (Gaussian only) = -99.0786
over 595815 frames.
We could see the Acoustic likelihood of each iteration.

6.4 Decoding

With the acoustic, pronunciation lexicon and language model


built, we are ready to decode (transcript) the audio clips into
words. Conceptually, our objective is to search for the most
likely sequence of words according to these models.

However, searching for all the possible sequences sounds


astonishingly inefficient.

The overall picture for decoding-graph creation of Kaldi is that


it is constructing the graph HCLG.fst, which is composed of 4
files derived from 4 fst files, i.e. H.fst, C.fst, L.fst, and G.fst.

• G is for Grammar Model, or Language Model. Technically, it


is an acceptor (i.e. its input and output symbols are the same)
that encodes the grammar or language model. To make
things easier to cooperate withthe other three WFST, treat it
as an input and output same WFST.

• L is for Lexicon Dictionary. Its output symbols are words and


its input symbols are phones. It’s the source of G.
• C represents the Context-dependency: its outputsymbols
are phones and its input symbols represent context-
dependent phones, i.e. windows of N phones.

• H contains the HMM definitions; its output symbols


represent context-dependent phones and its input symbols
are transition-ids, encoding the pdf-id and other information. If
we were to summarize our approach on one line (and one line
can’t capture all the details, obviously), the line would
probably as follows, where asl == “add-self-loops” and rds ==
“remove-disambiguation-symbols”, and H’ is H without the
self-loops:

HCLG = asl(min(rds(det(H’ o min(det(C o min(det(L o


G)))))))) The training sequence is: G -> L -> C -> H.

Lattice is to be used to store the candidate sequence during


decoding process, which goes through all the branches to get
the best score, N-best to output the text.

The essence of the Lattice is a directed acyclic graph. Each


node of the graphrepresents the ending time of a word,
eachline represents a possible word and the word’s acoustic
score and language mode score.

Computing the denominator involves summing over all


possible word sequences: estimate by generating lattices,
and summing over all words in the lattice. In practice also
compute numerator statistics using lattices (useful for
summing multiple pronunciations).

Generate numerator and denominator lattices for every


training utterance. Denominator lattice uses recognition setup
(with a weakerlanguage model). Eachword in the lattice is
decoded to give a phone segmentation, and forward-
backward is then used to compute the state occupation
probabilities.
6.3.1 Build up Decoding Graph

We need build up decoding graph before decoding. Line 71


telling us that it will be done by:

$ utils/mkgraph.sh --mono data/lang_test_bg exp/mono


exp/mono/graph

creates an extendable decoding graph HCLG,


utils/mkgraph.sh
which contains language model, phone dictionary, context,
and HMM structure. With HCLG, steps/decode.shcould decode the speech
to FST.

Check to make sure L_disambig.fst and G.fst existed.


from
http://vpanayotov.blogspot.com/2012/06/kaldi-decoding-graph-construction.html
Creates new language model: $lang/tmp/LG.fst based on
L_disambig.fst and G.fst.
With new LG.fst and disambig.int, create new $lang/tmp/
CLG_1_0.fst,ilabels_1_0, and disambig_ilabels_1_0.int.

Then
create new exp/mono/graph/HCLGa.fstwith $tree and $model.
The final result is:
tree-info exp/mono/tree
tree-info exp/mono/tree

fsttablecompose data/lang_test_bg/L_disambig.fst
data/lang_test_bg/G.fst

fstdeterminizestar --use-log=true
fstpushspecial
fstminimizeencoded
fstisstochastic data/lang_test_bg/tmp/LG.fst

-0.00841335 -0.00928529

fstcomposecontext --context-size=1 --central-position=0 --read-


disambig-syms=data/lang_test_bg/ phones/disambig.int --write-
disambig-syms=data/lang_ test_bg/tmp/disambig_ilabels_1_0.int
data/lang_test_ bg/tmp/ilabels_1_0

fstisstochastic data/lang_test_bg/tmp/CLG_1_0. fst


-0.00841335 -0.00928515

make-h-transducer --disambig-syms-out=exp/mono/
graph/disambig_tid.int --transition-scale=1.0 data/
lang_test_bg/tmp/ilabels_1_0 exp/mono/tree exp/mono/ final.mdl

fstminimizeencoded
fstdeterminizestar --use-log=true
fsttablecompose exp/mono/graph/Ha.fst data/lang_

test_bg/tmp/CLG_1_0.fst
fstrmsymbols exp/mono/graph/disambig_tid.int fstrmepslocal
fstisstochastic exp/mono/graph/HCLGa.fst 0.000381767 -0.00951546

add-self-loops --self-loop-scale=0.1 --reorder=true exp/mono/final.mdl


final.mdl is what we need.
6.3.2 Decoding
We decode based on developing and testing separately. Line
73 & 74:
steps/decode.sh --nj “$decode_nj” --cmd “$decode_
cmd” exp/mono/graph data/dev exp/mono/decode_dev
steps/decode.sh --nj “$decode_nj” --cmd “$decode_cmd” exp/mono/graph data/test
exp/mono/decode_test
The decoding log will be stored at exp/mono/decode_dev/ log and
exp/mono/decode_test/log.Below is the decoding output of developing
section.
steps/decode.sh --nj 1 --cmd run.pl --max-jobs-run 10 exp/mono/ graph data/dev
exp/mono/decode_dev
decode.sh: feature type is delta
steps/diagnostic/analyze_lats.sh --cmd run.pl --max-jobs-run 10 exp/mono/graph
exp/mono/decode_dev
steps/diagnostic/analyze_lats.sh: see stats in exp/mono/decode_
dev/log/analyze_alignments.log
Overall, lattice depth (10,50,90-percentile)=(5,26,120) and mean=61.6
steps/diagnostic/analyze_lats.sh: see stats in exp/mono/decode_
dev/log/analyze_lattice_depth_stats.log
Following, we could finish other training and decoding.
final.mdl,along withwords.txt(dictionary) and HCLG.fst below ./graph_word,
we could recognize the speech.

Decoding process results in a huge HMM. In theory, we can


use it with the Viterbi decoding to find the most likely
sequence of words.

In reality, this involves a whole lotof works and simplifications.


We need to optimize this huge graph for faster decoding. We
apply determinations to make it easier to search
deterministically and minimization to reduce the number of
redundancy in states and transitions.

6.5 Final Results and Output


Here we go:
#!/bin/bash

for x in exp/{mono,tri,sgmm,dnn,combine}*/decode*; do [
-d $x ] && echo $x | grep “${1:-.*}” >/dev/null && grep WER $x/wer_* 2>/dev/null |
utils/best_wer.sh; done

for x in exp/{mono,tri,sgmm,dnn,combine}*/decode*; do [ -d $x ] && echo $x | grep “${1:-.*}”


>/dev/null && grep Sum $x/score_*/*.sys 2>/dev/null | utils/best_wer.sh; done

See my computer's result:


%WER 31.8 | 400 15057 | 71.7 19.4 8.9 3.5 31.8 100.0 |
-0.496 | exp/mono/decode_dev/score_5/ctm_39phn.filt.sys %WER 32.2 | 192 7215 | 70.5
19.4 10.0 2.8 32.2 100.0 |
-0.211 | exp/mono/decode_test/score_6/ctm_39phn.filt.sys
WER is 31.8% and 32.2% for developing set and testing set
separately. Not bad.

The most popular metric in ASR is WER. Consider a corpse


with N words, WER (Word error rate) is the ratio on how many
insertions, deletions, and substitutions (a.k.a. edit distance)
are needed to convert our predictions to the ground truth.

Another metric is perplexity, which we use to measure how


well a probability distribution in making predictions. For a
good LVCSR and language model, the prediction should have
low entropy and perplexity.

Intuitively, we can view perplexity being the average number


of choices a random variable has. In a language model,
perplexity is a measure of on average how many probable
words can follow a sequence of words. Fora good language
model, the choices should be small. If any word is equally
likely, the perplexity will be high and equals the number of
words in the vocabulary.

===================================================

MonoPhone Training & Decoding


===================================================

steps/train_mono.sh --nj 30 --cmd run.pl --max-jobsrun 10 data/train


data/lang exp/mono
steps/train_mono.sh: Initializing monophone system.

steps/train_mono.sh: Compiling training graphs steps/train_mono.sh:


Aligning data equally (pass 0) steps/train_mono.sh: Pass 1
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 2
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 3
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 4
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 5
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 6
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 7
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 8
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 9
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 10
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 11
steps/train_mono.sh: Pass 12
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 13 steps/train_mono.sh: Pass 14
steps/train_mono.sh: Aligning data steps/train_mono.sh: Pass 15
steps/train_mono.sh: Pass 16 steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 17 steps/train_mono.sh: Pass 18
steps/train_mono.sh: Aligning data steps/train_mono.sh: Pass 19
steps/train_mono.sh: Pass 20 steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 21 steps/train_mono.sh: Pass 22
steps/train_mono.sh: Pass 23 steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 24 steps/train_mono.sh: Pass 25
steps/train_mono.sh: Pass 26 steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 27 steps/train_mono.sh: Pass 28
steps/train_mono.sh: Pass 29 steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 30 steps/train_mono.sh: Pass 31
steps/train_mono.sh: Pass 32 steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 33 steps/train_mono.sh: Pass 34
steps/train_mono.sh: Pass 35 steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 36
steps/train_mono.sh: Pass 37
steps/train_mono.sh: Pass 38
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 39

steps/diagnostic/analyze_alignments.sh --cmd run.pl


--max-jobs-run 10 data/lang exp/mono
steps/diagnostic/analyze_alignments.sh: see stats in
exp/mono/log/analyze_alignments.log
2 warnings in exp/mono/log/align.*.*.log

exp/mono: nj=30 align prob=-99.15 over 3.12h [retry=0.0%,


fail=0.0%] states=144 gauss=986
steps/train_mono.sh: Done training monophone system in exp/mono

tree-info exp/mono/tree
tree-info exp/mono/tree

fsttablecompose data/lang_test_bg/L_disambig.fst
data/lang_test_bg/G.fst
fstminimizeencoded
fstpushspecial
fstdeterminizestar --use-log=true
fstisstochastic data/lang_test_bg/tmp/LG.fst
-0.00841336 -0.00928521

fstcomposecontext --context-size=1 --central-position=0 --read-


disambig-syms=data/lang_test_bg/ phones/disambig.int --write-
disambig-syms=data/ lang_test_bg/tmp/disambig_ilabels_1_0.int
data/lang_ test_bg/tmp/ilabels_1_0.11459 data/lang_test_bg/tmp/
LG.fst

fstisstochastic data/lang_test_bg/tmp/CLG_1_0.fst
-0.00841336 -0.00928521

make-h-transducer --disambig-syms-out=exp/mono/
graph/disambig_tid.int --transition-scale=1.0 data/
lang_test_bg/tmp/ilabels_1_0 exp/mono/tree exp/mono/ final.mdl

fstrmepslocal
fsttablecompose exp/mono/graph/Ha.fst data/lang_
test_bg/tmp/CLG_1_0.fst

fstdeterminizestar --use-log=true
fstrmsymbols exp/mono/graph/disambig_tid.int fstminimizeencoded
fstisstochastic exp/mono/graph/HCLGa.fst
0.000381709 -0.00951555

add-self-loops --self-loop-scale=0.1 --reorder=true exp/mono/final.mdl


exp/mono/graph/HCLGa.fst steps/decode.sh --nj 5 --cmd run.pl --
max-jobs-run 10 exp/mono/graph data/dev exp/mono/decode_dev
decode.sh: feature type is delta
steps/diagnostic/analyze_lats.sh --cmd run.pl --maxjobs-run 10
exp/mono/graph exp/mono/decode_dev
steps/diagnostic/analyze_lats.sh: see stats in exp/
mono/decode_dev/log/analyze_alignments.log
Overall, lattice depth (10,50,90-percentile)=(5,25,121) and
mean=55.2

steps/diagnostic/analyze_lats.sh: see stats in exp/


mono/decode_dev/log/analyze_lattice_depth_stats.log
steps/decode.sh --nj 5 --cmd run.pl --max-jobs-run 10
exp/mono/graph data/test exp/mono/decode_test decode.sh: feature
type is delta

steps/diagnostic/analyze_lats.sh --cmd run.pl --maxjobs-run 10


exp/mono/graph exp/mono/decode_test
steps/diagnostic/analyze_lats.sh: see stats in exp/
mono/decode_test/log/analyze_alignments.log Overall, lattice depth
(10,50,90-percentile)=(6,27,142) and mean=76.0

steps/diagnostic/analyze_lats.sh: see stats in exp/


mono/decode_test/log/analyze_lattice_depth_stats. log
Chapter 7. Triphone: Deltas
Tri1: Deltas + Delta-Deltas

Three steps for ASR: phone, word, sentence. Monophone


brings us the phones. We use triphone model to get word.

The key issue for Speech Recognition (not only Kaldi, but the
whole field) is the speech variation, which occur everywhere:
channel/microphone type, environmental noise, speaking
style, vocal anatomy, gender, accent, health, etc.

Speech is a continuous acoustic flow. It is composed of


partially stable status and partially dynamic status. A word’s
phonetic (or wave) in reality is affected by many issues, not
just phone. For example, context of the phone, speaker, style,
dialect, etc.

Coarticulation leads to the difference between feeling and


standard of the phone. We have to rely on context to identify
a phone in daily life.
7.1 General

To solve coarticulation problem, we divided the phone into


several sub-phone units. For example, the word “speech”, the
first part of the phone is connected with the previous word,
the middle part is stable, the third part is connected with the
following word. This is the reason that we use Tri-Phone
model when using HMM model to recognize the speech. If
just considering the previous word, it’s called Bi-Phone.

Putting the phone into context, we gettri-phone or senone


models. Senone is not just two or three more words, but
connected with decision tree and some other more
complicated functions.

Triphone is one of phones. Comparing with monophone (i.e. t,


iy, n), triphone is represented as t-iy+n, three monophones.
It’s like monophone iy, but put the context into the
consideration, which means the phone before iy is t, and after
iy is n.

Triphone is a useful sub-word unit as it models phone-in-


context, and thus captures the most importantcoarticulation
effect. As bothleftand right contexts are involved, the number
of such triphones would be very large.

How many triphones are there? People believe Englishhas


about 40 phones. In this case, there will be 40 x 40 x 40 = 64k
possible triphones. In a cross-word system, 60k three-state
HMMs, with 10 component Gaussian mixtures per state: 1.8M
Gaussians 39-dimension feature vectors (12 MFCCs +
energy), deltas and accelerations. Assuming diagonal
Gaussians: about 790 parameters/state. Total will be well
above 1 billion parameters. We would need a very large
amount of training data to train such a system.
7.1 General 111
The general steps for triphone training:
The main difference between triPhone and monoPhone is
decision tree state tying. The principle, class, and program
are the same. TriPhone and monoPhone both are HMM.

7.2 tri1 : Deltas + Delta-Deltas Analysis

Here is the script.

Line 83 steps/align_si.sh aligns data based on mono_ali.

In ali, each utt will be related with a corresponding transition


id. Each transition-id will have transition-state and
transitionindex. Transition-state is one-to-one related to
phone-id, hmmstate, pdf-state. Where phone-id and hmm-
state are (0,1,2). pdf-id is the leave’s number of the tree,
corresponding to an independent phone type.
To a specific hmm-state, there are two conversions: forward
pdf and self-look pdf. They are same in most cases.
Transition-index marks the source: forward or self-loop. We
could conclude that the main purpose of ali is to link the
featured vector with the specific phone status.
Line 87 steps/train_deltas.sh, the main part is:

Line 81 is to accumulate tree stats. Using acc-tree-stats and


sumtree-stats, based on alignment information, accumulate
the context features of each monophone.

Line 91 cluster-phones and line 94 compile-questions get


questions for tree-building.

Based on above, line 99 build-treebuild up the decision tree. Line


104 gmm-init-modeland line 114 gmm-mixupcreate the primary model
based on the alignment previous done. It will calculate the
mean and variance of each leaf of the decision tree, and
serve as the primary model for each status.
Line 122 convert-ali converting alignments from $alidir to use
current tree by invoking ali/final.mdl, dir/1.mdl, dir/tree to
convert ali/ali.gz to dir/ali.gz. which match the original speech
to the decision tree.

Line 129 compile-train-graphs compiling graphs of transcripts


based on the fst created in previous parts.
compile-train-graphs map the utt to lexicon and then create
the new state fst combining context ,topo ,and tree. It creates
dir/ fsts.JOB.gz. The iteration will be based on the above
alignment.

The rest is just keeping iteration for better result. The default
is 35 iterations. When reaching 10, 20, and 30 iteration, Kaldi
will re-ali to create new ali.gz for further processing.

The ali.gzwon’t change in other iterations (excluding 10, 20,


and 30). Kaldi only make state alignmentand calculate
expectation and maximize it. Line 147 gmm-acc-stats-ali and
line 150 gmm-est train GMM for each state.

incgauss=$[($totgauss-$numgauss)/$max_iter_inc]

and the max_iter_inc = 25, the maximum Gauss is 25, 25


iterations. numgauss = numleaves. It implies that each leaf
has one Gauss at the beginging. Adding 1 to incgauss for
each iteration, and till 25 maximum.

Now, we could understand why train_deltasis working bet


ter than train_mono.

Eachmonophone will have differentpronunciation in different


contextenvironment. The train_monocould only provide one GMM
to fit the situation. The work is far from complete.

While train_deltas will use different GMM to fit the


monophone based on different context. For example, one
GMM to fit w_a_n and another GMM fitting b_a_i. Both GMM
outputs will map to the mono-phone a. We use different
model to identify the same monophone’s different situation, so
that the same monophone’s multiple status and changing
problem.

The features of “a” will be easier to be aligned to “a”. From


this we could see that ali file actually maps eachfeature vector
to the phone state.

7.3 Output
=============================================================== tri1 :
Deltas + Delta-Deltas Training & Decoding
===============================================================

steps/align_si.sh --boost-silence 1.25 --nj 30 --cmd run.pl


--max-jobs-run 10 data/train data/lang exp/mono exp/mono_ali steps/align_si.sh: feature type
is delta
steps/align_si.sh: aligning data in data/train using model from exp/mono, putting alignments
in exp/mono_ali steps/diagnostic/analyze_alignments.sh --cmd run.
pl --max-jobs-run 10 data/lang exp/mono_ali
steps/diagnostic/analyze_alignments.sh: see stats
in exp/mono_ali/log/analyze_alignments.log
steps/align_si.sh: done aligning data.
steps/train_deltas.sh --cmd run.pl --max-jobs-run 10 2500 15000 data/train data/lang
exp/mono_ali exp/tri1 steps/train_deltas.sh: accumulating tree stats
steps/train_deltas.sh: getting ques
tions for tree-building, via clustering
steps/train_deltas.sh: building the tree
steps/train_deltas.sh: converting align
ments from exp/mono_ali to use current tree

steps/train_deltas.sh: compiling graphs of transcripts steps/train_deltas.sh: training pass 1


steps/train_deltas.sh: training pass 2
steps/train_deltas.sh: training pass 3
steps/train_deltas.sh: training pass 4
steps/train_deltas.sh: training pass 5
steps/train_deltas.sh: training pass 6
steps/train_deltas.sh: training pass 7
steps/train_deltas.sh: training pass 8
steps/train_deltas.sh: training pass 9
steps/train_deltas.sh: training pass 10
steps/train_deltas.sh: aligning data
steps/train_deltas.sh: training pass 11
steps/train_deltas.sh: training pass 12
steps/train_deltas.sh: training pass 13
steps/train_deltas.sh: training pass 14

7.3 Output 117

steps/train_deltas.sh: training pass 15


steps/train_deltas.sh: training pass 16
steps/train_deltas.sh: training pass 17
steps/train_deltas.sh: training pass 18
steps/train_deltas.sh: training pass 19
steps/train_deltas.sh: training pass 20
steps/train_deltas.sh: aligning data
steps/train_deltas.sh: training pass 21
steps/train_deltas.sh: training pass 22
steps/train_deltas.sh: training pass 23
steps/train_deltas.sh: training pass 24
steps/train_deltas.sh: training pass 25
steps/train_deltas.sh: training pass 26
steps/train_deltas.sh: training pass 27
steps/train_deltas.sh: training pass 28
steps/train_deltas.sh: training pass 29
steps/train_deltas.sh: training pass 30
steps/train_deltas.sh: aligning data
steps/train_deltas.sh: training pass 31
steps/train_deltas.sh: training pass 32
steps/train_deltas.sh: training pass 33
steps/train_deltas.sh: training pass 34

steps/diagnostic/analyze_alignments.sh --cmd run. pl --max-jobs-run 10 data/lang exp/tri1


steps/diagnostic/analyze_alignments.sh: see stats in exp/tri1/log/analyze_alignments.log

63 warnings in exp/tri1/log/init_model.log
44 warnings in exp/tri1/log/update.*.log
1 warnings in exp/tri1/log/compile_questions.log

exp/tri1: nj=30 align prob=-95.27 over 3.12h [retry=0.0%, fail=0.0%] states=1872


gauss=15037 tree-impr=5.40 steps/train_deltas.sh: Done training sys
tem with delta+delta-delta features in exp/tri1
tree-info exp/tri1/tree
tree-info exp/tri1/tree

fstcomposecontext --context-size=3 --central-position=1 --read-disambig-


syms=data/lang_test_bg/phones/ disambig.int --write-disambig-syms=data/lang_test_bg/

tmp/disambig_ilabels_3_1.int data/lang_test_bg/tmp/ ilabels_3_1.8577


data/lang_test_bg/tmp/LG.fst

fstisstochastic data/lang_test_bg/tmp/CLG_3_1.fst 0 -0.00928518

make-h-transducer --disambig-syms-out=exp/tri1/graph/disambig_tid.int --transition-


scale=1.0 data/lang_test_bg/ tmp/ilabels_3_1 exp/tri1/tree exp/tri1/final.mdl

fsttablecompose exp/tri1/graph/Ha.fst data/


lang_test_bg/tmp/CLG_3_1.fst

fstrmepslocal
fstminimizeencoded
fstrmsymbols exp/tri1/graph/disambig_tid.int
fstdeterminizestar --use-log=true
fstisstochastic exp/tri1/graph/HCLGa.fst
0.000443876 -0.0175772
HCLGa is not stochastic

add-self-loops --self-loop-scale=0.1 --reorder=true exp/tri1/final.mdl exp/tri1/graph/HCLGa.fst


steps/decode.sh --nj 5 --cmd run.pl --max-jobs-run 10 exp/tri1/graph data/dev
exp/tri1/decode_dev decode.sh: feature type is delta
steps/diagnostic/analyze_lats.sh --cmd run.pl --maxjobs-run 10 exp/tri1/graph
exp/tri1/decode_dev steps/diagnostic/analyze_lats.sh: see stats in exp/
tri1/decode_dev/log/analyze_alignments.log
Overall, lattice depth (10,50,90-percen
tile)=(3,11,42) and mean=18.9
steps/diagnostic/analyze_lats.sh: see stats in exp/
tri1/decode_dev/log/analyze_lattice_depth_stats.log steps/decode.sh --nj 5 --cmd run.pl --
max-jobs-run 10 exp/tri1/graph data/test exp/tri1/decode_test decode.sh: feature type is
delta
steps/diagnostic/analyze_lats.sh --cmd run.pl --maxjobs-run 10 exp/tri1/graph
exp/tri1/decode_test steps/diagnostic/analyze_lats.sh: see stats in exp/
tri1/decode_test/log/analyze_alignments.log
Overall, lattice depth (10,50,90-percen
tile)=(3,12,47) and mean=21.7
steps/diagnostic/analyze_lats.sh: see stats in exp/
tri1/decode_test/log/analyze_lattice_depth_stats.log
Chapter 8. tri2: LDA + MLLT
Discriminant Anal ysis

LDA + MLLT refers to the way we transform the features after


computing the MFCCs: we splice across several frames,
reduce the dimension (to 40 by default) using Linear
Discriminant Analysis), and then later estimate, over multiple
iterations, a diagonalizing transform known as MLLT or STC.
(from http:// Kaldi-asr.org/doc/transform.html)

8.1 General

Here is the script:

Let’s examine the output and see the difference:


steps/train_lda_mllt.sh --cmd slurm.pl --mem 4G 2500 20000
data/train data/lang exp/tri2_ali exp/tri3a steps/train_lda_mllt.sh:
Accumulating LDA statistics.

steps/train_lda_mllt.sh: Accumulating tree stats


steps/train_lda_mllt.sh: Getting questions for tree clustering.
steps/train_lda_mllt.sh: Building the tree
steps/train_lda_mllt.sh: Initializing the model

steps/train_lda_mllt.sh: Converting alignments from exp/tri2_ali to use


current tree
steps/train_lda_mllt.sh: Compiling graphs of transcripts

Training pass 1
Training pass 2
steps/train_lda_mllt.sh: Estimating MLLT
Training pass 3
...

steps/diagnostic/analyze_alignments.sh --cmd slurm. pl --mem 4G


data/lang exp/tri3a
steps/diagnostic/analyze_alignments.sh: see stats in
exp/tri3a/log/analyze_alignments.log
333 warnings in exp/tri3a/log/acc.*.*.log

1422 warnings in exp/tri3a/log/align.*.*.log


7 warnings in exp/tri3a/log/lda_acc.*.log
1 warnings in exp/tri3a/log/build_tree.log
1 warnings in exp/tri3a/log/compile_questions.log

exp/tri3a: nj=10 align prob=-48.75 over 150.18h [retry=0.3%,


fail=0.0%] states=2136 gauss=20035 treeimpr=5.07 lda-sum=24.62
mllt:impr,logdet=0.96,1.40

steps/train_lda_mllt.sh: Done training system with LDA+MLLT


features in exp/tri3a

Comparing with train_delta, there are one more extra step


after reading original features: features conversion. The
training is based on the converted features instead of original
features.
8.2 tri2 Output

===================================================
tri2 : LDA + MLLT Training & Decoding
===================================================

steps/align_si.sh --nj 30 --cmd run.pl --max-jobsrun 10 data/train


data/lang exp/tri1 exp/tri1_ali steps/align_si.sh: feature type is delta

steps/align_si.sh: aligning data in data/train using model from


exp/tri1, putting alignments in exp/ tri1_ali

steps/diagnostic/analyze_alignments.sh --cmd run.pl


--max-jobs-run 10 data/lang exp/tri1_ali
steps/diagnostic/analyze_alignments.sh: see stats in
exp/tri1_ali/log/analyze_alignments.log
steps/align_si.sh: done aligning data.

steps/train_lda_mllt.sh --cmd run.pl --max-jobs-run


10 --splice-opts --left-context=3 --right-context=3
2500 15000 data/train data/lang exp/tri1_ali exp/ tri2

steps/train_lda_mllt.sh: Accumulating LDA statistics.


steps/train_lda_mllt.sh: Accumulating tree stats
steps/train_lda_mllt.sh: Getting questions for tree clustering.

steps/train_lda_mllt.sh: Building the tree


steps/train_lda_mllt.sh: Initializing the model

steps/train_lda_mllt.sh: Converting alignments from exp/tri1_ali to use


current tree
steps/train_lda_mllt.sh: Compiling graphs of transcripts

Training pass 1
Training pass 2
steps/train_lda_mllt.sh: Estimating MLLT
Training pass 4
steps/train_lda_mllt.sh: Estimating MLLT Training pass 5
Training pass 6
steps/train_lda_mllt.sh: Estimating MLLT Training pass 7
Training pass 8
Training pass 9
Training pass 10
Aligning data
Training pass 11
Training pass 12
steps/train_lda_mllt.sh: Estimating MLLT Training pass 13
Training pass 14
Training pass 15
Training pass 16
Training pass 17
Training pass 18
Training pass 19
Training pass 20
Aligning data
Training pass 21
Training pass 22
Training pass 23
Training pass 24
Training pass 25
Training pass 26
Training pass 27
Training pass 28
Training pass 30
Aligning data
Training pass 31
Training pass 32
Training pass 33
Training pass 34
steps/diagnostic/analyze_alignments.sh --cmd run.pl
--max-jobs-run 10 data/lang exp/tri2
steps/diagnostic/analyze_alignments.sh: see stats in
exp/tri2/log/analyze_alignments.log
215 warnings in exp/tri2/log/update.*.log
92 warnings in exp/tri2/log/init_model.log
1 warnings in exp/tri2/log/compile_questions.log exp/tri2: nj=30 align
prob=-47.97 over 3.12h [re

try=0.0%, fail=0.0%] states=2016 gauss=15030 treeimpr=5.58 lda-


sum=28.42 mllt:impr,logdet=1.62,2.22

steps/train_lda_mllt.sh: Done training system with LDA+MLLT


features in exp/tri2
tree-info exp/tri2/tree

tree-info exp/tri2/tree

make-h-transducer --disambig-syms-out=exp/tri2/
graph/disambig_tid.int --transition-scale=1.0 data/
lang_test_bg/tmp/ilabels_3_1 exp/tri2/tree exp/tri2/ final.mdl

fstrmepslocal
fsttablecompose exp/tri2/graph/Ha.fst data/lang_
test_bg/tmp/CLG_3_1.fst

fstminimizeencoded
fstrmsymbols exp/tri2/graph/disambig_tid.int fstdeterminizestar --use-
log=true
fstisstochastic exp/tri2/graph/HCLGa.fst
0.000445813 -0.0175771
HCLGa is not stochastic

add-self-loops --self-loop-scale=0.1 --reorder=true exp/tri2/final.mdl


exp/tri2/graph/HCLGa.fst steps/decode.sh --nj 5 --cmd run.pl --max-
jobs-run 10 exp/tri2/graph data/dev exp/tri2/decode_dev decode.sh:
feature type is lda
steps/diagnostic/analyze_lats.sh --cmd run.pl --maxjobs-run 10
exp/tri2/graph exp/tri2/decode_dev steps/diagnostic/analyze_lats.sh:
see stats in exp/ tri2/decode_dev/log/analyze_alignments.log
Overall, lattice depth (10,50,90-percentile)=(2,8,30) and mean=13.6
steps/diagnostic/analyze_lats.sh: see stats in exp/
tri2/decode_dev/log/analyze_lattice_depth_stats.log steps/decode.sh
--nj 5 --cmd run.pl --max-jobs-run 10 exp/tri2/graph data/test
exp/tri2/decode_test decode.sh: feature type is lda

steps/diagnostic/analyze_lats.sh --cmd run.pl --maxjobs-run 10


exp/tri2/graph exp/tri2/decode_test steps/diagnostic/analyze_lats.sh:
see stats in exp/ tri2/decode_test/log/analyze_alignments.log Overall,
lattice depth (10,50,90-percentile)=(2,9,34) and mean=15.4

steps/diagnostic/analyze_lats.sh: see stats in exp/


tri2/decode_test/log/analyze_lattice_depth_stats. log
Chapter 9. tri3: LDA+MLLT+SAT
Ada ptive

SAT refers to the Speaker Adapted Training (SAT), i.e. train


on fMLLR-adapted features. It can be done on top of either
LDA+MLLT, or delta and delta-delta features. If there are no
transforms supplied in the alignment directory, it will estimate
transforms itself before building the tree (and in any case, it
estimates transforms a number of times during training).

SAT adds features transformation to LDA+MLLT. The model


could accept either fMLLR aligned features or the original
MFCC features based on whether transform files existed in ali
folder. If existed, it will use fMLLR, otherwise, itwill convertthe
original one to fMLLR trans.

SAT uses this trans to run acc_tree_stats to accumulate the


decision tree. Build_tree uses the transformed features, too.
tri3 Output

===================================================
tri3 : LDA + MLLT + SAT Training & Decoding
===================================================

steps/align_si.sh --nj 30 --cmd run.pl --max-jobsrun 10 --use-graphs


true data/train data/lang exp/ tri2 exp/tri2_ali

steps/align_si.sh: feature type is lda


steps/align_si.sh: aligning data in data/train us

ing model from exp/tri2, putting alignments in exp/ tri2_ali

steps/diagnostic/analyze_alignments.sh --cmd run.pl


--max-jobs-run 10 data/lang exp/tri2_ali
steps/diagnostic/analyze_alignments.sh: see stats in
exp/tri2_ali/log/analyze_alignments.log
steps/align_si.sh: done aligning data.

steps/train_sat.sh --cmd run.pl --max-jobs-run 10 2500 15000


data/train data/lang exp/tri2_ali exp/ tri3

steps/train_sat.sh: feature type is lda


steps/train_sat.sh: obtaining initial fMLLR transforms since not
present in exp/tri2_ali

steps/train_sat.sh: Accumulating tree stats steps/train_sat.sh: Getting


questions for tree clustering.
steps/train_sat.sh: Building the tree
steps/train_sat.sh: Initializing the model

steps/train_sat.sh: Converting alignments from exp/ tri2_ali to use


current tree

steps/train_sat.sh: Compiling graphs of transcripts Pass 1


Pass 2
Estimating fMLLR transforms
Pass 4
Estimating fMLLR transforms Pass 5
Pass 6
Estimating fMLLR transforms Pass 7
Pass 8
Pass 9
Pass 10
Aligning data
Pass 11
Pass 12
Estimating fMLLR transforms Pass 13
Pass 14
Pass 15
Pass 16
Pass 17
Pass 18
Pass 19
Pass 20
Aligning data
Pass 21
Pass 22
Pass 23
Pass 24
Pass 25
Pass 26
Pass 27
Pass 28
Pass 29
Pass 30
Aligning data
Pass 31
Pass 33
Pass 34
steps/diagnostic/analyze_alignments.sh --cmd run.pl
--max-jobs-run 10 data/lang exp/tri3
steps/diagnostic/analyze_alignments.sh: see stats in
exp/tri3/log/analyze_alignments.log
11 warnings in exp/tri3/log/update.*.log
1 warnings in exp/tri3/log/compile_questions.log 40 warnings in
exp/tri3/log/init_model.log
steps/train_sat.sh: Likelihood evolution:
-50.308 -49.4272 -49.2236 -49.0161 -48.2868 -47.5841
-47.1577 -46.8853 -46.6384 -46.1028 -45.8385 -45.5103
-45.3276 -45.1911 -45.0677 -44.9549 -44.8449 -44.7357
-44.6329 -44.4723 -44.3375 -44.2495 -44.1644 -44.0833
-44.0051 -43.9287 -43.8543 -43.7816 -43.7107 -43.6167
-43.542 -43.5163 -43.5002 -43.4878
exp/tri3: nj=30 align prob=-47.11 over 3.12h [retry=0.0%, fail=0.0%]
states=1904 gauss=15020 fmllrimpr=4.07 over 2.78h tree-impr=8.78
steps/train_sat.sh: done training SAT system in exp/ tri3
tree-info exp/tri3/tree
tree-info exp/tri3/tree
make-h-transducer --disambig-syms-out=exp/tri3/
graph/disambig_tid.int --transition-scale=1.0 data/
lang_test_bg/tmp/ilabels_3_1 exp/tri3/tree exp/tri3/ final.mdl
fstrmepslocal
fsttablecompose exp/tri3/graph/Ha.fst data/lang_
test_bg/tmp/CLG_3_1.fst
fstminimizeencoded
fstrmsymbols exp/tri3/graph/disambig_tid.int fstdeterminizestar --use-
log=true
fstisstochastic exp/tri3/graph/HCLGa.fst
0.000445813 -0.0175772
HCLGa is not stochastic
add-self-loops --self-loop-scale=0.1 --reorder=true exp/tri3/final.mdl
exp/tri3/graph/HCLGa.fst steps/decode_fmllr.sh --nj 5 --cmd run.pl --
max-jobsrun 10 exp/tri3/graph data/dev exp/tri3/decode_dev
steps/decode.sh --scoring-opts --num-threads 1
--skip-scoring false --acwt 0.083333 --nj 5 --cmd run.pl --max-jobs-run
10 --beam 10.0 --model exp/ tri3/final.alimdl --max-active 2000
exp/tri3/graph data/dev exp/tri3/decode_dev.si
decode.sh: feature type is lda
steps/diagnostic/analyze_lats.sh --cmd run.pl --maxjobs-run 10
exp/tri3/graph exp/tri3/decode_dev.si
steps/diagnostic/analyze_lats.sh: see stats in exp/
tri3/decode_dev.si/log/analyze_alignments.log Overall, lattice depth
(10,50,90-percentile)=(2,9,34) and mean=15.3
steps/diagnostic/analyze_lats.sh: see stats in exp/
tri3/decode_dev.si/log/analyze_lattice_depth_stats. log
steps/decode_fmllr.sh: feature type is lda
steps/decode_fmllr.sh: getting first-pass fMLLR transforms.
steps/decode_fmllr.sh: doing main lattice generation phase
steps/decode_fmllr.sh: estimating fMLLR transforms a second time.
steps/decode_fmllr.sh: doing a final pass of acoustic rescoring.
steps/diagnostic/analyze_lats.sh --cmd run.pl --maxjobs-run 10
exp/tri3/graph exp/tri3/decode_dev steps/diagnostic/analyze_lats.sh:
see stats in exp/ tri3/decode_dev/log/analyze_alignments.log
Overall, lattice depth (10,50,90-percentile)=(1,5,16) and mean=7.6
steps/diagnostic/analyze_lats.sh: see stats in exp/
tri3/decode_dev/log/analyze_lattice_depth_stats.log
steps/decode_fmllr.sh --nj 5 --cmd run.pl --max-jobsrun 10
exp/tri3/graph data/test exp/tri3/decode_test steps/decode.sh --
scoring-opts --num-threads 1
--skip-scoring false --acwt 0.083333 --nj 5 --cmd run.pl --max-jobs-run
10 --beam 10.0 --model exp/ tri3/final.alimdl --max-active 2000
exp/tri3/graph data/test exp/tri3/decode_test.si
decode.sh: feature type is lda
steps/diagnostic/analyze_lats.sh --cmd run.pl --maxjobs-run 10
exp/tri3/graph exp/tri3/decode_test.si
steps/diagnostic/analyze_lats.sh: see stats in exp/
tri3/decode_test.si/log/analyze_alignments.log Overall, lattice depth
(10,50,90-percentile)=(2,10,36) and mean=16.5
steps/diagnostic/analyze_lats.sh: see stats in exp/
tri3/decode_test.si/log/analyze_lattice_depth_ stats.log
steps/decode_fmllr.sh: feature type is lda
steps/decode_fmllr.sh: getting first-pass fMLLR transforms.
steps/decode_fmllr.sh: doing main lattice generation phase
steps/decode_fmllr.sh: estimating fMLLR transforms a second time.
steps/decode_fmllr.sh: doing a final pass of acoustic rescoring.
steps/diagnostic/analyze_lats.sh --cmd run.pl --maxjobs-run 10
exp/tri3/graph exp/tri3/decode_test steps/diagnostic/analyze_lats.sh:
see stats in exp/ tri3/decode_test/log/analyze_alignments.log
Overall, lattice depth (10,50,90-percentile)=(1,5,18) and mean=8.4
steps/diagnostic/analyze_lats.sh: see stats in exp/
tri3/decode_test/log/analyze_lattice_depth_stats. log
Chapter 10. SGMM2
SGMM on top of fMLLR
10.1 SGMM2 General

SGMM2 is introduced by Karel, referring instead to SGMM


(the subspace Gaussian mixture model)training with speaker
vectors. This training would normally be called on top of
fMLLR features obtained from a conventional system, but it
also works on top of any type of speaker-independent
features (based on deltas+delta-deltas or LDA+MLLT). Here
is the script:

After alignment in train_delta, three more stages have been


implemented: steps/align_fmllr.sh.
steps/align_fmllr.sh --cmd slurm.pl --mem 4G --nj 10 data/train
data/lang exp/tri3a exp/tri3a_ali

steps/align_fmllr.sh: feature type is lda


steps/align_fmllr.sh: compiling training graphs

steps/align_fmllr.sh: aligning data in data/train using


exp/tri3a/final.mdl and speaker-independent features.

steps/align_fmllr.sh: computing fMLLR transforms


steps/align_fmllr.sh: doing final alignment. steps/align_fmllr.sh: done
aligning data.

steps/diagnostic/analyze_alignments.sh --cmd slurm. pl --mem 4G


data/lang exp/tri3a_ali
steps/diagnostic/analyze_alignments.sh: see stats in
exp/tri3a_ali/log/analyze_alignments.log

283 warnings in exp/tri3a_ali/log/align_pass1.*.log


4 warnings in exp/tri3a_ali/log/fmllr.*.log
305 warnings in exp/tri3a_ali/log/align_pass2.*.log

Align first as the pre_ali.gz, calculate fmllr transforms based


on pre_ali.gz, and then create trans.job.

The memory required by SGMM2 is slightly larger, preferably


with more than 8G, preferably with a GPU, which can be
faster. From the beginning to MMI + SGMM2, it took a total of
1 hour and 40 minutes, then in the DNN phase took 2 hours,
DNN + SGMM took 20 minutes, a total of about 4 hours, if
there is a GPU it should be much faster. The recognition
results obtained at each stage are posted below.

10.2 SGMM2 Output

===================================================
SGMM2 Training & Decoding
===================================================

steps/align_fmllr.sh --nj 30 --cmd run.pl --max-jobsrun 10 data/train


data/lang exp/tri3 exp/tri3_ali

steps/align_fmllr.sh: feature type is lda


steps/align_fmllr.sh: compiling training graphs

steps/align_fmllr.sh: aligning data in data/train using


exp/tri3/final.alimdl and speaker-independent features.

steps/align_fmllr.sh: computing fMLLR transforms


steps/align_fmllr.sh: doing final alignment. steps/align_fmllr.sh: done
aligning data.

steps/diagnostic/analyze_alignments.sh --cmd run.pl


--max-jobs-run 10 data/lang exp/tri3_ali
steps/diagnostic/analyze_alignments.sh: see stats in
exp/tri3_ali/log/analyze_alignments.log
steps/train_ubm.sh --cmd run.pl --max-jobs-run 10 400 data/train
data/lang exp/tri3_ali exp/ubm4

steps/train_ubm.sh: feature type is lda


steps/train_ubm.sh: using transforms from exp/tri3_ali

steps/train_ubm.sh: clustering model exp/tri3_ali/ final.mdl to get


initial UBM

steps/train_ubm.sh: doing Gaussian selection Pass 0


Pass 1
Pass 2

steps/train_sgmm2.sh --cmd run.pl --max-jobs-run 10 7000 9000


data/train data/lang exp/tri3_ali exp/ubm4/ final.ubm exp/sgmm2_4

steps/train_sgmm2.sh: feature type is lda


steps/train_sgmm2.sh: using transforms from exp/ tri3_ali
steps/train_sgmm2.sh: accumulating tree stats

steps/train_sgmm2.sh: Getting questions for tree clustering.

steps/train_sgmm2.sh: Building the tree


steps/train_sgmm2.sh: Initializing the model steps/train_sgmm2.sh:
doing Gaussian selection steps/train_sgmm2.sh: compiling training
graphs steps/train_sgmm2.sh: converting alignments
steps/train_sgmm2.sh: training pass 0 ...
steps/train_sgmm2.sh: training pass 1 ...
steps/train_sgmm2.sh: training pass 2 ...
steps/train_sgmm2.sh: training pass 3 ...
steps/train_sgmm2.sh: training pass 4 ...
steps/train_sgmm2.sh: training pass 5 ...
steps/train_sgmm2.sh: re-aligning data
steps/train_sgmm2.sh: training pass 6 ...
steps/train_sgmm2.sh: training pass 7 ...
steps/train_sgmm2.sh: training pass 8 ...
steps/train_sgmm2.sh: training pass 9 ...
steps/train_sgmm2.sh: training pass 10 ... steps/train_sgmm2.sh: re-
aligning data
steps/train_sgmm2.sh: training pass 11 ... steps/train_sgmm2.sh:
training pass 12 ... steps/train_sgmm2.sh: training pass 13 ...
steps/train_sgmm2.sh: training pass 14 ... steps/train_sgmm2.sh:
training pass 15 ... steps/train_sgmm2.sh: re-aligning data
steps/train_sgmm2.sh: training pass 16 ... steps/train_sgmm2.sh:
training pass 17 ... steps/train_sgmm2.sh: training pass 18 ...
steps/train_sgmm2.sh: training pass 19 ... steps/train_sgmm2.sh:
training pass 20 ... steps/train_sgmm2.sh: training pass 21 ...
steps/train_sgmm2.sh: training pass 22 ... steps/train_sgmm2.sh:
training pass 23 ... steps/train_sgmm2.sh: training pass 24 ...
steps/train_sgmm2.sh: building alignment model (pass 25) steps/train_sgmm2.sh: building
alignment model (pass 26) steps/train_sgmm2.sh: building alignment model (pass 27) 213
warnings in exp/sgmm2_4/log/update_ali.*.log 1875 warnings in
exp/sgmm2_4/log/update.*.log 1 warnings in
exp/sgmm2_4/log/compile_questions.log Done
tree-info exp/sgmm2_4/tree
tree-info exp/sgmm2_4/tree
make-h-transducer --disambig-syms-out=exp/sgmm2_4/

graph/disambig_tid.int --transition-scale=1.0 data/


lang_test_bg/tmp/ilabels_3_1 exp/sgmm2_4/tree exp/
sgmm2_4/final.mdl

fstrmepslocal
fsttablecompose exp/sgmm2_4/graph/Ha.fst data/lang_
test_bg/tmp/CLG_3_1.fst
fstrmsymbols exp/sgmm2_4/graph/disambig_tid.int
fstminimizeencoded
fstdeterminizestar --use-log=true
fstisstochastic exp/sgmm2_4/graph/HCLGa.fst 0.000485195
-0.0175772
HCLGa is not stochastic

add-self-loops --self-loop-scale=0.1 --reorder=true


exp/sgmm2_4/final.mdl exp/sgmm2_4/graph/HCLGa.fst

steps/decode_sgmm2.sh --nj 5 --cmd run.pl --maxjobs-run 10 --


transform-dir exp/tri3/decode_dev exp/ sgmm2_4/graph data/dev
exp/sgmm2_4/decode_dev

steps/decode_sgmm2.sh: feature type is lda


steps/decode_sgmm2.sh: using transforms from exp/
tri3/decode_dev

steps/diagnostic/analyze_lats.sh --cmd run.pl --maxjobs-run 10


exp/sgmm2_4/graph exp/sgmm2_4/decode_dev
steps/diagnostic/analyze_lats.sh: see stats in exp/
sgmm2_4/decode_dev/log/analyze_alignments.log Overall, lattice
depth (10,50,90-percentile)=(2,6,20) and mean=9.4

steps/diagnostic/analyze_lats.sh: see stats in exp/


sgmm2_4/decode_dev/log/analyze_lattice_depth_stats. log
steps/decode_sgmm2.sh --nj 5 --cmd run.pl --max-jobsrun 10 --
transform-dir exp/tri3/decode_test exp/ sgmm2_4/graph data/test
exp/sgmm2_4/decode_test

steps/decode_sgmm2.sh: feature type is lda


steps/decode_sgmm2.sh: using transforms from exp/
tri3/decode_test

steps/diagnostic/analyze_lats.sh --cmd run.pl --maxjobs-run 10


exp/sgmm2_4/graph exp/sgmm2_4/decode_ test

steps/diagnostic/analyze_lats.sh: see stats in exp/


sgmm2_4/decode_test/log/analyze_alignments.log Overall, lattice
depth (10,50,90-percentile)=(2,6,23) and mean=10.4

steps/diagnostic/analyze_lats.sh: see stats in exp/


sgmm2_4/decode_test/log/analyze_lattice_depth_ stats.log

Even though SGMM is a good approach, Dan Povey points


out that “But I recommend that you ignore SGMMs. Right now
neural nets are where it’s at, and SGMMs were, at best, the
most highly evolved state of a dying approach. MMI is still
somewhat relevant as it’s an example of a sequence-level
objective (maximum posterior of the correct transcript) which
is equivalent to the objective function used in CTC and other
popular modern approaches, and it’s still used when training
neural nets.”
Chapter 11. MMI + SGMM2
Maximum Mutual Information
11.1 Why MMI?

Until last chapter, most of the models are based on MLE.


Maximum likelihood provides a consistent approach to
parameter estimation problems. This means that maximum
likelihood estimates can be developed for a large variety of
estimation situations. For example, they can be applied in
reliability analysis to censored data under various censoring
models. However, some disadvantages are clear, too:

• The likelihood equations need to be specifically worked out


for a given distribution and estimation problem. The
mathematics is often non-trivial, particularly if confidence
intervals for the parameters are desired.

• The numerical estimation is usually non-trivial. Except for a


few cases where the maximum likelihood formulas are in fact
simple, it is generally best to rely on high quality statistical
software to obtain maximum likelihood estimates. Fortunately,
high quality maximum likelihood software is becoming
increasingly common.

• Maximum likelihood estimates can be heavily biased for


small samples. The optimality properties may not apply for
small samples.
• Maximum likelihood can be sensitive to the choice of
starting values.
To solve the first problem, we use GMM to mimic the living
environment. However, GMM might not be trained thoroughly,
which is a problem.

The other disadvantages justhave no effective way to


overcome. That’s why MMI (Maximum Mutual Information) is
introduced. Rather than some well-known methods of deep
learning, this task presents a more specific framework
adopted by speech processing (i.e. discriminative training).
As for the DNN-HMM system above, Cross Entropy (CE) loss
function is often employed to minimize the phonemes
prediction error rate. The CE criterion evaluates each speech
frame independently. Hence, the training process ignores the
context information among phonemes series. To address the
error, discriminative training methods for DNN were proposed.
The discriminative criteria for DNN include Maximum Mutual
Information (MMI),

MMI Formula:

Let’s convert it a
little bit:

11.2 MMI General


It’s easier to see now that the target function is P(W|O), in
another word, MMI model put the language model into
consideration. Comparing with MLE, the advantage is
thateven if the assumed distribution is not ideal, the output
will not be that bad.

We could think it the ration of right paths and wrong paths.


When right paths have been increased, the wrong paths will
be decreased, which means “discriminative training”

Here is the script from Kaldi.

Line 175 steps/train_mmi_sgmm2.sh uses data/train_si84 as


the input, and output the exp/tri2b_mmi
train_mm_sgmm2 script:

Line 44 of train_mm_sgmm2.sh compared the text of two


labeled phones, stop in case of not same. Each phone is
related with one identical integer.

Line 47 - 49, get the location of each sentence’s features.


alidir/{tree,final.mdl,ali.1.gz} gets the decision tree aligned,
model, and alignment information.
denlatdir/lat.1.gz stores word diagram.
Starting from line 69, transforming the features.
final.mat is to transform the features. transform-feats using
transform to convert features to ark format for decoding, and
then get lattice file.

Line 108 lattice-boost-ali, monitoring the b* (frame phone


error) on each arch, increase likelihood of each graph to
decrease graph cost, to identify discriminative training, like
increasing MMI and changing lattice. The model needs
transition to convertpdf-id to phone. With silence-phones
parameter, lattice will ignore silence errors, or using max-
silence-error. Setting max silence error =1 equivalent to no
silence-phones).

11.3 MMI + SGMM2 Output

===================================================

MMI + SGMM2 Training & Decoding


===================================================

steps/align_sgmm2.sh --nj 30 --cmd run.pl --max-jobsrun 10 --


transform-dir exp/tri3_ali --use-graphs true
--use-gselect true data/train data/lang exp/sgmm2_4
exp/sgmm2_4_ali

steps/align_sgmm2.sh: feature type is lda


steps/align_sgmm2.sh: using transforms from exp/ tri3_ali

steps/align_sgmm2.sh: aligning data in data/train using model


exp/sgmm2_4/final.alimdl
steps/align_sgmm2.sh: computing speaker vectors (1st pass)
steps/align_sgmm2.sh: computing speaker vectors (2nd pass)

steps/align_sgmm2.sh: doing final alignment.


steps/align_sgmm2.sh: done aligning data.

steps/diagnostic/analyze_alignments.sh --cmd run.pl


--max-jobs-run 10 data/lang exp/sgmm2_4_ali

steps/diagnostic/analyze_alignments.sh: see stats in


exp/sgmm2_4_ali/log/analyze_alignments.log
steps/make_denlats_sgmm2.sh --nj 30 --sub-split 30
--acwt 0.2 --lattice-beam 10.0 --beam 18.0 --cmd run. pl --max-jobs-
run 10 --transform-dir exp/tri3_ali data/train data/lang
exp/sgmm2_4_ali exp/sgmm2_4_ denlats

steps/make_denlats_sgmm2.sh: Making unigram grammar FST in


exp/sgmm2_4_denlats/lang
steps/make_denlats_sgmm2.sh: Compiling decoding graph in
exp/sgmm2_4_denlats/dengraph

tree-info exp/sgmm2_4_ali/tree
tree-info exp/sgmm2_4_ali/tree
fsttablecompose exp/sgmm2_4_denlats/lang/L_disambig. fst
exp/sgmm2_4_denlats/lang/G.fst
fstminimizeencoded
fstdeterminizestar --use-log=true
fstpushspecial
fstisstochastic exp/sgmm2_4_denlats/lang/tmp/LG.fst 1.27271e-05
1.27271e-05
fstcomposecontext --context-size=3 --central-posi

tion=1 --read-disambig-syms=exp/sgmm2_4_denlats/
lang/phones/disambig.int --write-disambig-syms=exp/
sgmm2_4_denlats/lang/tmp/disambig_ilabels_3_1.int
exp/sgmm2_4_denlats/lang/tmp/ilabels_3_1.7065 exp/
sgmm2_4_denlats/lang/tmp/LG.fst

fstisstochastic exp/sgmm2_4_denlats/lang/tmp/ CLG_3_1.fst


1.27657e-05 0

make-h-transducer --disambig-syms-out=exp/sgmm2_4_
denlats/dengraph/disambig_tid.int --transitionscale=1.0
exp/sgmm2_4_denlats/lang/tmp/ilabels_3_1 exp/sgmm2_4_ali/tree
exp/sgmm2_4_ali/final.mdl

fsttablecompose exp/sgmm2_4_denlats/dengraph/Ha.fst
exp/sgmm2_4_denlats/lang/tmp/CLG_3_1.fst
fstrmepslocal
fstminimizeencoded

fstrmsymbols exp/sgmm2_4_denlats/dengraph/disambig_ tid.int


fstdeterminizestar --use-log=true

fstisstochastic exp/sgmm2_4_denlats/dengraph/HCLGa. fst


0.000481185 -0.000485819

add-self-loops --self-loop-scale=0.1 --reorder=true


exp/sgmm2_4_ali/final.mdl
exp/sgmm2_4_denlats/dengraph/HCLGa.fst

steps/make_denlats_sgmm2.sh: feature type is lda


steps/make_denlats_sgmm2.sh: using fMLLR transforms from
exp/tri3_ali

steps/make_denlats_sgmm2.sh: Merging archives for data subset 1


steps/make_denlats_sgmm2.sh: Merging archives for data subset 2
steps/make_denlats_sgmm2.sh: Merging archives for data subset 3
steps/make_denlats_sgmm2.sh: Merging archives for data subset 4
steps/make_denlats_sgmm2.sh: Merging archives for data subset 5
steps/make_denlats_sgmm2.sh: Merging archives for data subset 6
steps/make_denlats_sgmm2.sh: Merging archives for data subset 7
steps/make_denlats_sgmm2.sh: Merging archives for data subset 8

steps/make_denlats_sgmm2.sh: Merging archives for data subset 9


steps/make_denlats_sgmm2.sh: Merging archives for data subset 10
steps/make_denlats_sgmm2.sh: Merging archives for data subset 11

steps/make_denlats_sgmm2.sh: Merging archives for data subset 12


steps/make_denlats_sgmm2.sh: Merging archives for data subset 13
steps/make_denlats_sgmm2.sh: Merging archives for data subset 14
steps/make_denlats_sgmm2.sh: Merging archives for data subset 15
steps/make_denlats_sgmm2.sh: Merging archives for data subset 16
steps/make_denlats_sgmm2.sh: Merging archives for data subset 17
steps/make_denlats_sgmm2.sh: Merging archives for data subset 18
steps/make_denlats_sgmm2.sh: Merging archives for data subset 19

steps/make_denlats_sgmm2.sh: Merging archives for data subset 20


steps/make_denlats_sgmm2.sh: Merging archives for data subset 21
steps/make_denlats_sgmm2.sh: Merging archives for data subset 22
steps/make_denlats_sgmm2.sh: Merging archives for data subset 23
steps/make_denlats_sgmm2.sh: Merging archives for data subset 24
steps/make_denlats_sgmm2.sh: Merging archives for data subset 25
steps/make_denlats_sgmm2.sh: Merging archives for data subset 26

steps/make_denlats_sgmm2.sh: Merging archives for data subset 27


steps/make_denlats_sgmm2.sh: Merging archives for data subset 28
steps/make_denlats_sgmm2.sh: Merging archives for data subset 29

steps/make_denlats_sgmm2.sh: Merging archives for data subset 30


steps/make_denlats_sgmm2.sh: done generating denominator
lattices with SGMMs.

steps/train_mmi_sgmm2.sh --acwt 0.2 --cmd run. pl --max-jobs-run


10 --transform-dir exp/tri3_ali
--boost 0.1 --drop-frames true data/train data/lang exp/sgmm2_4_ali
exp/sgmm2_4_denlats exp/sgmm2_4_mmi_ b0.1

steps/train_mmi_sgmm2.sh: feature type is lda


steps/train_mmi_sgmm2.sh: using transforms from exp/ tri3_ali

steps/train_mmi_sgmm2.sh: using speaker vectors from


exp/sgmm2_4_ali
steps/train_mmi_sgmm2.sh: using Gaussian-selection info from
exp/sgmm2_4_ali

Iteration 0 of MMI training


Iteration 0: objf was 0.501016467634615, MMI auxf change was
0.0162648878979182
Iteration 1 of MMI training
Iteration 1: objf was 0.515660655672759, MMI auxf change was
0.00237356455193395
Iteration 2 of MMI training

Iteration 2: objf was 0.51831743276386, MMI auxf change was


0.000636215653485037
Iteration 3 of MMI training
Iteration 3: objf was 0.519198604496081, MMI auxf change was
0.000381818294967297
MMI training finished

steps/decode_sgmm2_rescore.sh --cmd run.pl --maxjobs-run 10 --iter


1 --transform-dir exp/tri3/decode_dev data/lang_test_bg data/dev
exp/sgmm2_4/decode_dev exp/sgmm2_4_mmi_b0.1/decode_dev_it1

steps/decode_sgmm2_rescore.sh: using speaker vectors from


exp/sgmm2_4/decode_dev
steps/decode_sgmm2_rescore.sh: feature type is lda
steps/decode_sgmm2_rescore.sh: using transforms from
exp/tri3/decode_dev
steps/decode_sgmm2_rescore.sh: rescoring lattices with SGMM
model in exp/sgmm2_4_mmi_b0.1/1.mdl

steps/decode_sgmm2_rescore.sh --cmd run.pl --maxjobs-run 10 --iter


1 --transform-dir exp/tri3/decode_test data/lang_test_bg data/test
exp/sgmm2_4/ decode_test
exp/sgmm2_4_mmi_b0.1/decode_test_it1

steps/decode_sgmm2_rescore.sh: using speaker vectors from


exp/sgmm2_4/decode_test
steps/decode_sgmm2_rescore.sh: feature type is lda

steps/decode_sgmm2_rescore.sh: using transforms from


exp/tri3/decode_test
steps/decode_sgmm2_rescore.sh: rescoring lattices with SGMM
model in exp/sgmm2_4_mmi_b0.1/1.mdl
steps/decode_sgmm2_rescore.sh --cmd run.pl --maxjobs-run 10 --iter
2 --transform-dir exp/tri3/decode_dev data/lang_test_bg data/dev
exp/sgmm2_4/decode_dev exp/sgmm2_4_mmi_b0.1/decode_dev_it2

steps/decode_sgmm2_rescore.sh: using speaker vectors from


exp/sgmm2_4/decode_dev
steps/decode_sgmm2_rescore.sh: feature type is lda

steps/decode_sgmm2_rescore.sh: using transforms from


exp/tri3/decode_dev
steps/decode_sgmm2_rescore.sh: rescoring lattices with SGMM
model in exp/sgmm2_4_mmi_b0.1/2.mdl

steps/decode_sgmm2_rescore.sh --cmd run.pl --maxjobs-run 10 --iter


2 --transform-dir exp/tri3/decode_test data/lang_test_bg data/test
exp/sgmm2_4/ decode_test
exp/sgmm2_4_mmi_b0.1/decode_test_it2

steps/decode_sgmm2_rescore.sh: using speaker vectors from


exp/sgmm2_4/decode_test
steps/decode_sgmm2_rescore.sh: feature type is lda
steps/decode_sgmm2_rescore.sh: using transforms from
exp/tri3/decode_test
steps/decode_sgmm2_rescore.sh: rescoring lattices with SGMM
model in exp/sgmm2_4_mmi_b0.1/2.mdl

steps/decode_sgmm2_rescore.sh --cmd run.pl --maxjobs-run 10 --iter


3 --transform-dir exp/tri3/decode_dev data/lang_test_bg data/dev
exp/sgmm2_4/decode_dev exp/sgmm2_4_mmi_b0.1/decode_dev_it3

steps/decode_sgmm2_rescore.sh: using speaker vectors from


exp/sgmm2_4/decode_dev

steps/decode_sgmm2_rescore.sh: feature type is lda


steps/decode_sgmm2_rescore.sh: using transforms from
exp/tri3/decode_dev
steps/decode_sgmm2_rescore.sh: rescoring lattices with SGMM
model in exp/sgmm2_4_mmi_b0.1/3.mdl
steps/decode_sgmm2_rescore.sh --cmd run.pl --max

jobs-run 10 --iter 3 --transform-dir exp/tri3/decode_test


data/lang_test_bg data/test exp/sgmm2_4/ decode_test
exp/sgmm2_4_mmi_b0.1/decode_test_it3

steps/decode_sgmm2_rescore.sh: using speaker vectors from


exp/sgmm2_4/decode_test
steps/decode_sgmm2_rescore.sh: feature type is lda

steps/decode_sgmm2_rescore.sh: using transforms from


exp/tri3/decode_test
steps/decode_sgmm2_rescore.sh: rescoring lattices with SGMM
model in exp/sgmm2_4_mmi_b0.1/3.mdl

steps/decode_sgmm2_rescore.sh --cmd run.pl --maxjobs-run 10 --iter


4 --transform-dir exp/tri3/decode_dev data/lang_test_bg data/dev
exp/sgmm2_4/decode_dev exp/sgmm2_4_mmi_b0.1/decode_dev_it4

steps/decode_sgmm2_rescore.sh: using speaker vectors from


exp/sgmm2_4/decode_dev
steps/decode_sgmm2_rescore.sh: feature type is lda
steps/decode_sgmm2_rescore.sh: using transforms from
exp/tri3/decode_dev
steps/decode_sgmm2_rescore.sh: rescoring lattices with SGMM
model in exp/sgmm2_4_mmi_b0.1/4.mdl

steps/decode_sgmm2_rescore.sh --cmd run.pl --maxjobs-run 10 --iter


4 --transform-dir exp/tri3/decode_test data/lang_test_bg data/test
exp/sgmm2_4/ decode_test
exp/sgmm2_4_mmi_b0.1/decode_test_it4

steps/decode_sgmm2_rescore.sh: using speaker vectors from


exp/sgmm2_4/decode_test
steps/decode_sgmm2_rescore.sh: feature type is lda
steps/decode_sgmm2_rescore.sh: using transforms from
exp/tri3/decode_test

steps/decode_sgmm2_rescore.sh: rescoring lattices with SGMM


model in exp/sgmm2_4_mmi_b0.1/4.mdl
Chapter 12. Dan’s DNN
DNN H ybrid Training

Deep Neural Networks (DNNs) are the latest hot topic in


speech recognition since around 2010. Google, Microsoft,
Facebook, and Amazon name a few are the main players in
this area. Kaldi also demonstrated the effectiveness of easily
incorporate “Deep Neural Network” (DNN) techniques to
improve the recognition performance in almost all recognition
tasks.

12.1 General Kaldi currently contains two parallel


implementations for DNN training. The first
implementation supports Restricted Boltzmann Machines
(RBM) pre-training, stochastic gradient descent training
using NVidia Graphics Processing Units (GPUs), and
discriminative training suchas boosted MMI and
statelevel minimum Bayes risk (sMBR).

The second implementation of DNNs in Kaldi was originally


written to support parallel training on multiple CPUs, although
it has now been extended to support parallel GPU-based
training and it does not support discriminative training.

One is located in code sub-directories nnet/2 and nnetbin/,


and is primarily maintained by Karel Vesely. The other is
located in code subdirectories nnet2/ and nnet2bin/, and is
primarily maintained by Daniel Povey (this code was originally
based on an earlier version of Karel's code, but it has been
extensively rewritten). Neither codebase is more “official” than
the other. Both are still being developed in parallel.
In this chapter, we will show Dan’s DNN mainly.

Dan’s DNN adopts a classic Hybrid Training and


Decoding framework using a simple deep network with
tanh nonlinearities. Moreover, also system combination
using minimum Bayes risk decoding is used, and in this
case a lattice combination is used to create a union of
lattices normalized by removing the total forward
costfrom them and using the resulting lattice as input the
last decoding step.
Line 196 steps/nnet2/traiin_tanh.shtrains a fairly
vanilla network with tanh nonlinearities. 12.2 Output

===================================================
DNN Hybrid Training & Decoding
===================================================

steps/nnet2/train_tanh.sh --mix-up 5000 --initiallearning-rate 0.015 --


final-learning-rate 0.002
--num-hidden-layers 2 --num-jobs-nnet 30 --cmd run. pl --max-jobs-
run 10 data/train data/lang exp/tri3_ ali exp/tri4_nnet

steps/nnet2/train_tanh.sh: calling get_lda.sh

steps/nnet2/get_lda.sh --transform-dir exp/tri3_ali


--splice-width 4 --cmd run.pl --max-jobs-run 10 data/ train data/lang
exp/tri3_ali exp/tri4_nnet

steps/nnet2/get_lda.sh: feature type is lda


steps/nnet2/get_lda.sh: using transforms from exp/ tri3_ali

feat-to-dim ‘ark,s,cs:utils/subset_scp.pl --quiet 333


data/train/split30/1/feats.scp | apply-cmvn
--utt2spk=ark:data/train/split30/1/utt2spk scp:data/
train/split30/1/cmvn.scp scp:- ark:- | splice-feats
--left-context=3 --right-context=3 ark:- ark:- | transform-feats
exp/tri4_nnet/final.mat ark:- ark:- | transform-feats --
utt2spk=ark:data/train/split30/1/ utt2spk ark:exp/tri3_ali/trans.1 ark:-
ark:- |’ -

splice-feats --left-context=3 --right-context=3 ark:- ark:-


transform-feats exp/tri4_nnet/final.mat ark:- ark:- transform-feats --
utt2spk=ark:data/train/split30/1/ utt2spk ark:exp/tri3_ali/trans.1 ark:-
ark:-
apply-cmvn --utt2spk=ark:data/train/split30/1/utt2spk
scp:data/train/split30/1/cmvn.scp scp:- ark:-

WARNING (feat-to-dim[5.5.164~1-9698]:Close():Kaldiio.cc:515) Pipe


utils/subset_scp.pl --quiet 333 data/train/split30/1/feats.scp | apply-
cmvn
--utt2spk=ark:data/train/split30/1/utt2spk scp:data/
train/split30/1/cmvn.scp scp:- ark:- | splice-feats
--left-context=3 --right-context=3 ark:- ark:- | transform-feats
exp/tri4_nnet/final.mat ark:- ark:- | transform-feats --
utt2spk=ark:data/train/split30/1/ utt2spk ark:exp/tri3_ali/trans.1 ark:-
ark:- | had nonzero return status 36096

feat-to-dim ‘ark,s,cs:utils/subset_scp.pl --quiet 333


data/train/split30/1/feats.scp | apply-cmvn
--utt2spk=ark:data/train/split30/1/utt2spk scp:data/
train/split30/1/cmvn.scp scp:- ark:- | splice-feats
--left-context=3 --right-context=3 ark:- ark:- | transform-feats
exp/tri4_nnet/final.mat ark:- ark:- | transform-feats --
utt2spk=ark:data/train/split30/1/ utt2spk ark:exp/tri3_ali/trans.1 ark:-
ark:- | splicefeats --left-context=4 --right-context=4 ark:- ark:- |’ -

transform-feats exp/tri4_nnet/final.mat ark:- ark:- splice-feats --left-


context=3 --right-context=3 ark:- ark:-
transform-feats --utt2spk=ark:data/train/split30/1/ utt2spk
ark:exp/tri3_ali/trans.1 ark:- ark:-
apply-cmvn --utt2spk=ark:data/train/split30/1/utt2spk
scp:data/train/split30/1/cmvn.scp scp:- ark:- splice-feats --left-
context=4 --right-context=4 ark:- ark:-

WARNING (feat-to-dim[5.5.164~1-9698]:Close():Kaldiio.cc:515) Pipe


utils/subset_scp.pl --quiet 333 data/train/split30/1/feats.scp | apply-
cmvn
--utt2spk=ark:data/train/split30/1/utt2spk scp:data/
train/split30/1/cmvn.scp scp:- ark:- | splice-feats
--left-context=3 --right-context=3 ark:- ark:- | transform-feats
exp/tri4_nnet/final.mat ark:- ark:- | transform-feats --
utt2spk=ark:data/train/split30/1/ utt2spk ark:exp/tri3_ali/trans.1 ark:-
ark:- | splicefeats --left-context=4 --right-context=4 ark:- ark:- | had
nonzero return status 36096

steps/nnet2/get_lda.sh: Accumulating LDA statistics.


steps/nnet2/get_lda.sh: Finished estimating LDA
steps/nnet2/train_tanh.sh: calling get_egs.sh
steps/nnet2/get_egs.sh --transform-dir exp/tri3_ali
--splice-width 4 --samples-per-iter 200000 --numjobs-nnet 30 --stage
0 --cmd run.pl --max-jobs-run 10
--io-opts --max-jobs-run 5 data/train data/lang exp/ tri3_ali
exp/tri4_nnet

steps/nnet2/get_egs.sh: feature type is lda steps/nnet2/get_egs.sh:


using transforms from exp/ tri3_ali
steps/nnet2/get_egs.sh: working out number of frames of training
data

utils/data/get_utt2dur.sh: segments file does not exist so getting


durations from wave files
utils/data/get_utt2dur.sh: successfully obtained utterance lengths
from sphere-file headers
utils/data/get_utt2dur.sh: computed data/train/utt2dur
feat-to-len ‘scp:head -n 10 data/train/feats.scp|’ ark,t:-
steps/nnet2/get_egs.sh: Every epoch, splitting the data up into 1
iterations,
steps/nnet2/get_egs.sh: giving samples-per-iteration of 37740 (you
requested 200000).

Getting validation and training subset examples.


steps/nnet2/get_egs.sh: extracting validation and training-subset
alignments.
copy-int-vector ark:- ark,t:-

LOG (copy-int-vector[5.5.164~1-9698]:main():copyint-vector.cc:83)
Copied 3696 vectors of int32. Getting subsets of validation examples
for diagnostics and combination.

Creating training examples


Generating training examples on disk

steps/nnet2/get_egs.sh: rearranging examples into parts for different


parallel jobs
steps/nnet2/get_egs.sh: Since iters-per-epoch == 1, just
concatenating the data.
Shuffling the order of training examples

(in order to avoid stressing the disk, these won’t all run at once).
steps/nnet2/get_egs.sh: Finished preparing training examples

steps/nnet2/train_tanh.sh: initializing neural net Training transition


probabilities and setting priors

steps/nnet2/train_tanh.sh: Will train for 15 + 5 epochs, equalling

steps/nnet2/train_tanh.sh: 15 + 5 = 20 iterations,
steps/nnet2/train_tanh.sh: (while reducing learning rate) + (with
constant learning rate).
Training neural net (pass 0)
Training neural net (pass 1)
Training neural net (pass 2)
......
Training neural net (pass 7)
Training neural net (pass 8)
Training neural net (pass 9)
Training neural net (pass 10)
Training neural net (pass 11)
Training neural net (pass 12)
Mixing up from 1904 to 5000 components
Training neural net (pass 13)
Training neural net (pass 14)
Training neural net (pass 15)
Training neural net (pass 16)
Training neural net (pass 17)
Training neural net (pass 18)
Training neural net (pass 19)
Setting num_iters_final=5

Getting average posterior for purposes of adjusting the priors.


Re-adjusting priors based on computed posteriors Done
Cleaning up data

steps/nnet2/remove_egs.sh: Finished deleting examples in


exp/tri4_nnet/egs
Removing most of the models

steps/nnet2/decode.sh --cmd run.pl --max-jobs-run 10


--nj 5 --num-threads 6 --transform-dir exp/tri3/decode_dev
exp/tri3/graph data/dev exp/tri4_nnet/decode_dev

steps/nnet2/decode.sh: feature type is lda


steps/nnet2/decode.sh: using transforms from exp/ tri3/decode_dev

steps/diagnostic/analyze_lats.sh --cmd run.pl --maxjobs-run 10 --iter


final exp/tri3/graph exp/tri4_ nnet/decode_dev

steps/diagnostic/analyze_lats.sh: see stats in exp/


tri4_nnet/decode_dev/log/analyze_alignments.log Overall, lattice
depth (10,50,90-percentile)=(7,32,159) and mean=71.1

steps/diagnostic/analyze_lats.sh: see stats in exp/


tri4_nnet/decode_dev/log/analyze_lattice_depth_ stats.log

score best paths


score confidence and timing with sclite
Decoding done.

steps/nnet2/decode.sh --cmd run.pl --max-jobs-run 10


--nj 5 --num-threads 6 --transform-dir exp/tri3/decode_test
exp/tri3/graph data/test exp/tri4_nnet/decode_test

steps/nnet2/decode.sh: feature type is lda

steps/nnet2/decode.sh: using transforms from exp/ tri3/decode_test


steps/diagnostic/analyze_lats.sh --cmd run.pl --maxjobs-run 10 --iter
final exp/tri3/graph exp/tri4_ nnet/decode_test
steps/diagnostic/analyze_lats.sh: see stats in exp/
tri4_nnet/decode_test/log/analyze_alignments.log Overall, lattice
depth (10,50,90-percentile)=(7,35,183) and mean=84.0

steps/diagnostic/analyze_lats.sh: see stats in exp/


tri4_nnet/decode_test/log/analyze_lattice_depth_ stats.log

score best paths


score confidence and timing with sclite
Decoding done.
Chapter 13. Karel’s DNN
Multiple Stages
Kaldi currently contains two parallel implementations for
DNN training. Neither codebase is more “official” than
the other. Both are still being developed in parallel. Dan’s
DNN is described in last Chapter. Here we deliberate
Karel’s recipe. Karel’s DNN uses layer-wise pre-training
based on RBMs (Restricted Boltzmann Machines), per-
frame cross-entropy training, and sequence-
discriminative training, using lattice framework,
optimizing sMBR criterion (State Minimum Bayes Risk).
The systems are built on top of LDA-MLLTfMLLR features
obtained from auxiliary GMM (Gaussian Mixture Model)
models. Whole DNN training is running in a single GPU
using CUDA (Compute Unified Device Architecture, the
parallel computing architecture created by Nvidia).
Karel’s DNN is located in code sub-directories nnet/2and
nnetbin/, and is primarily maintained by Karel Vesely. The
script is pretty simple, just one line:

Karel’s DNN is split into several stages.


13.0 Stage 0: Store Features

Here is the run_dnn.sh.


Line 31 - 50, the 40-dimensional features (MFCC, LDA, MLLT,
fMLLR with CMN) are stored to simplify the training.

13.1 Stage 1: Pre-training 159

13.1 Stage 1: Pre-training

The implementation of layer-wise RBM (Restricted Boltzmann


Machine) pre-training algorithm is Contrastive Divergence
with 1-step of Markov Chain Monte Carlo sampling (CD-1).
The hyper-parameters of the recipe were tuned on the 100
hours Switchboard subset. If smaller databases are used,
mainly the number of epochs N needs to be set to 100
hours/set_size. The training is unsupervised, so it is sufficient
to provide single data-directory with input features.

When training the RBM with Gaussian-Bernoulli units, there is


a high risk of weight-explosion, especially with larger learning
rates and thousands of hidden neurons. To avoid weight-
explosion a mechanism, which compares the variance of
training data withthe variance of the reconstruction data in a
minibatch has been implemented. If the variance of
reconstruction is >2x larger, the weights are shrank and the
learning rate is temporarily reduced.

13.2 Stage 2: Frame-level Cross-entropy

In this phase a DNN which classifies frames into


triphonestates is trained. This is done by mini-batch
Stochastic Gradient Descent. The default is to use Sigmoid
hidden units, Softmax output units and fully connected layers
AffineTransform. The learning rate by default is 0.008, size of
mini-batch 256; no momentum or regularization is used. The
optimal learning-rate differs with type of hidden units, the
value for sigmoid is 0.008, for tanh 0.00001.
The input_transform and the pre-trained DBN (i.e. Deep
Belief Network, stack of RBMs) is passed into to the script
using the options ‘–input-transform’ and ‘–dbn’, only the
output layer is initialized randomly. An early stopping criterium
is used to prevent over-fitting, for this the objective function
on the cross
13.3 Stage 3: Sequence-discriminative Training 161

validation set (i.e. held-out set) is measured, therefore two


pairs of feature-alignment dirs are needed to perform the
supervised training.

13.3 Stage 3: Sequence-discriminative Training

In this phase, the neural networks is trained to classify


correctly the whole sentences, which is closer to the general
ASR objective than frame-level training. The objective of
sequence-discriminative training is to maximize the expected
accuracy of state labels derived from reference transcriptions,
and lattice framework to represent competing hypothesis is
used. The training is done by Stochastic Gradient Descent
(SGD) with per-utterance updates, low learning rate 1e-5
which is kept constant is used and 3-5 epochs are done.
Faster convergence when re-generating lattices after 1st
epoch are observed.

MMI, BMMI, MPE and sMBR4 training are all supported. In


sMBR optimization, silence frames are excluded from
accumulating approximate accuracies. 13.4 Stage 4:
Iteration of sMBR In this phase, we iterated the sMBRs to
get better result.

if [ $stage -le 4 ]; then


# Re-train the DNN by 6 iterations of sMBR steps/nnet/train_mpe.sh -
-cmd “$cuda_cmd” --numiters 6 --acwt $acwt \

--do-smbr true \
$data_fmllr/train data/lang $srcdir ${srcdir}_ ali ${srcdir}_denlats $dir
|| exit 1
# Decode
for ITER in 1 6; do

steps/nnet/decode.sh --nj 20 --cmd “$decode_ cmd” \


--nnet $dir/${ITER}.nnet --acwt $acwt \ $gmmdir/graph
$data_fmllr/test $dir/decode_

test_it${ITER} || exit 1
steps/nnet/decode.sh --nj 20 --cmd “$decode_
cmd” \
--nnet $dir/${ITER}.nnet --acwt $acwt \ $gmmdir/graph
$data_fmllr/dev $dir/decode_
dev_it${ITER} || exit 1
done
fi

echo Success exit 0

13.5 Final Output

===================================================

DNN Hybrid Training & Decoding (Karel’s recipe)


===================================================

steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl


--max-jobs-run 10 --transform-dir exp/tri3/decode_ test data-fmllr-
tri3/test data/test exp/tri3 datafmllr-tri3/test/log data-fmllr-tri3/test/data

steps/nnet/make_fmllr_feats.sh: feature type is lda_ fmllr


utils/copy_data_dir.sh: copied data from data/test to data-fmllr-
tri3/test
utils/validate_data_dir.sh: Successfully validated data-directory data-
fmllr-tri3/test
steps/nnet/make_fmllr_feats.sh: Done!, type lda_ fmllr, data/test -->
data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans
exp/tri3/decode_ test

steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run. pl --max-jobs-run


10 --transform-dir exp/tri3/decode_dev data-fmllr-tri3/dev data/dev
exp/tri3 datafmllr-tri3/dev/log data-fmllr-tri3/dev/data

steps/nnet/make_fmllr_feats.sh: feature type is lda_ fmllr


utils/copy_data_dir.sh: copied data from data/dev to data-fmllr-
tri3/dev
utils/validate_data_dir.sh: Successfully validated data-directory data-
fmllr-tri3/dev

steps/nnet/make_fmllr_feats.sh: Done!, type lda_ fmllr, data/dev -->


data-fmllr-tri3/dev, using : rawtrans None, gmm exp/tri3, trans
exp/tri3/decode_dev

steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl


--max-jobs-run 10 --transform-dir exp/tri3_ali data-fmllr-tri3/train
data/train exp/tri3 data-fmllrtri3/train/log data-fmllr-tri3/train/data

steps/nnet/make_fmllr_feats.sh: feature type is lda_ fmllr

utils/copy_data_dir.sh: copied data from data/train to data-fmllr-


tri3/train
utils/validate_data_dir.sh: Successfully validated data-directory data-
fmllr-tri3/train

steps/nnet/make_fmllr_feats.sh: Done!, type lda_ fmllr, data/train -->


data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans
exp/tri3_ali

Speakers, src=462, trn=416, cv=46 /tmp/sv_WRwQQ/ speakers_cv


utils/data/subset_data_dir.sh: reducing #utt from 3696 to 3328
utils/data/subset_data_dir.sh: reducing #utt from 3696 to 368
# steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbmiter 20 data-fmllr-
tri3/train exp/dnn4_pretrain-dbn

# Started at Fri Jan 4 13:23:49 CST 2019


#

steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 data-fmllr-


tri3/train exp/dnn4_pretrain-dbn

# INFO
steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a
stack of RBMs

dir : exp/dnn4_pretrain-dbn
Train-set : data-fmllr-tri3/train ‘3696’
LOG ([5.5.164~1-9698]:main():cuda-gpu-available. cc:49)
### IS CUDA GPU AVAILABLE? ‘HP’ ###

WARNING ([5.5.164~1-9698]:SelectGpuId():cu-device. cc:203) Not in


compute-exclusive mode. Suggestion: use ‘nvidia-smi -c 3’ to set
compute exclusive mode

LOG ([5.5.164~1-9698]:SelectGpuIdAuto():cu-device. cc:323)


Selecting from 1 GPUs
LOG ([5.5.164~1-9698]:SelectGpuIdAuto():cu-device.cc:338)
cudaSetDevice(0): GeForce GTX 1070 free:7750M, used:365M,
total:8116M, free/total:0.954946

LOG ([5.5.164~1-9698]:SelectGpuIdAuto():cu-device. cc:385) Trying


to select device: 0 (automatically), mem_ratio: 0.954946

LOG ([5.5.164~1-9698]:SelectGpuIdAuto():cu-device. cc:404)


Success selecting device 0 free mem ratio: 0.954946

LOG ([5.5.164~1-9698]:FinalizeActiveGpu():cu-device.cc:258) The


active GPU is [0]: GeForce GTX 1070 free:7684M, used:431M,
total:8116M, free/total:0.946814 version 6.1
### HURRAY, WE GOT A CUDA GPU FOR COMPUTATION!!! ##

### Testing CUDA setup with a small computation (setup = cuda-


toolkit + gpu-driver + Kaldi):
### Test OK!

# PREPARING FEATURES
copy-feats --compress=true scp:data-fmllr-tri3/

train/feats.scp ark,scp:/tmp/Kaldi.1A01/train. ark,exp/dnn4_pretrain-


dbn/train_sorted.scp LOG (copy-feats[5.5.164~1-9698]:main():copy-
feats. cc:143) Copied 3696 feature matrices.
# ‘apply-cmvn’ not used,
feat-to-dim ‘ark:copy-feats scp:exp/dnn4_pretraindbn/train.scp ark:- |’
-
copy-feats scp:exp/dnn4_pretrain-dbn/train.scp ark:-

WARNING (feat-to-dim[5.5.164~1-9698]:Close():Kaldiio.cc:515) Pipe


copy-feats scp:exp/dnn4_pretraindbn/train.scp ark:- | had nonzero
return status 36096

# feature dim : 40 (input of ‘feature_transform’)

+ default ‘feature_transform_proto’ with splice +/-5 frames


nnet-initialize --binary=false exp/dnn4_pretraindbn/splice5.proto
exp/dnn4_pretrain-dbn/tr_splice5. nnet

nnet

9698]:Init():nnet-nnet.cc:314) <Splice> <InputDim>


40 <OutputDim> 440 <BuildVector> -5:5 </BuildVector> LOG (nnet-
initialize[5.5.164~1-9698]:main():nnetinitialize.cc:63) Written initialized
model to exp/ dnn4_pretrain-dbn/tr_splice5.nnet # compute
normalization stats from 10k sentences compute-cmvn-stats ark:-
exp/dnn4_pretrain-dbn/cmvng.stats
nnet-forward --print-args=true --use-gpu=yes exp/ dnn4_pretrain-
dbn/tr_splice5.nnet ‘ark:copy-feats scp:exp/dnn4_pretrain-
dbn/train.scp.10k ark:- |’ ark:-

ark:-

9698]:SelectGpuId():cu-device.cc:203) Not in compute-exclusive


mode. Suggestion: use ‘nvidia-smi -c 3’ to set compute exclusive
mode

3’ to set compute exclusive mode

9698]:SelectGpuIdAuto():cu-device.cc:323) Selecting from 1 GPUs

from 1 GPUs

9698]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(0):
GeForce GTX 1070 free:7750M, used:365M, total:8116M,
free/total:0.954946

total:8116M, free/total:0.954946

9698]:SelectGpuIdAuto():cu-device.cc:385) Trying to select device: 0


(automatically), mem_ratio: 0.954946

select device: 0 (automatically), mem_ratio: 0.954946

9698]:SelectGpuIdAuto():cu-device.cc:404) Success selecting device


0 free mem ratio: 0.954946

selecting device 0 free mem ratio: 0.954946

9698]:FinalizeActiveGpu():cu-device.cc:258) The active GPU is [0]:


GeForce GTX 1070 f r e e : 7 6 8 4 M , used:431M, total:8116M,
free/total:0.946814 version 6.1
copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- LOG (copy-
feats[5.5.164~1-9698]:main():copy-feats. cc:143) Copied 3696
feature matrices.
LOG (nnet-forward[5.5.164~1-9698]:main():nnet-forward.cc:192)
Done 3696 files in 0.0376807min, (fps 497522)

497522)

9698]:main():compute-cmvn-stats.cc:168) Wrote global CMVN stats


to exp/dnn4_pretrain-dbn/cmvn-g.stats

CMVN stats to exp/dnn4_pretrain-dbn/cmvn-g.stats

9698]:main():compute-cmvn-stats.cc:171) Done accumulating CMVN


stats for 3696 utterances; 0 had errors.

# + normalization of NN-input at
‘exp/dnn4_pretraindbn/tr_splice5_cmvn-g.nnet’
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:53)
Reading exp/dnn4_pretrain-dbn/tr_splice5.nnet

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:65)


Concatenating cmvn-to-nnet exp/dnn4_pretraindbn/cmvn-g.stats -|

cmvn-to-nnet exp/dnn4_pretrain-dbn/cmvn-g.stats - LOG (cmvn-to-


nnet[5.5.164~1-9698]:main():cmvn-tonnet.cc:114) Written cmvn in
‘nnet1’ model to:

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat.cc:82) Written


model to exp/dnn4_pretrain-dbn/ tr_splice5_cmvn-g.nnet

### Showing the final ‘feature_transform’: nnet-info


exp/dnn4_pretrain-dbn/tr_splice5_cmvn-g. nnet

LOG (nnet-info[5.5.164~1-9698]:main():nnet-info. cc:57) Printed info


about exp/dnn4_pretrain-dbn/tr_ splice5_cmvn-g.nnet

num-components 3
input-dim 40
output-dim 440
number-of-parameters 0.00088 millions
component 1 : <Splice>, input-dim 40, output-dim 440,

frame_offsets [ -5 -4 -3 -2 -1 0 1 2 3 4 5 ] component 2 : <AddShift>,


input-dim 440, output-dim 440,

shift_data ( min -0.168308, max 0.0655773, mean


-0.00402776, stddev 0.0406097, skewness -2.33079, kurtosis
6.28039 ) , lr-coef 0

component 3 : <Rescale>, input-dim 440, output-dim 440,

scale_data ( min 0.320696, max 0.967125, mean 0.762129, stddev


0.15756, skewness -0.640707, kurtosis -0.34954 ) , lr-coef 0

###

# PRE-TRAINING RBM LAYER 1


# initializing ‘exp/dnn4_pretrain-dbn/1.rbm.init’

# pretraining ‘exp/dnn4_pretrain-dbn/1.rbm’ (input gauss, lrate 0.01,


iters 40)
# converting RBM to exp/dnn4_pretrain-dbn/1.dbn rbm-convert-to-
nnet exp/dnn4_pretrain-dbn/1.rbm exp/ dnn4_pretrain-dbn/1.dbn

LOG (rbm-convert-to-nnet[5.5.164~1-9698]:main():rbmconvert-to-
nnet.cc:69) Written model to exp/dnn4_ pretrain-dbn/1.dbn

# PRE-TRAINING RBM LAYER 2


# computing cmvn stats ‘exp/dnn4_pretrain-dbn/2. cmvn’ for RBM
initialization

cmvn’ for RBM initialization 9698]:SelectGpuId():cu-device.cc:203)


Not in compute-exclusive mode. Suggestion: use ‘nvidia-smi -c 3’ to
set compute exclusive mode

3’ to set compute exclusive mode


9698]:SelectGpuIdAuto():cu-device.cc:323) Selecting from 1 GPUs

from 1 GPUs

9698]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(0):
GeForce GTX 1070 free:7750M, used:366M, total:8116M,
free/total:0.954884

total:8116M, free/total:0.954884

9698]:SelectGpuIdAuto():cu-device.cc:385) Trying to select device: 0


(automatically), mem_ratio: 0.954884

select device: 0 (automatically), mem_ratio: 0.954884

9698]:SelectGpuIdAuto():cu-device.cc:404) Success selecting device


0 free mem ratio: 0.954884
selecting device 0 free mem ratio: 0.954884
9698]:FinalizeActiveGpu():cu-device.cc:258) The active GPU is [0]:
GeForce GTX 1070 f r e e : 7 6 8 4 M , used:432M, total:8116M,
free/total:0.946752 version 6.1 nnet-concat exp/dnn4_pretrain-
dbn/final.feature_ transform exp/dnn4_pretrain-dbn/1.dbn -

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:53)


Reading exp/dnn4_pretrain-dbn/final.feature_ transform

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:65)


Concatenating exp/dnn4_pretrain-dbn/1.dbn LOG (nnet-
concat[5.5.164~1-9698]:main():nnet-concat. cc:82) Written model to
copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:-
LOG (copy-feats[5.5.164~1-9698]:main():copy-feats. cc:143) Copied
3696 feature matrices.

LOG (nnet-forward[5.5.164~1-9698]:main():nnet-forward.cc:192)
Done 3696 files in 0.0902158min, (fps 207802)

207802)
9698]:main():compute-cmvn-stats.cc:168) Wrote global CMVN stats
to standard output CMVN stats to standard output

9698]:main():compute-cmvn-stats.cc:171) Done accumulating CMVN


stats for 3696 utterances; 0 had errors.

LOG (cmvn-to-nnet[5.5.164~1-9698]:main():cmvn-tonnet.cc:114)
Written cmvn in ‘nnet1’ model to: exp/ dnn4_pretrain-dbn/2.cmvn

initializing ‘exp/dnn4_pretrain-dbn/2.rbm.init’ pretraining


‘exp/dnn4_pretrain-dbn/2.rbm’ (lrate 0.4, iters 20)

# appending RBM to exp/dnn4_pretrain-dbn/2.dbn nnet-concat


exp/dnn4_pretrain-dbn/1.dbn ‘rbm-con

vert-to-nnet exp/dnn4_pretrain-dbn/2.rbm - |’ exp/ dnn4_pretrain-


dbn/2.dbn

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:53)


Reading exp/dnn4_pretrain-dbn/1.dbn
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:65)
Concatenating rbm-convert-to-nnet exp/dnn4_ pretrain-dbn/2.rbm - |
rbm-convert-to-nnet exp/dnn4_pretrain-dbn/2.rbm -

LOG (rbm-convert-to-nnet[5.5.164~1-9698]:main():rbmconvert-to-
nnet.cc:69) Written model to
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:82) Written
model to exp/dnn4_pretrain-dbn/2.dbn

# PRE-TRAINING RBM LAYER 3


# computing cmvn stats ‘exp/dnn4_pretrain-dbn/3. cmvn’ for RBM
initialization

cmvn’ for RBM initialization

9698]:SelectGpuId():cu-device.cc:203) Not in compute-exclusive


mode. Suggestion: use ‘nvidia-smi -c 3’ to set compute exclusive
mode
3’ to set compute exclusive mode

9698]:SelectGpuIdAuto():cu-device.cc:323) Selecting from 1 GPUs

from 1 GPUs

9698]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(0):
GeForce GTX 1070 free:7751M, used:365M, total:8116M,
free/total:0.954984
total:8116M, free/total:0.954984
9698]:SelectGpuIdAuto():cu-device.cc:385) Trying to select device: 0
(automatically), mem_ratio: 0.954984

select device: 0 (automatically), mem_ratio: 0.954984

9698]:SelectGpuIdAuto():cu-device.cc:404) Success selecting device


0 free mem ratio: 0.954984
selecting device 0 free mem ratio: 0.954984
9698]:FinalizeActiveGpu():cu-device.cc:258) The active GPU is [0]:
GeForce GTX 1070 f r e e : 7 6 8 5 M , used:431M, total:8116M,
free/total:0.946853 version 6.1 nnet-concat exp/dnn4_pretrain-
dbn/final.feature_ transform exp/dnn4_pretrain-dbn/2.dbn -

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:53)


Reading exp/dnn4_pretrain-dbn/final.feature_ transform

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:65)


Concatenating exp/dnn4_pretrain-dbn/2.dbn LOG (nnet-
concat[5.5.164~1-9698]:main():nnet-concat. cc:82) Written model to
copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:-
LOG (copy-feats[5.5.164~1-9698]:main():copy-feats. cc:143) Copied
3696 feature matrices.

LOG (nnet-forward[5.5.164~1-9698]:main():nnet-forward.cc:192)
Done 3696 files in 0.1077min, (fps 174068)

174068)
9698]:main():compute-cmvn-stats.cc:168) Wrote global CMVN stats
to standard output

CMVN stats to standard output

9698]:main():compute-cmvn-stats.cc:171) Done accumulating CMVN


stats for 3696 utterances; 0 had errors.

LOG (cmvn-to-nnet[5.5.164~1-9698]:main():cmvn-tonnet.cc:114)
Written cmvn in ‘nnet1’ model to: exp/ dnn4_pretrain-dbn/3.cmvn

initializing ‘exp/dnn4_pretrain-dbn/3.rbm.init’ pretraining


‘exp/dnn4_pretrain-dbn/3.rbm’ (lrate 0.4, iters 20)

# appending RBM to exp/dnn4_pretrain-dbn/3.dbn

nnet-concat exp/dnn4_pretrain-dbn/2.dbn ‘rbm-convert-to-nnet


exp/dnn4_pretrain-dbn/3.rbm - |’ exp/ dnn4_pretrain-dbn/3.dbn

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:53)


Reading exp/dnn4_pretrain-dbn/2.dbn

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:65)


Concatenating rbm-convert-to-nnet exp/dnn4_ pretrain-dbn/3.rbm - |

rbm-convert-to-nnet exp/dnn4_pretrain-dbn/3.rbm - LOG (rbm-


convert-to-nnet[5.5.164~1-9698]:main():rbmconvert-to-nnet.cc:69)
Written model to
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:82) Written
model to exp/dnn4_pretrain-dbn/3.dbn
# PRE-TRAINING RBM LAYER 4
# computing cmvn stats ‘exp/dnn4_pretrain-dbn/4. cmvn’ for RBM
initialization

cmvn’ for RBM initialization 9698]:SelectGpuId():cu-device.cc:203)


Not in compute-exclusive mode. Suggestion: use ‘nvidia-smi -c 3’ to
set compute exclusive mode
3’ to set compute exclusive mode

9698]:SelectGpuIdAuto():cu-device.cc:323) Selecting from 1 GPUs

from 1 GPUs

9698]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(0):
GeForce GTX 1070 free:7750M, used:365M, total:8116M,
free/total:0.954946

total:8116M, free/total:0.954946

9698]:SelectGpuIdAuto():cu-device.cc:385) Trying to select device: 0


(automatically), mem_ratio: 0.954946 select device: 0 (automatically),
mem_ratio: 0.954946

9698]:SelectGpuIdAuto():cu-device.cc:404) Success selecting device


0 free mem ratio: 0.954946

selecting device 0 free mem ratio: 0.954946

9698]:FinalizeActiveGpu():cu-device.cc:258) The active GPU is [0]:


GeForce GTX 1070 f r e e : 7 6 8 4 M , used:431M, total:8116M,
free/total:0.946814 version 6.1

nnet-concat exp/dnn4_pretrain-dbn/final.feature_ transform


exp/dnn4_pretrain-dbn/3.dbn -
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:53)
Reading exp/dnn4_pretrain-dbn/final.feature_ transform
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:65)
Concatenating exp/dnn4_pretrain-dbn/3.dbn LOG (nnet-
concat[5.5.164~1-9698]:main():nnet-concat. cc:82) Written model to
copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:-

LOG (copy-feats[5.5.164~1-9698]:main():copy-feats. cc:143) Copied


3696 feature matrices.
LOG (nnet-forward[5.5.164~1-9698]:main():nnet-forward.cc:192)
Done 3696 files in 0.11935min, (fps 157076)

157076)

9698]:main():compute-cmvn-stats.cc:168) Wrote global CMVN stats


to standard output

CMVN stats to standard output

9698]:main():compute-cmvn-stats.cc:171) Done accumulating CMVN


stats for 3696 utterances; 0 had errors.

LOG (cmvn-to-nnet[5.5.164~1-9698]:main():cmvn-tonnet.cc:114)
Written cmvn in ‘nnet1’ model to: exp/ dnn4_pretrain-dbn/4.cmvn
initializing ‘exp/dnn4_pretrain-dbn/4.rbm.init’

pretraining ‘exp/dnn4_pretrain-dbn/4.rbm’ (lrate 0.4, iters 20)


# appending RBM to exp/dnn4_pretrain-dbn/4.dbn nnet-concat
exp/dnn4_pretrain-dbn/3.dbn ‘rbm-convert-to-nnet exp/dnn4_pretrain-
dbn/4.rbm - |’ exp/ dnn4_pretrain-dbn/4.dbn
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:53)
Reading exp/dnn4_pretrain-dbn/3.dbn

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:65)


Concatenating rbm-convert-to-nnet exp/dnn4_ pretrain-dbn/4.rbm - |

rbm-convert-to-nnet exp/dnn4_pretrain-dbn/4.rbm - LOG (rbm-


convert-to-nnet[5.5.164~1-9698]:main():rbmconvert-to-nnet.cc:69)
Written model to
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:82) Written
model to exp/dnn4_pretrain-dbn/4.dbn
# PRE-TRAINING RBM LAYER 5
# computing cmvn stats ‘exp/dnn4_pretrain-dbn/5. cmvn’ for RBM
initialization
cmvn’ for RBM initialization 9698]:SelectGpuId():cu-device.cc:203)
Not in compute-exclusive mode. Suggestion: use ‘nvidia-smi -c 3’ to
set compute exclusive mode

3’ to set compute exclusive mode

9698]:SelectGpuIdAuto():cu-device.cc:323) Selecting from 1 GPUs

from 1 GPUs

9698]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(0):
GeForce GTX 1070 free:7750M, used:366M, total:8116M,
free/total:0.954907

total:8116M, free/total:0.954907

9698]:SelectGpuIdAuto():cu-device.cc:385) Trying to select device: 0


(automatically), mem_ratio: 0.954907

select device: 0 (automatically), mem_ratio: 0.954907

9698]:SelectGpuIdAuto():cu-device.cc:404) Success selecting device


0 free mem ratio: 0.954907

selecting device 0 free mem ratio: 0.954907

9698]:FinalizeActiveGpu():cu-device.cc:258) The active GPU is [0]:


GeForce GTX 1070 f r e e : 7 6 8 4 M , used:432M, total:8116M,
free/total:0.946775 version 6.1 nnet-concat exp/dnn4_pretrain-
dbn/final.feature_ transform exp/dnn4_pretrain-dbn/4.dbn -

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:53)


Reading exp/dnn4_pretrain-dbn/final.feature_ transform

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:65)


Concatenating exp/dnn4_pretrain-dbn/4.dbn LOG (nnet-
concat[5.5.164~1-9698]:main():nnet-concat. cc:82) Written model to
copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:-
LOG (copy-feats[5.5.164~1-9698]:main():copy-feats. cc:143) Copied
3696 feature matrices.

LOG (nnet-forward[5.5.164~1-9698]:main():nnet-forward.cc:192)
Done 3696 files in 0.13468min, (fps 139197)

139197)

9698]:main():compute-cmvn-stats.cc:168) Wrote global CMVN stats


to standard output

CMVN stats to standard output

9698]:main():compute-cmvn-stats.cc:171) Done accumulating CMVN


stats for 3696 utterances; 0 had errors.

LOG (cmvn-to-nnet[5.5.164~1-9698]:main():cmvn-tonnet.cc:114)
Written cmvn in ‘nnet1’ model to: exp/ dnn4_pretrain-dbn/5.cmvn

initializing ‘exp/dnn4_pretrain-dbn/5.rbm.init’ pretraining


‘exp/dnn4_pretrain-dbn/5.rbm’ (lrate 0.4, iters 20)
# appending RBM to exp/dnn4_pretrain-dbn/5.dbn

nnet-concat exp/dnn4_pretrain-dbn/4.dbn ‘rbm-convert-to-nnet


exp/dnn4_pretrain-dbn/5.rbm - |’ exp/ dnn4_pretrain-dbn/5.dbn

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:53)


Reading exp/dnn4_pretrain-dbn/4.dbn
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:65)
Concatenating rbm-convert-to-nnet exp/dnn4_ pretrain-dbn/5.rbm - |

rbm-convert-to-nnet exp/dnn4_pretrain-dbn/5.rbm - LOG (rbm-


convert-to-nnet[5.5.164~1-9698]:main():rbmconvert-to-nnet.cc:69)
Written model to
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:82) Written
model to exp/dnn4_pretrain-dbn/5.dbn
# PRE-TRAINING RBM LAYER 6
# computing cmvn stats ‘exp/dnn4_pretrain-dbn/6. cmvn’ for RBM
initialization

cmvn’ for RBM initialization 9698]:SelectGpuId():cu-device.cc:203)


Not in compute-exclusive mode. Suggestion: use ‘nvidia-smi -c 3’ to
set compute exclusive mode

3’ to set compute exclusive mode

9698]:SelectGpuIdAuto():cu-device.cc:323) Selecting from 1 GPUs

from 1 GPUs

9698]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(0):
GeForce GTX 1070 free:7751M, used:365M, total:8116M,
free/total:0.954999

total:8116M, free/total:0.954999

9698]:SelectGpuIdAuto():cu-device.cc:385) Trying to select device: 0


(automatically), mem_ratio: 0.954999

select device: 0 (automatically), mem_ratio: 0.954999

9698]:SelectGpuIdAuto():cu-device.cc:404) Success selecting device


0 free mem ratio: 0.954999

selecting device 0 free mem ratio: 0.954999

9698]:FinalizeActiveGpu():cu-device.cc:258) The active GPU is [0]:


GeForce GTX 1070 f r e e : 7 6 8 5 M , used:431M, total:8116M,
free/total:0.946868 version 6.1

nnet-concat exp/dnn4_pretrain-dbn/final.feature_ transform


exp/dnn4_pretrain-dbn/5.dbn -
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:53)
Reading exp/dnn4_pretrain-dbn/final.feature_ transform
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:65)
Concatenating exp/dnn4_pretrain-dbn/5.dbn LOG (nnet-
concat[5.5.164~1-9698]:main():nnet-concat. cc:82) Written model to
copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:-
LOG (copy-feats[5.5.164~1-9698]:main():copy-feats. cc:143) Copied
3696 feature matrices.

LOG (nnet-forward[5.5.164~1-9698]:main():nnet-forward.cc:192)
Done 3696 files in 0.148966min, (fps 125848)

125848)

9698]:main():compute-cmvn-stats.cc:168) Wrote global CMVN stats


to standard output

CMVN stats to standard output

9698]:main():compute-cmvn-stats.cc:171) Done accumulating CMVN


stats for 3696 utterances; 0 had errors.

LOG (cmvn-to-nnet[5.5.164~1-9698]:main():cmvn-tonnet.cc:114)
Written cmvn in ‘nnet1’ model to: exp/ dnn4_pretrain-dbn/6.cmvn

initializing ‘exp/dnn4_pretrain-dbn/6.rbm.init’ pretraining


‘exp/dnn4_pretrain-dbn/6.rbm’ (lrate 0.4, iters 20)
# appending RBM to exp/dnn4_pretrain-dbn/6.dbn

nnet-concat exp/dnn4_pretrain-dbn/5.dbn ‘rbm-convert-to-nnet


exp/dnn4_pretrain-dbn/6.rbm - |’ exp/ dnn4_pretrain-dbn/6.dbn

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:53)


Reading exp/dnn4_pretrain-dbn/5.dbn
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:65)
Concatenating rbm-convert-to-nnet exp/dnn4_ pretrain-dbn/6.rbm - |
rbm-convert-to-nnet exp/dnn4_pretrain-dbn/6.rbm -

LOG (rbm-convert-to-nnet[5.5.164~1-9698]:main():rbmconvert-to-
nnet.cc:69) Written model to
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:82) Written
model to exp/dnn4_pretrain-dbn/6.dbn

# REPORT
# RBM pre-training progress (line per-layer)

exp/dnn4_pretrain-dbn/log/rbm.1.log:progress: [59.4867 53.2198


52.0588 51.4113 51.068 50.8504 50.702 50.5875 50.5049 50.4403
50.3839 50.3431 50.3531 50.3546 50.3716 50.3526 50.3494 50.3537
50.3309 50.3397 50.3387 50.3246 50.325 50.3117 ]

exp/dnn4_pretrain-dbn/log/rbm.2.log:progress: [6.72542 5.54473


5.45044 5.38909 5.33717 5.28904 5.23951 5.19439 5.14991 5.10887
5.06825 5.04026 ]

exp/dnn4_pretrain-dbn/log/rbm.3.log:progress: [5.87008 4.54053


4.45563 4.41395 4.37866 4.34733 4.31496 4.28637 4.25737 4.23046
4.20576 4.18703 ]

exp/dnn4_pretrain-dbn/log/rbm.4.log:progress: [4.41877 3.54689


3.48168 3.45219 3.42853 3.40722 3.38683 3.36887 3.35155 3.3345
3.31848 3.30785 ]

exp/dnn4_pretrain-dbn/log/rbm.5.log:progress: [4.11132 3.10282


3.0477 3.02518 3.00828 2.99136 2.97569 2.96107 2.94746 2.93379
2.92229 2.91543 ]

exp/dnn4_pretrain-dbn/log/rbm.6.log:progress: [3.17977 2.53633


2.4901 2.47436 2.46173 2.44852 2.43675 2.42552 2.41665 2.40667
2.39865 2.39314 ]

Pre-training finished.
# Removing features tmpdir /tmp/Kaldi.1A01 @ HP train.ark
# Accounting: time=1934 threads=1

# Ended (code 0) at Fri Jan 4 13:56:03 CST 2019, elapsed time 1934
seconds
# steps/nnet/train.sh --feature-transform exp/dnn4_ pretrain-
dbn/final.feature_transform --dbn exp/dnn4_ pretrain-dbn/6.dbn --hid-
layers 0 --learn-rate 0.008 data-fmllr-tri3/train_tr90 data-fmllr-
tri3/train_ cv10 data/lang exp/tri3_ali exp/tri3_ali exp/dnn4_ pretrain-
dbn_dnn

# Started at Fri Jan 4 13:56:03 CST 2019


#

steps/nnet/train.sh --feature-transform exp/dnn4_ pretrain-


dbn/final.feature_transform --dbn exp/dnn4_ pretrain-dbn/6.dbn --hid-
layers 0 --learn-rate 0.008 data-fmllr-tri3/train_tr90 data-fmllr-
tri3/train_ cv10 data/lang exp/tri3_ali exp/tri3_ali exp/dnn4_ pretrain-
dbn_dnn

# INFO
steps/nnet/train.sh : Training Neural Network dir : exp/dnn4_pretrain-
dbn_dnn

Train-set : data-fmllr-tri3/train_tr90 3328, exp/tri3_ali


CV-set : data-fmllr-tri3/train_cv10 368 exp/tri3_ali
LOG ([5.5.164~1-9698]:main():cuda-gpu-available. cc:49)
### IS CUDA GPU AVAILABLE? ‘HP’ ###

WARNING ([5.5.164~1-9698]:SelectGpuId():cu-device. cc:203) Not in


compute-exclusive mode. Suggestion: use ‘nvidia-smi -c 3’ to set
compute exclusive mode

LOG ([5.5.164~1-9698]:SelectGpuIdAuto():cu-device. cc:323)


Selecting from 1 GPUs

LOG ([5.5.164~1-9698]:SelectGpuIdAuto():cu-device.cc:338)
cudaSetDevice(0): GeForce GTX 1070 free:7750M, used:365M,
total:8116M, free/total:0.954953

LOG ([5.5.164~1-9698]:SelectGpuIdAuto():cu-device. cc:385) Trying


to select device: 0 (automatically), mem_ratio: 0.954953
LOG ([5.5.164~1-9698]:SelectGpuIdAuto():cu-device. cc:404)
Success selecting device 0 free mem ratio: 0.954953

LOG ([5.5.164~1-9698]:FinalizeActiveGpu():cu-device.cc:258) The


active GPU is [0]: GeForce GTX 1070 free:7684M, used:431M,
total:8116M, free/total:0.946822 version 6.1

### HURRAY, WE GOT A CUDA GPU FOR COMPUTATION!!! ##


### Testing CUDA setup with a small computation (setup = cuda-
toolkit + gpu-driver + Kaldi):
### Test OK!
# PREPARING ALIGNMENTS
Using PDF targets from dirs ‘exp/tri3_ali’ ‘exp/tri3_ ali’
hmm-info exp/tri3_ali/final.mdl
copy-transition-model --binary=false exp/tri3_ali/ final.mdl
exp/dnn4_pretrain-dbn_dnn/final.mdl

final.mdl exp/dnn4_pretrain-dbn_dnn/final.mdl 9698]:main():copy-


transition-model.cc:62) Copied transition model.

# PREPARING FEATURES
# re-saving features to local disk,

copy-feats --compress=true scp:data-fmllr-tri3/ train_tr90/feats.scp


ark,scp:/tmp/Kaldi.pqzk/train. ark,exp/dnn4_pretrain-
dbn_dnn/train_sorted.scp

LOG (copy-feats[5.5.164~1-9698]:main():copy-feats. cc:143) Copied


3328 feature matrices.

copy-feats --compress=true scp:data-fmllr-tri3/ train_cv10/feats.scp


ark,scp:/tmp/Kaldi.pqzk/ cv.ark,exp/dnn4_pretrain-dbn_dnn/cv.scp

LOG (copy-feats[5.5.164~1-9698]:main():copy-feats. cc:143) Copied


368 feature matrices.
# importing feature settings from dir ‘exp/dnn4_pretrain-dbn’
# cmvn_opts=’’ delta_opts=’’ ivector_dim=’’ # ‘apply-cmvn’ is not
used,

feat-to-dim ‘ark:copy-feats
scp:exp/dnn4_pretraindbn_dnn/train.scp.10k ark:- |’ -

copy-feats scp:exp/dnn4_pretrain-dbn_dnn/train. scp.10k ark:-


WARNING (feat-to-dim[5.5.164~1-9698]:Close():Kaldiio.cc:515) Pipe
copy-feats scp:exp/dnn4_pretraindbn_dnn/train.scp.10k ark:- | had
nonzero return status 36096

# feature dim : 40 (input of ‘feature_transform’) # importing


‘feature_transform’ from ‘exp/dnn4_pretrain-
dbn/final.feature_transform’

### Showing the final ‘feature_transform’:


nnet-info exp/dnn4_pretrain-dbn_dnn/imported_final.
feature_transform

LOG (nnet-info[5.5.164~1-9698]:main():nnet-info. cc:57) Printed info


about exp/dnn4_pretrain-dbn_dnn/ imported_final.feature_transform

num-components 3
input-dim 40
output-dim 440
number-of-parameters 0.00088 millions
component 1 : <Splice>, input-dim 40, output-dim 440,

frame_offsets [ -5 -4 -3 -2 -1 0 1 2 3 4 5 ] component 2 : <AddShift>,


input-dim 440, output-dim 440,

shift_data ( min -0.168308, max 0.0655773, mean


-0.00402776, stddev 0.0406097, skewness -2.33079, kurtosis
6.28039 ) , lr-coef 0

component 3 : <Rescale>, input-dim 440, output-dim


440,
scale_data ( min 0.320696, max 0.967125, mean 0.762129, stddev
0.15756, skewness -0.640707, kurtosis -0.34954 ) , lr-coef 0

###

# NN-INITIALIZATION
# getting input/output dims :

feat-to-dim ‘ark:copy-feats
scp:exp/dnn4_pretraindbn_dnn/train.scp.10k ark:- | nnet-forward
“nnetconcat exp/dnn4_pretrain-dbn_dnn/final.feature_ transform
‘\’’exp/dnn4_pretrain-dbn/6.dbn’\’’ -|” ark:- ark:- |’ -

copy-feats scp:exp/dnn4_pretrain-dbn_dnn/train. scp.10k ark:-

nnet-forward “nnet-concat exp/dnn4_pretrain-dbn_dnn/


final.feature_transform ‘exp/dnn4_pretrain-dbn/6. dbn’ -|” ark:- ark:-

LOG (nnet-forward[5.5.164~1-9698]:SelectGpuId():cudevice.cc:128)
Manually selected to compute on CPU. nnet-concat
exp/dnn4_pretrain-dbn_dnn/final.feature_ transform
exp/dnn4_pretrain-dbn/6.dbn -

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:53)


Reading exp/dnn4_pretrain-dbn_dnn/final.feature_transform

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:65)


Concatenating exp/dnn4_pretrain-dbn/6.dbn

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:82) Written


model to
WARNING (feat-to-dim[5.5.164~1-9698]:Close():Kaldiio.cc:515) Pipe
copy-feats scp:exp/dnn4_pretrain-dbn_ dnn/train.scp.10k ark:- | nnet-
forward “nnet-concat exp/dnn4_pretrain-
dbn_dnn/final.feature_transform ‘exp/dnn4_pretrain-dbn/6.dbn’ -|”
ark:- ark:- | had nonzero return status 36096
# genrating network prototype exp/dnn4_pretrain-dbn_ dnn/nnet.proto
# initializing the NN ‘exp/dnn4_pretrain-dbn_dnn/ nnet.proto’ ->
‘exp/dnn4_pretrain-dbn_dnn/nnet.init’

nnet-initialize --seed=777 exp/dnn4_pretrain-dbn_ dnn/nnet.proto


exp/dnn4_pretrain-dbn_dnn/nnet.init

dnn/nnet.proto exp/dnn4_pretrain-dbn_dnn/nnet.init 9698]:Init():nnet-


nnet.cc:314) <AffineTransform> <InputDim> 1024 <OutputDim> 1904
<BiasMean> 0.000000 <BiasRange> 0.000000 <ParamStddev>
0.091474

<BiasRange> 0.000000 <ParamStddev> 0.091474

9698]:Init():nnet-nnet.cc:314) <Softmax> <InputDim>


1904 <OutputDim> 1904
1904 <OutputDim> 1904

9698]:Init():nnet-nnet.cc:314) </NnetProto>
LOG (nnet-initialize[5.5.164~1-9698]:main():nnetinitialize.cc:63)
Written initialized model to exp/ dnn4_pretrain-dbn_dnn/nnet.init
nnet-concat exp/dnn4_pretrain-dbn/6.dbn exp/dnn4_ pretrain-
dbn_dnn/nnet.init exp/dnn4_pretrain-dbn_ dnn/nnet_dbn_dnn.init
LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:53)
Reading exp/dnn4_pretrain-dbn/6.dbn

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:65)


Concatenating exp/dnn4_pretrain-dbn_dnn/nnet. init

LOG (nnet-concat[5.5.164~1-9698]:main():nnet-concat. cc:82) Written


model to exp/dnn4_pretrain-dbn_dnn/ nnet_dbn_dnn.init

# RUNNING THE NN-TRAINING SCHEDULER

steps/nnet/train_scheduler.sh --feature-transform exp/dnn4_pretrain-


dbn_dnn/final.feature_transform
--learn-rate 0.008 exp/dnn4_pretrain-dbn_dnn/nnet_ dbn_dnn.init
ark:copy-feats scp:exp/dnn4_pretraindbn_dnn/train.scp ark:- |
ark:copy-feats scp:exp/ dnn4_pretrain-dbn_dnn/cv.scp ark:- | ark:ali-
to-pdf exp/tri3_ali/final.mdl “ark:gunzip -c exp/tri3_ali/ ali.*.gz |” ark:- |
ali-to-post ark:- ark:- | ark:alito-pdf exp/tri3_ali/final.mdl “ark:gunzip -c
exp/ tri3_ali/ali.*.gz |” ark:- | ali-to-post ark:- ark:- | exp/dnn4_pretrain-
dbn_dnn

CROSSVAL PRERUN AVG.LOSS 7.7600 (Xent),

ITERATION 01: TRAIN AVG.LOSS 2.0913, (lrate0.008), CROSSVAL


AVG.LOSS 1.9001, nnet accepted (nnet_dbn_
dnn_iter01_learnrate0.008_tr2.0913_cv1.9001)

ITERATION 02: TRAIN AVG.LOSS 1.3923, (lrate0.008), CROSSVAL


AVG.LOSS 1.7871, nnet accepted (nnet_dbn_
dnn_iter02_learnrate0.008_tr1.3923_cv1.7871)

ITERATION 03: TRAIN AVG.LOSS 1.1886, (lrate0.008), CROSSVAL


AVG.LOSS 1.7467, nnet accepted (nnet_dbn_
dnn_iter03_learnrate0.008_tr1.1886_cv1.7467)

ITERATION 04: TRAIN AVG.LOSS 1.0431, (lrate0.008), CROSSVAL


AVG.LOSS 1.7458, nnet accepted (nnet_dbn_
dnn_iter04_learnrate0.008_tr1.0431_cv1.7458) ITERATION 05:
TRAIN AVG.LOSS 0.8836, (lrate0.004), CROSSVAL AVG.LOSS
1.6402, nnet accepted (nnet_dbn_
dnn_iter05_learnrate0.004_tr0.8836_cv1.6402)

ITERATION 06: TRAIN AVG.LOSS 0.8084, (lrate0.002), CROSSVAL


AVG.LOSS 1.5782, nnet accepted (nnet_dbn_
dnn_iter06_learnrate0.002_tr0.8084_cv1.5782)

ITERATION 07: TRAIN AVG.LOSS 0.7764, (lrate0.001), CROSSVAL


AVG.LOSS 1.5362, nnet accepted (nnet_dbn_
dnn_iter07_learnrate0.001_tr0.7764_cv1.5362)
ITERATION 08: TRAIN AVG.LOSS 0.7625, (lrate0.0005),
CROSSVAL AVG.LOSS 1.5088, nnet accepted (nnet_dbn_
dnn_iter08_learnrate0.0005_tr0.7625_cv1.5088)

ITERATION 09: TRAIN AVG.LOSS 0.7554, (lrate0.00025),


CROSSVAL AVG.LOSS 1.4934, nnet accepted (nnet_dbn_
dnn_iter09_learnrate0.00025_tr0.7554_cv1.4934)

ITERATION 10: TRAIN AVG.LOSS 0.7511, (lrate0.000125),


CROSSVAL AVG.LOSS 1.4856, nnet accepted (nnet_dbn_
dnn_iter10_learnrate0.000125_tr0.7511_cv1.4856)

ITERATION 11: TRAIN AVG.LOSS 0.7481, (lrate6.25e-05),


CROSSVAL AVG.LOSS 1.4820, nnet accepted (nnet_dbn_
dnn_iter11_learnrate6.25e-05_tr0.7481_cv1.4820) ITERATION 12:
TRAIN AVG.LOSS 0.7461, (lrate3.125e-05), CROSSVAL AVG.LOSS
1.4804, nnet accepted (nnet_dbn_ dnn_iter12_learnrate3.125e-
05_tr0.7461_cv1.4804)

ITERATION 13: TRAIN AVG.LOSS 0.7448, (lrate1.5625e-05),


CROSSVAL AVG.LOSS 1.4797, nnet accepted (nnet_dbn_
dnn_iter13_learnrate1.5625e-05_tr0.7448_cv1.4797) finished, too
small rel. improvement 0.000472845

steps/nnet/train_scheduler.sh: Succeeded training the Neural


Network : ‘exp/dnn4_pretrain-dbn_dnn/final.nnet’

steps/nnet/train.sh: Successfuly finished. ‘exp/ dnn4_pretrain-


dbn_dnn’

steps/nnet/decode.sh --nj 20 --cmd run.pl --max-jobsrun 10 --acwt 0.2


exp/tri3/graph data-fmllr-tri3/ test exp/dnn4_pretrain-
dbn_dnn/decode_test

# Removing features tmpdir /tmp/Kaldi.pqzk @ HP cv.ark


train.ark
# Accounting: time=333 threads=1
# Ended (code 0) at Fri Jan 4 14:01:36 CST 2019, elapsed time 333
seconds

steps/nnet/decode.sh --nj 20 --cmd run.pl --max-jobsrun 10 --acwt 0.2


exp/tri3/graph data-fmllr-tri3/dev exp/dnn4_pretrain-
dbn_dnn/decode_dev

steps/nnet/align.sh --nj 20 --cmd run.pl --max-jobsrun 10 data-fmllr-


tri3/train data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-
dbn_dnn_ali

steps/nnet/align.sh: aligning data ‘data-fmllr-tri3/ train’ using


nnet/model ‘exp/dnn4_pretrain-dbn_dnn’, putting alignments in
‘exp/dnn4_pretrain-dbn_dnn_ ali’

steps/nnet/align.sh: done aligning data.

steps/nnet/make_denlats.sh --nj 20 --cmd run.pl --maxjobs-run 10 --


acwt 0.2 --lattice-beam 10.0 --beam 18.0 data-fmllr-tri3/train data/lang
exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_denlats
Making unigram grammar FST in exp/dnn4_pretrain-dbn_
dnn_denlats/lang

Compiling decoding graph in exp/dnn4_pretrain-dbn_


dnn_denlats/dengraph

tree-info exp/dnn4_pretrain-dbn_dnn/tree
tree-info exp/dnn4_pretrain-dbn_dnn/tree

fsttablecompose exp/dnn4_pretrain-dbn_dnn_denlats/
lang/L_disambig.fst exp/dnn4_pretrain-dbn_dnn_denlats/lang/G.fst

fstminimizeencoded
fstdeterminizestar --use-log=true
fstpushspecial

fstisstochastic exp/dnn4_pretrain-dbn_dnn_denlats/ lang/tmp/LG.fst


1.27271e-05 1.27271e-05
fstcomposecontext --context-size=3 --central-position=1 --read-
disambig-syms=exp/dnn4_pretrain-dbn_
dnn_denlats/lang/phones/disambig.int --write-disambig-
syms=exp/dnn4_pretrain-dbn_dnn_denlats/lang/
tmp/disambig_ilabels_3_1.int exp/dnn4_pretrain-dbn_
dnn_denlats/lang/tmp/ilabels_3_1.17584 exp/dnn4_pretrain-
dbn_dnn_denlats/lang/tmp/LG.fst

fstisstochastic exp/dnn4_pretrain-dbn_dnn_denlats/
lang/tmp/CLG_3_1.fst
1.27657e-05 0

make-h-transducer --disambig-syms-out=exp/dnn4_pretrain-
dbn_dnn_denlats/dengraph/disambig_tid.int
--transition-scale=1.0 exp/dnn4_pretrain-dbn_dnn_
denlats/lang/tmp/ilabels_3_1 exp/dnn4_pretrain-dbn_ dnn/tree
exp/dnn4_pretrain-dbn_dnn/final.mdl

fstrmepslocal

fsttablecompose exp/dnn4_pretrain-dbn_dnn_denlats/
dengraph/Ha.fst exp/dnn4_pretrain-dbn_dnn_denlats/
lang/tmp/CLG_3_1.fst

fstrmsymbols exp/dnn4_pretrain-
dbn_dnn_denlats/dengraph/disambig_tid.int
fstminimizeencoded
fstdeterminizestar --use-log=true
fstisstochastic exp/dnn4_pretrain-dbn_dnn_denlats/
dengraph/HCLGa.fst
0.000473932 -0.000485819

add-self-loops --self-loop-scale=0.1 --reorder=true exp/dnn4_pretrain-


dbn_dnn/final.mdl exp/dnn4_pretrain-
dbn_dnn_denlats/dengraph/HCLGa.fst
steps/nnet/make_denlats.sh: generating denlats from data ‘data-fmllr-
tri3/train’, putting lattices in ‘exp/dnn4_pretrain-dbn_dnn_denlats’
steps/nnet/make_denlats.sh: done generating denominator lattices.
steps/nnet/train_mpe.sh --cmd run.pl --gpu 1 --numiters 6 --acwt 0.2 -
-do-smbr true data-fmllr-tri3/ train data/lang exp/dnn4_pretrain-
dbn_dnn exp/dnn4_ pretrain-dbn_dnn_ali exp/dnn4_pretrain-
dbn_dnn_denlats exp/dnn4_pretrain-dbn_dnn_smbr

Pass 1 (learnrate 0.00001)


TRAINING FINISHED; Time taken = 1.02159 min; processed
18350.9 frames per second.
Done 3696 files, 0 with no reference alignments, 0 with no lattices, 0
with other errors.

Overall average frame-accuracy is 0.871751 over 1124823 frames.


Pass 2 (learnrate 1e-05)

TRAINING FINISHED; Time taken = 1.02279 min; processed


18329.4 frames per second.
Done 3696 files, 0 with no reference alignments, 0 with no lattices, 0
with other errors.
Overall average frame-accuracy is 0.878631 over 1124823 frames.

Pass 3 (learnrate 1e-05)


TRAINING FINISHED; Time taken = 1.01812 min; pro
cessed 18413.4 frames per second.

Done 3696 files, 0 with no reference alignments, 0 with no lattices, 0


with other errors.
Overall average frame-accuracy is 0.882521 over 1124823 frames.
Pass 4 (learnrate 1e-05)
TRAINING FINISHED; Time taken = 1.02529 min; processed
18284.6 frames per second.
Done 3696 files, 0 with no reference alignments, 0 with no lattices, 0
with other errors.
Overall average frame-accuracy is 0.885317 over 1124823 frames.
Pass 5 (learnrate 1e-05)
TRAINING FINISHED; Time taken = 1.03147 min; processed
18175.1 frames per second.
Done 3696 files, 0 with no reference alignments, 0 with no lattices, 0
with other errors.
Overall average frame-accuracy is 0.887541 over 1124823 frames.
Pass 6 (learnrate 1e-05)
TRAINING FINISHED; Time taken = 1.03413 min; processed
18128.3 frames per second.
Done 3696 files, 0 with no reference alignments, 0 with no lattices, 0
with other errors.

Overall average frame-accuracy is 0.889426 over 1124823 frames.


MPE/sMBR training finished

Re-estimating priors by forwarding 10k utterances from training set.

steps/nnet/make_priors.sh --cmd run.pl --max-jobsrun 10 --nj 20 data-


fmllr-tri3/train exp/dnn4_pretrain-dbn_dnn_smbr

Accumulating prior stats by forwarding ‘data-fmllrtri3/train’ with


‘exp/dnn4_pretrain-dbn_dnn_smbr’ Succeeded creating prior counts
‘exp/dnn4_pretraindbn_dnn_smbr/prior_counts’ from ‘data-fmllr-tri3/
train’
steps/nnet/train_mpe.sh: Done. ‘exp/dnn4_pretraindbn_dnn_smbr’

steps/nnet/decode.sh --nj 20 --cmd run.pl --max-jobsrun 10 --nnet


exp/dnn4_pretrain-dbn_dnn_smbr/1.nnet
--acwt 0.2 exp/tri3/graph data-fmllr-tri3/test exp/ dnn4_pretrain-
dbn_dnn_smbr/decode_test_it1

steps/nnet/decode.sh --nj 20 --cmd run.pl --max-jobsrun 10 --nnet


exp/dnn4_pretrain-dbn_dnn_smbr/1.nnet
--acwt 0.2 exp/tri3/graph data-fmllr-tri3/dev exp/ dnn4_pretrain-
dbn_dnn_smbr/decode_dev_it1

steps/nnet/decode.sh --nj 20 --cmd run.pl --max-jobsrun 10 --nnet


exp/dnn4_pretrain-dbn_dnn_smbr/6.nnet
--acwt 0.2 exp/tri3/graph data-fmllr-tri3/test exp/ dnn4_pretrain-
dbn_dnn_smbr/decode_test_it6

steps/nnet/decode.sh --nj 20 --cmd run.pl --max-jobsrun 10 --nnet


exp/dnn4_pretrain-dbn_dnn_smbr/6.nnet
--acwt 0.2 exp/tri3/graph data-fmllr-tri3/dev exp/ dnn4_pretrain-
dbn_dnn_smbr/decode_dev_it6

Success
Chapter 14. Final Results
from MonoPhone to DNN
===============================================================

Getting Results [see RESULTS file]


===================================================

%WER 31.8 | 400 15057 | 71.6 19.6 8.8 3.4 31.8 100.0 |
-0.506 | exp/mono/decode_dev/score_5/ctm_39phn.filt. sys

%WER 24.8 | 400 15057 | 79.2 15.8 5.0 4.0 24.8 99.8 | -0.147 |
exp/tri1/decode_dev/score_10/ctm_39phn. filt.sys

%WER 22.6 | 400 15057 | 81.0 14.2 4.7 3.6 22.6 99.5 | -0.217 |
exp/tri2/decode_dev/score_10/ctm_39phn. filt.sys

%WER 20.5 | 400 15057 | 82.6 12.8 4.7 3.1 20.5 99.8 | -0.588 |
exp/tri3/decode_dev/score_10/ctm_39phn. filt.sys

%WER 23.2 | 400 15057 | 80.0 14.8 5.2 3.3 23.2 99.5 |
-0.188 | exp/tri3/decode_dev.si/score_10/ctm_39phn. filt.sys

%WER 21.0 | 400 15057 | 81.9 12.5 5.6 2.9 21.0 99.8 |
-0.525 | exp/tri4_nnet/decode_dev/score_5/ctm_39phn. filt.sys

%WER 18.1 | 400 15057 | 85.2 11.0 3.8 3.3 18.1 99.5 | -0.662 |
exp/sgmm2_4/decode_dev/score_6/ctm_39phn. filt.sys

%WER 18.3 | 400 15057 | 84.8 11.3 3.9 3.2 18.3 99.5 | -0.279 |
exp/sgmm2_4_mmi_b0.1/decode_dev_it1/
score_8/ctm_39phn.filt.sys
%WER 18.5 | 400 15057 | 84.2 11.4 4.4 2.7 18.5 99.5 | -0.138 |
exp/sgmm2_4_mmi_b0.1/decode_dev_it2/
score_10/ctm_39phn.filt.sys
%WER 18.5 | 400 15057 | 84.3 11.4 4.4 2.8 18.5 99.5 | -0.148 |
exp/sgmm2_4_mmi_b0.1/decode_dev_it3/
score_10/ctm_39phn.filt.sys
%WER 18.5 | 400 15057 | 84.3 11.4 4.3 2.8 18.5 99.5 | -0.155 |
exp/sgmm2_4_mmi_b0.1/decode_dev_it4/
score_10/ctm_39phn.filt.sys

%WER 17.4 | 400 15057 | 84.8 10.4 4.7 2.3 17.4 99.3 | -0.616 |
exp/dnn4_pretrain-dbn_dnn/decode_dev/ score_6/ctm_39phn.filt.sys

%WER 17.3 | 400 15057 | 85.1 10.4 4.5 2.4 17.3 99.3 |
-0.581 | exp/dnn4_pretrain-dbn_dnn_smbr/decode_dev_
it1/score_6/ctm_39phn.filt.sys

%WER 17.0 | 400 15057 | 85.9 10.2 3.9 2.9 17.0 99.5 |
-0.610 | exp/dnn4_pretrain-dbn_dnn_smbr/decode_dev_
it6/score_6/ctm_39phn.filt.sys

%WER 16.9 | 400 15057 | 85.9 11.0 3.2 2.8 16.9 99.5 | -0.126 |
exp/combine_2/decode_dev_it1/score_6/ ctm_39phn.filt.sys

%WER 16.9 | 400 15057 | 86.0 10.9 3.1 2.8 16.9 99.3 | -0.126 |
exp/combine_2/decode_dev_it2/score_6/ ctm_39phn.filt.sys

%WER 16.8 | 400 15057 | 85.8 10.8 3.4 2.6 16.8 99.3 | -0.036 |
exp/combine_2/decode_dev_it3/score_7/ ctm_39phn.filt.sys

%WER 16.9 | 400 15057 | 86.3 10.9 2.8 3.2 16.9 99.0 | -0.260 |
exp/combine_2/decode_dev_it4/score_5/ ctm_39phn.filt.sys

%WER 32.1 | 192 7215 | 71.1 19.4 9.5 3.3 32.1 100.0 | -0.468 |
exp/mono/decode_test/score_5/ctm_39phn. filt.sys

%WER 26.1 | 192 7215 | 77.6 16.9 5.5 3.6 26.1 100.0 | -0.136 |
exp/tri1/decode_test/score_10/ctm_39phn. filt.sys
%WER 24.0 | 192 7215 | 79.7 15.0 5.4 3.7 24.0 99.5
13.5 Final Output 193
| -0.260 | exp/tri2/decode_test/score_10/ctm_39phn. filt.sys

%WER 21.2 | 192 7215 | 82.1 13.2 4.7 3.3 21.2 99.5 | -0.695 |
exp/tri3/decode_test/score_9/ctm_39phn. filt.sys

%WER 24.0 | 192 7215 | 79.3 15.4 5.2 3.4 24.0 99.5 |
-0.281 | exp/tri3/decode_test.si/score_9/ctm_39phn. filt.sys

%WER 22.8 | 192 7215 | 80.3 13.7 6.0 3.1 22.8 100.0 | -0.423 |
exp/tri4_nnet/decode_test/score_5/ ctm_39phn.filt.sys
%WER 19.5 | 192 7215 | 82.8 12.1 5.1 2.3 19.5 100.0 |
-0.120 | exp/sgmm2_4/decode_test/score_10/ctm_39phn. filt.sys

%WER 19.7 | 192 7215 | 84.3 11.9 3.8 3.9 19.7 100.0 | -0.673 |
exp/sgmm2_4_mmi_b0.1/decode_test_it1/
score_6/ctm_39phn.filt.sys

%WER 20.0 | 192 7215 | 83.1 12.3 4.6 3.1 20.0 100.0 | -0.179 |
exp/sgmm2_4_mmi_b0.1/decode_test_it2/
score_9/ctm_39phn.filt.sys

%WER 20.0 | 192 7215 | 83.1 12.4 4.5 3.2 20.0 100.0 | -0.195 |
exp/sgmm2_4_mmi_b0.1/decode_test_it3/
score_9/ctm_39phn.filt.sys

%WER 19.9 | 192 7215 | 83.8 12.1 4.1 3.7 19.9 99.5 | -0.521 |
exp/sgmm2_4_mmi_b0.1/decode_test_it4/
score_7/ctm_39phn.filt.sys

%WER 18.3 | 192 7215 | 84.4 11.2 4.4 2.7 18.3 100.0 | -1.177 |
exp/dnn4_pretrain-dbn_dnn/decode_test/ score_4/ctm_39phn.filt.sys

%WER 18.4 | 192 7215 | 84.7 11.2 4.2 3.1 18.4 100.0 |
-1.168 | exp/dnn4_pretrain-dbn_dnn_smbr/decode_test_
it1/score_4/ctm_39phn.filt.sys

%WER 18.5 | 192 7215 | 84.9 11.4 3.7 3.4 18.5 100.0 |
-0.779 | exp/dnn4_pretrain-dbn_dnn_smbr/decode_test_
it6/score_5/ctm_39phn.filt.sys

%WER 17.8 | 192 7215 | 84.9 11.6 3.5 2.7 17.8 99.5

| -0.143 | exp/combine_2/decode_test_it1/score_6/ ctm_39phn.filt.sys

%WER 18.2 | 192 7215 | 85.1 11.6 3.3 3.2 18.2 99.5 | -0.236 |
exp/combine_2/decode_test_it2/score_5/ ctm_39phn.filt.sys

%WER 18.1 | 192 7215 | 84.8 11.7 3.5 2.9 18.1 99.5 | -0.124 |
exp/combine_2/decode_test_it3/score_6/ ctm_39phn.filt.sys

%WER 18.0 | 192 7215 | 84.7 11.8 3.5 2.8 18.0 99.5 | -0.130 |
exp/combine_2/decode_test_it4/score_6/ ctm_39phn.filt.sys

===================================================
Finished successfully on Fri Jan 4 14:16:46 CST 2019
===================================================
Next Step
When voice assistants began to emerge in 2011 with the
introduction of Siri, no one could have predicted that this
novelty would become a driver for techinnovation. Now nearly
10 years later, it’s estimated that every one in six Americans
own a smart speaker (Google Home, Amazon Echo) and
eMarketer forecasts that nearly 100 million smartphone users
will be using voice assistants in 2020.

This is only the beginning of voice technology as we will see


major advancements in the user interface in the years to
come. Voice is the future of brand interaction and customer
experience. There is a lot of opportunity for much deeper and
much more conversational experiences with customers.

The question is, are you willing to jump on this opportunity?


Index
#
$alidir 114
$lang/tmp/LG.fst 99
$model. 100
$tree 100
39-dimension feature vectors 110 .glm. 53
.mdl 89
.phn 45, 46
.spk2gender 53
.spk2utt 53
.stm 53
.txt 45
.wav 45
.wrd 45, 46

A
acc_tree_stats 125
acc-tree-stats 113
AccumAmDiagGmm 81
AccumDiagGmm 81
acoustic features 3
Acoustic likelihood 95
acoustic model 3
Acoustic Model 84
acoustic models 4, 8
Acoustic models 6
Acoustic-Phonetic Continuous Speech

Corpus 38
AffineTransform 160
algorithm 9
ali 112
ali/ali.gz 114
alignment 113
alignments 114
ali.gz 115
AmDiagGmm 81
ANN 2
ASR I, 4, 7
Automatic Speech Recognition I DiagGmm 81
DiagGMM 85
DiagGmmNormal 81
dialect 39
Dictionary 59
dir/ali.gz 114
discriminative training 138
DNN 2, 7, 157, 158
DNN/CNN 75
DNN-HMM-based 8
DNN hybrid 12
DR1 44
DR8 44
DTW 2

B
b_a_i. 115
bash 17
Baum-Welch 82
Bayes 150
BLAS 12
BMMI 161
Build_tree 125

C
CD-1 159
CE 138
Cepstral mean and variance normalization 76

check_dependencies.sh 19
CMN 158
CMU 5, 10
CMVN 76, 77
coefficient 75
compile-train-graphs 114
convert-ali 114
corpus 37
CUDA 157

D
Dan’s DNN 149, 150
DARPA 38
datasets 48
DCT 5, 75
decode.sh. 83
decoding 3
Decoding 9, 96
Decoding Graph 98
De-distribution 8
deep learning 10
Deep Neural Networks 7, 149
defacto 34
Deltas 109
Deltas + Delta-Deltas 109
DFT 75

E
egs 41
ELRA 13
EM 82
EndContextDependency 90 eventmap 90

F
fbank 75
feats.scp. 76
feature-alignment 161
Feature Extraction 5
Features extraction 3
feature vectors 3
FFT 5
Filter-bank Analysis 3
Final Results 191
Finite State Transducers (FSTs): 12 fMLLR 77, 125, 131, 158
fMLLR-adapted 125
forward 112
forward pdf 112
forward-pdf 81
Fourier Analysis 3
Frame-level Cross-entropy 160 framing 3
FST 12, 60, 98
fstinput 13
fsts.JOB.gz 85
layer-wise pre-training 157
LDA 119, 125
LDA+MLLT 119, 125
LDC 13, 37
LDC93S1 37
L_disambig.fst 61, 98, 99
lexicon 3
L.fst 61
likelihoo, Acoustic 95
likelihood 11, 137
LM 9
LVCSR 103

G
Gaussian-Bernoulli 159
GCONSTS 85
Git 16
github 17
GMM 7, 8, 75
gmm-est 86
GMM-HMM 2, 7
gmm-init-model 114
gmm-init-mono 84
gmm-mixup 114
GPU 15, 157

H
HCLG 85, 98
Hidden Markov Model 7
Hinton 2
HMM 7
hmm-state 81
HTK 2, 89
hybrid 7
Hybrid Training and Decoding framework

150

I
Individual Word Error Rates 7
input_transform 160
ISIP 10
IWER 7

K
Kaldi 10, 11
Karel 149
Karel’s 158
Karel’s DNN 157
Karel Vesely 149, 157
L
language model 3
Language Model 9
LAPACK 12
lattice 97, 141
LAU 60

M
machine learning 10
maximum likelihood 11
Maximum likelihood 138
Maximum Mutual Information 138 max_iter_inc 115
Mel Frequency Cepstral Coefficients 5 mfcc 35
MFCC 5, 75
MFCCs 110
MFCC’s DCT 75
mkgraph.sh 83
MLE 86, 139
MLLT 119, 125
MMI 137, 138
MMI + SGMM2 137
MMI model 139
monoPhone 111
Monophone 82
MonoPhone 81
MPE 161

N
N-best 97
N-Gram 2
NIST 39
nnetbin 149
nonlinearities 150
NSN 60
NTIS 39
NVidia 149
SGMM2 131
SGMMs 12
Sigmoid 160
signals 3
SIL 60
SIL_B 60
sMBR 149, 157, 162
sMBR4 161
Softmax 160
Speech Recognition 3
Sphinx 5, 10
spk2gender 53
spk_id 54
SPN 60
SRI 39
stages 158
steps/align_fmllr.sh. 132
steps/align_si.sh 112
stm 53
Stochastic Gradient Descent 161 sum-tree-stats 113
SVM 2

O
oov.txt/int 61
OpenFst 12
OpenFST 22

P
pattern recognition I
pdf 112
pdfclass 90
p.d.f-id 92
Perl 17
perplexity 103
phones 61
phone status. 112
phones.txt 61
Povey 12
pre_ali.gz 132
pre-emphasis 3
pre-processing 3
Pre-processing 4
pre-trained DBN 160
Pre-training 159
probability of Transition 63
P(W|O) 139
Python 17
Pytorch 10

R
RBM 149
Recipes 46
Restricted Boltzmann Machines 157 run_dnn.sh 158

S
SAT 125
scp 83
self-look pdf 112
self-loop. 112
Semi-supervised Gaussian Mixture Model

(SGMMs) 12
Sequence-discriminative 161
Sequence-discriminative Training 161 SGD 8

T
tanh 160
TE 90
Tensorflow 10
TIMIT 2, 8
TIMIT corpus 44
toolkit 11
topo 61, 85
trailblazers 34
train_mono.sh 83
train-mono.sh 83
transducers 22
TransitionalModel 81
Tri1 109
tri1 : Deltas + Delta-Deltas 112 tri2 119, 121
tri2: LDA + MLLT 119
tri3 125, 126
tri3: LDA+MLLT+SAT 125 triPhone 111
Triphone 109
triphonestates 160

U
utils 46
utt 112
utterance 54
utt_id 54

V
vector 5
Viterbi 9, 81, 82, 90 VQ 2

W
WER 34, 75
WFST 22, 96
windowing 3
word dictionary 47 word error rate 34 words.txt 61

Y
Yes or No 23
Z
Zhang 12

You might also like