FoCal MultiClass Manual
FoCal MultiClass Manual
FoCal MultiClass Manual
Niko Br
ummer
Spescom DataVoice
[email protected]
June 2007
Contents
1 Introduction 3
1.1 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Relationship to original FoCal Toolkit . . . . . . . . . . . . . 4
1.3 Does it work? . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Manual organization . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 5
1
II Toolkit User Manual 21
5 How to get up and running 22
5.1 Software Versions . . . . . . . . . . . . . . . . . . . . . . . . . 22
6 Package Reference 23
6.1 Multi-class Cllr evaluation . . . . . . . . . . . . . . . . . . . . 23
6.2 Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.3 Linear backend . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.4 Quadratic backend . . . . . . . . . . . . . . . . . . . . . . . . 26
6.5 HLDA backend . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.7 Nist LRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.9 Other packages . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2
Chapter 1
Introduction
This document is the user manual for the FoCal Multi-class Toolkit,
which is a collection of utilities, written in MATLAB, to help researchers
with the problems of evaluation, fusion, calibration and decision-making in
multi-class statistical pattern recognition.
Although applicable more widely, the toolkit has been designed primarily
with the NIST LRE-071 language recognition evaluation in mind.
This toolkit does not form a complete pattern or language recognizer,
rather it helps the creators of such recognizers to:
1.1 Availability
Documentation and MATLAB code for this toolkit are available at:
http://niko.brummer.googlepages.com/focalmulticlass
This toolkit is made freely available, for non-commercial use, with the un-
derstanding that the authors of the toolkit and their employers cannot be
held responsible in any way for the toolkit or its use.
1
http://www.nist.gov/speech/tests/lang/2007/index.htm
3
1.2 Relationship to original FoCal Toolkit
The original FoCal Toolkit2 , by the same author and also written in
MATLAB, was designed with application to the NIST SRE3 speaker de-
tection evaluations in mind. Speaker detection is a two-class recognition
problem.
In contrast, the FoCal Multi-class toolkit is designed for multi-class
recognition problems, but it has the following in common with the original
FoCal:
4
All of chapter 2 (except possibly for section 2.3, if it does not
interest you).
All of chapter 3.
Relevant parts of chapter 4, if you are interested in using Gaussian
backends.
Also keep in mind that MATLAB-style help comments is available (in the m-
file source code) for most of the functions and provides extensive additional
information.
1.5 Acknowledgements
The author wishes to thank Lukas Burget for making his HLDA optimization
code available for use in this toolkit and David van Leeuwen for research
collaboration.
5
Part I
6
Chapter 2
Application-independent
multi-class recognition
2.1 Input
In all cases of multi-class recognition (detection or identification), we assume
that the following four essential inputs are given:
7
1. Input data x. (In language recognition x is a speech segment.)
2.2 Processing
The toolkit is designed to be applicable when the above-listed inputs are
processed with the following four steps (of which the toolkit does steps 2
and 3):
8
~` is in a symmetrical, but redundant form. An asymmetri-
cal, but non-redundant form can be obtained, e.g., by letting
c = P (~` (x)|HN ), so that the last component is always zero and
then omitting that component. In this toolkit, we work with the
symmetrical, redundant form.
P (~` (x)|HN ) is not the same as P (x|HN ). All of the processing
between input x and ~` (x) leads to some inevitable information-
loss. Given this information loss, an optimal representation of the
information that remains is in the well-calibrated form of equa-
tion (2.2).
P (H1 |~` )
`1
!1
P 1e
P (H |~` ) N P2 e`2
2
X
`j
PHi |~` = = P e . (2.3)
.. j ..
. j=1
.
P (HN |~` ) PN e`N
P (Hi |~` ) 1
2
accept Hi ,
(2.4)
P (Hi |~` ) 1
2
reject Hi .
9
Steps 1 and 2 are application-independent steps to extract recognition
information from the input and to format this information.
Detection of class Hi is to make the binary decision: Does the input belong
to class Hi or not?
Take the case of three classes H1 , H2 , H3 as example: There are three ways
of writing the identification task as combinations of detection tasks:
For more hypotheses, this recipe is applied recursively and there is a combi-
natorial explosion of ways to write any N -class identification as combinations
of one-against-K detections, where K ranges from 1 to N 1.
10
2.3.2 Soft decisions
The case of soft decisions is similar. Once the prior information has been
incorporated, a soft N -class identification decision is just a (posterior) prob-
ability distribution over the possible answers, which can be written for the
two cases as:
For identification the soft decision is a probability distribution of the
form: [P (H1 ), P (H2 ), . . . , P (HN )]T .
For detection of Hi the soft decision is a probability distribution of
the form: [P (Hi ), P (Hi )]T , where denotes not and where
X
P (Hi ) = 1 P (Hi ) = P (Hj ), (2.5)
j6=i
2.4 Summary
The toolkit works on the principle of representing general milti-class recog-
nition information as log-likelihood-vectors, where there is a likelihood for
each possible class. If well-calibrated, these likelihoods can be used to make
cost-effective Bayes decisions in a wide variety of recognition applications.
1
Again, there is a combinatorial explosion of different ways to do this. If you need to
know exactly how many ways there are, see sequence A109714 in the On-Line Encyclopedia
of Integer Sequences: http://www.research.att.com/~njas/sequences/A109714.
11
Chapter 3
xt is the input,
12
The multi-class Cllr measure of goodness of these log-likelihoods is:
T
1X
Cllr = wt log2 Pt (3.1)
T t=1
where Pt is the posterior probability for the true class of trial t, as calculated
from the given log-likelihoods and the flat prior, P (Hi ) = Pflat = N1 :
exp `c(t) (x t )
Pt = P Hc(t) |~` (xt ) = PN
(3.2)
j=1 exp `j (xt )
3.1.1 Properties
Cllr has the sense of cost: lower Cllr is better.
Cllr has units in bits of information.
0 Cllr .
Cllr = 0 only for perfect recognition, where the posterior for the true
class, Pt = 1, for every trial t.
Cllr = , if Pt = 0, for any trial t. Keep log-likelihoods finite for finite
Cllr . More generally, keep log-likelihood magnitudes moderately small
for small Cllr .
Cllr = log2 N is the reference value for a useless, but neverthe-
less well-calibrated recognizer, which extracts no information from
the speech, and which acknowledges this by outputting equal log-
likelihoods: `1 (xt ) = `2 (xt ) = = `N (xt ) = ct .
Cllr < log2 N indicates a useful recognizer which can be expected to
help to make decisions which have lower average cost than decisions
based on the prior alone.
Cllr > log2 N indicates a bad recognizer which can be expected to make
decisions which have higher average cost than decisions based on the
prior alone.
1
If there are an equal number of trials of each class, then wt = 1.
13
3.1.2 Interpretations
For details, see [1].
log2 N Cllr is the average effective amount of useful information, about
the class of the input, that the evaluated recognizer delivers to the user,
relative to the information given by the flat prior.
The logarithmic cost for the posterior, log2 (Pt ) can be interpreted
as an expected cost of using the recognizer, over a wide range of differ-
ent identification and detection applications, and over a wide range of
different cost coefficients.
The logarithmic cost for the posterior, log2 (Pt ) can be interpreted as
an integral, over a wide range of different identification and detection
applications, of the error-rate of the recognizer.
14
function that computes an average two-class Cllr over target languages
(see toolkit function AVG CDET AND CLLR).
Although this toolkit is aimed at application in NIST LRE07, we pre-
fer to use the multi-class version rather than the averaged two-class
version. The multi-class Cllr is employed in the all of following toolkit
functions:
MULTICLASS CLLR (for evaluation)
MULTICLASS MIN CLLR (for evaluation)
CALREF PLOT (for evaluation)
TRAIN NARY LLR FUSION (for fusion and calibration)
15
3.2.1 Calibration transformation
The calibration transformation which is used to perform the above-mentioned
decomposition is:
~` 0 (xt ) = ~` (xt ) + ~ (3.5)
where ~` (xt ) is the original log-likelihood-vector for trial t; ~` 0 (xt ) is the trans-
formed (calibrated) log-likelihood-vector; is a positive scalar; and ~ is an
N -vector. Note the the calibration parameters (, ) ~ are constant for all the
trials of an evaluation. This calibration transformation serves a dual purpose
in this toolkit:
We can now explicitly specify the refinement and calibration losses in terms
of the calibration transformation (3.5):
0
refinement loss = min Cllr , (3.6)
~
,
0
calibration loss = max(Cllr Cllr ), (3.7)
~
,
0
where Cllr is defined by rewriting (3.1) and (3.2) in terms of ~` 0 ():
T
0 1X
Cllr = wt log2 Pt0 , (3.8)
T t=1
Pt0 = P Hc(t) |~` 0 (xt ) .
(3.9)
3.3 Fusion
Multi-class Cllr has a third purpose in the toolkit, namely as logistic regres-
sion optimization objective for the fusion of multiple recognizers. Let there
be K input recognizers, where the kth recognizer outputs its own (not nec-
essarily well-calibrated) log-likelihood-vector ~` k (xt ) for every trial t. Then
16
the fused (and calibrated) log-likelihood-vector is:
K
X
~` 0 (xt ) = k ~` k (xt ) + ~ (3.10)
k=1
This is implemented by toolkit function APPLY NARY LIN FUSION. The fusion
coefficients may be found as:
~ = arg max C 0 ,
(1 , 2 , . . . , K , ) (3.11)
llr
0
where Cllr is calculated for the fused ~` 0 () over a supervised training database.
4
NARY multi-class; LLR Linear Logistic Regression.
17
Chapter 4
During recognition, the log-likelihood for each class is just the log Gaus-
sian probability density of the score-vector, given the language.
18
1. By assuming homoscedastic models (where all models share a common
within-class covariance) a linear (or more correctly affine) transform
results from score-space to log-likelihood-space.
19
size d r. Additionally, the mean is also constrained: The mean i ,
for class i, is constrained so that:
0 mi
i = R (4.2)
m0
The quadratic backend additionally allows smoothing over the class covari-
ances, where each class covariance is smoothed by convex combination with
the average of the covariances. The degree of smoothing is controlled with
the combination weight. A weight of 0 means no smoothing, while maximum
smoothing with a weight of 1 gives a linear backend again.
See the reference sections 6.4 and 6.5 for more details on quadratic back-
ends.
20
Part II
21
Chapter 5
22
Chapter 6
Package Reference
All of the toolkit functions are written as MATLAB m-files and every function
has the usual help comments in the m-file. These comments form the detailed
documentation for each function, which we do not repeat here. Instead, we
give a general overview of the functions and where and how to apply them.
The functions in the toolkit are grouped under a number of subdirectories
according to functionality. We shall refer to each of these groups of MATLAB
funtions as a package. The rest of this chapter documents each package in a
separate section.
23
See figure 6.1, for an example of the plot produced by this function:
There are 5 coloured columns which compare the performance of 5
different 3-class recognizers against the reference value1 of log2 3 as
represented by the leftmost black column. The user-supplied name
of each system is printed under its corresponding column. The total
height of each column represents the total Cllr , the green part represents
refinement loss and the red part the calibration loss.
At the MATLAB prompt, type help function name for more information
on the above functions.
1.5
Cllr [bits]
logN/log2
refinement loss
0.5
calibration loss
0
0 1 2 3 1+2+3
system
6.2 Fusion
This package performs discriminative (logistic regression) training to cal-
ibrate the scores of a single recognizer, or to form a calibrated fusion of
the scores of multiple recognizers. Training is performed on a supervised
database of scores.
The scores that serve as input to the calibration or fusion must be in
log-likelihood-vector form (as defined by equation (2.2)), but need not be
well-calibrated. See section 3.3, for more detail on fusion and calibration.
1
See section 3.1.1
24
This package provides the following functions:
At the MATLAB prompt, type help function name for more information
on the above functions.
TRAIN LDA This function is not presently used by any other part of this
toolkit (and will probably not be needed by the user), but is included
to document the relationship between the linear Gaussian backend and
linear discriminant analysis (LDA).
At the MATLAB prompt, type help function name for more information
on the above functions.
25
6.4 Quadratic backend
This package provides functions to train and apply (heteroscedastic)
quadratic backends to transform general score-vectors to log-likelihood -
vectors.
If output scores of multiple recognizers are stacked into one score-vector,
then this backend effectively performs a calibrated fusion of the recognizers
and is therefore a generalization of the fusion package. It is more general
because is allows input of a more general form. It is up to the user to decide
whether to use the discriminative or generative options, or indeed to even
chain them.
The linear backend is less general than the quadratic ones and therefore
more strongly regularized. The linear is therefore probably a safer backend
to use when training data is scarce.
As noted in chapter 4, the quadratic backend can be regularized by
assuming structured PPCA or FA covariance models, or by smoothing
between class covariances.
26
As noted in chapter 4, this HLDA backend can be further regularized by
smoothing between class covariances.
At the MATLAB prompt, type help function name for more information
on the above functions. Also see the HLDA package for more details on the
HLDA implementation.
6.6 Application
This package provides functions to serve as a link between applications and
the application-independent information carried by the log-likelihood-vectors
used in the rest of this toolkit.
P (~` |Hi )
Li = log (6.1)
P (~` |Hi )
At the MATLAB prompt, type help function name for more information
on the above functions.
27
provides average Cdet and average Cllr tools to evaluate these outputs.
AVG CDET AND CLLR This combines the above two functions, calling first
the one and then the other. It therefore takes log-likelihood input and
calculates average Cdet and average Cllr .
At the MATLAB prompt, type help function name for more information
on the above functions.
6.8 Examples
This package provides a number of example scripts which serve the following
purposes:
To show how to call and how to use together several of the toolkit
functions.
This package provides the following scripts (they are m-file scripts, not func-
tions):
28
MULTIFOCAL EXAMPLE1 FUSION This example serves as introduction
and demonstration of the toolkit. It starts by generating synthetic
training and test data in log-likelihood-vector format, for a number of
different hypothetical, 3-class recognizers. The training data has been
purposefully made scarce in order to create a difficult fusion problem.
It then demonstrates multi-class Cllr evaluation of these recognizers as
well as a (discriminative) fusion between them. Then it also demon-
strates a variety of generative Gaussian backend fusions on the same
data.
MULTIFOCAL EXAMPLE2 QUADPPCA The synthetic data for this example
is in general score-vector format (not log-likelihood-vector format). The
generative backends are applicable to this data, but not the discrim-
inative fusion. The data is generated according to a heteroscedastic
PPCA model. A variety of different generative backends are tested on
this data.
MULTIFOCAL EXAMPLE3 LINPPCA The synthetic data for this example is
in general score-vector format (not log-likelihood-vector format). The
generative backends are applicable to this data, but not the discrimina-
tive fusion. The data is generated according to a homoscedastic PPCA
model. A variety of different generative backends are tested on this
data.
MULTIFOCAL EXAMPLE4 LINFA The synthetic data for this example is
in general score-vector format (not log-likelihood-vector format). The
generative backends are applicable to this data, but not the discrimi-
native fusion. The data is generated according to a homoscedastic FA
model. A variety of different generative backends are tested on this
data.
MULTIFOCAL EXAMPLE5 HLDA The synthetic data for this example is in
general score-vector format (not log-likelihood-vector format). The
generative backends are applicable to this data, but not the discrimi-
native fusion. The data is generated according to an HLDA model. A
variety of different generative backends are tested on this data.
MULTIFOCAL EXAMPLE6 FUSION This is a repeat of the first part of
MULTIFOCAL EXAMPLE1 FUSION, with the same types of data and the
same discriminative fusion. But in this example, evaluation is done
with the function CDET CLLR PLOT, instead of CALREF PLOT, which
was used in the former example. (See section 6.7 and type help
cdet cllr plot in MATLAB for more information.)
29
The above m-file scripts do not take or return any parameters and do not
display any help. Just type the script-name at the command-prompt to
execute. The user is encouraged to edit these scripts in order to experiment
with the toolkit.
The data in all of these examples is synthetic, used here just to demon-
strate how these tools work. Dont base conclusions about relative merit of
the toolkit backends on this data. Instead, use the tools demonstrated here
to base conclusions on your own real data.
DATA SYNTHESIS Functions for creating the synthetic data used by the
EXAMPLES package.
See the comments in the MATLAB source code for more information.
30
Bibliography
[1] Niko Brummer, Measuring, refining and calibrating speaker and lan-
guage information extracted from speech, Ph.D. dissertation, Stellen-
bosch University, to be submitted 2007.
[3] Niko Br
ummer and David van Leeuwen, On Calibration of Language
Recognition Scores, Odyssey 2006.
[4] David van Leeuwen and Niko Br ummer, Channel-dependent GMM and
Multi-class Logistic Regression, Odyssey 2006.
[8] M.J.F. Gales, Semi-tied covariance matrices for hidden Markov Models,
IEEE Transaction Speech and Audio Processing, vol. 7, pp. 272-281, 1999.
31