INTERSPEECH 2014
Extending Limabeam with discrimination and coarse gradients
Charles Fox, Thomas Hain
Speech and Hearing group
Department of Computer Science
University of Sheffield, UK
[email protected]
10.21437/Interspeech.2014-236
Abstract
model. Using numerical gradient descent with a CMU Sphinx-3
speech model on CMU-8 [8] and CMU-WSJ-PDS corpora [9],
gains of 31.4% relative have been reported [9], though with the
caveat these significant gains we showed only for longer (>7s)
utterances. These corpora contain only single static speakers
reading written scripts, so that smaller utterances can easily be
chunked together.
The present study tests for independent replication of these
results on a standard, unscripted multi-speaker meeting corpus,
AMI [1]. It finds no significant improvement over TDOA beamforming using basic Limabeam in this new setting – which has
more realistic noise types and utterance characteristics – but
suggests two extensions to Limabeam which then do allow it to
improve on the TDOA scores. The first is a change of objective
function, replacing likelihood with a discriminative likelihood
ratio to avoid a problem with silence models in the AMI environment. The second is to replace local gradients with coarser
gradients, allowing search to avoid some local minima.
Other improvements to Limabeam and AMI have been suggested, including sub-band and parameter-sharing Limabeam
[10] [11], cepstral Limabeam [12],[13], and neural network
recognition gains on AMI [14], which could all be combined
with the present extensions.
Limabeam is an approach to multi-microphone array processing for ASR which makes minimal assumptions about system
geometry, instead searching for filters to maximise output likelihoods under a speech model. The first results of Limabeam
on the AMI meeting corpus are given, then two extensions of
the algorithm for this corpus. First, it is shown that the original
local gradient following sticks in local minima, and a coarser
gradient is used. Second, a new discriminative objective function is provided to handle mis-matched silence models. The
extensions are based on examination of 2D receptive fields and
2D likelihood maps which are novel near-field analogs of radial
beamformer response patterns, but do not show radial symmetry and have many local minima. The extended Limabeam improves WER on TDOA baselines on the AMI corpus, by 1% rel.
when both are adapted with decodes and by 19% rel. when both
adapted with ground truth.
Index Terms: ASR, beamforming, discriminative
1. Introduction
ASR in noisy, reverberant, distant-talking environments such as
meeting rooms remains a difficult task [1, 2] as additive noise
from overlapping speakers and non-speech sources and convolutional noise from reverberation degrade the signal. However
in some cases, instrumentation with multi-microphone arrays
may be possible, either via static installations or ad-hoc networks using, say, the participants’ mobile phones. Such array
signals can be processed with weighted delay and sum transforms, {wij } of the multiple input channels xi [t] to an output
channel y[t] for ASR,
XX
y[t] =
wij xi [t − j].
(1)
i
2. Baseline experiments
Unlike CMU-8/WSJ-PDS, AMI consists of unscripted simulated business meetings by groups of four participants around
a table. Recorded in three meeting rooms, each is instrumented
with a circular, 100mm radius array of omnidirectional microphones in the centre of the table. Training and test sets of 12,000
(15.7 hours) and 1,188 (1.9 hours) non-overlapping, humansegmented utterances are defined, having independent sets of
speakers.
Unlike CMU-8/WSJ-PDS, no chunking of short utterances
is performed as our tests are intended as a proxy for more general meetings where speakers may move, or where diarisation
is unclear. Audio used here is at 16kHz, and converted to PLP
features [15]. All processing was per-utterance unless stated
otherwise.
Baselines were obtained for Individual Head Microphone
(IHM) and Single Distant Microphone (SDM) channels after
training 3-state left-to-right hidden Markov Models (HMMs)
with state-clustered phonetic decision tree ties states of 16component GMMs to model output probabilities. Training followed a standard HTK mixup procedure [16]. Word error rates
(WERs) are obtained with NIST sclite [17]. Decodes are based
on HTK HDecode with a 3-gram language model trained from
the AMI training set. Decode parameters were fixed at the start
of baseline testing (s15p0) and were not changed to overfit later
j
Traditional beamformers [3] have used geometric assumptions
to choose {wij } to optimise criteria. For example, Time Delay Of Arrival (TDOA, [4]) is optimal for a single source in
the presence of diffuse white noise; Maximum Variance Distortionless Response (MVDR, [5]) is optimal assuming the target
source has the widest variance of any combination of sources.
The Multiple Inverse Theorem (MINT, [6]) is optimal for discrete, non-diffuse noise sources. None of these assumptions are
perfectly valid, and it has been noted [7] that such mismatch is
acute as they all tend to be highly sensitive to low noise in any
of their inputs.
Likelihood-maximising beamforming (Limabeam) was introduced to ASR by [8, 9] and tries to make minimal assumptions and instead try to search the {wij } using gradient descent
to maximise the likelihood of the signal under an ASR speech
Copyright 2014 ISCA
2440
14-18 September 2014, Singapore
(a)
(c)
18
0
(b)
(g)
(e)
-70e-5
-80e-5
(d)
(h)
(f)
4
0
-70e-5
-80e-5
8
0
-72e-5
-82e-5
-67e-5
-77e-5
Figure 1: (a) 8kHz Receptive field for TDOA; (b.) 6kHz receptive field for random BF. (c),(d) Likelihood field for utterances. (e),(f) Desilenced likelihood fields. (g),(h) Discriminative likelihood ratio fields. White dots show speaker locations, black dots are microphones.
x and y axis are physical coordinates in a 5.5x4m room. Rectangle shows a typical meeting table size.
experiments.
Baselines WERs were further obtained for standard beamformers. TDOA audio was created per-utterance via,
wij = δ(j = argτ max r(i, 0)),
(2)
where r(i, 0) is the cross correlation function between channel i and a reference channel 0 chosen as that with the highest utterance energy, and δ is a Dirac Delta function. We also
created baselines for audio output of the standard Beamformit
(BFIT, [18]) software using its default settings, which is based
on TDOA but including cross-utterance smoothing optimised
for meeting rooms.
Figure 2: Convergence of gradient descent searches, using exact
analytic and numerical estimated gradients, 10s utterance.
Table 1: Baseline results.
data
IHM
SDM
IHM
SDM
IHM
SDM
TDOA
BFIT
TDOA
TDOA
model
xwrd
xwrd
MLLR(gnd)
MLLR(gnd)
MLLR(hyp)
MLLR(hyp)
SPR
SPR
MLLR(gnd)
MLLR(hyp)
WER
39.8
66.0
23.4
50.9
37.2
62.7
60.6
61.2
51.8
59.4
S
24.3
45.5
12.0
32.4
17.4
39.4
41.1
40.0
31.8
36.3
D
11.5
16.7
8,7
15.8
16.5
19.9
15.7
17.8
17.3
19.9
Maximum Likelihood Linear Regression (MLLR, [16],
from IHM base) adaptation baselines for TDOA are also shown
in table 1. MLLR is trained on a per-speaker basis using ground
truth (gnd) and TDOA-decodes (hyp) test set transcripts; only
means are adapted and 5 regression tree class transforms are
used. Ground truth training gives an indication of what adaptation would achieve given large amounts of per-speaker training
data.
I
4.0
3.9
2.7
2.7
3.3
3.4
3.8
3.5
2.7
3.2
3. Limabeam experiments
The most basic form of Limabeam models speech as a single
GMM on MFCC features (a similar objective to a Wiener filter
but optimising GMM feature likelihoods than GMM frequencies) and was tested on AMI. All Limabeam versions in the
present paper are initialised to TDOA weights, and work with 8
microphones with 10 positive delay taps each (80 parameters).
Pilot experiments (e.g. fig. 2) found no significant differences in compute time or WER by switching from analytic gradients [9] to numerically computed gradients. Analytical solution makes many evaluations around each point to compute
full-dimensioned gradients used to take a few large, accurate
steps. Numerical gradient descent takes many smaller steps to
give a smoother curve, but the same solution.
The basic results in table 1 use Single Pass Retraining (SPR,
[16]) to train new TDOA and BFIT models based on the previous IHM alignment. TDOA was found to outperform BFIT in
this case.1
1 Following this, GCC-PHAT[19] was also tested and found 11%
worse that TDOA, and a static TDOA fixing parameters over all utterances for each speaker was 2.6% worse than TDOA. GCC-PHAT usually improves WER in strong reverberation but does not here. Static
TDOA worsening was surprising as AMI speakers are seated and expected to have similar/smoothable TDOA values throughout, and shows
the large effects of small head moves.
2441
4
0
Figure 4: WER of standard LIMA-HMM by utterance length
larger, in our case 80D, space than the 3D manifold of TDOA
solutions. Just one example of a receptive field given random
weights is shown in 1(b), which indicates the ability to produce
fields very unlike these classical side-lobed shapes, and where
near-field effects do dominate the area where the AMI speakers
sit. Such plots suggest many local minima where Limabeam
could stick.
Single frequency receptive fields differ from likelihood
fields however, which are illustrated in fig. 1(c) and (d). These
show the effect of deliberately choosing {wij } to focus on a
speaker at the white dot, then moving the speaker around in
simulation. At each pixel, the same utterance is placed there
and the likelihood of the beamformed signal measured under
the HMM model. This likelihood is plotted as the pixel color.
The resulting likelihood fields are again very non-smooth and
show many localised minima.
It must be emphasised that this 2D (actually a slice of a
3D room) field is not the space searched by the 80-dimensional
LIMA search. But it does give a suggestion of the shape of the
latter space and the types of minima found in it, and therefore
that Limabeam is likely to get stuck often in such minima. The
likelihood fields are highly diffuse, lacking clear single peaks
at speaker locations. (This is the first publication to show such
fields.)
Figure 3: Cumulative distribution of AMI utterance lengths.
Standard BFGS Quasi-Newton was used to perform
gradient-based optimisation in both cases (Octave fminunc,
whose numerical gradients use wij moves of 6e-6). Numerical gradient computation allowed simple switching from MFCC
to PLP features which gave 7.1% relative WER improvement.
However even with this, the WER of table 2 did not improve
on the relevant (TDOA, SPR, 60.6%) baseline, suggesting that
LIMA-GMM likelihood is not a good proxy for WER optimisation.
A full HMM-based Limabeam was then implemented using
HMMs. Alignment was performed at every parameter evaluation (80 for each gradient descent iteration) using HVite with
ground truth transcripts. MLLR adaptation was applied and
results are shown as LIMA-HMM in the table above. Again
this standard Limabeam underperformed its baseline (TDOA,
MLLR(gnd)).
Table 2: Limabeam results.
data
LIMA-GMM
LIMA-HMM
model
SPR
MLLR
WER
60.8
64.4
S
41.3
36.1
D
15.7
26.1
I
3.9
2.2
5. Extensions
The AMI corpus contains strong, localised noise sources in at
least one of the rooms having a loud server rack. It was found
that 27% of segment time is assigned to silence. Together this
could lead to optimising the filter to local minima that transform
silence in new utterances to sound like silence in trained models,
perhaps dominating any transform of speech sound. To test this
hypothesis, two alternative optimisation objectives were constructed.
The first objective function is a per-frame likelihood average, but with silence phones excluded,
4. Inspection of corpus and beams
Previous work [9] showed on other corpora that Limabeam gave
no improvement on short utterances, such as those less than 7s
duration. To explore this for the AMI corpus, figs 3 and 4 show
the word length distribution of AMI utterances and the WER
of LIMA-HMM. These suggest that some of the overall poor
performance is due to a large number of short utterances, confirming the findings of [9].
Inspection of the wij during optimization suggested that
most utterances’ parameters were shifting only by small
amounts away from the initial TDOA solutions – comparable
to the search step size of the optimiser. This could occur if
TDOA solutions are already local (or global) optima, making it
impossible for the optimizer to escape from them. To gain some
intuition about the shape of the search space, we examined the
theoretical (no reverb or noise) spatial receptive field patterns
for various wij sets. Traditionally, beamformer responses are
plotted only as functions of angle, not radius, under far-field assumption. This assumption might not hold for the distances in
AMI corpus, where speakers sit <1m from the array. Fig. 1(a)
shows a typical receptive field over one AMI meeting room (at
6kHz) for a TDOA filter focused on a speaker location. For
TDOA filters it can be seen that the near-field effects are limited to a small region around the mic array so the usual radial
plots are appropriate. However Limabeam can search a much
obj1 ({wij }) =
N
1 X
b(n) log P (x(n)|M, {wij }),
M n=1
(3)
where there are N frames in the utterance and M contain nonsilence phones in the current alignment indicated by the indicator function b, λ is the likelihood and M is the HMM model
giving a best alignment.
Instead of simply excluding silence, the second objective
function actively penalises the transformation of model silence
into new-utterance silence by using the discriminative function,
Q
P (x(n)|M, {wij })
,
(4)
obj2 ({wij }) = log Qn
n P (x(n)|S, {wij })
where S is an HMM model consisting only of a single GMM
trained on silence from the training set.
2442
Sample likelihood fields for these two objective functions
are shown in fig.1(e),(f) for obj1 and fig.1(g),(h) for obj2 . Manual inspection of 10 utterances each suggested that these forms
are typical, and that obj2 tends to have a smoother form. This
objective was thus selected for full testing.
While these 3D spatial fields look considerably smoother
than the basic likelihoods, it might still be possible that local
minima exist in the higher dimensional weight space, and so a
second extension aims to help escape from any such minima by
using less localised gradient estimates. Analytic gradients are
perfectly local, measuring the slope at an infinitesimal point.
Standard numerical gradient descent approximates this by measuring the gradient between finitely but closely spaced point.
However for very non-smooth surfaces, such as fractals, local
gradients are of little use and larger steps should be taken to
escape from very small minima.
Two methods were tested to do this. Firstly, a simulated annealing search [20] (SA) and secondly, gradient descent
searches using coarse gradient estimates, obtained by sampling
points 2000 times further away from the current solution than
used by the standard optimiser (GDx2000). Limabeam searches
of most forms are computationally expensive to run (e.g. 400
days of 3GHz core time for the AMI test set) so these alternatives to basic gradient descent were used to give just an indication of alternative methods rather than an exhaustive search.
Results are shown in table 3. All runs here use Discriminative LIMA-HMM and MLLR training on either ground truth or
SPR-TDOA decodes, and in TDOA model space.
field effects, which can produce non-radially symmetric local minima in both beamformer receptive fields and likelihood
maps. Even conservatively quantising each weight to 10 possible values, gives a 1080 sized search space – comparable to
the number of atoms in the universe – and impossible to search
with any current computer. The space contains all known linear
beamformers and many more. So any Limabeam-like search is
heuristic. Adapting the coarseness of gradient descent to better
fit intuition about minima distribution gives a small WER improvement. This suggests that future work could quantify such
prior knowledge and use it to create custom search algorithms
to better exploit it.
Table 3: Extended Limabeam results. ‘gnd’=MLLR adaptation performed using ground truth data; ‘hyp’=MLLR adaptation performed on decoded hypotheses.
GD=standard
gradient descent search; SA=simulated annealing search;
GDx2000=coarse gradient descent search.
[5] Jack Capon, “High-resolution frequency-wavenumber
spectrum analysis,” Proceedings of the IEEE, vol. 57, no.
8, pp. 1408–1418, 1969.
search
GD
GD
SA
SA
GDx2000
GDx2000
MLLR
gnd
hyp
gnd
hyp
gnd
hyp
WER
48.2
59.3
49.9
63.0
41.8
58.7
S
25.0
36.4
29.2
39.0
24.0
37.6
D
21.1
19.8
19.0
21.3
15.8
18.0
7. References
[1] Jean Carletta, Simone Ashby, Sebastien Bourban, Mike
Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec,
Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal,
et al., “The AMI meeting corpus: A pre-announcement,”
in Machine learning for multimodal interaction, pp. 28–
39. Springer, 2006.
[2] Charles Fox, Yulan Liu, Erich Zwyssig, and Thomas Hain,
“The Sheffield Wargames Corpus,” Proceedings of Interspeech., 2013.
[3] Robert J Mailloux, Phased array antenna handbook,
Artech House Boston, 2005.
[4] Ralph O Schmidt, “Multiple emitter location and signal
parameter estimation,” Antennas and Propagation, IEEE
Transactions on, vol. 34, no. 3, pp. 276–280, 1986.
[6] Jacob Benesty, Jingdong Chen, Yiteng Arden Huang, and
Jacek Dmochowski, “On microphone-array beamforming from a mimo acoustic signal processing perspective,”
Audio, Speech, and Language Processing, IEEE Transactions on, vol. 15, no. 3, pp. 1053–1065, 2007.
I
2.0
3.1
1.7
2.6
2.0
3.1
[7] Felicia Lim, Mark RP Thomas, and Patrick A Naylor,
“Mintformer: A spatially aware channel equalizer,” in
Applications of Signal Processing to Audio and Acoustics
(WASPAA), 2013 IEEE Workshop on. IEEE, 2013, pp. 1–
4.
The hyp-based experiments give marginal improvements
over TDOA, (0.1% abs. for GD, no improvement for SA, and
1% rel. for GDx2000). Ground-truth LIMA-MLLR results give
much more impressive improvement (19% relative, still not as
large as in [9]) than ground truth TDOA-MLLR, suggesting if
sufficient Lima-processed per-speaker training data was available then such improvements would also occur.
[8] Michael L Seltzer, Microphone array processing for robust speech recognition, Ph.D. thesis, Carnegie Mellon
University, 2003.
[9] Michael L Seltzer, Bhiksha Raj, and Richard M Stern,
“Likelihood-maximizing beamforming for robust handsfree speech recognition,” Speech and Audio Processing,
IEEE Transactions on, vol. 12, no. 5, pp. 489–498, 2004.
6. Conclusion
[10] Michael L Seltzer and Richard M Stern, “Subband
likelihood-maximizing beamforming for speech recognition in reverberant environments,” Audio, Speech, and
Language Processing, IEEE Transactions on, vol. 14, no.
6, pp. 2109–2121, 2006.
No significant WER gains for found on AMI with standard
Limabeam. However by extending with discriminative objective and coarse gradient descent we obtained a 1% relative improvement, and a suggestion from gnd MLLR that larger gains
up to 19% over gnd-MLLR standard Limabeam would be available given large per-speaker training data.
Unusual shapes formed by arbitrary parameter values emphasise there is more to beamforming than shown in traditional
angular responses plots under far-field assumptions. For AMI,
the space where speakers sit is susceptible to dominating near-
[11] Michael L Seltzer and Richard M Stern, “Parameter
sharing in subband likelihood-maximizing beamforming
for speech recognition using microphone arrays,” in
Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04). IEEE International Conference on.
IEEE, 2004, vol. 1, pp. I–881.
2443
[12] Dominik Raub, John W McDonough, and Matthias
Wölfel, “A cepstral domain maximum likelihod beamformer for speech recognition.,” in Interspeech, 2004.
[13] Kshitiz Kumar and Richard M Stern,
“Maximumlikelihood-based cepstral inverse filtering for blind speech
dereverberation,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on.
IEEE, 2010, pp. 4282–4285.
[14] Pawel Swietojanski, Arnab Ghoshal, and Steve Renals,
“Hybrid acoustic models for distant and multichannel
large vocabulary speech recognition,”
in Automatic
Speech Recognition and Understanding (ASRU), 2013
IEEE Workshop on. IEEE, 2013, pp. 285–290.
[15] Hynek Hermansky, “Perceptual linear prediction (PLP)
analysis of speech,” vol. 87, no. 4, pp. 1738–1752, Apr.
1990.
[16] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw,
XA Liu, G. Moore, J. Odell, D. Ollason, D. Povey, et al.,
“The HTK book,” 2006.
[17] National Institute of Standards and Technology (NIST),
Speech Recognition Scoring
Toolkit (SCTK) Version 2.4.0.,
web resource:
http://www.itl.nist.gov/iad/mig/tools,, 2010.
[18] X Anguera, “Beamformit, the fast and robust acoustic
beamformer. in http://www. icsi. berkeley. edu/xanguera,”
2006.
[19] Michael S Brandstein and Harvey F Silverman, “A robust
method for speech signal time-delay estimation in reverberant rooms,” in Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on. IEEE, 1997, vol. 1, pp. 375–378.
[20] William L Goffe, “Simann: a global optimization algorithm using simulated annealing,” Studies in Nonlinear
Dynamics and Econometrics, vol. 1, no. 3, pp. 169–176,
1996.
2444