PHD GOLDerica
PHD GOLDerica
PHD GOLDerica
discriminants, and the potential strength of evidence they yield. The four
The final portion of the thesis considers the combination of the four parameters
likelihood ratio.
presents for the first time a comprehensive survey of current forensic speaker
statistics, for four phonetic and linguistic parameters that survey participants
the forensic speech science and likelihood ratios for forensics literature by
considering what steps can be taken to conceptually align the area of forensic
iii
speaker comparison with more developed areas of forensic science (e.g. DNA)
comparison system.
iv
v
Table of Contents
Title Page............................................................................................................................................................i
Abstract ............................................................................................................................................................iii
Acknowledgements ................................................................................................................................. 16
Declaration ................................................................................................................................................... 19
Chapter 1: Introduction......................................................................................................................... 23
1
2.3.2 Prior Odds................................................................................................................................... 43
3.1 Introduction........................................................................................................................................ 71
2
3.2 The Survey .......................................................................................................................................... 74
3.2.1 SurveyGizmo.............................................................................................................................. 74
3
3.10 What is Considered Discriminant ........................................................................................... 89
4.1 Introduction........................................................................................................................................ 96
4
4.5.3 Discussion ................................................................................................................................ 123
4.6.2.1 LRs for ARs in 100 SSBE Male Speakers ............................................................ 128
4.6.2.2 LRs for ARs of 25 Speakers with Variation in the Minimum Number of
5
5.5 Discussion......................................................................................................................................... 172
5.5.2 Comparison of LTFD, MFCC, MVKD, and GMM-UBM Results ............................. 174
6
7.1 Introduction..................................................................................................................................... 202
7
8.4.2 Within-Parameter Correlation Results ........................................................................ 238
9.1 Summary of the Forensic Speaker Comparison Practices Survey ............................ 267
...................................................................................................................................................................... 268
8
9.3 Summary of Discrimination Performance by the Overall System ............................ 274
9
List of Tables and Figures
List of Tables
10
combinations (100 speakers)
Table 5.5: Package length variability 171
Table 5.6: Overview of discriminant formant studies where F3 performs 173
best
Table 5.7: Summary of LR-based discrimination for LTFD and MFCC in 175
competing studies
Table 6.1: Tailored Frequency Ranges for Selected Speakers 189
Table 6.2: Summary of LR-based discrimination for F0 (100 speakers) 196
Table 6.3: F0 package length variability for LR results 198
Table 6.4: Fundamental frequency across different package lengths 198
Table 7.1: Number of clickers versus number of non-clickers over 213
varying speech sample lengths
Table 7.2: Summary of Speakers Mean and Median Click Rates - Int1 223
versus Int2 and Int3
Table 7.3: Changes in click rate across speaker - Int1 versus Int2 224
Table 7.4: Changes in click rate across speaker - Int1 versus Int3 224
Table 7.5: Mean click rates of the three interlocutors 224
Table 8.1: Formant pairing within LTFD 237
Table 8.2: Between-parameter pairings 237
Table 8.3: Correlation coefficients within LTFD 242
Table 8.4: Correlation coefficients within- and between-parameters 247
Table 8.5: Summary of LR-based discrimination for the complete system 250,
(100 speakers) 288
Table 8.6: Summary of LR-based discrimination for alternative systems 253
(100 speakers)
Table 8.7: Summary of LR-based discrimination for individual 260
parameters (100 speakers)
Table 8.8: Performance comparison between individual parameters and 261
the complete system
Table 8.9: Best-performing system or individual parameter in relation to 262
LR statistics
Table 9.1: Human-based results against ASR (Batvox) results from 275
French et al. (2012) on studio quality data
11
List of Figures
12
Figure 6.2: Distribution of mean fundamental frequency 190
Figure 6.3: Distribution of standard deviation in fundamental frequency 191
Figure 6.4: Cumulative percentages for mean fundamental frequency 192
Figure 6.5: Cumulative percentages for standard deviation in 193
fundamental frequency
Figure 6.6: Tippett plot of fundamental frequency 197
Figure 7.1: The action of the vocal organs in producing a velaric 204
ingressive voiceless dental click k (a) first stage velic and
anterior closure; (b) second stage, expansion of the enclosed
oral space; (c) third stage, release of the anterior closure.
(Laver, 1994, p. 176)
Figure 7.2: IPA Chart - Clicks Excerpt 204
Figure 7.3: Distribution of click occurrences by functional category 212
Figure 7.4: Distribution of click totals over five minutes of speech 214
Figure 7.5: Distribution of click rate (clicks/minute) in DyViS population 215
Figure 7.6: Cumulative percentages for click rate 216
Figure 7.7: Mean and range of click rates across all speakers 218
Figure 7.8: Distribution of standard deviation in click rate 219
Figure 7.9: Cumulative percentages for standard deviation in click rate 220
Figure 8.1: LTFD1 versus LTFD2 238
Figure 8.2: LTFD1 versus LTFD3 239
Figure 8.3: LTFD1 versus LTFD4 239
Figure 8.4: LTFD2 versus LTFD3 240
Figure 8.5: LTFD2 versus LTFD4 240
Figure 8.6: LTFD3 versus LTFD4 241
Figure 8.7: Mean AR versus LTFD1-4 243
Figure 8.8: Click rate versus mean AR 244
Figure 8.9: Click rate versus mean F0 244
Figure 8.10: Click rate versus LTFD1-4 245
Figure 8.11: Mean F0 versus LTFD1-4 246
Figure 8.12: Mean F0 versus mean AR 247
Figure 8.13: Tippett plot of the complete system 251
Figure 8.14: Zoomed-in Tippett plot of the complete system 252
13
Figure 8.15: Tippett plot of the complete system - parameters calibrated 255
individually and then combined
Figure 8.16: Zoomed-in Tippett plot of the complete system - parameters 256
calibrated individually and then combined (-6 to 6 LLR)
Figure 8.17: Zoomed-in Tippett plot of the complete system - system 257
calibration after combination of parameters (-10 to 10 LLR)
14
List of Appendices
Appendix A: Survey Instructions
15
Acknowledgements
There is an ancient African proverb that says it takes a village to raise a child.
A portion of that proverb it takes a village- is now colloquially used to
acknowledge the influence a group of people can have when contributing to
something bigger than themselves.
Despite appearing as the single author of this piece of work, I believe that it
takes not just a single individual but a village to bring a PhD to fruition. To
that extent, I would like to take the opportunity to thank my village.
The first thank you goes to my supervisor, Professor Peter French, whose
support, guidance, and encouragement has helped shape every bit of this
research. Thank you for every discussion (of which there are far too many to
even count), the time you have spent editing drafts/abstracts/papers, and all
the opportunities you have graciously allowed me.
Thank you is also extended to Professor Paul Foulkes and Dr. Dominic Watt for
their advice and suggestions on this thesis. A general thank is also due to the
York team for helping to secure the Universitys position in the BBfor2 project.
I must thank BBfor2 for being a wonderful learning environment over the past
years. An important thank you goes to Dr. David van Leeuwen and Dr. Henk van
den Heuvel for their leadership and organization of the project. A thank you is
also extended to my supervisor in the BBfor2 network, Dr. Didier Meuwly, for
providing valuable insight into the application of Bayes Theorem in forensics.
I would also like to thank those at the University of Canterbury in the New
Zealand Institute for Language, Brain and Behaviour, especially Professor Jen
Hay for allowing me to be a part of such an amazing linguistic environment.
Thank you to Dr. Kevin Watson and Dr. Lynn Clark for giving me the
opportunity to work alongside you both.
A special thank you goes to Professor Colin Aitken, to whom I sent an email
containing a complicated statistics question in early 2012. If I could have
anticipated what was to follow, then I probably would have emailed sooner.
That email began what was to become an extremely rewarding exchange of
ideas at the interface of phonetics and statistics. Thank you for taking the time
to reply.
16
A thank you is also extended to those who have helped me along the way
through discussions, insight, and simply encouragement. Thank you to
Professor Anders Eriksson, Phil Harrison, Dr. Michael Jessen, Dr. Christin
Kirchhbel, Dr. Carmen Llamas, Professor Francis Nolan, Dr. Richard Ogden (for
all things non-pulmonic), and Lisa Roberts.
To all my office buddies over the years, I thank you for putting up with my
annoying finger tapping (as I counted syllables for AR) and the clicking sounds I
made (trying to determine place of articulation), the discussions we have had,
and simply from keeping me from talking to myself: Natalie Fecher, Jessica
Wormald, Becky Taylor, and Rana Alhussein Almbark.
A thank you must also go to Vincent Hughes for all our LR discussions. Without
a fellow LR-researcher this thesis just would not be the same. A thank you also
goes to the Forensic Research Group in the Language and Linguistic Science
Department at the University of York for providing a safe and nurturing
environment in which to exchange ideas.
Now to those behind the scenes. I must thank my family - Mom, Dad, and Lauren
for your unequivocal love and support. You have always encouraged me to
follow my interests, so much so that your encouragement has never faltered
even with me being 6,000 miles away from home. And sorry for all those middle
of the night phone calls when I miscalculated the time difference.
To Tom, my best friend and biggest supporter, you know the details of this PhD
almost as well as I do. Thank you for being my shoulder to cry on when things
got stressful, lending your ear as I discussed a long days worth of work,
providing unyielding patience, and your continued support and unconditional
love.
Finally, despite my love for phonetics and linguistics, the work reported in this
thesis would not have been possible without the financial support of the
Bayesian Biometrics for Forensics (BBfor2) Network, funded by Marie Curie
Actions (EC Grant Agreement No. PITN-GA-2009-238803). Without the
generous support of this funding body, this thesis would not be possible.
17
18
Declaration
This is to certify that this thesis comprises original work and that all
contributions from external sources are acknowledged by explicit references.
I also declare that aspects of the research have been previously published or
submitted to journals and conference proceedings. These publications are as
follows:
Signed:
Erica Gold
19
20
Not everything that can be counted counts
21
22
Chapter 1 Introduction
The research presented in this thesis explores the calculation of numerical
likelihood ratios using both phonetic and linguistic features. Articulation rate,
ingressive plosive sounds) are analyzed with the purpose of considering intra-
contribution of the thesis in the field of forensic speech science, and provides a
used in forensic speaker comparison cases. The research aims are then
task carried out by forensic phoneticians (Foulkes and French, 2012, p. 558),
and the majority of research in the field of forensic speech science (hereafter
FSS) is oriented towards this task. FSC is also referred to by other terms such as
voice comparison (Rose, 2002; Rose and Morrison, 2009). However, the term
comparison is preferred in this thesis for two reasons. Firstly, it is not possible
23
conclusion framework, given that there is always, to some extent, within-
under the hypothesis that the samples came from the same person, versus the
probability of obtaining the evidence under the hypothesis that two different
speakers produced the criminal and suspect samples. The term speaker is
preferred over voice in this thesis as not all parameters examined in FSC work
are products of just the voice per se. The manifestation of speech parameters
can also be a reflection of the social and psychological mind-set of the individual
(e.g. French et al., 2010). Therefore, forensic speaker comparison is the term
the criminal recording to also contain other sounds associated with the crime
taking place. In the UK, the suspect sample is usually a recording of a police
interview (Nolan, 1983; Rose, 2002) with the suspect. The objective of the
hypothesis that the samples came from the same person, versus the probability
24
under the hypothesis that two different speakers produced the criminal and
include binary decisions (either the two speakers are the same person or they
between the criminal and suspect; Broeders, 1999), the UK Position Statement
ratios (LR, either verbal or numerical, expressing the likelihood of finding the
2009b; 2009c). There has recently, however, been a strong promotion of the
2Other researchers have use ASR to mean automatic speech recognition (e.g. Goel, 2000).
However, ASR is used in this thesis to mean automatic speaker recognition.
25
The use of the numerical LR in research literature has been given
increasing attention, starting with Rose (1999). However, the use of the
numerical LR in courts in cases involving FSCs is rare (see Rose, 2012; 2013).
Many reasons exist for the limited representation of the numerical LR in court, a
number of which are addressed in French et al. (2010). However, the key
calculated with regard to the (very) few parameters for which there are
available population statistics, French et al. (2010, p. 149) argue that one runs
the risk of producing an opinion that could lead to a miscarriage of justice. This
is due to the fact that the analysis would fail to consider a large number of other
available parameters for which there are no population statistics. In turn, this
could impact the conclusion the expert would arrive at in a FSC case.
The motivation for this thesis stems directly from the difficulties and
for the field of forensic speech science to align itself with other more developed
focus simply upon the reasons for or against the implementation of a numerical
LR. However, far less empirical work has been carried out to examine the
study serves as an exercise in calculating numerical LRs for speech data. The
data are derived from the speech of a homogeneous group of speakers, while
26
be implemented in a real FSC case, should the analyst choose to utilize a
numerical LR framework.
The primary aims of this thesis are threefold. The first aim is to provide
the field of FSS with a comprehensive summary of current FSC practices used
around the world, which has never been previously available. The survey will
provide details regarding the types of analysis that are used and their
they adopt. The survey will also serve as the primary motivation for the
selection of the four phonetic and linguistic parameters examined in this thesis.
population will nonetheless provide the field of forensic phonetics with useful
information that can further inform FSC casework. The analysis focuses on
parameters that can be considered under a numerical LR. Through this analysis,
27
speakers. As such, this research will address the arguments set forth in French
The third, and final, aim of the present body of work is to take the
examining potential correlations that exist between and within parameters, and
evidence. Numerical LRs are then calculated, and strength of evidence and the
framework (in part or full) for FSCs. Most importantly, this work aims to
casework.
called paradigm shift in forensic science, changes in the law Bayes Theorem
FSCs, the use of likelihood ratios (LRs) in FSCs, and a discussion of the
limitations relating to these topics. The research questions for this thesis that
have arisen from previous research reviewed in this chapter are also presented.
28
Chapter 3 reports on the construction, contents, and results of the first
plosives).
chapter investigates the effects that the package length (time intervals) of
tokens may have on results. Finally, LRs are calculated for individual formants
29
as well as in combination. The levels of discrimination are presented, along with
on F0 as well as the external factors known to affect F0. Population statistics are
offered for F0, and the effects of the package length of tokens are investigated
for potential differences in the discriminant results for F0. The chapter
concludes with the calculation of LRs, and examines the levels of discrimination,
limitation of not being able to calculate LRs for click rate, due to the lack of
rate.
Chapters 4-7 are explored in Chapter 8. The chapter provides a summary of the
are calculated between all parameters as well as within parameters. These data
are used to inform the appropriate methods for the combination of parameters
30
in order to create a complete system (consisting of the four parameters
explored in this thesis). Overall likelihood ratios (OLRs) are calculated for the
combinations of LTFD, F0, and AR. The performance of the complete system (in
associated with the calculation of numerical LRs are discussed, as well as the
thesis, revisits the thesis aims and identifies opportunities and challenges that
31
Chapter 2 Literature Review
In this chapter, an overview is presented of the literature surrounding the so-
called paradigm shift in forensics changes in the law Bayes Theorem forensic
speaker comparisons (FSCs), the use of likelihood ratios (LRs) in FSCs, and the
limitations and shortcomings surrounding these topics that have led to the
The term paradigm shift was first introduced by Kuhn in 1962. A paradigm in
shift as a change in these basic assumptions within the ruling theory of science.
the center of the universe rather than the Earth (Kuhn, 1962). Kuhn (1962)
argues that once a paradigm shift is complete, a scientist is unable to reject the
new paradigm in favor of the old one. As asserted by Kuhn (1962), paradigms
In 2005, Saks and Koehler wrote a review entitled The Coming Paradigm
and Koehler, 2005, p. 895). They begin their review by describing the state of
32
traditional forensic science that follows a frequentist view of evaluation,
that two evidentially-relevant traces, e.g. recorded speech samples, were made
others in the world (Saks and Koehler, 2005, p. 892) Linking evidence to a
objects will always be different. Therefore when two pieces of evidence are
being compared that are not observably different, an expert will conclude that
they were made by the same object or person (Saks and Koehler, 2005). The
given population) has failed to be taken into account (i.e. they reject the
uniqueness assumption).
Saks and Koehler (2005) reveal that in the decade leading up to their
review, many people had been falsely convicted of serious crimes, only to be
later exonerated by DNA evidence that had not been previously tested at the
time of the trial. The authors state that erroneous convictions sometimes occur,
convictions), it was found that 63% were due in some part to erroneous forensic
science expert testimony (Saks and Koehler, 2005, p. 892). This was the second
33
identifications4. The authors explicitly state that the criticism does not apply to
for other forensic science disciplines. The reasoning behind the statement is
that DNA typing follows three main principles, (1) the technology [is] an
application of knowledge derived from core scientific principles (2) the courts
and scientists [can] scrutinize applications of the technology and (3) it offers
emulate the approach taken by DNA typing, whereby the courts are provided
explicitly stating it, Saks and Koehler are essentially arguing for the adoption of
the likelihood ratio (LR; 1.1.1) as the medium for presenting conclusions to
the trier(s) of fact (e.g. judge, jury). They are arguing for forensic science to
take on such an approach, Saks and Koehler (2005, p. 892) recommend that
forensic scientists will need to work closely with experts in other fields to
In recent years, following the paper by Saks and Koehler (2005), there
have also been calls for improvements in the quality of forensic evidence by a
number of legal and government bodies. It has been argued that all areas of
4Multiple factors were considered in each false conviction. Therefore, while erroneous forensic
evidence was a contributing factor in 63% of the false convictions, erroneous eyewitness
testimony was the only other factor contributing to more false convictions (71%; Saks and
Kohler, 2005, p. 892).
34
forensic science need to be more transparent, that forensic examinations should
Commission of England & Wales, 2011). These calls for changes to forensic
evaluation were made for the same reasons that Saks and Koehler (2005)
alluded to with respect to false convictions being made from poorly presented
forensic evidence as well as the changes that have occurred in the law.
A number of rulings made in the last century have significantly changed the face
United States. Starting in 1923, with the ruling of Frye v. United States, courts
moved away from accepting testimony from expert witnesses on the basis of the
methods not used by others in the same forensic discipline. The Frye ruling
(Frye v United States (293 F. 1013 D.C. Cir. [1923])) determined that expert
testimony was only admissible if the method of analysis used gained general
paragraph 5).
United States Supreme Court implemented a new ruling with regard to the
35
must demonstrate that it can stand on a dependable (i.e. tested) foundation. The
ruling challenged those in the field of forensics to show that the forensic
method in question had been tested, that its error rate has been established,
and this error rate was acceptably low. The Daubert ruling has since been
reliable. The ruling by the United States Supreme Court was intended to lower
would have previously been considered inadmissible under the Frye ruling. At
the same time, Daubert was meant to raise the threshold for long-established
Daubert ruling was brought to light. The case in question included handwriting
This decision was made even though the field of handwriting analysis had
loophole in Daubert the handwriting evidence was not excluded. The reason
given was that since the methods used to collect evidence were found to have no
scientific basis, Daubert did not apply to handwriting identification as it was not
36
It was not until 1999, in the case of Kumho Tire v. Carmichael, that the
United States Supreme Court directly addressed whether or not Daubert applied
organizations in which they argued that the majority of the expert testimony
that they offered did not include scientific theories, methodologies, techniques,
toolmarks, bullets, and shell casings, and bloodstain pattern identification (Brief
Ironically, the practitioners that were initially lobbying for their expertise to be
admissible on scientific grounds were now denying that they were a science.
States Supreme Court ruled in Kumho Tire that all expert testimony would be
be admissible in court.
Although the rulings described in this section pertain to law in the United
States, these rulings have had a large impact on the legislation in other
the basis of the qualifications of the expert testifying rather than the methods
37
matter without the assistance of the witness possessing special knowledge or
experience in the area and when (ii) the subject matter of the opinion forms
intermediate between the Frye and Daubert rulings, where the Bonython ruling
encompasses the Frye ruling and includes the expectation that the testimony is
reliable (again, this is usually satisfied with reference to the experts academic
pedigree and the lack of previous miscarriages of justice in relation to the given
expert testimony).
Committee (2009) and the Law Commission of England & Wales (2011) have
also been influenced by such U.S. rulings, and have urged forensic sciences to
make changes to their current practices that would align them more closely
with measures set out in Daubert. With respect to the changes in legislature, the
developments in presenting DNA typing, the paradigm shift, and the calls made
Bayes Theorem was first proposed by Sir Thomas Bayes in the 1740s, then
updated and published by Richard Price (Bayes and Price, 1763), and later the
38
Laplace5 (Laplace, 1781). Laplace is in reality the man who turned Bayes
around the world (Bertsch McGrayne 2012). Bayes Theorem was created as a
way in which to update ones beliefs. The theorem has three central
components: the posterior odds, the prior odds, and the likelihood ratio (LR), as
(1)
( | )
hypothesis (e.g. the criminal and suspect are the same person) and Hd is the
defense hypothesis (e.g. the criminal and suspect are different people). The E in
proposes that the posterior odds (the probability of the prosecution hypothesis
5The term updated is used here to mean that a prior probability can be adjusted/modified by
taking into account any new evidence or observation (e.g. likelihood ratio(s) in forensics) to
arrive at a posterior probability (see Equation 1).
39
being correct given the evidence divided by the probability of the defense
hypothesis being correct, given the evidence) is equal to the prior odds (the
probability of the defense hypothesis being correct; see 2.3.2 for an example)
(Aitken and Taroni, 2004) or what is also referred to as the strength of evidence
same results on the basis of the defense hypothesis. The LR is the only portion
opinion. The opinion that the expert provides on the strength of evidence is
constitutes one part of the Bayesian framework), as that would imply the
hypothesis (the evidence being from the same person/object), while the
has come from a different person/object). Ideally, the defense hypothesis would
40
be set by the defense (Champod and Meuwly, 2000); however, this is rarely
done and the responsibility usually falls to the expert, who typically renders the
probability (Champod and Meuwly 2000, p. 195). In the LR equation, when the
numerator presents a greater value than the denominator, there is support for
the prosecution hypothesis, and when the denominator is greater than the
of the resulting LR, or rather the distance of the resulting LR from 1. Therefore,
an LR of 100 means that the probability of the evidence (given the competing
prosecution and defense hypotheses) is 100 times more likely to have been
obtained/to have come from the suspect than someone else in the population. If
in the same case the LR was 1/100, then the evidence is 100 times more likely
to have been obtained/come from someone in the population other than the
defense hypothesis can take a value between 0 and 1 (inclusive), while the LR
can take a value between 0 and (Aitken and Taroni, 2004). Due to the fact that
translation, which makes it easier for the trier(s) of fact (e.g. judge, jury) to
41
The combined LRs from multiple pieces of evidence have been referred
probability following Nave Bayes (Kononenko, 1990; Hand and Yu, 2001).
evidence, (1) the use of an LR algorithm that can handle correlation through
and Taroni, 2004), (2) Bayesian networking that will account for correlations by
weightings (Aitken and Taroni, 2004), or (3) a solution proposed in the field of
accounts for correlations in resulting LRs and then applies statistical weightings
the evidence given the prosecution hypothesis than it is to obtain the evidence
given the defense hypothesis. A verbal LR will not include any numbers in its
evidence. For example it is more probable to that one would obtain the
42
evidence given the prosecution hypothesis than it would be to obtain the
introduced; they are then updated by the new information, which results in a
posterior probability (Aitken and Taroni, 2004). For example, suppose a crime
known to be one of the 101 inhabiting the island, then the prior odds of the
suspect being the criminal is 1/100. 6 These prior odds will then be updated by
the trier(s) of fact as new evidence is presented throughout the case. The prior
odds are a key factor in the separation between a frequentist way (see 1.1) of
When the prior odds are used in research a numerical value is typically
given, and to incorporate those odds into a Bayesian framework it only requires
simple multiplication with the likelihood ratio. In practice, the prior odds can be
problematic. Robertson and Vignaux (1995, p. 19) state that this is especially
true since very large or very small prior odds can give some very startling
If that same case had prior odds of 1/1000, multiplied by the same LR of 4 (e.g. a
43
Hp of 8 divided by a Hd of 2 is equal to 4), the posterior odds would then be
0.004 (i.e. in favor of the defense hypothesis). Significant (or even small)
changes of the prior odds can dramatically change the posterior odds. This is
demonstrated in the case above, where large prior odds cause the posterior
odds to be in favor of the prosecution hypothesis, and much smaller prior odds
remaining constant). Prior odds can also be problematic in practice, given that
Bayes Theorem assigns the responsibility of establishing the prior odds to the
trier(s) of fact. This means that typically the jurors are held to be responsible for
In Equation 1, the posterior odds are the results of the prior odds after
being updated by the LR. Numerically, the posterior odds are the multiplication
of the prior odds by the LR (Aitken and Taroni, 2004). Deriving the posterior
odds, as with the prior odds, is the responsibility of the trier(s) of fact
the posterior odds by considering the prior odds they had initially established,
Neither the likelihood ratio nor the prior odds on their own constitute a
Bayesian probability; rather, it is the value of the posterior odds that equates to
44
2.3.4 Logical Fallacies
implementation of the Bayesian framework. The first is for the forensic expert
to report posterior odds, since the expert does not typically have access to the
prior odds, as they are generally set by the trier(s) of fact and not an expert
(Rose and Morrison, 2009). Even if the expert were to have access to them, the
prior odds will vary in accordance with individuals personal beliefs about the
cases, and the beliefs are subject to natural bias. This fallacy of presenting
posterior odds, as committed by a forensic expert, also means that the expert is
taking on the role of trier of fact (Robertson and Vignaux, 1995), which in fact
infringes on what has been called the ultimate issue. The ultimate issue in law
is the decision about the guilt or innocence of a suspect by the trier(s) of fact. If
the suspect made the shoe mark, the expert then places himself/herself in the
role of decision maker, rather than an objective party presenting facts relating
to the case (see Joseph Crosfield & Sons v. Techno-Chemical Laboratories Ltd.).
Lucy 2005) or the inversion fallacy (Kaye 1993). The prosecutors fallacy
interchanged with the probability of the hypothesis given the evidence (Lucy,
2005). The inversion of the probabilities gives undue weight to the prosecution
hypothesis by assuming that the prior odds of a random match (or two pieces of
45
hypothesis. For example, an expert states that there is a 10% chance the suspect
would have the crime blood type if he were innocent. Thus there is a 90%
piece of evidence without attention to any associated value (e.g. prior odds). For
which the suspect comes from. The defense argues that the DNA profile would
very little value. On the contrary, cutting the total population from 1 in 10,000
to 1 in 100 means that 9,900 people are being excluded, and also it is highly
unlikely that all of the 100 individuals are equally likely to be the criminal (Evett
The three fallacies presented in this section are all flawed in a logical
sense as the expert takes on responsibility that is not his/hers (e.g. presenting
occur. For other errors in the interpretation of Bayes Theorem, see Aitken and
46
2.4 Forensic Speaker Comparison
Given the current paradigm shift and the logically and legally correct framework
that Bayes Theorem offers, practitioners in the field of FSC are making the
forensics (i.e. those using a Bayesian framework, for example, DNA). Acceptance
of the forensic paradigm shift in FSC has already been acknowledged and
embraced (French et al., 2010). However, the ease with which an LR approach
can be adopted is an issue in itself. This is largely due to the challenges that
speech data in the forensic context present. This section builds upon the
parameters that are commonly analyzed in FSCs, and the way in which FSC
conclusions are currently framed in the UK. This section will situate the
demonstrating the current state of the field as it attempts to align itself with
forensic disciplines.
scientists (Foulkes and French, 2012). The task of the expert is to provide expert
opinion on the speech evidence to the trier(s) of fact. The expert opinion in a
samples) under the hypothesis that the samples came from the same person,
versus the probability of obtaining the evidence (the typicality of the analyzed
47
speech parameters) under the hypothesis that two different speakers produced
in a FSC are varied, unlike DNA where the same techniques are routinely
applied across cases. That is to say that the methodologies involved in FSCs
this are twofold: (1) speech data is complex in nature (confounding factors are
often present), and (2) there is no single speech parameter that is omnipresent
reference to the simple fact that no two speech utterances produced even by the
same speaker are ever identical. It is this intra-speaker variation that sets
forensic speech science apart from some other forensic disciplines. DNA is an
example of forensic evidence where the criminal and suspect samples can be
identical. For speech, unlike DNA, it will never be the case that the probability of
factors (e.g. the interlocutor, illness, speaking style, intoxication); however, the
(e.g. vocal tract length, the rate at which vocal cords vibrate), as well as
48
phonological and social factors (e.g. age, sex, class; Chambers, 2005; Eckert,
2000; Wardhaugh, 2006). It is also possible for these variables to interact with
one another, whereby their effects are manifested in the speech of individuals
differently.
under a combined auditory and acoustic phonetic analysis (French and Stevens,
phonological, and social factors. The relationships that exist in the data when
account in the evidence (as is also the case for other forensic sciences). This is
combination of both (e.g. discrete at one level and continuous at another for the
49
assumption of normality is not always advised as it can lead to miscarriages of
similar to the rest of the population). Algorithms used for calculating LRs (those
typicality on the basis of data existing at all areas under the normal distribution
curve. If a normal distribution curve does not accurately describe the data, the
complex is the inevitable reality that speech recordings made under forensic
2001), the distance (of the speaker) from the microphone (Vermeulen, 2009),
and cellular phone audio recording codecs (Gold, 2009) have been shown to
ratios and high levels of background noise and/or overlapping speech. For this
reason, the task of extraction of the necessary parameters is made more difficult
they are typically not as severe (or not present at all) in direct and high-quality
50
recordings made in a police interviews (e.g. a recording of a suspect) or other
comparable situations.
Given that the suspect recording in a FSC case typically comes from a
police interview, there is often a mismatch in the conditions under which the
criminal and suspect samples are elicited. The criminal recording is frequently
made in situations that involve high emotional states, physical activity, or the
influence of drugs or alcohol. They also tend to be short in duration and limited
speakers has proved fruitless since research began in the field of FSS. This
should not come as a surprise, given the inherent complexity of speech data, as
outlined in 2.4.1. Research has shown that the vocal tract is highly plastic,
parameter that makes one speaker different does not necessarily make another
speaker different, and that parameters that make a speaker differ can vary over
phonetic and linguistic parameters that have ideal characteristics for FSC. These
7Shibboleth is used here to refer to a single identifiable parameter that can discriminate
between all speakers.
51
1. High between-speaker variability: The parameter should show a high
transmission technologies
prohibitively difficult
Criteria 3-6 are specifically concerned with practical issues that arise in
given to work with, while criteria 5 and 6 are associated with the recording and
criteria suggested by Nolan (1983) identify the true difficulty of the FSC task.
That is, the expert has to identify and examine phonetic and linguistic
52
parameters that have high inter-speaker variation, but also low intra-speaker
phonetic and linguistic parameters for analysis in FSCs, the most obvious
question is:
Before questioning the proper (or logically and legally correct) framework in
FSC analysis itself. Only after establishing the general expectations of the FSS
frameworks. For without any analyzed forensic speech evidence, there can be
no valid conclusion.
Position Statement that was introduced in 2007. The UK Position Statement was
53
expressed in forensic speaker comparison cases (French and Harrison 2007 p.
137). The UK Position Statement stemmed from ruling of the Appeal Court of
England and Wales in R v. Doheny and Adams (1996), which showed that the
interpretation of the DNA evidence at the initial trial had been flawed by the
Position Statement signified a shift in the role of the forensic phonetician when
suggests that experts in the past were often trying to identify speakers (French
and Harrison, 2007, p. 138). However, under their new approach an expert
would not be making identifications per se. Instead, the expert will take on a
whether the voice in the questioned recordings fits the description of the
suspect (French and Harrison 2007 p. 138). The UK Position Statement was
also proposed with the intention of aligning the field of FSC with more modern
assessment. The conclusion framework set out in French and Harrison (2007)
potentially involves a two-part decision. The first part concerns the assessment
of whether the samples are consistent with having been spoken by the same
person. The second part, which only comes into play if there is a positive
distinctive the combination of features that are common to the samples may be.
54
Figure 2.1: Illustration of the UK Position Statement (Rose and Morrison, 2009, p.141)
conclusion about consistency cannot be made, then the expert concludes with a
single evaluation (i.e. not consistent, or no decision). In the event that the expert
finds the two speech samples to be consistent, s/he will then assess the
55
distinctiveness. The degree of distinctiveness is made on a five-point
the experience of the expert so that he/she can provide a statement of the
is not providing a single probability of the hypothesis (e.g. the speaker in the
criminal sample is likely to be person X), but not quite meeting the logical
the consistency and distinctiveness account for both the similarity and the
typicality of the speech recordings. However, the inner workings of the Position
Statement do not hold true to the logical framework of an LR. There are two
main reasons for this mismatch: (1) assessments are made on different scales,
and (2) there is no logical procedure for combining (and weighing) constituent
142). Although Rose and Morrison acknowledge that the decision about
point scale). This is due to the fact that the judgment is wholly categorical, and
the ternary decision cannot be intuitively placed on a scale. A scale would imply
some degree of hierarchy, and it is difficult to argue that, for example, no-
56
decision should be ranked before inconsistent (or vice versa). Therefore, the
on a scale of one to five; however, this does not follow the same logic as the
relationship between the two assessments; instead they exist more as two
defined boundaries an expert is faced with a hard decision. So, for example, if a
criminal sample has an F0 mean of 115 Hz, while the population mean is 90 Hz,
is susceptible to the cliff-edge effect, the same can actually be said for the verbal
LR scale provided by Evett (1998). Although the verbal scale suggested by Evett
(1998) is associated with Log10 LRs, the cliff-edge effect can still occur for those
57
expected to combine individual LRs for parameters that are mutually
weightings, s/he then runs the risk of over-or under-estimating the strength of
evidence (as s/he are essentially considering the same evidence multiple times).
strength of evidence.
Despite the disparity between the UK Position Statement and the LR, the
allows the expert to avoid the undesirable and lengthy task of collecting
populations that could ever be required for a FSC case. Secondly, the framework
allows the expert to avoid the difficulty of calculating numerical LRs for all
LR algorithms. These two attributes are technically part and parcel of the same
thing, as they together evaluate the denominator of the LR; however, the
is safe to argue that no LR algorithm could ever account for or encompass the
models for calculating LRs for all speech parameters. Through experience and
58
education, an expert is able to account for instances of accommodation, channel
mismatch, intoxication, emotional effects, and social factors. These factors tend
It is now generally accepted in forensics that the logically and legally correct
the output is a likelihood ratio (Saks & Koehler, 2005). The domain of forensic
speech science is no exception to this, and efforts have been made in the last
decade and a half (starting with Rose, 1999) to incorporate into the LR into
numerator of the LR contains the probability of obtaining the evidence given the
hypothesis that the speech came from the same speaker, while the denominator
is the probability of obtaining the evidence given the hypothesis that the speech
59
divided by the probability obtained from the denominator, and the result is the
with only a single case example of the LR being used in FSC casework to date
(Rose, 2012; 2013). The research carried out has predominantly had two main
foci: (1) assessing speaker discrimination using a numerical LR and (2) overall
(Alderman, 2004; Rose, 2007a; Rose, 2010a; Rose and Winter, 2010; Zhang et
2009; Enzinger, 2010a; Kinoshita and Osanai, 2006; Morrison, 2009a; Rose et
60
al., 2006), and long-term formant distributions over vowel mixtures (Becker et
purposes has not been given the same amount of attention as vowel-based LR
research. Those studies that have been carried out on features other than
include fundamental frequency (F0), voice onset time (VOT), nasals, laterals,
and fricatives (Kavanagh, 2010; 2011; 2013; Kinoshita, 2002; 2005; Kinoshita et
al., 2009, Coe, 2012). Traditional FSC does not just involve the analysis of vowels
and the non-vocalic features listed above. For this reason, further empirical
parameters that have not had their discriminant ability tested. To date,
parameters have been selected for analysis based principally on their ease of
61
(2) If experts are to provide their opinions on the most helpful speaker
speaker discriminants?
useful discriminants is that such parameters will perform better than speech
parameters selected for analysis arbitrarily (simply because they are easily
(Kinoshita, 2001; Morrison, 2011; Zhang et al., 2008), exploring the issues
2010; Rose, 2006c; 2010b; Rose et al., 2004), identifying the relevant population
for the LR (Hughes, in progress; Hughes and Foulkes, 2012; Morrison et al.,
2012a; 2012b), exploring the amount of data that is preferred for the reference
62
(Morrison et al., 2010; Morrison, 2013; Rose 2010a; 2010b, 2013a; Rose et al.,
2004; Zhang et al., 2008), system calibration (Morrison, 2012; Morrison et al.,
2010; Morrison and Kinoshita, 2008), and measures of system validity and
research questions relating to the calculation of LRs in FSCs that would greatly
benefit from further empirical study and assistance from forensic statisticians.
the complexity of speech data and has opted for often convenient but erroneous
a single parameter alone is not sufficient. Given the large body of literature on
between speakers?
speakers?
63
One would hypothesize that adding ever more parameters would further
advance the task of FSC, since theory and research tells us that speakers are
The only publications that report the use of LRs for multiple speech
parameters in FSC casework are those of Rose (2012; 2013b) in connection with
Morgan Chase bank, asking to transfer $150 million from the Australian
Hong Kong. Before the close of business, the criminal called the bank asking for
confirmation of the details in the fax he had sent. When the Australian
$150 million, an investigation followed. A suspect was identified, and Rose was
asked to compare the recording of the fraudulent telephone call with recorded
telephone calls known to have been made by the suspect. The analysis and
report were produced five years prior to Roses publications about it (2012;
2013b), so he presents the original analysis that was carried out as well as a
In the original analysis, he identified many tokens of the word yes in both
the criminal and suspect recordings, as well as the utterance not too bad in the
criminal recording and multiple occurrences of the same phrase in the suspect
64
recordings. Therefore, the majority of the analysis and the resulting LR were
to establish the typicality of the criminals speech Rose defined the relevant
2013b, p. 284). He then collected relevant speech samples from 35 adult males,
designated time-points, the fundamental frequency (F0) in not too bad taken
tones in not too bad, formant measurements of the vowels in not too bad, and
the frequency cut-off in /s/ from the word yes (Rose, 2013b). After
some degree of correlation had to exist between the parameters. For this
reason, parameters that were assumed to have some degree of correlation with
one another (e.g. formant measurements for certain vowels) were thrown out,
Five years after the conclusion of the case, Rose provided a critique of the
number of developments made in the field since the R-v-Hufnagl case that could
have made a significant difference in his analysis. These include vowel (and
quantification of accuracy and precision (validity and reliability; e.g. Cllr and
65
between-parameter correlations for calculating OLRs (e.g. fusion). If any of
can confidently be said that the strength of the numerical LR would not be
identical to that presented in Rose (2013b; also shown through his reanalysis of
the case material); most likely, the strength of the LR would weaken as
The final portion of Rose (2013b) commented upon the courts reaction
something rarely discussed in forensic phonetics. The expert testimony did not
condensed his analysis into two main points for the jury, which he emphasized
on multiple occasions: (1) the LR is for estimating the strength of the evidence
and not the probability that the suspect is the criminal, and (2) the jury should
not give much weight to the specific value allocated to the LR, just that it was
very big. Whether Roses testimony made an impression on the triers of fact in
R-v-Hufnagl is unknown. However, the jury did return a guilty verdict (Rose,
2013b). Rose also notes that it was perhaps vital to his testimony that the judge
was encouraging towards his approach and that this helped him (Rose) to
articulate to the court the strength of the speech evidence. It can be assumed
that not all judges would act in the same manner, and presenting the same
66
Overall, it is encouraging to see an example of a real case in which a
nice backdrop to the case and the type of speech material Rose chose to analyze.
The critique at the end of the paper is a positive contribution, as it shows how
the field has evolved in the past five years since the case analysis was
completed. The paper also shines light on the reception of the LR in a court,
which again often goes without attention in the literature. However, the paper
perhaps brings up more questions (both theoretical and practical) about the
instance, how does an expert begin to select parameters for analysis under an
LR framework? How can an expert argue why s/he has selected certain
relevant to the remainder of this thesis. The first is that real-world cases are
(Rose, 2013b, p. 318). This means that the LR calculation is not the same in
every FSC case, or for every phonetic/linguistic parameter selected for analysis.
case-to-case basis. The second statement asserted by Rose is that FSC might
lend itself more readily to a verbal LR over a numerical LR8. The reason for this
8A verbal LR is simply a verbal, rather than numerical, statement of the probability of obtaining
the evidence given the prosecution hypothesis over the probability of obtaining the evidence
67
is that precise figures may be misleading in that numerical LRs may be difficult
for the trier(s) of fact to interpret9 (Rose, 2013b, p. 305). The final statement
comes from Judge Hodgson (2002) but is reiterated by Rose (2013b): since not
able to quantify the exact tongue shape of a speakers //? In this instance, a
one that is not completely transparent in its description. Should these types of
case that other phonetic-linguistic parameters can be made to fit the mold in the
the speech evidence is feasible, given that numerical LRs cannot currently be
algorithms and/or the qualitative nature of certain parameters), and the lack of
number of limitations and difficulties that can occur when applying the
given the defense hypothesis. For example the verbal statement could be presented as it is
extremely more probable to obtain the given evidence under hypothesis x than y.
9 For example, is there really much of a difference between an LR of 1.1 x 10 14 and an LR of 1.11
x 1014?
68
numerical LR framework to FSCs, which are largely due to the complexity of
speech data. If the field is to continue in its efforts to align itself with more
already adopted the LR framework (e.g. DNA), various aspects of the actual
(4) For this reason, it is essential to ask: What are the practical
This chapter has presented a series of research questions that have been
motivated by the prior literature and existing legal rulings with regard to
forensic evidence, while further developing the research aims of the thesis. This
section reiterates the research questions identified in this chapter, which will be
69
(1) What phonetic and linguistic parameters do practicing forensic
discriminant?
speaker discriminants?
70
Chapter 3 International Survey of
Forensic Speaker Comparison
Practices
3.1 Introduction
(i) For the first time, to make available to the wider forensic, legal, and
(ii) To draw upon the very considerable collective experience of FSC experts
that are considered to have the greatest potential for discriminating between
individuals.
It will become apparent from the results presented below that there is a great
for expressing the conclusions that arise from the comparisons. Some of the
differences found are, undoubtedly, a function of the rules, regulations and laws
71
context of the constraints on the admissibility of expert evidence in different
science.
3.1.1 Background
While research has been carried out on many facets of FSS and FSCs,
there has not been any research that has comprehensively surveyed the FSC
practices employed by experts around the world. The extent to which the FSS
community had been aware of commonly used FSC practices has been limited to
FSC case. The objective of the exercise was not to survey practitioners, but
rather to observe and assess basic methods that participants chose to employ in
some of the basic methods involved in a FSC case, which were confined to: the
provide the field of FSS with a body of information relating to FSC methods, but
did create interest and a platform on which to conduct further research into the
72
Hollien and Majewski (2009) discuss the prevalence of inconsistencies in
FSS practices with particular attention to FSCs, like those reported by Cambier-
Langeveld (2007). The authors suggest that the field of FSS lacks any real
consensus in terms of procedures and methods for FSC cases. They argue that it
determine the identity of a speaker, and that without any standards or common
(and that other experts could adopt should they wish to), which includes
confidence levels of their judgments. Despite their efforts to offer their own
standard and protocol for FSCs, the authors fail to acknowledge alternative
across the globe. I would argue that an understanding of the current state of the
For a field that came to fruition in the late 1980s/ early 1990s (the time
at which acoustic and auditory phonetic analysis began regularly being used in
the UK courts at least, (French, p.c.)) little has been done to unify and
standardize the field over this time. While Cambier-Langeveld (2007) and
Hollien and Majewski (2009) argue that there is a lack of consensus in the field
of FSC and that standards are almost non-existent, I would suggest that the only
way to remedy such a fault is first to assess the current methodologies and
the world.
73
3.2 The Survey
participants were kept anonymous and also given the option of answering some
3.2.1 SurveyGizmo
surveys, collecting data and performing analysis. [The] tool supports a variety
selected as the medium for the survey over other similar websites for a number
of reasons: the server is secure, the package offers the ability to save and
enhanced privileges, and an excellent user interface for creating the survey. All
responses collected from the survey are saved on the SurveyGizmo server, and
where they gave their consent to participate in the survey and agreed to their
data being used in future research. After giving consent, participants were
74
structure. They were allowed to stop the survey at any time and save it, so that
this feature as the total time (including interruptions) it took most participants
3.3 Participants
Institutes, the National Institute for Standards and Technology (NIST) for those
agreed to participate, and data were collected from July 2010 through March
2011.
3.3.1 Countries
Australia, Austria, Brazil, China, Germany, Italy, the Netherlands, South Korea,
Spain, Sweden, Turkey, UK, and USA. Although the majority of the participants
were from Europe, a total of five continents were represented in the results.
75
3.3.2 Place of Work
3.3.3 Experience
18,221, ranging from 4 to 6,000, with a mean of 506. The respondents had a
76
Auditory Phonetic-cum-Acoustic Phonetic Analysis (AuPA+AcPA):
examinations.
one another, may be found, inter alia, in Baldwin and French (1990), Drygajlo
(2007), French (1994), French and Stevens (2013), Greenberg et al. (2010),
Table 3.1.
77
Table 3.1: Methods of analysis employed by country
Method Countries
AcPA Italy
Australia, Austria, Brazil, China, Germany,
AuPA+AcPA
Netherlands, Spain, Turkey, UK, USA
HASR Spain, Germany, South Korea, Sweden, USA
important vary from analyst to analyst within each of the method categories.
78
Currently, there is much debate in the field about the logical and legally
correct frameworks (French and Harrison, 2007; Rose and Morrison, 2009;
across the world. The conclusion frameworks may be grouped under the
following headings:
Binary Decision:
A two-way choice that either the criminal and suspect are the same
numerical one and it may use such terms as likely/ very likely to be
the same (or different) speakers. These types of judgments are often
labelled as frequentist.
evidence in the USA have had a large impact on all fields of forensics
79
and generally fail to acknowledge the defense hypothesis (e.g. the
suspect).
basis of the prosecution hypothesis (that they come from the same
Aitken and Taroni (2004), Evett (1990), and Evett et al. (2000), as
they provide the essential building blocks for the proper assessment
80
UK Position Statement:
The first part concerns the assessment of whether the samples are
The second part, which only comes into play if there is a positive
inter alia, Broeders (2001), Champod and Evett (2000), French and Harrison
(2007), French et al. (2010), Jessen (2008), Morrison (2009c), and Rose and
Morrison (2009).
Table 3.3.
Table 3.3: Methods used for analysis in forensic speaker comparisons against conclusion
frameworks
81
As seen in Table 3.3, there is a tendency for participants using AuPA + AcPA to
frameworks.
countries appear more than once, as there were multiple respondents from the
frameworks.
A Likert Scale was used to measure the respondents level of satisfaction with
the conclusion method s/he used. Likert ratings were averaged across
satisfied). Table 3.5 reports the number of experts responding, mean scores,
82
3.5.1 Population Statistics
Out of all respondents, 70% reported that they use some form of
population statistics in arriving at their conclusions. 58% stated that they had
(used by almost all of the 70%), articulation rate, voice onset time, long term
3.6 Guidelines
each forensic speaker comparison case and if so whether they had been
involved in its design. 85% of respondents used some form of a protocol or set
Origin Number
Developed personally 6
Developed in conjunction with colleagues 17
Given it by place of work 3
83
3.7 Casework Analysis: Alone or in Conjunction with Others
Besides carrying out casework in their native language, 56% of experts stated
that they also conduct casework in other languages. Collectively, these experts
have worked on cases in over 40 different languages other than their own.
Of those who work with other languages, 94% require the assistance of a
This section reports on the aspects of recorded speech that respondents take
84
3.9.1.1 Segmental Features
their examinations. With regard to vowels, 81% invariably carried out some
form of analysis and 13% routinely did so. 94% of all experts evaluated the
auditory quality of vowels, 97% carried out some form of formant examinations
63% measured F1-F3, and 10% measured either F1 and F2 or F2 and F3. In
some form of examination; 52% invariably did so. For all experts, 88% of
energy loci.
answers using a 6-point Likert Scale ranging from 1 (never) to 6 (always). The
number of experts responding, mean Likert ratings, and standard deviations are
85
represented in Table 3.8 for those respondents who are native English speakers
measure, for those conducting some form of AcPA, 94% reported measuring the
mean, 41% the median, 34% the mode, 72% standard deviation, 25% the
alternative baseline (the value (in Hz) that falls 7.64% below the F0; Lindh,
frequency listed above, the most common combination for analysis was the
measure the mean, median, mode, standard deviation, and alternative baseline.
known as the varco (the standard deviation divided by the mean (Jessen et al.,
2005)), the first and third quartiles, and kurtosis/skew. It is important to note
86
frequency, a large proportion point out that it is usually of little help. One
94% of the respondents who include an AuPA stated that they examine
voice quality as part of their overall procedure, although only 77% of these
invariably or routinely examine it. Further to this, 61% of those who examine
version of such a scheme, for its description. Of those experts examining voice
quality the large majority (63%) reported using the Laver Voice Profile Analysis
analysis of voice quality and provide some form of a verbal description. The
Asthenia, Strain; see Bhuta et al. 2004 for more information) or a modified
version of it (13%), and a single expert (3%) reported using LTS spectra (long-
term spectra) for examining voice quality. Furthermore, three experts provided
voice quality can often be central to the analysis and is best analyzed
forensic speaker comparisons and the third expert comments that they are
increasingly of the view that voice quality is one of the most valuable but least
85% of all respondents stated that they examine intonation with one or
87
invariably. The specific aspects of intonation vary, with tonality 11 being
reported more than tonicity12, 67% vs. 38% of respondents (Ladd, 1996, p. 10).
(e.g. speaking rate (SR) or articulation rate; Knzel, 1997). For formal measures,
compared to only 19% that use speaking rate and 16% that use both
articulation rate and speaking rate. Those using AR were asked how they
syllables for AR rather than linguistic ones. 73% stated that they examine
behaviors, patterns of code switching). 88% of all experts stated that they
11 Tonality marks one kind of unit of language activity and roughly where each such unit
begins and ends one tone group is as it were one move in a speech act (Halliday 1967 p. 30).
12 Tonicity marks the focal point of each such unit of activity: every move has one (major), or
one major and one minor, concentration point, shown by the location of the tonic syllable, the
start of the tonic (Halliday 1967 p. 30).
88
3.9.2.2 Non-Linguistic Features
In addition to being asked about features within the linguistic, phonetic and
identify which feature from any domain they found most useful for
processes (e.g. connected speech processes) and fluency were all reported by
13% of the respondents. One respondent went as far as stating that vowel
participants alluded to the fact that despite some individual parameters having
13It is perhaps better to classify tongue clicking as a linguistic feature when it is used in a
inherently functional way, such as that described in Chapter 7. However, at the time of the
survey, clicks were classified as non-linguistic.
89
3.11 Discussion
The purpose of this chapter has not been to advance any argument or to
develop theoretical propositions. Rather, its objective has been the much more
mundane one answering to the motivations set out in 3.1: laying out basic
day and drawing upon the collective expertise of FSC experts worldwide so as
to identify current working methods and features of speech that are considered
Those not directly involved in this specialist field but working, for
the lack of consensus over such fundamental matters as how speech samples
assigned greatest importance during the analytic process, and how conclusions
are to be expressed at the end of it. However, we are assured by those working
in various other fields of forensic science that the level of dissensus uncovered
Indeed, some of the practices and preferences found here are undoubtedly
dictated or constrained by the rules of the institutions and firms in which the
disciplines, the options, particularly for the framing of conclusions, may be laid
their auspices. For instance, the Dutch government forensic science facility, the
90
p. c.). Likewise, one of the participants in the present study who used a binary
Over and above the rules laid down by public and private sector
laboratories, nations may also set down requirements, either by statute or via
case law. Some jurisdictions, notably the England and Wales division of the UK,
very high degree of autonomy and discretion over the methods of analysis
he/she adopts and the way the outcomes are formulated. In respect of forensic
speaker comparison evidence, the England and Wales position was affirmed in
the Appeal Court ruling R -v- Robb (1991), in which the court ruled that
whether or not an expert used any acoustic testing was entirely his/her own
decision, and re-iterated in relation to the same issue in the more recent appeal
R -v- Flynn and St John (2008). Indeed, the main analytic issues over which the
higher courts have seen fit to pass down general prohibitions to forensic
experts concern the use of statistics in representing the strength of evidence (cf.
R -v- Doheny and Adams (1996); R -v- T (2011)). Where experts enjoy freedom
legal system, it is of note that all nine UK experts taking part in the study use the
combined AuPA + AcPA method. Indeed, this method is the predominant one
across all countries represented in the survey (25 = AuPA + AcPA; 10 = other
see Table 3.3). Thus, although the results show a wide range of variation in
91
As for the future, certain trends can be predicted. One is that as time
(semi)automatic systems and to their capabilities for handling real case (i.e.,
being incorporated into casework alongside the AuPA + AcPA approach. This
development would be particularly apposite in the USA, where the appeal court
ruling Daubert -v- Merrell Dow Pharmaceuticals Inc. (1993) is taken by many
lower courts as the benchmark for admissibility of expert evidence, and within
that ruling is the statement that the court ordinarily should consider the known
or potential error rate of the method. ASRs readily lend themselves to meeting
this criterion, and, indeed, many systems are subject to such testing as part of
ASRs have an LR as one of their easily selectable options for representing the
only half as many (7) use some form of LR (3 = verbal LR; 4 = numerical LR).
The increasing use of ASR software, together with the current paradigm shift
in forensic science towards Bayesian reasoning, and the use of LRs for
experts using LRs and a corresponding decrease in the use of other conclusion
frameworks.
what Saks and Koehler (2005) have called a paradigm shift in the evaluation
and presentation of evidence in the forensic sciences which deal with the
92
origin, e.g., DNA profiles, finger marks, hairs, fibres, glass fragments, tool marks,
that not all speech evidence is of the quantifiable type, as demonstrated in the
survey results from the present study. The Bayesian framework of likelihood
ratios has been adopted by many fields in the forensic disciplines where
There are elements of speech that are best described/analyzed qualitatively (i.e.
certain aspects of voice quality (e.g. lingual body orientation), lexical, syntactic,
quantified in some form, then it is plausible that we will one day see an entire
then there will still be experts who will continue to present such features in a
be predicted that more research on LRs will be carried out. This can be seen as a
experience (as reported in this survey), may now be tested empirically, and
features. Given this, experts and researchers in the field of forensic speech
93
found to be most discriminant through intrinsic14 and extrinsic15 likelihood
ratio testing.
Insofar as the present study lays bare that information, it may be considered to
3.12 Limitations
limitations. The first is the limited number of experts who took part in the
survey. Ideally, one would like to work with a larger sample size in order to
14 Intrinsic likelihood ratio testing uses the same set of speakers (e.g. from the same speech
corpus) for both the test and reference samples.
15 Extrinsic likelihood ratio testing uses different sets of speakers (e.g. from different speech
94
(Campbell et al., 2009), there are a number of experts around the world who
use ASR alone, and the survey results presented in this chapter fail to represent
this fact. Despite efforts to recruit participants that utilize ASR alone, no such
The final limitation is the simple fact that these results have a limited
shelf life meaning that the field is always changing and forever evolving, and
these results are only a snapshot of the field as it stood in 2010-2011. The
trends seen in the survey will certainly vary in the future as more research is
As stated in 1.3, this survey served in part as a hunting ground for identifying
parameters from the survey that experts found to be highly discriminant and/or
stops). The subsequent chapters analyze each of the four parameters in turn,
95
Chapter 4 Articulation Rate
4.1 Introduction
speech tempo and 73% of those doing so with varying regularity (Chapter 3). It
is also reported that when asked which parameters they found highly
they found speech tempo to be the most useful parameter for discriminating
speakers. Overall, speech tempo was ranked as the third most helpful
of individuals provides insight into the distribution of and variation within the
speaking rate (SR) or articulation rate (AR; Knzel, 1997). Both speaking and
articulation rate measure the speech tempo of an individual, but the two
measures capture slightly different aspects of tempo. Speaking rate (SR) can be
pauses, that are contained within the overall speaking-turn (Laver 1994 p.
158). Articulation rate (AR) is defined as the rate at which a given utterance is
and ends with silence (Laver 1994 p. 158). The difference between speaking
96
rate and articulation rate is that the former includes disfluencies and
and unfilled pauses. Within the field of forensic speech science the majority of
(100+ speakers) exist for German and Chinese, but as yet, there has not been a
similar study carried out on English. This study presents the analysis for the
ARs and standard deviations (SD) of 100 Southern Standard British English
(SSBE) male speakers. The results concern both the inter-speaker and intra-
Goldman-Eisler (1956) was one of the first to analyze and calculate articulation
rate in a population. In her study, she examined the spontaneous speech of eight
by counting the number of syllables (the definition of the syllable and interval
interviewer to the next, which is usually occasioned by the subject having come
to a natural stop or pause (Goldman-Eisler, 1956, p. 137). It was found that the
mean AR across speakers was 4.95 syllables per second, ranging from 4.4 for
97
the slowest to 5.9 for the fastest speakers. The mean standard deviation across
speakers was 0.91 syllables per second (syll/sec), ranging from 0.54 to 1.48.
Southern British males in mock police interviews. Along with the AR and SR
Howard (2011) also provided baseline results for the speakers using interpause
syll/sec with a range of 5.14 to 7.00 and a standard deviation of 0.89 syll/sec
SR. However, this study focused on intra-speaker variation. It was observed that
and the same observations were made. It is noted that different speech tasks,
states, can cause differences in the pauses that speakers use (e.g. number, kind,
tasks. Articulation rate differed slightly across the different tasks, but it was
98
retested claims that intra-speaker variation was lower in AR than SR. He
For the experiment, the read and spontaneous speech of five males and five
females was analyzed, and SR was found to be higher in read speech than in
spontaneous speech. This is largely due to the fact that speakers use far fewer
hesitation pauses (i.e. filled and unfilled) in read speech than in spontaneous.
AR, on the other hand, was not significantly different between read and
that were smaller than they were with SR. To further evaluate the possible
of both intra- and inter-speaker differences. According to the equal error rates
than SR, investigators have begun to examine AR in more detail. In keeping with
Knzels conclusion that AR will have to be interpreted with caution when used
in forensic speaker recognition until its possibilities and limitations have been
assessed on the basis of genuine case material and large numbers of speakers
(1997, p. 79), additional studies have been conducted in both German and
was measured for both spontaneous and read speech. It was found that, contra
Knzel (1997), the mean AR was significantly higher in read speech. Overall,
Jessen found the mean AR for the 100 speakers was 5.21 syll/sec. In order to
memory stretches (Jessen, 2007b, p. 53) were utilized rather than interpause
99
stretches and intonation phrases (Trouvain 2004, p. 50), which had been
memory stretches as the phonetic expert going through the speech signal
can easily be retained in short-term memory. After listening several times the
expert then counts the number of syllables that he/she is able to recall from
Cao and Wang (2011) followed the methodology of Jessen (2007b) and
examined the ARs for 101 male Mandarin Chinese speakers in spontaneous
in AR, and found both the global ARs (GAR) and means of local ARs (LARmean)
to be fairly normally distributed (GAR and LAR mean are explained further in
4.3.2). The mean global articulation rate (GAR) was 6.58 syll/sec and the mean
of the local articulation rates (LARmean) was 6.66 syll/sec. They also report
that the range of AR for a given speaker is relatively small and stable. However,
studies. The authors attribute the difference to the simpler syllable structure
found in Chinese. Chinese syllables are largely /CV/ in shape; therefore more
syllables per second can be produced than is possible with the inherently longer
English speakers is 50, and many studies examined are based on considerably
100
Table 4.1: Overview of articulation rate studies
101
As can be seen in Table 4.1, a number of studies have examined AR for both
read and spontaneous speech, but to date only two small-scale studies on
British English have been carried out (Goldman-Eisler, 1956; Kirchhbel and
Howard, 2011). Combined, the results for ARs from these in respect of
spontaneous speech have a mean rate of 5.29 syllables per second, with the
slowest mean at 4.81 syll/sec and the fastest at 5.88 syll/sec. It is important to
note that these figures are the result of studies of only a few varieties of English,
and how other varieties and dialects may pattern is unknown. The most recent
more than 1.00 syll/sec relative to Goldman-Eislers (1956) study carried out
about 55 years earlier. It is hypothesized that the results of the present study
will pattern more closely with those of Kirchhbel and Howard (2011) than
articulation rate in a large, homogeneous group of 100 male speakers. This data
102
4.3.1 Data
The data for the current chapter as well as subsequent chapters come
speech corpus collected under simulated forensic conditions (de Jong et al.
and accent group. All speakers were recorded under both studio and telephone
recording conditions for Task 2 (see below), and under studio recording
conditions for a number of different speaking styles (i.e. Task 1, Task 3, and
Task 4). The DyViS recordings include four tasks identified below (adapted
given visual stimuli (e.g. pictures of people and places) to prompt the
number of target words from the visual stimuli that were elicited by the
interlocutor (i.e. the interlocutor asked the speaker specific questions in order
to elicit the target words). The second task in DyViS is a telephone conversation
103
between the speaker and his accomplice Robert Freeman (the interlocutor is
the same person for all 100 speakers). Task 2 is recorded from the studio end of
the telephone call as well as via an intercepted external BT landline. The second
task, like Task 1, is spontaneous speech, whereby the interlocutor questions the
speaker in a mock police interview. The interlocutor for Task 2 elicits from the
speaker the same target words as those used in the police interview, which
Tasks 3 and 4 of DyViS are both forms of read speech. Task 3 consists of
a read news report pertaining to the alleged drug trafficking crime. The same
target words are included in the read report. Task 4 is read speech from
controlled sentences that have a large number of SSBE vowels in nuclear non-
The studies carried out in the remainder of this thesis will include only
data from either Task 1 or Task 2 (studio quality, spontaneous speech) of the
DyViS database. The current chapter uses Task 2 for calculating articulation
rate.
The DyViS studio recordings were all made using a Marantz PMD670
portable solid state recorder at a sampling rate of 44.1 kHz and 16 bit depth (de
Jong et al., 2007). All speakers were recorded via a Sennheiser ME64-K6
104
4.3.2 Methodology
The Task 2 recordings used for the current study were 15 to 25 minutes
lower boundary for memory stretches because it was the maximum number
of tokens that could be extracted from the shortest of the 100 recordings. The
(Knzel, 1997; Trouvain, 2004). As Jessen (2007b, p. 53) explains, the first
speaker of a language, one has a fairly reliable intuitive ability to count the
avoids the need to become involved in examining intensity peaks in the acoustic
production (Keating 1988, cited in Jessen, 2007b, p. 53). For these reasons,
105
of the language whereas a phonetic syllable is one that is manifested in
phonetic reality (Jessen 2007b p. 53). Jessen gives an example using the
phrase did you eat yet? Phonologically we would count this as having four
rare cases even increased. If the phrase were to be reduced it might be realized
as perhaps two syllables as in jeet yet (Jessen 2007b). For this reason the use
obviously be lower than if the same phrase was counted as four phonological
syllables (see Jessen, 2007b, pp. 53-54 for further discussion). Jessen (2007b)
suggests that syllables are best defined phonetically, rather than phonologically.
homogeneous population with the same accent and that counting syllables
106
phonetically has been shown to cause curious artefacts17 (Jessen, 2007b, p.
54; Koreman, 2006), the current study is based on phonological syllables rather
The third methodological decision, and the one which is perhaps the
most influential on the results, involves the kind of speech interval that is
selected for determining AR. The AR can be calculated for the entire duration of
local ARs can be calculated (Jessen, 2007b, p. 54). Miller et al. (1984) showed
that speakers often change their speech tempo over the course of longer
occur within a single recording, it is more useful to obtain local ARs. Previously,
to identify speech intervals over which to calculate local ARs (Trouvain, 2004).
selecting speech intervals, a much simpler and more pragmatic approach was
chosen for this study. Interpause stretches tend to result in intervals that are
pauses are reliant on phonetic and linguistic judgments made by the analyst,
17These artefacts occur when one speaker may be speaking quickly and as a consequence
deletes phonetic syllables, whereas another speaker is typically inclined to reduce or
completely delete the same number of phonetic sounds. However, this speaker is able to
preserve the number of underlying phonological syllables.
107
current study for computing local ARs is referred to as a memory stretch
(Jessen, 2007b, p. 54). After listening several times to that interval of speech,
the expert then counts the number of syllables that he/she is able to recall from
Sony Sound Forge Audio Studio 10.0 was used for analysis. Speech intervals
were only selected at least two minutes into the recording, to allow the speaker
were chosen, and the region marked out. After listening several times, I would
type out the speech phrase on the region marker tag. Following this, I would
count the number of syllables included in that interval. After collecting enough
memory stretches, it was possible to view all recorded regions that listed the
number of syllables and included the length of the speech segment. Those
figures were entered into Microsoft Excel and local and global ARs as well as
analyzed in this study. The mean and standard deviation of AR for each speaker
are used for analysis and are reported in 4.3.2 below. The maximum number
of syllables in a memory stretch was 26, but most stretches contained between
108
7 and 11 syllables (in order not to push the limits18 (Jessen, 2007b, p. 55), and
implemented by Jessen, four syllables or more per stretch were used. The
that each memory stretch consisted of only fluent speech (for speech intervals).
Fluent speech was defined as speech that did not include the following: any kind
any syllable lengthening (judged subjectively) that went beyond canonical non-
measured per speaker was approximately 30, with a standard deviation of 2.1, a
range of 26-32, and 2,993 total ARs calculated for the 100 speakers.
4.3.3 Results
The distributions of the local AR means and the standard deviations for
individuals are presented in Figures 4.1 and 4.2. The y-axis represents the
number of speakers that fall within a given range and the x-axis depicts
That is in order to avoid trying to remember such an extensive interval of speech that
18
mistakes are made when trying to recall it, as this could potentially affect the resulting ARs.
109
Mean Articulation Rates
18
Number of Speakers 16
14
12
10
8
6
4
2
0
Figure 4.1. The mean AR for the population is 6.02 syll/sec, with an overall
range of mean AR from 4.57 to 7.79 syll/sec. The standard deviation of the
mean is 0.64 syll/sec. The 100 speakers have mean ARs within a 3.22 syll/sec
window.
The data were checked for two levels of outliers. This thesis defines
suspected outliers as falling between 1.5 times the interquartile range (IQR)
and 3 times the IQR, plus or minus the first or third quartiles. Any outliers that
fall outside the upper bounds of 3 times the IQR are confirmed as definite or
extreme outliers in this thesis. The mean AR has six suspected outliers at 7.23
110
syll/sec, 7.24 syll/sec, 7.47 syll/sec (x2), 7.53 syll/sec, and 7.79 syll/sec.
20
15
10
distributed. The mean SD is 1.20 syll/sec, with a range of 0.72 and 3.95 syll/sec.
Those speakers who lie towards the left end of the x-axis are considered
relatively more consistent in their AR than those speakers who fall towards the
right end, who are characterized as having a more variable AR. The SDs of the
100 speakers lie within a range of 3.23 syll/sec, which is a larger range (by 0.01
syll/sec) than the range of means found for AR (see Figure 4.1). AR has three
suspected outliers at 1.72 syll/sec, 1.77 syll/sec, and 1.87 syll/sec. There are
111
The cumulative distribution graph of means in Figure 4.3 below shows
the percentile within which a given AR falls. The y-axis is the cumulative
90
Cumulative Proportion (%)
80
70
60
50
40
30
20
10
0
4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8
Articulation Rate (syllables/second)
Figure 4.3: Cumulative percentages for mean articulation rate
The curve in Figure 4.3 is characterized by a steep central portion, but rather
approximately 5.3 and 6.6 syll/sec, into which roughly 73% of the population
112
Standard Deviations for Articulation Rate
100
The curve in Figure 4.4 is similar to the curve seen in Figure 4.3, as it is
However, the slopes at the ends are not as gradual as the curve of the mean
SDs fall.
is important to note that the mean SD (1.2 syll/sec) for a speaker is about twice
the SD (0.64 syll/sec) for between-speaker variation. One would be more likely
to find higher levels of variation within any given speaker of SSBE than between
that speaker and others. This variability is also shown in the variance ratio,
squared within-speaker SD (Rose et al., 2006). A value of less than one indicates
113
that there is more variation within speakers than between them. A value
greater than one indicates that there is more variation between speakers than
within speakers. The variance ratio for AR is 0.2844, which confirms that there
4.3.4 Discussion
In addition to providing population statistics, there are three main points
to be drawn from the results reported in 4.4.0. The first is that the ARs found
in the current study are very different from those found by older AR studies,
namely Goldman-Eisler (1956), Robb et al. (2004), and Jacewicz et al. (2009).
range of 4.40 to 5.90, and a standard deviation of 0.91 syllables per second
(range = 0.54 to 1.48). Her research was based on the spontaneous speech of
unnatural sound prolongations), and this perhaps in part accounts for her lower
mean AR than that found in the present study. Another reason for higher AR
results in the present study could also be due to the use of phonological
syllables rather than phonetic syllables. The use of phonetic syllables could
potentially lead to lower ARs, as it counts only those syllables which are actually
articulated by the speaker (see 4.3.2 for the example of did you eat yet (4
The second point is that claims that forensic practitioners who took part
114
there is more variation occurring within speakers than between them. For this
typical mean AR. However, this is not to say that the parameter is not helpful
The final point is that forensic phoneticians need to take care when using
between speakers, but may carry more weight when AR is used to discriminate
4.3.5 Limitations
A possible limitation of the present study is the selection of memory
analyst calculating the AR. This can potentially lead to high levels of variation in
results to those obtained from more commonly defined and objective interval
The following section compares the results found for memory stretches in 4.3
115
4.4.1 Methodology
Twenty-five of the same speakers as those reported on above were
randomly selected from Task 2 of DyViS, and five minutes of speech from each
individual was analyzed starting at a point two minutes into each speech
sample. It is important to note that for comparison purposes, the mean AR for
an individual using memory stretches was only calculated using intervals from
the same five-minute speech sample as was used when calculating AR using
inter-pause stretches. The remaining aspects of the methodology were also kept
Inter-pause stretches are defined here as both filled and unfilled pauses
that lasted 130ms or longer (Dankoviov 1997) but were not stop closures.
The interval also had to include at least five syllables. The criteria for items that
were excluded from analysis in an interval were identical to those set by the
exclusion rules in 4.3.1. This meant that intervals excluded any kind of pauses,
English. A mean of 40 intervals was measured per speaker, with a range of 26-
58, and amounting to 1,011 ARs in total. The mean number of syllables per
4.4.2 Results
The mean ARs for both memory stretches and inter-pause stretches are
presented below in Figure 4.5. The y-axis represents the AR in syll/sec and the
x-axis shows the 25 randomly selected speakers from the 100 speakers in the
DyViS Database.
116
Mean ARs for 25 Speakers: Memory Stretches vs. Inter-pause
6.8
Stretches
Memory Stretches
Articulation Rate (syllables/second)
6.6 Inter-pause Stretches
6.4
6.2
6.0
5.8
5.6
5.4
5.2
5.0
16 19 24 25 30 31 35 39 50 55 56 57 60 62 63 66 67 68 71 75 76 78 85 87 91
Speaker
Figure 4.5: Comparison of mean articulation rate for memory stretches versus inter-pause
stretches
Figure 4.5 provides mean ARs generated using both methodologies. The means
for each speaker are displayed above one another to allow for a visual
evident in the crossing lines in Figure 4.5. All speakers show relatively small
differences (especially speakers 031, 050, 056, 063, 075, 078, and 085) in their
mean ARs. However, some speakers have larger differences (e.g. speakers 016,
035, 068, 071, 076, and 091) than others. Using the absolute values of the
117
The mean AR for the 25 speakers using inter-pause stretches was 5.98
memory stretches, which was 5.96 syll/sec. The mean AR calculated by the two
amount, given that the mean ARs of the speakers are between 5.00 and 7.00
syll/sec and the mean SD for the 25 speakers (using memory stretches) was
0.64 (syll/sec). Using a Wilcoxon signed rank test for the two sets of data, the
4.4.3 Discussion
The present study gives rise to two important conclusions. The first is
as long as the following are kept consistent: the basic unit of speech defined
here as the phonological syllable, and the exclusion rules. Based on the findings
in 4.4.2 and the experience gained from calculating 125 mean ARs, I am now
of the opinion that mean AR measurements are affected more by the exclusion
rules than they are by the actual definition of the speech interval (memory
stretch vs. inter-pause unit). The exclusion rules were described in 4.3.2 and
concern what speech can be excluded from analysis, e.g. whether false starts,
excluded. These exclusion rules can vary from analyst to analyst, as one might
find a repetition such as I-I-I-I am going to the store should be excluded, but
118
judgments made with respect to exclusion in an analysis may be exercised by
constraints of the rules. Therefore, the differences that were found among the
mean ARs for the 25 speakers could more likely be attributed to variability that
results found in 4.3 using memory stretches are valid and reliable
calculate ARs using inter-pause stretches is far greater than the time it takes to
use memory stretches rather than inter-pause stretches in order to use time
Given that the methodology for selecting speech intervals does not appear to
119
interval. AR was recalculated for 25 speakers using different syllable lengths for
speech intervals.
4.5.1 Methodology
The data for the first 25 speakers of the study that were presented in
4.3 were used to recalculate mean ARs using different minimum syllable
requirements for speech intervals. There were seven possible minimum syllable
requirements for the speech interval, ranging from four to ten20. Microsoft Excel
was used to remove tokens from the speakers data for each given requirement
and mean ARs and SDs for all individuals were recalculated once tokens had
4.5.2 Results
The mean AR and SD results across all speakers for different minimum
syllable requirements in a speech interval are provided in Table 4.3. The table
details for each minimum syllable level the mean number of tokens included for
calculating speakers overall AR as well as the groups mean AR, and the groups
mean SD.
Table 4.3: Summary of articulation rate statistics when varying the minimum number of
syllables in the speech interval
20This range was chosen based on the number of tokens available for the different syllable
lengths.
120
There is a prominent trend present in Table 4.3: as the minimum number of
syllables required for a speech interval increases, the mean AR for speakers also
increases while the mean SD decreases. This shows that perhaps AR becomes
slightly more stable within individuals as the minimum syllable count per
variation. Table 4.4 shows the mean differences in mean, SD, and the difference
interval and the second and third columns display the mean for mean AR and
SD. The s in columns two and three are calculated by taking the average value
(mean AR or SD) at a given syllable level and subtracting the baseline means
(those values calculated for 4< syllables). Positive values represent an increase,
Within-speaker
Mean for Mean Mean for SD
AR (syll/sec) (syll/sec)
5< 0.024 -0.013
6< 0.106 -0.029
7< 0.192 -0.045
8< 0.286 -0.062
9< 0.359 -0.062
10 < 0.392 -0.160
Table 4.4 shows that individuals are patterning similarly to the group as a
121
minimum number of syllables in a speech interval the more stable a speakers
AR becomes. This could also be due in part to the decreasing number of tokens
involved in the calculation at the 10< syllable level. However, the number of
tokens is still rather robust for syllables at the nine or more level and below,
syllables required for a speech interval and the second presents the mean for
the SD of the mean ARs. As in Table 4.4, a positive value shows an increase,
Between-speaker
Mean for SD of AR Means
(syllables/second)
5< -0.022
6< -0.004
7< -0.002
8< 0.014
9< -0.006
10 < 0.028
The overall trend displayed in Figure 4.8 is inconsistent, in that as the minimum
requirement levels the mean is 0.001 syll/sec. This means that the variation
between speakers mean ARs remains relatively stable despite the changes
122
4.5.3 Discussion
The conclusion to be drawn from the results obtained by increasing the
likelihood ratios and could potentially lead to stronger evidential values for AR
as a discriminant. In the next section, LRs are calculated for AR, as well as to
LRs provide a framework for the estimation of the strength of evidence under
4.6.1 Methodology
The LR calculations for AR were performed using a MatLab
(MVKD) formula (Morrison 2007). MVKD was chosen over Lindleys (1977)
univariate LR, because it can account for inter- and intra-speaker variation (i.e.
for both the LR numerator and denominator. Morrison also suggests that
123
speaker variation (Morrison 2008 p. 97) the MVKD approach is more suitable
The MVKD formula provided by Aitken and Lucy (2004) assumes that
(GMM-UBM21; Reynolds et al., 2000) has also been proposed for calculating LRs
(Becker et al., 2008; French et al., 2012; Rose and Winter, 2010). GMM-UBM,
distributions instead of kernel densities (as per MVKD). The most significant
characteristics when comparing same (SS) and different (DS) speaker pairs
(Reynolds et al., 2000). Morrison has found that a GMM-UBM, which does not
both better and worse than MVKD on different occasions (Lindh and Morrison,
21GMM refers to the way in which the data are modeled. GMM is a parametric density function
that is comprised of a number of component functions (Gaussian; Reynolds et al., 2000). UBM
refers to the way in which the background population in modeled. A UBM is used to represent
general, person-independent parameters to be compared against a model of person-specific
parameters (Reynolds et al., 2000).
124
MVKD has been shown to provide reliable and important strength-of-evidence
(Hughes, 2011; Morrison, 2008; 2009a; Rose, 2006a; 2007a, 2007b), and will
(SS) and different-speaker (DS) LR calculations for AR. The script calls for the
100 speakers samples to be split in half (i.e. 50/50), such that SS comparisons
speakers 051-100 act as the background population. The calculated raw LRs
allows zero, rather than one, to act as the center point between the support for
Hp and Hd. The log transforms are also beneficial in normalizing distributions
Table 4.6. The verbal scale, adapted from Champod and Evett (2000), is based
125
Table 4.6: Expressions for strength of evidence in terms of Log10 LR and the corresponding
verbal expression following Champod and Evetts (2000) verbal scale
This scale was previously used by the UK Forensic Science Service (Champod
and equal error rate (EER), which are both metrics of system validity. The term
validity refers to how well a system (in this thesis a system can be an
performance error was assessed using Cllr, which is a common assessment used
a Bayesian error metric that quantifies the ability of the system to output LRs
that align correctly with the prior knowledge of whether speech samples were
produced by the same or different speakers. The Cllr acts as an error measure
that captures the gradient goodness of a set of likelihood ratios derived from
test data (Morrison 2009b p. 6). Previous studies of LRs for forensic speaker
comparisons have shown that Cllr proves appropriate for measuring errors
126
(Morrison and Kinoshita, 2008; Morrison, 2011). The equation commonly used
(1)
Cllr was calculated using Brmmers FOCAL toolkit23 function cllr.m with the
log-LRs as input. Values of Cllr that are closer to zero indicate that error is low.
performance, while values above one indicate very poor performance (van
validity. This is based on the point at which the percentage of false hits (DS
pairs that ostensibly offer support for the Hp) and the percentage of misses (SS
pairs that appear to offer support for the Hd) are equal (Brmmer and du Preez,
2006, p. 230).
4.6.2 Results
The following sections detail the results of two sets of calculations of the
127
considers how the minimum number of syllables in a speech interval may affect
The results for the calculation of LRs on ARs are summarized in Table
4.7. The second row contains the results from SS comparisons and the third row
negative value (providing support for the defense hypothesis). The third
column presents the mean LLR for all comparisons (either SS or DS). The
calculated LLR. The final two columns provide the EER and Cllr values for the
entire system.
Table 4.7: Summary of LR-based discrimination for articulation rate (100 speakers)
Comparisons % Correct Mean LLR Min LLR Max LLR EER Cllr
AR SS 90.0 0.18 -1.48 2.06
.3340 .8981
AR DS 46.2 -2.94 -8.76 0.82
Table 4.7 shows that AR performs much better with SS comparisons than DS
perform better than SS pairs. However, it appears that, because the degree of
128
variation in AR is so high within speakers overall, the system tends to allocate
within-speaker variation. This is evident in the fact that for DS comparisons, the
system performs slightly worse than chance (which is 50%, since an LLR
respectively) as the AR system tends to over-predict pairs being from the same
speaker rather than different speakers (note the high error rate in correct DS
judgments) 24. Table 4.7 shows that Cllr is approaching one, but is still under it.
poor performance. The EER is high at 33.4% for AR as a system and the mean
where zero is the division between support for Hp (>0) and support for Hd (<0).
that are steeper indicate a weaker strength-of-evidence. The results for SS and
24 An additional explanation for the poor performance of DS comparisons could be that the
system is not optimally calibrated (see 8.5.3). This is evident in the intersection between the
SS and the DS distributions in Figure 4.6, as the intersection is not at LLR = 0, but further to the
right into the higher scores. Therefore, many DS comparisons obtain an LLR larger than zero.
This miscalibration is also potentially the reason for the poor Cllr values.
129
Same speaker comparisons
Different speaker comparisons
Figure 4.6 shows that error rates are higher for DS comparisons than they are
for SS comparisons. The SS line is steeper than that of the DS line and provides a
relatively low strength of evidence. DS, on the other hand, can attain higher
remember when analyzing SS and DS LR results that two samples cannot get
more similar for a feature than identical (Rose et al., 2006, p. 334), and
therefore DS comparisons will always carry the potential for achieving a higher
between individuals, and only produces higher strength of evidence for a very
130
4.6.2.2 LRs for ARs of 25 Speakers with Variation in the Minimum Number
of Syllables in a Speech Interval
The following section reports on the LRs calculated for the ARs of 25
interval. Table 4.8, which has a similar structure to that in Table 4.7, provides
the results of the different systems. The first column includes a value next to the
Table 4.8: Summary of LR-based discrimination for mean articulation rate when varying the
minimum number of syllables in a speech interval (25 speakers)
Comparisons % Correct Mean LLR Min LLR Max LLR EER Cllr
4 < Same Speaker 84.6 0.005 -1.800 0.369
0.2500 1.0260
4 < Different Speaker 52.6 -0.252 -3.153 0.771
5 < Same Speaker 76.9 -0.014 -1.703 0.340
0.3109 1.0326
5 < Different Speaker 53.2 -0.260 -2.676 0.665
6 < Same Speaker 69.2 -0.068 -1.73 0.412
0.3846 1.0976
6 < Different Speaker 54.5 -0.316 -2.718 0.608
7 < Same Speaker 53.8 -0.152 -1.908 0.435
0.4615 1.1780
7 < Different Speaker 55.1 -0.345 -2.751 0.585
8 < Same Speaker 69.2 -0.103 -2.731 0.662
0.3782 1.1104
8 < Different Speaker 53.8 -0.280 -1.839 0.409
9 < Same Speaker 61.5 -0.135 -1.243 0.414
0.4615 0.9958
9 < Different Speaker 52.6 -0.025 -0.595 0.317
10 < Same Speaker 53.8 0.024 -1.175 0.322
0.5385 0.9848
10 < Different Speaker 39.7 -0.033 -0.234 0.218
For DS comparisons that number drops by 15.4%, with Cllr improving by .0412.
131
EER is more erratic, showing both increases and decreases in its performance
percentage of correct SS comparisons and the Cllr value are worse, while the
minimum syllable lengths above five perform worse than the AR system in
as there were only 13 test speakers and 12 reference speakers, whereas the
tests described in 4.6.2.1 included 50 speakers in both test and reference sets.
4.6.3 Discussion
Although within-speaker variation can be decreased by increasing the
produced a Cllr better than that seen in 4.6.2 with all 100 speakers and a
the 25 speakers, the number of useable speech intervals available for each
this along with the increase in the minimum syllables required for a speech
interval.
132
Most importantly, by re-redefining the minimum syllable count
susceptible to changes in data collection and methodologies for AR. The need
4.7 Conclusion
the large number of available methodologies for the calculation of AR, it appears
that the defining of a speech interval (memory stretch vs. inter-pause unit) does
not have a significant effect on the results. However, in the context of real
as experts have claimed it to be (Chapter 3). This raises the question as to why
some analysts are using it at all in casework except for instances of very high or
low AR. However, exceptions exist for those speakers that are classified as
outliers. It has been shown that AR offers a very weak strength of evidence for
strength of evidence. However, this must be traded off against the fact that they
133
produce a very high rate of incorrect DS judgments. Despite all efforts to
syllables in a speech interval, the overall system is not improved from the
original system results shown in 4.6.2. The Cllr in the best-performing system
clearly needed, because there is a risk that some experts in the field are
analyzing certain features rather blindly. That is to say, they are giving weight
is shown by the fact that 93% of experts surveyed analyze speech tempo, and
speech tempo to be the single most useful discriminant. The analysis carried out
in this chapter provides evidence that AR may be a far from useful discriminator
in many cases.
Although AR may not be the discriminant shibboleth all experts hope for,
speakers may have a very low or high AR, and in which the parameter can be
134
(2007a, p. 1820) points out not all speakers differ from each other in the same
good discriminant parameter, as was evident in the small number of high LRs
135
Chapter 5 Long-Term Formant
Distributions
5.1 Introduction
that 28% of forensic phoneticians found vowels to be the most useful feature in
the research carried out. LTFD is the method used to calculate the average
values for each formant of a speaker over a given speech recording. For a given
formant (i.e. F1-F4), measurements for all vowels produced by a single speaker
are averaged across the entire recording or relevant portions of the recording.
This means that for each formant (F1-F4) of a speaker there is an LTFD value
length for the current study); therefore, long vowels carry more weight than
short vowels in that they yield a greater number of measurements per vowel. A
analysis. This results in greater time savings. LTFD also avoids the potential
correlations between vowel phonemes which would not allow those correlated
136
phonemes to be combined, since combinations of parameters under Bayes
framework. Population statistics which consist of LTFD means and SDs are
combinations relevant for casework. The results of the LRs are presented and
Many different methods have been offered for acoustic analysis, the most
formants for different vowels (Jessen, 2008; Rose, 2002; 2006a, b, c; Rose et al.,
2003). Investigations have also been carried out using formant dynamics in
McDougall and Nolan, 2007; Hughes, 2011; Hughes et al., 2009). Formant
speakers and that there is variation with respect to formants between speakers.
This research led to the argument that the development of techniques for
137
capturing the average of all spectral slices in a recording. However, LTS
include background noise (Nolan and Grigoras, 2005). LTFD, like LTS, was also
the vowels and certain voiced portions in a recording (Nolan and Grigoras,
2005). There have also been a number of analyses concerned with vowels
under a LR framework (e.g. Alderman, 2004; Kinoshita and Osanai, 2006; Rose,
2007a; Morrison, 2009). However, only three have considered LTFD (Becker et
Nolan and Grigoras (2005) were the first to report the use of LTFD for
LTFD1 and LTFD2 in order to eliminate a suspect who is thought to have made
some obscene phone calls. The first author carried out an auditory analysis of
Diphthongs were also included in the analysis and the beginning and end points
of the vowels were measured. In general, the vowel analysis by the first author
suggested that the speech in the criminal samples and the speech in the suspect
sample were poorly matched. Each vowel in the suspect samples exhibited
rendering the criminal and suspect samples incompatible. Given that there were
fundamental frequency (SFF), LTS, and LTFD. All three approaches were carried
138
out using Catalina Toolbox26. LTFD was used in the re-analysis of the data in the
obscene phone call case. This approach considers the long-term disposition of
formants (Nolan and Grigoras, 2005, p. 162), an aspect which LTS fails to
grasp. Only the voiced frames in the recordings were used for analysis and
linear prediction was used to estimate LTFD1-4. The results from LTFD1-4 give
LTFD2 and LTFD3 in the criminal recordings are considerably lower than in the
samples than the suspect sample. Given the distribution of LTFD, this analysis
gave further substance to the argument that there were two different speakers
involved and that the suspect and criminal were not one and the same person.
very clear picture of the average behaviour of each formant (Nolan and
Grigoras, 2005, p. 169). It also provides strong insights into the dimensions of a
speakers vocal tract where these are reflected in the maximum LTFD27. The
formant frequency values for LTFD are inversely related to the speakers vocal
tract size, whereby a longer vocal tract will result in lower formant values
(Nolan and Grigoras, 2005; French et al., 2012). LTFD also has the capacity to
indicate certain habits speakers use, such as palatalization, which are indicated
by a raised LTFD2 (Nolan and Grigoras, 2005). French et al. (2012) also show
that voice qualities related to tongue body position are correlated with LTFD.
Additionally, the shape of the distributions for the estimates of each formant is
useful in identifying speakers who have either more or less variable formants.
26Available at http://www.forensicav.ro/download.htm
27The maximum LTFD (for LTFD1 and LTFD2) is reflected in the overall area of a speakers
vowel space.
139
This is classified by distributions which are leptokurtic (narrow-peaked) or
Moos (2010) utilized the LTFD method detailed in Nolan and Grigoras
(2005) and analyzed LTFD2 and LTFD3 values of mobile phone speech in both
read and spontaneous speech of 71 male German speakers from the Pool 2010
corpus (Jessen et al., 2005). The spontaneous speech was elicited from speakers
without using certain proscribed words, similar to the strategies in the board
game Taboo. The person who was matched with the speaker feigned
and longer stretches of speech. The read speech was produced by speakers
reading a German version of The North Wind and the Sun (Moos 2010).
Recordings were edited to include only the vocalic portions, where laterals,
approximants, vocalic hesitations, and creaky voice were part of that stream.
Nasals, areas of strong nasality, and vowels spoken on a high pitch (where
were not included. After cutting down the recordings in Wavesurfer, the length
of the spontaneous speech was between 12 and 83 seconds (mean = 40 sec) and
the length of the read speech was between 8 and 16 seconds (mean = 12 sec).
28This is also seen in Simpson (2008) and Clermont et al. (2008) with regard to F3 values for a
number of phonemes that were measured using mid-point center frequencies.
140
smaller intra-speaker variation. The LTFD values from read speech were
reported as being higher than those in spontaneous speech. However, this could
be due in part to the fact that the The North Wind and the Sun is a tool used in
phonetic experiments to get an array of phonemes and tokens. For this reason,
spread of LTFD values per speaker. It is important to note that details of the
spontaneous speech that was elicited were not provided, and could have in
phonemes as is usually the case with read speech (of course this is dependent
on the chosen text). It was also noted by Moos (2010) that it is vital to know
not correlated to) F0, dialect, and speech rate, making LTFD viable for
Becker et al. (2008) investigated the use of Gaussian Mixture Models for
LTFD1-3 under an LR framework (see Jessen et al., 2013 for a similar LTFD
study but using the software Vocalise29). Spontaneous speech was used from 68
male German speakers from the Pool 2010 corpus recorded in a laboratory
setting. The speech had been elicited as described above in Moos (2010), and as
in Moos (2010), the data used in Becker et al. (2008) were transmitted through
141
These recordings were then edited to remove consonantal information as well
Formant tracking was then used to identify peaks and LTFD measurements
were extracted in Wavesurfer. The formant measurements for the first half of
each recording were used as a training set, and the test set consisted of the
test set were halved again in order to increase the possible number of
comparisons. This resulted in recordings from the training set being around 22
seconds in length, while those in the test set were around 11 seconds long.
Background Model (UBM), and one Gaussian Mixture Model (GMM) was
speakers were used in the test and a total of 100 same-speaker comparisons
The lowest (i.e. the best performing) EERs were found for combinations
which achieved EERs of 0.030 and 0.042, respectively. The lowest EER in which
an EER of 0.053. Overall, discrimination levels were high, and Becker et al.
(2008) note that the speaker models created using LTFD can relate directly to
142
Jessen and Becker (2010) build on Becker et al. (2008) and investigate
between LTFD2-3 and body height for 81 male speakers from the Pool 2010
corpus of telephone transmitted speech. Both LTFD2 and LTFD3 were found to
were associated with lower LTFD2 and LTFD3. The Pearson correlation
coefficients were both just over 30% (r = -0.316 and -0.339 respectively), but
were nonetheless significant at the 1% level. Jessen and Becker (2010) then
examined the consistency with which analysts measure LTFD. Five phoneticians
measured LTFD2 and LTFD3 for 20 speakers from the Digs dialect corpus
(Jessen and Becker, 2010). LTFD means were compared across analysts and it
was found that LTFD2 had Pearson correlations between 0.84 and 0.95, while
for LTFD3 these figures were between 0.98 and 0.9930. Consistency across
analysts was higher for LTFD3 than for LTFD2, but both formants achieved
highly consistent results overall, showing that the methodology for LTFD
Grigoras, 2005). Jessen and Becker (2010) tested this hypothesis using three
found that the different languages did not appear to differ in terms of the LTFD-
143
The authors then investigated the effects of Lombard speech31 on LTFD.
speech) were used to analyze possible Lombard effects on LTFD1-3. LTFD1 was
condition, with high levels of intra-speaker variation present. LTFD2 and LTFD3
were inconsistent in their effects across speakers, and both yielded non-
2010; Knzel, 2001). Finally, the authors tested the performance of LTFD
analysis modeled using GMMs against ASR. They found ASR to outperform LTFD
analysis, with an EER of 0.107 for ASR as compared to an EER of 0.243 for LTFD.
LRs and then applies statistical weightings) was also used to try to improve
results (EER= 0.108). However, it still performed worse than ASR on its own.
The results found by Jessen and Becker (2010) were promising for LTFD use in
phoneticians, the potential to use LTFD statistics from one language across
many languages, and the limited effect of Lombard speech on LTFD2 and
LTFD3.
31Lombard speech is the tendency of speakers to increase their vocal effort when speaking
(typically due to loud noise; Lombard (1911)).
144
signal, which are reflections of the dimensions of the vocal tract) and voice
quality (VQ). They considered the efficacy and limitations of the three
parameters, and correlations among the parameters. The study used the
recordings from Task 2 of the DyViS database (Nolan et al., 2009). All original
automatically extract and measure F1-F4 every 5 msec. This yielded between
10,000 and 30,000 F1-F4 measurements per speaker. LRs for LTFD1-4 were
calculated using UBM-GMM in the same way as that detailed in Becker et al.
(2008). In total there were 200 same speaker (SS) comparisons and 9800
divided in half. French et al. (2012) found LTFD1-4 to perform very well, with
comparison with the discrimination levels of the MFCCs and VQ on the same
data set, LTFD achieved similar error rates to the other methods; one method
LTFD, MFCCs, and VQ, there were correlations found between the LRs produced
from MFCCs and LRs calculated from the UBM-GMM analysis of LTFDs (r =
0.39). There was a weak correlation identified between VQ and LTFD globally (r
= 0.12). However, there were some specific aspects of VQ that were more
closely correlated with single LTFD measurements (e.g. raised larynx and
LTFD1, r = 0.40). In conclusion, French et al. (2012) suggest the use of a vocal
32This was done using Synthesis Toolkit CV software that was adapted by Philip Harrison from J
P French Associates.
145
tract output measurement (e.g. MFCC, VQ, LTFD) to be used as one of many
human behavior.
The research in the studies presented above was carried out using a
discriminant performance of LTFD was tested in previous studies only using the
The following section discusses the collection of population statistics for LTFD
distribution and variation that occurs in LTFD for a large group of individuals
5.3.1 Methodology
25, were analyzed. The recordings were from Task 2 (a conversation between
the speaker and his accomplice) of the DyViS database (Nolan et al., 2009). The
concatenated vowels per speaker, and the iCAbS formant tracker (Clermont et
al., 2007) was used to automatically extract and measure F1-F4 every 5 msec.
146
Population statistics were calculated by averaging all measurements for each
formant for a single speaker and also taking the SD for each of those formants.
5.3.2 Results
for LTFD1 means and SDs are provided in Figures 5.1 and 5.2. The y-axis
represents the number of speakers with mean LTFD formant frequencies that
fall within a given range and the x-axis represents 10Hz-wide formant
Mean LTFD1
18
Number of Speakers
16
14
12
10
8
6
4
2
0
Figure 5.1 shows a normal distribution with a slight negative skew due to
375.6Hz, 386.7Hz and one suspected outlier at 515.6Hz. The overall mean for
the group LTFD1 is 451Hz, with a range of 364.7Hz to 515.6Hz. The SD of the
means is 29.9Hz, and all 100 speakers SDs fall within a 150.9Hz range.
147
Standard Deviations for LTFD1
18
Number of Speakers
16
14
12
10
8
6
4
2
0
The standard deviation values for LTFD1 within speakers follows a roughly
and 209.8Hz. The mean SD is 131.4Hz, with a range of 64.8Hz to 209.8Hz. The
SD of the mean SDs is 26.8Hz. All 100 speakers have SDs within 145Hz, which is
a larger range (by 58.7Hz) than the range of means found in Figure 5.1.
5.3 and Figure 5.4, respectively, show the percentile within which a given
LTFD1 mean or SD falls within the population. The y-axis is the cumulative
148
Mean LTFD1
100
The curve in Figure 5.3 is characterized by steepness in the central portion and
gentle gradients in the first and third portions. Despite the steepness of the
central section, the curve is overall a lot more gradient than was seen for AR in
Chapter 4. 1 SD from the mean gives a range between 421.1Hz and 480.9Hz,
into which roughly 83% of the population tested here falls. The cumulative
149
Standard Deviations for LTFD1
100
The curve in Figure 5.4 is slightly steeper than that seen in Figure 5.3, and the
beginning and end portions are less gradient. 1 SD from the mean SD gives a
53.6Hz range between 104.6Hz and 158.2Hz, into which roughly 77% of the
population falls. Given that the mean LTFD values are representative of the
variation that occurs between speakers, and the SD values represent within-
ascertain which variation is higher. The LTFD1 variance ratio is 0.05, which is
The results for LTFD2 are presented in Figures 5.5 and 5.6. The graphs
150
Mean LTFD2
20
Number of Speakers
18
16
14
12
10
8
6
4
2
0
Figure 5.5 has as a normal distribution with a slight positive skew due in part to
a suspected outlier at 1633Hz. The overall mean LTFD2 for the group is
16
14
12
10
8
6
4
2
0
240-250
250-260
260-270
270-280
280-290
290-300
300-310
310-320
320-330
330-340
340-350
350-360
360-370
370-380
380-390
390-400
400-410
410-420
420-430
430-440
151
The standard deviation for LTFD2 within speakers is again roughly normally
distributed. There are two suspected outliers at 425.4Hz and 437.7Hz. The
mean SDs is 37.7Hz. All 100 speakers have SDs within 188.4Hz.
The cumulative distribution graphs of means and SDs in Figure 5.7 and
Figure 5.8, respectively, show the percentiles at which a given LTFD2 mean or
Mean LTFD2
100
Cumulative Proportion (%)
90
80
70
60
50
40
30
20
10
0
1350 1380 1410 1440 1470 1500 1530 1560 1590 1620 1650
Formant Frequencies (Hz)
The curve in Figure 5.7 is rather gradual compared to those in Figures 5.3 and
5.5. 1 SD from the mean gives a range between 1420.8Hz and 1532.6Hz, into
which roughly 72% of the sample population falls. The cumulative distribution
152
Standard Deviations for LTFD2
100
The curve in Figure 5.8 is similar to that in Figure 5.7; however, the rate of
increase of the middle portion in Figure 5.8 is much more variable. 1 SD from
the mean SD gives a range between 285Hz and 360.4Hz, into which roughly
the inter-speaker variation for LTFD2, there is a variance ratio 0.03. This
The results for LTFD3 are presented in Figures 5.9 and 5.10 below. The
graphs represent the population distributions for LTFD3 mean and SD,
respectively.
153
Mean LTFD3
25
Number of Speakers
20
15
10
Figure 5.9 has as a normal distribution with a slight negative skew. This is
LTFD3 for the group is 2478.5Hz, with a range of 2212.6Hz to 2824.4Hz. The SD
of the LTFD3 means is 106.5Hz, and all 100 speakers fall within a 611.8Hz
window.
154
Standard Deviations for LTFD3
18
Number of Speakers
16
14
12
10
8
6
4
2
0
distributed with a slight negative skew. It can be expected that the skew is due
516Hz. The SD of the LTFD3 SDs is 62.8Hz. All 100 speakers have SDs within
5.11 and Figure 5.12, respectively, illustrate the percentile at which a given
155
Mean LTFD3
100
The curve in Figure 5.11 is rather steep in the middle portion, with the
beginning and end portions of the slope being more gradient. 1 SD from the
mean gives a range between 2372Hz and 2585Hz, into which roughly 70% of
the population falls. The cumulative distribution of individual speakers SDs for
156
Standard Deviations for LTFD3
100
The curve in Figure 5.11 is rather steep, with the end portion of the slope, which
starts around 375Hz, being more gradient than the beginning. 1 SD from the
mean gives a range between 215.1Hz and 340.7Hz, into which roughly 68% of
the sample population falls. The calculated variance ratio for LTFD3 is 0.15. This
suggests it is more likely that one will find higher levels of variation within a
speaker than between that speaker and the rest of the population, which was
The results for LTFD4 are presented in Figures 5.13 and 5.14 below. The
graphs display the population distributions for LTFD4 mean and SD,
respectively.
157
Mean LTFD4
16
Number of Speakers
14
12
10
8
6
4
2
0
Figure 5.13 shows that mean LTFD4 has as a fairly normal distribution, though
with a slight negative skew. However, there are no suspected (1.5 x the
revealing measurement errors that occurred for those individuals that appear
to have lower LTFD4 means. The overall mean LTFD4 for the group is
158
Standard Deviations for LTFD4
18
Number of Speakers
16
14
12
10
8
6
4
2
0
of the LTFD4 SDs is 67.2Hz. All 100 SSBE speakers have SDs within 292.7Hz of
each other.
5.15 and Figure 5.16, respectively, show the percentile at which a given LTFD4
159
Mean LTFD4
100
more similar to the curve in Figure 5.7 than it is to the curves illustrated in
Figures 5.3 and 5.11. 1 SD from the mean gives a range between 3490Hz and
3831.8Hz, in which roughly 63% of the sample population falls. The cumulative
160
Standard Deviations for LTFD4
100
The curve in Figure 5.16 has a rather gradual, almost linear slope. The graph
has a similar shape to the curve shown in Figure 5.15. 1 SD from the mean SD
gives a range between 415Hz and 549.4Hz, into which roughly 65% of the
population falls. The variance ratio for LTFD4 is 0.13, which again indicates that
one will be more likely to find higher levels of variation within a speaker than
between that speaker and the rest of the population, which we also saw in the
case of LTFD1-3. Given that variance ratios greater than one indicate that more
variation occurs within individuals than between them, the LTFDs with higher
lower variance ratios. LTFD3 had the highest ratio at 0.15, followed by LTFD4 at
0.13, LTFD1 at 0.05, and finally LTFD2 at 0.03. The LR results in the next section
161
5.3.3 LTFD1-4 Results Compiled
displayed in Table 5.1 and 5.2. Table 5.1 compiles LTFD1-4 results for between-
speaker variation, while Table 5.2 presents the compilation of LTFD1-4 results
for within-speaker variation. The first columns identify the formant, and the
second through fourth columns contain mean SD, range (of SD), and SD (of SD).
The mean formant measurements for LTFD1-4 in Table 5.1 are very similar to
those of , where F1 is about 500 Hz, F2 is 1500 Hz, F3 is 2500 Hz, and F4 is
3500 Hz (Johnson, 2003). Given that LTFD is an average across all vowel
phonemes, some type of central (with respect to the vowel space) vowel would
be expected. The results in Table 5.1 and 5.2 also show that LTFD3 and LTFD4
have the smallest ratios of mean SD to mean formant value (277.9: 2478.5 and
482.2: 3660.9). This is indicative of the two higher formants being more stable
within speakers, suggesting that they will be better speaker discriminants than
the lower formants (this also coincides with the variance ratios from 5.3.2
above).
162
5.4 Likelihood Ratios for LTFD
The discriminant results of LTFD are presented in the following section. In the
combination with other formants. The second part considers the effects that
5.4.1 Methodology
2007) for the 100 male speakers in DyViS Task 2. The MVKD formula was
originally developed for use with evidence that included repeated measures of a
given parameter (Aitken and Lucy, 2004). However, LTFD considers evidence
from all possible vowel categories, resulting in raw data that can be extremely
varied. For this reason, the raw formant data were averaged over 0.5 sec
single token) for F1-F4 in order to obtain what Moos refers to as packages
(Moos, 2010). There were 100 to 284 (LTFD1-4) measurements per speaker,
calculate LRs, whereby speakers 1-50 acted as the test set and speakers 51-100
acted as the reference set. LRs are calculated for LTFD1-4 individually as well as
the field ( 3.9.1.1). Traditionally, the two formants most commonly used in
casework and sociolinguistic studies are F1 and F2, which are measured in
order to reveal aspects of an individuals vowel space (Ash 1988; Milroy and
163
Gordon, 2003). Some FSS experts still analyze only these two formants (
LTFD research has reported that formants are often prone to variation. It
telephone (the 340-3700Hz band) there are many acoustic properties of the
signal that are often affected (Foulkes and French, 2012). The most notable are
Often F4 is missing from the signal altogether (Knzel, 2001; Byrne and
Foulkes, 2004). A similar effect has also been reported for recordings made
using the video and voice recorders in cellular phones (Gold, 2009). For this
reason, some experts avoid F1 and F4 altogether, meaning that only LTFD2 and
aware that the distance between the microphone (of the recording device) and
the talker (in conjunction with the room acoustics) can also have effects on
3.9.1.1), and therefore LTFD1-LTFD3 are considered, as they are the most
in combination to represent the ideal case where F1-F4 are all measureable.
This also provides the upper boundary in terms of the maximum number of
features within a parameter that can be used (for the given data) to achieve the
164
performance (EER and Cllr) and the magnitude of strength of evidence
summarized in Table 5.3. The leftmost column represents the LTFD that was
likelihood ratio (LLR) above zero, and correct DS comparisons have an LLR of
less than zero. The mean LLR is in the third column, followed by the minimum
and maximum LLR in the next two columns. The final two columns present the
Comparisons % Correct Mean LLR Min LLR Max LLR EER Cllr
LTFD1 SS 72.0 0.224 -2.158 1.902
.2806 .8840
LTFD1 DS 71.7 -4.858 -68.768 1.993
LTFD2 SS 70.0 0.162 -1.077 1.259
.3165 .8119
LTFD2 DS 67.5 -1.939 -27.814 1.602
LTFD3 SS 88.0 0.288 -8.373 3.743
.1700 1.0731
LTFD3 DS 80.6 -11.857 -139.273 1.734
LTFD4 SS 68.0 0.238 -2.258 1.378
.2214 .8085
LTFD4 DS 80.2 -11.574 -124.808 1.301
Table 5.3 shows that, overall, LTFD3 has the lowest EER (.1700) but the highest
Cllr (1.0731). LTFD4 performed second-best in terms of EER (.2214) and best
for Cllr (.8085). The highest EER was for LTFD2 at .3165, but it had the second-
correct results than did DS comparisons, with the exception of LTFD4, where DS
165
comparisons performed 12.2% better. The strength of evidence (i.e. mean LLR)
evidence is lower, ranging from 0.162 to 0.288. With respect to Champod and
Evetts verbal scale (2000 p. 240) the LLR scores for SS would not even
there are some cases where a formant individually achieves a stronger strength
of evidence, as per LTFD3 with its maximum LLR of 3.743 (moderately strong
evidence).
LTFD3 performs the best overall, followed by LTFD4, LTFD1, and finally LTFD2.
Despite returning the highest Cllr, LTFD3 has the highest percentage of correct
SS and DS comparisons. It also offers the lowest EER, while providing the
strongest strength of evidence for SS and DS. The suspected reason for Cllr
being at its highest for LTFD3 is that Cllr appears to be greatly affected by
parameters (e.g. vowels) that produce wider ranges and higher magnitudes of
and DS comparisons with high strengths of evidence. For this reason, high Cllrs
appear to be being calculated for parameters that have the potential to offer
discriminant, it achieved a lower Cllr than LTFD3, because the magnitude of the
166
AR LLRs overall were smaller. The same is true of LTFD2, which has the lowest
the formants can potentially yield even better performances. Table 5.4 below
Table 5.4: Summary of LR-based discrimination for different LTFD1-4 combinations (100
speakers)
Comparisons % Correct Mean LLR Min LLR Max LLR EER Cllr
LTFD1+2 SS 70.0 0.417 -2.472 2.761
.2041 .7648
LTFD1+2 DS 85.0 -7.477 -76.391 1.996
LTFD2+3 SS 76.0 0.334 -7.828 3.768
.1392 .9630
LTFD2+3 DS 89.9 -14.173 -156.130 1.956
LTFD1+2+3 SS 74.0 0.625 -7.632 3.676
.1147 1.0161
LTFD1+2+3 DS 94.3 -19.307 -155.807 3.007
LTFD1+2+3+4 SS 84.0 1.160 -5.292 5.466
.0414 .5411
LTFD1+2+3+4 DS 97.4 -29.228 -162.931 2.854
Four different LTFD combination scenarios are presented in Table 5.4. LTFD1+2
performed the worst with respect to EER (.2041), but was the best in terms of
Cllr (.7648). LTFD1+2+3+4 performed the best with respect to EER, which was
.0414, and had the lowest Cllr (.5411). The highest proportion of correct SS and
comparisons.
33An additional explanation for the poor Cllr values could be that the system is not optimally
calibrated (see 8.5.3) as was also seen in 4.6.2.1.
167
The Tippett plots for the four LTFD1-4 combinations are presented in
0.9
0.8
0.7
Cumulative Proportion
0.6
0.5
0.4
0.3
0.2
Same speaker comparisons
0.1 Different speaker comparisons
0
-80 -70 -60 -50 -40 -30 -20 -10 0 10
Log10 Likelihood Ratio
168
1
0.9
0.8
0.7
Cumulative Proportion
0.6
0.5
0.4
0.3
0
-160 -140 -120 -100 -80 -60 -40 -20 0 20
Log10 Likelihood Ratio
0.9
0.8
0.7
Cumulative Proportion
0.6
0.5
0.4
0.3
0
-160 -140 -120 -100 -80 -60 -40 -20 0 20
Log10 Likelihood Ratio
169
1
0.9
0.8
Cumulative Proportion
0.7
0.6
0.5
0.4
0.3
Same speaker comparisons
0.2
0.1
Different speaker comparisons
0
-180 -160 -140 -120 -100 -80 -60 -40 -20 0 20
Log10 Likelihood Ratio
Figures 5.17-5.20 illustrate the range of LLR values calculated for the four
LTFD1-4 scenarios. LTFD1+2+3+4 had the largest positive values for SS LLR
and the largest negative values for DS LLR (i.e. best strengths of evidence in
both cases), and the best overall mean LLRs (SS = 1.16, DS = -29.228). LTFD1+2
had the smallest LLR ranges for both SS and DS, and the weakest mean DS LLR.
LTFD2+3 yielded the weakest mean SS LLR, and the second-weakest mean DS
LLR. LTFD1+2+3+4 offered the strongest LLR for same-speaker pairs at 5.466
(very strong evidence), and the strongest minimum LLR for DS at -162.931.
LTFDs for individual formants, the four combination scenarios were able to
comparisons.
170
5.4.3 Results of Package Length
The length of the package over which a distribution is calculated can vary. A
package length of 0.5 seconds was chosen for the study, as it was found to yield
the lowest EERs. The effects of package length variation can be seen in Table 5.5
for the LTFD combination of LTFD1+2+3+4. The size of the package length is
comparisons that were correct. The mean, minimum, and maximum LLRs for SS
and DS comparisons are found in columns four through nine, while EER and Cllr
The results in Table 5.5 suggest that package length affects the discriminant
34 LTFDs in their raw form readily lend themselves to a UBM-GMM algorithm for calculating
LRs. However, in order to test the MVKD formula, LTFDs were put into packages, as the MVKD
formula is not equipped to handle streams of data.
35 Package length was also used by Moos (2010) to determine stability within LTFD over
varying quantities of data. The packages are used in a similar way here, but are evaluated in
terms of their validity.
171
correct DS comparisons. In general, EER improves with the decrease in package
length; however, EER appears to have a threshold around 0.5 seconds, above
It is important to note that these results apply only to the given study,
where the total duration of material per speaker was around 50 seconds. It
could be the case that package length has different effects on speech samples
longer or shorter than the 50 seconds used in the current study. However, this
analysis serves as a starting point for further investigation into the effects of
variability of package length and the overall length of speech samples on LTFD
results.
5.5 Discussion
The following section considers the discriminant value of higher formants and
lower formants, and also compares the results from the present study with
The results from individual LTFD LRs revealed that LTFD3 and LTFD4
performed better than the lower formants, LTFD1 and LTFD2, in discriminating
between speakers. These results suggest that the higher formants carry more
Jessen, 1997; McDougall, 2004; Moos, 2010; Simpson, 2008; Clermont et al.,
172
investigating the discriminant ability of formants where F3 (a higher formant)
Formants Most
Study Data Measurements
Considered Discriminant
Jessen (1997) German; 20 speakers F1-3 Peaks in spectra F3
Australian English; 5
McDougall (2004) F1-3 Dynamic F3
speakers
Moos (2010) German; 71 speakers F1-3 LTFD F3
Simpson (2008) temporal
British English; 25
and Clermont et F1-3 midpoint of F3
speakers
al. (2008) formant
British English; 97
Hughes (2013) F1-3 Dynamic F3
speakers
lower formants (as seen in this study, and those listed in Table 5.6) can be
obtained by recourse to phonetic theory. The first and second formants are
frequencies are related in large part to tongue position: the first formant
correlates inversely with tongue height and the second formant is associated
with tongue frontness/backness (Clark and Yallop, 1990, p. 268). The range of
and shape of his/her vocal tract, while the given configuration of a speakers
vocal tract will determine its F1 and F2 values. In general, the lower formants
(i.e. F1 and F2) do not encode speaker-specific information; rather, they are
that they are less affected by behavioral and physiological variation than are
173
with the resonances in the smaller cavities of the vocal tract, which allow for
less intra-speaker variation (i.e. smaller cavities offer a smaller space in which
variation of F3 and F4 is limited with respect to variation in the size of the vocal
tract (which does not show a wide range of variation; Xue and Hao, 2006). It is
correlated in part to voice qualities that involve the backing of the tongue body,
accent group they studied (SSBE speakers). The same was also found for
specific, to some degree F3 can also encode accent information, specifically that
To this extent, the suggestion that higher formants carry more speaker-
discriminant information than lower ones is borne out in the current research,
The results presented in the current study were calculated using the MVKD
formula. However, GMM-UBM has also been used on the same data (French et
al. 2012), and LTFD on German data (Becker et al., 2008). The MFCC results
(French et al., 2012) and the LR results from French et al. (2012) and Becker et
174
al. (2008) for LTFD are compared (Table 5.7) to the results found in the present
study.
Table 5.7: Summary of LR-based discrimination for LTFD and MFCC in the current study and
competing studies
The LTFD results from all three studies are generally similar36 regardless of
whether GMM-UBM or MVKD was used. However, given that French et al.
(2012) and the current study are based on the same recordings, it would be
expected that SS comparison results were more similar than they are (94% to
84%, respectively). This could suggest that for LTFD it is preferable to use
comparisons were made by French et al. (2012), whereas the current study only
could be due in part to the disparity in sample size (10% is equivalent to five SS
comparisons).
The tendency (albeit a small one) is for LTFD to miss SS pairs, and for
MFCC to mistake DS pairs for SS pairs. In view of this, it could be argued that in
the context of security, where investigators are working to put together a list of
MFCCs are more likely to include additional suspects (despite their innocence)
36Becker et al. (2008) also included results using bandwidths. However, those results are not
presented here.
175
similarity when comparing non-similar speaker pairings. In a judicial context,
similarity when the samples are in fact similar by virtue of having been spoken
5.6 Conclusion
Overall, the results presented in this chapter suggest that LTFD is a good
speaker discriminant, despite all LTFDs having variance ratios that imply intra-
LTFDs. The best combination, LTFD1+2+3+4, had an EER of only 0.0414, which
results of the survey reported in Chapter 3, it appears that experts were correct
in identifying formants (in one form or another; e.g. LTFD, or for individual
where the values of LTFD means were higher in read speech than spontaneous
speech. It appears that speaking style can have a large impact on LTFD results.
available to work with, and whether the material in the suspect and criminal
176
The most attractive aspect of LTFD may not be in its successful results,
but in the fact that LTFD is not correlated with a number of other parameters
individually and later there is a desire to combine results from those multiple
vowels. This often results in a scenario where certain vowels are inevitably
the multiplication of individual LRs only when pieces of evidence are mutually
produce a LTFD. The only drawback to this lies in the high level of
generalization that is entailed when all vowels are averaged, meaning that
LTFD and MFCC analysis can provide insights into the vocal tract; however,
MFCC) would be combined with other pieces of speech evidence into an overall
LR (owing to the strong correlations between LTFD and MFCC; French et al.,
2012). For this reason, unless a single phoneme can yield more promising LR
results for different populations, these results suggest that LTFD should be
177
Chapter 6 Long-Term Fundamental
Frequency
6.1 Introduction
frequency as the number of times per second that the vocal folds complete a
intervals (e.g. a phoneme, a word). Clark and Yallop further explain that F0 is
controlled by the muscular forces determining vocal fold settings and tensions
in the larynx, and by the aerodynamic forces of the respiratory system which
drive the larynx and provide the source of energy for phonation itself (Clark
and Yallop, 2001, p. 333). Speakers are known to differ from one another in the
distribution of spectral energy (of F0) within their speech, due largely to
phonatory/vocal tract settings (Clark and Yallop, 2001). For this reason, F0 is
phonatory and other vocal settings. The survey completed by expert forensic
casework. Alongside voice quality, F0 was also claimed to be the most useful
178
Despite the popularity of F0, the parameter is not immune to exogenous
recording codecs (Braun, 1995; Gold, 2009; Knzel, 2001; Papp, 2008). It is
already highly variable within speakers and the presence of these factors makes
it even more so. Regardless of this, however, the experts expectations remain
key role in a forensic case is outlined in Nolan (1983, p. 124). The expectation
of F0 being a good speaker discriminant may stem from the view that it is to
positive note, F0 has been shown to be rather robust to background noise and is
1986; Loakes, 2006). However, only Hudson et al. (2007) provides statistics for
a group of English speakers. There is also only one study on English that reports
Australian English (Loakes, 2006). For this reason, the current chapter
forensic phonetic research. Kinoshita et al. (2009, p. 92) suggest that the
179
popularity of F0 stems from promising results in early speaker recognition
together relevant F0 literature, and divides it into two parts. The first part
forensic phonetics, and the second examines exogenous factors that can affect
and female speakers in both read and spontaneous speech, and across multiple
languages, are provided in Traunmller and Eriksson (1995). Many new studies
have been conducted since Traunmller and Eriksson (1995). The majority of
speakers.
year), produced by six male speakers of Australian English. The six speakers
180
standard deviation of 17.4Hz (range: 14.3-19.2Hz). Loakes (2006) also presents
Loakes (2006) were lower than those reported by Rose (2003). However, read
speech tends to elicit higher F0 values than spontaneous speech (Loakes, 2006).
Lindh (2006) reports long-term F0 values for 109 young male Swedish
speakers taken from short samples of spontaneous speech. The male speakers,
alternative baseline F0 of 86.3Hz. The alternative baseline is the value (in Hz)
that falls 7.64% below the mean F0 (approximately 1.43 standard deviations;
see Lindh (2006) for more on alternative baseline). These F0 values are higher
than those found for Australian English. Rose (2002) suggests that F0 values
The study most relevant to the research reported in the current chapter
100 male speakers of British English drawn from the DyViS database. The
authors use Task 1 of DyViS, where individuals are taking part in a simulated
police interview. Three to five minutes of spontaneous speech per speaker were
analyzed after all background noises were removed. A Praat script was used to
extract mean, median, and mode F0 for each speaker. The aim of the research
181
was to gain an understanding of the distribution of F0 in a large homogeneous
group of English speakers primarily for forensic casework. Hudson et al. (2007)
report a group mean F0, median F0, and mode F0 of 102.2, 106, and 105 Hz
respectively. The study provides the population results in a format that is useful
The studies detailed above have reported average F0 values for groups
2006 and Rose, 2003, but at a very limited level). Kinoshita (2005) was the first
(1977) formula and synthetic (i.e. invented) criminal and suspect F0s (both
mean and SD) were created. The 90 male speakers acted as the reference
population. The results presented had a small range of LR estimates, and were
rather close to unity (i.e. not supporting a preference for one hypothesis or the
evidence.
182
6.2.2 Effects of Exogenous Factors on F0
technology, to varying extents (Braun, 1995; Gold, 2009; Junqua, 1996; Knzel,
2001; Linard and Di Benedetto, 1999; Papp, 2009; Zetterholm, 2006). F0 has
case in forensic cases that F0 measurements are not robust to the effects of
and which may have some relevance to forensic situations. An exhaustive list of
and sample size (French, 1990; Horii, 1975; Mead, 1974; Steffan-Battog et al.,
Hollbrook, 1981), age, a history of smoking (Braun, 1994; Gilbert and Weismer,
1974; Murphy and Doyle, 1987; Sorensen and Horii, 1982), alcohol
consumption (Klingholz et, al. 1988; Knzel et al., 1992; Pisoni and Martin,
1989; Sobell et al., 1982), testosterone drugs and anabolic steroids (Bauer,
(Bouchayer and Cornut, 1992), surgical stripping of the vocal folds after
Keilmann and Hlse, 1992), shortening of vocal folds (Oats and Dacakis, 1983),
and lingual block (e.g. use of anesthesia; Hardcastle, 1975). Finally, Braun
183
are emotions (sorrow, anger, fear; Williams and Stevens, 1972), stress (Hecker
et al., 1968, Scherer, 1977), vocal fatigue (Novak et al., 1991), depression
(Darby and Hollien, 1977; Hollien 1980; Scherer et al., 1976), schizophrenia
(Hollien, 1980; Saxman and Burk, 1968), time of day (Garrett and Healey,
1987), and background noise level (Dieroff and Siegert, 1966; Lombard, 1911;
detailed by Braun (1995) pose problems that FSC experts are confronted with
when analyzing F0. Many other studies have been carried out since Braun
(1995) to examine the effects of external factors on F0. Those studies that are
F0. They looked at 12 French vowels spoken in isolation by ten speakers (five
experimenter who stood at varying distances in the room from the speaker
(close, normal, and far). The distance at which the experimenter stood relative
to the speaker was intended to induce change in the vocal effort that the
speaker assumed would be required for the experimenter to hear the speaker
(2000) reports that nearly 25% of cases in Germany involve voice disguise, and
speech from 100 speakers (50 males and 50 females) where they were asked to
adopt different voice disguises (high, low, and denasalized). Knzel showed that
speakers were effectively and consistently able to disguise their voices using F0
184
were observed applying different phonatory strategies, which caused difficulty
in associating the change in F0 with a particular change the speaker has made in
his/her phonatory setting. Zetterholm (2006) has also examined disguise and
Swedish TV personalities. She found that the impersonator was able to vary his
modal F0 voice from its normal average of 118Hz, to anywhere between 97Hz
and 225Hz.
received more attention of late, perhaps due in part to the advent of new and
emerging technologies. This opens new avenues for potential technical effects
investigating the effects of video and voice recorders in cellular phones. Three
different cellular phones were used in the experiment. All phones encoded the
speech signal using an AMR or mpeg4 codec for voice and video recorders
between one and five percent in mean F0, and the SD of F0 changing from 9 to
63.6% for a single cellular phone. Overall, voice recorders (AMR codec) were
found to make bigger changes in F0 than video recorders did. It has previously
been shown that the GSM AMR codec (a speech-encoding codec similar to the
AMR) has the tendency to change voiced frames into unvoiced frames, and vice
216). This could be the case for the AMR codec found in cellular phones, and any
GSM AMR.
185
Variation in F0 can be caused by numerous exogenous factors, many of
which (those relevant to casework, at any rate) have been detailed above. The
potential of F0.
linguistically homogeneous group of 100 male speakers. These data serve as the
F0, where in the discussion it will be made apparent that the variability is
6.3.1 Methodology
The current study uses the recordings from Task 2 of the DyViS
database. Each recording for all 100 speakers was used in its entirety. However,
after the editing of the files, the recordings were between 2:25 minutes and
11:17 minutes in length, with an average of 6:21 minutes per file. Using Praat
(version 5.1.35), multiple passes were made through the recordings to ensure
that there was only speech remaining. The first phase consisted of the removal
of all the portions where the interlocutor was speaking, and any silent pauses
186
with the talkers speech. The second phase removed all intrusive noises in the
listening phase was used to ensure that everything but the speech of the
speaker had been properly removed from the recordings. This final phase saw
only minor edits that typically amounted to the removal of less than one second
token has not been previously tested. For this study I chose to segment the
speech into 10-second intervals (see 6.4.3 for effects of package length). Each
file was then annotated using a Praat text grid. The tier represented the package
length (in seconds) according to which the speech signal was subsequently
marked out in the text grids until the end of each recording. If the final segment
did not meet the interval length requirement, it was not included in analysis.
Figure 6.1 depicts an example of the text grid annotations used for all
recordings.
187
Figure 6.1: Example of a text grid annotation
used to extract mean F0 and standard deviation values for each interval. The
Praat script was set to a frequency range of 50 300 Hz (following Hudson et al.
2007). After reviewing the F0 Praat picture distributions (for octave jumps and
unwanted pitch artefacts), 64 speakers were found to have reliable F0. The
and were therefore re-run using tailored ranges. The tailored ranges were
chosen through trial and error, where the range with the least amount of errors
was chosen as the best possible frequency range. These are detailed in Table
6.1.
188
Table 6.1: Tailored Frequency Ranges for Selected Speakers
After the Praat script was re-run using tailored frequency ranges for these
speakers, all F0 means and standard deviations were imported into Microsoft
Excel for further analysis. There were a total of 7,447 intervals for all 100
6.3.2 Results
presented in Figures 6.2 and 6.3. The y-axis represents the number of speakers
that fall within a given range and the x-axis depicts F0 in Hertz (Hz).
189
Mean Fundamental Frequency
18
Number of Speakers 16
14
12
10
8
6
4
2
0
There is a normal distribution in the DyViS corpus for mean F037, as illustrated
in Figure 6.2. The mean F0 for the population is 103.6Hz, with an overall mean
There are no suspected outliers (as defined in Chapter 4) in the mean F0 data.
Technically, this is the mean of the means of the means (i.e. the mean across speakers of the
37
means across tokens of each speaker of the means of all the raw F0 values of each token).
190
Mean Standard Deviation for
Fundamental Frequency
30
25
Number of Speakers
20
15
10
0
6-8 10-12 14-16 18-20 22-24 26-28 36-38
Fundamental Frequency (Hz)
There is a roughly normal distribution for the SDs of F038 in Figure 6.3, with a
slight positive skewing due to a number of outliers. There are seven suspected
25.59Hz, and 27.62Hz) and one extreme outlier at 37.32Hz. Including the
outliers, the mean SD for the population is 15.1Hz, with a range of 7.4Hz-
37.3Hz. If those outliers are removed, the distribution becomes more normal
6.4 and 6.5 (respectively) show the percentiles at which a given F0 mean or SD
38Technically, this is the mean SD of the mean SD (i.e. the mean SD across speakers of the SDs
across tokens of each speaker of the means of all the F0 raw values of each token).
191
proportion of the population, and the x-axis presents fundamental frequency
(Hz).
90
Cumulative Proportion (%)
80
70
60
50
40
30
20
10
0
75 80 85 90 95 100 105 110 115 120 125 130 135 140
Frequency (Hz)
Figure 6.4: Cumulative percentages for mean fundamental frequency
The curve in Figure 6.4 is characterized by a steep central section but has gentle
gradients at both ends. The data show that the lowest 20% of the speakers have
a mean F0 below 93Hz, while the highest 20% have an F0 above 115Hz. This
leaves only a narrow band of 22Hz in which the remaining 60% of speakers are
192
Mean Standard Deviations for
Fundamental Frequency
100
Cumulative Proportion (%)
90
80
70
60
50
40
30
20
10
0
7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Standard Deviation (Hz)
The first half of the curve in Figure 6.5 is characterized by a steep trajectory,
while the second half has a fairly long gradient trajectory at the upper end. The
lengthy gradient to the end of the trajectory is due to the suspected outliers and
extreme outliers confirmed above in the current section. Observing the spread
of the SDs, the lowest 20% of speakers have F0 SDs below 12Hz, and the highest
20% have F0 SDs above 17.5Hz. This leaves a remarkably narrow band of 5.5Hz
in which the majority of speakers fall (60%). Overall, the F0 data have a
variance ratio of 0.7152, which indicates that there is more variation occurring
6.3.3 Discussion
The results for the present study are very similar to those reported by
Hudson et al. (2007), which used Task 1 of DyViS. The difference in the two
193
group means is only 2.4Hz. Results from the present study are also similar to
those presented in Loakes (2006), who gave a mean F0 of 105.2Hz for the
languages the results presented in 6.3 are somewhat dissimilar in that both
Swedish (Lindh, 2006) and German (Knzel, 1989) report higher mean F0s at
The lower mean F0 values found for SSBE compared to those found for
that numerous speakers in the DyViS database have creaky voice qualities
(Hudson et al., 2007), and the creaky voice qualities of speakers were included
in the current study. This is because the F0 of speakers who use creaky
phonations a lot tend to result in bimodal distributions, with the first peak
representing the creaky voice quality and the second peak representing modal
types are averaged, which thus results in a lower mean F0 (Hudson et al., 2007).
As such, it is most likely the case that, as pointed out in Hudson et al. (2007),
speakers modes did not correspond to their means. For this reason it would be
individuals. Overall, the results also suggest that mean F0 and SD are perhaps
not the best measures for those speakers with intermittently-present creaky
voice.
194
6.4 Likelihood Ratios
The discriminant results for F0 are presented in the following section. The first
part provides LR results for F0, and the second part considers the effects of
6.4.1 Methodology
whereby the test and the reference speakers came from the same population of
100 speakers. Speakers 1-50 were used as the test speakers, while speakers 51-
100 served as the reference speakers. Mean F0 and SD parameters were both
used for each token spoken by a given individual in the calculation of the LRs.
Performance of the system was assessed in terms of both the magnitude of LRs
(Champod and Evett, 2000) and system validity (Cllr and EER).
The results for the calculation of LRs for F0 are summarized in Table 6.2.
The second row contains the results from same-speaker (SS) comparisons and
the third row contains the different-speaker (DS) comparison results. The
correctly identified, whereby a log likelihood ratio (LLR) above zero was correct
for a SS comparison and an LLR of less than zero was a correct judgment for DS
comparisons. The mean LLR is found in the third column, followed by the
minimum and maximum LLRs. The final two columns present the performance
Comparison % Correct Mean LLR Min LLR Max LLR EER Cllr
10 sec SS 92.0 0.958 -3.404 1.936
0.0849 0.4547
10 sec DS 89.9 -24.204 -269.159 1.906
percentage of correct judgments. The mean LLR for DS offers very strong
evidence to support the defense hypothesis (Hd; Champod and Evett, 2000),
while the mean LLR for SS only offers limited evidence to support the
prosecution hypothesis (Hp). Even the Max LLR for SS does not reach a strength
the value of 1.936). The EER for the system is higher than that found for a
combined LTFD system in Chapter 5 (0.0414; see Table 5.4), but is significantly
better than that found for AR in Chapter 4 (0.334; see Table 4.7). The Cllr for F0
as a system is generally better than the Cllrs achieved in Chapters 4 and 5 for
The Tippett plot in Figure 6.6 offers a visual measure of the performance
of F0 as a discriminant feature.
196
1
0.9
0.8
0.7
Cumulative Proportion
0.6
0.5
0.4
0.3
0.2
Same speaker comparisons
Different speaker comparisons
0.1
0
-180 -160 -140 -120 -100 -80 -60 -40 -20 0 10
Figure 6.6 shows that there is a narrow range in LLR for SS and that most LLRs
for SS are relatively similar. The DS comparisons have a wider spread of LLRs. It
is also clear that DS comparisons can achieve very large LLRs, which offers a
for analysis. This involves dividing the recording into multiple sections (or
tokens). The most efficacious token length (or referred to here as package
seconds was chosen for the study as it was found to yield the lowest EER.
However, it is possible to vary the size of the package length (similar to that
seen in Chapter 5). The effects of package length variation can be seen in Table
197
6.3 for F0. The table has the same formatting as that of Table 6.2, although Table
Comparison % Correct Mean LLR Min LLR Max LLR EER Cllr
5 sec SS 88.0 0.868 -5.176 2.030
0.1010 0.5634
5 sec DS 91.1 -29.879 -308.588 2.031
10 sec SS 92.0 0.958 -3.404 1.936
0.0849 0.4547
10 sec DS 89.9 -24.204 -269.159 1.906
15 sec SS 92.0 0.970 -2.526 1.999
0.1016 0.4407
15 sec DS 89.0 -20.964 -233.785 1.809
20 sec SS 92.0 0.960 -2.536 1.880
0.0967 0.4383
20 sec DS 88.7 -18.620 -206.961 1.717
The results in Table 6.3 suggest that package length affects the discriminant
performance of F0. However, the increase of package length does not appear to
be linearly correlated with the overall system performance in terms of EER. For
values shown in Table 6.4, which displays the results for mean F0s, F0 range,
standard deviations (SD), and range of SDs across the four different package
lengths.
198
Table 6.4 shows that there is relatively little difference between F0 results as a
function of the different package lengths. The biggest difference found in the
results is in the mean of SDs for 5 seconds (14.3Hz), compared to the mean of
SDs for the larger package lengths (15.1-15.5Hz). The 5-second package length
was also found to have the biggest difference in Cllr (Table 6.3) with the longer
package lengths. On the basis of these results, it could be argued that choosing a
the data.
It is important to note that like the package length results found for
LTFD in Chapter 5, the results presented in this section relate specifically to the
present recordings, in which the total length of material per speaker was
around six minutes. It could again be the case that package length affects longer
6.5 Discussion
The results presented in the present study provide a starting point for further
investigation into the discriminant value of F0. However, the study was limited
by the highly controlled nature of the recordings, which were relatively free
from the influences of the exogenous factors that are known to affect F0 values,
as detailed in 6.2.2. More studies which incorporate those factors are needed.
The results of the present study were produced using only mean and
survey also reported that it is not uncommon for experts to use other measures
199
of F0 in their casework. Kinoshita et al. (2009) showed promising results when
kurtosis, modal F0, and modal density). This points to a need to reassess the
strength of evidence.
6.6 Conclusion
speaker discriminant overall, and has promise for demonstrating that two
voices have come from the same speaker (rather than different speakers) in the
comes from different recordings. Previous literature would suggest that its
they are correct in that F0 does well discriminating same speakers that come
from the same recording, but it is uncertain whether that result will hold true
200
good recording conditions and high audio quality used for the current study are
comparison of mean F0s and SDs is unlikely to advance the methods used for
exceptions are to be made for those individuals who lie towards the margins of
the distribution curve or who can be classed as outliers, and the case remains
201
Chapter 7 Click Rate
7.1 Introduction
The survey results presented in Chapter 3 indicated that experts examine many
tongue clicking, and both filled and silent hesitation phenomena. In respect of
examined recordings for the presence of velaric ingressive stops (i.e. clicks),
science has focused primarily on vowels, and to some extent consonants and
fundamental frequency (Gold and Hughes, 2013). However, there remains a gap
analyzed in this chapter are used by speakers in a discursive manner that can be
classified as conveying linguistic meaning (i.e. they are used here as a discourse
202
difficulties in attributing a numerical strength of evidence to measure discrete
data.
velaric airstream, such as Zulu [ . Laver (1994, p. 174) explains further that a
closure made by the back of the tongue against the velum. A second closure is
also made, further forward in the mouth, either by the tip, blade or front of the
tongue or by the lips. For a lingual click, there is a closure made by the back of
the tongue coming into contact with the soft palate, and the front portion of the
tongue is then drawn downwards. This process increases the volume of the
space occupied by the air trapped in between the two closures rarefying the
intra-oral air-pressure. When the more forward of the two closures is released,
the outside air at atmospheric pressure flows in to fill the partial vacuum
(Laver 1994, p. 174). It is at this point that a click is realized. Figure 7.1 below
illustrates the actions of the vocal organs involved in the production of a click
sound.
203
Figure 7.1:) The action of the vocal organs in producing a velaric ingressive voiceless dental
click k (a) first stage velic and anterior closure; (b) second stage expansion of the enclosed
oral space; (c) third stage release of the anterior closure. (Laver, 1994, p. 176)
Figure 7.1 provides an illustration of the process involved for the vocal organs
in the production of a dental click. This is just one of six possible places of
(IPA). The five different click types are provided in the figure below, which is an
204
The six different clicks presented in Figure 7.2 above are most commonly
2006, p. 139) and extensive research has been carried out to document clicks in
those languages (e.g. Greenberg, 1950; Herbert, 1990; Jessen and Roux, 2002).
Nama, !X, and !X clicks are used phonemically (Laver, 1994, p. 174). Clicks
are also found in English, but unlike those in African languages they are not
states of a speaker. Previous research suggests that certain clicks are used to
convey such things as annoyance (Abercrombie, 1967, p. 31; Ball, 1989, p. 10),
sympathy (Gimson, 1970, p. 34), and disapproval (Crystal, 1987, p. 126). There
is also evidence to suggest that the phonetic properties of clicks can vary
The first type are clicks that occur in the onset of a new sequence, the second
are clicks used in the onset of a new and disjunctive sequence, and the third
type are clicks produced in the middle of a sequence of talk when the speaker
is engaged in the activity of searching for a word (2005 p. 176). The following
205
Fragment 1: Holt.SO.88.1.2/bath/
05: (.)
06: Bil: mm
Fragment 2: Holt.1.8/Saturday/
03: (0.2)
05: Les: [!]. .hhh anyway we had a very good evening o:n saturday
06: (0.2)
Fragment 3: Holt.U.88.2.2/natter/
01 Les .hhhh and theres the- the natte- uhm (0.2) (0.3) !
by the bilabial click in line 4. A second type of click is used for the onset of a
click on line 5. And the final click type is found in Fragment 3, where both
206
bilabial and alveolar clicks are being used to signify the search for a word by the
speaker in line 1. The three click types above, in combination with clicks being
used paralinguistically (i.e. to show emotion or affect), are used in the analysis
potential one can begin with the assumption that for any aspect of affect or
For example, while one can signal annoyance, disapproval or sympathy by use
of clicks, there are many other ways of signaling these states to interlocutors.
topic or the fact that one is having difficulty finding a word, other forms
Given that this is so, one might reasonably expect there to be variability across
speakers in terms of whether clicks or other forms are their preferred option.
Chapter 3 to the effect that clicks have high value as speaker discriminants.
207
7.3 Data
The recordings analyzed were of 100 male speakers of SSBE aged 18-25 years
from the Dynamic Variability in Speech (DyViS) corpus (Nolan et al., 2009). Two
of the 100 speakers played the role of a criminal suspect and was interrogated
by one of two project interviewers (Int2 and Int3) who played the role of a
accomplice and explaining what had occurred in the police interview. The role
of the accomplice in this data set was played by the same project interviewer
recordings used for analysis were made at the subjects end of the line i.e. they
7.4 Methodology
criteria: (a) it must vary (ideally quite widely) across speakers; (b) it must be
this section, the methods employed to test the intra- and inter-speaker variation
of click rates are outlined. Task 2 recordings are used for the first portion of the
click analysis of the current study. As mentioned above, each speaker conversed
The first two minutes of each recording were ignored so as to allow for
speakers to settle into the interaction. All subsequent speech from each subject,
208
up to a maximum of five minutes net i.e. after excluding long pauses and the
speech of the interlocutor was then extracted and divided into one-minute
intervals (giving a combined total of 499 minutes of net speech). 99 of the 100
speakers produced enough speech to meet the five-minute target. One speaker
(speaker 012) fell just short of this, and the analysis was therefore based on
analysis of just four minutes of his speech. The extracted speech was examined
auditorily during two listening sessions in Sony Sound Forge (version 10.0;
analysis done auditorily) and Praat (version 5.1.35; auditory and acoustic
analysis done simultaneously) for instances of clicks. Any sounds that auditorily
and visually resembled clicks but were not apparently produced on a velaric
ingressive airstream were excluded from the analysis. This resulted in the
At the end of this process there were a total of 454 clicks left. Each click was
(2007; 2011a; 2011b), i.e. initiating a new speaking turn, indicating topical
transcribed excerpts from the recordings (clicks are indicated by the symbol !,
39 Pike (1943 p. 103) says percussors differ from initiators in several ways in opening and
closing they move perpendicularly to the entrance of the air chamber . . . ; they produce no
directional air current but merely a disturbance that starts sound waves which are modied by
certain cavity resonators; they manifest their releasing or approaching percussive timbre only
at the moments of the opening and closing of some passage . . . Typical percussives are made by
the opening and closing of the lips, the tongue making closure at the alveolar ridge, the velum
closing, the vocal folds making a glottal closure, and the sublaminal percussive of the cluck
click (Ogden 2013 p. 302). The most common percussives found in the current data set were
related to the opening and closing of the lips.
209
Initiating new speaking turn
S011: Mhm
Int1: Do I know it
Int1: Um wha- did they trace that phone call when you were in the uh
grotty booth
7.5 Results
some general findings on phonetic and functional aspects of the clicks are
presented.
210
7.5.1 Phonetic Properties of the Clicks
The scope of the present study did not extend to a detailed analysis of
the phonetic and acoustic properties of the clicks. However, in terms of their
to the passive articulator, they ranged from dental through alveolar to post-
gravity and lower level of intensity than the other variants. At the other
extreme, post-alveolar clicks are the highest in intensity, have a relatively short
data, it appears that, at this stage at least, place of articulation proved very
difficult to classify more finely, and that no individual speaker clearly stood out
from the others in respect of this dimension. It is supposed that because clicks
are not used phonemically by SSBE speakers, the precise place of articulation
for clicks does not matter to a speaker or listener when used in sequence
management. It is rather that the presence of any form of apical click can signify
play an important role for those clicks used as affective markers, since place of
articulation for clicks has been shown to signify different emotions (Ball, 1989;
211
Distribution of Click Types
250
200
Number of Total Clicks
150
100
50
0
word search turn initiation signaling topic affective
disjunction
Category of Click
Figure 7.3: Distribution of click occurrences by functional category
Of the 454 clicks that occur in the combined 499 minutes of speech examined,
word search accounts for just over half of all clicks (51.32%). Taken together,
(48.24%) to those used to indicate word search. Affective use represents the
smallest category, with only two examples (0.44%). Whilst the latter may to
some extent be accounted for by the fact that the attitudinal stances that clicks
are used to convey (pity, disapproval) seldom arise in the type of conversation
in Table 7.1, looking first at clickers versus non-clickers. The leftmost column in
212
Table 7.1 presents the length of time over which clicks were analyzed, while the
second and third columns represent the numbers of speakers who were found
been found to click at least once in the given speech sample, and a non-clicker is
defined as a speaker who does not click at all in the given speech sample.
Table 7.1: Number of clickers versus number of non-clickers over varying speech sample
lengths
As seen in Table 7.1, if one considers each sample in its entirety, the proportion
clickers decreases as sample length increases, owing to the fact that so many of
the speakers click very infrequently. This can be seen in Figure 7.4, in which it
is apparent that 74% of the DyViS population clicks five times or fewer over the
five-minute period, i.e. they have a click rate of one click per minute or less.
Figure 7.4 displays the number of speakers on the y-axis and number of total
213
30
Distribution of the Total Number of Clicks
Number of Speakers 25
20
15
10
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 23 24 25 26 27 28 29 53 54
Number of Clicks
Approximately 50% of speakers click only once, twice or not at all. And while
the mean number of clicks for the group as a whole is 4.26 clicks over five
minutes of net speech, this is highly skewed by three speakers who produce a
very high number of clicks (24, 28, and 54). The mean number of clicks per
speaker drops to 3.4 clicks when the three most extreme clickers are removed.
Figure 7.5 presents the mean click rates in clicks per minute (clicks/min.),
rather than as a cumulative number of clicks, as seen in Figure 7.4. The y-axis
presents the number of speakers that fall within a given range and the x-axis
214
Mean Click Rate
30
Number of Speakers 25
20
15
10
3.30-3.45
0.15-0.30
0.30-0.45
0.45-0.60
0.60-0.75
0.75-0.90
0.90-1.05
1.05-1.20
1.20-1.35
1.35-1.50
1.50-1.65
1.65-1.80
1.80-1.95
1.95-2.10
2.10-2.25
2.25-2.40
2.40-2.55
2.55-2.70
2.70-2.85
2.85-3.00
3.00-3.15
3.15-3.30
3.45-3.60
4.8
5.6
0.0-0.15
10.8
Click Rate (clicks/minute)
Figure 7.5. The mean click rate for the population is 0.88 clicks/min, with a
means is 1.41 clicks/min. There are two suspected outliers at 3.00 clicks/min
and 3.50 clicks/min. There are also three extreme outliers at 4.80 clicks/min,
shows the percentile at which a given click rate falls in relation to the
215
Mean Click Rate
100
4.8
5.6
10.8
Click Rate (clicks/minute)
The curve in Figure 7.6 starts at 25% for those with a click rate of 0, meaning
that 25% of the DyViS population have no clicks present in their speech. From
logarithmic growth. Figure 7.6 shows that roughly 70% of the population have
click rates at or below 0.8 clicks/min, and only 30% have larger click rates.
7.5.3.1 Discussion
length (see Table 7.1), and it is not possible to specify a threshold sample
duration for determining click rate, as the sample duration is dependent upon
frequency of clicking. For example, to determine that someone has a click rate
of, say, 0.2 clicks per minute, it would be necessary to have a sample five
minutes in length, during which time the speaker clicks only once. However, to
216
establish that someone had a click rate of, say 10 per minute, all one would need
is one minute of speech or indeed less. This assumes, of course, that the
clicks would be evenly distributed across time. And, as will be seen in the
the data. For the present, however, it is noted that the low number of click
totals for the majority of speakers makes the discrimination capacity of clicks
discriminant for the handful of speakers who produce high click totals, if these
Speakers are represented on the x-axis and the click rates (clicks per minute) on
the y-axis. A speakers mean click rate is represented by a black dot and the
vertical bars indicate the range between the minimum and maximum click rate
217
Click Rate Ranges by Speaker
16
mean
14
Click Rate (clicks/minute)
12
10
0
0 20 40 60 80 100
Speakers
Figure 7.7: Mean and range of click rates across all speakers
mean click rate increases, such that the higher-rate clickers have a greater
range of variability across the individual minute blocks. Thus, even for those
speakers for whom clicks might serve as a potentially discriminant feature, the
clicks tend to occur in localized clusters rather than being evenly spread
throughout the sample. This effectively means that in order to establish that
someone has a high click rate, the analyst would need a relatively large amount
containing around one minute of net speech from the target speaker are not
unusual. Obtaining five minutes of net speech is much less common. Thus, the
218
There is a limited amount of data with which we can calculate variability.
Caution must be exercised when interpreting the SD data. Figure 7.8 presents
Standard Deviation
30
Number of Speakers
25
20
15
10
Figure 7.8 has a positively skewed distribution, like that seen for mean click
rate in Figure 7.5. There are two suspected outliers in the population (at 1.95
clicks/min and 2.07 clicks/min), and one extreme outlier at 5.40 clicks/min.
The mean SD for click rate in the population is 0.69 clicks/min, with a range of 0
clicks/min to 5.40 clicks/min. The SD of the SDs for click rate is 0.70, which is
actually higher than the mean, indicating a large spread in click rate values.
Figure 7.9. The y-axis shows the cumulative proportion of the population with a
SD at a given point, and the x-axis presents click rate in clicks per minute.
219
Standard Deviations for Click Rate
100
The curve in Figure 7.9 is similar to the curve seen in Figure 7.6, but is slightly
more gradient than logarithmic in its growth. The data in Figure 7.9 show that
25% of the speakers have SDs under 0.25 clicks/min., due to the 25% of
speakers who do not click at all in their five minutes of net speech. The variance
ratio for click rate is 4.06, which signifies that there is more variation between
speakers than within speakers for click rate. A variance ratio of 4.06 is the
highest that has been achieved for any parameter in the current thesis
must be remembered that there were on average only five click tokens per
speaker.
220
7.5.4.1 Discussion
discourse markers). There is no reason to assume that the need to express the
clicks can fulfill should be evenly spread across time. A more detailed analysis
across interactions as well as within them, i.e. some types of conversation may
well present more opportunities than others. For the present, however, another
accommodation effects.
range of linguistic features (c.f. Giles, 1973; Giles and Ogay, 2007; Shepard, Giles
The click data considered so far were all drawn from the Task 2
recordings of the DyViS database, where each of the 100 subjects conversed
with the same interlocutor, Int1. The recorded interviews that make up the
conversing with the 100 subjects. The further work reported in this section was
more frequently in the Task 1 recordings when speaking with Int2 and Int3
221
than in the Task 2 recordings when speaking with Int1. This observation
subjects actual click rates in the Task 1 recordings relative to the Task 2
recordings, and (b) examining the click rates of Int2 and Int3 relative to Int1.
The latter was undertaken with a view to determining whether any increase in
accommodation effect.
speaking with Int2 and 25 speaking with Int3. As with the Task 2 sampling
procedure, the first two minutes of the conversations were excluded from the
analysis to allow for settling-in time. Three minutes of net speech were
sample from the Task 2 recordings. Click rates were then compared across the
two tasks. The comparisons showed that, although there was no statistically
(using a chi-squared test, where p = .7401 (Int1 to Int2) and p = .0880 (Int1 to
Int3)), clickers did show a marked increase in click rate when speaking to Int2
and Int3 over when speaking to Int1. The results are summarized in Table 7.2.
The first column identifies the interlocutor, and the second and third columns
present the mean and median click rates, respectively, for the given
interlocutor.
222
Table 7.2: Summary of Speakers Mean and Median Click Rates - Int1 versus Int2 and Int3
The increase across Int1 to Int2 is significant at the 1% level (using a Wilcoxon
signed rank test, p = .0034 and n = 25). That between Int1 to Int3 falls just short
of significance at this level (1%), but achieves it if one speaker whose high click
rate (speaker 07s click rate is 14.33 clicks/min for Task 1 and 12 clicks/min for
24).
The actual changes mean, minimum and maximum - for speakers are
represented in Tables 7.3 and 7.4 (Int1 versus Int2 and Int1 versus Int3). The
first column identifies the direction of change in click rate. The second column
in Table 7.3 and 7.4 identifies the number of speakers with a given change in
click rate, and columns three through five present the mean, minimum, and
223
Table 7.3: Changes in click rate across speaker - Int1 versus Int2
Number of
(Int2-Int1) Mean Minimum Maximum
Speakers
Increase 15 1.601 0.003 3.003
Same 4
Decrease 6 -0.334 -0.003 -0.670
Table 7.4: Changes in click rate across speaker - Int1 versus Int3
Number of
(Int3-Int1) Mean Minimum Maximum
Speakers
Increase 17 1.264 -2.333 4.333
Same 0
Decrease 5 3.444 -0.003 -0.333
In attempting to account for the increases in click rates when subjects spoke to
Int2 and Int3, click rates for Int2 and Int3 were calculated from three randomly-
minutes of net speech after the settling-in period, thus providing a total net
sample of nine minutes for each interlocutor. For Int1 an equivalent portion of
post-settling-in speech was extracted from the Task 2 recordings with the same
three subjects selected for Int2 and Int3, thereby providing a total net sample of
18 minutes. The mean click rates for the three interlocutors are set out in Table
7.5. The first column identifies the interlocutor and the second column presents
224
Given the click rates established for subjects from the Task 2 recordings, Int1
might be seen as a relatively average clicker. Int2 and Int3, however, would be
for the increased click rates of the subjects when conversing with Int2 and Int3
would be that they are accommodating their clicking behavior towards that of
effect is bilateral and that interviewers also adjust their click rates towards
those of the subjects. The data to test this view are not available within the
is a factor40; it may or may not be significant that Int1 is a young male, while
explanation of the differences might be that the Task 1 interactions offer more
clicking opportunities, these being mock police interviews in which the subjects
are asked questions that might well have them searching for words in
answering. However, this would not account for the relatively high click rates of
Int2 and Int3, and although there are currently no formal findings to present on
this, the clear impression is that there are no obvious differences amongst click
opportunities.
of evidence. The absence of an LR calculation for clicks is due entirely to the fact
that a model does not currently exist with which it might be calculated.
40 Accent may also be a factor, since only one of the interlocutors was also an SSBE speaker.
225
However, there are a number of mathematical procedures that can be used to
(2004 p. 37) state that for any particular type of evidence the distribution of
important to use the model that best fits the distribution of data in order to
with when used to calculate numerical LRs, as they are discrete in nature. Aside
from DNA profiling (which works with discrete data), there is a lack of methods
when data are discrete rather than continuous. In forensic speech science, there
has not been any LR research that has carried out a comprehensive analysis of
(Gold and Hughes, 2013), for which it is possible to assume normality. Once an
continuous data to be modeled using the means and covariances only (Aitken
and Gold, 2013, p. 148). However, for discrete data a description of the
clicking rate in speakers, there are two main issues to consider when seeking
how to model the data appropriately. The first is the possibility for each
discrete data entry (e.g. the 5-minute recording) to have multiple levels of
response (e.g. a click count for each minute in the recording). For example, in
the present data, multiple levels of response are represented by the multiple
226
click counts over a given amount of time. More specifically, Speaker A may have
5 minutes of net speech, where each minute of net speech yields an individual
count (e.g. 0,0,1,2,0). The second issue is the correlation that exists between
counts. Given that 25% of the population was found not to click at all over 5
minutes of net speech, it is apparent that correlations exist between counts, and
The work in Aitken and Gold (2013) explores the issues and limitations
bivariate Bernoulli model are proposed for evaluating clicks and any other
discrete data that act in a similar way to clicks. The models proposed in Aitken
and Gold are basic models; however they illustrate issues that need to be
other models may be built (Aitken and Gold 2013 p. 154). Likelihood ratios
are provided in Aitken and Gold (2013, p. 153). However, they are based on a
limited data set whereby and (set distributions of the population) were not
distribution. The LR results for clicks were between 0.30 and 3.35 (i.e. giving
very limited evidence for support) which are small but intuitively sensible
(Aitken and Gold, 2013, p. 154). More practical work is needed to further
develop the models. However, it is hoped that further testing will also produce
smaller LRs. Intuitively, this would align with there being a finite number of
41Structural learning makes decisions based on the data at hand, and uses those data to inform
a given model/algorithm/framework (Porwal et al., 2013)
227
Given the limited strength of evidence reported for clicks in Aitken and
Gold (2013), the lack of models for calculating click-rate LRs may not be all that
made, however, for those individuals who lie towards the margins of the
distribution curve and who can be classified as outliers with respect to click
rate.
7.7 Conclusion
speaker behavior is largely unsupported by the present data for young male
individuality. Secondly, even for the high-rate clickers who stand apart from
228
comparison casework. In spite of these findings, it is suggested that, in certain
cases, it may well be. Studies such as those of Wright (2007; 2011a; 2011b) and
and affective meanings, provide normative data and descriptions. For this
reason, these studies allow forensic practitioners to assess the speech samples
for providing valuable resources. This is, in fact, just a further instance of a
different for other varieties of English or (for example) differ in accordance with
speaker age or gender - and nothing has been found in the sociolinguistic
literature on English to support that view - the mere comparison of click rates
229
Chapter 8 Overall Likelihood Ratios
8.1 Introduction
The survey of FSC practices (Chapter 3) revealed that for the vast
(despite some parameters having greater weight than others). For this reason,
the current chapter addresses the issue of combining parameters for speaker
an FSC has traditionally been carried out by experts through implicit mental
that the speakers in the suspect and criminal samples are the same person
transparent. As such, it has been argued that different experts will weigh certain
parameters more highly than others, based purely on personal opinions (Rose
and Morrison, 2009). For this reason, the traditional method of parameter
Bayes theorem on the other hand offers a more explicit and transparent
230
LRs (or equivalently, the addition of individual LLRs) assuming that there is
and Bayesian Networks have been put forward to circumvent the problem
power, strength of evidence, and validity can be tested for all analyzed speech
combined system. However, the blend of approaches taken in this chapter has
never been used before. The chapter begins by exploring the existing
relationships between LTFD, AR, F0, and click rate to check for potential
combination methods given the (in)dependencies that exist amongst the given
231
parameters. After the combination of individual parameters into a complete
working system, the discriminant ability, strength of evidence, and validity are
the strength of evidence for calculating OLRs for the data under scrutiny.
2004; Kinoshita, 2002; Rose et al., 2003; Rose et al., 2004; Rose and Winter,
2010). The main focus of this earlier research is the issue of combining
linguistic phonetic parameters often recognized this problem but did nothing to
try to ameliorate it. For example, Kinoshita (2002) used nave Bayes to combine
LRs based on the best-performing set of formant predictors from /m/, //, and a
parameters).
232
from /: / in Japanese. Linear regression was applied to the parameters to
assess the degree of correlation only within individual parameters (i.e. the
formants). The individual LRs were combined into an OLR using an assumption
tested because it was assumed that given the very different phonetic nature of
the three segments used, there was unlikely to be much correlation between all
but their highest formants (Rose et al. 2003 p. 195). Rose et al. (2003) make a
predict that the parameters are not correlated. However, without further
testing, correlations may go unexposed (and unrealized). Rose et al. (2003) also
note that linguistic theory leads them to believe that the higher formants may
be correlated, yet nothing is done to account for it. Therefore, it is probable that
the results produced for the study were over- or under-estimations of the
strength of evidence.
Aitken and Lucys (2004) MVKD formula treats the set of data from which LRs
are computed as multivariate data, and as such is able to account for within-
strength of SS and DS LRs compared with the more conservative MVKD model.
The proportion of errors was also better when independence was assumed,
which led Rose et al. (2004) to conclude that the correct formula is still not
233
exploiting all the discriminability in the speech data and (as such) the Idiots
approach nave Bayes is still preferable (Rose et al. 2004 p. 496). However
the study fails to discuss the fact that nave Bayes produces misrepresentative
was developed within the field of ASR (Brmmer et al., 2007; Gonzalez-
Rodriguez et al., 2007; Ramos Castro, 2007) and has since been applied in a
Morrison et al., 2010; Rose, 2010b; Rose, 2011) leading Rose and Winter (2010,
p. 42) to claim that fusion is one of the main advances to have emerged from
automatic methods.
Fusion is currently the only alternative to a nave Bayes approach for LR-
problems with fusion. Firstly, back-end processing, as the name suggests, deals
with correlations after the generation of numerical LRs has been performed.
are not correlated by virtue of their internal structure and which therefore
234
generate non-correlated LRs. More broadly, there is also an issue of efficiency.
Since fusion is implemented after the generation of LRs, the original analysis
8.3 Data
The present study does not introduce any new data, as it works with the data
presented in Chapters 4-7. The parameters under consideration are mean long-
term formant frequency distributions (for F1-4), mean articulation rate, long-
8.4 Correlations
This section considers potential correlations that exist within and between
found for the group of 100 speakers do not exhibit the same patterns as those
8.4.1 Methodology
correlations were calculated for the data: those within LTFD (i.e. LTFD1, LTFD2,
LTFD3, LTFD4) and those between parameters (LTFD1-4, AR, F0, click rate).
The formants within LTFD are treated as individual parameters for correlation
testing, given that phonetic theory has established that individual formant
235
measurements represent different physiological aspects of a vowel (e.g.
point per person so a mean value was calculated for each speakers LTFD1-4,
AR, and F0 (i.e. three separate means). The data for click rate already existed as
a single data point for each speaker, so no additional mean calculations were
required.
coefficients, as the latter assesses how well the relationship between two
consideration are not known (nor can they be assumed to be linear), the
calculate the correlation coefficients for all pairs of parameters. Table 8.1
presents all six possible pairing combinations for the LTFD parameter, and
236
Table 8.1: Formant pairings within LTFD
Parameter 1 Parameter 2
Long-term Formant Distributions
LTFD1 LTFD2
LTFD1 LTFD3
LTFD1 LTFD4
LTFD2 LTFD3
LTFD2 LTFD4
LTFD3 LTFD4
Parameter 1 Parameter 2
LTFD1 Mean F0
LTFD1 AR
LTFD1 Click Rate
LTFD2 Mean F0
LTFD2 AR
LTFD2 Click Rate
LTFD3 Mean F0
LTFD3 AR
LTFD3 Click Rate
LTFD4 Mean F0
LTFD4 AR
LTFD4 Click Rate
Mean F0 AR
Mean F0 Click Rate
AR Click Rate
Tables 8.1 and 8.2 are complete lists of all 21 parameter pairings. The first and
second columns simply identify the parameters that are being compared against
each other.
237
independence is made by the expert, which in turn can result in different
this can cause variation in the LR results). For the purpose of this study,
presented in Figures 8.1 - 8.6. The y-axis presents one LTFD parameter, while
r = -0.16
LTFD1 (Hz)
238
r = 0.05
LTFD1 (Hz)
LTF
r = -0.03
Figure x:
LTFD1 (Hz)
239
r = 0.43
LTFD2 (Hz)
r = 0.20
LTFD2 (Hz)
240
r = 0.41
LTFD3 (Hz)
The scatterplots in Figures 8.1 - 8.3 do not exhibit any strong relationships
between the variables, and graphically suggest that there are no correlations.
Figures 8.4 and 8.6 are characterized as having moderate positive correlations,
while Figure 8.5 has a slightly weaker positive correlation. The correlation
present in Figure 8.5 (LTFD2 vs. LTFD4) is most likely representative of indirect
correlation, given that LTFD2 correlates with LTFD 3, and LTFD3 correlates
with LTFD4.
in Table 8.3. The intersection of a column and row indicates a given comparison,
and the value within the box is Spearmans rank correlation coefficient (r). A
241
Table 8.3: Correlation coefficients within LTFD
Based on the results seen in Figure 8.1 - 8.6 and Table 8.3, an informed
within LTFD1-4. The results suggest that LTFD2 is correlated with LTFD3,
LTFD3 is correlated with LTFD4, and LTFD2 is indirectly correlated with LTFD4
However, this correlation was not deemed to be significant (r is less than 0.25;
presented in Figures 8.7 - 8.12. The y-axis represents the first parameter, and
242
r = -0.13
r = 0.15
r = -0.14 r = -0.12
243
r = -0.04
AR (syllables/second)
r = -0.03
AR
F0 (Hz)
244
r = 0.10 r = -0.22
r = -0.13 r = -0.20
245
r = 0.08
r = 0.19
r = -0.06
r = -0.07
246
r = -0.06
AR (syllables/second)
The scatterplots in Figures 8.7 - 8.12 do not exhibit any signs of strong (or even
scatterplots have a limited number of data points (only 100 in this case)
compared with the number of data points for Figures 8.1 - 8.6, the calculation of
correlation coefficients is necessary (as was also seen in Table 8.3) to quantify
247
In terms of between-parameter correlations, Table 8.4 does not present any
new strong correlations (those within LTFD have already been discussed in
8.4.2). The strongest relationships found between parameters are those for
LTFD2 vs. click rate (-0.22), LTFD4 vs. click rate (-0.20), and LTFD2 vs. mean F0
(0.19). Linguistics literature and phonetic theory do not give any reason to lead
Therefore, the very weak correlations seen in Table 8.4 have most likely
happened by chance. Correlation does not imply causality, and these three cases
8.4.3.1 Discussion
Based on the results seen in Figures 8.7-8.12 and Table 8.4, an informed
judgment can be made with respect to the parameter correlations for the data
set. The results suggest that there is no parameter correlation between LTFD,
mean AR, mean F0, and click rate (confirmation given by Marjan Sjerps, p.c.). As
OLRs can be calculated for the system. A model does not currently exist with
43 However, it is possible that F0 and F2 could be related. High F2 values are associated with
tongue fronting, and as the tongue body fronts, it pulls on the hyoid bone, from which the larynx
is suspended. Laryngeal tension of this sort would promote higher F0, because of tension on the
vocal folds. Therefore, it is possible that one might anticipate a correlation between high F2
values and high F0 values.
248
which calculate numerical LRs for click rate, and therefore the OLRs calculated
8.5.1 Methodology
Individual LRs were calculated for LTFD1, AR, and F0 using a MatLab
and a separate LR was calculated for LTFD2-4 together using the same MVKD
formula. This was done in order for the algorithm to take into account the
methodology, whereby the test and the reference speakers came from the same
population of 100 speakers, was used for all LR calculations. Speakers 1-50
were used as the test speakers, while speakers 51-100 served as the reference
speakers.
The results from the individual LRs and the LR from LTFD2-4 were then
multiplied together following Nave Bayes (given that 8.4 demonstrated that
AR, F0, LTFD1, and LTFD2-4 were independent of one another) to form a
complete system. Additional variations of the system were also computed in the
same manner as for the complete system, whereby the LR for LTFD2-4 is always
calculated together (in the MVKD formula) and multiplied by the other
calculate basic statistics, EER, and Cllr for the OLR system and variations on this
system.
249
Logistic-regression calibration was also applied to the complete system
(see 8.5.3 for further discussion) was applied in the first instance to individual
complete system after the parameters had been combined in order to compare
The results of the OLR for the complete system are provided in Table 8.5.
deviation), and AR (mean). The first column in Table 8.5 presents the
mean LLR, minimum LLR, and max LLR. The final two columns report on the
complete systems validity where the sixth column provides the EER and the
Table 8.5: Summary of LR-based discrimination for the complete system (100 speakers)
Comparison % Correct Mean LLR Min LLR Max LLR EER Cllr
Complete System SS 92.00 5.673 -3.082 7.316
.0607 .3793
Complete System DS 93.27 1.560 -infinity 3.963
Table 8.5 shows that the combination of all parameters into the complete
system provides an EER of 0.0607, and a Cllr of 0.3793. It appears that the
stronger than that seen in the tests reported in Chapters 4-6. Figure 8.13
45This script was created by Niko Brmmer, modified by Geoffrey Morrison, and edited by
Vincent Hughes.
250
presents the Tippett plot of the complete system, and Figure 8.14 is a zoomed-in
251
Same speaker comparisons
Different speaker comparisons
Figures 8.13 and 8.14 illustrate the distribution of SS and DS pairs. The strength
of evidence of the DS pairs is higher than the strength of evidence offered by the
SS pairs. Following Champod and Evett (2000), the system has the potential to
that is considered very strong support. Figure 8.14 shows that the crossover
close to the zero threshold, but not on it, and it is possible that calibration of the
252
combined systems should they outperform the complete system. Table 8.6
provides ten alternative systems to the complete system from Table 8.5. The
Table 8.6: Summary of LR-based discrimination for alternative systems (100 speakers)
Comparison % Correct Mean LLR Min LLR Max LLR EER Cllr
LTFD1+LTFD2-4+F0 SS 92.00 5.691 -3.517 7.348
.0631 .4322
LTFD1+LTFD2-4+F0 DS 92.82 1.528 -infinity 4.151
LTFD1+LTFD2-4+AR SS 98.00 3.811 -1.423 5.380
.1310 .6101
LTFD1+LTFD2-4+AR DS 73.06 1.311 -infinity 4.082
LTFD1+LTFD2-4 SS 94.00 3.807 -1.618 5.426
.1361 .6348
LTFD1+LTFD2-4 DS 71.43 1.387 -infinity 4.508
LTFD1+F0+AR SS 86.00 2.432 -3.900 3.594
.0709 .4780
LTFD1+F0+AR DS 95.43 0.134 -infinity 2.373
LTFD1+F0 SS 82.00 2.150 -4.335 3.353
.0771 .5266
LTFD1+F0 DS 94.41 0.096 -infinity 2.306
LTFD1+AR SS 76.00 1.046 -1.967 2.143
.2284 .7873
LTFD1+AR DS 77.14 0.083 -infinity 2.101
LTFD2-4+F0+AR SS 96.00 5.220 -2.149 6.887
.0647 .4160
LTFD2-4+F0+AR DS 89.22 1.469 -infinity 3.848
LTFD2-4+F0 SS 96.00 5.249 -2.585 6.933
.0707 .4742
LTFD2-4+F0 DS 88.20 1.465 -infinity 3.817
LTFD2-4+AR SS 100.00 3.319 0.457 4.951
.0929 .8413
LTFD2-4+AR DS 60.20 0.789 -infinity 3.213
F0+AR SS 88.00 1.625 -2.959 2.457
.0855 .4197
F0+AR DS 91.47 0.048 -268.938 2.143
None of the alternative systems in Table 8.6 outperforms the complete system
in terms of validity. The next best performing system in terms of EER (after the
the inclusion of AR, which suggests that the inclusion of more parameters
253
8.5.3 Overall Likelihood Ratio Results: Calibrated
Fienberg, 1983), but has since made its way into automatic speaker comparison
loss46 in the computed LR values may lead to arbitrarily high errors. For this
been applied here to the complete system in two different orders to compare
calibrated results. Figures 8.15 and 8.16 illustrate the first method, in which
illustrates the results of the second method, where individual parameters were
254
Same speaker comparisons
Different speaker comparisons
Figure 8.15: Tippett plot of the complete system - parameters calibrated individually and then
combined
255
Same speaker comparisons
Different speaker comparisons
Figure 8.16: Zoomed-in Tippett plot of the complete system - parameters calibrated
individually and then combined (-6 to 6 LLR)
256
Same speaker comparisons
Different speaker comparisons
Figure 8.17: Zoomed-in Tippett plot of the complete system- system calibration after
combination of parameters (-10 to 10 LLR)
and 8.16) resulted in an EER of .0554 and a Cllr of .2831. There was an
improvement in both the EER and Cllr from the uncalibrated system of .0053
and .0962, respectively. The calibration of the complete system after the
value) of EER of .0011, and an improvement (i.e. a lower value) of Cllr of .1408.
Results show that for the complete system, calibration before combination
provides the best EER, while calibration after combination provides the best
Cllr. The differences between the two methods are minimal. However, one
improves the gradient result for incorrect/correct judgments (Cllr) while the
other improves the hard detection error rate (EER). In forensic speaker
257
comparison, a protocol for the order in which the application of calibration
should take place has not been previously discussed. Therefore, more research
applied. It is also entirely possible that calibration could be applied twice, once
before combination and once after. However, that was not tested here.
8.6 Discussion
This section focuses on three key discussion points: whether clicks can help
from all systems combinde to the results found for the individual parameters,
system, however, did not include click rate as one of the combined parameters,
as click rate did not lend itself to the calculation of numerical likelihood ratios.
The system should now consider whether click rate has the potential to help in
those pairs that include any of the speakers that had extreme outlying click
rates (see 7.5.3). The three extreme outliers in terms of click rate (speakers
007, 024, and 033) form one half of the pairings in 20 DS pairs. If those extreme
258
click rates were considered within the complete system, one could propose that
those 20 DS pairs would then be judged correctly. This would then increase the
extreme click rate outliers are not a part of any of the incorrectly-judged SS
pairs, so click rate will not help further the comparison in these pairings.
improve the system performance to some degree. Although click rate could be
used to help discriminate those SS and DS pairs that were incorrectly identified,
there is the potential for click rate to also decrease the system performance. If
important to note that this discussion of system performance (where click rate
is included) remains hypothetical without including all of the click rate data. As
parameter (to a higher extent than AR, even). The inclusion of this unstable
parameter could cause more variation in OLRs, which could in turn weaken the
system. It is also important to consider that Aitken and Gold (2013) showed
that the LRs produced for click rate were associated with relatively weak
strength of evidence. Therefore, perhaps the inclusion of click rate in all 2500 SS
and DS comparisons may not contribute significantly to the OLRs, leaving the
259
8.6.2 Comparing Individual Parameters to the Systems
to their being placed into a combined system. Table 8.7 contains all the
Table 8.7: Summary of LR-based discrimination for individual parameters (100 speakers)
Comparison % Correct Mean LLR Min LLR Max LLR EER Cllr
LTFD1 SS 72.00 0.224 -2.158 1.902
.2806 .8840
LTFD1 DS 71.70 -4.858 -68.768 1.993
LTFD234 SS 74.00 0.649 -8.461 4.996
.0798 .9023
LTFD234 DS 95.31 -25.812 -166.915 3.046
F0 SS 92.00 0.958 -3.404 1.936
.0849 .4547
F0 DS 89.90 -24.204 -269.159 1.906
AR SS 90.00 0.180 -1.480 2.060
.3340 .8981
AR DS 46.20 -2.940 -8.760 0.820
Table 8.7 shows that LTFD234 has the lowest EER at .0798, followed by F0 at
.0849. LTFD1 and AR both have EERs around .30. The best-performing
parameters close to .90. The results for the individual parameters would
suggest that a system including LTFD234 and F0 will have the best opportunity
of performing well in terms of system validity. It also appears that a system that
by its DS comparisons.
260
Table 8.8 provides a summary of the improvements and deteriorations
individual parameter for a given LR statistic. The first column identifies the LR
statistic, and the second column indicates whether or not the complete system
performing individual parameter (i.e. LTFD1-4, AR, or F0). The final column
Table 8.8: Performance comparison between individual parameters and the complete system
Table 8.8 shows that the complete system outperformed any individual
parameter in terms of EER, Cllr, SS Mean LLR, DS Min LLR, and SS Max LLR
with respect to DS % Correct, DS Mean LLR, SS Min LLR, and DS Max LLR
complete system has (most importantly) the best system validity as well as the
261
evidence for a single parameters tends to be very limited for SS (see Table 8.7),
The same LR statistics presented in Table 8.8 are also considered with
8.9. Table 8.9 presents the LR statistic in the first column and identifies which
Table 8.9 identifies whether the complete system (light blue), an alternative
the given LR statistic. The complete system remains the best in terms of system
validity (which is the most important of the statistics). The alternative systems
achieve the best performance for five of the LR statistics, while individual
parameters are the best performing for two of the LR statistics. The results in
Table 8.9 confirm the opinions set out by experts in 3.10 that the inclusion of
262
respect to all the LR statistics. It is the case that there are smaller systems and
respects. Therefore, the general opinion of experts - that more parameters are
does not appear to be the case that the more parameters that are used for
8.6.3 Limitations
of validity it still has four important limitations to consider. The first limitation
better performance in all respects. Results suggest that the addition of the
performing individual parameters may not improve the overall system. This
Should the expert only select the best-performing parameters, in terms of EER,
speech? Additionally, the expert must recognize that with the addition of
confidence interval, and it may be the case that the confidence intervals
measure of credibility will be what sets the complete system apart from an
263
The second limitation to the complete system lies in the steps taken
one another. It is possible that the complete system is limited in respect of the
Chapter 5 for LTFD1-4. MVKD was used to calculate LRs for LTFD1-4. However,
dictate treating LTFD1 separately from LTFD234 (as was seen in 8.6.2), and
the LRs from the two sets to be multiplied following nave Bayes. The EER for
the MVKD combination was 0.0414, while the current chapter reports an EER
for LTFD1+LTFD234 of 0.1361. The two methods for combining LTFD1-4 lead
correlations into account when working with the given data has caused EER to
increase.
that different results can be achieved when the individual parameters are
a second phase after the combination of parameters has been carried out. More
264
research is needed in order to make an educated decision on the order in which
The final limitation is very basic, but it is perhaps the most important. It
on one another. A threshold of 0.25 was selected, but this was somewhat
testing and theory to allow for more reliable decisions to be made on the
and languages.
8.7 Conclusion
The results of this study have shown that the combination of parameters into a
Cllr). It is not necessarily the case that more parameters will improve all aspects
of the system, but where it matters most - in terms of validity - the addition of
thesis (AR, LTFD, F0, and to some extent clicks) raises the question of what will
happen when other parameters are added to the system. Following expert
opinion in this respect (see Chapter 3), one would expect validity to further
improve. However, it could be the case that there will be a threshold at which
265
that the addition of unstable (highly variable) parameters will make the
is expected that the strength of evidence for SS pairs will only increase as
parameters are added, while the strength of evidence will remain similar for DS
incorporated into a numerical LR. For this reason, the current complete system
can only serve as a building block contributing towards a larger system that will
266
Chapter 9 Discussion
In this chapter, the results from the previous six chapters (3-8) are summarized
and discussed. The results of the first international survey of FSC practices are
evaluated with respect to four of the valued phonetic and linguistic parameters
linguists working outside FSS, the degree of variation in methods will not be
surprising to those working in various other fields of forensic science (see the
insight into which parameters experts identified as being the most helpful
speaker discriminant parameters above all others. The following are the top five
1. Voice quality
2. Dialect/ accent variants and vowel formants
3. Speaking tempo and fundamental frequency
267
4. Rhythm
5. Lexical/grammatical choices, vowel and consonant realizations,
phonological processes, and fluency
parameters that hold the most discriminant power in FSCs. The discriminant
well expert expectations of the parameters matched the results, and whether
Chapter 3 as one of the most helpful speaker discriminants (ranked 3rd in 9.1).
influence methodology has on the calculation of AR: (1) the definition of the
speech interval does not significantly affect results, (2) varying the minimum
268
stable, and (3) testing suggests that the exclusion of speech segments, and
perhaps the definition of the syllable (i.e. phonetic versus phonological) may
The findings tell a different story from that predicted by expert opinion,
suggesting that there are misconceptions about the discriminant capacity of AR.
while DS pairs were correctly identified at a rate less than chance (46.2%).
Articulation rate contributed weak strength of evidence for SS pairs, and only
highest EER of the three parameters tested under the LR framework, i.e. AR,
speaker discriminants (ranked 2nd in 9.1). This provided the motivation for
further discriminant testing. The results from the analysis of LTFD provide both
269
specificity: (1) small changes in the package length for LTFD have only a small
effect on results, and (2) higher formants (LTFD3 and LTFD4) are suggested to
pairs were correctly identified 97.4% of the time. As a system, LTFD1-4 had an
EER of 0.0414 (the lowest EER of the three parameters tested under the LR
framework: AR, LTFD, and F0) and a Cllr of 0.5411. Despite the promising
separately, the combined LTFD2-4 system still achieves a low EER of .0798 and
a Cllr of 0.9023, where SS pairs and DS pairs are correctly identified 74% and
findings from Becker at al. (2008), Moos (2010), French et al. (2012), and Jessen
et al. (2013), suggest that LTFDs perform very similarly to MFCCs under
rather strong. The only potential limitation of LTFD is that it averages across all
vowels that relate accent information. Unless a single vowel phoneme can yield
more promising results, the evidence suggests that LTFD should be considered
Including both could be seen as doubling evidence, insofar as LTFD measurements encompass the
47
270
9.2.3 Long-Term Fundamental Frequency
3) as being one of the most helpful speaker discriminants (ranked 3rd in 9.1).
This motivated empirical testing of the parameter. The results from the analysis
the phonetic findings, there are two main observations. Firstly, small changes to
the package length of F0 only have a small effect on the results (as we saw with
correlate with LTFD1-4. Given previous research (Narang et al., 2012; Syrdal
and Steele, 1985) it might have been expected that F0 and LTFD1 would be
correlated, especially since using Lombard speech it has been shown that F1
LTFD1 was also reported by Moos (2010), and may be an indication that F0 and
phonemes (Narang et al., 2012; Syrdal and Steele, 1985). However, once F0 is
compared to an LTFD that relationship is lost, perhaps because (i) F0 and F1 are
not correlated for all phonemes (and an averaging of phonemes eliminates any
strong correlation present in the data), or more likely (ii) there is non-vowel
pairs 92% of the time, and DS pairs 89.9% of the time. F0 contributed a rather
weak strength of evidence for SS pairs, while DS pairs had a much stronger
271
of the three parameters tested under the LR framework: AR, LTFD, and F0) and
a Cllr of 0.4547. Given the findings and the plethora of previous LR research on
factors (e.g. disguise, recording transmission, vocal effort; see 2.2 for more
forensic case. The difference between the F0 in the criminal and suspect
recordings (the suspect sounded more nervous in the criminal recording than
the suspect recording (Boss, 1996, p. 156)). Unless it were to transpire that F0 is
robust to many of the factors detailed in 2.2, the mere comparison of mean F0s
and SDs is on its own unlikely to advance the speaker comparison task
Again, this motivated empirical testing. The results from the analysis of click
272
of the phonetic findings, there are two pertinent observations: (1) discourse
effects are present in clicks, in that the rate of clicking, rather than being solely a
beyond the variety of English analyzed in this thesis, the view of those forensic
present data for young male speakers of SSBE. Click data is positively skewed
and discrete, and there is currently no method available for deriving an LR from
them (although see Aitken and Gold (2013) for current developments in
proposed algorithms for calculating the LRs of clicks). For this reason, the
speaker individuality. Additionally, even for the high-rate clickers who stand
that one would need speech samples of a length seldom encountered in criminal
forensic recordings in order to reliably establish an overall click rate. Given the
patterns of clicking behavior are different for other varieties of English or differ
273
in accordance with speaker age or gender - and nothing has been found in the
of click rates across samples is, in the overwhelming majority of cases, unlikely
to advance the speaker comparison task. However, as with AR, it is advised that
outliers.
long term formant frequencies (LTFD), fundamental frequency (F0), and clicks
as a combined, overall system. The forensic findings are assessed with respect
of SS and DS pairs correct, strength of evidence, and validity), how well expert
expectations of the parameters corresponded with the results, and if the results
of the parameters in this thesis was motivated by expert opinion and the lack of
AR, F0) correctly identified SS pairs 92% of the time, and DS pairs 93.3% of the
time. The combined system contributed very good strength of evidence for SS
pairs, and even stronger strength of evidence for DS pairs. Overall, the combined
274
system had an EER of 0.0607 and a Cllr of 0.3793. After the complete system
was calibrated, the EER and Cllr decreased to 0.0554 and 0.2831, respectively.
validity (EER and Cllr). It is not necessarily the case that the inclusion of more
of the system, but where it matters most (validity) the addition of parameters
does result in an improvement. The results of the complete system are almost as
good as those from ASRs under similar conditions (French et al., 2012). Table
9.1 compares the results of the research presented in this thesis against those
Table 9.1: Human-based results against ASR (Batvox) results from French et al. (2012) on
studio quality data
that the system developed in this thesis only incorporates three parameters
Out of all the views advanced by experts that were reported in Chapter 3,
perhaps the most valuable and the most accurate is that the combination of
275
current system serves as a starting point from which to expand. Additional
combination.
ASR performances on the same studio-quality recordings, and (2) the FSS
The research conducted for this thesis was not intended to provide a
course of the current research project, the methodologies employed have made
ASRs used for FSCs are typically known for demonstrating their validity through
error rates, and the testing of these ASRs is easily replicable. ASR error rates are
trial, and a false positive is a non-target trial classified as a target trial (van
and linguistic parameters) for FSCs has become known for not providing error
and click rate, is comparable to that of an ASR tested on the same type of data
8.0%49. An ASR system tested on the same data (French and Harrison, 2010)
reported false positive errors of 4.5%, and achieved zero false negatives.
system is dependent upon the expertise of the analyst. It is likely that some
given that the methodology for extracting data is relatively automatic50. AR and
click rate calculations are more dependent on the analyst. For example, when
and what to ignore, and for clicks the analyst must decide whether s/he is
hearing a click rather than a percussive. For this reason, caution should be
French et al. (2012) showed that the ASR only minimally outperformed the
277
human-based system on studio-quality data. However, such a comparison puts
1997; French and Harrison, 2010; Reynolds, 2002; Reynolds et al., 2000). The
quality gets worse. French and Harrison (2010) tested an ASR on real case data
where the outcomes of the cases were known to the authors (i.e. guilty/not
guilty verdicts). For the 767 comparisons undertaken, the ASR achieved an EER
of 24.2%. When the ASR was tested on the cases in which the system judged the
comparisons and produced an EER of 5.4%. When the ASR was further tested on
15.1% was returned for the 369 comparisons. The results indicate a steep fall in
performance for the ASR when processing less than ideal quality recordings.
ASRs when testing lower-quality recordings. For example, the 767 real forensic
comparisons used for testing with an ASR by French and Harrison, 2010 had
278
9.4.1.2 Scope for Improvements in the Human-Based System
The human-based system created for this thesis was limited to four
parameters owing to the inherent time restrictions of the research. Given the
analysis presented in 8.6 and 8.7, it is reasonable to assume that the addition
of good speaker discriminants would increase the validity of the system. With
parameters should not be done ad hoc, but should involve phonetically- and
that are good speaker discriminants (e.g. voice quality, VOT, or those
parameters that are significantly correlated with others (but if they are,
based system with ASRs. This could potentially be similar to the Vocalise 51
software package created by Oxford Wave Research Ltd (Vocalise, 2013) that
LTFD; see 10.2 for further discussion of Vocalise). Given the good performance
of the ASR on the data tested, and the good performance achieved by the
choose between the inclusion of MFCCs or LTFDs, as they effectively analyze the
same aspects of a speaker (i.e. vocal tract resonances). LTFD analysis provides
the analyst with an indication of the speakers habitual use of the vowel space,
279
which is strongly correlated with the dimensions of the vocal tract (Moos, 2010;
French et al., 2012). MFCC analysis reports on approximately the same aspect,
tract. For this reason, LTFDs and MFCCs are highly correlated (French et al.,
2012) and the inclusion of both parameters would result in the doubling of
that MFCCs should be selected over LTFDs. An integrated system could then
consist of ASR analysis, F0, AR, and click rate, which again is entirely possible if
ASR (MFCC) data are found to be independent of the other parameters included
52In ASR analysis more time would be required of the analyst for the editing of recordings
(cleaning them up, for example, with a bandpass filter), and for human-based analysis more
time would be required for the analysts to extract measurements and additional data from the
poor quality recordings.
280
trade-off with the variation that may be present in the FSC results. The more
in results. For example, if an FSC comparison was vital to a criminal case, rather
than having an ASR completely reject a case that was of poor recording quality
(or offer a conclusion with a very high EER associated with it), a human-based
and the facts of the evidence; however, within FSCs the application of the LR
(e.g. package lengths of LTFD or F0, as seen in 5-6) in order to achieve a more
The present study did not address the debate surrounding the
delimitation of the relevant population, as intrinsic LRs were carried out on the
known optimal population (i.e. the test and reference sets of speakers came
281
the relevant population will remain a difficulty facing FSCs, as was also argued
subside. See Hughes (in progress) for further information and discussion on the
Harrison (2007) and French et al. (2010), is a very real problem for the
testament to this, for the present study it took nearly three years to collect
population statistics for just four parameters for 100 speakers in a single and
very specific population (and, moreover, these data were extracted from a
person. At that point, given the occurrence of sound change, it would be time to
scrap the collected population statistics and start again. Such a feat hardly
faced by the FSS community, because as it stands only certain parameters can
282
number of parameters beside vowels to play a role in characterizing an
multiple options for this, but little testing has been done to compare such
methods. It is also the case that only three of these methods (i.e. fusion, MVKD,
GMM, and nave Bayes) are ever really implemented in FSC. If the numerical LR
framework is to become the way of the future, then research should consider
the use of Bayesian Networks for combining speech evidence; these have
inevitable, and the research presented in this thesis is no exception. This section
outlines and discusses three general methodological limitations: (1) the absence
of non-contemporaneous data, (2) the use of intrinsic LR testing, and (3) using
only speakers 1-50 for all the same-speaker comparisons when calculating LRs.
The recordings used in the present study were all obtained from single
283
studies citing the importance of incorporating recordings that are made several
days, weeks, months, or even years apart (see Enzinger and Morrison, 2012;
Loakes, 2006; Morrison et al., 2012b; Nolan et al., 2009; Rhodes, 2013).
However, I would argue that there are a number of other external factors that
few days (such as the effects of accommodation reported in 7.5.5). In any case,
the results presented here, despite the fact that they are based on the use of
criminal samples on which to test the human-based system. The LRs calculated
in the present study were based on the DyViS data set of recordings of 100
speakers, whereby the first 50 speakers always acted as the test samples and
the second set of 50 speakers always acted as the reference samples. This means
that the 100 speakers were in some way a part of either the test or reference
sample, and that tests were not conducted using outside data sets. As a result,
The final limitation is the way the data (i.e. speakers) were divided for
resulting LRs. For empirical testing purposes, the calculation of intrinsic LRs
typically requires a set of speakers to act as both the criminal and the suspect,
284
while an additional set of speakers must act as the reference population. The
population can vary and is entirely at the analysts discretion. For convenience
this thesis divided the data set evenly. As such, the calculation of LRs produced
the speakers to have been divided in a number of alternative ways, which might
may serve as a helpful tool for casework. It is also suggested that all forensic
relative ease of extracting LTFDs from speech recordings means that this
FSCs.
impossible. For this reason, experts are faced with a number of decisions,
with other forensic disciplines. Judge Hodgson from Australia argues that not all
types of evidence can be sensibly assigned an LR, and that therefore there is no
(Hodgson, 2002). Should the field of FSS agree on this statement, it is worth
54 By complete it is meant that all possible analyzable speech parameters are included.
285
LR. If all speech evidence can be forced into a numerical LR model then an
experts job is done. However, this does not seem a plausible possibility, given
the complexity of speech evidence. For this reason, it is my view that experts are
left with three options for the future: they must take the first two if they wish to
align with other forensic disciplines, and the third, a default option, if they are
logically and legally correct framework within which to present evidence, the
framework will also be played by regulations under the country of practice and
simply practicality issues. All countries and institutions are constrained by the
laws, rules, and regulations in which they work. In 3.11, an expert from China
able to choose their conclusion framework, their decision will come down to a
practicality factor. Given the results of this thesis, I believe that the implications
for the field of FSCs are such that a complete numerical LR is unrealistic, and
286
Chapter 10 Summary & Conclusion
The following chapter reviews the empirical results and discussion detailed in
Chapters 3-9 by relating them back to the four research questions introduced in
Chapter 2. The chapter concludes with suggestions for future work that would
SD, range, alternative baseline), voice quality, intonation, speech tempo (AR and
Out of all the possible parameters analyzed in FSCs, the experts top
ranked parameters (the first three) in terms of the expected discriminant ability
were: (1) voice quality, (2) dialect/accent variants and vowel formants, and (3)
speaking tempo and F0. Three of these parameters (LTFD, AR, F0) and clicks
287
(2) If experts are to provide their opinion on the most helpful speaker
discriminants, will these selected parameters be good speaker
discriminants?
Chapters 4-7 presented the results of the discriminant ability of AR, LTFD, F0,
and clicks. LTFD and F0 showed promising results, with EERs of less than .1. AR
and clicks were not as good at discriminating between speakers. AR had an EER
of .33, and although LRs were not calculated for clicks, the results suggested that
The results presented in Chapters 4-7, in combination with the results from the
survey in Chapter 3, suggest that disparities exist between expert opinions and
Table 8.5: Summary of LR-based discrimination for the complete system (100 speakers)
Comparison % Correct Mean LLR Min LLR Max LLR EER Cllr
Complete System SS 92.00 5.673 -3.082 7.316
.0607 .3793
Complete System DS 93.27 1.560 -infinity 3.963
for LTFD, F0, and AR in combination. The system combining these three
288
parameters produces a lower EER and Cllr than any individual parameter. The
results supported the view expressed by survey participants, i.e. that the
The proper combination of speech evidence will vary depending on the given
data set. However, for all cases, correlations should be tested in order to
evidence). For the given data set, dependent relationships were identified
amongst LTFD2-4, and MVKD was used to account for the existing correlations
mutually exclusive and the speech evidence was therefore combined using
Nave Bayes.
Chapter 8 revealed that for the given data set, the addition of more parameters
improved system validity. However, this did not improve all of the LR statistics
(e.g. exceptions included SS and DS percent correct, SS and DS Mean LLR, and SS
289
Chapter 9 discussed the limitations and implications of using a numerical LR
framework that arose during the development of the system created for this
thesis. Those limitations include, but are not confined to: subjective elements of
LRs for complex speech data distributions, and the range of methods used to
Chapter 9 concluded that two fundamental factors the amount of time needed
are the same as those used by an ASR. However, the time needed to collect and
analyze the data for the actual LR calculation is much more intensive for a
290
human-based system was capable of producing results comparable to those of
ASRs. It also proposed that with real forensic material (e.g. degraded or shorter
ASR.
(1) the integration of ASRs and phonetic-linguistic parameters, and (2) research
into more transparent and successful ways for combining correlated speech
linguistic parameters. Future research in this area may benefit from the
MFCCs.
and logical method for combining evidence (see Evett et al., 2002). In forensics,
the use of Bayesian Networks for the derivation of FSC LRs enables any
291
correlations that exist between (or within) parameters to be weighted
10.3 Conclusion
This thesis set out to explore viability of aligning the field of FSC with other,
the FSS community if experts are to continue in their efforts to align themselves
count (those which are discrete rather than continuous, and which cannot be
adequately quantified). In the process of addressing the main aims of this thesis,
additional findings were presented, including: the survey of FSC practices, the
discriminant capacity of individual parameters (AR, LTFD, F0, clicks) and those
adopting a numerical LR framework for FSCs. It is also hoped that this research
292
Appendix A
Survey Instructions:
Each participant was emailed a unique, secured web-link that directed them to
the survey. The survey had general instructions on the first page, which read:
Please keep in mind that there are no right or wrong answers. This survey is
meant to serve as a tool to gain insight into the general practices in forensic
speaker comparisons around the world as well as finding out which features
All answers will be kept anonymous and names of participants will never be
Before each of the 9 sections of the survey there were additional instructions to
remind the participants to generalize to the best of their ability across all cases
they had worked on rather than always responding that the given feature was
questions with regards to all speaker comparison cases you have worked on. I
understand that many features and analyses are case dependent, so please do
293
There were some participants who gave the predicted it depends on the case
answer for questions, but for the most part the majority did an excellent job at
294
Appendix B
Example of a Bayesian Network for Speech Evidence:
Bayesian Networks have never been used in FSS before, yet they offer a
method for combining evidence that is both transparent and readily accepted
by other forensic communities (Evett et al., 2002). For this reason, further
Articulation
LTFD1 Click Rate F0
Rate
Bayesian Networks, such as the figure above, can be used to calculate Overall
LRs (OLRs) provided that probability densities and variances exist for each
295
parameter. If a case does not account for evidence from a certain parameter in
the figure above, it is possible to simply leave that node out and the remaining
296
List of Abbreviations
change or difference
AR articulation rate
DS different speaker
F0 fundamental frequency
LR likelihood ratio
297
MFCC Mel-frequency cepstral coefficient
SD standard deviation
SS same speaker
VQ voice quality
298
References
Abercrombie, D. (1967). Elements of General Phonetics. Edinburgh: Edinburgh
University Press.
Aitken, C. G. G. and Taroni, F. (2004). Statistics and the Evaluation of Evidence for
Forensic Scientists (2nd ed.). Chichester: John Wiley & Sons, Ltd.
Ball, M. J. (1989). Phonetics for speech pathology. London: Whurr Publishers Ltd.
299
Analyst. Printed for J. Noon, London. The Eighteenth Century Research
Publications Microfilm A 7173 reel 3774 no. 06.
Becker, T., Jessen, M., and Grigoras, C. (2008). Forensic speaker verification
using formant features and Gaussian mixture models. Proceedings of
Interspeech 2008. Brisbane, pp. 1505-1508.
Bertsch McGrayne, S. (2012). The Theory that Would Not Die: How Bayes' Rule
Cracked the Enigma Code, Hunted Down Russian Submarines, & Emerged
Triumphant from Two Centuries of Controversy. New Haven and London: Yale
University Press.
Bhuta, T., Patrick, L., and Garnett, J. D. (2004). Perceptual evaluation of voice
quality and its correlation with acoustic measurements. Journal of Voice,
18(3), pp. 299-304.
Brmmer, N., Burget, L., Cernock, J. H., Glembek, O., Grzl, F., Karafit, M., van
Leeuwen, D. A., Matejka, P., Schwarz, P., and Strasheim, A. (2007). Fusion of
heterogeneous speaker recognition systems in the STBU submission for the
300
NIST SRE 2006. IEEE Transactions on Audio Speech and Language Processing,
15, pp. 2072-2084.
Byrne, C. and Foulkes, P. (2004). The mobile phone effect on vowel formants.
International Journal of Speech, Language and the Law, 11, pp. 83-102.
Campbell, J., Shen, W., Campbell, W., Schwartz, R., Bonastre, J. F., and Matrouf, D.
(2009). Forensic speaker recognition: A need for caution. IEEE Signal
Processing Magazine, March 2009, pp. 95-103.
Champod, C., & Meuwly, D. (2000). The inference of identity in forensic speaker
recognition. Speech communication, 31, pp. 193-203.
Clark, J., & Yallop, C. (2001). An Introduction to Phonetics and Phonology (2nd
ed.). Oxford: Blackwell Publishers Ltd.
Clermont, F., French, P., Harrison, P., and Simpson, S. (2008). Population data for
English spoken in England: A modest first step. Paper presented at the
Conference of the International Association of Forensic Phonetics and
Acoustics. Plymouth, United Kingdom.
301
Crystal, D. (1987). The Cambridge Encyclopaedia of Language. Cambridge:
Cambridge University Press.
de Jong, G., McDougall, K., and Nolan, F. (2007) Sound Change and Speaker
Identity: An Acoustic Study. In: Christian Mller (ed.), Speaker Classification
II: Selected Papers. Berlin: Springer, pp. 130-141.
302
Evett, I. W. (1990). The theory of interpreting scientific transfer evidence.
Forensic Science Progress, 4, pp. 141179.
Evett, I. W., Gill, P. D., Jackson, G., Whitaker, J., and Champod, C. (2002).
Interpreting small quantities of DNA: The hierarchy of propositions and the
use of Bayesian networks. Journal of Forensic Sciences, 47(3), pp. 520-530.
Evett, I. W., Jackson, G. Lambert, J. A., and McCrossan, S. (2000). The impact of
the principles of evidence interpretation on the structure and content of
statements. Science & Justice, 40, pp. 233239.
Evett, I. and Weir, B. S. (1998). Interpreting DNA Evidence: Statistical Genetics for
Forensic Scientists. Sunderland, Massachusetts: Sinauer Associates.
French, P., Foulkes, P., Harrison, P., and Stevens, L. (2012). Vocal tract output
measures: relative efficacy, interrelationships and limitations. Paper
presented at the Conference of the International Association of Forensic
Phonetics and Acoustics Conference, Santander, Spain.
French, P. and Harrison, P. (2010). The work of speech and audio analysts. AES
Conference no. 128. London, United Kingdom.
French, J. P., Nolan, F., Foulkes, P., Harrison, P., and McDougall, K. (2010). The
UK position statement on forensic speaker comparison: a rejoinder to Rose
and Morrison. International Journal of Speech, Language and the Law, 17(1),
pp. 143-152.
303
Fritzell, B., Sundberg, J., and Strange-Ebbesen, A. (1982). Pitch change after
stripping oedematous vocal folds. Folia Phoniatrica, 34, pp. 29-32.
Gold, E. (2009). The Effects of Video and Voice Recorders in Cellular Phones on
Vowel Formants and Fundamental Frequency. Unpublished; University of
York. MSc.
Gonzalez-Rodriguez, J., Rose, P., Ramos D., Toledano, D. T., and Ortega-Garcia, J.
(2007). Emulating DNA: rigorous quantification of evidential weight in
304
transparent and testable forensic speaker recognition, IEEE Transactions of
Audio, Speech and Language Processing, 15, pp. 2104-2115.
Greenberg, C. S., Martin, A., Brandschain, L., Campbell, J., Cieri, C., Doddington, G.,
and Godfrey, J. (2010). Human Assisted Speaker Recognition In SRE10, Proc.
Odyssey 2010, Brno, Czech Republic, June-July 2010.
Guillemin, B. and Watson, C. (2008). Impact of the GSM mobile phone network
on the speech signal: some preliminary findings. International Journal of
Speech, Language and the Law, 15(2), pp. 193-218.
Hand, D. J., and Yu, K. (2001). Idiot's Bayes - not so stupid after all? International
Statistical Review, 69(3), pp. 385-399.
Hecker, M. L., Stevens, K. N., von Bismarck, G., and Williams, C. E. (1968).
Manifestations of task-induced stress in the acoustic speech signal. Journal of
the Acoustical Society of America, 44, pp. 993-1001.
305
Hollien, H. (1980). Vocal indicators of psychological stress. In Wright, F., Bahn,
C., and Rieben, R.W. (eds.) Forensic Psychology and Psychiatry. New York:
John Wiley & Sons Ltd., pp. 47-72.
Hudson, T., de Jong, G., McDougall, K., Harrison, P., and Nolan, F. (2007). F0
statistics for 100 young male speakers of Standard Southern British English.
In 16th Proceedings of the International Congress of Phonetic Sciences,
Saarbrcken, pp. 1809-1812.
306
Ishihara, S. and Kinoshita, Y. (2008). How many do we need? Exploration of the
Population Size Effect on the performance of forensic speaker classification.
9th Annual Conference of the International Speech Communication Association
(Interspeech). Brisbane, Australia, pp. 1941-1944.
Jessen, M. and Roux, J.C. (2002). Voice quality differences associated with stops
and clicks in Xhosa. Journal of Phonetics, 30, pp. 152.
Jessen, M., Kster, O., and Gfroerer, S. (2005). Influence of vocal effort on
average and variability of fundamental frequency. International Journal of
Speech, Language, and the Law, 12(2), pp. 174-213.
Johnson, K. (2003). Acoustic and auditory phonetics. (2nd edn). Malden, MA:
Blackwell.
307
Kavanagh, C. (2011). Intra- and inter-speaker variability in duration and
spectral properties of English /s/. Paper presented at the Conference of the
Acoustical Society of America, San Diego, USA.
Kinoshita, Y., Ishihara, S., and Rose, P. (2009). Exploring the discriminatory
potential of F0 distribution parameters in traditional forensic speaker
recognition. International Journal of Speech, Language and the Law, 16, pp.
91-111.
308
Kirchhbel, C. and Howard, D. (2011). Investigating the acoustic characteristics
of deceptive speech. Proceedings of the 17th International Congress of
Phonetic Sciences, August 17-21, 2011, Hong Kong, China, pp. 1094-1097.
Koreman, J. (2006). Perceived speech rate: the effects of articulation rate and
speaking rate in spontaneous speech. Journal of the Acoustical Society of
America, 119, pp. 582-596.
Knzel, H., Braun, A., and Eysholdt, U. (1992). Einflu von Alkohol auf Stimme
und Sprache. Heidelberg: Kriminalistik-Verlag.
309
Laplace, P. S. (1781). Mmoire sur les Probabilitis. OC (9), pp. 383-485.
Lindh, J., & Morrison, G.S. (2011). Humans versus machine: Forensic voice
comparison on a small database of Swedish voice recordings. Proceedings of
the 17th International Congress of Phonetic Sciences, Hong Kong, China, pp.
12541257.
Miller, J. L., Grosjean, F., and Lomanti, C. (1984). Articulation rate and its
variability in spontaneous speech: a reanalysis and some implications.
Phonetica, 41, pp. 215-225.
310
McDougall, K. (2004). Speaker-specific formant dynamics: an experiment on
Australian English /a/. International Journal of Speech, Language and the
Law, 11 (1), pp. 103-130.
311
Morrison, G. S. (2013). Tutorial on logistic-regression calibration and fusion:
converting a score to a likelihood ratio. Australian Journal of Forensic
Sciences, 45, pp. 173-197.
Morrison, G. S., Ochoa, F., and Thiruvaran, T. (2012). Database selection for
forensic voice comparison. Proceedings of Odyssey 2012: The Language and
Speaker Recognition Workshop, Singapore, International Speech
Communication Association, pp. 62-77.
Morrison, G. S., Rose, P., & Zhang, C. (2012). Protocol for the collection of
databases of recordings for forensic-voice-comparison research and practice.
Australian Journal of Forensic Sciences, 44, pp. 155167.
Morrison, G.S., Thiruvaran, T., and Epps, J. (2010). An issue in the calculation of
logistic-regression calibration and fusion weights for forensic voice
comparison. Proceedings of the 13th Australasian International Conference on
Speech Science and Technology. Melbourne, Australia, pp. 74-77.
Narang, V., Misra, D. and Yadav, R. (2012). F1 and F2 correlation with F0: A
study of vowels of Hindi, Punjabi, Korean, and Thai. International Journal of
Asian Language Processing, 22(2), pp. 63-73.
Nolan, F. and Grigoras, C. (2005). A case for formant analysis in forensic speaker
identification. International Journal of Speech, Language and the Law, 12(2),
pp. 143-173.
Nolan, F., McDougall, K., de Jong, G. and Hudson, T. (2009). The DyViS database:
style-controlled recordings of 100 homogeneous speakers for forensic
phonetic research. International Journal of Speech, Language and the Law,
16(1), 31-57.
Novak, A., Dlouha, O., Capkova, B., and Vohradnik, M. (1991). Voice fatigue after
theater performance in actors. Folia Phoniatrica, 43, pp. 74-78.
312
Oats, J. M. and Dacakis, G. (1983). Speech pathology considerations in the
management of transsexualism a review. British Journal of Disorders of
Communication, 18, pp. 139-151.
Pike, K. (1943). Phonetics: A critical analysis of phonetic theory and a technic for
the practical description of sounds. Ann Arbor, MI: University of Michigan
Press.
Porwal, U., Shi, Z., and Setlur, S. (2013). Machine learning in handwritten Arabic
text recognition. In Govindaraju, V. and C. R. Rao (eds.) Handbook of Statistics
31 - Machine Learning: Theory and Applications. Amsterdam: North Holland,
pp. 443-470.
Reynolds, D. A., Quatieri, T. F., and Dunn, R. (2000). Speaker verification using
adapted Gaussian mixture models. Digital Signal Processing, 10 (1-3), pp. 19-
41.
313
Rhodes, R. (2013). Assessing the Strength of Non-Contemporaneous Forensic
Speech Evidence. Unpublished; University of York. PhD.
Robb, M., Maclagan, M. A., and Chen, Y. (2004) Speaking rates of American and
New Zealand varieties of English. Clinical Linguistics & Phonetics, 18(1), pp. 1-
15.
Rose, P. (2007b). Going and getting it Forensic Speaker Recognition from the
perspective of a traditional practitioner/researcher. Paper presented at the
Australian Research Council Network in Human Communicative Science
Workshop: FSI not CSI Perspectives in State-of-the-Art Forensic Speaker
Recognition, Sydney.
314
Rose, P. (2011). Forensic voice comparison with Japanese vowel acoustics a
likelihood ratio-based approach using segmental cepstra. In W. S. Lee, E. Zee
(Eds.) Proceedings of the 17th International Congress of Phonetic Sciences,
Hong Kong, 17-21 August, pp. 1718-1721.
Rose, P. (2012). Yes, Not Too Bad Likelihood Ratio-Based Forensic Voice
Comparison in a $150 Million Telephone Fraud. Proceedings of the 14th
Australasian International Conference on Speech Science and Technology.
Macquarie University, Australia, pp. 161-164.
Rose, P. (2013b). Where the science ends and the law begins: Likelihood ratio-
based forensic voice comparison in a $150 million telephone fraud.
International Journal of Speech, Language and the Law, 20(2), pp. 277-324.
Rose, P., Kinoshita, Y., and Alderman, T. (2006). Realistic extrinsic forensic
speaker discrimination with the diphthong /a/. Proceedings of the 11th
Australasian International Conference on Speech Science and Technology.
University of Auckland, New Zealand, pp. 329-334.
Rose, P., Lucy, D., and Osanai, T. (2004). Linguistic-acoustic forensic speaker
identification with likelihood ratios from a multivariate hierarchical random
effects model a non-idiots Bayes approach. Proceedings of the 10th
Australasian Conference on Speech Science and Technology. Macquarie
University, Australia, pp. 492-497.
Rose, P., Osanai, T., and Kinoshita, Y. (2003). Strength of forensic speaker
identification evidence: multispeaker formant- and cepstrum-based
segmental discrimination with a Bayesian likelihood ratio as threshold.
Forensic Linguistics, 10, pp. 179-202.
315
Saxman, J. H. and Burk, K. W. (1968). Speaking fundamental frequency
characteristics of adult female schizophrenics. Journal of Speech, Language,
and Hearing Research, 11, pp. 194-203.
Scherer, K. R., Helfrich, H., Standke, R., and Wallbott, H. (1976). Psychoakustische
und kinesische Verhaltensanalyse. Research report, University of Giessen.
316
Syrdal, A. and Steele, S. (1985). Vowel F1 as a function of speaker fundamental
frequency. Journal of the Acoustical Society of America, 78, S56.
317
Wright, M. (2011b). The phoneticsinteraction interface in the initiation of
closings in everyday English telephone calls. Journal of Pragmatics, 43(4), pp.
1080-1099.
Xue S. A. and Hao J. G. (2006). Normative Standards for vocal tract dimensions
by race as measured by acoustic pharyngometry. Journal of Voice, 20, pp.
391-400.
COURT CASES
Joseph Crossfield and Sons Ltd v Techno Chemical Laboratories Ltd (29 TLR, 378
& 379 [1913])
318