INTERSPEECH 2015
Anomaly-Based Annotation Errors Detection in TTS Corpora
Jindřich Matoušek1,2 , Daniel Tihelka2
1
Department of Cybernetics, 2 New Technologies for the Information Society (NTIS)
Faculty of Applied Sciences, University of West Bohemia, Czech Rep.
[email protected]
[email protected]
10.21437/Interspeech.2015-146
Abstract
tures, a sequence of bad phone segments typical for a misannotated word could be revealed.
In this paper we further investigate possibilities of using
an anomaly detection (also called novelty detection or outlier
detection [22]) approach to detect word-level annotation errors.
In Sec. 2 we introduce methods we use for anomaly detection.
In Sec. 3 our data set is presented. Sec. 4 describes various
feature sets that we considered. In Sec. 5 and 6.3 experiments
and results are described and discussed. Conclusions are drawn
in Sec. 7.
In this paper we adopt several anomaly detection methods to
detect annotation errors in single-speaker read-speech corpora
used for text-to-speech (TTS) synthesis. Correctly annotated
words are considered as normal examples on which the detection methods are trained. Misannotated words are then taken as
anomalous examples which do not conform to normal patterns
of the trained detection models. Word-level feature sets including basic features derived from forced alignment, and various
acoustic, spectral, phonetic, and positional features were examined. Dimensionality reduction techniques were also applied to
reduce the number of features. The first results with F 1 score
being almost 89% show that anomaly detection could help in
detecting annotation errors in read-speech corpora for TTS synthesis.
Index Terms: annotation error detection, anomaly detection,
read speech corpora, speech synthesis
2. Methods
The problem of the automatic detection of misannotated words
could be viewed as a problem of anomaly detection. In general, anomaly detection is the identification of items (so-called
anomalous items) which do not conform to an expected pattern
or other items in a data set [23]. In our case, misannotated words
are considered as anomalous examples, and correctly annotated
words are taken as normal examples. In our scenario, anomaly
detection could be viewed as an unsupervised detection technique under the assumption that the majority of the examples in
the unlabeled data set are normal, or that the training data is not
polluted by anomalies. By just providing the normal training
data, an algorithm creates a representational model of this data.
If newly encountered data is too different from this model, it
is labeled as anomalous. This could be perceived as an advantage over a standard classification approach in which substantial
number of both negative (normal) and positive (anomalous) examples is needed. Nevertheless, if some anomalous examples
are given in the anomaly detection framework, they can be used
to tune the detector and to evaluate its performance.
Let us denote x(1) , . . . , x(Nn ) the training set of normal (i.e.
not anomalous) examples where Nn is the number of normal
training examples with each example x(i) ∈ RNf and Nf being
the number of features.
1. Introduction
Word-level annotation of speech data is still one of the most
important processes for many speech-processing tasks. Concretely, concatenative speech synthesis methods including very
popular unit selection assume the word-level (textual) annotation to be correct, i.e. that textual annotation literally matches
the corresponding speech signal. However, in case of large
speech corpora used today for corpus-based speech synthesis
(usually tens of hours of speech), it is almost impossible to
guarantee such a perfect annotation—(semi-)automatic annotation approaches (see, e.g., [1–7]) are still error-prone, and manual annotation is a time-consuming, costly, but, given the large
amount of data, still not errorless process [8]. If not detected,
any mismatch between speech data and its annotation may inherently result in audible glitches in synthetic speech [9].
As incorrect annotation is often manifested by gross phonetic segmentation errors, many studies focused on various refinements of the segmentation scheme (see, e.g., [10–16]). In
our previous work [17] we focused on a way to fix the origin
of the segmentation errors, i.e. to fix the annotation errors. We
proposed several classification and/or detection methods which
attempted to detect annotation errors in a read-speech corpus
suitable for text-to-speech (TTS) synthesis. In contrast to other
studies [3,5,18–21] which focus rather on revealing bad phonelike segments, we focused mainly at revealing word-level errors,
i.e. misannotated words. We believe word-level annotation error detection is more robust because phone-level detection could
result in many “false positive” detections. Using word-level fea-
2.1. Univariate Gaussian distribution
In this method, each feature xj (j = 1, . . . Nf ) is modeled
separately using a univariate Gaussian distribution (UGD) with
mean µj ∈ R and variance σj2 ∈ R under the assumption of
feature independence, i.e. xj ∼ N (µj , σj2 ). The probability of xj being generated by N (µj , σj2 ) can be then written as
p(xj ; µj , σj2 ).
The training consists of fitting parameters µj , σj2 using
µj =
This research was supported by the grant TAČR TA01030476. The
access to the MetaCentrum clusters provided under the programme
LM2010005 is highly appreciated.
Copyright 2015 ISCA
Nn
1 X (i)
x ,
Nn i=1 j
σj2 =
Nn
1 X (i)
(x − µj )2 .
Nn i=1 j
(1)
Having the estimated µj , σj2 , probability of a new example x
314
September 6-10, 2015, Dresden, Germany
Table 1: Summary of features used for anomaly detection.
(either normal or anomalous) can be computed as
Nf
Y
Features
Nf
Y
(xj − µj )2
1
√
).
p(x) =
exp(−
=
2σj2
2πσj
j=1
j=1
(2)
If p(x) is very small, i.e. p(x) < ε, then the example x does not
conform to the normal examples distribution and can be denoted
as anomalous.
p(xj ; µj , σj2 )
Basic
Acoustic
Spectral
Other
2.2. Multivariate Gaussian distribution
Phonetic
Multivariate Gaussian distribution (MGD) is a generalization of
the univariate Gaussian distribution. In this case, p(xj ) are not
modeled independently but p(x) is modeled in one go using
mean vector µ ∈ RNf and covariance matrix Σ ∈ RNf ×Nf , i.e.
x ∼ NNf (µ, Σ).
The training now could be written as
µ=
Nn
1 X (i)
x ,
Nn i=1
Σ=
Positional
Nn
1 X (i)
(x − µ)(x(i) − µ)⊺ . (3)
Nn i=1
3. Experimental data
We used a Czech read-speech corpus of a single-speaker
male voice [25], recorded for the purposes of unit-selection
speech synthesis in the state-of-the-art text-to-speech system
ARTIC [26]. The voice talent was instructed to speak in a
“news-broadcasting style” and to avoid any spontaneous expressions. The full corpus consisted of 12242 utterances (approx. 18.5 hours of speech) segmented to phone-like units using HMM-based forced alignment (carried out by the HTK
toolkit [27]) with acoustic models trained on the speaker’s
data [15]. From this corpus we selected Nn = 1124 words,
which were annotated correctly (i.e. normal examples), and
Na = 273 words (213 of them being different), which contained some annotation error (i.e. anomalous examples). The
misannotated words were collected during ARTIC system tuning and evaluation. The decision whether the annotation was
correct or not was made by a human expert who analyzed the
phonetic alignment.
Probability of a new example x being generated by N (µ, Σ)
now is
1
1
exp − (x − µ)⊺ Σ−1 (x − µ) .
p(x) = p
2
(2π)Nf |Σ|
(4)
Again, if p(x) < ε the example x is considered anomalous.
2.3. One-class SVM
One-class SVM (OCSVM) algorithm maps input data into a
high dimensional feature space via a kernel function and iteratively finds the maximal margin hyperplane which best separates the training data from the origin. This results in a binary
decision function f (x) which returns +1 in a “small” region
capturing the (normal) training examples and −1 elsewhere (see
Eq. 8) [24].
The hyperplane parameters w and ρ are determined by
solving a quadratic programming problem
min
w,ξ,ρ
Nn
1
1 X
||w||2 +
ξi − ρ
2
νNn i=1
4. Features
There were two kinds of features used in our experiments.
Phone-level features were extracted for each phone given the
phone boundaries generated by HMM-based forced alignment.
Word-level features were collected directly on the word level,
usually as a ratio or distribution of various phonetic properties
within a word. The features are summarized in Table 1.
To emphasize anomalies in the feature values, each phonelevel feature was modeled using a classification and regression
tree (CART). This context-dependent model was trained on the
same forced-aligned speech corpus as used throughout this paper using various context questions (mainly phonetic, prosodic,
and positional) to grow the tree similarly as described for the
duration feature in [17, 28]. For each phone and each feature,
leaves of a corresponding tree represent the predicted mean
and standard deviation of the feature. The deviation of the actual feature value from the CART-predicted value was then expressed by means of z-scores. For each phone and feature an
independent CART was trained using EST tool wagon [29].
Since the anomaly detection is performed on a word level,
the phone-level features were converted to word-level ones using the following statistics (denoted as “stats” in Table 2) calculated for each word: mean (or median, respectively), minimum,
maximum phone-level feature value, and the range of the fea-
(5)
subject to
w · Φ(x(i) ) ≥ ρ − ξi ,
i = 1, 2, . . . , Nn ,
ξi ≥ 0, (6)
where Φ(x(i) ) is the mapping defining the kernel function, ξi
are slack variables, and ν ∈ (0, 1] is an a priori fixed constant
which represents an upper bound on the fraction of examples
that may be anomalous. We used a Gaussian radial basis function kernel
K(x, x′ ) = exp(γ||x − x′ ||2 )
(7)
where γ is a kernel parameter and ||x − x′ || is a dissimilarity
measure between the examples x and x′ .
Solving the minimization problem (5) using Lagrange multipliers αi and using the kernel function (7) for the dot-product
calculations, the decision function for a new example x then
becomes
f (x) = sgn(w·Φ(x)−ρ) = sgn(
Nn
X
Description
Phone-level features
duration, forced-aligned acoustic likelihood
energy, formants (F1, F2, F3, F2/F1), fundamental
frequency (F0), zero crossing, voiced/unvoiced ratio
spectral crest factor, rolloff, flatness, centroid,
spread, kurtosis, skewness, harmonic-to-noise ratio
score predictive model (SPM) [14], energy/duration
ratio, spectral centroid/duration ratio
Word-level features
phonetic voicedness ratio, sonority ratio, syllabic
consonants ratio, articulation manner distribution,
articulation place distribution, word boundary
voicedness match [17]
forward/backward position of word/phrase in
phrase/utterance, the position of the phrase in an
utterance
αi K(x(i) , x)−ρ). (8)
i=1
315
utterance1
utterances:
phrase1
phrases:
words:
word1
Table 2: Summary of models used for final evaluation. Parameters column specifies the optimal values as found by cross validation (“—” means that no optimal values were found). The
number in parenthesis denotes the number of features.
phrase2
word3
word2
phones:
p1
...
word3
word3
p2
p3
p4
Model ID
UGD∗
Parameters
ε = 0.005
...
...
...
MGD∗
OCSVM∗
ε = 2.5e-14
ν = 0.005
γ = 0.03125
UGDdim
MGDdim
OCSVMdim
ε = 5.0e-24
ε = 5.0e-24
ν = 0.125
γ = 0.125
ε = 2.0e-7
Phone-level features (basic, acoustic, spectral, other)
Deviation from CARTpredicted values
Phonetic
features
Word-level
statistics
Word-level features
Word-level
histograms
Positional
features
UGD0
MGD0
OCSVM0
Figure 1: Scheme of feature extraction and collection.
UGDall
MGDall
OCSVMall
ture values. In order to emphasize outlying feature values within
a word, histogram of values for each phone-level feature were
also used similarly as in [17]. The scheme of feature extraction
and collection is illustrated in Fig. 1. The total number of all
features used in our experiments was 359.
ε = 7.9e-4
ν = 0.05
γ = 0.25
—
—
ν = 0.075
γ = 2.4e-4
Features
duration: stats + histogram +
zscore, acoust. likelihood:
stats + histogram, energy:
zscore (28)
duration: stats + histogram +
zscore, acoust. likelihood:
stats + histogram,
energy/duration: stats (28)
PCA (20)
PCA (20)
ICA (30)
duration: stats, acoust.
likelihood: stats + (8)
all features (359)
niques: principal component analysis (PCA), independent component analysis (ICA), and feature agglomeration (FAG). PCA
decomposes a feature set in a set of successive orthogonal components that explain a maximum amount of the variance. ICA
attempts to separate a feature set into independent additive subcomponents. Feature agglomeration applies hierarchical clustering to group together features that behave similarly. The
number of features was seen as an another parameter of the
model selection process; hence, the optimal number of features for each reduction technique and each detection model
was determined during the cross validation. The comparison
of results on the validation set shown in Fig. 2 indicate that the
reduction techniques, with some exceptions, behave similarly.
CVF stands for “cross-validation features”, i.e. models UGD∗ ,
MGD∗ , and OCSVM∗ that use feature combinations selected
by cross validation. The best combination of the model and the
reduction technique is shown in Table 2 as UGDdim , MGDdim ,
and OCSVMdim .
5. Experiments
5.1. Model training and selection
For the purposes of anomaly detection model training and selection, the normal examples were divided into training and validation examples using 10-fold cross validation with 60% of the
normal examples used for training and 20% of the normal examples used for validation in each cross-validation fold. The
remaining 20% of the normal examples were held out for the
final evaluation of the model. As for the anomalous examples,
50% of them were used in cross validation when selecting the
best model parameters, and the remaining 50% of anomalous
examples were used for the final evaluation.
The standard training procedure was utilized to train the
models described in Sec. 2. Models’ parameters were optimized
during model selection, i.e. by selecting their values that yielded
best results (in terms of F 1 score, see Sec. 6.1) applying a grid
search over relevant values of the parameters with 10-fold cross
validation. In case of both UGD and MGD, the parameter ε
was searched in the interval [10−100 , 0.1]. For OCSVM, the
parameter ν was searched in the range [0.005, 0.3], and the kernel parameter γ in a recommended exponentially growing interval [2−15 , 24 ] [30]. Moreover, we experimented with various
feature set combinations, and the selection of the best feature
combination for each model was also a part of the model selection phase. Scikit-learn toolkit [31] was employed in our
experiments.
The optimal parameters of each detection model and the
best feature combination as found during the model selection
phase are shown in Table 2 as UGD∗ , MGD∗ , and OCSVM∗ .
6. Evaluation
For the final evaluation on the test data, we used the models
specified in Table 2. UGD0 , MGD0 , and OCSVM0 denote models trained on basic features only, and UGDall , MGDall , and
OCSVMall denote models trained on all features. Performance
of all these models were then evaluated using the held-out test
data.
6.1. Detection metrics
Due to the unbalanced number of normal and anomalous examples, F 1 score is often used to evaluate the performance of an
anomaly detection system
5.2. Dimensionality reduction
In order to select a best feature combination automatically, we
also experimented with various dimensionality reduction tech-
F1 =
316
2∗P ∗R
,
P +R
P =
tp
,
pp
R=
tp
ap
(9)
90
88
Table 3: Final evaluation of the proposed anomaly detection
models on test data.
UGD
MGD
OCSVM
Model ID
UGD∗
MGD∗
OCSVM∗
UGDdim
MGDdim
OCSVMdim
UGD0
MGD0
OCSVM0
UGDall
MGDall
OCSVMall
RAND
86
F1 [ %]
84
82
80
78
76
74
72
PCA
ICA
FAG
CVF
Figure 2: Comparison of dimensionality reduction techniques
on the validation set.
6.2. Statistical significance
Since the used data set is relatively small, statistical significance
tests were performed to compare results of the proposed detection models. We applied McNemar’s test [32], in which two
detectors A and B are tested on a test set, and for all testing
examples the following four numbers are recorded: number of
examples detected incorrectly by both A and B (n00 ), number
of examples detected incorrectly by A but correctly by B (n01 ),
number of examples detected incorrectly by B but correctly by
A (n10 ), and number of examples detected correctly by both
A and B (n11 ). Under the null hypothesis, the two detectors
should have the same error rate, i.e. n01 = n10 . McNemar’s
test is based on a χ2 test for goodness of fit that compares the
distribution of counts expected under the null hypothesis to the
observed counts:
(|n01 − n10 | − 1)2
,
n01 + n10
R[%]
89.78
90.51
87.59
86.86
86.86
85.40
66.42
81.02
78.10
100.00
99.27
85.40
25.50
F 1[%]
87.23
88.89
86.64
87.50
87.50
85.40
74.29
78.45
80.45
54.91
63.85
86.67
24.60
tistically significant. For comparison, random detection considering the fraction of misannotated words in our test data set is
also shown as RAND.
As can be seen, dimensionality reduction techniques
achieve similar results as careful feature combinations selected
by cross validation. Similarly, OCSVM achieves statistically
comparable results when all features are used (in contrast to
UGD and MGD which fail with so many number of features).
As for the absolute comparison of the individual anomaly detection techniques, all three techniques performed comparably
well with differences not being statistical significant.
As for the feature sets, feature combinations selected by
cross validation confirm the importance of the features emphasizing anomalies (expressed both as z-score deviations from
CART-predicted values and as histograms). On the other hand,
spectral, phonetic, and positional features seem not so important.
Comparing the proposed anomaly-based annotation error
detection with classification-based detection [17], similarly
good results were achieved. This is a good finding because,
unlike the classification-based detection, we do not need any
misannotated words to train an anomaly-based detector; thus,
training data collection should be easier.
where P is precision, the ability of a detector not to detect
as misannotated a word that is annotated correctly, R is recall, the ability of a detector to detect all misannotated words,
tp means “true positives” (i.e., the number of words correctly
detected as misannotated), pp stands for “predicted positives”
(i.e., the number of all words detected as misannotated), and ap
means “actual positives” (i.e., the number of actual misannotated words).
F 1 score was also used to optimize all parameters and feature set combinations as described in Sec. 5.1.
χ2 =
P [%]
84.83
87.32
85.71
88.15
88.15
85.40
84.26
76.03
82.95
37.85
47.06
87.97
23.70
7. Conclusions
We experimented with three anomaly detection techniques to
detect word-level annotation errors in a read-speech corpus used
for TTS. We showed that all three methods, after being carefully
configured by a grid search and cross-validation process, performed similarly well with F 1 score being almost 89%. Such
result suggests that anomaly detection could help in detecting
annotation errors in read-speech corpora for TTS synthesis. No
misannotated words need to be collected as the anomaly detectors are trained only on correctly annotated words.
In our future work we plan to carry out error analysis to
spot any potential systematic trend in the misdetected words.
As the effective features seem to be to some extent voice independent, we also plan to find out how the described anomaly
detection will cope with data from more speakers and/or more
languages. We would also like to find out how the proposed
detection method is sensitive to spontaneous speech data.
Using the proposed anomaly detection, we believe the annotation process accompanying the development of a new TTS
voice could be reduced only to the correction of words detected
as misannotated. Lessons learned from the anomaly detection
might also be used for the automatic error detection in synthetic
speech [33–35].
(10)
where a “continuity correction” term (of −1 in the numerator)
is incorporated to account for the fact that the statistic is discrete
while the χ2 distribution with 1 degree of freedom is continuous. If the null hypothesis is correct, then the probability that
this quantity is greater than χ21,0.95 = 3.841 is less than 0.05
(the significance level α = 0.05). So we may reject the null hypothesis in favor of the hypothesis that the two detectors have
different performance when χ21,0.95 > 3.841.
6.3. Results and discussion
The evaluation of the proposed anomaly detection models is
given in Table 3. Model IDs written in bold denote that the
corresponding models performed better than the other ones according to McNemar’s statistical significance test (α = 0.05).
The differences among the models written in bold are not sta-
317
8. References
[20] J. Kominek and A. W. Black, “Impact of durational outlier removal from unit selection catalogs,” in Speech Synthesis Workshop, Pittsburgh, USA, 2004, pp. 155–160.
[1] S. Cox, R. Brady, and P. Jackson, “Techniques for accurate automatic annotation of speech waveforms,” in International Conference on Spoken Language Processing, Sydney, Australia, 1998.
[21] Y.-J. Kim, A. K. Syrdal, and M. Jilka, “Improving TTS by higher
agreement between predicted versus observed pronunciations,” in
Speech Synthesis Workshop, 2004, pp. 127–132.
[2] H. Meinedo and J. Neto, “Automatic speech annotation and transcription in a broadcast news task,” in ISCA Workshop on Multilingual Spoken Document Retrieval, Hong Kong, 2003, pp. 95–100.
[22] V. J. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artificial Intelligence Review, vol. 22, no. 2, pp. 85–126,
2004.
[3] J. Adell, P. D. Agüero, and A. Bonafonte, “Database pruning for
unsupervised building of text-to-speech voices,” in IEEE International Conference on Acoustics Speech and Signal Processing,
Toulouse, France, 2006, pp. 889–892.
[23] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A
survey,” ACM Computing Surveys, vol. 41, no. 3, pp. 1–58, 2009.
[4] T. J. Hazen, “Automatic alignment and error correction of human generated transcripts for long speech recordings.” in INTERSPEECH, Pittsburgh, USA, 2006, pp. 1606–1609.
[24] B. Schölkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and
R. C. Williamson, “Estimating the support of a high-dimensional
distribution,” Neural Computation, vol. 13, no. 7, pp. 1443–1471,
2001.
[5] R. Tachibana, T. Nagano, G. Kurata, M. Nishimura, and
N. Babaguchi, “Preliminary experiments toward automatic generation of new TTS voices from recorded speech alone,” in INTERSPEECH, Antwerp, Belgium, 2007, pp. 1917–1920.
[25] J. Matoušek, D. Tihelka, and J. Romportl, “Building of a speech
corpus optimised for unit selection TTS synthesis,” in Language
Resources and Evaluation Conference, Marrakech, Morocco,
2008.
[6] M. P. Aylett, S. King, and J. Yamagishi, “Speech synthesis without
a phone inventory,” in INTERSPEECH, Brighton, Great Britain,
2009, pp. 2087–2090.
[26] D. Tihelka, J. Kala, and J. Matoušek, “Enhancements of Viterbi
search for fast unit selection synthesis,” in INTERSPEECH,
Makuhari, Japan, 2010, pp. 174–177.
[7] O. Boeffard, L. Charonnat, S. L. Maguer, D. Lolive, and G. Vidal, “Towards fully automatic annotation of audiobooks for TTS,”
in Language Resources and Evaluation Conference, Istanbul,
Turkey, 2012, pp. 975–980.
[27] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu,
G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and
P. Woodland, HTK Book (for HTK Version 3.4), The. Cambridge,
U.K.: Cambridge University, 2006.
[8] J. Matoušek and J. Romportl, “Recording and annotation of
speech corpus for Czech unit selection speech synthesis,” in Text,
Speech and Dialogue, ser. Lecture Notes in Computer Science.
Berlin: Springer, 2007, vol. 4629, pp. 326–333.
[28] J. Romportl and J. Kala, “Prosody modelling in Czech text-tospeech synthesis,” in Speech Synthesis Workshop, Bonn, Germany, 2007, pp. 200–205.
[9] J. Matoušek, D. Tihelka, and L. Šmı́dl, “On the impact of annotation errors on unit-selection speech synthesis,” in Text, Speech
and Dialogue, ser. Lecture Notes in Computer Science. Springer,
2012, vol. 7499, pp. 456–463.
[29] P. Taylor, R. Caley, A. W. Black, and S. King, “Edinburgh Speech Tools Library:
System Documentation,”
http://www.cstr.ed.ac.uk/projects/speech tools/manual-1.2.0,
1999.
[10] D. Toledano, L. Gomez, and L. Grande, “Automatic phonetic segmentation,” IEEE Transactions on Speech and Audio Processing,
vol. 11, no. 6, pp. 617–625, 2003.
[30] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support
vector machines,” ACM Transactions on Intelligent Systems and
Technology, vol. 2, no. 3, pp. 27:1–27:27, 2011.
[11] J. Matoušek, D. Tihelka, and J. Psutka, “Experiments with automatic segmentation for Czech speech synthesis,” in Text, Speech
and Dialogue, ser. Lecture Notes in Computer Science. Berlin,
Heidelberg: Springer, 2003, vol. 2807, pp. 287–294.
[31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. M. B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,
J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perror,
and E. Duchesnay, “Scikit-learn: Machine learning in Python,”
Journal of Machine Learning Research, vol. 12, pp. 2825–2830,
2011.
[12] J. Kominek and A. W. Black, “A family-of-models approach to
HMM-based segmentation for unit selection speech synthesis,” in
INTERSPEECH, Jeju Island, Korea, 2004, pp. 1385–1388.
[32] T. G. Dietterich, “Approximate statistical tests for comparing supervised classification learning algorithms,” Neural Computation,
vol. 10, pp. 1895–1923, 1998.
[13] S. S. Park and N. S. Kim, “On using multiple models for automatic
speech segmentation,” IEEE Transactions on Audio Speech and
Language Processing, vol. 15, no. 8, pp. 2202–2212, 2007.
[33] H. Lu, S. Wei, L. Dai, and R.-H. Wang, “Automatic error detection
for unit selection speech synthesis using log likelihood ratio based
SVM classifier,” in INTERSPEECH, Makuhari, Japan, 2010, pp.
162–165.
[14] C.-Y. Lin and R. Jang, “Automatic phonetic segmentation by score
predictive model for the corpora of Mandarin singing voices,”
IEEE Transactions on Audio Speech and Language Processing,
vol. 15, no. 7, pp. 2151–2159, 2007.
[15] J. Matoušek and J. Romportl, “Automatic pitch-synchronous
phonetic segmentation,” in INTERSPEECH, Brisbane, Australia,
2008.
[34] W. Y. Wang and K. Georgila, “Automatic detection of unnatural word-level segments in unit-selection speech synthesis,” in
IEEE Automatic Speech Recognition and Understanding Workshop, Hawaii, USA, 2011, pp. 289–294.
[16] A. Rendel, E. Sorin, R. Hoory, and A. Breen, “Towards automatic
phonetic segmentation for TTS,” in IEEE International Conference on Acoustics Speech and Signal Processing, Kyoto, Japan,
2012, pp. 4533–4536.
[35] J. Vı́t and J. Matoušek, “Concatenation artifact detection trained
from listeners evaluations,” in Text, Speech and Dialogue, ser.
Lecture Notes in Computer Science.
Berlin, Heidelberg:
Springer, 2013, vol. 8082, pp. 169–176.
[17] J. Matoušek and D. Tihelka, “Annotation errors detection in TTS
corpora,” in INTERSPEECH, Lyon, France, 2013.
[18] S. Wei, G. Hu, Y. Hu, and R.-H. Wang, “A new method for mispronunciation detection using Support Vector Machine based on
Pronunciation Space Models,” Speech Communication, vol. 51,
no. 10, pp. 896–905, 2009.
[19] R. Donovan and P. Woodland, “A hidden Markov-model-based
trainable speech synthesizer,” Computer Speech & Language,
vol. 13, no. 3, pp. 223–241, 1999.
318