IberSPEECH 2018 Proceedings
IberSPEECH 2018 Proceedings
IberSPEECH 2018 Proceedings
AND PROCEEDINGS
Iber SPEECH2018
BARCELONANOVEMBER 21-23
Wed Thu Fri
21 22 23
8:00
Registration
Starts at 08:00
9:00 Opening
09:00-09:20
Oral 4: Synthesis, Produc- Oral 5: Text & NLP
tion & Analysis Applications
Oral 1: Speaker recognition 09:00-10:40 09:00 - 10.40
10:00 09:20-10:40
Special Session: Projects,
Coffee break Demo
Coffeeand PhD thesis
break Coffee break
10:40-11:00 12:00-13:30
10:40-11:00 10:40-11:00
11:00
Keynote: Tanja Schultz Keynote: Rob Clark Keynote: Lluis Marquez
11:00-12:00 11:00-12:00 11:00-12:00
12:00
Posters: Topics on Speech Special Session: Projects, Round Table
Technologies Demo and PhD thesis 12:00-13:00
13:00 12:00-13:30 12:00-13:30
Closing
13:00-13:30
15:00
Oral 2: ASR & Speech Albayzin Evaluations
Applications 15:00-16:40
16:00 15:00-16:40
20:00
Welcome Reception
20:00
21:00 Gala Dinner
20:30
IberSPEECH2018
NOVEMBER 21-23
BARCELONA
Edited by Antonio Bonafonte, Jordi Luque and Francesc Alías Pujol
Committees vi
Organizing Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Program Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Organizing Institutions xi
Awards xii
Venue xiii
Papers xxxiii
i
ii
Welcome Message
iii
tations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis,
discussion panels, a round table, and awards to the best thesis and papers.
The core of the scientific program of IberSPEECH2018 includes a total of 37 full regular pa-
per contributions that will be presented distributed among 5 oral and 1 poster sessions. To
ensure the quality of all the contributions, each submitted paper was reviwed by three mem-
bers of the scientific review committee. All the papers in the conference will be accessible
through the International Speech Communication Association (ISCA) Online Archive. Paper
selection was based on the scores and comments provided by the scientific review commit-
tee, which includes over 86 researchers from different institutions (mainly from Spain and
Portugal, but also from France, Germany, Brazil, Slovakia, Ireland, Greece, Hungary, Slovenia,
Austria and United Kingdom).
Furthermore, it is confirmed to publish an extension of selected papers as a special issue of
the Journal of Applied Sciences, “IberSPEECH 2018: Speech and Language Technologies for
Iberian Languages”, published by MDPI with fully open access. In addition to regular paper
sessions, the IberSPEECH2018 scientific program features the following activities: the AL-
BAYZIN evaluation challenge session, a special session including the presentation of demos,
research projects and recent PhD thesis, a round table and three keynote lectures.
Following the success of previous ALBAYZIN technology evaluations since 2006, this year
ALBAYZIN evaluations have focused around multimedia analysis of TV broadcast content.
Under the framework of a newly created Cátedra RTVE at Universidad de Zaragoza, we intro-
duce and report on the results of the IberSPEECH-RTVE 2018 Challenge. The Corporación de
Radiotelevisión Española (RTVE) has provided participants with an annotated TV broadcast
database and the necessary tools for the evaluations, promoting the fair and transparent
comparison of technology in different fields related to speech and language technology. It
comprises four different challenge evaluations: Speech to Text Challenge (S2TC), Speaker Di-
arization Challenge (SDC) and Multimodal Diarization Challenge (MDC), organized by RTVE
and Universidad de Zaragoza; and the Search on Speech Challenge (SoSC) jointly organized
by Universidad San Pablo-CEU and AuDIaS from Universidad Autónoma de Madrid with the
support of the ALBAYZIN Committee. Overall, 7 teams participated in the S2TC challenge,
8 teams in the SDC, 3 teams in the MDC, and 3 more teams in the SoSC challenge, which
results in 21 system paper description contributions. Additionally, 11 special session pa-
pers are also included in the conference program. These were intended to describe either
progress in current or recent research and development projects, demonstration systems, or
PhD Thesis extended abstracts to compete in the PhD Award. Furthermore, IberSPEECH2018
features 3 remarkable keynote speakers: Prof. Tanja Schultz (University of Bremen, Germany
and Institute of Carnegie Mellon, Pittsburgh, PA USA), Dr. Rob Clark (Google, London, UK)
and Dr. Lluis Marquez (Amazon, Barcelona, Spain), to whom we would like to acknowledge
for their extremely valuable participation.
Moreover, a round table with recognized experts brought discussion about the role of re-
search and innovation from both academia and industry. Such a symbiosis creates market
power by exploring and developing new categories which will eventually become the next
blue oceans of our society. However, converting such activities into real businesses, making
strong bones and figuring out new products that have a major impact in the real world is
a non trivial task. A round table, we expected as an opportunity for both worlds on finding
and exploring synergies and collaboration.
iv
The social program of IberSPEECH2018 sets sail with the welcome reception at the Escola
d’Enginyeria Barcelona Est (EEBE), at the recently created UPC Diagonal Besòs Campus, next
to the Telefonica Tower. EEBE aims to become a top-quality academic centre in the field of
engineering for the 21st-century industry that is capable of acting as an agent of transfor-
mation at a local and international level. The EEBE was born from the Barcelona College of
Industrial Engineering (EUETIB) and from part of the teaching and research activity in chem-
ical and materials engineering hitherto carried out at the Barcelona School of Industrial En-
gineering (ETSEIB). The gala dinner will be held at Restaurant Marítim, next to the legendary
Barcelona Reial Club Marítim, designed by Lázaro Rosa-Violán from Contemporain Studio,
to provide the aesthetics and flavours of different Mediterranean paradises.
Finally, we would like to thank all those whose effort made possible this conference, including
the members of the organizing committee, the local organizing committee, the ALBAYZIN
committee, the scientific reviewer committee, the authors, the conference attendees, the
supporting institutions, and so many people who gave their best to achieve a succesful con-
ference.
Barcelona, November 2018
Jordi Luque, General Chair
v
Organizing Committee
General Chair:
Jordi Luque, Telefónica Research, Spain
General Co-Chairs:
Antonio Bonafonte, Universitat Politècnica de Catalunya, Spain
Francesc Alías Pujol, La Salle – Universitat Ramon LLull, Spain
António Teixeira, Universidade de Aveiro, Portugal
Publication Chairs:
Francesc Alías Pujol, La Salle – Universitat Ramon LLull, Spain
Antonio Bonafonte, Universitat Politècnica de Catalunya, Spain
vi
Special Session and Awards Chair:
Ascensión Gallardo-Antolín, Universidad Carlos III, Spain
Evaluations Chairs:
Alfonso Ortega, Universidad de Zaragoza, Spain
Eduardo Lleida, Universidad de Zaragoza, Spain
Luis Javier Rodríguez Fuentes, Universidad del País Vasco, Spain
Local Committee:
Francesc Alías Pujol, La Salle – Universitat Ramon LLull, Spain
Antonio Bonafonte, Universitat Politècnica de Catalunya, Spain
Jordi Luque, Telefónica Research, Spain
Jordi Pons, Universitat Pompeu Fabra, Spain
Bardia Rafieian, Universitat Politècnica de Catalunya, Spain
Marta Ruiz Costa-Jussà, Universitat Politècnica de Catalunya, Spain
Carlos Segura, Telefónica Research, Spain
Joan Serrà, Telefónica Research, Spain
vii
Scientific Review Committee
viii
Jon Ander Gómez, Universitat Politècnica de València, Spain
Emilio Granell, Universitat Politècnica de València, Spain
Inma Hernaez, University of the Basque Country (UPV/EHU), Spain
Javier Hernando, Universitat Politècnica de Catalunya, Spain
Lluís-F. Hurtado, Universitat Politècnica de València, Spain
Oliver Jokisch, Leipzig University of Telecommunications (HfTL), Germany
Oscar Koller, Microsoft Germany GmbH, Germany
Eduardo Lleida, University of Zaragoza, Spain
José David Lopes, Heriot Watt University UK
Paula López Otero, Universidade da Coruña, Spain
Jordi Luque, Telefónica Research, Spain
Carlos David Martínez Hinarejos, Universitat Politècnica de València, Spain
Helena Moniz, INESC/FLUL, Portugal
Juan Montero, Universidad Politécnica de Madrid, Spain
Nicolás Morales, Nuance Communications GmbH, Germany
Climent Nadeu, Universitat Politècnica de Catalunya, Spain
Juan L. Navarro-Mesa, Universidad de Las Palmas de Gran Canaria, Spain
Eva Navas, University of the Basque Country, Spain
Géza Németh, Budapest University of Technology & Economics, Hungary
Nelson Neto, Universidade Federal do Pará, Brazil
Hermann Ney, RWTH Aachen University, Germany
Alfonso Ortega, University of Zaragoza, Spain
Yannis Pantazis, Foundations for Research and Technology – Hellas, Spain
Carmen Peláez-Moreno University Carlos III Madrid, Spain
Thomas Pellegrini, Université de Toulouse; IRIT, France
Mikel Penagarikano, University of the Basque Country, Spain
Fernando Perdigao, Institute of Telecommunications (IT), Lisbon, Portugal
José L. Pérez-Córdoba, University of Granada, Spain
Ferran Pla, Universitat Politècnica de València, Spain
Jiri Pribil, Slovak Academy of Sciences Slovakia
Jorge Proenca, IT – Coimbra, Portugal
Michael Pucher, Acoustics Research Institute Austria
Paulo Quaresma, Universidade de Evora, Portugal
Ganna Raboshchuk, ELSA Corp., Portugal
Sam Ribeiro, The University of Edinburgh, UK
Eduardo Rodriguez Banga, University of Vigo, Spain
Marta Ruiz Costa-Jussà, Universitat Politècnica de Catalunya, Spain
Luis Javier Rodríguez-Fuentes, Univ. of the Basque Country UPV/EHU, Spain
Rubén San-Segundo, Universidad Politécnica de Madrid, Spain
ix
Jon Sánchez, Aholab – EHU/UPV, Spain
Joan Andreu, Sanchez Universitat Politècnica de València, Spain
Emilio Sanchis, Universitat Politècnica de València, Spain
Diana Santos, University of Oslo, Norway
Ibon Saratxaga, University of the Basque Country, Spain
Encarna Segarra, Universitat Politècnica de València, Spain
Carlos Segura Perales, Telefónica Research, Spain
Joan Serrà, Telefónica Research, Spain
Alberto Simões, 2Ai Lab – IPCA, Portugal
Rubén Solera-Ureña, INESC-ID Lisboa, Portugal
António Teixeira, University of Aveiro, Portugal
Javier Tejedor, Universidad CEU San Pablo, Spain
Doroteo Toledano, Universidad Autónoma de Madrid, Spain
Isabel Trancoso, INESC ID Lisboa / IST, Portugal
Cassia Valentini-Botinhao, The University of Edinburgh, UK
Amparo Varona, University of the Basque Country, Spain
Andrej Zgank, University of Maribor, Slovenia
Catalin Zorila, Toshiba Cambridge Research Laboratory UK
x
Organizing Institutions
IberSPEECH2018 has been partially funded by the project Red Temática en Tecnologías del Habla 2017
(TEC2017-90829-REDT) founded by Ministerio de Ciencia, Innovación y Universidades.
xi
Awards
All regular papers are candidates for this award. The award, given based on the review reports
and the presentation at the conference, grants the authors the publicaton of an extended
version of their work within the Special Issue of Applied Sciences journal (MDPI) entitled
”IberSPEECH 2018: Speech and Language Technologies for Iberian Languages”.
Papers submitted to Albayzin evaluation tasks are candidates for these awards. The awards
will be given to the winners of the Albayzin evaluation challenges, in accordance with the
evaluation plan and rules defined for each task.
Papers submitted to the PhD Thesis special session are candidates for this award. The award
is given based on the decision of the committee formed by the members of the General chair,
Technical Program chair and Special Session and Awards Chair. The award is given based on
different criteria, including the quality of the document, impact of the thesis and clearness
of the presentation at the conference.
This is an honorary prize awarded by the Spanish Thematic Network on Speech Technology
(RTTH) that recognizes experienced individuals who have made outstanding contributions
related to speech technology research in Spain.
IberSPEECH2018 Edition
xii
Venue
• Password: *******
xiii
The main body of the conference will be held in Torre Telefónica (floor 0, see figure) and
the Auditorium in floor 2 by accessing the elevators depicted in the figure. The following
diagram outlines the main conference areas and services:
xiv
Lunch: 21 and 22 November, 13:30.
D’Ins Escola, Restaurant i Càtering
A gastronomic offer adapted to the occasion and a catering service with an added value
that will enrich it: THE SOCIAL VALUE of the PEOPLE that work in this service. People who
participate in a training and job placement program developed by the Fundación Formació
i Treball.
Address: Carrer de Ramon Llull, 240, 08930, Sant Adrià del Besòs.
Wednesday 21st and Thursday 22nd lunches’ will be given very close to Torre Telefónica
(350 m).
xv
Social Program
xvi
Gala dinner: Thursday 22 November, 20:30.
Restaurant Marítim
Address: Moll d’Espanya, 08039, Barcelona
Telephone: +34 93 221 17 75
Web: www.maritimrestaurant.es
The gala dinner will be held downtown, next to the sea (close to Cristobal Colon statue).
xvii
xviii
Invited Speakers
Abstract.- Speech is a complex process emitting a wide range of biosignals, including, but
not limited to, acoustics. These biosignals – stemming from the articulators, the articulator
muscle activities, the neural pathways, and the brain itself – can be used to circumvent limi-
tations of conventional speech processing in particular, and to gain insights into the process
of speech production in general. In my talk I will present ongoing research at the Cognitive
Systems Lab (CSL), where we explore a variety of speech-related muscle and brain activi-
xix
ties based on machine learning methods with the goal of creating biosignal-based speech
processing devices for communication applications in everyday situations and for speech
rehabilitation, as well as gaining a deeper understanding of spoken communication. Several
applications will be described such as Silent Speech Interfaces that rely on articulatory mus-
cle movement captured by electromyography to recognize and synthesize silently produced
speech, Brain-to-text interfaces that recognize continuously spoken speech from brain activ-
ity captured by electrocorticography to transform it into text, and Brain-to-Speech interfaces
that directly synthesize audible speech from brain signals.
xx
Rob Clark received his PhD from the University of Edinburgh in
2003. His primary interest is in producing engaging synthetic
speech. Before joining Google Rob was at the University of Ed-
inburgh for many years involved in both teaching and research
relating to text-to-speech synthesis. Rob was one of the primary
developers and maintainers of the open source Festival text-to-
speech synthesis system. In 2015 he joined Google where he is
working on text-to-speech synthesis and prosody.
Abstract.- This talk addresses the issue of producing appropriate and engaging text-to-
speech. The quality of speech produced by modern text-to-speech systems is sufficiently
intelligible and naturally sounding that we are now seeing it widely used in an increasing
number of real world applications. While the speech generated can sound very natural, we
are still a long way from ensuring it always sounds appropriate and engaging in the context
of a particular discourse or dialogue. We present recent work at Google which begins to
address this issue by looking at techniques to generate variation in prosody and speaking
style using latent representations and discuss the problems and challenges that we face in
going further.
xxi
xxii
Lluís Màrquez is a Principal Applied Scientist at Amazon Re-
search in Barcelona. From 2013 to 2017 he had a Principal Sci-
entist role at the Arabic Language Technologies group from the
Qatar Computing Research Institute (QCRI), and previously, he
was Associate Professor at the Technical University of Catalonia
(UPC, 2000-2013). He holds a university award-winning PhD in
Computer Science from UPC (1999). His research focuses on nat-
ural language understanding by using statistical machine learn-
ing models. He has 150+ papers in Natural Language Processing
and Machine Learning journals and conferences. He has been
General and Program Co-chair of major conferences in the area
(ACL, EMNLP, EACL, CoNLL, *SEM, EAMT, etc.), and held several
organizational roles in ACL and EMNLP too. He was co-organizer
of various international evaluation tasks at Senseval/SemEval
(2004, 2007, 2010, 2015-2017) and CoNLL shared tasks (2004-
2005, 2008-2009). He was Secretary and President of the ACL
special interest group on Natural Language Learning (SIGNLL) in
the period 2007-2011. More recently, he was President-elect and
President of the European Chapter of the ACL (EACL; 2013-2016)
and member of the ACL Executive Committee (2015-2016). Luís
Màrquez has been Guest Editor of special issues at Computa-
tional Linguistics, LRE, JNLE, and JAIR in the period (2007-2015).
He has participated in 16 national and EU research projects, and
2 projects with technology transfer to the industry, acting as the
principal site researcher in 10 of them, helping companies embed
AI in their business.
Abstract.- Automatic Question Answering (Q&A), i.e., the task of building computer pro-
grams that are able to answer question posed in natural language, has a long tradition in the
fields of Natural Language Processing and Information Retrieval. In recent years, Q&A appli-
cations have had a tremendous impact in industry and they are ubiquitous (e.g., embedded
in any of the personal assistants that are in the market, Siri, Alexa, Cortana, Google Assistant,
etc.). At the same time, we have witnessed a renewed interest in the scientific community, as
Q&A has become one of the paradigmatic tasks for assessing the ability of machines to com-
prehend text. A plethora of corpora, resources and systems have blossomed and flooded the
community in the last three years. These systems can do very impressive things, for instance,
finding answers to open ended questions in long text contexts with super-human accuracy,
or answering complex questions about images, by mixing the two modalities. As in many
other fields, these state-of-the-art systems are implemented using machine learning in the
form of neural networks (deep learning). The new AI, of course. But do these Q&A systems
really understand what they read? In more simple words, do they provide the right answers
for the right reasons? Several recent studies have shown that QA systems are actually very
brittle. They generalize badly and they fail miserably when presented with simple adversar-
ial examples. The machine learning algorithms are very good at picking all the biases and
xxiii
artefacts in the corpora, and they learn to find answers based on shallow text properties and
pattern matching. But they do not show many understanding or reasoning abilities, after all.
Following this serious setback, there is a new push in the community for carefully designing
more complex and bias-free datasets, and more robust and explainable systems. Hopefully,
this will lead to a new generation of smarter and more useful Q&A engines in the near future.
In this talk, I will overview the present and the future of Question Answering by going over
all the aforementioned topics.
xxiv
Technical Program
Speaker Recognition
Wednesday, 21 November 2018, 09:20 – 10:40
Chair: Xavier Anguera, ELSA Corp.
xxv
Towards expressive prosody generation in TTS for reading aloud applications 40
Monica Dominguez, Alicia Burga, Mireia Farrús, Leo Wanner
Performance evaluation of front- and back-end techniques for ASV spoofing de-
tection systems based on deep features 45
Alejandro Gomez-Alanis, Antonio M. Peinado, José Andrés González López, Angel
M. Gomez
The observation likelihood of silence: analysis and prospects for VAD applications 50
Igor Odriozola, Inma Hernaez, Eva Navas, Luis Serrano, Jon Sanchez
Audio event detection on Google’s Audio Set database: Preliminary results using
different types of DNNs 64
Javier Darna-Sequeiros, Doroteo T. Toledano
xxvi
ASR & Speech Applications
Wednesday, 21 November 2018, 15:00 – 16:40
Chair: Carmen García Mateo, University of Vigo
xxvii
Speech & Language Technologies Applied to Health
Wednesday, 21 November 2018, 17:00 – 18:40
Chair: Mireia Farrús, Universitat Pompeu Fabra
Towards an automatic evaluation of the prosody of people with Down syndrome 112
Mario Corrales-Astorgano, Pastora Martínez-Castilla, David Escudero-Mancebo,
Lourdes Aguilar, César González-Ferreras, Valentín Cardeñoso-Payo
Influence of tense, modal and lax phonation on the three-dimensional finite ele-
ment synthesis of vowel [A] 132
Marc Freixes, Marc Arnela, Joan Claudi Socoró, Francesc Alías Pujol, Oriol Guasch
xxviii
Special Session
Thursday, 22 November 2018 12:00 – 13:30
Chair: Ricardo de Córdoba, Universidad Politécnica de Madrid
Silent Speech: Restoring the Power of Speech to People whose Larynx has been
Removed 163
José Andrés González López, Phil D. Green, Damian Murphy, Amelia Gully, James
M. Gilbert
PhD Thesis
xxix
Bottleneck and Embedding Representation of Speech for DNN-based Language
and Speaker Recognition 179
Alicia Lozano-Diez, Joaquin Gonzalez-Rodriguez, Javier Gonzalez-Dominguez
Deep Learning for i-Vector Speaker and Language Recognition: A Ph.D. Thesis
Overview 184
Omid Ghahabi
Albayzin Evaluation
Thursday, 22 November 2018 15:00 – 16:40
Chair: Alfonso Ortega & Eduardo Lleida, Universidad de Zaragoza
UPC Multimodal Speaker Diarization System for the 2018 Albayzin Challenge 199
Miquel Angel India Massana, Itziar Sagastiberri, Ponç Palau, Elisa Sayrol, Josep
Ramon Morros, Javier Hernando
In-domain Adaptation Solutions for the RTVE 2018 Diarization Challenge 220
Ignacio Viñals, Pablo Gimeno, Alfonso Ortega, Antonio Miguel, Eduardo Lleida
xxx
DNN-based Embeddings for Speaker Diarization in the AuDIaS-UAM System for
the Albayzin 2018 IberSPEECH-RTVE Evaluation 224
Alicia Lozano-Diez, Beltran Labrador, Diego de Benito, Pablo Ramirez, Doroteo T.
Toledano
The Intelligent Voice System for the IberSPEECH-RTVE 2018 Speaker Diarization
Challenge 231
Abbas Khosravani, Cornelius Glackin, Nazim Dugan, Gérard Chollet, Nigel Can-
nings
GTTS-EHU Systems for the Albayzin 2018 Search on Speech Evaluation 249
Luis J. Rodríguez-Fuentes, Mikel Peñagarikano, Amparo Varona, Germán Bordel
Cenatav Voice Group System for Albayzin 2018 Search on Speech Evaluation 254
Ana R. Montalvo, Jose M. Ramirez, Alejandro Roble, Jose R. Calvo
MLLP-UPV and RWTH Aachen Spanish ASR Systems for the IberSpeech-RTVE
2018 Speech-to-Text Transcription Challenge 257
Javier Jorge, Adrià Martínez-Villaronga, Pavel Golik, Adrià Giménez, Joan Albert
Silvestre-Cerdà, Patrick Doetsch, Vicent Andreu Císcar, Hermann Ney, Alfons Juan,
Albert Sanchis
xxxi
The Vicomtech-PRHLT Speech Transcription Systems for the IberSPEECH-RTVE
2018 Speech to Text Transcription Challenge 267
Haritz Arzelus, Aitor Alvarez, Conrad Bernath, Eneritz García, Emilio Granell, Carlos
David Martinez Hinarejos
Intelligent Voice ASR system for Iberspeech 2018 Speech to Text Transcription
Challenge 272
Nazim Dugan, Cornelius Glackin, Gérard Chollet, Nigel Cannings
TransDic, a public domain tool for the generation of phonetic dictionaries in stan-
dard and dialectal Spanish and Catalan 291
Juan-María Garrido, Marta Codina, Kimber Fodge
xxxii
Papers
xxxiii
xxxiv
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
ViVoLab, Aragón Institute for Engineering Research (I3A), University of Zaragoza, Spain
{vmingote,amiguel,ortega,lleida}@unizar.es
Abstract among the best results of the state-of-the-art. The i-vector ex-
tractor represents each utterance in a low-dimensional subspa-
In this paper, we propose a new differentiable neural network
ce called the total variability subspace as a fixed-length feature
alignment mechanism for text-dependent speaker verification
vector and the PLDA model produces the verification scores.
which uses alignment models to produce a supervector repre-
However, as we previously mentioned, many improvements on
sentation of an utterance. Unlike previous works with similar
this baseline system have been achieved in recent years by pro-
approaches, we do not extract the embedding of an utterance
gressively substituting components of the systems by DNNs,
from the mean reduction of the temporal dimension. Our system
thanks to their larger expressiveness and the availability of big-
replaces the mean by a phrase alignment model to keep the tem-
ger databases. Examples of this are the use of DNN bottleneck
poral structure of each phrase which is relevant in this applica-
representations as features replacing or combined with spectral
tion since the phonetic information is part of the identity in the
parametrization [8], training DNN acoustic models to use their
verification task. Moreover, we can apply a convolutional neural
outputs as posteriors for alignment instead of GMMs in i-vector
network as front-end, and thanks to the alignment process being
extractors [9], or replacing PLDA by a DNN [10]. Other pro-
differentiable, we can train the whole network to produce a su-
posals similar to face verification architectures have been more
pervector for each utterance which will be discriminative with
ambitious and have trained a discriminative DNN for multiclass
respect to the speaker and the phrase simultaneously. As we
classifying and then extract embeddings by reduction mecha-
show, this choice has the advantage that the supervector encodes
nisms [11] [12], for example taking the mean of an intermediate
the phrase and speaker information providing good performan-
layer named usually bottleneck layer. After that embedding ex-
ce in text-dependent speaker verification tasks. In this work, the
traction, the verification score is obtained by a similarity metric
process of verification is performed using a basic similarity me-
such as cosine similarity [11].
tric, due to simplicity, compared to other more elaborate mo-
dels that are commonly used. The new model using alignment The application of DNNs and the same techniques as in
to produce supervectors was tested on the RSR2015-Part I data- text-independent models for text-dependent speaker verification
base for text-dependent speaker verification, providing compe- tasks has produced mixed results. On the one hand, specific mo-
titive results compared to similar size networks using the mean difications of the traditional techniques have been shown suc-
to extract embeddings. cessful for text-dependent tasks such as i-vector+PLDA [13],
Index Terms: Text Dependent Speaker verification, HMM DNNs bottleneck as features for i-vector extractors [14] or pos-
Alignment, Deep Neural Networks, Supervectors terior probabilities for i-vector extractors [14][15]. On the other
hand, speaker embeddings obtained directly from a DNN ha-
ve provided good results in tasks with large amounts of data
1. Introduction and a single phrase [16] but they have not been as effective in
Recently, techniques based on discriminative deep neural tasks with more than one pass phrase and smaller database si-
networks (DNN) have achieved a substantial success in many zes [4][5]. The lack of data in this last scenario may lead to
speaker verification tasks. These techniques follow the philo- problems with deep architectures due to overfitting of models.
sophy of the state-of-the-art face verification systems [1][2] Another reason that we explore in the paper for the lack
where embeddings are usually extracted by reduction mecha- of effectiveness of these techniques in general text-dependent
nisms and the decision process is based on a similarity metric tasks is that the phonetic content of the uttered phrase is rele-
[3]. Unfortunately, in text-dependent tasks this approach does vant for the identification. State-of-art text-independent approa-
not work efficiently since the pronounced phrase is part of the ches to obtain speaker embeddings from an utterance usually
identity information [4][5]. A possible cause of the imprecision reduce temporal information by pooling and by calculating the
in text-dependent tasks could be derived from using the mean mean across frames of the internal representations of the net-
as a representation of the utterance as we show in the experi- work. This approach may neglect the order of the phonetic in-
mental section. To solve this problem, this paper shows a new formation because in the same phrase the beginning of the sen-
architecture which combines a deep neural network with a ph- tence may be totally different from what is said at the end. An
rase alignment method used as a new internal layer to maintain example of this is the case when the system asks the speaker
the temporal structure of the utterance. As we will show, it is to utter digits in some random order. In that case a mean vec-
a more natural solution for the text-dependent speaker verifica- tor would fail to capture the combination of phrase and speaker.
tion, since the speaker and phrase information can be encoded Therefore one of the objectives of the paper is to show that it is
in the supervector thanks to the neural network and the specific important to keep this phrase information for the identification
states of the supervector. process, not just the information of who is speaking.
In the context of text-independent speaker verification In previous works we have developed systems that need to
tasks, the baseline system based on i-vector extraction and Pro- store a model per user which were adapted from a universal
babilistic Linear Discriminant Analysis (PLDA) [6][7] are still background model and the evaluation of the trial was based on
1 10.21437/IberSPEECH.2018-1
a likelihood ratio [17][18]. One of the drawbacks of this ap- 2.1. Alignment mechanism
proach is the need to store a large amount of data per user and
In this work, we select a Hidden Markov Model (HMM)
the speed of evaluation of trials, since likelihood expressions
as the alignment technique in all the experiments, but other po-
were dependent on the frame length. In this paper, we focus
sibilities could be to select Gaussian Mixture Model (GMM)
on systems using a vector representation of a trial or a speaker
or DNN posteriors. In text-dependent tasks we know the phrase
model. We propose a new approach that includes alignment as
transcription which allows us to construct a specific left-to-right
a key component of the mechanism to obtain the vector repre-
HMM model for each phrase of the data and obtain a Viterbi
sentation from a deep neural network. Unlike previous works,
alignment per utterance.
we substitute the mean of the internal representations across ti-
One reason to employ a phrase HMM alignment was due to
me which is used in other neural network architectures [4][5]
its simplicity for training independent HMM models for diffe-
by a frame to state alignment to keep the temporal structure of
rent phrases used to develop our experiments without the need
each utterance. We show how the alignment can be applied in
of phonetic information for training. Another reason was that
combination with a DNN acting as a front-end to create a su-
using the decoded sequence provided by the Viterbi algorithm
pervector for each utterance. As we will show, the application
in a left-to-right architecture it is ensured that each state of the
of both sources of information in the process of defining the su-
HMM corresponds to at least one frame of the utterance, so no
pervector provides better results in the experiments performed
state is empty.
on RSR2015 compared to previous approaches.
The process followed to add this alignment to our system
This paper is organized as follows. In Section 2 we present is detailed below. Once models for alignment are trained, a se-
our system and especially the alignment strategy developed. quence of decoded states γ=(q1 , ..., qt ) where qt indicates the
Section 3 presents the experimental data. Section 4 explains the decoded state at time t with qt ∈ {1, ..., Q} is obtained. Before
results achieved. Conclusions are presented in Section5. adding these vectors to the neural network they are preproces-
sed and converted into a matrix with ones and zeros in function
2. Deep neural network based on alignment of its correspondences with the states which makes possible to
use them directly inside of the neural network. In this way, we
In view of the aforementioned imprecisions in the results put ones at each state according to the frames that belong to this
achieved in previous works for this task with only DNNs and a state as a result of this process, we have the P alignment matrix
basic similarity metric, we decided to apply an alignment me- A ∈ RT ×Q with its components atqt =1 and q atq =1 which
chanism due to the importance of the phrases and their tempo- means that only one state is active at the same time.
ral structure in this kind of tasks. Since same person does not For example, if we train an HMM model with 4 states and
always pronounce one phrase at the same speed or in the same we obtain a vector γ and apply the previous transformation, the
way due to differences in the phonetic information, it is usual resultant matrix A would be:
that there exists an articulation and pronunciation mismatch bet-
ween two compared speech utterances even from the same per- 1 0 0 0
son. 1 0 0 0
In Fig. 1 we show the overall architecture of our system, 1 0 0 0
where the mean reduction to obtain the vector embedding befo- 0 1 0 0
γ = [1, 1, 1, 2, 2, 3, 3, 4] → A = (1)
re the backend is substituted by the alignment process to finally 0 1 0 0
0 0 1 0
create a supervector by audio file. This supervector can be seen
0 0 1 0
as a mapping between an utterance and the state components of
the alignment, which allows to encode the phrase information. 0 0 0 1
For the verification process, once our system is trained, one su-
After this process, as we show in Fig. 2, we added this ma-
pervector is extracted for each enroll and test file, and then a
trix to the network as a matrix multiplication like one layer mo-
cosine metric is applied over them to achieve the verification
re, thanks to the expression as a matrix product it is easy to dif-
scores.
ferentiate and this enables to backpropagate gradients to train
neural network as usual. This matrix multiplication allows as-
signing the corresponding frames to each state resulting in a su-
pervector. Then, the speaker verification is performed with this
supervector. The alignment as a matrix multiplication can be
expressed as a function of the input signal to this layer xct with
dimensions (c × t) and matrix of alignment of each utterance A
with dimensions (t × q):
P
t xct · atq
scq = P (2)
t atq
2
(a) Operation with 1D Convolution
Figura 2: Process of alignment, the input signal x is multiplied
by an alignment matrix A to produce a matrix with vectors sQ
which are then concatenated to obtain the supervector.
input signal over the acoustic features thus we obtain the tradi-
tional supervector. However, we expect to improve this baseline
result, so we propose to add some layers as front-end previous
to the alignment layer and train them in combination with the
alignment mechanism.
For deep speaker verification some simple architectures (b) Example of the convolution operation
with only dense layers [4] have been proposed. However, lately
it has been tried to employ deep neural networks as Residual Figura 3: Operation with 1D Convolution layers, 3(a) general
CNN Networks [5] but in text-dependent task it has not achie- pipeline of this operation. 3(b) example of how k context frames
ved the same good results as previous simple approaches. from input are multiplied by the weight matrix W and the output
In our network we propose a straightforward architecture is equivalent to a linear combination of convolutions.
with only a few layers which include the use of 1-dimension
convolution (1D convolution) layers instead of dense layers or put dimension of 60. On these input features we apply a data
2D convolution layers as in other works. Our proposal is to ope- augmentation method called Random Erasing [20], which helps
rate in the temporal dimension to add context information to the us to avoid overfitting in our models due to lack of data in this
process and at the same time the channels are combined at each database.
layer. The context information which is added depends on the On the other hand, the DNN architecture consists of the
size of the kernel used in convolution layer. front-end part in which several different configurations of la-
To use this type of layer, it is convenient that the input sig- yers have been tested as we will detail in the experiments, and
nals have the same size to concatenate them and pass to the the second part of the architecture which is an alignment based
network. For this reason, we apply a transformation to interpo- on HMM models. Finally, we have extracted supervectors as a
late or fill with zeros the input signals to have all of them with combination of front-end and alignment with a flatten layer and
the same dimensions. with them we have obtained speaker verification scores by using
The operation of the 1D convolution layers is depicted in a cosine similarity metric without any normalization technique.
Fig. 3, the signal used as layer input and its context, the previous A set of experiments was performed using Pytorch [21] to
frames and the subsequent frames, are multiplied frame by fra- evaluate our system. We compare a front-end with mean reduc-
me with the corresponding weights. The result of this operation tion with similar philosophy as [4][5] to the feature input di-
for each frame is linearly combined to create the output signal. rectly or a front-end both followed by the HMM alignment. In
3. Experimental Data the part of the front-end, we implemented 3 different layer con-
figurations: one convolutional layer with a kernel of dimension
In all the experiments in this paper, we used the RSR2015 1 equivalent to a dense layer but keeping the temporal structure
text-dependent speaker verification dataset [19]. This dataset and without adding context information, one convolutional la-
consists of recordings from 157 male and 143 female. There yer with a kernel of dimension 3, and three convolutional layers
are 9 sessions for each speaker pronouncing 30 different phra- with a kernel of dimension 3.
ses. Furthermore, this data is divided into three speaker subset: In Table 1 we show equal error rate (EER) results with the
background (bkg), development (dev) and evaluation (eval). We different architectures trained on the background subset for fe-
develop our experiments in Part I of this data set and we em- male, male and both partitions together. We have found that,
ploy the bkg and dev data (194 speakers, 94 female/100 male) as we expected, the first approach with mean reduction mecha-
for training. The evaluation part is used for enrollment and trial nism for extracting embeddings does not perform well for this
evaluation. text-dependent speaker verification task. It seems that this ty-
pe of embeddings do not represent correctly the information to
4. Results achieve discrimination between the correct speaker and phra-
In our experiments, we do not need the phrase transcrip- se both simultaneously. Furthermore, we show how changing
tion to obtain the corresponding alignment, because one phrase the typical mean reduction for a new alignment layer inside the
dependent HMM model has been trained with the background DNN achieves a relative improvement of 91.62 % in terms of
partition using a left-to-right model of 40 states for each phrase. the EER %.
With these models we can extract statistics from each utterance Nevertheless, these EER results were still quite high, so we
of the database and use this alignment information inside our decided that the results can be improved training with back-
DNN architecture. As input to the DNN, we employ 20 dimen- ground and develop subsets together. In Table 2, we can see that
sional Mel-Frequency Cepstral Coefficients (MFCC) with their if we use more data for training our systems, we achieve bet-
first and second derivatives as features for obtaining a final in- ter performance especially in deep architectures with more than
3
Cuadro 1: Experimental results on RSR2015 part I [19] eval
subset, where EER % is shown. These results were obtained by
training only with bkg subset.
5. Conclusions
In this paper we present a new method to add a new layer
as an alignment inside of the DNN architectures for encoding
meaningful information from each utterance in a supervector,
which allows us to conserve the relevant information that we
use to verify the speaker identity and the correspondence with
the correct phrase. We have evaluated the models in the text-
dependent speaker verification database RSR2015 part I. Re-
Figura 4: Results of EER % varying train percentage where
sults confirm that the alignment as a layer within the architectu-
standard deviation is shown only for both gender independent
re of DNN is an interesting line since we have obtained compe-
results.
titive results with a straightforward and simple alignment tech-
nique which has a low computational cost, so we can achieve
For illustrative purposes, we also represent our high- better results with other more powerful techniques.
dimensional supervectors in a two-dimensional space using t-
SNE [22] which preserves distances in a small dimension spa-
ce. In Fig.5(a), we show this representation for the architec-
6. Acknowledgements
ture which uses the mean to extract the embeddings, while in This work has been supported by the Spanish Ministry
Fig.5(b) we represent the supervectors of our best system. As of Economy and Competitiveness and the European Social
we can see in the second system the representation is able to Fund through the project TIN2017-85854-C4-1-R, by Gobierno
cluster the examples from the same person, whereas in the first de Aragón/FEDER (research group T36 17R) and by Nuance
method is not able to cluster together examples from the same Communications, Inc. We gratefully acknowledge the support
person. On the other hand, in both representations data are auto- of NVIDIA Corporation with the donation of the Titan Xp GPU
organized to show on one side examples from female identities used for this research.
4
7. References [18] A. Miguel, J. Llombart, A. Ortega, and E. Lleida, “Tied Hid-
den Factors in Neural Networks for End-to-End Speaker Recog-
[1] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “DeepFace: Clo- nition,” pp. 2819–2823, 2017.
sing the Gap to Human-Level Performance in Face Verification,”
2014 IEEE Conference on Computer Vision and Pattern Recogni- [19] A. Larcher, K. A. Lee, B. Ma, and H. Li, “Text-dependent
tion, pp. 1701–1708, 2014. speaker verification: Classifiers, databases and RSR2015,” Speech
[2] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified Communication, vol. 60, pp. 56–77, 2014. [Online]. Available:
embedding for face recognition and clustering,” in 2015 IEEE http://dx.doi.org/10.1016/j.specom.2014.03.001
Conference on Computer Vision and Pattern Recognition (CVPR), [20] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random era-
June 2015, pp. 815–823. sing data augmentation,” arXiv preprint arXiv:1708.04896, 2017.
[3] H. V. Nguyen and L. Bai, “Cosine similarity metric learning for [21] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito,
face verification,” in Asian conference on computer vision. Sprin- Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic diffe-
ger, 2010, pp. 709–720. rentiation in pytorch,” in NIPS-W, 2017.
[4] Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, and K. Yu, [22] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”
“Deep feature for text-dependent speaker verification,” Speech Journal of machine learning research, vol. 9, no. Nov, pp. 2579–
Communication, vol. 73, pp. 1–13, 2015. [Online]. Available: 2605, 2008.
http://dx.doi.org/10.1016/j.specom.2015.07.003
[5] E. Malykh, S. Novoselov, and O. Kudashev, “On residual cnn in
text-dependent speaker verification task,” Lecture Notes in Com-
puter Science (including subseries Lecture Notes in Artificial Inte-
lligence and Lecture Notes in Bioinformatics), vol. 10458 LNAI,
pp. 593–601, 2017.
[6] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel,
“A study of interspeaker variability in speaker verification,”
IEEE Transactions on Audio, Speech, and Language Processing,
vol. 16, no. 5, pp. 980–988, 2008.
[7] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Oue-
llet, “Front-end factor analysis for speaker verification,” IEEE
Transactions on Audio, Speech, and Language Processing,
vol. 19, no. 4, pp. 788–798, 2011.
[8] A. Lozano-Diez, A. Silnova, P. Matejka, O. Glembek, O. Plchot,
J. Pešán, L. Burget, and J. Gonzalez-Rodriguez, “Analysis and op-
timization of bottleneck features for speaker recognition,” in Pro-
ceedings of Odyssey, vol. 2016, 2016, pp. 352–357.
[9] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme
for speaker recognition using a phonetically-aware deep neural
network,” in Acoustics, Speech and Signal Processing (ICASSP),
2014 IEEE International Conference on. IEEE, 2014, pp. 1695–
1699.
[10] O. Ghahabi and J. Hernando, “Deep belief networks for i-vector
based speaker recognition,” in Acoustics, Speech and Signal
Processing (ICASSP), 2014 IEEE International Conference on.
IEEE, 2014, pp. 1700–1704.
[11] G. Bhattacharya, J. Alam, and P. Kenny, “Deep speaker embed-
dings for short-duration speaker verification,” in Proc. Inters-
peech, 2017, pp. 1517–1521.
[12] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,
“Deep neural network embeddings for text-independent speaker
verification,” in Proc. Interspeech, 2017, pp. 999–1003.
[13] H. Zeinali, H. Sameti, and L. Burget, “Hmm-based phrase-
independent i-vector extractor for text-dependent speaker verifica-
tion,” IEEE/ACM Transactions on Audio, Speech, and Language
Processing, vol. 25, no. 7, pp. 1421–1435, 2017.
[14] H. Zeinali, L. Burget, H. Sameti, O. Glembek, and O. Plchot,
“Deep neural networks and hidden markov models in i-vector-
based text-dependent speaker verification,” in Odyssey-The Spea-
ker and Language Recognition Workshop, 2016, pp. 24–30.
[15] S. Dey, S. Madikeri, M. Ferras, and P. Motlicek, “Deep neural
network based posteriors for text-dependent speaker verification,”
in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE
International Conference on. IEEE, 2016, pp. 5050–5054.
[16] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end
text-dependent speaker verification,” ICASSP, IEEE International
Conference on Acoustics, Speech and Signal Processing - Procee-
dings, vol. 2016-May, no. Section 3, pp. 5115–5119, 2016.
[17] A. Miguel, J. Villalba, A. Ortega, E. Lleida, and C. Vaquero, “Fac-
tor Analysis with Sampling Methods for Text Dependent Spea-
ker Recognition,” Proceedings of the 15th Annual Conference
of the International Speech Communication Association, Inters-
peech 2014, no. September, pp. 1342–1346, 2014.
5
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
6 10.21437/IberSPEECH.2018-2
Strn Stst Strn Stst
GMM
GMM
Ntrn , Ftrn Ntst , Ftst
Ntrn , Ftrn Ntst , Ftst
I-VECTOR
I-VECTOR
EXTRACTOR
EXTRACTOR wtrn wtst
wtrn wtst CENTERING
CENTERING KL DIST. WHITENING
LENGTH NORM.
WHITENING
wtrn wtst
LENGTH NORM.
wtrn wtst PLDA
llr(wtrn , wtst )
PLDA
FUSION
score(wtrn , wtst )
llr(wtrn , wtst )
Figure 1: Standard I-vector PLDA framework. Given Figure 2: Proposed system. The zeroth order Baum Welch
some input utterances Strn and Stst , the system generates statistic from both enrollment and test utterances are compared
llr(wtrn wtst ) by means of KL distance, which is then fused to the original
final score.
3. Phonetic Mismatch Compensation of matching phonemes, the more reliable the score is. Similarly,
Short utterances are an already known problem in speaker the lower is the phoneme similarity, the less restrictive should
verification, with several contributions [8][9][10][11][12][13]. the score be, in order to gain robustness against mismatches.
Most of these solutions assume a sort of uncertainty term be- Considering the i-vector PLDA standard framework, we
cause of the missing information, which must be compensated. consider the KL distance as the metric between enrollment and
This uncertainty term summarizes about how limited is the in- test utterances. This metric, is formulated as follows:
formation in the utterances, but do not pay attention to the de-
tailed missing phonemes. Therefore, this term is used as a
sort of quality measure of the utterance representations. Con- KLdist (p, q) =KL(p||q) + KL(q||p) (1)
sequently, scores are only compensated by these representation Z ∞
p(x)
quality approximations, without any concern about the condi- KL(p||q) = p(x) ln dx (2)
−∞ q(x)
tional dependencies when comparing enrollment and test. It is
not as harmful comparing utterances with similar limited infor- where KL represents the Kullback Leibler divergence between
mation as with totally mismatched phonetic content. distributions p and q. KL divergence is not symmetric, hence
According to our understanding, the detailed phonetic in- not a distance, so we make use of the symmetric version in-
formation is an impressive side information to pay no attention stead. This distance will compare our phonetic information,
to. Besides its quality is increasing as long as ASR systems in this work the zeroth order Baum-Welch statistics, extracted
evolve. This sort of knowledge allows the identification of the from the GMM-UBM step in the pipeline. This information,
missing acoustic content, making possible some sort of com- related with the acoustic content in the utterance, has strong re-
pensation for the missing acoustic content and a fair comparison lationships with the desired phonetic content. Nevertheless, no
of short utterances with only the available audio. side information is required for its extraction.
Therefore, our proposal is a proof of concept, as a first at- The obtained distance can be taken into account in any pos-
tempt to include the phonetic information in the evaluation of terior point of the speaker verification system (i-vector extrac-
the trial. In this work we work on the phonetic mismatch be- tor, PLDA, etc). In this work, a fusion of the PLDA score with
tween enrollment and test utterances in a speaker verification the distance is considered, made by means of logistic regres-
system. For this reason we have defined a distance between sion. The schematic for the tested framework is illustrated in
the enrollment and the test utterance for a trial. This distance Fig 2
is defined to measure how different is the acoustic content of With this fusion, the new score is able to compensate the
enrollment and test utterances, hence measuring how fair the phonetic mismatch in the trials, providing at the same time some
trials are in terms of acoustic similarity. The higher the number sort of quality measure. However, this distance just analyzes the
7
Table 1: EER(%) and minDCF metrics for the original long Table 2: KL distance and Error (%) for both target and non-
utterances, the chopped short utterance and the phonetically target trials depending on the trial length: long utterances
balanced short utterance (Long), chopped short utterances (Short) and phonetically bal-
anced short utterances (Phon. Balanced). Error estimated at
Utterance EER(%) MinDCF NIST operation point.
Baseline Long Utterance 3.25 0.16 Utterance Long Short Phon. Balanced
Chopped Short Utterance 8.57 0.40
Distance
Phoneme Balanced Short Utterance 4.11 0.20
Target 1.06 3.62 2.55
Non-target 1.74 4.61 3.47
interaction between enrollment and test utterances in the scor- Error (%)
ing process, but does not analyze the utterance representation Target 28.43 80.40 40.79
itself. Non-target 0.06 0.01 0.03
4. Experiments
Our experiments try to analyze the relevance of the phonetic and short utterances are considered the baseline results for the
mismatch in short utterances with limited acoustic information. experiments onwards.
We have opted for a speaker verification task with NIST SRE The previous results show a significant impact of the pho-
datasets, with available long utterances (around 5 minutes of netic variability on the utterance modeling capabilities. How-
speech with a unique speaker). Cohorts from SRE04, SRE05, ever, it is still unclear how this variability affects the perfor-
SRE06 and SRE08 are used to construct the speaker verifi- mance of our system. Hence we have performed a study com-
cation system, the i-vector PLDA standard framework. This paring acoustic mismatch between enrollment and test utter-
system consists of a 2048-Gaussian GMM-UBM and a 400- ances with the error score. As a first approach, the acoustic
dimension i-vector extractor. I-vectors are centered, whitened mismatch is measured by means of the KL distance between
and length-normalized [14] before being evaluated in a 400- the distributions of the zeroth Baum Welch statistics for both
dimension PLDA. Gaussianized MFCCs with first and second the enrollment and the test utterances. Thus we are compar-
derivatives are the input for the system. ing which components of the GMM contribute to the i-vector
The described system has evaluated three subsets based on extraction for enrollment and test. The results are exposed in
NIST SRE10 det. 5 core-extended core-extended female ex- Table 2, comparing the newly proposed distance with the error
periment: The original utterances constitute the reference sys- at the evaluation operating point (CM ISS = 10, CF A = 1,
tem with long utterances. Short utterances are created by chop- Ptgt = 0.01). The results are differentiated between target and
ping random segments from the original ones, only reassuring non-target populations for a better understanding.
that the short utterance contains between 3 and 60 seconds of The results indicate that short utterances suffer from the
speech. Another short utterance subset is also extracted, select- acoustic mismatch between enrollment and test, being much
ing the frames so that the original and the extracted utterances more significant than in long utterances. This extra mismatch
share the same phoneme distribution. This subset is referred occurs with both target and non-target trial populations. How-
as phonetically balanced in the paper. This latest subset is im- ever, this extra mismatch does not have the same effect in the
possible to find in real life, but allows the analysis of short ut- error term. Whereas non-target populations are not affected in
terances with (short utterances) and without (phonetically bal- terms of error, target trials do, explaining the degradation of
anced) phonetic variability, making comparisons possible. short utterances. Trials with short utterances fail because the
The first analysis compares the performance of the three speaker verification system considers the phonetic variability as
subsets, evaluated with the reference speaker verification sys- speaker variability, not differentiating between them.
tem. Both sorts of short utterances are expected to yield de- The proposed solution is the compensation of the original
graded performance with regard to the original long ones due to scores by means of the phonetic distance between enrollment
the limited information. The relevance of the phonetic variabil- and test utterances. As a first approach, we propose a sim-
ity is checked by direct comparison between these two results. ple yet effective linear regression fusing two systems, the i-
In Table 1 we present the obtained performances for the long vector PLDA and the KL distance. This first approach helps
original utterances with respect to their shorter versions, either the speaker verification system to notice whether the acoustic
chopped or phonetically balanced. mismatch can be degrading the score or not. The results with
The results indicate that both types of variability imply a this score are shown in Table 3.
loss of performance, as expected. However, the level of degra- According to the results, consistent improvements have
dation is far from being the same. Whilst the standard chopped been obtained, reaching up to 10% relative improvements. Sig-
short segments obtain 163.69% relative degradation in terms of nificantly enough, not only short utterances get improved but so
EER, the phonetically balanced short utterances only gets de- do long utterances.
graded a relative 26.46%. Therefore, the estimation variability, Finally, it is possible to analyze the benefits of the phonetic
i.e. how robust is our i-vector due to limited information, is not compensation and its impact with the different populations (tar-
nearly as influential as the the phonetic variability, present in get and non-target) in our trial subsets. The comparison be-
real life short utterances. It is important to bear in mind that tween our baseline system and our compensated version is in-
the amount of data evaluated per short utterance is significantly cluded in Table 4
smaller than the original long utterance, sometimes ruling out The results indicate a significant reduction of the target tri-
up to 95% of the original audio. The obtained results for long als error (False Negative cases) with both short and long utter-
8
Table 3: EER(%) and MinDCF metrics for trials with original PLDA, etc.) unaltered. Further work should be done in order
long utterances and short utterances, evaluating with the stan- to determine best and more efficient ways to make use of this
dard i-vector PLDA system (Baseline) and our proposed com- phonetic information.
pensated version (Compensated)
9
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
10 10.21437/IberSPEECH.2018-3
Figure 1: Block diagram showing different stages of the RBM vector extraction and its input to Bottom-up AHC.
11
desired (known) number of clusters achieved. The clustering frames in order to generate 80-dimensional feature inputs to the
algorithm is based on computing a distance/similarity matrix RBMs. With a shift of one frame, we generate almost 10 million
M (X) between all the speakers’ segments. Where X is the samples for the URBM training. The large amount of training
set of segments to be clustered. Hence the RBM vectors of all samples will favor more efficient learning that will lead to more
the segments are extracted, the matrix M (X) is computed by accurate URBM. All the RBMs used in this paper comprise of
scoring all the RBM vectors against all. Thus for N RBM vec- 80 visible and 400 hidden units. The URBM was trained for 200
tors, the matrix M (X) has dimensions N × N . In every itera- epochs with a learning rate of 0.0005, weight decay of 0.0002
tion, the segments with minimum/maximum distance/similarity and a batch size of 100. All the adapted RBM models for the
scores are clustered together and the matrix M (X) is updated. test speaker segments are trained with 200 epochs with a learn-
The corresponding rows and columns of the clustered segments ing rate of 0.005, weight decay of 0.000002 and a batch size of
are removed from M (X) and a new row and column are added. 64. The PCA is trained with the background RBM supervectors
The new row and column contains the distance scores between as discussed in section 2.3. Finally, fixed dimensional RBM
the new and old clusters. The new scores are computed accord- vectors are extracted for the speakers’ segments and are used
ing to the linkage algorithm used. For example segments Sa in the speaker clustering experiments. Different dimensions of
and Sb are clustered in Sab . Then the scores between new clus- the RBM vectors are evaluated which will be discussed in the
ter (Sab ) and old segment (Sn ) are computed as follows: results section.
(a) Average Linkage: There are several metrics to measure the performance of
speaker clustering. For example cluster impurity (or con-
1 versely cluster purity), rand index, normalized mutual infor-
s(Sab , Sn ) = {s(Sa , Sn ) + s(Sb , Sn )} (1)
2 mation (NMI) and F-measure as described in [22]. We have
(b) Single Linkage: considered the Cluster Impurity (CI) measure in this work. CI
measures the quality of a cluster, to what extent a cluster con-
s(Sab , Sn ) = max{s(Sa , Sn ), s(Sb , Sn )} (2) tains segments from different speakers. However, this metric
has a trivial solution when there is only one segment per clus-
Where s(Sab , Sn ) is the score between new cluster Sab and old ter. To deal with this, Speaker Impurity (SI) is measured at the
segment Sn while s(Sa , Sn ) is the score between old segments same time. SI measures to what extent a speaker is distributed
Sa and Sn . among clusters. There is a trade-off between CI and SI [23].
In this way, the process is iterated until a stopping criterion CI and SI are plotted against each other in an Impurity Trade-
is met. There are two methods to control the iterations: (1) Fix off (IT) curve and an Equal Impurity (EI) point is marked as
a threshold, and (2) Add an additional information to the system working point.
about the desired (known) number of clusters. The system stops
when this number is reached. In this work, we did not let the 5. Results
system know any desired number of clusters and we have used
the thresholding method. We have tuned a threshold in order to Different lengths for RBM vectors as well as for i-vectors are
see the performance of the system at different possible working evaluated using cosine scoring and average linkage clustering
points. The system performance is measured with respect to a algorithm. The results are shown in the second column of Table
ground truth cluster labels. We will discuss evaluation metrics 1. From the Table, it can be observed that if the dimension is in-
in section 4. creased, the performance is improved, both in case of i-vectors
and RBM vectors, in terms of Equal Impurity (EI). However, in
case of i-vectors, the best choice is 800 dimension. In case of
4. Experimental Setup and Database RBM vectors, the 2000 dimensional RBM vectors performs bet-
The experiments were performed using the audios from ter than the others. In this case, a relative improvement of 11%
AGORA database, which contains audio recordings of 34 TV is achieved compared to 800 dimensional i-vectors. A further
shows of Catalan broadcast TV3 [20]. Each show comprises of increase in the length of RBM vectors beyond 2000, degrades
two parts, i.e., a and b. So there are 68 audio files in total, of the performance in terms of EI. The third column of Table 1
approximate length of 38 minutes each. These files contain seg- compares the performance of RBM vector with the baseline i-
ments from 871 adult Catalan and 157 adult Spanish speakers. vectors in case of single linkage algorithm for clustering using
For the clustering experiments in this work, we have selected 38
audio files for testing and the remaining 30 audios are used as a
background data. The background data is used to train the Uni- Table 1: Comparison of speaker clustering results for the pro-
versal Background Model (UBM), Total Variability (T) matrix, posed RBM vectors with i-vectors. The dimensions of vectors
URBM and PCA. From the testing audio files, we have manu- are given in parenthesis. Each column shows Equal Impurity
ally extracted 2631 speaker segments according to ground truth (EI) in % for different scoring and linkage combinations.
rich transcription. These segments belong to 414 speakers that
appears in the audios. EI% EI% EI%
For both the baseline and proposed systems, 20 dimen- Approach (Cosine (Cosine (PLDA
sional Mel-Frequency Cepstral Coefficients (MFCC) features Average) Single) Single)
are extracted using a Hamming window of 25 ms with 10 ms i-vector (400) 49.19 46.26 36.16
shift. For the baseline, a 512 components UBM is trained to i-vector (800) 46.66 42.19 35.91
extract i-vectors and the PLDA is trained with the background i-vector (2000) 46.79 42.83 35.89
i-vectors, using Alize toolkit [21]. For the proposed system,
more than 3000 speaker segments are extracted from the back- RBM vector (400) 51.36 39.66 37.36
ground shows according to the ground truth rich transcription. RBM vector (800) 47.20 40.02 32.36
For each segment, we concatenate the features of 4 neighboring RBM vector (2000) 41.53 37.14 31.68
12
70 70
i-vector (800) Cosine: EI=42.19%
i-vector (800) PLDA: EI=35.91%
RBM vector (2000) Cosine: EI=37.14%
60 60 RBM vector (2000) PLDA: EI=31.68%
Speaker Impurity (%)
40 40
i-vector (400): EI=49.19%
i-vector (800): EI=46.66%
i-vector (2000): EI=46.79%
30 RBM vector (400): EI=51.36% 30
RBM vector (800): EI=47.2%
RBM vector (2000): EI=41.53%
RBM vector (2400): EI=52.48%
RBM vector (3000): EI=50.51%
20 20
20 30 40 50 60 70 20 30 40 50 60 70
Cluster Impurity (%) Cluster Impurity (%)
Figure 3: Comparison of Impurity Trade-off (IT) curves for the Figure 4: Comparison of Impurity Trade-off (IT) curves for the
proposed RBM vectors with i-vectors. Different dimensions of proposed RBM vectors with i-vectors. Different dimensions of
RBM vectors are evaluated using cosine scoring with average RBM vectors are evaluated using cosine and PLDA scoring with
linkage algorithm for clustering. The dimensions of i-vectors single linkage algorithm for clustering. The dimensions of i-
and RBM vectors are given in parenthesis. vectors and RBM vectors are given in parenthesis.
13
7. References [20] H. Schulz and J. A. R. Fonollosa, “A catalan broadcast conversa-
tional speech database,” in Joint SIG-IL/Microsoft Workshop on
[1] F. Richardson, D. Reynolds, and N. Dehak, “Deep neural network Speech and Language Technologies for Iberian Languages, 2009,
approaches to speaker and language recognition,” IEEE Signal pp. 27–30.
Processing Letters, vol. 22, no. 10, pp. 1671–1675, 10 2015.
[21] A. Larcher, J. F. Bonastre, B. G. B. Fauve, K. A. Lee, C. Lévy,
[2] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme
H. Li, J. S. D. Mason, and J. Y. Parfait, “Alize 3.0-open source
for speaker recognition using a phonetically-aware deep neural
toolkit for state-of-the-art speaker recognition.” in Interspeech,
network,” in Acoustics, Speech and Signal Processing (ICASSP),
2013, pp. 2768–2772.
2014 IEEE International Conference on. IEEE, 2014, pp. 1695–
1699. [22] C. D. Manning, P. Raghavan, and H. Schütze, “Introduction to in-
[3] P. Kenny, V. Gupta, T. Stafylakis, P. Ouellet, and J. Alam, “Deep formation retrieval,” Cambridge university press; 2008, pp. 158–
neural networks for extracting baum-welch statistics for speaker 163.
recognition,” in Proc. Odyssey, 2014, pp. 293–298. [23] D. A. van Leeuwen, “Speaker linking in large data sets,” Pro-
[4] Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, and K. Yu, “Deep ceedings of the Speaker and Language Recognition Odyssey, pp.
feature for text-dependent speaker verification,” Speech Commu- 202–208, 6 2010.
nication, vol. 73, pp. 1–13, 10 2015.
[5] K. Chen and A. Salman, “Learning speaker-specific characteris-
tics with a deep neural architecture,” IEEE Transactions on Neural
Networks, vol. 22, no. 11, pp. 1744–1756, 11 2011.
[6] L. Deng, D. Yu et al., “Deep learning: methods and applications,”
Foundations and Trends in Signal Processing, vol. 7, no. 3–4, pp.
197–387, 2014.
[7] T. Yamada, L. Wang, and A. Kai, “Improvement of distant-talking
speaker identification using bottleneck features of DNN.” in Inter-
speech, 2013, pp. 3661–3664.
[8] M. Senoussaoui, N. Dehak, P. Kenny, R. Dehak, and P. Du-
mouchel, “First attempt of boltzmann machines for speaker verifi-
cation,” in Odyssey 2012-The Speaker and Language Recognition
Workshop, 2012.
[9] O. Ghahabi and J. Hernando, “Restricted boltzmann machines for
vector representation of speech in speaker recognition,” Computer
Speech & Language, vol. 47, pp. 16–29, 1 2018.
[10] ——, “Deep learning backend for single and multisession i-vector
speaker recognition,” IEEE/ACM Transactions on Audio, Speech,
and Language Processing, vol. 25, no. 4, pp. 807–817, 4 2017.
[11] P. Safari, O. Ghahabi, and J. Hernando, “From features to speaker
vectors by means of restricted boltzmann machine adaptation,” in
ODYSSEY 2016-The Speaker and Language Recognition Work-
shop, 2016, pp. 366–371.
[12] H. Sayoud and S. Ouamour, “Speaker clustering of stereo audio
documents based on sequential gathering process,” Journal of In-
formation Hiding and Multimedia Signal Processing, vol. 4, pp.
344–360, 10 2010.
[13] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern, “Automatic seg-
mentation, classification and clustering of broadcast news audio,”
in Proc. DARPA speech recognition workshop, 1997, pp. 97–99.
[14] H. Ghaemmaghami, D. Dean, S. Sridharan, and D. A. van
Leeuwen, “A study of speaker clustering for speaker attribution in
large telephone conversation datasets,” Computer Speech & Lan-
guage, vol. 40, pp. 23–45, 11 2016.
[15] S. E. Tranter and D. A. Reynolds, “An overview of auto-
matic speaker diarization systems,” IEEE Transactions on audio,
speech, and language processing, vol. 14, no. 5, pp. 1557–1565,
9 2006.
[16] J. Jorrı́n, P. Garcı́a, and L. Buera, “DNN bottleneck features
for speaker clustering,” Proc. Interspeech 2017, pp. 1024–1028,
2017.
[17] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algo-
rithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp.
1527–1554, 6 2006.
[18] P. Safari, O. Ghahabi, and J. Hernando, “Feature classification by
means of deep belief networks for speaker recognition,” in Signal
Processing Conference (EUSIPCO), 2015 23rd European. IEEE,
2015, pp. 2117–2121.
[19] O. Ghahabi and J. Hernando, “Deep belief networks for i-vector
based speaker recognition,” in Acoustics, Speech and Signal
Processing (ICASSP), 2014 IEEE International Conference on.
IEEE, 2014, pp. 1700–1704.
14
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
15 10.21437/IberSPEECH.2018-4
2.2. Data augmentation second, measuring those differences between stressed and
neutral frames for the same speakers.
Data augmentation (DA) is a commonly used strategy adopted
As a first outcome, we realized that locution speed reflects
to increase the quantity of training data. It is a key ingredient
the stress of a person, we tend to pronounce more words per
of the state of the art systems for image and speech recognition
second and produce longer pauses when stressed. In these same
[9]. It can act as a regularizer in preventing overfitting [10]
conditions, there is a tendency to rise the frequency of our
and improving performance in imbalanced class problems [11],
voices. Thus, the speed and pitch from audio signals are two
making the whole process more robust and achieving a better
variables that we aim to modify by using the SOX library [18]
performance. It is also very useful for small data sets, as it is
in order to artificially simulate speech under stress conditions.
our case, to augment the speech database and as a consequence
improve accuracy [12].
4. Experiments
2.3. Classifiers In this section we present the construction of our system in a
Methods such as Gaussian Mixture Models (GMM) are block by block basis: we introduce the database, the labelling
generally used for speaker recognition, Support Vector strategy, the preprocessing of the data, and the experiments
Machines are widely applied as well [13], [14]. However, carried out.
several studies suggest the use of Deep Learning for speaker
recognition [15] and others prove the improvement in SR 4.1. Corpus database
performance using Convolutional Neural Networks [16]. We used the so-called VOCE Corpus Database [19], a
In recent years Deep Learning algorithms have skyrocketed 45-speaker recordings database in neutral and stress conditions.
in many scientific fields specially when using a large number of For each of the users, speech was recorded on 3 different
data. But for this research, we aim to keep a balance between scenarios: recording, prebaseline and baseline, which were
computational complexity and accuracy, due to the constraints acquired respectively, in a public speaking setting where the
that our targeted device hardware imposes and the reduced speaker is supposed to be under stress conditions, the speaker is
amount of data originally available. Also, preliminary tests to reading a paper 24 hours before the speech, and again reading
compare GMM, SVM and Multi-Layer Perceptron (MLP) led the same paper 30 minutes but before the public speaking
us to chose the later, a precursor of Deep Neutral Networks, setting. The heart rate (HR) was also acquired every second
due its better performance. for the three recordings.
3.1. Feature Extraction However we only used 21 speakers out of the 45 due to
the lack of properly recorded HR information, noisy audios
The acoustic features of speech extracted from audio signals
or absence of recordings. We divided these 21 speakers into
should reflect both anatomy (e.g., size and shape of the throat
two sets, Set 1 was composed of 10 speakers whose HR were
and mouth) and learned behavioral patterns (e.g., voice pitch,
coherent with the recordings in the sense that, when a speaker
speaking style).
was reading the heart rate remained stable, but on the public
We worked with the features extracted in the work done by speaking setting the HR rose. Set 2 was made out of the other
Alba Mı́nguez [17] within BINDI for stress detection since the 11 remaining speakers. In Table 1 the number of samples per
database employed is the same. These are the pitch, first three setting are specified, each sample representing 1s audio frames.
formants, twelve Mel-Frequency Cepstral Coefficients and the
energy of the signals. The short-term features were computed
4.2. Preprocessing
every 10 ms of audio and then a temporal integration was
performed over 1s length segments, calculating the mean and For simplicity, we begin with a conversion from stereo to
standard deviation, and resulting in one feature vector per 1s mono of the audio recordings, followed by a downsampling
audio frames, which is the rate at which the accompanying heart from 44100Hz to 16000Hz to reduce the computational cost
rate measures used for labelling stress were taken. of the problem without loosing too much precision. Then, a
normalization of the signals in amplitude is achieved to be able
3.2. Data augmentation to compare between them, and finally the signals go through
a voice activity detector (VAD) [20] that removes silent audio
As for our device, we would hypothetically have neutral
frames as those don’t include valuable information to our task.
speech for the learning step and we may find stressed speech
for testing. For those reasons and regarding to the low
4.3. Labelling
number of samples we have, we considered the generation of
a synthetically stressed database performing data augmentation Labelling an audio signal to determine stress presence is a
for the particular case of stress conditions. To be able to delicate matter since there is not a prescribed way to do so given
produce stressed speech out of neutral utterances we carried stress is non binary and very subjective. Taking a pragmatical
out an analysis, first listening to the audio signals and detecting perspective, once more we relied on the work done by Alba
what differences could be appreciated between them, and Mı́nguez [17] where the recordings of this corpus were labeled
16
Figure 1: Block Diagram of the system
according to each user’s heart rate (HR). Every 1s audio frame 4.6. Synthetic Stress
is labelled as stressed or neutral using two different heart rate
We performed an analysis to measure the differences between
thresholds. We selected the binarization threshold that gave
the mean pitch from neutral to stressed audio frames for each
better results in their report, which was the 75% percentile of
speaker using VOICEBOX [20], and we also estimated the
the HR of the user.
average elocution speed for each user. To do this, we obtained
an automatic transcription of each of the recordings by using
4.4. Balancing the data Google Speech Recognition [21] and computed afterwards the
mean number of words per second.
Soon we realized that the data instances were not balanced for The differences of pitch from neutral to stressed speech
each speaker. An adjustment needs to be made for each set were between -2% and +7%, increasing an average of 2.2%.
and conditions to get consistent estimates as all classes have the As regards to the elocution speed, subjectively, it seems to
same importance. Nevertheless, the use of an over-sampling increase in stressed speech, but our analysis gave us the opposite
technique would have a big drawback in our case because conclusion. The number of words per second was higher when
some users have significantly more samples than others, and the user was reading a text, 2.2 words/s in mean, than when
this would create too many artificial samples. To cope with the speaker was performing an oral presentation, 1.85 words/s.
this problem we cropped randomly the neutral samples by a By listening to the signals, we determined that the words were
threshold of 120 samples for both sets, and stressed samples pronounced faster but there were more pauses in between them,
over a threshold of 300 samples. Applying an over-sampling leading to a lower elocution rate in overall.
technique (in particular, SMOTE) [11] to the new cropped data Thus, we have changed the locution speed and the pitch
culminated in new samples resulting in a balanced data set. from the original database, to produce synthetically stressed
samples of speech. The pitch was modified in steps of
4.5. Preliminary Experiments [-6%, -3%, +3%, +6%], and the signals were reproduced at
the following speeds [-20%, -15%, -10%, -5%]. All these
Originally, for an initial experimental set-up we used the data modifications are applied to the original sets and result in an
available for Sets 1 and 2 (21 speakers). This preliminary augmentation of data, one new synthetic set per modification.
experiment is made to observe the behaviour of mismatch
conditions’ experiments on the speaker recognition rate. First
of all, we divided the data in neutral (NS) and stressed speech
(S) and experimented training with one type of speech and
testing with the other, and then mixing both types. In order to
get reliable results, these experiments were repeated 50 times
where, in each repetition, at least 50% data was randomly
chosen for testing. The results in terms of accuracy (percentage
of audio segments correctly classified) are in Table 2.
17
Figure 3: Equivalence between training data and number of experiment.
18
7. References [20] M. Brookes, “Voicebox: Speech processing toolbox for matlab
[software],” 01 2011. [Online]. Available: http://www.ee.ic.ac.
[1] J. H. Hansen and S. Patil, “Speaker classification,” uk/hp/staff/dmb/voicebox/voicebox.html
C. Müller, Ed. Berlin, Heidelberg: Springer-Verlag,
2007, ch. Speech Under Stress: Analysis, Modeling [21] A. Zhang, “Speech recognition (version 3.8) [software],”
and Recognition, pp. 108–137. [Online]. Available: http: 2017. [Online]. Available: ”https://github.com/Uberi/speech
//dx.doi.org/10.1007/978-3-540-74200-5\ 6 recognition”
[2] F. Dellaert, T. Polzin, and A. Waibel, “Recognizing emotion in [22] J. Ridley Stroop, “Studies of interference in serial verbal
speech,” vol. 3, 12 1996. reactions,” in Journal of Experimental Psychology: General, vol.
[3] I. Murray and J. L. Arnott, “Toward the simulation of emotion 121, 03 1992, pp. 15–23.
in synthetic speech: A review of the literature on human vocal
emotion,” vol. 93, pp. 1097–108, 03 1993.
[4] “UC3M4SAFETY - Multidisciplinary team for detecting,
preventing and combating violence against women,” 2017.
[Online]. Available: http://portal.uc3m.es/portal/page/portal/
inst estudios genero/proyectos/UC3M4Safety
[5] A. Poddar, M. Sahidullah, and G. Saha, “Speaker verification
with short utterances: a review of challenges, trends and
opportunities,” IET Biometrics, vol. 7, no. 2, pp. 91–101, 2018.
[6] E. Shriberg, Higher-Level Features in Speaker Recognition.
Berlin, Heidelberg: Springer Berlin Heidelberg, 2007,
pp. 241–259. [Online]. Available: https://doi.org/10.1007/
978-3-540-74200-5\ 14
[7] J. P. Campbell, “Speaker recognition: a tutorial,” Proceedings of
the IEEE, vol. 85, no. 9, pp. 1437–1462, Sep 1997.
[8] D. A. Reynolds, T. Quatieri, and R. B. Dunn, “Speaker verification
using adapted gaussian mixture models.” vol. 10, no. 1, p. 19–41,
2000.
[9] N. Jaitly and G. E. Hinton, “Vocal tract length perturbation
(VTLP) improves speech recognition,” 2013. [Online]. Available:
http://www.cs.toronto.edu/∼ndjaitly/jaitly-icml13.pdf
[10] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices
for convolutional neural networks applied to visual document
analysis,” in ICDAR, 2003.
[11] K. W. Bowyer, N. V. Chawla, L. O. Hall, and W. P. Kegelmeyer,
“SMOTE: synthetic minority over-sampling technique,” CoRR,
vol. abs/1106.1813, 2011. [Online]. Available: http://arxiv.org/
abs/1106.1813
[12] I. Rebai, Y. BenAyed, W. Mahdi, and J.-P. Lorré, “Improving
speech recognition using data augmentation and acoustic model
fusion,” Procedia Computer Science, vol. 112, pp. 316 – 322,
2017.
[13] W. M. Campbell, D. E. Sturim, and D. A. Reynolds,
“Support vector machines using gmm supervectors for speaker
verification,” IEEE Signal Processing Letters, vol. 13, no. 5, pp.
308–311, May 2006.
[14] K. A. Abdalmalak and A. Gallardo-Antolı́n, “Enhancement of
a text-independent speaker verification system by using feature
combination and parallel structure classifiers,” Neural Computing
and Applications, vol. 29, no. 3, pp. 637–651, Feb 2018.
[15] K. R. Farrell, R. J. Mammone, and K. T. Assaleh, “Speaker
recognition using neural networks and conventional classifiers,”
IEEE Transactions on Speech and Audio Processing, vol. 2, no. 1,
pp. 194–205, Jan 1994.
[16] M. McLaren, Y. Lei, N. Scheffer, and L. Ferrer, “Application
of convolutional neural networks to speaker recognition in noisy
conditions,” pp. 686–690, 01 2014.
[17] A. Mı́nguez-Sánchez, “Detección de estrés en señales de voz
[Stress detection in voiced signals],” p. 86, 06 2017. [Online].
Available: https://github.com/minguezalba/Stress Detection
[18] R. Bittner, E. Humphrey, and J. Bello, PySOX: Leveraging the
Audio Signal Processing Power of SOX in Python. International
Conference on Music Information Retrieval (ISMIR-16), 8 2016.
[19] A. Aguiar, M. Kaiseler, C. M, J. Silva, M. H, and P. Almeida,
“Voce corpus: Ecologically collected speech annotated with
physiological and psychological stress assessments.” 05 2014.
[Online]. Available: https://repositorio-aberto.up.pt/bitstream/
10216/85669/2/133351.pdf
19
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
20 10.21437/IberSPEECH.2018-5
L2 prosodic features and for segmenting multi-speaker por-
L1 tions, we have used a speech-to-text aligner software.
audio Word Multi-speaker segments were split from the words fol-
subtitles Segment extraction alignment lowing speech-dashes. For merging incomplete segments,
punctuation information was used.
script Speaker
Prosody annotation annotation 2.2. Speech-to-text alignment
21
2.4. Word-level acoustic feature annotation Although the TSure threshold catches most of the
one-to-one mapping segments, we realized that many of
Each word in the extracted segments is automatically an-
them fall below this threshold even if they map. So,
notated with the following acoustic features: mean fun-
we added another decision step that if one-to-one map-
damental frequency (f0), mean intensity, speech rate and
ping correlation scores higher than merged pairings and
silence intervals (pauses) before and after. The first two
it scores above a TOK threshold, then it is preferred as a
features are extracted with the ProsodyTagger toolkit
matched pair.
[10] built on Praat [11]. Pause information is calculated
from word-boundary information and speech rate is cal-
2.6. Output format
culated using:
We needed to store the corpus segments in a convenient
#syllables in word way to use with machine learning based applications. We
word speech rate = (1)
word duration used the Proscript library [12] for storing the enhanced
To represent speaker independent, perceptual acous- transcripts. This library makes it possible to store and
tic variations in the segments, both f0 and intensity val- manipulate speech transcript related data. The segments
ues are converted into logarithmic semitone scale relative are stored in csv files that keep the information listed in
to the speaker norm value. Thus, speaker mean values Table 1. A csv file containing all the segments is created
were represented by zero values in both cases. Semitone for each episode as well.
values are calculated with the corresponding formula:
x Table 1: Segment information kept in a Proscript format
semitone(x, norm) = 12 ∗ log( ) (2) csv file.
norm
2.5. Cross-lingual segment alignment based on Information Details
subtitle cues
word tokenized
The first three methodologies presented in this section id unique word id
dealt with extraction of segments in each language. This timing start and end times
subsection explains how segments extracted for each lan- pause coming before and after
guage are aligned to create the bilingual segment pairs. punctuation attached to beginning and end
We have developed an aligning process based on tim- f0 in Hertz and log-scale (semitones)
ing information of the extracted segments. Note that intensity in Decibels and log-scale
the segment alignments can be one-to-one, one-to-many, speech rate relative to syllables
many-to-one or many-to-many depending on the sen-
tencing structure in the subtitles. To create our own
alignment algorithm based on time cues, we first defined
a metric that measures the correlation percentage be- 3. Compiling the Heroes corpus
tween two sets of ordered segments S=hs1 , ..., sN i and
We put our methodology into practice by compiling a cor-
E=he1 , ..., eN i:
pus from the science fiction TV series Heroes 6 . Originat-
correlating ing from United States, Heroes ran in TV channels world-
segments correlation = max(0, × 100) (3) wide between the years 2006 and 2010. The whole series
span
consists of 4 seasons and 77 episodes and is dubbed into
correlating = min(esN , seN ) − max(ss1 , ss1 ) (4) many languages including Spanish, Portuguese, French
and Catalan. Each episode runs for a length of 42 min-
span = max(eeN , seN ) − min(es1 , ss1 ) (5) utes.
whereesx andeex denote the starting and ending time We chose this series as we had access to the DVD’s
of the xth segment in set E, ssx and sex denote the starting with Spanish dubbing. Also, we found it to have the
and ending time of the xth segment in set S. Spanish subtitles closest to the Spanish dubbing scripts
The alignment procedure is as follows: Two indexes among other series.
iE , iS are kept which slide through the segments of each
language. First, segments corresponding to each index 3.1. Raw data acquisition
are checked if they correlate more than the TSure thresh-
The DVD’s of the series were obtained from the Pompeu
old. If they do, they are assigned as a one-to-one matched
Fabra University Library. Episodes were extracted using
pair. If not, the possibilities of one-to-many, many-to-
the Handbrake software and were saved as Matroska for-
one or many-to-many matches are considered. This is
mat (mkv) files. Mkv files can hold multiple channels of
done through computing the correlations between com-
audios and subtitles embedded in it like DVDs. In order
binations of the current and two following segments and
to run our scripts we first needed to extract the audio
selecting the most correlating segment set pair. While
and subtitle pairs for both languages. Audio is extracted
considering combinations of the segments it is made sure
using the mkvextract command line tool7 . As subtitles
that two merged segments belong to the same speaker
were embedded as bitmap images in the DVD, we had to
and are not more than 10 seconds far from each other.
If the combined segment set pair with highest correla- 6 Produced by Tailwind Productions, NBC Universal Tele-
tion has a correlation of more than TM erged threshold, vision Studio (2006-2007) and Universal Media Studios (2007-
then the combinations are merged into one segment and 2010)
paired with each other. 7 https://mkvtoolnix.download/
22
run optical character recognition (OCR) in order to get English Spanish
srt format subtitles. As OCR is an error-prone process, Avg. # sentences (subtitles) 647 554
the resulting srt files needed to be spell checked. Avg. # sentences
628 513
We collected English and Spanish audio of 21 episodes (extracted)
totaling to 25 hours of raw audio and their corresponding Avg. # segments 526 459
subtitles. The episode scripts were obtained from a fan- Avg. # parallel segments 334
site in the Internet8 . Table 4: Averages numbers for each episode.
23
7. References
[1] A. Tsiartas, P. G. Georgiou, and S. S. Narayanan, “A
study on the effect of prosodic emphasis transfer on over-
all speech translation quality,” in 2013 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Pro-
cessing, May 2013, pp. 8396–8400.
[2] G. K. Anumanchipalli, L. C. Oliveira, and A. W. Black,
“Intent transfer in speech-to-speech machine transla-
tion,” in Spoken Language Technology (SLT) Workshop.
IEEE, 2012, pp. 153–158.
[3] Q. T. Do, S. Sakti, and S. Nakamura, “Toward expres-
sive speech translation: A unified sequence-to-sequence
lstms approach for translating words and emphasis,” in
INTERSPEECH, 2017.
[4] J.-P. Goldman, P.-E. Honnet, R. Clark, P. N. Garner,
M. Ivanova, A. Lazaridis, H. Liang, T. Macedo, B. Pfis-
ter, M. S. Ribeiro, E. Wehrli, and J. Yamagishi, “The si-
wis database: a multilingual speech database with acted
emphasis,” 2016.
[5] P. D. Agüero, J. Adell, and A. Bonafonte, “Prosody
generation for speech-to-speech translation,” in Interna-
tional Conference on Acoustics, Speech, and Signal Pro-
cessing (ICASSP), vol. 1. IEEE, 2006, pp. 557–560.
[6] T. Kano, S. Takamichi, S. Sakti, G. Neubig, T. Toda,
and S. Nakamura, “An end-to-end model for cross-
lingual transformation of paralinguistic information,”
Machine Translation, Apr 2018. [Online]. Available:
https://doi.org/10.1007/s10590-018-9217-7
[7] A. Öktem, M. Farrús, and L. Wanner, “Automatic ex-
traction of parallel speech corpora from dubbed movies,”
in Proceedings of the 10th Workshop on Building and Us-
ing Comparable Corpora (BUCC), Vancouver, Canada,
2017, pp. 31–35.
[8] A. Öktem, “movie2parallelDB: Automatic parallel
speech database extraction from dubbed movies,”
2018. [Online]. Available: https://github.com/alpoktem/
movie2parallelDB
[9] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and
M. Sonderegger, “Montreal Forced Aligner: Trainable
text-speech alignment using Kaldi,” in Proc. Interspeech,
2017, pp. 498–502.
[10] M. Dominguez, M. Farrús, and L. Wanner, “An
automatic prosody tagger for spontaneous speech,” in
Proceedings of COLING 2016, the 26th International
Conference on Computational Linguistics: Technical
Papers. The COLING 2016 Organizing Committee,
2016, pp. 377–386. [Online]. Available: http://www.
aclweb.org/anthology/C16-1037
[11] P. Boersma and D. Weenink, “Praat: Doing phonet-
ics by computer [Computer software], retrieved from
http://www.praat.org/,” 2017.
[12] A. Öktem, “Proscript: Python library for prosodic
annotation of speech segments,” 2018. [Online]. Available:
https://github.com/alpoktem/proscript
24
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
25 10.21437/IberSPEECH.2018-6
tional language such as interviews or programs with clear dic-
Language Pronounciation tation with minimum amount of background noise or music.
Models Dictionary
Based on these priorities, we downloaded approximately 490
hours of video with their corresponding subtitles in srt format
Speech Feature
Audio Vectors
Graphemes
from 17 programs, the distribution of topics and durations for
Feature
Extraction Decoder
each program can be seen in the table 1.
(Front-end)
26
Table 1: The programmes used in constructing the ASR system. Table shows their respective themes, total downloaded durations and
the final duration used for the training.
finally converted these IPA version to the CMU Sphinx readable between 130 and 6800 Hz; i.e. 12 cepstra using the C0 as the
format. In total we have used 37 phonemes, consistent with the energy component plus their deltas and delta deltas adding up to
literature on Catalan phonetic corpus [12]. 39 total parameters (1s c d dd). For the acoustic model train-
Additional information on the structure of Catalan language ing, our Gaussian mixture model contains 32 Gaussian densi-
is necessary for the decoding phase. Within an ASR system ties, and 6000 tied HMM states.
the statistical information on the linguistic grammar and syn- In our process, we started by estimating the transition prob-
tax represented through the language model, and these models abilities of the Context-Independent (CI) HMMs for forced
can be prepared using a sufficiently large text corpus. In this alignment of the acoustic data. For the forced alignment itself
work, we have taken advantage of the subtitle of our audio cor- we used the sphinx3 align executable that needed to be com-
pus and merged them with the Catalan OpenSubtitles Corpus piled apart from the Sphinx-5prealpha library. In this step, the
[13] to build a basis for our language models. audio files are aligned with their respective transcriptions using
The final corpus is cleaned from all symbols and punc- the CI models, in the case when there is a mismatch with the
tuation, and numbers are normalized using espeak tool. The transcription and the alignment result, the audio files are elimi-
complete corpus has 5.3 million tokens with around 100,000 nated for the following steps. After the non-aligning segments
unique tokens which we used for compiling the phonetic are eliminated we were left with 240 hours of total audio. The
lexicon. Using approximately 58k words (which appear at final amount of audio per programme used is shown in table 1.
least twice in the corpus) we have prepared one 3-gram The transition probabilities of the CI HMMs re-estimated
(OT large 3gram) and another 4-gram (OT large 4gram) lan- using filtered data set, and following this phase a complete list
guage model in ARPA format using the CMU Language Model of tri-phones (58289 in our case) are built and their transition
toolkit (CMUCLMTK). probabilities are estimated in the form of Context-Dependent
(CDs) HMMs. These tri-phones account for both between-word
and within-word contexts, however since the training data might
4. Training not account for all the possibilities, the unseen tri-phones are
For our training process we have used the standard CMU Sphinx tied to the seen tri-phones using decision trees.
training steps with very minor changes. The training starts with We performed our training in a resource limited environ-
extraction of the Mel-Frequency Cepstral Coefficients (MFCC) ment. For four threads of Intel(R) Atom(TM) CPU N2800 with
27
1.86GHz, the whole training process took about 120 hours. 5. Future Work
The most important and basic step for improving our ASR sys-
4.1. Evaluation
tem is to use a better pronunciation dictionary using a better
In order to evaluate the word error rate (WER) of the acous- grapheme to phoneme conversion system. In this work, for
tic models we wanted to make sure that the test voices do not its ease of use we have taken advantage of espeak, however
appear in the training corpus. In order to evaluate the acous- the festival based FESTCAT speech synthesis system is specif-
tic models for similar recordings, we downloaded different TV3 ically implemented for Catalan and allows for a more refined
programmes that were not used in the training. 4 hours of new grapheme to phoneme conversion. Training the acoustic model
TV3 recordings evaluated with the OT large 4gram language with this improved pronunciation dictionary will allow for bet-
models resulted in a WER of %35,2. For the decoding we used ter results overall.
the standard decoding script within the CMU Sphinx. For the acoustic data itself depending on the sound quality,
In order to evaluate the accuracy of the models in a cleaner background music levels and the speaker mistakes, it should be
environment and also to guarantee 100% speaker exclusion we clustered into clean and other, similar to the librispeech dataset
decided to use another test set for evaluation round. For this [16]. Additionally, we plan to do a gender diarization model,
we used FESTCAT corpus, which is specifically designed for determining whether the voice is male of female for each seg-
creating a speech synthesis system for Catalan [14, 15], and ment, in order to assess the gender balance of the whole dataset.
consists of 28 hours of recordings from 10 different voices (5 With these acoustic models, it will be possible to do align-
female, 5 male). For evaluating our ASR system we used 4 total ment of an audio with its given text. This process will not only
hours of 4 female and 4 male voices with their corresponding be useful in cleaning the dataset itself, but also will allow ex-
transcripts. It should be noted that due to the clean environment tending the current set without relying on the cue start and end
of the recordings, the FESTCAT dataset also represents a more times within the subtitles. This implies further Catalan acoustic
ideal audio quality. data could be assembled by using the audio and just its corre-
Due to our restricted text corpus, we have created another sponding transcription.
set of language models in addition to the ones explained in the Related to this possibility, another tool we would like to
subsection 3.3 specifically for the test decoding. The second set develop is a system of automatic punctuation in Catalan. The
of language models uses a corpus of FESTCAT text plus the cor- readability of recognized transcripts depend a lot on sentence
pus explained in the subsection 3.3. Using this new corpus we segmentation and correct punctuation. The methods for train-
created one 3-gram language model with the most frequent 20k ing punctuation engines using recurrent neural networks (RNN)
words (OTF 20k 3gram) and two other models with the most are very well developed, especially with the use of a large text
frequent 58k words (OTF large 3gram, OTF large 4gram) sim- corpus [17]. But also recently it was demonstrated that acous-
ilar to the OT large models. Whereas for the OT large 4gram tic data with word-aligned transcriptions can be used to create
language model we ended up with a WER of 31.95%, the best prosody based punctuation models [18]. For now this type of
results were attained by the OTF large 4gram model at 11,68%. models have only been trained for English. With the possibility
The results for each language model with the corresponding of doing word level alignments for Catalan, we will be able to
real-time decoding factor (xRT) for an Intel i7-4510U 3 Ghz train one in Catalan in the recent future.
Quad-core architecture is shown in the table 2. The high preci-
sion OTF results, show that if the acoustic conditions are perfect 6. Conclusions
and the language models are “in-domain,” our acoustic models
can recognize voices that are not in its training set reasonably In this paper, we have described building of an ASR system for
well. In addition, for our cases the main factor which improved a new language, using only publicly available resources. Ap-
the recognition precision was the amount of pruning that the plying our methodology on Catalan, we compiled a dataset of
corpus was subjected to for constructing the language models. 240 hours of transcribed broadcast speech and used it to develop
Whereas moving from the most frequent 20k words to most fre- large-vocabulary speech recognition models, both of which are
quent 58k words makes a considerable improvement, the effect distributed openly online. The accuracy of the resulting models
of using 4-gram instead of 3-gram seems to be very small, prob- show that they can be a base for speech technology developers
ably due to our specific test condition. However one important to access the Catalan speaking community. Building a voice
difference between the 3-gram and 4-gram models is the xRT, input interface for a desktop or mobile application is easy as in-
for which the 3-gram models are considerably faster than the 4- stalling the CMU Sphinx toolkit3 and placing the models in its
gram models. Note that we did not undertake any optimization installation directory. The ASR system further gives the possi-
of the decoding parameters neither for best precision not for the bility to adapt acoustic and language models for more special-
best computational performance. ized vocabularies and acoustic environments. We believe that
the practical and low-cost setup of CMU Sphinx makes it an
Table 2: The WER and xRT results for different language mod- important player amongst other ASR engines, despite the more
els for the FESTCAT test dataset. modern neural network based alternatives. It keeps its relevance
especially for minority languages which have little open acous-
tic data resources available.
Language WER xRT
Model (%)
7. Acknowledgements
OT large 4gram 31,95 0.952
OTF 20k 3gram 22,50 0.872 This project was funded by Softcatalà. The authors would like
OTF large 3gram 12,11 0.900 to thank Antonio Bonafonte for his guidance during the writing
OTF large 4gram 11,68 1.002 of this paper.
3 https://cmusphinx.github.io/wiki/tutorial/
28
8. References [18] A. Öktem, M. Farrus, and L. Wanner, “Attentional parallel RNNs
for generating punctuation in transcribed speech,” in 5th Interna-
[1] H. Schulz, M. Ruiz, and J. A. R. Fonollosa, “TECNOPARLA - tional Conference on Statistical Language and Speech Processing
Speech technologies for Catalan and its application to speech-to- SLSP 2017, Le Mans, France, 2017.
speech translation,” Procesamiento del lenguaje natural, vol. 41,
pp. 319–320, Sep 2008.
[2] J. Mariño, J. Padrell, A. Moreno, and C. Nadeu, “Workshop
on speech recognition based on very large telephone speech
databases,” C. Draxler, 2000, pp. 57–61.
[3] M. Gales, S. Young et al., “The application of hidden markov
models in speech recognition,” Foundations and Trends in Signal
Processing, vol. 1, no. 3, pp. 195–304, 2008.
[4] K.-F. Lee, H.-W. Hon, and R. Reddy, “An overview of the sphinx
speech recognition system,” in Readings in speech Recognition.
Elsevier, 1990, pp. 600–610.
[5] P. Placeway, S. Chen, M. Eskenazi, U. Jain, V. Parikh, B. Raj,
M. Ravishankar, R. Rosenfeld, K. Seymore, M. Siegler et al.,
“The 1996 hub-4 sphinx-3 system,” in Proc. DARPA Speech
recognition workshop, vol. 97. Citeseer, 1997.
[6] P. Lamere, P. Kwok, E. Gouvea, B. Raj, R. Singh, W. Walker,
M. Warmuth, and P. Wolf, “The CMU sphinx-4 speech recogni-
tion system,” in IEEE Intl. Conf. on Acoustics, Speech and Signal
Processing (ICASSP 2003), Hong Kong, vol. 1, 2003, pp. 2–5.
[7] D. Huggins-Daines, M. Kumar, A. Chan, A. W. Black, M. Rav-
ishankar, and A. I. Rudnicky, “Pocketsphinx: A free, real-time
continuous speech recognition system for hand-held devices,” in
Acoustics, Speech and Signal Processing, 2006. ICASSP 2006
Proceedings. 2006 IEEE International Conference on, vol. 1.
IEEE, 2006, pp. I–I.
[8] D. Huggins-Daines and A. I. Rudnicky, “Mixture pruning and
roughening for scalable acoustic models,” in Proceedings of the
ACL-08: HLT Workshop on Mobile Language Processing, 2008,
pp. 21–24.
[9] ——, “Combining mixture weight pruning and quantization for
small-footprint speech recognition,” in Acoustics, Speech and Sig-
nal Processing, 2009. ICASSP 2009. IEEE International Confer-
ence on. IEEE, 2009, pp. 4189–4192.
[10] C. Gaida, P. Lange, R. Petrick, P. Proba, A. Malatawy, and
D. Suendermann-Oeft, “Comparing open-source speech recogni-
tion toolkits,” Tech. Rep., DHBW Stuttgart, 2014.
[11] P. Vojtas, J. Stepan, D. Sec, R. Cimler, and O. Krejcar, “Voice
recognition software on embedded devices,” in Asian Conference
on Intelligent Information and Database Systems. Springer,
2018, pp. 642–650.
[12] I. Esquerra, C. N. Camprub, L. Villarrubia, and P. Len, “De-
sign of a phonetic corpus for speech recognition in Catalan,” in
Workshop on Language Resources for European Minority Lan-
guages at the Conference on Language Resources and Evaluation
(LREC), Granada, 1998.
[13] J. Tiedemann, “News from OPUS - A collection of multilingual
parallel corpora with tools and interfaces,” in Recent Advances in
Natural Language Processing, N. Nicolov, K. Bontcheva, G. An-
gelova, and R. Mitkov, Eds. Borovets, Bulgaria: John Benjamins,
Amsterdam/Philadelphia, 2009, vol. V, pp. 237–248.
[14] A. Bonafonte, J. Adell, I. Esquerra, S. Gallego, A. Moreno, and
J. Pérez, “Corpus and voices for Catalan speech synthesis,” in Pro-
ceedings of LREC Conference 2008, 2008, pp. 3325–3329.
[15] A. Bonafonte, L. Aguilar, I. Esquerra, S. Oller, and A. Moreno,
“Recent work on the FESTCAT database for speech synthesis,” in
Proceedings of LREC Conference 2008, 2009, pp. 3325–3329.
[16] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
rispeech: an ASR corpus based on public domain audio books,”
in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE
International Conference on. IEEE, 2015, pp. 5206–5210.
[17] O. Tilk and T. Alumäe, “Bidirectional recurrent neural network
with attention mechanism for punctuation restoration,” in Inter-
speech 2016, 2016.
29
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
30 10.21437/IberSPEECH.2018-7
no discontinuities and artifacts can be reduced. This is trans-
lated into better quality speech as rated by human listeners.
The proposals were initially conceived to improve the speech
obtained with a deep generative network able to model multiple
speakers with the same structure. Nevertheless, the speaker-
dependent normalization (see section 2.2) could be used as a
new pre-processing technique in a variety of problems, and
the look ahead approach (section 2.3) can be generalized to Figure 1: Classical speaker-independent normalization
time series modeling. Current state-of-the art TTS models like
WaveNet [1], Tacotron [8] or VQ-VAE [9] already model sev-
eral speakers with a unique model, but do not apply speaker-
dependent normalizations, which is shown to deteriorate results.
The look ahead proposal is not either mentioned, but it outper-
formed our baseline model and could be applied to other time
series modeling system.
In the next section, first the baseline system is presented. It
consist of SampleRNN [4], extended to generate speech con- Figure 2: Proposed speaker-dependent normalization
ditioned to acoustic features and speaker identity. Then, the
speaker dependent normalization and the look ahead are intro-
duced. In section 3, the experimental setup is described. Sec- in figure 3. This changes the previous equation as a new depen-
tion 4 presents the experimental results that show how both pro- dence factor is included, thus new formulation follows equa-
posals outperform the baseline system. tion (2), where lt ∈ R43 stands for a 43-dimensional acoustic
vector corresponding to the analysis window of the current sam-
ple xt .
2. Multi-speaker Network
2.1. Baseline
Y
T
conditional end-to-end neural audio generation model [4] that Differing from the original SampleRNN model [4] and apart
consists of two recurrent modules running at different clock from the previously mentioned addition of the acoustic condi-
rates that aim to model the short and long term dependencies tioners that allow to synthesize coherent speech, the authors
of speech signals, and one module with auto-regressive multi- also incorporated the blocks in the left of figure 3. These
layer perceptrons (MLPs) that processes speech sample by sam- aim to differentiate among all the speakers of the database by
ple. The authors of SampleRNN reported that gated recurrent means of embedding an identifier which is also used to condi-
unit (GRU) [10] cells worked slightly better than long short- tion the model jointly with the aforementioned Ahocoder fea-
term memory (LSTM) ones, hence this is the recurrent architec- tures. Hence lt is augmented to include the speaker identity by
ture adopted for this work. The three tier architecture provides concatenating the embedding to the acoustic features, resulting
flexibility in allocating the amount of computational resources in a vector l̂t ∈ R49 .
for modeling different levels of abstraction and results very ef-
ficient in memory during training. The final output of Sam-
2.2. Speaker-dependent feature normalization
pleRNN model is the probability of the current sample value
conditioned on all the previous values of the sequence that can Features fed to a neural network are often previously normal-
be expressed following the chain rule of probability as stated ized to control the magnitude of both the activations and gra-
in equation (1). This follows a Multinoulli distribution, which dients in training. With the hypothesis of having speaker-
could be unintuitive due to the naturalness of speech signals, dependent features, an independent normalization for each of
which are real-valued, but achieves better results as it does not the speakers was proposed to isolate the speech features from
assume any distribution shape of the data and thus can more eas- the source. Maximum and minimum values for each of the pa-
ily model arbitrary distributions. In this work, speech samples rameters were found within the training partition so it could
are quantized with 8 bits, having therefore 256 possible values. happen that some features of the train or validation partitions
Differing from the linear quantization proposed in SampleRNN, overpass the bounds. The chosen normalization function was
we apply a µ−law companding transformation [11] before clas- a simple feature scaling that follow equation (3), which bound
sifying into the 256 possible classes to flatten the Laplacian-like each of the features from 0 to 1. This approach could be also ap-
distribution of the speech signals. plied with other normalization functions like the z−score, i.e.
statistical normalization. This last option was not tested be-
Y
T fore the writing of this paper due to the low improvement in
P (X) = P (xt |x1 , . . . , xt−1 ) (1) results of this only modification (see table 1). Nevertheless,
t=1 as it can be seen in the same table, this approach outperforms
the other models when combined with the look ahead approach
In order to generate speech coherent spoken contents, the model (explained in section 2.3). Therefore, a statistical normalization
was conditioned like in [3] with acoustic features obtained with could also be tested in future work.
Ahocoder [12], a high-quality harmonics-plus-noise vocoder
that predicts a set of features that can characterize speech sig- x − xmin
nals. The adapted model with its conditioning inputs is depicted x̂ = (3)
xmax − xmin
31
This proposal aims to give importance to the speaker identity same amount of samples to train, we choose to use all the avail-
to ideally allow voice conversion without the need of a com- able data per speaker to avoid restricting all of them to only 14
plex mapping of features. Inspiration came from the behav- minutes of speech instead of an hour. The total duration of the
ior of the pitch for every speaker, which is depicted for both whole dataset including the six speakers amounts to 5.25 hours,
speaker-independent and speaker-dependent normalizations in which we divide into 80% for training, 10% for validation and
figures 1 and 2 respectively. These plots illustrate the evolu- 10% for testing.
tion of the logarithmic fundamental frequency for four different
speakers including two males and two females that read the ex- 3.2. Feature Design and Hyper-parameters
act same text and thus are very similar once normalized follow-
ing a speaker-dependent approach. Note that there is some time The acoustic parameters are extracted with Ahocoder in frames
shifting due to different duration of phonemes and pauses but of length 15 ms shifted every 5 ms, obtaining 40 Mel-
the signal is yet very similar. frequency cepstral coefficients, the maximum voiced frequency
After the classical speaker-independent normalization (figure (fv), the logarithmic F0 value and the voiced/unvoiced flag
1), it is very easy to distinguish between females (75, 76) and (uv). To tackle the discontinuity in the logF0 statistics in un-
males (79, 80). This means that it would be impossible to per- voiced signals, the extracted pitch is post-processed with a log-
form voice conversion because the network doesn’t need the linear interpolation for the unvoiced segments following previ-
speaker identifier for being this information implicit in the fea- ous strategies [14].
tures. This is why this redundancy is translated into the futil- All these features are thus scaled following either the proposed
ity of this input observed when trying to change the speaker speaker-dependent or the more classical speaker-independent
identity at will. The behavior of the pitch once normalized by normalization. The normalized features are then rearranged to
speaker is very similar if the intonation is comparable. Nev- match the speech samples dimensions used in training and the
ertheless, the other features that are fed to the network (see speaker embedding is added as an independent input to the sys-
next section) resulted in very similar normalizations for both tem, as mentioned earlier.
speaker-independent and speaker-dependent approaches. The learning strategy was to train each of the models derived
from the previous proposals with mini-batch stochastic gradient
2.3. Look ahead descent (SGD) using a mini-batch size of 128 and minimizing
the negative log-likelihood (NLL). The chosen optimizer is the
In the modeling of non-real-time sequences such as the genera- adaptive moment estimation (ADAM) [15] for its effectiveness
tion of speech in a TTS system, the features that will be fed to in many problems and ease of use. It is an SGD algorithm with
the network are known beforehand. This means that, in contrast adaptive learning rate, having an initial value of 10−4 , which
with a possible phone call where both ends are talking at real we enhance for our task with an external rate controller known
time, the features that will condition the sequence at future time as scheduler. This had two milestones at epochs 15 and 35. In
steps are always known and thus can be used to better model the each of these milestones, the learning rate is scaled down by
generated signal. a factor 0.1, which counterattacks the sudden changes in the
With this idea in mind, the causality that speech synthesizers loss curve that shows up at first epochs. Weight normalization
inherited from the vocoders used in decoding is questioned and [16] is also used in the 1D-convolutional layers to speed-up the
both the current and future windows of features are fed to the convergence of the model.
network. This results in a larger model because the number of
features is duplicated at each time step but also achieves better
3.3. Subjective evaluation
quality without the need of more features.
Note that the look ahead approach modifies the architecture be- As this is a generative task that involves synthesized nuances
cause the upper right 1D-convolution block doubles its input in the speech that are difficult to evaluate with any objective
size (the original value of 43 is crossed out in the figure and re- metric, a mean opinion score (MOS) test is conducted. The
placed by 86 to accept both the features of the current and future MOS is a rating of the naturalness of the speech signal with an
frames). integer scale ranging from 1 to 5. The meaning of each scale
value is translated as Excellent (5), Good (4), Fair (3), Poor (2),
and Bad (1).
3. Experimental setup
In total 4 systems are evaluated combining both proposed im-
In this section we characterize the experimental conditions to provements with all possibilities: (1) speaker dependent nor-
evaluate the previous approaches. First we describe the speech malization and look-ahead (Spk-D + LA); (2) speaker indepen-
data used to estimate the models. Then, the acoustic parameters, dent normalization and look-ahead (Spk-Ind + LA); (3) speaker
architecture and learning hyperparameters are outlined. Finally, dependent normalization (Spk-D); and (4) only speaker inde-
the methodology used to evaluate the system is described. pendent normalization (Spk-Ind). Hence to perform the test 25
subjects were asked to rate each of the 4 proposed systems un-
3.1. Dataset der a set of 8 test utterances, one per modelled speaker (4 males
and 4 females). In total 32 systems were prompted to be rated
The speech dataset used in the experiments is formed by six per listener, and they could listen the different systems as many
Spanish voices from the TC-STAR project [13], where half of times as required to compare and rate them. For each sentence,
them are males and the other half are females. The database the transcription of the audio was provided to ease the listening,
was unbalanced with one of the female speakers barely hav- and the audios of each of the different systems synthesizing the
ing a quarter of speech recording time compared to the others. same sentence, were disposed side by side to compare, having a
Notwithstanding some works like [14] recommend balancing random order per utterance (i.e. the system identity was hidden
the data per user so that all speakers have approximately the and mixed among the different utterances).
32
Window 80 Speaker ID
future samples
Window
80 samples
Tier 3
GRU
Input size: 1024
Hidden size: 1024
GRU
Input size: 1024
Hidden size: 1024
(128 × 52 × 20) xi+60 , . . . , xi+79
ConvT 1D
Channels:
1024→1024
Kernel size: 4
GRU
Input size: 1024
Hidden size: 1024
GRU
Input size: 1024
Hidden size: 1024
ConvT 1D
Channels:
1024→1024
Kernel size: 20
Sample-Level
Module
33
7. References
[1] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,
“WaveNet: A Generative Model for Raw Audio,” ICASSP,
IEEE International Conference on Acoustics, Speech and Signal
Processing - Proceedings, pp. 1–15, 2016. [Online]. Available:
http://arxiv.org/abs/1609.03499
[2] O. Barbany Mayor, “Multi-Speaker Neural Vocoder,” Bachelor’s
thesis, Universitat Politècnica de Catalunya, 2018.
[3] A. Bonafonte, S. Pascual, and G. Dorca, “Spanish statisti-
cal parametric speech synthesis using a neural vocoder,” Proc.
Interspeech, pp. 1998–2001, 2018.
[4] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo,
A. Courville, and Y. Bengio, “SampleRNN: An Unconditional
End-to-End Neural Audio Generation Model,” ICLR, pp. 1–11,
2017. [Online]. Available: http://arxiv.org/abs/1612.07837
[5] Y. Fan, Y. Qian, F. K. Soong, and L. He, “Unsupervised
speaker adaptation for DNN-based TTS synthesis,” ICASSP,
IEEE International Conference on Acoustics, Speech and Signal
Processing - Proceedings, pp. 4475–4479, 2015.
[6] S. Pascual and A. Bonafonte, “Multi-output RNN-LSTM for
multiple speaker speech synthesis and adaptation,” in 24th
European Signal Processing Conference, EUSIPCO 2016,
Budapest, Hungary, August 29 - September 2, 2016, 2016,
pp. 2325–2329. [Online]. Available: https://doi.org/10.1109/
EUSIPCO.2016.7760664
[7] T. Toda, L. H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu,
and J. Yamagishi, “The voice conversion challenge 2016,” Proc.
Interspeech, vol. 08-12-Sept, pp. 1632–1636, 2016.
[8] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J.
Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio,
Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A.
Saurous, “Tacotron: A fully end-to-end text-to-speech synthesis
model,” CoRR, vol. abs/1703.10135, 2017. [Online]. Available:
http://arxiv.org/abs/1703.10135
[9] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural
Discrete Representation Learning,” in NIPS, 2017. [Online].
Available: http://arxiv.org/abs/1711.00937
[10] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau,
F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase
Representations using RNN Encoder-Decoder for Statistical
Machine Translation,” CoRR, 2014. [Online]. Available: http:
//arxiv.org/abs/1406.1078
[11] ITU-T. Recommendation G. 711, “Pulse Code Modulation (PCM)
of voice frequencies,” 1988.
[12] D. Erro, I. Sainz, E. Navas, and I. Hernaez, “Harmonics plus
Noise Model based Vocoder for Statistical Parametric Speech
Synthesis,” IEEE Journal on Selected Topics in Signal Processing,
vol. 8, no. 2, pp. 184–194, 2014.
[13] A. Bonafonte, H. Höge, I. Kiss, A. Moreno, U. Ziegenhain,
H. V. D. Heuvel, H. Hain, X. S. Wang, and M. N. Garcia, “TC-
STAR : Specifications of Language Resources and Evaluation for
Speech Synthesis,” Proceedings of the Language Resources and
Evaluation Conference LREC06, pp. 311–314, 2006.
[14] S. Pascual de la Puente, “Deep learning applied to speech synthe-
sis,” Master’s thesis, Universitat Politècnica de Catalunya, 2016.
[15] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” CoRR, vol. abs/1412.6980, 2014. [Online].
Available: http://arxiv.org/abs/1412.6980
[16] T. Salimans and D. P. Kingma, “Weight normalization:
A simple reparameterization to accelerate training of deep
neural networks,” CoRR, vol. abs/1602.07868, 2016. [Online].
Available: http://arxiv.org/abs/1602.07868
34
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
35 10.21437/IberSPEECH.2018-8
3. Experimental framework
This section briefly describes both the text corpora used to train
the LMs and the testing data used in the experiments.
36
Table 2: Characteristics of the combined Language Models In view of these findings, it can be concluded that model
combinations do not significantly reduce the averaged WER,
No INV words
Training text size
OOV words but do lead to an improvement in the confidence interval. On
(M of words) average these combined models present more robust results than
CLM1 720.000 118 2,5 % any of the single models.
CLM2 730.000 118 2,5 %
It is interesting to compare the results obtained by CML1
CLM3 630.000 352 2,8 %
CLM4 900.000 435 2,3 %
and CML2. We recall that CML1 is trained by combining all
the texts, whereas in CML2 the previously-trained models are
mixed. Table 3 shows that the average WER values are lower for
CML2, and therefore it is better to combine previously-trained
single language models.
readings and orally produced and read speeches. It con-
sists of 30 files with an average duration of 3:50 minutes
4.2. Experiment 2
per recording and a total duration of approximately 115
minutes (about 2 hours). In this Experiment a tetragram rescoring on the lattice obtained
in the previous experiment is performed. Table 4 shows the
2. Second Corpus: Speech in newscasts. Speech in news-
average WER together with the 95% confidence interval of the
casts. A corpus with audio recordings of television news-
rescoring results.
casts from TVG (Televisión de Galicia). They present
a mixture of spontaneous and planned speech or read
Table 4: Tetragram Language Models Rescoring Results
speech, but with more contemporary themes and vocab-
ulary than in the first corpus. It consists of 10 files with
an average duration of 34 minutes per recording and a SLM CLM1 CLM2 CLM3 CLM4
total duration of 340 minutes (5 hours and 40 minutes). First WER% 17.60 17.51 17.52 16.72 17.01
Corpus CI-95% ± 1.84 ± 1.75 ± 1.72 ± 1.76 ± 1.75
3. Third Corpus: Speech in TED Talks. A corpus with
Second WER% 22.80 21.46 21.40 22.64 21.79
audio recordings from TED Talks in Galician [18]. They Corpus CI-95% ± 2.65 ± 2.63 ± 2.61 ± 2.68 ± 2.74
present planned speech but are not read, being of a spon-
Third WER% 19.45 18.72 18.52 18.45 17.97
taneous nature. It consists of 10 files with an average
Corpus CI-95% ± 2.62 ± 2.46 ± 2.56 ± 2.68 ± 2.60
duration of 16 minutes per recording and a total duration
of 163 minutes (2 hours and 43 minutes). Average WER in
19.95 19.23 19.14 19.27 18.92
analysis corpora
4. Experimental results
Three experiments were carried out to assess the impact of the The average WER and the confidence interval for the three
different language models on ASR performance: corpora analyzed is reduced. In the first corpus the average
• Experiment 1: recognition using a single-pass decoding WER is reduced by approximately 1% (expressed in absolute
strategy using 3-gram LMs of Table 2. terms), obtaining a value of 16.72%. The reduction is similar in
the second corpus, going from 22.80% to 21.40% of WER. In
• Experiment 2: rescoring with 4-gram language models. the third corpus it goes from 19.45% to 17.97%.
• Experiment 3: rescoring with RNNLMs. It is also interesting to compare the average WER for the
three corpora shown in Table 4. The lowest average value is
4.1. Experiment 1 obtained by CLM4, the model that combines the greatest num-
ber of single LMs, and the one with the least out of vocabulary
The mixture of models described in Section 3.1 has been tested (OOV) words. It also shows how all the combinations of models
in each of the corpora described in Section 3.2. The results can obtain lower WER results than single models, that is, when ap-
be seen in Table 3. The SLM column shows the best result plying the rescoring of tetragrams, the CLMs are clearly supe-
obtained by a single LM, that is, without mixing LMs. rior, being more robust in confidence interval and with a lower
average WER.
Table 3: Combined Language Models Results
4.3. Experiment 3
SLM CLM1 CLM2 CLM3 CLM4
In this last experiment a rescoring with the RNNLM is applied
First WER% 21.02 17.61 17.51 17.55 18.14
Corpus CI-95% to the lattices obtained by the decoding of experiment 2, that is,
± 1.84 ± 1.76 ± 1.71 ± 1.78 ± 1.77
a rescoring RNNLM is applied to the lattice resulting from the
Second WER% 21.39 21.56 21.52 23.46 22.86
rescoring of tetragrams. For this, RNNLMs have been trained
Corpus CI-95% ± 2.63 ± 2.58 ± 2.55 ± 2.78 ± 2.70
using the same text as in the previous experiment. Table 5
Third WER% 19.07 18.77 18.68 19.48 19.18 shows the results.
Corpus CI-95% ± 2.24 ± 2.44 ± 2.57 ± 2.70 ± 2.81 In the first corpus all language models reduce the aver-
age WER obtained. An absolute reduction of up to 1.5% is
achieved, reaching the value of 16.05% of average WER. How-
Table 3 shows that for the first and second corpus it is not ever, in the second and third corpus, applying the rescoring with
possible to reduce the average WER with any of the LM com- RNNLMs does not reduce the WER obtained for all LMs.
binations, but it is possible to reduce the confidence interval of The average WER in the three analyzed corpora (final row
the results. Only for the third corpus was the average WER in Table 5) is slightly reduced compared to the values obtained
obtained slightly reduced, improving the results of single LMs. in experiment 3, achieving a value of 18.83% when using the
37
Table 5: RNNLM Rescoring Results 5.2. Use of limited data in a modern system
SLM CLM1 CLM2 CLM3 CLM4 To train the models that the ASR systems uses, large corpora of
audio (for the acoustic model) and text (for the language model)
First WER% 16.05 16.82 16.63 16.52 16.57
Corpus CI-95% are necessary. In both cases, the type of data that is collected
± 1.69 ± 1.74 ± 1.76 ± 1.77 ± 1.80
must be representative of the speech that one wants to recog-
Second WER% 22.98 21.61 21.37 23.44 21.39
nize.
Corpus CI-95% ± 2.62 ± 2.67 ± 2.58 ± 2.83 ± 2.73
Obtaining a large corpus of audio recordings, with their cor-
Third WER% 19.27 19.07 18.73 18.82 18.52
responding transcriptions, and which are representative of the
Corpus CI-95% ± 2.94 ± 3.01 ± 2.95 ± 3.27 ± 2.95 speech, is not easy when working with minority or less well-
Average WER in
19.43 19.16 18.91 19.59 18.83
resourced languages. Large databases with this information are
analysis corpora simply not available. Although there may be TV or radio sta-
tions that broadcast in the language, it is difficult to obtain such
audio material with accurate transcriptions, and they often lack
variety in terms of speech types.
CLM4. We can conclude that the best strategy to reduce WER This study has shown that one solution to deal with the
has been to combine the language models that have provided shortage of resources in acoustic modeling is to use data from
the best results in the first experiment. The combined models languages with similar phonetics. In our case, looking at Gali-
increase vocabulary size, provide more training texts and there- cian, the acoustic models have been trained using data from
fore reduce the OOV words, while also providing a greater ro- Spanish, multiplying by 4 the amount of information. Spanish,
bustness against variation in speech. being a widely spoken language, has the necessary resources
to obtain correctly transcribed audio corpus. Therefore, in our
5. Discussion acoustic modeling, approximately 70% of data has been used in
Spanish (over 79 hours of Spanish speaking). The other 30%
This section offers a discussion of the WER results and the use corresponds to more than 30 hours of Galician.
of data in a modern system when working with a minority lan- For language models, a text corpus was created using data
guage. obtained from: 1) different magazines and newspapers pub-
lished in the language; 2) downloading the information present
5.1. WER results in the Galician version of Wikipedia; 3) information obtained
The reduction of the average WER achieved in the first corpus from small text corpora. In order to increase the amount of data
is greater than that achieved in the second and third. Such a dif- available, another solution was to obtain a large corpus of data
ference in behavior between the first corpus and the other two in another language, and translate it into Galician using a free
may be due to the different character of the linguistic samples automatic translation tool.
[11]. The first corpus is composed mainly of read texts (writ-
ten language), while the second corpus presents a high num- 6. Conclusions
ber of speakers, a heterogeneous mixture of speech types (read
The results obtained for Galician ASR are promising. Im-
language, statements by different speakers, situations including
proving the training text of the language models and applying
noise, music, a mixture of Galician and Spanish, among others).
RNNLM in decoding resulted in reducing the average WER ob-
To see how such a heterogeneous mixture of speech affects
tained. However, it has also been shown that increasing the
the results, all interviews were removed from the recordings,
complexity of the system leads to more training data. The strate-
leaving only the speech of presenters and reporters. The results
gies applied to work with minority and less well- resourced lan-
show an absolute reduction of more than 7% compared to the
guages have also contributed to the positive results in recogni-
best case for this corpus, obtaining an average WER of 13.88%.
tion.
Therefore, the speech type of this corpus clearly does affect the
results here. As a future line of research, we plan to improve the acoustic
A detailed analysis of the recognition errors can lead to a and language models of the ASR, as well as to use more efficient
further reduction of the average WER. Yet it must be taken into algorithms in the decoding stage.
account that some of the errors, at least in the oral corpus (not
read), must be assumed to be inevitable. These errors reflect 7. Acknowledgements
the doubts and errors in speakers’ pronunciation, deviations in
This work has received financial support from the Spanish
forms, etc. They might appear as errors in the transcription on
Ministerio de Economı́a y Competitividad through project
which the calculation of the WER is based, but in which the
’TraceThem’ (TEC2015-65345-P), from the Xunta de Galicia
recognition is in fact successful.
(Agrupación Estratéxica Consolidada de Galicia accreditation
Finally, in order to check how far we can get with the cur- 2016-2019, Galician Research Network TecAnDaLi ED431D
rent training data, a new RNNLM model was trained, introduc- 2016/011) and the European Union (European Regional Devel-
ing the transcripts of the analyzed corpora into the development opment Fund – ERDF). Our gratitude to the Ramon Piñeiro
text. With this, we sought to model the network so that it was Institute of the Xunta de Galicia for allowing the use of the
specifically prepared to recognize the corpus on which it was CORGA material and for its collaboration in the labeling of the
going to be tested. Of course, applying this technique is only second and third corpora.
possible when the correct transcripts are available. The results
show an absolute WER reduction of 0.2%, that is, a small and
not significant improvement. This result leads us to conclude 8. References
that it is difficult to continue reducing the WER with this train- [1] G. Hinton, L. Deng, D. Yu, and Y. Wang. 2012. Deep Neural Net-
ing data and with these algorithms. works for Acoustic Modeling in Speech Recognition. IEEE Signal
38
Processing Magazine, vol. 9, no. 3, pp. 82-97.
[2] J. T. Goodman. 2001. A bit of progress in language modeling.
Computer Speech and Language, vol. 15, no. 4, pages 403–434.
[3] D. Jurafsky and J.H. Martin. 2008. Speech and Language Pro-
cessing: An Introduction to Language Processing, Computational
Linguistics, and Speech Recognition.
[4] T. Mikilov, S. Kombrink, A. Deoras, L. Bruget, and J. Cer-
nicky. 2011. RNNLM-recurrent neural network language model-
ing toolkit, in Proc. of the 2011 ASRU Workshop, pages 196-201.
[5] E. Arisoy, T.N. Sainath, B. Kingsbury, and B. Ramabhadran.
2012. Deep Neural Network Language Models. In NAACL-HLT
Workshop on the Future of Language Modeling for HLT, pages
20-28, Stroudsburg, PA, USA. Association for Computational
LInguistics.
[6] Y. Bengio, R. Ducharme, P. Vincent, and C. Juavi. 2003. A neural
probabilistic language model. Journal of Machine Learning Re-
search, 3:1137-1155.
[7] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y.
Wu. 2016. Exploring the limits of language modeling. In arXiv
preprint arXiv:1602.02410.
[8] H. Xu, T. Chen, D. Gao, Y. Wang, K. Li, N. Goel, Y. Carmiel, D.
Povey, and S. Khudanpur. 2018. A pruned rnnlm lattice-rescoring
algorithm for automatic speech recognition, In ICASSP.
[9] M. Sundermeyer, Z. Tuske, R. Schluter, and H. Ney. 2014. Lat-
tice decoding and rescoring with long-span neural network lan-
guage models. En Fifteenth Annual Conference of the Interna-
tional Speech Communication Association.
[10] X. Chen, X. Liu, A. Ragni, Y. Wang and M. Gales. 2017. Future
word contexts in neural network language models. ArXiv preprint
arXiv:170805592.
[11] A. Piñeiro, C. Garcı́a and L. Docı́o. 2018. Estudio sobre el im-
pacto del corpus de entrenamiento del modelo de lenguaje en las
prestaciones de un reconocedor de habla. Sociedad Española para
el Procesamiento del Lenguaje Natural.
[12] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motlı́cek, Y. Quian, P. Schwarz, J.
Silovský, G. Stemmer, and K. Veselý. 2011. The Kaldi Speech
Recognition Toolkit. In ASRU.
[13] V. Peddinti, D. Povey and S. Khudanpur. 2015. A time delay neu-
ral network architecture for efficient modeling of long temporal
contexts. In Proceedings of INTERSPEECH 2015.
[14] L. Docı́o, A. Cardenal and C. Garcı́a. 2006. TC-STAR 2006
automatic speech recognition evaluation: The uvigo system. In
Proc. Of TC-STAR Workshop on Speech-to-Speech Translation,
ELRA, Parı́s, France.
[15] C. Garcı́a, J. Tirado, L. Docı́o and A. Cardenal. 2004. Transcrigal:
A bilingual system for automatic indexing of broadcast news. In
IV International Conference on Language Resources and Evalua-
tion.
[16] A. Stolcke. 2002. SRILM – An extensible language modeling
toolkit. Proceedings of the International Conference on Statisti-
cal Language Processing, Denver, Colorado.
[17] I. Alegrı́a, I. Arantzabal, M. Forcada, X. Gómez, L. Padró, J.R.
Pichel, and J. Waliño. 2006. OpenTrad: Traducción automática de
código abierto para las lenguas del estado Español. Procesamiento
del Lenguaje Natural.
[18] TEDxGalicia. x=independent organized TED event.
http://www.tedxgalicia.com/
39
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
40 10.21437/IberSPEECH.2018-9
appropriately correlated with each other, comprehension of the prosodic cues based on the analysis of the available corpus of
message is positively affected (cf., e.g., [5] for German and [15] read speech annotated with hierarchical thematicity (in the NLG
for Catalan). Therefore, there is reason to assume that a conver- module) and prosody has been proved to yield an improvement
sational application considering the notions of content packag- in the perception of expressiveness of the synthesized speech
ing by means of the relation between thematicity and prosody [4]. However, there is still no application that can provide an
will benefit from the same advantages as in natural conversa- automatic derivation of thematicity-based prosody cues for raw
tion environments. Most of all, conversational avatars in ap- texts that arrive to the TTS application, such as the implemen-
plications for children in educational settings [16], applications tation proposed in this paper.
for those with special needs [1] as well as for elderly [2] and,
in particular, for those with cognitive impairments [17], would 3. Methodology
greatly benefit from such a communicatively-oriented improve-
ment. This paper proposes an approach that tests the formal represen-
State-of-the-art conversational applications, in particular tation of information (or communicative) structure proposed by
TTS systems, do not yet include communicative information. Mel’čuk [8] and its correspondence to prosody in the context of
The task is not trivial. It involves, in the first place, a commu- a concept-to-speech (CTS) application, where text coming from
nicative theoretical model, automatic tools to parse the Infor- a web-retrieved service is input to the TTS engine.
mation Structure of a text, and, last but not least, a generative
model of the related prosodic contour. Some preliminary at- 3.1. Objectives
tempts to include thematicity in TTS applications were made
Our work envisages the study of the IS–prosody interface from a
in the past. Consider, for instance, Steedman’s work [6] on
methodological perspective based on a speech synthesis imple-
the correlation of theme and rheme to rising and falling into-
mentation setup. The proposed methodology has the following
nation patterns, which was tested in the Festival speech synthe-
underlying goals:
sizer [18], and the creation of dedicated tags in MaryTTS [19]
for the notions of givenness and contrast [20]. However, these • to provide automatic tools to investigate the effect of
attempts have a major shortcoming in that they use a flat bi- thematicity–prosody correspondence in human-machine
nary thematicity structure, which does not suffice to describe interaction contexts;
the complexity of content packaging, especially in relation to
prosody. Our previous studies (see, e.g. [21]) suggest that hi- • to explore the advantages and limitations of a
erarchical thematicity based on propositions as described by thematicity-based prosody enrichment in speech synthe-
Mel’čuk [8] constructs a more versatile scaffolding for com- sis;
municative modeling of computer interaction with humans. As • to provide a preliminary scaffolding to incrementally
already mentioned above, Mel’čuk’s methodological approach add other communicative dimensions, registers and lan-
has also demonstrated to be instrumental in natural language guages;
generation applications [22, 23].
In contrast to IS models that propose a partition of sentences Such a methodology addresses two main research issues
into a theme and a rheme, Mel’čuk [8] argues in the context of in this field: (i) the lack of implementation settings of the IS–
the Meaning–Text Theory for a tripartite hierarchical division prosody correspondence and (ii) testing of the integration of the
(‘theme’, ‘rheme’, and ‘specifier’ – the element which sets the IS–prosody interface in computational settings.
utterance’s context) within propositions that further permits em-
beddedness of communicative spans; consider (1) for illustra- 3.2. Pipeline
tion of hierarchical thematicity (annotated following the guide-
The proposed pipeline sketched in Figure 1 includes four mod-
lines established in [24]) of the sentence Ever since, the remain-
ules:
ing members have been desperate for the United States to rejoin
this dreadful group. A total of five partitions are identified, in- 1. Tokenizer: it splits the text into sentences and words.
cluding three spans at level 1 (L1), a specifier (SP1), theme (T1) Punctuation marks are also tokenized as required to serve
and rheme (R1), and two embedded spans at level 2 (L2)2 in the as input for the syntactic parser.
rheme, a theme (T1(R1)) and a rheme (R1(R1)).3
2. Syntactic parser: The parser by Bohnet [25, 26] is used.
(1) [Ever since,]SP1 [the remaining members]T1 [have been This parser is trained on the TIGER Penn Treebank [27]
desperate [for the United States]T1(R1) [to rejoin this and outputs a fourteen-columned CONLL file.
dreadful group.]R1(R1)]R1
3. Communicative parser. This rule-based system derives
A hierarchical thematicity structure of this kind has been thematicity labels from syntactic structure. It outputs a
shown to correlate better with ToBI labels than binary flat the- CONLL file with an added column for communicative
maticity [9]. Such a correlation still does not solve the prob- structure (i.e., the output CONLL has fifteen columns).
lem of a one–to-one mapping between a specific intonation la- For now, it only derives hierarchical thematicity labels.
bel (e.g., H*) to a static acoustic parameter (e.g., an increase 4. SSML prosody converter. It converts the thematicity
of 50% in fundamental frequency). A more varied range of spans derived by the communicative parser to SSML
2 Levels are connected to the concept of embeddedness of spans: for spans and assigns a variety of prosody tags to each span.
instance, a main theme (T1 at L1) may be subdivided into further the- This module is based on the tool presented in [28].
maticity spans, which will belong to L2 thematicity.
3 As more than one thematicity span may exist within the same The use of the Speech Synthesis Markup Language (SSML)
proposition, abbreviations include a number (e.g., ‘SP1’) that indicates [29] convention for prosody enrichment, as proposed in [28],
the number of occurrences at each level (e.g., ‘SP2’ would be the second facilitates the integration of the methodology proposed in this
specifier in a specific thematicity level). paper within the context of TTS applications.
41
corpus contains eight texts with a total of 1,418 words.In what
follows, we present the experimental setup with respect to the
prosody enrichment procedure and the IS–prosody correspon-
dence.
The open source software MaryTTS5 [19] was used for
the implementation. The default synthesized speech output has
been enriched using MaryXML prosody specifications6 , which
follow the SSML recommendation7 .
The SSML prosody tags allow control of six optional at-
tributes (overall pitch, pitch contour, pitch range, speech rate,
duration and volume). These attributes can be modified inde-
pendently or in combination. For our implementation, overall
pitch and speech rate were chosen individually and in combina-
tion. Absolute (e.g., ‘+50Hz’ for increasing a specific amount
Figure 1: Communicative generation pipeline.
of hertz (Hz) in F0) and relative values can be used to applied
the modification. An example of a SSML prosody tag for mod-
ification of two prosodic elements is presented below:
Example (1)
<prosody rate=”-10%” pitch=”+20%”>text to be modi-
fied </prosody>
Moreover, the SSML boundary tag that controls the intro-
duction of pauses at a specific location was also used after each
thematicity span. The duration of the break is specified in mil-
liseconds (ms). An example of SSML boundary tag is intro-
duced below:
Example (2)
Text before the break <boundary duration=”100”/>text
after the break.
Figure 2: Example of the output of the communicative parser in
CONLL format annotated with thematicity. The correspondence between thematicity and prosody is
presented as variations from referent prosody tag values in-
volving fundamental frequency (F0) and speech rate (SR) over
3.3. The Communicative Parser thematicity spans (cf. Table 1). We propose testing a varied
range of values generated automatically, against a manual im-
The main contribution of this paper is a rule-based communica- plementation following the findings in [10, 4], where a variety
tive parser for texts in German. In what follows, we sketch the of prosodic cues for each thematicity span is presented based on
core functions and algorithm of the parser.4 corpus analysis.
The parser is implemented as a python script that requires
a CONLL file with the part-of-speech (POS) and dependency F0 SR
syntax analysis per token in each sentence. Clauses are the
T1 +15% -15%
main syntactic cue to detect propositions. Thus, The main al- R1 +10% +10%
gorithm loops over POS and dependency relations columns to SP1 +20% -10%
P +15% -10%
identify complex and coordinated clauses in the first place and
label propositions. Then, thematicity is labeled focusing on the
detection of specifiers and themes, which usually have as syn- Table 1: Referent prosody tag values for L1 thematicity.
tactic correlates frontal modifiers and subjects respectively.
Several functions have been scripted for finding proposi- Table 1 shows the referent modification for theme (T1),
tions, thematicity and annotate them following the guidelines rheme (R1) and specifier (SP1) spans within L1 thematicity.
established in [24]. Those guidelines establish the convention Propositions are defined as clauses that contain a finite verb and
of using square brackets (”[” and ”]”) to establish the beginning they are the referent units for thematicity segmentation. They
and end of a thematicity span and keys (”{” and ”}”) to signal can include L1 and L2 spans and embrace under one commu-
beginning and end of a proposition. The resulting output is a nicative label different types of syntactic relationships, for ex-
CONLL file that has one column at the end with the annotation ample coordination, juxtaposition and subordination. The ref-
of thematicity (cf. Figure 2). erent values assigned to each span are chosen randomly within
a range of plus minus 5 points in each new sentence. Thus,
4. Experimental Setup even though the annotation of thematicity in this experiment
is restricted to the sentence domain, an automatic variation is
A working corpus has been created from web-retrieved text in envisaged to generate a different range of prosodic parameters
German on advice for sleeping routines and local news. The across sentences.
4 The code of both the communicative parser and the thematic- 5 Available at http://mary.dfki.de/
6 http://mary.dfki.de/documentation/maryxml/
ity to SSML module is available in the following repository un-
der a GNU v.3 licence: https://github.com/TalnUPF/ index.html
KRISTthem2prosModule 7 https://www.w3.org/TR/speech-synthesis/
42
5. Evaluation S1 S2 S3 S4 S5 S6 Average
DEF 47% 29% 29% 71% 41% 47% 44%
For the evaluation of automatic assignment of thematicity-based AUT 53% 71% 71% 29% 59% 53% 56%
prosody, a selection of newspaper articles in German has been
done. From those articles, a selection of sentences with differ-
Table 3: Results from the pairwise comparison for the default
ent communicative structures has been made for the perception
and automatic modification
test, as detailed below.
For the evaluation of the thematicity-based prosody en-
richment module, expressiveness was assessed by means of a
perception test using: (1) a Mean Opinion Score (MOS) with 6. Conclusions
a 5-point Likert scale: 1-bad, 2-poor, 3-fair, 4-good, and 5- Given the relevant role of the Information Structure–prosody in-
excellent; and a pairwise comparison. Seventeen participants terface in human communication, it seems reasonable that next
took part in the evaluation, all of them either native speakers of generation conversational agents face new challenges in adopt-
German or proficient speakers. The test was conducted fully in ing communicatively-oriented models. Current speech tech-
German and participants were informed that our goal is to inves- nologies have been oblivious to advances in theoretical fields
tigate if synthesized speech was perceived as better expressing studying this correlation, basically due to the lack of a for-
the communicative content of the sentence taking into account mal representation of the communicative (or information) struc-
prosodic variability. Six sentences were included in the per- ture and limited capabilities of prosody enrichment standards to
ception test representative of different complexity in syntax and achieve variability in implementation settings.
communicative structure: The present study provides a methodology for a more ver-
S1 Warme Fuß und Vollbäder direkt vor dem Schlafengehen satile integration of the IS–prosody interface in TTS for reading
fördern den Nachtschlaf. aloud applications. Such a methodology contributes in several
aspects to the state of the art: (i) a formal description of hi-
S2 Der Begriff der Schlafhygiene bezeichnet Verhal- erarchical thematicity is used; (ii) a communicative parser that
tensweisen, die einen gesunden Schlaf fördern. derives thematicity labels is introduced; and (iii) the prosodic
S3 Dafür sorgen, dass das Schlafzimmer ruhig und dunkel cues are automatically derived and tested in a TTS application.
ist und eine angenehme Temperatur hat. All in all, this study pivots the transition from theoretical work
S4 Landrat Thomas Reumann schlägt vor, den Fi- on the IS–prosody interface to the integration of thematicity-
nanzierungsantrag zu stellen, will aber erst im Haushalt based prosody enrichment to achieve more expressive synthe-
2018 Gelder einstellen. sized speech.
A limitation of the current study is that it only considers
S5 Das funktioniert nur, wenn alle mitmachen.
relative acoustic parameters over rather large text segments.
S6 Im übrigen betonte er, dass der Landkreis nicht allein Key aspects of prosody modeling, like F0 contour generation
sei, sondern Städte und Gemeinden als Partner habe, die in terms of prominence and phrasing remain to be looked into.
den Beschluss mittragen müssten. Future work is, furthermore, aimed at exploring other dimen-
Three samples of each sentences were included in the sions of communicative structure like emphasis and foreground-
MOS test: (1) the default TTS output (DEF); (2) auto- edness within the framework that has been proposed in this pa-
matic thematicity-based modifications (AUT) and (3) manual per.
thematicity-based prosody modifications (MAN). The pairwise
comparison included the default TTS output versus the auto- 7. Acknowledgements
matic thematicity-based prosody modification. A total of fifty-
This work is part of the KRISTINA project, which has received
one answers are considered in the evaluation. Table 2 shows
funding from the European Unions Horizon 2020 Research
results of the MOS test. In all cases, the best scoring sample
and Innovation Programme under the Grant Agreement num-
is the thematicity-based prosody modification (either manual or
ber H2020-RIA-645012. It has been also partly supported by
automatic). This supports the initial hypothesis that thematicity-
the Spanish Ministry of Economy and Competitiveness under
based prosody modifications are perceived as more expressive.
the Maria de Maeztu Units of Excellence Programme (MDM-
In sentences 2, 3, 5 and 6 the best scoring option is the automatic
2015-0502). The third author is partially funded by the Ramón
version, whereas sentences 1 and 4 score best for the manual
y Cajal program.
version of the modification. These results are in line with the
pairwise comparison shown in Table 3, where all choices go for
the thematicity-based modification except for sentence 4.
S1 S2 S3 S4 S5 S6 Average
DEF 2.65 3.35 3.12 2.71 3.18 3.06 3.01
AUT 3.00 3.53 3.41 2.65 3.53 3.71 3.30
MAN 3.12 2.88 3.06 3.00 2.71 2.88 2.94
43
8. References [16] D. Prez-Marn and I. Pascual-Nieto, “An exploratory study on
how children interact with pedagogic conversational agents,” Be-
[1] B. L. Mencı́a, D. D. Pardo, A. H. Trapote, and L. A. H. Gómez, haviour & Information Technology, vol. 32, no. 9, pp. 955–964,
“Embodied Conversational Agents in Interactive Applications for 2013.
Children with Special Educational Needs,” in Technologies for In-
clusive Education: Beyond Traditional Integration Approaches, [17] P. Wargnier, G. Carletti, Y. Laurent-Corniquet, S. Benveniste,
D. Griol Barres, Z. Callejas Carrión, and R. L.-C. Delgado, Eds. P. Jouvelot, and A. S. Rigaud, “Field evaluation with cognitively-
Hershey, USA: IGI Global, 2013, pp. 59–88. impaired older adults of attention management in the Embodied
Conversational Agent Louise,” in IEEE International Conference
[2] A. Ortiz, M. del Puy Carretero, D. Oyarzun, J. J. Yanguas, on Serious Games and Applications for Health, SeGAH 2016, Or-
C. Buiza, M. F. Gonzalez, and I. Etxeberria, Elderly Users in lando, Florida, USA, 2016, pp. 1–8.
Ambient Intelligence: Does an Avatar Improve the Interaction?
Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 99– [18] A. W. Black and P. A. Taylor, “The Festival Speech Syn-
114. thesis System: System documentation,” Human Commun-
ciation Research Centre, University of Edinburgh, Scot-
[3] L. Wanner, E. André, J. Blat, S. Dasiopoulou, M. Farrús, land, UK, Tech. Rep. HCRC/TR-83, 1997, avaliable at
T. Fraga, E. Kamateri, F. Lingenfelser, G. Llorach, O. Martı́nez, http://www.cstr.ed.ac.uk/projects/festival.html.
G. Meditskos, S. Mille, W. Minker, L. Pragst, D. Schiller,
A. Stam, L. Stellingwerff, F. Sukno, B. Vieru, and S. Vrochidis, [19] M. Schröder and J. Trouvain, “The German Text-to-Speech
“KRISTINA: A Knowledge-Based Virtual Conversation Agent,” Synthesis System MARY: A Tool for Research, Development and
in Proceedings of the 15th International Conference on Practi- Teaching,” International Journal of Speech Technology, vol. 6,
cal Applications of Agents and Multi-Agent Systems (PAAMS), no. 4, pp. 365–377, 2003. [Online]. Available: http://mary.dfki.de
Oporto, Portugal, 2017. [20] F. Kügler, B. Smolibocki, and M. Stede, “Evaluation of informa-
[4] M. Domı́nguez, M. Farrús, and L. Wanner, “Thematicity-based tion structure in speech synthesis : The case of product recom-
Prosody Enrichment for Text-to-Speech Applications,” in Pro- mender systems perception,” in ITG Conference on Speech Com-
ceedings of the 9th International Conference on Speech Prosody munication, IEEE, 2012, pp. 26–29.
2018 (SP2018), Poznań, Poland, 2018. [21] M. Domı́nguez, M. Farrús, and L. Wanner, “Combining acous-
tic and linguistic features in phrase-oriented prosody prediction,”
[5] D. Meurers, R. Ziai, N. Ott, and J. Kopp, “Evaluating Answers to
in Proceedings of the 8th International Conference on Speech
Reading Comprehension Questions in Context: Results for Ger-
Prosody, Boston, USA, 2016, pp. 796–800.
man and the Role of Information Structure,” in Proceedings of the
TextInfer 2011 Workshop on Textual Entailment, ser. TIWTE ’11. [22] L. Wanner, B. Bohnet, and M. Giereth, “Deriving the Commu-
Stroudsburg, PA, USA: Association for Computational Linguis- nicative Structure in Applied NLG,” in Proceedings of the 9th Eu-
tics, 2011, pp. 1–9. ropean Workshop on Natural Language Generation at the Bian-
nual Meeting of the European Chapter of the Association for
[6] M. Steedman, “Information structure and the syntax-phonology Computational Linguistics, 2003, pp. 100–104.
interface,” Linguistic inquiry, vol. 31, no. 4, pp. 649–689, Fall
2000. [23] M. Ballesteros, B. Bohnet, S. Mille, and L. Wanner, “Data-driven
sentence generation with non-isomorphic trees,” in Proceedings
[7] M. Haji-Abdolhosseini and S. Müller, “Constraint-Based Ap- of the Annual Conference of the North American Association
proach to Information Structure and Prosody Correspondence,” for Computational Linguistics – Human Language Technologies
in Proceedings of the 10th International Conference on Head- (NAACL – HLT), 2015.
Driven Phrase Structure Grammar. CSLI Publications, 2003,
pp. 143–162. [24] B. Bohnet, A. Burga, and L. Wanner, “Towards the annotation of
penn treebank with information structure,” in Proceedings of the
[8] I. A. Mel’čuk, Communicative Organization in Natural Lan- Sixth International Joint Conference on Natural Language Pro-
guage: The semantic-communicative structure of sentences. Am- cessing, Nagoya, Japan, 2013, pp. 1250–1256.
sterdam, Philadephia: Benjamins, 2001.
[25] B. Bohnet and J. Nivre, “A Transition-Based System for Joint
[9] M. Domı́nguez, M. Farrús, A. Burga, and L. Wanner, “Using hier- Part-of-Speech Tagging and Labeled Non-Projective Dependency
archical information structure for prosody prediction in content- Parsing,” in Proceedings of the 2012 Joint Conference on Empiri-
to-speech applications,” in Proceedings of the 8th International cal Methods in Natural Language Processing and Computational
Conference on Speech Prosody, Boston, USA, 2016, pp. 1019– Natural Language Learning (EMNLP-CoNLL ’12), Jeju Island,
1023. Korea, 2012, pp. 1455–1465.
[10] M. Domı́nguez, M. Farrús, and L. Wanner, “Compilation of cor- [26] ——, “he Best of BothWorlds – A Graph-based Completion
pora to study the information structureprosody interface,” in 11th Model for Transition-based Parsers,” in Proceedings of the 13th
edition of the Language Resources and Evaluation Conference Conference of the European Chapter of the Association for Com-
(LREC2018), Mijazaki, Japan, 2018. putational Linguistics (EACL), Avignon, France, 2012, pp. 77–87.
[11] M. Halliday, “Notes on Transitivity and Theme in English, Parts [27] S. Brants, S. Dipper, P. Eisenberg, S. Hansen, E. König, W. Lez-
1-3,” Journal of Linguistics, vol. 3, no. 1, pp. 37–81, 1967. ius, C. Rohrer, G. Smith, and H. Uszkoreit, “TIGER: Linguis-
tic Interpretation of a German Corpus,” Journal of Language and
[12] R. Schwarzschild, “Givenness, avoidf and other constraints on the
Computation, no. 2, pp. 597–620, 2004.
placement of accent,” Natural Language Semantics, vol. 7, no. 1,
pp. 141–177, 1999. [28] M. Domı́nguez, M. Farrús, and L. Wanner, “A thematicity-based
prosody enrichment tool for cts,” in Proceedings of the 18th An-
[13] E. Hajičová, B. Partee, and P. Sgall, Topic-Focus Articulation, Tri- nual Conference of the International Speech Communication As-
partite Structures, and Semantic Content. Kluwer Academic sociation (INTERSPEECH 2017), Stockholm, Sweden, 2017, pp.
Publishers, Dordrecht, 1998. 3421–2.
[14] H. H. Clark and S. E. Haviland, “Comprehension and the given- [29] P. Taylor and A. Isard, “SSML: A Speech Synthesis Markup Lan-
new contract,” Discourse production and comprehension. Dis- guage,” Speech Communication, vol. 21, no. 1-2, pp. 123–133,
course processes: Advances in research and theory, vol. 1, pp. February 1997.
1–40, 1977.
[15] M. Vanrell, I. Mascaró, F. Torres-Tamarit, and P. Prieto, “Intona-
tion as an Encoder of Speaker Certainty: Information and Con-
firmation Yes-No Questions in Catalan,” Language and Speech,
vol. 56, no. 2, pp. 163–190, 2013.
44
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Abstract features for spoofing detection [6, 7, 11]. This technique con-
sists of employing deep neural networks in the front-end of the
As Automatic Speaker Verification (ASV) becomes more anti-spoofing system which are fed by speech features, so that
popular, so do the ways impostors can use to gain illegal ac- the deep features extracted by the neural network are passed to a
cess to speech-based biometric systems. For instance, impos- classifier in order to make the final detection decision (genuine
tors can use Text-to-Speech (TTS) and Voice Conversion (VC) or spoof). The core idea is to take advantage of the nonlin-
techniques to generate speech acoustics resembling the voice of ear modeling and discriminative capabilites of deep neural net-
a genuine user and, hence, gain fraudulent access to the sys- works which have shown to be suitable for feature engineering
tem. To prevent this, a number of anti-spoofing countermea- [3], not only for spoofing detection, but also for speech recog-
sures have been developed for detecting these high technol- nition [4], speaker recognition [3], and speech synthesis [5].
ogy attacks. However, the detection of previously unforeseen In this work, we compare the performance of different fea-
spoofing attacks remains challenging. To address this issue, tures and back-ends in an anti-spoofing system which extracts
in this work we perform an extensive empirical investigation deep features [6] in order to detect VC and TTS attacks. This
on the speech features and back-end classifiers providing the anti-spoofing system employs a convolutional neural network
best overall performance for an antispoofing system based on (CNN) plus a recurrent neural network (RNN) and gets a sin-
a deep learning framework. In this architecture, a deep neural gle spoofing identity representation per utterance. Although a
network is used to extract a single identity spoofing vector per similar comparison has already been studied in [7], our study
utterance from the speech features. Then, the extracted vectors presents three important differences: (1) our anti-spoofing sys-
are passed to a classifier in order to make the final detection tem employs a CNN to extract convolutional features at the
decision. Experimental evaluation is carried out on the stan- speech frame level, (2) we compare the performance of classical
dard ASVSpoof2015 data corpus. The results show that classi- features, such as FBANKs and MFCCs, with the performance
cal FBANK features and Linear Discriminant Analysis (LDA) of the recent popular CQCC features [8], and (3) we combinate
obtain the best performance for the proposed system. different features and classifiers in order to find the combination
Index Terms: Automatic speaker verification, spoofing detec- which offers the best performance.
tion, deep neural networks, deep features, classifier. This paper is organized as follows. Section 2 describes the
features and back-ends we are going to compare in a CNN +
1. Introduction RNN anti-spoofing system. Then, in Section 3, we outline the
speech corpora, the network training, and the performance eval-
Automatic Speaker Verification (ASV) aims to authenticate the uation details. Section 4 discusses the results of the different
identity claimed by a given individual [1]. However, most ASV features and back-ends in the deep neural network based anti-
systems are vulnerable to spoofing attacks, in which an impos- spoofing system. Finally, we present the conclusions derived
tor try to gain fraudulent access to the system by presenting to from this research in Section 5.
the ASV system speech acoustics resembling the voice of a gen-
uine user. Four types of spoofing attacks have been identified
[2]: (i) replay (i.e. using pre-recorded voice of the target user), 2. System description
(ii) impersonation (i.e. mimicking the voice of the target voice), This section is devoted to the description of the anti-spoofing
and also either (iii) text-to-speech synthesis (TTS) or (iv) voice system. First, Section 2.1 describes different voice features:
conversion (VC) systems to generate artificial speech resem- FBANK, MFCC and CQCC. The neural network architecture
bling the voice of a legitimate user. The aim of this work is to for deep feature extraction is detailed in Section 2.2. Further-
develop robust anti-spoofing countermeasures for either VC or more, Section 2.3 describes different classifiers (back-ends):
TTS based attacks. Linear Discriminant Analysis (LDA), Support Vector Machine
The performance of anti-spoofing systems can meaning- (SVM), and One-Class Support Vector Machine (One-Class
fully vary depending on the voice features used to feed them. SVM).
Due to this, voice features have attracted the attention of a num-
ber of researchers [8, 9, 10]. However, anti-spoofing systems
2.1. Speech features
based on neural networks usually use classical voice features,
such as FBANKs, and to the best of our knowledge, the new As demonstrated in [11], traditional log MEL filterbank features
popular CQCC features have not been employed yet to feed (FBANK) are effective for detecting spoofing attacks with sys-
these types of systems. tems based on neural networks. These features are obtained by
In the last years, the technique of deep features extraction passing the Short Time Fourier Transform (STFT) magnitude
have been explored to obtain more discriminative and effective spectrum through a Mel-filterbank and applying a log opera-
45 10.21437/IberSPEECH.2018-10
tion. However, FBANK features are usually high-correlated.
One way to decorrelate these features is to apply the Discrete
Cosine Transform (DCT) to get the classical Mel Frequency
Cepstral Coefficient (MFCC) features.
In [8], CQCC features are proposed for spoofing detection,
which are obtained using the Constant Q Transform (CQT). The
Q factor is a measure of the selectivity of each filter and is de-
fined as the ratio between the center frequency and the band-
width of the filter. In contrast to the STFT, whose Q factor
increases when moving from low to high frequencies as the
bandwidth is the same for all filters, the bandwidth of the fil-
ters employed in the CQT is not constant, and this results in
getting a higher frequency resolution for low frequencies and a
higher temporal resolution for high frequencies. In this manner,
the CQCC features try to imitate the human perception system
which is known to approximate a constant Q factor between
500Hz and 20kHz [20].
In this work, we employ the classical FBANK and MFCC Figure 1: Front-end architecture of the anti-spoofing system
features, as well as the popular CQCC features, to feed the anti- which extracts a spoofing identity vector per utterance (N rep-
spoofing system. resents the number of context windows per utterance). This sys-
tem is proposed in [6].
2.2. Front-end
46
Table 1: Structure of the ASVspoof2015 data corpus divided by
the training, development and evaluation sets [14].
# Speakers # Utterances
Subset
Male Female Genuine Spoofed
Training 10 15 3750 12,625
Development 15 20 3497 49,875
Evaluation 20 26 9404 184,000
3. Experimental framework data corpus, the softmax layer of both CNN and RNN contains
K + 1 = 6 neurons (one per class). The two fully connected
To evaluate the performance of several features and back-
layers of the CNN have 1024 sigmoid neurons, and the layer
ends in an anti-spoofing system based on neural networks, the
of the RNN has 1920 GRUs, which is the length of the identity
ASVspoof 2015 dataset [14], a standard data corpus for re-
spoofing vector of the whole utterance. To prevent the prob-
search on spoofing detection, was employed. Details about the
lem of overfitting, the initial dropout probabilities are 50% and
methodology followed for training and testing are also given in
40% from the first to the last fully connected layer, respectively.
this section.
Also, early stopping is applied in order to stop the training pro-
cess when no improvement of the cross entropy is obtained after
3.1. Speech corpus
15 iterations. All the specified parameters of the system have
The ASVspoof 2015 corpus [14] defines three datasets (train- been optimized using the validation set of the data corpus [14].
ing, development and evaluation), each one containing a mix
of genuine and spoofed speech. The structure of these three 3.4. Performance evaluation
datasets are shown in Table 1. Spoofing attacks were generated
either by TTS or VC. A total of 10 types of spoofing attacks (S1 The equal error rate (EER) is used to evaluate the system per-
to S10) are defined: three of them are implemented using TTS formance. As described in the ASVspoof 2015 challenge evalu-
(S3, S4 and S10), and the remaining seven ones (S1, S2, S5, S6, ation plan [14], the EER was computed independently for each
S7, S8 and S9) using different VC systems. Attacks S1 to S5 spoofing algorithm and then the average EER across all attacks
are referred to as known attacks, since the training and develop- was used. To compute the average EER, we used the Bosaris
ment sets contain data for these types of attacks, while attacks toolkit [15].
S6 to S10 are referred to as unknown attacks, because they only
appear in the evaluation set. More details about this corpus can 4. Experimental results
be found in [14].
4.1. Comparison of features and back-ends
3.2. Spectral Analysis Table 2 shows the detailed results of the different features
(FBANK, MFCC and CQCC) and classifiers (LDA, SVM and
The frame window size is 25 ms with 10 ms of frame shift.
SVM One-Class) in the described CNN + RNN anti-spoofing
Moreover, the size of the context window is W = 31 frames,
system. Furthermore, a summary of these results is shown in
and the number of filters used to get the spectral features is
Fig. 2. The best performance is obtained with the combina-
M = 48 filters. In contrast to [7] and [11], we use a 48-
tion of FBANK features and the LDA classifier. In average, the
dim static spectral features without delta and acceleration co-
FBANK features obtain the best performance independently of
efficients, as we have realized that the context window of 31
the back-end, although MFCC features perform better on the
frames is already exploiting the correlations between consecu-
SVM One-Class considering all the attacks. The CQCC fea-
tive frames. Therefore, a higher spectral resolution is achieved
tures achieve the best average performance in the known attacks
while the size of the spectral feature vector is smaller than in
with LDA and SVM back-ends, but these two combinations per-
[7].
form very poorly in the S10 attack.
Regarding the back-ends, the LDA outperforms the other 2
3.3. Training
classifiers in the known and unknown attacks. Moreover, the
The CNN and RNN networks are trained using Adam opti- binary SVM classifier performs much better than SVM One-
mizer [18]. As there are K = 5 known spoofing attacks in the Class using FBANK and CQCC features.
47
Table 2: Comparison on evaluation dataset for each spoofing attack in terms of (%) EER
Known Attacks Unknown Attacks Total
Features Back-end
S1 S2 S3 S4 S5 Avg. S6 S7 S8 S9 S10 Avg. Avg.
LDA 0.01 0.09 0.00 0.00 0.11 0.04 0.66 0.21 0.00 0.36 7.16 1.68 0.86
FBANK SVM 0.03 0.13 0.00 0.01 0.22 0.08 0.77 0.34 0.18 0.48 10.46 2.44 1.26
SVMOne 0.36 2.07 0.17 0.12 4.37 1.42 5.44 1.34 0.34 1.53 8.23 3.38 2.40
LDA 0.06 0.08 0.00 0.00 0.06 0.04 0.11 0.12 0.00 0.05 15.43 3.14 1.59
MFCC SVM 0.05 0.19 0.01 0.01 0.23 0.10 0.22 0.21 0.05 0.15 29.58 6.04 3.07
SVMOne 0.43 1.97 0.12 0.12 2.11 0.95 3.38 2.07 0.06 1.03 11.09 3.53 2.24
LDA 0.04 0.04 0.00 0.00 0.04 0.02 0.13 0.51 0.05 0.08 11.76 2.51 1.27
CQCC SVM 0.03 0.01 0.01 0.01 0.02 0.01 0.06 0.37 0.07 0.02 21.52 4.41 2.21
SVMOne 1.72 6.14 0.49 0.47 7.34 3.23 10.13 9.67 1.39 6.50 8.54 7.25 5.24
Figure 3: Comparison on evaluation dataset for known and unknown spoofing attacks in terms of average (%) EER
48
7. References [20] B. C. J. Moore, “An Introduction to the Psychology of Hearing”,
BRILL, 2003.
[1] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li,
“Spoofing and countermeasures for speaker verification: A survey,”
in Speech Communication, vol. 66, pp. 130–153, 2015.
[2] Z. Wu et al., “Anti-spoofing for text-independent speaker verifi-
cation: An initial database, comparison of countermeasures, and
human performance,” IEEE/ACM Transactions on Audio, Speech,
and Language Processing, vol. 24, no. 4, pp. 768–783, 2016.
[3] Liu, Y., Qian, Y., Chen, N., Fu, T., Zhang, Y., Yu, K., “Deep fea-
ture for text-dependent speaker verification”, in Speech Communi-
cation, vol. 13, pp. 1–13, 2015.
[4] Grzl, F., Karafit, M., Kontr, S., Cernocky, J., “Probabilistic and
bottle-neck feature for LVCSR of meetings,” in Proc. IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2007, pp. 757–760.
[5] Wu, Z., King, S., “Improving trajectory modelling for dnn-based
speech synthesis by using stacked bottleneck features and min-
imum trajectory error training,” in IEEE/ACM Trans. on Audio,
Speech and Language Processing, vol. 24, pp. 1255–1265, 2016.
[6] Alejandro Gomez-Alanis, Antonio M. Peinado, Jose A. Gonzalez,
and Angel M. Gomez, “A Deep Identity Representation for Noise
Robust Spoofing Detection,” in Proc. InterSpeech, 2018.
[7] Y. Qian, N. Chen, and K. Yu, “Deep features for automatic spoofing
detection,” in Speech Communication, vol. 85, pp. 43–52, 2016.
[8] M. Todisco, H. Delgado, and N. Evans, “A new feature for auto-
matic speaker verification anti-spoofing: Constant Q cepstral coef-
ficients,” Proc. Odyssey, 2016, pp. 249–252.
[9] T. B. Patel and H. A. Patil, “Combining evidences from mel cep-
stral, cochlear filter cepstral and instantaneous frequency features
for detection of natural vs. spoofed speech,” Proc. Interspeech,
2015, pp. 2062–2066.
[10] Muckenhirn, H., Korshunov, P., Magimai-Doss, M., Marcel, S.,
“Long-Term Spectral Statistics for Voice Presentation Attack De-
tection,” in IEEE/ACM Trans. on Audio, Speech and Language Pro-
cessing, vol. 25, pp. 2098–2111, 2017.
[11] Y. Qian, N. Chen, H. Dinkel, and Z. Wu, “Deep Feature Engi-
neering for Noise Robust Spoofing Detection,” IEEE/ACM Trans-
actions on Audio, Speech and Language Processing, vol. 25, no.
10, pp. 1942–1955, 2017.
[12] Scholkopf, B., Williamson, R. C., Smola, A. J., et al., “Support
vector method for novelty detection,” in Proc. NIPS, 2000, pp. 582–
588.
[13] Jess Villalba, Antonio Miguel, Alfonso Ortega, and Eduardo
Lleida, “Spoofing detection with dnn and one-class svm for the
asvspoof 2015 challenge,” in Proc. InterSpeech, 2015, pp. 2067–
2071.
[14] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilçi, M.
Sahidullah, and A. Sizov, “ASVspoof 2015: The first automatic
speaker verification spoofing and countermeasures challenge,” in
Proc. InterSpeech, 2015, pp. 2037–2041.
[15] N. Brümmer and E. deVilliers, “The BOSARIS toolkit: Theory,
algorithms and code for surviving the new DCF,” in NIST SRE11
Speaker Recognition Workshop, Atlanta, Georgia, USA, Dec. 2011,
pp. 1–23.
[16] Kyunghyun Cho, et al., “Learning Phrase Representations using
RNN Encoder-Decoder for Statistical Machine Translation,” Proc.
Empirical Methods in Natural Language Processing, 2014, pp.
1724–1734.
[17] S. Rennie, V. Goel, and S. Thomas, “Annealed dropout training of
deep networks,” in Proc. Spoken Language Technology Workshop,
2014, pp. 159–164.
[18] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” arXiv:1412.6890, 2014.
[19] Chunlei Zhang, Chengzhu Yu, and John H. L. Hansen, “An in-
vestigation of Deep-Learning Frameworks for Speaker Verification
Antispoofing,” IEEE Journal of Selected Topics in Signal Process-
ing, vol. 11, pp. 684–694, 2017.
49
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
50 10.21437/IberSPEECH.2018-11
2.2. The impact of CMVN 2.4. Robustness against different SN R values
The use of CMVN has a significant impact on the curves that Another interest point to focus on in a VAD is its robustness
observation likelihoods form. When testing a sample signal and for different recording conditions. As an example, we have
computing frame by frame the observation likelihoods at each chosen four signals from the Spanish SpeeCon database [14]
state of the silence HMM, very different curves are obtained to illustrate the impact of the recording distance on the obser-
depending on weather CMVN is applied or not. Figure 1 illus- vation likelihood curves. These four signals correspond to the
trates this difference. The middle and bottom diagrams show same utterance, but were recorded by means of four different
the curves formed by the observation log-likelihoods generated microphones: a headset (channel C0 ), a lavalier (channel C1 ),
by each HMM state s0 , s1 and s2 , without and with normaliza- a medium-distance cardioid microphone (0.5-1 meter, channel
tion respectively, through a utterance composed of four words. C2 ) and a far-distance omnidirectional microphone (channel
In this case, the normalization has been performed using the C3 ). Each of these channels represents a different SN R, C0
means and variances computed from the file. being the cleanest (around 20dB) and C3 the noisiest (0dB).
Figure 2 shows the observation log-likelihoods generated
by the central state of the silence HMM trained with the Basque
Speecon-like database. The utterance is the same as the one in
Figure 1 (note that the signal in Figure 1 corresponds to the C1
signal in Figure 2). The darkest curve corresponds to the C0
channel and the lightest one to the C3 channel.
51
speakers, with 10 utterances per speaker: 6300 files for
each noise level. The total speech content in the database
is 86.57 % (not well balanced), and the label files are the
ones belonging to the classic TIMIT database. All audio
files are presented as single channel 16kHz 16-flac, but
have been converted to 16-bit PCM.
2. ECESS subset of the Spanish Speecon database: it was
used in the ECESS evaluation campaign of voice activ-
ity and voicing detection in 2008. It includes 1020 ut-
terances recorded in different environments (office, en-
tertainment, car and public place) distributed among the
C0 , C1 , C2 and C3 subsets (total number of files: 4080).
There are 60 different speakers each of which utters
17 sentences. The total speech content in the database
is 55.77 % (well balanced), and it contains reference
speech and silence labels specifically designed to assess
different VAD algorithms. The signals in the database
were recorded at 16 kHz and 16 bit per sample.
Each file’s features have been normalized off-line, with the
means and variances calculated from the file itself. The on-line
performance has been left for future research.
3.2. Error metrics Figure 3: ER0 and ER1 (top) and T ER (bottom) for different
decision threshold values when testing the signals of SN R 50
The VAD accuracy experiment consists in evaluating the abil- to 5 dB in the babble noise subset (left) and the white noise
ity of the system to discriminate between speech and silence subset (right) of the Noisy TIMIT database.
segments at different SN R levels, in terms of silence error-
rate (ER0 ) and speech error-rate (ER1 ). These two rates are
computed as the fractions of the silence frames and speech Regarding the error rates, the minimum T ERs are obtained
frames that are incorrectly classified (N0,1 and N1,0 , respec- at T h = −150, except for 5, 10 and 15 dB in white noise
tively) among the number of real silence frames and speech subset, which occur at −100. Thus, we can consider the point
frames in the whole database (N0ref and N1ref , respectively), of T h = −150 as the most valid threshold. Some ER0 and
as shown in equation 5. In addition, the T ER (total error rate) ER1 values obtained for T h = −150 are shown in Table 1.
has also been computed as the average of the ER0 and ER1
(equation 6). Table 1: T ER, ER0 and ER1 for T h = −150 on the signals
of SN R 50, 35, 20 and 5 dB in the babble noise (left) and white
N0,1 N1,0 noise (right) subsets of the Noisy TIMIT database.
ER0 = ref
× 100; ER1 = ref × 100 (5)
N0 N1
Babble White
ER0 + ER1 ER0 ER1 T ER ER0 ER1 T ER
T ER = (6)
2 50dB 34.89 6.71 20.80 34.88 6.95 20.92
A minimum duration of 15 frames both for speech and si- 35dB 30.87 7.48 19.18 28.05 9.18 18.62
lence segments was set. This value was empirically chosen after 20dB 26.35 11.25 18.80 21.53 16.89 19.21
some preliminary experiments. 5dB 22.60 20.78 21.70 15.49 30.90 23.20
52
Table 2: T ER, ER0 and ER1 with T h = −150 on the sig- Table 4: Comparison of different VAD algorithm results at four
nals of channels C0 , C1 , C2 and C3 in the Spanish Speecon SN R levels
database.
(a) Silence error rates (ER0 )
ER0 ER1 T ER G.729 AFE-FD AFE-NR Prop.
C0 6.21 2.74 4.48 C0 56.06 63.88 58.23 15.68
C1 4.22 6.13 5.18 C1 70.23 54.75 55.96 12.42
C2 7.10 6.00 6.55 C2 59.54 52.10 38.10 15.39
C3 9.46 6.45 7.96 C3 70.49 50.10 47.65 17.59
Table 3: T ER, ER0 and ER1 for 5 and 10 frames long speech- 5. Conclusions
segment margins, with T h = −150 for the signals of channels
In this paper, we have assessed the usefulness of the observa-
C0 , C1 , C2 and C3 in the ECESS subset of the Spanish Speecon
tion likelihood generated by the central state GMM of a silence
database.
HMM trained using CMVN, as a possible basis on which to
build a VAD system. We have seen that a good classification
5 frames 10 frames between speech and silence can be performed, just by setting a
ER0 ER1 T ER ER0 ER1 T ER threshold in the curves that observation likelihoods form.
C0 10.84 1.29 6.07 15.68 0.79 8.24 The silence HMM has been trained using the close-talk
C1 7.94 3.47 5.71 12.42 2.30 7.36 channel from the Basque Speecon-like database. Then, a thresh-
C2 10.91 3.50 7.21 15.39 2.47 8.93 old analysis has been carried out, processing the babble and
C3 13.29 3.95 8.62 17.59 2.89 10.24 white noise files of the Noisy TIMIT database. As a conclu-
sion, we have noticed that the minimums error rates occur at
the same likelihood point in 17 SN R values out of a total of
The table shows that ER1 reduces and ER0 increases. 20. This point is the one we have chosen as the threshold.
T ER increases as well, because ER0 increases faster than
ER1 reduces. All in all, the use of a margin around speech This threshold has been tested with a separate database: the
segments allows decreasing significantly ER1 , with a not very ECESS subset of the Spanish Speecon database. The results
significant resulting T ER degradation. obtained for this database are even better than those obtained
for the Noisy TIMIT, which leads us to think that the silence
4.3. Comparison with other systems observation likelihood behaves similarly on different channels.
Additionally, the results of the test have been compared
In order to validate the previous results, our results have been with three different standard VAD systems. Although the best
compared with the outcomes of three popular standard VAD al- speech error rates have not been achieved with the use of the
gorithms carried out in a previous work [18]. These systems decision threshold, we have got the best silence error rates. Our
are standard defined by ITU (International Telecommunication results are quite competitive; actually, the best total classifica-
Union) and ETSI (European Telecommunications Standards In- tion rates have been obtained.
stitute):
As a final conclusion, competitive results are obtained just
1. The VAD algorithm of the ITU G.729 system [19]. by setting a decision threshold to the silence observation likeli-
2. The AFE-FD (frame-dropping mechanism) algorithm hood curves. This fact has been applied in [21], where a method
implemented in ETSI AFE-DSR (Advanced Front-End called Multi-Normalization Scoring (MNS) is used to explode
for Distributed Speech Recognition) [20]. the discriminative potential of the observation likelihood scores.
Robust on-line results are shown in that paper, where the scores
3. The AFE-NR (noise reduction system) algorithm imple- obtained with MNS are classified with a Multi-Layer Percep-
mented in ETSI AFE-DSR [20]. tron (MLP). This issue and others related to the selection of the
Table 4 shows the results obtained for the three VAD sys- optimal threshold are being investigated currently in our labo-
tems along with the proposed method (using T h = −150 and ratory.
a margin of 10 frames), over the same dataset (4080 files from
the ECCESS subset). Regarding ER1 , the AFE-FD gets better 6. Acknowledgements
results, and also the AFE-NR for C0 and C1 . However both
systems show the disadvantage of getting very high ER0 for This work has been partially supported by the EU
all the channels (the lowest value is 38.10 %). This means that (FEDER) under grant TEC2015-67163-C2-1-R (RESTORE)
many silence frames will be sent to the recognizer. The ER0 in (MINECO/FEDER, UE) and by the Basque Government under
our results are between 12.42 and 17.59 %. grant KK-2017/00043 (BerbaOla).
53
7. References [18] I. Luengo, E. Navas, I. Odriozola, I. Saratxaga, I. Hernaez,
I. Sainz, and D. Erro, “Modified LTSE-VAD algorithm for ap-
[1] M. K. Mustafa, T. Allen, and K. Appiah, Research and Develop- plications requiring reduced silence frame misclassification.” in
ment in Intelligent Systems XXXI. Springer International Publish- LREC 2010, Seventh International Conference on Language Re-
ing, 2014, ch. A Review of Voice Activity Detection Techniques sources and Evaluation, May 17-23, Valletta, Malta, Proceedings,
for On-Device Isolated Digit Recognition on Mobile Devices, pp. 2010, pp. 1539–1544.
317–329.
[19] P. Setiawan, S. Schandl, H. Taddei, H. Wan, J. Dai, L. B. Zhang,
[2] T. Virtanen, R. Singh, and B. Raj, Techniques for Noise Robust- D. Zhang, J. Zhang, and E. Shlomot, “On the itu-t g.729.1 si-
ness in Automatic Speech Recognition, 1st ed. Wiley Publishing, lence compression scheme.” in EUSIPCO 2008 – 16th European
2012. Signal Processing Conference, August 25-28, Lausanne, Switzer-
[3] S. G. Tanyer and H. Ozer, “Voice activity detection in nonstation- land, Proceedings, 2008, pp. 1–5.
ary noise,” IEEE Transactions on Speech and Audio Processing, [20] E. Standards, “Speech processing, transmission and quality as-
vol. 8, no. 4, pp. 478–482, 2000. pects (stq); distributed speech recognition; front-end feature ex-
[4] J. Tatarinov and P. Pollák, “Hmm and ehmm based voice activity traction algorithm; compression algorithms,” ETSI Standards, Eu-
detectors and design of testing platform for vad classification,” ropean Telecommunications Standards Institute, vol. ES 201 108
Digital Technologies, vol. 1, pp. 1–4, 2008. Recommendation, 2002.
[5] H. Veisi and H. Sameti, “Hidden-Markov-model-based voice ac- [21] I. Odriozola, I. Hernaez, and E. Navas, “An on-line VAD based on
tivity detector with high speech detection rate for speech enhance- Multi-Normalisation Scoring (MNS) of observation likelihoods,”
ment,” vol. 6, no. 1, pp. 54–63, 2012. Expert Systems with Applications (ESwA), vol. 110, pp. 52–61,
2018.
[6] Ó. Varela, R. S. Segundo, and L. A. Hernández, “Combining
pulse-based features for rejecting far-field speech in a HMM-
based Voice Activity Detector,” vol. 37, no. 4, pp. 589–600, 2011.
[7] D. Enqing, L. Guizhong, Z. Yatong, and C. Yu, “Voice activity
detection based on short-time energy and noise spectrum adapta-
tion,” in ICSP 2002 – 6th International Conference on Signal Pro-
cessing Proceedings, August 26-30, Beijing, China, Proceedings.
IEEE, 2002, p. 464467.
[8] Y. W. Tan, W. J. Liu, W. Jiang, and H. Zheng, “Hybrid svm/hmm
architectures for statistical model-based voice activity detection,”
2014, pp. 2875–2878.
[9] T. Hughes and K. Mierle, “Recurrent neural networks for voice
activity detection,” 2013, pp. 7378–7382.
[10] S. Thomas, S. Ganapathy, G. Saon, and H. Soltau, “Analyzing
convolutional neural networks for speech activity detection in
mismatched acoustic conditions,” 2014, pp. 2519–2523.
[11] Y. Obuchi, “Framewise speech-nonspeech classification by neural
networks for voice activity detection with statistical noise sup-
pression,” in ICASSP, 2016, pp. 5715–5719.
[12] M. Westphal, “The use of cepstral means in conversational speech
recognition,” in EUROSPEECH 1997 – 5th European Confer-
ence on Speech Communication and Technology, September 22-
25, Rhodes, Greece, Proceedings. ISCA, 1997, pp. 1143–1146.
[13] I. Odriozola, I. Hernaez, M. I. Torres, L. J. Rodriguez-Fuentes,
M. Penagarikano, and E. Navas, “Basque speecon-like and
Basque speechdat MDB-600: speech databases for the develop-
ment of ASR technology for Basque,” in LREC 2014, Ninth In-
ternational Conference on Language Resources and Evaluation,
May 26-31, Reykjavik, Iceland, Proceedings, 2014, pp. 2658–
2665.
[14] D. Iskra, B. Grosskopf, K. Marasek, H. van den, F. Diehl, and
A. Kiessling, “Speecon speech databases for consumer devices:
Database specification and validation,” in LREC 2002, Third In-
ternational Conference on Language Resources and Evaluation,
May 27-31, Las Palmas, Spain, Proceedings, 2002, pp. 329–333.
[15] A. Abdulaziz and V. Kepuska, “Noisy timit speech (ldc2017s04),”
3 2017. [Online]. Available: http://hdl.handle.net/11272/UFA9N
[16] B. Kotnik, P. Sendorek, S. Astrov, T. Koç, T. Çiloglu, L. D.
Fernández, E. R. Banga, H. Höge, and Z. Kacic, “Evaluation of
voice activity and voicing detection,” in INTERSPEECH 2008 –
8th Annual Conference of the International Speech Communica-
tion Association, September 22-26, Brisbane, Australia, Proceed-
ings, 2008, pp. 1642–1645.
[17] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, and D. Pallett,
“Darpa timit acoustic-phonetic continous speech corpus cd-rom.
nist speech disc 1-1.1,” NASA STI/Recon Technical Report N,
vol. 93, 1993.
54
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
55 10.21437/IberSPEECH.2018-12
2. System Description phonetic unit that is going to appear next according to the
context in which the unit is included. In the second case, the
2.1. Phone-gram definition objective is to normalize the counts and then smooth them
trying to obtain a vector representation with homogeneous
Phonotactic systems use context information to improve the values.
performance of LID. In this regard, we propose to use The model definition normally used to train both is focused
phonetic units that implicitly incorporate context information at the word-level [13] but we work at the phone level. The
as features (phone-grams). They can be defined as the objective is to find the co-occurrence of phonemes and
grouping of two or more phonemes in a new unit (Figure 1). phoneme sequences that tend to appear in similar contexts for
In this work we have used 2grams only because of the a specific language. Hence, we expect to improve the results
scattering observed in higher order. compared with the system based on uniphone sequences. Our
study focuses on phone-grams, and their use in the continuous
space has been called Phone-based Embeddings (Ph-Emb)..
56
[12] that produces the phonetic sequences corresponding to different i-Vectors for each language. Our proposal is to fuse
the utterances. the scores provided by the individual language-dependent
systems expecting a better performance and a lower
Front-End Back-End computational cost.
Phone-based
Embeddings
UBM
Phoneme Phone- training
Training Data
Recognizer grams
Phone-based T Matrix Language
2.10. Acoustic System using MFCCs
creation detected
Embeddings Multi-
Training Phase
generation Class
Logistic
We have fused the scores of the proposed techniques with
i-Vectors
Evaluation Phase
Classifier
the scores obtained from an acoustic system to check if they
Phone- Phone-based
Evaluation Phoneme
grams Embeddings provide complementary information. The acoustic system has
creation generation
Data Recognizer been generated as follows: from each speech utterance, 12
MFCC coefficients including C0 [21] are extracted for each
frame. The silence and noise segments of the acoustic signal
have been removed using a Voice Activity Detector. To
Figure 2: Global System Architecture.
reduce the noise perturbation, a RASTA filter has been used
together with a cepstral mean and variance normalization
The second component of the system is the "Back-End". (CMNV). We have a feature vector of dimension 56,
Firstly, we obtain the phone-gram sequences from the generated from the concatenation of the SDC parameters using
phoneme sequences of each language. The sequences obtained the 7-1-3-7 configuration. Feature vectors are used to train the
have been used to train the Phone-based Embeddings . To total variability matrix, from which the i-vectors of dimension
model the Phone-based Embeddings we have used both 400 with 512 Gaussians are extracted (optimal configuration).
alternatives described above, Skip-Gram and GloVe modeling.
After that, we have replaced every phone-gram by its
respective Phone-based Embedding to use it as input feature
vector to the i-Vector system. All these vectors are used to 3. Results
train the T matrix and the UBM model needed to obtain the i- We have to define the Phone-based Embeddings optimal
Vectors. Finally we have used these i-Vectors as features to training parameters for 2grams: vector size, window size,
train a multiclass logistic classifier to define the detected number of training iterations and negative sampling factor.
language [19], [20]. Negative sampling is an optimization method used to improve
As we have one different Phone-based Embedding for each the NEs robustness applying logistic regression. It reduces the
language to be recognized, we considered two alternatives to computational complexity and increases the vector estimated
manage the set of vectors for each phone-gram. We have efficiency.
called them: "Single vector embedding" and "Multiple vector The window size corresponds to the number of phonetic
embeddings". phone-gram units considered to the left and to the right of the
current phonetic unit and it is considered as contextual
2.8. Single Vector Embedding (SVE) information. The vector size is the vector embedding size. In
We organize the phone-gram sequence in a column, all cases, the results in the tables represent the fusion of the
replacing each phone-gram by its corresponding Phone-based three phonetic recognizers.
Embedding of a specific language and repeat this process for
all the other languages. So, the first column will contain the 3.1. Single Vector Embedding (SVE)
Phone-based Embeddings sequences trained with data from As we described in Section 2.8 we have generated a
language 1, the second one will contain the Phone-based sequence of phone-grams for each language. We have tested
Embeddings sequences trained with data from language 2, several options for the feature vector size, obtaining an
and so on. Finally, we obtain a matrix that includes the Phone- optimum for size 40, being 240 the final vector, considering
based Embeddings trained with all the languages to be the 6 languages to be recognized. In relation to the number of
recognized (Figure 3). Gaussians in the i-Vectors system we have obtained an
optimum for 512. The best result was 24.69% of Cavg.
57
uses a three phone-gram context window and the second one a 4. Conclusions
five phone-gram context window with the following weights:
A) 3 phone-grams context: Final Ph-Emb = Left Ph-Emb * We have demonstrated that the use of Phone-based
0.25 + Central Ph-Emb * 0.50 + Right Ph-Emb * 0.25 Embeddings as feature vectors provides improvements in an
LID task. We have used as a baseline a first system that uses
B) 5 phone-grams context: Final Ph-Emb = Second Left Phone-based Embeddings as feature vectors with rather poor
Ph-Emb * 0.10 + Left Ph-Emb * 0.15 + Central Ph-Emb results. However, using the new approaches proposed in this
* 0.50 + Right Ph-Emb * 0.15 + Second Right Ph-Emb paper results improved, and the fusion of our best
* 0.10 configuration with our acoustic system provides significant
The objective is to assign more weight to the current improvements.
phone-gram but taking into account information from the Our baseline system uses the "SVE" technique to obtain
neighboring units. Using option A (Cavg: 24,38) we obtained 24.69% of Cavg. Considering this poor result, we decided to
a 14,4% relative improvement over option B (Cavg: 28,49) change the approach and use an individual matrix for each
using the model for the Basque language and the Skip-Gram language, fusing the scores from all individual systems at the
technique as reference. The optimum vector size for SG-Emb back-end, and we obtained 19,73% of Cavg using the Skip-
is 80, with 512 Gaussians in the iVectors system, 10 iterations Gram modelling.
and a window size of 8. All the optimization has been Then, we proposed the inclusion of context information in
obtained using the data development set. After fusing all the the Phone-based Embeddings including the two or four
languages, we obtained 18.70% of Cavg with MVE, which is a nearest neighbours. After fusing all the language models we
relative improvement of 24.3% over SVE. obtained 18.7% of Cavg using the Skip-Gram modelling.
Finally, using the GloVe modelling we obtain 16.7% of Cavg
3.4. GloVe model for the MVE
with a 10.7% relative improvement over Skip-Gram modelling
We have also evaluated our approach using the GloVe and a 32.4% compared to the baseline system.
model (Section 2.6) instead of the Skip-Gram model for our Also, the fusion with the acoustic based system provides a
best system with contextual information, because it 34.1% relative improvement, which demonstrates that both
incorporates information of the co-occurrence of phone-grams systems provide complementary information for the LID task.
in all the training data set. The optimal configuration As future research lines, we propose to study the effects of
parameters are: vector size of 80, window size of 4, and 30 higher order units using a larger database. We will also
iterations. The optimum number of Gaussians for the iVectors evaluate other types of language models for the neural
system has been 512 (the same as Skip-Gram). Fusing all the embeddings. Also, we expect to use models with a high
languages as before, we obtain a 16,70% of Cavg, which is a number of layers (char-RNN) and use its combination with
10.7% of relative improvement over MVE based on the Skip- convolutional DNNs to get better local context characteristics.
Gram model.
5. Acknowledgements
3.5. Summary of results
The work leading to these results has been supported by
In Table 2 we present the summary of results obtained with AMIC (MINECO, TIN2017-85854-C4-4-R), and CAVIAR
the techniques proposed in this paper. As we can see, the final (MINECO, TEC2017-84593-C2-1-R) projects. Authors also
system using the GloVe model provides the best results. thank Mark Hallet for the English revision of this paper and all
the other members of Speech Technology Group for the
Table 2: Summary of results. continuous and fruitful discussion on these topics. We
gratefully acknowledge the support of NVIDIA Corporation
System Cavg Improvement % with the donation of the Titan X Pascal GPU used for this
SVE 24.69 research.
MVE context and Skip-Gram 18.70 24.3
MVE context and GloVe 16.70 32.4
6. References
[1] Y. Muthusamy, E. Barnard and A. Cole, “Reviewing automatic
3.6. Fusion with the acoustic model language identification,” in Signal Processing Magazine, IEEE
1994, pp. 33–41.
The objective of this technique is to improve an existing [2] L. D’Haro, R. Cordoba, C. Salamea and J. Echeverry “Extended
LID system, which is based on acoustic information (section phone-likelihood ratio features and acoustic-based i-vectors for
2.10). So, we present the results of fusing the existing acoustic language recognition,” in Proceedings in Acoustics, Speech and
Signal Processing, ICASSP, 2014, pp. 5342–5346.
LID system with our two best systems, based on Phone-based
[3] N. Brummer and D. Van Leeuwen, “On calibration of language
Embeddings obtained with the Skip-Gram and Glove models recognition scores,” in Speaker and Language Recognition
(Table 3). Workshop, IEEE Odyssey 2006, pp. 1-8.
[4] J. Turian, L. Ratinov, and Y. Bengio, “Word Representations: A
Table 3: Phone-based Embeddings systems fused with simple and general method for semi-supervised learning,” in
an acoustic system. Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics, 2010, pp. 384–394.
System Cavg Improvement % [5] S. Bengio and G. Heigold, “Word Embeddings for Speech
Recognition,” in INTERSPEECH 2014 – 15th Annual Conference
Acoustic system 7.60 of the International Speech Communication Association,
Fusion with SG-Emb 5.40 28.9 September 14-18, Singapore, Proceedings, 2014, pp. 1053–
1057.
Fusion with Gl-Emb 5.01 34.1
58
[6] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean,
“Distributed representations of words and phrases and their
compositionality,” in Advances in neural information processing
systems, 2013, pp. 3111–3119.
[7] J. Pennington, R. Socher, and C. Manning, “Glove: Global
vectors for word representations,” in Proceedings of the
conference on empirical methods in natural language
processing, 2014, pp. 1532–1543.
[8] P. Wang, B. Xu, J. Xu, G. Tian, C. Liu, and H. Hao, “Semantic
expansion using word embedding clustering and convolutional
neural network for improving short text classification,” , 2016,
pp. 806–814.
[9] L. D’Haro, R. Cordoba, M. Caraballo and J. Pardo, “Low-
resource language recognition using a fusion of phoneme
posteriorgram counts, acoustic and glottal-based i-vectors,” in
Proceedings in Acoustics, Speech and Signal Processing
ICASSP, 2013, pp. 6852–6856.
[10] L. Rodriguez-Fuentes, M. Penagarikano, A. Varona, M. Diez
and G. Bordel, “KALAKA-3: a database for the assessment of
spoken language recognition technology on YouTube audios,” in
Language Resources and Evaluation, 2016, pp. 221–243.
[11] A. Martin and C. Greenberg, “The 2009 NIST Language
Recognition Evaluation,” in Speaker and Language Recognition
Workshop, IEEE Odyssey 2010, pp. 165–171.
[12] P. Ace, P. Schwarz and V. Ace, “Phoneme recognition based on
long temporal context,” PhD. Thesis, Brno University of
Technology, Faculty of Information Technology, 2009.
[13] T. Mikolov, K. Cheng, G. Corrado and J. Dean, “Efficient
estimation of word representation in vector space,” in
Proceedings of Workshop at ICLR, pp. 1–12.
[14] D. Guthrie, B. Allison, W. Liu, L. Guthrie and Y. Wilks, “A
closer look at Skip-Gram modelling,” in Proceedings of the 5th
International Conference on Language Resources and
Evaluation, 2006, pp. 1–4.
[15] Y. Yuang, L. He, L. Peng and Z. Huang, “A new study based on
word2vec and cluster for document categorization,” in Journal
of Computational Information Systems., 2014, pp. 9301–9308
[16] F. Morin and Y. Bengio, “Hierarchical probabilistic neural
network language model,” in Aistats, 2005, pp. 246–252.
[17] S. Yujing, X. Yeming, X. Ji, P. JieLin and Y. Yonghong,
“Recurrent neural network language model with vector-space
word representations,” in the 21th International Congress on
Sound and Vibration, 2014.
[18] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel and P. Ouellet,
“Front-end factor analysis for speaker verification. in Audio,
Speech and Language Processing, 2011, pp. 788–798.
[19] N. Brummer, L. Burget, J. Cernocky, O. Glembek, F. Grezl, M.
Kara, D. Van Leeuwen, P. Matejka, P. Schwarz and A.
Strasheim, “Fusion of heterogeneous speaker recognition
systems in the STBU submission for the NIST speaker
recognition evaluation” in Audio, Speech and Language
Processing., 2007, pp. 2072–2084.
[20] M. BenZeghiba, J. Gauvain and L. Lamel, “Language score
calibration using adapted Gaussian back-end,” in
INTERSPEECH 2009 -- 10th Annual Conference of the
International Speech Communication Association, 2019, pp.
2191–2194.
[21] S. Davis and P. Mermelstein, “Comparison of parametric
representations for monosyllabic word recognition in
continuously spoken sentences,” in Acoustics, Speech and Signal
Processing. 1980, pp. 357–366.
59
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
1. Introduction
The fields of Machine Translation (MT) and Automatic Speech
Recognition (ASR) share many features, including conceptual
foundations, sustained interest and attention of researchers in
the field, a remarkable progress in the last two decades and the
resulting wide popular use. Both ASR and MT have a long way
to improve and, as a result, do not give perfect results. Speech
Translation (ST) applications are typically created by combin-
ing ASR and MT systems [1, 2].
This pipeline implies that each system has to be trained with
their own dataset (which are required to be large) creating a big
drawback for low resourced languages. In addition, all errors
made by the recognizer go to the MT system and then the MT
system itself adds its own errors. The errors are combined, and
the results are often very poor.
Deep learning architectures have allowed for end-to-end ap-
proaches for both machine translation [3] and speech recogni-
tion [4]. Both systems are based on an architecture of encoder-
decoder with recurrent neural networks and attention mecha- Figure 1: Model architecture of the Transformer.
nisms. This architecture has been successfully extended to end-
to-end speech translation [5].
Recently, there has been a new proposed architecture for 2.1. Input/Output
addressing machine translation [6]. Later, this architecture has
Originally, the input of the Transformer is a sequence of words
also been used for speech recognition1 [7]. In both cases, the
divided in sub-units denominated tokens. Once the text is
Transformer outperforms previous architectures based on recur-
turned into a tokenized version of the words, a matrix of real
rent neural networks. Inspired by these previous works, this
numbers collects the vectors (typically of size dmodel = 512).
paper describes how to adapt this architecture to end-to-end
If taken the raw input sequence as x = (x1 , x2 , . . . , xm ) and
speech translation. The rest of the paper is organised as follows.
the embedded representation as w = (w1 , w2 , . . . , wm ) with
Section 2 briefly describes the architecture of the Transformer
wj ∈ Rf , then each wj is a column vector of the input matrix
to make this paper self-contained. Section 3 reports the details
belonging to the space RV ×f , with V as the number of embed-
1 https://tensorflow.github.io/tensor2tensor/tutorials/asr with dings and f the number of features of each embedding.
transformer.html The decoder generates an output sequence corresponding to the
60 10.21437/IberSPEECH.2018-13
input sentence. participants called family members or close friends. The audio
files of the CALLHOME Corpus are available at 5 LDC96S35.
2.2. Positional encoding The transcripts of these audio files are available at 6 LDC96T17.
The transcript files are in plain-text, tab-delimited format (tdf)
The lack of recurrence and convolution in the model entails that
with UTF-8 character encoding. In order to adapt the transcript
no recurrence nor temporal information is available. A good
files for the Transformer, all the text was turned into capital
way to keep the order of the sentence is adding positional en-
letters as well as a reference number at the beginning of each
coding to the input embeddings at the bottoms of the encoder
sentence was added, consisting of six digits starting from
and decoder stacks.
”000000” to the last sentence and separated with a tabulator
What is used in Transformer is an element-wise vector p =
such as follows:
(p1 , p2 , . . . , pm ), with pj ∈ Rf , which is added to the original
matrix.
000000 HELLO
2.3. Encoder 000001 ALO.
000002 ALO, BUENAS NOCHES. QUIéN ES?
The encoder consists of a stack of N layers, each of them com-
000003 QUé TAL, EH, YO SOY GUILLERMO, CóMO
posed of two sub-layers: a multi-head attention mechanism and
ESTáS?
a fully-connected feed forward net, plus residual connections 2
000004 AH GUILLERMO.
(referenced in Figure 1 as ”Add”) on both stages, followed by a
layer of normalization. ...
The multi-head attention has several parallel attention lay- 003637 OH MY GOD.
ers, or heads, which concatenate attention functions with differ- 003638 MHM. Y NO LE PODı́AN HACER NADA, NO.
ent linearly projected queries, keys and values. 003639 MM.
61
5000 steps and the learning rate warm-up steps set to 80009 . Comparing ASR+MT concatenation and End-to-End
To adapt the speech features in the ASR encoder, we used the Speech Translation, the results show that in terms of BLEU,
conv relu conv from tensor2tensor. As parameters, we used the latter is slightly better than the former gaining 0.5 points of
mel filterbank of 80 coefficients every 10ms with a window of BLEU.
25 ms. As preprocessing for the ASR inputs, we used the ten- Figure 2 shows an example that when concatenating ASR
sor2tensor options as follows conv1d(inputs, filter size=1536, and MT, the errors are also concatenated. The Spanish target
kernel size=9) + relu + conv1d(inputs, filter size=384, ker- word BAILA (DANCE in English), when recognized with the
nel size=1). The Transformer gets a vector of dimension 384 model ASR ES, is misspelled and transcribed to the word VAYA,
every 10ms. Also for the speech part, a clarification of the in- which has a very similar sound but totally different meaning. As
put and target maximum sequence length is that to have an input a consequence, the final translation output can not reproduce the
maximum sequence length of 1550 means that only examples of word DANCE, which gives a strong meaning of context to the
transcriptions whose audio has less than 1550 frames are used, sentence. In this case, the end-to-end system is able to produce
which implies that with frames of 10 ms the maximum size of a better translation
the input audio frame is approximately 15.5 seconds in length.
On the other hand, to have a target maximum sequence length
of 350 means that the train transcripts are limited to a maximum
4. Conclusions
size of 350 characters. This paper proposes to use of the Transformer as main architec-
The speech models were trained on TPUs [10] following ture for Speech Recognition, Machine Translation and Speech
the suggested parameters for the librispeech task of the ten- Translation. To the best of our knowledge, this is the first time
sor2tensor library [11]. that this promising architecture is used to reproduce an End-to-
End Speech Translation system. BLEU results show that the
3.3. Training End-to-End Speech Translation architecture provides slightly
better results than the standard ASR and MT concatenation. Ex-
When training, as there are several GPUs or TPUs, the param- amples show that these better results are achieved by avoiding
eters are applied to each one. So the effective batch size is the the concatenation of errors.
numbers of GPUs (in this case 4 or 8 in a TPU) multiplied by
In future work, it would be interesting to train a system
the batch size. In each batch the parameters are updated using
capable of doing multi-task learning [15]. This system would
the stochastic gradient descend and the Adam optimizer [12].
build several models and not only the one learning to translate
Both the ASR/ST and MT systems use a character-based to-
from Spanish speech to English text. The new multi-task model
kenization. This implies that the models look for the correlation
would learn in addition Spanish Recognition and/or Spanish-to-
of input and output sentences character by character.
English text translation.
3.4. Results
5. Acknowledgements
The evaluation of each model involving translation was done by
computing the Bilingual Evaluation Understudy score (BLEU) This work is supported in part by the Spanish Ministerio de
[13]. The BLEU score is the most used for the field of MT and Economı́a y Competitividad, the European Regional Develop-
it compares the decoded sentence with the target sentence of the ment Fund and the Agencia Estatal de Investigación, through
test set by looking into the modified n-gram precision. As for the postdoctoral senior grant Ramón y Cajal, the contract
ASR evaluation, a commonly used metric is Word Error Rate TEC2015-69266-P (MINECO/FEDER,EU) and the contract
(WER) [14], which is defined as the ratio of word errors (sub- PCIN-2017-079 (AEI/MINECO).
stitutions, deletions and insertions) to words processed. For the
evaluation of ASR systems, punctuation marks were not taken 6. References
into account.
[1] A. Waibel and C. Fugen, “Spoken language translation,” IEEE
Signal Processing Magazine, vol. 25, no. 3, pp. 70–79, May 2008.
[2] M. Dureja and S. Gautam, “Article: Speech-to-speech translation:
A review,” International Journal of Computer Applications, vol.
129, no. 13, pp. 28–30, November 2015, published by Foundation
of Computer Science (FCS), NY, USA.
[3] D. Bahdanau, K. Cho, and Y. Bengio, “Neural ma-
chine translation by jointly learning to align and trans-
late,” CoRR, vol. abs/1409.0473, 2014. [Online]. Available:
http://arxiv.org/abs/1409.0473
[4] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and
spell,” CoRR, vol. abs/1508.01211, 2015. [Online]. Available:
http://arxiv.org/abs/1508.01211
[5] R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu,
and Z. Chen, “Sequence-to-sequence models can di-
Figure 2: Speech Translation Example. rectly translate foreign speech,” 2017. [Online]. Available:
https://arxiv.org/abs/1703.08581
[6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
9 To set the learning rate warm-up steps to 8000 means that the first Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
8k steps the learning rate grows linearly and then follows an inverse in Advances in Neural Information Processing Systems, 2017, pp.
square root decay. 6000–6010.
62
Hparam Text-to-Text (GPU) ASR/ST (TPU)
Number of encoder layers 6 6
Number of decoder layers 6 4
Gradient clipping No No
Learning rate 0.2 0.15
Momentum 0.9 0.9
Audio sampling rate - 8000
Batch size 4096 16
Maximum length 256 125550
Input sequence maximum length 0 1550
Target sequence maximum length 0 350
Adam optimizer β1 = 0.9 β2 = 0.997 = 10−9 β1 = 0.9 β2 = 0.997 = 10−9
Attention layers 8 2
Initializer uniform unit scaling uniform unit scaling
Initializer gain 1.0 1.0
Training steps 250000 210000
Table 2: Training parameters.
System Results [10] N. Jouppi, “Google supercharges machine learning tasks with tpu
ASR ES (TPU) 38.02 (WER) custom chip,” Google Blog, May, vol. 18, 2016.
MT (GPU) 55.05 (BLEU) [11] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez,
ASR + MT (TPU/GPU) 19.97 (BLEU) S. Gouws, L. Jones, L. Kaiser, N. Kalchbrenner, N. Parmar,
ASR EN (TPU) 20.47 (BLEU) R. Sepassi, N. Shazeer, and J. Uszkoreit, “Tensor2tensor for
Table 3: Results of the model evaluation. ASR ES stands for the neural machine translation,” CoRR, vol. abs/1803.07416, 2018.
speech recognition with Spanish transcriptions as target. ASR [Online]. Available: http://arxiv.org/abs/1803.07416
EN stands for the speech recognition with English transcrip- [12] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
tions as target. mization,” arXiv preprint arXiv:1412.6980, 2014.
[13] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method
for automatic evaluation of machine translation,” in Proceedings
of the 40th annual meeting on association for computational lin-
[7] S. Zhou, L. Dong, S. Xu, and B. Xu, “Syllable-Based Sequence- guistics. Association for Computational Linguistics, 2002, pp.
to-Sequence Speech Recognition with the Transformer in Man- 311–318.
darin Chinese,” ArXiv e-prints, 2018. [14] A. C. Morris, V. Maier, and P. Green, “From wer and ril to mer and
[8] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, wil: improved evaluation measures for connected speech recog-
S. Gouws, L. Jones, Ł. Kaiser, N. Kalchbrenner, N. Parmar et al., nition,” in Eighth International Conference on Spoken Language
“Tensor2tensor for neural machine translation,” arXiv preprint Processing, 2004.
arXiv:1803.07416, 2018. [15] M. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and
[9] R. Prabhavalkar, T. N. Sainath, B. Li, K. Rao, and N. Jaitly, “An L. Kaiser, “Multi-task sequence to sequence learning,”
analysis of attention in sequence-to-sequence models,,” in Proc. CoRR, vol. abs/1511.06114, 2015. [Online]. Available:
of Interspeech, 2017. http://arxiv.org/abs/1511.06114
63
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Audio event detection on Google's Audio Set database: Preliminary results using
different types of DNNs
Javier Darna Sequeiros and Doroteo T. Toledano
64 10.21437/IberSPEECH.2018-14
samples [10], and 0.360 with the implementation of several one type of audio), all the networks (except one, as will be
levels of attention [11]. discussed later) use binary cross-entropy as their loss function
This paper intends to be a first approach to Google Audio and the binary sigmoid as the activation function for the
Set. Our goal is to train several different architectures of output layer.
neural networks with this database and compare their
evaluation results with each other and with Google's baseline.
The rest of the paper is organized as follows: Section 2 will 3.1. LSTM.
discuss in more detail the Google Audio Set database. Section
Since the data has the form of a time series of 10 feature
3 will define the neural networks that were used to face the
vectors (embeddings), each one representing one second of
audio event detection problem, section 4 will describe the tests
audio, a Long Short-Term Memory (LSTM) recurrent neural
that were made on the neural networks created, section 5 will
network was immediately considered as an appropriate
interpret the results from those tests, and finally section 6 will
network, as they are specialized in this kind of data. The
conclude the work and propose future research lines that arise
proposed LSTM model has the following architecture:
from the results of this paper.
• The inputs are series of 10 128-dimensional feature
vectors (embeddings).
2. The Google Audio Set database • The first hidden layer is a unidirectional LSTM
The Google Audio Set database is available in two different layer with 600 units, which outputs a single vector
formats: when the whole sequence has been processed. This
• Text files describing the video id, start time, stop layer is followed by a dropout layer with a 0.3
time, and labels assigned to each segment. probability.
• Features (embeddings) extracted at 1 Hz for each • The last hidden layer is a fully connected layer with
segment using a DNN trained by Google (the 600 units.
structure of this DNN is similar to the VGG
networks used in image recognition).
3.2. CNN.
In this paper we have used only the latter format so that Convolutional Neural Networks (CNN) have proven to be
minimal preprocessing is required and results are easier to very powerful for processing images. In our case, the input
recreate. However, the official dataset from Google Audio Set sequence of 10 128-dimensional feature vectors can be
webpage was not used because it was developed to be used considered as a 10x128 image or matrix, and therefore a CNN
directly on Tensorflow (also developed by Google) and it was could be appropriate in this case as well. The 128-dimensional
very difficult and inefficient to use in other toolkit such as input vectors are themselves produced as the output of a
Keras. Instead, we finally used an “unofficial” conversion to different neural network. This neural network uses a PCA
.h5df format available from the Google+ user group "audioset- transformation to create an embedding of the data. Therefore,
users" [12]. This conversion includes, for each segment, the there is no local proximity relationship between the elements
extracted audio features as a uniform 128x10 array and the of the vector. However, there is a temporal proximity for each
presence or absence of each possible label as a boolean vector. individual feature, which translates into a local proximity
The Google Audio Set database consist of 3 subsets (in between the rows of the matrix. A convolutional neural
any of the available formats): network with the following architecture was designed to take
• Balanced training, which has a balanced distribution advantage of this temporal proximity:
of the classes but contains only a small fraction of • The input is the previously mentioned 10x128
the samples (about 22K segments). matrix.
• Unbalanced training, which contains all of the • The first hidden layer is a convolutional layer with
samples (2.0M segments) but suffers from a greatly 16 filters and a kernel with dimension 3x1. Since
unbalanced distribution of the classes. there is not proximity relationship in the input
• Evaluation, which contains about 20K segments. vectors (columns of the input matrix) the second
dimension of the kernel is always restricted to 1 in
In all the experiments of this paper the neural networks were our tests.
trained with one of the training sets and then tested with the • The second hidden layer is a maximum pooling
evaluation set in order to obtain the final results. layer with a 2x2-dimensional window followed by a
dropout layer with a 0.3 probability.
3. Proposed neural networks • The last hidden layer is a fully connected layer with
600 units.
All the models used in this paper to perform the test are neural
networks based on one of three different architectures. Despite
the different architecture, all the models share the following 3.3. MLP
properties:
• They use Adam as their optimizer. Multi Layer Perceptrons (MLPs) are amongst the most
• Every unit that does not belong to the output layer standard and versatile neural networks. In fact, MLPs can
uses ReLU as its activation function. approximate any input-output multidimensional output,
• The output layer is fully connected and has 527 including those produced by other network architectures, so
units. they can be used as a reference model. The main advantage of
other architectures over MLPs is that MLPs include a huge
Since the audio event classification is a multi-labelled problem amount of weights, which can make training more difficult
(i.e. the same segment can, and typically, contain more than and more prone to overfitting. Given their property of
65
universal function approximation, they can be used to test the Table 1: Classification of the models ordered by their
impact of other factors apart from the type of neural network performance (in increasing mAP or mR order). 1h, 2h,
they are based on. For this last reason, several models based 3h indicates the number of hidden layers. bal. unbal.
on MLPs were developed and tested: Indicates the training set (balanced or unbalanced
• Two MLPs with one hidden layer. training set). bip. bin. Indicates the codification of the
• A MLP with two hidden layers. targets (bipolar or binary) and correspondingly the
• A MLP with three hidden layers. activation function of the output layer (tanh or
sigmoid).
All these models share these properties:
• The input is a 1280-dimensional vector (the Model, training set, codification mAP mR
flattened version of the input matrix used for MLP 1 h. l., bal., bip. 0.13704 0.15848
CNNs).
• The hidden layers have 1500 units each. After each MLP 1 h. l., unbal., bip. 0.19696 0.22697
one of them, there is a dropout layer with a 0.3
MLP 3 h. l., unbal., bin. 0.20686 0.23529
probability.
MLP 2 h. l., unbal., bin. 0.21203 0.24079
One of the MLPs with one hidden layer has the following
MLP 1 h. l., unbal., bin. 0.21342 0.24166
particular properties:
• The hyperbolic tangent (tanh) is used as the MLP 1 h. l., bal., bin. 0.21893 0.24249
activation function of the output layer.
• In this case, the Mean Squared Error (MSE) is used CNN, bal., bin. 0.22830 0.25595
as the networks' loss function. MLP 2 h. l., bal., bin. 0.24422 0.27542
MLP 3 h. l., bal., bin. 0.25276 0.28706
4. Test Description LSTM, bal., bin. 0.26652 0.30698
In order to compare the performance of the different models,
we evaluated them on the evaluation subset of Google Audio The first point to note is that the ranking of the different
Set. Every model was trained with the balanced training test of neural networks is the same whether the models are ordered
Audio Set. In the case of the MLP with the bipolar sigmoid by their mAP or their mR. Because of that, the metric
activation function, the target vectors were preprocessed so considered is irrelevant when interpreting the results.
that they have a bipolar codification (the absence of a class is The models using bipolar data obtain the worst results.
represented with the value -1 instead of 0), allowing us to test One possible reason could be that hidden layers use an
the effect of the codification of the data in performance. activation function unable to take negative values. However,
All the models were trained with a minibatch size of by comparing both models using bipolar data, we can notice
128. The training had a maximum duration of 50 epochs, that the one using the unbalanced training set has a much
however, early stopping was used in order to interrupt the better performance than the one using the balanced training
training process when the mAP no longer increases for three set. This is surprising because the models using binary data
epochs, thus preventing overfitting. No early stopping was show the exact opposite behavior. In these models the use of
used on the LSTM model as its mAP grew at a notably the unbalanced training set has a negative impact on their
irregular rate and early stopping kept interrupting the training performance, the effect becoming more intense the more
process before the model could reach its stability phase. This layers the network has.
phenomenon didn’t happen with the rest of the models. The MLPs using binary data and the balanced training
As the main focus of these tests is to compare the set have a better performance the more hidden layers they
different network architectures, hyper-parameters were left at have, which is the expected behavior when there is enough
their default values (learning rate: 0.001, beta1: 0.9, beta2: training data, as it seems to be the case.
0.999, decay: 0). The CNN's results were quite limited, falling behind the
After training, the networks were tested with the MLPs with more than one hidden layer. This is probably
evaluation subset of the Google Audio Set, and the final because of the lack of local meaning of the different features
results were obtained by calculating the mean Average included in the 128-dimensional feature vectors (embeddings),
Precision (mAP) and the mean Recall (mR). which limits the kernel to a single dimension.
In addition, another test was performed on all the models Finally, the model with the best results is the LSTM
based on MLPs where they were trained with the unbalanced network, with a mAP of 0.26652 and a mR of 0.30698. This
training set (much larger in terms of samples, but much more result is interesting because, even knowing that LSTM neural
unbalanced too) instead of the balanced one. networks are particularly effective with time series, in our case
these time series are very short, with 10 elements, which could
have limited the performance of this model.
5. Results
After performing the tests described above, results presented
in Table 1 were obtained:
66
6. Conclusions and future work 8. References
After testing the performance of several deep neural networks, [1] KRIZHEVSKY, Alex; SUTSKEVER, Ilya; HINTON,
we were able to obtain a mAP of 0.26652 with a simple LSTM Geoffrey E. Imagenet classification with deep
network. Despite these results being worse than the 0.314 convolutional neural networks. In Advances in neural
mAP of the baseline established by Google, they allow us to information processing systems. 2012. p. 1097-1105.
draw some conclusions about creating models for Google [2] TEMKO, Andrey, et al. CLEAR evaluation of acoustic
Audio Set. event detection and classification systems. In
First of all, we can conclude that LSTM networks are International Evaluation Workshop on Classification of
the most appropriate architecture for this problem from all of Events, Activities and Relationships. Springer, Berlin,
those which were tested, as a relatively simple network with Heidelberg, 2006. p. 311-322.
one LSTM layer and a fully connected layer offered better [3] SALAMON, Justin; JACOBY, Christopher; BELLO,
results than a more complex network with three fully Juan Pablo. A dataset and taxonomy for urban sound
connected layers, therefore recurrent neural networks should research. In Proceedings of the 22nd ACM international
be a good starting point if a better performance is looked for, conference on Multimedia. ACM, 2014. p. 1041-1044.
for example by adding more layers to the model or [4] IEEE AASP 2018 Challenge on Detection and
implementing more complex architectures. Classification of acoustic Scenes and Events,
The use of the balanced training subset seems to http://dcase.community/challenge2018/index, (accessed:
improve the performance of the models despite being less than 05/10/2018).
1/20 of the dataset. However, the unbalanced training subset [5] Freesound database, https://freesound.org, (accessed:
was only used on MLPs due to time restrictions. Its effects on 05/10/2018).
the other architectures should be studied in the future. [6] CAKIR, Emre, et al. Polyphonic sound event detection
Transforming the target vectors to a bipolar codification using multi label deep neural networks. In Neural
decreases the performance of the models; however there Networks (IJCNN), 2015 International Joint Conference
seems to be a positive correlation between this codification on. IEEE, 2015. p. 1-7.
and the use of the unbalanced training set, which could be [7] PARASCANDOLO, Giambattista; HUTTUNEN, Heikki;
worth researching into. VIRTANEN, Tuomas. Recurrent neural networks for
polyphonic sound event detection in real life recordings.
arXiv preprint arXiv:1604.00861, 2016.
7. Acknowledgements [8] Audio Set, a large scale dataset of manually annotated
audio events, https://research.google.com/audioset/,
This work has been partially supported by project “DSSL” (accessed: 20/05/2018).
(TEC2015-68172-C2-1-P), funded by the Ministry of [9] GEMMEKE, Jort F., et al. Audio set: An ontology and
Economy and Competitivity of Spain and FEDER. human-labeled dataset for audio events. In Acoustics,
Speech and Signal Processing (ICASSP), 2017 IEEE
International Conference on. IEEE, 2017. p. 776-780.
[10] KONG, Qiuqiang, et al. Audio Set classification with
attention model: A probabilistic perspective. arXiv
preprint arXiv:1711.00927, 2017.
[11] YU, Changsong, et al. Multi-level Attention Model for
Weakly Supervised Audio Classification. arXiv preprint
arXiv:1803.02353, 2018.
[12] Audio Set Users Group,
https://groups.google.com/forum/#!forum/audioset-users,
(accessed: 20/05/2018).
67
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
68 10.21437/IberSPEECH.2018-15
• Set A: Pitch and Energy. is represented as three-dimensional real-valued vector. The di-
mensions of this vector correspond to Valence (corresponding
• Set B: Pitch, Energy and Spectral Centroid.
to the concept of polarity), Arousal (degree of calmness or ex-
• Set C: Pitch, Energy, Spectral Centroid, ZCR and Spec- citement), and Dominance (perceived degree of control over a
tral Spread. situation): the VAD model.
• Set D: Pitch, Energy, Spectral Centroid, ZCR, Spectral In order to solve the regression problem of emotional status
Spread and 12 MFCC coefficients. detection, we propose to use deep learning. When consider-
ing emotion detection from speech Long-Short Term Memory
• Set E: Pitch, Energy, Spectral Centroid, ZCR, Spectral (LSTM) neural networks were tested. The underlying idea is
Spread and 16 LPC coefficients. to be able to learn the relationship among present and past in-
• Set F: Pitch, Energy, Spectral Centroid, ZCR, Spectral formation although existing a big distance among them. That
Spread and 21 Bark features. is, they have memory and they can manage with temporal se-
quences of data like the sequence of vectors extracted from
The first set was selected according to the studies performed an acoustic signal. When regarding emotional status detection
in [12] where the arousal state of the speaker affects the overall from text a classical feedforward network was considered be-
energy and pitch. In addition to time-dependent acoustic fea- cause of simplicity. Such networks have proven to be efficient
tures such as pitch and energy, spectral features were selected for problems in similar tasks, like sentiment analysis [21].
for Sets B and C as a short-time representation for speech signal
[13]. For Sets D, E and F different Cepstral-based features were 3. Experiments
added, proven that they are good for detecting stress in speech
signal [14]. We have carried out two series of experiments for the evalua-
When regarding emotion detection from language the same tion of the regression processes. In the first one we present the
procedure has to be carried out. First of all a vectorial represen- most interesting results related to the emotion detection from
tation of the transcribed text is needed. In this case, we hope speech, and in the second we show the most interesting results
to capture some meaning of the utterance that might help in the on emotion detection from text.
detection of specific emotional status. An appropriate repre-
sentation should consider some semantic information like the 3.1. Corpus
word embeddings word2vec[15], doc2vec [16] or GLOVE [17]. As far as we know there is no Spanish three-dimensional corpus
Word2vec embeddings, the most simple model, are shallow, within the literature, so for the experiments, we have created a
two-layer neural networks that are trained to reconstruct lin- small corpus using the VAD model. The corpus consists of 120
guistic contexts of words. Word2vec takes as its input a large fragments between 3 and 5 seconds taken from the Spanish TV
corpus of text and produces a vector space, typically of several program “La Sexta Noche”. This TV program consist of politi-
hundred dimensions, with each unique word in the corpus be- cal debate, news and events, and discussions commonly appear.
ing assigned a corresponding vector in the space. Word vectors Each fragment has been transcribed manually and tagged using
are positioned in the vector space such that words that share crowdsourcing (the practice of obtaining needed services, ideas,
common contexts in the corpus are located in close proximity or content by soliciting contributions from a large group of peo-
to one another in the space. However, this technique represents ple) techniques. In this case each fragment has been labeled by
each word of the vocabulary by a distinct vector, without pa- 5 different annotators, following the next questionnaire:
rameter sharing. In particular, they ignore the internal structure
of words, which is an important limitation for morphologically 1. In order to address the Valence: “How do you perceive
rich languages. For example, in Spanish, most verbs have more the speaker?”
than forty different inflected forms and this leads to a vocab-
ulary where many word forms occur rarely (or not at all) in • Excited
the training corpus, making it difficult to learn good word rep- • Slightly Excited
resentations. Thus, [15] proposes to learn representations for • Neutral
character n-grams, and to represent words as the sum of the n-
2. In order to address the Arousal: “His mood is . . . ”
gram vectors. The model (known as FastText) can be seen as an
extension of the continuous skip-gram model [18] which takes • Positive (nice / constructive)
into account subword information. • Slightly Positive
Additionally, a way of representing the emotional status is
• Slightly Negative
needed in order to establish a machine learning problem. A
• Negative (unpleasant / non colaborative)
categorical emotion description (e.g. six basic emotions) is
an easy way to procedure but it provides a quite constrained 3. In order to address the Dominance: “How do you per-
model. Affective computing researchers have started exploring ceive the speaker about the situation in which he or she
the dimensional representation of emotion [19] as an alterna- is in?”
tive. Dimensional emotion recognition, aims to improve the
understanding of human affect by modelling affect as a small • More dominant / controlling the situation / . . .
number of continuously valued, continuous time signals. It has • He or she does not dominate the situation neither
the benefit of being able to: (i) encode small changes in affect is he or she cowed.
over time, and (ii) distinguish between many more subtly differ- • More coward / defensive / . . .
ent displays of affect, while remaining within the reach of cur-
rent signal processing and machine learning capabilities [20]. Once the tags were generated by crowdsourcing, the an-
In our work, we represent the problem of dimensional emotion swers collected from all the annotators were transferred to the
recognition as a regression one, where each emotional status three-dimensional model, making the average of each answer
69
and avoids the vanishing gradient problem [25].
The layers of the proposed network consist of a small num-
ber of units or cells not to build a large network architecture,
since we have a limited sized training corpus.
for all fragments where the first answer of each question was SVR
LR RNN
assigned the value 0, the last answer was assigned the value 1, linear poly rbf
and the rest of the answers a midpoint. Then the corpus was Set A 0.1682 0.1660 0.1661 0.1691 0.1670
split into two sets, 70% of the fragments were used for training Set B 0.1686 0.1653 0.1685 0.1710 0.1565
purposes and the remaining 30% for test. Set C 0.1690 0.1679 0.1718 0.1709 0.1576
Set D 0.1742 0.1894 0.1703 0.1710 0.1665
3.2. Baselines models and Evaluation Metrics Set E 0.1699 0.2007 0.1842 0.1711 0.1664
Set F 0.1733 0.2256 0.1898 0.1710 0.1413
Both emotion detections problems, from speech and from text,
has been tested first with Linear Regression (LR) [22] and Su-
per Vector Regression (SVR) [23] (with three different types of As shown in Table 1, the proposed network slightly im-
kernels linear, poly, and rbf ), in order to compare with Neural proves the results of the baseline models in almost all the sets in
Networks. the corpus. It can also be concluded that by selecting a smaller
Regarding the input of these baselines models, two differ- set of parameters, better results are obtained with the baseline
ent approaches have been analysed to fix the problem of time model. However, the set A seems to have insufficient informa-
sequence. On the one hand, we fit the models with full informa- tion and Bark features are of great help in the case of networks
tion, considering each feature in on each time-step independent providing the best results.
(full models), and on the other hand, calculating the mean of
each feature over time-steps (mean models). 3.4. Experiments with language features
In relation to the evaluation metric, the Mean Square Error Regarding the word representation, FastText embeddings from
(MSE) has been used, because it seems to provide a good inter- SBWC3 has been used in the experiments. The mentioned em-
pretation of how far the prediction and the true label are. In this beddings are a Skipgram model of 300 dimensions and 855380
problem, MSE can be described as the mean of the distances different word vectors, trained with Spanish Billion Word Cor-
between the points of the true label on the three-dimensional pus4 with more than 1.4 billion words.
model and the predicted points on the same three-dimensional The regression problem of emotional status with language
model. features has been addressed with a small Deep Neural Network
(DNN). This network consist of three similar layers; the first
3.3. Experiments with acoustic features two layers are composed of 5 units, a sigmoidal activation func-
In order to obtain the acoustic parameters, each audio has been tion and followed by Dropout layer with 0.5 of keep-probability
divided into individual frames using a context window of 25 in order to prevent to the overfitting problem [26]; while the last
milliseconds and a step of 10 milliseconds (as shown in Figure layer, the output layer, is a Dense layer of 3 units and the sig-
1), obtaining 300 frames per audio. A vector made up of the moidal activation function (same as the network proposed for
selected acoustic features was associated to each frame. Differ- the regression problem of emotional status with acoustic fea-
ent experiments were carried out using the different feature sets tures).
(A, B, C, D, E, F) described in Section 2. Additionally, for each
Table 2: Best result obtained with baseline models and Deep
set, different experiments were also performed including both
Neural Network (DNN) in the regression problem of emotional
the first and the first and the second derivatives.
status with language features. MSE error has been used.
The network proposed to address the regression problem
of emotional status with acoustic features is a Recurrent Neural
Network (RNN). The network is composed with an LSTM layer SVR
LR DNN
(the architecture proposed by [24]) of 10 cell memory blocks, linear poly rbf
to get a representation of the audio along the time. Subsequent Mean 0.1906 0.1350 0.1165 0.1197 0.1203
layers are two Dense layers which aim to infer the VAD model Full 0.1356 0.1292 0.1199 0.1229 0.1196
from the representation given by the LSTM. The fist Dense
layer consist of 15 units and ReLU activation function while
the second and last consist of 3 units and sigmoid activation As shown in Table 2, the proposed network achieves similar
function. results when comparing it to the baselines models. It is an inter-
The output layer contains a sigmoidal activation function to esting result given the small size of the training set and the great
take advantage of the output limitation benefits, it is bounded 3 https://github.com/uchile-nlp/spanish-word-
between 0 and 1. On the other hand, the hidden layer contains embeddings/blob/master/README.md
the ReLU activation function because it provides good results 4 http://crscardellino.me/SBWCE/
70
impact it has when building neural networks. The obtained re- [8] R. Justo, T. Corcoran, S. M. Lukin, M. Walker, and M. I. Torres,
sults suggest that increasing the annotated training corpus neu- “Extracting relevant knowledge for the detection of sarcasm and
ral networks might improve the baseline models. nastiness in the social web,” Knowledge-Based Systems, vol. 69,
pp. 124–133, 2014.
[9] S. Baccianella, A. Esuli, and F. Sebastiani, “Sentiwordnet 3.0: an
4. Conclusions enhanced lexical resource for sentiment analysis and opinion min-
The main goal of this work was to develop an automatic emo- ing.” in Lrec, vol. 10, no. 2010, 2010, pp. 2200–2204.
tion detection system from speech and language. The sys- [10] S. Volkova and Y. Bachrach, “Inferring perceived demographics
tem acted over acoustic fragments extracted from a TV show from user emotional tone and user-environment emotional con-
and their corresponding transcriptions. Each fragment was trast,” in Proceedings of the 54th Annual Meeting of the Asso-
ciation for Computational Linguistics (Volume 1: Long Papers),
annotated by means of a crowdsourcing platform using a 3- vol. 1, 2016, pp. 1567–1578.
dimensional VAD model. Different neural networks architec-
[11] R. A. Calvo and S. Mac Kim, “Emotions in text: dimensional and
tures were tested and the obtained results show that RNN can categorical models,” Computational Intelligence, vol. 29, no. 3,
outperform baseline systems when considering emotion detec- pp. 527–543, 2013.
tion from speech. Moreover, using a simple feedforward neural [12] C. E. Williams and K. N. Stevens, “Vocal correlates of emotional
network with a very small training corpus (84 sentences) similar states,” Speech evaluation in psychiatry, pp. 221–240, 1981.
results to those obtained with baseline models can be achieved.
[13] T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion
For further work we propose to get a bigger annotated cor- recognition using hidden markov models,” Speech communica-
pus by using crowdsourcing tools to better train the proposed tion, vol. 41, no. 4, pp. 603–623, 2003.
neural networks. Additionally, the two knowledge sources [14] S. E. Bou-Ghazale and J. H. Hansen, “A comparative study of
(acoustic and text) might be merged to provide a more accurate traditional and newly proposed features for recognition of speech
emotion detection system. under stress,” IEEE Transactions on speech and audio processing,
vol. 8, no. 4, pp. 429–442, 2000.
5. Acknowledgements [15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
“Distributed representations of words and phrases and their com-
This work has been partially founded by by positionality,” in Advances in Neural Information Processing Sys-
the Spanish Government (TIN2014-54288- tems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani,
C4-4-R and TIN2017-85854-C4-3-R), and by and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp.
the European Commission H2020 SC1-PM15 3111–3119.
program under RIA 7 grant 69872. [16] Q. V. Le and T. Mikolov, “Distributed representations of sentences
and documents,” in ICML, ser. JMLR Workshop and Conference
Proceedings, vol. 32. JMLR.org, 2014, pp. 1188–1196.
6. References [17] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global
[1] J. Irastorz and M. I. Torres, “Analyzing the expression of annoy- vectors for word representation,” in EMNLP. ACL, 2014, pp.
ance during phone calls to complaint services,” in 7th IEEE Inter- 1532–1543.
national Conference on Cognitive Infocommunications (CogInfo- [18] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enrich-
Com), 2016, p. 103106. ing word vectors with subword information,” Transactions of the
[2] S. E. Eskimez, K. Imade, N. Yang, M. Sturge-Apple, Z. Duan, and Association for Computational Linguistics, vol. 5, pp. 135–146,
W. B. Heinzelman, “Emotion classification: How does an auto- 2017.
mated system compare to naive human coders?” in 2016 IEEE In- [19] H. Gunes and M. Pantic, “Automatic, dimensional and continuous
ternational Conference on Acoustics, Speech and Signal Process- emotion recognition,” Int. J. Synth. Emot., vol. 1, no. 1, pp. 68–99,
ing, ICASSP 2016, Shanghai, China, March 20-25, 2016, 2016, Jan. 2010.
pp. 2274–2278.
[20] M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajew-
[3] D. Ververidis and C. Kotropoulos, “Emotional speech recogni- ski, R. Cowie, and M. Pantic, “Avec 2014: 3d dimensional affect
tion: resources, features, and methods,” Speech Communication, and depression recognition challenge,” in Proceedings of the 4th
pp. 1162–1181, 2006. International Workshop on Audio/Visual Emotion Challenge, ser.
[4] A. Mencattini, E. Martinelli, F. Ringeval, B. W. Schuller, and AVEC ’14. New York, NY, USA: ACM, 2014, pp. 3–10.
C. D. Natale, “Continuous estimation of emotions in speech [21] R. Moraes, J. F. Valiati, and W. P. G. Neto, “Document-level sen-
by dynamic cooperative speaker models,” IEEE Trans. Affective timent classification: An empirical comparison between svm and
Computing, vol. 8, no. 3, pp. 314–327, 2017. [Online]. Available: ann.” Expert Syst. Appl., vol. 40, no. 2, pp. 621–633, 2013.
https://doi.org/10.1109/TAFFC.2016.2531664 [22] G. A. Seber and A. J. Lee, Linear regression analysis. John
[5] C. Clavel, G. Adda, F. Cailliau, M. Garnier-Rizet, A. Cavet, Wiley & Sons, 2012, vol. 329.
G. Chapuis, S. Courcinous, C. Danesi, A.-L. Daquo, M. Deldossi [23] C.-C. Chang, “Libsvm: a library for support vector machines,”
et al., “Spontaneous speech and opinion detection: mining call- ACM Transactions on Intelligent Systems and Technology, 2: 27:
centre transcripts,” Language resources and evaluation, vol. 47, 1–27: 27, 2011 “http://www. csie. ntu. edu. tw/˜ cjlin/libsvm”,
no. 4, pp. 1089–1125, 2013. vol. 2.
[6] S. Mohammad, C. Dunne, and B. Dorr, “Generating high- [24] A. Graves and J. Schmidhuber, “Framewise phoneme classifica-
coverage semantic orientation lexicons from overtly marked tion with bidirectional lstm and other neural network architec-
words and a thesaurus,” in Proceedings of the 2009 Conference tures,” Neural Networks, vol. 18, no. 5-6, pp. 602–610, 2005.
on Empirical Methods in Natural Language Processing: Volume
2-Volume 2. Association for Computational Linguistics, 2009, [25] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
pp. 599–608. networks,” in Proceedings of the fourteenth international confer-
ence on artificial intelligence and statistics, 2011, pp. 315–323.
[7] J. R. Bellegarda, “Emotion analysis using latent affective folding
[26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
and embedding,” in Proceedings of the NAACL HLT 2010 work-
R. Salakhutdinov, “Dropout: a simple way to prevent neural net-
shop on computational approaches to analysis and generation
works from overfitting,” The Journal of Machine Learning Re-
of emotion in text. Association for Computational Linguistics,
search, vol. 15, no. 1, pp. 1929–1958, 2014.
2010, pp. 1–9.
71
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
72 10.21437/IberSPEECH.2018-16
Table 1: Summary of datasets found in the literature.
# Name Language and Lexicon Size Data Type # Repetitions # Signers Labelled
1 Superpixel [11] ASL: 24 alphabet signs 131688 images RGB 500 5 Yes
2 Fingerspelling [12] ASL: 24 alphabet signs and 31000 images Depth 200 5 Yes
digits 1-9
3 Massey University [13] ASL: 24 alphabet signs and 2524 images RGB 5 5 Yes
numbers 0-9
4 Padova Senz3D [3] ASL: letters B, D, I, S, and digits 2640 images RGB + Depth 30 + 30 4 Yes
2, 3, 4, 5, 9, 10
5 Padova Kinect [14] ASL: letters A, D, I, L, W, Y, and 2800 images RGB + Depth 10 + 10 14 Yes
digits 2, 5, 7
6 HKU Kinect Gesture [15] ASL: letters A, L, Y and digits 1-5 3000 images RGB 60 5 Yes
7 NTU Microsoft Kinect [16] ASL: letters A, L, Y and digits 1-5 2000 images RGB + Depth 10 + 10 10 Yes
8 Cvpr15 [17] ASL: 7 words and digits 1-9 68000 images Depth 500 8 Yes
9 RWTH-50 [18] ASL: 83 words 8844 videos Grey scale 2 3 Yes
10 ASLLVD [19] ASL: words 992 videos RGB 2 5 Yes
11 RWTH-104 [20] ASL: 201 sentences 201 videos Grey scale 1 3 Yes
12 German Spelling [21] GSL:30 alphabet signs and digits 3080 videos Grey scale 2 20 Yes
1-5
13 RWTH PHOENIX [22] GSL: sentences 592383 images RGB 1 NA No
14 Grades Online [10] SSL: 750 words 750 videos RGB 1 2 Yes
15 SpreadTheSign [23] SSL: words and sentences +2000 videos RGB 1 NA Yes
16 LSA64 [24] ArSL: 64 words 3200 videos RGB 5 10 Yes
17 SKIG [25] Hand gestures: 10 2160 videos RGB + Depth 18 + 18 6 Yes
18 Microsoft Gesture-RC [26] Hand gestures: 12 594 videos RGB + Depth NA NA No
19 ChaLearn 2016 [27] Hand gestures 140945 videos RGB + Depth 1+1 NA No
73
of one hand. 100 randomly-selected repetitions of each sign the highest accuracy (99.61%) but the lowest accuracy values
were used for training and 10 for testing. over the other datatests (fourth row in Table 2).
Two further datasets were selected only as test datasets: the UVIGO CNN (which was trained using both RGB and
RGB images of Massey University dataset (#3), and the depth depth images) tested over the other datasets gave better results,
images of Padova Senz3D dataset (#4). which it seems that generalizes better. Therefore, it resulted that
training with RGB pictures or depth pictures separately did not
3.3. Data preprocessing perform well in recognizing depth and RGB pictures,
respectively.
Image preprocessing in machine learning typically consists
of cropping, resizing and feature extraction operations. Next we As seen in Table 1, #1 and #2 both have five signers unlike
describe the preprocessing tasks needed for each dataset. UVIGO which has two signers. More signers mean more
variation, and therefore more generalization. However, these
As said before, the used CNN required 224x224-pixel
results seem to indicate that using RGB and depth images is as
images as input features, so resizing had to be performed for
important as the number of signers.
Superpixel, Fingerspelling and Massey University datasets.
Cropping (hand segmentation) is not needed for these datasets.
Table 2: Test comparison for the trained CNN in terms of
Regarding UVIGO dataset, hand segmentation is made accuracy. The labels of the first column refer to the training
using the information provided by the Kinect2 sensors. Kinect2 material used to train the CNN
gives access to RGB and depth image streams, along with 25
Training UVIGO Superpixel Fingerspelling
body joint coordinates. In the RGB stream our segmentation
data/Testing data
algorithm uses 4 body joints to easily locate a hand. Then it
crops a square around the detected hand. For the depth stream UVIGO 89.76% 27.97% 20.96%
we use again 4 joints to locate the hand. This hand is segmented Superpixel 17.39% 92.48% 7.81%
in the xy plane as the RGB images. The algorithm applies
Fingerspelling 10.00% 6.38% 99.61%
another cropping in the z axis using the depth values of the hand
to set a threshold (Figure 1). This eliminates the background
from the depth images. In a second experiment, the three training datasets were
Finally, we also performed a segmentation in Padova jointly used to train three CNNs.
dataset. It was applied the depth stream cropping described Table 3 summarizes results for the training of these three
before, but in this case instead of using the joints provided by CNNs: “All RGB” label means a CNN trained with only RGB
Kinect2, we directly used the depth values to distinguish the images from UVIGO and Superpixel datasets, “All depth” label
hand from its surroundings. means a CNN trained with only depth images from UVIGO and
Fingerspelling datasets, and “All” label means a CNN trained
with both types of images. They are also compare to Massey
(#3) and Padova (#4) datasets.
It can be seen that the CNNs based on both RGB and depth
images outperformed the CNNs trained with only one kind of
image. All (Table 3) exceeded the 50% precision mark on
external datasets. All depth also did so for depth, but only for
the Padova dataset because it consists of depth images.
74
hands at the same time), and a prediction time of 0.054s. The Future research will focus on improving system training to
whole process was therefore performed in 0.116s. match or surpass accuracy rates for both depth and RGB images
separately with all datasets included and on extending
3.5. Dataset acquisition guidelines recognition to the Spanish SL. An expansion of the acquired
dataset is also planned, as well as the creation of two computer
As mentioned, training a sign recognition NN requires
applications, one for recording and another for real-time gesture
significant amounts of data, with sufficient diversity in sign
recognition.
execution and in recording environments to ensure robustness.
Considering the results, dataset acquisition guidelines are
described next to meet these requirements. 5. Acknowledgements
It is therefore recommended to record data from at least 10 This work has received financial support from the Xunta de
different signers and to ensure a balance between male and Galicia (Agrupación Estratéxica Consolidada de Galicia
female and native and non-native signers. It is also accreditation 2016-2019, Galician Research Network
recommended to spread recording sessions over several days to TecAnDaLi ED431D 2016/011 and Grupos de Referencia
ensure a variety of conditions (lighting, clothes, etc.) and Competitiva GRC2014/024) and the European Union
settings. (FEDER/ERDF).
It is highly recommended to record RGB and depth at the
same time, preferably with a resolution of at least 640x480 6. References
pixels. Depth images are useful because a more precise
[1] J. Isaacs and S. Foo, "Hand pose estimation for American sign
segmentation is possible. Another advantage is that different language recognition," IEEE Thirty-Sixth Southeastern
lightning conditions, clothes and settings do not have any Symposium on System Theory, 2004, pp. 132-136.
influence because they are not perceived by the sensors. [2] D. Guo, W. Zhou, M. Wang and H. Li, "Sign language recognition
Nonetheless, recording sex, height and anatomical differences based on adaptive HMMS with data augmentation," IEEE
is important. International Conference on Image Processing (ICIP), Phoenix,
AZ, 2016, pp. 2876-2880.
To train the system for each sign the same way, it is highly [3] A. Memo, L. Minto and P. Zanuttigh, “Exploiting Silhouette
recommended to formalize the number of images recorded per Descriptors and Synthetic Data for Hand Gesture Recognition”,
sign, signer and recording setting. 100 images per isolated Eurographics Italian Chapter Conference, Eurographics
gesture and 20 videos per continuous gesture for each signer is Association, pp. 15-23, 2015.
proposed as a good compromise. For a sufficient number of [4] Zuzanna Parcheta and Carlos-D. Martínez-Hinarejos, ”Sign
signers, this represents a sufficiently large and sufficiently Language Gesture Recognition Using HMM”, L.A. Alexandre et
heterogeneous set of empirical data to avoid training the al. (Eds.): IbPRIA 2017, LNCS 10255, pp. 419-426, 2017
network with images too similar to each other. [5] Carlos D. Martínez-Hinarejos and Zuzanna Parcheta, “Spanish
Sign Language Recognition with Different Topology Hidden
Formalization also requires efficient and convenient Markov Models”, Proceedings of the Interspeech 2017,
organization and annotation. It is therefore recommended to do Stockholm, Sweden, pp. 3349-3353, 2017.
the following: organize the dataset in a folder tree, with a single [6] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S.
folder per signer, and with each containing a sub-folder per sign Guadarrama, K. Saenko, T. Darrell, “Long-Term Recurrent
and sub-sub-folders for each repetition of that sign; identify Convolutional Networks for Visual Recognition and
signer folders using an identification code, and name sign Description,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 39, no. 4, pp. 677-691, April 1 2017.
folders after the sign name; identify each image with the signer [7] E. Tsironi, P. Barros, C. Weber and S. Wermter, “An Analysis of
identification code, the recording date (or code), the repetition Convolutional Long Short-Term Memory Recurrent Neural
number and the sign name; and finally, if feasible, assign each Networks for Gesture Recognition”, Neurocomputing, vol. 268,
recorded sign a code and use that instead of its name. pp. 76-86, 2016.
It is recommended describe files in terms of three sections: [8] X. Shi, Z. Chen, H. Wang and D. Y. Yeung, “Convolutional
LSTM Network: A Machine Learning Approach for Precipitation
with the lexicon and the associated code; with the signer’s name
Nowcasting”, Proceedings of the 28th International Conference
and respective code with information on date of birth, why they on Neural Information Processing Systems (NIPS’15), vol. 1, pp.
use SL and their dominant hand; and with details on the 802-810, 2015.
recording session, equipment used, recorder name, recording [9] O. Koller, H. Ney, and R. Bowden, “Automatic Alignment of
date, file name and recording conditions. Recommended is to HamNoSys Subunits for Continuous Sign Language
use a spreadsheet (or CSV) because it is very easy to filter out Recognition”, LREC Workshop on the Representation and
and retrieve key statistics from the data. Processing of Sign Languages: Corpus Mining, Portorož,
Slovenia, pp. 121-128, 2016.
[10] “Grades,” [Online]. Available: http://grades.uvigo.es/. [Accessed
4. Conclusions and further work 12 07 2018].
Although a few studies exist on ASLR, progress is still limited [11] C. Wang, Z. Liu and S. C. Chan, “Superpixel-Based Hand Gesture
Recognition With Kinect Depth Camera”, IEEE Transactions on
by the lack of datasets. The fact that the few available lack Multimedia, vol. 17, no. 1, pp. 29-39, Jan. 2015.
reliability and convenience makes the creation of new datasets [12] B. Kang, S. Tripathi, T. Q. Nguyen, “Real-time Sign Language
mandatory in order to progress in this field. Fingerspelling Recognition using Convolutional Neural
We have described an experimental framework designed to Networks from Depth Map”, 3rd IAPR Asian Conference on
study the reliability of existing datasets and the combination of Pattern Recognition (ACPR), Kuala Lumpur, 2015, pp. 136-140.
RGB and depth data and experimentally trained CNNs. The [13] A. L. C. Barczak, N. H. Reyes, M. Abastillas, A. Piccio and T.
system achieved over 50% precision for challenging datasets Susnjak, “A New 2D Static Hand Gesture Colour Image Dataset
from both natural and artificial recording environments. This for ASL Gestures”, Research Letters in the Information and
Mathematical Sciences, vol. 15, pp. 12-20, 2011.
method can be used to create a complete and straightforward
dataset suitable for research.
75
[14] G. Marin, F. Dominio, P. Zanuttigh, “Hand Gesture Recognition
with Leap Motion and Kinect Devices”, IEEE International
Conference on Image Processing (ICIP), Paris, 2014, pp. 1565-
1569.
[15] Z. Ren, J. Yuan, J. Meng and Z. Zhang, “Robust Part-Based Hand
Gesture Recognition Using Kinect Sensor”, IEEE Transactions
on Multimedia, vol. 15, no. 5, pp. 1110-1120, Aug. 2013.
[16] S. C. Chan, C. Wang and Z. Liu, “Hand gesture recognition based
on canonical formed superpixel earth mover's distance”, 2016
IEEE International Conference on Multimedia and Expo (ICME),
Seattle, WA, 2016, pp. 1-6.
[17] X. Sun, Y. Wei, S. Liang, X. Tang and J. Sun, “Cascaded Hand
Pose Regression”, IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Boston, MA, 2015, pp. 824-832.
[18] M. Zahedi, D. Keysers, T. Deselaers, and H. Ney, “Combination
of Tangent Distance and an Image Distortion Model for
Appearance-Based Sign Language Recognition”, Deutsche
Arbeitsgemeinschaft für Mustererkennung Symposium (DAGM),
Lecture Notes in Computer Science, Vienna, Austria, pp. 401-408,
Aug. 2005.
[19] C. Neidle, A. Thangali and S. Sclaroff, “Challenges in
Development of the American Sign Language Lexicon Video
Dataset (ASLLVD)”, 5th Workshop on the Representation and
Processing of Sign Languages: Interactions between Corpus and
Lexicon, LREC 2012, Istanbul, Turkey, May 27, 2012.
[20] P. Dreuw, D. Rybach, T. Deselaers, M. Zahedi, and H. Ney,
“Speech Recognition Techniques for a Sign Language
Recognition System”, INTERSPEECH 2007, Antwerp, Belgium,
pp. 2513-2516, Aug. 2007.
[21] P. Dreuw, T. Deselaers, D. Keysers, and H. Ney, “Modeling
Image Variability in Appearance-Based Gesture Recognition”,
ECCV Workshop on Statistical Methods in Multi-Image and
Video Processing (ECCV-SMVP), Graz, Austria, pp. 7-18, May
2006.
[22] J. Forster, C. Schmidt, T. Hoyoux, O. Koller, U. Zelle, J. Piater,
and H. Ney, “RWTH-PHOENIX-Weather: A Large Vocabulary
Sign Language Recognition and Translation Corpus”, Language
Resources and Evaluation (LREC), Istanbul, Turkey, pp. 3785-
3789, 2012.
[23] European Sign Language Center, “Spread The Sign,” 2006.
[Online]. Available: www.spreadthesign.com. [Accessed 11 07
2018].
[24] F. Ronchetti, F. Quiroga, C. Estrebiu, L. Lanzarini and A. Rosete,
“LSA64: An Argentinian Sign Language Dataset”, XX II
Congreso Argentino de Ciencias de la Computación (CACIC),
2015.
[25] I. Shao and L. Liu, “Learning Discriminative Representations
from RGB-D Video Data”, Proceedings of the Twenty-Third
International Joint Conference on Artificial Intelligence
(IJCAI’13), pp. 1493-1500, May 2013.
[26] Microsoft Research Cambridge-12 Kinect, “Kinect Gesture Data
Set”, [Online]. Available: https://www.microsoft.com/en-
us/download/details.aspx?id=52283. [Accessed 12 07 2018].
[27] S. Escalera, X. Baro, H. J. Escalante and I. Guyon, “ChaLearn
Looking at People: A Review of Events and Resources”, 2017
International Joint Conference on Neural Networks (IJCNN),
Anchorage, AK, 2017, pp. 1594-1601, 2017.
[28] K. Simonyan and A. Zisserman, “Very Deep Convolutional
Networks for Large-Scale Image Recognition”, arXiv preprint
arXiv:1409.1556, 2014.
76
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
77 10.21437/IberSPEECH.2018-17
Figure 1: Training stages of a hybrid HMM-DNN, triphone-based acoustic model on Kaldi.
plemented an HMM-DNN ASR system in Kaldi and also 3. Tools and Resources for BP using Kaldi
conducted a comparative study between HMM-based models
In order to build a speech recognition system, one must be pro-
in both Kaldi and HTK. The Castillian Spanish SpeechDat(II)
vided with a language model (LM), a phonetic dictionary and
FDB-4000 audio corpus was used, which contains 43 hours
an acoustic model (AM). The resources and tools used to build
of recordings from 4,000 speakers. The results indicated a
each one of the three aforementioned components with Kaldi
34.02% decrease in WER when comparing the most accurate
will be detailed below. It is worth mentioning that the LM and
DNN-based and HMM-based models from Kaldi. A decrease
the dictionary are the very same used in CMU Sphinx as well.
of 53.79% for the HMM-based model in Kaldi could also be
The steps to train the AMs in particular are similar for both
observed over their most accurate model from HTK.
toolkits, but some differences will be pointed out along the text.
On the second work, Zorrilla et al. [17] carried out sev- For further information about acoustic model training for BP
eral experiments using Kaldi in order to evaluate different using CMU Sphinx tools, the reader is referred to [4].
deep-learning approaches for acoustic modeling on well-known
Spanish data sets, namely Albayzin, Dihana, CORLEC-EHU 3.1. Audio Corpora
and TC-STAR. In addition, the El País text corpus was used
for language modeling. The authors found through experi- Speech recognition is a data-driven technology, which means it
ments that all HMM-DNN hybrid acoustic models have out- requires a relatively large amount of labeled data (transcribed
performed the HMM-GMM ones and work well even with non audio) to work properly. The corpora used to train the acoustic
task-specific language models. models with Kaldi are composed by seven data sets, as summa-
During the research, we also found two works that tackle rized in Table 1. The data sets contain audio files in an uncom-
the ASR problem for BP using deep neural networks. Quin- pressed, linear, signed PCM (namely, WAVE) format, and are
tanilha et al. [18] presented an open-source, character-based, sampled at 16 kHz with 16 bits per sample. It is important to
end-to-end bidirectional long short-term memory (BLSTM) note that the actual number of speakers in West Point was rather
neural network for LVCSR. Several experiments were con- reduced due to abundance of foreign words amidst the corpus.
ducted over a data set of approximately 14 hours of recorded Besides, Constitution and Consumer Protection Code corpora
audio and the best performance evaluated in terms of label er- share the same speaker.
ror rate was 31.53% without the use of any language model.
Bonilla et al. [19], on the other hand, proposed an end-to-end 3.2. Phonetic Dictionary and Language Model
deep-learning system for recognizing digits, which is compared The phonetic dictionary maps every grapheme in the lexicon
to a simple multilayer perceptron (MLP) network. It is not clear, (orthographic representation) to one or more phonetic transcrip-
however, if the system classifies characters, words or phonemes. tions. The software described in [24] was used to include the
The best result is reported as 97.5% of accuracy rate, against pronunciation mapping of each of the 14,518 words into the
82.8% achieved by the MLP. dictionary. The trigram language model used in this work is de-
According to the literature review, it appears no previous scribed in [3]. It was trained with the SRILM [25] toolkit with
work has developed ASR resources with Kaldi for Brazilian 1.6 million phrases from the CETENFolha [26] corpus, yielding
Portuguese yet. Therefore, we believe this is the first attempt a perplexity value of 170. The LM is available in ARPA format,
to build acoustic models for BP using the toolkit’s deep learn- but in order to be used on the Kaldi environment, it was con-
ing approaches. verted to the FST format using the provided arpa2fst script.
78
Table 1: Audio corpora used to train acoustic models. Table 2: Kaldi DNN tools and parameters used for training.
79
Table 3: WER (%) achieved by CMU Sphinx and Kaldi toolkits.
testing and the six other corpora were used for training. Unfor- 5. Conclusions and Future Works
tunately, no clusters or graphic cards could be used for training
This paper addressed the first attempt to develop a speech recog-
the models. Therefore, due to the computational burden and
nition system for large vocabulary (LVCSR) in Brazilian Por-
the lack of hardware resources, it was not possible to develop
tuguese using the Kaldi toolkit. Triphone-based, HMM-GMM
DNN-based AMs for all combinations of HMM-GMM acous-
acoustic models with different values of Gaussians and tied-
tic models with Kaldi.
states were trained with Kaldi and CMU Sphinx tools in order to
Table 3 shows the results obtained with both CMU Sphinx establish a comparison in terms of word error rate (WER). The
and Kaldi. For Kaldi, by the way, the WER was evaluated evaluation results showed that the systems perform better as we
across all triphone training steps in order to perform a more increase the number of Gaussian densities per mixture and the
complete comparison to CMU Sphinx results, since neither the number of tied-states. For CMU Sphinx, the results obtained
LDA+MLLT stage or the fMLLR alignment were included for are in accordance to [4], in spite of the current WER achieved
this toolkit. For Sphinx, as expected, the WER decreases as being lower, possibly due the larger corpora used for training
we increase both the number of Gaussians and the number of the models.
tied-states of the model. However, the values seem to converge Results also showed that Kaldi definitely outperformed
after 4,000 senones and 8 Gaussians. The lowest WER value CMU Sphinx even without the use of its deep learning tools. An
achieved was approximately 11.1% with 4,000 senones and 16 explanation might be the use of Viterbi algorithm for training
Gaussian densities. (rather than Baum-Welch), as well as the use of Viterbi align-
For Kaldi, however, we found that the previous convergence ments in between each training stage, which is said to improve
shown on CMU Sphinx results does not occur. As we increase or refine the parameters of the model [38]. With the use of
the number of senones and the number of Gaussians, the WER DNNs, Kaldi presents an improvement of 57.21% over the best
values linearly drop. Besides, it can be seen that the lowest HMM-GMM-based acoustic model built with CMU Sphinx.
WER values for the first two triphone training steps (tri-∆ and As future work, we plan to finish training the HMM-DNN
tri-∆∆) are already lower than the best one achieved by CMU triphone-based AMs with Kaldi and consequently make them
Sphinx: 9.31% and 9.23%, respectively. The global, lowest publicly available (together with the recipe) [8] to the commu-
WER value obtained with Kaldi was 6.5% with 8,000 tied-states nity. We also expect to test with 32 and 64 densities per mix-
and 16 Gaussians at the tri-LDA-MLLT step, which is equiva- ture, now evaluating the decoding time too in terms of the real-
lent to 128,000 leaves on the decision tree, according to Kaldi’s time factor (xRT) as the WER possibly decreases. Furthermore
parameter settings (which is basically the result of the product HTK’s latest release also has an implementation of deep learn-
8,000 × 16). ing algorithms, which may join the next comparisons.
80
7. References [18] I. M. Quintanilha, L. W. P. Biscainho, and S. L.
Netto, “Towards an end-to-end speech recognizer for por-
[1] L. R. Rabiner, “A tutorial on hidden markov models and selected tuguese using deep neural networks,” in XXXV Simpó-
applications in speech recognition,” Proceedings of the IEEE, sio Brasileiro de Telecomunicações e Processamento de Sinais,
vol. 77, no. 2, pp. 257–286, Feb 1989. September 2017, pp. 709–714. [Online]. Available: http:
[2] X. Huang, A. Acero, and H.-W. Hon, Spoken Language Process- //www.sbrt.org.br/sbrt2017/anais/1570360756.pdf
ing: A Guide to Theory, Algorithm, and System Development, [19] D. A. Bonilla, N. Nedjah, and L. de Macedo Mourelle, “Recon-
1st ed. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2001. hecimento automático de fala em português usando redes neurais
[3] N. Neto, C. Patrick, A. Klautau, and I. Trancoso, “Free tools artificiais profundas,” in Anais do 12 Congresso Brasileiro de In-
and resources for brazilian portuguese speech recognition,” teligência Computacional, C. J. A. Bastos Filho, A. R. Pozo, and
Journal of the Brazilian Computer Society, vol. 17, no. 1, pp. H. S. Lopes, Eds. Curitiba, PR: ABRICOM, 2015, pp. 1–6.
53–68, Mar 2011. [Online]. Available: https://doi.org/10.1007/ [20] PCD Legal. (2018) PCD legal: Acessível para todos. [Online].
s13173-010-0023-1 Available: http://www.pcdlegal.com.br/
[4] R. Oliveira, P. Batista, N. Neto, and A. Klautau, “Baseline acous- [21] LDC. (2018) Cslu: Spoltech brazilian portuguese version 1.0.
tic models for brazilian portuguese using cmu sphinx tools,” in [Online]. Available: https://catalog.ldc.upenn.edu/LDC2006S16
Computational Processing of the Portuguese Language. Berlin,
[22] LDC. (2018) West point brazilian portuguese speech. [Online].
Heidelberg: Springer Berlin Heidelberg, 2012, pp. 375–380.
Available: https://catalog.ldc.upenn.edu/LDC2008S04
[5] S. Young, D. Ollason, V. Valtchev, and P. Woodland, The HTK [23] PUC-Rio. (2018) Centro de estudos em telecomunicações
Book. Cambridge University Engineering Department, version (CETUC). [Online]. Available: http://www.cetuc.puc-rio.br/
3.4, 2006.
[24] A. Siravenha, N. Neto, V. Macedo, and A. Klautau, “Uso de regras
[6] D. Huggins-Daines, M. Kumar, A. Chan, A. W. Black, M. Ravis- fonológicas com determinação de vogal tônica para conversão
hankar, and A. I. Rudnicky, “Pocketsphinx: A free, real-time con- grafema-fone em Português Brasileiro,” 7th International Infor-
tinuous speech recognition system for hand-held devices,” in 2006 mation and Telecommunication Technologies Symposium, 2008.
IEEE International Conference on Acoustics Speech and Signal
Processing Proceedings, vol. 1, May 2006, pp. I–I. [25] A. Stolcke, “SRILM - an extensible language modeling toolkit,”
International Conference on Spoken Language Processing, 2002.
[7] D. Povey, A. Ghoshal, G. Boulianne, N. Goel, M. Hannemann, [Online]. Available: http://www.speech.sri.com/projects/srilm/
Y. Qian, P. Schwarz, and G. Stemmer, “The kaldi speech recogni-
[26] Linguateca. (2018) Corpus de extractos de textos electrónicos
tion toolkit,” in In IEEE 2011 workshop, 2011.
nilc/folha de s. paulo (CETENFolha). [Online]. Available:
[8] GitLab. (2018) Tutorial para treino de modelo acústico com kaldi. https://www.linguateca.pt/cetenfolha/
[Online]. Available: https://gitlab.com/fb-asr/fb-am-tutorial/ [27] GitHub. (2018) Kaldi speech recognition toolkit. [Online].
kaldi-am-train Available: https://github.com/kaldi-asr/kaldi
[9] P. K. Sahu and D. S. Ganesh, “A study on automatic speech recog- [28] S. Davis and P. Mermelstein, “Comparison of parametric repre-
nition toolkits,” in 2015 International Conference on Microwave, sentations for monosyllabic word recognition in continuously spo-
Optical and Communication Engineering (ICMOCE), Dec 2015, ken sentences,” IEEE Transactions on Acoustics, Speech, and Sig-
pp. 365–368. nal Processing, vol. 28, no. 4, pp. 357–366, Aug 1980.
[10] A. Becerra, J. I. de la Rosa, and E. González, “A case study of [29] A. Viterbi, “Error bounds for convolutional codes and an asymp-
speech recognition in spanish: From conventional to deep ap- totically optimum decoding algorithm,” IEEE Transactions on In-
proach,” in 2016 IEEE ANDESCON, Oct 2016, pp. 1–4. formation Theory, vol. 13, no. 2, pp. 260–269, April 1967.
[11] B. Popović, S. Ostrogonac, E. Pakoci, N. Jakovljević, and [30] L. R. Welch, “Hidden markov models and the baum-welch algo-
V. Delić, “Deep neural network based continuous speech recogni- rithm,” in IEEE Information Theory Society Newsletter, vol. 53,
tion for serbian using the kaldi toolkit,” in Speech and Computer. 2003, pp. 10–12.
Cham: Springer International Publishing, 2015, pp. 186–192.
[31] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification,
[12] P. Cosi, “A kaldi-dnn-based asr system for italian,” in 2015 In- 2nd ed. Wiley Interscience, 2000.
ternational Joint Conference on Neural Networks (IJCNN), July [32] R. A. Gopinath, “Maximum likelihood modeling with gaussian
2015, pp. 1–5. distributions for classification,” in IEEE International Conference
[13] B. Karan, J. Sahoo, and P. K. Sahu, “Automatic speech recogni- on Acoustics, Speech and Signal Processing, ICASSP, vol. 2, May
tion based odia system,” in 2015 International Conference on Mi- 1998, pp. 661–664 vol.2.
crowave, Optical and Communication Engineering (ICMOCE), [33] S. Matsoukas, R. Schwartz, H. Jin, and L. Nguyen, “Practical
Dec 2015, pp. 353–356. implementations of speaker-adaptive training,” in DARPA Speech
[14] A. Ali, Y. Zhang, P. Cardinal, N. Dahak, S. Vogel, and J. Glass, Recognition Workshop, 1997.
“A complete kaldi recipe for building arabic speech recognition [34] M. J. F. Gales, “Maximum likelihood linear transformations
systems,” in 2014 IEEE Spoken Language Technology Workshop for hmm-based speech recognition,” Computer Speech and
(SLT), Dec 2014, pp. 525–529. Language, vol. 12, no. 2, pp. 75–98, April 1998. [Online].
[15] I. Kipyatkova and A. Karpov, “Dnn-based acoustic modeling for Available: https://doi.org/10.1006/csla.1998.0043
russian speech recognition using kaldi,” in Speech and Computer. [35] K. Veselý, A. Ghoshal, L. Burget, and D. Povey, “Sequence-
Cham: Springer International Publishing, 2016, pp. 246–253. discriminative training of deep neural networks,” in INTER-
[16] S. Guiroy, R. de Cordoba, and A. Villegas, “Application of the SPEECH 2013, 2013, pp. 2345–2349.
kaldi toolkit for continuous speech recognition using hidden- [36] D. Povey, X. Zhang, and S. Khudanpur, “Parallel train-
markov models and deep neural networks,” in Advances in Speech ing of dnns with natural gradient and parameter averaging,”
and Language Technologies for Iberian Languages. IberSPEECH http://arxiv.org/pdf/1410.7455v8, Tech. Rep., 2014.
2016, ser. LNCS, vol. 10077. Springer, November 2016. [37] Kaldi. (2018) Dan’s dnn implementation. [Online]. Available:
[17] A. L. Zorrilla, N. Dugan, M. I. Torres, C. Glackin, G. Chollet, and http://kaldi-asr.org/doc/dnn2.html
N. Cannings, “Some asr experiments using deep neural networks [38] E. Chodroff. (2018) Kaldi tutorial: Training overview. [On-
on spanish databases,” in Advances in Speech and Language Tech- line]. Available: https://www.eleanorchodroff.com/tutorial/kaldi/
nologies for Iberian Languages. IberSPEECH 2016, ser. LNCS, training-overview.html
vol. 10077. Springer, November 2016.
81
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
82 10.21437/IberSPEECH.2018-18
Figure 2: Voice conversion pipeline.
presented in Section 5.
83
3. Experimental framework
The evaluation framework used in this paper is that of the
QbESDR task of Albayzin 2014 search on speech evaluation. It
consists of a set of spoken documents extracted from TV broad-
cast news in Basque language under diverse background condi-
tions [28]. The queries were recorded in an office environment,
which serves to simulate a regular user querying a retrieval sys-
Figure 4: Overview of the search on speech system. tem via speech. Each query includes a basic and two additional
examples from different speakers; in these experiments, only
MCEP features or obtaining a more complex representation that the basic example is used. Two different sets of queries are
makes use of these features. Since raw acoustic features do not included in the dataset: development (dev) queries for parame-
usually exhibit a good performance in QbESDR, Gaussian pos- ter tuning and evaluation (eval) queries to assess system perfor-
teriorgrams were used for this purpose: a Gaussian posterior- mance. Table 1 summarizes some statistics of the database.
gram represents each frame of a spoken utterance by means of a
vector of dimension G: each element of this vector is the poste- Table 1: Summary of the experimental framework used in this
rior probability of each of the G Gaussians in a GMM given the paper.
frame. This representation was first proposed in [12] and used Duration
for QbESDR in [13, 14, 15], to cite some examples. Data # recordings Total Min Max # hits
After feature extraction, given a query Q = {q1 , . . . , qn } Documents 1841 3 h 11 min 3.00 s 30.12 s -
dev queries 100 2 min 51 s 1.35 s 2.29 s 772
and a document D = {d1 , . . . , dm } of n and m frames respec- eval queries 100 2 min 52 s 1.31 s 2.25 s 855
tively, with vectors qi , dj ∈ <G and n m, DTW finds the
best alignment path between these two sequences. Subsequence
DTW [25] (S-DTW) was used in this system, since it allows the The evaluation metric used in this work to assess QbESDR
partial alignment of a short sequence (the query) with a longer performance is the maximum term weighted value [29], in ac-
sequence (the document). The first step consists in computing cordance with the experimental protocol defined for Albayzin
a cumulative cost matrix M ∈ <n×m for a given query and 2014 search on speech evaluation. This metric was adopted in-
document as follows: stead of actual TWV in order to ignore the performance loss
caused by calibration issues.
c(qi , dj ) if i=0
Mi,j =
c(qi , dj ) + Mi−1,0 if i>0
(2)
4. Experiments and results
j=0
Before presenting the experimental results, some details of the
c(qi , dj ) + M∗ (i, j) else
different modules of the system described in Section 2 must
where c(qi , dj ) is a function that defines the cost between be mentioned. The GMMs of the gender classification system
query vector qi and document vector dj , and were trained using the FA sub-corpus of Albayzin database [30],
which includes around 4 hours of speech uttered by 200 differ-
M∗ (i, j) = min (Mi−1,j , Mi−1,j−1 , Mi,j−1 ) (3) ent speakers (100 male, 100 female). The features used were 19
In this paper, the log cosine similarity was used as the cost Mel-frequency cepstral coefficients (MFCCs) augmented with
function as in [10] since it empirically showed a superior per- energy, delta and acceleration coefficients, and only voiced
formance compared with other metrics: frames were considered. The number of mixtures of the GMMs
was empirically set to 1024. The parameters fa , fb and k of the
qi · dj VC strategy were set to 700 Hz, 3000 Hz and 0.5, respectively,
cost(qi , dj ) = − log (4)
|qi ||dj | according to [17]. In the search stage, the silence intervals be-
This metric is normalized in order to turn it into a cost func- fore and after the queries were automatically removed using the
tion defined in the interval [0,1]: voice activity detection approach described in [31]. The number
of Gaussians G of the GMM used for Gaussian posteriorgram
cost(qi , dj ) − costmin (i) computation was empirically set to 128. It must be noted that
c(qi , dj ) = (5)
costmax (i) − costmin (i) the GMM of each experiment was trained with the features ex-
where costmin (i) = minj cost(qi , dj ) and costmax (i) = tracted from its corresponding documents, so they are different
maxj (qi , dj ). for each experiment.
After computing M, the S-DTW algorithm is used to find The first experiment aimed at comparing system perfor-
the best alignment path between Q and D. According to this mance when extracting MCEP features from the converted
algorithm, the best alignment path ends at frame b∗ : waveforms (Synthesized) and when straightforwardly using the
converted MCEP features (Converted). As shown in Figure 5,
b∗ = arg min Mn,b (6) using the converted features leads to clearly better results for
b∈1,...,m
dev queries. In addition, experiments were run with different
Then, it is possible to backtrack the whole alignment path that values of |α| in order to analyze the influence of this param-
starts at frame a∗ . eter in QbESDR results. The figure shows that the best re-
A score must be assigned to each detection of a query Q sults were obtained when converting male utterances to female
in a document D in order to measure how likely the query is voices with |α| = π/30. The worst performance was achieved
present in the document. In this system, the document is length- with |α| = π/12 since such a coarse conversion leads to more
normalised by dividing the cumulative cost by the length of the distorted speech according to [17]. Results with |α| = π/36
warping path [26] and z-norm is applied afterwards [27]. are worse than those obtained with |α| = π/30 because the
84
Dev
Table 2: Eval results for all documents and queries and for dif-
0.25
ferent combinations of male (M) and female (F) documents (d)
F, Synthesized
F, Converted
and queries (d) with different systems. Conversion parameters
0.2 M, Synthesized were tuned on dev queries.
M, Converted
0.15
MTWV
MTWV
0.1 All Fq-Fd Mq-Md Fq-Md Mq-Fd
Original 0.1659 0.2753 0.1623 0.1354 0.1969
0.05 Converted 0.1937 0.2958 0.1958 0.2013 0.2249
Synthesized 0.0835 0.2042 0.0280 0.1672 0.0272
Queries only 0.1684 0.2753 0.1623 0.1490 0.1841
0
π/12 π/24 π/30 π/36
|α|
85
7. References [18] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicen-
cio, T. Kinnunen, and Z. Ling, “The Voice Conversion Challenge
[1] F. Metze, N. Rajput, X. Anguera, M. Davel, G. Gravier, C. V. 2018: promoting development of parallel and nonparallel meth-
Heerden, G. Mantena, A. Muscariello, K. Pradhallad, I. Szöke, ods,” in Proceedings of Odyssey, 2018, pp. 195–202.
and J. Tejedor, “The spoken web search task at MediaEval 2011,”
in Proceedings of the 37th International Conference on Acoustics, [19] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker veri-
Speech and Signal Processing (ICASSP), 2012, pp. 5165–5168. fication using adapted Gaussian mixture models.” Digital Signal
[2] F. Metze, E. Barnard, M. Davel, C. V. Heerden, X. Anguera, Processing, vol. 10, no. 1-3, pp. 19–41, 2000.
G. Gravier, and N. Rajput, “The spoken web search task,” in Pro- [20] J. Hillenbrand and M. Clark, “The role of f0 and formant frequen-
ceedings of the MediaEval 2012 Workshop, 2012. cies in distinguishing the voices of men and women,” Attention,
[3] X. Anguera, F. Metze, A. Buzo, I. Szöke, and L. Rodriguez- Perception, & Psychophysics, vol. 71, no. 5, pp. 1150–1166, 2009.
Fuentes, “The spoken web search task,” in Proceedings of the Me- [21] F. Bahmaninezhad, C. Zhang, and J. Hansen, “Convolutional neu-
diaEval 2013 Workshop, 2013. ral network based speaker de-identification,” in Proceedings of
[4] J. Tejedor, D. Toledano, P. Lopez-Otero, P. Docio-Fernandez, and Odyssey, 2018, pp. 255–260.
C. Garcia-Mateo, “Comparison of ALBAYZIN query-by-example [22] T. Zorila, D. Erro, and I. Hernaez, “Improving the quality of stan-
spoken term detection 2012 and 2014 evaluations,” EURASIP dard GMM-based voice conversion systems by considering physi-
Journal on Audio, Speech, and Music Processing, vol. 2016, no. 1, cally motivated linear transformations,” Communications in Com-
pp. 1–19, 2016. puter and Information Science (ISSN: 1865-0929), vol. 328, pp.
[5] X. Anguera, L. Rodriguez-Fuentes, I. Szöke, A. Buzo, and 30–39, 2012.
F. Metze, “Query by example search on speech at Mediaeval
[23] D. Erro, A. Alonso, L. Serrano, E. Navas, and I. Hernáez, “In-
2014,” in Proceedings of the MediaEval 2014 Workshop, 2014.
terpretable parametric voice conversion functions based on Gaus-
[6] I. Szöke, L. Rodriguez-Fuentes, A. Buzo, X. Anguera, F. Metze, sian mixture models and constrained transformations,” Computer
J. Proença, M. Lojka, and X. Xiong, “Query by example search Speech and Language, vol. 30, no. 1, pp. 3–15, 2015.
on speech at Mediaeval 2015,” in Proceedings of the MediaEval
2015 Workshop, 2015. [24] P. Lopez-Otero, L. Docio-Fernandez, and C. Garcia-Mateo,
“Finding relevant features for zero-resource query-by-example
[7] J. Tejedor, D. Toledano, P. Lopez-Otero, P. Docio-Fernandez, search on speech,” Speech Communication, vol. 84, pp. 24–35,
J. Proença, F. Perdigão, F. Garcı́a-Granada, E. Sanchis, A. Pom- 2016.
pili, and A. Abad, “ALBAYZIN query-by-example spoken term
detection 2016 evaluation,” EURASIP Journal on Audio, Speech, [25] M. Müller, Information Retrieval for Music and Motion.
and Music Processing, vol. 2018, no. 2, pp. 1–25, 2018. Springer-Verlag, 2007.
[8] H. Sakoe and S. Chiba, “Dynamic programming algorithm op- [26] A. Abad, R. Astudillo, and I. Trancoso, “The L2F spoken web
timization for spoken word recognition,” IEEE Transactions on search system for Mediaeval 2013,” in Proceedings of the Medi-
Acoustics, Speech and Signal Processing, vol. 26, no. 1, pp. 43– aEval 2013 Workshop, 2013.
49, 1978.
[27] I. Szöke, L. Burget, F. Grézl, J. C̆ernocký, and L. Ondel, “Calibra-
[9] T. Hazen, W. Shen, and C. White, “Query-by-example spoken tion and fusion of query-by-example systems - BUT SWS 2013,”
term detection using phonetic posteriorgram templates,” in IEEE in Proceedings of the 37th International Conference on Acoustics,
Workshop on Automatic Speech Recognition & Understanding, Speech and Signal Processing (ICASSP), 2014, pp. 7899–7903.
ASRU, 2009, pp. 421–426.
[28] J. Tejedor, D. Toledano, L. Rodriguez-Fuentes, M. Pe-
[10] L. Rodriguez-Fuentes, A. Varona, M. Penagarikano, G. Bordel, nagarikano, A. Varona, M. Diez, and G. Bordel, “The
and M. Diez, “High-performance query-by-example spoken term ALBAYZIN 2014 search on speech evaluation plan,” 2014.
detection on the SWS 2013 evaluation,” in Proceedings of the 37th [Online]. Available: http://iberspeech2014.ulpgc.es/images/
International Conference on Acoustics, Speech and Signal Pro- EvaluationPlanSearchonSpeech.pdf
cessing (ICASSP), 2014, pp. 7869–7873.
[29] J. Fiscus, J. Ajot, J. Garofolo, and G. Doddington, “Results of
[11] P. Lopez-Otero, L. Docio-Fernandez, and C. Garcia-Mateo, “Pho-
the 2006 spoken term detection evaluation,” in Proceedings of the
netic unit selection for cross-lingual query-by-example spoken
ACM SIGIR Workshop “Searching Spontaneous Conversational
term detection,” in Proceedings of IEEE Automatic Speech Recog-
Speech”, 2007, pp. 51–56.
nition and Understanding Workshop, 2015, pp. 223–229.
[12] Y. Zhang and J. Glass, “Unsupervised spoken keyword spotting [30] A. Moreno, D. Poch, A. Bonafonte, E. Lleida, J. Llisterri,
via segmental DTW on Gaussian posteriorgrams,” in IEEE Work- J. Mariño, and C. Nadeu, “Albayzin speech database: design of
shop on Automatic Speech Recognition & Understanding, ASRU, the phonetic corpus,” in EUROSPEECH, vol. 1, 1993, pp. 175–
2009, pp. 398–403. 178.
[13] G. Mantena, S. Achanta, and K. Prahallad, “Query-by-example [31] S. Basu, “A linked-HMM model for robust voicing and speech de-
spoken term detection using frequency domain linear prediction tection,” in Proceedings of International conference on acoustics,
and non-segmental dynamic time warping,” IEEE/ACM Transac- speech and signal processing (ICASSP), vol. 1, 2003, pp. 816–
tions on Audio, Speech and Language Processing, vol. 22, no. 5, 819.
pp. 944–953, 2014. [32] D. Garcia-Romero and C. Espy-Wilson, “Analysis of i-vector
[14] M. Madhavi and H. Patil, “Vtln-warped gaussian posteriorgram lenth normalization in speaker recognition systems,” in Proceed-
for qbe-std,” in Proceedings of EUSIPCO, 2017, pp. 593–597. ings of Interspeech, 2011, pp. 249–252.
[15] ——, “Combining evidences from detection sources for query-
by-example spoken term detection,” in Proceedings of APSIPA
ASC, 2017, pp. 563–568.
[16] P. Lopez-Otero, L. Docio-Fernandez, and C. Garcia-Mateo,
“Compensating gender variability in query-by-example search on
speech using voice conversion,” in Proceedings of Interspeech,
2017, pp. 2909–2913.
[17] C. Magariños, P. Lopez-Otero, L. Docio-Fernandez, D. Erro,
E. Banga, and C. Garcia-Mateo, “Piecewise linear definition of
transformation functions for speaker de-identification,” in Pro-
ceedings of First International Workshop on Sensing, Processing
and Learning for Intelligent Machines (SPLINE), 2016, pp. 1–5.
86
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
ViVoLab, Aragón Institute for Engineering Research (I3A), University of Zaragoza, Spain
{pablogj, ivinalsb, ortega, amiguel, lleida}@unizar.es
87 10.21437/IberSPEECH.2018-19
This capability becomes very useful to carry out long and Xt−1 Xt Xt+1
short term analysis simultaneously. LSTM networks have
been modified combining two of them in a Bidirectional
LSTM (BLSTM) network. One processes the sequence in the ... BLSTM1 BLSTM1 BLSTM1 ...
forward direction while the other one processes the sequence Cell Cell Cell
backwards. This way the network is able to model causal and
anticausal dependencies for the same sequence. ... BLSTM2 BLSTM2 BLSTM2 ...
Cell Cell Cell
LSTM ans BLSTM networks have been successfully ap-
plied to sequence modeling tasks in speech technologies such Linear Linear Linear
as ASR [19] [20], language modeling [21], speaker verification layer layer layer
in text-dependent systems [22] or machine translation [23].
88
4. Experimental setup and results Table 1: SER, error per class and average error for BLSTM
segmentation-by-classification system on the test partition for
4.1. Database and metric description different feature configurations (Mel: log Mel filter bank, Chr:
The database consists of broadcast news audio in Catalan. chroma, ∆ + ∆∆: 1st and 2nd order derivatives)
The full database includes 87 hours of audio sampled at 16
KHz and divided in 24 files. The database was split into two Class Error(%)
Feats SER Avg
parts: two thirds of the total amount of data are reserved for mu sp sm sn
training, while the remaining third is used for testing. Five
acoustic classes were defined for the evaluation. The classes 32 Mel 17.67 18.09 30.89 33.07 35.14 29.30
are distributed as follows: 37% for clean speech (sp), 5% for 64 Mel 17.47 17.98 31.45 31.38 34.60 28.85
music (mu), 15% for speech over music (sm), 40% for speech 80 Mel 16.87 18.14 29.63 30.23 33.45 27.86
over noise (sn) and 3% for others (ot). The class “others” is 96 Mel 17.33 18.07 30.81 31.48 33.94 28.58
not evaluated in the final test. A more detailed description of
80 Mel 16.14 17.12 30.13 26.57 31.66 26.37
the Albayzı́n 2010 audio segmentation evaluation can be found
+ Chr
in [13].
80 Mel
The main metric we will be using for evaluating our results + Chr + 15.91 16.28 28.82 26.32 31.94 25.84
is the the Segmentation Error Rate (SER), inspired by the NIST ∆+∆∆
metric for speaker diarization [28]. This metric can be inter-
preted as the ratio between the total length of the incorrectly
labeled audio and the total length of the audio in the reference.
Given the dataset to evaluate Ω, each document is divided into were computed to take into account dynamic information in the
continuous segments and the segmentation error time for each audio signal.
segment n is defined as: Results obtained on the test partition for the different
front-end configurations using our BLSTM segmentation-by-
Ξ(n) = T (n)[max(Nref (n) , Nsys (n)) − Ncorrect (n)] (2) classification system are presented in Table 1 in terms of SER
and the Albayzı́n evaluation metric. When the number of
where T (n) is the duration of the segment n, Nref (n) is the analysis bands is increased, we can appreciate that the SER
number of reference classes that are present in segment n, decreases, reaching its minimum using 80 bands. However,
Nsys (n) is the number of system classes that are present in seg- it can also be seen that using a higher number of bands can
ment n and Ncorrect (n) is the number of reference classes that affect the system performance. This is the case of the 96 bands
are present in segment n and were correctly assigned by the seg- configuration, that increases its error compared to the 80 bands
mentation system. This way, the SER is computed as follows: configuration. We can notice that, by incorporating chroma
P features, the error in the class “Speech over music” decreases
n∈Ω Ξ(n) significantly when compared to the 80 Mel coefficient con-
SER = P (3)
n∈Ω (T (n)Nref (n)) figuration, with a relative improvement of 12.10%. This is
due to the capabilities of chroma features to capture musical
Alternatively, the metric originally proposed for the Al- dependencies, which helps our system discriminate this class in
bazı́n 2010 evaluation will also be taken into account. This a more accurate way. The best result for this set of experiments
metric represents the relative error averaged over all the acous- is obtained using the first and second order derivatives of the
tic classes: log Mel filter bank and the chroma features, achieving a SER
of 15.91%, which is equivalent to an average class error of
dur(missi ) + dur(fai )
Error = meani (4) 25.84%. The performance of our segmentation system using
dur(refi ) the resegmentation module is evaluated in the following set of
experiments.
where dur(missi ) is the total duration of all miss errors for
the ith acoustic class, dur(fai ) is the total duration of all false
Aiming to illustrate how the system performance is influ-
alarm errors for the ith acoustic class, and dur(refi ) is the total
enced by the inertia imposed by the resegmentation module,
duration of the ith acoustic class according to the reference. A
Fig. 2 shows the scatter plot of the relative improvement in
collar of ±1s around each reference boundary is not scored in
performance versus the minimum segment length (Tmin ) for
both cases, SER and average class error, to avoid uncertainty
different values of the down-sampling factor L. It can be
about when an acoustic class begins or ends, and to take into
seen that configurations that perform better have a minimum
account inconsistent human annotations.
segment length between 0.5 and 1.5 seconds, which is in
the order of magnitude of the 2 seconds collar applied in the
evaluation.
4.2. Experimental results
For the experimental evaluation of our system, different front- The results on the test partition of the full segmentation
end configurations were assessed. The starting point of our system combining the BLTSMs and the resegmentation module
feature space exploration consists of a simple 32 log Mel fil- for the best front-end configuration evaluated (80 Mel + chroma
ter bank. Our next step was increasing the frequency resolution + derivatives) and for different values of the down-sampling
by using a higher number of analysis bands in the filter bank, factor, L, and the minimum segment length, Tmin , are shown
testing 64, 80 and 96 bands. Chroma features were incorpo- in Table 2 in terms of SER and the Albayzı́n evaluation
rated in order to help our system to discriminate classes that metric. If we compare the best result in this Table with the
contain music. Eventually, first and second order derivatives best result in Table 1, it can be seen that, by incorporating
89
25
L=25 it can be seen that our BLSTM-HMM system performs better
L=35
L=45
in all the acoustic classes. This error reduction is equivalent
20
L=55
L=65
to a relative improvement of 15.24% in terms of SER and a
L=75 15.75% in terms of the average class error. The difference in
Relative improvement (%)
15
vs 18.82%) with a relative improvement of 20.25%.
L, Class Error(%)
SER Avg
Tmin mu sp sm sn 5. Conclusions
25, 12.49 14.55 21.99 19.08 24.88 20.13 A new approach for audio segmentation based on RNNs is
1.25s presented in this paper proving the capabilities of this kind
35, of models in the audio segmentation task, achieving the best
12.48 14.31 22.26 18.70 25.10 20.10 result so far in the Albayzı́n 2010 database. A segmentation-
1.4s
by-classification scheme has been followed, combining a
45, classification system, which is mainly made of 2 BLSTM
12.46 14.19 22.14 18.82 25.04 20.05
0.9s layers, with an smoothing back-end implemented through
55, a Hidden Markov Model. Several front-end configurations
12.57 16.12 22.00 18.94 24.95 20.50 were evaluated, proving the capabilities of chroma features for
0.55s
capturing musical structures when compared to a perceptual
Mel filter bank. The combination of BLSTM and HMM
has been proven to be appropriate, reducing significantly the
the resegmentation module, the error is reduced significantly, system error by forcing a minimum segment length for the
getting a relative improvement of 21.68% in terms of SER. It segmentation labels. Competitive results have been obtained
can also be observed that, as long as the Tmin value used in with this new approach, resulting in a relative improvement of
the configuration stays in the order of magnitude of 1 second, 15.75% when compared to the best result in the literature so
the performance of the system is not highly affected by the far.
variations in the down-sampling factor. The SER metric for the
four parameter configurations evaluated goes from 12.46% to Regarding our contributions, front-end configuration seems
a 12.57%, which makes an absolute difference of only 0.11% to have a big impact in this task, specially when classifying
between the best and the worst case. This way, by forcing a classes that contain music. Just by modifying the input fea-
certain amount of inertia in the output of the neural network, tures we have achieved a significant improvement in the perfor-
the system is able to achieve a SER of 12.46%, decreasing mance of our system. Furthermore, the introduction of RNNs in
significantly the error when compared to the output of the RNN. the audio segmentation task has been proven to be successful,
improving the results obtained so far with traditional statistical
Finally, Table 3 shows the results obtained in the Albayzı́n models such as GMM/HMM or factor analysis. In future work
2010 database by different systems already presented in the lit- we intend to improve even more these results by introducing
erature. The winner team of the Albayzı́n 2010 evaluation pro- more complex neural architectures after the BLSTM layers.
posed a segmentation-by-classification approach based on a hi-
erarchical GMM/HMM including MFCCs, chroma and spectral
entropy as input features [29]. The best result in this database 6. Acknowledgements
so far uses a solution based on Factor Analysis combined with a
Gaussian back-end and MFCCs with 1st and 2nd order deriva- We gratefully acknowledge the support of NVIDIA Corporation
tives as input features [10]. When compared with this result, with the donation of the Titan Xp GPU used for this research.
90
7. References [17] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[1] S. Chen, P. Gopalakrishnan et al., “Speaker, environment and
channel change detection and clustering via the Bayesian Infor- [18] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget:
mation Criterion,” in Proc. DARPA broadcast news transcription Continual prediction with LSTM,” 1999.
and understanding workshop, vol. 8, 1998, pp. 127–132. [19] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech
[2] C.-H. Wu, Y.-H. Chiu, C.-J. Shia, and C.-Y. Lin, “Automatic enhancement and recognition using multi-task learning of long
segmentation and identification of mixed-language speech using short-term memory recurrent neural networks,” in Sixteenth An-
delta-BIC and LSA-based GMMs,” IEEE Transactions on audio, nual Conference of the International Speech Communication As-
speech, and language processing, vol. 14, no. 1, pp. 266–276, sociation, 2015.
2006. [20] A. Graves, N. Jaitly, and A.-r. Mohamed, “Hybrid speech recogni-
[3] P. Delacourt and C. J. Wellekens, “DISTBIC: A speaker-based tion with deep bidirectional LSTM,” in Automatic Speech Recog-
segmentation for audio data indexing,” Speech communication, nition and Understanding (ASRU), 2013 IEEE Workshop on.
vol. 32, no. 1-2, pp. 111–126, 2000. IEEE, 2013, pp. 273–278.
[4] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern, “Automatic seg- [21] M. Sundermeyer, R. Schlüter, and H. Ney, “LSTM neural net-
mentation, classification and clustering of broadcast news audio,” works for language modeling,” in Thirteenth annual conference
in Proc. DARPA speech recognition workshop, vol. 1997, 1997. of the international speech communication association, 2012.
[5] A. Misra, “Speech/nonspeech segmentation in web videos,” in [22] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end
Thirteenth Annual Conference of the International Speech Com- text-dependent speaker verification,” in Acoustics, Speech and
munication Association, 2012. Signal Processing (ICASSP), 2016 IEEE International Confer-
ence on. IEEE, 2016, pp. 5115–5119.
[6] H. Meinedo and J. Neto, “A stream-based audio segmentation,
classification and clustering pre-processing system for broadcast [23] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-
news using ANN models,” in Ninth European Conference on lation by jointly learning to align and translate,” arXiv preprint
Speech Communication and Technology, 2005. arXiv:1409.0473, 2014.
[24] M. A. Bartsch and G. H. Wakefield, “To catch a chorus: Using
[7] K. J. Piczak, “Environmental sound classification with convolu-
chroma-based representations for audio thumbnailing,” in Appli-
tional neural networks,” in Machine Learning for Signal Process-
cations of Signal Processing to Audio and Acoustics, 2001 IEEE
ing (MLSP), 2015 IEEE 25th International Workshop on. IEEE,
Workshop on the. IEEE, 2001, pp. 15–18.
2015, pp. 1–6.
[25] S. Ruder, “An overview of gradient descent optimization algo-
[8] G. Richard, M. Ramona, and S. Essid, “Combined supervised
rithms,” arXiv preprint arXiv:1609.04747, 2016.
and unsupervised approaches for automatic segmentation of ra-
diophonic audio streams,” in Acoustics, Speech and Signal Pro- [26] A. Paszke, G. Chanan, Z. Lin, S. Gross, E. Yang, L. Antiga, and
cessing, 2007. ICASSP 2007. IEEE International Conference on, Z. Devito, “Automatic differentiation in PyTorch,” Advances in
vol. 2, 2007, pp. II–461. Neural Information Processing Systems 30, pp. 1–4, 2017.
[9] Y. Patsis and W. Verhelst, “A [27] F. Gustafsson, “Determining the initial states in forward-
speech/music/silence/garbage/classifier for searching and backward filtering,” IEEE Transactions on Signal Processing,
indexing broadcast news material,” in Database and Expert Sys- vol. 44, no. 4, pp. 988–992, 1996.
tems Application, 2008. DEXA’08. 19th International Workshop [28] NIST, “The 2009 (RT-09) Rich Transcription Meeting Recogni-
on, 2008, pp. 585–589. tion Evaluation Plan,” (Melbourne 28-29 May 2009).
[10] D. Castán, A. Ortega, A. Miguel, and E. Lleida, “Audio [29] A. Gallardo Antolı́n and R. San Segundo Hernández, “UPM-
segmentation-by-classification approach based on factor analysis UC3M system for music and speech segmentation,” 2010.
in broadcast news domain,” EURASIP Journal on Audio, Speech,
and Music Processing, vol. 2014, no. 1, p. 34, 2014.
[11] J. Ajmera, I. McCowan, and H. Bourlard, “Speech/music segmen-
tation using entropy and dynamism features in a hmm classifica-
tion framework,” Speech communication, vol. 40, no. 3, pp. 351–
363, 2003.
[12] L. Lu, H. Jiang, and H. Zhang, “A robust audio classification and
segmentation method,” in Proceedings of the ninth ACM interna-
tional conference on Multimedia. ACM, 2001, pp. 203–211.
[13] T. Butko and C. Nadeu, “Audio segmentation of broadcast news in
the Albayzin-2010 evaluation: overview, results, and discussion,”
EURASIP Journal on Audio, Speech, and Music Processing, vol.
2011, no. 1, p. 1, 2011.
[14] G. Hinton, L. Deng, D. Yu, G. Dahl, A.-r. Mohamed, N. Jaitly,
V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep
neural networks for acoustic modeling in speech recognition,”
IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97,
2012.
[15] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neu-
ral network learning for speech recognition and related applica-
tions: An overview,” in Acoustics, Speech and Signal Processing
(ICASSP), 2013 IEEE International Conference on. IEEE, 2013,
pp. 8599–8603.
[16] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent
pre-trained deep neural networks for large-vocabulary speech
recognition,” IEEE Transactions on audio, speech, and language
processing, vol. 20, no. 1, pp. 30–42, 2012.
91
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Abstract used to integrate the transcriber feedback into the main stream
of information for word correction, by using on-line handwrit-
State-of-the-art Natural Language Recognition systems allow ing and speech [7, 8, 9]. In this work, we explore the idea of
transcribers to speed-up the transcription of audio, video or im- using speech dictation for feeding the interactive system with
age documents. These systems provide transcribers an initial an additional source of information of the full text to transcribe.
draft transcription that can be corrected with less effort than The rest of the paper is organised as follows: Section 2
transcribing the documents from scratch. However, even the presents our multimodal proposal; Section 3 introduces the ex-
drafts offered by the most advanced systems based on Deep perimental framework; Section 4 explains the performed exper-
Learning contain errors. Therefore, the supervision of those iments and the obtained results; finally, Section 5 offers the con-
drafts by a human transcriber is still necessary to obtain the cor- clusions and future work lines.
rect transcription. This supervision can be eased by using inter-
active and assistive transcription systems, where the transcriber
and the automatic system cooperate in the amending process. 2. Multimodal Computer Assisted
Moreover, the interactive system can combine different sources Transcription of Text Images
of information in order to improve their performance, such as
This section presents our proposal, which is composed of two
text line images and the dictation of their textual contents.
parts, multimodal recognition and interaction.
In this paper, the performance of a multimodal interactive
and assistive transcription system is evaluated on one Spanish
historical manuscript. Although the quality of the draft tran- 2.1. Multimodal Recognition Framework
scriptions provided by a Handwriting Text Recognition system The natural language recognition problem aims to recover the
based on Deep Learning is pretty good, the proposed interactive text represented in an input signal. In the case of HTR, this input
and assistive approach reveals an additional reduction of tran- signal is usually a segmented line of a digitalised handwritten
scription effort. Besides, this effort reduction is increased when document [10]. Then, given a handwritten text line image or
using speech dictations over an Automatic Speech Recognition a speech signal represented by a feature vector sequence x =
system, allowing for a faster transcription process. (x1 , x2 , . . . , x|x| ), the problem for HTR and ASR is finding
Index Terms: speech recognition, human-computer inter- the most likely word sequence ŵ [2], that is:
action, handwriting recognition, assistive transcription, deep
learning ŵ = arg max P(w | x) = arg max P(x | w)P(w) (1)
w∈W w∈W
92 10.21437/IberSPEECH.2018-20
: esta
ente
ar 13.4 0 -4.4
ced Hog.9 -9.0 29
pro 19.9 -15 -0.5
- .6 4
-1.8 -12 2 8 :
tal -1.1 esta 404
194 Hos.3 -5.9 -4.4
-15
,
-1 3 .8 -0.5
-0.3 del este
-3.8 ital
2.3 246 cap.6 -10
.0
ente -0
ced -2.3 .3 287 -0.3
pro 10.8 -14
-
-3.5 233
e
ient
pac .9
-19
-3.5
e
ient
crec2.7
-2
gría -3.5
Ale .0 190
-16 ente
.3 ced
-11 pro 10.8
-
nía 3- .5
Ago.3
-20 ,
-7.2 4.1 194
la .3
ia -11
7.8 158 glor
.5
-5.6 -27 190
de -9.7
5.5 156
-1.6
sto
Cri 7 6
26. 1 4
-9.5
Preprocess DepthConcat 0
Columnwise Concat WFST decoding and lattice generation constrained to a language model
2.2. Multimodal and Interactive Framework performed over all possible suffixes of p [6].
This suffix search can be efficiently carried out by using lat-
In the CATTI framework the transcriber is involved in the
tices [6] obtained from the combination of the HTR and ASR
transcription process, since he/she is responsible for validating
recognition outputs. In each interaction step, the decoder parses
and/or correcting the system hypothesis. The system takes into
the validated prefix p over the lattice and then continues search-
account the handwritten text image and the transcriber feedback
ing for a suffix which maximises the posterior probability ac-
in order to improve the proposed hypotheses [6]. An exam-
cording to Equation (2). This process is repeated until a com-
ple of a CATTI operation is shown in Figure 2. In this exam-
plete and correct transcription of the input text line image is
ple, in the traditional post-edition approach, a transcriber should
obtained.
have to correct about two errors from the recognised hypothe-
sis (Agonı́a and Hospital). However, using our interactive ap-
proach only one explicit user-correction is necessary to get the 3. Experimental Framework
correct transcription. This section presents the datasets, the preprocess, the models,
Formally, in the traditional CATTI framework [6], the sys- the system setup, and the evaluation metrics used in the experi-
tem uses a given feature sequence, xhtr , representing a hand- ments.
written text line image and a user validated prefix p of the tran-
scription. In this work, in addition to xhtr , a sequence of fea- 3.1. Datasets
ture vectors xasr , which represents the speech dictation of the
textual contents of the text line image, is used to improve the The datasets used in this work correspond to a Spanish histori-
system performance. Therefore, the CATTI system should try cal manuscript, a Spanish phonetic corpus, and a set of speech
to complete the validated prefix by searching for a most likely samples provided by five different native Spanish speakers.
suffix ŝ taking into account both sequences of feature vectors.
Following the asumptions presented in [15], the CATTI prob- 3.1.1. Historical Manuscript: The Cristo Salvador Corpus
lem can be formulatted as: The Cristo Salvador corpus is a 19th century Spanish
manuscript provided by Biblioteca Valenciana Digital (Bi-
ŝ = arg max P(xhtr | p, s) · P(xasr | p, s) · P(s | p) (2)
s ValDi), and it is publicly available for research purposes on the
website of the Pattern Recognition and Human Language Tech-
where the concatenation of p and s is w. As in conventional nology (PRHLT) research center 1 . It is a single writer book
HTR and ASR, P(xhtr | p, s) and P(xasr | p, s) can be ap- composed of 53 pages (the page 41 is presented in Figure 3)
proximated by morphological models and P(s | p) by a lan-
guage model conditioned by p. Therefore, the search must be 1 https://www.prhlt.upv.es
93
Image
Speech
ITE-0 p
ŝ Cristo de la Alegrı́a, procedente del Hogar: esta
ITE-1 m ⇑
p Cristo de la
ŝ Agonı́a, procedente del Hogar: esta
m ⇑
p Cristo de la Agonı́a, procedente del
ITE-2 ---- ---------------------------------------------
ŝ Hostal: esta
v Hospital
p Cristo de la Agonı́a, procedente del Hospital
ŝ : esta
FINAL v #
p≡t Cristo de la Agonı́a, procedente del Hospital: esta
that were manually divided into lines (such as the line shown at
the top of Figure 4). This corpus presents some problematic im- Figure 4: Examples of an extracted text line image (top) and the
age features, such as smear, background variations, differences result of the preprocess given to the neural network (bottom).
in bright, and bleed-through (ink that trespasses to the other sur-
face of the sheet).
We followed the directives of the hard partition defined in
3.2. Preprocess and Feature Extraction
previous works [16, 17]. The first 30 pages (662 text lines) were
used for training the optical and language models, while the fol- All the text line images were scaled to 64 pixels in height and a
lowing 3 pages (78 text lines) were used for validation purposes. pre-processing was applied for correcting the slant and remov-
The test set was composed of the lines of the page 41 (24 lines, ing the background noise [19]. A text line image and the result-
222 words); this page was selected for being, according to pre- ing image after the image preprocess are presented in Figure 4.
liminary error recognition results, a representative page of the With respect to speech feature extraction, 39 Mel-
whole test set (the remaining 20 pages, 473 lines). This corpus Frequency Cepstral Coefficients composed of the first 12 cep-
contains 1213 lines, with a vocabulary of 3451 different words, strals and log frame energy with first and second order deriva-
and a set of 92 different characters, taking into account lower- tives were extracted from the audio files [20].
case and uppercase letters, numbers, punctuation marks, special
symbols, and blank spaces. 3.3. Models
3.1.2. Speech Dataset: Albayzin and Cristo Salvador Optical models are Convolutional Recurrent Neural Networks
(CRNN), which consist of a convolutional and a recurrent
The Spanish phonetic corpus Albayzin [18] was used for train- blocks [21]. The convolutional blocks are composed of 3 con-
ing the ASR acoustical models. This corpus consists of a set volutional layers of 16, 32, and 48 features maps. Each convo-
of three sub-corpus recorded by 304 speakers using a sampling lutional layer has kernel sizes of 3 × 3 pixels, horizontal and
rate of 16 KHz and a 16 bit quantisation. The training partition vertical strides of 1 pixel, LeakyReLU as activation function,
used in this work includes a set of 6800 phonetically balanced and a maximum pooling layer with non-overlapping kernels of
utterances, specifically, 200 utterances read by four speakers, 2 × 2 pixels only at the output of the first two layers. Then, the
25 utterances read by 160 speakers, and 50 sentences read by recurrent blocks are composed of 3 recurrent layers. Each recur-
40 speakers with a total duration of about 6 hours. A set of rent layer is composed of 256 Bidirectional Long-Short Term
25 acoustical classes, 23 monophones, short silence, and long Memory (BLSTM) units. Finally, a linear fully-connected out-
silence, was estimated from this corpus. put layer is used after the recurrent block. Those models were
Test data for ASR was the product of the acquisition of the trained using Laia [22].
dictation of the contents of the lines of the page 41 by five dif- Acoustical models were trained using EESEN [13]. This
ferent native Spanish speakers (i.e., a total set of 120 utterances, acoustical model is a Recurrent Neural Network (RNN) com-
with a total duration of about 9 minutes) using a sample rate of posed of 351 inputs for 9 neighbouring frames of cepstral fea-
16 KHz and an encoding of 16 bits (to match the conditions of tures, 6 hidden layers with 250 BLSTM units, and an output
Albayzin data). layer with a softmax function [12].
94
The lexicon models for both modalities are in HTK lexi- Table 1: Experimental Results.
con format, where each word is modelled as a concatenation of
characters for HTR or phonemes for ASR. Post-edition CATTI
Experiment
The particularities of historical manuscripts, such as, writ- WER Oracle WER WSR EFR P-value
ing style, epoch and subject, make it very difficult to find ex- HTR 8.9% 1.8% 4.1% 53.9% 0.0511
ternal resources that allow to improve the models. In general, ASR 31.4% 8.5% 10.4% −16.9% 0.3898
Multimodal 10.6% 0.8% 1.8% 79.8% 0.0004
a part of the book is used to train the models that are used to
automatically transcribe the rest of the book. Therefore, the
language model (LM) was estimated directly from the tran-
scriptions of the pages included on the HTR training set us- concretely it presents a WER equal to 8.9% and an oracle WER
ing the SRILM ngram-count tool [23]. This language model equal to 1.8%. In this case, speech recognition does not seem to
is a 2-gram with Kneser-Ney back-off smoothing [24] interpo- be a good substitute for handwriting recognition. The quality of
lated with the whole lexicon in order to avoid out-of-vocabulary the lattices obtained by the speech recognition system present a
words, and it presents a perplexity of 742.8 for the test data. WER equal to 31.4% and an oracle WER equal to 8.5%.
Regarding multimodality, the quality of the lattices ob-
3.4. System Setup tained from the lattice combination of both modalities presents
a WER equal to 10.6%. However, these multimodal lattices
As previously stated, the decoding and lattice generation based
presents an oracle WER equal to 0.8%. Even though the com-
on WFST for both modalities were implemented using the
bination technique does not improve the unimodal HTR WER,
EESEN recogniser [13], however, the multimodal lattice com-
it allows to reduce the oracle WER substantially. Therefore, an
bination was performed using lattice-combine from Kaldi [25].
outstanding effect on interactive transcription can be expected,
In order to optimise the presented multimodal and interac-
since the oracle WER is related to the quality of the alternatives
tive framework, the values of the main variables were set up
offered by the interactive and assistive system (the lower the
on a validation set, as well as the limit of mouse actions for
oracle WER, the better the alternatives).
correcting each erroneous word on the interactive transcription
Concerning the CATTI results, 4.1% of estimated interac-
experiments, that was set to 3 [6].
tive human effort (WSR) was required for obtaining the perfect
transcription from the HTR lattices, which represents 53.9% of
3.5. Evaluation Metrics
relative effort reduction (EFR) over the HTR baseline (WER
The quality of the transcriptions is assessed using the Word Er- equal to 8.9%, p = .051). On the other side, no effort re-
ror Rate (WER), which allows us to obtain a good estimation duction can be considered when only ASR is used at the in-
for the transcriber post-edition effort. The WER is based on the put of the interactive system. However as expected, the mul-
Levenshtein edit distance [26] and it can be defined as the min- timodal combination not only represents 56.1% of relative im-
imum number of words that have to be substituted, deleted and provement on the estimate interactive human effort (1.8% over
inserted to transform the transcription into the reference text, 4.1%, p = .091), but these improvements are statistically sig-
divided by the number of words in the reference text. nificant when compared with the HTR baseline (EFR equal to
The quality of the lattices can be defined as the quality of 79.8%, p < .001).
the best hypotheses contained in them, and it is known as oracle
error rates. Then, the quality of the word lattices is estimated 5. Conclusions
by the oracle WER, which represents the smaller WER that can
be obtained from the word sequences contained in them. In this paper, the use of multimodal combination for improv-
The overall interactive performance is given by Word ing the CATTI system presented in previous works has been
Stroke Ratio (WSR), which can also be computed by using the studied. Multimodal combination allows us to provide addi-
reference text. After each hypothesis proposed by the system, tional sources of information to the assistive transcription sys-
the longest common prefix between the hypothesis and the ref- tem, such as speech dictation of the textual contents of the doc-
erence text is obtained and the first error from the hypothesis is ument to transcribe.
corrected. This process is iterated until a full match is achieved. The obtained results show that the combination technique
Therefore, the WSR can be defined as the number of user cor- used, even though it does not improve the best hypothesis of-
rections that are necessary to produce correct transcriptions us- fered by the unimodal HTR system, it may produce new bi-
ing the interactive system, divided by the total number of words grams that increase the search alternatives. Moreover, the ad-
in the reference text. This definition makes the WER compara- justment of the word posterior probabilities can increase the
ble to the WSR. The relative difference between them gives us probabilities of the correct words, reaching better hypotheses
the effort reduction (EFR), which is an estimation of the reduc- that allows the assistive transcription system to provide an ad-
tion of the transcription effort that can be achieved by using the ditional and significant reduction of the human effort.
interactive system. In future work, we will study the use of other combination
The statistical significance of the experimental results is es- techniques, the use of sentences in the handwritten text corpus
timated by means of p-values with a threshold of significance instead of lines (in order to make multimodality more natural),
of α = 0.05 that were calculated through the Welch t-test [27] and the use of the information of context given by the previous
using the statistical computing tool R [28]. lines.
95
7. References [15] E. Granell, V. Romero, and C.-D. Martı́nez-Hinarejos, “Mul-
timodality, interactivity, and crowdsourcing for document tran-
[1] S. Chhabra, G. Gupta, M. Gupta, and G. Gupta, “Detecting Fraud- scription,” Computational Intelligence, vol. 34, no. 2, pp. 398–
ulent Bank Checks,” in Proc. of the 15th IFIP International Con- 419, 2018.
ference on Digital Forensics, 2017, pp. 245–266.
[2] A. H. Toselli, A. Juan, D. Keysers, J. González, I. Salvador, [16] V. Romero, A. H. Toselli, L. Rodrı́guez, and E. Vidal, “Computer
H. Ney, E. Vidal, and F. Casacuberta, “Integrated Handwriting Assisted Transcription for Ancient Text Images,” in Image Anal-
Recognition and Interpretation using Finite-State Models,” Int. ysis and Recognition, ser. Lecture Notes in Computer Science,
Journal of Pattern Recognition and Artificial Intelligence, vol. 18, M. Kamel and A. Campilho, Eds. Springer Berlin Heidelberg,
no. 4, pp. 519–539, 2004. 2007, vol. 4633, pp. 1182–1193.
[3] T. Bluche, H. Ney, and C. Kermorvant, “A Comparison of [17] V. Alabau, V. Romero, A. L. Lagarda, and C. D. Martı́nez-
Sequence-Trained Deep Neural Networks and Recurrent Neu- Hinarejos, “A Multimodal Approach to Dictation of Handwrit-
ral Networks Optical Modeling for Handwriting Recognition,” in ten Historical Documents.” in Proc. 12th Interspeech, 2011, pp.
Proc. of the 2nd SLSP, 2014, pp. 199–210. 2245–2248.
[4] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, [18] A. Moreno, D. Poch, A. Bonafonte, E. Lleida, J. Llisterri, J. B.
and J. Schmidhuber, “A Novel Connectionist System for Uncon- Mariño, and C. Nadeu, “Albayzin speech database: design of the
strained Handwriting Recognition,” IEEE Transaction on PAMI, phonetic corpus,” in Proc. of EuroSpeech, 1993, pp. 175–178.
vol. 31, no. 5, pp. 855–868, 2009. [19] M. Villegas, V. Romero, and J. A. Sánchez, “On the modifica-
[5] S. España-Boquera, M. Castro-Bleda, J. Gorbe-Moya, and tion of binarization algorithms to retain grayscale information for
F. Zamora-Martı́nez, “Improving offline handwriting text recon- handwritten tet recognition,” in Proc. of IbPRIA, 2015, pp. 208–
gition with hybrid HMM/ANN models,” IEEE Transactions on 215.
Pattern Analysis and Machine Intelligence, vol. 33, no. 4, pp. [20] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition.
767–779, 2011. Prentice Hall, 1993.
[6] V. Romero, A. H. Toselli, and E. Vidal, Multimodal Interactive
Handwritten Text Transcription, ser. Series in Machine Perception [21] J. Puigcerver, “Are Multidimensional Recurrent Layers Really
and Artificial Intelligence (MPAI). World Scientific Publishing, Necessary for Handwritten Text Recognition?” in Proc. of the
2012. 14th ICDAR, vol. 1. IEEE, 2017, pp. 67–72.
[7] E. Granell, V. Romero, and C. D. Marı́nez-Hinarejos, “An Interac- [22] J. Puigcerver, D. Martin-Albo, and M. Villegas, “Laia: A
tive Approach with Off-line and On-line Handwritten Text Recog- deep learning toolkit for HTR,” 2016. [Online]. Available:
nition Combination for Transcribing Historical Documents,” in https://github.com/jpuigcerver/Laia/
Proc. of the 12th IAPR-DAS, 2016, pp. 269–274. [23] A. Stolcke, “SRILM-an extensible language modeling toolkit.” in
[8] C.-D. Martı́nez-Hinarejos, E. Granell, and V. Romero, “Com- Proc. of the 3rd Interspeech, 2002, pp. 901–904.
paring different feedback modalities in assisted transcription of [24] R. Kneser and H. Ney, “Improved backing-off for m-gram lan-
manuscripts,” in Proc. of the 13th IAPR-DAS, 2018, pp. 115–120. guage modeling,” in Proc. of ICASSP, vol. 1, 1995, pp. 181–184.
[9] A. Toselli, V. Romero, M. Pastor, and E. Vidal, “Multimodal inter-
[25] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
active transcription of text images,” Pattern Recognition, vol. 43,
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,
no. 5, pp. 1824–1825, 2010.
J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi Speech
[10] V. Romero, J. A. Sanchez, V. Bosch, K. Depuydt, and J. de Does, Recognition Toolkit,” in Proc. of ASRU, 2011.
“Influence of text line segmentation in handwritten text recogni-
tion,” in Proc. of the 13th ICDAR, 2015, pp. 536–540. [26] V. I. Levenshtein, “Binary codes capable of correcting deletions,
insertions, and reversals,” Soviet Physics Doklady, vol. 10, no. 8,
[11] F. Jelinek, Statistical Methods for Speech Recognition. MIT
pp. 707–710, February 1966.
Press, 1998.
[12] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition [27] B. L. Welch, “The Generalization of ‘Student’s’ Problem when
with deep recurrent neural networks,” in Proc. of ICASSP, 2013, Several Different Population Variances are Involved,” Biometrika,
pp. 6645–6649. vol. 34, no. 1/2, pp. 28–35, 1947.
[13] Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end [28] R Core Team, R: A Language and Environment for Statistical
speech recognition using deep RNN models and WFST-based de- Computing, R Foundation for Statistical Computing, Vienna, Aus-
coding,” in Proc. of ASRU, 2015, pp. 167–174. tria, 2017, https://www.R-project.org/. Last access: May 2017.
[14] H. Xu, D. Povey, L. Mangu, and J. Zhu, “Minimum bayes risk
decoding and system combination based on a recursion for edit
distance,” Computer Speech & Language, vol. 25, no. 4, pp. 802–
828, 2011.
96
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
97 10.21437/IberSPEECH.2018-21
In this paper we present an empirical evaluation of effec-
tiveness in pronunciation training in Japanese students of Span-
ish as a foreign language. Corrective feedback in pronunciation
training of second language acquisition could be implicit or ex-
plicit, in terms of whether or not the learner is informed of the
corrected form of the error [18]. Since it is recognized as an
essential component of these kind of systems, an explanation
mode has been included, comprising textual and audiovisual
material which provides explanations and illustrations on the
correct articulation of the sounds included in each lesson. We
also provide a guided path to users to follow in order to com-
plete all activities, based on their results. The preferable charac-
teristics of corrective feedback are: unambiguous, understand-
able, detectable, short and should preferably take account of
learner characteristics, both proficiency and literacy level [19].
We provide different types of feedback depending on the tasks
and results (see subsection 2.2).
In section 2, the experimental procedure is described, which Figure 1: Standard flow to complete a lesson in Japañol.
includes the participants and protocol stages, the speech mate-
rial processing and Japañol description. Results section shows
users’ interaction degree with the CAPT system, the most dif- tion which are not used in this work. All data was presented to
ficult Spanish sounds found in perception and production tasks human raters randomly and without user association. The audio
and user’s human rater and ASR scores consistency, correlation files were also processed with the same ASR used in the appli-
and improvement. We end the paper with a discussion about the cation (Google ASR). Human raters were asked to concentrate
relevance of the results and the conclusions and future work. in the specific sound of each word which should be generated
correctly, ignoring the bad pronunciations of the rest of sounds.
2. Experimental procedure
2.1. Protocol description 2.2. Software tool of the CAPT system
A total of eight native Japanese students between 20 and 22 Figure 1 shows the regular sequence of steps in order to com-
years old were selected as participants for our experiment. All plete a lesson in Japañol. After user’s authentication in (step 1),
participants qualified for the same Spanish as foreign language seven lessons are presented at the main menu of the application
for beginners (A2-B1) course at the Language Center of the (step 2). Each lesson includes a pair of Spanish sound contrasts
University of Valladolid. In this way we guaranteed (1) that and users achieve a particular score, expressed in percentage.
all students had the same initial level of Spanish and (2) that Lessons are divided in five main training modes: Theory, Ex-
our experiment realistically reproduced the variety of persons posure, Discrimination, Pronunciation and Mixed modes (step
that attend to these type of courses. 3) in which each one proposes several task-types with a fixed
Systematization of Spanish mispronunciations produced by number of mandatory task-tokens. The final lesson score is the
Japanese speakers was the first step to choose minimal pairs mean score of the last three modes. Users are guided by the sys-
in Spanish [20]. A set of 56 words, chosen according to the tem in order to complete all training modes of a lesson. When
pairs to be worked out along the game and selected accord- reaching a score below 60% in Discrimination, Pronunciation or
ing to their phonetic difficulty, where spoken by each of the Mixed modes, users are recommended to go back to exposure
8 participants, both before (pre-test) and after (post-test) train- mode as a feedback resource and then return to the failed mode.
ing sessions. Participants exclusively used the CAPT system in Besides, next lesson is enabled when users reach a minimum
the three 45-minutes-maximum training sessions; with a delay score of 60%.
of at least 48 hours. A total of 84 minimal pairs corresponding The first training mode is Theory (step 4). A brief and
to mispronunciations were presented to participants, 12 in each simple video describing the target contrast of the lesson is pre-
of the seven lessons. Students were asked not to complete more sented to the user as a first contact with feedback. At the end of
than three lessons per session. The software application was the video, next mode becomes available; but users may choose
installed under an Android emulator (NOX App player2 ) in the to review the material as many times as they want. Exposure
computers of the Language Center multimedia laboratory. At (step 5) is the second mode. Users strengthen the lesson con-
the beginning of the first session, students were instructed on trast experience previously introduced in Theory mode, in order
how to use the software. Then, they had to work individually to support their assimilation. Three minimal pairs are displayed
and did not have any communication with either instructors or to the user. In each one of them, both words are synthetically
classmates. Each participant worked inside an individual cubi- produced by Google TTS for five times (highlighting the cur-
cle equipped with a headset with microphone. rent word), alternately and slowly. After that, users must record
Once finished the post-test, a manual revision and isolation themselves at least one time per word and listen to their own
of the recorded words of pre and post-test was carried out in and system’s sound. Words are represented with their ortho-
order to elaborate a perceptual listening test. Five expert pho- graphic and phonemic forms. A replay button allows to listen
neticians and native speakers assigned a correct/incorrect value to the specified word again. Step 6 refers to Discrimination
to each word, plus some extra annotations about the pronuncia- mode, in which ten minimal pairs are presented to the user con-
secutively. In each one of them, one of the words is syntheti-
2 Official website (last visited June, 24th 2018) https://www. cally produced, randomly. The challenge of this mode consists
bignox.com/ on identifying which word is produced. As feedback elements,
98
Table 1: Events by user with the CAPT tool in the whole experiment. THE, EXP, DIS, PRO and MIX correspond to Theory, Exposure,
Discrimination, Pronunciation and Mixed modes, respectively. n, m and M are the mean, minimum and maximum values. Time (min)
row represents the time spent in minutes per person in each mode in the whole experiment. #Tries is the number times a mode is
practiced by each user. Mand. and Req. mean mandatory and requested TTS listenings. Productions use ASR.
words have their orthographic and phonetic transcription repre- Table 2: Confusion matrix of discrimination task-tokens (diago-
sentations. Users can also request listen to the target word again nal: right discrimination task-tokens). The rows are the sounds
with a replay button. Speed varies alternately between slow and expected by the tool and the columns are the sounds selected by
normal speed rates. Finally, the system changes word color to the user. success refers to the success rate of the correspond-
green (success) or red (failure) with a chime sound. Pronun- ing sound row. #Lis is the number of requested listenings to the
ciation is the fourth mode (step 7) which aim is to produce as sound row.
well as possible, both words, separately, of the five minimal
pairs presented with their phonetic transcription. Google’s ASR
determines automatically and in real time acceptable or non-
acceptable inputs. In each production attempt the tool displays
a text message with the recognized speech, plays a right/wrong
sound and changes word’s color to green or red. The maximum
number of attempts per word is five in order not to discourage
users. However, after three consecutive failures, the system of-
fers to the user the possibility of request a word synthesis as an
explicit feedback as many times as they want with a replay but-
ton. Mixed mode is the last mode of each lesson (step 8). Nine
production and perception tasks alternate at random in order to
further consolidate obtained skills and knowledge.
results provide in first place the desired word. Users can try a
3. Results maximum of five wrong productions per word. We can observe
Table 1 shows user’s interaction degree with Japañol. After all a success improvement from first to last attempt (last column),
training sessions, Japanese learners spent an average of 100.12 being the highest ones [fl] (48.4%) and [fR] (27.7%) sounds. At
minutes performing the proposed tasks. Users consumed a first attempt, the most confused pair in production tasks is [fl]-
83.06% of the mean effective time for carry out interactive exer- [fR] (88 times) and the least confused one is [l]-[rr] (20 times).
cises in EXP, DIS, PRO and MIX modes. As a mean term, users The sounds with the lowest discrimination success rate are [fl]
listened to the TTS system 628.51 times and used the ASR sys- and [s] (both < 35%), and those with the highest discrimination
tem 247.13 times, giving a rate of 8.74 uses of the TTS/ASR per success rates are [l] and [R] (both > 75%). At last attempt, the
minute. Table 1 also shows important differences in the use of most confused pair in production tasks is [s]-[T] (102 times) and
the tool depending on the user. For instance, the user who per- the least confused one is [l]-[R] (3 times). The sounds with the
formed tasks of the PRO mode in the fastest way spent 22.43 lowest production success rate are [s] and [fR] (both < 65%). A
minutes and the one who spent more time 72.85 minutes. This higher number of requested listenings (first column) appears for
contrast can also be observed in the rest of the modes and in the last attempt pronunciation at lower success rates. The highest
number of times they interacted with the tool. The inter-user production rate success sounds are [l] and [R] (both > 96%).
differences affect both the number of times the users make use Table 4 presents the scores for each user at any of the given
of the ASR (162 minimum vs. 355 maximum) and the number stages of the experiment (pre-test, CAPT tool, post-test, and a
of times they requested the use of TTS (79 vs. 359 times). delta score of pre and post-test). EXP and ASR scores refer to
Tables 2 and 3 display the confusion matrices between both tests learners’ qualifications by human raters and Google
the sounds of the minimal pairs in perception and production ASR, respectively. These scores are computed by summing up
events. The most confused pair in discrimination tasks is [l]-[R] the number of correct words per speaker and normalizing the
(54 times) and the least confused one is [T]-[f] (5 times). The result to the range [0,10]. JAP score is computed by the number
sounds with the lowest pronunciation success rate are [s] and [fl] of correct and incorrect task-tokens while doing the required
(both < 71%), and those with the highest pronunciation success task-types of the training modes (Discrimination, Pronunciation
rate are [f] and [T] (both > 92%), corresponding to the lowest and Mixed modes) as a qualification to rank the participants.
number of requested listenings (5 and 20). Concerning to EXP values, a consistency test among hu-
Table 3 shows pronunciation events results at first and last man raters based on the Fleiss’ Kappa statistical indicator was
attempt per word utterance. A produced word is correct if ASR carried out both for pre-test and post-test evaluations. For lo-
99
Table 3: Confusion matrix of pronunciation task-tokens at first and last attempt per word sequence (diagonal: right pronunciation
task-tokens at first and last attempt per word sequence). The rows are the sounds expected by the tool and the columns are the sounds
produced by the user. success refers to the success rate of the corresponding sound row. #Lis is the number of requested listenings to
the sound row.
Table 4: Different user scores in pre-test, post-test and game by ciation is accepted in Latin American Spanish. About liquid
experts, ASR and game in a scale of [0, 10]. ID, EXP, ASR and consonants, Japanese speakers are more successful at phoneti-
JAP refer to user identifier, experts, Google ASR and Japañol, cally producing [l] and [R] than discriminating these phonemes.
respectively. Japanese speakers have already acquired these sounds since
they are allophones of a same liquid phoneme in Japanese. For
Pre-test Game Post-test ∆ (Pre/Post) this reason, it does not seem to be necessary to distinguish them
ID EXP ASR JAP EXP ASR EXP ASR in Japanese, whereas it is in Spanish.
07 9.8 5.1 9.4 9.4 7.5 -0.4 2.4 Applied methodology has proved coherent and efficient be-
05 8.6 4.5 9.1 8.8 4.6 0.2 0.2 cause easier tasks (exposure, discrimination) were presented be-
01 8.1 2.9 8.0 7.9 4.6 -0.3 1.7 fore tougher ones (pronunciation) in time and repetition terms
02 7.3 3.7 8.4 7.8 4.8 0.4 1.2 (Table 1). It has also led to better final scores (Table 4). Par-
03 6.9 1.9 6.3 7.6 2.7 0.8 0.8 ticipants consistently resorted to TTS models when faced with
08 6.8 2.8 6.5 7.1 3.4 0.4 0.6 difficulties both in perception and production modes (#Lis col-
06 5.8 0.8 5.9 6.5 3.2 0.8 2.4 umn of Tables 2 and 3 and #Req.List.row of Table 1). The
04 5.4 2.1 6.9 7.3 2.9 1.9 0.8 fact that our CAPT tool makes use of ASR and TTS systems
is the principal pedagogical and operational concern. The qual-
ity of synthetic voice in the rendering of minimal pairs appears
gistic reasons, post-test evaluation was carried out first and a to have been adequate. Judging from gathered data, the TTS
moderate agreement [21] was found (Kappa value=0.50). There system employed seems to be beneficial for students [22, 23]:
is a considerable increment for pre-test, reaching a substantial the rate of success significantly increases after undertaking the
agreement (Kappa value=0.63). Comparing subjective and ob- exposure activities imposed by feedback. The role of ASR is
jective scores, delta score values (last two columns) are pos- even more crucial as it offers diagnosis to users as real-time
itive in almost all cases both in EXP and ASR (only two of automatic feedback. Nowadays, ASR systems have some limi-
them decrease in EXP, both in top three). They also show a tations such as isolated or infrequent words, adaptation to tech-
fair correlation with pre-test expert scoring (r=-0.856) and post- nology and L1 transferred pronunciation utterances as correct
test expert scores with ASR ones (r=-0.735). Pre-test scores as- L2 words (false positives). However, [24] demonstrated the ef-
signed by experts (EXP) show reasonable linear regression cor- fectiveness of ASR-based CAPT tool for training users in the
relation with those obtained by applying the ASR in the same production of decontextualized isolated words and [25] reported
test (r=0.890). A similar correlation is found for post-test EXP L2 French vowel /y/ production improvement after training with
and ASR (r=0.834). Game scores (JAP) show good correlation a mobile ASR system.
with EXP post-test results (r=0.912), clearly over the correlation Human raters’ post-test scores fairly correlate with ASR
found between JAP and pre-test human rater results (r=0.867). and game ones, being useful in the future to be able to eval-
uate a large amount of users reducing human costs (Table 4).
The lowest pre-test scores’ users improved more than the best
4. Discussion and Conclusion ones. However, they did not reach better results than the top
We have described the experiment results with a CAPT system ones. This is due to the fact that the tool does not give ex-
of native Japanese learners of Spanish as a foreign language, tra activities when the limit is reached. As a future work, we
that include a software tool, Japañol, that we believe is worth are collaborating with Seisen University with a bigger group
taking into account when thinking about possible teaching com- of students divided into experimental, control and in-classroom
plement. The results have shown that Japanese learners of Span- groups. We are also considering the possibility of applying the
ish have difficulty with [f] in onset clusters position in both per- methodology to other more exotic and minority languages. Fi-
ception (Table 2) and production (Table 3). [s]-[T] present sim- nally, a comparison between a game-oriented version versus a
ilar results, they tend to substitute [T] by [s], but this pronun- learning-oriented one could be our next step.
100
5. References [16] C. Tejedor-Garcı́a, V. Cardeñoso-Payo, E. Cámara-Arenas,
C. González-Ferreras, and D. Escudero-Mancebo, “Playing
[1] R. I. Thomson and T. M. Derwing, “The effectiveness of L2 pro- around minimal pairs to improve pronunciation training,” IFCASL
nunciation instruction: A narrative review,” Applied Linguistics, 2015, 2015.
vol. 36, no. 3, p. 326, 2014.
[17] C. Tejedor-Garcı́a, D. Escudero-Mancebo, C. González-Ferreras,
[2] A. Kukulska-Hulme, “Language learning defined by time and
E. Cámara-Arenas, and V. Cardeñoso-Payo, “Evaluating the Ef-
place: A framework for next generation designs,” in Left to My
ficiency of Synthetic Voice for Providing Corrective Feedback in
Own Devices: Learner Autonomy and Mobile Assisted Language
a Pronunciation Training Tool Based on Minimal Pairs,” SLaTE,
Learning, ser. Innovation and Leadership in English Language
2017.
Teaching, J. E. Dı́az-Vera, Ed. Bingley, UK: Emerald Group
Publishing Limited, 2012, vol. 6, pp. 1–13. [Online]. Available: [18] Y. Sheen and R. Ellis, “Corrective feedback in language teaching,”
http://oro.open.ac.uk/30756/ Handbook of research in second language teaching and learning,
[3] A. Pareja-Lora, C. Calle-Martı́nez, and P. Rodrı́guez-Arancón, vol. 2, pp. 593–610, 2011.
New perspectives on teaching and working with languages in the [19] B. Penning de Vries, C. Cucchiarini, H. Strik, and R. van Hout,
digital era. Research-publishing. net, 2016. “The Role of Corrective Feedback in Second Language Learn-
[4] W. Li and D. Mollá-Aliod, “Computer processing of oriental lan- ing: New Research Possibilities by Combining CALL and Speech
guages. language technology for the knowledge-based economy,” Technology,” Proceedings of SlaTE 2010, Tokyo, Japan, nov
Lecture Notes in Computer Science, vol. 5459, 2009. 2010.
[5] M. Carranza, “Diseño de aplicaciones para la práctica de la pro- [20] C. Tejedor-Garcı́a and D. Escudero-Mancebo, “Uso de pares
nunciación mediante dispositivos móviles y su incorporación en mı́nimos en herramientas para la práctica de la pronunciación del
el aula de ele,” El español entre dos mundos: Estudios de ELE en español como lengua extranjera,” Revista de la Asociación Euro-
Lengua y Literatura, pp. 279–297, 2014. pea de Profesores de Español. El español por el mundo, no. 1, pp.
[6] A. Neri, C. Cucchiarini, and H. Strik, “The effectiveness of 355–363, 2018.
computer-based speech corrective feedback for improving seg- [21] J. R. Landis and G. G. Koch, “The measurement of observer
mental quality in L2-Dutch,” ReCALL, vol. 20, no. 02, pp. 225– agreement for categorical data,” biometrics, pp. 159–174, 1977.
243, 2008.
[22] G. Smith, W. Cardoso, and C. G. Fuentes, “Text-to-speech syn-
[7] J. Lee, J. Jang, and L. Plonsky, “The Effectiveness of Second Lan- thesizers: Are they ready for the second language classroom?”
guage Pronunciation Instruction: A Meta-Analysis,” Applied Lin- Concordia University, 2015.
guistics, vol. 36, no. 3, p. 345, 2015.
[23] D. Liakin, W. Cardoso, and N. Liakina, “The pedagogical use of
[8] M. Carranza, “Errores y dificultades especı́ficas en la adquisición mobile speech synthesis (TTS): focus on French liaison,” Com-
de la pronunciación del español le por hablantes de japonés y puter Assisted Language Learning, vol. 30, no. 3-4, pp. 325–342,
propuestas de corrección,” Nuevos enfoques en la enseñanza del 2017.
español en Japón, pp. 51–78, 2012.
[24] A. Neri, O. Mich, M. Gerosa, and D. Giuliani, “The effective-
[9] A. Blanco-Canales and M. Nogueroles-López, “Descripción y
ness of computer assisted pronunciation training for foreign lan-
categorización de errores fónicos en estudiantes de español/l2.
guage learning by children,” Computer Assisted Language Learn-
validación de la taxonomı́a de errores aacfele,” Logos: Revista
ing, vol. 21, no. 5, pp. 393–408, 2008.
de Lingüı́stica, Filosofı́a y Literatura, vol. 23, no. 2, pp. 196–225,
2013. [25] D. Liakin, W. Cardoso, and N. Liakina, “Learning L2 pronun-
[10] A. R. Bradlow, D. B. Pisoni, R. Akahane-Yamada, and ciation with a mobile speech recognizer: French /y/,” CALICO
Y. Tohkura, “Training Japanese listeners to identify English /r/ Journal, vol. 32, no. 1, p. 1, 2015.
and /l/: IV. Some effects of perceptual learning on speech produc-
tion,” The Journal of the Acoustical Society of America, vol. 101,
no. 4, pp. 2299–2310, 1997.
[11] J. M. Cabezas Morillo, “Las creencias de los estudi-
antes japoneses sobre la pronunciación española: un
análisis exploratorio,” Trabajo de Máster, Universidad
Pablo de Olavide, 2009. [Online]. Available: http://www.
mecd.gob.es/dam/jcr:5ebaa0b9-7b77-4a28-a732-b743be08dd06/
2010-bv-11-05cabezas-pdf.pdf
[12] M. Carranza, C. Cucchiarini, J. Llisterri, M. J. Machuca, and
A. Rı́os, “A corpus-based study of spanish L2 mispronun-
ciations by Japanese speakers,” in Edulearn14 Proceedings.
6th International Conference on Education and New Learn-
ing Technologies. IATED Academy, 2014, pp. 3696–3705.
[Online]. Available: http://liceu.uab.cat/∼joaquim/publicacions/
Carranza et al 14 Corpus Spanish L2.pdf
[13] C. Tejedor-Garcı́a, V. Cardeñoso-Payo, E. Cámara-Arenas,
C. González-Ferreras, and D. Escudero-Mancebo, “Measuring
pronunciation improvement in users of CAPT tool TipTopTalk!”
Interspeech, pp. 1178–1179, September 2016.
[14] C. Tejedor-Garcı́a, D. Escudero-Mancebo, E. Cámara-Arenas,
C. González-Ferreras, and V. Cardeñoso-Payo, “Improving L2
production with a gamified computer-assisted pronunciation train-
ing tool, TipTopTalk!” IberSPEECH 2016: IX Jornadas en Tec-
nologı́as del Habla and the V Iberian SLTech Workshop events,
pp. 177–186, 2016.
[15] D. Escudero-Mancebo, E. Cámara-Arenas, C. Tejedor-Garcı́a,
C. González-Ferreras, and V. Cardeñoso-Payo, “Implementation
and test of a serious game based on minimal pairs for pronuncia-
tion training,” SLaTE, pp. 125–130, 2015.
101
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Vicomtech, Spain
2
Pattern Recognition and Human Language Technologies Research Center,
Universitat Politècnica de València, Spain
[cbernath,aalvarez,harzelus]@vicomtech.org, [email protected]
102 10.21437/IberSPEECH.2018-22
and languages. Finally, the results achieved were compared Table 1: Training, development and test subsets for English.
with those obtained by LSTM-HMM based systems through the
use of the same training and evaluation datasets. subset hours
This paper is organized as follows. Section 2 describes train-clean 464.2
the main architecture of the E2E systems developed within this train-other 496.7
work. Section 3 describes the training and evaluation data em- dev-clean 5.4
ployed, whilst Section 4 presents the baseline ASR systems test-clean 5.4
built for each language including both E2E and LSTM-HMM dev-other 5.3
architectures. The experiments performed are described in Sec- test-other 4.1
tion 5 and finally, Section 6 draws conclusions and looks at the test-noisy 5.4
future work.
2. System overview The clean partitions correspond to those pools with a lower
All the E2E systems presented in this work were developed fol- WER in the whole corpus, whilst the other ones contain the
lowing the Deep Speech 2 architecture [2]. The core of the most difficult audios a priori. Test-noisy was artificially created
system is basically an RNN model, in which speech spectro- within this work using synthesis by superposition [22] of noise
grams are ingested and text transcriptions are provided as out- samples to the test-clean subset.The noise samples correspond
put. Although the Long-short-term Memory (LSTM) is widely to audios from different acoustic environments selected from
used as RNN model, in this architecture Gated Recurrent Units the Youtube platform.
(GRU) [18] are employed, since they have been proven to be Regarding the text data used to train the LMs, they were
trained more rapidly and to be less likely to diverge [19]. composed by 22,000 books from Project Gutenberg2 repository,
A sequence of 2 layers of 2D convolutional neural networks toting up 803 million tokens and 900,000 unique words.
(CNN) are employed as spectral feature extractor from spectro-
grams. The first layer was composed of 1 input and 32 output 3.2. Spanish
channels and it uses filters of size 41 × 11 and stride of size The Spanish subset of the SAVAS corpus [23] was used as the
2 × 2. The second layer takes as input the output of the first main dataset. It is composed of broadcast news contents from
layer, composed of 32 channels. The output of the second layer the Basque Country’s public broadcast corporation EiTB (Eu-
incorporates 32 channels as well. This second layer employs a skal Irrati Telebista), and includes audios in both clear (studio)
filter dimension of 21 × 11 and stride of size 2 × 1. A 2D batch and noisy (outside) conditions. This media dataset was then
normalization function is applied to the output of both layers, in transferred through both land- and mobile-lines using different
addition to a hard tanh function as an activation function. combinations, generating new telephone domain subsets, as it
The E2E systems are set up using 5 layers of bidirectional is summarized in Table 2.
GRU layers. Each hidden layer is composed of 800 hidden
units. After the bidirectional recurrent layers, a fully connected Table 2: Training, development and test subsets for Spanish.
layer is applied as the last layer of the whole model. The output
corresponds to a softmax function which computes a probability subset hours
distribution over characters. This distribution is computed over train-media 132.5
each timestep. The size of this output layer was equal to the train-land-mobile 397.5
total number of the characters to predict. dev-media 4
During the training process, the CTC loss function is com- test-media-clean 4
puted to measure the error of the predictions, whilst the gradi- test-media-noisy 4
ent is computed using backpropagation through time algorithm dev-land 4
with the aim of updating the network parameters. The optimizer
test-land 4.6
is the Stochastic Gradient Descent (SGD).
dev-mobile 4
In addition, external language models (LMs) were inte-
test-mobile 4.6
grated during the decoding of the E2E systems with the aim
of overcoming the initial results. To this end, modified Kneser-
Ney smoothed n-grams models of several orders were estimated Concerning text data, they were obtained by merging tran-
using the KenLM toolkit [20]. scriptions of the training audios and generic domain news
crawled from the Internet. The number of texts summed up a
3. Corpora description total of 320 million words.
In this section, the acoustic and text data used to train and eval- 3.3. Basque
uate both E2E and LSTM-HMM systems are presented for each
language. The Basque training data was also composed of the Basque sub-
set of the SAVAS corpus, and the audios were gathered from
3.1. English Basque broadcast news programs as well. No telephone do-
main partition was generated in this case. Table 3 describes the
The freely available corpus LibriSpeech [21], a read speech cor- main characteristics of the SAVAS Basque corpus.
pus based on audio-books from LibriVox1 , was used as dataset. The text data were also obtained by merging transcriptions
The training, development and testing subsets were maintained and generic news crawled from the Internet. In total, 180 mil-
as original. These partitions are detailed in Table 1. lion words were employed for the LMs estimation.
1 https://librivox.org/ 2 https://www.gutenberg.org/
103
Table 3: Training, development and test subsets for Basque. Table 4: WER (%) results for linear- and Mel-based systems for
each language.
subset hours
train-media 142.25 Test(-media-)clean Test(-media-)noisy
dev-media 4 No LM 3-gram No LM 3-gram
test-media-clean 4
English
test-media-noisy 4
Baseline-EN 10.6 5.6 35.3 23.6
Mel-scale-EN 10.2 5.4 34.8 22.2
Spanish
4. Baseline E2E and LSTM-HMM systems Baseline-ES 24.0 10.3 39.5 19.2
Using the above described corpora, the E2E and LSTM-HMM Mel-scale-ES 21.9 10.3 36.7 18.9
based baseline systems were trained in order to be compared to
Basque
the evolved ASR systems presented in Section 5. The charac-
Baseline-EU 23.8 12.9 38.8 17.3
teristics of these baseline systems are described in the following
Mel-scale-EU 21.2 8.9 36.2 19.2
subsections.
104
Looking at the results obtained for the media test set in Table 8: WER comparison of Kaldi vs E2E for SAVAS Spanish.
clean conditions, relative improvements of 7.5% and 8.3% of
the Mixed-ES and the fine-tuned Mixed-FT-Media models can Test clean Test noisy
be observed with respect to the baseline. Besides, improve- 3-gram 5-gram 3-gram 5-gram
ments of 18.3% were achieved by the fine-tuned Mixed-FT-
Spanish
Media model for the media noisy test set when comparing to
Evolved-ES 8.5 7.2 10.9 9.3
the baseline system.
LSTM-HMM 7.9 7.7 11.9 10.8
Regarding the results on the telephone domain presented in
Table 6, the three models under evaluation present better per- Basque
formance than the Baseline-ES model. As it was expected, Evolved-EU 8.9 6.6 19.2 15.9
the Mixed-FT-Phone model shows the best results with a real LSTM-HMM 7.8 6.0 10.8 8.2
improvement of 37.3% with respect to the baseline over the
Test-mobile set, composed of audios transferred by the mo-
bile channel. This improvement achieves a 34.5% in the case
of the Test-land set. The Mixed-ES model also shows notable As it can be observed in Table 7, the evolved E2E sys-
enhancements, with slightly worse WER than the Mixed-FT- tem outperformed the LSTM-HMM system and all the refer-
Phone model. The Mixed-FT-Media performs satisfactorily as ence systems over the clean test. It shows the effect of apply-
well, but it shows a little lower performance than the others. ing Mel-scale based parametrization, and techniques like fine-
Finally, it can be also observed that the Mixed-ES model tuning and speed-based data augmentation. It has to be also
performs almost as well as the fine-tuned models even without remarked that the WER obtained is 0.9 percentage points lower
having applied any fine-tuning technique. than the achieved by a human manual transcription.
On the contrary, for the Test-other partition, the LSTM-
5.3. Best E2E models compared to LSTM-HMM systems HMM model shows a better performance presenting an error
rate 0.2 percentage points lower than the proposed E2E model.
The aim of this task was to select the best E2E models presented With respect to the reference systems, the Evolved-EN model
above and compare them to (1) the state-of-the-art models in obtains a better performance than Deep Speech 1 model, but
the literature if available and (2) LSTM-HMM based systems still, an error of 2.2 points higher than Deep Speech 2. It
developed on top of the Kaldi toolkit. For this experiment, the has to be considered that the model from Deep Speech 2 was
selected models for each language were called as Evolved-EN, trained with 12,000 hours (including the Librispeech corpus),
Evolved-ES and Evolved-EU. more than ten times more than the Evolved-EN system (960h).
• Evolved-EN. It corresponded to the English baseline Table 8 shows how the best Spanish E2E model overcame
model evolved through fine-tuning and a speed-based all the results obtained by the LSTM-HMM based system with
data augmentation techniques applied for 10 more the exception of the Test clean when a 3-gram was applied.
epochs. The decoding was performed using an external It shows the robustness of the fine-tuned mixed model trained
3-gram LM and using a beam size of 1000. with data from different acoustic domains. Finally, the scarcity
of training data for Basque benefited the LSTM-HMM model
• Evolved-ES. This model refers to the Spanish fine-tuned against the E2E model, which was estimated using Mel spec-
Mixed-FT-Media. It was decoded with two configura- trograms as the only enhancement technique.
tions; (1) using an external 3-gram LM and a beam size
of 600 and (2) with a 5-gram LM with a beam size of
1000. 6. Conclusions and Future Work
• Evolved-EU. The selected Basque model was the Mel- In this work, E2E ASR systems for English, Spanish and
scale-EU model presented in subsection 5.1. It was de- Basque have been developed and evaluated against reference
coded using external 3-gram and 5-gram LM models and LSTM-HMM architectures. Besides, the positive impact
with a beam size of 600 and 1000 respectively. of applying some enhancement techniques have been demon-
strated through different experimental evaluations.
In Table 7, a comparison between the evolved model, the As main conclusions, it can be stated that using Mel-scale
Kaldi based LSTM-HMM model, and the results obtained from based spectrograms overcomes linear-based ones, as it was
reference systems in the literature are presented for the English proven in all the experiments. Moreover, it was shown that
language. Since Test-noisy was created for this work, the results robust hybrid E2E models that perform almost as well as in-
for this subset are only given for the first two models. domain models can be generated for acoustically different en-
vironments. With regard to the training data, as in the case of
Table 7: WER comparison between our best model, the LSTM- Basque compared to English, it was clear that more data leads
HMM model and reference systems for English language. to better results. However, when the resources are limited, using
an external LM of a high order can improve the performance, as
Test-clean Test-other Test-noisy it was shown for all the languages under evaluation.
Evolved-EN 4.9 15.4 21.0 As a future work, one important task will be the genera-
LSTM-HMM 6.0 15.2 26.3 tion of new Spanish and Basque E2E models with more train-
DeepSpeech-1 [2] 7.8 21.7 - ing data, since these models were still weak without LM. More-
DeepSpeech-2 [2] 5.3 13.2 - over, current N-grams will be replaced by RNN based LMs,
Human [2] 5.8 12.6 - specially for Basque as an agglutinative language. Finally, fol-
Wav2Letter [8] 6.9 - - lowing novel studies, new research efforts will be made to de-
velop methodologies for a semi-supervised learning within E2E
architectures.
105
7. References [17] S. Braun, D. Neil, and S.-C. Liu, “A curriculum learning method
for improved noise robustness in automatic speech recognition,”
[1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, in Signal Processing Conference (EUSIPCO), 2017 25th Euro-
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep pean. IEEE, 2017, pp. 548–552.
neural networks for acoustic modeling in speech recognition: The
shared views of four research groups,” IEEE Signal processing [18] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
magazine, vol. 29, no. 6, pp. 82–97, 2012. F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase repre-
[2] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Bat- sentations using rnn encoder-decoder for statistical machine trans-
tenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen lation,” arXiv preprint arXiv:1406.1078, 2014.
et al., “Deep speech 2: End-to-end speech recognition in english [19] R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical
and mandarin,” in International Conference on Machine Learn- exploration of recurrent network architectures,” in International
ing, 2016, pp. 173–182. Conference on Machine Learning, 2015, pp. 2342–2350.
[3] A. Graves and N. Jaitly, “Towards end-to-end speech recognition [20] K. Heafield, “Kenlm: Faster and smaller language model queries,”
with recurrent neural networks,” in Proceedings of the 31st In- in Proceedings of the Sixth Workshop on Statistical Machine
ternational Conference on International Conference on Machine Translation. Association for Computational Linguistics, 2011,
Learning - Volume 32, ser. ICML’14. JMLR.org, 2014, pp. II– pp. 187–197.
1764–II–1772.
[21] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
[4] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend rispeech: an asr corpus based on public domain audio books,” in
and spell: A neural network for large vocabulary conversational Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE
speech recognition,” in Acoustics, Speech and Signal Processing International Conference on. IEEE, 2015, pp. 5206–5210.
(ICASSP), 2016 IEEE International Conference on. IEEE, 2016,
pp. 4960–4964. [22] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos,
E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al.,
[5] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
“Deep speech: Scaling up end-to-end speech recognition,” arXiv
gio, “Attention-based models for speech recognition,” in Ad-
preprint arXiv:1412.5567, 2014.
vances in neural information processing systems, 2015, pp. 577–
585. [23] A. del Pozo, C. Aliprandi, A. Álvarez, C. Mendes, J. P. Neto,
[6] L. Lu, X. Zhang, and S. Renals, “On training the recurrent neural S. Paulo, N. Piccinini, and M. Raffaelli, “Savas: Collecting, an-
network encoder-decoder for large vocabulary end-to-end speech notating and sharing audiovisual language resources for automatic
recognition,” 2016 IEEE International Conference on Acoustics, subtitling.” in LREC, 2014, pp. 432–436.
Speech and Signal Processing (ICASSP), pp. 5060–5064, 2016. [24] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
[7] Y. Miao, M. Gowayyed, and F. Metze, “Eesen: End-to-end speech N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
recognition using deep rnn models and wfst-based decoding,” in “The kaldi speech recognition toolkit,” in IEEE 2011 workshop
Automatic Speech Recognition and Understanding (ASRU), 2015 on automatic speech recognition and understanding, no. EPFL-
IEEE Workshop on. IEEE, 2015, pp. 167–174. CONF-192584. IEEE Signal Processing Society, 2011.
[8] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end- [25] V. Liptchinsky, G. Synnaeve, and R. Collobert,
to-end convnet-based speech recognition system,” arXiv preprint “Letter-based speech recognition with gated convnets,”
arXiv:1609.03193, 2016. CoRR, vol. abs/1712.09444, 2017. [Online]. Available:
[9] H. Liu, Z. Zhu, X. Li, and S. Satheesh, “Gram-ctc: Automatic unit http://arxiv.org/abs/1712.09444
selection and target decomposition for sequence labelling,” arXiv
preprint arXiv:1703.00096, 2017.
[10] K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Na-
hamoo, “Direct acoustics-to-word models for english conver-
sational speech recognition,” arXiv preprint arXiv:1703.07754,
2017.
[11] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmen-
tation for speech recognition,” in Sixteenth Annual Conference of
the International Speech Communication Association, 2015.
[12] L. Torrey and J. Shavlik, “Transfer learning,” in Handbook of
Research on Machine Learning Applications and Trends: Algo-
rithms, Methods, and Techniques. IGI Global, 2010, pp. 242–
264.
[13] Y. Fujita, R. Takashima, T. Homma, R. Ikeshita, Y. Kawaguchi,
T. Sumiyoshi, T. Endo, and M. Togami, “Unified asr system using
lgm-based source separation, noise-robust feature extraction, and
word hypothesis selection,” in Automatic Speech Recognition and
Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015,
pp. 416–422.
[14] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nico-
laou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end
speech emotion recognition using a deep convolutional recurrent
network,” in Acoustics, Speech and Signal Processing (ICASSP),
2016 IEEE International Conference on. IEEE, 2016, pp. 5200–
5204.
[15] D. Palaz, R. Collobert et al., “Analysis of cnn-based speech recog-
nition system using raw speech as input,” in Proceedings of IN-
TERSPEECH, no. EPFL-CONF-210029, 2015.
[16] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Dropout: a simple way to prevent neural net-
works from overfitting,” The Journal of Machine Learning Re-
search, vol. 15, no. 1, pp. 1929–1958, 2014.
106
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
AHOLAB Signal Processing Laboratory, University of the Basque Country (UPV/EHU), Spain
[email protected], [email protected], [email protected], [email protected]
107 10.21437/IberSPEECH.2018-23
In short, the hypotheses of the experiment described in this female, making it gender inclusive (there are only 4 women in
paper are: the whole database and only 2 of them fulfilled the two criteria
• Intelligibility or Word Error Rate measurement is corre- of proficiency and accessibility).
lated with user rated listening effort. The criteria for choosing healthy speakers was quality of
recording as well as gender balance. One male and one female
• Healthy voices are more intelligible and less effortful, healthy speaker was chosen.
compared to oesophageal voices.
• Listeners familiar with oesophageal speech find it less 2.2.2. Selection of Sentences
effortful to process, compared to listeners that are not.
A pilot listening test was conducted within the lab to assess the
• ASR performs worse than HSR for healthy voices, but feasibility of this corpus for the sentence transcription task. The
even more so for oesophageal voices. participants chosen for this pilot study were unfamiliar with the
We begin by describing the methodology, corpus and details sentences of the corpus and thus not subject to priming. After
of the listening test. This will be followed by analysis methods the pilot test, the participants reported that some sentences were
used and results. Finally, the conclusions and future work are too long to remember and hence, effortful to transcribe. Ad-
presented. ditionally, although semantically and syntactically correct, the
sentences were rich in content and contained words that are dif-
ficult to guess, often containing proper names, dates etc.
2. Methodology This led us to reconsider the length of sentences and we
2.1. Experimental Design decided to choose a subset of shorter sentences, which would
make them suitable for sentence transcription. We used the Cor-
The main task for this experiment was the word recall and tran-
pusCRT tool [20] which generates a phonetically balanced sub-
scription task: Participants listened to a sentence and then wrote
set of sentences based on the provided phonetic criteria. In this
what they have understood. According to [13] the strengths of
case, the criteria we used was a maximum of 40 phonemes and
sentence repetition tasks are that they are ”fairly simple cog-
this gave us a set of 30 sentences, each of which had a maximum
nitive tasks” and that they are ”consistent throughout the age
of 10 words. Some examples of the sentences are the following:
span” in the area of neurophysiological tests. Moreover, sen-
’¿Qué diferencia hay entre el caucho y la hevea?’ What is the
tence transcription tasks have been widely used for subjec-
difference between rubber and hevea?, ’Unos dı́as de euforia y
tive intelligibility measurements. The work in [14] reports the
meses de atonı́a.’ A few days of euphoria and months of atony.
agreement of sentence transcription tasks with a wide range of
intelligibility quantification techniques and in [15] the method All the selected sentences (both from oesophageal and
is described as ”human speech recognition”. Therefore, we healthy speakers) were normalised to a common peak value
chose this approach to calculate WER and consequently the in- (0.8) to achieve a homogeneous and comfortable level of loud-
telligibility. ness.
We were also interested in knowing the listening effort of
these utterances. We had the participants rate the sentences for 2.3. The Listening Test
listening effort on a 5 point Likert scale. The options were ’very We created six mutually exclusive sets of sentences such that
little’,’a little’, ’some’, ’quite’ and ’a lot’. each set contained 30 different sentences and exactly 5 sen-
To avoid priming and sentence order bias, the sentences tences from each speaker. As a result, all 180 sentences (30 sen-
were played only once and in a random order. tences from six speakers) was covered after every sixth partici-
pant. This ensured equal coverage of all sentences and speakers.
2.2. Corpus and Stimuli Each participant was assigned one of these sets and they listened
The parallel data used for this task is 100 phonetically balanced to the sentences in a random order.
sentences selected from a bigger corpus [16], recorded by 35 The participants were asked to use headphones for the study
healthy speakers and 32 oesophageal speakers. The record- unless impossible. They were assured that it was not a test of
ings of oesophageal speakers were done in an acoustically iso- hearing and that the test was being conducted to obtain their
lated room with a studio microphone (Neumann TLM 103). honest and uninhibited response. They were told that the sen-
The recordings of the healthy speakers have variable sources tence could be played only once and that they should pay close
because they have been acquired through an online platform attention and type what they hear. If they missed some portions
[17]. However, some of them were made in the aforemen- or were unsure of what they heard, they could put three dots (...)
tioned acoustically isolated room although with a different mi- in that place. Additionally, they were asked to mark a response
crophone. for the amount of effort they experienced for that sample on the
aforementioned Likert scale. The first couple of sentences that
2.2.1. Selection of Speakers were presented were practice sentences (one healthy and one
oesophageal), to familiarise the participant with the task. These
For this experiment we chose oesophageal speakers based on sentences were sampled from the same corpus [21] but different
two criteria: proficiency and accessibility. Proficient speakers from the ones that appear in the actual test.
were those who underwent laryngectomy, and had begun train- We took the following information from the participants:
ing to speak for at least two years prior to the recording. Ad- age, presence of hearing impairment, the kind of audio equip-
ditionally, an oesophageal voice quality assessment tool [18], ment used (good quality headphones, normal quality head-
based on the factors (speaking rate, regularity etc.) of the A4S phones, good loudspeakers and bad equipment) and whether the
scale of [19], was used as a guide to assess proficiency. Acces- listener had close contact with laryngectomees.
sibility of speakers was considered because the opportunity to The listening test1 was web based and it was possible to
obtain follow-up recordings could be useful for future research.
Based on these criteria, we chose 4 speakers, three male and one 1 https://aholab.ehu.eus/users/sneha/Listening test.php
108
reach out to a wide range of participants. However, this also Table 1: Mean WER and Effort
meant differences in audio equipment and the effects of this on
the responses are reported in the results section. Oesophageal Healthy
Effort Familiar 2.61 1.25
2.4. Automatic Speech Recognition Not familiar 3.54 1.26
Total mean effort 3.07 1.255
To have an objective measure of the intelligibility (WER) we WER (in %) Familiar 17.39 7.42
prepared an ASR system for Spanish using the Kaldi toolkit Not familiar 18.35 4.85
[22]. This approach was chosen as it allowed us to control the Total mean WER 17.87 6.16
processing operations followed during the recognition as well
as basic aspects of the recognition process such as the lexicon
and the language model. It is implemented following the recipe WER results (F(3,1256)=0.707, p=0.548) and on listening ef-
s5 for the Wall Street Journal database. The acoustic features fort (F(3,1256)=0.705, p=0.549).
used are 13 Mel-Frequency Cepstral Coefficients (MFCCs) to In addition, we present the WER results from the ASR sys-
which a process of mean and variance normalization (CMVN) tem for all speakers.
is applied to mitigate the effects of the channel. The details of
the training procedures are described in [23]. 3.1. Word Error Rates from HSR
The audio material used to train the Spanish recogniser was
healthy laryngeal speech as described in [24]. However, due Table 1 presents mean WERs and Figure 1 shows the speaker-
to the characteristics of the sentences used for the evaluation, wise WERs for familiar and unfamiliar listeners. OM, OF, HM,
some modifications were made in this ASR system. Although HF are acronyms for Oesophageal Male, Oesophageal Female,
the acoustic models were maintained, a new lexicon was cre- Healthy Male and Healthy Female respectively. Mean WER
ated from the 100 sentences corpus used in the experiment (701 is always higher for oesophageal speech compared to healthy
words). This was done because using the original lexicon (with speech, as expected. There is no major difference in the WER
37, 632 entries) as much as 23% of the words were out of vo- for familiar and unfamiliar listeners in the case of oesophageal
cabulary (OOV) words. This is due to the fact that the sen- speech. This result corroborates the conclusions in [12]. For
tences are phonetically balanced and many sentences containing healthy speech there is slight difference of around 3 points in
proper names and many unusual words were chosen to maxi- the mean WER, but as can be seen in Figure 1 the difference is
mize the variability of the phonetic content. Together with this not meaningful.
reduced lexicon, a unigram language model with equally prob- The ANOVA results show that familiarity with oesophageal
able words was used. speech had no effect on WER (F(1,1590)=0.360,0.548). On
Although the final WER numbers obtained in this way the other hand, speaker-type has a strong effect on WER
are not comparable to a realistic ASR situation, the procedure (F(1,1590)=129.552, p<0.001).
serves our purpose of evaluating the intelligibility of the sen-
tences, comparing the performance of healthy and oesophageal
speakers, and establishing a baseline reference for future devel-
opments in the field (such as evaluating the improvements of
speech modification algorithms).
109
that familiarity with oesophageal speech has an effect on ef-
fort (F(1,1590)=84.94, p<0.001) and Speaker-type has a strong
effect on effort (F(1,1590)=1243.94, p<0.001).
Figure 2: Mean speaker-wise self reported effort values for oe- 4. Conclusions and Future Work
sophageal (OM1, OM2, OM3, OF1) and healthy (HM1, HF1)
speakers. 1 corresponds to least effortful and 5 to most effortful. Healthy voices are on an average three times as intelligible as
Error bars show 95% confidence intervals oesophageal voices. The mean self reported effort was also
three times larger for oesophageal speech compared to healthy
voices. There was significant correlation between intelligibility
and effort. Speaker type had an effect on both intelligibility and
effort. Listeners familiar with oesophageal fared the same for
intelligibility as people who were not. However, they reported
less effort in listening to oesophageal speech than the not famil-
iar listeners. The ASR system we chose for this task had poorer
WER for oesophageal voice compared to healthy voice.
The listening effort obtained through this study is based on
the listener’s own interpretation of ’effort involved in listening’.
This will provide us with a reference for comparison when we
perform objective listening effort measurements in the future
using physiological methods such as EEG and pupillometry. If
these subjective measures are found to be correlated with the
physiological measurements, then that opens the possibility of
using the less cumbersome self report strategy to achieve our
purpose of evaluation.
Figure 3: Correlation between WER and user rated listening
effort Both HSR intelligibility and ASR intelligibility play differ-
ent but important roles in oesophageal speech evaluation. While
improved HSR would enable better human-human interactions,
an improved ASR performance would enable better human-
3.3. Correlation of Intelligibility and Listening Effort machine interactions (eg. digital voice assistants). Lower listen-
Correlation between intelligibility (WER) and self reported ef- ing effort would also contribute towards better communication
fort is 0.479 (Pearson’s r, p <0.001). This is a weak but sig- with humans.
nificant correlation that indicates that sentences with more tran- Our main future work is to build an oesophageal voice
scription errors are perceived as more effortful. This relation- restoration system aimed at better ASR and HSR intelligibility
ship between WER and self-reported effort is illustrated as a and low listening effort.
box-plot in Figure 3.
110
6. References [18] N. Tits, “Exploring the parameters describing the quality and in-
telligibility of alaryngeal voices,” University of Mons, 2017.
[1] B. Weinberg, “Acoustical properties of esophageal and tracheoe-
sophageal speech,” Laryngectomee rehabilitation, pp. 113–127, [19] T. Drugman, M. Rijckaert, C. Janssens, and M. Remacle, “Tra-
1986. cheoesophageal speech: A dedicated objective acoustic assess-
ment,” Computer Speech & Language, vol. 30, no. 1, pp. 16–31,
[2] T. Most, Y. Tobin, and R. C. Mimran, “Acoustic and perceptual 2015.
characteristics of esophageal and tracheoesophageal speech pro-
duction,” Journal of communication disorders, vol. 33, no. 2, pp. [20] A. Sesma and A. Moreno, “Corpuscrt 1.0: Diseno de corpus
165–181, 2000. orales equilibrados,” UPC, Tech. Rep., Dec.2000.
[3] A. Maier, T. Haderlein, U. Eysholdt, F. Rosanowski, A. Batliner, [21] D. Erro, I. Hernáez, E. Navas, A. Alonso, H. Arzelus, I. Jauk,
M. Schuster, and E. Nöth, “Peaks–a system for the automatic eval- N. Q. Hy, C. Magarinos, R. Pérez-Ramón, M. Sulı́r et al.,
uation of voice and speech disorders,” Speech Communication, “Zuretts: online platform for obtaining personalized synthetic
vol. 51, no. 5, pp. 425–437, 2009. voices,” Proceedings of eNTERFACE, pp. 1178–1193, 2014.
[4] C. Middag, T. Bocklet, J.-P. Martens, and E. Nöth, “Combin- [22] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
ing phonological and acoustic asr-free features for pathological N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
speech intelligibility assessment,” in Twelfth Annual Conference “The kaldi speech recognition toolkit,” in IEEE 2011 workshop
of the International Speech Communication Association, 2011. on automatic speech recognition and understanding, no. EPFL-
CONF-192584. IEEE Signal Processing Society, 2011.
[5] C. Middag, J.-P. Martens, G. Van Nuffelen, and M. De Bodt,
“Dia: a tool for objective intelligibility assessment of pathological [23] S. P. Rath, D. Povey, K. Veselỳ, and J. Cernockỳ, “Improved fea-
speech,” in 6th International workshop on Models and Analysis of ture processing for deep neural networks.” in Interspeech, 2013,
Vocal Emissions for Biomedical Applications. Firenze University pp. 109–113.
Press, 2009, pp. 165–167. [24] L. Serrano, D. Tavarez, I. Odriozola, I. Hernaez, and I. Saratxaga,
[6] J. L. Miralles and T. Cervera, “Voice intelligibility in patients who “Aholab system for albayzin 2016 search-on-speech evaluation,”
have undergone laryngectomies,” Journal of Speech, Language, in IberSPEECH, 2016, pp. 33–42.
and Hearing Research, vol. 38, no. 3, pp. 564–571, 1995. [25] E. Polityko. Word error rate. [Online]. Available:
[7] T. Cervera, J. L. Miralles, and J. González-Àlvarez, “Acousti- https://www.mathworks.com/examples/matlab/community/19873-
cal analysis of spanish vowels produced by laryngectomized sub- word-error-rate , access date: 20th February 2018
jects,” Journal of speech, language, and hearing research, vol. 44, [26] JASP Team, “JASP (Version 0.8.6)[Computer software],” 2018.
no. 5, pp. 988–996, 2001. [Online]. Available: https://jasp-stats.org/ ,access date: 20th
[8] A. Mantilla, H. Pérez-Meana, D. Mata, C. Angeles, J. Alvarado, February 2018
and L. Cabrera, “Recognition of vowel segments in spanish
esophageal speech using hidden markov models,” in Computing,
2006. CIC’06. 15th International Conference on. IEEE, 2006,
pp. 115–120.
[9] R. McGarrigle, K. J. Munro, P. Dawes, A. J. Stewart, D. R. Moore,
J. G. Barry, and S. Amitay, “Listening effort and fatigue: What
exactly are we measuring? a british society of audiology cogni-
tion in hearing special interest group white paper,” International
journal of audiology, 2014.
[10] S. Bennett and B. Weinberg, “Acceptability ratings of normal,
esophageal, and artificial larynx speech,” Journal of Speech, Lan-
guage, and Hearing Research, vol. 16, no. 4, pp. 608–615, 1973.
[11] K. F. Nagle and T. L. Eadie, “Listener effort for highly intelligible
tracheoesophageal speech,” Journal of Communication Disorders,
vol. 45, no. 3, pp. 235–245, 2012.
[12] W. L. Cullinan, C. S. Brown, and P. D. Blalock, “Ratings of intel-
ligibility of esophageal and tracheoesophageal speech,” Journal
of communication disorders, vol. 19, no. 3, pp. 185–195, 1986.
[13] J. Meyers, K. Volkert, and A. Diep, “Sentence repetition test: Up-
dated norms and clinical utility,” vol. 7, pp. 154–9, 02 2000.
[14] K. M. Yorkston and D. R. Beukelman, “A comparison of tech-
niques for measuring intelligibility of dysarthric speech,” Journal
of communication disorders, vol. 11, no. 6, pp. 499–512, 1978.
[15] R. P. Lippmann, “Speech recognition by machines and humans,”
Speech communication, vol. 22, no. 1, pp. 1–15, 1997.
[16] I. Sainz, D. Erro, E. Navas, I. Hernáez, J. Sanchez,
I. Saratxaga, and I. Odriozola, “Versatile Speech Databases
for High Quality Synthesis for Basque,” in 8th international
conference on Language Resources and Evaluation (LREC),
2012, pp. 3308–3312. [Online]. Available: http://www.lrec-
conf.org/proceedings/lrec2012/pdf/126 Paper.pdf
[17] D. Erro, I. Hernáez, A. Alonso, D. Garcı́a-Lorenzo, E. Navas,
J. Ye, H. Arzelus, I. Jauk, N. Hy, C. Magariños, R. Pérez-Ramón,
M. Sulı́r, X. Tian, and X. Wang, “Personalized synthetic voices
for speaking impaired: Website and app,” in Proceedings of the
Annual Conference of the International Speech Communication
Association, INTERSPEECH, 2015.
111
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
112 10.21437/IberSPEECH.2018-24
of classifying different speech dimensions is well researched, Table 1: Corpus description. Concerning the therapist decision,
but focused on specific aspects or reduced populations. Some Cont.R (Continue Right) means that the activity was rightly re-
works focus on speech intelligibility of people with aphasia [15] solved, Cont (Continue) means that the activity was satisfac-
or speech intelligibility in general [16]. Others try to identify torily resolved and Rep (Repeat) means that the activity was
speech disorders in children with cleft lip and palate [17]. In faultily resolved. Concerning the expert judgment, Right means
addition, speech emotions and autism spectrum disorders recog- that the recording was rightly produced and Wrong means that
nition have been investigated [18]. The point is that all these the recording was wrongly produced.
works include a subjective evaluation done by experts as a gold
Therapist decision Expert judgment
standard to train the classification systems. (real time) (offline)
In this work, we analyse the difficulties of automatically Speaker #Utterances Cont.R Cont. Rep. Right Wrong Corpus
S01 120 70 33 17 87 33 C1
predicting the quality of the prosody of an oral production and S02 106 90 16 0 81 25 C1
propose a new approach that will serve as a baseline for future S03 97 93 3 1 78 19 C1
S04 131 19 51 61 75 56 C1
work. Recordings of individuals with Down syndrome collected S05 151 21 54 76 77 74 C1
in different sessions of use of the educational video game PRA- S06 30 x x x 19 11 C2
DIA: Mystery in the city were used to obtain information about S07 34 x x x 13 21 C2
S08 28 x x x 23 5 C2
the relevant features needed to make an automatic classification S09 43 x x x 20 23 C2
of the productions. The speech corpus obtained along the time S10 33 x x x 29 4 C2
S11 57 x x x 31 26 C3
of game was judged by a therapist, who evaluated in real time S12 12 x x x 7 5 C3
the quality of the oral productions, and by a prosody expert, who S13 7 x x x 2 5 C3
S14 11 x x x 3 8 C3
did an off-line evaluation. The difference in the experimental S15 33 x x x 19 14 C3
procedure will be used to investigate if an automatic system can S16 10 x x x 6 4 C3
only rely on prosodic variables to judge the oral productions S17 8 x x x 5 3 C3
S18 11 x x x 6 5 C3
of the players (offline evaluation), or whether other features re- S19 10 x x x 6 4 C3
lated to the game dynamics should also be incorporated in the S20 10 x x x 6 4 C3
S21 9 x x x 1 8 C3
system. The judgments of the expert are used to train an auto- S22 7 x x x 3 4 C3
matic classifier that predicts quality by using acoustic informa- S23 8 x x x 3 5 C3
tion extracted from the audios of the corpus. Total 966 293 157 155 465 302
113
Table 2: Description of the C1 subcorpus. For each speaker, this 2.4. Automatic classification
table shows Chronological age (CA), Verbal mental age (VA),
Short-term verbal memory (STVM), and Non-verbal cognitive As explained is section 2.2, the recordings were evaluated by
level (NVCL). Ages are expressed in months. In addition, the the therapist and the prosody expert. Since the final aim of
mean percentage of success in perception (MPercT) and pro- the module is to decide if the gamer can continue the game or
duction (MProdT) PEPS-C tasks are included. should repeat the activity (without considering degrees of fail-
ure), the evaluation of the expert was used to build the classi-
Speaker Gender CA VA STVM NVCL MPercT MProdT fier. According to this, the output of the different classifiers are
S01 f 195 84 94 17 69.79% 48.30% Right (R) or Wrong (W), based on the prosody expert scoring.
S02 m 204 99 134 18 76.04% 72.10%
S03 f 178 96 78 20 73.96% 74.65% The Weka machine learning toolkit [28] was used and three dif-
S04 m 190 60 below 74 10 60.42% 49.76% ferent classifiers were used to compare their performance: the
S05 m 223 69 below 74 13 56.25% 54.84%
C4.5 decision tree (DT), the multilayer perceptron (MLP) and
the support vector machine (SVM). In addition, the results of
using the recordings of the three corpora as well as all combi-
emotional and motivational level. The video game allows to nations of these corpora were compared.
evaluate the result of the oral activities typing a concrete key Furthermore, the stratified 10-fold cross-validation tech-
on the computer keyboard where the game is installed. If the nique was used to create the training and testing datasets. We
evaluation is Cont.R (Continue with right result) or Cont (Con- also used feature selection before training the classifiers: the
tinue but the oral activity could be better), the video game ad- features were selected by measuring the information gain of the
vances to the next activity. If the evaluation is Rep. (Repeat), training set and discarding the ones in which the information
the game offers a new attempt in which the player has to repeat gain equals zero (column Feat. in Table 3).
the activity. For each activity, there is a predetermined number
of attempts: when the attempts finish, the video game goes to
the next screen to avoid frustration on the player, even if the 3. Results
activity has not been successfully completed (and the therapist Table 1 and Table 2 show a high difference between speakers re-
continues judging with Rep.). lated to their developmental level and prosodic skills. S04 and
On the other hand, an expert in prosody evaluated the three S05 have the lowest scores in verbal mental age (60 and 69,
subcorpora of oral productions of 23 speakers with Down syn- respectively), short-term verbal memory (below 74 both speak-
drome in an offline mode. Due to the difficulty of the task and ers) and non-verbal cognitive level (10 and 13, respectively).
the different context of the evaluation, the prosody expert used In addition, both of them have the lowest mean percentage of
a reduced evaluation system (Right or Wrong production). The success in perception PEPS-C tasks (60.42% and 56.25%, re-
judgments were made relying on purely auditive basis, with- spectively) and lower mean percentage of success in production
out any acoustic analysis of the sentences, and the focus was PEPS-C tasks (49.76% and 54.84%, respectively). These low
on the intonational and prosodic structure. Related to this, fac- scores are related with the quality of the productions, with a
tors of intelligibility, quality in pronunciation or adjustment to higher percentage of W assignments from the prosody expert
the expected sentence were not taken into account. Even in the (42.75% and 49% respectively) and higher percentage of Rep.
case of speakers with low cognitive level and serious problems from the therapist (47% and 50%, respectively).
of intelligibility, the main criterion was whether they had mod- The classification results highly depend on the corpus and
eled prosody with certain success, even if the message was not the classifier used (Table 3). SVM classifier works better with
understood. Following the categories of intonational phonol- all corpora and the worst results are obtained using DT classifier
ogy [24] and the learning objectives included in PRADIA [14], (best case is 79.3% vs 64.94% baseline). The best results are
criteria concerning intonation, accent and prosodic organization obtained in Case A and D by using any of the three classifiers
were used to judge if the sentence was Right or Wrong: in short, (UAR 0.83 with SVM classifier). The classification accuracy
adjustment to the expected modality; respect for the difference decreases when the C3 corpus is entered (C, E, F and G cases)
between lexical stress and accent (tonal prominence); and ad- as the number of speakers substantially increases. Moreover,
justment to the organization in prosodic groups relying mainly when the same features are used to identify speakers instead
in the distinction between function and content words. of the quality of the utterance (column #SR rate of Table 3),
scenarios Case C, E and G are the worst ones and scenarios
2.3. Feature extraction Case A and B are the best. In order to see the influence of
The openSmile toolkit [25] was used to extract acoustic fea- the speaker in the classification results, we present results per
tures from each recording of C1, C2 and C3 subcorpora. The speaker in Table 4.
GeMAPS feature set [26] was selected due to the variety of We focus on Case D to present results per speakers in Ta-
acoustic and prosodic features contained in this set. This set ble 4. Only the samples of corpus C1 are analyzed because
contains frequency related features, energy related features, they were evaluated by the two evaluators. Comparing the R-
spectral features and temporal features. The arithmetic mean W judgments of the expert with the classifier predictions, there
and the coefficient of variation were calculated on these fea- is a high recall in R-R case for all speakers (S01 83.91%, S02
tures. Furthermore, 4 additional temporal features were added: 87.65%, S03 97.44%, S04 94.67%, S05 87.01%). The coin-
the silence and sounding percentages, silences per second and cidence in W-W case is lower: while S02 and S05 present a
the mean silences. The complete description of these features reasonable classification rate (72% and 70.27%, respectively),
can be found in previous research [27]. In this work, only results for S03 goes down to 26.32%. Concerning this result,
prosodic features (frequency, energy and temporal) have been we note that most of the utterances judged as wrong by the ex-
used because spectral features improve the speaker identifica- pert were rated as right by the therapist (100% in cell W-Cont.R
tion, and classifiers can be adapted to each speaker in the classi- for S3). As average, we obtain only 10.05% of false negatives.
fication process. In total, 34 prosodic features were employed. This will be discussed in the next section as a positive result for
114
Table 3: Classification results depending on the corpus and the classifier used. The prosody expert judgments were used to train the
classifiers. BL means the performance baseline of each group of samples (number of samples of the most populated class divided
by all the samples). DT means Decision trees, SVM means Support vector machines and MLP means Multilayer Perceptron. CR
means the classification rate, AUC means the Area Under the Curve and AUR means the Unweighted Average Recall. The number of
samples (utt.), the number of speakers (SPK), the number of features (Feat.) and the speaker classification rate using SVM (SR rate)
are presented. The output of the different classifiers are Right or Wrong, based on prosody expert scoring.
DT SVM MLP
Corpora BL CR AUC UAR CR AUC UAR CR AUC UAR #Utt. #Feat. #SPK SR rate
Case A C1 65.79% 69.57% 0.68 0.74 78.49% 0.74 0.83 73.23% 0.7 0.79 605 21 5 69.92%
Case B C2 61.90% 60.26% 0.58 0.61 72.68% 0.7 0.79 68.49% 0.67 0.73 168 16 5 88.01%
Case C C3 50.78% 65.76% 0.66 0.66 61.58% 0.62 0.69 63.71% 0.64 0.64 193 7 13 30.05%
Case D C1+C2 64.94% 70.77% 0.68 0.75 79.3% 0.76 0.83 72.57% 0.7 0.78 773 21 10 64.94%
Case E C1+C3 62.16% 66.29% 0.65 0.69 72.31% 0.7 0.79 67.17% 0.65 0.74 798 20 18 52.26%
Case F C2+C3 55.96% 60.94% 0.6 0.64 66.47% 0.66 0.75 64% 0.63 0.69 361 13 18 64.27%
Case G C1+C2+C3 62.11% 66.88% 0.66 0.71 74.32% 0.71 0.81 69.37% 0.66 0.76 996 20 23 59.21%
115
5. References [20] S. Corral, D. Arribas, P. Santamarı́a, M. Sueiro, and J. Pereña,
“Escala de inteligencia de Wechsler para niños-IV,” Madrid: TEA
[1] G. E. Martin, J. Klusek, B. Estigarribia, and J. E. Roberts, “Lan- Ediciones, 2005.
guage characteristics of individuals with down syndrome,” Topics
in language disorders, vol. 29, no. 2, p. 112, 2009. [21] J. Raven, J. C. Raven et al., Test de matrices progresivas:
manual/Manual for Raven’s progessive matrices and vocabulary
[2] P. A. Eadie, M. Fey, J. Douglas, and C. Parsons, “Profiles of
scalesTest de matrices progresivas. Paidós,, 1993, no. 159.9.
grammatical morphology and sentence imitation in children with
072.
specific language impairment and down syndrome,” Journal of
Speech, Language, and Hearing Research, vol. 45, no. 4, pp. 720– [22] P. Martı́nez-Castilla and S. Peppé, “Developing a test of prosodic
732, 2002. ability for speakers of iberian spanish,” Speech Communication,
vol. 50, no. 11-12, pp. 900–915, 2008.
[3] E. Smith, K.-A. B. Næss, and C. Jarrold, “Assessing pragmatic
communication in children with down syndrome,” Journal of [23] C. González-Ferreras, D. Escudero-Mancebo, M. Corrales-
communication disorders, vol. 68, pp. 10–23, 2017. Astorgano, L. Aguilar-Cuevas, and V. Flores-Lucas, “Engaging
adolescents with down syndrome in an educational video game,”
[4] G. Laws and D. V. Bishop, “Verbal deficits in down’s syndrome
International Journal of Human–Computer Interaction, vol. 33,
and specific language impairment: a comparison,” International
no. 9, pp. 693–712, 2017.
Journal of Language & Communication Disorders, vol. 39, no. 4,
pp. 423–451, 2004. [24] D. R. Ladd, Intonational phonology. Cambridge University
Press, 2008.
[5] R. D. Kent and H. K. Vorperian, “Speech impairment in down
syndrome: A review,” Journal of Speech, Language, and Hearing [25] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent devel-
Research, vol. 56, no. 1, pp. 178–210, 2013. opments in opensmile, the Munich open-source multimedia fea-
[6] B. Heselwood, M. Bray, and I. Crookston, “Juncture, rhythm and ture extractor,” in Proceedings of the 21st ACM international con-
planning in the speech of an adult with down’s syndrome,” Clini- ference on Multimedia. ACM, 2013, pp. 835–838.
cal Linguistics & Phonetics, vol. 9, no. 2, pp. 121–137, 1995. [26] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg,
[7] S. J. Peppé, “Why is prosody in speech-language pathology so E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S.
difficult?” International Journal of Speech-Language Pathology, Narayanan et al., “The Geneva minimalistic acoustic parameter
vol. 11, no. 4, pp. 258–271, 2009. set (GeMAPS) for voice research and affective computing,” IEEE
Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202,
[8] P. Martı́nez-Castilla, M. Sotillo, and R. Campos, “Prosodic abil- 2016.
ities of spanish-speaking adolescents and adults with Williams
syndrome,” Language and Cognitive Processes, vol. 26, no. 8, [27] M. Corrales-Astorgano, D. Escudero-Mancebo, and C. González-
pp. 1055–1082, 2011. Ferreras, “Acoustic characterization and perceptual analysis of the
relative importance of prosody in speech of people with down syn-
[9] S. Peppé, J. McCann, F. Gibbon, A. O’Hare, and M. Rutherford, drome,” Speech Communication, vol. 99, pp. 90–100, 2018.
“Receptive and expressive prosodic ability in children with high-
functioning autism,” Journal of Speech, Language, and Hearing [28] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and
Research, vol. 50, no. 4, pp. 1015–1028, 2007. I. H. Witten, “The weka data mining software: an update,” ACM
SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10–18, 2009.
[10] V. Stojanovik, “Prosodic deficits in children with down syn-
drome,” Journal of Neurolinguistics, vol. 24, no. 2, pp. 145–155, [29] J. Grieco, M. Pulsifer, K. Seligsohn, B. Skotko, and A. Schwartz,
2011. “Down syndrome: Cognitive and behavioral functioning across
the lifespan,” in American Journal of Medical Genetics Part C:
[11] O. Saz, S.-C. Yin, E. Lleida, R. Rose, C. Vaquero, and W. R. Seminars in Medical Genetics, vol. 169, no. 2. Wiley Online
Rodrı́guez, “Tools and technologies for computer-aided speech Library, 2015, pp. 135–149.
and language therapy,” Speech Communication, vol. 51, no. 10,
pp. 948–967, 2009.
[12] W. R. Rodrı́guez, O. Saz, and E. Lleida, “A prelingual tool for
the education of altered voices,” Speech Communication, vol. 54,
no. 5, pp. 583–600, 2012.
[13] “Pradia,” http://www.pradia.net, accessed: 2018-07-18.
[14] L. Aguilar and Gutiérrez-González, “Aprendizaje prosódico en un
videojuego educativo dirigido a personas con sı́ndrome de down:
definición de objetivos y diseño de actividades,” Revista de Edu-
cación Inclusiva, under revision.
[15] D. Le, K. Licata, C. Persad, and E. M. Provost, “Automatic as-
sessment of speech intelligibility for individuals with aphasia,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-
cessing, vol. 24, no. 11, pp. 2187–2199, 2016.
[16] A. Maier, T. Haderlein, U. Eysholdt, F. Rosanowski, A. Batliner,
M. Schuster, and E. Nöth, “Peaks–a system for the automatic eval-
uation of voice and speech disorders,” Speech Communication,
vol. 51, no. 5, pp. 425–437, 2009.
[17] A. Maier, F. Hönig, C. Hacker, M. Schuster, and E. Nöth, “Au-
tomatic evaluation of characteristic speech disorders in children
with cleft lip and palate,” in Ninth Annual Conference of the In-
ternational Speech Communication Association, 2008.
[18] H.-y. Lee, T.-y. Hu, H. Jing, Y.-F. Chang, Y. Tsao, Y.-C. Kao, and
T.-L. Pao, “Ensemble of machine learning and acoustic segment
model techniques for speech emotion and autism spectrum disor-
ders recognition.” in INTERSPEECH, 2013, pp. 215–219.
[19] L. Dunn, L. Dunn, and D. Arribas, “Test de vocabulario en
imágenes peabody,” Madrid: TEA, 2006.
116
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
117 10.21437/IberSPEECH.2018-25
to handle speech reconstruction tasks. SEGAN was designed as
a speaker- and noise-agnostic model to generate clean/enhanced
versions of aligned noisy speech signals. From now on, we
change signal names for the new task, so we rather work with
natural, voiced (i.e. restored) and whispered speech signals. To
adapt the architecture to the task of voiced speech restoration,
we decide to remove the audio alignment requirement, as the
data we use has slight misalignments between input and output
speech (see Section 3.1 for more details). In addition, we intro-
duce a number of improvements that consistently stabilize and
facilitate its training after direct regularization over the wave-
form is removed. These modifications also refine the generated
quality at the generator output when regression is removed.
2.1. SEGAN
We now outline the most basic aspects of SEGAN, specifically
highlighting the ones that have been subject to change. For the
sake of brevity we refer the reader to the original paper and
code [15] for more detailed explanations on the old architecture
and setup. The SEGAN generator network (G) embeds an input
noisy waveform chunk into the latent space via a convolutional
encoder. Then, the reconstruction is made in the decoder by
Figure 1: Generator network architecture. Skip connection with
‘deconvolving’ back the latent signals into the time domain. G
learnable al are depicted with purple boxes. These are summed
features skip connections with constant factors (acting as iden-
to each intermediate activation of the decoder. Encoder and
tity functions) to promote that low-level features could escape
a potentially unnecessary compression from the encoder. Such decoder are like in the original SEGAN [15], but with half the
skip connections also improve training stability, as they allow amount of layers and doubled pooling per layer.
gradients to flow better across the deep structure of G, which has
a total of 22 layers. In the denoising setup, an L1 regularization
term helped centering output predictions around 0, discourag- tries to send messages to G about a bad behavior whenever the
ing G to explore bizarre amplitude magnitudes that could make content between both chunks, the one coming from G and the
the discriminator network (D) converge to easy discriminative reference one, changes. Note that we are using the least-squares
solutions for the fake adversarial case. GAN form (LSGAN) in the adversarial component, so our loss
functions, for D and G respectively, become
2.2. Adapted SEGAN
1
The SEGAN architecture has been adapted to cope with mis- min V (D) = Ex,w̃∼pdata (x,w̃) [(D(x, w̃) − 1)2 ]+
D 3
alignments in the input/output signals as mentioned before, as 1
well as to achieve a more stable architecture and to produce + Ez∼pz (z),w̃∼pdata (w̃) [D(G(z, w̃), w̃)2 ]
3
better quality outputs. In the current setup, similarly to the orig- 1
inal SEGAN mechanism, we inject whisper data to G, which + Ex,xr ∼pdata (x) [D(x, xr )2 ]
3
compresses it and then recovers a version of the utterance with
min V (G) = Ez∼pz (z),w̃∼pdata (w̃) [(D(G(z, w̃), w̃) − 1)2 ],
prosodic information. To cope with misalignments, we get rid G
of the L1 regularization term, as this was forcing a one-to-one
correspondence between audio samples, assuming input and where w̃ ∈ RT is the whispered utterance, x ∈ RT is the
output had the same phase. In its place we use a softer reg- natural speech, xr ∈ RT is a randomly chosen natural chunk
ularization which works in the spectral domain, similar to the within the batch, G(z, w̃) ∈ RT is the enhanced speech, and
one used in the parallel Wavenet [17]. We use a non-averaged [D(x, w̃), D(G(z, w̃), w̃), D(x, xr )] are the discriminator deci-
version of this loss though, as we work with large frames dur- sions for each input pair. All of these signals are vectors of
ing training (16,384 samples per sequence), and averaging the length T samples except for D outputs, which are scalars. T is
spectral frames over this large span could be ineffective. More- a hyper-parameter fixed during training but it is variable during
over, we calculate the loss as an absolute distance in decibels test inference.
between the generated speech and the natural one. After removing the regularization factor L1 , the generator
The spectral regularization is added to the adversarial loss output can explore large amplitudes whilst adapting to mimic
coming from D with a weighting factor λ. In SEGAN, D the speech distribution. As a matter of fact, this collapsed the
is a learnable comparative loss function between natural or training whenever the tanh activation was placed in the output
voiced signals and whispered ones. This means we have a layer of G to bound its output to [−1, 1], because the amplitude
(natural, whispered) paired input as a real batch sam- grew quickly with aggressive gradient updates and tanh would
ple and (voiced, whispered) as a fake batch sample. In not allow G to properly update anymore due to saturation. The
contrast, G has to make (voiced,whispered) true, thus way to correct this was bounding the gradient of D by applying
being the adversarial objective. In the current setup, we add spectral normalization as proposed in [18]. The discriminator
an additional fake signal in D that will enforce the preserva- does not have any batch normalization technique in this imple-
tion of intelligibility when we forward data through G: the mentation, and its architecture is the same as in our previous
(natural,random natural shuffle) pair. This pair work.
118
Figure 2: From left to right: natural speech, whispered speech (input to G) and output from G as voiced signal.
The new design of G is shown in Figure 1. It remains as CMU Arctic corpus [20] (25 minutes for each speaker, approx-
a fully convolutional encoder-decoder structure with skip con- imately). Then, whispered speech was generated from the ar-
nections, but with two changes. First, we reduce the number ticulatory data by using the RNN-based articulatory-to-speech
of layers by augmenting the pooling factor from 2 to 4 at ev- synthesiser described in [13]. In this work, these whispered sig-
ery encoder-decoder layer. This is in line with preliminary ex- nals are taken as the input to SEGAN, which acts as a post-filter
periments on the denoising task, where increasing pooling has enhancing the naturalness of the signals. For each whispered
been effective to improve objective scores for that task. Sec- signal we have a natural version, which is the original speech
ond, we introduce learnable skip connections, and these are now signal recorded by the subject. To simplify our first modeling
summed instead of concatenated to decoder feature maps. We approach we used one male and one female speakers, namely
thus have now learnable vectors al which multiply every chan- M4 and F1, and built two speaker-dependent SEGAN models.
nel of its corresponding shuttle layer l by a scalar factor αl,k . These speakers are selected for the better level of intelligibil-
These factors are all initialized to one. Hence, at the j-th de- ity of their whisper data within their genders. We want to note,
coder layer input we have the addition of the l-th encoder layer however, that both female speakers are less intelligible in their
response following whisper form than male speakers. These two speakers’ data is
split into two sets: (1) training, with approximately 90% of the
hj = hj−1 + al hl , utterances and (2) test, with the remaining approximate 10%.
In order to have augmented data we follow the same chunking
where is an element-wise product along channels. method as in our previous work [15] but window strides are one
order of magnitude smaller. Hence we have a canvas of 16,384
3. Experimental Setup samples (≈ 1 second at 16kHz) every 50 ms, in contrast with
the previous 0.5 s.
To evaluate the performance of our technique, a clinical appli-
cation involving the generation of audible speech from captured
3.2. SEGAN Setup
movement of the speech articulators is tested. More details
about the experimental setup in terms of dataset, baseline and We use the same kernel widths of 31 as we had in [15], both
hyper-parameters for our proposed approach are given below. when encoding and decoding and for both G and D networks.
The feature maps are incremental in the encoder and decremen-
3.1. Task and Dataset tal in the decoder, having {64, 128, 256, 512, 1024, 512, 256,
In our previous work [6, 13], a silent speech system aimed at 128, 64, 1} in the generator and {64, 128, 256, 512, 1024}
helping laryngectomy patients to recover their voices was de- in the discriminator convolutional structures. The discrimina-
scribed. The system comprised an articulator motion capture tor has a linear layer at the end with a single output neuron, as
device [19], which monitored the movement of the lips and in the original SEGAN setup. The latent space is constructed
T
tongue by tracking the magnetic field generated by small mag- with the concatenation of the thought vector c ∈ R 1024 ×1024
T ×1024
nets attached to them, and a synthesis module, which generated with the noise vector z ∈ R 1024 , where z ∼ N (0, I).
speech from articulatory data. To generate speech acoustics, re- Both networks are trained with Adam [21] optimizer, with the
current neural networks (RNNs) trained on parallel articulatory two-timescale update rule (TTUR) [22], such that D will have
and speech data were used. The speech produced by this system a four times faster learning rate to virtually emulate many iter-
had a reasonable quality when evaluated on normal speakers, ations in D prior to updating G. This way, we have D learn-
but it was not completely natural owing to limitations when es- ing rate 0.0004 and G learning rate 0.0001, with β1 = 0 and
timating the pitch (i.e., the capturing device did not have access β2 = 0.9, which are the same schedules based on recent suc-
to any information about the glottal excitation). cessful approaches to faster and stable convergent adversarial
In this work, we are interested on determining whether the training [23]. All signals processed by the GAN, either in
proposed adapted SEGAN could improve those signals by gen- the input/output of G or the input of D, are pre-emphasized
erating more natural and realistic prosodic contours. To evaluate with a 0.95 factor, as it proved to help coping with some high-
this, we have articulatory and speech data available, recorded frequency artifacts in the de-noising setup. When we generate
simultaneously for 6 healthy British subjects (2 females and voiced data out of G we de-emphasize it with the same factor to
4 males). Each speaker has recorded a random subset of the get the final result.
119
Figure 3: Histograms of pitch values in Hertz per utterance for male speaker (left) and female speaker (right). The three systems
appearing are natural signals; RNN baseline voiced predictions with vocoder features; and voiced speech using SEGAN.
Figure 4: Section of pitch contour of a test utterance of the male speaker calculated with Ahocoder from 4 different sources: Natural
data (blue); RNN baseline (orange); Voiced with seed 100 (green) and voiced with seed 200 (red). Here it is shown how changing the
seed indeed creates different plausible contours.
3.3. Baseline behavior in a future version of the system with an auxiliary un-
voiced/voiced classifier in the output of G.
To assess the performance of SEGAN in this task we have as Finally, figure 2 shows examples of waveforms and spec-
reference the RNN-based articulatory-to-speech system from trograms for natural, whispered and voiced signals. We can
our previous work [13] and the natural data for each modeled appreciate how, for a small chunk of waveform, the generator
speaker. The recurrent model is used to predict both the spec- network is able to refine low frequencies and gets rid of high
tral (i.e., MFCCs) and pitch parameters (i.e., fundamental fre- frequency noises to approximate the natural data. Preliminary
quency, aperiodicities and unvoiced-voiced decision) from the listening tests suggest that this model can achieve a good natural
articulatory data, so the source is articulatory data and not whis- voiced version of the speech, but some artifacts intrinsic to the
pered speech in that case. The STRAIGHT vocoder [24] is then convolutional architecture (specially in high-frequencies) have
employed to synthesise the waveform from the predicted pa- to be palliated. This observation is in line with what was also
rameters. prompted in the WaveGAN [26] work, and this is also one of
the potential reasons of the effectiveness of using pre-emphasis.
4. Results We refer the reader to the audio samples to have a feeling of the
current quality of our system 1 .
We analyze the statistics of the generated pitch contours for
the RNN, SEGAN and natural data. Figure 3 depicts the his- 5. Conclusions
tograms of all contours extracted from predicted/natural wave-
forms. Ahocoder [25] was used to extract logF0 curves, which We presented a speaker-dependent end-to-end generative ad-
are then converted to Hertz scale. Then, all voiced frames were versarial network to act as a post-filter of whispered speech
selected and concatenated per each of the three systems. We to deal with a pathological application. We adapted our pre-
come up with a long stream for each system and for the two vious speech enhancement GAN architecture to overcome mis-
genders. It can seen that, for both genders, voiced histograms alignment issues and still obtained a stable GAN architecture
(corresponding to SEGAN) have a broader variance than RNN to reconstruct voiced speech. The model is able to generate
ones, closer to the natural signal shape. This is understandable novel pitch contours by only seeing the whispered version of
if we consider that the RNN was trained with a regression crite- the speech at its input. The method generates richer curves than
rion that optimizes its output towards the mean of the pitch dis- the baseline, which sounds monotonic in terms of prosody. Fu-
tribution. This ends up producing a monotonic prosody effect, ture lines of work include an even more end-to-end approach by
normally manifested as a robotic sounding that can be heard going sensor-to-speech. Also, further study is required to alle-
in the audio samples referenced below. This indicates that the viate intrinsic high frequency artifacts provoked by the type of
adversarial procedure can generate more natural pitch values. decimation-interpolation architecture we base our design on.
Figure 4 shows pitch contours generated by SEGAN with
different random seeds. We have to note that each random seed 6. Acknowledgements
generates a different latent vector z, so the stochasticity cre- This research was supported by the project TEC2015-69266-P
ates novel curves that look plausible. It also can be noted that (MINECO/FEDER, UE).
SEGAN made some errors in determining the correct voicing
decision for some speech segments. We may enforce a better 1 http://veu.talp.cat/whispersegan/
120
7. References [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-
[1] J. Chen, H. Yang, X. Wu, and B. C. Moore, “The effect of f0 sarial nets,” in Advances in Neural Information Processing Sys-
contour on the intelligibility of speech in the presence of interfer- tems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,
ing sounds for mandarin chinese,” The Journal of the Acoustical and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp.
Society of America, vol. 143, no. 2, pp. 864–877, 2018. 2672–2680.
[2] S. Popham, D. Boebinger, D. P. Ellis, H. Kawahara, and J. H.
McDermott, “Inharmonic speech reveals the role of harmonicity [15] S. Pascual, A. Bonafonte, and J. Serrà, “SEGAN: Speech en-
in the cocktail party problem,” Nature communications, vol. 9, hancement generative adversarial network,” in Proc. of Inter-
no. 1, p. 2122, 2018. speech, 2017, pp. 3642–3646.
[3] T. Toda, A. W. Black, and K. Tokuda, “Statistical mapping be- [16] S. Pascual, M. Park, J. Serrà, A. Bonafonte, and K.-H. Ahn, “Lan-
tween articulatory movements and acoustic spectrum using a guage and noise transfer in speech enhancement generative adver-
gaussian mixture model,” Speech Communication, vol. 50, no. 3, sarial network,” in Proc. of ICASSP. IEEE, 2018, pp. 5019–5023.
pp. 215–227, 2008.
[17] A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,
[4] K. Nakamura, M. Janke, M. Wand, and T. Schultz, “Estimation
K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo,
of fundamental frequency from surface electromyographic data:
F. Stimberg et al., “Parallel wavenet: Fast high-fidelity speech
Emg-to-f 0,” in Proc. of ICASSP. IEEE, 2011, pp. 573–576.
synthesis,” arXiv preprint arXiv:1711.10433, 2017.
[5] K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, “Speaking- [18] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral
aid systems using gmm-based voice conversion for electrolaryn- normalization for generative adversarial networks,” arXiv preprint
geal speech,” Speech Communication, vol. 54, no. 1, pp. 134–146, arXiv:1802.05957, 2018.
2012.
[19] M. J. Fagan, S. R. Ell, J. M. Gilbert, E. Sarrazin, and P. M. Chap-
[6] J. A. Gonzalez, L. A. Cheah, A. M. Gomez, P. D. Green, J. M. man, “Development of a (silent) speech recognition system for pa-
Gilbert, S. R. Ell, R. K. Moore, and E. Holdsworth, “Direct speech tients following laryngectomy,” Medical engineering & physics,
reconstruction from articulatory sensor data by machine learning,” vol. 30, no. 4, pp. 419–425, 2008.
IEEE/ACM Transactions on Audio, Speech, and Language Pro-
cessing, vol. 25, no. 12, pp. 2362–2374, 2017. [20] J. Kominek and A. W. Black, “The CMU Arctic speech
[7] R. W. Morris and M. A. Clements, “Reconstruction of speech databases,” in Fifth ISCA Workshop on Speech Synthesis, 2004,
from whispers,” Medical Engineering and Physics, vol. 24, no. 7, pp. 223–224.
pp. 515–520, 2002.
[21] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
[8] F. Ahmadi, I. V. McLoughlin, and H. R. Sharifzadeh, “Analysis- mization,” arXiv preprint arXiv:1412.6980, 2014.
by-synthesis method for whisper-speech reconstruction,” in Cir-
cuits and Systems, 2008. APCCAS 2008. IEEE Asia Pacific Con- [22] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and
ference on. IEEE, 2008, pp. 1280–1283. S. Hochreiter, “Gans trained by a two time-scale update rule con-
verge to a local nash equilibrium,” in Advances in Neural Infor-
[9] H. R. Sharifzadeh, I. V. McLoughlin, and F. Ahmadi, “Recon-
mation Processing Systems, 2017, pp. 6626–6637.
struction of normal sounding speech for laryngectomy patients
through a modified celp codec,” IEEE Transactions on Biomed- [23] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena,
ical Engineering, vol. 57, no. 10, pp. 2448–2458, 2010. “Self-attention generative adversarial networks,” arXiv preprint
[10] J. Li, I. V. McLoughlin, and Y. Song, “Reconstruction of pitch arXiv:1805.08318, 2018.
for whisper-to-speech conversion of chinese,” in Chinese Spoken
Language Processing (ISCSLP), 2014 9th International Sympo- [24] H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, “Re-
sium on. IEEE, 2014, pp. 206–210. structuring speech representations using a pitch-adaptive time–
frequency smoothing and an instantaneous-frequency-based f0
[11] A. K. Fuchs and M. Hagmüller, “Learning an artificial f0-contour extraction: Possible role of a repetitive structure in sounds1,”
for alt speech,” in Proc. of Interspeech, 2012. Speech communication, vol. 27, no. 3-4, pp. 187–207, 1999.
[12] B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and
[25] D. Erro, I. Sainz, E. Navas, and I. Hernaez, “Harmonics Plus
J. S. Brumberg, “Silent speech interfaces,” Speech Communica-
Noise Model Based Vocoder for Statistical Parametric Speech
tion, vol. 52, no. 4, pp. 270–287, 2010.
Synthesis,” IEEE Journal of Selected Topics in Signal Processing,
[13] J. A. Gonzalez, L. A. Cheah, P. D. Green, J. M. Gilbert, S. R. vol. 8, no. 2, pp. 184–194, Apr. 2014.
Ell, R. K. Moore, and E. Holdsworth, “Evaluation of a silent
speech interface based on magnetic sensing and deep learning for [26] C. Donahue, J. McAuley, and M. Puckette, “Synthesizing
a phonetically rich vocabulary,” in Proc. of Interspeech, 2017, pp. audio with generative adversarial networks,” arXiv preprint
3986–3990. arXiv:1802.04208, 2018.
121
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Aholab
University of the Basque Country (UPV/EHU), Bilbao, Spain
{lserrano, david, xsarasola, sneha, ibon, eva, inma}@aholab.ehu.eus
122 10.21437/IberSPEECH.2018-26
In this paper we propose to take advantage of the recent ad-
vances in machine learning techniques to train a deep learning
system to convert from the audio of an oesophageal speaker to
a healthy speaker. Though the literature focuses specially on
spectral conversion, the importance of prosody in oesophageal
speech can not be left out. A good intonation is important
for the utterances to be perceived as more natural and pleas-
ant. Therefore, the proposed system will not only convert the
spectral features but will also estimate f0 . The spectral and f0
conversion contributions have been then evaluated separately by
means of word error rate (WER) and a perceptual test.
123
3. Evaluation
3.1. Objective evaluation: Kaldi ASR
124
Table 1: WER results for the different experiments
125
tion Technology (ISSPIT), 2013 IEEE International Symposium [23] H. Benisty and D. Malah, “Voice conversion using gmm with en-
on. IEEE, 2013, pp. 000 210–000 214. hanced global variance,” in Twelfth Annual Conference of the In-
[5] H. R. Sharifzadeh, I. V. McLoughlin, and F. Ahmadi, “Recon- ternational Speech Communication Association, 2011.
struction of normal sounding speech for laryngectomy patients [24] N. Xu, Y. Tang, J. Bao, A. Jiang, X. Liu, and Z. Yang, “Voice
through a modified celp codec,” IEEE Transactions on Biomed- conversion based on gaussian processes by coherent and asym-
ical Engineering, vol. 57, no. 10, pp. 2448–2458, 2010. metric training with limited training data,” Speech Communica-
[6] A. del Pozo and S. Young, “Continuous tracheoesophageal speech tion, vol. 58, pp. 124–138, 2014.
repair,” in Eusipco, 2006, pp. 1–5. [25] M. Narendranath, H. A. Murthy, S. Rajendran, and B. Yegna-
[7] ——, “Repairing tracheoesophageal speech duration,” in Speech narayana, “Transformation of formants for voice conversion using
Prosody, 2008, pp. 187–190. artificial neural networks,” Speech communication, vol. 16, no. 2,
pp. 207–216, 1995.
[8] O. Schleusing, R. Vetter, P. Renevey, J.-M. Vesin, and
V. Schweizer, “Prosodic speech restoration device: Glottal exci- [26] S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad,
tation restoration using a multi-resolution approach,” in Interna- “Spectral mapping using artificial neural networks for voice con-
tional Joint Conference on Biomedical Engineering Systems and version,” IEEE Transactions on Audio, Speech, and Language
Technologies. Springer, 2010, pp. 177–188. Processing, vol. 18, no. 5, pp. 954–964, 2010.
[9] H. Doi, K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, [27] L. Sun, S. Kang, K. Li, and H. Meng, “Voice conversion using
“Statistical approach to enhancing esophageal speech based deep bidirectional long short-term memory based recurrent neural
on gaussian mixture models,” in Acoustics Speech and Signal networks,” in Acoustics, Speech and Signal Processing (ICASSP),
Processing (ICASSP), 2010 IEEE International Conference on. 2015 IEEE International Conference on. IEEE, 2015, pp. 4869–
IEEE, 2010, pp. 4250–4253. 4873.
[10] ——, “Esophageal speech enhancement based on statistical voice [28] Z.-H. Ling, S.-Y. Kang, H. Zen, A. Senior, M. Schuster, X.-J.
conversion with gaussian mixture models,” IEICE TRANSAC- Qian, H. M. Meng, and L. Deng, “Deep learning for acoustic
TIONS on Information and Systems, vol. 93, no. 9, pp. 2472– modeling in parametric speech generation: A systematic review
2482, 2010. of existing techniques and future trends,” IEEE Signal Processing
Magazine, vol. 32, no. 3, pp. 35–52, 2015.
[11] H. Doi, T. Toda, K. Nakamura, H. Saruwatari, and K. Shikano,
“Alaryngeal speech enhancement based on one-to-many eigen- [29] D. Erro, I. Sainz, E. Navas, and I. Hernaez, “Harmonics plus noise
voice conversion,” IEEE/ACM Transactions on Audio, Speech, model based vocoder for statistical parametric speech synthesis,”
and Language Processing, vol. 22, no. 1, pp. 172–183, 2014. IEEE Journal of Selected Topics in Signal Processing, vol. 8,
no. 2, pp. 184–194, 2014.
[12] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on
maximum-likelihood estimation of spectral parameter trajectory,” [30] P. Alku, “Glottal wave analysis with pitch synchronous iterative
IEEE Transactions on Audio, Speech, and Language Processing, adaptive inverse filtering,” Speech communication, vol. 11, no. 2-
vol. 15, no. 8, pp. 2222–2235, 2007. 3, pp. 109–118, 1992.
[13] D. Erro, E. Navas, and I. Hernaez, “Parametric voice conver- [31] M. Kishimoto, T. Toda, H. Doi, S. Sakti, and S. Nakamura,
sion based on bilinear frequency warping plus amplitude scaling,” “Model training using parallel data with mismatched pause po-
IEEE Transactions on Audio, Speech and Language Processing, sitions in statistical esophageal speech enhancement,” in Signal
vol. 21, no. 3, pp. 556 – 566, 2013. Processing (ICSP), 2012 IEEE 11th International Conference on,
[14] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, “Voice conver- vol. 1. IEEE, 2012, pp. 590–594.
sion using deep neural networks with layer-wise generative train- [32] D. Erro, E. Navas, and I. Hernáez, “Iterative MMSE estima-
ing,” IEEE/ACM Transactions on Audio, Speech and Language tion of vocal tract length normalization factors for voice trans-
Processing (TASLP), vol. 22, no. 12, pp. 1859–1872, 2014. formation,” in Thirteenth Annual Conference of the International
[15] S. H. Mohammadi and A. Kain, “An overview of voice conversion Speech Communication Association, 2012.
systems,” Speech Communication, vol. 88, pp. 65–82, 2017. [33] D. Erro, A. Alonso, L. Serrano, D. Tavarez, I. Odriozola, X. Sara-
[16] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice con- sola, E. del Blanco, J. Sánchez, I. Saratxaga, E. Navas et al., “Ml
version through vector quantization,” Journal of the Acoustical parameter generation with a reformulated mge training criterion-
Society of Japan (E), vol. 11, no. 2, pp. 71–76, 1990. participation in the voice conversion challenge 2016.” in Inter-
speech, 2016.
[17] L. M. Arslan, “Speaker transformation algorithm using segmental
codebooks (stasc) 1,” Speech Communication, vol. 28, no. 3, pp. [34] J. A. Gonzalez, L. A. Cheah, A. M. Gomez, P. D. Green, J. M.
211–226, 1999. Gilbert, S. R. Ell, R. K. Moore, and E. Holdsworth, “Direct speech
reconstruction from articulatory sensor data by machine learning,”
[18] A. Bonafonte, A. Kain, J. v. Santen, and H. Duxans, “Including IEEE/ACM Transactions on Audio, Speech, and Language Pro-
dynamic and phonetic information in voice conversion systems,” cessing, vol. 25, no. 12, pp. 2362–2374, 2017.
in Eighth International Conference on Spoken Language Process-
ing, 2004. [35] I. Sainz, D. Erro, E. Navas, I. Hernáez, J. Sanchez,
I. Saratxaga, and I. Odriozola, “Versatile Speech Databases
[19] C.-H. Lee, C.-H. Wu, and J.-C. Guo, “Pronunciation variation for High Quality Synthesis for Basque,” in 8th international
generation for spontaneous speech synthesis using state-based conference on Language Resources and Evaluation (LREC),
voice transformation,” in Acoustics Speech and Signal Process- 2012, pp. 3308–3312. [Online]. Available: http://www.lrec-
ing (ICASSP), 2010 IEEE International Conference on. IEEE, conf.org/proceedings/lrec2012/pdf/126 Paper.pdf
2010, pp. 4826–4829.
[36] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
[20] H. Zen, Y. Nankaku, and K. Tokuda, “Continuous stochastic fea-
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
ture mapping based on trajectory hmms,” IEEE Transactions on
“The kaldi speech recognition toolkit,” in IEEE 2011 workshop
Audio, Speech, and Language Processing, vol. 19, no. 2, pp. 417–
on automatic speech recognition and understanding, no. EPFL-
430, 2011.
CONF-192584. IEEE Signal Processing Society, 2011.
[21] Y. Stylianou, O. Cappé, and E. Moulines, “Continuous probabilis-
[37] S. P. Rath, D. Povey, K. Veselỳ, and J. Cernockỳ, “Improved fea-
tic transform for voice conversion,” IEEE Transactions on speech
ture processing for deep neural networks.” in Interspeech, 2013,
and audio processing, vol. 6, no. 2, pp. 131–142, 1998.
pp. 109–113.
[22] E. Helander, T. Virtanen, J. Nurminen, and M. Gabbouj, “Voice
[38] L. Serrano, D. Tavarez, I. Odriozola, I. Hernaez, and I. Saratxaga,
conversion using partial least squares regression,” IEEE Transac-
“Aholab system for albayzin 2016 search-on-speech evaluation,”
tions on Audio, Speech, and Language Processing, vol. 18, no. 5,
in IberSPEECH, 2016, pp. 33–42.
pp. 912–921, 2010.
126
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
1 https://www.leapmotion.com/ 2 http://chalearnlap.cvc.uab.es/dataset/13/description/
127 10.21437/IberSPEECH.2018-27
In this work, Convolutional Networks are used to classify Table 1: Results with trim/zero padding preprocessing. Best
gestures from the Spanish sign language. Some improvements results marked with “*”.
in recognition accuracy are obtained with respect to results from
our previous studies [8, 9]. Max length
40 50 60 70 80 90 97 100
3. Experimental setup 5 12.0 13.1 12.0 12.5 11.6 12.7 12.6 11.4
3.1. Data preparation 10 11.0 11.0 10.1 11.6 10.6 11.8 11.4 11.2
15 10.8 11.2 9.8 11.3 9.8 11.4 11.1 10.3
The experimental data is the isolated gestures subset from the 20 10.5 10.5 10.5 10.9 9.4 11.1 10.8 9.9
Patience
“Spanish sign language db”3 , which is described in [8]. All 25 10.4 10.0 9.6 10.9 9.5 10.5 11.1 10.2
gestures in this database are dynamic. Data was acquired using 30 11.1 10.6 9.4 10.6 *9.3 11.9 10.2 10.3
the Leap Motion sensor. Gestures are described with sequences 35 10.9 10.7 10.0 10.5 9.4 11.0 10.3 10.2
of 21 variables per hand. This gives a sequence of 42 dimen- 40 10.7 10.4 9.9 10.5 *9.3 10.8 10.3 9.9
sional vector features per gesture. The isolated gesture subset is 45 11.2 10.1 10.4 10.5 9.7 11.0 10.3 10.1
formed of samples corresponding to 92 words; for each word, 50 11.2 10.1 10.3 10.5 9.6 10.5 10.3 10.2
40 examples performed by 4 different people were acquired,
giving a total number of 3,680 acquisitions. The dataset was
divided into 4 balanced partitions with the purpose of conduct-
The number of output features is fixed to 1,000. Finally, there
ing cross-validation experiments in the same fashion as those
is an output layer with 92 neurons corresponding to the number
described in [9].
of classes.
Data corresponding to each gesture is saved in a separated
text file. Each row in the file contained information (42 fea-
ture values) about the gesture in a determined frame. Gestures 4. Experiments and results
have different numbers of frames; therefore, a preprocess step is We conducted experiments using our LeNet network. In
required because we need fixed length data to train a neural net- each experiment we varied the value of patience p =
work. Three different methods were used to fix the number of {5, 10, 15, 20, 25, 30, 35, 40, 45, 50} and max length =
rows of each gesture to an equal length of matrix max length: {40, 50, 60, 70, 80, 90, 97, 100}. Patience (also called early
• Trim data to fixed length (keeping the last max length stopping) is a technique for controlling overfitting in neural net-
vectors) or pad with 0 (at the beginning) to get the same works, by stopping training before the weights have converged.
number of rows. We stop the training when the performance has stopped improv-
ing in a determined number of epochs. The original code and
• Our implementation of the trace segmentation technique LeNet configuration is dedicated to speech recognition, where
[25] the maximum length was fixed to 97 by default. That is the rea-
son why this value is used in the experiments. The experiments
• Linear interpolation of data using the interp1d
compare against the baseline provided by [9], which used Hid-
method from the scikit-learn library.
den Markov Models (HMM) to obtain a classification error rate
of 10.6±0.9. Confidence intervals were calculated by using the
3.2. Model bootstrapping method with 10,000 repetitions [27] and they are
In our experiments we use the LeNet network [26]. LeNet is all around the same value for all the experiments (±0.9). All
a convolutional network that has several applications, such as the results shown in the following are classification error rates
handwriting recognition or speech recognition. The architecture obtained through cross-validation using the same four partitions
of this network is characterised to ensure invariance to some stated in [9].
degree of shift, scale, and distortion. For each experiment we used the three different techniques
In our case, the input plane receives the gesture data, and of data preprocessing described above. Table 1 shows gesture
each unit in a layer receives inputs from a set of units located classification results using the trim/zero padding preprocessing
in a small neighbourhood in the previous layer. In Figure 1 the technique. We added a colour scale to make it easier to see
scheme of the LeNet architecture is shown. The first layer is the performance of classification depending on patience and
a 2 dimensional convolution layer which learns convolution fil- max length values. Dark colours are for bigger error rates and
ters, where each filter is 20 × 20 units. After that, the ReLU light colours are assigned to lower error rates. We can appreci-
activation function is applied and followed by a max-pooling ate that the best score is obtained with a patience of 30 or 40
with kernel size of 2. Once again, a convolution layer is ap- epochs and max length of 80 rows (9.3% error rate). In gen-
plied, but this time with 20 convolution filters. The ReLU ac- eral, the higher the patience, the lower the error, but only until a
tivation function is applied, previously applying max-pooling. certain value. With respect to max length, the behaviour does
A dropout layer with p = 0.5 is added and finally, a fully- not present a clear pattern.
connected layer is used. The number of input features depends In Table 2 we show most frequent confused gestures in
on the max length parameter, and it is calculated by following gesture classification using trim/zero padding as preprocessing
Equation (1). step. The confusions comes from global confusion matrix gen-
erated during cross-validation experiments.
max length − 12 In Table 3, trace-segmentation preprocessing results are
· 140 (1) shown. In this case, the best score is obtained with a patience of
4
45 epochs and max length of 100 rows (9.2% of error rate).
3 https://github.com/Sasanita/ This result is slightly better than the best score obtained with the
spanish-sign-language-db trim/zero padding technique. However, according to confidence
128
Figure 1: Schematic diagram of LeNet architecture [24].
Table 2: Most frequent confused gestures in gesture classifica- Table 4: Most frequent confused gestures in gesture classifica-
tion using trim/zero padding preprocessing. tion using trace-segmentation preprocessing.
Max length
lowest error rate is 8.6%, which is significantly better than the
40 50 60 70 80 90 97 100 HMM baseline (which did not happen with the other prepro-
5 12.5 12.6 13.2 11.9 11.5 11.1 12.4 11.4 cessing techniques). Therefore, the previous result from [9] is
10 11.4 10.8 11.5 10.5 12.6 11.7 10.7 11.1 improved by about 2% in an absolute manner. This result is
15 11.2 11.3 11.3 11.3 10.6 10.3 10.9 10.0 considered to be very satisfactory.
20 11.1 9.8 10.7 10.5 10.8 10.2 10.0 11.0
Patience
129
Table 5: Results with interpolation preprocessing. Best result
marked with “*”.
Max length
40 50 60 70 80 90 97 100
5 12.1 11.4 12.3 11.6 12.0 11.3 10.7 12.7
10 11.1 11.4 11.2 10.6 10.9 10.1 9.9 10.6 (a) ”hello” sign. (b) ”five” sign.
15 10.3 9.5 11.9 10.0 10.3 10.1 10.1 9.2
20 10.5 9.3 10.3 9.4 11.1 10.0 9.5 9.0
Patience
25 9.6 9.8 9.9 9.5 9.8 9.9 *8.6 9.6 Figure 2: Pair of confused signs.
30 10.0 10.1 10.3 9.9 9.6 9.7 8.9 9.1
35 9.7 10.0 9.8 9.6 9.8 9.2 8.7 9.0
40 10.0 9.7 10.0 9.6 9.6 9.2 8.9 9.0 [4] National Institute on Deafness and Other Communication Disor-
45 10.0 9.7 9.9 9.4 9.8 9.5 9.2 9.6 ders, “Hearing Aids,” https://bit.ly/1UwxUYN, 2013.
50 10.0 9.7 9.9 9.5 9.6 9.5 9.1 9.0 [5] S. Jones, “Alerting Devices,” https://bit.ly/2Eri6Y8, 2018.
[6] National Association of the Deaf, “Captioning for Access,” https:
Table 6: Most frequent confused gestures in gesture classifica- //bit.ly/2NCpLVi.
tion using interpolation preprocessing. [7] The Canadian Hearing Society, “Speech to text transcription
(CART Communication Access Realtime Translation),” https:
//bit.ly/2N2QhFV.
Num of confusions Reference Recognition [8] Z. Parcheta and C.-D. Martı́nez-Hinarejos, “Sign language ges-
ture recognition using HMM,” in Iberian Conference on Pattern
10 sign a lot Recognition and Image Analysis. Springer, 2017, pp. 419–426.
10 eighteen nineteen [9] C.-D. Martınez-Hinarejos and Z. Parcheta, “Spanish Sign Lan-
9 red eyes guage Recognition with Different Topology Hidden Markov Mod-
9 he brother els,” Proc. Interspeech 2017, pp. 3349–3353, 2017.
8 good morning be born [10] V. E. Kosmidou, P. C. Petrantonakis, and L. J. Hadjileontiadis,
7 one no “Enhanced sign language recognition using weighted intrinsic-
7 do not know no mode entropy and signer’s level of deafness,” IEEE Transactions
7 hello five on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 41,
no. 6, pp. 1531–1543, 2011.
[11] “MYO armband,” https://www.myo.com/.
[12] Z. Zhang, “Microsoft kinect sensor and its effect,” IEEE multime-
the error rate by 2% to a level of 8.6%. We also, present ta- dia, vol. 19, no. 2, pp. 4–10, 2012.
bles containing most frequent confused gestures using different [13] L. E. Potter, J. Araullo, and L. Carter, “The leap motion controller:
methods of preprocessing. a view on sign language,” in Proceedings of the 25th Australian
As we obtained such a good result, we would like to con- computer-human interaction conference: augmentation, applica-
tinue working in this area. Now we want to work with sign tion, innovation, collaboration. ACM, 2013, pp. 175–178.
language sentence recognition to be able to create a system that [14] D. Kim, O. Hilliges, S. Izadi, A. D. Butler, J. Chen, I. Oikono-
recognises a signed sentence composed of several signs. For midis, and P. Olivier, “Digits: freehand 3d interactions anywhere
that, we will use Recurrent Neural Networks (RNN) commonly using a wrist-worn gloveless sensor,” in Proceedings of the 25th
used in machine translation and speech recognition. Also, we annual ACM symposium on User interface software and technol-
ogy. ACM, 2012, pp. 167–176.
will conduct experiments on isolated gestures classification to
compare the performance of LeNet and RNN. [15] T. Kuroda, Y. Tabata, A. Goto, H. Ikuta, M. Murakami et al.,
“Consumer price data-glove for sign language recognition,” in
Proc. of 5th Intl Conf. Disability, Virtual Reality Assoc. Tech., Ox-
6. Acknowledgements ford, UK, 2004, pp. 253–258.
Work partially supported by MINECO under grant DI- [16] J. D. Guerrero-Balaguera and W. J. Pérez-Holguı́n, “Fpga-based
15-08169, by Sciling under its R+D programme, and by translation system from colombian sign language to text,” Dyna,
vol. 82, no. 189, pp. 172–181, 2015.
MINECO/FEDER under project CoMUN-HaT (TIN2015-
70924-C2-1-R). The authors would like to thank NVIDIA for [17] F.-H. Chou and Y.-C. Su, “An encoding and identification ap-
their donation of Titan Xp GPU that allowed to conduct this proach for the static sign language recognition,” in Advanced
research. Intelligent Mechatronics (AIM), 2012 IEEE/ASME International
Conference on. IEEE, 2012, pp. 885–889.
[18] S. Celebi, A. S. Aydin, T. T. Temiz, and T. Arici, “Gesture Recog-
7. References nition Using Skeleton Data with Weighted Dynamic Time Warp-
[1] American Speech-Language-Hearing Association, “Guidelines ing,” in VISAPP, 2013, pp. 620–625.
for Fitting and Monitoring FM Systems,” https://bit.ly/2udMuOs, [19] Y. Zou, J. Xiao, J. Han, K. Wu, Y. Li, and L. M. Ni, “Grfid: A
2002. device-free rfid-based gesture recognition system,” IEEE Trans-
[2] M. Kaine-Krolak, , and M. E. Novak, “An Introduction to Infrared actions on Mobile Computing, vol. 16, no. 2, pp. 381–393, 2017.
Technology: Applications in the Home, Classroom, Workplace, [20] H. J. Escalante, I. Guyon, V. Athitsos, P. Jangyodsuk, and J. Wan,
and Beyond ...” https://bit.ly/2udMuOs, 1995. “Principal motion components for one-shot gesture recognition,”
[3] National Institute on Deafness and Other Communication Disor- Pattern Analysis and Applications, vol. 20, no. 1, pp. 167–182,
ders, “Cochlear Implants,” https://bit.ly/27R6aWd, 2016. 2017.
130
[21] L. Pigou, A. Van Den Oord, S. Dieleman, M. Van Herreweghe,
and J. Dambre, “Beyond temporal pooling: Recurrence and tem-
poral convolutions for gesture recognition in video,” International
Journal of Computer Vision, vol. 126, no. 2-4, pp. 430–439, 2018.
[22] N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bow-
den, “Neural sign language translation,” CVPR 2018 Proceedings,
2018.
[23] O. Koller, S. Zargaran, and H. Ney, “Re-sign: Re-aligned end-to-
end sequence modelling with deep recurrent cnn-hmms,” in IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
2017.
[24] H. Dwyer, “Deep Learning With Dataiku Data Science Studio,”
https://bit.ly/2mfx8Wd.
[25] E. F. Cabral and G. D. Tattersall, “Trace-segmentation of isolated
utterances for speech recognition,” in 1995 International Confer-
ence on Acoustics, Speech, and Signal Processing, vol. 1, May
1995, pp. 365–368 vol.1.
[26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proceedings of the
IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[27] M. Bisani and H. Ney, “Bootstrap estimates for confidence inter-
vals in ASR performance evaluation,” in Proc. of ICASSP, vol. 1,
2004, pp. 409–412.
131
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
132 10.21437/IberSPEECH.2018-28
120
110
100
H ( f ) [dB]
90
80
realistic
simplified
70
60
50
0 2 4 6 8 10 12
Frequency [kHz]
Figure 2: Vocal tract transfer function H(f ) for the vowel [A]
with the realistic and simplified vocal tract geometries.
133
Tense (R d =0.3)
Modal (R d =1)
Lax (R d =2.7)
0
0 5 10 15 20 25 30 35 40
Time [ms]
(a) Waveform
Figure 3: Glottal flow ug (t) and its time derivative u0g (t) ac- 0
cording to the LF model [12]. Tense (R d =0.3)
-20 Modal (R d =1)
Lax (R d =2.7)
-40
the realistic geometry. Note, however, that these modes do not
LTAS [dB]
appear in the spectrum of the simplified configuration. The ra- -60
dial symmetry of this geometry prevents their onset [5, 6].
-80
Finally, the inverse Discrete Fourier Transform was applied
to the vocal tract transfer functions H(f ) to obtain the vocal -100
tract impulse responses h(t) of the two geometries (see Fig. 1).
-120
2.3. Voice Source Signal
-140
An LF model [12] was used to produce the voice source sig- 0 2 4 6 8 10 12
nal. This model approximates the glottal flow ug (t) and its time Frequency [kHz]
derivative u0g (t) in terms of four parameters (Tp , Te , Ta , Ee ) (b) Long-term average spectrum
that describe its time-domain properties (see Fig. 3). The con-
trol of this model can be simplified with the single glottal shape Figure 4: Glottal source for a tense (Rd = 0.3), a modal
parameter Rd [16]. This is defined as (Rd = 1), and a lax (Rd = 2.7) phonation.
Td 1 U0 F0
Rd = = , (4)
T0 110 Ee 110 from the 3D FEM simulations of the realistic and simplified vo-
cal tract geometries. The six synthesised vowels are normalised
where Td is the declination time, T0 the period, and F0 the fun- with the same scaling factor to obtain reasonable sound pressure
damental frequency. The declination time Td corresponds to levels. This factor has been selected so as to produce 70 dBSPL
the quotient between the glottal flow peak U0 and the negative in the realistic geometry with a modal phonation (Rd = 1). The
amplitude of the differentiated glottal flow Ee . LTAS have then been computed for each audio.
In this work, we used the Kawahara’s implementation of
Figure 5 shows the obtained LTAS for the six generated
the LF model [20], which generates a free-aliasing excitation
vowels. As also appreciated in the vocal tract transfer functions
source signal. We adapted this model to our purposes, modify-
(see Fig. 2), small differences between geometries are produced
ing the sampling frequency from its original value of 44100 Hz
for frequencies below 5 kHz, whereas beyond this range higher
to 24 kHz. Moreover, we introduced the Rd glottal shape pa-
order modes propagate in the realistic case, thus inducing larger
rameter. This allows one to easily control the voice source with
deviations. This behaviour can be observed for the three phona-
a single parameter, which runs from Rd = 0.3 for a very ad-
tion types. Essentially the glottal source modifies the overall
ducted phonation, to Rd = 2.7 for a very abducted phonation
energy level and also introduces an energy decay in frequency
(see [16]). From the Rd range [0.3, 2.7] two extreme values
(compare Fig. 2 with Fig. 5). This decay, known as the spec-
plus a middle one were chosen. We used Rd = 0.3 to gen-
tral tilt, strongly depends on the phonation type. The laxer
erate a tense phonation, Rd = 2.7 for a lax production, and
the phonation the larger the spectral tilt [16]. Furthermore, the
Rd = 1 for a normal (modal) voice quality. With regard to F0,
voice source also affects the energy balance of the first harmon-
a pitch curve was obtained from a real sustained vowel lasting
ics (below ∼ 500 Hz). For instance, the lax phonation has the
4.4 seconds. This pitch contour was placed around 120 Hz to
lowest overall energy values among all phonation types. How-
generate all the source signals. Figure 4a shows four periods
ever, one can see that the first harmonic (close to 120 Hz) has
of the three simulated voice source waveforms. Moreover, the
larger amplitude levels than the rest of the spectrum, in contrast
LTAS of the glottal source signals are represented in Fig. 4b.
to what occurs for the other phonations.
As observed, the phonation type obviously changes the glottal
pulse shape, thus modifying the spectral energy distribution of HFE levels have been computed by integrating the power
the source signal. spectral density in the 8 kHz octave band, as in [11]. In ad-
dition, the overall energy levels have been calculated follow-
ing the same procedure but for the whole examined frequency
3. Results range.
Six versions of vowel [A] (see Fig. 1) have been generated using The obtained results are listed in Table 1. Note first that in
the three glottal source signals corresponding to a tense, a modal the realistic case with a modal phonation (Rd = 1) the over-
and a lax phonation, and the two impulse responses obtained all level is 70 dBSPL . Remember that this value was fixed to
134
80
vocal tract
Tense (R d =0.3) realistic
60 simplified
Modal (R d =1)
40
LTAS [dB]
20
0 Lax (R d =2.7)
-20
-40
1 2 3 4 5 6 7 8 9 10 11 12
Frequency [kHz]
Figure 5: Long-term average spectra (LTAS) of the FEM synthesised vowel [A] using the realistic and simplified vocal tract geometries
with a tense (Rd = 0.3), a modal (Rd = 1), and a lax (Rd = 2.7) phonation.
135
[2] B. H. Story, “Phrase-level speech simulation with an airway mod- [19] H. Takemoto, S. Adachi, P. Mokhtari, and T. Kitamura, “Acoustic
ulation model of speech production,” Comput. Speech Lang., interaction between the right and left piriform fossae in generating
vol. 27, no. 4, pp. 989–1010, 2013. spectral dips,” The Journal of the Acoustical Society of America,
[3] P. Birkholz, “Modeling consonant-vowel coarticulation for artic- vol. 134, no. 4, pp. 2955–2964, 2013.
ulatory speech synthesis,” PLoS ONE, vol. 8, no. 4, p. e60603, [20] H. Kawahara, K.-I. Sakakibara, H. Banno, M. Morise, T. Toda,
2013. and T. Irino, “A new cosine series antialiasing function and its
[4] S. Stone, M. Marxen, and P. Birkholz, “Construction and eval- application to aliasing-free glottal source models for speech and
uation of a parametric one-dimensional vocal tract model,” singing synthesis,” in INTERSPEECH, 2017, pp. 1358–1362.
IEEE/ACM Transactions on Audio Speech and Language Process-
ing, vol. 26, no. 8, pp. 1381–1392, 2018.
[5] R. Blandin, M. Arnela, R. Laboissière, X. Pelorson, O. Guasch,
A. V. Hirtum, and X. Laval, “Effects of higher order propagation
modes in vocal tract like geometries,” The Journal of the Acousti-
cal Society of America, vol. 137, no. 2, pp. 832–8, 2015.
[6] M. Arnela, S. Dabbaghchian, R. Blandin, O. Guasch, O. Engwall,
A. Van Hirtum, and X. Pelorson, “Influence of vocal tract geome-
try simplifications on the numerical simulation of vowel sounds,”
The Journal of the Acoustical Society of America, vol. 140, no. 3,
pp. 1707–1718, 2016.
[7] B. B. Monson, A. J. Lotto, and B. H. Story, “Gender and vocal
production mode discrimination using the high frequencies for
speech and singing,” Frontiers in Psychology, vol. 5, p. 1239,
2014.
[8] T. Vampola, J. Horáček, and J. G. Švec, “FE modeling of human
vocal tract acoustics. Part I: Production of Czech vowels,” Acta
Acust. united with Acustica, vol. 94, no. 5, pp. 433–447, 2008.
[9] H. Takemoto, P. Mokhtari, and T. Kitamura, “Acoustic analysis of
the vocal tract during vowel production by finite-difference time-
domain method,” The Journal of the Acoustical Society of Amer-
ica, vol. 128, no. 6, pp. 3724–3738, 2010.
[10] M. Arnela, R. Blandin, S. Dabbaghchian, O. Guasch, F. Alías,
X. Pelorson, A. Van Hirtum, and O. Engwall, “Influence of lips on
the production of vowels based on finite element simulations and
experiments,” The Journal of the Acoustical Society of America,
vol. 139, no. 5, pp. 2852–2859, 2016.
[11] B. B. Monson, A. J. Lotto, and S. Ternström, “Detection of
high-frequency energy changes in sustained vowels produced by
singers,” The Journal of the Acoustical Society of America, vol.
129, no. 4, pp. 2263–2268, 2011.
[12] G. Fant, J. Liljencrants, and Q. Lin, “A four-parameter model
of glottal flow,” Speech Transmission Laboratory Quarterly
Progress and Status Report (STL-QPSR), vol. 26, no. 4, pp. 1–
13, 1985.
[13] T. Murtola, P. Alku, J. Malinen, and A. Geneid, “Parameteriza-
tion of a computational physical model for glottal flow using in-
verse filtering and high-speed videoendoscopy,” Speech Commu-
nication, vol. 96, pp. 67–80, 2018.
[14] B. D. Erath, M. Zañartu, K. C. Stewart, M. W. Plesniak, D. E.
Sommer, and S. D. Peterson, “A review of lumped-element mod-
els of voiced speech,” Speech Communication, pp. 667–690,
2013.
[15] A. Murphy, I. Yanushevskaya, A. N. Chasaide, and C. Gobl,
“Rd as a Control Parameter to Explore Affective Correlates of
the Tense-Lax Continuum,” in INTERSPEECH, 2017, pp. 3916–
3920.
[16] G. Fant, “The LF-model revisited. Transformations and frequency
domain analysis,” Speech Transmission Laboratory Quarterly
Progress and Status Report (STL-QPSR), vol. 36, no. 2-3, pp.
119–156, 1995.
[17] M. Arnela and O. Guasch, “Finite element computation of ellip-
tical vocal tract impedances using the two-microphone transfer
function method,” The Journal of the Acoustical Society of Amer-
ica, vol. 133, no. 6, pp. 4197–4209, 2013.
[18] D. Aalto, O. Aaltonen, R.-P. Happonen, P. Jääsaari, A. Kivelä,
J. Kuortti, J.-M. Luukinen, J. Malinen, T. Murtola, R. Parkkola,
J. Saunavaara, T. Soukka, and M. Vainio, “Large scale data ac-
quisition of simultaneous MRI and speech,” Applied Acoustics,
vol. 83, pp. 64–75, 2014.
136
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Abstract this application was already successful in providing for the first
time information about the entire tongue contour instead of
Recent advances in real-time magnetic resonance imaging some points, as in EMA and it updated significantly the infor-
(RT-MRI) for speech studies, providing a considerable increase mation about the velum. The increase of temporal resolution,
in time resolution, potentially improve our ability to study the updates the quality of the analysis and allows the description
dynamic aspects of speech production. To take advantage of the of a wider set of sounds, including faster sounds as thrills and
sheer amount of the resulting data, automated methods can be more complex sounds as diphthongs and nasal vowels. The
used to select, process, and analyze the data, and previous work significant increase of the number of participants is crucial to
could tackle these challenges for an European Portuguese (EP) take apart individual from languages specific characteristics.
corpus acquired at 14 frames per second (fps). The state-of-the-art as summarized in [2] includes currently (1)
Aiming to further explore RT-MRI in the study of the dy- frame rates that already surpass 100 Hz; (2) databases for sev-
namic characteristics of EP sounds, e.g., nasal vowels and diph- eral languages were recorded at high frame rates (English [4],
thongs, we present a novel 50 fps RT-MRI corpus and assess German [5], French [5]); (3) use of a wide variety of RT-MRI
the applicability, in this new context, of our previous propos- analysis techniques, that can be classified in four classes [2]:
als for processing and analyzing these data to extract relevant basis decomposition or matrix factorization techniques at the
articulatory information. Importantly, at this stage, we were level of the raw or processed images, pixel or region-of-interest
interested in assessing if and to what extent the new data and (ROI) based, grid-based, and contour-based; (4) initial studies
the proposed methods are able to support and corroborate the regarding the dynamics of articulators and gestural timing rela-
articulatory analysis obtained from the previous corpus. Over- tionships.
all, and although this new corpus poses novel challenges, it was
Despite all these evolutions, particularly in extraction of in-
possible to process and analyze the 50 fps data. A comparison
formation from the images (e.g., [5, 6]), the number of studies
of automated analysis performed for the same sounds, for both
going beyond the extraction of contour or articulators is very
corpora (i.e., 14 fps and 50 fps), yields similar results, corrobo-
scarce. A representative example of working in the facilitation
rating previous results and demonstrating the envisaged replica-
of high level analysis is [7]. As frame rates rise, more work is
bility. Moreover, we updated the processing in order to be able
needed in these frameworks for quantitative systematic analy-
to analyze dynamic information and provide first insights on the
sis, that will be essential to make possible exploring the highly
temporal organization of complex sounds such as nasal vowels
increased amount of images.
and diphthongs.
Index Terms: speech production, dynamic, real-time magnetic Several studies used MRI for studying EP, as summarized
resonance, European Portuguese, processing and analysis. in Section 2. Compared to the state-of-the-art [2], these stud-
ies used a low frame rate, possibly providing not enough in-
formation to the adequate characterization of the investigated
1. Introduction sounds. Further limitations are: (1) the scarce amount of pub-
A major challenge in modern linguistics is to understand how lications and missing analysis of some sounds, namely rhotics
continuous and dynamic speech movements are related to per- and diphthongs; (2) preliminary character of the approaches to
ceptual categories like consonants and vowels. Physiological other sounds such as nasal vowels, due to the limited frame rate
studies have shown that phonological segments can be defined used and the amount of participants.
based on the synchronization of primary articulators in speech The main objectives of this paper are the following: (1)
(the lips, tongue, soft-palate, jaw, and vocal folds) with each adding Portuguese to the small set of languages studied using
other in time [1] using methods such as Eletromagnetic Articu- the last evolutions in RT-MRI; (2) contribute to a better under-
lography (EMA) and Magnetic Resonance Imaging (MRI). standing of the dynamical aspects in the production of speech
Many advances in real-time magnetic resonance imaging sounds; (3) assess previous results obtained with lower tempo-
(RT-MRI) resolutions (spatial and temporal) have been driven ral resolution; (4) profit from the improved temporal resolution
by the need to investigate phonetic and phonological phenom- to start covering sounds characterized by their dynamic nature
ena [2], such as vowel nasalization in Portuguese [3]. which could not be contemplated in previous studies, such as
In the beginning of RT-MRI application to the study of the diphthongs.
speech production the temporal resolution was quite low, but Even if essential for these objectives, the acquisition of
137 10.21437/IberSPEECH.2018-29
the novel data for new classes of sounds and more complex gestures in Portuguese [22, 21].
phonetic contexts with higher temporal resolution poses sev- Despite all these relevant contributions, the sounds of Por-
eral challenges on how to process and analyze the resulting tuguese are not yet well described: diphthongs and trills still
database. Our team has previously addressed this challenge, missing, nasal vowels need an improved characterization due to
for different data [8, 9] by proposing methods to process [6] the lower frame rate and the small amount of participants.
and analyze [7, 10] the image sequences to extract articulatory From the perspective of processing and quantitative infor-
data. However, the nature of the novel database raises several mation extraction a lot has been achieved recently for EP ex-
new questions, opportunities and challenges, including the ap- ploring, essentially, automatic techniques to deal with the quan-
plicability of previous analyses. tity and diversity of the information obtained, both from 3D
The remainder of this article is organized as follows: Sec- static [18] and real-time MRI [20, 6, 7, 10], However, the
tion 2 provides a short summary of related work in speech pro- problem of information extraction for speech production stud-
duction studies using RT-MRI and other techniques, focusing ies and development of relevant articulation models is far from
in studies addressing EP; Section 3 presents information re- solved. The consideration of these imaging technologies poses
garding a novel RT-MRI database acquired for EP, from corpus challenges to extract articulatory-relevant information profiting
description to the methods considered for acquisition, process- from the full range of available data.
ing and analysis of the image sequences; in Section 4, illustra-
tive results are presented, including novel information regarding
diphthongs production; finally, the article concludes with a brief 3. Methods
summary of the contributions and ideas for future work. Gathering novel insights on the dynamic nature of complex
sounds, for EP, taking advantage of recent advances in real-time
2. Related Work MRI, involves several steps from corpus definition to articula-
tory analysis, as described in what follows.
Nowadays, speech production studies can be supported by
a wide range of technologies including imaging modalities
3.1. Corpus
(e.g., Ultrasound, MRI) and other instrumental techniques (e.g.,
EMA [11, 12]). In this context, real-time magnetic resonance The corpus consists of minimal pairs containing all stressed oral
imaging has been received particular relevance due to its non- [i, e, E, a, O, o, u] and nasal vowels [5̃, ẽ, ĩ, õ, ũ] in one and two
invasiveness, non-use of ionizing radiation, and the remarkable syllable words. Nasal diphthongs /5̃w, ẽj, 5̃j/ and the oral coun-
improvements in spatio-temporal resolution achieved in the last terparts /aw, aj, oj/ as well as /ej, ew, iw, ow, uj/ in mono-
few years [13, 2]. syllabic words were also included. Additional materials were
From the first attempts to acquire real-time imaging with recorded for further modeling of variability in the production of
MRI to date, much has been evolved in the area, achieving sam- nasality.
pling rates close to those obtained with EMA and Ultrasonogra- All words were randomized and repeated in two prosodic
phy. This has been possible by exploiting several technological conditions embedded in one of three carrier sentences alternat-
advancements that involve the use of high field strengths, more ing the verb as follows (Diga ’Say’—ouvi ’I heard’— leio ’i
powerful gradients, dedicated coils, non-Cartesian K-space tra- read’) as in ‘Diga pote, diga pote baixinho’ (’Say pot, Say pot
jectories, high degree of undersampled data and more efficient gently’). So far, this corpus has been recorded from twelve na-
image reconstruction algorithms allowing for high sampling tive speakers (8m, 4f) of EP. The tokens were presented from a
rates and improved image quality [5, 14]. timed slide presentation with blocks of 13 stimuli each. The sin-
For European Portuguese (EP), several of these techniques gle stimulus could be seen for 3 seconds and there was a pause
(e.g., EMA and MRI) have been used. These efforts include of about 60 seconds after each block of 13. The first three par-
not only data acquisition but also data analysis. Early stud- ticipants read 7 blocks in a total of 91 stimuli and the remaining
ies addressed dynamic aspects of nasal vowel production using nine participants had 9 blocks of 13 stimuli (total of 117 tokens).
EMA [15, 16, 17] and MRI (e.g. [9]
Internationally, several research groups have been using 3.2. RT-MRI Acquisition
MRI to gather information for different languages using differ-
ent approaches. Comprehensive reviews of these studies are RT-MRI recordings, as exemplified in Fig. 1, were conducted at
summarized in [13, 2, 7]. the Max Planck Institute for biophysical Chemistry, Göttingen,
The first MRI study for EP included 2D and 3D data re- Germany, using a 3 Tesla Siemens Magnetom Prisma Fit
garding static configuration of all EP vowels and consonants MRI System equipped with high performance gradients (Max
for one speaker [9]. A deeper study on EP laterals (3D) was ampl=80 mT/m; slew rate = 200 T/m/s).
conducted later with data from 7 participants [8, 18]. A study A standard 64–channel head coil was used with a mirror
of co-articulation resistance in EP was presented in [19]. First mounted on top of the coil. The speaker was lying down,
results of RT-MRI for EP were presented in 2012 [3]. At this in a comfortable position, and was instructed to read the re-
stage, a RT-MRI dataset which included nasal vowels, laterals, quired sentences. Real-time MRI measurements were based on
taps and trills was acquired with a frame rate of 14 fps. This a recently developed method, where highly under-sampled ra-
represented an important first step towards a better characteriza- dial FLASH acquisitions are combined with nonlinear inverse
tion of the dynamic aspects involved in the production of these reconstruction (NLINV) providing images at high spatial and
sounds of Portuguese [3]. The configuration of the vocal tract temporal resolutions [23]. Acquisitions were made at 50 fps,
during the production of nasal vs. oral vowels was investigated resulting in images as the ones presented in Fig. 1. Speech
using RT-MRI in [20]. More recently, RT-MRI was used for was synchronously recorded using an optical microphone (Dual
studying the temporal coordination of oral articulators and the Channel-FOMRI, Optoacoustics, Or Yehuda, Israel), fixed on
velum during the production of nasal vowels [21], providing ad- the head coil, with the protective pop-screen placed directly
ditional support for the delayed coordination of oral and nasal against the speaker’s mouth.
138
d i g 5 u
Figure 1: Selected frames for multiple sounds occurring as in ”Diga vu” (Say ’vu’).
139
end of the vowel and a possible consequence of coarticulation
effects. For instance, the influence of the carrier sentence’s sec-
ond ’diga’ (say), following the token, as in ’Diga vem, diga . . . ’
(Say come, say . . . ).
4.1. EP Diphthongs
Given the exploratory nature of this first effort regarding diph-
thongs, to have a first grasp of what is happening, we started by
oral diphthongs. As an example, Fig. 4 shows results for [aw]
5. Conclusion
The major contribution of this paper is the presentation of a
novel RT-MRI database for EP recorded with a frame rate of
Figure 4: Variation, over time, of the articulators during the 50 Hz, contributing to augment the very reduced amount of
production of the EP oral diphthong [aw], as in ’pau’ (stick). languages with such a valuable resource for speech production
studies. This database, after its completion and pre-processing
will be partially made available for other researchers. Addition-
as in ’pau’ (stick). The diphthong is covered by 10 images (10
ally, we demonstrate the applicability of previously proposed
points in the graph), including the initial [p] and the diphthong
methods for segmentation and analysis, by illustrating previous
(around 200 ms). Note that these representations only present
findings for oral and nasal vowels and performing a first explo-
lines, in the graph, for those regions where, along production,
ration of EP diphthongs. The application of this methodology
changes fall into the yellow and red stripes. It is clear an abrupt
to more data and speakers will enable a detailed description of
change in lip aperture (LA) at 20-30 % of diphthong duration,
nasal sounds in European Portuguese and a better understanding
from [p] to [a], and a reduction, close to the end, for the [w].
of their implementation in production.
The nasal counterpart of this diphthong, [5̃w̃], as in ’pão’ The work presented here can still profit from several im-
(bread), is analyzed in Fig. 5, showing a gradual variation of lip provements and provides the grounds for exploring new routes
aperture (LA), similar to the one observed in paw. Additionally, of speech production studies in EP. Even though the image qual-
there is movement of the tongue back, as opposed to the oral ity is better than our previous 14 fps corpus, the different na-
counterpart, probably due to the need of adjustment in the nasal ture of the corpus, with a large number of dental, alveolar and
passage, which is hinted by the changes also noted at the velum. palatal contacts (in the support words and sentences, e.g., [t]
before [5̃w̃] as in ’sotão’ – attic) poses new challenges to vocal
tract segmentation, with a few segmentations still requiring a
final manual revision, an aspect to improve as new speakers are
included, by training better models [6].
Finally, now that the grounds for work have been estab-
lished, future developments should be propelled by address-
ing concrete hypotheses regarding EP nasals, such as the one
of delayed coordination of oral and nasal gestures in Por-
tuguese [27, 21, 22].
6. Acknowledgements
This work is partially funded by the project ’Synchrone
Variabilität und Lautwandel im Europäischen Portugiesisch’,
with funds from the German Federal Ministry of Education
Figure 5: Variation, over time, of the articulators during the and Research, by IEETA Research Unit funding (UID/CEC-
production of the EP oral diphthong [5̃w̃], as in ’pão’ (bread). /00127/2013), by Portugal 2020 under the Competitiveness
and Internationalization Operational Program, and the Eu-
ropean Regional Development Fund through project SOCA
Our corpus and methods also enable investigating the com-
– Smart Open Campus (CENTRO-01-0145-FEDER-000010)
plex context of a diphthong after a nasal consonant, and, for
and project MEMNON (POCI-01-0145-FEDER-028976). We
instance, in comparison with other diphthongs. In Fig. 6, we
thank Philip Hoole for the scripts for noise supression and all
compare the production of ’mão’ ([m5̃w̃], ’hand’) with ’pão’
the participants of the experiment for their time and voice.
([p5̃w̃], ’bread’), showing that, as expected, the velum behaves
140
7. References [19] A. Teixeira, P. Martins, A. Silva, and C. Oliveira., “An MRI study
of consonantal coarticulation resistance in portuguese,” in Inter-
[1] L. Goldstein and M. Pouplier, “The temporal organization of national Seminar on Speech Production (ISSP’11), Montrreal,
speech,” The Oxford handbook of language production, p. 210, Canada, Jun. 2011.
2014.
[20] S. Silva, A. Teixeira, C. Oliveira, and P. Martins, “Segmenta-
[2] V. Ramanarayanan, S. Tilsen, M. Proctor, J. Töger, L. Goldstein, tion and analysis of vocal tract from midsagittal real-time mri,”
K. S. Nayak, and S. Narayanan, “Analysis of speech production in International Conference Image Analysis and Recognition.
real-time MRI,” Computer Speech & Language, 2018. Springer, 2013, pp. 459–466.
[3] A. Teixeira, P. Martins, C. Oliveira, C. Ferreira, A. Silva, and [21] A. R. Meireles, L. Goldstein, R. Blaylock, and S. Narayanan,
R. Shosted, “Real-time MRI for portuguese,” in International “Gestural coordination of brazilian portugese nasal vowels in CV
Conference on Computational Processing of the Portuguese Lan- syllables: A real-time MRI study.” in ICPhS, 2015.
guage. Springer, 2012, pp. 306–317.
[22] C. Oliveira, “Do grafema ao gesto. contributos linguı́sticos para
[4] S. Narayanan, A. Toutios, V. Ramanarayanan, A. Lammert,
um sistema de sı́ntese de base articulatória,” Ph.D. dissertation,
J. Kim, S. Lee, K. Nayak, Y.-C. Kim, Y. Zhu, L. Goldstein et al.,
Universidade de Aveiro, 2009.
“Real-time magnetic resonance imaging and electromagnetic ar-
ticulography database for speech production research (tc),” The [23] M. Uecker, S. Zhang, D. Voit, A. Karaus, K.-D. Merboldt, and
Journal of the Acoustical Society of America, vol. 136, no. 3, pp. J. Frahm, “Real-time mri at a resolution of 20 ms,” NMR in
1307–1311, 2014. Biomedicine, vol. 23, no. 8, pp. 986–994, 2010.
[5] M. Labrunie, P. Badin, D. Voit, A. A. Joseph, J. Frahm, [24] P. Boersma, “Praat, a system for doing phonetics by computer,”
L. Lamalle, C. Vilain, and L.-J. Boë, “Automatic segmentation Glot international, vol. 5, no. 9/10, pp. 341–347, 2001.
of speech articulators from real-time midsagittal MRI based on [25] P. Boersma and D. Weenink, “Praat: doing phonetics by com-
supervised learning,” Speech Communication, vol. 99, pp. 27 – puter [computer program], version 6.0.40,” 2018, retrieved 11
46, 2018. May 2018 from http://www.praat.org/.
[6] S. Silva and A. Teixeira, “Unsupervised segmentation of the vo- [26] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Ac-
cal tract from real-time MRI sequences,” Computer Speech and tive shape models-their training and application,” Computer vision
Language, vol. 33, no. 1, pp. 25–46, Sep. 2015. and image understanding, vol. 61, no. 1, pp. 38–59, 1995.
[7] ——, “Quantitative systematic analysis of vocal tract data,” Com- [27] C. Cunha, S. Silva, A. Teixeira, C. Oliveira, P. Martins, ,
puter Speech & Language, vol. 36, pp. 307 – 329, 2016. A. Joseph, and J. Frahm, “Analysis of nasal vowels and diph-
[8] A. Teixeira, P. Martins, C. Oliveira, and A. Silva, “Production and thongs in european portuguese,” in Workshop New Developments
modeling of the european portuguese palatal lateral.” in Compu- in Speech Sensing and Imaging (Labphon satellite event), Lisbon,
tational Processing of the Portuguese Language, PROPOR 2012, Jun. 2018.
Lecture Notes in Computer Science/LNAI, Vol. 7243, 2012.
[9] P. Martins, I. Carbone, A. Pinto, A. Silva, and A. Teixeira, “Eu-
ropean Portuguese MRI based speech production studies,” Speech
Communication, vol. 50, no. 11, pp. 925–952, 2008.
[10] S. Silva and A. J. S. Teixeira, “Critical articulators identification
from RT-MRI of the vocal tract,” in Proc. Interspeech, Stockholm,
Sweden, 2017, pp. 626–630.
[11] J. S. Perkell, M. H. Cohen, M. A. Svirsky, M. L. Matthies, I. Gara-
bieta, and M. T. Jackson, “Electromagnetic midsagittal articu-
lometer systems for transducing speech articulatory movements,”
The Journal of the Acoustical Society of America, vol. 92, no. 6,
pp. 3078–3096, 1992.
[12] P. Hoole and N. Nguyen, “Electromagnetic articulography,”
Coarticulation–Theory, Data and Techniques, Cambridge Studies
in Speech Science and Communication, pp. 260–269, 1999.
[13] A. D. Scott, M. Wylezinska, M. J. Birch, and M. E. Miquel,
“Speech MRI: Morphology and function,” Physica Medica,
vol. 30, no. 6, pp. 604 – 618, 2014.
[14] J. Frahm, S. Schätz, M. Untenberger, S. Zhang, D. Voit, K. D.
Merboldt, J. M. Sohns, J. Lotz, and M. Uecker, “On the temporal
fidelity of nonlinear inverse reconstructions for real-time MRI–the
motion challenge.” The Open Medical Imaging Journal, vol. 8, pp.
1–7, 2014.
[15] A. Teixeira and F. Vaz, “European Portuguese Nasal Vowels: An
EMMA study,” in 7th European Conference on Speech Communi-
cation and Technology, EuroSpeech - Scandinavia, vol. 2. Aal-
borg, Dinamarca: CPK/ISCA, Sep. 2001, pp. 1843–1846.
[16] C. Oliveira and A. Teixeira, “On gestures timing in european por-
tuguese nasals,” in ICPhS, 2007, pp. p. 405 – 408.
[17] S. Rossato, A. Teixeira, and L. Ferreira, “Les nasales du Portugais
et du Français : une étude comparative sur les données EMMA,”
in JEP’2006, Rennes, França, 2006.
[18] P. Martins, C. Oliveira, C. Ferreira, A. Silva, and A. Teixeira, “3D
MRI and semi-automatic segmentation techniques applied to the
study of european portuguese lateral sound,” in International Sem-
inar on Speech Production (ISSP’11), Montreal, Jun. 2011.
141
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Abstract took into account the noise reduction provided by the beam-
former to obtain more accurate statistics. That work also ex-
Although beamforming is a powerful tool for microphone plores non-linear postfilters based on minimum mean square
array speech enhancement, its performance with small arrays, error (MMSE) estimation of the speech amplitude. Gannot et
such as the case of a dual-microphone smartphone, is quite lim- al. [11] modified the beamformer using a generalized sidelobe
ited. The goal of this paper is to study different postfiltering ap- canceler (GSC) structure and an optimal modified log-spectral
proaches that allow for further noise reduction. These postfilters amplitude (OMLSA) estimator [12] as a postfilter. Apart from
are applied to our previously proposed extended Kalman filter the above, a statistical analysis of dual-channel postfilters in
framework for relative transfer function estimation in the con- isotropic noise fields is presented in [13].
text of minimum variance distortionless response beamforming. In the case of dual-microphone smartphones, other works
We study two different postfilters based on Wiener filtering and have exploited the information of the secondary microphone to
non-linear estimation of the speech amplitude. We also pro- enhance the speech signal from the reference microphone. For
pose several estimators of the clean speech power spectral den- example, Jeub et al. [14] proposed an estimator of the noise
sity which exploit the speaker position with respect to the de- power spectral density (PSD) along with a modified single-
vice. The proposals are evaluated when applying speech en- channel Wiener filter that explicitly exploits the power level dif-
hancement on a dual-microphone smartphone in different noisy ference (PLD) of the speech signal between the microphones in
acoustic environments, in terms of both perceptual quality and close-talk (CT) conditions (when the loudspeaker of the smart-
speech intelligibility. Experimental results show that our pro- phone is placed at the ear of the user). Nelke et al. [15] devel-
posals achieve further noise reduction in comparison with other oped an alternative noise PSD estimator in far-talk (FT) condi-
related approaches from the literature. tions (when the user holds the device at a distance from her/his
Index Terms: Postfiltering, Extended Kalman filter, Minimum face). This method combines a single-channel speech presence
variance distortionless response, Dual-microphone speech, probability estimator and the coherence properties of the dual-
Smartphone channel target signal and background noise. The noise PSD
is employed to estimate the gain function to be applied on the
1. Introduction reference channel. Such an algorithm was extended for multi-
microphone devices in [16]. Nevertheless, all of these tech-
Multi-channel speech processing is widely employed in devices niques make assumptions about the noise field properties that
with several microphones to improve the noise reduction per- may not be accurate in practice, thereby leading to a limited
formance, yielding better speech quality and/or intelligibility. performance.
The most common techniques for multi-channel processing are
the beamforming algorithms, which apply a spatial filtering to Recently, we proposed an estimator of the relative trans-
the existing sound field [1, 2, 3]. Nevertheless, the performance fer function (RTF) between microphones based on an extended
of beamforming can be insufficient if a small number of micro- Kalman filter (eKF) framework [17]. Our method is capable of
phones is considered, as in the case of dual-microphone smart- tracking the RTF evolution using prior information on the chan-
phones [4, 5]. Thus, to improve the noise reduction perfor- nel and noise statistics without making any assumption on the
mance, the enhanced signal at the output of the beamformer clean speech signal statistics. In that work we evaluated the per-
might be further processed by using postfiltering methods [6]. formance of our estimator over MVDR beamforming applied
on a dual-microphone smartphone configuration in CT and FT
Several postfilters have been proposed in the recent years
conditions, showing improvements in estimation accuracy with
mainly based on multi-channel Wiener filtering, which can
respect to other relevant methods from the literature. Despite
be decomposed into minimum variance distortionless response
this, the speech enhancement performance is still limited as a
(MVDR) beamforming plus single-channel Wiener filtering [3].
result of using beamforming with only two microphones.
For multi-channel Wiener filtering, Zelinski [7] assumed a
spatially-white noise to estimate the speech and noise statistics. In this paper we evaluate the use of postfiltering techniques
Marro et al. [8] further improved the postfilter architecture and to overcome shortcomings of our eKF-based MVDR approach
also considered acoustic echo and reverberation. The previous for dual-microphone smartphones. We compare different al-
spatially-white noise assumption was substituted in [9] by as- gorithms and modify them to adapt them to our eKF-based
suming a diffuse noise field. The postfilter presented in [10] method. Also, we propose different clean speech PSD estima-
tors and make use of the available information about the RTF
This work has been supported by the Spanish MINECO/FEDER and noise to obtain the needed statistics for postfiltering. Our
Project TEC2016-80141-P and the Spanish Ministry of Education proposals are evaluated on a dual-microphone smartphone un-
through the National Program FPU (grant reference FPU15/04161). der several noisy acoustic environments in CT and FT condi-
142 10.21437/IberSPEECH.2018-30
tions, achieving improvements in terms of noise reduction per- 2.2. MVDR beamforming
formance in comparison with other state-of-the-art approaches.
The remainder of this paper is organized as follows. In Sec- Once the RTF is estimated, both noisy signals are combined
tion 2 we briefly revisit the eKF-based RTF estimation and its using MVDR beamforming, whose weights can be expressed
application to MVDR beamforming. Section 3 describes the (omitting indices f and t for clarity) as [3]
proposed postfiltering approaches and clean speech PSD esti-
mators for CT and FT conditions. Then, in Section 4 the ex- Φ−1
nn d
F= , (6)
perimental framework is presented along with our perceptual d Φ−1
H
nn d
quality and speech intelligibility results. Finally, conclusions
>
are summarized in Section 5. where (·)H indicates Hermitian transpose, d = 1, A21 is
the steering vector and Φnn is a noise spatial correlation matrix
>
2. Beamforming for dual-microphone (i.e., obtained from N = N1 , N2 ). We can also express
smartphones the PSD of the noise at the output of the beamformer as [10]
Before presenting the postfiltering approaches, it is worth- −1
while to review the RTF estimation and beamforming for dual- φv = dH Φ−1
nn d . (7)
microphone smartphones that we proposed in [17]. First, we
introduce formulation proposed for the eKF-based RTF esti- Finally, the enhanced signal for the reference microphone is es-
mation. Next, we describe the MVDR beamforming approach
timated as Xb1,mvdr = FH Y, with Y = Y1 , Y2 > .
for processing the dual-channel noisy speech signals using the
noise statistics and RTF estimations.
3. Postfiltering for dual-microphone
2.1. Extended Kalman filter-based RTF estimation smartphones
Let us consider the following additive distortion model for the The performance of MVDR beamforming when applied on
noisy speech signal in the short-time Fourier transform (STFT) smartphones with only two microphones is quite limited due to
domain, the reduced spatial information and the particular placement of
Ym (f, t) = Xm (f, t) + Nm (f, t), (1) the microphones [4]. Therefore, we analyze the use of post-
where Ym (f, t), Xm (f, t) and Nm (f, t) represent, respec- filtering for enhancing the signal at the output of the beam-
tively, noisy speech, clean speech and noise STFT coefficients former and further improve the noise reduction performance.
at the m-th microphone (m = 1, 2), f is the frequency bin We propose two different postfilters based on Wiener filtering
and t the frame index. Using the relative transfer function and the optimal modified log-spectral amplitude (OMLSA) es-
(RTF) A21 (f, t) = X X2 (f,t)
between both microphones, we can timator [12], and also address the estimation of the clean speech
1 (f,t)
write the speech distortion model for the secondary microphone PSD at the reference microphone needed by the postfilters. In
(m = 2) in terms of the reference microphone (m = 1) as addition, the postfiltering gains are further processed using the
musical noise reduction algorithm proposed in [18]. This post-
Y2 (f, t) = A21 (f, t) (Y1 (f, t) − N1 (f, t)) + N2 (f, t). (2) processing is applied to frequencies above 1 kHz. For ease of
We can also rewrite the previous complex variables as vec- notation, we drop the indices f and t henceforth.
(t) (t)
tors stacking their real and imaginary parts, yielding ym , a21
(t) 3.1. Wiener filtering
and nm (index f is omitted for clarity). For example, we define
the noisy speech vector for the m-th microphone as The multi-channel Wiener filter can be decomposed into an
(t) > MVDR beamformer followed by a single-channel Wiener filter
ym = Re(Ym (t)), Im(Ym (t)) , (3) defined as
where [·]> indicates transpose. Then, we set a dynamic model ξ
Gwf = , (8)
(t)
for a21 as follows, 1+ξ
(t) (t−1) where ξ = φx1 /φv is the a priori signal-to-noise ratio (SNR)
a21 = a21 + w(t) , (4)
and φx1 is the clean speech PSD at the reference microphone.
where w(t) models the variability of the RTF between consec- This Wiener filter is partially obtained from the enhanced sig-
utive frames. Also, we redefine (2), using the previous vector nal at the output of the beamformer. Better performance can be
notation, as achieved if the Wiener filter is fully calculated from the refer-
ence noisy signal when an overestimated noise is considered.
(t) (t) (t) (t) (t)
y2 = h a21 , n1 ; y1 + n2 Thus, we propose the following improved Wiener filter
h i
(t) (t) (t) (t) (t) (t)
= C y1 − n1 , D y1 − n1 a21 + n2 , bx
φ
(5) Giwf = 1
, (9)
b
φ x 1 + µφbv
1 0 0 −1
where C = and D = .
0 1 1 0
Assuming multivariate Gaussian variables and using the where φ bx is an estimate of the clean speech PSD (discussed in
1
models in (4) and (5), in [17] we proposed an MMSE estima- Subsection 3.3), φbv is an estimate of the noise PSD, taken as
(t)
tor of a21 using an extended Kalman filter (eKF) framework, φn1 (first element of the diagonal of Φnn ) in order to use an
which tracks the RTF using the observable noisy speech, esti- overestimated version of the noise, and µ is a factor which pro-
mated noise statistics and a priori information on the RTF statis- vides an increased overestimation. As a result, the clean speech
tics [17]. b1,iwf = Giwf X
signal is estimated as X b1,mvdr .
143
3.2. Optimal modified log-spectral amplitude estimator 4. Experimental evaluation
The OMLSA estimator proposed in [12] computes the postfilter 4.1. Experimental framework
gains as
To evaluate the proposed techniques, we simulated dual-channel
Gomlsa = GH1 pSPP GH0 1−pSPP , (10) noisy speech recordings on a Motorola Moto G smartphone.
where pSPP is the speech presence probability, GH0 is a constant We consider two different modes of use: close-talk (CT) and
gain when speech is absent and GH1 is the gain when speech is far-talk (FT). These modes can be easily identified using the
present, computed as proximity sensor included in the smartphone. Clean speech
signals were obtained from 18 speakers of the VCTK database
Z ! [19] downsampled to 16 kHz. We simulated recordings at eight
∞
1 e−t different noisy environments with different reverberations: car,
GH1 = Gwf exp dt , (11)
2 ξ
γ t street, babble, mall, bus, cafe, pedestrian street and bus station.
1+ξ
The noise signals were added at six different SNR levels from
where γ = |X b1,mvdr |2 /φv is the a posteriori SNR and Gwf -5 dB to 20 dB. Further details about this database can be found
was defined in (8). We modify this gain by substituting Gwf in [17].
by Giwf , defined in (9), which yields the improved OMLSA For STFT computation, we choose a 25 ms square-root
gain Giomlsa . Finally, the clean speech signal is estimated as Hann window with 75% overlap. The noisy speech spatial cor-
b1,iomlsa = Giomlsa X
b1,mvdr . relation matrix Φyy is estimated by a first-order recursive aver-
X
aging with an averaging constant of 0.9. The noise spatial cor-
relation matrix Φnn is estimated by recursive averaging during
3.3. Clean speech PSD estimators
time-frequency bins where speech is absent. Thus, we compute
The previous postfilters require an estimation of the clean the speech presence probability pSPP at each bin by using the
speech PSD, φx1 . We propose two different estimators for CT Multi-Channel Speech Presence Probability (MC-SPP) noise
and FT conditions, respectively, based on the noisy speech and tracking algorithm proposed in [20]. Finally, we use an over-
noise statistics and the estimated RTF between microphones. estimation factor µ = 4 and a speech absent gain GH0 = 0.05
Therefore, these estimators take advantage of the more accurate for postfiltering implementation.
RTFs obtained by our eKF-based approach.
For close-talk (CT) conditions, the estimator is based on the 4.2. Results
PLD between microphones [14], which can be computed as The two proposed postfilters, Wiener filtering (eKF-WF) and
OMLSA estimator (eKF-OMLSA), are evaluated in combina-
bPLD = max (φy − φy , 0),
∆φ (12) tion with MVDR beamforming, both using the eKF-based RTF
1 2
estimator outlined in Subsection 2.1. The obtained results are
where φy1 and φy2 are the noisy speech PSDs at the reference compared with those achieved by the noisy speech at the refer-
and secondary microphones, respectively. This estimator takes ence microphone, and MVDR beamforming with eKF and no
advantage of the more attenuated clean speech component at the postfiltering (eKF-MVDR) [17]. Also, we evaluate two other
secondary microphone. Assuming that the noise PSD is simi- state-of-the-art enhancement algorithms for dual-microphone
lar at both microphones so that its difference can be neglected smartphones, that is, the PLD-based Wiener filtering for close-
compared to ∆φ bPLD , it can be easily shown that the clean speech talk conditions of [14] and the speech presence probability and
PSD can be approximated as [14] coherence-based (SPPC) Wiener filtering for far-talk position
of [15]. The musical noise suppressor of [18] is also applied to
bPLD PLD and SPPC gains.
b(CT) ∆φ
φ x1 = . (13) The resulting enhanced signals are assessed in terms of per-
1 − |A21 |2
ceptual quality and speech intelligibility by means of the per-
ceptual evaluation of the speech quality (PESQ) [21] and short-
Unlike CT conditions, in far-talk (FT) conditions speech time objective intelligibility (STOI) [22] metrics, respectively.
power is similar at both microphones and the previous assump- Clean speech at the reference microphone is taken as a refer-
tions are inaccurate (i.e., noise PSD difference cannot be ne- ence for these performance metrics. The results for close-talk
glected compared to PLD between microphones) [14]. There- (CT) and far-talk (FT) conditions are shown in Tables 1 and 2,
fore, a better estimator is obtained by considering the distortion- respectively.
less properties of MVDR beamforming, which imply that the In close-talk conditions, it is shown that the proposed post-
clean speech PSD at the reference microphone is the same as filters outperform the other methods in terms of perceptual qual-
the one at the beamformer output. Thus, we estimate the clean ity, with both Wiener filtering and OMLSA approaches obtain-
speech PSD at both the beamformer output and the reference ing similar performance on average. While the Wiener filter-
microphone as ing approach achieves better PESQ results at low and medium
SNRs, the OMLSA approach yields higher PESQ scores at high
b(FT)
φ H
x1 = F (Φyy − Φnn ) F, (14) SNRs. It can also be seen that PLD is a better choice than
eKF-MVDR, but the addition of postfiltering after beamform-
where Φyy is a noisy speech spatial correlation matrix (i.e., cal- ing leads to better PESQ results. On the other hand, intelligibil-
culated from Y), whose diagonal elements are the noisy speech ity scores among the different evaluated techniques are similar,
PSDs φy1 and φy2 . Although the clean speech estimate could but MVDR beamforming without postfiltering obtains slightly
be obtained from a direct subtraction of the first diagonal ele- better ones. That means that the superior perceptual quality
ments of matrices Φyy and Φnn , the estimator defined in (14) achieved by postfiltering involves some speech distortion that
has the advantage of using all the channels in the estimation. slightly reduces intelligibility. In general, PLD and Wiener
144
Table 1: PESQ and STOI scores obtained for noisy and en- based on Wiener filtering and the OMLSA estimator, and also
hanced speech when using different dual-microphone enhance- proposed different clean speech PSD estimators for CT and FT
ment techniques in close-talk (CT) conditions. conditions in order to compute the needed statistics. The pro-
posed approaches were evaluated in terms of perceptual qual-
Metric Method SNR (dB) ity and speech intelligibility when they are used for enhanc-
-5 0 5 10 15 20 Avg. ing noisy speech signals from a dual-microphone smartphone
PESQ Noisy 1.10 1.12 1.18 1.31 1.52 1.80 1.34 in adverse acoustic environments. Our results show improve-
PLD 1.15 1.26 1.44 1.67 1.94 2.24 1.62 ments in terms of both perceptual quality and noise reduction
eKF-MVDR 1.11 1.16 1.25 1.40 1.62 1.92 1.41 of the enhanced signal while low speech distortion is intro-
eKF-WF 1.19 1.30 1.49 1.73 2.02 2.34 1.68 duced in comparison to a standalone MVDR beamformer. As
eKF-OMLSA 1.19 1.29 1.46 1.72 2.03 2.36 1.68 future work, we will extend this study on postfiltering to general
STOI Noisy 0.53 0.63 0.72 0.79 0.84 0.88 0.73 multi-microphone devices through our extended Kalman filter
PLD 0.54 0.64 0.73 0.80 0.85 0.89 0.74 approach.
eKF-MVDR 0.56 0.65 0.74 0.80 0.85 0.89 0.75
eKF-WF 0.53 0.63 0.72 0.80 0.85 0.89 0.74
eKF-OMLSA 0.50 0.59 0.70 0.79 0.85 0.89 0.72
6. References
[1] J. Benesty, J. Chen, and Y. Huang, Microphone Array Signal Pro-
Table 2: PESQ and STOI scores obtained for noisy and en- cessing. Springer, 2008, vol. 1.
hanced speech when using different dual-microphone enhance- [2] K. Kumatani, J. McDonough, and B. Raj, “Microphone array pro-
ment techniques in far-talk (FT) conditions. cessing for distant speech recognition: From close-talking micro-
phones to far-field sensors,” IEEE Signal Processing Magazine,
vol. 29, no. 6, pp. 127–140, 2012.
Metric Method SNR (dB)
-5 0 5 10 15 20 Avg. [3] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A
PESQ Noisy 1.12 1.13 1.21 1.37 1.61 1.94 1.40 consolidated perspective on multimicrophone speech enhance-
ment and source separation,” IEEE/ACM Transactions on Audio
SPPC 1.18 1.20 1.34 1.55 1.80 2.04 1.52
Speech and Language Processing, vol. 25, no. 4, pp. 692–730,
eKF-MVDR 1.12 1.17 1.28 1.47 1.74 2.09 1.48 2017.
eKF-WF 1.25 1.33 1.51 1.80 2.17 2.56 1.77
eKF-OMLSA 1.24 1.34 1.53 1.80 2.14 2.49 1.76
[4] I. Tashev, S. Mihov, T. Gleghorn, and A. Acero, “Sound capture
system and spatial filter for small devices,” in Proc. Interspeech,
STOI Noisy 0.52 0.62 0.73 0.81 0.87 0.91 0.74
2008, pp. 435–438.
SPPC 0.41 0.52 0.63 0.72 0.78 0.82 0.65
eKF-MVDR 0.53 0.63 0.74 0.82 0.88 0.92 0.75
[5] I. López-Espejo, A. M. Peinado, A. M. Gomez, and J. A.
González, “Dual-channel spectral weighting for robust speech
eKF-WF 0.44 0.56 0.70 0.81 0.88 0.91 0.72
recognition in mobile devices,” Digital Signal Processing, vol. 75,
eKF-OMLSA 0.49 0.59 0.71 0.81 0.87 0.91 0.73 pp. 13–24, 2018.
[6] M. Parchami, W. P. Zhu, B. Champagne, and E. Plourde, “Recent
developments in speech enhancement in the short-time Fourier
postfiltering have a similar performance, while OMLSA shows transform domain,” IEEE Circuits and Systems Magazine, vol. 16,
slightly worse results, especially at low SNRs. Thus, eKF-WF no. 3, pp. 45–77, 2016.
seems the preferred strategy for close-talk conditions on aver- [7] R. Zelinski, “A microphone array with adaptive post-filtering for
age. noise reduction in reverberant rooms,” in Proc. ICASSP, 1988, pp.
Regarding far-talk conditions, likewise, the proposed post- 2578–2581.
filters obtain the best results in terms of perceptual quality, with [8] C. Marro, Y. Mahieux, and K. U. Simmer, “Analysis of noise re-
Wiener filtering being the best strategy for noise reduction, es- duction and dereverberation techniques based on microphone ar-
pecially at high SNRs. The SPPC method outperforms MVDR rays with postfiltering,” IEEE Transactions on Speech and Audio
beamforming with no postfiltering, but it does not achieve any Processing, vol. 6, no. 3, pp. 240–259, 1998.
improvements compared to eKF-WF and eKF-OMLSA. More- [9] I. A. McCowan and H. Bourlard, “Microphone array post-filter
over, SPPC introduces more speech distortion, yielding a poor based on noise field coherence,” IEEE Transactions on Speech
performance in terms of speech intelligibility. MVDR beam- and Audio Processing, vol. 11, no. 6, pp. 709–716, 2003.
forming with no postfiltering achieves the best STOI scores, [10] S. Lefkimmiatis and P. Maragos, “A generalized estimation ap-
as in CT conditions. On the other hand, the postfiltering ap- proach for linear and nonlinear microphone array post-filters,”
proaches obtain similar results on average, although their per- Speech Communication, vol. 49, no. 7-8, pp. 657–666, 2007.
formance is worse at low SNRs. The comparison of both post- [11] S. Gannot and I. Cohen, “Speech enhancement based on the gen-
filters indicates that eKF-OMLSA achieves better intelligibility eral transfer function GSC and postfiltering,” IEEE Transactions
on average and, especially, at low SNRs. To sum up, both eKF- on Speech and Audio Processing, vol. 12, no. 6, pp. 561–571,
WF and eKF-OMLSA perform similarly, with Wiener filtering 2004.
achieving best perceptual quality and OMLSA better speech in- [12] I. Cohen and B. Berdugo, “Speech enhancement for non-
telligibility in FT conditions. stationary noise environments,” Signal Processing, vol. 81, no. 11,
pp. 2403–2418, 2001.
5. Conclusions [13] C. Zheng, H. Liu, R. Peng, and X. Li, “A statistical analysis of
two-channel post-filter estimators in isotropic noise fields,” IEEE
In this paper we have proposed a postfiltering approach to our Transactions on Audio, Speech and Language Processing, vol. 21,
RTF extended Kalman filter framework for dual-microphone no. 2, pp. 336–342, 2013.
smartphones. Our proposals make use of the more accurate esti- [14] M. Jeub, C. Herglotz, C. Nelke, C. Beaugeant, and P. Vary, “Noise
mated RTFs and noise statistics in order to obtain the gain func- reduction for dual-microphone mobile phones exploiting power
tion for noise reduction. We evaluated two different postfilters level differences,” in Proc. ICASSP, 2012, pp. 1693–1696.
145
[15] C. M. Nelke, C. Beaugeant, and P. Vary, “Dual microphone noise
PSD estimation for mobile phones in hands-free position ex-
ploiting the coherence and speech presence probability,” in Proc.
ICASSP, 2013, pp. 7279–7283.
[16] W. Jin, M. J. Taghizadeh, K. Chen, and W. Xiao, “Multi-channel
noise reduction for hands-free voice communication on mobile
phones,” in Proc. ICASSP, 2017, pp. 506–510.
[17] J. M. Martı́n-Doñas, I. López-Espejo, A. M. Gomez, and A. M.
Peinado, “An extended Kalman filter for RTF estimation in dual-
microphone smartphones,” in Proc. Eusipco, 2018, pp. 2488–
2492.
[18] T. Esch and P. Vary, “Efficient musical noise suppression for
speech enhancement systems,” in Proc. ICASSP, 2009, pp. 4409–
4412.
[19] J. Yamagishi. (2012) English multi-speaker corpus
for CSTR voice cloning toolkit. [Online]. Available:
http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
[20] M. Souden, J. Benesty, S. Affes, and J. Chen, “An integrated solu-
tion for online multichannel noise tracking and reduction,” IEEE
Transactions on Audio, Speech and Language Processing, vol. 19,
no. 7, pp. 2159–2169, 2011.
[21] “P.862.2: Wideband extension to recommendation P.862 for the
assessment of wideband telephone networks and speech codec,”
ITU-T Std. P.862.2, 2007.
[22] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An al-
gorithm for intelligibility prediction of time-frequency weighted
noisy speech,” IEEE Transactions on Audio, Speech and Lan-
guage Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
146
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
147 10.21437/IberSPEECH.2018-31
NN F
PN = (3)
GMM‐HMM VAD NV F
segmentation where NV F is the total number of voiced frames, NN F
is the total number of frames labelled as a musical note and
VOICE
NT is the total number of frames, all of them calculated within
Pitch extraction the segment to be classified. The vector containing these two
& smoothing parameters is classified using a SVM.
Note detection Our algorithm to label the pitch curve with musical notes dis-
cretises the F0 curve in semitones expressed in cents and then
NOTE LABELS searches for sequences of semitones that fulfil two conditions:
to have enough duration and less variation range than a thresh-
Calculation of
old. This algorithm is simpler than state-of-the-art algorithms
Pitch parameters
[16] but our lack of labelled data made us create a method with
PV & PN minimum supervision. First we map the F0 value to cent scale
with an offset to make all the possible values of f0c positive
Singing/speech according to expression (4).
SVM classifier
fo
f0c = 1200log2 (
) + 5800 (4)
fref
SINGING SPEECH where fref is 440 Hz, the frequency of A4 note.
To avoid possible instability due to vibrato, we apply a
smoothing to the F0 curve. The smoothing consists on calcu-
Figure 1: Structure of the proposed speech/singing voice seg- lating the local maxima and minima envelopes and taking the
mentation system average curve. The obtained smoothed pitch curve is rounded to
the closest semitone value to discretise the sequence, as shown
in Figure 2.
outside the diagonal is 0.0001. This way, fast transitions and
5600
small discontinuities are removed. To classify the segments, the
likelihood of observation provided by each model is calculated 5550
using expression (1).
5500
M
X
cents
5450
P (o|si ) = wij N (o|µij , Σij ) (1)
j=1 5400
where o is the MFCC vector, wij , µij , Σij are the weight, 5350
F0 in cent scale
mean and diagonal covariance of the component j of the state 5300 Semitone discretization
si and M is the number of Gaussian components. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Seconds
5450
pose the use of two parameters derived from pitch to do the clas-
sification: proportion of voiced segments (P V ) and percentage 5400
148
search the non-overlapping group of subsequences that fulfil the Table 1: Number of speakers and singers in Bertso database
predefined conditions of minimum length and maximum range
in amplitude expressed in equations 5 and 6. Singer Host Singer and host Total
Female 7 6 2 15
Len(s) ≥ L (5) Male 12 6 2 20
max(s) − min(s) ≤ R (6) Total 19 12 4
where s is the semitone subsequence, R is the maximum
amplitude range and L is the minimum length.
The algorithm is defined in the next steps: downsampled to 16000 Hz and converted to Windows PCM
files1 .
• Consider the whole pitch curve as a collection of We analysed the Bertso database and the singing voice and
K voiced sequences of contiguous semitones ex- speech segments never appear consecutively, i. e., there is al-
pressed in cents each one with its own length S = ways silence or noise between segments to classify. Therefore,
{S1 , S2 , ..., SK }. considering the structure of both databases, each segment be-
• Define R as the maximum allowed variation range and L longs only to a class. Additionally, we have studied the dura-
as the minimum length. tion of the segments produced by speakers and singers: singing
voice has longer durations than speech (mean duration of 3.69
• Search the longest subsequence in each Si (1 ≤ i ≤ K) and 1.51 seconds respectively in Bertso database and 3.87 and
that fills the conditions. 1.92 seconds in NUS database).
• Label the longest subsequence found as a musical note. The distribution of the proposed classifying features P V
Between the possible semitones in the sequence, the and P N in the databases can be seen in Figures 4 and 5. Speech
most frequent one is selected as label. is more scattered than singing voice, but taking into account
both parameters a good discrimination of both classes can be
• Split the remaining parts of the original sequence Si in achieved.
two new sequences: the subsequence in the left of the For the experiments, we split the Bertso database in 10 sub-
note found (SLi ) and the subsequence in the right of the sets for cross-validation tests. All the partitions considered in-
note found (SRi ) as shown in Figure 3. clude different singers in the train and test subsets. The NUS
• If any of the new generated subsequences fill the duration database is classified using the algorithms trained with Bertso
condition, Si in S is substituted by them and the process database.
begins again.
1.0
• When all sequences from S have been analysed the pro-
cess finishes. 0.8
Singing Voice
value in Western music [19]. 0.4
0.2
4. Experiments and results
0.0
4.1. Datasets
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
PV
As few publicly available database exist with speech and mono-
phonic singing, we have used an excerpt of our Bertsolaritza
Figure 4: Distribution of the classes in the Bertso database
database [20] to train the algorithms and the NUS Sung and
Spoken Lyrics Corpus [21] to test them. In the Bertso database
we manually labelled 20 audio files from 19 singers with a total
duration of 60 minutes and 40 seconds. These audio files con- 1.0
tain 32.8 minutes of singing voice and 2.87 minutes of speech.
0.8
The 20 files were selected to cover the variability of the origi-
nal Bertsolaritza database, considering recordings from differ- 0.6
ent decades and gender. In the NUS database, each singer has Speech
PN
Singing Voice
recorded a singing and spoken version of 4 songs. The total 0.4
number of different songs is 20 and 12 singers have made the
recordings, 6 males and 6 females. As each recording contains 0.2
either speech or singing voice, we used the VAD to obtain the
voice segments and labelled them with the type of the recording. 0.0
Table 1 shows the distribution of singers and hosts by gen- 0.2 0.4 0.6 0.8 1.0
PV
der in the Bertso database. In most cases the speakers either sing
or act as host, but some hosts give the topic for the improvised Figure 5: Distribution of the classes in NUS database
verses singing as well. These hosts appear in the recordings
both singing and speaking.
In the Bertso database the audio files originally were in 1 Examples of signals contained in the Bertso database can be ac-
mp3. Both databases had 44100 Hz samplerate and have been cessed at http://bdb.bertsozale.eus/en/web/bertsoa/view/1323t4.
149
4.2. Comparison with other methods Table 2: Results of the GMM-HMM VAD for different number
of Gaussian components
To compare our algorithm with other methods we had tested
them in our Bertso database and in NUS database. We have
Gaussians F-Score
selected methods that are suitable to work with segments of
different duration as it is the case of Bertso database. On the 2 0.965 +/- 0.005
one hand, we have trained GMM classifiers with the parame- 4 0.967 +/- 0.006
ters suggested in [12] (∆F 0) and [13] (DFT of F0 distribution). 8 0.969 +/- 0.007
On the other hand, we have also built a GMM classifier based 16 0.972 +/- 0.008
on MFCC parameters. These methods are explained with more 32 0.973 +/- 0.008
detail in the following subsections. 64 0.974 +/- 0.008
150
7. References [21] Z. Duan, H. Fang, B. Li, K. C. Sim, and Y. Wang, “The nus sung
and spoken lyrics corpus: A quantitative comparison of singing
[1] A. Loscos, P. Cano, and J. Bonada, “Low-delay singing voice and speech,” in 2013 Asia-Pacific Signal and Information Pro-
alignment to text.” in ICMC, 1999. cessing Association Annual Summit and Conference, Oct 2013,
[2] A. Mesaros and T. Virtanen, “Automatic recognition of lyrics in pp. 1–9.
singing,” EURASIP Journal on Audio, Speech, and Music Pro-
[22] A. Savitzky and M. J. E. Golay, “Smoothing and differentiation
cessing, vol. 2010, pp. 1–11, 2010.
of data by simplified least squares procedures.” Analytical Chem-
[3] D. Deutsch, R. Lapidis, and T. Henthorn, “The speech-to-song istry, vol. 36, no. 8, pp. 1627–1639, 1964.
illusion,” Acoustical Society of America Journal, vol. 124, pp.
[23] J. Sundberg, “The acoustics of the singing voice,” Scientific Amer-
2471–2471, 2008.
ican, vol. 236, no. 3, pp. 82–91, 1977.
[4] S. Falk and T. Rathcke, “On the speech-to-song illusion: Evidence
from german,” in Speech Prosody 2010-Fifth International Con-
ference, 2010.
[5] S. R. Livingstone, K. Peck, and F. A. Russo, “Acoustic differences
in the speaking and singing voice,” Proceedings of Meetings on
Acoustics, vol. 19, no. 35080, 2013.
[6] J. Merrill and P. Larrouy-Maestri, “Vocal features of song and
speech: Insights from Schoenberg’s Pierrot lunaire,” Frontiers in
Psychology, vol. 8:1108, 2017.
[7] W. Chou and L. Gu, “Robust singing detection in speech/music
discriminator design,” in Acoustics, Speech, and Signal Process-
ing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International
Conference on, vol. 2. IEEE, 2001, pp. 865–868.
[8] B. Schuller, B. J. B. Schmitt, D. Arsić, S. Reiter, M. Lang, and
G. Rigoll, “Feature selection and stacking for robust discrimi-
nation of speech, monophonic singing, and polyphonic music,”
IEEE International Conference on Multimedia and Expo, ICME
2005, vol. 2005, pp. 840–843, 2005.
[9] D. Gerhard, “Pitch-based acoustic feature analysis for the discrim-
ination of speech and monophonic singing,” Canadian Acoustics,
vol. 30, no. 3, pp. 152–153, 2002.
[10] B. Schuller, G. Rigoll, and M. Lang, “Discrimination of
speech and monophonic singing in continuous audio streams
applying multi-layer support vector machines,” 2004 IEEE
International Conference on Multimedia and Expo (ICME)
(IEEE Cat. No.04TH8763), pp. 1655–1658, 2004. [Online].
Available: http://ieeexplore.ieee.org/document/1394569/
[11] W. H. Tsai and C. H. Ma, “Speech and singing discrimination
for audio data indexing,” Proceedings - 2014 IEEE International
Congress on Big Data, BigData Congress 2014, pp. 276–280,
2014.
[12] Y. Ohishi, M. Goto, K. Itou, and K. Takeda, “Discrimination
between Singing and Speaking Voices,” Interspeech, pp. 1141–
1144, 2005.
[13] B. Thompson, “Discrimination between singing and speech in
real-world audio,” 2014 IEEE Workshop on Spoken Language
Technology, SLT 2014 - Proceedings, pp. 407–412, 2014.
[14] T. Hain and P. C. Woodland, “Segmentation and classification of
broadcast news audio,” in Fifth International Conference on Spo-
ken Language Processing, 1998.
[15] P. Boersma, “Accurate short-term analysis of the fundamental fre-
quency and the harmonics-to-noise ratio of a sampled sound,”
in Proceedings of the institute of phonetic sciences, vol. 17, no.
1193. Amsterdam, 1993, pp. 97–110.
[16] M. Mauch, C. Cannam, R. Bittner, G. Fazekas, J. Salamon, J. Dai,
J. Bello, and S. Dixon, “Computer-aided melody note transcrip-
tion using the tony software: Accuracy and efficiency,” 2015.
[17] Y. S. Moon and J. Kem, “Fast normalization-transformed subse-
quence matching in time-series databases,” vol. E90-D, no. 12,
2007, pp. 2007–2018.
[18] O. K. Kostakis and A. G. Gionis, “Subsequence Search in Event-
Interval Sequences,” 2015, pp. 851–854.
[19] A. S. Bregman, Auditory scene analysis: The perceptual organi-
zation of sound. MIT press, 1994.
[20] X. Sarasola, E. Navas, D. Tavarez, D. Erro, I. Saratxaga, and
I. Hernaez, “A singing voice database in Basque for statistical
singing synthesis of bertsolaritza.” in LREC, 2016.
151
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
152 10.21437/IberSPEECH.2018-32
form of one vector per time step t. Hence, the transformation 3. Experimental Setup
RL → RH is applied independently at each time step t as
3.1. Dataset
For the experiments we use utterances of speakers from the
ht = max(0, Wxt + b),
TCSTAR project dataset [22]. This corpora includes sentences
and paragraphs taken from transcribed parliamentary speech
where W ∈ RH×L , b ∈ RH , xt ∈ RL , and ht ∈ RH . Af- and transcribed broadcast news. The purpose of these text
ter this projection, we have the recurrent core formed by an sources is twofold: enrich the vocabulary and facilitate the se-
LSTM layer of size H and an additional LSTM output layer. lection of the sentences to achieve good prosodic and phonetic
The MUSA-RNN output is recurrent, as this prompted better coverage. For this work, we choose the same male (M1) and fe-
results than using dynamic features to smooth cepstral trajecto- male (F1) speakers as in our previous works. These two speak-
ries in time [13]. ers have the most amount of data among the available ones.
Based on the Transformer architecture [17], we propose a Their amount of data is balanced with approximately the fol-
pseudo-sequential processing network that can leverage distant lowing durations per split for both: 100 minutes for training, 15
element interactions within the input linguistic sequence to pre- minutes for validation, and 15 minutes for test.
dict acoustic features. This is similar to what an RNN does, but
discarding any recurrent connection. This will allow us to pro- 3.2. Linguistic and Acoustic Features
cess all input elements in parallel at inference, hence substan- The decoder maps linguistic and prosodic features into acous-
tially accelerating the acoustic predictions. In our setup, we do tic ones. This means that we first extract hand-crafted features
not face a sequence-to-sequence problem as stated previously, out of the input textual query. These are extracted in the la-
so we only use a structure like the Transformer encoder which bel format, following our previous work in [20]. We thus have
we call a linguistic-acoustic decoder. a combination of sparse identifiers in the form of one-hot vec-
The proposed SALAD architecture begins with the same tors, binary values, and real values. These include the identity
embedding of linguistic and prosodic features, followed by a of phonemes within a window of context, part of speech tags,
positioning encoding system. As we have no recurrent struc- distance from syllables to end of sentence, etc. For more detail
ture, and hence no processing order, this positioning encod- we refer to [20] and references therein.
ing system will allow the upper parts of the network to locate For a textual query of N words, we will obtain M label
their operating point in time, such that the network will know vectors, M ≥ N , each with 362 dimensions. In order to inject
where it is inside the input sequence [17]. This positioning code these into the acoustic decoder, we need an extra step though.
c ∈ RH is a combination of harmonic signals of varying fre- As mentioned, the MUSA testbed follows the two-stage struc-
quency: ture: (1) duration prediction and (2) acoustic prediction with
2i
the amount of frames specified in first stage. Here we are only
ct,2i = sin t/10000 H working with the acoustic mapping, so we enforce the duration
2i
with labeled data. For this reason, and similarly to what we did
ct,2i+1 = cos t/10000 H in previous works [14, 21], we replicate the linguistic label vec-
tor of each phoneme as many times as dictated by the ground-
where i represents each dimension within H. At each time-step truth annotated duration, appending two extra dimensions to the
t, we have a unique combination of signals that serves as a time 362 existing ones. These two extra dimensions correspond to
stamp, and we can expect this to generalize better to long se- (1) absolute duration normalized between 0 and 1, given the
quences than having an incremental counter that marks the po- training data, and (2) relative position of current phoneme in-
sition relative to the beginning. Each time stamp ct is summed side the absolute duration, also normalized between 0 and 1.
to each embedding ht , and this is input to the decoder core. We parameterize the speech with a vocoded representa-
The decoder core is built with a stack of N blocks, depicted tion using Ahocoder [23]. Ahocoder is an harmonic-plus-noise
within the dashed blue rectangle in figure 1. These blocks are high quality vocoder, which converts each windowed waveform
the same as the ones proposed in the decoder of [17], but we frame into three types of features: (1) mel-frequency cepstral
only have self-attention modules to the input, so it looks more coefficients (MFCCs), (2) log-F0 contour, and (3) voicing fre-
like the Transformer encoder. The most salient part of this type quency (VF). Note that F0 contours have two states: either they
of block is the multi-head attention (MHA) layer. This applies follow a continuous envelope for voiced sections of speech, or
h parallel self-attention layers, which can have a more versatile they are 0, for which the logarithm is undefined. Because of
feature extraction than a single attention layer with the possibil- that, Ahocoder encodes this value with −109 , to avoid numeri-
ity of smoothing intra-sequential interactions. After the MHA cal undefined values. This result would be a cumbersome output
comes the feed-forward network (FFN), composed of two fully- distribution to be predicted by a neural net using a quadratic re-
connected layers. The first layer expands the attended features gression loss. Therefore, to smooth the values out and normal-
into a higher dimension dff , and this gets projected again to ize the log-F0 distribution, we linearly interpolate these con-
the embedding dimensionality H. Finally, the output layer is tours and create an extra acoustic feature, the unvoiced-voiced
a fully-connected dimension adapter such that it can convert the flag (UV), which is the binary flag indicating the voiced or un-
hidden dimensions H to the desired amount of acoustic outputs, voiced state of the current frame. We will then have an acoustic
which in our case is 43 as discussed in section 3.2. As stated ear- vector with 40 MFCCs, 1 log-F0, 1 VF, and 1 UV. This equals
lier, we may slightly degrade the quality of predictions with this a total number of 43 features per frame, where each frame
output topology, as recurrence helps in the output layer captur- window has a stride of 80 samples over the waveform. Real-
ing better the dynamics of acoustic features. Nonetheless, this numbered linguistic features are Z-normalized by computing
can suffice our objective of having a highly parallelizable and statistics on the training data. In the acoustic feature outputs,
competitive system. all of them are normalized to fall within [0, 1].
153
Figure 1: Transition from RNN/LSTM acoustic model to SALAD. The embedding projections are the same. Positioning encoding
introduces sequential information. The decoder block is stacked N times to form the whole structure replacing the recurrent core.
FFN: Feed-forward Network. MHA: Multi-Head Attention.
Table 1: Different layer sizes of the different models. Emb: lin- Concerning the training setup, all models are trained with
ear embedding layer, and hidden size H for SALAD models in batches of 32 sequences of 120 symbols. The training is in a
all layers but FFN ones. HidRNN: Hidden LSTM layer size. dff : so-called stateful arrangement, such that we carry the sequen-
Dimension of the feed-forward hidden layer inside the FFN. tial state between batches over time (that is, the memory state
in the RNN and the position code index in SALAD). To achieve
Model Emb HidRNN dff this, we concatenate all the sequences into a very long one and
chop it into 32 long pieces. We then use a non-overlapped slid-
Small RNN 128 450 -
ing window of size 120, so that each batch contains a piece per
Small SALAD 128 - 1024
sequence, continuous with the previous batch. This makes the
Big RNN 512 1300 -
models learn how to deal with sequences longer than 120 out-
Big SALAD 512 - 2048
side of train, learning to use a conditioning state different than
zero in training. Both models are trained for a maximum of
300 epochs, but they trigger a break by early-stopping with the
3.3. Model Details and Training Setup validation data. The validation criteria for which they stop is
the mel cepstral distortion (MCD; discussed in section 4) with
We have two main structures: the baseline MUSA-RNN and a patience of 20 epochs.
SALAD. The RNN takes the form of an LSTM network for
their known advantages of avoiding typical vanilla RNN pitfalls Regarding the optimizers, we use Adam [25] for the RNN
in terms of vanishing memory and bad gradient flows. Each of models, with the default parameters in PyTorch (lr = 0.001,
the two different models has two configurations, small (Small β1 = 0.9, β2 = 0.999, and = 10−8 ). For SALAD we use a
RNN/Small SALAD) and big (Big RNN/Big SALAD). This in- variant of Adam with adaptive learning rate, already proposed
tends to show the performance difference with regard to speed in the Transformer work, called Noam [17]. This optimizer is
and distortion between the proposed model and the baseline, based on Adam with β1 = 0.9, β2 = 0.98, = 10−9 and a
but also their variability with respect to their capacity (RNN learning rate scheduled with
and SALAD models of the same capacity have an equivalent
number of parameters although they have different connexion
topologies). Figure 1 depicts both models’ structure, where lr = H −0.5 · min(s−0.5 , s · w−1.5 )
only the size of their layers (LSTM, embedding, MHA, and
FFN) changes with the mentioned magnitude. Table 1 sum-
marizes the different layer sizes for both types of models and where we have an increasing learning rate for w warmup train-
magnitudes. ing batches, and it decreases afterwards, proportionally to the
Both models have dropout [24] in certain parts of their inverse square root of the step number s (number of batches).
structure. The RNN models have it after the hidden LSTM We use w = 4000 in all experiments. The parameter H is the
layer, whereas the SALAD model has many dropouts in differ- inner embedding size of SALAD, which is 128 or 512 depend-
ent parts of its submodules, replicating the ones proposed in the ing on whether it is the small or big model as noted in table 1.
original Transformer encoder [17]. The RNN dropout is 0.5, We also tested Adam on the big version of SALAD, but we did
and SALAD has a dropout of 0.1 in its attention components not observe any improvement in the results, so we stick to Noam
and 0.5 in FFN and after the positioning codes. following the original Transformer setup.
154
Table 2: Male (top) and female (bottom) objective results.
A: voiced/unvoiced accuracy.
4. Results
In order to assess the distortion introduced by both models, we
took three different objective evaluation metrics. First, we have
the MCD measured in decibels, which tells us the amount of
distortion in the prediction of the spectral envelope. Then we
have the root mean squared error (RMSE) of the F0 prediction
in Hertz. And finally, as we introduced the binary flag that spec-
ifies which frames are voiced or unvoiced, we measure the accu-
racy (number of correct hits over total outcomes) of this binary
classification prediction, where classes are balanced by nature.
These metrics follow the same formulations as in our previous
works [14, 20, 21].
Table 2 shows the objective results for the systems detailed
in section 3.3 over the two mentioned speakers, M and F. For
both speakers, RNN models perform better than the SALAD
ones in terms of accuracy and error. Even though the small- Figure 3: Inference time for the four different models with re-
est gap, occurring with the SALAD biggest model, is 0.3 dB spect to generated waveform length. Both axis are in seconds.
in the case of the male speaker and 0.1 dB in the case of the
female speaker, showing the competitive performance of these Table 3: Maximum inference latency with RANSAC fit.
non-recurrent structures. On the other hand, Figure 3 depicts
the inference speed on CPU for the 4 different models synthe- Model Max. latency [s]
sizing different utterance lengths. Each dot in the plot indicates
a test file synthesis. After we collected the dots, we used the Small RNN 63.74
RANSAC [26] algorithm (Scikit-learn implementation) to fit a Small SALAD 4.715
linear regression robust to outliers. Each model line shows the Big RNN 64.84
latency uprise trend with the generated utterance length, and Big SALAD 5.455
RNN models have a way higher slope than the SALAD models.
In fact, SALAD models remain pretty flat even for files of up
to 35 s, having a maximum latency in their linear fit of 5.45 s 5. Conclusions
for the biggest SALAD, whereas even small RNN is over 60 s.
We have to note that these measurements are taken with Py- In this work we present a competitive and fast acoustic model
Torch [27] implementations of LSTM and other layers running replacement for our MUSA-RNN TTS baseline. The proposal,
over a CPU. If we run them on GPU we notice that both systems SALAD, is based on the Transformer network, where self-
can work in real time. It is true that SALAD is still faster even in attention modules build a global reasoning within the sequence
GPU, however the big gap happens on CPUs, which motivates of linguistic tokens to come up with the acoustic outcomes.
the use of SALAD when we have more limited resources. Furthermore, positioning codes ensure the ordered processing
We can also check the pitch prediction deviation, as it is the in substitution of the ordered injection of features that RNN
most affected metric with the model change. We show the test has intrinsic to its topology. With SALAD, we get on average
pitch histograms for ground truth, big RNN and big SALAD over an order of magnitude of inference acceleration against the
in figure 2. There we can see that SALAD’s failure is about RNN baseline on CPU, so this is a potential fit for applying text-
focusing on the mean and ignoring the variance of the real dis- to-speech on embedded devices like mobile handsets. Further
tribution more than the RNN does. It could be interesting to work could be devoted on pushing the boundaries of this system
try some sort of short-memory non-recurrent modules close to to alleviate the observed flatter pitch behavior.
the output to alleviate this peaky behavior that makes pitch flat-
ter (and thus less expressive), checking if this is directly re- 6. Acknowledgements
lated to the removal of the recurrent connection in the output
This research was supported by the project TEC2015-69266-P
layer. Audio samples are available online as qualitative results
(MINECO/FEDER, UE).
at http://veu.talp.cat/saladtts .
155
7. References [14] S. Pascual and A. Bonafonte, “Multi-output RNN-LSTM for mul-
tiple speaker speech synthesis and adaptation,” in Proc. 24th Eu-
[1] H. Zen, “Acoustic modeling in statistical parametric speech ropean Signal Processing Conference (EUSIPCO). IEEE, 2016,
synthesis–from HMM to LSTM-RNN,” 2015. pp. 2325–2329.
[2] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speech
synthesis using deep neural networks,” in 2013 IEEE Interna- [15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
tional Conference on Acoustics, Speech and Signal Processing Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
(ICASSP). IEEE, 2013, pp. 7962–7966. [16] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evalu-
[3] H. Lu, S. King, and O. Watts, “Combining a vector space rep- ation of gated recurrent neural networks on sequence modeling,”
resentation of linguistic context with a deep neural network for arXiv preprint arXiv:1412.3555, 2014.
text-to-speech synthesis,” Proc. ISCA SSW8, pp. 281–285, 2013. [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
[4] Y. Qian, Y. Fan, W. Hu, and F. K. Soong, “On the training aspects Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
of deep neural network (DNN) for parametric TTS synthesis,” in in Proc. Advances in Neural Information Processing Systems
2014 IEEE International Conference on Acoustics, Speech and (NIPS), 2017, pp. 6000–6010.
Signal Processing (ICASSP). IEEE, 2014, pp. 3829–3833. [18] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
[5] Q. Hu, Z. Wu, K. Richmond, J. Yamagishi, Y. Stylianou, and learning with neural networks,” in Proc. Advances in Neural In-
R. Maia, “Fusion of multiple parameterisations for DNN-based formation Processing Systems (NIPS), 2014, pp. 3104–3112.
sinusoidal speech synthesis with multi-task learning,” in Proc. IN- [19] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-
TERSPEECH, 2015, pp. 854–858. lation by jointly learning to align and translate,” arXiv preprint
[6] Q. Hu, Y. Stylianou, R. Maia, K. Richmond, J. Yamagishi, and arXiv:1409.0473, 2014.
J. Latorre, “An investigation of the application of dynamic sinu- [20] S. Pascual, “Deep learning applied to speech synthesis,” Master’s
soidal models to statistical parametric speech synthesis.” in Proc. thesis, Universitat Politècnica de Catalunya, 2016.
INTERSPEECH, 2014, pp. 780–784.
[21] S. Pascual and A. Bonafonte Cávez, “Multi-output RNN-LSTM
[7] S. Kang, X. Qian, and H. Meng, “Multi-distribution deep be-
for multiple speaker speech synthesis with a-interpolation model,”
lief network for speech synthesis,” in Proc. 2013 IEEE Interna-
in Proc. ISCA SSW9. IEEE, 2016, pp. 112–117.
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2013, pp. 8012–8016. [22] A. Bonafonte, H. Höge, I. Kiss, A. Moreno, U. Ziegenhain,
[8] S. Pascual and A. Bonafonte, “Prosodic break prediction with H. van den Heuvel, H.-U. Hain, X. S. Wang, and M.-N. Garcia,
RNNs,” in Proc. International Conference on Advances in Speech “Tc-star: Specifications of language resources and evaluation for
and Language Technologies for Iberian Languages. Springer, speech synthesis,” in Proc. LREC Conf, 2006, pp. 311–314.
2016, pp. 64–72. [23] D. Erro, I. Sainz, E. Navas, and I. Hernáez, “Improved HNM-
[9] S.-H. Chen, S.-H. Hwang, and Y.-R. Wang, “An RNN-based based vocoder for statistical synthesizers.” in Proc. INTER-
prosodic information synthesizer for mandarin text-to-speech,” SPEECH, 2011, pp. 1809–1812.
IEEE Transactions on Speech and Audio Processing, vol. 6, no. 3, [24] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
pp. 226–239, 1998. R. Salakhutdinov, “Dropout: a simple way to prevent neural net-
[10] S. Achanta, T. Godambe, and S. V. Gangashetty, “An investigation works from overfitting,” The Journal of Machine Learning Re-
of recurrent neural network architectures for statistical parametric search, vol. 15, no. 1, pp. 1929–1958, 2014.
speech synthesis,” in Proc. INTERSPEECH, 2015. [25] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
[11] Z. Wu and S. King, “Investigating gated recurrent networks for mization,” arXiv preprint arXiv:1412.6980, 2014.
speech synthesis,” in Proc. International Conference on Acous-
[26] M. A. Fischler and R. C. Bolles, “Random sample consensus: a
tics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp.
paradigm for model fitting with applications to image analysis and
5140–5144.
automated cartography,” Communications of the ACM, vol. 24,
[12] R. Fernandez, A. Rendel, B. Ramabhadran, and R. Hoory, no. 6, pp. 381–395, 1981.
“Prosody contour prediction with long short-term memory, bi-
directional, deep recurrent neural networks.” in Proc. INTER- [27] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De-
SPEECH, 2014, pp. 2268–2272. Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic
differentiation in PyTorch,” in NIPS Workshop on The Future of
[13] H. Zen and H. Sak, “Unidirectional long short-term memory re- Gradient-based Machine Learning Software & Techniques (NIPS-
current neural network with recurrent output layer for low-latency Autodiff), 2017.
speech synthesis,” in Proc. International Conference on Acous-
tics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp.
4470–4474.
156
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Abstract
In this document, we describe the mobile application Japañol1 ,
a learning tool which helps pronunciation training of Spanish
as a foreign language (L2) at a segmental level. The tool has
been specifically designed to be used by native Japanese peo-
ple, and implies a branch of a previous CAPT gamified tool
TipTopTalk!. In this case, a predefined cycle of actions related
to exposure, discrimination and production is presented to the
user, always under the minimal-pairs approach to pronunciation
training. It incorporates freely available ASR and TTS and pro-
vides feedback to the user by means of short video tutorials, to
reinforce learning progression.
Index Terms: computer-assisted pronunciation training, Figure 1: Components of the CAPT system.
speech recognition, human-computer interaction, computa-
tional para-linguistics.
While the freedom of movement on game-oriented tools
1. Introduction leads users to maximize their score by repeating those tasks
The way we teach and learn foreign languages is adapting to the they found easy, the continued use of the tool seemed to gener-
technologies. The development of new modes to engage peo- ate stagnation. Complementary, learning-oriented tools should
ple in learning, as Computer-Assisted Pronunciation Training focus on users’ difficulties, offering guided and corrective feed-
(CAPT), Computer-Assisted Language Learning (CALL) and back and achieving better effectiveness and efficiency pedagog-
Mobile-Assisted Language Learning (MALL), allow improving ical results [11]. This motivated this alternate version of the
linguistic skills anytime and anywhere [1]. Also, these systems tool, which provides a fixed cycle of well known and balanced
can give useful information on how learners perform and im- learning activities, and which has been applied initially to an
prove their pronunciation [2]. While there are many software experiment which results are presented in this same conference
tools that rely on speech technologies for providing to users L2 [12].
pronunciation training in the field of Computer Assisted Pro-
nunciation Training (CAPT), Japañol [3], distinctively incorpo- 2. System’s description
rates a well designed cycle of all the relevant activities related
to pronunciation training: exposure, discrimination, production Figure 1 shows the architecture diagram of components in
and mixed mode. Japañol. It is a native Android application built from scratch and
This application represents an evolution of previous serious can be run using low-cost resources available at language labo-
games [4, 5, 6] designed for pronunciation training of L2 by ratories such as computers, tablets, speakers and microphones.
non-native. All of them rely on the minimal pairs methodology It uses Google speech technology, such as Text-To-Speech of-
[7] and are within the context of research projects related to the fline tool and Automatic Speech Recognition web service for
development and testing of software tools and games for foreign Android, that offers a N-best list of probable results for each
language learning (TIN2014-59852-R and VA050G18). utterance. Japañol keeps record of chronological events and re-
sults of the users with the system in log files. Both audio and log
Previous versions were based on the free selection by users
files are stored in a server, through a set of web services, in or-
of exposure, discrimination and production tasks of, mainly, En-
der to be later analyzed to extract results and conclusions. Lists
glish or Spanish minimal-pairs, in order to get achievements
of Minimal pair words are available in a database accessible to
and increase points in leader-boards. With that approach, we
Japañol.
were able to assess user’s pronunciation level in a L2 [8]. We
have also analyzed how the introduction of corrective feedback
[9, 10] increased pronunciation improvement among users after 3. Using the tool
the first stages of use.
Most of current CAPT systems offer isolated pronunciation or
1 This discrimination activities as part of the training exercises . Very
work has been partially supported by the Ministerio de
Economı́a y Empresa (MINECO) and the European Regional De-
few combine these different modes as we do in our learning ap-
velopment Fund FEDER under project (TIN2014-59852-R) and by plication. In Japañol we follow a learning methodology based
Consejerı́a de Educación of Junta de Castilla y León under project on the Theory, Exposure, Discrimination, Pronunciation and
(VA050G18). Mixed modes.
157
5. Acknowledgements
Special thanks to the people of the Language Center of Univer-
sidad de Valladolid for their support for the evaluation campaign
with Japanese students. We also thank members of the research
team for the active support: Enrique Cámara, César González
Ferreras and Mario Corrales Astorgano from Universidad de
Valladolid, and Antonio Rı́os Mestre and Marı́a Machuca Ayuso
from Universitat Autònoma de Barcelona. Special thanks also
to Takuya Kimura, from University of Seisen in Tokyo, who
provided Japanese versions of the screens and conducted the
experiment with Japanese students at his university.
6. References
[1] R. I. Thomson and T. M. Derwing, “The effectiveness of L2 pro-
nunciation instruction: A narrative review,” Applied Linguistics,
vol. 36, no. 3, p. 326, 2014.
[2] W. Li and D. Mollá-Aliod, “Computer processing of oriental lan-
Figure 2: Standard flow to complete a lesson in Japañol. guages. language technology for the knowledge-based economy,”
Lecture Notes in Computer Science, vol. 5459, 2009.
[3] Japañol Project. (2017) Eca-simm japañol project. [Online].
Available: https://eca-simm.uva.es/es/proyectos/capt/japanol/
The activities are organized as a sequence of lessons, each [4] C. Tejedor-Garcı́a, V. Cardeñoso-Payo, E. Cámara-Arenas,
devoted to a specific segmental pronunciation difficulty associ- C. González-Ferreras, and D. Escudero-Mancebo, “Measuring
ated to a minimal pair. In each lesson, a brief and clear expla- pronunciation improvement in users of CAPT tool TipTopTalk!”
nation about the problem and valid pronunciations is provided, Interspeech, pp. 1178–1179, September 2016.
in the form of audiovisual material. Then, an exposure mode is [5] C. Tejedor-Garcı́a, D. Escudero-Mancebo, E. Cámara-Arenas,
entered, in which the user can listen to reference realizations of C. González-Ferreras, and V. Cardeñoso-Payo, “Improving L2
each valid utterance. After exposure, a discrimination mode is production with a gamified computer-assisted pronunciation train-
faced in which a sequence of 10 pairs of distinct words (part of a ing tool, TipTopTalk!” IberSPEECH 2016: IX Jornadas en Tec-
nologı́as del Habla and the V Iberian SLTech Workshop events,
minimal pair) is presented while the user is required to selected
pp. 177–186, 2016.
which one corresponds best with the listened utterance (gener-
ated using the TTS). A minimum number of 6 success hits has [6] ——, “TipTopTalk! mobile application for speech training using
minimal pairs and gamification,” IberSPEECH 2016: IX Jornadas
to be obtained in order to proceed to the next step. If not, the en Tecnologı́as del Habla and IV Iberian SLTech Workshop events,
user is suggested to return to exposure mode again before facing pp. 425–432, 2016.
a new discrimination challenge for the same lesson. Once the
[7] C. Tejedor-Garcı́a, V. Cardeñoso-Payo, E. Cámara-Arenas,
minimum required number of right answers has been given, or C. González-Ferreras, and D. Escudero-Mancebo, “Playing
after a number of 5 tries to avoid discouraging the user, a pro- around minimal pairs to improve pronunciation training,” IFCASL
nunciation mode is entered. In this mode, the user has to say, in 2015, 2015.
sequence, 10 different words selected from the list associated to [8] C. Tejedor-Garcı́a and D. Escudero-Mancebo, “Uso de pares
the minimal pair in the lesson. The recorded speech is submit- mı́nimos en herramientas para la práctica de la pronunciación del
ted to Google Speech ASR and is accepted as valid only when español como lengua extranjera,” Revista de la Asociación Euro-
the first item of the N-best list provided by the ASR matches the pea de Profesores de Español. El español por el mundo, no. 1, pp.
target word. A minimum number of 6 correct pronunciations is 355–363, 2018.
required to pass. If the attempt fails, the user is recommended [9] D. Escudero-Mancebo, E. Cámara-Arenas, C. Tejedor-Garcı́a,
to return to exposure mode before attempting again. C. González-Ferreras, and V. Cardeñoso-Payo, “Implementation
and test of a serious game based on minimal pairs for pronuncia-
Finally, a mixed mode activity is required for each lesson. tion training,” SLaTE, pp. 125–130, 2015.
In this mode, a sequence of 10 random discrimination and pro- [10] A. Rauber, C. Tejedor-Garcı́a, V. Cardeñoso-Payo, E. Cámara-
duction tasks are presented and a 60% success is again required Arenas, C. González-Ferreras, D. Escudero-Mancebo, and
to proceed. This mode resembles the added difficulty found to A. Rato, “TipTopTalk!: A game to improve the perception and
switch from discrimination to production in a normal conversa- production of L2 sounds,” Abstracts of New Sounds Aarhus, 8th
tion. A diagram showing the sequence of activities to complete International Conference on Second Language Speech, p. 160,
a lesson is shown in Figure 2. 2016.
[11] C. Tejedor-Garcı́a, D. Escudero-Mancebo, C. González-Ferreras,
E. Cámara-Arenas, and V. Cardeñoso-Payo, “Evaluating the Ef-
4. Activities in the demonstration ficiency of Synthetic Voice for Providing Corrective Feedback in
a Pronunciation Training Tool Based on Minimal Pairs,” SLaTE,
2017.
The demonstration will consist on an interactive session show-
ing all different modes in the client application (see 3). People [12] C. Tejedor-Garcı́a, D. Escudero-Mancebo, E. Cámara-Arenas,
C. González-Ferreras, and V. Cardeñoso-Payo, “Improving pro-
will be able to ask for help during the presentation. At the be- nunciation of spanish as a foreign language for l1 japanese speak-
ginning, all attending people can download the application with ers with japañol capt tool,” IberSPEECH 2018: X Jornadas en
a given URL or taking a photo of a QR picture. Once down- Tecnologı́as del Habla and the V Iberian SLTech Workshop, pp.
loaded, the demonstration begins logging into the application iii–lll, 2018.
before entering the menu of lessons.
158
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
159
model and, when found, produces observations (e.g., battery is behaviors. At runtime, RoQME can provide system adminis-
draining too fast); and, finally (3) a probabilistic reasoner that trators with real-time QoS indicators about the degree of fulfill-
computes (based on Bayesian inference) a numeric estimation ment in usability, resource utilization or stability, to name just a
for each metric (e.g. a value of 0.89 can be understood as the few examples. This information can then be used by the engi-
probability of being optimal in terms of power consumption). neers to improve the system or adjust its configuration.
Although the RoQME project is focused on the robotics do-
main, the modeling tools being developed are designed to be ex- 3.3. Self-adaptation
tensible and application domain agnostic. Further details about
Speaker verification systems could use the QoS metrics pro-
the RoQME project can be found in [8, 9].
vided by RoQME to automatically adjust its own configuration
(or software) in order to optimize its performance under chang-
3. Applications to biometric systems ing circumstances. For instance, this approach could be applied
This section describes some potential applications of RoQME to estimate the voiceprint quality. This would allow the system
to biometric systems, in particular, to ASV systems. As stated to dynamically change its operation, e.g., to ask the user to pro-
before, although the RoQME project is focused on the robotics vide additional voice samples if his/her voiceprint quality falls.
domain, we believe that it explores an issue of great relevance Furthermore, RoQME QoS metrics could play an important role
for many other software systems. in an unsupervised strategy for adaptating voiceprints, aimed at
gradually improving their quality as the system is used. This ap-
3.1. Benchmarking proach would make the system more robust, e.g., against aging
effects.
Using performance metrics, such as Equal Error Rate or De-
tection Cost Function, to quantify the goodness of the results
(either scores or binary decisions) is an essential instrument
4. Acknowledgements
to evaluate and contrast different algorithms and technologies. RoQME has received funding from the European Union’s Hori-
However, when it comes to the integral quality of a real-world zon 2020 research and innovation programme under grant
system, these metrics fail to capture some important aspects, agreement N. 732410, in the form of financial support to third
including: parties of the RobMoSys project.
• The interaction of all the different subsystems and their
combined effect on False Acceptance and False Rejec- 5. References
tion ratios. [1] NIST Speaker Recognition Evaluations. [Online]. Available:
• The effect of usability and accessibility considerations https://www.nist.gov/itl/iad/mig/speaker-recognition
on Failure to Acquire, Failure to Enroll, and False Rejec- [2] J. González-Rodrı́guez, “Evaluating automatic speaker recognition
tion ratios. For example, if the instructions on the screen systems: An overview of the NIST speaker recognition evaluations
are not clear enough, the user might be unable to provide (1996-2014),” Loquens, vol. 1, no. 1, 2014.
a valid voice sample. In this case, the performance of the [3] M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The 2016
system would be affected before any sample reaches the speakers in the wild speaker recognition evaluation,” in 16th An-
nual Conference of the International Speech Communication Asso-
verification process. ciation (INTERSPEECH), 2016.
• The impact of certain system configurations and deci- [4] K. Lee, A. Larcher, G. Wang, P. Kenny, N. Brümmer, D. A. van
sions at design/deployment time, such as the thresholds Leeuwen, H. Aronowitz, M. Kockmann, C. Vaquero, B. Ma, H. Li,
of the different subsystems or how many authentication T. Stafylakis, M. J. Alam, A. Swart, and J. Perez, “The reddots
attempts are granted to verify an user. Regarding the lat- data collection for speaker recognition,” in 15th Annual Conference
ter, it is obvious that a system with no limits on the num- of the International Speech Communication Association (INTER-
SPEECH), 2015.
ber of failed authentication attempts would be less secure
than an identical system with appropriate limits. [5] T. Kinnunen, K.-A. Lee, H. Delgado, N. W. D. Evans, M. Todisco,
M. Sahidullah, J. Yamagishi, and D. A. Reynolds, “t-DCF: a detec-
We propose to integrate non-functional properties (i.e., security, tion cost function for the tandem assessment of spoofing counter-
usability, reliability, etc.) into the system quality evaluation. In measures and automatic speaker verification,” in ODYSSEY 2018 –
this sense, RoQME would allow software engineers to bench- The Speaker and Language Recognition Workshop, 2018.
mark complex biometric systems, in terms of their overall per- [6] RoQME Integrated Technical Project. (2018-2019). RoQME: Deal-
formance, given a number of QoS metrics. Basically, engineers ing with non-functional properties through global Robot Quality-
of-Service Metrics. [Online]. Available: http://robmosys.eu/roqme/
would start specifying the required non-functional properties by
using the high-level modeling language provided by RoQME. [7] RobMoSys EU H2020 Project. (2017-2020). RobMoSys: Com-
posable Models and Software for Robtics Systems - Towards an
After that, the resulting models will automatically generate the EU Digital Industrial Platform for Robotics. [Online]. Available:
code of a software component. At runtime, this component will http://robmosys.eu
continuously update the QoS metrics associated with the spec- [8] C. Vicente-Chicote, J. F. Inglés-Romero, and J. Martı́nez, “A
ified properties. Finally, this information would be considered component-based and model-driven approach to deal with non-
together with other performance metrics (e.g. Equal Error Rate) functional properties through global qos metrics,” in 5th Interna-
to compare different systems, configurations, etc. tional Workshop on Interplay of Model-Driven and Component-
Based Software Engineering (ModComp), in conjunction with
3.2. Dynamic assessment MODELS’18, 2018.
[9] J. F. Inglés-Romero, J. M. Espı́n, R. Jiménez-Andreu, R. Font, and
Monitoring QoS metrics over time would allow engineers to de- C. Vicente-Chicote, “Towards the use of quality of service met-
tect any anomaly or deviation from the expected behavior and rics in reinforcement learning: A robotics example,” in 5th Inter-
foresee future problems. In this sense, RoQME could be use- national Workshop on Model-driven Robot Software Engineering
ful for the early identification of malfunctioning or undesired (MORSE), in conjunction with MODELS’18, 2018.
160
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
161
7. References
[1] C. González-Ferreras, D. Escudero-Mancebo, M. Corrales-
Astorgano, L. Aguilar-Cuevas, and V. Flores-Lucas, “Engaging
adolescents with down syndrome in an educational video game,”
International Journal of Human–Computer Interaction, vol. 33,
no. 9, pp. 693–712, 2017.
[2] M. Corrales-Astorgano, D. Escudero-Mancebo, and C. González-
Ferreras, “Acoustic characterization and perceptual analysis of the
relative importance of prosody in speech of people with down syn-
drome,” Speech Communication, vol. 99, pp. 90–100, 2018.
[3] F. Adell, L. Aguilar, M. Corrales-Astorgano, and D. Escudero-
Mancebo, “Proceso de innovación educativa en educación espe-
cial: Enseñanza de la prosodia con fines comunicativos con el
apoyo de un videojuego educativo,” I Congreso Internacional en
Humanidades Digitales, p. in press.
[4] C.-A. Mario, P. MartÃnez-Castilla, D. Escudero-Mancebo,
L. Aguilar, C. González-Ferreras, and V. Cardenoso-Payo, “To-
wards an automatic evaluation of the prosody of people with down
syndrome,” Iberspeech 2018, p. in press.
[5] P. Martı́nez-Castilla and S. Peppé, “Developing a test of prosodic
ability for speakers of iberian spanish,” Speech Communication,
vol. 50, no. 11-12, pp. 900–915, 2008.
[6] S. J. Peppé, P. Martı́nez-Castilla, M. Coene, I. Hesling, I. Moen,
and F. Gibbon, “Assessing prosodic skills in five european lan-
guages: Cross-linguistic differences in typical and atypical pop-
ulations,” International journal of speech-language pathology,
vol. 12, no. 1, pp. 1–7, 2010.
[7] S. Peppé and J. McCann, “Assessing intonation and prosody in chil-
dren with atypical language development: the peps-c test and the
revised version,” Clinical Linguistics & Phonetics, vol. 17, no. 4-5,
pp. 345–354, 2003.
162
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Silent Speech:
Restoring the Power of Speech to People whose Larynx has been Removed
Jose A. Gonzalez1 , Phil D. Green2 , Damian Murphy3 , Amelia Gully3 , and James M. Gilbert4
1
University of Malaga, Spain
2
University of Sheffield, U.K.
3
University of York, U.K.
4
University of Hull, U.K.
[email protected]
Abstract
Every year, some 17,500 people in Europe and North America
lose the power of speech after undergoing a laryngectomy, nor-
mally as a treatment for throat cancer. Several research groups
have recently demonstrated that it is possible to restore speech
to these people by using machine learning to learn the trans-
formation from articulator movement to sound. In our project
articulator movement is captured by a technique developed by
our collaborators at Hull University called Permanent Magnet
Articulography (PMA), which senses the changes of magnetic
field caused by movements of small magnets attached to the
lips and tongue. This solution, however, requires synchronous
PMA-and-audio recordings for learning the transformation and, Figure 1: Permanent Magnet Articulography (PMA) system.
hence, it cannot be applied to people who have already lost their Upper-left and lower-left: placement of magnets used to cap-
voice. Here we propose to investigate a variant of this tech- ture the movements of the lips and tongue. Right: PMA headset
nique in which the PMA data are used to drive an articulatory consisting of micro-controller, battery and magnetic sensors for
synthesiser, which generates speech acoustics by simulating the detecting the variations of the magnetic field generated by the
airflow through a computational model of the vocal tract. The magnets.
project goals, participants, current status, and achievements of
the project are discussed below.
Index Terms: speech restoration, silent speech interfaces, per- techniques (typically deep neural networks [13]). For captur-
manent magnet articulography, articulatory synthesis, magnetic ing articulator movement we use Permanent Magnet Articulog-
resonance imaging raphy (PMA) [14, 15, 16, 17], a technology developed by our
collaborators at the University of Hull in which small magnets
are attached to the lips and tongue and the magnetic field gen-
1. Introduction erated when the articulators move is captured by sensors close
A total laryngectomy is a clinical procedure in which the voice to the mouth (see Fig. 1 for a picture of the PMA system).
box is surgically removed most commonly as a treatment for The parameters of the transformation for converting articulator
throat cancer. This procedure not only leaves the subject muted, movement into speech are currently estimated from simultane-
but it is also known to cause social isolation, feelings of loss of ous recordings of audio and PMA signals acquired before the
identity and can lead to clinical depression [1, 2, 3]. Current person loses her/his voice. Some audio samples produced by
available methods for speaking after a laryngectomy include the the proposed restoration method can be found at https://
electro-larynx, a hand-held device which produces an unnatural, www.jandresgonzalez.com/is2017. As can be seen,
electronic voice; oesophageal speech, which is difficult to mas- the samples are mostly intelligible and the speaker identity is
ter, and the voice prosthesis, which is considered to be the cur- clearly preserved.
rent gold standard, but has a short life time (4 to 8 weeks) due A limitation of the above approach for speech restoration
candida growth, thus requiring regular hospital visits for valve is that simultaneous speech-and-sensor recordings are required
replacement [4, 5, 6]. Other available methods such as the Al- for estimating the mapping between articulator movement and
ternative and Augmentative Communication (AAC) devices [7], its acoustics. Thus, this makes this method unsuitable for per-
where the user types words and the device synthesises them, are sons who have already lost their voice. The aim is of this project
also limited by their slow manual input and, therefore, are not is, thus, to investigate a novel approach that would make simul-
suitable for any other than short conversations. taneous recordings unnecessary. The idea is to predict, in real
As an alternative to existing speech restoration methods, time, the position of the speech articulators from the PMA sig-
we are investigating a new way to restore speech to those who nals. This is a non-trivial problem as the magnetic field arriving
are unable to speak [8, 9, 10, 11, 12]. The idea is to trans- at the sensors is a composite of the fields generated by all the
form measurements of the lips and tongue movements obtained magnets attached to the articulators. From the estimated vo-
using magnetic sensing into audible speech using a speaker- cal tract shapes speech can finally be synthesised by simulating
dependent transformation, implemented by machine learning airflow propagation through the vocal tract using well-known,
163 10.21437/IberSPEECH.2018-33
established articulatory synthesis methods [18]. knowledge the support of NVIDIA Corporation with the dona-
In the next sections, the detailed objectives of the project, tion of the Titan X Pascal GPU used for this research.
the participants, and its current status are described in detail.
6. References
2. Project objectives [1] A. Byrne, M. Walsh, M. Farrelly, and K. O’Driscoll, “Depression
As previously mentioned, the goal of this project is to investi- following laryngectomy. A pilot study,” Brit J Psychiat., vol. 163,
gate and develop a new method for speech restoration based on no. 2, pp. 173–176, 1993.
the PMA capturing technique and machine learning, but with- [2] D. S. A. Braz, M. M. Ribas, R. A. Dedivitis, I. N. Nishimoto,
out the need of parallel speech and sensor recordings for train- and A. P. B. Barros, “Quality of life and depression in patients
ing the machine learning techniques. We attempt to do this by, undergoing total and partial laryngectomy,” Clinics, vol. 60, no. 2,
instead, learning an alternative transformation which will map pp. 135–142, 2005.
the articulatory data captured by the PMA device into a physi- [3] H. Danker, D. Wollbrück, S. Singer, M. Fuchs, E. Brähler, and
cal model of the vocal tract (e.g. 1D or 2D representation of the A. Meyer, “Social withdrawal after laryngectomy,” Eur Arch Oto-
vocal tract). Then, we will be able to generate audible speech Rhino-L, vol. 267, no. 4, pp. 593–600, 2010.
from the estimated vocal tract shapes by using well-known ar- [4] S. R. Ell, A. J. Mitchell, and A. J. Parker, “Microbial coloniza-
ticulatory synthesis methods. tion of the groningen speaking valve and its relationship to valve
The detailed objectives of this project are: failure,” Clin Otolaryngol Allied Sci., vol. 20, no. 6, pp. 555–556,
1995.
• To train a direct transformation from PMA data to vocal
[5] S. R. Ell, “Candida: the cancer of silastic,” J Laryngol Otol, vol.
tract shapes used by the articulatory synthesiser.
110, no. 03, pp. 240–242, 1996.
• To personalise the synthesiser so that the speech gener-
[6] J. M. Heaton and A. J. Parker, “Indwelling tracheo-oesophageal
ated sounds like the users original voice. voice prostheses post-laryngectomy in sheffield, uk: a 6-year re-
With regard to the latter point, we expect to use MRI images view,” Acta Otolaryngol., vol. 114, no. 6, pp. 675–678, 1994.
of the subject’s vocal tract to personalise the synthesiser. In this [7] M. Fried-Oken, L. Fox, M. T. Rau, J. Tullman, G. Baker, M. Hin-
way, the acoustics generated by the synthesiser will resemble dal, N. Wile, and J.-S. Lou, “Purposes of AAC device use for per-
the user’s original voice. sons with ALS as reported by caregivers,” Augment Altern Com-
mun., vol. 22, no. 3, pp. 209–221, 2006.
164
[17] L. A. Cheah, J. Bai, J. A. Gonzalez, J. M. Gilbert, S. R. Ell,
P. D. Green, and R. K. Moore, “Preliminary evaluation of a silent
speech interface based on intra-oral magnetic sensing,” in Proc.
BioDevices, 2016, pp. 108–116.
[18] J. Mullen, D. M. Howard, and D. T. Murphy, “Waveguide physi-
cal modeling of vocal tract acoustics: flexible formant bandwidth
control from increased model dimensionality,” IEEE Trans. Audio
Speech Lang. Process., vol. 14, no. 3, pp. 964–971, 2006.
[19] A. J. Gully, H. Daffern, and D. T. Murphy, “Diphthong synthesis
using the dynamic 3d digital waveguide mesh,” IEEE/ACM Trans-
actions on Audio, Speech, and Language Processing, vol. 26,
no. 2, pp. 243–255, Feb 2018.
[20] A. J. Gully, T. Yoshimura, D. T. Murphy, K. Hashimoto,
Y. Nankaku, and K. Tokuda, “Articulatory text-to-speech
synthesis using the digital waveguide mesh driven by a deep
neural network,” in Proc. Interspeech, F. Lacerda, Ed. ISCA,
2017, pp. 234–238. [Online]. Available: http://www.isca-speech.
org/archive/Interspeech 2017/abstracts/0900.html
165
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
UPV/EHU
1
2
HUC-Biocruces
[email protected], [email protected]
[email protected], [email protected]
Abstract tems are designed to help the user to quickly build messages
to be spoken out loud (with a keyboard or more sophisticated
RESTORE is a project aimed to improve the quality of commu- input devices). Commercial AAC systems have usually a lim-
nication for people with difficulties producing speech, provid- ited choice of the synthetic voice: they are usually high quality
ing them with tools and alternative communication services. At voices sounding like a very healthy young person. Neverthe-
the same time, progress will be made at the research of tech- less, statistics show a reality in which a great majority of the
niques for restoration and rehabilitation of disordered speech. people affected are elderly people, whose real voice would not
The ultimate goal of the project is to offer new possibilities in match with the prosthetic one. Similarly, there is a lack of chil-
the rehabilitation and reintegration into society of patients with dren’s voices, and according to the data in Spain there are 32
speech pathologies, especially those laryngectomised, by de- 700 children between 6 and 15 years affected by this disability.
signing new intervention strategies aimed to favour their com- RESTORE is a project aimed to improve the quality of
munication with the environment and ultimately increase their communication for people with difficulties producing speech,
quality of life. providing them with tools and alternative communication ser-
Index Terms: alaryngeal voice, oesophageal speech, speaking vices. At the same time, progress will be made at the research
aids, voice rehabilitation, statistical parametric speech synthe- of techniques for restoration and rehabilitation of disordered
sis, voice bank speech. The following goals were proposed for the project:
166 10.21437/IberSPEECH.2018-34
2. Tools to support the speech therapist 3. Personalisation of the synthetic voice
Total Laryngectomy surgery completely removes the larynx of One of the goals of RESTORE project is to improve the al-
the patient while separating the airway from the mouth, nose ready existing ZureTTS voice bank web portal4 making it more
and oesophagus. Consequently, patients who undergo a TL can flexible and allowing to provide a personalised TTS service to
not produce speech sounds in a conventional manner because people with oral disabilities.
their vocal cords have been removed. The rehabilitation process
of a patient starts immediately upon confirmation of the surgery. 3.1. Voice bank and web portal
Through a pre-surgical interview an orientation framework is In the previous version of the voice bank web portal, each voice
offered both to the patient and his or her family where they will donor had to record 100 sentences to get his or her personalised
receive information about: voice. This is not a problem for healthy people, but many pa-
tients are not able to record such a long corpus. Therefore the
• The anatomical and physiological changes that result
original 100 sentences corpus has been divided into three cor-
from surgery
pora of 33, 33 and 34 sentences. Each donor can choose how
• The way to communicate during the period immediately many corpora to record and the personalised voice will be pro-
following surgery duced with the available speech material. Besides, two new
languages have been incorporated to the portal: Gaelic and the
• The speech therapy sessions that will follow surgery Navarro-Lapurdian dialect of Basque.
The main objective of the rehabilitation after TL is to return 3.2. Recording protocol for pre-laryngectomised people
to the patients the possibility of oral communication for reinte-
gration into their social, work and personal life. After surgical To be able to generate a personalised voice for laryngectomees,
intervention, additionally to medical treatment and the impor- the ideal situation is to have recordings of the patient made prior
tance of tracheostomy protection and care of the tracheal can- to the surgery. If this is not the case, recordings made by close
nula the patient will be informed about the therapy process to family members can be used to produce a personalised synthetic
acquire alaryngeal voice. There are three possibilities for voice voice for the patient, as voices are usually similar among family
rehabilitation: members of the same gender [4] [5].
The first step to get these recordings is to introduce the
• Oesophageal speech recording procedure in the hospital protocol. This protocol has
established the following criteria to select patients that can take
• Tracheoesophageal speech part in the project:
• Use of an electrolarynx 1. Patients older than 18 years old with a TL programmed.
Oesophageal speech (ES) is preferred by medical doctors be- 2. Close family members with voices similar to the patient.
cause it does not require a voice prosthesis, but it is also most 3. Any patient older than 18 years old without any speech
effortful and difficult to acquire. Tracheoesophageal speech is pathology who comes to the otorhinolaryngology service
the most successful method and also produces the most under- of the Cruces University Hospital for any reason.
standable speech, but requires a voice prosthesis placed during Once these criteria have been fixed, the protocol has been sent
total laryngectomy or later in a secondary puncture. Finally, the to the Basque Ethics Committee for approval.
electrolarynx is an external vibrating handheld device which is
placed to the neck or the face. The vibrating sound is modulated 3.3. Personalisation for dysarthric voices
by the movements of the articulators to produce understandable
speech. It produces a robotic voice and it is sometimes used One person with ELA make use of the ZureTTS portal to obtain
also as a backup secondary method. a synthetic voice. However, the voice was already affected by
In Cruces University Hospital laryngectomised patients the disease and resulting synthetic voice also showed the same
start rehabilitation after 2 or 3 weeks after hospital discharge problems of the original voice. The main issue was the rhythm,
with the aim of learning to produce oesophageal speech. The which was very slow, with very long vowels and very frequent
patient attends around 50 rehabilitation sessions during a period long pauses, thus confirming other studies [6][7]. Also, some
of 4 months. If the final speaking method is tracheoesophageal, consonants were poorly realised. The slow prosody provoked
the average learning period is only 5 days. the malfunctioning of the automatic alignment algorithm thus
In RESTORE we have developed an interactive video contributing to the low quality of the synthetic voice. Never-
aimed at helping the patient during and after this rehabilitation theless, even when a new specific alignment method adapted to
period. This video considers the main difficulties faced by the the multiple pauses was applied, the synthetic voice was not of
laryngectomised patient and proposes exercises and advices to the desired quality, mainly because of the long phones. To over-
overcome them. Using a comic style representation of a food come it, prosody transplantation from a healthy voice was per-
market, the main character represents the patient itself, going formed with good results. This voice was offered to the user, in-
through the different market stalls, in each of which he or she stead of the one automatically provided by the system. We also
will practice a new rehabilitation exercise. Video recordings of tried to palliate the pronunciation problems applying the adap-
real sessions with a speech therapist are also included, as well tation techniques using only vowels described in [8][9], but the
as short interviews with laryngectomised persons that have suc- new synthetic voice, although with an improved pronunciation
ceed in the rehabilitation process and share their own feelings lost the personality of the speaker. We are also experimenting
and experiences. with model surgery on some phones [10], but the improvements
are subtle.
Clinical evaluation of the developed tool is currently taking
place with laryngectomised patients. 4 aholab.ehu.eus/zureTTS
167
at modifying oesophageal speech in such a way as to improve
the performance of a state of the art ASR system with these
modified signals as input. To achieve this goal we decided to
experiment with voice conversion (VC) algorithms. In this sec-
tion we summarize the efforts done within the project in this
direction.
3.4. Synthetic voices catalogue 4.2. Improving the intelligibility of the oesophageal voice
If the person with oral disabilities is not able to record the cor- We have tried several strategies to improve the intelligibility of
pus and there is no family member with a similar voice, he or the oesophageal speech. First, we tried a classical GMM based
she is still able to get a customised synthetic voice, by selecting voice conversion, using parallel data. This was followed by
the one of his or her preference among all the donated voices. DNN based approaches, using LSTM and more recently also
To make the selection of a personalised voice easier, a cat- including a WaveNet vocoder.
alogue with the available voices has been included in the WEB There are several specific problems that must be faced when
portal. A subjective evaluation where listeners qualified all processing and converting oesophageal signals:
the synthetic voices according to several attributes was devel- • Oesophageal signals lack the regular periodicity (intona-
oped. These attributes were: white - hoarse voice, sweet - dom- tion) typical of laryngeal signals. Although they have a
inant voice, warm - high-pitched voice, clean - nasal voice and certain periodicity at certain segments, the fundamental
monotonous - expressive voice. A bidimensional representa- frequency is very low (around 80Hz) and very irregu-
tion of the voices using the two most discriminative dimensions lar. Usual F0 calculation algorithms generate many er-
(sweet-dominant and white-hoarse) has been integrated in the rors and do not result in a realistic measure of the local
portal, as shown in Figure 1. periodic segments. Thus a specific F0 detection algo-
The voices can be easily modified in tone, rhythm, inten- rithm has been used.
sity and vocal tract length to get a synthetic voice that pleases
the user. The customised voices can be obtained from the web • The rhythm in general, the duration of syllables and
portal in standard format for Android OS, iOS and Windows, the duration of the phones inside them, vary signifi-
so they can be directly used by Augmentative and Alternative cantly in relation to healthy speech. Additionally, noises
Communication Devices. and pauses are inserted in between words and syllables
even inside the syllable. Therefore, the alignment of
healthy and oesophageal parallel sentences becomes a
4. Voice conversion tricky task.
In the production of oesophageal speech the pharyngo- • Many phones (mainly corresponding to consonants) in
oesophageal segment is used as a substitutive vibrating element the sounds stream are not present in the signal or they
for the vocal folds. Due to the nature of the intervention, the are realised in a completely different way. This fact
air used to create the vibration of the oesophagus can not come also complicates the parallel alignment task and gener-
from the lungs and the trachea as happens during normal speech ates many recognition errors.
production. Instead, the air is swallowed from the mouth and in-
• In general, a fundamental frequency curve must be esti-
troduced in the oesophagus, being then expelled in a controlled
mated for the converted spectrum (except for the case of
way while producing the vibration. These huge differences in
using WaveNet). Simple conversion of the source F0 val-
the production mechanisms lead to a diminution of naturalness
ues as usually done in the VC field is not feasible, which
and intelligibility [11][12][13]. As a consequence, the com-
opens a wide range of research possibilities.
munication with others is hindered. Moreover, these less intel-
ligible voices are an added problem for the automatic speech To evaluate the intelligibility of the resulting converted sig-
recognition algorithms that are becoming ubiquitous in the hu- nals we have used a Kaldi-based ASR system [18] trained with
man computer interaction technologies. One of the goals of this material described in [19]. This approach was selected because
project is the development of techniques and algorithms aimed it allows us to control the exact processing operations followed
168
during the recognition (such us the use of transformations like ders,” Folia Phoniatrica et Logopaedica, vol. 53, no. 1, pp. 1–18,
fMLLR) as well as basic aspects of the recognition process such 2001.
as the lexicon and the language model. This turned out to be [7] J. R. Green, Y. Yunusova, M. S. Kuruvilla, J. Wang, G. L.
very important with our reduced set of 100 phonetically rich Pattee, L. Synhorst, L. Zinman, and J. D. Berry, “Bulbar and
sentences, containing many very unlikely words, proper names speech motor assessment in als: Challenges and future direc-
etc. The procedure and results of the different ASR tests are tions,” Amyotrophic Lateral Sclerosis and Frontotemporal Degen-
eration, vol. 14, no. 7-8, pp. 494–500, 2013.
described in [20][21].
[8] A. Alonso, D. Erro, E. Navas, and I. Hernaez, “Speaker Adap-
tation using only Vocalic Segments via Frequency Warping,” in
5. Discussion and future work INTERSPEECH 2015, 2015, pp. 2764–2768.
This project represents an effort to promote modern technolog- [9] ——, “Study of the effect of reducing training data in speech
ical advances in the area of speech processing among the group synthesis adaptation based on frequency warping,” in Advances
of people with oral disabilities. In particular we have put spe- in Speech and Language Technologies for Iberian Languages,
A. Abad, A. Ortega, A. Teixeira, C. Garcı́a Mateo, C. D.
cial emphasis to introduce the benefits of the advances in speech Martı́nez Hinarejos, F. Perdigão, F. Batista, and N. Mamede, Eds.
synthesis in the rehabilitation process of the laryngectomised Cham: Springer International Publishing, 2016, pp. 3–13.
people.
[10] A. Pierard, D. Erro, I. Hernaez, E. Navas, and T. Dutoit, “Surgery
In relation to the intelligibility of oesophageal speech, we of speech synthesis models to overcome the scarcity of training
plan to improve the techniques to evaluate not only human intel- data,” vol. 10077 LNAI, pp. 73–83.
ligibility but also the effort employed by the listener (’listening
[11] B. Weinberg, “Acoustical properties of esophageal and tracheoe-
effort’). sophageal speech,” Laryngectomee rehabilitation, pp. 113–127,
At the time of writing this paper, only two laryngectomised 1986.
persons have been recorded for the voice bank previous to TL. [12] T. Most, Y. Tobin, and R. C. Mimran, “Acoustic and perceptual
It must be taken into account that the elapsed time between characteristics of esophageal and tracheoesophageal speech pro-
the communication of the need to do the TL surgery and the duction,” Journal of communication disorders, vol. 33, no. 2, pp.
surgery itself is very short (usually of a few days). So there is 165–181, 2000.
not the necessary time for the medical team to explain the pa- [13] T. Drugman, M. Rijckaert, C. Janssens, and M. Remacle, “Tra-
tient the future benefits of making the recordings. Additionally, cheoesophageal speech: A dedicated objective acoustic assess-
these patients have their voice already very harsh and speak with ment,” Computer Speech & Language, vol. 30, no. 1, pp. 16–31,
difficulties (it is usually the reason why they have contacted 2015.
the unit). This is why we consider the catalogue of synthetic [14] R. Ishaq and B. G. Zapirain, “Esophageal speech enhancement us-
voices, possibly including voices of the patient’s relatives as a ing modified voicing source,” in Signal Processing and Informa-
very good alternative. We hope that this voice bank pilot exper- tion Technology (ISSPIT), 2013 IEEE International Symposium
iment will continue growing and expanding the service to other on. IEEE, 2013, pp. 000 210–000 214.
hospitals after the end of the present project. [15] A. Mantilla, H. Pérez-Meana, D. Mata, C. Angeles, J. Alvarado,
and L. Cabrera, “Recognition of vowel segments in spanish
esophageal speech using hidden markov models,” in Computing,
6. Acknowledgements 2006. CIC’06. 15th International Conference on. IEEE, 2006,
pp. 115–120.
This project has been founded by the Spanish Ministry of Econ-
omy and Competitiveness with FEDER support (RESTORE [16] A. Mantilla, H. Perez-Meana, D. Mata, C. Angeles, J. Alvarado,
and L. Cabrera, “Analysis and recognition of voiced segments of
project, TEC2015-67163-C2-1-R and TEC2015-67163-C2-2-
esophageal speech,” in Electronics, Robotics and Automotive Me-
R). chanics Conference, 2006, vol. 2. IEEE, 2006, pp. 236–244.
[17] A. Mantilla-Caeiros, M. Nakano-Miyatake, and H. Perez-Meana,
7. References “A pattern recognition based esophageal speech enhancement sys-
tem,” Journal of applied research and technology, vol. 8, no. 1,
[1] J. P. Rodrigo, F. López, J. L. Llorente, C. Álvarez-Marcos, and
pp. 56–70, 2010.
C. E. Suarez, “Results of total laryngectomy as treatment for lo-
cally advanced laryngeal cancer in the organ-preservation era.” [18] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
Acta otorrinolaringologica espanola, vol. 66 3, pp. 132–8, 2015. N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
“The kaldi speech recognition toolkit,” in IEEE 2011 workshop
[2] S. Sociedad Española de Oncologı́a Médica, “Las cifras del cáncer
on automatic speech recognition and understanding, no. EPFL-
en España 2014,” 2014.
CONF-192584. IEEE Signal Processing Society, 2011.
[3] C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: De- [19] L. Serrano, D. Tavarez, I. Odriozola, I. Hernaez, and I. Saratxaga,
sign, collection and data analysis of a large regional accent speech “Aholab system for albayzin 2016 search-on-speech evaluation,”
database,” in 2013 International Conference Oriental COCOSDA in IberSPEECH, 2016, pp. 33–42.
held jointly with 2013 Conference on Asian Spoken Language Re-
search and Evaluation (O-COCOSDA/CASLRE), Nov 2013, pp. [20] L. Serrano, D. Tavarez, X. Sarasola, S. Raman, I. Saratxaga,
1–4. E. Navas, and I. Hernaez, “LSTM based voice conversion for la-
ryngectomees,” in Proc. IberSPEECH 2018, 2018.
[4] H. S. Feiser and F. Kleber, “Voice similarity among brothers: ev-
idence from a perception experiment,” in Proc. 21st Annual Con- [21] S. Raman, I. Hernaez, E. Navas, and L. Serrano, “Listening to la-
ference of the International Association for Forensic Phonetics ryngectomees: A study of intelligibility and self-reported listen-
and Acoustics (IAFPA), 2012. ing effort of spanish oesophageal speech,” in Proc. IberSPEECH
2018, 2018.
[5] I.-S. Ahn and M.-J. Bae, “On a similarity analysis to family
voice,” Advanced Science Letters, vol. 24, no. 1, pp. 744–746,
2018.
[6] W. G, J. JY, L. JS, K. RD, and K. JF, “Acoustic and intelligibility
characteristics of sentence production in neurogenic speech disor-
169
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
1https://elpais.com/elpais/2016/02/18/media/1455822566_899 2 https://d-lab.tech/challenge-3/
475.html
170
3. Recruiting of labelers and technical 5. Quality Assessment
support To calculate the consistency between the annotators we used
17 people were selected and hired for the labelling task. Most three indices: Cohen's kappa index, Accuracy, and Cronbach's
of the annotators are students. They have different alpha index.
backgrounds. Following recommendations from SafeToNet,
5.1. Cohen's kappa index
most of the labelers are psychology students. There are 4 male
and 13 female annotators, all of them between 20 and 30 years The Cohen's kappa index [1] between two annotators labeling
old. Table 1 shows the background summary of the annotators: the same data measures the consistency between the
annotations and compares them with the case of the annotation
CODE BACKGROUND being random. To calculate the consistency between two
F1 Philosophy student annotators, the common messages labeled by both of them are
F2 Teacher degree searched and the results of the annotation are compared for each
F3 Psychology student characteristic. The Cohen's kappa index is calculated as:
F4 Criminology student 𝑝𝑝0 − 𝑝𝑝𝑒𝑒
F5 Biomedicine student 𝜅𝜅 =
1 − 𝑝𝑝𝑒𝑒
F6 Psychology student
Where 𝑝𝑝0 is the relative agreement between raters (accuracy),
F7 Social Communication student
and 𝑝𝑝𝑒𝑒 Is the probability of chance agreement. Cohen's kappa
F8 Psychologist degree
F9 Biology and Neuroscience student
index was calculated on binarized categories: (category A: level
F10 Statistics and economy student of concern 1 or 2; category B: level of concern 3, 4, or 5)
F11 Children's education
F12 Psychology student 5.2. Cronbach's alpha coefficient
F13 Engineering student The Cronbach alpha coefficient has been calculated per each
M1 Engineering student category (y) as
M2 Psychology student
𝐾𝐾 ∑2𝑖𝑖=1 𝜎𝜎𝑦𝑦𝑦𝑦
2
M3 Physical activity and sports degree 𝛼𝛼𝑦𝑦 = �1 − �
𝐾𝐾 − 1 2
M4 Engineering student 𝜎𝜎𝑡𝑡
Table 1. Code and Background of the annotators Where K: number of items (K=2: 2 annotators)
2
𝜎𝜎𝑦𝑦𝑦𝑦 variance of each item (i.e vector of length nposts of category
Annotators were contracted on a part time basis (20 y of tagger i)
hours/week). They work from home. They have some 𝜎𝜎𝑡𝑡2 variance of the total (i.e. vector: t=y1+y2 of length nposts)
flexibility to manage their dedication to the project during each
week. However, their daily dedication to the project can never 6. Results
be higher than 5 hours. This dedication was set to avoid
tiredness and lack of concentration during the annotation
procedure. index aggression anxiety depression distress sexuality substance violence
Accuracy 0,90 0,98 0,96 0,84 0,97 0,99 0,97
Cohen's kappa 0,70 0,62 0,63 0,58 0,84 0,80 0,70
A training week (16th-20th July) was organized at UPC Cronbach's alpha 0,76 0,53 0,61 0,59 0,80 0,84 0,71
premises. A person from SafeToNet was in charge of the Table 2. Mean indexes for all the categories
training during the first three days. Annotators were instructed,
one category at a time, with several examples chosen from real The table shows a lower inter agreement in anxiety and distress.
posts. The last two days, students had a simulation of real work This is a consequence of no direct translation of distress within
using the annotation platform reaching the labeling of 400 posts Spanish and Catalan. The annotators had difficulties to
each day. Those posts were carefully chosen to show distinguish both categories.
simultaneously several categories and several concern levels.
Another re-training week was necessary at the mid part of the
project to consensus inter-annotator agreement. 7. Discussion
4. Compilation and selection of posts The data collection will be finished on October 15th so that the
first prototype will be ready one month later. The results of the
Posts were selected from several sources such as twitter, complete project will be presented in Mobile World Congress
teenagers' chats, blogs, forums, medical consultation web sites, 2019.
etc. Data was manually or automatically downloaded, cleaned,
formatted and selected. Posts were chosen to have a minimum 8. Acknowledgements
number of characters (without counting –not discarding-
@names and internet addresses) of 50 and a total maximum of We want to thank d-LAB and STN for the trust placed in us to
280. Spanish as spoken in Latin America posts were discarded carry out the project.
when possible. As expected, Catalan data was harder to collect.
The number of information in internet in Spanish is huge 9. References
compared against Catalan websites, blogs and consulters. In
[1] J. Cohen. A Coefficient of Agreement for Nominal Scales.
addition, Catalan speakers are bilingual, and it is very common
https://doi.org/10.1177/001316446002000104
to find Spanish posts in Catalan sites. Spanish and Catalan can [2] L.J. Cronbach Coefficient alpha and the internal structure of tests.
be naturally mixed even in the same post. Psychometrika, Sep 1951, vol 16, Issue 3, pp 297-334
171
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
[email protected], [email protected]
172
➢ OBJ2: Involve end-users and to reach a degree of fit to and continuous interaction with the end-users, the
their personalised needs and requirements, derived by the technologies to consider individual user profile,
coach, which will enhance their well-being including cultural facts and interaction history, the
➢ OBJ3: Supply the coach with Incorporate non-intrusive, current emotional status of the user and the coach
privacy-preserving, empathic, expressive interaction strategies at each decision of the dialogue manager,
technologies at each text generated by the Natural Language
Generator, at each inflexion of the TTS and at each
➢ OBJ4: Validate the coach efficiency and effectiveness movement of the personalised visual agent.
across 3 distinct European societies (Norway, Spain, and
France), with 200 to 250 subjects – who will be involved
from the start Technological Goals (Tg) and Actions:
➢ OBJ5: Evaluate/validate the effectiveness of EMPATHIC 1. Develop a simulated virtual coach and acquire an
designs against relevant user’s personalised acceptance initial corpus of dialogues. A set of annotated
and affordance criteria (such as the ability to adapt to dialogues will be designed and obtained through a
users’ underlying mood) assessed through the Key Wizard-of-Oz (WoZ) technology to fulfil the initial
Performance Indicators (KPI) listed in Section 1.1.2 end-users and data requirements of Scientific Goals
➢ OBJ6: Drive the developed methodology and tools to #2, #3 and #4.
industry acceptance and open-source access identifying 2. Integrate and provide a proof-of-concept of the
appropriate evaluation criteria to improve the technology running on different devices
“specification-capture-design-implementation” software 3. Validation through Field trials. EMPATHIC will test
engineering process of implementing socially-centred representative realistic use cases for different user
ICT products. profiles in three different countries
173
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Abstract the contents, or on-line HTR [4] from touchscreen pen strokes.
In this context, a multimodal interactive assistive scenario [5],
The transcription of digitalised documents is useful to ease the where the assistive system and the paleographer cooperate to
digital access to their contents. Natural language technologies, generate the perfect transcription, would reduce the time and
such as Automatic Speech Recognition (ASR) for speech audio the human effort required for obtaining the final result.
signals and Handwritten Text Recognition (HTR) for text im-
The use of multimodal collaborative transcription appli-
ages, have become common tools for assisting transcribers, by
cations (crowdsourcing) [6], where collaborators can employ
providing a draft transcription from the digital document that
speech dictation of text lines as a transcription source from their
they may amend. This draft is useful when it presents an error
mobile devices, allows for a wider range of population where
rate low enough to make the amending process more comfort-
volunteers can be recruited, producing a powerful tool for mas-
able than a complete transcription from scratch.
sive transcription at a relatively low cost, since the supervision
The work described in this thesis is focused on the improve- effort of paleographers may be dramatically reduced.
ment of the transcription offered by an HTR system from three
In this thesis 1 [7], the reduction of the required human ef-
scenarios: multimodality, interactivity and crowdsourcing.
fort for obtaining the actual transcription of digitalised histori-
The image transcription can be obtained by dictating their
cal manuscripts is studied in the following scenarios:
textual contents to an ASR system. Besides, when both sources
of information (image and speech) are available, a multimodal • Multimodality: An initial draft transcription of a hand-
combination is possible, and this can be used to provide assis- written text line image can be obtained by using an off-
tive systems with additional sources of information. Moreover, line HTR system. An alternative for obtaining this draft
speech dictation can be used in a multimodal crowdsourcing transcription is to dictate the contents of the text line im-
platform, where collaborators may provide their speech by us- age to an ASR system. Furthermore, when both sources
ing mobile devices. (image and speech) are available, a multimodal combi-
Different solutions for each scenario were tested on two nation is possible, and an iterative process can be used
Spanish historical manuscripts, obtaining statistically signifi- in order to refine the draft transcription. Multimodal
cant improvements combination can be used in interactive transcription sys-
Index Terms: handwritten text recognition, automatic speech tems for combining different sources of information at
recognition, multimodality, combination, interactivity, crowd- the system input (such as off-line HTR and ASR), as
sourcing well as to incorporate the user feedback (on-line HTR).
At the same time, the multimodal and iterative combi-
1. Introduction and Motivation nation process can be used to improve the initial off-line
HTR draft transcription by using the ASR contribution
Transcription of digitised historical documents is an interesting
of different speakers in a collaborative scenario.
task for libraries in order to provide efficient information access
to the contents of these documents. The transcription process • Interactivity: The use of assistive technologies in the
is done by experts on ancient and historical handwriting called transcription process reduces the time and human effort
paleographers. required for obtaining the actual transcription. The assis-
In the latest years, the use of off-line Handwritten Text tive transcription system proposes a hypothesis, usually
Recognition (off-line HTR) systems [1] has allowed to speed derived from a recognition process of the handwritten
up the manual transcription process. HTR systems are com- text line image. Then, the paleographer reads it and pro-
posed of modules and employ models similar to those of clas- duces a feedback signal (first error correction, dictation,
sical speech recognition systems. However, state-of-the-art off- etc.), and the system uses it to provide an alternative hy-
line HTR systems [2] are far from being perfect, and human pothesis, starting a new cycle. This process is repeated
supervision is required to really produce a transcription of stan- until a perfect transcription is obtained. Multimodality
dard quality. The initial result of automatic recognition may can be incorporated to the assistive transcription system,
make the paleographer task easier, since they are able to per- in order to improve the human-computer interaction and
form corrections on a good draft transcription. to provide the system with additional sources of infor-
In addition to using off-line HTR systems from text line mation.
images, other modalities of natural language recognition can be
used to help paleographers on the transcription process, such as 1 Publicly available in the UPV institutional repository: http://
Automatic Speech Recognition (ASR) [3] from the dictation of hdl.handle.net/10251/86137
174 10.21437/IberSPEECH.2018-35
• Crowdsourcing: Open distributed collaboration to ob- to modify the general language model in order to make more
tain initial transcriptions is another option for improv- likely the decoded sentences; this modified language model is
ing the draft transcription to be amended by the pale- employed in the decoding for the other modality. This proce-
ographer. However, current transcription crowdsourcing dure can be used iteratively. This approach presents a few draw-
platforms are mainly limited to the use of non-mobile de- backs: there is not a single hypothesis given that each modality
vices, since the use of keyboards in mobile devices is not provides its own, and it is not known beforehand which one is
friendly enough for most users. An alternative, is the use more accurate, and the initial modality must be chosen arbitrar-
of speech dictation of handwritten text lines as a tran- ily.
scription source in a crowdsourcing platform where col- Chapter 3 of the thesis (Combining Handwriting and
laborators may provide their speech by using their own Speech) presents a new proposal based on the use of Confusion
mobile device. Multimodal combination allows the im- Networks for obtaining a single hypothesis from the combina-
provement of the initial handwritten text recognition hy- tion of the hypotheses obtained from an off-line HTR and an
pothesis by using the contribution of speech recognition ASR recognisers for decoding a text line image and the dicta-
from several speakers, providing as a final result a better tion of its contents. In the next chapter (Chapter 4), our multi-
draft transcription to be amended by a paleographer with modal proposal is tested and compared with other combination
less effort. In this framework, since collaborators are methods.
usually a scarce resource, their acquisition effort should The experiments were performed on two different Spanish
be optimised with respect to the quality of the draft tran- historical manuscripts. Cristo Salvador, which is a single writer
scriptions. book from the 19th century provided by Biblioteca Valenciana
The rest of this paper is structured as follows: Section 2 of- Digital, and Rodrigo [10], that corresponds to the digitisation
fers the main scientific and technological goals; Section 3 sum- of the book Historia de España del arçobispo Don Rodrigo,
marises the contents of this thesis; Section 4 contains the main which was written in old Castilian (Spanish) in 1545. Both cor-
conclusions; Section 5 draws the current work derived from this pora are publicly available for research purposes on the website
thesis and the future work lines; Finally, Section 6 presents the of the Pattern Recognition and Human Language Technology
achievements, and the scientific contributions. (PRHLT) research center 2 . Acoustics models were trained by
using the Spanish phonetic corpus Albayzin [11].
The transcriptions quality is assessed using the Word Error
2. Scientific and Technological Goals Rate (WER) value, which allows us to obtain a good estimation
The main scientific and technological goals of this thesis are the for the paleographer post-edition effort, and the lattices qual-
following: ity by the oracle WER, which represents the WER of the best
hypotheses contained in the word lattices (more details about
• To study the unimodal and multimodal combination corpora and evaluation metrics can be found in Chapter 2 of the
techniques, in order to propose a new multimodal combi- thesis).
nation technique for improving the transcription of digi-
talised historical manuscripts by using the speech dicta-
Table 1: Summary of the multimodal experimental results.
tion of their contents.
• To study the use of multimodal combination techniques Cristo Salvador Rodrigo
in a computer assisted system to improve the computer- Experiment
WER Oracle WER WER Oracle WER
human interaction and to accelerate the interactive tran-
Off-line HTR 32.9% 27.5% 39.3% 28.2%
scription process. ASR 43.3% 27.4% 62.9% 29.5%
• To develop a multimodal crowdsourcing platform based Multimodal 29.3% 13.4% 35.9% 14.8%
on the studied multimodal combination techniques to
ease and widespread the transcription of digitalised his-
torical manuscripts. Table 1 summarises the results of the multimodal experi-
ments. As it can be observed, the behaviour is similar for both
3. Thesis Overview corpora. The use of the ASR does not improve the WER of the
draft offered by the off-line HTR system, although the word-
The thesis document [7] is structured in five parts to facilitate graphs generated offer similar values of oracle WER for both
the reading experience. It starts with a first introductory part, modalities. However, combining both modalities by using our
followed by a part for each one of the three studied scenarios, proposal, not only the WER is improved, but the oracle WER
and it finishes with a part which presents the general conclu- value of the multimodal word-graph lattices is substantially re-
sions and future work lines. This section presents an overview duced. Given that the oracle WER value is related to the quality
of the contents of the three central parts (multimodality, inter- of the alternatives offered by our interactive and assistive sys-
activity, and crowdsourcing). tem, an outstanding effect on interactive transcription can be
expected.
3.1. Multimodality
3.2. Interactivity
The integration of knowledge given by off-line HTR and ASR
processes presents two limitations: both signals are asyn- The result of combining the knowledge given by off-line HTR
chronous and each modality uses different basic linguistic units and ASR processes may make the paleographer task easier,
(usually, characters for off-line HTR and phonemes for ASR). since they are able to correct on an improved draft transcrip-
An initial approach for solving this limitation was proposed in tion. However, given that paleographer revision is required to
previous works [8, 9], where the output of the recognition pro-
cess of one modality, in form of word-graph lattice, is used 2 https://prhlt.upv.es/
175
produce a transcription of standard quality, an interactive assis- the percentage of words corrected by means of the keyboard
tive scenario, where the automatic system and the paleographer (KBD). As it can be observed, the multimodal combination of
cooperate to generate the perfect transcription, would provide the on-line feedback with the input hypotheses reduces signif-
an additional reduction of the human effort and time required icantly the amount of words that are required to be corrected
for obtaining the final result. by using the keyboard, and most of the paleographer effort is
Chapter 5 of the thesis (Assistive Transcription) presents a concentrated in the more ergonomic touchscreen feedback.
multimodal interactive transcription system where the paleogra-
pher feedback is provided by means of touchscreen pen strokes, 3.3. Crowdsourcing
traditional keyboard, and mouse operations. The combination
of the different sources of information is based on the use of As an alternative to the keyboard, volunteers could employ
Confusion Networks derived from the decoding output of three voice as input for transcription. Nearly all mobile devices pro-
recognition systems: two HTR systems (off-line and on-line), vide this modality, which widens the range of population and
and an ASR system. Off-line HTR and ASR are used to derive situations where collaboration can be performed. The main
(by themselves or by combining their recognition results) the drawback is that the audio transcription, usually obtained by
initial hypothesis, and on-line HTR is used to provide feedback. ASR systems [3], presents an ambiguity not present in typed
In the next chapter (Chapter 6 of the thesis), our multimodal and input. Even the state-of-the-art techniques [12], although more
interactive proposals are tested. accurate than a few years ago, produce a considerable amount
In this case, the interactive performance is given by Word of errors in the recognition process, which makes it necessary
Stroke Ratio (WSR), the definition of which makes it compara- to obtain a balance between the amount of collaborations and
ble with the WER. The relative difference between them gives the quality they provide.
us the effort reduction (EFR), which is an estimation of the tran- In any case, the need for final supervision by a paleographer
scription effort reduction that can be achieved by using the in- enables the possibility that, although not perfect, voice inputs
teractive system (see Chapter 2 of the thesis for more details). combined with off-line HTR provide an initial draft transcrip-
tion more accurate than that given only by off-line HTR. This
Table 2: Summary of the multimodal interactivity experimental fact was confirmed with the statistically significantly improve-
results. ments obtained in the experiments performed for the previous
parts of this thesis, multimodal, and interactive transcription.
Thus, the employment of speech collaborations will allow us to
Cristo Salvador Rodrigo significantly reduce the final transcription effort.
Experiment
WSR EFR WSR EFR Chapter 7 of the thesis (Collective Collaboration) explores
Off-line HTR 30.2% 8.2% 36.2% 7.9% how a crowdsourcing framework that allows for text line dic-
ASR 35.1% −6.7% 47.2% −20.1% tations acquisition could decrease the transcription effort. The
framework is based on the use of multimodal recognition, both
Multimodal 14.1% 57.1% 27.0% 31.3% employing and combining off-line HTR and ASR results, to im-
prove the final transcription that is going to be offered to the pa-
leographer. The multimodal recognition approach is based on
Table 2 summarises the results of the assistive and interac-
language model interpolation and Confusion Network combina-
tive experiments. As it can be observed, the estimated interac-
tion techniques. The crowdsourcing platform was implemented
tive human effort (WSR) required for obtaining the perfect tran-
by using a client-server architecture. The client is a mobile ap-
scription from the off-line HTR decoding represents about 8%
plication [13] that allows speech acquisition and the server part
of relative effort reduction (EFR) over the off-line HTR WER
performs the recognition and combination operations. In the
for both corpora (see Table 1). However, in the case of ASR no
next chapter (Chapter 8), our multimodal crowdsourcing pro-
effort reduction can be considered. Regarding multimodality, as
posal is tested in a supervised and in an unsupervised mode for
expected, the use of the proposed multimodal approach allows
the Rodrigo [10] corpus.
the interactive system to achieve more than 30% of relative ef-
fort reduction over the off-line HTR WER for both corpora.
45%
Worst speaker order
Table 3: Summary of the multimodal feedback and interactivity 40%
Median speaker order
Best speaker order
experimental results for Cristo Salvador.
Word Error Rate
35%
WSR
Experiment EFR
Deletions TS KBD Global 30%
176
shows the evolution, from the initial off-line HTR baseline until for speech acquisition is publicly available [13].
the process of the speech of the last collaborator, for the lists The experiments showed that, in this framework, the num-
that obtained the worst, the median and the best final results. ber of collaborators is more important than the order in which
As it can be observed, the worst and the best final results do not their speech is processed. Through this experimentation, it
represent any statistically significant differences. These results has been shown that the use of speech is a good additional
show that, in the best case, only two speakers are needed to ob- source of information for improving the transcription of histori-
tain significant improvements. Meanwhile, in the worst case at cal manuscripts, and that this modality allows people to collab-
least four speakers are needed. orate in this task using their own mobile device.
100%
ASR baseline 5. Current and Future Work
90% Crowdsourcing system output
80%
Crowdsourcing ASR output Currently, we are testing the performance of our assistive and
interactive system with more robust modelling methods based
Word Error Rate
70%
60% on deep learning.
50% Regarding multimodality, we propose for future studies the
40% use of whole sentences instead of lines of the handwritten text
30% document because it might make multimodality more natural
20% from the point of view of the paleographer or speaker who has
HTR baseline 5th 10th 15th 20th 25th
to dictate the contents of the handwritten text images to the ASR
Collaborators system.
In the case of interactive transcription, we have already
Figure 2: Baseline values and the evolution of the system and tested the use of speech not only as an additional source of in-
ASR outputs for the whole unsupervised collaborations. The formation of the handwritten text line image to transcribe in the
horizontal lines represent the corresponding average ASR WER. interactive and assistive system, but as an additional modality
for human-computer interaction [14]. Furthermore, our future
works aim also at taking advantage of the real samples that are
From the unsupervised experiments, Figure 2 draws the produced while the system is used for adapting the feedback
baseline values for both modalities and the evolution of the sys- natural language recognisers to the user.
tem and ASR outputs. As it can be observed, the language Finally, the proposed multimodal crowdsourcing frame-
model interpolation permits to reduce the error level in the work and the multimodal interactive transcription system were
next speech decoding process [8], and the combination with the integrated [15], and in the near future, we are planning to test it
speech decoding results allows the system output to converge to with other datasets.
a better hypothesis with less errors to correct [16]. Besides, the
ASR performance is considerably improved, reducing the aver- 6. Scientific Contributions
age WER baseline value. Finally, after processing the speech
of the last collaborator, the system outputs presented 25.3% of The main contributions of this thesis can be summarised in:
WER that represents 35.6% of relative statistically significant the evaluation on how to combine the decoding output of
improvement over the off-line HTR baseline, and an estimated different natural language recognition systems, the integra-
time reduction for the paleographer revision of about 5 minutes tion of the combination of different signals in a computer as-
per page [10]. sisted transcription system, and the development of a multi-
modal crowdsourcing platform for the transcription of historical
manuscripts.
4. Main Conclusions
The scientific impact of this thesis was supported by eight
Regarding multimodality, the benefits of multimodal combi- publications at the time of the dissertation presentation. Con-
nation of the results obtained from off-line HTR with addi- cretely, the multimodality part was supported by two articles
tional sources of information for the transcription of historical presented in two international conferences (ICDAR 2015 [16],
manuscripts have been confirmed. and CAIP 2015 [17]), the interactivity part by two publications,
With respect to interactivity, multimodality was applied on one in an international conference and the other in a book chap-
an interactive tool for transcribing historical handwritten docu- ter (DAS 2016 [18], Handwriting, Nova 2017 [19]), and the
ments. On the one hand, the multimodal hypotheses combina- crowdsourcing part by four publications, two in international
tion allows to reduce the human time and workload required for conferences, one in a book chapter, and one in a JCR interna-
transcribing historical books, due to the increased recognition tional journal (DocEng 2016 [20], IberSPEECH 2016 [21, 13],
accuracy and the better quality of the alternatives contained in IEEE/ACM TASLP [22]).
the multimodal lattice. On the other hand, the use of multimodal Moreover, in the time of writing this paper, an additional
combination allows to improve the human-computer interaction publication on an international conference is supporting the in-
(by using on-line touch-screen handwritten pen strokes), given teractivity part (DAS 2018 [14]), and another in a JCR interna-
that the multimodal combination allows to correct errors on the tional journal the crowdsourcing part (COIN [15]).
interactive system hypothesis by using the information provided
by the on-line handwritten text introduced by the user.
Finally, the proposed multimodal crowdsourcing frame-
7. Acknowledgments.
work is based on the iterative refinement of the language model Work partially supported by: Percepción - TSI-020601-2012-
and hypotheses combination. This framework uses a client 50 (MINETUR), SmartWays - RTC-2014-1466-4 (MINECO),
/ server architecture in order to allow collaborators to decide STraDA - TIN2012-37475-C02-01 (MINECO), and CoMUN-
when and where to collaborate. The mobile application used HaT - TIN2015-70924-C2-1-R (MINECO/FEDER).
177
8. References Recognition: The shared views of four research groups,” IEEE
Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
[1] A. Fischer, “Handwriting Recognition in Historical Documents,”
Ph.D. dissertation, University of Bern, 2012. [13] E. Granell and C.-D. Martı́nez-Hinarejos,
[2] A. Manoj, P. Borate, P. Jain, V. Sanas, and R. Pashte, “A Survey on “Read4SpeechExperiments: A Tool for Speech Acquisition
Offline Handwriting Recognition Systems,” International Journal from Mobile Devices,” in Proceedings of the IX Jornadas en
of Scientific Research in Science, Engineering and Technology, Tecnologı́as del Habla and the V Iberian SLTech Workshop
vol. 2, no. 2, pp. 253–257, 2016. (IberSPEECH’2016), 2016, pp. 411 – 417.
[3] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. [14] C.-D. Martı́nez-Hinarejos, E. Granell, and V. Romero, “Com-
Prentice Hall, 1993. paring different feedback modalities in assisted transcription of
manuscripts,” in Proceedings of the 13th IAPR International
[4] R. Plamondon and S. N. Srihari, “On-Line and Off-Line Hand- Workshop on Document Analysis Systems (DAS ’18), 2018, pp.
writing Recognition: A Comprehensive Survey,” IEEE Transac- 115–120.
tions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1,
pp. 63–84, 2000. [15] E. Granell, V. Romero, and C.-D. Martı́nez-Hinarejos, “Mul-
timodality, interactivity, and crowdsourcing for document tran-
[5] A. H. Toselli, E. Vidal, and F. Casacuberta, “Computer Assisted
scription,” Computational Intelligence, vol. 34, no. 2, pp. 398–
Transcription of Text Images,” in Multimodal Interactive Pattern
419, 2018.
Recognition and Applications. Springer, 2011, ch. 3, pp. 61–98.
[6] A. Fornés, J. Lladós, J. Mas, J. M. Pujades, and A. Cabré, [16] E. Granell and C.-D. Martı́nez-Hinarejos, “Combining Handwrit-
“A Bimodal Crowdsourcing Platform for Demographic Histori- ing and Speech Recognition for Transcribing Historical Handwrit-
cal Manuscripts,” in Proceedings of the First International Con- ten Documents,” in Proceedings of the 13th International Confer-
ference on Digital Access to Textual Cultural Heritage (DATeCH ence on Document Analysis and Recognition (ICDAR’15), 2015,
’14), 2014, pp. 103–108. pp. 126–130.
[7] E. Granell, “Advances on the Transcription of Historical [17] ——, “Multimodal Output Combination for Transcribing Histor-
Manuscripts based on Multimodality, Interactivity and Crowd- ical Handwritten Documents,” in Proceedings of the 16th Interna-
sourcing,” Ph.D. dissertation, Universitat Politècnica de València, tional Conference on Computer Analysis of Images and Patterns
2017, supervisors: C.-D. Martı́nez-Hinarejos and V. Romero, (CAIP), 2015, pp. 246–260.
Available: http://hdl.handle.net/10251/86137. [18] E. Granell, V. Romero, and C.-D. Martı́nez-Hinarejos, “An
[8] V. Alabau, V. Romero, A. L. Lagarda, and C.-D. Martı́nez- Interactive Approach with Off-line and On-line Handwritten
Hinarejos, “A Multimodal Approach to Dictation of Handwritten Text Recognition Combination for Transcribing Historical Doc-
Historical Documents,” in Proceedings of the 12th Annual Con- uments,” in Proceedings of the 12th IAPR International Workshop
ference of the International Speech Communication Association on Document Analysis Systems (DAS ’16), 2016, pp. 269–274.
(Interspeech), 2011, pp. 2245–2248. [19] ——, “Using Speech and Handwriting in an Interactive Approach
[9] V. Alabau, C.-D. Martı́nez-Hinarejos, V. Romero, and A. L. La- for Transcribing Historical Documents,” in Handwriting: Recog-
garda, “An iterative multimodal framework for the transcription of nition, Development and Analysis. Nova Science, 2017.
handwritten historical documents,” Pattern Recognition Letters, [20] E. Granell and C.-D. Martı́nez-Hinarejos, “A Multimodal Crowd-
vol. 35, pp. 195–203, 2014, frontiers in Handwriting Processing. sourcing Framework for Transcribing Historical Handwritten
[10] N. Serrano, F. Castro, and A. Juan, “The RODRIGO Database,” Documents,” in Proceedings of the 16th ACM Symposium on Doc-
in Proceedings of the 7th International Conference on Language ument Engineering (DocEng), 2016, pp. 157–163.
Resources and Evaluation (LREC 2010), 2010, pp. 2709–2712.
[21] ——, “Collaborator Effort Optimisation in Multimodal Crowd-
[Online]. Available: http://aclweb.org/anthology/L10-1330
sourcing for Transcribing Historical Manuscripts,” in Advances
[11] A. Moreno, D. Poch, A. Bonafonte, E. Lleida, J. Llisterri, J. B. in Speech and Language Technologies for Iberian Languages.
Mariño, and C. Nadeu, “Albayzin speech database: Design of Springer, 2016, pp. 234–244.
the phonetic corpus,” in Proceedings of the 3rd European Confer-
ence on Speech Communication and Technology (Eurospeech’93), [22] ——, “Multimodal Crowdsourcing for Transcribing Handwrit-
1993, pp. 175–178. ten Documents,” IEEE/ACM Transactions on Audio, Speech, and
Language Processing, vol. 25, no. 2, pp. 409–419, 2017.
[12] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly,
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kings-
bury, “Deep Neural Networks for Acoustic Modeling in Speech
178
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
179 10.21437/IberSPEECH.2018-36
Lang8
Convolution
Full
ID Size Acc. EERavg Impr.
Subsampling Convolution connection
Filter shape: 5x5 Pool shape: 2x2 Filter shape: 5x5 Subsampling
Pool shape: 2x2 Subsampling
Convolution Pool shape: 1x62
#1 i-vector 23M 65.02 16.94 -
HIDDEN LAYER 1
HIDDEN LAYER 2
Filter shape: 11x11
#2 lstm 1×512 1.2M 57.51 17.82 -
HIDDEN LAYER 3
#3 lstm 1×750 2.5M 63.39 15.61 ∼7.85
#4 lstm 1×1024 4.4M 65.63 15.10 ∼10.86
Figure 1: Representation of a CDNN architecture for LID.
#5 lstm 2×256 850k 63.73 14.96 ∼11.69
#6 lstm 2×512 3.3M 70.90 12.51 ∼26.15
Table 1: Configuration of the CDNN models for LID.
Table 2: CDNN systems performance for LID on the 8 lan- Bottleneck features
DNN
BN feature-based
BN UBM / i-vector
guages subset of NIST LRE 2009. system
Performance
Figure 2: Scheme of cepstral vs. BN based LID/SID systems.
ID Size EERavg (%) Cavg
i-vector ∼23M 16.94 0.1535
ConvNet 1 ∼198k 22.14 0.2406
ConvNet 2 ∼39k 25.90 0.2700 We train our systems with sequences of 2 s of MFCC-SDCs
ConvNet 3 ∼39k 24.69 0.2616 with no stacking of acoustic frames. They consist of 1 or 2
ConvNet 4 ∼39k 23.48 0.2461 LSTM hidden layers followed by a softmax output layer, which
ConvNet 5 ∼78k 21.60 0.2282 returns a probability for each input frame and language. For
ConvNet 6 ∼78k 21.11 0.2293 scoring, we average the output per frame, using just the last
10% of each utterance.
ConvNet 6+i-vector - 15.96 0.1433
180
Table 4: Results of BNFs for LID (development set of NIST LRE Table 6: Results of BNFs for SID on the NIST SRE 2010.
2015), varying the number of layers of the DNN.
Features Norm. Phone Acc(%) EER(%)
Number of DNN EERavg (in %) ASR feat. Utt. CMN 49.8 2.51
Hidden Layers Frame Acc. 30s 10s 3s MFCC∆+∆∆ ST-CMVN 49.6 1.99
3 47.82 5.52 9.04 14.34 MFCC20dim ST-CMVN 45.57 1.67
4 49.55 4.33 7.81 13.76
5 50.46 5.22 8.57 14.15
3.2. Analysis of Bottleneck Features for SID
Table 5: Results of BNFs for LID (development set of NIST LRE We explore whether DNNs suboptimal for ASR can provide
2015) with different position for the BN layer. better BNFs for SID. We present here experiments with dif-
ferent features to feed the DNN, either optimized for ASR
Position of DNN EERavg (in %) (“ASR feat.”) or for SID (“MFCC”). The ASR optimized fea-
BN Layer Frame Accuracy 30s 10s 3s tures [10] are composed of 24 Mel-filter bank log outputs con-
catenated with 13 fundamental frequency (F0) features, with
First 49.17 9.37 12.24 16.59 utterance mean normalization, which is what we used as de-
Second 49.46 6.27 9.55 14.58 fault for ASR [11]. SID optimized features are the classical 20
Third 49.55 4.33 7.81 13.76 MFCCs used for SID, either adding the derivatives or not, and
Fourth 48.05 4.64 8.00 14.17 normalized with short-term cepstral mean and variance normal-
ization (ST-MVN). We evaluate the systems on the NIST SRE
2010, condition 5, female task [12].
3.1. Analysis of Bottleneck Features for LID
3.2.1. Experiments and Results
Here, we analyze how the topology of the DNN trained for ASR
influences the performance of the resulting BNFs for LID on The aspect analyzed in this section is the DNN input features,
the NIST LRE 2015 development dataset. We use a feedfor- which are either optimized for ASR or SID (“ASR feat.” vs.
ward DNN with an input layer, three to five hidden layers, and “MFCC”). Results of these experiments are summarized in Ta-
the output layer. To feed the network, we use 20 MFCCs pre- ble 6.
processed with a context of 31 frames. The hidden layers are We see that the ASR features (with per utterance mean nor-
composed of 1500 units and the BN layer, of 80. The softmax malization) yield better performance in terms of phone accu-
output layer provides the probability of each input to correspond racy than the MFCCs since they are expected to be optimized
to a given phoneme state (3083 triphone states are used). We use for ASR. However, BNFs obtained from these DNNs do not
stochastic gradient descent to optimize the cross-entropy. seem to be as discriminative as the ones obtained with DNNs
trained using MFCCs optimized for SID. Moreover, adding first
and second derivatives to MFCCs provide better phone accuracy
3.1.1. Experiments and Results
but resulted in a worse SID performance. We see as for LID that
First we vary the number of layers in the DNN from 3 to 5 better ASR performance (in terms of phone accuracy) does not
(Table 4). Despite the 5 layers configuration gives better perfor- necessarily correspond to better SID performance.
mance in terms of frame accuracy, it is the architecture with 4
hidden layers the one that reaches the lowest EERavg for LID. 4. Utterance Level Representation:
Therefore, the discriminative task (ASR) is easier for the DNN
when the classifier is more complex (5 layers DNN) and, thus,
DNN-based Embeddings
improves the frame accuracy. However, that network is not be- Despite the success of BNFs for SID [13, 14, 15] and LID [16,
ing forced to focus on obtaining a compact representation of the 17, 18, 19, 20, 21], the variable length of this frame-wise repre-
signal, which is then used for LID. sentation poses a challenge in consequent modeling. The clas-
Then, keeping fixed the number of layers to 4, we explore sical i-vector compacts the utterance representation in a fixed-
how the LID system performs depending on the position that the length vector. However, the aim of i-vectors is to capture infor-
BN layer occupies in the DNN, which correspond to different mation about sources of variability in the training data, but this
levels of extracted information closer or further from the pho- information is not necessarily relevant to the target task.
netic information (output layer). Results can be seen in Table 5. In this section we use embeddings for LID (after their suc-
The closer the BN layer to the input layer, the noisier the re- cess for SID [22]), which are a fixed-length representation of an
sulting representation would be, which might explain the drop utterance extracted from a sequence summarizing DNN trained
in performance for the first and second layers with respect to discriminatively for the target task (LID).
the results of the last two layers. The best performance in terms The DNN consists of a first part that works on a frame-by-
of EERavg for LID is obtained when the bottleneck layer is lo- frame basis from a given sequence of feature vectors, followed
cated in the third layer, but that result is very close to the one by a pooling layer, which in our case computes the mean and
obtained with the BN at the fourth layer. Performance of the standard deviation over time of the activations of the previous
DNN also drops when the BN layer moves from layer third to layer. Finally, a number of hidden layers follow to capture the
fourth. In this topology, the BN layer in position fourth is con- information contained in the input, providing a single vector of
nected directly to the output layer, resulting in a weight matrix values per sequence (embeddings), which can be modeled by
that connects a small layer with just 80 hidden units with the some other backend.
output layer, of size 3083. These weights might be difficult to In particular, our DNN-embedding system takes stacked
learn, which may explain this drop in performance of the DNN. BNFs as input and use bidirectional LSTM (BLSTM) layers for
181
Pooling
(mean, std) Table 8: Results with PCA on top of embeddings for LID on the
Emb_a Emb_b
Fully NIST LRE 2015 evaluation dataset.
Connected
Input BLSTM BLSTM
sequence 256 256
Output
14 lang Cavg × 100
Frame-by-frame
Utterance (sequence) level
embedding representation
the frame-level part. After pooling, two more fully connected we performed a post-evaluation analysis.
layers are added, whose output values will serve as embeddings.
First, we compared the architecture with the same config-
Finally, the output layer consists of a softmax layer that pro-
uration as DNN 1 in previous section with a larger one, where
vides a vector of language posterior probabilities for each utter-
the fully connected layer has 1500 units and both embeddings
ance. An example of this architecture is depicted in Figure 3.
are 512-dimensional. With that larger model, performance im-
proved from 22.18% to 19.86% (Cprimary ).
4.1. Analysis on NIST LRE 2015
Moreover, we extended the training dataset by performing
For these experiments, we use the architecture described above. data augmentation through addition of noise, reverberation and
We feed the DNN with 30-dimensional stacked BNFs and the tempo variations of original audio files. Figure 4 shows the
output softmax layer provides a 20-dimensional vector of lan- comparison of performance when training with up to 11 copies
guage posterior probabilities for each utterance. As reference, of the original data with different corruptions. In general, in-
we use an i-vector system that consists of a 2048-dimensional creasing the number of copies of the data yields improvements
UBM trained on the same BNFs and 600 dimensional i-vectors. in performance. In particular, adding any noisy version of the
First, we experiment varying the size of the embedding data (combined or not with other corruptions) makes the system
layers (keeping fixed the rest to 256 for each layer up to the more robust against data mismatch, providing gains in perfor-
pooling). We start with 512 and 300-dimensional embeddings mance. The only two cases in which data augmentation does not
(DNN 1) and half each twice (DNN 2 and 3, respectively). Re- improve the system trained only on original data are the ones in
sults stacking both embeddings are shown in Table 7. We see a which just reverberation or tempo variations are performed.
better performance of DNN 2 embeddings, which are half size
w.r.t. DNN 1. This suggest that the embeddings of larger size 5. Conclusions
contain more detrimental information about channel since all
DNNs reached the same performance on the training data. The main contributions of this Ph.D. Thesis are the following.
Motivated by this, we explore further dimensionality reduc- First, the proposed end-to-end approaches for LID based on
tion via PCA in Table 8. We see that with smaller embeddings, CDNNs and LSTMs, which provide an alternative to i-vectors
we are able to get improvements even reducing the dimension- with less parameters. Secondly, the systematic study of bottle-
ality up to 25. Best results are achieved with embeddings from neck feature DNN-based LID systems and the analysis of this
DNN 2 whose dimensionality (406) is close to the typical i- approach for SID, which show that optimal DNN configuration
vector (400 or 600). Besides, we achieved a performance of for BNFs for LID and SID might differ from the most beneficial
17.44%, close to our i-vector baseline (16.93%) and score level for ASR, task for which the DNN is trained. Finally, the novel
fusion of both gave us a Cavg of 15.69%. approach based on embeddings for LID, in line with previous
works in SID, which provides a fixed-length representation of
utterances directly learned by the DNN for the target task able
4.2. Analysis on NIST LRE 2017
to outperform the well-established i-vectors.
After developing the embedding system for the NIST LRE In terms of articles, from research directly output from this
2017, where it was included in the primary submission of BUT Ph.D. Thesis, two journal articles and seven peer reviewed in-
team (a fusion of 3 i-vector systems and the embedding system), ternational conference papers were published.
182
6. References [18] R. Fér, P. Matějka, F. Grézl, O. Plchot, and J. Černocký, “Multilin-
gual bottleneck features for language recognition,” in Proceedings
[1] A. Lozano-Diez, “Bottleneck and embedding representation of Interspeech 2015, vol. 2015, no. 09, 2015, pp. 389–393.
of speech for dnn-based language and speaker recogni-
tion,” Ph.D. dissertation, June 2018. [Online]. Available: [19] P. Matějka, L. Zhang, T. Ng, H. S. Mallidi, O. Glembek, J. Ma,
https://repositorio.uam.es/handle/10486/684191 and B. Zhang, “Neural network bottleneck features for language
identification,” in Proceedings of Odyssey 2014. International
[2] T. Mikolov, S. Kombrink, L. Burget, J. H. Cernocky, and S. Khu- Speech Communication Association, 2014.
danpur, “Extensions of recurrent neural network language model,”
in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE [20] F. Richardson, D. Reynolds, and N. Dehak, “Deep neural network
International Conference on. IEEE, 2011, pp. 5528–5531. approaches to speaker and language recognition,” IEEE Signal
Processing Letters, vol. 22, no. 10, pp. 1671–1675, Oct 2015.
[3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” in Intelligent Signal [21] B. Jiang, Y. Song, S. Wei, J.-H. Liu, I. V. McLoughlin, and L.-R.
Processing. IEEE Press, 2001, pp. 306–351. Dai, “Deep bottleneck features for spoken language identifica-
tion,” PLOS ONE, vol. 9, no. 7, pp. 1–11, 07 2014. [Online].
[4] P. A. Torres-Carrasquillo, E. Singer, M. A. Kohler, and J. R. Available: https://doi.org/10.1371/journal.pone.0100795
Deller, “Approaches to Language Identification Using Gaussian
Mixture Models and Shifted Delta Cepstral Features,” in Proc. [22] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,
ICSLP, vol. 1, 2002, pp. 89–92. “Deep neural network embeddings for text-independent speaker
verification,” in Proceedings of Interspeech 2017, 2017.
[5] A. Graves, Supervised Sequence Labelling with Recurrent
Neural Networks. Springer, 2012, vol. 385. [Online]. Available:
http://dx.doi.org/10.1007/978-3-642-24797-2
[6] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget:
Continual prediction with LSTM,” Neural Computation, vol. 12,
no. 10, pp. 2451–2471, 2000.
[7] F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning
precise timing with LSTM recurrent networks,” Journal of Ma-
chine Learning Research, vol. 3, pp. 115–143, Mar. 2003.
[8] K. Greff, R. K. Srivastava, J. Koutnı́k, B. R. Steunebrink, and
J. Schmidhuber, “Lstm: A search space odyssey,” arXiv preprint
arXiv:1503.04069, 2015.
[9] F. Grezl, M. Karafiat, S. Kontar, and J. Cernocky, “Probabilis-
tic and bottle-neck features for lvcsr of meetings,” in 2007 IEEE
International Conference on Acoustics, Speech and Signal Pro-
cessing - ICASSP ’07, vol. 4, April 2007, pp. IV–757–IV–760.
[10] F. Grézl, M. Karafiát, and L. Burget, “Investigation into bottle-
neck features for meeting speech recognition,” in Proc. Inter-
speech 2009, no. 9. International Speech Communication As-
sociation, 2009, pp. 2947–2950.
[11] M. Karafiát, F. Grézl, K. Veselý, M. Hannemann, I. Szőke, and
J. Černocký, “But 2014 babel system: Analysis of adaptation in
nn based systems,” in Proceedings of Interspeech 2014. Interna-
tional Speech Communication Association, 2014, pp. 3002–3006.
[12] NIST, “The nist year 2010 speaker recognition eval-
uation plan,” www.itl.nist.gov/iad/mig/tests/sre/2010/
NIST SRE10 evalplan.r6.pdf, 2010.
[13] D. Garcia-Romero and A. McCree, “Insights into deep neural net-
works for speaker recognition,” in INTERSPEECH 2015, 16th An-
nual Conference of the International Speech Communication As-
sociation, Dresden, Germany, September 6-10, 2015, 2015, pp.
1141–1145.
[14] S. Yaman, J. Pelecanos, and R. Sarikaya, “Bottleneck features for
speaker recognition,” in Proceedings of Odyssey 2012. Interna-
tional Speech Communication Association, 2012.
[15] A. Lozano-Diez, A. Silnova, P. Matějka, O. Glembek, O. Plchot,
J. Pešán, L. Burget, and J. Gonzalez-Rodriguez, “Analysis and op-
timization of bottleneck features for speaker recognition,” in Pro-
ceedings of Odyssey 2016. International Speech Communication
Association, 2016.
[16] Y. Song, B. Jiang, Y. Bao, S. Wei, and L. R. Dai, “I-vector rep-
resentation based on bottleneck features for language identifica-
tion,” Electronics Letters, vol. 49, no. 24, pp. 1569–1570, Novem-
ber 2013.
[17] A. Lozano-Diez, R. Zazo, D. T. Toledano, and J. Gonzalez-
Rodriguez, “An analysis of the influence of deep neural network
(dnn) topology in bottleneck feature based language recognition,”
PLOS ONE, vol. 12, no. 8, pp. 1–22, 08 2017. [Online].
Available: https://doi.org/10.1371/journal.pone.0182580
183
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Abstract and/or speaker labels are required for training, which are not
always accessible (e.g., [5, 6, 7, 8, 9]).
Recent advances in Deep Learning (DL) technology have Another possible use of DL is to represent a speech signal
improved the quality of i-vectors but the DL techniques in with a single low dimensional vector using a DL architecture,
use are computationally expensive and need speaker or/and rather than the traditional i-vector algorithm. These vectors are
phonetic labels for the background data, which are not eas- often referred to as speaker embeddings (e.g., [10, 7, 11, 12,
ily accessible in practice. On the other hand, the lack of 13]). The need of speaker labels for training the network is one
speaker-labeled background data makes a big performance gap, of the disadvantages of these techniques. Moreover, speaker
in speaker recognition, between two well-known cosine and embeddings extracted from hidden layer outputs are not so com-
PLDA i-vector scoring techniques. This thesis tries to solve the patible with Probabilistic Linear Discriminant Analysis (PLDA)
problems above by using the DL technology in different ways, backend [14, 15] as the posterior distribution of hidden layer
without any need of speaker or phonetic labels. We have pro- outputs are usually not truly Gaussian.
posed an effective DL-based backend for i-vectors which fills
The first objective in this thesis is to make use of deep ar-
46% of this performance gap, in terms of minDCF, and 79% in
chitectures for backend i-vector classification in order to fill
combination with a PLDA system with automatically estimated
the performance gap between the cosine (unlabeled-based) and
labels. We have also developed an efficient alternative vector
PLDA (labeled-based) scoring baseline systems given unla-
representation of speech by keeping the computational cost as
beled background data. The second one is to develop an effi-
low as possible and avoiding phonetic labels. The proposed
cient framework for vector representation of speech by keeping
vectors are referred to as GMM-RBM vectors. Experiments on
the computational cost as low as possible and avoiding speaker
the core test condition 5 of the NIST SRE 2010 show that com-
and phonetic labels. The last main objective is to make use of
parable results with conventional i-vectors are achieved with a
deep architectures for backend i-vector classification for Lan-
clearly lower computational load in the vector extraction pro-
guage Identification (LID) in intelligent vehicles. In this sce-
cess. Finally, for the LID application, we have proposed a DNN
nario, LID systems are evaluated using words or short sentences
architecture to model effectively the i-vector space of languages
recorded in cars in four languages, English, Spanish, German,
in the car environment. It is shown that the proposed DNN ar-
and Finnish.
chitecture outperforms GMM-UBM and i-vector/LDA systems
The three main objectives are summarized in sections 2-
by 37% and 28%, respectively, for short signals 2-3 sec.
4. Section 5 describes the experimental results. Section 6 lists
Index Terms: Deep Learning, Speaker Recognition, i-Vector, the publications resulted from the Ph.D. thesis and section 7
Deep Neural Network, Deep Belief Network, Restricted Boltz- concludes the paper.
mann Machine
2. Deep Learning Backend for i-Vector
1. Introduction Speaker Verification
The successful use of Deep Learning (DL) in a large variety of
signal processing applications, particularly in speech process- We have proposed the use of DL as a backend in which a two-
ing (e.g., [1, 2, 3]), has inspired the community to make use of class hybrid Deep Belief Network (DBN)-Deep Neural Net-
DL techniques in speaker and language recognition as well. A work (DNN) is trained for each target speaker to increase the
possible use of DL techniques in speaker recognition is to com- discrimination between target i-vector/s and the i-vectors of
bine them with the state-of-the-art i-vector [4]. However, the other speakers (non-targets/impostors) (Fig. 2). Proposed net-
main problem is that the use of DL increases highly the com- works are initialized with speaker-specific parameters adapted
putational cost of the i-vector extraction process and phonetic from a global model, which is referred to as Universal Deep
Belief Network (UDBN). Then the cross-entropy between the
The Ph.D. thesis has been carried out under supervision of Prof. class labels and the outputs is minimized using the back-
Javier Hernando in TALP Research Center, Department of Signal propagation algorithm.
Theory and Communications, Universitat Politecnica de Catalunya - DNNs usually need a large number of input samples to be
BarcelonaTech, Spain. The thesis was supported in part by the Spanish trained efficiently. In speaker recognition, target speakers can
projects TEC2010-21040-C02-01, PCIN-2013-06, TEC2015-69266-P,
and TEC2012-38939-C03-0 and the European project PCIN-2013-067.
be enrolled with only one sample (single session task) or mul-
The Author is now with EML European Media Laboratory GmbH, Hei- tiple samples (multi-session task). In both cases, the number
delberg, Germany. The full thesis manuscript can be found Online in of target samples is very limited. A network trained with such
http://hdl.handle.net/2117/118780. limited data is highly probable to overfit. On the other hand, the
184 10.21437/IberSPEECH.2018-37
UDBN & DBN Adaptation
DBN Universal
DBN Target/Impostor
Labels
Background
i-vectors
Decision
Figure 1: Block-diagram of the proposed DL-based backend on i-vectors for target speaker modeling.
Outputs: pensive and need phonetic labels for the background data. It has
Posterior Probability
of target and non-target been proposed in this thesis an alternative vector-based repre-
classes Initialization: sentation for speakers in a less computationally expensive man-
Speaker-specific ner with no use of any phonetic or speaker labels.
Inputs: parameters adapted
A mix of target i-vector/s from UDBN RBMs are good potentials for this purpose because they
and the cluster centroids have good representational powers and they are unsupervised
of selected impostor and computationally low cost. It is assumed in this work that the
i-vectors
inputs of RBM, i.e., visible units, are GMM supervectors and
the outputs, i.e., hidden units, are the low dimensional vectors
Figure 2: Proposed deep learning architecture for training of
we are looking for. The RBM is trained given the background
each speaker model.
GMM supervectors and will be referred to as URBM. The role
of the URBM is to learn the total session and speaker variabil-
ity among the background supervectors. Different types of units
number of target and impostor samples will be highly unbal- and activation functions can be used for training the URBM but
anced, i.e., one or some few target samples against thousands we have proposed a variant of ReLU, which will be referred
of impostor samples. Learning from such unbalanced data will to as Variable ReLU (VReLU), for this application. It will be
result in biased DNNs towards the majority class. shown in section 5 that the proposed VReLU does not suffer
Fig. 1 shows the block diagram of the proposed approach. from the problems with sigmoid and ReLU and works the best.
Two main contributions are proposed in this thesis to tackle the After training the URBM, the visible-hidden connection weight
above problems. The balanced training block attempts to de- matrix is used to transform unseen GMM supervectors to lower
crease the number of impostor samples and, on the contrary, dimensional vectors which will be referred to as GMM-RBM
to increase the number of target ones in a reasonable and ef- vectors in this work.
fective way. The most informative impostor samples for target In fact, the proposed VReLU is defined as follows and is
speakers are first selected by the proposed impostor selection compared with ReLU function in Fig. 3,
algorithm. Afterwards, the selected impostors are clustered and (
the cluster centroids are considered as final impostor samples x x>τ
f (x) = , τ ∈ N (0, 1) (1)
for each target speaker model. Impostor centroids and target 0 x≤τ
samples are then divided equally into minibatches to provide
balanced impostor and target data in each minibatch. On the Given the GMM supervectors and the URBM parameters,
other hand, the DBN adaptation block is proposed to compen- the GMM-RBM vectors are extracted as follows,
sate the lack of input data. As DBN training does not need any −1/2
ωr = W Σubm N −1 (u)F̃(u) (2)
labeled data, the whole background i-vectors are used to build a
global model, which is referred to as Universal DBN (UDBN). where Σubm is the diagonal covariance matrix of the UBM, W
The parameters of the UDBN are then adapted to the balanced is the connection weights from URBM, and N (u) and F̃(u)
data obtained for each target speaker. At the end, given the tar- are zeroth and centralized first order Baum-Welch statistics, re-
get/impostor labels, the adapted DBN and the balanced data, a spectively.
DNN is discriminatively trained for each target speaker. More Like in case of i-vectors, resulting GMM-RBM vectors are
details can be found in [16]. mean normalized and whitened using the mean vector and the
whitening matrix obtained on the background data.
The comparison of equation 2 with that of i-vector in equa-
3. RBMs for Vector Representation tion 3 implies clearly that GMM-RBM vector extraction needs
of Speech much less computational load. More details can be found in
[16].
Recently, the advances in DL have improved the quality of i- −1 t −1
vectors, but the DL techniques in use are computationally ex- ω = I + T t Σ−1 N (u)T T Σ F̃(u) (3)
185
ReLU VReLU VReLU
y y y
y=x y=x y=x
x=
x=-
Figure 3: Comparison of ReLU and proposed VReLU. τ is randomly selected from a normal distribution with zero mean and unit
variance per each hidden unit and per each input sample. (b) and (c) are two examples of VReLU with positive and negative τ .
than the input layer size. From the second hidden layer towards
the output, the size of each layer will be half of the previous
Table 1: Performance comparison of the proposed DNN system
layer. For example, the configuration of a 3-hidden-layer DNN
with other baseline systems on NIST 2014 i-vector challenge.
will be as 400-512-256-128-4, where 400 is the size of the in-
put i-vectors and 4 is the number of language classes. It will Progress Set Evaluation Set
be shown in section 5 that, in this way, we can decrease the Unlabeled Background Data
EER (%) minDCF EER (%) minDCF
computational complexity to a great extent while keeping the
classification accuracy. [1] cosine 4.78 0.386 4.46 0.378
Two forms of i-vectors are considered as inputs to DNNs, [2] PLDA (Estimated Labels) 3.85 0.300 3.46 0.284
[3] Proposed DNN-1L 5.13 0.327 4.61 0.320
raw i-vectors and session-compensated i-vectors. LDA and [4] Proposed DNN-3L 4.55 0.305 4.11 0.300
WCCN are two commonly used techniques for session vari- Fusion [2] & [4] 2.99 0.260 2.70 0.243
ability compensation among i-vectors. Although LDA performs Labeled Background Data
better than WCCN for the LID application when cosine scoring
[5] PLDA (Actual Labels) 2.23 0.226 2.01 0.207
is used, we will use only WCCN session-compensated i-vectors Fusion [2] & [5] 2.04 0.220 1.85 0.204
as the inputs to DNNs. This is because the number of the lan- Fusion [4] & [5] 2.13 0.221 2.00 0.196
guage classes is very few in this application and, therefore, the Fusion [2] & [4] & [5] 1.88 0.204 1.74 0.190
maximum number of meaningful eigenvectors will be also few
(number of classes minus one). We implemented different DNN
architectures with LDA-projected i-vectors as inputs but no gain of 36,572, 6530, and 9634 i-vectors, respectively. The number
was observed. The use of raw i-vectors is advantageous as no of target speaker models is 1306 and for each of them five i-
language-labeled background data is required. More details can vectors are available. Each target model will be scored against
be found in [16]. all the test i-vectors and, therefore, the total number of trials
will be 12,582,004. Three baseline systems are considered in
5. Experimental Results this work for evaluation: cosine, PLDA with actual labels, and
This section summarizes the main results obtained on the ex- PLDA with estimated labels. The size of hidden layers is set to
periments for each main contribution presented in sections 2-4. 400. Table 1 compares the performance of the proposed DNN
The full database provided in the National Institute of Stan- systems with other baseline systems in terms of minDCF and
dard and Technology (NIST) 2014 speaker recognition i-vector EER. The interesting point is that the combination of the DNN-
challenge [17] is used for the experiments in section 2. Rather 3L and PLDA with estimated labels in the score level improves
than speech signals, i-vectors are given directly by NIST in the results to a great extent. The resulting relative improve-
this challenge to train, test, and develop the speaker recogni- ment compared to cosine baseline system is 36% in terms of
tion systems. This enables system comparison more readily minDCF on the evaluation set. This improvement with no use
with consistency in the front-end and in the amount and type of background labels is considerable compared to 45% relative
of the background data [17]. Three sets of 600-dimensional improvement which can be obtained by PLDA with actual la-
i-vectors are provided: development, train, and test consisting bels.
186
Table 2: Performance comparison of proposed GMM-RBM for Single and Multisession i-Vector Speaker Recogni-
vectors and conventional i-vectors on the evaluation set core tion, IEEE/ACM Transactions on Audio, Speech, and
test condition-common 5 of NIST SRE 2010. GMM-RBM vec- Language Processing, vol. 25, no. 4, pp. 807-817, Apr.
tors and i-vectors are of a same size of 400. 2017, (Awarded the best RTTH doctorate student ar-
ticle in 2017).
cosine PLDA 3. O. Ghahabi, A. Bonafonte, J. Hernando, and A. Moreno,
EER (%) minDCF EER (%) minDCF Deep neural networks for i-vector language identifica-
tion of short utterances in cars, in Proc. INTERSPEECH,
[1] i-Vector 6.270 0.05450 4.096 0.04993 2016, pp. 367-371.
GMM-RBM Vector
[2] 6.638 0.06228 4.517 0.05085 4. O. Ghahabi and J. Hernando, Restricted boltzmann ma-
(Trained with ReLU)
GMM-RBM Vector chine supervectors for speaker recognition, in Proc.
[3] 6.497 0.06099 3.907 0.05184 ICASSP, 2015, pp. 4804-4808.
(Trained with VReLU)
Fusion [1] & [3] 5.791 0.05238 3.814 0.04673 5. O. Ghahabi and J. Hernando, Deep belief networks for i-
vector based speaker recognition, in Proc. ICASSP, May
2014, pp. 1700-1704, (Awarded the Qualcomm travel
Table 3: Comparison of LID systems for short signals recorded grant).
in car. Performance values are reported based on LER (%).
6. O. Ghahabi and J. Hernando, i-vector modeling with
deep belief networks for multi-session speaker recogni-
Duration of Test Signals (in sec) t < 2 2 6 t < 3 t > 3 All
Number of Samples 2,472 2,355 5,591 10,418 tion, in Proc. Odyssey, 2014, pp. 305-310.
7. O. Ghahabi and J. Hernando, Global impostor selection
[1] GMM-UBM 9.98 4.56 4.70 6.02
for DBNs in multi-session i-vector speaker recognition,
[2] i-Vector + Cosine 17.28 6.58 5.00 8.09 in Advances in Speech and Language Technologies for
[3] i-Vector + WCCN + Cosine 14.50 5.03 3.42 6.31
[4] i-Vector + LDA + Cosine 12.41 3.96 2.32 5.03 Iberian Languages, ser. Lecture Notes in Artificial Intel-
ligence. Springer, Nov. 2014.
[5] i-Vector + WCCN + DNN 12.06 3.30 2.30 4.60
[6] i-Vector + DNN 11.01 2.87 2.58 4.54 8. P. Safari, O. Ghahabi, and J. Hernando, From features
Fusion [6] & [4] 11.63 3.41 1.95 4.48 to speaker vectors by means of restricted boltzmann ma-
Fusion [6] & [1] 10.20 3.04 2.49 4.41 chine adaptation, in Proc. Odyssey, 2016, pp. 366-371.
Fusion [6] & [4] & [1] 11.12 3.37 1.96 4.39
9. P. Safari, O. Ghahabi, and J. Hernando, Speaker recog-
nition by means of restricted boltzmann machine adap-
tation, in Proc. URSI, 2016, pp. 1-4.
For the experiments in Section 3, the NIST 2010 SRE [18], 10. P. Safari, O. Ghahabi, and J. Hernando, Feature classi-
core test-common condition 5, is used for evaluation. Table 2 fication by means of deep belief networks for speaker
compares the performance of GMM-RBM vectors, which are recognition, in Proc. EUSIPCO, 2015, pp. 2162-2166.
obtained with URBMs trained with ReLU and VReLU, with
traditional i-vectors on the evaluation set. The use of proposed 11. G. Raboshchuk, C. Nadeu, O. Ghahabi, S. Solvez, B. M.
VReLU shows better performance than the use of ReLU in both Mahamud, A. Veciana, and S. Hervas, On the acoustic
cosine and PLDA scoring. At the end, the best results are environment of a neonatal intensive care unit: Initial de-
achieved with score fusion of i-vectors and GMM-RBM vec- scription, and detection of equipment alarms, in Proc.
tors which shows about 7-7.5% and 4-6.5% relative improve- INTERSPEECH, 2014, pp. 2543-2547, (Awarded the
ments in terms of EER and minDCF, respectively, compared to ISCA travel grant).
i-vectors. For score fusion, BOSARIS toolkit [19] is used.
For the experiments of Section 4, the database has been 7. Conclusions
recorded within the scope of the EU project SpeechDat-Car
The main contributions of this thesis have been presented in
(LE4-8334) [20]. Table 3 summarizes the results for all the
three main works. In the first one, a hybrid architecture based
techniques in four categories based on the test signal durations:
on DBN and DNN has been proposed to discriminatively model
less than 2 sec, between 2 and 3 sec, more than 3 sec, and all
each target speaker for i-vector speaker verification. It was
durations. The first two categories are more interesting because
shown that the proposed hybrid system fills approximately 46%
the decision should be made fast in this application. Both i-
of the performance gap between the cosine and the oracle PLDA
vector+DNN systems show superior performance compared to
scoring systems in terms of minDCF. In the second work, a
i-vector + LDA baseline system. The frame-based GMM-UBM
new vector representation of speech has been presented for
baseline system works better than other systems only for test
text-independent speaker recognition. Gaussian Mixture Model
signals shorter than 2 sec. However, the accuracy is still high in
(GMM) supervectors have been transformed by a Universal
comparison to other categories.
RBM (URBM) to lower dimensional vectors, referred to as
GMM-RBM vectors. The experimental results show that the
6. Publications performance of GMM-RBM vectors is comparable with that of
traditional i-vectors but with much less computational load. In
1. O. Ghahabi and J. Hernando, Restricted Boltzmann ma-
the third work, a DNN architecture has been proposed for i-
chines for vector representation of speech in speaker
vector LID of short utterances recorded in cars. It has been
recognition, Computer Speech & Language, vol. 47, pp.
shown that for test signals with duration 2-3 sec the proposed
16-29, 2018.
DNN architecture outperforms GMM-UBM and i-vector/LDA
2. O. Ghahabi and J. Hernando, Deep Learning Backend baseline systems by 37% and 28%, respectively.
187
8. References print text-dependent speaker verification,” in Proc. ICASSP, 2014,
pp. 4052–4056.
[1] A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence
training of deep belief networks for speech recognition,” in Proc. [11] S. Wang, Y. Qian, and K. Yu, “What does the speaker embedding
Interspeech, 2010, pp. 2846–2849. encode?” Proc. Interspeech, pp. 1497–1501, 2017.
[2] G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre- [12] G. Bhattacharya, J. Alam, and P. Kenny, “Deep speaker embed-
trained deep neural networks for large-vocabulary speech recog- dings for short-duration speaker verification,” Proc. Interspeech,
nition,” IEEE Transactions on Audio, Speech, and Language Pro- pp. 1517–1521, 2017.
cessing, vol. 20, no. 1, pp. 30–42, 2012. [13] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,
[3] A. Senior, H. Sak, and I. Shafran, “Context Dependent Phone “Deep neural network embeddings for text-independent speaker
Models For LSTM RNN Acoustic Modelling,” in Proc. ICASSP, verification,” Proc. Interspeech, pp. 999–1003, 2017.
2015, pp. 4585–4589.
[14] S. Prince and J. Elder, “Probabilistic linear discriminant analysis
[4] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, for inferences about identity,” in Proc. ICCV, 2007, pp. 1–8.
“Front-end factor analysis for speaker verification,” IEEE Trans-
[15] P. Kenny, “Bayesian speaker verification with heavy tailed priors,”
actions on Audio, Speech, and Language Processing, vol. 19,
in Proc. Odyssey), 2010.
no. 4, pp. 788–798, May 2011.
[16] O. Ghahabi, “Deep learning for i-vector speaker and lan-
[5] Y. Lei, N. Scheffer, L. Ferre, and M. Mclaren, “A novel scheme
guage recognition,” PhD dissertation, Universitat Politecnica de
for speaker recognition using a phonetically-aware deep neural
Catalunya, 2018, [Online]. Available: http://hdl.handle.net/2117/
network,” in Proc. ICASSP, 2014, pp. 1714–1718.
118780.
[6] F. Richardson, D. Reynolds, and N. Dehak, “Deep Neural Net-
work Approaches to Speaker and Language Recognition,” IEEE [17] NIST. (2014) The NIST speaker recognition i-vector machine
Signal Processing Letters, vol. 22, no. 10, pp. 1671–1675, 2015. learning challenge. [Online]. Available: http://nist.gov/itl/iad/
mig/upload/sre-ivectorchallenge 2013-11-18 r0.pdf.
[7] Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, and K. Yu, “Deep
feature for text-dependent speaker verification,” Speech Commu- [18] NIST, “The NIST year 2010 speaker recognition evaluation
nication, vol. 73, pp. 1–13, Oct. 2015. plan,” 2010, [Online]. Available: https://www.nist.gov/itl/iad/
mig/speaker recognition evaluation 2010.
[8] T. Pekhovsky, S. Novoselov, A. Sholokhov, and O. Kudashev, “On
autoencoders in the i-vector space for speaker recognition,” pp. [19] N. Brummer and E. Villiers, “BOSARIS toolkit user guide:
217–224, 2016. Theory, algorithms and code for binary classifier score pro-
cessing,” 2011, [Online]. Available: https://sites.google.com/site/
[9] J. Villalba, N. Brümmer, and N. Dehak, “Tied variational autoen- bosaristoolkit/.
coder backends for i-vector speaker recognition,” in Proc. Inter-
speech, 2017, pp. 1004–1008. [20] A. Moreno, B. Lindberg, C. Draxler, G. Richard, K. Choukri,
S. Euler, and J. Allen, “Speechdat-car. a large speech database
[10] E. Variani, X. Lei, E. McDermott, I. Lopez Moreno, and for automotive environments,” in Proc. LREC, 2000.
J. Gonzalez-Dominguez, “Deep neural networks for small foot-
188
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
189 10.21437/IberSPEECH.2018-38
Figure 1: Clustering and synthesis framework. Table 1: Perplexities (PP) for different feature sers for Expres-
sions (Ex) and Characters (Ch) in comparison to the database.
PP/Ex PP/Ch
DB 140.4 8.3
silRate 10.6 4.7
sylRate 9.4 4.0
meanF 0 9.8 4.2
Rhythm − P itch 8.7 3.4
Rhythm − P itch − JShimm 8.6 3.4
M F CCiV ec 9.0 3.5
F 0iV ec 11.2 4.2
iV ecC 8.8 3.8
Rhythm − iV ecC 8.2 3.3
Rhythm − JShimm − iV ecC 8.5 3.5
The idea is: use different feature sets to cluster expressive
speech, use the data in clusters to train expressive voices and
synthesize a diologue using these expressive voices. The corpus feature sets to the winning i-vector set: Rhythm & iVecC.1 Also
is a juvenile narrative audiobook recorded in European Span- here, the proposed combination of Rhythm & iVecC outper-
ish, with a total of 7900 sentences and 8.8 hours of duration. formed all OpenSmile sets.
The clustering algorithm is k-means, concretely VQ, as by [12].
Many different featur combinations were tested. The results of
Table 2: Perplexities for different features combinations, includ-
some of them a presented below. Some features were combined
ing openSMILE for expressions (E) and for characters (Ch).
to sets in order to facilitate the notation.
• silRate is silence rate and syllRate is syllable rate (#/sec) PP/Ex PP/Ch
(extracted with Ogmios [4]). DB 140.4 8.3
• Rhythm is silence and syllable rates (#/sec), duration is09 10.5 3.9
means and variation, computation based on segmenta- is10 10.8 4.0
tion. emobase 10.5 4.0
emolarge 8.8 3.7
• Pitch is F0 means, variance and range. Rhythm − iV ecC 8.2 3.3
• JShimm is Local jitter and shimmer (extracted with [3]).
• MFCCiVec are i-vectors calculated on basis of MFCCs;
F0iVec are i-vectors calculated on basis of F0. iVecC: 2.3. Subjective evaluation
F0 and MFCC based i-vectors (the acoustic features for Two subjective experiments were conducted. For these experi-
the i-vectors were extracted with the AHOCoder [9], the ments, for a given dialogue from the same audiobook (test set
i-vectors themselves were extracted using Kaldi [29]). excluded form the training set), a set of synthetic voices was
trained using the data in the clusters. The underlying system
2.2. Objective evaluation was an HMM based TTS [24] where the average voice was
An objective evaluation was performed: a small part of the cor- trained using the whole corpus (aprox. 10h) and the cluster data
pus was labeled with expressions and character (speaker) labes was used to perform adaptation. A total of 16 sentences was
(only for the evaluation purposes) and with the aid of these la- presented to a number of participants. The task: design your
bels, the perplexity of the clusters was calculated, derived from own audiobook dialogue using synthetic voices instead of the
real once.
entropy as by [33, 42]: P P = 2H̄(X) .
The interface is a website, where the participants could
A resume of the results is shown in the table 1. Due to
choose 1 of 10 synthetic voices for each sentence in a diolague.
space limitations only few chosen results are shown. The upper
The website design aimed to create the right atmosphere of the
line shows the perplexity calculated for the annotated part of the
book story and a more enjoyable experience. Also an introduc-
corpus. The part below it shows the performance of some “tra-
tion text was provided for the case that the participants were not
ditional features” and the part at the bottom, the performance of
familiar with the story.
sets which included i-vectors.
The experiments differ in so far, that in the first experiments
As can be seen, the combination of Rhythm and iVecC out- the participants had an example of the original character voice,
performed all other combinations in the given task. in the second they did not. Also: in the first experiment, the
In order to verify these results, an additional objective synthetic voices were chosen manually with the criterion of re-
evualation was conducted. For this evaluation, OpenSmile fea- samblance to the original voices, and mixed with other random
ture sets, as in openSMILE Book [10], were compared to the voices. In the second that choices was made automatically by
proposed sets. OpenSmile is a set of feature extraction tools acoustic distance for half of the sentences, the rest was random.
widely used for emotional and expressive speech analysis and
synthesis. It extracts thousands of features and statistics about 1 is09, is10, emobase, emolarge are feature sets by OpenSmile used
them and is considered to be state-of-the-art feature extraction in different experiments. For further details please refer to openSMILE
for expressive speech. The table 2 compares some OpenSmile Book [10].
190
The idea behind: if some voices are especially suitable for some TTS. The performance of the system is evaluated in two sub-
characters, the participants would tend to prefer them. jective experiments. For the embeddings the toolkit word2vec
In the first experiment, 19 persons had participated; in the [40, 25] was used to calculate the word embeddings; the sen-
second, 11 persons. Due to space limitations only the results for tence embeddings were calculated as centroids of the word em-
the second experiment are shown in Table 3. beddings in the vector space. The vector space has been trained
with the Wikicorpus [30].
Table 3: Relative preferences for the voices v0-v9 over the
whole paragraph for the narrator (Narr) and the two present 3.2. Subjective evaluation
characters (Ch2 and Ch3). The task in the first experiment was to read two book paragraphs
automatically predicting expressiveness for each sentences. For
v0 v1 v2 v3 v4 each sentence in the paragraph an embedding was calculated.
N arr 0.42 0.06 0.00 0.03 0.04 It was used to predict an acoustic feature vector as in Section
Ch2 0.13 0.16 0.14 0.23 0.03 2. Two prediction models were compared: a nearest-neighbour
Ch3 0.18 0.13 0.13 0.31 0.00 classifier and a neural network. Both paragraphs were extracted
from different books of the same series, as to preserve char-
v5 v6 v7 v8 v9 acters and the ambience. The expressive readings were also
N arr 0.23 0.04 0.10 0.06 0.01 compared to a neutral reading. The task was implemented as a
Ch2 0.09 0.05 0.03 0.10 0.03 preference test. The participants had the option to choose that
Ch3 0.18 0.00 0.00 0.00 0.05 two systems performed equally.
A total of 21 persons participated in the experiment. Table 4
The results show that there is an actual preference for some shows the results for these experiment for both paragraphs (P 1
voices atop of others. Of course, not all participants have the and P 2). The results show a clear preference for the expressive
same imagination of the book characters, especially if they systems.
don’t know the book. So individual preferences are out of scope
of this task. But the results show clearly, that the approach as Table 4: Prediction method preferences by users for the first
well as the proposed feature sets are suitable for the task. The two tasks. DNN method, nearest neighbor (NN) method, neutral
task itself can be interpreted as a simulatio of a real-life appli- voice.
cation of expressive speech. Further details on the experiments
and the results are published in [17, 16, 14] DNN NN
DNN NN neutral =NN =neutral
3. Semantics-to-Acoustics Mapping P1 0.19 0.43 0.0 0.38 0.0
P2 0.29 0.14 0.04 0.48 0.05
When we humans read a text aloud –lets say we read a good
night story to a small child–, we probably will read the story
with an expressive voice imitating book characters, their emo- In a further test the prediction system was used as a “search
tions in different situations etc. as to engage the child. Although engine” for expressive training data. For this purpose, a key-
likely we all would read slightly different, though the “quality” word called seed was used to predict an acoustic vector, which
of our reading will possibly be judged by our expressive abil- on its side was used as a cluster centroid for acoustic data and
ities. The question for this section is: What in the text does for voice adaptation. For example “Mysterious secret in silent
provide us the necessary information as to adequately adapt our obscurity” was used as seed to find training data for a suspense
reading style and can it be taught to a machine? voice. Other trained emotions were angry, happy and sadness,
The key approach is the automatic representation of text. also a neutral voice. Seven sentences were synthesized with
Such representations are often called embeddings and there is each of these voices and again, a preference test was presented
a large number of techniques to calculate them, like for in- to the 21 participants. The synthesized sentences were chosen
stance [21, 2, 26, 34]. The basic idea is to represent text in trying to reflect their expressive meaning. For example “Finally,
terms of word co-occurences of defined text units (e.g. sen- the holidays begin!” is supposed to be happy and the expecta-
tences, paragraphs, etc.). This way each unit is represented as a tion was that the participants would choose the happy voice for
co-occurence vector of its own words. More modern techniques it. Table 5 presents the results for this experiment.
train neural networks to predict a certain feature. Then extract
the vector representations from intermediate layers of the net- Table 5: Task 3. Voice preference by users for each sentence.
work, like [26] and [34].
The assumption is that these representations actually also happy angry suspense neutral
codify expressive information, especially if the underying net- Happy1 0.29 0.38 0.24 0.10
work is trained using an “expressive” criterion (see Section 4). Happy2 0.52 0.24 0.10 0.14
So the task is to convert this vector representation into acoustics. Angry1 0.14 0.48 0.24 0.14
Angry2 0.38 0.43 0.14 0.05
3.1. Experimental framework Suspense 0.0 0.05 0.81 0.14
The framework is: in a given text corpus, in this case an au- Sadness 0.19 0.05 0.43 0.43
diobook, for each sentence of the text a semantic embedding is Neutral 0.10 0.05 0.43 0.43
calculated. This embedding is then used to predict an acoustic
feature vector, concretely from the above experiment, which for The preferences for voices are pretty clear. It is interesting
its part is the centroid of a data cluster. As in the experiments to remark that happy and angry voices for sometimes exchange-
above, these data clusters are used for adaptation in an HMM able, the same can be said about the sad and neutral voices in
191
their respective contexts. Further details on these experiments general preference results will be shown here. More details can
can be found in [15, 14]. be found in [14, 18].
Table 6 shows the preferences divided by the sentiment.
4. NN-based expressive TTS with sentiment For positive and negative sentences, the word level system per-
formed best, although for negative sentences with high variance.
Looking back at the experiment in the previous chapter there For neutral sentences, the word context and tree distance system
are few spontaneous suggestions which can be made as to de- performed best. Possibly it is due to the fact that it probably has
velop the approach and improve the results. First, nowadays an equilibrating effect.
HMM-based synthesis is almost completely replaced by syn- T-tests show that for negative sentences, there is a signifi-
thesis based on neural networks. NN-based TTS provides new cant difference between the system without sentiment and the
possibilities of leveraging semantic vectors and avoiding clus- word level system, and no significant difference for the other
tering, which is an advantage itself since all data is always taken systems. For neutral sentences, there is a significant difference
into account in the training process. For this experiment the between the system without sentiment and the word context and
DNN-based TTS as described in [36] was used. tree distance system, but not for the other systems. For positive
The second point is the usage of embeddings which are sentences, there is only significant difference for the one-tailed
more suitable to represent expressiveness. The authors in t-test between the system without sentiment and the word level
[20, 34] propose the Stanford Sentiment Parser. The system system.
is trained on movie reviews and predicts the positivity or nega-
tivity of the sentence. Table 6: System preferences for positive, negative and neutral
In previous work, neural network based systems have al- sentences. ws: without sentiment, wcd: word context and tree
ready been combined with semantic vector input, though not distance, wl: word level
for expressive speech. To name a few, [37] use word embed-
dings to substitute TOBI and POS tags in RNN-based synthesis
achieving significant system improvement. [38] enhance the in-
put to NN-based systems with continuous word embeddings, ws wcd wl
and also try to substitute the conventional linguistic input by positive mean 1.84 1.85 1.71
the word embeddings. They do not achieve performance im- positive variance 0.54 0.76 0.54
provement, however, when they use phrase embeddings com- negative mean 2.06 1.96 1.84
bined with phonetic context, they do achieve significant im- negative variance 0.52 0.67 1.1
provement in a DNN-based system. [38] enhances word vectors neutral mean 2 1.83 1.96
with prosodic information, i.e. updates them, achieving signifi- neutral variance 0.71 0.6 0.95
cant improvements.
192
7. References [23] J. Lorenzo Trueba. Design and Evaluation of Statistical Paramet-
ric Techniques in Expressive Text-To-Speech: Emotion and Speak-
[1] R. Barra-Chicote, J. Yamagishi, S. King, J. Montero, and ing Styles Transplantation. PhD thesis, E.T.S.I. Telecomunicación
J. Macias-Guarasa. Analysis of statistical parametric and unit (UPM), 2016.
selection speech synthesis systems applied to emotional speech.
Speech Communication, 52:394–404, 2010. [24] T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai. Speech syn-
thesis from hmms using dynamic features. In Acoustics, Speech,
[2] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neu- and Signal Processing (ICASSP), pages 389–392, 1996.
ral probabilistic model. Journal of Machine Learning Research,
3(Feb):1137–1155, 2003. [25] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estima-
tion of word representations in vector space. In Proceedings of
[3] P. Boersma and D Weenink. Praat: doing phonetics by computer Workshop at ICLR, 2013.
(version 5.4.07), 2015.
[26] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Dis-
[4] T. Bonafonte, P. Aguero, J. Adell, J. Perez, and A. Moreno. Og- tributed Representations of Words and Phrases and their Compo-
mios: the UPC text-to-speech synthesis system for spoken transla- sitionality. ArXiv e-prints, October 2013.
tion. In Proceedings of TC-STAR Workshop on Speech-to-Speech [27] J.M. Montero, J. Gutierrez-Arriola, J. Colas, E. Enriquez, and
Translation, pages 199–204, 2006. J.M. Pardo. Analysis and modeling of emotional speech in span-
[5] F. Burckhardt and W.F. Sendelmeier. Verification of acoustical ish. In Proceedings of ICPhS, pages 671–674, 1999.
correlates of emotional speech using formant synthesis. In Pro- [28] I.R. Murray and J.L. Arnott. Toward the simulation of emotion
ceedings of ISCA Workshop on Speech and Emotion, pages 151– in synthetic speech: a review of the literature on human vocal
156, 2000. emotion. The Journal of the Acoustic Society of America, pages
[6] J.E. Cahn. Generation of affect in synthesized speech. In Pro- 1097–1108, 1993.
ceedings of American Voice I/O Society, pages 251–256, 1989. [29] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
[7] L. Chen, M.J.F. Gales, N. Braunschweiler, M. Akamine, and N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,
K. Knill. Integrated expression prediction and speech synthesis J. Silovsky, G. Stemmer, and K. Vesely. The kaldi speech recogni-
from text. IEEE Journal of Selected Topics in Signal Processing, tion toolkit. In IEEE 2011 Workshop on Automatic Speech Recog-
8(2):323–335, 2014. nition and Understanding. IEEE Signal Processing Society, De-
cember 2011. IEEE Catalog No.: CFP11SRW-USB.
[8] N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet.
Front end factor analysis for speaker verification. IEEE Transac- [30] S. Reese, G. Boleda, L. Cuadros, M. Padró, and G. Rigau. Wiki-
tions on Audio, Speech and Language Processing, 19(4):788–798, corpus: A word-sense disambiguated multilingual wikipedia cor-
2011. pus. In Proceedings of 7th Language Resources and Evaluation
Conference (LREC10), pages 1418–1421, 2010.
[9] D. Erro, I. Sainz, E. Navas, and I. Hernaez. Improved HNM-based
[31] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker veri-
vocoder for statistical synthesizers. In Proceedings of Interspeech,
fication using adapted Gaussian mixture models. Digital Signal
pages 1809–1812, 2011.
Processing, 10(1-3):19–41, 2000.
[10] F. Eyben. The opensmile book, 2016. [32] B. Schuller, R. Müller, M. Lang, and G. Rigoll. Speaker indepen-
[11] F. Eyben, S. Buchholz, N. Braunschweiler, J. Latorre, V. Wan, dent emotion recognition by early fusion of acoustic and linguistic
M. Gales, and K. Knill. Unsupervised clustering of emotion and features within ensembles. In Proceedings of Interspeech, pages
voice styles for expressive TTS. In Proceedings of ICASSP, pages 805–808, 2005.
4009–4012, 2012. [33] C.E. Shannon. A mathematical theory of communication. The
[12] R.M. Gray. Vector quantization. IEEE ASSP Magazine, 1(2):4– Bell System Technical Journal, 27(3):379–423, 1948.
29, 1984. [34] R. Socher, A. Perelygin, J.Y. Wu, J. Chuang, C.D. Manning, A.Y.
[13] W. Hamza, R. Bakis, E. Eide, M. Picheny, and J. Pitrelli. The IBM Ng, and C. Potts. Recursive deep models for semantic composi-
expressive speech synthesis system. In Proceedings of ICSLP, tionality over a sentiment treebank. In Conference on Empirical
pages 2577–2580, 2004. Methods in Natural Language Processing (EMNLP), 2013.
[14] I. Jauk. Unsupervised Learning for Expressive Speech Synthesis. [35] E. Szekely, J. Cabral, P. Cahill, and J. Carson-Berndsen. Cluster-
PhD thesis, Universitat Politècnica de Catalunya, 2017. ing expressive speech styles in audiobooks using glottal source pa-
rameters. In Proceedings of Interspeech, pages 2409–2412, 2011.
[15] I. Jauk and A. Bonafonte. Direct expressive voice training based
[36] S. Takaki and J. Yamagishi. Constructing a deep neural network
on semantic selection. In Proceedings of Interspeech, pages
based spectral model for statistical speech synthesis. Recent Ad-
3181–3185, 2016.
vances in Nonlinear Speech Processing, 48:117–125, 2016.
[16] I. Jauk and A. Bonafonte. Prosodic and spectral ivectors for ex- [37] P. Wang, Y. Qian, F.K. Soong, L. He, and H. Zhao. Word embed-
pressive speech synthesis. In Proceedings of Speech Synthesis ding for recurrent neural network based tts synthesis. In Proceed-
Workshop 9, pages 59–63, 2016. ings of International conference on acoustics, speech and signal
[17] I. Jauk, A. Bonafonte, P. López-Otero, and L. Docio-Fernandez. processing (ICASSP), pages 4879–4883, 2015.
Creating expressive synthetic voices by unsupervised clustering [38] X. Wang, S. Takaki, and J. Yamagishi. Investigating of using con-
of audiobooks. In Interspeech 2015, pages 3380–3384, 2015. tinuous representation of various linguistic units in neural network
[18] I. Jauk, J. Lorenzo-Trueba, J. Yamagishi, and A. Bonafonte. Ex- based text-to-speech synthesis. IEICE Transactions on Informa-
pressive speech synthesis using sentiment embeddings. In Pro- tion and Systems, E99-D(10):2471–2480, 2016.
ceedings of Interspeech, pages 3062–3066, 2018. [39] O. Watts. Unsupervised Learning for Text-to-Speech Synthesis.
[19] R. Kehrein. The prosody of authentic emotions. In Proceedings PhD thesis, University of Edinburgh, 2012.
of Speech Prosody, pages 423–426, 2002. [40] word2vec Tool for computing continuous distributed representa-
[20] D. Klein and C.D. Manning. Accurate unlexicalized parsing. tions of words.
ACL, pages 423–430, 2003. [41] J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi. Modeling
of various speaking styles and emotions for hmm-based speech
[21] K. Kuttler. An introduction to linear algebra. Bringham Young
synthesis. In Proceedings of Eurospeech, pages 2461–2464, 2003.
University, 2007.
[42] Y. Zhao and G. Karypis. Empirical and theoretical comparisons
[22] P. Lopez-Otero, L. Docio-Fernandez, and C. Garcia-Mateo. iVec- of selected criterion functions for document clustering. Machine
tors for continuous emotion recognition. In Proceedings of Iber- Learning, 55(3):311–331, 2004.
speech 2014, pages 31–40, 2014.
193
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
194 10.21437/IberSPEECH.2018-39
c2f ace
Recognition
θf a
Face tracking
pf ace c1f ace
pspeaker
Mapping Resegmentation
c2speaker
Speaker
Clustering Mapping Resegmentation
segmentation
θsc c1speaker
Resegmentation
λsc
One can then assign each cluster to the closest target t∗ by 3.2. Speaker embedding
comparing the cluster embedding x to each target embedding
xt computed as the average embedding extract from all their The embedding architecture used is the one introduced in [2]
enrolment pictures: and further improved in [13]. In the embedding space, using
the triplet loss paradigm, two sequences xi and xj of the same
t∗ = argmint∈{1...T } d(x, xt ) (3) speaker (resp. two different speakers) are expected to be close to
(resp. far from) each other according to their angular distance.
In case d(x, xt∗ ) is greater than a tunable threshold θfa , the The embeddings are trained on the Voxceleb corpus.
cluster is decided to be a non-target person and therefore not
returned by the system. This approach is denoted by pface in
Figure 1 and constitutes the “face” part of PLUMCOT primary 3.3. Speech turn clustering
submission and of all ODESSA submissions.
A variant of this cluster-wise face recognition approach is As proposed in [10], we use Affinity propagation (AP) algo-
to perform recognition directly at the level of face tracks (i.e. rithm [14] to perform clustering of speech turns. AP does not
without prior face clustering). This variant is denoted by c2face require a prior choice of the number of clusters contrary to other
in Figure 1 and constitutes the “face” part of PLUMCOT con- clustering methods. All speech segments are potential cluster
trastive submission #2. centers (exemplars). Taking as input the pair-wise similarities
between all pairs of speech segments, AP will select the exem-
plars and associate all other speech segments to an exemplar. In
3. Speaker diarization our case, the similarity between ith and j th speech segments
This section describes the building blocks of the “speaker” part is the negative angular distance between their embeddings. AP
of PLUMCOT runs. They all rely on the speaker diarization has two hyper-parameters: preference θsc and damping factor
approach introduced in [10]. λsc .
195
4. Multimodal fusion parameters that minimizes the speaker diarization error rate for
“speaker” part and the face diarization error rate for “face” part.
This section describes our attempts at improving speaker di-
arization with face clustering, and vice versa. Those two ap-
proaches were respectively submitted as PLUMCOT primary 6. Submissions
run (4.1) and PLUMCOT first contrastive run (4.2). Figure 1 summarizes the primary and two contrastive runs of
the PLUMCOT consortium. All of them have been introduced
4.1. Improving speaker diarization with face clustering in the previous sections of this paper.
Let us assume that there are N speakers according to speaker di- The ODESSA consortium mostly focused on the
arization, and M persons according to face clustering (or recog- monomodal speaker diarization aspects of the task. Therefore,
nition). Let K ∈ RN ×M be the co-occurrence matrix of the ODESSA submissions to the “speaker” part of the multimodal
output of both pipelines: Kij is the overall duration in which diarization challenge rely on the same systems used for its
speaker i ∈ {1 . . . N } is speaking and person j ∈ {1 . . . M } is open-set submissions to the speaker diarization challenge: the
visible. fusion at similarity-level of various speech turn representation
The main intuition motivating this approach arises from the (such as neural embeddings and binary keys). More informa-
following observation about broadcast news videos: most of the tion can be found in [16]). All three ODESSA submissions use
time, the camera is pointing at the current speaker. Therefore, the same “face” part as PLUMCOT primary submission.
the proposed approach simply updates each speaker cluster by
assigning them to the most co-occurring face cluster: 7. Experimental protocol
i ← argmaxj∈{1...M } Kij (4) 7.1. RTVE2018 corpus
Thanks to the joint optimization (described in Section 5) of The RTVE2018 dataset is a collection of diverse TV shows
stopping criteria for both face clustering and speaker diariza- aired between 2015 and 2018 on the public Spanish National
tion, we anticipate that this approach will “choose” to favour Television (RTVE). The development subset of the RTVE2018
smaller (but purer) speaker clusters than the purely monomodal database contains one single 2 hours show “La noche en 24H”
speaker diarization pipeline. A speaker divided into several labeled with speaker and face timestamps. It also contains 11
small clusters may then be merged back together thanks to (a additional files (for a total duration of 14 hours) labeled with
hopefully better) face clustering and Equation 4. speaker timestamps only. Enrollment files for the target persons
are also provided: they consist of a few pictures and one short
4.2. Filtering face detection with speech activity detection video for each target.
The evaluation set contains 3 videos files of almost 2 hours
Long silence each of TV shows labeled with speaker and face timestamps.
Time
However, at the time of the submission of this paper, we have
Speech turns
no result on the test set so we are not reporting results on this
test set.
Face tracks
Time
7.2. Evaluation metric
The evaluation metric used for this task is the diarization error
Figure 2: Face tracks within long non-speech regions (red) are rate (DER) defined as follows:
removed.
false alarm + missed detection + confusion
DER = (5)
total
Our face detection and tracking module tends to detect lots
of non-target faces, leading to a huge amount of false alarms where false alarm is the duration of non-speech incorrectly clas-
(e.g. in crowds, in credits at the end of TV shows, etc.). As sified as speech, missed detection is the duration of speech in-
depicted in Figure 2, we propose a very simple solution to this correctly classified as non-speech, confusion is the duration of
problem: filtering face tracks in long non-speech regions. speaker confusion, and total is the total duration of speech in the
reference. Note that this metric does take overlapping speech
5. Hyper-parameters joint optimization into account, potentially leading to increased missed detection
in case the speaker diarization system does not include an over-
As mentioned in Section 4.1, the various modules of PLUM-
lapping speech detection module. DER is a standard metric for
COT runs are jointly optimized. For instance, the “speaker”
evaluating and comparing speaker diarization systems but it can
part of PLUMCOT primary run is the combination of two mod-
also be applied for face clustering by replacing speech turns by
ules with their own set of hyper-parameters: face clustering (θfc )
face tracks.
and speaker diarization (θsc and λfc ). Instead of tuning the for-
mer for optimal face clustering performance and the latter for
7.3. Implementation details
optimal speaker diarization separately, the whole pipeline (in-
cluding the assignment step described in Equation 4) is jointly 7.3.1. Face clustering and recognition
optimized.
Practically, we use the Covariance Matrix Adaptation Evo- As already stated in Section 2, we use the pre-trained face de-
lution Strategy minimization method [15] available in the tector and face embedding available in dlib library [7], wrapped
chocolate library4 to automatically select the set of hyper- in our pyannote.video toolkit5 . All hyper-parameters of the face
4 chocolate.readthedocs.io 5 github.com/pyannote/pyannote-video
196
clustering and recognition pipeline are jointly optimized in or- 6.86% while c2speaker only gets DER = 10.68%). This shows the
der to minimize the (face) diarization error rate on the only an- benefit of the joint optimization of hyper-parameters: a better
notated video of the RTVE2018 development set provided by “face” system does not necessarily lead to a better multimodal
the organizers of the challenge. “speaker” pipeline.
As described in details in [16], ODESSA “speaker” primary
7.3.2. Speaker diarization run is the combination at similarity level of three different rep-
resentations (x-vector trained on NIST SRE data, triplet loss
Feature extraction. All modules in the speaker diarization
embedding trained on VoxCeleb and binary key). This complex
pipeline share the same feature extraction step: 19 MFCC coef-
system reaches a performance of DER = 7.21% which is still be-
ficients (with their first and second derivatives, and the first and
low the simpler multimodal PLUMCOT primary run (that com-
second derivatives of the energy) are extracted every 10ms on a
bines triplet loss speaker embedding and neural face embed-
25ms windows. The only exception is the re-segmentation step
ding) with DER = 6.86%. One could hope that combining both
that does not use any derivative.
approaches would help us get even closer to perfect diarization.
Segmentation. Both speech activity and speaker change de-
tection modules are trained with the Catalan broadcast news
database from the 3/24 TV channel proposed for the 2010 Al- 9. Conclusion and future work
bayzin Audio Segmentation Evaluation [17]. We use the exact We have conducted experiments on monomodal face clustering
same configuration as the one described in [10]: stacked bi- and speaker diarization and shown an improvement of the re-
directional LSTMs and multi-layer perceptron on 3.2s sliding sults when we combine them into a multimodal approach. It has
windows. also been shown that combining two monomodal approaches
Speaker embedding. Speaker embeddings are trained using tuned separately does not automatically lead to the best results:
VoxCeleb1 dataset [18]. We use the exact same architecture one should rather tune them jointly using a global optimization
as the one used in [13] (stacked bi-directional LSTMs on a 3s process.
window) and the training process introduced in [19] (triplet loss While results of the multimodal approaches are promising,
with angular distance). there is still room for improvement. In particular, we plan to
Speaker diarization pipeline. Once every module is trained, investigate the use of the talking-face detection approach intro-
hyper-parameters of the speaker diarization pipeline are jointly duced in [20] to improve the module in charge of mapping face
optimized in order to minimize the diarization error rate on the clusters with speaker clusters.
development set (dev2) of RTVE2018 corpus provided by the Finally, we would like to highlight the fact that the code
organizers of the challenge. for most monomodal building blocks is available for other re-
searchers to use67 .
8. Results and discussion
Table 1 summarizes the performance of each submission on the 10. Acknowledgements
development set. Official results on the test set were not avail- This work was partly supported by ANR through the ODESSA
able at the time of writing the paper. (ANR-15-CE39-0010) and PLUMCOT (ANR-16-CE92-0025)
projects.
Consortium Run Speaker Face
PLUMCOT primary 6.86 28.15 11. References
contrastive 1 10.59 28.15
contrastive 2 10.68 31.01 [1] E. Lleida, A. Ortega, A. Miguel, V. Bazán, C. Pérez, M. Zotano,
and A. de Prada, “Albayzin evaluation: Iberspeech-rtve 2018 mul-
ODESSA primary 7.21 28.15
timodal diarization challenge,” 06 2018. [Online]. Available: http:
contrastive 1 9.29 28.15 //catedrartve.unizar.es/reto2018/EvalPlan-Multimodal-v1.3.pdf
contrastive 2 11.46 28.15
[2] H. Bredin and G. Gelly, “Improving speaker diarization of tv se-
Table 1: Diarization error rate on the development set ries using talking-face detection and clustering,” in Proceedings
of the 2016 ACM on Multimedia Conference. ACM, 2016, pp.
157–161.
Comparing “speaker” parts of PLUMCOT primary run [3] Y. Yusoff, W. Christmas, and J. Kittler, “A study on automatic
(DER = 6.86%) and constrative run #1 (DER = 10.59%) shows shot change detection,” in European Conference on Multimedia
that speaker diarization can be greatly improved when guided Applications, Services, and Techniques. Springer, 1998, pp. 177–
by face clustering: this amounts to a relative improvement of 189.
35%. Face clustering also helps significantly for face recogni- [4] N. Dalal and B. Triggs, “Histograms of oriented gradients for
tion: it is improved from DER = 31.01% for track-wise face human detection,” in Computer Vision and Pattern Recognition,
recognition (c2face ) to DER = 28.15% for cluster-wise face 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1.
IEEE, 2005, pp. 886–893.
recognition (pface ).
There are no difference between face primary run and con- [5] M. Danelljan, G. Häger, F. Khan, and M. Felsberg, “Accurate
strative run #1 maybe because during the long silence founded scale estimation for robust visual tracking,” in British Machine
Vision Conference, Nottingham, September 1-5, 2014. BMVA
faces were already deleted with the recognition threshold θf a Press, 2014.
with the enrollment data.
[6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
While cluster-wise face recognition (DER = 28.15%, pface )
for image recognition,” in Proceedings of the IEEE conference on
is better than raw face clustering (DER = 46.02%, not shown computer vision and pattern recognition, 2016, pp. 770–778.
in Table 1) for the “face” part, the latter does lead to bet-
ter “speaker” performance than the former when jointly opti- 6 github.com/pyannote/pyannote-audio
mized with the speaker diarization pipeline (pspeaker gets DER = 7 github.com/pyannote/pyannote-video
197
[7] D. E. King, “Dlib-ml: A machine learning toolkit,” Journal of Ma-
chine Learning Research, vol. 10, no. Jul, pp. 1755–1758, 2009.
[8] H.-W. Ng and S. Winkler, “A data-driven approach to cleaning
large face datasets,” in Image Processing (ICIP), 2014 IEEE In-
ternational Conference on. IEEE, 2014, pp. 343–347.
[9] O. M. Parkhi, A. Vedaldi, A. Zisserman et al., “Deep face recog-
nition.” in BMVC, vol. 1, no. 3, 2015, p. 6.
[10] R. Yin, H. Bredin, and C. Barras, “Neural Speech Turn Seg-
mentation and Affinity Propagation for Speaker Diarization,” in
19th Annual Conference of the International Speech Communica-
tion Association, Interspeech 2018, Hyderabad, India, September
2018.
[11] G. Gelly and J.-L. Gauvain, “Minimum Word Error Training of
RNN-based Voice Activity Detection.” in 186th Annual Confer-
ence of the International Speech Communication Association, In-
terspeech 2015, Dresden, Germany, September 2015, pp. 2650–
2654.
[12] R. Yin, H. Bredin, and C. Barras, “Speaker Change Detection in
Broadcast TV using Bidirectional Long Short-Term Memory Net-
works,” in 18th Annual Conference of the International Speech
Communication Association, Interspeech 2017, Stockholm, Swe-
den, August 2017.
[13] G. Wisniewksi, H. Bredin, G. Gelly, and C. Barras, “Combin-
ing speaker turn embedding and incremental structure prediction
for low-latency speaker diarization,” in 18th Annual Conference
of the International Speech Communication Association, Inter-
speech 2017, Stockholm, Sweden, September 2017.
[14] B. J. Frey and D. Dueck, “Clustering by Passing Messages Be-
tween Data Points,” science, vol. 315, no. 5814, pp. 972–976,
2007.
[15] C. Igel, N. Hansen, and S. Roth, “Covariance matrix adapta-
tion for multi-objective optimization,” Evolutionary Computation,
vol. 15, no. 1, pp. 1–28, 2007.
[16] J. Patino, H. Delgado, R. Yin, H. Bredin, C. Barras, and N. Evans,
“ODESSA at Albayzin Speaker Diarization Challenge 2018,” in
IberSPEECH, Barcelona, Spain, November 2018.
[17] A. Ortega, D. Castan, A. Miguel, and E. Lleida, “The albayzin
2012 audio segmentation evaluation.” Iberspeech 2012, 2012.
[18] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-
scale speaker identification dataset,” in 18th Annual Conference
of the International Speech Communication Association, Inter-
speech 2017, Stockholm, Sweden, September 2017.
[19] H. Bredin, “TristouNet: Triplet Loss for Speaker Turn Embed-
ding,” in ICASSP 2017, IEEE International Conference on Acous-
tics, Speech, and Signal Processing, New Orleans, USA, March
2017.
[20] J. S. Chung and A. Zisserman, “Out of time: automated lip sync in
the wild,” in Workshop on Multi-view Lip-reading, ACCV, 2016.
198
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
UPC Multimodal Speaker Diarization System for the 2018 Albayzin Challenge
Miquel India, Itziar Sagastiberri, Ponç Palau, Elisa Sayrol, Josep Ramon Morros, Javier Hernando
199 10.21437/IberSPEECH.2018-40
Our approach uses a tracking by detection approach: First,
all the faces in the video sequence are detected. For this, we use
a detector based on HOG+SVM2 [7] from the dlib [8] library.
Once the faces have been detected, a KLT3 tracker [9, 10,
11] is used to relate the detections in successive frames. We
used the implementation4 provided for the baseline system of
the Multimodal Person Discovery in Broadcast TV task in Me-
diaEval 2015 [2].
As mentioned previously, a face track provides the spatial
location of a set of faces of a given individual, which are used
for feature extraction, and the temporal interval where this per-
son is visible in the video.
In the video system, the track is the basic unit of recog-
nition: we will output a result for each track that is classified
as belonging to one of the known persons. Tracks classified as
unknown are discarded and no output is provided.
To characterize each track, we follow a two step process:
first, a feature vector is extracted for each detected face in the
track. Then, the final feature vector for the track is obtained by
averaging all the track’s feature vectors.
These feature vectors are obtained using the last fully con- Figure 2: Speaker front-end diagram.
nected layer from a Deep Neural Network based on the ResNet
34 architecture [12], trained using the metric learning triplet
loss process described in FaceNet [13]. This learns a mapping
from the detected faces to a compact space where the feature I-vectors are low rank vectors, typically between 400 and
vectors (i.e. 128 dimensional FaceNet embeddings) originat- 600, representing a speech utterance. Given a speech sig-
ing from the faces of a given individual are located in a sep- nal, acoustic features like Mel Frequency Cepstral Coefficients
arate and compact region of the space. Thus, the vectors are (MFCC) are extracted. These feature vectors are modeled by
highly discriminative, allowing to use standard techniques to a set of Gaussian Mixtures (GMM) adapted from a Universal
perform classification/verification. We have used the off-the Background Model (UBM). The mean vectors of the adapted
shelf dlib [8] implementation, without any adaptation nor fine- GMM are stacked to build the M supervector, which can be
tuning to the task identities. written as:
A similar method is used to extract the feature vectors for
the images and videos of the enrollment set. For each person, 10 M = µ + Tω (1)
still images and one short clip were provided. For each still im-
age, we detect the face and we extract a feature vector. The short where µ is the speaker- and session-independent mean su-
video is processed similarly to the test video: scene detection, pervector from UBM, T is the total variability matrix, and ω is
face detection and face tracking. A feature vector is extracted a hidden variable. The mean of the posterior distribution of ω is
for each resulting track. This results in a variable number of en- referred to as i-vector. This posterior distribution is conditioned
rollment vectors for each person, depending on the number of on the Baum-Welch statistics of the given speech utterance.
tracks in the short video. These vectors are associated with the The T matrix is trained using the Expectation-Maximization
name of the corresponding person and used as a person model. (EM) algorithm given the centralized Baum-Welch statistics
To decide the track identity, we used a k-NN classifier with from background speech utterances. More details can be found
a cosine distance metric. A global threshold is applied to deter- in [14].
mine if the track belongs to any of the persons in the database. If Given an i-vector, a DNN is used to extract a more discrim-
this is the case, the identity corresponding to the nearest vector inative speaker vector. This DNN is composed by 2 hidden lay-
in the database is used as the track identity. This simple method ers of 400 nodes, where the activations of the second layer are
is possible because the highly discriminative properties of the used as a speaker embedding. This neural network is fed with i-
FaceNet 128 dimensional embeddings. vectors and an initial L2 normalization is applied to these inputs
before the first hidden layer. After each hidden layer, a batch
2.2. Speaker System normalization layer is used as regularizer. Initially, the DNN is
pretrainned as a speaker classifier. Therefore, a softmax layer is
The speaker system works as a tracking algorithm that uses added in the output of the network and the DNN is trainned min-
speaker embeddings to compare speech signal segments with imizing the cross-entropy loss. Following to this pretrainning,
the speech utterances of the enrollment identities. These repre- the softmax layer is removed and the DNN is trainned with the
sentations are created with a DNN which is feed with i-vectors following multiple objective loss:
and is trainned with a multi-objective loss (Figure 2). This loss
is based on a triplet margin loss and a regularitzation function N
which minimizes the variance of both positive and negative tu- 1 X λ
Loss = T Lossi + RLossi (2)
ple distances. N i=1 2
2 Histogram of Oriented Gradients + Support Vector Machine T Lossi = max(0, d(Ai , Pi ) − d(Ai , Ni ) + margin) (3)
3 Kanade-Lucas-Tomasi tracker
4 https://github.com/MediaevalPersonDiscoveryTask/Baseline2015
200
• Some speakers do not come into view any time in the
show and there are other people who are shown in the
screen but do not speak. These faces and speakers corre-
spond in major part to the unknown identities.
According to these assumptions, an algorithm has been de-
signed based in weighting temporal overlaps between the tracks
of the face system and the speech segments of the speaker sys-
tem (Figure 3). As its shown in the figure, the intersection be-
tween face tracks and speaker segments produces a new multi-
modal segmentation. The temporary segments where face and
Figure 3: Fusion scheme. Green boxes refer to the segments speech are not overlapped are discarded. We use this new seg-
where id assignation has been directly propagated. Orange mentation to combine the assignations of both modalities:
boxes refers to segments which id ask has been assigned after
the score combination. • The segments where the corresponding face/speaker seg-
ments have the same target assigned are automatically
tagged with that identity.
1 X
N • When the speaker and face assignations are not the same,
RLossi = (|d(Ai , Pi ) − |d(Aj , Pj )|)+ we produce a new scoring combining both modalities
N j=1
distances between the segment and the enrollment tar-
1 X
N gets. First we extract the scores of the multimodal seg-
(|d(Ai , Ni ) − |d(Aj , Nj )|) (4) ment for each modality. The range of these scores are
N j=1
different for each source, hence it is needed to normal-
ize them. This normalization is produced with a softmax
where T Loss corresponds to the triplet margin loss [13] activation which has a different temperature τ parameter
and RLoss corresponds to our proposed regularization func- for each modality. A new set of scores is then produced
tion. T Loss is computed with hinge loss and d(x, y) refers with the average of both modalities scores. Given these
to the 2nd order euclidean distance between a pair of vectors. new multimodal scores, a new threshold is used to deter-
On the other hand, RLoss is a function that forces the DNN to mine whether the segment correspond to the most similar
minimize the variance of both positive and negative tuples dis- target or to an unknown identity.
tances. Hence, in each batch we estimate the means of the posi-
tive and negative scores. These means are then used to minimize
the distance between each positive or negative pair distance and 3. Optimization and Experimental Results
its corresponding mean. We add a λ penalty term so as to bal- In the following section we describe the setup of the proposed
ance the magnitude of the regularization function in the global approaches and we present the results of these systems for
loss. the Multimodal Speaker Diarization task of the 2018 Albayzin
The i-vector framework combined with the DNN is used as Challenge.
a front-end block in the speaker tracking system. This front-
end allows to extract features of the speech signal and compare 3.1. Speaker System
it with the signals of the speaker targets. In our approach, a
sliding window strategy have been used to extract speaker em- The speaker front-end block has been trained on the Vox-
beddings from 3 seconds length speech segments with a 0.25 Celeb2 [15] database. Feature extraction is performed with 20
seconds shift size. For the enrollment identities, we have used size MFCC plus delta features. The UBM has been trained
the whole signal to extract an embedding for each target. Cosine with a 1024 mixtures GMM and the T Matrix size is 400. For
distance metric is then used to evaluate the similarity of speech the whole i-vector framework we have used the Alize [16, 17]
segment embeddings for each target. The target with the biggest toolkit and we have only used the first 1000 speakers of Vox-
similarity is then assigned to the corresponding speech segment. Celeb2 development partition. The DNN used is composed by
In order to classify the non-interest or unknown speakers, a two 400 size hidden layers. The pretraining has been performed
threshold is imposed to determine the assignation between the using the same data used for the i-vector framework. For the
best candidate and the speech segment. If the most similar target triplet based DNN training, the whole VoxCeleb2 development
distance is below the threshold, the speech segment is automat- partition have been used. In order to obtain a good estimation
ically tagged as an unknown identity. of the positive and negative pair means, batch size have been set
to 1024. The λ for the RLoss have been set to 1. Both network
2.3. Fusion System trainnings have been performed with Adam optimizer. Learn-
ing rate have been set to 0.01 and the pretraining has been reg-
A fusion system has been considered in order to combine the ularized with an additional 0.001 weight decay. For the target
previous information sources. Speaker and video diarization assignation, the decision threshold has been tuned to improve
are performed first in an individual manner. The results of both DER results on the RTVE2018 development set. A final value
modalities are then fused so as to obtain a better speaker assig- of 0.08 threshold over the the cosine distance (in range [-1,1])
nation. In order to combine both outputs properly, we made the has been obtained.
following assumptions:
3.2. Video system
• Speaker segments and face tracks of the same person are
temporarily correlated. Hence, it is very likely that the The method described in Section 2.1 has been used to obtain
person who appears in the video is the one who is speak- the results. We have filtered short tracks (tracks shorter than
ing. 1s) because they are likely to belong to non-important faces.
201
System Miss FA SER DER could have improved the rate of error caused by these factors.
Monomodal Speaker 3.5% 5.7% 31.9% 41.13% On the other hand, using an initial speaker segmentation on the
Monomodal Face 37.9% 0.5% 1.9% 40.24% signal instead of a sliding windows strategy could also lead to a
Fusion (Spk. Eval.) 26.6% 2.3% 38.2% 66.99% better system performance.
Fusion (Face Eval.) 51.7% 0.3% 26.9% 78.92% For the face modality, the main source of error is the high
Table 1: DER results on the development partition. number of missed face time (37.9%). On the other side, the
FA and the SER are very low. The missed face time error
could be originated from two different motives. For one side,
a threshold too low could cause many false rejections of valid
This also allows to reduce the computational load of the system. tracks (i.e. tracks belonging to valid enrollment identities). On
For each track, a 128D feature vector has been generated. The the other side, this error could be originated because the face
final identity decision is determined by a k-NN classifier. As detection/tracking failed to extract valid tracks. To determine
the number of enrollment vectors is low, a value of k = 1 has which one of this errors is predominant we have set the rejec-
been used. By looking at the small Speaker Error Rate value in tion threshold at its maximum value (th = 1), meaning that all
Table 1, this approach is effective, thanks to the discriminating tracks should be accepted. After doing that, we found that the
power of the embeddings. The principal challenge in this task missed face error was still very high (37.2%). This indicates
was the high number of tracks belonging to persons that are not that the errors are mainly produced by the tracking step.
in the enrollment set. To reject these tracks, a global threshold The fusion system presents worst results than the the
th has been used. This threshold has been determined as the speaker and the face systems used individually. Both fu-
value providing the highest DER measure in the development sion systems evaluated with the speaker and the video
set. A final value of th = 0.47 over the cosine distance (in groundtruths present higher MISS and SER in comparison with
range 0 − 1) with the nearest neighbor has been obtained. the monomodal systems. We have noticed that the multimodal
segmentation does not improve the results because it automat-
ically discards a lot of speaker segments and face tracks. In
3.3. Fusion system
one hand, it discards the segments where there is no overlap-
Given the scores between signal speaker/face segments and the ping between speaker segments and face tracks. On the other
target vectors, a softmax activation have been used to normalize hand, when more than one face appears in the video, the system
the scores of each modality. In order to obtain similar scores, automatically discards the face track of the person who is not
the softmax of each modality has been applied with a different speaking. Therefore, our assumptions would work better with
temperature τ parameter. For speech τ = 3 has been used and the aim of looking for who is shown and speaking at the same
for the face modality τ has been set to 2. For the fusion system time but not for this kind of multimodal evaluation.
the target/non-target distance threshold have been set to 0.03.
4. Conclusions
3.4. Results
We have presented two monomodal and one multimodal tech-
The proposed systems have been evaluated in the RTVE2018 nologies to perform person identification in broadcast videos.
database for the Multimodal Speaker Diarization task of the A quantitative analisis has been performed on the RTVE 2018
2018 Albayzin Challenge. The development partition is com- dataset as provided in the Albayzin challenge. From the exper-
posed of one video, with a duration of around 2 hours. Enroll- iments it can be seen that the monomodal systems should be
ment data (10 still images and a short video) is provided for a improved. For the speaker approach, it would be interesting to
total of 34 identities. The test partition is composed of three test explore transfer learning methods to adapt our generic model in
videos, with a total duration of around 4 hours with enrollment a smaller scenario and to include a speaker segmentation algo-
data for 39 identities. The metric used to evaluate the systems rithm in the system. For the face modality, we plan to improve
is the Diarization Error Rate (DER), which is the sum of three the face detection and tracking step as it has been proven that is
different errors: Miss Speech (MISS), False Alarm (FA) and the main source of error for the face modality. There is also a
Speaker Error Rate (SER). In this challenge, the presented ap- large room for improvement for the multimodal fusion system.
proeaches are evaluated individually in each modality. Hence, Instead of fusing the systems from the output of the monomodal
it is needed to produce a diarization result for both speaker and systems, an end to end multimodal system could work better if
face sources. a big amount of data is available.
Table 1 shows the results of the presented approaches on
the development partition. The first two rows correspond to 5. Acknowledgements
the face and speaker system evaluated with their correspond-
ing face/speaker groundtruth. Fusion system corresponds to the This work was supported in part by the Spanish Project
combination approach described in Section 2.3. Therefore, the DeepVoice (TEC2015-69266-P) and the project MALEGRA
third and fourth row of the table correspond to the fusion system (TEC2016-75976-R), financed by the Spanish Ministerio de
evaluated with the speaker and the face groundtruth. Economa, Industria y Competitividad and the European Re-
gional Development Fund (ERDF).
Speaker system shows a 41.13% DER, where the main
source of error is the SER with a 31.9%. The threshold used
to decide whether a segment corresponds to a target or to an un- 6. References
known identity produces a low MISS but leads to a higher FA [1] O. G. G. Bernard and J. Kahn, “The first official repere evalua-
and SER. We noticed that our system failed in segments where tion,” in SLAM-Interspeech, 2013.
music was included in the background and with these targets [2] J. Poignant, H. Bredin, and C. Barras, “Person discovery in broad-
whose enrolment signal was very different to the show in terms cast tv at mediaeval 2015,” in Working Notes Proceedings of the
of channel variability. Adapting the model to the RTVE corpus MediaEval 2015 Workshop, 2015.
202
[3] H. Bredin, C. Barras, and C. Guinaudeau, “Person discovery in
broadcast tv at mediaeval 2016,” in Working Notes Proceedings of
the MediaEval 2016 Workshop, 2016.
[4] N. Le, H. Bredin, G. Sargent, M. India, P. Lopez-Otero, C. Barras,
C. Guinaudeau, G. Gravier, G. Barbosa da Fonseca, I. Lyon Freire,
J. Z. Patrocinio, S. J. F. Guimares, M. Gerard, J. Morros, J. Her-
nando, L. Docio-Fernandez, C. Garcia-Mateo, S. Meignier, and
J. Odobez, “Towards large scale multimedia indexing: A case
study on person discovery in broadcast news,” in CBMI 2017.
[5] M. India, D. Varas, , V. Vilaplana, J. Morros, and J. Hernando,
“Upc system for the 2015 mediaeval multimodal person discov-
ery in broadcast tv task,” in Working Notes Proceedings of the
MediaEval 2015 Workshop, 2015.
[6] M. India, G. Marti, , E. Sayrol, J. Morros, J. Hernando, C. Cor-
tillas, and G. Bouritsas, “Upc system for the 2016 mediaeval mul-
timodal person discovery in broadcast tv task,” in Working Notes
Proceedings of the MediaEval 2016 Workshop, 2016, pp. 1–3.
[7] N. Dalal and B. Triggs, “Histograms of oriented gradients for hu-
man detection,” in CVPR’05, vol. 1, June 2005, pp. 886–893.
[8] D. E. King, “Dlib-ml: A machine learning toolkit,” Journal of
Machine Learning Research, vol. 10, pp. 1755–1758, 2009.
[9] B. D. Lucas and T. Kanade, “An iterative image registration tech-
nique with an application to stereo vision,” in IJCAI’81, 1981, pp.
674–679.
[10] C. Tomasi and T. Kanade, “Detection and tracking of point fea-
tures,” International Journal of Computer Vision, Tech. Rep.,
1991.
[11] J. Shi and Tomasi, “Good features to track,” in CVPR 1994, June
1994, pp. 593–600.
[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in CVPR 2016, June 2016, pp. 770–778.
[13] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified
embedding for face recognition and clustering,” in CVPR 2015,
06 2015, pp. 815–823.
[14] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,
“Front-end factor analysis for speaker verification,” IEEE Trans.
on Audio, Speech, and Language Processing, vol. 19, no. 4, pp.
788–798, 2011.
[15] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep
speaker recognition,” in INTERSPEECH, 2018.
[16] J.-F. Bonastre, F. Wils, and S. Meignier, “Alize, a free toolkit for
speaker recognition,” in ICASSP’05, vol. 1, 2005, pp. I–737.
[17] J.-F. Bonastre, N. Scheffer, D. Matrouf, C. Fredouille, A. Larcher,
A. Preti, G. Pouchoulin, N. W. Evans, B. G. Fauve, and J. S.
Mason, “Alize/spkdet: a state-of-the-art open source software for
speaker recognition.” in Odyssey, 2008, p. 20.
203
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
204 10.21437/IberSPEECH.2018-41
different areas with different video content. It is quite common new detection should be more restrictive than assigning it to a
that the shots change at different pace in the different areas, so previous one located in an overlapped area. It is important to
a solution that can make local decisions on shot change is quite highlight that a shot change can make a face with different ID
useful. However, we left for future improvements of the system to appear in the same position that another face in the previous
the local detection of shot changes. For this version we just set frame before the shot. So, a change of shot restarts the tracking
a permissive global threshold that allows detection of total or process.
partial change of shot, movement and fading as a unique event.
2.2.3. Face clusters augmentation
2.2. Face processing
The unreliable robustness of the face recognizer with extreme
The face processing subsystem comprises several sequential poses and expressions lead us to define a strategy to cope with
operations that are briefly explained through the Figure 1 and potential erroneous ID assignments.
in the subsections below. A previous study of the capacity of dlib’s facial recognition
network, advised us to use a restrictive Th_newID, so
2.2.1. Face Detection and Geometric Normalization comparison of faces with extreme poses between the current
embedding and the embeddings in the enrollment dataset will
Face detection is a fundamental step in the sequential
not surpass it. This way, it is less likely that a wrong ID
processing. We have used the detector based on Multi-Task
assignment is propagated through a tracked BB. Only when the
Cascaded Convolutional neural Network [5], that jointly finds
head moves to a more frontal pose, or a similar extreme pose is
a Bounding Box for the face and five landmarking points useful
stored in the enrollment dataset, the ID is assigned to the BB
to normalize the face. This face detector is quite robust to pose,
and propagated. But, what happens if the face in the shot is
expression and illumination changes. False negatives are
always in an extreme pose? We need a way to enrich the
typical in extreme poses with yaw angles beyond +/-60º and
enrollment dataset with new samples of a specific ID that
pitch angles beyond +/- 40º, that are not so uncommon in
appear in the content but are different enough from the
interview and debate contents. This approach also brings a bit
enrollment dataset. Given that the Th_previousID is more
amount of false positives in areas where textured objects with
relaxed, during a shot where a previous match has been
skin colors appear, like hands, arms and other not human
assigned, several different face samples with almost any pose
objects.
and expression can enrich the enrollment dataset. The criterion
Once a face is detected (being true detection or not), its
to enrich the dataset is the increase of variance of the ID face
bounding box (BB) is saved with several parameters that will
cluster. Then, the enriched dataset is ready to use in the next
allow to do tracking and assign identities during the process. An
frame.
overlapping function between the current BB, and the BBs of
the previous frame allows linking the BBs belonging to the 2.2.4. Face ID backtracking
same person and do backtracking when the shot has finished.
The detected face is passed to a geometric normalization Once a shot change is detected, an online post-processing is run
that prepares the face to be plugged in in a standardized way to to reassign IDs or even delete potentially wrong assignments in
the face recognition block. the past shot. This block is based on heuristic rules defined after
observing the typical behavior of the previous processing
2.2.2. Face recognition blocks in the development scenarios. If the detector and
recognizer were ideal, a shot without rapid movements or with
We have used the face recognizer based on dlib’s
slow camera movement should contain just a number of BB
implementation [6] of the Microsoft ResNet DNN [1]. This
tracks that matches the number of persons in the shot; the first
DNN finds an embedded space where similar faces are grouped
BB should appear in the first shot frame and the last BB of its
together and far from different faces. This network is also
BB track should appear in the last frame of the shot. But things
trained to be quite robust to pose, expression and illumination
are not perfect, so the rationale of the backtracking post-
changes. In this case, the network makes quite many false
processing is based on the next observations:
identification assignments when poses are beyond +/-50º in
yaw and +/- 20º in pitch. Also extreme facial expressions
produce false assignments, being quite robust, though, for False negatives of the detector break the paths of the BBs,
neutral and smiling faces (the great bulk of face images found so there will be more paths than in the ideal case.
in internet-based datasets for face recognition). However, TV False initial matchings in every new track will yield tracks
contents offer more facial expressions of emotion than neutral with incorrect IDs.
and smiling, so false assignments are quite common also in The number of persons in the shot can be roughly estimated
these cases. by the average number of detections. This assumption is
After a face is detected for the first time in a shot, meaning broken when extreme poses appear during a long period,
that no previous BB is linked to the current one, a candidate ID producing an underestimated number of persons in the shot.
is assigned to the BB if it surpasses a distance threshold for new Every ID matched in the shot can appear in several broken
IDs (Th_newID) when comparing the embedded vector against paths. The accumulated confidence of that ID gives a rough
all the embedded vectors of the enrollment set. The closest ID approximation of its probability of being in that shot.
is kept for that BB. A confidence value is also assigned to that
ID in that specific BB. If the BB is linked to a previous BB, the Using these observations we defined a rule to keep tracks
embedded vector is compared just with the cluster of enrolled of a number of persons just above the estimated average and
vectors of the previous candidate ID. If it surpasses a second with the IDs with largest accumulated confidence. Those IDs
threshold (Th_previousID) then the same ID is assigned to the are finally assigned to the time intervals within and across shots
BB. The rationale of keeping two different thresholds and and written down in an rttm file.
having Th_newID > Th_previousID is that assigning an ID to a
205
Video file Audio file
New shot
detected
NO Obtain speech Same short-term cluster same color
YES segments
YES
YES NO
Th_previousID?
Find ID matching
Pure speaker clusters in temporal
segmetation
Back to obtain
Figure 2: Block diagram of the speaker diarization and
new frame NO YES recognition subsystem.
End of shot
detected
206
3.3. On-line Identity Assignment 2018. The system uses state of the art DNN algorithms for face
detection and verification and also for speaker diarization and
The clusters obtained in the previous step need to be assigned verification. The application scenario is studied to implement
to the enrollment identities. Keeping the timestamps of each ad-hoc post-processing strategies to fine-tune the ID
embedding in the clustering process, allows to design an online assignments made by the video and audio parts. Specifically,
ID assignment approach. Temporal segments are defined as the information on shot changes are exploited to avoid tracking
consecutive timestamps with embeddings associated to the faces across shots. Confidence matrices are used in a fusion
same cluster. The ID assigned to a time segment is the strategy that allows changing pre-assigned speaker identities.
enrollment ID of the best-matching enrollment cluster, as far as This framework leaves a lot of room for improvement in each
this distance is less than a threshold. This threshold is defined of the fundamental processing stages and also in the ad-hoc
after observing the typical behavior of the system in the rules for fine-tuning. One of the main future lines consist of
development scenarios. A confidence value for that ID in that increasing the robustness of face matchings for extreme poses
specific temporal segment is stored to be used jointly with the and expressions and the robustness of speaker ID assignments
face-based confidence value in the fusion process. in overlapped speech. From a video processing point of view, a
better characterization of the display montage would allow the
application of post-processing rules less prone to errors.
4. Fusion Finally, speeding up some DNN critical parts by using GPUs
and efficiently coding some other parts would allow real-time
Once a decision was made for both modalities, a multimodal processing of the system.
fusion approach was implemented in order to correct potentially
wrong speech-based ID assignments. Given a temporal segment
that has been assigned a speaker identity ID1, if a high-
Acknowledgements
confidence single face identity ID2 has been detected in more This work has received financial support from the Spanish
than 60% of the video frames in that speaker ID1 segment, the Ministerio de Economía y Competitividad through project
speaker ID1 is changed to the face identity ID2. ‘TraceThem’ TEC2015-65345-P, from the Xunta de Galicia
This rule doesn’t apply if ID1 and ID2 have different gender (Agrupación Estratéxica Consolidada de Galicia accreditation
(as given by the enrollment name). 2016-2019) and the European Union (European Regional
Much more elaborated fusion rules can be applied at this Development Fund ERDF).
stage, but only this one was tested for the competition.
The results over the Development video provided by the References
organizers of the competition are presented in Table 1.
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning
for Image Recognition,” in IEEE Conference on Computer
Table 1: DER results on the Development video
Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.
Modality DER [2] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,
“Deep neural network embeddings for text-independent
Face 36.20% speaker verification,” in INTERSPEECH 2017 – 97th Annual
Speaker 14.25% Conference of the International Speech Communication
Speaker Fusion 7.51% Association, Proceedings, pp. 999–1003, 2017.
Average DER 21.85% [3] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end
text-dependent speaker verification,” in 2016 IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2016, pp. 5115–5119.
5. Computational Cost [4] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey and A. McCree,
The computational cost of the proposed audiovisual diarization "Speaker diarization using deep neural network embeddings,"
2017 IEEE International Conference on Acoustics, Speech and
system was measured in terms of the real-time factor (RT). This Signal Processing (ICASSP), New Orleans, LA, 2017, pp. 4930-
measure represents the amount of time needed to process one 4934.
second of audiovisual content: xRT = P/I, where I is the [5] K. Zhang, Z. Zhang, Z. Li and Y. Qiao, “Joint Face Detection
duration of the processed video and P is the time required for and Alignment Using Multitask Cascaded Convolutional
processing it. The whole development video was processed to Networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp.
compute the RT, thus taking into account many different 1499-1503, Oct. 2016.
audiovisual situations. The duration of this video is I = 7410 s, [6] D. E. King. “Dlib-ml: A Machine Learning Toolkit, Journal of
and the time needed to process it was P = 33457 s, leading to Machine Learning Research,” 10:1755-1758, 2009.
[7] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N.
RT = 4.51. These computation time was obtained by running Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al.,
this experiment on an Intel(R) Core(™) i5 CPU [email protected] GHz “The Kaldi speech recognition toolkit,” in IEEE 2011 Workshop
with 12 GB RAM. Even though the process is running more on Automatic Speech Recognition and Understanding (ASRU),
than 4 times slower than real-time, the code is not optimized at pp. 1-42011.
all (some parts are coded in Matlab) and the machine is just [8] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large
using 1 CPU and no GPU. We are working to speed up the scale speaker identification dataset,” INTERSPEECH 2017 –
process and expect to have it running in real-time in the next 97th Annual Conference of the International Speech
months. Communication Association, 2017.
[9] Chris Biemann, “Chinese whispers: an efficient graph clustering
algorithm and its application to natural language processing
6. Conclusions and future work problems,” in First Workshop on Graph Based Methods for
Natural Language Processing (TextGraphs-1), pp. 73-80, 2006.
We have presented the GTM-UVIGO System deployed for the
Albayzin Multimodal Diarization Competition at Iberspeech
207
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
208 10.21437/IberSPEECH.2018-42
beddings extractor [4] to the task of speaker clustering.
We used a multi-bandwidth speaker embedding DNN in our
submission. Speaker embedding DNNs were trained following
the protocol of [4]. Specifically, Kaldi was used to generate ex-
amples for training the DNN with a duration ranging between
2.0 and 3.5 seconds of speech. DNNs were trained using Ten-
sorflow over 6 epochs using a mini batch size of 128 exam-
ples, and dropout probability linearly increasing to 10% then
back to 0% in the final 2 epochs. The embeddings network
starts with five frame-level hidden layers, all using rectified lin-
ear unit (ReLU) activation and batch normalization. The first
three layers incrementally add time context with stacking of [-
2,-1,0,1,2], [-2,0,2], and [-3,0,3] instances of the input frame.
A statistics pooling layer then stacks the mean and standard de-
Figure 1: Flow diagram of components used in the STAR-LAB viation of the frames per audio segment, resulting in a 3000
team submissions to the IberSPEECH-RTVE 2018 Speaker Di- dimensional segment-level representation. The final two hid-
arization Challenge. den layers of 512 nodes operate at the segment-level and use
ReLU activation and batch normalization prior to the output
layer, which targets speaker labels for each audio segment us-
3.2. Speech Activity Detection
ing log softmax as the output. The embeddings are extracted
We used a DNN-based Speech Activity Detection (SAD) model from the first segment-level hidden layer of 512 nodes. This
leveraging short-term normalization. The SAD model was system used PLDA classification for clustering after applying
trained on clean telephone and microphone data from a ran- an LDA dimensionality reduction. We applied length and mean
dom selection of files from the provided Mixer datasets (2004- normalization to embeddings prior to use in PLDA. As a simple
2008), Fisher, Switchboard, Mixer6, SRE’18 unlabeled and method of domain adaptation, we mean normalized the chunked
SRE’16 unlabeled data. A 5 minute DTMF tone (acquired from embeddings from an audio file using the mean of all chunks.
YouTube), and a selection of noise and music samples with and The embeddings VB initialization process was performed
without speech added were added to the pool of data. In all, as follows. The audio was first segmented into 1.5 second seg-
11,668 files were used to train the SAD model. ments with 0.2 second shift. Following a similar strategy to VB
The system uses 19-dimensional MFCC features, which ex- diarization, we initialized a speaker cluster posterior matrix, q,
cluded C0 and used 24 filters over a bandwidth of 200-3300 Hz. to for 13 speakers. The number of speakers was selected from
These features were mean and variance normalized using a slid- previous experiments over the development data. We calculated
ing window of 3 seconds, and concatenated over a window of 31 for each speaker cluster, a weighted-average embedding based
frames. The resulting 620-dimensional feature vector formed on q and the 1.5s embeddings segments. These per-cluster em-
the input to a DNN which consisted of two hidden layers of beddings were compared using PLDA against each individual
sizes 500 and 100. The output layer of the DNN consisted of embedding segment. We scaled the likelihood ratios (LLRs)
two nodes trained to predict the posteriors for the speech and that resulted from PLDA by 0.05 and performed Viterbi decod-
non-speech classes. These posteriors are converted into like- ing of the LLRs to result in a new q and speaker priors. This pro-
lihood ratios using Bayes rule (assuming a prior of 0.5), and cess was iterated 10 times before using the result q and speaker
thresholded at a value of -1.5, -2.0 and -3.0 for the primary priors in the subsequent VB diarization based on BN+MFCC
and both contrastive systems, respectively. A padding of 0.5 features.
seconds was applied over the final segmentation to smooth the
transitions between speech/non-speech. 3.4. Variational Bayes diarization
We applied cross-talk removal on all interview data from
the NIST SRE corpora to suppress the interviewer speech that Our diarization approach was based on the work of [10]. This
bled through to the target speaker channel. This was espe- approach uses an i-vector subspace to produce a frame-level di-
cially important for distant microphone channels in which each arization output. Our i-vector subspace was trained using con-
speaker had similar energy. Cross-talk removal involved using catenated BN and MFCC features [11], resulting in a feature
the SAD Log-Likelihood Ratios (LLRs) from the target micro- with 140 dimensions. With VB diarization, we have used a
phone as well as the close-talking interviewer microphone, and left-to-right HMM structure of three states per speaker in order
removing any detected speech from the target channel that was to smooth the transitions between speakers that was proposed
detected in the interviewer channel with more than 3.5 in LLR in [12].
value of the target channel. The initialization of the VB diarization approach is done
For all augmented system training data, the SAD align- with the speaker posteriors estimated from the speaker embed-
ments from the raw audio were used rather than running SAD dings initialization. We performed a maximum of 20 iterations
on the degraded signals directly, as done in [4]. of VB diarization.
209
Table 1: Development results for each system submission [10] VB diarization with eigenvoice and HMM priors, 2013,
http://speech.fit.vutbr.cz/software/
System Name Miss Sp FA Sp SpkErr DER vb-diarization-eigenvoice-and-hmm-priors.
[11] Mitchell McLaren, Yun Lei, and Luciana Ferrer, “Advances
Primary 2.1% 1.9% 13.4% 17.38% in deep neural network approaches to speaker recognition,” in
Contrastive 1 2.0% 2.3% 13.4% 17.81% Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE
Contrastive 2 2.0% 2.9% 12.2% 17.08% International Conference on. IEEE, 2015, pp. 4814–4818.
[12] M. Diez, L. Burget, and P. Matejka, “Speaker diarization based
on bayesian hmm with eigenvoice priors,” in Proc. Odyssey 2018
Table 2: Computational requirements of STAR-LAB submis- The Speaker and Language Recognition Workshop , 2018, 2018,
sions from based on RT factor (higher than 1.0 is slower than pp. 147–154.
real time) and maximum resident memory needed to diarize 10-
minutes of development file millennium-20170522.
5. Computation
We benchmarked the computational requirements of the STAR-
LAB system on a single core. The machine was an Intel Xeon
E5630 Processor operating at 2.53GHz. The approximate pro-
cessing speed and resource requirements are listed in Table 2.
These calculations are based on total CPU time divided by
the total duration of the audio.
6. Acknowledgments
We’d like to thank Lukas Burget and the BUT team for their python
implementation of Variational Bayes diarization which was leveraged
in this work [10].
7. References
[1] Mitchell McLaren, Luciana Ferrer, Diego Castan, Mahesh Nand-
wana, and Ruchir Travadi, “The sri-con-usc nist 2018 sre sys-
tem description,” in NIST 2018 Speaker Recognition Evaluation,
2018.
[2] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero,
Y. Carmiel, and Sa Khudanpur, “Deep neural network-based
speaker embeddings for end-to-end speaker verification,” in Spo-
ken Language Technology Workshop (SLT). IEEE, 2016, pp. 165–
170.
[3] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,
“Deep neural network embeddings for text-independent speaker
verification,” in Proc. Interspeech, 2017, pp. 999–1003.
[4] M. McLaren, D. Castan, M. K. Nandwana, L. Ferrer, and E. Yil-
maz, “How to train your speaker embeddings extractor,” in
Speaker Odyssey, 2018, pp. 327–334.
[5] Patrick. Kenny, “Bayesian analysis of speaker diarization with
eigenvoice priors,” in Tech. Rep. CRIM, 2008.
[6] S.J.D. Prince and J.H. Elder, “Probabilistic linear discriminant
analysis for inferences about identity,” in Proc. IEEE Interna-
tional Conference on Computer Vision. IEEE, 2007, pp. 1–8.
[7] C. Kim and R. Stern, “Power-normalized cepstral coefficients
(PNCC) for robust speech recognition,” in Proc. ICASSP, 2012,
pp. 4101–4104.
[8] Mitchell McLaren and Yun Lei, “Improved speaker recognition
using dct coefficients as features,” in Acoustics, Speech and Signal
Processing (ICASSP), 2015 IEEE International Conference on.
IEEE, 2015, pp. 4430–4434.
[9] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,
“Front-end factor analysis for speaker verification,” IEEE Trans.
on Speech and Audio Processing, vol. 19, pp. 788–798, 2011.
210
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
211 10.21437/IberSPEECH.2018-43
Speech
Feature activity Segment/cluster
extraction detection Segmentation representation Clustering Resegmentation
Binary key
MFCC 1-second AHC Diarization
Bi-LSTM Triplet-loss GMM
ICMC Bi-LSTM AP hypothesis
x-vectors
sponse, constant Q transform (IIR-CQT) [17]. This is a richer, ers use ReLu activations except the output layer neurons which
multi-resolution time-frequency representation for audio sig- use soft-max. The network is trained to discriminate between
nals, which provides a greater frequency resolution at lower fre- speakers in the training set. Once trained, the network is used
quencies and a higher time resolution at higher frequencies. to extract utterance-level embeddings for utterances from un-
seen speakers. The embedding is just the output of one of the
2.2. Speech activity detection and segmentation fully connected layers after the statistics pooling layer.
All submissions share a common speech activity detection
2.4. Clustering
(SAD) module [18], where SAD is modelled as a supervised bi-
nary classification task (speech vs. non-speech), and addressed Agglomerative hierarchical clustering. The AHC clustering
as a frame-wise sequence labelling task using a bi-directional uses a bottom-up agglomerative clustering algorithm as follows.
long short-term memory (LSTM) network operating on MFCC First, and assuming that the input audio stream is represented
features. As for segmentation, two systems were explored: (i) a as a matrix of segment-level embeddings, a number of clusters
straightforward uniform segmentation which splits speech con- Minit are initialised by a uniform splitting of the segment-level
tent into 1 second segments and (ii) segmentation via the de- embedding matrix. Cluster embeddings are estimated as the
tection of speaker change points. The speaker change detection mean segments embeddings. An iterative process including:
(SCD) module is that proposed in [19]. Similarly to the SAD (i) segment to-cluster assignment, (ii) closest cluster pair merg-
module, SCD is also modelled here as a supervised binary se- ing and (iii) cluster embedding re-estimation by averaging em-
quence labelling task (change vs. non-change). beddings of cluster members is then applied. All comparisons
are performed using the cosine similarity between embeddings.
2.3. Segment/cluster representation The clustering solutions generated after (i) are stored at every
iteration. The output solution is selected by finding a trade-
Binary key. This technique was initially proposed for speaker
off between the number of clusters and the within-class sum
recognition [8, 20] and applied to speaker diarization [21, 9, 16].
of squares (WCSS) among all solutions. This is accomplished
It represents speech segments as low-dimensional, speaker-
through an elbow criterion, as described in [21].
discriminative binary or integer vectors, which can be clustered
Affinity propagation. As proposed in [24], an affinity propa-
using some sort of similarity measure. The core model to per-
gation (AP) algorithm [25] is our second clustering method. In
form this mapping is a binary key background model (KBM)
contrast to other approaches, AP does not require a prior choice
which is trained in the test segment before diarization. The
of the number of clusters contrary to other clustering methods.
KBM is actually a collection of diagonal-covariance Gaussian
All speech segments are potential cluster centres (exemplars).
models selected from a pool of Gaussians learned on a sliding
Taking as input the pair-wise similarities between all pairs of
window over the test data. The window rate is adjusted dy-
speech segments, AP will select the exemplars and associate all
namically to assure a minimum number of Gaussians. Then,
other speech segments to an exemplar. In our case, the simi-
a selection process is performed to keep a percentage p of the
larity between the ith and j th speech segments is the negative
Gaussians in the pool to ensure sufficient coverage of all the
angular distance between their embeddings.
speakers in the test audio stream. The KBM is then used to bi-
narise an input sequence of acoustic features, which are then
accumulated to obtain a cumulative vector, which is the final 2.5. Re-segmentation
representation. Refer to [21] for more details. A resegmentation process is performed to refine time bound-
Triplet-loss neural embedding. The embedding architecture aries of the segments generated in the clustering step. It
used is the one introduced in [10] and further improved in [22]. uses Gaussian mixture models (GMM) to model the clusters,
In the embedding space, using the triplet loss paradigm, two and maximum likelihood scoring at feature level. Since the
sequences xi and xj of the same speaker (resp. two different log-likelihoods at frame level are noisy, an average smooth-
speakers) are expected to be close to (resp. far from) each other ing within a sliding window is applied to the log-likelihood
according to their angular distance. curves obtained with each cluster GMM. Then, each frame is
x-vector. This method [11] uses a deep neural network (DNN) assigned to the cluster which provides the highest smoothed
which maps variable length utterances to fixed-dimensional em- log-likelihood.
beddings. The network consists of three main blocks. The first
is a set of layers which implements a time-delay neural net- 3. System fusion
work (TDNN) [23] which operates at the frame level. The sec-
ond is a statistics pooling layer that collects statistics (mean and Two approaches to fusion were explored. The first operates at
variance) at the utterance level. Finally a number of fully con- the similarity matrix level suited to combine speaker diarization
nected layers are followed by the output layer with as many systems that are aligned at the segment level. The second oper-
outputs as speakers in the training data. Neurons of all lay- ates at the hypothesis level and can be applied to systems with
212
Cluster emb.
System A dim(emb A) by M
A1 A2 A1
System A
Segmental embeddings
dim(emb A) by N α
… B1 B2 B1 B2 B1
System B
Cosine
similarities M by N similarity matrices +
…
A1B1
A2B2
A2B1
A2B2
A1B1
A1B2
B2
A2
Segment to cluster
dim(emb B) by M
System B Cluster emb. assignment
Figure 2: Illustration of the segment-to-cluster similarity matrix Figure 3: Illustration of the fusion of two diarization hy-
fusion. potheses.
213
Table 1: Summary of ODESSA Primary (P) and contrastive (C1/C2) submissions for the closed- and open-set (denoted by c and o sub-
script, respectively) conditions, including feature extraction, segmentation and training data used, segment representation, clustering
and fusion. Performance (DER, %) is shown in the last column.
Condition Sys. Features Segmentation Segment rep. / train data Clustering Fusion DER
Pc - - - - C1c ,C2c , Hyp-level 10.17
Closed C1c ICMC 1-second BK / - AHC - 12.33
C2c MFCC BiLSTM EMB / 3/24 data AP - 14.10
Po - 1-second - AHC C1c ,C1o ,C2o , Sim-level5 7.21
Open C1o MFCC 1-second x-vector / SRE-data AHC - 9.29
C2o MFCC BiLSTM EMB / Voxceleb AP - 11.46
214
8. References [16] J. Patino, H. Delgado, and N. Evans, “The EURECOM
Submission to the First DIHARD Challenge,” in Proc. IN-
[1] A. Stolcke, G. Friedland, and D. Imseng, “Leveraging speaker TERSPEECH 2018, 2018, pp. 2813–2817. [Online]. Available:
diarization for meeting recognition from distant microphones,” in http://dx.doi.org/10.21437/Interspeech.2018-2172
Proc. ICASSP. IEEE, 2010, pp. 4390–4393.
[17] P. Cancela, M. Rocamora, and E. López, “An efficient multi-
[2] J. Patino, R. Yin, H. Delgado, H. Bredin, A. Komaty, G. Wis-
resolution spectral transform for music analysis,” in Proc. ISMIR,
niewski, C. Barras, N. Evans, and S. Marcel, “Low-latency
2009, pp. 309–314.
speaker spotting with online diarization and detection,” in Proc.
Odyssey 2018, 2018, pp. 140–146. [18] G. Gelly and J.-L. Gauvain, “Minimum Word Error Training of
[3] “NIST Rich Transcription Evaluation,” 2009. [Online]. Available: RNN-based Voice Activity Detection.” in Proc. INTERSPEECH,
https://www.nist.gov/itl/iad/mig/rich-transcription-evaluation 2015, pp. 2650–2654.
[4] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, [19] R. Yin, H. Bredin, and C. Barras, “Speaker Change Detection
and M. Liberman, “First DIHARD Challenge Evaluation Plan,” in Broadcast TV using Bidirectional Long Short-Term Memory
2018, https://zenodo.org/record/1199638. Networks,” in Proc. INTERSPEECH, Stockholm, Sweden,
August 2017. [Online]. Available: https://github.com/yinruiqing/
[5] A. Ortega, I. Viñals, A. Miguel, and E. Lleida, “The Albayzin change detection
2016 Speaker Diarization Evaluation,” in Proc. IberSPEECH,
2016. [20] G. Hernandez-Sierra, J. R. Calvo, J.-F. Bonastre, and P.-
M. Bousquet, “Session compensation using binary speech
[6] A. Ortega, I. Viñals, A. Miguel, E. Lleida, V. Bazán,
representation for speaker recognition,” Pattern Recognition
C. Pérez, M. Zotano, and A. de Prada, “Albayzin Evalua-
Letters, vol. 49, pp. 17 – 23, 2014. [Online]. Available: http://
tion: IberSPEECH-RTVE 2018 Speaker Diarization Challenge,”
www.sciencedirect.com/science/article/pii/S0167865514001779
2018. [Online]. Available: http://catedrartve.unizar.es/reto2018/
EvalPlan-SpeakerDiarization-v1.3.pdf [21] H. Delgado, X. Anguera, C. Fredouille, and J. Serrano, “Fast
[7] E. Lleida, A. Ortega, A. Miguel, V. Bazán, C. Pérez, single-and cross-show speaker diarization using binary key
M. Zotano, and A. de Prada, “RTVE2018 Database Description,” speaker modeling,” IEEE Transactions on Audio, Speech and Lan-
2018. [Online]. Available: http://catedrartve.unizar.es/reto2018/ guage Processing, vol. 23, no. 12, pp. 2286–2297, 2015.
RTVE2018DB.pdf [22] G. Gelly and J.-L. Gauvain, “Spoken Language Identification us-
[8] X. Anguera and J.-F. Bonastre, “A novel speaker binary key de- ing LSTM-based Angular Proximity,” in Proc. INTERSPEECH,
rived from anchor models,” in Proc. INTERSPEECH, 2010, pp. August 2017.
2118–2121. [23] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural
[9] J. Patino, H. Delgado, N. Evans, and X. Anguera, “EURECOM network architecture for efficient modeling of long temporal con-
submission to the Albayzin 2016 Speaker Diarization Evaluation,” texts,” in Proc. INTERSPEECH, September 2015, pp. 3214–3218.
in Proc. IberSPEECH, Nov 2016. [24] Y. Ruiqing, B. Hervé, and B. Claude, “Neural Speech Turn Seg-
[10] H. Bredin, “TristouNet: Triplet Loss for Speaker Turn Embed- mentation and Affinity Propagation for Speaker Diarization,” in
ding,” in Proc. ICASSP, New Orleans, USA, March 2017. Proc. INTERSPEECH, Hyderabad, India, September 2018.
[11] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- [25] B. J. Frey and D. Dueck, “Clustering by Passing Messages Be-
pur, “X-Vectors: Robust DNN Embeddings for Speaker Recogni- tween Data Points,” science, vol. 315, no. 5814, pp. 972–976,
tion,” in Proc. ICASSP, April 2018, pp. 5329–5333. 2007.
[12] S. Meignier, D. Moraru, C. Fredouille, J.-F. Bonastre, and L. Be- [26] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: a large-
sacier, “Step-by-step and integrated approaches in broadcast news scale speaker identification dataset,” in Proc. INTERSPEECH,
speaker diarization,” Computer Speech & Language, vol. 20, no. August 2017.
2-3, pp. 303–330, 2006.
[27] Z. Zajı́c, M. Kunešová, J. Zelinka, and M. Hrúz, “ZCU-NTIS
[13] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, Speaker Diarization System for the DIHARD 2018 Challenge,”
M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watan- in Proc. INTERSPEECH, 2018, pp. 2788–2792.
abe et al., “Diarization is Hard: Some Experiences and Lessons
Learned for the JHU Team in the Inaugural DIHARD Challenge,” [28] I. Viñals, P. Gimeno, A. Ortega, A. Miguel, and E. Lleida, “Es-
Proc. INTERSPEECH, pp. 2808–2812, 2018. timation of the Number of Speakers with Variational Bayesian
PLDA in the DIHARD Diarization Challenge,” in Proc. INTER-
[14] S. Davis and P. Mermelstein, “Comparison of parametric repre-
SPEECH, 2018, pp. 2803–2807.
sentations for monosyllabic word recognition in continuously spo-
ken sentences,” IEEE Transactions on Acoustics, Speech, and Sig- [29] M. Diez, F. Landini, L. Burget, J. Rohdin, A. Silnova,
nal Processing, vol. 28, no. 4, pp. 357–366, Aug 1980. K. Žmolı́ková, O. Novotný, K. Veselý, O. Glembek, O. Plchot,
[15] H. Delgado, M. Todisco, M. Sahidullah, A. K. Sarkar, N. Evans, L. Mošner, and P. Matějka, “BUT System for DIHARD Speech
T. Kinnunen, and Z. H. Tan, “Further optimisations of constant Diarization Challenge 2018,” in Proc. INTERSPEECH, 2018, pp.
Q cepstral processing for integrated utterance and text-dependent 2798–2802.
speaker verification,” in 2016 IEEE Spoken Language Technology
Workshop (SLT), Dec 2016, pp. 179–185.
215
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
EML European Media Laboratory GmbH, Berliner Straße 45, 69120 Heidelberg, Germany
[email protected]; [email protected]
216 10.21437/IberSPEECH.2018-44
The Rich Transcription Time Marked (RTTM) files con-
taining the segment information are generated automatically by
another SD system and are provided in the challenge along with
the audio recordings. VAD Feature
Extract
217
Table 1: The performance of the EML diarization system on the dev2 partition of RTVE2018 database.
DER(%)
Response Time Total Processing Time ×RT
La noche 24H Millenium Total
2 sec 32.65 12.24 22.12 00:12:37 0.014
4 sec 24.28 10.32 17.02 00:11:12 0.012
The first stage of the clustering algorithm is similar to the evaluation will be as follows,
Mean Shift based algorithm proposed in [5] and used success-
fully in [6]. In the second stage, the closer clusters obtained in
the first stage are combined. The second stage can be iterated md-eval-v22.pl -b -c 0.25 -r reference.rttm -s system.rttm (1)
for a few iterations or until no further merge is possible. In both
stages, speaker vectors are joined based on the cosine similar- 5. Experimental Results
ity considering a threshold which is set to 0.350 and 0.300 for
stages 1 and 2, respectively. After clustering, the centroids of We have used the human-revised labeled data (Sec. 2.3) for
the top 2048 clusters with higher number of members are con- the evaluation of the diarization system. It contains 12 audio
sidered as the final background speaker vectors. recordings from two different channels with a total duration of
Given the background speaker vectors, we perform a semi approximately 16 hours. The audio recordings include speech,
S-normalization on the scores before decision. The test speaker music, silence, noise, cross talks (some times more than two
vector and the speaker models are both compared with back- speakers at a same time), and speech over music.
ground speaker vectors using cosine similarity. The cosine Two different participating conditions are proposed in this
score between the test speaker vector and the speaker model challenge, a closed-set condition in which only data provided
is normalized once with the mean score of the top 10 closest within the Albayzin evaluation can be used for training and an
background speaker vectors to the test speaker vector and an- open-set condition in which external data can also be used for
other time with the mean score of the top 10 closest background training as long as they are publicly accessible to everyone. We
speaker vectors to the speaker model. The final score is the av- have participated only in the closed-set condition.
erage of these two normalized scores.
The EML speaker diarization system is primarily designed
for an Online application for which the robustness, computa-
4. Performance Measurement tional cost, and the response time is important. In the primary
As in the NIST RT Diarization evaluations, the Diarization Er- submitted system, the decision about the identity of the speakers
ror Rate (DER) will be used for the performance measurement is made every approximately 2 sec without looking at the future
in the challenge. The DER includes the time that is assigned to in the audio recording. As the algorithm divides the speaker
the wrong speaker, missed speech time and false alarm speech vectors into two halves before creating a new speaker model,
time. the resolution for the speaker change point detection is about
1 sec. However, as the response time is not important in the
The speaker error time is the amount of time that has been
challenge, we can increase it to a longer time but it would cor-
assigned to an incorrect speaker. This error can occur in seg-
respond to loosing fast speaker turns in the audio recording.
ments where the number of system speakers is greater than the
number of reference speakers, but also in segments where the The development set (sec. 2.3) includes audio recordings
number of system speakers is lower than the number of refer- from two Spanish programs, La noche en 24H and Millenium.
ence speakers whenever the number of system speakers and the The experimental results showed higher average DER on La
number of reference speakers are greater than zero. noche en 24H recordings than on Millenium recordings. It
The missed speech time refers to the amount of time that could be due to longer duration of audio signals (2 hours each
speech is present but not labeled by the diarizaton system in compared to 1 hour each for Millenium), faster speaker turns,
segments where the number of system speakers is lower than more cross talks, more music on the background, or something
the number of reference speakers. else which needs more investigation.
The false alarm time is the amount of time that a speaker Table 1 summarizes the average DER on each program,
has been labeled by the diarization system but is not present in the total DER obtained on the entire dev2 set of RTVE2018
segments where the number of system speakers is greater than dataset considering the duration of audio recordings, and the to-
the number of reference speakers. tal computational time used for processing all the recordings
As defined in the challenge [7], consecutive segments of from scratch including the feature extraction. The process-
the same speaker with a silence of less that 2 sec come together ing is made using a single core of an Intel(R) Xeon(R) CPU
and are considered as a single segment. A forgiveness collar of @2.10GHz.
0.25 sec, before and after each reference boundary, will be con-
sidered in order to take into account both inconsistent human 6. Conclusions
annotations and the uncertainty about when a speaker begins or
ends. Overlap regions where more than one speaker is present We used the EML Online speaker diarization system as a par-
are also taken into account for the evaluation. ticipation in the recent Albayzin speaker diarization evaluation.
The tool used for evaluating the diarization systems is We tried to take advantage of all the unlabeled and labeled data
the one developed for the RT diarization evaluations by NIST provided in the challenge in the close-set condition. The system
md-eval-v22.pl, available in the web site of the evaluation: showed a reasonable performance on the development data with
http://catedrartve.unizar.es/reto2018. The command line for the a very low computational cost.
218
7. Acknowledgement [4] O. Ghahabi and J. Hernando, “Deep learning backend for single
and multisession i-vector speaker recognition,” IEEE/ACM Trans-
We would like to thank Wei Zhou for the efficient implementa- actions on Audio, Speech, and Language Processing, vol. 25, no. 4,
tion of the Online algorithm. pp. 807–817, 2017.
[5] M. Senoussaoui, P. Kenny, T. Stafylakis, and P. Dumouchel, “A
8. References study of the cosine distance-based mean shift for telephone speech
diarization,” IEEE/ACM Transactions on Audio, Speech and Lan-
[1] E. Lleida, A. Ortega, A. Miguel, V. Bazan, guage Processing, vol. 22, no. 1, pp. 217–227, 2014.
C. Perez, M. Zotano, and A. Prada, “RTVE2018
database description,” 2018, [Online]. Available: [6] S. Novoselov, T. Pekhovsky, and K. Simonchik, “STC speaker
http://catedrartve.unizar.es/reto2018/RTVE2018DB.pdf. recognition system for the nist i-vector challenge,” in Odyssey: The
Speaker and Language Recognition Workshop, 2014, pp. 231–240.
[2] O. Ghahabi, W. Zhou, and V. Fischer, “A robust voice activity de-
tection for real-time automatic speech recognition,” in Proc. ESSV, [7] A. Ortega, I. Vinals, A. Miguel, E. Lleida, V. Bazan,
2018. C. Perez, M. Zotano, and A. Prada, “Albayzin evaluation:
IberSpeech-RTVE 2018 speaker diarization challenge,” 2018, [On-
[3] M. Zelenak, H. Schulz, and F. J. Hernando Pericas, “Albayzin 2010 line]. Available: http://catedrartve.unizar.es/reto2018/EvalPlan-
evaluation campaign: speaker diarization,” in VI Jornadas en Tec- SpeakerDiarization-v1.3.pdf.
nologı́a del Habla, 2010, pp. 301–304.
219
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
ViVoLab, Aragón Institute for Engineering Research (I3A), University of Zaragoza, Spain
{ivinalsb, pablogj, ortega, amiguel, lleida}@unizar.es
Abstract ers in the different parts of the audio to make any decision.
For this reason diarization applies many methods developed for
This paper tries to deal with domain mismatch scenarios in the speaker recognition. Successful diarization systems consider-
diarization task. This research has been carried out in the con- ing these technologies are: Agglomerative Hierarchical Clus-
text of the Radio Televisión Española (RTVE) 2018 Challenge tering (AHC)[3] with ∆BIC [4], streams of eigenvoices[5] re-
at IberSpeech 2018. This evaluation seeks the improvement of segmented with HMMs [6], i-vectors [7] clustered with PLDA
the diarization task in broadcast corpora, known to contain mul- [8], [9] in [10]. Neural Networks are also contributing, firstly
tiple unknown speakers. These speakers are set to contribute in providing more reliable acoustic information [11] and more re-
different scenarios, genres, media and languages. The evalua- cently a new representation: the embeddings such as xvectors
tion offers two different conditions: A closed one with restric- [12].
tions in the resources to train and develop diarization systems,
and an open condition without restrictions to check the latest When moving from telephone data to other scenarios, such
improvements in the state-of-the-art. as broadcast or meetings, new difficulties arise. Specially rel-
evant are the estimation of the number of speakers and the do-
Our proposal is centered on the closed condition, specially
main mismatch. The first problem is caused by the presence of
dealing with two important mismatches: media and language.
an unknown number of speakers in the audio. This difficulty in-
ViVoLab system for the challenge is based on the i-vector
creases if the contributions per speaker are significantly unbal-
PLDA framework: I-vectors are extracted from the input audio
anced. Our proposed solution to deal with this problem is [13],
according to a given segmentation, supposing that each segment
which makes use of a penalized version of the Evidence Lower
represents one speaker intervention. The diarization hypothe-
Bound (ELBO) from a Variational Bayes solution, as reliabil-
ses are obtained by clustering the estimated i-vectors with a
ity metric. Another important problem is the domain mismatch.
Fully Bayesian PLDA, a generative model with latent variables
Broadcast data consists of several shows, belonging to multiple
as speaker labels. The number of speakers is decided by com-
genres and many differences in terms of locations, audio qual-
paring multiple hypotheses according to the Evidence Lower
ity or postprocessing details. This large variability in the audio
Bound (ELBO) provided by the PLDA, penalized in terms of
makes that one system is likely to lack in precision to cover the
the hypothesized speakers to compensate different modeling ca-
whole range of possibilities. However specific systems are also
pabilities.
unfeasible for practical reasons. Our approach[14] combines
Index Terms: adaptation, diarization, broadcast, i-vector,
both strategies: A single system is unsupervisedly adapted to
PLDA, Variational Bayes
the different shows to diarize with the same data to evaluate,
obtaining the diarization labels afterwards.
1. Introduction The paper is organized as follows: Section 2 describes the
The production of broadcast content has progressively aug- evaluation and the available data. ViVoLab system is presented
mented along the last years, becoming more and more neces- in Section 3. Section 4 is dedicated to present the obtained re-
sary the tools to process and label all these new data. One of sults. Finally, some conclusions are included in Section 5.
the required tasks is diarization, the indexation of some audio
according to the active speaker. Hence the goal of diarization is
the differentiation among the speakers by means of generic la-
2. RTVE 2018 Challenge
bels, leaving the identification of each speaker for further work, The RTVE 2018 Challenge is part of the 2018 edition of the
if necessary. Originally developed for telephone conversations, Albayzin [15], [16], [17], [18], [19] evaluations. These eval-
new domains such as broadcast audio and meetings are suitable uations are designed to promote the evolution of speech tech-
to be interested in this technique, adding new challenging draw- nologies in Iberian languages. In particular, RTVE 2018 Chal-
backs not present in the original scenario. lenge is focused on the extraction of relevant information from
Multiple approaches have been proposed to the diarization Broadcast data in Spanish language. This information, such as
problem since its origins, most of them following two main the identity of the person on screen and his speech, is intended
strategies: The Top-Down philosophy, which obtains the correct to help describing and labeling the multimedia data for further
labels by dividing an initial hypothesis with only one speaker, work. To accomplish all these goals, the evaluation provides
and the Bottom-Up strategy, which initially divides the input around 500 hours of shows from the Spanish Public TV corpo-
audio into acoustic segments containing only one speaker each, ration Radio Televisión Española (RTVE). The considered au-
and combining them afterwards. Further information is avail- dio tries to cover the widest possible range of Spanish variabil-
able in [1][2]. Both philosophies need to characterize the speak- ity, including varieties of Spanish from Spain and Latin Amer-
ica. In addition to the provided audio some metadata is pro-
This work has been supported by the Spanish Ministry of Economy vided, with different levels of reliability.
and Competitiveness and the European Social Fund through the 2015
FPI fellowship, the project TIN2017-85854-C4-1-R and Gobierno de The database is divided into 4 subsets, with different func-
Aragón /FEDER (research group T36 17R). tionality:
220 10.21437/IberSPEECH.2018-45
VAD Segment Ni φj yi
πθ
sn (t) Φn θn
MFCC i-vectors Clustering θj
M
µ W V
Figure 1: Schematic of ViVoLab diarization system α
221
best compare the results from each hypothesis, we make use of a from a new domain. Besides, when moving from ground truth
penalized version of the Evidence Lower Bound (ELBO) given VAD to a noisy one, a new sort of audio mismatch is intro-
by the PLDA model [13]. This metric, related to the likelihood, duced: i-vectors with non-speech. In this situation the unsuper-
indicates how well the speaker labels represent the given data, vised adaptation contributes learning from the evaluation data
but penalized in terms of the hypothesed number of speakers to and adapting the model to the new conditions. This adaptation
avoid unnecessary subclustering, i.e., estimating more clusters makes the system not to be degraded by the new scenario. In
than the real number of speakers. real conditions with noisy VAD the adaptive solution obtains
a 24% relative improvement respect to the non-adaptive con-
4. Results trastive system.
Regarding the estimation of the number of speakers, the
ViVoLab submission to the RTVE 2018 Diarization Challenge performed analysis indicates that our systems are likely to sub-
consists of two systems, a primary and a constrastive. Both cluster, i.e., hypothesize a larger number of clusters than the
follow the pipeline described previously, with the only differ- real value. If some of these extra clusters are dedicated to col-
ence that our primary system performs unsupervised adaptation lect strange segments, the main clusters keep pure and clean
while the contrastive does not. All the models were trained with and compensate any loss in performance. Therefore some sub-
3/24 and CARTV data and no extra knowledge was considered, clustering could help avoiding relevant mistakes (primary sys-
not even those provided by RTVE data. tem vs contrastive system using ground truth VAD). However,
The results obtained by the previously described configura- our results also indicate than an excesive subclustering causes a
tions are exhibited in Table 1. strong degradation of the performance (contrastive system with
BLSTM VAD). Further research should be done to provide a
Table 1: DER (%) Results for Primary and constrastive 1 sys- deeper understanding.
tems in the development set. Results presented with Ground
truth VAD and our BLSTM VAD for comparison reasons. Re-
sults include the DER term as well as its contributions: Miss 6. References
speech (MISS), False Alarm speech (F.A.) and Speaker Er- [1] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland,
ror(SPK). Overlap is considered for evaluation purposes and O. Vinyals, “Speaker Diarization: A Review of Recent Re-
search,” IEEE Transactions On Audio Speech And Language Pro-
SYSTEM MISS (%) F.A. (%) SPK(%) DER(%) cessing, vol. 20, no. 2, pp. 356–370, 2012.
222
Spoken Language Technology Workshop (SLT), pp. 165–170,
2016.
[13] I. Viñals, P. Gimeno, A. Ortega, A. Miguel, and E. Lleida, “Es-
timation of the Number of Speakers with Variational Bayesian
PLDA in the DIHARD Diarization Challenge,” Interspeech,
no. September, pp. 2803–2807, 2018.
[14] I. Viñals, A. Ortega, J. Villalba, A. Miguel, and E. Lleida,
“Domain Adaptation of PLDA models in Broadcast Diariza-
tion by means of Unsupervised Speaker Clustering,” Interspeech,
pp. 2829–2833, 2017.
[15] A. Ortega, D. Castan, A. Miguel, and E. Lleida, “The Albayzin
2012 Audio Segmentation Evaluation,” 2012.
[16] J. Tejedor and D. T. Toledano, “The ALBAYZIN 2014 Search on
Speech Evaluation,” no. November, 2014.
[17] A. Ortega, D. Castan, A. Miguel, and E. Lleida, “The Albayzin
2014 Audio Segmentation Evaluation,” 2014.
[18] J. Tejedor and D. T. Toledano, “The ALBAYZIN 2016 Search on
Speech Evaluation,” 2016.
[19] A. Ortega, I. Viñals, A. Miguel, and E. Lleida, “The Albayzin
2016 Speaker Diarization Evaluation,” 2016.
[20] M. Zelenák, H. Schulz, and J. Hernando, “Speaker diarization of
broadcast news in Albayzin 2010 evaluation campaign,” Eurasip
Journal on Audio, Speech, and Music Processing, vol. 2012, no. 1,
pp. 1–9, 2012.
[21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural Computation, vol. 9, no. 8, pp. 1–32, 1997.
[22] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of I-
vector Length Normalization in Speaker Recognition Systems,”
in Proceedings of the Annual Conference of the International
Speech Communication Association, INTERSPEECH, pp. 249–
252, 2011.
223
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Abstract (dev2) [3]. For training, we used all data from the first two
databases and 8 files (out of 12) from RTVE 2018. This set
This document describes the three systems submitted by the was segmented according to the time alignments specified by
AuDIaS-UAM team for the Albayzin 2018 IberSPEECH-RTVE the RTTM files.
speaker diarization evaluation. Two of our systems (primary
In order to evaluate the speaker diarization performance of
and contrastive 1 submissions) are based on embeddings which
our systems, we used 3 files from the RTVE 2018 development
are a fixed length representation of a given audio segment ob-
set (approximately 4 hours) not used for training.
tained from a deep neural network (DNN) trained for speaker
All the audio files were down-sampled to 16kHz.
classification. The third system (contrastive 2) uses the classical
i-vector as representation of the audio segments. The resulting
embeddings or i-vectors are then grouped using Agglomerative 3. Feature Extraction
Hierarchical Clustering (AHC) in order to obtain the diarization
All our systems are based on MFCC features extracted using
labels. The new DNN-embedding approach for speaker diariza-
Kaldi [4]. Each feature vector consists of 20 MFCCs (includ-
tion has obtained a remarkable performance over the Albayzin
ing C0), computed every 10 ms with a 25 ms “Povey” window
development dataset, similar to the performance achieved with
(default in Kaldi, similar to Hamming window).
the well-known i-vector approach.
For the i-vector system, these 20-dimensional features are
Index Terms: speaker diarization, embeddings, i-vectors, AHC
normalized using cepstral mean normalization over a 3 s sliding
window, and augmented with their first and second derivatives
1. Introduction (∆ and ∆∆), providing a final 60-dimensional feature vector.
The AuDIaS-UAM submission for the Speaker Diarization For the DNN-embedding systems, the raw 20-dimensional
(SD) evaluation consisted of three different systems, two of MFCC feature vectors are used to feed the network without ap-
them based on embeddings (also known as x-vectors) [1] ex- plying channel compensation or adding temporal information.
tracted from a Deep Neural Network (DNN) trained for speaker However, global zero-mean and unit-variance normalization is
classification, and a third one based on the classical total vari- performed over the whole training set.
ability i-vector model [2].
Our systems are submitted for the closed-set condition 4. DNN-based Embedding Systems
since they are trained using the training and development
datasets made available for this evaluation, briefly described in Two of our submitted systems are based on DNN-based embed-
Section 2. dings [1]. An embedding is a fixed-length representation of a
For all our systems, we extract frame-level features as de- given utterance or audio segment learned directly by a DNN.
scribed in Section 3. Then, using those features, we train ei- Typically, this DNN is trained for speaker classification.
ther a DNN or an i-vector extractor in order to obtain a fixed- In our case, we used an architecture based on Bidirec-
length representation of an audio segment (regardless its du- tional Long Short-Term Memory (BLSTM) recurrent neural
ration), as presented in Sections 4 and 5, respectively. These networks similar to the one used in [5] for language recog-
models are trained using a segmentation based on the reference nition, whose configuration was adjusted to the available data
labels (RTTM files) provided by the organizers for training and and the speaker diarization task. The architecture (sequence-
development purposes. summarizing DNN) used consists on a frame-level part com-
We kept three development files from RTVE dataset as held posed of two BLSTM layers (with 128 cells each) and a fully-
out set for diarization performance evaluation. For these record- connected layer of 500 hidden units. Then, a pooling layer com-
ings, in order to discard fragments where just music (without putes the mean and standard deviation over time to the outputs
speech) is present, we developed a DNN-based music detec- of the previous layer, followed by two fully connected layers
tor described in Section 6. Then, diarization labels are obtained (embeddings a and b, respectively) of 50 hidden units each and
by means of Agglomerative Hierarchical Clustering (AHC) per- a softmax output layer with 3124 output units, working on an
formed over non-music segments. This last step is summa- utterance-level basis. All the layers (except the output layer)
rized in Section 7. Finally, in Section 8 we show results of the use sigmoid non-linear activation. A graphical representation
Audias-UAM submitted systems over our development dataset. of the architecture is depicted in Figure 1.
The size of the output layer (3124) corresponds to the num-
ber of speakers considered in our training dataset. However,
2. Training and Development Datasets it should be pointed out that due to the lack of actual speaker
We used the three datasets provided by the organizers for this identification labels and the segmentation based on the RTTM
evaluation: Aragón Radio, 3/24 TV channel and RTVE 2018 labels for diarization, each recording was considered to have
224 10.21437/IberSPEECH.2018-46
Pooling
(mean, std)
Output
3124 audio signals, followed by one LSTM layer and a final fully-
Fully speakers connected layer prior to the output layer.
BLSTM BLSTM connected
Input 128 128
500 Emb_a Emb_b
The output layer has 4 output units which correspond to
sequence
50 50 the classification into music, speech, speech + music and none
of them. For this evaluation we used just the probabilities of
20 dim
MFCC Posterior belonging to the music class in order to filter out segments that
probabilities contain only music. This way, just the embeddings or i-vectors
corresponding to speech (or speech with music) segments were
used to perform the clustering stage.
In order to extract the probability of a segment to contain
music, Mel-spectrograms corresponding to test recordings have
Frame-by-frame Utterance (sequence) level
embedding representation
been computed and split into a stream of 10 second segments to
fit the input size of the music/speech classifier. The separation
between consecutive segments is 0.5 s in order to identify each
Figure 1: Architecture used for the DNN-based embedding sys-
Mel-spectrogram with an embedding or i-vector in the stream.
tems used as primary and contrastive 1 submissions to the
speaker diarization evaluation.
7. Agglomerative Hierarchical Clustering
In order to obtain the speaker diarization labels, we used Ag-
different speakers than the rest (which is usually not the case). glomerative Hierarchical Clustering (AHC) over the resulting
Then, even though two segments might be labeled as spoken by stream of either embeddings or i-vectors (depending on the sys-
different speakers, they could belong to the same person. tem) for a given development or test recording. Embeddings
The DNN was trained using stochastic gradient descend to or i-vectors corresponding to music segments according to our
minimize the multi-class cross-entropy criteria for speaker clas- music detector were discarded previously to the clustering step.
sification. For training purposes, the network was fed with 3 s This stage was implemented in Python using the scikit-learn
long sequences of 20-dimensional MFCC feature vectors. toolbox [9].
After training, embeddings were extracted for each 3 s frag- Thus, AHC is applied to the resulting sequence of vectors
ment of the development and test recordings, with a shift of corresponding to a specific audio file. We used cosine distance
0.5 s. Each segment was forwarded through the network up for i-vectors and euclidean distance for embeddings. The num-
to the first embedding layer (embedding a), providing a 50- ber of clusters is controlled by the threshold of the distance to
dimensional embedding every 0.5 s (corresponding to 3 s se- merge clusters, whose value was optimized on the development
quences). set. The linkage method used was the average of the vectors.
This system was implemented using Keras [6]. This clustering was applied once for the contrastive systems.
Embeddings obtained from this system were used for the However, for the primary system we applied first AHC over
primary system and the contrastive system 1, which differ in the whole set of vectors with a lower threshold to allow a bigger
the clustering stage as described in Section 7. number of clusters, and then, a second AHC stage was applied
to group the centroids of the previous clusters. This was done in
5. I-vector System order to help the clustering grouping speaker identities instead
of vectors similar due to their closeness in time. The centroids
As contrastive system 2, we used the classical total variability were computed as the mean vector over all the points labeled as
i-vector [2] modeling. belonging to the same cluster in the first AHC stage.
To develop this system, an UBM of 1024 Gaussian com- Finally, for all the systems, we post-processed the cluster-
ponents was trained using the 60-dimensional MFCC+∆+∆∆ ing labels by filtering out clusters that grouped less than 10 s of
features described in Section 3, and a 50-dimensional total vari- audio. This was done in order to reduce false alarm in terms
ability subspace was derived from the Baum-Welch statistics of clusters that do not group a different speaker but segments
of the training segments (obtained according to RTTM times- further than the chosen threshold.
tamps). The configuration was taken from previous speaker di-
arization systems developed in our research group.
8. Development Results
After training, each development and test recording was
processed in order to obtain a stream of i-vectors every 0.5 s Table 1 shows the results obtained by our systems in our devel-
(as with embeddings) with a sliding window of length 3 s. opment set (three files from RTVE 2018 dev 2 partition).
This system was implemented using Kaldi [4]. In our development set, both systems based on embeddings
The speaker diarization was performed using clustering on obtained similar results, especially in terms of missed and false
top of the resulting streams of i-vectors (see Section 7). alarm speaker time. The difference in these two metrics with
respect to the third system (based on i-vector) is also not signif-
icant, while the performance differs mainly due to the speaker
6. Music Detection error time. This might be related to the system settings and
In order to discard segments where just music was present, we thresholds selected for the clustering, which would merge dif-
developed a music/speech classifier based on DNNs [7]. ferent speakers into one cluster or vice-versa.
This system is trained using 150 h of audio from Google Even though the i-vector system obtained better perfor-
Audio Set [8], a dataset consisting of 10 seconds audio seg- mance in our development dataset than the embedding-based
ments extracted from YouTube videos. Our architecture is systems, we submitted as primary system one of these systems
composed of six bidimensional convolutional neural network due to the novelty of this technique with respect to the well-
(CNN) layers which operate on the Mel-spectrogram of the known i-vectors for speaker diarization and related tasks.
225
Table 1: Performance of the Audias-UAM submission for the Albayzin IberSPEECH-RTVE speaker diarization evaluation over the
development dataset (approximately 4 hours, 3 different recordings from 2 different shows. The performance is shown in % of scored
speaker time.
9. Acknowledgements
This work was supported by project DSSL: Redes Profundas y
Modelos de Subespacios para Detección y Seguimiento de Lo-
cutor Idioma y Enfermedades Degenerativas a partir de la Voz
(TEC2015-68172-C2-1-P), funded by Ministerio de Economı́a
y Competitividad, Spain and FEDER.
10. References
[1] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep
neural network embeddings for text-independent speaker verifica-
tion,” in Proceedings of Interspeech 2017, 2017.
[2] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,
“Front-end factor analysis for speaker verification,” IEEE
Transactions on Audio, Speech & Language Processing,
vol. 19, no. 4, pp. 788–798, 2011. [Online]. Available:
http://dx.doi.org/10.1109/TASL.2010.2064307
[3] “Albayzin evaluation: Iberspeech-rtve 2018 speaker diariza-
tion challenge,” http://catedrartve.unizar.es/reto2018/EvalPlan-
SpeakerDiarization-v1.3.pdf.
[4] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,
J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recogni-
tion toolkit,” in IEEE 2011 Workshop on Automatic Speech Recog-
nition and Understanding. IEEE Signal Processing Society, Dec.
2011, iEEE Catalog No.: CFP11SRW-USB.
[5] A. Lozano-Diez, O. Plchot, P. Matějka, and J. Gonzalez-Rodriguez,
“Dnn based embeddings for language recognition,” in Proceedings
of ICASSP, April 2018.
[6] “Keras: The python deep learning library,” https://keras.io/.
[7] D. de Benito Gorrón, “Detección de voz y música
en un corpus a gran escala de eventos de audio,”
https://repositorio.uam.es/handle/10486/684843, June 2018.
[8] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen,
W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set:
An ontology and human-labeled dataset for audio events,” in Proc.
IEEE ICASSP 2017, New Orleans, LA, 2017.
[9] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,
J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per-
rot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,”
Journal of Machine Learning Research, vol. 12, pp. 2825–2830,
2011.
226
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
1. Introduction
This is the first participation of the CENATAV Voice Group in
the Albayzin Challenges, participating in the Speaker Diariza-
tion Challenge (SDC) task and developing a Diarization Sys-
tem focuses in robust feature extraction. A Speaker Diarization
System allows identifying ” Who spoke when ? ” on an audio
stream, which has been of interest for the scientific community
since the last century, with the emergence of the first works on
speaker segmentation and clustering [1][2]. The diarization can Figure 1: Gammatone filter-bank of 40 dimension
be used as a stage that enriches and improves the results of other
systems, for example: a Rich Transcription System uses the di-
arization for adding the information about who is speaking to γ ∗ tτ −1 ∗ cos(2π ∗ f ci ∗ t + θ)
h(t)i = , (1)
the speech transcription, or a Speaker Recognition System uses exp(−2π ∗ erbi ∗ t)
it when the test signal has several speakers, so diarization allows where:
finding the segments from the test signal with only one speaker
[3].
• γ: amplitude.
An issue of the diarization is the environment where the
speech is recorded, because noise is a natural condition in real • τ : filter order.
applications. The proposed system is focused on robust feature • erb: equivalent rectangular bandwidth.
extraction techniques for improving the results in a real applica-
• f ci : center frequency at the channeli .
tion. Robust techniques as Mean Hilbert Envelope Coefficients
(MHEC), Medium Duration Modulation Coefficients (MDMC) • θ: phase.
and Power Normalization Cepstral Coefficients (PNCC) are The next gammatone filter parameters were set in the pro-
analysed. posed system, following Glasberg and Moore’s recommenda-
The system was mainly developed on S4D tool [4], with tion [6], where:
the following structure: robust feature extraction, segmentation (fmax +EarQ∗minB)
• f ci = −(EarQ ∗ minB) +
(gaussian divergence and Bayesian Information Coefficient), exp(i∗0.5)/EarQ
speech activity detection (Support Vector Machine), clustering f ci
• erb = ( EarQ
τ
+ minB τ )1/τ
(Hierarchical Agglomerative Clustering) and the last stage is
the Re-segmentation (Viterbi algorithm). A system description • EarQ = 9.26449
is done in the next sections. • minB = 24.7
227 10.21437/IberSPEECH.2018-47
Earq is the asymptotic filter quality at high frequencies and FM component at the gammatone filter-bank output, rather the
minB is the minimum bandwidth for low frequencies channels. power at each sub-band is computed and transformed into the
The parameters θ, γ, τ were set to 0, 1 and 4 respectively. cepstral domain. There are two main approaches of PNCC, the
short and medium time approaches, being the first the approach
2.1. Mean Hilbert Envelope Coefficients used in this paper. For more details about this technique, see
[11]. The figure 4 shows the process for computing PNCC.
A gammatone filter modulates a signal in amplitude and fre-
quency [7], and to demodulate the output signal is a way for
recovering the information transmitted. Mean Hilbert Enve-
lope Coefficients (MHEC) extract this information by applying
Hilbert Transform for estimating the analytic signal, separating
the AM component from the modulated signal and assuming
that the FM component does not exist [7]. The extraction pro-
cess is shown in the figure 2.
228
6. Re-segmentation: a HMM is trained on the whole signal
and a Viterbi re-segmentation is done for redefining the
change points.
4. Experiment
The tested robust feature extraction, LFCC and LPCC algo-
rithms are self-implementations, while MFCC implementation
is from SIDEKIT tool [13] and the SVM from pyAudioAnal-
ysis tool [12]. The Gaussian Divergence, BIC, HAC and Re-
segmentation algorithms are is from S4D tool [4]. The proposed
systems were submitted at closed condition and they do not use
training data, with the exception of SVM, for which a portion of
Albayzin SDC 2016 Database was used. The experiment was
developed on RTVE-2018 SDC Development Database, tun-
ing the default thresholds of BIC and HAC in S4D, and com-
paring robust (MHEC, MDMC, PNCC) and classic (MFCC,
LFCC, LPCC) feature extraction techniques. The tables 1 and 2 Figure 6: Diarization Error Rate of the proposed system by fea-
show the best configurations at each feature. The pre-emphasis ture.
(0.97), length window (0.025 sec), shift window (0.01 sec),
compression (logarithmic) and normalization (Cepstral Mean
Normalization) are equal in each feature.
The computational cost was computed in terms of real-time
Table 1: Robust features configuration factor. This measure represents the necessary time for process-
ing a second of signal. The experiment was done on Intel Core
i3-3110M CPU 2.40 GHz X 4 with 3.7 GB of memory. The
Parameters MHEC MDMC PNCC computational cost of each system submitted is shown in the
Filter-bank dimension 40 40 40 table 3, being the system with LFCC the most efficient.
Bandwidth (Hz) 0 - 7000 0 - 7000 0 - 7000
Cepstral coefficients 12 +∆ + ∆∆ 12 +∆ + ∆∆ 15 +∆ + ∆∆ Table 3: Real-time factor of each system submitted
229
putacionales e Informáticas (CICCI 2018), La Habana, Cuba,
2018.
[4] S. Meignier. (2015) Sd4. [Online]. Available: http://www-
lium.univ-lemans.fr/s4d/
[5] N. T. Hieu, “Speaker diarization in meetings domain,” Ph.D. dis-
sertation, School of Computer Engineering of the Nanyang Tech-
nological University, 2014.
[6] M. Slaney, “An efficient implementation of the patterson-
holdsworth auditory filter bank,” Apple Computer, Tech. Rep.,
1993.
[7] S. O. Sadjadi and J. H. L. Hansen, “Mean hilbert envelope coef-
ficients (MHEC) for robust speaker and language identification,”
Speech Communication, vol. 72, pp. 138–148, 2015.
[8] E. L. Campbell, G. Hernandez, and J. R. Calvo, “Feature extrac-
tion of automatic speaker recognition, analysis and evaluation in
real environment,” in International Workshop of Artificial Intel-
ligent and Pattern Recognition 2018,Lecture Note of Computer
Science, 2018.
[9] V. Mitra, H. Franco, M. Graciarena, and D. Vergyri, “Medium-
duration modulation cepstral feature for robust speech recogni-
tion,” in IEEE International Conference on Acoustics, Speech and
Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014,
2014, pp. 1749–1753.
[10] A. Potamianos and P. Maragos, “A comparison of the energy op-
erator and the hilbert transform approach to signal and speech de-
modulation,” 1995.
[11] C. Kim and R. M. Stern, “Power-normalized cepstral coefficients
(PNCC) for robust speech recognition,” IEEE/ACM Trans. Audio,
Speech & Language Processing, vol. 24, no. 7, pp. 1315–1329,
2016.
[12] T. Giannakopoulos. (2018) pyaudioanalysis. [Online]. Available:
https://github.com/tyiannak/pyAudioAnalysis
[13] A. Larcher, S. Meignier, and K. A. LEE. (2017) Sidekit. [Online].
Available: http://www-lium.univ-lemans.fr/sidekit/
230
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Intelligent Voice Limited, St Clare House, 30-33 Minories, EC3N 1BP, London, UK
231 10.21437/IberSPEECH.2018-48
tures are centered using a short-term (3s window) cepstral mean
and variance normalization (ST-CMVN).
232
Figure 2: Deep neural network architecture used to extract speaker embeddings.
boundaries. This could be either in the feature space like MFCC • music: A single music file is randomly selected from
or in the factor analysis subspace [24]. Speaker diarization in MUSAN, trimmed or repeated as necessary to match du-
factor analysis space allows us to take advantages of speaker ration, and added to the original signal (5-15dB SNR).
specific information. By contrast, lower-level acoustic features
such as MFCCs are not quite as good for discerning speaker • noise: MUSAN noises are added at one second intervals
identities, but can only provide sufficient temporal resolution throughout the recording (0-15dB SNR).
to witness local speaker changes. The proposed framework • reverb: The training recording is artificially reverberated
for diarization provides a stronger speaker representation at the via convolution with simulated RIRs.
frame level. As a result, when combined with a HMM to refine
the speaker posterior probabilities through limiting the speaker
3.2. Performance Metrics
transitions [24], the system is able to detect speaker change
points. The speaker log likelihoods for the HMM are computed We measured performance with Diarization Error Rate (DER),
by the spectral clustering algorithm as described in section 2.3. the standard metric for diarization. It is measured as the total
percentage of reference speaker time that is not correctly at-
Table 1: DER (%) on the development data (dev2) as well as
tributed to a speaker. More concretely, DER is defined as:
evaluation data of the IberSPEECH-RTVE 2018 speaker di-
arization challenge.
F A + M ISS + ERROR
DER = (1)
T OT AL
DER(%) Err(%) FA(%) Miss(%)
where F A is the total system speaker time not attributed to a
dev2 15.96 10.5 3.6 1.8 reference speaker, M ISS is the total reference speaker time not
eval 30.96 25.2 4.8 0.9 attributed to a system speaker, and ERROR is the total refer-
ence speaker time attributed to the wrong speaker. Like the tra-
ditional conventions used in evaluating diarization performance
3. Experiments [11], a forgiveness collar of 0.25 seconds will be applied before
and after each reference boundary prior to scoring. The DER is
3.1. Training Data reported based on the NIST RT Diarization evaluations [34].
Switchboard corpora (LDC2001S13, LDC2002S06,
LDC2004S07, LDC98S75, LDC99S79) and NIST SRE 4. Results
2004-2010 which consists of conversational telephone and
microphone speech data at 8kHz sample frequency from around The IberSPEECH-RTVE 2018 Speaker Diarization is a new
5k speakers were used for training the system. Augmentation challenge in the ALBAYZIN evaluation series. This evaluation
increases the amount and diversity of the existing training data. consists of segmenting broadcast audio documents according to
Our strategy employs additive noises and reverberation. Re- different speakers and linking those segments which originate
verberation involves convolving room impulse responses (RIR) from the same speaker. We used two Intel Xeon CPU (E5-2670
with audio. We use the simulated RIRs described in [30], and @ 2.60GHz and 8 cores), 64G of DDR3 memory, 400G disk
the reverberation itself is performed with the multi-condition storage and an NVIDIA TITAN X GPU (12G of memory) to
training tools in the Kaldi ASpIRE recipe [31]. For additive train the network. Keras API with tensorflow backend has been
noise, we use the MUSAN dataset, which consists of over 900 used for system development. Training takes almost a week to
noises, 42 hours of music from various genres and 60 hours of process around half a million segments of 10-20 seconds long.
speech from twelve languages [32]. Both MUSAN and the RIR To process a single 20 minute recording the system execution
datasets are freely available from http://www.openslr.org. We times is around 7 seconds. We report the performance of our
use a 3-fold augmentation that combines the original “clean” proposed diarization framework on the development set (dev2)
training list with two augmented copies [33]. To augment a using the provided speaker marks and also the result of the sub-
recording, we choose between one of the following randomly: mitted system on the evaluation set in Table 1. Our system
was trained on publicly accessible data which totally differ from
• babble: Three to seven speakers are randomly picked both the development and evaluation data (open-set condition).
from MUSAN speech, summed together, then added to The results indicate the effectiveness of the proposed approach
the original signal (13-20dB SNR). on challenging domains.
233
5. Conclusion [13] S. Shum, N. Dehak, and J. Glass, “On the use of spectral and iter-
ative methods for speaker diarization,” in Thirteenth Annual Con-
The IberSPEECH-RTVE 2018 Speaker Diarization has proven ference of the International Speech Communication Association,
to be a highly challenging contest especially in the detection 2012.
of the number of speakers and dealing with background noise. [14] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and
We have presented our system and reported the results on the I. L. Moreno, “Speaker diarization with lstm,” arXiv preprint
development set as well as the evaluation set of the challenge. arXiv:1710.10468, 2017.
We found deep neural network embeddings much better at dis- [15] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouel-
cerning speaker identities especially for speech acquired with- let, “Front-end factor analysis for speaker verification,” Audio,
out constraint on recording equipment and environment. Our Speech, and Language Processing, IEEE Transactions on, vol. 19,
strategy to employ additive noises and reverberation for data no. 4, pp. 788–798, 2011.
augmentation plays an important role in the success of our sys- [16] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,
tem on challenging domain. We will perform research on the “Deep neural network embeddings for text-independent speaker
evaluation set once the labels are released to gain insights on verification,” Proc. Interspeech 2017, pp. 999–1003, 2017.
the real effects of the approaches presented in the paper. [17] C. Zhang and K. Koishida, “End-to-end text-independent speaker
verification with triplet loss on short utterances,” in Proc. of Inter-
speech, 2017.
6. Acknowledgement [18] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end
The research leading to the results presented in this paper has text-dependent speaker verification,” in Acoustics, Speech and
been (partially) granted by the EU H2020 research and innova- Signal Processing (ICASSP), 2016 IEEE International Confer-
tion program under grant number 769872. ence on. IEEE, 2016, pp. 5115–5119.
[19] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “General-
ized end-to-end loss for speaker verification,” arXiv preprint
7. References arXiv:1710.10467, 2017.
[1] A. Khosravani and M. M. Homayounpour, “Nonparametrically [20] J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Matejka, and L. Bur-
trained plda for short duration i-vector speaker verification,” Com- get, “End-to-end dnn based speaker recognition inspired by i-
puter Speech & Language, vol. 52, pp. 105–122, 2018. vector and plda,” arXiv preprint arXiv:1710.02369, 2017.
[2] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, [21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
and M. Liberman, “First dihard challenge evaluation plan,” 2018. Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[Online]. Available: https://zenodo.org/record/1199638 [22] S. H. Shum, N. Dehak, R. Dehak, and J. R. Glass, “Unsuper-
[3] X. Anguera, M. Aguilo, C. Wooters, C. Nadeu, and J. Hernando, vised methods for speaker diarization: An integrated and iterative
“Hybrid speech/non-speech detector applied to speaker diariza- approach,” IEEE Transactions on Audio, Speech, and Language
tion of meetings,” in Speaker and Language Recognition Work- Processing, vol. 21, no. 10, pp. 2015–2028, 2013.
shop, 2006. IEEE Odyssey 2006: The. IEEE, 2006, pp. 1–6. [23] M. Senoussaoui, P. Kenny, T. Stafylakis, and P. Dumouchel, “A
[4] D. Dimitriadis and P. Fousek, “Developing on-line speaker di- study of the cosine distance-based mean shift for telephone speech
arization system,” in Proc. Interspeech, 2017, pp. 2739–2743. diarization,” IEEE/ACM Transactions on Audio, Speech and Lan-
guage Processing (TASLP), vol. 22, no. 1, pp. 217–227, 2014.
[5] S. Thomas, G. Saon, M. Van Segbroeck, and S. S. Narayanan,
“Improvements to the ibm speech activity detection system for the [24] G. Sell and D. Garcia-Romero, “Diarization resegmentation in
darpa rats program,” in Acoustics, Speech and Signal Processing the factor analysis subspace,” in Acoustics, Speech and Signal
(ICASSP), 2015 IEEE International Conference on. IEEE, 2015, Processing (ICASSP), 2015 IEEE International Conference on.
pp. 4500–4504. IEEE, 2015, pp. 4794–4798.
[6] S. Chen, P. Gopalakrishnan et al., “Speaker, environment and [25] F. Valente and C. Wellekens, “Variational bayesian methods for
channel change detection and clustering via the bayesian infor- audio indexing,” in International Workshop on Machine Learning
mation criterion,” in Proc. DARPA broadcast news transcription for Multimodal Interaction. Springer, 2005, pp. 307–319.
and understanding workshop, vol. 8. Virginia, USA, 1998, pp. [26] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree,
127–132. “Speaker diarization using deep neural network embeddings,” in
Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE
[7] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern, “Automatic seg-
International Conference on. IEEE, 2017, pp. 4930–4934.
mentation, classification and clustering of broadcast news audio,”
in Proc. DARPA speech recognition workshop, vol. 1997, 1997. [27] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kan-
nan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker
[8] V. Gupta, “Speaker change point detection using deep neural embedding system,” arXiv preprint arXiv:1705.02304, 2017.
nets,” in Acoustics, Speech and Signal Processing (ICASSP), 2015
IEEE International Conference on. IEEE, 2015, pp. 4420–4424. [28] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering:
Analysis and an algorithm,” in Advances in neural information
[9] R. Yin, H. Bredin, and C. Barras, “Speaker change detection processing systems, 2002, pp. 849–856.
in broadcast tv using bidirectional long short-term memory net-
works,” in Proc. Interspeech 2017, 2017, pp. 3827–3831. [29] T. Caliński and J. Harabasz, “A dendrite method for cluster anal-
ysis,” Communications in Statistics-theory and Methods, vol. 3,
[10] M. Hrúz and Z. Zajı́c, “Convolutional neural network for speaker no. 1, pp. 1–27, 1974.
change detection in telephone speaker diarization system,” in
Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE [30] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur,
International Conference on. IEEE, 2017, pp. 4945–4949. “A study on data augmentation of reverberant speech for robust
speech recognition,” in Acoustics, Speech and Signal Processing
[11] P. Kenny, D. Reynolds, and F. Castaldo, “Diarization of telephone (ICASSP), 2017 IEEE International Conference on. IEEE, 2017,
conversations using factor analysis,” IEEE Journal of Selected pp. 5220–5224.
Topics in Signal Processing, vol. 4, no. 6, pp. 1059–1070, 2010.
[31] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
[12] G. Sell and D. Garcia-Romero, “Speaker diarization with plda i- N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
vector scoring and unsupervised calibration,” in Spoken Language “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on
Technology Workshop (SLT), 2014 IEEE. IEEE, 2014, pp. 413– automatic speech recognition and understanding. IEEE Signal
417. Processing Society, 2011.
234
[32] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and
noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
[33] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-
pur, “X-vectors: Robust dnn embeddings for speaker recognition,”
Submitted to ICASSP, 2018.
[34] “The 2009 (rt-09) rich transcription meeting recognition eval-
uation plan,” http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/
rt09-meeting-eval-plan-v2.pdf, accessed on June 2, 2016.
235
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA
{hzili1,pgarci27,jvillal7}@jhu.edu,[email protected],[email protected]
We present JHU’s speaker diarization system for Iberspeech Di- After computing the MFCCs for each case, the system trained
arization Evaluation. Our main goal is to test our current di- a TDNN SAD model on the Albayzin2016 labeled data follow-
arization system in other databases and possible scenarios. We ing the Aspire recipe in Kaldi [3]. The network consists of 5
provide a system solution for the open and closed condition sce- TDNN layers and 2 layers of statistics pooling[4]. The overall
nario, using Kaldi. We follow a basic pipeline with specific context of the neural network is around 1s, with around 0.8s of
characteristics for each scenario. The pipeline for our system is left context and 0.2s of right context. This approach is suitable
described as follows. for our purposes since it can include a wider context not affect-
ing the number of parameters. For this special case, we trained
• Audio feature extraction the DNN with two classes: speech and non-speech. The speech
• Speech activity detection (SAD) segments include both the clean voice and the voice with noises.
Other parts of the audio are considered as non-speech, which
• Embedding extraction may include music, noise and silence. A simple Viterbi decod-
• PLDA ing on a HMM with duration constraints of 0.3s for speech and
• Score fusion (if possible) 0.1s for silence is used to get speech activity labels for the test
236 10.21437/IberSPEECH.2018-49
data recordings. The energy based SAD was also tried for our Callhome diarization recipe [3]. To obtain an accurate estima-
experiment, but the results were worse overall. tion of the number of speakers and have a better speaker seg-
mentation, the system scans for several thresholds until it finds
4. Embeddings an optimum on a hold-out dataset. We evaluated this approach
using the Albayzin2018 dev2 dataset.
We computed two different sets of embeddings depending on
the condition. For the closed condition we focused on the i-
vectors. This approach computes i-vectors in the traditional
7. Experiments
way; it trains a T-matrix with Albayzin2016-only data. After- In this section, we describe some experiments that give us a
wards, we obtained the i-vectors for the Albayzin2018 dev2 and clue of the overall performance of our system. We evaluated
test set. We tried other DNN possibilities, but due to the few our systems with Diarization Error Rate (DER), which is the
amount of data available the results were not promising. most common metric for speaker diarization. The diarization
For the open condition we examined four types of embed- error can be decomposed into speaker error, false alarm speech,
dings. The i-vectors-basic, trained on data set 3, obtained base- missed speech and overlap speaker. Our DER tolerated errors
line results for Albayzin2018 dev2 and test. These i-vectors are within 250ms of a speaker transition and only scored the non-
of dimension 400. overlapping part of the segments because our model outputs sin-
The BNF-i-vectors (of dimension 600) use the bottleneck gle label for each frame. Our systems were evaluated on the
feature computed from data set 2 to refine the GMM alignments. Albayzin2018 dev2.
The rest of the i-vector pipeline remains the same; the T-matrix We employed the Albayzin2018 dev2 set as the initial part
was also trained on data set 2. of our experiments. This set was divided into two parts, and we
We explored two types of DNN based embedding architec- tune the parameters on one part and compute the DER perfor-
tures. The first one, the deafult Kaldi recipe for Voxceleb, is a mance on the other, which was similar to the Kaldi Callhome
TDNN for x-vector-basic [5, 6]. In this approach, each MFCC diarization recipe [3].
frame is passed through a sequence of TDNN layers. Then, a The DER results of different systems for the open and
pooling layer accounts for the utterance level process and com- closed condition are shown in Table 1 and Table 2 respectively.
putes the mean and standard deviation of the TDNN output over We compared three different calibration strategies: supervised
time in a pooling layer. This intermediate representation, known calibration, oracle calibration and more than 10s. In the super-
as embedding, is projected to a lower dimension (512 in this vised calibration we chose the optimal thresholds on the held-
case). The DNN output are the posterior probabilities of the out cross validation set. For oracle calibration, we used the or-
training speakers. The objective function is cross entropy. We acle number of speakers for AHC. However, unlike traditional
employed data set 3 for training the TDNN. The augmentation speaker diarization dataset like CALLHOME dataset, the utter-
is performed as described in [7] using MUSAN noises2 . ances in the Albayzin2018 dev2 set were long and contained
For the second x-vector approach the pre-pooling layers are more speakers. The speech segments of some speakers were
changed to factorized TDNNs (TDNN-F) with skip connections so limited that we didn’t want to create a new cluster for these
[8]. This new architecture reduces the number of parameters in speakers. The third column shows the DER results when we
the network by factorizing the weight matrix of each TDNN clustered with the oracle number of speakers that have more
layer into the product of two low-rank matrices. The first fac- than 10 seconds speech in the segment. It should be noted that
tor is forced to be semi-orthogonal that will prevent the lost of since the number of speakers are unknown for the test set, our
information when projecting from high to low dimension. As final submission only used supervised calibration and we re-
in other architectures, skip connections are an option for this ported the DER results of oracle calibration on the development
TDNN-F. Some input layers receive as input the output of the set just for reference.
previous layer and other prior layers. The best solution so far is As shown in Table 1, x-vector based systems outperform
to have skip connection between low-rank interior layers in the the i-vector based ones, which is consistent with previous stud-
TDNN-F. The x-vectors are of dimension 600. ies. Among the four systems, the TDNN-F based x-vector per-
forms the best. It outperforms the basic x-vector and i-vector by
5. PLDA and Score Fusion 1.27% and 3.90% absolute. Equal weighted score fusion fur-
ther reduces the DER to 9.39%. It is interesting that the DER
For the closed condition we observed that the number of speak-
performance of x-vector based systems degrades when cluster-
ers estimated by the current approach was very high for the Al-
ing with the actual number of speakers while it improves for the
bayzin2018 dev2 set. We decided to use PCA as in [9]. With
i-vector based systems. This indicates that i-vector based sys-
this tuning strategy the system was able to take into account ev-
tems require more prior knowledge of the number of speakers.
ery recording for PCA rotation, instead of only the global PCA.
Table 2 shows the DER results for the closed condition. The
This strategy also maintained the number of speaker in a desir-
i-vector system achieves a DER of 24.03%, which is further im-
able range.
proved to 22.26% if clustering with the oracle number of speak-
For the open condition, we used the traditional PLDA work-
ers. However, the performance of the x-vector is not as good as
flow, and the PLDA was trained on the Albayzin2016 data. We
i-vector. We believe the reason is that we cannot obtain enough
obtained 4 different types of scores that addressed the four type
data to train a discriminative neural network. Even after data
of embeddings. We fused the four systems with equal weights.
augmentation with the music, noise and speech we extracted
from Albayzin2016 dataset, the training set only contained 332
6. Clustering hours of speech which was much smaller than the usual amount
The system performed an Agglomerative Hierarchical Cluster- of data to train the x-vector system. Besides, since the record-
ing (AHC) to obtain a segmentation of the recordings following ings were from TV programs, a large number of speakers didn’t
have enough corpus. The score fusion didn’t improve the sys-
2 http://www.openslr.org/resources/17 tem performance for the closed condition.
237
Table 1: DER (%) comparison of different systems for the open condition
Table 2: DER (%) comparison of different systems for the closed condition
From our experiment, we also observe that the SAD is of condition, due to the small amount of training data. For the open
vital importance. Since we don’t know the oracle SAD marks, condition, the best results were obtained by the x-vector based
the quality of the SAD is directly associated with the DER per- system. However, having a score fusion, before the clustering
formance. Three different SAD models were evaluated, among gave noticeable improvements. We are still planning to do some
which the 5-layer TDNN model trained on in-domain data per- re-segmentation in future versions.
forms the best. It outperforms the TDNN model trained on the
Librispeech with same network architecture by 2.27% absolute. 10. Acknowledgements
The energy based SAD is simple but the performance is worse
than the TDNN models by a large margin. In our final system, The authors would like to thank David Snyder for his help in
we use the TDNN SAD trained on Albayzin2016 for both the this project.
open and closed condition.
11. References
Table 3: DER (%) of basic x-vector system with different SAD [1] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb:
a large-scale speaker identification dataset,” arXiv preprint
SAD DER arXiv:1706.08612, 2017.
energy based SAD 21.52 [2] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep
TDNN SAD trained on Librispeech 15.66 speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
TDNN SAD trained on Albayzin2016 13.39 [3] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,
J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recog-
nition toolkit,” in Proceedings of the IEEE Workshop on Automatic
8. Future Work Speech Recognition and Understanding, ASRU2011. Waikoloa,
HI, USA: IEEE, dec 2011, pp. 1–4.
Although we largely reduce the diarization error with in-domain
[4] P. Ghahremani, V. Manohar, D. Povey, and S. Khudanpur, “Acous-
SAD and system fusion, there are still many problems to inves- tic modelling from the signal domain using cnns.” in INTER-
tigate. The first is the overlap problem. Our current system SPEECH, 2016, pp. 3434–3438.
cannot handle the overlapping speech, since it predicts a single
[5] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-
speaker label for each frame. However, solving this problem is pur, “X-Vectors : Robust DNN Embeddings for Speaker Recog-
not easy. The whole procedure of the diarization might change nition,” in Proceedings of the IEEE International Conference on
to predict multiple labels for one frame. Second, as discussed in Acoustics, Speech and Signal Processing, ICASSP 2018. Alberta,
the former part, the number of speakers estimated by the super- Canada: IEEE, apr 2018, pp. 5329–5333.
vised calibration is not very close to the actual number. Besides, [6] G. Sell, D. Snyder, A. Mccree, D. Garcia-Romero, J. Vil-
clustering with the oracle number of speakers sometimes even lalba, M. Maciejewski, V. Manohar, N. Dehak, D. Povey,
degrades the system. Whether there exists better methods to S. Watanabe, and S. Khudanpur, “Diarization is Hard:
control the clustering process, especially for the condition with Some Experiences and Lessons Learned for the JHU Team
many speakers, requires further studies. Third, due to the time in the Inaugural DIHARD Challenge,” in Proceedings of
the 19th Annual Conference of the International Speech
limit, we didn’t include the re-segmentation process in our sys-
Communication Association, INTERSPEECH 2018, Hyderabad,
tem. We will add this part later to see if it can further boost the India, sep 2018, pp. 2808—-2812. [Online]. Available:
system performance. http://www.danielpovey.com/files/2018 interspeech dihard.pdf
http://dx.doi.org/10.21437/Interspeech.2018-1893
9. Conclusions [7] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and
noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
This is the submission for the JHU Diarization system. We tried
[8] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmo-
our out-of-the-box system in this new scenario that contained hamadi, and S. Khudanpur, “Semi-Orthogonal Low-Rank
broadcast news in a new language. Two main solutions were Matrix Factorization for Deep Neural Networks,” in Pro-
proposed for the closed and the open conditions: i-vector and ceedings of the 19th Annual Conference of the Interna-
x-vector. I-vector was showed to be best suited for the closed tional Speech Communication Association, INTERSPEECH
238
2018, Hyderabad, India, sep 2018. [Online]. Available:
http://danielpovey.com/files/2018 interspeech tdnnf.pdf
[9] C. Vaquero, A. Ortega, J. Villalba, A. Miguel, and E. Lleida, “Con-
fidence measures for speaker segmentation and their relation to
speaker verification,” in Eleventh Annual Conference of the Inter-
national Speech Communication Association, 2010.
239
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
240 10.21437/IberSPEECH.2018-50
sources (transcriptions of European and Spanish Parliaments improved discriminative and well-calibrated scores [20]. Cal-
from the TC-STAR database, subtitles, books, newspapers, on- ibration and fusion training was performed using the Bosaris
line courses and transcriptions of the Mavir sessions included toolkit [21].
in the development set2 [14]). Specifically, two fourgram-based
language models were trained following the Kneser-Ney dis-
counting strategy using the SRILM toolkit [15], and the final
3. Query-by-example spoken term
LM was obtained by mixing both LMs using the SRILM static detection system
n-gram interpolation functionality. One of the LMs was trained The primary system submitted for the QbESTD evaluation con-
using the RTVE2018 subtitles data provided for the Albayzin sists in the fusion of four systems. Three of those systems
2018 Text-to-Speech challenge and the other LM was built us- follow the same scheme: first, feature extraction is performed
ing the other text corpora. The LM vocabulary size was limited in order to represent the queries and documents by means of
to the most frequent 300K words and, for each search task, the feature vectors; then, the queries are searched within the docu-
set of OOV keywords were removed from the language model. ments using a search approach based on DTW; finally, a score
normalization step is performed. The other system is an adap-
2.2. PS system tation of the PS system described above to the QbESTD task.
A system based on phonetic search following the probabilistic
retrieval model for information retrieval was developed for the 3.1. DTW-based systems
STD task:
3.1.1. Speech representation
• Indexing. First, the phone transcription of each docu-
ment is obtained, and then the documents are indexed in Three different approaches for speech representation were used;
terms of phone n-grams of different size [16, 5]. Accord- given a query Q with n frames (and equivalently, a document
ing to the probabilistic retrieval model, each document D with m frames), these representations result in a set Q =
is represented by means of a language model [4]. In this {q1 , . . . , qn } of n vectors of dimension U (and equivalently, a
case, given that the phone transcriptions have errors, sev- set D = {d1 , . . . , dm } of m vectors of dimension U ):
eral hypotheses for the best transcription are used to im-
• Phoneme posteriorgram (PhnPost). One subsystem re-
prove the quality of the language model [6]. The start
lies on phoneme posteriorgrams [7] for speech represen-
time and duration of each phone are also stored in the
tation: given a speech document and a phoneme recog-
index.
niser with U phonetic units, the a posteriori probability
• Search. First, a phonetic transcription of the query is ob- of each phonetic unit is computed for each time frame,
tained using the grapheme-to-phoneme model for Span- leading to a set of vectors of dimension U that repre-
ish included in Cotovia [17]. Then, the query is searched sent the probability of each phonetic unit at every time
within the different indices, and a score for each doc- instant. The English (EN) phone decoder developed by
ument is computed following the query likelihood re- the Brno University of Technology was used to obtain
trieval model [18]. It must be noted that this model sorts phoneme posteriorgrams; in this decoder, each phonetic
the documents according to how likely they contain the unit has three different states and a posterior probability
query, but the start and end times of the match are re- is output for each of them, so they were combined in or-
quired in this task. To obtain these times, the phone tran- der to obtain one posterior probability for each unit [22].
scription of the query is aligned to that of the document After obtaining the posteriors, a Gaussian softening was
by computing their minimum edit distance, and this al- applied in order to have Gaussian distributed probabili-
lows the recovery of the start and end times since they are ties [23].
stored in the index. In addition, the minimum edit dis-
tance is used to penalize the score returned by the query • Low-level descriptors (LLD). A large set of features,
likelihood retrieval model as described in [6]. summarised in Table 1, was used to represent the
queries and documents; these features, obtained using
The minimum and maximum size of the n-grams were set to the OpenSMILE feature extraction toolkit [24], were ex-
1 and 5, respectively, according to [5]. The different hypotheses tracted every 10 ms using a 25 ms window, except for
for the phone transcriptions of the documents were extracted F0, probability of voicing, jitter, shimmer and HNR, for
from the phone lattice obtained employing the LVCSR system which a 60 ms window was used.
described above, and the number of hypotheses to be used for
indexing was empirically set to 40. Indexing and search were • Gaussian posteriorgram (GP). Gaussian posteriorgrams
performed using Lucene3 . [9] were used to represent the audio documents and
queries. Given a Gaussian mixture model (GMM) with
2.3. Fusion U Gaussians, the a posteriori probability of each Gaus-
sian is computed for each time frame, leading to a set
Discriminative calibration and fusion were applied in order to of vectors of dimension U that represent the probability
combine the outputs of the different STD systems [19]. The of each Gaussian at every time instant. In this system,
global minimum score produced by the system for all queries 19 MFCCs were extracted from the waveforms, accom-
was used to hypothesize the missing scores. After normaliza- panied with their energy, delta and acceleration coeffi-
tion, calibration and fusion parameters were estimated by lo- cients. Feature extraction and Gaussian posteriorgram
gistic regression on a development dataset in order to obtain computation were performed using the Kaldi toolkit [1].
The GMM was trained using MAVIR training and devel-
2 http://cartago.lllf.uam.es/mavir/index.pl?m=descargas
opment data, as well as RTVE development recordings.
3 http://lucene.apache.org
241
Table 1: Acoustic features used in the proposed search on speech system.
Description # features
Sum of auditory spectra 1
Zero-crossing rate 1
Sum of RASTA style filtering auditory spectra 1
Frame intensity 1
Frame loudness 1
Root mean square energy and log-energy 2
Energy in frequency bands 250-650 Hz (energy 250-650) and 1000-4000 Hz 2
Spectral Rolloff points at 25%, 50%, 75%, 90% 4
Spectral flux 1
Spectral entropy 1
Spectral variance 1
Spectral skewness 1
Spectral kurtosis 1
Psychoacoustical sharpness 1
Spectral harmonicity 1
Spectral flatness 1
Mel-frequency cepstral coefficients 16
MFCC filterbank 26
Line spectral pairs 8
Cepstral perceptual linear predictive coefficients 9
RASTA PLP coefficients 9
Fundamental frequency (F0) 1
Probability of voicing 1
Jitter 2
Shimmer 1
log harmonics-to-noise ratio (logHNR) 1
LCP formant frequencies and bandwidths 6
Formant frame intensity 1
Deltas 102
Total 204
3.1.2. Search algorithm less likely. One approach to overcome this issue consists in de-
tecting a given number of candidate matches nc : every time a
The search stage was carried out using the subsequence DTW
warping path, that ends at frame b∗ , is detected, M(n, b∗ ) is set
(S-DTW) [25] variant of the classical DTW approach. To per-
to ∞ in order to ignore this element in the future.
form S-DTW, first a cost matrix M ∈ <n×m must be defined, in
which the rows and columns correspond to the query and docu- A score must be assigned to every detection of a query Q
ment frames, respectively: in a document D. First, the cumulative cost of the warping path
Mn,b∗ is length-normalized [27] and, after that, z-norm is ap-
c(qi , dj ) if i=0 plied so that all the scores of all the queries have the same dis-
Mi,j = c(qi , dj ) + Mi−1,0 if i > 0, j = 0 (1) tribution [28].
c(q , d ) + M∗ (i, j) else
i j
3.2. PS system
where c(qi , dj ) is a function that defines the cost between the
The system described in Section 2.2 was also used for QbESTD.
query vector qi and the document vector dj , and
Since, in this experimental setup, the queries are spoken, the
M∗ (i, j) = min (Mi−1,j , Mi−1,j−1 , Mi,j−1 ) (2) LVCSR system described in Section 2.1 was used to obtain
phone transcriptions of the queries. In this system, the number
Pearson’s correlation coefficient r [26] was the metric used of transcription hypotheses of the documents was empirically
to define the cost function by mapping it into the interval [0,1] set to 50.
applying the following transformation:
3.3. Fusion
1 − r(qi , dj )
c(qi , dj ) = (3) The fusion strategy described in Section 2.3 was used to com-
2
bine the QbESTD systems described in this section.
Once matrix M is computed, the end of the best warping
path between Q and D is obtained as
4. Preliminary Results
∗
b = arg min M(n, b) (4) The systems described in the previous sections were evaluated
b∈1,...,m
in terms of the average term weighted value (ATWV) and max-
The starting point of the path ending at b∗ , namely a∗ , is imum term weighted value (MTWV), which are the evaluation
computed by backtracking, hence obtaining the best warping metrics defined for Albayzin 2018 evaluation. The results in-
path P(Q, D) = {p1 , . . . , pk , . . . , pK }, where pk = (ik , jk ), cluded in this section were achieved using the development
i.e. the k-th element of the path is formed by qik and djk , and data provided by the organizers. Since two different datasets
K is the length of the warping path. (MAVIR and RTVE) were used for development, and in or-
It is possible that a query Q appears several times in a doc- der to avoid overfitting when choosing the decision threshold,
ument D, especially if D is a long recording. Hence, not only the groundtruth labels of MAVIR and RTVE were joined into
the best warping path must be detected but also others that are a single set (namely MAVIR+RTVE) to compute the decision
242
Table 2: STD results on development data 5. Conclusions and future work
This paper presented the systems developed for the STD and
MAVIR RTVE MAVIR+RTVE QbESTD tasks of Albayzin 2018 Search on Speech evaluation.
System MTWV ATWV MTWV ATWV MTWV ATWV The STD system consists in a fusion of a LVCSR system with
LVCSR (con1) 0.5314 0.5179 0.5976 0.5798 0.5992 0.5991 a phonetic search approach based on the probabilistic retrieval
PS (con2) 0.4828 0.4739 0.6286 0.5993 0.6173 0.6167 model for information retrieval. The LVCSR system relied on
LVCSR-NP (con3) 0.5068 0.4079 0.5801 0.5794 0.5704 0.5700 the proxy words approach for OOV words, which were also
Fusion (pri) 0.5470 0.5290 0.6550 0.6183 0.6826 0.6791
managed by the phonetic search system. The QbESTD sys-
tem is a fusion of three DTW-based systems with the phonetic
search system used in the STD task.
The performance obtained in STD and QbESTD tasks are
not straightforwardly comparable because the queries used to
Table 3: QbESTD results on development data
compute the evaluation metrics are not the same for both tasks,
but the results suggest that spoken queries lead to better results
in RTVE dataset. This might be caused by a greater amount of
MAVIR RTVE MAVIR+RTVE
OOV words, so this will be investigated by further analysis of
System MTWV ATWV MTWV ATWV MTWV ATWV
the results.
PhnPost (con2) 0.1971 0.1742 0.7145 0.7081 0.5180 0.5160
In future work, a system that combines word-level and
LLD 0.2017 0.1774 0.7136 0.7114 0.5156 0.5136
phone-level representations with the probabilistic retrieval
GP 0.1877 0.1628 0.6731 0.6718 0.4856 0.4841
model for information retrieval will be assessed. This idea
PS (con3) 0.2383 0.2029 0.3540 0.3528 0.3519 0.3507
is motivated by the fact that, according to the results exhib-
Fusion DTW (con1) 0.2699 0.2649 0.7211 0.7076 0.5471 0.5451
ited in the STD task, the LVCSR and phonetic search systems
Fusion (pri) 0.2896 0.2470 0.7273 0.6964 0.6195 0.6174 are strongly complementary, and designing smart combination
strategies might improve the performance of logistic regression
fusion.
The DTW-based systems for QbESTD used in this paper
are language-independent, i.e. the system can be used regard-
threshold, which was subsequently applied to each dataset indi- less the language spoken in the recordings. Given that a LVCSR
vidually. system for Spanish was trained for the STD system, the use
of the activations of the LVCSR network will be investigated
4.1. STD experiments in future work in order to assess QbESTD performance in a
language-dependent setting.
Table 2 shows the results achieved using the systems described
in Section 2. The Table also includes an additional sys- 6. Acknowledgements
tem, namely LVCSR-NP, which consists in the aforementioned This work has received financial support from i) “Ministerio de
LVCSR without using the proxy words strategy for OOV terms; Economı́a y Competitividad” of the Government of Spain and
this means that the LVCSR-NP system does not detect any OOV the European Regional Development Fund (ERDF) under the
terms. Comparing LVCSR-NP and LVCSR systems, it can be research projects TIN2015-64282-R and TEC2015-65345-P, ii)
seen that using the proxy words strategy is beneficial specially Xunta de Galicia (projects GPC ED431B 2016/035 and GRC
when dealing with MAVIR data. The table also shows that the 2014/024), and iii) Xunta de Galicia - “Consellerı́a de Cultura,
PS system outperforms the LVCSR system on RTVE dataset, Educación e Ordenación Universitaria” and the ERDF through
and it also leads to a better overall result. The combination of the 2016-2019 accreditations ED431G/01 (“Centro singular de
both systems achieves a significant improvement in all the ex- investigación de Galicia”) and ED431G/04 (“Agrupación es-
perimental conditions, which suggests that both strategies are tratéxica consolidada”).
strongly complementary.
7. References
4.2. QbESTD experiments
[1] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,
Table 3 shows the results achieved by the QbESTD systems de- J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recog-
scribed in Section 3. The best performance in MAVIR data was nition toolkit,” in IEEE 2011 Workshop on Automatic Speech
achieved with the PS system, which also exhibited the lowest Recognition and Understanding. IEEE Signal Processing So-
performance in RTVE data. PhnPost and LLD systems achieved ciety, 2011.
almost the same results for RTVE and MAVIR+RTVE data. [2] D. Can and M. Saraclar, “Lattice indexing for spoken term detec-
tion.” IEEE Transactions on Audio, Speech and Language Pro-
The table also displays the results obtained when fusing the cessing, vol. 19, no. 8, pp. 2338–2347, 2011.
three DTW approaches (Fusion DTW) and when fusing the four [3] G. Chen, O. Yilmaz, J. Trmal, D. Povey, and S. Khudanpur, “Us-
systems (Fusion). The MTWV is always higher when fusing ing proxies for OOV keywords in the keyword search task,” in
the four systems but, for the individual datasets, the ATWV IEEE Workshop on Automatic Speech Recognition & Understand-
is higher when fusing only the DTW systems. Nevertheless, ing, ASRU, 2013, pp. 416–421.
the overall result is better when combining the four systems, so [4] J. Ponte and W. Croft, “A language modeling approach to informa-
this system was selected as the primary (pri) for this evaluation, tion retrieval,” in Proceedings of ACM SIGIR, 1998, pp. 275–281.
while the fusion of the three DTW systems was presented as [5] P. Lopez-Otero, J. Parapar, and A. Barreiro, “Efficient query-by-
contrastive (con1). example spoken document retrieval combining phone multigram
243
representation and dynamic time warping,” Information Process- [22] L. Rodriguez-Fuentes, A. Varona, M. Penagarikano, G. Bordel,
ing and Management, vol. 56, pp. 43–60, 2019. and M. Diez, “GTTS systems for the SWS task at MediaEval
2013,” in Proceedings of the MediaEval 2013 Workshop, 2013.
[6] ——, “Probabilistic information retrieval models for query-by-
example spoken document retrieval,” Speech Communication [23] A. Varona, M. Penagarikano, L. Rodriguez-Fuentes, and G. Bor-
(submitted), 2018. del, “On the use of lattices of time-synchronous cross-decoder
phone co-occurrences in a SVM-phonotactic language recogni-
[7] T. Hazen, W. Shen, and C. White, “Query-by-example spoken
tion system.” in 12th Annual Conference of the International
term detection using phonetic posteriorgram templates,” in IEEE
Speech Communication Association (Interspeech), 2011, pp.
Workshop on Automatic Speech Recognition & Understanding,
2901–2904.
ASRU, 2009, pp. 421–426.
[24] F. Eyben, M. Wöllmer, and B. Schuller, “OpenSMILE - the Mu-
[8] P. Lopez-Otero, L. Docio-Fernandez, and C. Garcia-Mateo, nich versatile and fast open-source audio feature extractor,” in
“Finding relevant features for zero-resource query-by-example Proceedings of ACM Multimedia (MM), 2010, pp. 1459–1462.
search on speech,” Speech Communication, vol. 84, pp. 24–35,
2016. [25] M. Müller, Information Retrieval for Music and Motion.
Springer-Verlag, 2007.
[9] Y. Zhang and J. Glass, “Unsupervised spoken keyword spotting
via segmental DTW on Gaussian posteriorgrams,” in IEEE Work- [26] I. Szöke, M. Skácel, and L. Burget, “BUT QUESST2014 system
shop on Automatic Speech Recognition & Understanding, ASRU, description,” in Proceedings of the MediaEval 2014 Workshop,
2009, pp. 398–403. 2014.
[10] K. Veselý, A. Ghoshal, L. Burget, and D. Povey, “Sequence- [27] A. Abad, R. Astudillo, and I. Trancoso, “The L2F spoken web
discriminative training of deep neural networks,” in Proceed- search system for Mediaeval 2013,” in Proceedings of the Medi-
ings of the 14th Annual Conference of the International Speech aEval 2013 Workshop, 2013.
Communication Association (Interspeech 2013)., no. 8, 2013, pp. [28] I. Szöke, L. Burget, F. Grézl, J. C̆ernocký, and L. Ondel, “Calibra-
2345–2349. tion and fusion of query-by-example systems - BUT SWS 2013,”
[11] P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal, in Proceedings of the 37th International Conference on Acoustics,
and S. Khudanpur, “A pitch extraction algorithm tuned for auto- Speech and Signal Processing (ICASSP), 2014, pp. 7899–7903.
matic speech recognition,” in Proceedings of ICASSP, 2014, pp.
2494–2498.
[12] D. Povey, M. Hannemann, G. Boulianne, L. Burget, A. Ghoshal,
M. Janda, M. Karafiát, S. Kombrink, P. Motlı́cek, Y. Qian,
K. Riedhammer, K. Veselý, and N. T. Vu, “Generating exact lat-
tices in the WFST framework,” in IEEE International Conference
on Acoustics, Speech, and Signal Processing, 2012, pp. 4213–
4216.
[13] C. Garcia-Mateo, J. Dieguez-Tirado, L. Docio-Fernandez, and
A. Cardenal-Lopez, “Transcrigal: A bilingual system for auto-
matic indexing of broadcast news,” in in Proc. Int. Conf. on Lan-
guage Resources and Evaluation, 2004.
[14] A. M. Sandoval and L. C. Llanos, “MAVIR: a corpus of sponta-
neous formal speech in Spanish and English,” in Iberspeech 2012:
VII Jornadas en Tecnologı́a del Habla and III SLTech Workshop,
2012.
[15] A. Stolcke, J. Zheng, W. Wang, and V. Abrash, “SRILM at Six-
teen: Update and outlook,” in Proc. IEEE Automatic Speech
Recognition and Understanding Workshop, December 2011.
[16] J. Parapar, A. Freire, and A. Barreiro, “Revisiting n-gram based
models for retrieval in degraded large collections,” in Proceed-
ings of the 31st European Conference on Information Retrieval
Research: Advances in Information Retrieval, ser. Lecture Notes
in Computer Science, vol. 5478. Springer International Publish-
ing, 2009, pp. 680–684.
[17] E. Rodrı́guez-Banga, C. Garcia-Mateo, F. Méndez-Pazó,
M. González-González, and C. Magariños, “Cotovı́a: an open
source TTS for Galician and Spanish,” in Proceedings of Iber-
speech 2012, 2012, pp. 308–315.
[18] C. Manning, P. Raghavan, and H. Schütze, Introduction to Infor-
mation Retrieval. Cambridge University Press, 2008.
[19] A. Abad, L. J. Rodrı́guez-Fuentes, M. Peñagarikano, A. Varona,
and G. Bordel, “On the calibration and fusion of heterogeneous
spoken term detection systems.” in Proceedings of Interspeech,
2013, pp. 20–24.
[20] N. Brümmer and D. van Leeuwen, “On calibration of language
recognition scores,” in IEEE Odyssey 2006: The Speaker and
Language Recognition Workshop, 2006, pp. 1–8.
[21] N. Brümmer and E. de Villiers, “The BOSARIS toolkit user
guide: Theory, algorithms and code for binary classifier
score processing,” Tech. Rep., 2011. [Online]. Available:
https://sites.google.com/site/nikobrummer
244
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
245 10.21437/IberSPEECH.2018-51
Figure 2: Query detection stages.
1 − r(xn , ym )
c(xn , ym ) = , (2)
2
where c(xn , ym ) represents the cost matrix used during the S-
DTW search. Figure 4: Cost matrix example with the modified Pearson cor-
Therefore, the cost c(xn , ym ) can take the values of 1 relation coefficient.
(when r = −1), 0.5 (when r = 0), or 0 (when r = 1). Figure 3
represents the cost matrix example with the standard Pearson
correlation coefficient computation. 2.3. Subsequence Dynamic Time Warping-based search
The S-DTW algorithm [26] has been used to hipothesize
query detections within the utterances. From the cost matrix
c(xn , ym ), the accumulated cost matrix employed within the
search is computed as given in Equation 3:
c(xn , ym ) if n=0
Dn,m = c(xn , ym ) + Dn−1,0 if n > 0, m = 0 (3)
Figure 3: Cost matrix example with the standard Pearson cor- c(x , y ) + D∗ (n, m) else,
n m
relation coefficient.
where
The final cost used during the search has been modified as D∗ (n, m) = min (Dn−1,m , Dn−1,m−1 , Dn,m−1 ) , (4)
follows: When r <= 0, r has been assigned the value of 0.
Next, c(xn , ym ) = 1 − r(xn , ym ). Therefore, for all the Pear- which implies that only horizontal, vertical, and diagonal
son correlation coefficient values lower or equal to 0, the cost path movements are allowed in the search.
246
Figure 5 shows the accumulated cost matrix from the cost
matrix presented in Figure 3 (i.e., with the standard Pearson cor-
relation coefficient computation), and Figure 6 shows that of the
cost matrix presented in Figure 4 (i.e., with the modified Pear-
son correlation coefficient). The accumulated cost matrix from
the modified Pearson correlation coefficient shows more cost in
non-occurrence regions, which favors the final query detection.
Figure 7: Query detection example from the accumulated cost
matrix in Figure 6.
247
5. Acknowledgements [15] A. H. H. N. Torbati and J. Picone, “A nonparametric bayesian
approach for spoken term detection by example query,” in Proc.
This work was partially supported by the project “DSSL: of Interspeech, 2016, pp. 928–932.
Redes Profundas y Modelos de Subespacios para Detec- [16] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Unsupervised
cin y Seguimiento de Locutor, Idioma y Enfermedades De- bottleneck features for low-resource Query-by-Example spoken
generativas a partir de la Voz” (TEC2015-68172-C2-1-P, term detection,” in Proc. of Interspeech, 2016, pp. 923–927.
MINECO/FEDER). [17] Y. Yuan, C.-C. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Pair-
wise learning using multi-lingual bottleneck features for low-
6. References resource Query-by-Example spoken term detection,” in Proc. of
ICASSP, 2017, pp. 5645–5649.
[1] J. G. Fiscus, J. Ajot, J. S. Garofolo, and G. Doddingtion, “Results
of the 2006 spoken term detection evaluation,” in Proc. of SSCS, [18] S. Oishi, T. Matsuba, M. Makino, and A. Kai, “Combining
2007, pp. 45–50. state-level spotting and posterior-based acoustic match for im-
proved query-by-example spoken term detection,” in Proc. of In-
[2] NIST, The Ninth Text REtrieval Conference (TREC 9), Accesed: terspeech, 2016, pp. 740–744.
February, 2018. [Online]. Available: http://trec.nist.gov
[19] M. Obara, K. Kojima, K. Tanaka, S. wook Lee, , and Y. Itoh,
[3] H. Joho and K. Kishida, “Overview of the NTCIR-11 Spoken- “Rescoring by combination of posteriorgram score and subword-
Query&Doc Task,” in Proc. of NTCIR-11, 2014, pp. 1–7. matching score for use in Query-by-Example,” in Proc. of Inter-
speech, 2016, pp. 1918–1922.
[4] X. Anguera, F. Metze, A. Buzo, I. Szöke, and L. J. Rodriguez-
Fuentes, “The spoken web search task,” in Proc. of MediaEval, [20] C.-C. Leung, L. Wang, H. Xu, J. Hou, V. T. Pham, H. Lv,
2013, pp. 921–922. L. Xie, X. Xiao, C. Ni, B. Ma, E. S. Chng, and H. Li, “Toward
high-performance language-independent Query-by-Example spo-
[5] X. Anguera, L. J. Rodriguez-Fuentes, I. Szöke, A. Buzo, and ken term detection for MediaEval 2015: Post-Evaluation analy-
F. Metze, “Query by example search on speech at Mediaeval sis,” in Proc. of Interspeech, 2016, pp. 3703–3707.
2014,” in Proc. of MediaEval, 2014, pp. 351–352.
[21] H. Xu, J. Hou, X. Xiao, V. T. Pham, C.-C. Leung, L. Wang, V. H.
[6] NIST, Draft KWS14 Keyword Search Evaluation Plan, Do, H. Lv, L. Xie, B. Ma, E. S. Chng, and H. Li, “Approximate
National Institute of Standards and Technology (NIST), search of audio queries by using DTW with phone time bound-
Gaithersburg, MD, USA, December 2013. [Online]. Available: ary and data augmentation,” in Proc. of ICASSP, 2016, pp. 6030–
https://www.nist.gov/sites/default/files/documents/itl/iad/mig/KWS14- 6034.
evalplan-v11.pdf
[22] J. Tejedor and D. T. Toledano, The ALBAYZIN 2018 Search on
[7] J. Tejedor, D. T. Toledano, P. Lopez-Otero, L. Docio- Speech Evaluation Plan, Universidad San Pablo CEU, Universi-
Fernandez, C. Garcia-Mateo, A. Cardenal, J. D. Echeverry- dad Autónoma de Madrid, Madrid, Spain, June 2018. [Online].
Correa, A. Coucheiro-Limeres, J. Olcoz, and A. Miguel, “Spoken Available: http://iberspeech2018.talp.cat/index.php/albayzin-
term detection ALBAYZIN 2014 evaluation: overview, systems, evaluation-challenges/search-on-speech-evaluation/
results, and discussion,” EURASIP, Journal on Audio, Speech and
[23] P. Lopez-Otero, L. Docio-Fernandez, and C. Garcia-Mateo,
Music Processing, vol. 2015, no. 21, pp. 1–27, 2015.
“GTM-UVigo systems for albayzin 2016 search on speech evalu-
[8] J. Tejedor, D. T. Toledano, X. Anguera, A. Varona, L. F. Hurtado, ation,” in Proc. of IberSPEECH, 2016, pp. 306–314.
A. Miguel, and J. Colás, “Query-by-example spoken term detec- [24] P. Schwarz, “Phoneme recognition based on long temporal con-
tion ALBAYZIN 2012 evaluation: overview, systems, results, and text,” Ph.D. dissertation, FIT, BUT, Brno, Czech Republic, 2008.
discussion,” EURASIP, Journal on Audio, Speech, and Music Pro-
cessing, vol. 2013, no. 23, pp. 1–17, 2013. [25] I. Szöke, M. Skacel, and L. Burget, “BUT QUESST 2014 system
description,” in Proc. of MediaEval, 2014, pp. 621–622.
[9] J. Tejedor, D. T. Toledano, P. Lopez-Otero, L. Docio-Fernandez,
and C. Garcia-Mateo, “Comparison of ALBAYZIN query-by- [26] M. Muller, Information Retrieval for Music and Motion. New
example spoken term detection 2012 and 2014 evaluations,” York: Springer-Verlag, 2007.
EURASIP, Journal on Audio, Speech and Music Processing, vol.
2016, no. 1, pp. 1–19, 2016.
[10] J. Tejedor, D. T. Toledano, P. Lopez-Otero, L. Docio-Fernandez,
L. Serrano, I. Hernaez, A. Coucheiro-Limeres, J. Ferreiros, J. Ol-
coz, and J. Llombart, “Albayzin 2016 spoken term detection eval-
uation: an international open competitive evaluation in spanish,”
EURASIP, Journal on Audio, Speech and Music Processing, vol.
2017, no. 22, pp. 1–23, 2017.
[11] J. Tejedor, D. T. Toledano, P. Lopez-Otero, L. Docio-Fernandez,
J. Proença, F. P. ao, F. Garcı́a-Granada, E. Sanchis, A. Pompili,
and A. Abad, “Albayzin query-by-example spoken term detection
2016 evaluation,” EURASIP, Journal on Audio, Speech, and Music
Processing, vol. 2018, no. 2, pp. 1–25, 2018.
[12] N. Sakamoto, K. Yamamoto, and S. Nakagawa, “Combination of
syllable based N-gram search and word search for spoken term
detection through spoken queries and IV/OOV classification,” in
Proc. of ASRU, 2015, pp. 200–206.
[13] R. Konno, K. Ouchi, M. Obara, Y. Shimizu, T. Chiba, T. Hirota,
and Y. Itoh, “An STD system using multiple STD results and mul-
tiple rescoring method for NTCIR-12 SpokenQuery&Doc task,”
in Proc. of NTCIR-12, 2016, pp. 200–204.
[14] P. Lopez-Otero, L. Docio-Fernandez, and C. Garcia-Mateo, “Pho-
netic unit selection for cross-lingual Query-by-Example spoken
term detection,” in Proc. of ASRU, 2015, pp. 223–229.
248
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Abstract tections for target trials. In particular, the threshold given by the
application model parameters (Ptarget = 0.0001, Cmiss = 1
This paper describes the systems developed by GTTS-EHU for and Cf a = 0.1) is too high compared to the MTWV threshold.
the QbE-STD and STD tasks of the Albayzin 2018 Search on In the case of STD, the lack of detections is even more re-
Speech Evaluation. Stacked bottleneck features (sBNF) are markable, with Pmiss ≈ 0.8 for extremely low thresholds. This
used as frame-level acoustic representation for both audio doc- could be due to an acoustic mismatch between the synthesized
uments and spoken queries. In QbE-STD, a flavour of segmen- queries and the test audio signals, which might be blocking the
tal DTW (originally developed for MediaEval 2013) is used DTW-based search, since the score yielded by the best match
to perform the search, which iteratively finds the match that would fall below the established threshold. The low amount of
minimizes the average distance between two test-normalized detections (especially, for target trials) not only yields high miss
sBNF vectors, until either a maximum number of hits is ob- error rates but also makes the calibration model to be poorly es-
tained or the score does not attain a given threshold. The STD timated. This may explain why the MTWV threshold for STD
task is performed by synthesizing spoken queries (using pub- scores on development data is much lower that that provided by
licly available TTS APIs), then averaging their sBNF represen- the application model.
tations and using the average query for QbE-STD. A publicly In both cases (QbE-STD and STD), the parameters of the
available toolkit (developed by BUT/Phonexia) has been used DTW-based search should be further tuned in order to get a
to extract three sBNF sets, trained for English monophone and larger amount of target and non-target detections.
triphone state posteriors (contrastive systems 3 and 4) and for
multilingual triphone posteriors (contrastive system 2), respec-
tively. The concatenation of the three sBNF sets has been also 2. QbE-STD systems
tested (contrastive system 1). The primary system consists of a In the design of our previous QbE-STD systems, the front-
discriminative fusion of the four contrastive systems. Detection end exploited existing software (e.g. the BUT phone decoders
scores are normalized on a query-by-query basis (qnorm), cal- for Czech, Hungarian and Russian [4]), whereas the backend
ibrated and, if two or more systems are considered, fused with (search, calibration and fusion) was almost entirely developed
other scores. Calibration and fusion parameters are discrimi- by our group (and collaborators) [3][5]. For this evaluation, we
natively estimated using the ground truth of development data. have applied an external VAD module and extracted a new set
Finally, due to a lack of robustness in calibration, Yes/No deci- of frame-level features.
sions are made by applying the MTWV thresholds obtained for
the development sets, except for the COREMAH test set. In this 2.1. Voice Activity Detection (VAD)
case, calibration is based on the MAVIR corpus, and the 15%
highest scores are taken as positive (Yes) detections. In our previous QbE-STD systems, VAD was performed by us-
ing the posteriors provided BUT phone decoders: first, the pos-
Index Terms: Spoken Term Detection, Query-by-Example teriors of non-phonetic units were added; then, if this aggre-
Spoken Term Detection, Bottleneck features, Dynamic Time gated non-phonetic posterior was higher than any other (pho-
Warping netic) posterior, the frame was labeled as non-speech; otherwise
it was labeled as speech. In this evaluation, we are not using
1. Introduction posteriors anymore, so we cannot apply the same procedure.
The main goal of the participation of GTTS in the Albayzin Instead, we apply the Python interface to a VAD module devel-
2018 Search on Speech Evaluation was to upgrade the QbE- oped by Google for the WebRTC project [6], based on Gaussian
STD systems that we developed for MediaEval 2013 and 2014 distributions of speech and non-speech features. Given an au-
[1] [2] by: (1) using an external VAD module; (2) replacing dio file, our VAD module produces two output files: the first
phonetic posteriors by bottleneck features as frame-level fea- one (required by the feature extraction module) is an HTK .lab
tures (which might imply changing some aspects of the method- file specifying speech and non-speech segments, whereas the
ology); and (3) handling the case of no development data by second one (used by the search procedure) is a text file (.txt)
applying cross-condition calibration and heuristic thresholding. containing a sequence of 1’s and 0’s (one per line) indicating
Also, as a proof-of-concept attempt to overcome the issue of speech/non-speech frames.
OOV words, we have developed STD systems by applying pub-
2.2. Bottleneck features
lic TTS APIs to synthesize a number of spoken instances of
each term, then computing an average query (following the ap- We have updated the feature extraction module of our previous
proach developed for MediaEval 2013 [3]) and using it to per- QbE-STD systems with the stacked botteneck features (sBNF)
form QbE-STD. recently presented by BUT/Phonexia [7]. Actually, three differ-
QbE-STD results on development data reveal that calibra- ent neural networks are applied, each one trained to classify a
tion models are not well estimated, probably due to a lack of de- different set of acoustic units and later optimized to language
249 10.21437/IberSPEECH.2018-52
recognition tasks. The first network was trained on telephone Note that d(v, w) ≥ 0, with d(v, w) = 0 if and only if v and w
speech (8 kHz) from the English Fisher corpus [8] with 120 are aligned and pointing in the same direction, and d(v, w) =
monophone state targets (FisherMono); the second one was also +∞ if and only if v and w are aligned and pointing in opposite
trained on the Fisher corpus but with 2423 triphone tied-state directions.
targets (FisherTri); the third network was trained on telephone The distance matrix computed according to Eq. 1 is nor-
speech (8 kHz) in 17 languages taken from the IARPA Babel malized with regard to the audio document x, as follows:
program [9], with 3096 stacked monophone state targets for the d(q[i], x[j]) − dmin (i)
17 languages involved (BabelMulti). dnorm (q[i], x[j]) = (2)
dmax (i) − dmin (i)
The architecture of these networks consists of two stages.
The first one is a standard bottleneck network fed with low- where:
level acoustic features spanning 10 frames (100 ms), the bottle- dmin (i) = min d(q[i], x[j]) (3)
j=1,...,n
neck size being 80. The second stage takes as input five equally
spaced BNFs of the first stage, spanning 31 frames (310 ms), dmax (i) = max d(q[i], x[j]) (4)
j=1,...,n
and is trained on the same targets as the first stage, with the
same bottleneck size (80). The bottleneck features extracted In this way, matrix values are in the range [0, 1] and a per-
from the second stage are known as stacked bottleneck features fect match would produce a quasi-diagonal sequence of zeroes.
(sBNF). Alternatively, instead of sBNF, the extractor can output This can be seen as test nomalization since, given a query q,
target posteriors. distance matrices take values in the same range (and with the
The operation of BUT/Phonexia sBNF extractors requires same relative meaning), no matter the acoustic conditions, the
an external VAD module providing speech/non-speech informa- speaker, etc. of the audio document x.
tion through an HTK .lab file. If no external VAD is provided, a Note that the chunking process described above makes the
simple energy-based VAD is computed internally. In this eval- normalization procedure differ from that applied in [3], since
uation, we have applied the WebRTC VAD described above. dmin (i) and dmax (i) are not computed for the whole audio
Our first aim was to replace old BUT by new BUT/Phonexia document but for each chunk independently. On the other hand,
posteriors, but the huge size of FisherTri (2423) and BabelMulti considering chunks of 5 minutes might be beneficial, since nor-
(3096) targets required some kind of selection, clustering or malization is performed in a more local fashion, that is, more
dimensionality reduction approach. So, given that —at least suited to the speaker(s) and acoustic conditions of each particu-
theoretically— the same information is conveyed by sBNF’s, lar chunk.
with a suitably low dimensionality (80), we decided to switch The best match of a query q of length m in an audio doc-
from posteriors to sBNF’s. We were aware that this change may ument x of length n is defined as that minimizing the average
make us pay a high price. Posteriors have a clear meaning, they distance in a crossing path of the matrix dnorm . A crossing
can be linearly combined and their values suitably fall within path starts at any given frame of x, k1 ∈ [1, n], then traverses
the range [0,1], which makes the − log cos(α) distance also a region of x which is optimally aligned to q (involving L vec-
range in [0,1] (α being the angle between two vectors of pos- tor alignments), and ends at frame k2 ∈ [k1 , n]. The average
teriors), with very good results reported in our previous works. distance in this crossing path is:
L
On the other hand, bottleneck layer activations have no clear 1X
meaning, we don’t really know if they can be linearly combined davg (q, x) = dnorm (q[il ], x[jl ]) (5)
L
l=1
(e.g for computing an average query from multiple query in-
stances), and their values are unbounded, so the − log cos(α) where il and jl are the indices of the vectors of q and x in the
distance does no longer apply. Is there any other distance work- alignment l, for l = 1, 2, . . . , L. Note that i1 = 1, iL = m,
ing fine with sBNF? This evaluation poses a great opportunity j1 = k1 and jL = k2 . The optimization procedure is Θ(n·m·d)
to address these issues. in time (d: size of feature vectors) and Θ(n · m) in space. For
details, we refer to [3].
2.3. DTW-based search The detection score is computed as 1 − davg (q, x), thus
ranging from 0 to 1, being 1 only for a perfect match. The
To perform the search of spoken queries in audio documents,
starting time and the duration of each detection are obtained by
we basically follow the DTW-based approach presented in [3].
retrieving the time offsets corresponding to frames k1 and k2 in
In the following, we summarize the approach and the modifica-
the VAD-filtered audio document.
tions introduced for this evaluation.
This procedure is iteratively applied to find not only the best
Given two sequences of sBNF’s corresponding to a spo-
match but also less likely matches in the same audio document.
ken query and an audio document, we first apply VAD to dis-
To that end, a queue of search intervals is defined and initialized
card non-speech frames, but keeping the timestamp of each
with [1, n]. Let us consider an interval [a, b], and assume that
frame. To avoid memory issues, audio documents are splitted
the best match is found at [a0 , b0 ], then the intervals [a, a0 − 1]
into chunks of 5 minutes, overlapped 5 seconds, and processed
and [b0 + 1, b] are added to the queue (for further processing)
independently. This chunking process is key to the speed and
only if the following conditions are satisfied: (1) the score of
feasibility of the search procedure.
the current match is greater than a given threshold T (in this
Let us consider the VAD-filtered sequences corresponding evaluation, T = 0.85); (2) the interval is long enough (in this
to a query q = (q[1], q[2], . . . , q[m]) and an audio document evaluation, half the query length: m/2); and (3) the number of
x = (x[1], x[2], . . . , x[n]), of length m and n, respectively. matches (those already found + those waiting in the queue) is
Since sBNF’s (theoretically) range from −∞ to +∞, we de- less than a given threshold M (in this evaluation, M = 7). An
fine the distance between any pair of vectors, q[i] and x[j], as example is shown in Figure 1. Finally, the list of matches for
follows: each query is ranked according to the scores and truncated to
q[i] · x[j] the N highest scores (in this evaluation, N = 1000, though it
d(q[i], x[j]) = − log 1 + + log 2 (1) effectively applied only in a few cases).
|q[i]| · |x[j]|
250
length(x[k2+1,n]) < m/2
not searched
spoken query (q)
k3 k4
audio document (x) k1 k2
1 n
1
Normalized
distance matrix
m
Figure 1: Example of the iterative DTW procedure: (1) the best match of q in x[1, n] is located in x[k1 , k2 ]; (2) since the score is greater
than the established threshold T , the search continues in the surrounding segments x[1, k1 − 1] and x[k2 + 1, n]; (3) x[k2 + 1, n] is
not searched, because it is too short; (4) the best match of q in x[1, k1 − 1] is located in x[k3 , k4 ]; (5) but its score is lower than T , so
the surrounding segments x[1, k3 − 1] and x[k4 + 1, k1 − 1] are not searched. The search procedure outputs the segments x[k1 , k2 ]
and x[k3 , k4 ].
2.4. Calibration and fusion of system scores As a result, the calibration/fusion model would be poorly esti-
mated and the Bayes optimal threshold (in this evaluation, 6.9)
The scores produced by our systems are transformed according would not produce good results.
to a discriminative calibration/fusion approach commonly ap-
plied in speaker and language recognition, that we adapted to 3. STD systems
STD tasks for MediaEval 2013, in collaboration with Alberto
In this evaluation, we have exploited some publicly available
Abad, from L2 F, the Spoken Language Systems Laboratory,
Text-to-Speech (TTS) API’s to perform text-in-audio search as
INESC-ID Lisboa. In the following paragraphs, we just sum-
audio-in-audio search. This is just a proof-of-concept aimed at
marize the procedure. For further details, see [5].
overcoming the Out-Of-Vocabulary (OOV) word issue.
First, the so-called q-norm (query normalization) is ap- We have applied the Google TTS (gTTS) Python library
plied, so that zero-mean and unit-variance scores are obtained and command-line interface (CLI) tool [12], which provides
per query. Then, if n different systems are fused, detections two different female (es-ES and es-US) voices, and the Cocoa
are aligned so that only those supported by k or more systems interface to speech synthesis in MacOS [13], which provides 5
(1 ≤ k ≤ n) are retained for further processing (in this eval- different voices (three male, two female) including both Euro-
uation, we use k = 2). To build the full set of trials (potential pean and American Spanish.
detections) we assume a rate of 1 trial per second (which is con- In this way, for each textual term, we synthesize 7 spoken
sistent with the evaluation script). Now, let us consider one of queries: q1 , q2 , . . . , q7 . These spoken queries are downsampled
those detections of a query q supported by at least k systems, to 8 kHz and applied VAD and sBNF extraction as described
and a system A that did not provide a score for it. There could in Sections 2.1 and 2.2. The longest query is then taken as
be different ways to fill up this hole. We use the minimum score reference and optimally aligned to the other queries by means
that A has output for query q in other trials. In fact, the min- of a standard DTW procedure. Let us consider the sequence
imum score for the query q is hypothesized for all target and of VAD-filtered sBNF vectors for the reference query: ql of
non-target trials of query q for which system A has not output length ml , and the sequence corresponding to another synthe-
a detection score. When a single system is considered (n = 1), sized query: qi of length mi . The alignment starts at [1, 1] and
the majority voting scheme is skipped but qnorm and the fill- ends at [ml , mi ] and involves L alignments, such that each fea-
ing up of missing scores are still applied. In this way, a com- ture vector of ql is aligned to a sequence of vectors of qi . This
plete set of scores is prepared, which besides the ground truth is repeated for all the synthesized queries, such that we end up
(target/non-target labels) for a development set of queries, can with a set of feature vectors Sj aligned to each feature vector
be used to discriminatively estimate a linear transformation that ql [j], for j = 1, 2, . . . , ml . Then, each ql [j] is averaged with
will hopefully produce well-calibrated scores. the feature vectors in Sj to get a single average query, as fol-
The calibration/fusion model is estimated on the develop- lows:
ment set and then applied to both the development and test 1 X
qavg [j] = ql [j] + v j = 1, 2, . . . , ml
sets, using the BOSARIS toolkit [10][11]. Under this ap- 1 + |Sj | v∈S
proach, the Bayes optimal threshold, given the effective prior j
(6)
(in this evaluation, P̂target = Cmiss Ptarget /(Cmiss Ptarget +
Finally, the average query obtained in this way is used to
Cf a (1 − Ptarget )) = 0.001), would be applied and —at least
search for occurrences in the audio documents, just in the same
theoretically— no further tunings would be necessary. In prac-
way (using the same configuration) as we do in the QbE-STD
tice, however, if a system yielded a small amount of detections,
task.
we would be using hypothesized scores for most of the trials.
251
4. Experimental setup and results Table 2: ATWV/MTWV performance on the development sets
of MAVIR and RTVE for the STD systems submitted by GTTS-
Since BUT/Phonexia sBNF extractors operate on 8 kHz signals, EHU. The ATWV threshold is set to 6.9. Along with the MTWV
all the query and test audio signals have been downsampled to score, the MTWV threshold is shown in parentheses.
8 kHz and stored as 16 bit little-endian signed-integer single-
channel WAV files. Audio conversion was performed under MAVIR RTVE
MacOS, using the following command: MTWV (Thr) ATWV MTWV (Thr) ATWV
con2 0.0396 (5.12) 0.0000 0.0933 (5.30) 0.0026
afconvert audio.<ext> -o audio_8k.wav con3 0.0463 (4.82) 0.0000 0.0951 (4.80) 0.0000
-d LEI16@8000 -c 1 -f WAVE con4 0.0512 (4.31) 0.0000 0.0916 (4.98) 0.0000
con1 0.0398 (5.28) 0.0000 0.0843 (5.21) 0.0000
pri 0.0464 (5.43) 0.0000 0.0809 (5.91) 0.0265
where <ext> represents any audio format extension (such as
mp3, aac, wav, etc.).
5. Conclusions and future work
Two different datasets have been provided for training and
development of QbE-STD and STD systems in the Albayzin The QbE-STD and STD results obtained by our systems on the
2018 Search on Speech Evaluation [14]: MAVIR [15] and development sets of MAVIR and RTVE may indicate that the
RTVE [16]. We have not used the training dataset at all, nor search procedure is detecting few occurrences of the queries,
the dev1 set of RTVE. Only the dev set of MAVIR and the dev2 yielding high miss error rates and making it difficult the esti-
set of RTVE have been used to estimate the calibration/fusion mation of good (robust) calibration/fusion models, since most
models, and later to search for query occurrences. Search has of the trials are missing from system output and we have to hy-
been also performed on the test sets: COREMAH [17], MAVIR pothesize scores for them. To get a larger amount of target and
and RTVE, applying the calibration/fusion models and the opti- non-target detections, three of the parameters of the DTW-based
mal (MTWV) thresholds obtained on development. In the case search must be relaxed: the maximum amount M of hits per au-
of COREMAH (for which no development data was provided), dio chunk (currently, M = 7), the minimum score T required to
MAVIR calibration/fusion models have been used and heuristic keep searching (currently, T = 0.85) and the maximum number
thresholding has been applied, by making Yes decisions for the N of detections per query for the whole set of audios (currently,
15% of detections with the highest scores. N = 1000).
One primary (pri) and four contrastive (con1, con2, con3 QbE-STD detection scores, though badly calibrated, seem
and con4) systems have ben submitted to each combination to work fine for RTVE but not so well for MAVIR. The differ-
of task (QbE-STD, STD), condition (development, test) and ence in performance might be related to a higher acoustic vari-
dataset (COREMAH, MAVIR, RTVE). BabelMulti, Fisher- ability or more adverse conditions (reverberation, noise, etc.)
Mono and FisherTri sBNF’s were used for contrastive systems for MAVIR.
2, 3 and 4, respectively. The concatenation of the three sBNF’s The trick of synthesizing spoken queries from textual terms
(with 80 × 3 = 240 dimensions) was used as acoustic repre- to perform STD as QbE-STD seems to be failing, with two
sentation for contrastive system 1. Finally, the primary system possible causes: (1) an acoustic mismatch between the synthe-
was obtained as the discriminative fusion of the four contrastive sized queries and the test audios might lead to low scores and
systems. block the iterative DTW detection procedure; and (2) the use of
bottleneck layer activations as frame-level acoustic representa-
Tables 1 and 2 show ATWV/MTWV performance of the 5 tion might be incompatible with the query averaging procedure
GTTS-EHU systems on the development sets of MAVIR and (which worked fine with phone posteriors).
RTVE for the QbE-STD and STD tasks, respectively. In all Future developments may involve some sort of data aug-
cases, ATWV is obtained for the Bayes optimal threshold (6.9). mentation in QbE-STD, such as the use of pseudo-relevance
Along with the MTWV score, the MTWV threshold is shown feedback, that is, the use of top matching query occurrences
too (in parentheses). As noted above, the systems eventually as additional examples. Also, though query averaging is com-
submitted for MAVIR and RTVE (in all tasks and conditions) putationally cheap, using it with sBNF representations might
were applied the MTWV threshold. be unfeasible and other more expensive strategies to take ad-
vantage of the synthesized queries should be explored, such as
Table 1: ATWV/MTWV performance on the development sets carrying out multiple searches and fusing the results [18]. Alter-
of MAVIR and RTVE for the QbE-STD systems submitted by natively, we could return to posteriors, by combining the high-
GTTS-EHU. The ATWV threshold is set to 6.9. Along with the dimensional sets of posteriors provided by BUT/Phonexia net-
MTWV score, the MTWV threshold is shown in parentheses. works with some sort of feature clustering or feature selection
approach.
MAVIR RTVE
MTWV (Thr) ATWV MTWV (Thr) ATWV 6. Acknowledgements
con2 0.1291 (4.06) 0.0000 0.4893 (4.75) 0.0097
con3 0.1327 (4.12) 0.0000 0.5722 (5.14) 0.0802 We thank Javier Tejedor and Doroteo T. Toledano for organizing
con4 0.1590 (4.09) 0.0000 0.5227 (5.09) 0.0437 a new edition of the Search on Speech Evalution and for their
con1 0.1278 (4.14) 0.0000 0.5159 (5.13) 0.0421 help during the development process. We also thank Eduardo
pri 0.1577 (4.59) 0.0000 0.5352 (5.69) 0.3043 Lleida and the ViVoLab team for the huge effort of collecting
and annotating RTVE broadcasts for the ALBAYZIN 2018 eval-
uations. This work has been partially funded by the UPV/EHU
under grant GIU16/68.
252
7. References [17] COREMAH corpus, Laboratorio de Lingüística Informática, Uni-
versidad Autónoma de Madrid, http://www.lllf.uam.es/coremah/.
[1] X. Anguera, L. J. Rodriguez-Fuentes, F. Metze, I. Szöke, A. Buzo,
and M. Penagarikano, “Query-by-example spoken term detection [18] T. Hazen, W. Shen, and C. White, “Query-By-Example Spo-
on multilingual unconstrained speech,” in Interspeech 2014, Sin- ken Term Detection Using Phonetic Posteriorgram Templates,” in
gapore, September 14-18 2014, pp. 2459–2463. ASRU, Merano, Italy, December 13-17, 2009, pp. 421–426.
[2] X. Anguera, L. J. Rodriguez-Fuentes, A. Buzo, F. Metze,
I. Szöke, and M. Penagarikano, “Quesst2014: Evaluating query-
by-example speech search in a zero-resource setting with real-life
queries,” in IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP 2015), Brisbane, Australia, April
19-24 2015, pp. 5833–5837.
[3] L. J. Rodriguez-Fuentes, A. Varona, M. Penagarikano, G. Bor-
del, and M. Diez, “High-performance query-by-example spoken
term detection on the sws 2013 evaluation,” in IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP
2014), Florence, Italy, May 4-9 2014, pp. 7819–7823.
[4] P. Schwarz, “Phoneme recognition based on long temporal con-
text,” Ph.D. dissertation, Faculty of Information Technology, Brno
University of Technology, http://www.fit.vutbr.cz/, Brno, Czech
Republic, 2008.
[5] A. Abad, L. J. Rodriguez-Fuentes, M. Penagarikano, A. Varona,
M. Diez, and G. Bordel, “On the calibration and fusion of het-
erogeneous spoken term detection systems,” in Interspeech 2013,
Lyon, France, August 25-29 2013, pp. 20–24.
[6] Python interface to the WebRTC (https://webrtc.org/) Voice Activ-
ity Detector (VAD), https://github.com/wiseman/py-webrtcvad.
[7] A. Silnova, P. Matejka, O. Glembek, O. Plchot, O. Novotny,
F. Grezl, P. Schwarz, L. Burget, and J. H. Cernocky,
“BUT/Phonexia Bottleneck Feature Extractor,” in Odyssey 2018:
The Speaker and Language Recognition Workshop, Les Sables
D’Olonne, France, June 26-29 2018, pp. 283–287.
[8] C. Cieri, D. Miller, and K. Walker, “The Fisher Corpus: a Re-
source for the Next Generations of Speech-to-Text,” in LREC,
2004, pp. 69–71.
[9] Babel Program, Intelligence Advanced Research Projects
Activity (IARPA), https://www.iarpa.gov/index.php/research-
programs/babel.
[10] N. Brümmer and E. de Villiers, The BOSARIS Toolkit User Guide:
Theory, Algorithms and Code for Binary Classifier Score Process-
ing, 2011, https://sites.google.com/site/bosaristoolkit/.
[11] ——, “The BOSARIS Toolkit: Theory, Algorithms and Code for
Surviving the New DCF,” arXiv.org, 2013, presented at the NIST
SRE’11 Analysis Workshop, Atlanta (USA), December 2011,
https://arxiv.org/abs/1304.2865.
[12] gTTS (Google Text-to-Speech): Python library and CLI tool to
interface with Google Translate’s text-to-speech API, Download
and installation: https://pypi.org/project/gTTS/. Documentation:
https://gtts.readthedocs.io/en/latest/.
[13] NSSpeechSynthesizer: The Cocoa interface to speech
synthesis in macOS (AppKit module of PyObjC
bridge), Apple Developer Objective C Documentation:
https://developer.apple.com/documentation/appkit/nsspeech-
synthesizer. Stackoverflow example of use (with Python 2):
https://stackoverflow.com/questions/12758591/python-text-to-
speech-in-macintosh.
[14] J. Tejedor and D. T. Toledano, The ALBAYZIN 2018 Search on
Speech Evaluation Plan, IberSpeech 2018: X Jornadas en Tec-
nologías del Habla and V Iberian SLTech Workshop, Barcelona,
Spain, November 21-23 2018, http://iberspeech2018.talp.cat/wp-
content/uploads/2018/06/EvaluationPlanSearchonSpeech.pdf.
[15] MAVIR corpus, Laboratorio de Lingüística In-
formática, Universidad Autónoma de Madrid,
http://www.lllf.uam.es/ESP/CorpusMavir.html.
[16] E. Lleida, A. Ortega, A. Miguel, V. Bazán, C. Pérez, M. Zotano,
and A. de Prada, RTVE2018 Database Description, Vivolab,
Aragon Institute for Engineering Research (I3A), University of
Zaragoza and Corporación Radiotelevisión Española, June 2018,
http://catedrartve.unizar.es/reto2018/RTVE2018DB.pdf.
253
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
254 10.21437/IberSPEECH.2018-53
search will be performed. These lattices and the transcriptions and eight programs “La Noche en 24H”), of this partition,
obtained are the primary input to the STD module. were manually segmented and labeled by speaker, by us,
The lattices are processed using the lattice indexing in expressions less than 40 seconds to obtain a training
technique described in [13], where the lattices of all the test dataset of 2013 expressions with 139,983 words with a
utterances are converted from individual weighted finite state duration of 13:45 hours. This set was used for training the
transducers (WFST) to a single generalized factor transducer acoustic models and the language model. The
structure in which the start- time, end-time and lattice posterior development data is in the partition “dev2” and is about
probability of each word is stored as a 3-dimensional cost. This 15 hours in twelve RTVE programs.
factor transducer is an inverted index of all word sequences seen During the manual segmentation of transcriptions and audio
in the lattices. of RTVE programs, we observed that the transcript files did not
Thus, given a list of terms to detect, a simple finite state include many speech expressions that happen in the audio and
machine is created such that it accepts the terms and composes also some speaker voices overlapping, typical of the
it with the factor transducer to obtain all its occurrences in the spontaneity and level of improvisation during the conversations
tests utterances, along with the utterance ID, start-time, end- among the journalists participating in the programs. We
time and lattice posterior probability of each occurrence. All eliminate the voices overlapping when segmenting the audio
those occurrences are sorted according to their posterior and its corresponding transcription, however only some not
probabilities and a YES/NO decision is assigned to each transcribed speech expressions, were transcribed by us when we
instance. detected them, listening carefully and replaying many times the
audio file, during the segmenting process.
2.2. Train and development data All the textual training material of the three databases was
Here is a description of the databases used to train and develop revised and corrected, carry to uppercase, substituting the
numbers and acronyms for their transcription and finally
our system for the STD task.
grouped in a database of 23029 different words and 44:45 hours
• TC-STAR: We obtained free from ELDA-ELRA a set of of duration.
audios and transcriptions of Spanish partition of
TCSTAR 2005-2007 [14], corresponding to the 2.3. Vocabulary and Lexicon
Evaluation Package database. It contains 26:40 hours of
audio and consists of 17163 expressions with 241412 The dictionary used by the LVSCR is composed only by words
words. This set was used for training the acoustic models from the transcriptions of the training data. Multilingual G2P
and the language model. transcriber [16] was used to obtain the phonetic transcription of
each word. We obtain a general lexicon of 23029 different
• MAVIR: This corpus was provided by the challenge words.
organizers, and corresponds to talks held by the MAVIR
consortium in 2006, 2007 and 2008 [6]. The Spanish 2.4. Language models
training data is contained in “SoS2018_training”,
contains 4: 20 hours and consists in 5 talks segmented in To train the language model used by the LVCSR, we used only
2400 expressions with 44423 words. This set was used the transcriptions of the training data corpus. It consists of
for training the acoustic models and the language model. 21575 expressions and 23029 different words. This text has
The MAVIR development data is contained in been supplied to the SRILM tool to create an Arpa format,
“SoS2018_development (1)” and is about one hour in two trigram language model with 23002 unigrams, 156778 bigrams
talks. and 38628 trigrams.
• RTVE: The Challenge organizers provided this corpus
2.5. INV and OOV Terms
and its structure is explained in [15]. The corpus is
divided in 4 partitions, a “train” one, two development This Challenge evaluation defines two sets of terms for STD
“dev1”, “dev2” and one “test”. The audio files of the task: an in-vocabulary (INV) set of terms and an out of-
“train” partition do not have human-revised vocabulary (OOV) set of terms. The OOV set of terms will be
transcriptions. Partition “dev1” contains about 53 hours composed by out-of-vocabulary words for the LVCSR system,
of audios and their corresponding human-revised so these OOV terms must be removed from the system
transcriptions and can be used for either development or dictionary and consequently from the lexicon and the language
training. So, transcriptions (files “trn”) and audio (files model.
“aac”) of twelve RTVE programs (four programs “20H”
255
References
3. Experimental results
[1] NIST. The spoken term detection (STD) 2006 evaluation plan.
Table 1 contains the STD scores obtained with the three National Institute of Standards and Technology (NIST),
proposed models (GMM-HMM, S-GMM and DNN-HMM), Gaithersburg, MD, USA, 10 edn., 2006.
using the DEV set of MAVIR and RTVE corpus, evaluating http://www.nist.gov/speech/tests/std.
[2] Tejedor, J., Toledano, D.T., Anguera, X., Varona, A., Hurtado,
with the NIST STDEval-0.7 tool, provided by the Challenge
L.F., Miguel, A., Colas, J.: Query-by-example spoken term
organizers. detection Albayzin 2012 evaluation: overview, systems, results,
This tool provides the probabilities of False Acceptances and discussion. EURASIP Journal on Audio, Speech, and Music
(Pfa) and Misses (Pmiss) of the STD system, and two metrics Processing, 2013, 2013(1):23.
that integrates both probabilities [17]: [3] Tejedor, J., Toledano, D.T., Lopez-Otero, P., Docio-Fernandez,
• Actual Term Weighted Value (ATWV) that integrates Pfa L., Garcia-Mateo, C., Cardenal, A., Echeverry-Correa, J.D.,
and Pmiss for each term, and averages over all the terms, Coucheiro-Limeres, A., Olcoz, J., Miguel, A.: Spoken term
representing the term weighted value for a threshold set by detection Albayzin 2014 evaluation: overview, systems, results,
and discussion. EURASIP Journal on Audio, Speech, and Music
the system tuned on development data Processing, (2015) 2015 (1):21
• Maximum Term Weighted Value (MTWV) that is the [4] Tejedor, J., Toledano, D.T., Lopez-Otero, P., Docio-Fernandez,
maximum TWV achieved by the system for all possible L., Garcia-Mateo, C.: Comparison of Albayzin query-byexample
thresholds, not depending on the tuned threshold, spoken term detection 2012 and 2014 evaluations. EURASIP
representing an upper bound of the system performance. Journal on Audio, Speech, and Music Processing, 2016(1):1.
Comparing with results obtained with the same MAVIR [5] Tejedor, J., Toledano, D.T., Lopez-Otero, P., Docio-Fernandez,
dataset in Albayzin 2016 spoken term detection evaluation [5], L., Serrano, L., Hernaez, I., Coucheiro-Limeres, A., Ferreiros, J.,
Olcoz, J., Llombart, J.: Albayzin 2016 spoken term detection
our results, for all the proposed models, are similar to obtained
evaluation: an international open competitive evaluation in
by the best system with the DEV set (first line of table 9 of [5]) spanish. EURASIP Journal on Audio, Speech, and Music
and the TEST set (first line of table 10 of [5]). Additionally, Processing, 2017(1):22.
proposed DNN-HMM model surpasses the behavior of the best [6] Sandoval, A.M., Llanos, L.C.: MAVIR: a corpus of spontaneous
system in that evaluation. formal speech in Spanish and English. In: Iberspeech 2012: VII
However, results shown that the behavior of the evaluated Jornadas en Tecnología del Habla, 2012. \MAVIR corpus:
methods is worst with the RTVE set, probably by: http://www.lllf.uam.es/ESP/CorpusMavir.html".
• the differences between the speech of the training set and [7] RTVE corpus: http://catedrartve.unizar.es/reto2018.html.
[8] COREMAH corpus: http://www.lllf.uam.es/coremah/".
their transcriptions, explained above, and [9] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O.,
• the differences in spontaneity and level of improvisation Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P.,
between MAVIR and RTVE sets. Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech
recognition toolkit. In: IEEE Workshop on Automatic Speech
4. Conclusions and future work Recognition and Understanding. IEEE Signal Processing Society,
2011.
Due to the inexperience of the participants in the use of the [10] Stolcke, A. et al.: SRILM-an extensible language modeling
Kaldi tool, the lack of the necessary time, and difficulties in the toolkit. In: ISCA Seven International Conference of Speech
exploitation of the computational resources that we have, it was Technologies, ICSLP 2002, pp 901-904.
[11] Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F., Ghoshal,
impossible for us to carry out, before the dead line, the
A., Glembek, O., Goel, N., Karafiat, M., Rastrow, A., Rose, R.,
evaluation of the proposed systems applying the Kaldi proxy Schwarz, P., and Thomas, S.: The subspace Gaussian mixture
method, for the DEV and TEST set. model: A structured model for speech recognition. Computer
Taking into account the best results of DNN-HMM model Speech & Language, 25(2):404–439, 2011.
in the RTVE set, its lowest Pmiss and lowest performance gap [12] Povey, D., Hannemann, M., Boulianne, G., Burget, L., Ghoshal,
between MTWV and ATWV metrics, indication of well A., Janda, M., Karafiát, M., Kombrink, S., Motlícek, P., Qian, Y.,
calibrated term detection scores, we decided to send as a Riedhammer, K., Veselý, K., Vu, N.T.: Generating exact lattices
primary system the one obtained with the DNN-HMM model, in the WFST framework. In: IEEE International Conference on
Acoustics, Speech, and Signal Processing. pp. 4213–4216, 2012.
and as contrastive systems 1 and 2, those obtained with the
[13] Can, D., Saraclar, M.: Lattice indexing for spoken term detection.
models S-GMM and GMM-HMM, respectively. IEEE Transactions on Audio, Speech and Language Processing
We consider the participation in this Challenge very useful. 19(8), 2338–2347, 2011.
The experience acquired by all the participants will serve us for [14] TC-STAR Technology and Corpora for Speech to Speech
next competitions and more importantly, for the development Translation. http://www.tcstar.org/pages/main.htm
of our research in the fields of ASR and STD. [15] Lleida E., Ortega A., Miguel A., Bazán V., Pérez C., Zotano M.
We propose to continue and conclude the experiments and De Prada A.: RTVE2018 Database Description
evaluating the Kaldi proxy method and refining the transcripts [16] Multilingual Grapheme to Phoneme.
https://github.com/jcsilva/multilingual-g2p
of the RTVE database samples, to make them available to the
[17] Fiscus, J. G., Ajot, J., Garofolo, J. S., & Doddingtion, G.: Results
Spanish Thematic Network on Speech Technology (RTTH). of the 2006 spoken term detection evaluation. In Proc. of
workshop on searching spontaneous conversational speech, pp.
5. Acknowledgments 45–50, 2007.
256
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
This paper describes the joint collaboration between the Ma- 3. Closed-condition system
chine Learning and Language Processing (MLLP) research
group from the Universitat Politècnica de València (UPV) 3.1. Speech data filtering
and the Human Language Technology and Pattern Recognition Under the closed training conditions, it is extremely important
(HLTPR) research group from the RWTH Aachen University for to make the most of the provided training data, specially when
the participation in the IberSpeech-RTVE 2018 Speech-to-Text it is scarce and/or noisy. This is the case of the RTVE database:
transcription challenge, that will be held during the IberSpeech on the one hand, the train set comprises only 463 hours of au-
2018 conference in Barcelona, Spain. Our participation con- dio, which is not much compared with the amount of data used
sisted of the submission of three systems: one primary and one to train current state-of-the-art systems [1, 2]. On the other
contrastive for the closed system training condition, and one hand, training data is not provided with verbatim transcripts
primary for the open condition. but with approximate subtitles. This becomes a major concern
The rest of the paper is structured as follows. First, Sec- when using recurrent neural networks for acoustic modeling, as
tion 2 describes the RTVE database that was provided by the the accuracy drops significantly when using noisy training data.
organizers of the challenge. Second, in Section 3 we describe Therefore, a robust speech data filtering procedure becomes a
the ASR systems we developed under the closed training condi- key point to achieve high ASR performance.
tions. Next, Section 4 details the ASR system that participated After examining some random samples from train, dev1
in the open training conditions. Finally, Section 5 provides a and dev2 sets, we first-hand checked that the provided subti-
summary of the work and gives some concluding remarks. tles are far from being verbatim transcripts, but also noted the
257 10.21437/IberSPEECH.2018-54
presence of 1) subtitle/transcription gaps, 2) subtitle files with Table 2: Corpus statistics of the text data used for LM training.
no timestamps, 3) audio files considerably larger than their cor-
responding subtitle files, 4) subtitle files covering timestamps Sentences Running words Vocabulary
that exceed the length of their corresponding audio file, or 5) train 340K 4.3M 80K
human transcription errors in dev1, among others. subs-C24H 3.1M 57M 160K
RNN-train 1.8M 35M 176K
For all these reasons, we applied the following speech data
filtering pipeline. As subtitle timestamps 1) are not reliable in dev1-dev 9.9K 160K 13K
the train set, 2) are not given in dev1, and 3) are given only at dev2 7.7K 150K 12K
speaker-turn level in dev2, we first force-aligned each audio file
to its corresponding subtitle/transcript text. We did this using 3.2. Acoustic modeling
a preexisting hybrid CD-DNN-HMM ASR [3] system in which The acoustic models (AM) used during the development of the
the search space was constrained to recognize the exact text (no MLLP-RWTH c1-dev closed system were trained using filtered
language model was involved in this procedure), with the only speech data from train and dev1-train sets, that is, 205 hours
freedom of exploring the different word pronunciations given of training speech data. We extracted 16-dim. MFCC features
by the lexicon model and of using an optional silence phoneme augmented by the full first and second time derivatives, resuling
at the beginning of each word. In this way we computed the best in 48-dim. features.
alignment between the input frame sequence to the sequence Our acoustic models were based on the hybrid approach [4,
of HMM states inferred from the subtitle/transcript text. Then, 5]. We first trained a conventional context-dependent Gaus-
we applied a heuristic post-filtering based on state-level frame sian mixture model hidden Markov model (CD-GMM-HMM)
occupation and word-level alignment scores: if either an HMM with three left-to-right states. The state-tying schema was esti-
state is aligned to more frames than the observed average state mated following a phonetic decision tree approach [6], resulting
frame occupation + two times the observed standard deviation, in 8.9K tied states. The GMM acoustic model was used to force
or a word whose average alignment score is lower than a given align the training data. We then trained a context-dependent
threshold, then the corresponding word alignment is considered feed-forward DNN-HMM using a context window of 11 frames,
noisy and the word is removed. Next, we completely discarded six hidden layers with ReLU activation functions and 2048 units
those files in which more than two thirds of the words were per layer. We used the transLectures-UPV toolkit (TLK) [7] to
filtered out in the previous step. Finally, we built a clean training train both GMM and DNN acoustic models.
corpus by joining words into segments whose boundaries were
Apart from the feed-forward model, we also trained a
delimited by large-enough silences and deleted words.
BLSTM-HMM model [5]. The DNN was used to refine the
alignment between input acoustic features and HMM states. We
Table 1: Number of raw, aligned raw, aligned speech, and fil- then trained the BLSTM-HMM model using the open source
tered speech hours as a result of applying the speech data filter- toolkit TensorFlow [8] and TLK. The BLSTM network con-
ing pipeline to the whole RTVE database. sisted of four bidirectional hidden layers with 512 LSTM cells
Aligned Filtered per layer and per direction.
Raw In order to increase the amount of training data, the final
Raw Speech Speech
train 463 438 252 187 submitted system (MLLP-RWTH p-final closed) was retrained
dev1-train 43 31 24 18 on a total of about 218 hours from sets train, dev1 and dev2.
dev1-dev 14 12 9 7
dev2 15 12 9 6 3.3. Language modeling
Overall 535 493 294 218 Our language model (LM) for the closed condition consists of
a combination of several n-gram models and a recurrent neural
Table 1 shows the result of applying this speech data fil- network (RNN) model. Also, since TV shows of each audio file
tering pipeline to the train, dev1 and dev2 sets. The second are known in advance, we performed an LM adaptation at the
column shows the raw audio length in hours of each set. The n-gram model level.
third refers to the amount of raw hours that could be aligned to First, we extracted sentences from all .srt and .trn files.
the corresponding subtitle/transcript text by our alignment sys- Then we applied a common text processing pipeline to normal-
tem. It must be noted that there were some audio files that the ize capitalization, remove punctuation marks, expand contrac-
system was not capable to align. This happens when none of tions (i.e. sr. → señor) and transliterate numbers. As already
the active hypotheses can reach the final HMM state at the last mentioned, we split dev1 into two subsets, dev1-train and dev1-
time step, due to an excessive histogram pruning or due to an dev, in order to include dev1-train in training. Thus, in this
non-matching transcript. For this reason, 42 hours of audio data section, we will refer to the combination of train and dev1-train
could not be aligned. The fourth column gives the total amount simply as train. For LSTM LM training, we concatenated the
of aligned speech data after removing non-speech events that train and subs-C24H sets into a single training file and removed
were aligned to the silence phoneme. Surprisingly, the original redundancy by discarding repeated sentences. Also, sentences
438 raw hours from the train set were reduced to 252 hours of were shuffled after each epoch to allow better generalization. To
speech data, i.e. we detected 186 hours of non-speech events. carry out TV-show LM adaptation experiments, we randomly
After some manual analysis of the alignments, we found that extracted 500 sentences of each TV show from the train set to
a significant portion of these 186 hours is explained by non- be used as validation data in the adaptation process. Table 2
subtitled speech, whose corresponding audio frames were in provides corpus statistics after normalization.
practice aligned to the silence phoneme. Finally, the fifth col- Second, to define our closed-condition system’s vocabulary,
umn shows the number of hours of clean speech data after ap- we first computed the vocabulary of both train and subs-C24H
plying the described heuristic post-filtering procedure and after sets, and then removed singletons, so that language models can
discarding files that shown a high word rejection rate. Starting properly model unknown word probabilities. After applying
with the original 535 hours of raw audio, we aligned 294 hours these two steps, the resulting vocabulary had 132K words. The
of speech, from which we rejected 76 hours of noisy data, end- out-of-vocabulary ratios of dev1-dev and dev2 sets were 0.36%
ing up with 218 hours of speech suitable for acoustic training. and 0.53%, respectively.
258
Table 3: Perplexities of the different LM components. search pruning parameters were optimized on the dev1-dev set.
dev1-dev dev2 Table 4 shows the results on both dev1-dev and dev2 sets. As
(a) N -gram train 139.6 183.0 expected, the BLSTM acoustic model outperformed the feed-
(b) N -gram subs-C24H 161.2 193.4 forward model by 12.2% relative.
(c) N -gram show-specific 184.0 294.3
(d) N -gram general (a+b) 107.0 147.5 Table 4: Comparison of the CD-FFDNN-HMM and BLSTM-
(e) N -gram adapt (a+b+c) 99.5 139.1 HMM acoustic models using the general n-gram language
(f) RNN 92.3 110.7 model. Results in WER % and relative WER % improvement.
(g) RNN+N -gram general (d+f) 78.2 101.8
(h) RNN+N -gram adapt (e+f) 68.9 99.2 dev1-dev dev2
WER WER ∆WER
FFDNN 29.7 27.1 -
Third, we trained two standard Kneser-Ney smoothed 4- BLSTM 26.5 23.8 12.2
gram LMs on the train and subs-C24H sets using the SRILM
toolkit [9]. Rows (a) and (b) of Table 3 show the perplexities
Next, we analyzed the contribution of different LM com-
obtained with these models on the dev1-dev and dev2 sets. In
binations during search, leaving fixed the acoustic model to the
addition to these two general n-gram LMs, we trained one n-
best BLSTM neural network. Specifically, we carried out recog-
gram LM for each TV show. Row (c) of Table 3 shows the
nition experiments using (1) the general n-gram LM, (2) the
averaged perplexity of the corresponding TV-show-specific LM
RNN LM, (3) the interpolation of the RNN LM with the gen-
for each file.
eral n-gram LM, and (4) the interpolation of the RNN LM with
Next, we trained a RNN LM using the Variance Regular-
the adapted, show-specific n-gram LMs. Table 5 shows per-
ization (VR) criterion [10]. This criterion reduces the compu-
plexities and WERs for the dev1-dev and dev2 sets over these
tational cost during the test phase. Our models were trained on
four different LM setups.
GPU devices using the CUED-RNNLM toolkit [11]. The net-
work setup was optimized to minimize perplexity on the dev1-
dev set. It consisted of a 1024-unit embedding layer and a hid- Table 5: Comparison of different language model combinations
den LSTM layer of 1024 units. The output layer is a 132K-unit using the BLSTM-HMM acoustic model in terms of perplexity,
softmax, whose size corresponds to the vocabulary size. The WER % and relative WER % improvement.
perplexities obtained with this network are depicted in Row (f)
dev1-dev dev2
of Table 3.
PPL WER PPL WER ∆WER
Then, the combination of the LMs was done in two steps.
n-gram general 107 26.5 148 23.8 -
Firstly, we performed a linear interpolation of n-gram models.
RNN 92 26.2 111 23.0 3.4
For the general, non-adapted models, we interpolated the LMs
RNN + n-gram general 78 25.3 102 22.4 5.9
estimated on the train and the subs-C24H sets by minimizing
RNN + n-gram adapt 69 24.8 99 22.4 5.9
the perplexity on dev1-dev [12]. Row (d) of Table 3 shows the
perplexities for this particular LM combination. For each show-
specific LM, we performed a three-way interpolation: the indi- The best results were obtained with the combination of
vidual show-specific LM, the train LM and the subs-C24H LM. RNN and n-gram models, showing a consistent 6% relative im-
In this case, interpolation weights were optimized individually provement in both sets over the baseline general n-gram LM. It
for each TV show so that the perplexity was minimized on the is worth noting that in terms of WER, the improvement from us-
corresponding 500-sentence show-specific validation set, simi- ing adapted models does not translate to dev2. As dev1-dev and
larly to the approach followed in [13, 14]. Secondly, we com- dev2 contain different shows with strongly varying amounts of
bined the interpolated n-gram LMs with the RNN LM. Other show-specific text data available for training, not all shows ben-
than the static interpolation of n-gram LMs, the result of this efit from adaptation equally. Anyway, since the adaption does
step is not a new monolithic model, but a set of interpolation not degrade the system performance, and given the good im-
weights to be used on-the-go by the ASR decoder during search. provement seen on dev1-dev, we decided to use the combination
Perplexities for the combination of the RNN LM with the gen- of RNN LMs plus adapted n-gram LMs for the final system.
eral and the adapted n-gram LMs can be found in Rows (g) Looking at the system outputs, after carrying out error anal-
and (h) of Table 3. ysis, we realized that our VAD module [15] was discarding a
Finally, to take the most of the provided data, the final sub- significant amount of speech regions in the audio files. This
mitted system (MLLP-RWTH p-final closed) was trained using significantly affected the WER by increasing the number of
the same hyper-parameters values estimated during the devel- deletions. For this reason, we decided to explore other au-
opment stage, but using also dev1-dev and dev2 sets as part of dio segmentation approaches and compare its performance in
the training data. terms of WER. Concretely, we compared the following ap-
proaches: (1) our baseline MLLP-UPV VAD system, based on
3.4. Experiments and results a speech/non-speech GMM-HMM classifier that ranked second
in the Albayzin-2012 audio segmentation challenge [15]; (2)
In this section we describe the experiments carried out to de- The LIUM Speaker Diarization Tools, a VAD system based
termine the best closed training condition system. Our exper- on Generalized Likelihood Ratio between speech/non-speech
iments were devoted to assess three components of the sys- Gaussian models [16]; (3) The well-known CMUseg audio seg-
tem: acoustic models, language models and voice activity de- mentation system using the standard configuration [17]; (4) Ap-
tection (VAD) modules. In all cases we used the TLK toolkit ply a fast pre-recognition step to segment the audio file by the
decoder [7] for recognizing test data using a one-pass decoding recognized silences, using the best CD-FFDNN-HMM acoustic
setup. model and a pruned version of the general n-gram LM; and (5)
First, we compared the performance of the CD-FFDNN- Use the segments generated in (4), and apply VAD the system
HMMs and BLSTM-HMMs acoustic models described in Sec- (1) to classify those segments into speech/non-speech. It is im-
tion 3.2. In both cases we used the general n-gram language portant to note that (3) and (4) are not VAD systems but just au-
model described in Section 3.3. Grammar scale factor and dio segmenters, so all detected segments are considered speech,
259
i.e. all audio is passed through to the ASR. Table 6 shows the scribed speech from several sources, covering a variety of do-
WER for each of the five audio segmentation/VAD techniques, mains and acoustic conditions. The collection consists of subti-
including the ratio of discarded audio that is dropped by the tled videos crawled from Spanish and Latin American websites.
VAD prior to decoding. We used a pronunciation lexicon with a vocabulary size
of 325k with one or more pronunciation variants. The acous-
Table 6: Comparison of different audio segmentation/VAD tech- tic model takes 80-dim. MFCC features as input and estimates
niques using the BLSTM-HMM acoustic model and the combi- state posterior probabilities for 5000 tied triphone states. The
nation of the RNN LM + adapted n-gram LM. Results in WER state tying was obtained by estimating a classification and re-
% and relative WER % improvement and the ratio of dropped gression tree (CART) on all available training data. Acous-
audio. tic modeling was done using a bi-directional LSTM network
dev1-dev dev2 with four layers and 512 LSTM units in each layer. About 30%
% drop. WER % drop. WER ∆WER of activations are dropped in each layer for regularization pur-
MLLP-UPV (1) 10.9 24.8 5.9 22.4 - pose [22]. During training we minimized the cross-entropy of
LIUM (2) 7.1 23.7 3.9 20.8 7.1 a network generated distribution in the softmax output layer at
CMUseg (3) 0 23.2 0 20.9 6.7 aligned label positions using a Viterbi alignment defined over
Pre-Recognition (4) 0 22.9 0 20.6 8.0 the 5000 tied triphone states of the CART. We used the Adam
+ MLLP-UPV (5) 3.2 22.3 3.3 20.0 10.7 learning rate schedule [23] with integrated Nesterov momen-
tum and further reduced the learning rate following a variant of
As we expected, the baseline VAD system (1) was dis- the Newbob scheme. We split input utterances into overlapping
carding too much segments, as it was too aggressive com- chunks of roughly 10 seconds and perform an L2 normalization
pared to other techniques. With either (2) or (3) we obtained of the gradients for each chunk. With the normalized gradients
a consistent improvement. It was further increased up to 8% the network is updated in a stochastic gradient descent manner
by using (4). We decided then to combine this segmentation where batches containing up to 50 chunks are distributed over
with our baseline VAD system (1), which led us to achieve eight GPU devices and recombined into a common network af-
an 11% relative WER improvement. In absolute terms, we ter roughly 500 chunks have been processed by all devices.
got a 2.4 WER points gain in dev2, with a final WER of The language model for the single-pass HMM decoding is
20.0%. This setup constituted our contrastive closed-condition a 5-gram count model trained with Kneser-Ney smoothing on a
system (MLLP-RWTH c1-dev closed), whilst our primary sys- large body of text data collected from multiple publicly avail-
tem (MLLP-RWTH p-final closed) was the result of re-training able sources. Its perplexity on dev1-dev and dev2 is 173.5 and
the same acoustic and language models with all available data, 173.2 respectively. This open-track system has reached a WER
as stated in Sections 3.2 and 3.3. of 18.3% and 15.6% on dev1-dev and dev2 without any speaker
Finally we analyzed the speed of submitted system in terms or domain adaptation or model tuning.
of Real Time Factor (RTF). We studied how tightening the prun-
ing parameters affects the RTF and the WER. Also, to assess the
speed of a fast pre-recognition step to segment the audio signal, 5. Conclusions
we also did this comparison using the LIUM VAD system. Re- In this paper we have presented the description of the three sys-
sults of this analysis are shown in Table 7. First, a more ag- tems that participated in the IberSpeech-RTVE 2018 Speech-to-
Table 7: Speed analysis in terms of RTF an its effect on the Text transcription challenge. Two of them, one primary (MLLP-
WER% over the dev2 set, either with the submitted system and RWTH p-final closed) and one contrastive (MLLP-RWTH c1-
removing the pre-precognition step, using LIUM VAD instead. dev closed), were submitted to the closed training conditions,
while the other one (MLLP-RWTH p-prod open) participated in
RTF WER the open training track. On the one hand, our best development
Submitted system (1) 1.5 20.0 closed-condition ASR system (MLLP-RWTH c1-dev closed),
+ inc. prune 0.8 20.3 consisting of a BLSTM-HMM acoustic model trained on a re-
(1) with LIUM VAD 1.0 20.9 liable set of 205 hours of training speech data, and a combi-
+ inc. prune 0.4 21.3 nation of both RNN and TV-show adapted n-gram language
models, achieved a competitive mark of 20.0% WER on the
gressive pruning speeds up the submitted system by 88% while dev2 set. Our final, primary closed-condition ASR system
degrading the WER by 0.3% absolute. Next, if we replace the (MLLP-RWTH p-final closed) should offer a similar or even
pre-recognition step on the submitted system by the LIUM VAD better performance as it followed the same system design setup
module, we get a speed-up of 50% at the cost of 0.9 points but trained with all available data, including both development
WER. Finally, we could afford a very significant speed-up of sets. On the other hand, our general-purpose open-condition
375% if we tighten the prune parameters when using LIUM ASR system (MLLP-RWTH p-prod open), without carrying out
VAD, with a WER loss of 1.3 absolute points, although it would any speaker, domain nor model adaptation of any kind, scored
still be a competitive system, scoring 21.3% WER points on 15.6% WER on the dev2 set.
dev2.
4. Open-condition system 6. Acknowledgements
The main motivation for participating in the open-condition The research leading to these results has received funding from
track was the desire to evaluate a system developed in the recent the European Union’s Horizon 2020 research and innovation
months for a different purpose, not related to the IberSpeech programme under grant agreement no. 761758 (X5gon) and the
challenge. In order to achieve this goal, we decided to keep Spanish government’s TIN2015-68326-R (MINECO/FEDER)
the amount of parameter optimization as low as possible. This research project MORE. This work also financed by grant
system is based on the software developed at RWTH Aachen FPU14/03981 from the Spanish Ministry of Education, Culture
University: RASR [18, 19] and RETURNN [20, 21]. and Sport. Finally, we would also like to thank our colleagues
The ASR system is based on a hybrid LSTM-HMM acous- at RWTH Aachen for many fruitful discussions: Eugen Beck,
tic model. It was trained on a total of approx. 3800 hours of tran- Tobias Menne and Albert Zeyer.
260
7. References [16] S. Meignier and T. Merlin, “LIUM SpkDiarization: an open
source toolkit for diarization,” in Proc. CMU SPUD Workshop,
[1] C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen,
Dallas, TX, USA, Mar. 2010.
Z. Chen, A. Kannan, R. J. Weiss, K. Rao, K. Gonina, N. Jaitly,
B. Li, J. Chorowski, and M. Bacchiani, “State-of-the-art speech [17] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern, “Automatic seg-
recognition with sequence-to-sequence models,” in Proc. IEEE mentation, classification and clustering of broadcast news audio,”
Int. Conf. on Acoustics, Speech and Signal Processing, Calgary, in Proc. DARPA Speech Recognition Workshop, 1997, pp. 97–99.
Canada, Apr. 2018, pp. 4774–4778. [18] D. Rybach, S. Hahn, P. Lehnen, D. Nolden, M. Sundermeyer,
[2] W. Xiong, L. Wu, J. Droppo, X. Huang, and A. Stolcke, “The Mi- Z. Tüske, S. Wiesler, R. Schlüter, and H. Ney, “RASR -
crosoft 2017 conversational speech recognition system,” in Proc. the RWTH Aachen University open source speech recognition
IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Cal- toolkit,” in Proc. IEEE Automatic Speech Recognition and Un-
gary, Canada, Apr. 2018, pp. 5934–5938. derstanding Workshop (ASRU), Honolulu, HI, USA, Dec. 2011.
[3] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, [19] S. Wiesler, A. Richard, P. Golik, R. Schlüter, and H. Ney,
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kings- “RASR/NN: The RWTH neural network toolkit for speech recog-
bury, “Deep neural networks for acoustic modeling in speech nition,” in IEEE International Conference on Acoustics, Speech,
recognition: The shared views of four research groups,” IEEE Sig- and Signal Processing, Florence, Italy, May 2014, pp. 3313–3317.
nal Processing Magazine, vol. 29, no. 6, pp. 82–97, Nov. 2012.
[20] P. Doetsch, A. Zeyer, P. Voigtlaender, I. Kulikov, R. Schlüter, and
[4] H. Bourlard and C. J. Wellekens, “Links between Markov models H. Ney, “RETURNN: The RWTH extensible training framework
and multilayer perceptrons,” in Advances in Neural Information for universal recurrent neural networks,” in IEEE International
Processing Systems I, D. Touretzky, Ed. San Mateo, CA, USA: Conference on Acoustics, Speech, and Signal Processing, New
Morgan Kaufmann, 1989, pp. 502–510. Orleans, LA, USA, Mar. 2017, pp. 5345–5349.
[5] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” [21] A. Zeyer, T. Alkhouli, and H. Ney, “RETURNN as a generic flexi-
Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997. ble neural toolkit with application to translation and speech recog-
[6] S. J. Young, J. J. Odell, and P. C. Woodland, “Tree-based state nition,” in Annual Meeting of the Assoc. for Computational Lin-
tying for high accuracy acoustic modelling,” in Proc. Workshop on guistics, Melbourne, Australia, Jul. 2018.
Human Language Technology, Plainsboro, NJ, USA, Mar. 1994,
[22] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
pp. 307–312.
R. Salakhutdinov, “Dropout: a simple way to prevent neural net-
[7] M. del Agua, A. Giménez, N. Serrano, J. Andrés-Ferrer, J. Civera, works from overfitting,” Journal of Machine Learning Research,
A. Sanchis, and A. Juan, “The translectures-UPV toolkit,” in Ad- vol. 15, no. 1, pp. 1929–1958, 2014.
vances in Speech and Language Technologies for Iberian Lan-
guages, Nov. 2014, pp. 269–278. [23] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
mization,” in Proc. of the Int. Conf. on Machine Learning, San
[8] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, Diego, CA, USA, May 2015.
G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,
I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,
L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga,
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,
I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,
Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning
on heterogeneous systems,” 2015, software available from
tensorflow.org. [Online]. Available: https://www.tensorflow.org/
[9] A. Stolcke, “SRILM – an extensible language modeling toolkit,”
in Proc. of the Int. Conf. on Spoken Language Processing, Denver,
CO, USA, Sep. 2002, pp. 901–904.
[10] X. Chen, X. Liu, M. J. F. Gales, and P. C. Woodland, “Improving
the training and evaluation efficiency of recurrent neural network
language models,” in Proc. IEEE Int. Conf. on Acoustics, Speech
and Signal Processing, Brisbane, Australia, Apr. 2015, pp. 5401–
5405.
[11] X. Chen, X. Liu, Y. Qian, M. J. F. Gales, and P. C. Woodland,
“CUED-RNNLM – An open-source toolkit for efficient training
and evaluation of recurrent neural network language models,” in
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process-
ing, Mar. 2016, pp. 6000–6004.
[12] F. Jelinek and R. L. Mercer, “Interpolated estimation of Markov
source parameters from sparse data,” in Proc. Workshop on
Pattern Recognition in Practice, Amsterdam, Netherlands, Apr.
1980, pp. 381–397.
[13] A. Martı́nez-Villaronga, M. A. del Agua, J. Andrés-Ferrer, and
A. Juan, “Language model adaptation for video lectures transcrip-
tion,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal
Processing, Vancouver, Canada, May 2013, pp. 8450–8454.
[14] A. Martı́nez-Villaronga, M. A. del Agua, J. A. Silvestre-Cerdà,
J. Andrés-Ferrer, and A. Juan, “Language model adaptation for
lecture transcription by document retrieval,” in Proc. IberSpeech,
Nov. 2014.
[15] J. A. Silvestre-Cerdà, A. Giménez, J. Andrés-Ferrer, J. Civera, and
A. Juan, “Albayzin Evaluation: The PRHLT-UPV Audio Segmen-
tation System,” in Proc. IberSpeech, Madrid, Spain, Nov. 2012,
pp. 596–600.
261
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Abstract tion and language models (AM, PM, LM) into a unified neu-
ral network modeling framework. Using a simplified training
Deep Neural Networks (DNN) are fundamental part of current process acoustic, pronunciation and language modeling com-
ASR. State-of-the-art are hybrid models in which acoustic mod- ponents are integrated to generate the hypothesized graphemes,
els (AM) are designed using neural networks. However, there is sub-words or word sequences. This also greatly simplifies the
an increasing interest in developing end-to-end Deep Learning decoding.
solutions where a neural network is trained to predict charac- Most end-to-end ASR approaches [5] are typically based on
ter/grapheme or sub-word sequences which can be converted a Connectionist Temporal Classification (CTC) framework [6,
directly to words. Though several promising results have been 7], a Sequence-to-Sequence attention-based encoder-decoder
reported for end-to-end ASR systems, it is still not clear if they [8, 9] or a combination of both [10].
are capable to unseat hybrid systems. Several sequence-to-sequence Neural Network models
In this contribution, we evaluate open-source state-of- have been proposed as Recurrent Neural Network Transducer
the-art hybrid and end-to-end Deep Learning ASR under the (RNN-T) [11], Listen, Attend and Spell (LAS) [9], and Mono-
IberSpeech-RTVE Speech to Text Transcription Challenge. The tonic Alignments [12]. In attention-based encoder-decoder
hybrid ASR is based on Kaldi and Wav2Letter will be the end- schemes [4] the listener encoder module plays a similar role
to-end framework. Experiments were carried out using 6 hours that a conventional acoustic model, the attender learns align-
of dev1 and dev2 partitions. The lowest WER on the reference ments between the source and the target sequence, and the de-
TV show (LM-20171107) was 22.23% for the hybrid system coder works as a language model. Better performance has been
(lowercase format without punctuation). Major limitation for reported [4] by modelling longer units such as word pieces mod-
Wav2Letter has been a high training computational demand (be- els (WPM) and using Multi-head attention (MHA).
tween 6 hours and 1 day/epoch, depending on the training set). Several research works have also been proposed aiming to
This forced us to stop the training process to meet the Chal- re-use HMM-based models to improve end-to-end systems. For
lenge deadline. But we believe that with more training time it example, well-trained tied-triphone acoustic models (AM) can
will provide competitive results with the hybrid system. be used as an initial model for a character-based end-to-end sys-
Index Terms: TV shows Speech-to-Text transcription, ASR tem [5] or training tied-triphone CTC models from scratch, but
systems, Hybrid DNN-HMM, End-to-end Deep Learning. in this case a lexicon was required. However, these methods
have important limitations as they demand a complex system
1. Introduction development, high computation and a large amount of data for
training, thus losing the attractiveness of end-to-end systems.
Deep Neural Networks (DNN) have become fundamental part
Therefore, even with the promising results already reported
of current ASR systems. State-of-the-art approaches are gener-
for many end-to-end ASR systems, it is still not clear if they
ally hybrid models in which acoustic models (AM) are designed
are capable to unseat the current state-of-the-art hybrid DNN-
using neural networks to create HMM class posterior proba-
HMM ASR systems.
bilities. These HMM-based neural network acoustic models
In this paper, our aim is to contribute to the research towards
(DNN-HMM) are combined with conventional pronunciation
the development of end-to-end ASR systems as an alternative
(PM) and language (LM) models [1].
to state-of-the-art hybrid ASR systems. For this purpose, we
The main limitations in hybrid ASR systems is a high com-
will develop and compare two open-source hybrid and end-to-
plexity associated to the boot-strapping process for training the
end ASR systems for a specific speech-to-text task. As hybrid
DNN-HMM models, requiring phoneme alignments for frame-
system the DNN-HMM Kaldi Toolkit [13] will be used, while
wise cross entropy, and a sophisticated beam search decoder
Wav2Letter [14] will be the end-to-end framework. The speech-
[2].
to-text task we will work on will be the RTVE IberSpeech 2018
Though several approaches are being proposed to overcome Challenge1 . This task represents a highly demanding domain
these limitations, such as to train without requiring a phoneme corresponding to the automatic transcription of TV shows and
alignment, or to avoid the lexicon [3], there is an increasing broadcast news, in specific conditions, as different noisy envi-
interest in working towards end-to-end solutions. ronments and with the lack of accurate transcriptions for train-
In end-to-end deep learning ASR systems [4] a neural net- ing.
work is trained to predict character/grapheme or sub-word se- The rest of the paper is structured as follows. In Section
quences which can be converted directly to words, or even
word sequences directly. They present the important advan- 1 http://iberspeech2018.talp.cat/index.php/albayzin-evaluation-
tage of integrating conventional separate acoustic, pronuncia- challenges/
262 10.21437/IberSPEECH.2018-55
2, we describe the end-to-end and the hybrid ASR systems, frame rate requires HMM traversable in one transition; we use
which we have evaluated in the RTVE IberSpeech 2018 Chal- fixed transition probabilities in the HMM, and don’t train them.
lenge. Section 3 details the datasets we have used: how we have Additionally, training DNN-HMM following a sequence-level
preprocessed them and the experimental protocols we have fol- objective function allowed its implementation as a maximum
lowed. Results are shown and discussed in Section 4. Finally, mutual information (MMI) criterion without lattices on GPU:
we summarize our results and conclusions in Section 5. doing a full forward-backward on a decoding graph derived
from a phone n-gram language
2. Deep Learning ASR Systems Starting from the available transcript of the training speech
data, training acoustic models is an iterative process of audio
2.1. End-to-end Speech Recognition System re-alignment starting from GMM/HMM monophone models
As representative of open-source end-to-end ASR systems, we and progressing to more accurate triphone models through re-
have chosen Wav2Letter. The Wav2Letter ASR system is based training. For all of our experiments we used conventional fea-
on a neural network architecture composed of convolutional ture pipe-line that involves splicing the 13-dimensional front-
units [15], with a Gated Linear Units (GLUs) implementation end MFCCs across 9 frames, followed by applying LDA
[16, 17]. The acoustic modeling is based on Mel-Frequency to reduce the dimension to 40 and then further decorrela-
Spectral Coefficients (MFSC) which feed the Gated CNNs that tion using MLLT [21]. For initial GMM/HMM alignments
generate letter scores at their outputs. These scores are pro- speaker independent acoustic models were obtained using fM-
cessed by an alternative to CTC, the Auto Segmentation Crite- LLR. The input features to the neural network in DNN-HMM
rion (ASG) leading to letter-based sequences (see Figure 1). In models was represented by a fixed transform that decorre-
order to train the acoustic model, the feature extraction module lates a vector of (40*7)-dimensional features obtained by pack-
computes 40-dimensional MFSCs, due to robustness to small ing seven frames of 40-dimesional features MFCC(spliced) +
time-warping distortions, as referred in [17]. LDA+MLLT+fMLLR corresponding to 3 frames on each side
of the central frame.
To improve robustness mainly on speakers variability,
speaker adaptive training (SAT) based on i-vectors was also
implemented [22]. Speaker adaptive models were obtained by
fine-tuning DNNs to a speaker-normalized feature space. On
each frame a 100-dimensional i-vector is appended to the 40-
Figure 1: Architecture of the end-to-end ASR system based on dimensional acoustic space. In this extended acoustic space i-
Wav2Letter system. Adapted from [17]. vectors may supply information about different sources of vari-
ability as speakers ID, so the network itself can do any fea-
As commented before, the Neural Network architecture is ture normalization that is needed. To overcome some issues
trained to infer the segmentation of each letter in the train- reported when test signals have substantially different energy
ing transcriptions using Auto Segmentation Criterion (ASG), levels than the training data, in our experiments the test-signal
an alternative criterion to Connectionist Temporal Classification energies were energy-normalized to the average of the training
(CTC). CTC takes into account all possible letter sequences, al- data.
lowing a special blank state, which represents possible garbage
frames between letters or the separation between repeated let- 3. Experimental Setup
ters. In ASG blank states are replaced by the number of rep-
3.1. Datasets
etition of the previous letter, consequently a simpler graph is
obtained [14]. Besides this graph that scores letter sequences 3.1.1. RTVE2018 Database
depicting the right transcription, another graph is used to score
In this evaluation, we investigate the performance of end-to-end
of all letter sequences. Finally, a beam-search decoder (as de-
and hybrid ASR systems on RTVE voice contents2 , a collection
scribed in [14]), is used at the last stage. It depends on a
of TV shows and broadcast news from 2015 to 2018.
beam thresholding, histogram pruning and an optional language
model. Training partition consists of audio files with subtitles, with
the following limitations:
2.2. Hybrid Speech Recognition System • Subtitles have been generated through a re-speaking pro-
In order to compare with end-to-end ASR, we have built a hy- cedure that sometimes summarizes what has been said,
brid ASR system using open-source Kaldi Toolkit [13]. The producing imprecise transcriptions.
ASR architecture consists of the classical sequence of three • Transcriptions have not been supervised by humans.
main modules: an acoustic model, a dictionary or pronunci-
ation lexicon and a N-gram language model. These modules • Timestamps are not properly aligned with the speech sig-
are combined for training and decoding using Weighted Finite- nal.
State Transducers (WFST) [18]. The acoustic modeling is based Trying to avoid the use of these low-quality transcriptions,
on Deep Neural Networks and Hidden Markov Models (DNN- which could cause confusion in the acoustic space, audio data
HMM). was initially aligned by a baseline alignment system. This sys-
For the implementation of Kaldi DNN-HMM acoustic tem was the same hybrid system described in Section 2.2. but
modelling we followed the so-called chain model [19], based trained using our own labeled databases, explained in Subsec-
on a subsampled time-delay neural network (TDNN) [20]. This tion 3.1.2. To improve the quality of these automatic transcrip-
implementation uses 3-fold reduced frame rate at the output of tions, they were undergone to a manual supervision process.
the network; this represents a significant reduction in decod-
ing computation and the corresponding test time. The reduced 2 http://catedrartve.unizar.es/reto2018/RTVE2018DB.pdf
263
RTVE training partition consists of 460 hours, however Table I: WER on reference TV (LM-20171107) show for acous-
due to our limitations in the manual supervision process, only tic models over closed and open training conditions (different
two training datasets have been prepared for our experiments: train data sizes and language models).
RTVE train350 (350 hours of train set) and RTVE train100 (a
RTVE train350 subset of 100 hours). Validation datasets where Hybrid systems WER(%)
extracted from the 10% training set for each RTVE train350 and
RT V E train100 + LM subtitles 26.01
RTVE train100 partitions. These validation datasets have been
RT V E train350 + LM subtitles 24.21
designed trying to cover the different scenarios in the whole
RT V E train350 + LM supervised 25.95
RTVE training data: political and economic news, in-depth in-
RT V E train350 + LM subtsuperv 23.47
terviews, debate, live magazines, weather information, game
RT V E train350 + Others + LM subtsuperv 23.20
and quiz shows. Consequently, for testing purposes, two de-
RT V E train350 + Others + LM open 22.23
velopment datasets have been defined as follows:
• RTVE dev1: 5 hours have been selected in a balanced
way in order to have an hour of each show type (e.g.
20H dev1 is one hour of 20H program). For testing on closed training condition, we used the full
volume of RTVE training data that has been manually revised
• RTVE dev2: 1 hour has been selected carrying out the (RTVE train350).
same procedure as RTVE dev1. In addition to AM, LM has been trained with a different cor-
pus. Four LMs were generated: LM subtitles (based on subti-
3.1.2. Other Databases tles given in RTVE database for the Challenge), LM supervised
Acoustic models have also been evaluated over open training (based on transcriptions of RTVE data training supervised by
condition, that is: by using additional datasets. To this end two humans), LM subtsuperv (based on two mentioned corpus) and
additional datasets have been used to train the system. LM open (based on several corpus: news between 2015 and
2018, interviews, film captions and the two mentioned before).
• VESLIM: It consists of 103 hours of Spanish clean
As shown in Table I, adding supervised transcriptions from
voice, where speakers read some sentences. More de-
the training set to a subtitles-based LM, we achieved a WER of
tails in [23].
23.47%, a 3% relative improvement over the same system using
• OWNMEDIA: It contains 162 hours of TV programs, in- a language model trained only with subtitles. This improvement
terviews, lectures and similar multimedia contents. This is mainly due to the fact that supervised transcriptions contains
dataset contains manual transcriptions. some conversational language features (i.e. false starts, trun-
cated words, filler words, syntactic structure changes at talking
3.2. Training time, etc.) that are generally omitted in subtitles because of re-
From the hybrid ASR system, our AM was trained following a speaking procedure. In contrast, a LM only trained with super-
process based on the Switchboard Kaldi recipe (TDNN Chain vised transcriptions did not provide better results. In this case,
models). there were some tags in these transcriptions when words could
In order to create the PM for hybrid ASR system, it is used not be confidently revised/transcribed (e.g. foreign names, mis-
a set of 29 real phonemes (without silence phones). Instead, pronunciation, background noise, etc.). Inserting tags meant to
end-to-end system, the vocabulary contains 38 graphemes rep- include ”unk” symbol to the LM and results were not as good
resenting the standard Spanish alphabet plus stressed vowels, as expected.
the apostrophe, symbols for repetitions and separation. We next evaluated the systems over open condition, where
we increased the amount of data for both AM and LM training.
3.3. Resources We combined RTVE training dataset and our own databases
(see Section 3.1.2), resulting in a total over 600 hours of speech.
Experiments have been carried out using several computa- As it can be seen in Table I the hybrid system provided a
tion resources. A server with 2 Xeon E5-2630v4, 2,2GHz, slight improvement. But, more importantly, WER went down to
10C/20TH and 3 GPUs Nvidia GTX 1080 Ti was used for hy- 22.23% when transcriptions from additional corpus were incor-
brid ASR system. GPU for the DNN training and CPU for the porated to train a LM. As a result, increasing both the amount
HMM training and final decoding. of audio and transcription data will enable us to obtain the max-
In order to train end-to-end models, more RAM was re- imum performance, and to cover as much information as possi-
quired, so it was used a GPU Nvidia Quadro P5000 (16 GB), ble appearing in test files.
for training and letter decoding. The division of development datasets according to the dif-
ferent show types makes it possible a deeper error analysis. Ta-
4. Results ble II shows that models applied to TV programs as 20H and
Millennium obtain the best results, a low WER of 14-17% for
4.1. Hybrid ASR System
the best models. This could be explained because contents are
First, we compared the performance of our hybrid system in- daily news having good acoustic conditions (clean voice, only
creasing training data volume from 100 to 350 hours, and by one speaker at time) and being better featured in LM. However,
adding additional training data from our external datasets. CA (Comando Actualidad) dataset contains some challenging
Evaluation plans mentioned that a reference TV show (LM- scenarios (interviews at the street, background noise, overlap-
20171107) has been used to obtain results with some commer- ping, music). As a result, models achieve a high value of WER
cial systems. This show is a live magazine covering Spanish (49.51%).
current events and it has been used to obtain first results. As Furthermore, it has to be emphasized that reference master
expected, more than 14% relative improvement in WER is ob- of transcriptions was given without any review from our part. To
tained we adding all the available data. evaluate the possible impact of transcription errors in the refer-
264
Table II: WER(%) on the different datasets of models over a closed and open training conditions (different train data volume and
language models). The duration of each dataset is an hour.
265
neural networks for acoustic modeling in speech recognition: The [20] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural
shared views of four research groups,” IEEE Signal processing network architecture for efficient modeling of long temporal con-
magazine, vol. 29, no. 6, pp. 82–97, 2012. texts,” in Sixteenth Annual Conference of the International Speech
Communication Association, 2015.
[2] A. Zeyer, K. Irie, R. Schlüter, and H. Ney, “Improved training
of end-to-end attention models for speech recognition,” arXiv [21] S. P. Rath, D. Povey, K. Veselỳ, and J. Cernockỳ, “Improved fea-
preprint arXiv:1805.03294, 2018. ture processing for deep neural networks.” in Interspeech, 2013,
pp. 109–113.
[3] A. Zeyer, E. Beck, R. Schlüter, and H. Ney, “Ctc in the context of
generalized full-sum hmm training,” in Proc. Interspeech, 2017, [22] Y. Miao, H. Zhang, and F. Metze, “Speaker adaptive train-
pp. 944–948. ing of deep neural network acoustic models using i-vectors,”
IEEE/ACM Transactions on Audio, Speech and Language Pro-
[4] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, cessing (TASLP), vol. 23, no. 11, pp. 1938–1949, 2015.
Z. Chen, A. Kannan, R. J. Weiss, K. Rao, K. Gonina et al., “State-
of-the-art speech recognition with sequence-to-sequence models,” [23] D. T. Toledano, L. A. H. Gómez, and L. V. Grande, “Automatic
arXiv preprint arXiv:1712.01769, 2017. phonetic segmentation,” IEEE transactions on speech and audio
processing, vol. 11, no. 6, pp. 617–625, 2003.
[5] S. Kim, M. L. Seltzer, J. Li, and R. Zhao, “Improved training
for online end-to-end speech recognition systems,” arXiv preprint [24] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
arXiv:1711.02212, 2017. rispeech: an asr corpus based on public domain audio books,” in
Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE
[6] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Con- International Conference on. IEEE, 2015, pp. 5206–5210.
nectionist temporal classification: labelling unsegmented se-
quence data with recurrent neural networks,” in Proceedings of
the 23rd international conference on Machine learning. ACM,
2006, pp. 369–376.
[7] A. Graves and N. Jaitly, “Towards end-to-end speech recognition
with recurrent neural networks,” in International Conference on
Machine Learning, 2014, pp. 1764–1772.
[8] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
gio, “Attention-based models for speech recognition,” in Ad-
vances in neural information processing systems, 2015, pp. 577–
585.
[9] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend
and spell: A neural network for large vocabulary conversational
speech recognition,” in Acoustics, Speech and Signal Processing
(ICASSP), 2016 IEEE International Conference on. IEEE, 2016,
pp. 4960–4964.
[10] T. Hori, S. Watanabe, and J. Hershey, “Joint ctc/attention decod-
ing for end-to-end speech recognition,” in Proceedings of the 55th
Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), vol. 1, 2017, pp. 518–529.
[11] A. Graves, “Sequence transduction with recurrent neural net-
works,” arXiv preprint arXiv:1211.3711, 2012.
[12] C. Raffel, M.-T. Luong, P. J. Liu, R. J. Weiss, and D. Eck, “Online
and linear-time attention by enforcing monotonic alignments,”
arXiv preprint arXiv:1704.00784, 2017.
[13] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
“The kaldi speech recognition toolkit,” in IEEE 2011 workshop
on automatic speech recognition and understanding, no. EPFL-
CONF-192584. IEEE Signal Processing Society, 2011.
[14] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-
to-end convnet-based speech recognition system,” arXiv preprint
arXiv:1609.03193, 2016.
[15] Y. LeCun, Y. Bengio et al., “Convolutional networks for images,
speech, and time series,” The handbook of brain theory and neural
networks, vol. 3361, no. 10, p. 1995, 1995.
[16] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language
modeling with gated convolutional networks,” in International
Conference on Machine Learning, 2017, pp. 933–941.
[17] V. Liptchinsky, G. Synnaeve, and R. Collobert, “Letter-based
speech recognition with gated convnets,” CoRR, 2017.
[18] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state trans-
ducers in speech recognition,” Computer Speech & Language,
vol. 16, no. 1, pp. 69–88, 2002.
[19] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar,
X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neu-
ral networks for asr based on lattice-free mmi.” in Interspeech,
2016, pp. 2751–2755.
266
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
267 10.21437/IberSPEECH.2018-56
2.1. RTVE2018 dataset As it can be seen in Table 2, a high number of hours were
discarded from the original RTVE2018 dataset. In the end, a to-
The RTVE2018 dataset was released by RTVE and comprises a tal of 136 hours and 11 minutes were considered as nearly per-
collection of TV shows drawn from diverse genres and broad- fect audios, whilst only 90 hours and 9 minutes can be thought
cast by the public Spanish National Television (RTVE) from to be perfectly aligned including the train, dev1 and dev2 re-
2015 to 2018. The real number of hours provided in the original aligned partitions. These both subsets were finally used to build
dataset as training and development sets is presented in Table 1. and tune the acoustic models (AM). The development set for the
tuning of the systems was extracted from the completely perfect
Table 1: Duration of each partition of the original RTVE2018 subset, as it is shown in Table 3.
dataset
Table 3: New perfectly and nearly perfectly aligned subsets for
subset duration train and development. The 4 hours from dev correspond to the
train 462 h. 9 min. same contents in both subsets.
dev1 62 h. 23 min.
dev2 15 h. 13 min. subset train dev
total 539 h. 45 min. Perfectly aligned 86 h. 9 min. 4 h.
Nearly perfect aligned 132 h. 11 min. 4 h.
The main problem of this dataset was that a great amount
of audios had imperfect transcriptions and, therefore, they could In terms of text data, a total of 3.5 million sentences and 61
not be used as such for training and evaluation purposes. With million words were compiled. This data was used to estimate
the aim of recovering only the correctly aligned segments, a the language models (LM) and the punctuation and capitaliza-
highly costly process was carried out, where an alignment and tion modules.
re-alignment techniques were performed first and a manual re-
vision and automatic recognition task afterwards. The align- 2.2. Open dataset
ment and re-alignment processes consist of two steps. In the
first step, we tried to align the original audios with their corre- The open dataset was used to build the ASR systems for the
sponding transcriptions using 4 different beam values (10; 100; open condition. In addition to the perfectly re-aligned subset
1,000 and 10,000). For the case of the train partition, a total from the RTVE2018 dataset, 4 different corpora were prepared
of 101 hours and 47 minutes were only aligned after this ini- for training. The SAVAS corpus [16] is composed of broadcast
tial step. Thus, 360 hours and 22 minutes were definitely dis- news contents from the Basque Country’s public broadcast cor-
carded to be used in any training process. A second step of re- poration EiTB (Euskal Irrati Telebista), and includes annotated
alignment was then performed over this new subset of 101 hours and transcribed audios in both clear (studio) and noisy (outside)
and 47 minutes, using beam and retry-beam values of 1 and 2 conditions. The Youtube RTVE Series corpus includes Span-
respectively, obtaining a total of 86 hours and 29 minutes of au- ish broadcast contents of RTVE shows and series gathered from
dio segments that were considered as nearly correctly aligned. the Youtube platform. The audio contents were downloaded
The alignments processes were performed using a feed-forward along with the automatic transcriptions provided by the plat-
DNN-HMM acoustic model trained with the Kaldi toolkit [15] form. These audios and their corresponding automatic tran-
and estimated over contents from the broadcast domain. Fi- scriptions were then split and re-aligned following the same
nally, a small partition of the nearly correctly aligned hours methodology as it was explained in Section 2.1. Finally, the Al-
were revised manually, whilst the remaining were recognized bayzin [17] and Multext [18] corpora were also included. The
using a different recognition architecture as employed for the development set corresponded to the in-domain new dev parti-
alignments. In this case, the recognition was performed using tion shown in Table 3. The total amount of hours available for
an E2E based recognition system trained with the same contents the open condition are summarized in Table 4.
from the broadcast domain. Only the recognition outputs that fit
exactly to the reference were tagged as perfect segments. The Table 4: The open dataset description
same cleaning methodology was also applied on the dev1 and
dev2 partitions. corpus #hours
The total number of hours discarded after the first step, and RTVE2018 132 h. 11 min.
the hours tagged as nearly perfect and completely perfect are SAVAS 160 h. 58 min.
summarized in Table 2. These hours correspond to audio seg- Youtube RTVE 197 h. 13 min.
ments that lasted more than one second, since the shorter ones Albayzin 6 h. 5 min.
were also discarded. Multext 53 min.
Total 497 h. 20 min.
Table 2: RTVE2018 dataset after the alignment and re-
alignment processes
Regarding text data, data selection techniques were applied
alignment re-alignment revision+E2E on general news data gathered from digital newspapers and us-
subset ing the LM created with the in-domain RTVE2018 text data as a
(discarded) (nearly perfect) (perfect)
train 360 h. 22 min. 86 h. 29 min. 56 h. 27 min. reference. A total of 3.5 million sentences and 71 million words
dev1 7 h. 43 min. 44 h. 34 min. 29 h. 39 min. were selected with a maximum perplexity threshold value of
120. Hence, summing the in-domain and new texts data, a total
dev2 9 h. 27 min. 5 h. 8 min. 4 h. 3 min.
of 132 million words were employed to estimate the LM, and
total 377 h. 32 min. 136 h. 11 min. 90 h. 9 min. punctuation and capitalization modules for the open condition.
268
3. Main architectures 3-fold augmented through the speed based augmentation tech-
nique. Each audio was transformed randomly depending on a
Two main architectures were employed to build the systems for
modification parameter ranged between 0.9 and 1.1 values. A
both closed and open conditions.
total of 396 hours and 33 minutes were therefore used for train-
ing. The LMs were estimated with the in-domain texts compiled
3.1. LSTM-HMM based systems from the RTVE2018 dataset.
These systems include a bidirectional LSTM-HMM acoustic
model and n-gram language models for decoding and rescoring 4.1.2. Contrastive systems
porpuses. The AMs and final graphs were estimated using the
Kaldi toolkit. The AM corresponded to a hybrid LSTM-HMM The first constrastive system was called ’Vicomtech-
implementation, where bidirectional LSTMs were trained to PRHLT c1-K2 closed’ and it was set up using the same
provide posterior probability estimates for the HMM states. configuration of the primary system, but the AM was estimated
This model was constructed with a sequence of 3 LSTM layers, using the 3-fold augmented acoustic data of the perfectly
using 640 memory units in the cell and 1024 fully connected aligned partition (see Table 3). A total of 258 hours and 27
hidden layer outputs. The number of steps used in the estima- minutes were employed for training.
tion of the LSTM state before prediction of the first label was
The same data was used to build the the second contrastive
fixed to 40 in both contexts. Furthermore, modified Kneser-Ney
system, tagged as ’Vicomtech-PRHLT c2-E1 closed’. It was
smoothed 3-gram and 9-gram models were used for decoding
an E2E recognition system which follows the architecture de-
and re-scoring of the lattices respectively. Both LMs were esti-
scribed above, and it was evolved for 30 epochs. The LM was
mated using the KenLM toolkit [19].
a 5-gram with an alpha value of 1.5 and a beam-width of 1000
during decoding.
3.2. E2E based systems
The E2E systems were developed following the Deep Speech 4.2. Open condition
2 architecture [2]. The core of the system is basically an RNN
model, in which speech spectrograms are ingested and text tran- 4.2.1. Primary system
scriptions are provided as output.
Initially, a sequence of 2 layers of 2D convolutional neu- The primary system of the open condition was called
ral networks (CNN) are employed as spectral feature extrac- ’Vicomtech-PRHLT p-E1 open’ and it was based on the E2E
tor from spectrograms. A 2D batch normalization function is architecture described in Section 3.2. This system was an evo-
then applied to the output of both layers, in addition to a hard lution of an already existing E2E model, which was built using
tanh function as an activation function. The E2E systems were the 3-fold augmented SAVAS, Albayzin, and Multext corpora
set up using 5 layers of bidirectional Gated Recurrent Units for 28 epochs. This model reached a WER of 7.2% on a 4 hours
(GRU) [20] layers as RNN networks. Each hidden layer is test set of the SAVAS corpus.
composed of 800 hidden units. After the bidirectional recur-
For this challenge, it was evolved for 2 new epochs using
rent layers, a fully connected layer is applied as the last layer of
the same corpora in addition to the 3-fold augmented nearly per-
the whole model. The output corresponds to a softmax function
fectly aligned corpus obtained from the RTVE2018 dataset (see
which computes a probability distribution over the characters.
Table 3). A total of 897 hours were used for training. The LM
During the training process, the CTC loss function is computed
was a 5-gram trained with the text data from the open dataset,
to measure the error of the predictions, whilst the gradient is es-
with an alpha value of 0.8 and a beam-width of 1000 during
timated using backpropagation through time algorithm with the
decoding.
aim of updating the network parameters. The optimizer is the
Stochastic Gradient Descent (SGD).
In addition, an external LM was integrated for decoding 4.2.2. Contrastive systems
with the aim of rescoring the initial lattices. To this end, modi-
fied Kneser-Ney smoothed 5-grams models were estimated us- The first constrastive system was called ’Vicomtech-
ing the KenLM toolkit. PRHLT c1-E2 open’ and as the primary system, it was
based on the previously explained E2E architecture. This
system was also an evolution of the already existing E2E
4. Systems descriptions model, but in this case, it was evolved for one epoch using the
A total of 6 systems based on the above described architectures 3-fold augmented SAVAS, Albayzin, Multext, nearly perfectly
were submitted to the challenge, three systems per condition. aligned partition and Youtube RTVE corpora. The duration of
the total amount of training audios was 1488 hours. The LM
4.1. Closed condition was a 5-gram trained with the text data from the open dataset,
with an alpha value of 0.8 and a beam-width of 1000 during
4.1.1. Primary system decoding.
The primary system submitted to the closed condition was The second contrastive system was composed by a bidirec-
called ’Vicomtech-PRHLT p-K1 closed’ and it is a bidirec- tional LSTM-HMM acoustic model combined with a 3-gram
tional LSTM-HMM based system combined with a 3-gram LM LM for decoding and a 9-gram LM for re-scoring lattices. The
for decoding and a 9-gram LM for re-scoring lattices. The AM AM was evolved for 10 epochs, with an initial and final learning
was trained for 10 epochs, with an initial and final learning rate rate of 0.0006 and 0.00006 respectively, using a mini-batch size
of 0.0006 and 0.00006 respectively, using a mini-batch size of of 100 and 20,000 samples per iteration, and it was trained with
100 and 20,000 samples per iteration. The AM was trained with the same data as the primary system of the open condition. The
the nearly perfectly aligned partition (see Table 3), which was LMs were estimated with the text data from the open dataset.
269
5. Results fewer training data were available. In fact, the primary sys-
tem in the closed condition achieved an error of 4 percentage
The results obtained over the development set shown in Table 3 points lower than the E2E based second contrastive system.
are presented in the following Table 5. The development set is In this condition, it is also remarkable how the primary sys-
composed by audio segments from all the TV shows included tem, trained with nearly correctly aligned audios, achieved bet-
in the original RTVE2018 dataset and lasts a total of 4 hours. ter results than the first contrastive LSTM-HMM based system,
which was built with perfectly aligned contents, even if the pri-
Table 5: WER results for each submitted system over the gener- mary system included more training data. It suggests that in
ated development set this case, exploiting more data although they were not aligned
exactly, helped systems to perform better.
type system cond. WER In the open condition, the E2E based systems achieved bet-
P Vicomtech-PRHLT p-K1 closed 22.6 ter results than the LSTM-HMM based one. It could be ex-
C1 Vicomtech-PRHLT c1-K2 closed Closed 22.8 pected since more training data were available to train models.
C2 Vicomtech-PRHLT c2-E1 closed 26.6 Even if the first contrastive system obtained a slightly better
P Vicomtech-PRHLT p-E1 open 20.7 performance than the primary one, a qualitative evaluation of
C1 Vicomtech-PRHLT c1-E2 open Open 20.5 the results gave as the intuition that the primary system was
C2 Vicomtech-PRHLT c2-K1 open 22.0 more robust against spontaneous speech. In this sense, the al-
pha value (0.8), which defines the weight of the LM against the
AM, of the E2E systems were lower than the alpha value (1.5)
employed in the E2E system of the closed condition, given that
5.1. Processing time and resources the AM performed better and the global system obtained higher
The decodings of the 6 recognition systems were performed on precision, especially with spontaneous speech.
an Intel Xeon CPU E5-2683v4 2.10 GHz 4xGPU server with Finally, it should be remarked that all the error rates
256GB DDR4 2400MHz RAM memory. Each GPU corre- achieved in this work are lower or at least are in the range of
sponds to an NVIDIA Geforce GTX 1080 Ti 11GB graphics the reference WER values given in the evaluation plan. These
acceleration card. WER values were obtained by commercial ASR systems over
The following Table 6 presents the processing time and one TV show in the dataset, and ranged between 22% and 27%
computational resources needed by each submitted system for of word error rate.
the decoding of the released test set of almost 40 hours of au-
dios. It should be noted that the LSTM-HMM based systems 7. References
were decoded using CPU cores, whilst the E2E systems took
[1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,
advantage of the GPU cards. A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep
neural networks for acoustic modeling in speech recognition: The
Table 6: Processing time and computational resources needed shared views of four research groups,” IEEE Signal processing
by each submitted system magazine, vol. 29, no. 6, pp. 82–97, 2012.
[2] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Bat-
CPU tenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen
system RAM GPU Time et al., “Deep speech 2: End-to-end speech recognition in english
cores
and mandarin,” in International Conference on Machine Learn-
Vicom-PRHLT p-K1 close 12GB 20 - 24h ing, 2016, pp. 173–182.
Vicom-PRHLT c1-K2 close 12GB 20 - 24h
Vicom-PRHLT c2-E1 close 4GB 8 7GB 12h [3] A. Graves and N. Jaitly, “Towards end-to-end speech recognition
with recurrent neural networks,” in Proceedings of the 31st In-
Vicom-PRHLT p-E1 open 5GB 8 7GB 8h ternational Conference on International Conference on Machine
Vicom-PRHLT c1-E2 open 5GB 8 7GB 9h Learning - Volume 32, ser. ICML’14. JMLR.org, 2014, pp. II–
Vicom-PRHLT c2-K1 open 19GB 12 - 40h 1764–II–1772.
[4] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend
and spell: A neural network for large vocabulary conversational
speech recognition,” in Acoustics, Speech and Signal Processing
6. Conclusions (ICASSP), 2016 IEEE International Conference on. IEEE, 2016,
pp. 4960–4964.
In this paper, the ASR systems submitted to the IberSPEECH-
RTVE Speech to Text Transcription Challenge 2018 have been [5] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
gio, “Attention-based models for speech recognition,” in Ad-
presented. In the beginning, one of the most costly task was
vances in neural information processing systems, 2015, pp. 577–
the processing of the released RTVE2018 dataset, since a high 585.
number of transcriptions were imperfect or do not fit exactly
[6] L. Lu, X. Zhang, and S. Renals, “On training the recurrent neural
to the related spoken audio. Furthermore, the type of contents
network encoder-decoder for large vocabulary end-to-end speech
posed a notable difficulty to the task, given that the TV shows recognition,” 2016 IEEE International Conference on Acoustics,
included most of the main challenges for any speech recogni- Speech and Signal Processing (ICASSP), pp. 5060–5064, 2016.
tion engine, including spontaneous speech, accents, noise back-
[7] Y. Miao, M. Gowayyed, and F. Metze, “Eesen: End-to-end speech
grounds, and/or overlapped speakers, among others. Thus, the recognition using deep rnn models and wfst-based decoding,” in
cleaning process of the dataset became a crucial task to exploit Automatic Speech Recognition and Understanding (ASRU), 2015
the data correctly. IEEE Workshop on. IEEE, 2015, pp. 167–174.
Looking at the results obtained on the internally generated [8] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-
development test and presented in Table 5, it can be clearly de- to-end convnet-based speech recognition system,” arXiv preprint
duced that LSTM-HMM based systems performed better when arXiv:1609.03193, 2016.
270
[9] H. Liu, Z. Zhu, X. Li, and S. Satheesh, “Gram-ctc: Automatic unit
selection and target decomposition for sequence labelling,” arXiv
preprint arXiv:1703.00096, 2017.
[10] K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Na-
hamoo, “Direct acoustics-to-word models for english conver-
sational speech recognition,” arXiv preprint arXiv:1703.07754,
2017.
[11] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmen-
tation for speech recognition,” in Sixteenth Annual Conference of
the International Speech Communication Association, 2015.
[12] L. Torrey and J. Shavlik, “Transfer learning,” in Handbook of
Research on Machine Learning Applications and Trends: Algo-
rithms, Methods, and Techniques. IGI Global, 2010, pp. 242–
264.
[13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Dropout: a simple way to prevent neural net-
works from overfitting,” The Journal of Machine Learning Re-
search, vol. 15, no. 1, pp. 1929–1958, 2014.
[14] S. Braun, D. Neil, and S.-C. Liu, “A curriculum learning method
for improved noise robustness in automatic speech recognition,”
in Signal Processing Conference (EUSIPCO), 2017 25th Euro-
pean. IEEE, 2017, pp. 548–552.
[15] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
“The kaldi speech recognition toolkit,” in IEEE 2011 workshop
on automatic speech recognition and understanding, no. EPFL-
CONF-192584. IEEE Signal Processing Society, 2011.
[16] A. del Pozo, C. Aliprandi, A. Álvarez, C. Mendes, J. P. Neto,
S. Paulo, N. Piccinini, and M. Raffaelli, “Savas: Collecting, an-
notating and sharing audiovisual language resources for automatic
subtitling.” in LREC, 2014, pp. 432–436.
[17] F. Casacuberta, R. Garcia, J. Llisterri, C. Nadeu, J. Pardo, and
A. Rubio, “Development of spanish corpora for speech research
(albayzin),” in Workshop on International Cooperation and Stan-
dardization of Speech Databases and Speech I/O Assesment Meth-
ods, Chiavari, Italy, 1991, pp. 26–28.
[18] E. Campione and J. Véronis, “A multilingual prosodic database,”
in Fifth International Conference on Spoken Language Process-
ing, 1998.
[19] K. Heafield, “Kenlm: Faster and smaller language model queries,”
in Proceedings of the Sixth Workshop on Statistical Machine
Translation. Association for Computational Linguistics, 2011,
pp. 187–197.
[20] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase repre-
sentations using rnn encoder-decoder for statistical machine trans-
lation,” arXiv preprint arXiv:1406.1078, 2014.
271
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
272 10.21437/IberSPEECH.2018-57
2.1. Two-step time alignment procedure set and the results are not as accurate as the results obtained by
the two-step process discussed in this work.
A two-step utterance level time alignment procedure is used
which includes forced alignment of plain text transcripts and 2.2. Data selection for training and testing
word level time alignments using the NIST sclite ASR scoring
utility in Speech Recognition Scoring Toolkit (SCTK) [13]. Utterance level time alignment information computed with the
two-step procedure described in the previous section is used to
2.1.1. Forced alignment of plain text transcripts convert the plain text transcripts into stm reference format
accepted by NIST sclite utility where each utterance was
Time alignments in the srt type subtitle and stm type reference labeled as a different speaker in order to enable utterance level
files of the train, dev1 and dev2 subsections of the provided ratios of correct, substitute, deleted and inserted words. The
training and development data are cleaned up to produce plain ASR best path results are used as the hypothesis input in ctm
text transcriptions where utterances are separated by a newline format. With the observation that the ratio of insertions is high
character. An iterative forced alignment script [14] which for wrongly aligned utterances, the two criteria below are
accepts plain text transcripts and ASR word lattice results as applied in order to select the utterances with correct alignments:
input, is used to compute the utterance level time alignments.
Punctuation cleanup is also applied on the plain text transcripts 1) %insertions < (%correct + %substitutions + %deletions)
in order to improve the accuracy of the alignment procedure, 2) %correct > 0
since ASR word lattice results do not involve any punctuation.
In each iteration, the alignment script uses confidence regions The provided ground truth transcriptions are assumed to be
of the results of the previous iteration to narrow down the search correct, and no analysis was carried out to test the correctness.
space. The number of iterations is configured as three from 278 hours of training data is selected from train, dev1 and dev2
previous experience. subsections of the provided data with the described data
preparation and selection mechanism. 17 hours of data from
2.1.2. Word level time alignment dev2 subsection is preserved for testing and 261 hours of data
The results from the forced alignment procedure described in from train and dev1 subsections are used for training the ASR
the previous section have been used to generate stm type model.
reference files to be used as the reference input to the NIST The selected training data is combined with 91 hours of
sclite ASR scoring utility. The ASR word lattice results are other European Spanish ASR training data from databases:
converted to ctm format which is accepted by the sclite utility Albayzin, Dihana, CORLEC-EHU, TC-STAR [5] to obtain 352
as a hypothesis file with word alternative level time hours of training data for ASR task.
information. The NIST sclite utility was called with the “-o
sgml” option in order to generate word level logs of correct
words, substitutions, deletions and insertions as seen in Table 3. ASR model training
1. Information in these logs are used to find the word level
timings of all the words in the reference files using linear time The Kaldi framework [11] was used to train the acoustic model
interpolation for the deletions. The computed word level of the ASR system submitted to Iberspeech 2018 Text to
timings are used to verify and update the utterance level start Speech Transcription Challenge by the Empathic team.
and end times computed with the forced alignment script Component diagram for ASR system training and testing is
described in previous section. given in Figure 1.
273
coefficients and iVectors are used as input features. Gaussian utility where ground truth transcriptions for selected utterances
posteriors used for the iVector estimation are based on the input of dev2 subsection are used in stm reference format. Words
features with Cepstral Mean and Variance Normalization from the ASR results which are not in the start end time interval
(CMVN). GMM-HMM iterative phone alignment was used of any reference utterance are omitted in the scoring process.
before starting the neural network training of the DNN-HMM An average WER result of 23.9 is obtained in the final model
training stage where nnet3 chain implementation of the Kaldi testing. Base model WER result obtained using the same testing
framework was used with frame-subsampling-factor of 3,
method and decoding parameters is 35.3.
reducing number of output frames to 1/3 of the input frames. A
3-fold data augmentation is applied on the input acoustic Real time factor in the decoding process including VAD,
features for the DNN-HMM training stage using the feature extraction and lattice post-processing is 0.022 using a 4
reverberation algorithm implemented in the Kaldi framework, cores Intel i7-4820K CPU @ 3.70GHz and a single NVidia
by using noise databases RWCP, AIR and Reverb2014 in order GeForce GTX 1080Ti GPU.
to create multi-condition data of total 1057 hours.
The provided test data for Iberspeech 2018 competition is
A sub-sampled Time-delayed Deep Neural Network
(TDNN) [16] with 6 layers and with ReLU and pnorm processed with the produced ASR model using the same
activation functions is used. Details of the neural network procedure described above and resulting transcriptions are
architecture and the stochastic gradient descent (SGD) based submitted in plain text format.
greedy layer-wise supervised training can be found in the
system submission article for the Kaldi Aspire recipe [17]. Figure 1: ASR system training / testing components.
TDNN training time for 2 epochs and 508 iterations is 13
hours 30 minutes with 3 NVidia GPUs (Quadro K6000,
GeForce Titan X, GeForce Titan XP). Last iteration training
and validation accuracies are 0.171631 and 0.199849
respectively. In order to avoid over-training, a selective system
combination is carried out over all the iterations skipping the
first 100 iterations, considering recorded accuracies for
individual iteration results.
274
process was helpful for the benchmarking of the produced ASR
model with the base model, and some other experiment results
prior to the production of the final ASR model.
Much higher WER results are obtained (average WER 58.3
for the final produced model, average WER 105.7 for the base
model) when they are calculated using all the provided ground
truth text of the dev2 subsection ignoring provided wrong time
information by using txt formatted ASR hypothesis files. A
detailed analysis of the word level sclite logs shows that these
values are not reliable because of mis-alignment problem of
sclite utility usage without time information for such long
reference and hypothesis files. This observation is the basis for
the necessity of the two-step time alignment process used in this
work prior to model training and testing.
A different value of frame-subsampling-factor compared to
training process is chosen in the model testing (frame-
subsampling-factor=3 in training and frame-subsampling-
factor=2 in the testing) since it yields more accurate results in
the testing of audio with Viterbi decoding using a LM.
5. Conclusion
The acoustic model building process with a data preparation
and selection using a two-step time alignment procedure and
utterance level thresholding with WER values yielded a good
working acoustic model when the new training data is merged
with the training data of the base model. The two-step time
alignment procedure together with the utterance level data
selection mechanism described in Section 2 enabled the usage
of the provided data for the acoustic model training step of the
ASR system generatıon. Model testing with the development
data using an adapted version of the base LM showed a
significant reduction in the WER results compared to the base
model results used in the experiments.
6. Acknowledgements
The research leading to the results presented in this paper has
been (partially) granted by the EU H2020 research and
innovation program under grant number 769872
275
TDNNs, i-vector Adaptation, and RNN-LMs,” in
Proceedings of the IEEE Automatic Speech Recognition
and Understanding Workshop, 2015.
7. References [18] N. Otsu, “A threshold selection method from gray-level
[1] G. E. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. histograms,” IEEE Trans. Systems, Man and Cybernetics,
Jaitly, A. W. Senior, V. Vanhoucke, P. Nguyen, T. N. vol. 9, pp. 62–66, 1979.
Sainath et al., “Deep Neural Networks for Acoustic [19] H. Xu, D. Povey, L. Mangu, J. Zhu, “Minimum Bayes Risk
Modeling in Speech Recognition: The Shared Views of Four decoding and system combination based on a recursion for
Research Groups,” IEEE Signal Processing Magazine, vol. edit distance,” Computer Speech & Language, Volume 25,
29, no. 6, pp. 82–97, 2012. Issue 4, October 2011, pp. 802-828
[2] A. Graves, N, Jailty, “Towards end-to-end speech
recognition with recurrent neural networks,” ICML'14
Proceedings of the 31st International Conference on
International Conference on Machine Learning,
Conference, June 21 - 26, Beijing, China, Proceedings,
2014, pp. II-1764-II-1772
[3] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber,
“Connectionist temporal classification: labelling
unsegmented sequence data with recurrent neural
networks,” in Proceedings of the 23rd international
conference on Machine learning. ACM, 2006, pp. 369–376.
[4] K. Vesely, A. Ghoshal, L. Burget, and D. Povey, “Sequence-
discriminative training of deep neural networks.” in
Proceedings of INTERSPEECH, 2013, pp. 2345–2349.
[5] A. L. Zorrilla, N. Dugan, M. I. Torres, C. Glackin, G.
Chollet and N. Cannings, “Some ASR experiments using
Deep Neural Networks on Spanish databases,” IberSpeech
2016 Conference, November 23 – 25, Lisbon, Portugal,
Proceedings, 2016, pp. 149-158
[6] SRILM Toolkit,
http://www.speech.sri.com/projects/srilm
[7] Asunció n Moreno, Dolors Poch, Antonio Bonafonte,
Eduardo Lleida, Joaquim Llisterri, José B. Mariño, and
Climent Nadeu, “Albayzin speech database: design of the
phonetic corpus.,” in EUROSPEECH. 1993, ISCA.
[8] José miguel Benedı́, Eduardo Lleida, Amparo Varona,
Marı́a josé Castro, Isabel Galiano, Raquel Justo, Iñi go
López De Letona, and Antonio Miguel, “Design and
acquisition of a telephone spontaneous speech dialogue
corpus in spanish: Dihana,” in In Fifth LREC, 2006, pp.
1636–1639.
[9] Luis J. Rodrı́guez and Torres M. Inés, “Spontaneous speech
events in two speech databases of human-computer and
human-human dialogs in spanish,” Language and Speech,
vol. 49, no. 3, pp. 333–366, 2006.
[10] Henk van den Heuvel, Khalid Choukri, Christian Gollan,
Asuncion Moreno, Djamel Mostefa: "TC-STAR: New
language resources for ASR and SLT purposes” LREC 2006
[11] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O.
Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P.
Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi
Speech Recognition Toolkit, IEEE 2011 Workshop on
Automatic Speech Recognition and Understanding, Hawaii,
IEEE Signal Processing Society, 2011.
[12] M. Karafiat, L. Burget, P. Matejka, O. Glembek, and J.
Cernocky, “iVector-based discriminative adaptation for
automatic speech recognition,” in 2011 IEEE Workshop on
Automatic Speech Recognition & Understanding. IEEE,
Dec. 2011, pp. 152–157
[13] Speech Recognition Scoring Toolkit (SCTK), National
Institute of Standards and Technology, US Department of
Commerce.
[14] N. Dugan, Forced alignment Python script, Intelligent Voice
LTD, https://github.com/IntelligentVoice/Aligner
[15] Kaldi aspire recipe: https://github.com/kaldi-
asr/kaldi/tree/master/egs/aspire
[16] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay
neural network architecture for efficient modeling of long
temporal contexts,” in Proceedings of INTERSPEECH,
2015.
[17] V. Peddinti, G. Chen, V. Manohar, T. Ko, D. Povey, and S.
Khudanpur, “JHU ASpIRE system: Robust LVCSR with
276
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
277 10.21437/IberSPEECH.2018-58
Figure 1: TDNN used in acoustic models [14]
278
ropean and Spanish Parliaments from the TC-STAR Table 3: Average WER on Development set.
database, subtitles, books, newspapers, online courses Primary Contrastive
and the transcriptions of the Mavir sessions included in TV show Dev1 Dev2 Dev1 Dev2
the development set of the Albayzin 2016 Spoken Term 20H 12.13% – 12.86% –
Detection Evaluation 1 . The vocabulary size of this cor- AP 22.82% – 24.30% –
pus is approximately 250K words. CA 53.95% – 54.35% –
• The RTVE subtitles provided by the organizers of the LM 30.81% – 32.48% –
evaluation. This text comprises approximately 60M LN24H 27.22% 28.85% 28.48% 29.85%
words and its vocabulary size is approximately 173K millennium – 25.52% – 24.78%
words. Average 26.65% 25.30% 27.61% 26.47%
Table 2 shows the main characteristics of these text re- Average 26.37% 27.37%
sources.
279
[14] V. Peddinti, D. Povey and S. Khudanpur. 2015. A time delay neu-
ral network architecture for efficient modeling of long temporal
contexts. In Proceedings of INTERSPEECH 2015.
[15] A. Stolcke. 2002. SRILM An extensible language modeling
toolkit. Proceedings of the International Conference on Statisti-
cal Language Processing, Denver, Colorado.
[16] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motlcek, Y. Quian, P. Schwarz, J.
Silovsk, G. Stemmer, and K. Vesel. 2011. The Kaldi Speech
Recognition Toolkit. In ASRU.
[17] L. Docı́o, A. Cardenal and C. Garcı́a. 2006. TC-STAR 2006
automatic speech recognition evaluation: The uvigo system. In
Proc. Of TC-STAR Workshop on Speech-to-Speech Translation,
ELRA, Parı́s, France.
[18] C. Garcı́a, J. Tirado, L. Docı́o and A. Cardenal. 2004. Transcrigal:
A bilingual system for automatic indexing of broadcast news. In
IV International Conference on Language Resources and Evalua-
tion.
[19] E. Rodrı́guez Banga, C. Garcı́a Mateo, F.J. Méndez Pazó, M.
González, C. Magariños Iglesias. 2012. Cotovı́a: an open source
TTS for Galician and Spanish. In IberSPEECH 2012 – VII Jor-
nadas en Tecnologa del Habla and III Iberian SLTech Workshop.
280
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
281 10.21437/IberSPEECH.2018-59
formance of healthy adults.
More recently, Miranda [16] investigated the influence of
education in the macro-linguistic dimension of discourse eval-
uation, considering concepts analysis, local, global and topic
coherence, and cohesion. The study was performed on a pop-
ulation of 87 healthy, elderly Portuguese participants. Results
corroborated the ones obtained by Mackenzie et al. [15], con-
firming the effect of literacy in this type of analysis.
282
the hierarchy, the coreference information is used to guide the
assignment of a subtopic to the corresponding level in the hi-
erarchy. To this purpose, we constraint the results provided by
the coreference system to those mentions whose referent is the
subject of the sentence. We are not interested in considering
other coreferential expressions because a subtopic, being a spe-
cialization of a topic, is typically referred to the subject of the
sentence.
283
each topic clusters. The highest result determines the cluster 0.78
for the new sentence. In the following step, we need to assign 0.76
the current sentence embeddings to a level in the current hierar- 0.74
chy. This implies to establish if we are dealing with a new or a 0.72
repeated topic and its level of specialization (i.e., subtopic, sub-
Accuracy
0.7
subtopic, etc.). This is achieved by first identifying, in the cur- 0.68
rent hierarchy, the sub-graph whose nodes belong to the same
0.66
cluster of the current sentence (e.g., the sub-graph correspond-
0.64
ing to the mother cluster). Then, we compute the cosine simi-
0.62
larity between the current sentence and each nodes of this sub- 1 12 8 10 7 2 13 4 3 9 6 16 14 11 5 15
graph. The new sentence is considered a son of the closest node Features
if the similarity is lower than a threshold. Otherwise, it is con-
sidered a repeated topic. If there is no sub-graph, the sentence
embedding is added as a new topic. If the new topic results to Figure 3: Variation of the classification accuracy while increas-
be a coreferential expression, this kind of information supersede ing the number of features.
the cosine metric strategy, and the new topic is added directly
as a son of its referent.
Although the algorithm developed resembles the analysis
performed in the standard clinical practice, the aim of this work nating the disease.
is not the comparison of the automatic method with the manual
Our results, achieved with the 70% of the data, provided
one. Instead, our focus is understanding if pragmatic features
an average accuracy of 77%, and an average F-score of 77%
related with topic coherence analysis may be relevant to dis-
in classifying AD. Interestingly, the number of topics was the
criminate AD. The type of features computed, as well as the
first feature selected, providing, alone, an average accuracy of
results of classification experiments are described in the follow-
67%. Comparing these results with current state of the art, we
ing sections.
acknowledge that Fraser et al. [8] achieved a higher accuracy
(81%) using a set of lexicosyntactic features. On the other hand,
5. Topic coherence features we also recognized that these results are slightly better than the
Through the multistage approach and the final hierarchy of ones achieved by Yancheva et al. [11] (F-score 74%) using only
topics we identified sixteen measurements: (1-4) the number a set of 12 semantic features. However, when the authors com-
of topics, subtopics, sub-subtopics and sub-sub-subtopics in- bine lexicosyntactic and semantic information, the F-score im-
troduced, (5-6) the proportion of dependent and independent proves to 80%. These considerations are interesting for multi-
clauses to the total number of sentences, (7) the total number of ple reasons, in fact, on one side they confirm the relevance of
coreferential mentions, (8) the total number of topics, subtopics, pragmatic features related with topic coherence in the task of
sub-subtopics and sub-sub-subtopics repeated, (9-11) the num- classifying AD. On the other hand, they also highlight that lex-
ber of sentences that were classified as not-related, incomplete, icosyntactic features are extremely important in characterizing
or no-content in the first step of the main algorithm, (12) the the disease and should be used in a complementary way with
coefficient of variation (the ratio of the standard deviation to the other features.
mean) of the cosine similarity between two temporally consec-
utive topics, (13) the length of the longest path from the root 7. Conclusions
node to all leaves, (14) the average number of outgoing edges
of all nodes, (15) the total number of sentences, (16) the ratio of In this work, we approached the problem of exploting topic co-
dependent to independent clauses. herence analysis to automatically classify AD. To this purpose,
we proposed an algorithm inspired by the type of assessment
6. Results and discussion conducted by clinicians to construct the topic hierarchy of a
picture description task, from which we extract a reduced set of
Classification experiments were performed with a Random For- pragmatic features for automatic classification. Initial experi-
est classifier, using the 70% of the remaining data of the Cookie mental results show comparable AD classification performance
Theft corpus, once that 30% of the data was retained to model to current state of the art approaches using different types of
the topic hierarchy. A stratified k-fold cross validation per sub- consolidated linguistic features. As future work, we plan to
ject strategy was implemented, with k being equal to 10. integrate the proposed pragmatic features with lexicosyntactic
Initial results, using the set of features described previously, features and to explore the extension of this kind of analysis
provided an average accuracy of 74% in distinguishing AD pa- to other types of discourse production tasks, including open-
tients from healthy controls. Then, in order to understand the domain tasks.
importance of each feature, we implemented a forward feature
selection method. This is an iterative approach in which the
model is trained with a varying number of features. Starting 8. Acknowledgments
with no features, at each iteration we test the accuracy of the
model by adding, one at a time, each of the features that were The authors want to express their gratitude to Dr. Filipa Mi-
not selected in a previous iteration. The accuracy is evaluated randa for her precious advice and help provided during the
with a stratified 10-fold cross validation. The feature that yields development of this work. This work was supported by Por-
the best accuracy is retained for further processing. The results tuguese national funds through – Fundação para a Ciência e
of this method are shown in Figure 3. With this approach, we a Tecnologia (FCT), under Grants SFRH/BD/97187/2013 and
identified the first six features as the most relevant in discrimi- Project with reference UID/CEC/50021/2013.
284
9. References [18] J. T. Becker, F. Boiler, O. L. Lopez, J. Saxton, and K. L. McGo-
nigle, “The natural history of alzheimer’s disease: description of
[1] R. Brookmeyer, E. Johnson, K. Ziegler-Graham, and H. M. Ar- study cohort and accuracy of diagnosis,” Archives of Neurology,
righi, “Forecasting the global burden of alzheimers disease,” vol. 51, no. 6, pp. 585–594, 1994.
Alzheimer’s & dementia, vol. 3, no. 3, pp. 186–191, 2007.
[19] B. MacWhinney, S. Bird, C. Cieri, and C. Martell, “Talkbank:
[2] K. E. Forbes, A. Venneri, and M. F. Shanks, “Distinct pat-
Building an open unified multimodal database of communica-
terns of spontaneous speech deterioration: an early predictor of
tive interaction,” 4th International Conference on Language Re-
alzheimer’s disease.” Brain Cogn, vol. 48, no. 2-3, pp. 356–361,
sources and Evaluation, pp. 525–528, 2004.
Mar-Apr 2002.
[20] M. B, “The childes project: Tools for analyzing talk, 3rd edition.”
[3] J. Reilly, J. Troche, and M. Grossman, “Language processing in
Lawrence Erlbaum Associates, Mahwah, New Jersey, 2000.
dementia,” The handbook of Alzheimer’s disease and other de-
mentias, pp. 336–368, 2011. [21] H. Goodglass, E. Kaplan, and B. Barresi, The Boston Diag-
nostic Aphasia Examination, Baltimore: Lippincott, Williams &
[4] V. Taler and N. A. Phillips, “Language performance in alzheimer’s
Wilkins, 2001.
disease and mild cognitive impairment: a comparative review.” J
Clin Exp Neuropsychol, vol. 30, no. 5, pp. 501–556, Jul 2008. [22] H. Ulatowska and S. Chapman, Discourse Analysis and Applica-
[5] G. Oppenheim, “The earliest signs of alzheimer’s disease.” J tions. Studies In Adult Clinical Populations. Hillsdale: Lawrence
Geriatr Psychiatry Neurol, vol. 7, no. 2, pp. 116–120, Apr-Jun Elbaum Associates, 1994, ch. Discourse macrostructure in apha-
1994. sia, pp. pp. 29–46.
[6] D. Kempler, “Language changes in dementia of the alzheimer [23] D. Klein and C. D. Manning, “Accurate unlexicalized parsing,” in
type,” Dementia and communication, pp. 98–114, 1995. Proceedings of the 41st Annual Meeting on Association for Com-
putational Linguistics-Volume 1. Association for Computational
[7] S. H. Ferris and M. R. Farlow, “Language impairment in Linguistics, 2003, pp. 423–430.
alzheimers disease and benefits of acetylcholinesterase in-
hibitors,” in Clinical interventions in aging, 2013. [24] S. Feng, R. Banerjee, and Y. Choi, “Characterizing stylistic el-
ements in syntactic structure,” in Proceedings of the 2012 Joint
[8] K. C. Fraser, J. A. Meltzer, and F. Rudzicz, “Linguistic features Conference on Empirical Methods in Natural Language Process-
identify alzheimer’s disease in narrative speech.” J Alzheimers ing and Computational Natural Language Learning. Association
Dis, vol. 49, no. 2, pp. 407–422, 2016. for Computational Linguistics, 2012, pp. 1522–1533.
[9] S. O. Orimaye, J. S.-M. Wong, and K. J. Golden, “Learning [25] S. Boytcheva, P. Dobrev, and G. Angelova, “Cgextract: Towards
predictive linguistic features for alzheimer’s disease and related extraction of conceptual graphs from controlled english.” Con-
dementias using verbal utterances,” in Proceedings of the 1st tributions to ICCS-2001, 9th International Conference of Concep-
Workshop on Computational Linguistics and Clinical Psychology tual Structures,, 2001.
(CLPsych). sn, 2014, pp. 78–87.
[26] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard,
[10] M. Yancheva, K. C. Fraser, and F. Rudzicz, “Using linguistic fea- and D. McClosky, “The Stanford CoreNLP natural language
tures longitudinally to predict clinical scores for alzheimer’s dis- processing toolkit,” in Association for Computational Linguistics
ease and related dementias,” in Proceedings of SLPAT 2015: 6th (ACL) System Demonstrations, 2014, pp. 55–60. [Online].
Workshop on Speech and Language Processing for Assistive Tech- Available: http://www.aclweb.org/anthology/P/P14/P14-5010
nologies, 2015, pp. 134–139.
[27] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin,
[11] M. Yancheva and F. Rudzicz, “Vector-space topic models for de- “Advances in pre-training distributed word representations,” in
tecting alzheimer’s disease.” in Proceedings of the 54th Annual Proceedings of the International Conference on Language Re-
Meeting of the Association for Computational Linguistics (Vol- sources and Evaluation (LREC 2018), 2018.
ume 1: Long Papers), 2016, pp. 2337–2346.
[12] K. C. Fraser and G. Hirst, “Detecting semantic changes in
alzheimer’s disease with vector space models,” in Proceedings
of LREC 2016 Workshop. Resources and Processing of Linguis-
tic and Extra-Linguistic Data from People with Various Forms of
Cognitive/Psychiatric Impairments (RaPID-2016), Monday 23rd
of May 2016, no. 128. Linköping University Electronic Press,
Linköpings universitet, 2016, p. 1 to 8.
[13] M. Mentis and C. A. Prutting, “Analysis of topic as illustrated in
a head-injured and a normal adult,” Journal of Speech, Language,
and Hearing Research, vol. 34, no. 3, pp. 583–595, 1991.
[14] M. Brady, C. Mackenzie, and L. Armstrong, “Topic use following
right hemisphere brain damage during three semi-structured con-
versational discourse samples,” Aphasiology, vol. 17, no. 9, pp.
881–904, 2003.
[15] C. Mackenzie, M. Brady, J. Norrie, and N. Poedjianto, “Picture
description in neurologically normal adults: Concepts and topic
coherence,” Aphasiology, vol. 21, no. 3-4, pp. 340–354, 2007.
[16] F. Miranda, “Influência da escolaridade na dimensão
macrolinguı́stica do discurso,” Master’s thesis, Universidade
Católica Portuguesa, Instituto de ciências da saúde, 2015.
[17] L. Santos, E. A. Corrêa Júnior, O. Oliveira Jr, D. Amancio,
L. Mansur, and S. Aluı́sio, “Enriching complex networks
with word embeddings for detecting mild cognitive im-
pairment from speech transcripts,” in Proceedings of the
55th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). Association for Compu-
tational Linguistics, 2017, pp. 1284–1296. [Online]. Available:
http://www.aclweb.org/anthology/P17-1118
285
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
286 10.21437/IberSPEECH.2018-60
English usually offers the largest corpora and biligual dictio- were obtained when applying the SVD only once, at the begin-
naries, they used the English embeddings to serve as the shared ning of the learning process.
embedding space. Artetxe et al. [12] built a generic framework
that generalizes previous works made on cross-linguistic em- 4. Experiments
beddings and they concluded that the best systems were the
ones with orthogonality constraint and a global pre-processing 4.1. Experimental setup
with length normalization and dimension-wise mean centering.
Smith et al. [7] also proved that translation matrices should be 4.1.1. Pre-trained word embeddings
orthogonal, for which they applied singular value decomposi- For pre-trained word embeddings we took the fastText embed-
tion (SVD) on the transformation matrices. Besides, they also dings published by Conneau et al. [8]. These embeddings were
introduced a novel “inverted softmax” method for identifying trained by applying their novel method where words are repre-
translation pairs. All these works listed above applied super- sented as a bag of character n-grams [13]. This model outper-
vised learning. However, in 2017 Conneau et al. [8] introduced formed Mikolov’s [14] CBOW and skipgram baseline systems
an unsupervised way for aligning monolingual word embedding that did not take any sub-word information into account. Con-
spaces between two languages without using any parallel cor- neau’s pre-trained word vectors trained on Wikipedia are avail-
pora. This unsupervised procedure holds the current state-of- able for 294 languages1 .
the-art results on Dinu’s benchmark word translation task. For Some experiments were also run by using the same embed-
comparing the different results see Table 1 and Table 2. ding that was used by Dinu et al. [1] in their experiments. These
word vectors were trained with word2vec and then the 200k
3. Proposed method most common words in both the English and Italian corpora
were extracted. The English word vectors were trained on the
We propose a method that learns linear mappings between word
WackyPedia/ukWaC and BNC corpora, while the Italian word
translation pairs in the form of translation matrices that map pre-
vectors were trained on the WackyPedia/itWaC corpus. This
trained word embeddings into a universal vector space. During
word embedding will be referred to as the WaCky embedding.
training, the cosine similarity of word translation pairs is max-
imized, which is calculated in the universal space. The method
4.1.2. Gold dictionaries
is applicable for any number of languages. Since, independent
of the number of languages applied during training, for each First, we ran the experiments on Dinu’s English-Italian bench-
language always exactly one translation matrix is learned, by mark data [1]. It is an English-Italian gold dictionary split into
introducing new languages, the number of the learned parame- a train and a test set, which was built from Europarl en-it2 [15].
ters remains linear to the number of the applied languages. For the test set they used 1,500 English words split into 5 fre-
Let L be a set of languages, and TP a set of translation pairs quency bins, 300 randomly chosen in each bin. The bins are de-
where each entry is a tuple of two in the form of (w1 , w2 ) where fined in terms of rank in the frequency-sorted lexicon: [1-5K],
w1 is a word in language L1 and w2 is a word in language L2 , [5K-20K], [20K-50K], [50K-100K], and [100K-200K]. Some
and both L1 and L2 are in L. Then, let’s consider the following of these 1500 English words have multiple Italian translations
equation to optimize: in the Europarl dictionary, so the resulting test set contains 1869
word pairs all together, with 1500 different English, and with
X X 1849 different Italian words. For the training set the top 5k
1 entries were extracted and care was taken to avoid any over-
· cos sim(w1 · T1 , w2 · T2 ) (2)
|T P | L ,L lap with test elements on the English side. On the Italian side,
1 2 (w1 ,w2 )
∈L ∈T P however, an overlap of 113 words is still present. In the end the
train set contains 5k word pairs with 3442 different English, and
where T1 and T2 are translation matrices mapping L1 and L2 4549 different Italian words.
to the universal space. Since the equation is normalized with Then, we built another golden dictionary similar to that of
the number of translation pairs in the TP set, the optimal value Dinu’s, but this time the translation pairs were extracted from
of this function is 1. Off-the-shelf optimizers are programmed the PanLex [2] database. PanLex is a nonprofit organization
to find local minimum values, so during the training process that aims to build a multilingual lexical database from available
the loss function is multiplied by −1. Word vectors are always dictionaries made by domain experts in all languages. To each
normalized, so the cos sim reduces to a simple dot product. translation pair a confidence value is assigned, which can be
At test time, first, both source and target language words used for filtering the extracted data. These confidence values
are mapped into the universal space, and from the most frequent are in the range of [1, 9], with 9 meaning high and 1 meaning
200k mapped target language words a look-up space is defined. low confidence. During the extraction process, translations with
Then, the system is evaluated with the Precision metric, more a confidence value below 7 and those for which no word vec-
specifically with Precision @1, @5, and @10, where Precision tor was found in the fastText embedding were dropped. Then,
@N denotes the percentage of how many times the real transla- training and test sets were constructed following Dinu’s steps,
tion of a source word is found among the N closest word vectors except for that only those English words were taken for which
in the look-up space. The distance assigned to the word vectors only one Italian translation was present. Experiments showed
when searching in the look-up space is the cos sim. that otherwise a serious noise was brought into the system, since
Previous works, such as Mikolov et al. [7] or Conneau et in many cases one English word might have up to 10 different
al. [8], suggested restricting the transformation matrix to an or- Italian translations.
thogonal one. From an arbitrary transformation matrix T an
orthogonal T 0 can be obtained by applying the SVD procedure. 1 http://github.com/facebookresearch/fastText/blob/master/pretrained-
Our experiments showed that by applying SVD on the transfor- vectors.md
mation matrices the learning is significantly faster. Best results 2 http://opus.lingfil.uu.se/
287
4.2. Results test set No. word pairs in old No. word pairs in new
Dinu 1869 1455
4.2.1. Parameter adjustment using Dinu’s data PanLex 1500 1242
First, parameter adjustment was performed using Dinu’s data, Table 4. Word reduction of the new test sets
which gave 0.1 as the best learning rate and 64 as the best batch
size, where batch size is equal to the number of translation pairs
used in one iteration. With applying SVD only once at the be-
ginning the obtained results of our best system are significantly the number of word pairs in the old and the new test sets. It
worse than state-of-the-art results on this benchmark data, but should be noted that by this reduction principally the most com-
they are comparable with or even better than some of the previ- mon English words are affected, and therefore worse scores
ous models discussed in Section 2. For comparison see Table 1 are expected compared to the previous train-on-Dinu-test-on-
and Table 2. Dinu, or train-on-PanLex-test-on-PanLex top results. Scores on
Dinu’s test set are shown in Table 5 and on the PanLex data in
Eng-Ita @1 @5 @10 Table 6. The obtained results show that training on the PanLex
Mikolov et al. 0.338 0.483 0.539 data cannot beat the system trained on Dinu’s data, which per-
Faruqui et al. 0.361 0.527 0.581 forms better both on Dinu’s and on the PanLex test sets. Not
Dinu et al. 0.385 0.564 0.639 even combining the two training sets succeeds in achieving sig-
Smith et al. (2017) 0.431 0.607 0.651 nificantly better results, although on the PanLex test set it does
Conneau et al. (2017) - WaCky 0.451 0.607 0.651 improve the scores in the Italian-English direction.
Conneau et al. (2017) - fastText 0.662 0.804 0.834
Proposed method - fastText 0.377 0.565 0.625 4.2.4. Continuing the training with PanLex data
Proposed method - WaCky 0.220 0.333 0.373
Table 1. Comparing English-Italian results on Dinu’s data. Another experiment was conducted to continue the baseline sys-
tem trained on Dinu’s data with the PanLex data. In other
words, it is the same as initializing the translation matrices of
the PanLex training process with previously learned ones. The
baseline system reaches its best performance between 2000 and
Ita-Eng @1 @5 @10
4000 epochs, depending on which precision value is regarded.
Mikolov et al. 0.249 0.410 0.474
Table 7 shows that on the English-Italian task there is no im-
Faruqui et al. 0.310 0.499 0.570
provement at all, while on the Italian-English task with the best
Dinu et al. 0.246 0.454 0.541
setting slightly better scores are achieved on precision @1 and
Smith et al. (2017) 0.380 0.585 0.636
@10 values.
Conneau et al. (2017) - WaCky 0.383 0.578 0.628
Conneau et al. (2017) - fastText 0.587 0.765 0.809
4.2.5. Experiments using three languages
Proposed method - fastText 0.310 0.502 0.547
Proposed method - WaCky 0.103 0.163 0.190 Finally, a multilingual experiment was carried out where the
Table 2. Comparing Italian-English results on Dinu’s data. system was trained on three languages - English, Italian, and
Spanish - at the same time. During training the system learns
three different translation matrices, one for English-universal,
one for Italian-universal, and one for Spanish-universal space
mapping. For example, in order to learn the English-universal
4.2.2. Experiments with the PanLex data
translation matrix, both the English-Italian and the English-
Using the PanLex database some experiments were made with Spanish dictionaries are used, according to Equation (2).
different training set sizes. 3k training examples proved to be Batches are homogeneous, but two following batches are al-
the best as Table 3 shows. ways different in terms of the language origins of the contained
data. That is, first an English-Italian batch is fed to the sys-
eng-ita ita-eng tem, then an English-Spanish batch, after that an Italian-Spanish
Prec. @1 @5 @10 @1 @5 @10 batch, and so on. First, bilingual models were trained in order to
1k 0.1500 0.2847 0.3340 0.1391 0.2761 0.3256 compare them later with the multilingual system. The results of
3k 0.2127 0.3473 0.3933 0.2232 0.3650 0.4152 the bilingual models are summarized in Table 8. Results are best
5k 0.1980 0.3193 0.3620 0.2212 0.3555 0.4030 on the Italian-Spanish task. Next, the system was trained using
10k 0.1613 0.2807 0.3227 0.1879 0.3012 0.3372 all the three languages at the same time. During the training
Table 3. Experiments with different training set sizes process the model was evaluated on the bilingual test datasets
of which the results are shown in Table 9. The obtained results
show that no advantage was achieved by extending the number
of languages, since the multilingual model performs worse than
4.2.3. Comparison of systems trained on Dinu’s and PanLex any of the pairwise bilingual models.
data
In the next step, some experiments were made to determine
5. Conclusions and future work
which data is more apt for learning linear mappings between This paper proposes a novel method for finding linear mappings
embeddings. In order to compare all the experiments objec- between word embeddings in different languages. As a proof of
tively subsets of the original test sets were created. These sub- concept a framework was developed which enabled basic pa-
sets do not contain any English word present either in the Dinu rameter adjustments and flexible configuration for initial exper-
training set or in the PanLex training set. Table 4 summarizes imentation.
288
eng-ita ita-eng
Precision @1 @5 @10 @1 @5 @10
train:Dinu - test:old 0.3770 0.5647 0.6245 0.3103 0.5018 0.5474
train:Dinu - test:new 0.3560 0.5407 0.5978 0.2917 0.4792 0.5215
train:PanLex - test:new 0.1360 0.2309 0.2594 0.1361 0.2556 0.2965
train:Dinu+PanLex - test:new 0.2930 0.4349 0.4861 0.2910 0.4556 0.5090
Table 5. Comparing Dinu’s and PanLex data on Dinu’s test set
eng-ita ita-eng
Precision @1 @5 @10 @1 @5 @10
train:PanLex - test:old 0.1960 0.3087 0.3440 0.1838 0.3059 0.3443
train:PanLex - test:new 0.1812 0.2858 0.3196 0.1668 0.2835 0.3213
train:Dinu - test:new 0.2295 0.4171 0.4839 0.2227 0.3763 0.4199
train:Dinu+PanLex - test:new 0.2295 0.3712 0.4275 0.2498 0.4026 0.4495
Table 6. Comparing Dinu’s and PanLex data on the PanLex test set
eng-ita ita-eng
Precision @1 @5 @10 @1 @5 @10
original 0.3770 0.5647 0.6245 0.3103 0.5018 0.5474
cont from 2000 0.3426 0.5256 0.5802 0.3229 0.4882 0.5535
cont from 3000 0.3535 0.5416 0.5970 0.3229 0.4840 0.5465
cont from 4000 0.3510 0.5273 0.5911 0.3118 0.4701 0.5243
Table 7. Continuing the baseline system with the PanLex data.
L1-L2 L2-L1
Precision @1 @5 @10 @1 @5 @10
eng-ita 0.2080 0.3280 0.3687 0.2082 0.3386 0.3904
eng-spa 0.2840 0.4320 0.4800 0.2883 0.4331 0.4836
spa-ita 0.3920 0.5340 0.5813 0.3655 0.5291 0.5750
Table 8. Results of bilingual models trained pairwise on the three different languages.
L1-L2 L2-L1
Precision @1 @5 @10 @1 @5 @10
eng-ita 0.1573 0.2667 0.3127 0.1638 0.2942 0.3386
eng-spa 0.1947 0.2973 0.3447 0.2350 0.3538 0.4064
spa-ita 0.2520 0.3640 0.4160 0.2568 0.3723 0.4162
Table 9. Bilingual results of the multilingual model trained using three different languages at the same time.
An interesting finding was that the system learned much data with the PanLex dataset brought a slight improvement on
faster when an initial SVD was applied on the translation ma- the Italian-English scores, but English-Italian scores only got
trices. Results obtained with these settings on Dinu’s data worse.
showed that the proposed model did learn from the data. The Finally, the system was trained on three different languages
obtained precision scores, though, were far from current state- at the same time. The obtained pairwise precision values are
of-the-art results on this benchmark data, they were compara- proved to be worse than the results obtained when the system
ble with results of previous attempts. The proposed model per- was trained in bilingual mode. However, these results are still
formed much better using the fastText embeddings [8], than us- promising considering that a completely new approach was im-
ing Dinu’s WaCky embeddings [1]. plemented, and they showed that the system definitely learned
Thereafter, an English-Italian dataset was extracted from from a data which is available for a wide range of languages.
the PanLex database, from which training and test datasets were The approach is quite promising but in order to reach state-
constructed roughly following the same steps that Dinu et al. [1] of-the-art performance the system has to deal with some mathe-
took. The system was trained and tested on both Dinu’s and matical issues, for example dimension reduction in the universal
PanLex test sets, and in both cases the matrices trained on space. Further experimentation in multilingual mode with an
Dinu’s data were the ones reaching higher scores. On the Pan- extended number of languages could also provide meaningful
Lex data experiments with different training set sizes were ex- outputs. By involving expert linguistic knowledge various sets
ecuted, out of which the 3K training set gave the best results. of languages could be constructed using either only very close
Continuing the training of the matrices obtained by using Dinu’s languages, or, on the contrary, using very distant languages.
289
Thanks to the PanLex database, bilingual dictionaries can easily
be extracted, which can, then, be directly used for multilingual
experiments.
6. Acknowledgements
This work is a collaboration of the Universitat Politècnica de
València (UPV) and the Budapest University of Technology and
Economics (BUTE). Work partially supported by the Spanish
MINECO and FEDER founds under project TIN2017-85854-
C4-2-R.
7. References
[1] G. Dinu, A. Lazaridou, and M. Baroni, “Improving zero-shot
learning by mitigating the hubness problem,” arXiv preprint
arXiv:1412.6568, 2014.
[2] D. Kamholz, J. Pool, and S. M. Colowick, “Panlex: Building a
resource for panlingual lexical translation.” in LREC, 2014, pp.
3145–3150.
[3] T. Mikolov, Q. V. Le, and I. Sutskever, “Exploiting similari-
ties among languages for machine translation,” arXiv preprint
arXiv:1309.4168, 2013.
[4] M. Faruqui and C. Dyer, “Improving vector space word represen-
tations using multilingual correlation,” in Proceedings of the 14th
Conference of the European Chapter of the Association for Com-
putational Linguistics, 2014, pp. 462–471.
[5] H. Youn, L. Sutton, E. Smith, C. Moore, J. F. Wilkins, I. Mad-
dieson, W. Croft, and T. Bhattacharya, “On the universal struc-
ture of human lexical semantics,” Proceedings of the National
Academy of Sciences, vol. 113, no. 7, pp. 1766–1771, 2016.
[6] S. Ruder, I. Vulić, and A. Søgaard, “A survey of cross-lingual
word embedding models,” arXiv preprint arXiv:1706.04902,
2017.
[7] S. L. Smith, D. H. Turban, S. Hamblin, and N. Y. Hammerla, “Of-
fline bilingual word vectors, orthogonal transformations and the
inverted softmax,” arXiv preprint arXiv:1702.03859 (publised at
ICRL2017), 2017.
[8] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and
H. Jégou, “Word translation without parallel data,” arXiv preprint
arXiv:1710.04087, 2017.
[9] C. Xing, D. Wang, C. Liu, and Y. Lin, “Normalized word embed-
ding and orthogonal transform for bilingual word translation,” in
Proceedings of the 2015 Conference of the North American Chap-
ter of the Association for Computational Linguistics: Human Lan-
guage Technologies, 2015, pp. 1006–1011.
[10] A. Lazaridou, G. Dinu, and M. Baroni, “Hubness and pollution:
Delving into cross-space mapping for zero-shot learning,” in Pro-
ceedings of the 53rd Annual Meeting of the Association for Com-
putational Linguistics and the 7th International Joint Conference
on Natural Language Processing (Volume 1: Long Papers), vol. 1,
2015, pp. 270–280.
[11] W. Ammar, G. Mulcaire, Y. Tsvetkov, G. Lample, C. Dyer, and
N. A. Smith, “Massively multilingual word embeddings,” arXiv
preprint arXiv:1602.01925, 2016.
[12] M. Artetxe, G. Labaka, and E. Agirre, “Learning principled bilin-
gual mappings of word embeddings while preserving monolingual
invariance,” in Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing, 2016, pp. 2289–2294.
[13] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enrich-
ing word vectors with subword information,” arXiv preprint
arXiv:1607.04606, 2016.
[14] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient esti-
mation of word representations in vector space,” arXiv preprint
arXiv:1301.3781, 2013.
[15] J. Tiedemann, “Parallel data, tools and interfaces in opus.” in
LREC, 2012, pp. 2214–2218.
290
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
291 10.21437/IberSPEECH.2018-61
2. General description North (es_exn), Extremadura South (es_exs), Canarias
(es_can), Castilla-La Mancha (es_clm), Madrid
TransDic is a multiplatform tool that can be used either in (es_mad) and Murcia (es_mur)
Linux, Mac OS or Windows. It is a command-line tool, which Other arguments allow also to specify the phonetic
requires to specify a set of arguments for its execution (for alphabet to be used (IPA or SAMPA), if the transcription may
example, the language or dialect or the phonetic alphabet for include or not pronunciation variants and syllable marks or the
the transcription). It can be used then both as an independent format of the output dictionary.
tool or integrated in other tools or procedures.
It is important to note that the definition of the
TransDic internal structure includes three levels, as pronunciation variants to be considered in transcription is not
illustrated in Figure 1: specified directly using arguments, as in Saga, for example. In
• The tool itself, TransDic. this case the selection of a given dialect determines the
• The processing core, shared by other applications (such phonetic phenomena and pronunciation variants that will be
as TransText, for the phonetic transcription of texts, or taken into account.
texafon, a full text-processing tool for text-to-speech The output of TransDic is another UTF-8 text file
applications), which includes the letter-to-sound module, containing the phonetised dictionary, which can be generated
other text processing modules, some of them (the text in two different formats: default (Figure 2) or HTK (Figure 3).
processing module, for example) used also by TransDic.
• A set of language/dialect modules, including language boxejar b o k . s e d . ʒ ˈa b o k . s e j . ʒ ˈa bok.sed.ʒ
or dialect dependent dictionaries and rules. Every dialect ˈa ɾ b o k . s e j . ʒ ˈa ɾ
is considered then as an ‘autonomous’ language, with its boxers b o k . s ˈe s b o k . s ˈe ɾ s
own rules and dictionaries. branca b ɾ ˈa n . k a
braçalet b ɾ a . s a . l ˈɛ t
brioix b ɾ i . ˈɔ j s
brisa b ɾ ˈi . z a
estas [estas] ˈe s t a s
sea [sea] s ˈe a
tenía [tenía] t e n "i a
nunca [nunca] n "u n k a
poder [poder] p o D ˈe r
aquí [aquí] a k "i
292
3.1. Phonetic characterization of dialects Table 1: Primary and secondary phenomena
considered for Canarias Spanish.
Existing literature on dialectological studies of Spanish spoken
in Spain ([19,20,21,22], among many others) and Catalan Primary Secondary
([23], a recent work also among many others) usually Pronunciation of orthographical
describes dialectal pronunciations in a very detailed way, <c,z> as [s] (seseo)
paying attention to both general and local phenomena. For this <caza> [ˈkasa]
work, however, only those phenomena which are general Pronunciation of orthographical Elision of syllable-final
<s> at the end of a syllable as [h] orthographical <s>
enough in the geographical area of the dialect were
(aspiración) <los> [lo]
considered. And within these general phenomena, a distinction <los> [loh]
was made between ‘primary’ (the most frequent in their Pronunciation of orthographical
corresponding geographical area) and ‘secondary’ (not as <g, j> as [h] (aspiración)
frequent as primary ones, but general enough to be taken into <ajo> [‘aho]
account in a general description of the dialect in question), in Elision (no pronunciation) of Pronunciation of intervocalic
order to keep the number of pronunciation variants within a orthographical <d> in words <d> as [ð], as in standard
reasonable number. Only geographical variants were ending with <-ado> Spanish
considered, not social, stylistic nor individual. <cansado> [kanˈsao] <cansado> [kanˈsaðo]
The goal of this phase was then to define a set of ‘primary’ Elision of intervocalic <d>
and ‘secondary’ pronunciations for each considered dialect, different from those of <-ado>
words
and to establish the relations between them (that is, which
<comido> [koˈmio]
secondary pronunciations are alternative pronunciations to the
primary ones). This task was carried out through an extensive Elision of word-final <r,l> Pronunciation of word-final
literature review for both languages. It was not an easy task at <comer> [koˈme] <r,l> as [r,l], as in standard
<Raquel> [raˈke] Spanish
all from a linguistic point of view, as the information provided
<comer> [koˈmeɾ]
in the literature is frequently incomplete, with limited <Raquel> [raˈkel]
information about the frequency or extension of the described
phenomena. Elision of word-final <d> Pronunciation of word-final <d>
<corred> [koˈre] as [d], as in standard Spanish
The detailed description of all the defined dialect sets is <corred> [koˈred]
out of the scope of this paper, so only one is presented here in Pronunciation of orthographical
some detail, the one for Canarias Spanish, based in the <ll> as [j] (yeísmo)
description provided mainly in [19,21,24,25]. Table 1 presents <calle> [ˈkaje]
the subset of primary pronunciations which are different from
standard Spanish in this dialect, the subset of secondary 3.2. Implementation
pronunciations and the relation between both subsets (that is,
which secondary realisations are pronunciation variants of the The implementation of the new dialect modules was done in
primary ones). As it can be observed in this table, the relation all cases taking as starting point the rules and dictionaries for
between primary and secondary pronunciations can follow the corresponding standard dialect and then making the
different patterns: necessary modifications to include the defined primary and
• One primary pronunciation has no secondary secondary phenomena. Some changes in the language-
pronunciations associated, which means that no independent core of TexAFon were done also to allow the
alternative realisations are considered. This is the case, generation of several pronunciation variants for a single input
for example, of the seseo. token.
• One primary pronunciation has one (or more) secondary To implement primary phenomena, new context-
pronunciation(s) associated. This means that all dependent Python rules were developed to replace the
pronunciations, primary and secondary ones, are standard ones in those cases in which the primary
possible in the same context, although the primary one is pronunciation was different from the standard. Figure 4
considered as more frequent than the secondary one(s). presents an example of rule for a primary pronunciation in
For example, elision of syllable-final orthographical <s> Andalusian Spanish. In some cases, the inclusion of these
(secondary pronunciation) is a possible alternative rules led to modify also the exception dictionary, to make
realisation to the pronunciation as [h], considered some entries coherent with the new rule.
primary (that is, most frequent) in Canarias Spanish. The implementation of the secondary phenomena was
done in a second phase and involved the modification of the
• One secondary pronunciation has no primary
primary context-dependent rules to allow the generation of
pronunciation associated. This means that it is an
secondary transcriptions for the same context. Figure 5
alternative to a standard pronunciation, which is also a
illustrates the result of the modification of the same rule
primary pronunciation in that dialect. This is the case of
presented in figure 4 to include deletion of <s> as secondary
yeísmo, the realisation of orthographical <ll> as [j],
variant.
which according to the literature has been considered as
less frequent than the realisation as [ʎ], default Finally, if the user has specified with the corresponding
pronunciation in standard Spanish. argument that transcription variants should be generated,
TransDic produces all transcription variants for the input
These lists of phonetic phenomena were used for the
word. These transcription variants are derived from the
implementation phase, explained in the following subsection.
character-by-character pronunciation variants generated by the
transcription rules: the language-independent letter-to-sound
293
module takes the obtained pronunciations for each character Table 2: Error measures obtained for the Catalan
and combines them to generate the word transcription variants. dialects.
So, for example, for the word ‘casas’ two different
Dialect Value
transcription variants would be generated ([kˈasah], [kˈasa])
Ribagorza 8
using the previously described rules, whereas in the case of Pallarés 11
‘llover’ the output variants would be four ([ʎoβˈeɾ], [joβˈeɾ], Tortosa 9
[ʎoβˈe], [joβˈe]), as the input word includes two characters Central area 8
with transcription variants which are combined to create the North Valencia 9
different alternative word transcriptions. Central Valencia 10
South Valencia 10
Alicante Valencian 9
if ch == "s" and nch == "NIL":
salida.append(["hh",0,False])
return salida Table 3: Mean number of variants per entry obtained
for the Spanish dialects.
Figure 4: Python implementation of a phonetic
transcription rule for the pronunciation of Dialect Mean number of variants
Standard 1.021
orthographical <s> as [h] (aspiración) at the end of a
Western Andalucía 1.76
word in Andalucía Spanish (Fodge, 2014).
Eastern Andalucía 1.354
Extremadura North 1,354
if ch == "s" and nch == "NIL": Extremadura South 1.76
salida.append(["hh",0,False]) Canarias 1.468
# Update deletion of word final syllable final s Castilla-La Mancha 1.207
salida.append(["",0,False]) Madrid 1.207
return salida Murcia 1.713
Figure 5: Python implementation of a secondary
pronunciation transcription rule for <s> deletion (in Table 4: Mean number of variants per entry obtained
bold) in Andalucía Spanish (Fodge, 2014). for the Catalan dialects.
Dialect Mean number of variants
3.3. Evaluation Standard 1
Ribagorza 1.006
The procedure to evaluate the transcription produced by Pallarés 1
the new dialect modules was similar for Spanish and Catalan: Tortosa 1
lists of isolated words, representative of the implemented Central area 1
phenomena (307 words for Spanish; 99 for Catalan), were North Valencia 1.269
processed using each dialect module to obtain the Central Valencia 1.003
corresponding output transcription. These output South Valencia 1.003
transcriptions were then revised manually to detect possible Alicante Valencia 1.003
errors. Both evaluations were carried out using a different tool
of the TexAFon package, but the evaluated rules were the 4. Conclusions
same used in TransDic [17,18].
This paper has presented TransDic, a tool for the generation of
In the case of Spanish, the transcription performance was
phonetised dictionaries for Catalan and Spanish. Its most
perfect: no errors were detected. Some errors were found,
innovative features are that it allows to transcribe in several
however, in the case of Catalan dialects. An error measure was
Spanish dialects spoken in Spain (Saga [12], for example, was
computed in this case, using the procedure described in [26]:
developed considering mainly the American Spanish dialects)
the sum of all phone substitutions (Sub), deletions (Del) and
and that it allows the creation of phonetic dictionaries
insertions (Ins) divided by the total number of phones in the
containing a reasonable number of pronunciation variants. The
reference transcription (N). Table 2 presents the obtained error
knowledge-based approach used in TransDic, based on a
values.
careful selection of the phonetic phenomena considered for
The number of generated variants was also evaluated. transcription, does not lead to an overgeneration of variants, a
Phonetised dictionaries were generated with TransDic for all classical problem in this kind of approach. Finally, another
dialects using two reference word lists in Spanish (1,000 most interesting feature of TransDic is that it is available for free
frequent words in the CREA corpus [27]) and Catalan (a download, from
cleaned version of the CesCa corpus [28], 1,648 words) and https://sites.google.com/site/juanmariagarrido/research/resourc
the mean number of variants per entry was computed. Tables 3 es/tools/transdic. TransText [14], a phonetisation tool which
and 4 present the results. allows the transcription of texts in the same dialects as
The results of these two evaluations indicate that TransDic TransDic, is also available for download from
performs reasonably well both as for transcription quality and https://sites.google.com/site/juanmariagarrido/research/resourc
number of generated variants is concerned. Spanish modules es/tools/transtext.
provide a more accurate transcription than Catalan ones, but Some expected improvements for the tool in the future are
they tend to generate more variants per input entry than the inclusion of new types of rules for variants (for example,
Catalan modules. Anyway, mean number of variants per entry rules for informal pronunciations), and a deeper evaluation of
is always below 2 in both languages. the output by native speakers of each dialect.
294
5. References [18] M. Codina, Automatic Phonetic Transcription of dialectal
variance in Catalan. Master Thesis, Barcelona: Pompeu Fabra
[1] H. Al-haj, R. Hsiao, I. Lane, and A.W. Black, “Pronunciation University, 2016.
modeling for dialectal arabic speech recognition”. IEEE [19] P. García Mouton, Lenguas y dialectos de España, Madrid: Arco
Workshop on Automatic Speech Recognition & Understanding Libros, 1994.
ASRU 2009, pp. 525–528, 2009. [20] M. Alvar (Director), Manual de dialectología hispánica. El
[2] J. Zheng, Pronunciation Variation Modeling for Automatic español de España. Barcelona: Ariel, 1999.
Speech Recognition, Ph. D, Thesis, University of Colorado, [21] F. Moreno Fernández, La lengua española en su geografía.
Boulder 2014. Madrid: Arco Libros, 2009.
[3] P. Taylor, “Hidden Markov Models for Grapheme to Phoneme [22] J. A. Samper Padilla, “Sociophonological Variation and Change
Conversion”, Proceedings of the European Conference on in Spain”, In M. Díaz Campos (Ed.), The Handbook of Hispanic
Speech Communication and Technology, Lisboa, Portugal, Sociolinguistics. West Sussex: Wiley-Blackwell, pp. 98–117,
September 2005, pp. 1973-1976, 2005. 2011.
[4] M. Bisani and H. Ney, “Investigations on joint-multigram [23] J. Veny and M. Massanell, Dialectologia catalana. Aproximació
models for grapheme-to-phoneme conversion”, Proceedings of pràctica als parlars catalans. Barcelona: Universitat de
the 7th International Conference on Spoken Language Barcelona, 2015.
Processing, ICSLP2002 - INTERSPEECH 2002, Denver, [24] M. M. Azevedo, Introducción a la lingüística española, Upper
Colorado, USA, September 16-20, pp. 105-108, 2002. Saddle River: Prentice Hall, 2009.
[5] S. Deligne, F. Yvon, and F. Bimbot, “Variable-length sequence [25] J. I. Hualde, A. Olarrea, and A. M. Escobar, Introducción a la
matching for phonetic transcription using joint multigrams”, lingüística hispánica. Cambridge: Cambridge University Press,
Proceedings of the Fourth European Conference on Speech 2001.
Communication and Technology, EUROSPEECH 1995, Madrid, [26] C. Van Bael, L. Boves, H. Van den Heuvel, and H. Strik,
Spain, September 18-21, pp. 2243-2246, 1995. “Automatic Phonetic Transcription of Large Speech Corpora”.
[6] A. Laurent, P. Deleglise, and S. Meignier, “Grapheme to Computer Speech & Language, vol. 21, no. 4, pp. 652-668,
phoneme conversion using an smt system”, Proceedings of the 2007.
10th Annual Conference of the International Speech [27] Real Academia Española, Banco de datos (CREA). Corpus de
Communication Association, Brighton, United Kingdom, referencia del español actual.
September 6-10, pp. 716-719, 2009. http://corpus.rae.es/lfrecuencias.html.
[7] M. Gerosa and M. Federico, “Coping with out-of-vocabulary [28] A. Llauradó, M. A. Martí, and L. Tolchinsky, "Corpus CesCa:
words: open versus huge vocabulary ASR”, Proceedings of the Compiling a corpus of written Catalan produced by school
2009 IEEE International Conference on Acoustics, Speech and children". International Journal of Corpus Linguistics, vol. 17,
Signal Processing, ICASSP 2009, pp. 4313-4316, 2009. no. 3, pp. 428–441, 2012.
[8] T. Holter and T. Svendsen, “Maximum likelihood modeling of
pronunciation variation”, Speech Communication, vol. 29, no. 2-
4, pp. 177–191, 1999.
[9] P. Bonaventura, F. Giuliani, J. M. Garrido, and I. Ortín,
“Grapheme-to-Phoneme Transcription Rules for Spanish, with
Application to Automatic Speech Recognition and Synthesis”,
Proceedings of the Workshop ‘Partially Automated Techniques
Transcribing Naturally Occurring Continuous Speech’,
Université de Montréal, Montreal, Quebec, Canada, pp. 33-39,
1998.
[10] X. López, Transcriptor fonético automático del español, 2004.
http://www.aucel.com/pln/transbase.html
[11] Molino de Ideas, Transcriptor fonético, 2012
http://www.fonemolabs.com/transcriptor.html.
[12] TALP-UPC, SAGA - Phonetic transcription software for all
Spanish variants, 2017, https://github.com/TALP-UPC/saga.
[13] P. Pachès, C. de la Mota, M. Riera, M. P. Perea, A. Febrer, M.
Estruch, J. M. Garrido, M. J. Machuca, A. Ríos, J. Llisterri, I.
Esquerra, J. Hernando, J. Padrell, and C. Nadeu, “Segre: An
automatic tool for grapheme-to-allophone transcription in
Catalan”, Proceedings of the Workshop on Developing
Language Resources for Minority Languages: Reusability and
Strategic Priorities, LREC, pp. 52-61, 2000.
[14] J. M. Garrido, M. Codina, and K. Fodge. “TransText, un
transcriptor fonético automático de libre distribución para
español y catalán”, Actas del Workshop “Subsidia: herramientas
y recursos para las ciencias del habla”, in press.
[15] J. M. Garrido, Y. Laplaza, M. Marquina, C. Schoenfelder, and S.
Rustullet. “TexAFon: a multilingual text processing tool for text-
to-speech applications”, Proceedings of IberSpeech 2012,
Madrid, Spain, pp. 281-289, 2012.
[16] J. M. Garrido, Y. Laplaza, B. Kolz, and M. Cornudella,
“TexAFon 2.0: A text processing tool for the generation of
expressive speech in TTS applications”, Proceedings of LREC
2014, Ninth International Conference on Language Resources
and Evaluation, Reykjavik (Iceland), pp. 3494-3500, 2014.
[17] K. Fodge, Introducing Spanish dialects in a linguistic processing
module for improved ASR and novel speech synthesis
capabilities, Master Thesis, Barcelona: Pompeu Fabra
University, 2014.
295
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
ViVoLAB, Aragón Institute for Engineering Research (I3A), University of Zaragoza, Spain
[email protected], [email protected], [email protected], [email protected]
Documentation and analysis of multimedia resources usually zdƌƵĞ ͞^ƉĂĐĞ͟ ͞^ƉĂĐĞ͟ ͞ŽŵŵĂ͟ ͞^ƉĂĐĞ͟ ͞ŽŵŵĂ͟ ͞^ƉĂĐĞ͟ ͞^ƉĂĐĞ͟ ͞WĞƌŝŽĚ͟ ͞W͟
296 10.21437/IberSPEECH.2018-62
2.1. Wide Residual Networks 1D pass, or also called mini-batch. Each sequence has T elements,
where each element is a vector xW RN,n,t , n ∈ N, t ∈ T with
WRNs are usually used in image tasks where two spatial di- CW the convolutional filter responses. In Figure 2b we show
mensions are the context to work. In automatic speech recog- that each xW RN,n,t is evaluated through the position indepen-
nition we can use spectrograms as an image to adopt this two- dent convolution layer and a Softmax non-linearity to obtain the
dimensional schema [12, 13]. However, in text processing, we final classification.
work only with temporal dimension because word relations are To train the network we use the cross-entropy as cost func-
only meaningful along this dimension. WRN is a structure based tion J(x, y). In the case of many-to-one structure we compute
on 2D-convolutional networks. In order to adapt this structure the mini-batch error as:
to our environment, we will work with 1D-convolutional layers.
In image context, convolutional layers are characterized by the N −1
1 X
number of output channels, C, but in text context, we consider Jmany to one = J(xn , yn ), (1)
N n=0
channels as a temporal sequence of vectors, where each element
of this vectors is one of the C filters response for this sequence
moment. where each example sequence of T elements correspond to one
example of N in the mini-batch. In our case we implement the
A WRN consists on several blocks, that in this paper we many-to-many structure. We also use the cross-entropy error
will call Wide Residual Block (WRB). In each of this WRB, we loss function J(x, y), but the mini-batch cost is computed for
increase the number of channels in the output. We will widen each of the T elements of each of the N examples of the mini-
the convolutional output. These WRBs are also constructed by batch as:
several Residual Block (RB). This Blocks are the basic elements
of the WRNs, and them have two paths. The first path consists N −1
1 X 1 X
T −1
on two convolutional layers with batch normalization [14] and Jmany to many = J(xn,t , yn,t ) (2)
N n=0 T t=0
ReLU non-linearity [15]. The second path is a residual con-
nection between block input and output. In this work we use
a sum of the block input and the convolutional path as residual 2.2. Long-Short Term Memory Cells
connection. In the case where the number of channels of the In this study we will compare WRN with LSTM structures and
two paths are different, we adjust the number of channels with BLSTM structures. In these cases we use the same input Xin as
a convolutional layer in the residual path. In Figure 2a we show described in section 2. In Figure 3 we show that the the input
the complete structure of the WRNs that we use in this work. sequences are processed by an LSTM or BLSTM obtaining a se-
We will characterize the WRB by its widen factor, which is the quence with the same length as the input. In the case of a LSTM
factor we apply to the number of channels at block input to gain or a BLSTM, the sequence Xin passes through the embedding
width. In Figure 2a, we describe convolutional layers by its ker- layer as in a WRN obtaining a sequence of same length but each
nel dimension. This kernel dimension is the number of words element xt from the sequence is a vector of embedding dimen-
that the convolutional filter takes to compute each element in the sion C, as we represent in Figure 3. This sequence is evaluated
output. In this work the kernel dimension of the first convolu- by the LSTM or the BLSTM obtaining a sequence where each el-
tion layer is 5. Convolutional layers in WRB use a kernel of di- ement has dimension H or 2∗H respectively with H the hidden
mension 3. There is a special case when we use a convolutional dimension of the LSTM or the BLSTM layers. With this struc-
layer with kernel dimension of one. This kind of convolutional ture we use the same philosophy many-to-many as in WRN, and
layer sometimes is called 1 x1 convolution, position indepen- we use the same loss function described in equation 2.
dent convolution or position wise fully connected. It is like a To obtain a combination of WRN and BLSTM we just sub-
fully connected layer with all its weights shared through time. stitute the position independent convolution and the Softmax in
In other words, each element of the input is evaluated with same WRN by the BLSTM structure showed in Figure 3. Then we
weights along all sequence. In this layers we also describe in- will remove or add WRBs to the structure to get different results
put dimension and output dimension of each sequence element discussed in section 3
as [input dimension × output dimension].
Usually in classification we adopt the many-to-one struc-
ture where we use an input sequence to classify the central el-
3. Experiment and Results
ement of the sequence. In convolutional networks for classifi- For this experiment we have collected 1,181,413 articles from
cation, usually we use pooling and reduce operations in order electronic edition of 32 diverse Spanish magazines and news-
to classify this central element. A problem that has this archi- papers, form January 2017 up to February 2018. Each article
tecture to classify two contiguous elements, we need to process has 506 words on average. For our purposes all texts were case
two sequences that only differ in one element. This produce that lowered; numbers, dates, hours and Roman numbers were auto-
we recompute T − 1 times the same element in the sequence. In matically transcribed and normalized; some units and abbrevia-
order to reduce this massive excess of computation in long con- tions like Km, m, Kg, min, were changed by their transcription;
text sequences we will use a many-to-many paradigm. In our all symbols were deleted; and final question and exclamation
experiments we select unique very long sequences from the text marks were substituted by dots. This data set was segmented in
as the context, and we compute the convolutions to the whole train set (980,000 articles), development set (101,000 articles)
sequence only once. For this our WRNs do not use any pooling and test set (100,413 articles), randomly chosen. The dictio-
strategy, the stride is always 1 and use the appropriated padding nary for this work is composed by all words in train text which
in order to fix the edge effects in convolutional layers. There- appear at least 50 times in order to avoid misspelled and rare
fore, at the output of the WRBs XW RN in figure 2a, we have the words. This supposes that all our experiments use a dictionary
same sequence length as the input, T . In Figure 2b it is repre- with 116,737 words including especial delimiters.
sented the shape of the WRNs output sequence. N is the number In all of the architectures tested in this work we use as first
of sequence examples evaluated at the same time in a forward layer an embedding layer with input dimension the number of
297
y/Ŷ
y y> y>
ŵďĞĚĚŝŶŐ
d
E
ŽŶǀϭ;ϱͿ E
Z ytZE t
ZĞ>h
tZ;ϭϲͿ
ŽŶǀϭ;ϯͿ ŽŶǀϭ;ϭͿ> dž>нϭ
tZ;ϯϮͿ Z
> т>нϭ > с>нϭ ŽŶǀϭ;ϭͿt džKƵƚ
E
tZ;ϲϰͿ
Z ZĞ>h ^ŽĨƚŵĂdž
E
ŽŶǀϭ;ϯͿ
ZĞ>h н y͛>
Z
ytZE Z KƵƚ
E zWƌĞĚ
ŽŶǀϭ;ϭͿǁ džKƵƚ d
Figure 2: Diagram of the Wide Residual Network architecture. The left figure dis-aggregate the different blocks in WRN, and the right
figure shows dimensions and schema of the linear classification block in the WRN.
d
E
word from the input to the classification of a particular output
time. For this we have cancelled sequentially each word of the
y input, by forcing to zero the output of the embedding layer for
this word, and representing the value of the network output af-
ter a forward pass of this modified input. We assume that those
ŚϬ Śƚ Śd words that do more contribution to a correct classification would
>^dD
ŚϬ Śƚ Śd be those words that when are cancelled disturb more the classi-
fication. In Figure 4 we can see the output values for ”Period”
class before the Softmax non-linearity of the LSTM, BLSTM and
, WRN for an output time where a ”Period” should be predicted.
, The dot probability output is shown as each word of the input
is sequentially cancelled. One of the first things we can look at
ŽŶǀϭ;ϭͿϮ,džKƵƚ is the temporal distance of the first and last cancelled word that
does a significant perturbation in ”Period” classification. LSTM
^ŽĨƚŵĂdž only has perturbations with words before the moment in which
we want the classification. We can see that this architecture only
E zWƌĞĚ KƵƚ uses past information for classification. This behavior may lead
d to worse performance because future words should be useful in
punctuation task. We can not say that a sentence is ended if we
Figure 3: BLSTM Classification Architecture. do not know when the text change of sense, or subject, or action.
In BLSTM architecture we can see that it uses words from past
and from future. Even those words that do a major perturbation
are from a future context. The case of WRNs is the same as
words in the dictionary, and 1024 as output dimension. The BLSTM, but it uses less number of words for the classification
first convolutional layer in the structures has 512 output chan- because the perturbations are closer to the classification time.
nels, stride of one and padding of two in order to cover edge For evaluation purposes we use three measures, precision,
effects. In wide residual blocks we use 8 as widen factor. The recall and F1 -measure, for each class of interest.
dimension of each WRB is the widen factor times 16 for the T rueP ositive
first block, the widen factor times 32 in the second block, the P recision = (3)
T rueP ositive + F alseP ositive
widen factor times 64 in the third block and the widen factor
times 128 in forth block. The LSTM and BLSTM architectures T rueP ositive
Recall = (4)
are composed by two layers of 1024 cells. All the architectures T rueP ositive + F alseN egative
were trained during 500 iterations where, for this experiment, P recision · Recall
we consider an iteration of training to process 3,000 random ar- F1 = 2 · (5)
P recision + Recall
ticles from the training set. All trainings were accomplished by For this measures only relevant cases are taken into account as
back-propagation procedure with Adam optimizer [16] config- we see in equations 3, 4 and 5, where only consider the True
ured with default hyper-parameters in Pytorch software [17]. Positive, False Positive, and False Negative for each class of
In order to understand the behaviour of each architecture, interest.
we have studied the response of the output when we modify For this paper we start considering as baseline a LSTM and a
the input. We want to get an idea of the contribution of each BLSTM. In Table 1 we show that BLSTM provides better results
298
0
−2
Period class network output
−4
−2
la
fecha
del
inicio
de
las
obras
de
para
los
diques
la
de
la
magdalena
y
estabilización
las
playas
santanderinas
de
los
peligros
bikinis
se
podría
conocer
esta
misma
semana
según
ha
avanzado
la
alcaldesa
gema
igual
la
regidora
municipal
así
lo
ha
indicado
este
martes
a
preguntas
de
la
prensa
sobre
este
proyecto
que
ejecutará
la
empresa
tragsa
y
sobre
el
que
mantiene
permanente
contacto
con
la
secretaria
de
estado
de
medio
ambiente
en
el
marco
de
ha
esos
primera
contactos
manifestado
que
espera
que
la
fecha
del
inicio
de
los
trabajos
se
conozca
muy
pronto
tanto
que
según
ha
añadido
en
esta
semana
del
año
se
pondrá
la
fecha
igual
ha
reivindicado
la
necesidad
de
que
ha
enfatizado
ha
quedado
nuevamente
estos
diques
algo
demostrado
estos
Canceled word in input sequence
Figure 4: Variations caused by the cancellation of an input word to ”Period” class from the perspective of a True period label after the
word ”ambiente” pointed by the vertical line. In grey shadow we show the context that affects the classification. From top to bottom
we represent the same output from LSTM, BLSTM and WRN architectures.
Table 1: Precision, Recall and F1 -Measure percentage for ”Period” and ”Comma” classes along the different architectures.
Period Comma
Architecture
Precision Recall F1 Precision Recall F1
LSTM 73.78 82.18 77.75 87.74 81.36 84.43
BLSTM 87.12 88.96 88.03 91.28 89.78 90.52
Convolutional 1WB 84.05 87.47 85.73 89.76 86.88 88.30
Convolutional 2WB 86.22 89.11 87.64 91.04 88.60 89.80
Convolutional 3WB 89.66 83.75 86.60 88.01 92.51 90.20
Convolutional 4WB 73.79 96.22 83.53 95.92 72.22 82.40
WRN 1WRB 83.10 88.66 85.79 90.67 85.94 88.24
WRN 2WRB 86.20 88.91 87.53 91.13 88.89 90.00
WRN 3WRB 87.41 89.02 88.21 91.17 89.84 90.50
WRN 4WRB 76.30 87.79 81.64 89.23 78.76 83.67
WRN 1WRB + BLSTM 89.33 92.36 90.82 93.86 91.38 92.60
WRN 2WRB + BLSTM 89.28 90.74 90.00 92.62 91.43 92.02
WRN 3WRB + BLSTM 86.93 89.92 88.40 81.81 89.32 90.55
than LSTM achieving a 88.03 of F1 with ”Period” class, and a 4. Conclusions and future work
90.53 of F1 with ”Comma”. These improvements are obtained
In this work, we have shown the use of Wide Residual Networks
thanks to use past and future context in BLSTM.
for punctuation recovering. We show that architectures that use
Residual connections usually work better when the network past and future context obtain better results. WRNs performs
is very deep, but they also help in shallower networks. In or- similarly to BLSTM in our experiments so other aspects of the
der to see how residual connections help the classification, we architecture should be taken into account, like training time or
trained convolutional networks which are exactly the same ar- resources consumed. The best results are obtained when WRNs
chitecture as WRNs but without residual connections. In Table 1 and BLSTMs are concatenated. This work suggests the idea of
we show that when we increase the number of blocks in convo- the use of WRNs as feature extractor. Then we need powerful
lutional networks without residual connections, they start to get structures than a linear classifier to perform the final classifica-
worse results than the same configuration in WRNs. With only tion. In problems that requires sequential processing, BLSTM
one block they perform equal, but with three blocks, the net- structures that uses past and future context, are the best option.
work without residual connections performs worse. We achieve For future work, it would be interesting to differentiate be-
the best result with a WRN using three blocks. We obtain an tween question marks and ”periods”. In Spanish, this is a chal-
F1 of 88.21 for ”Period” and 90.50 for ”Comma”, which is the lenging task since we need to predict the closing question mark,
same performance as the base case of a BLSTM. but also the opening question mark, which is a symbol not used
In the last experiments we concatenate the output of a WRN in other languages like English. Future work also will pay at-
with the input, without embedding layer, of the BLSTM. These tention to architecture improvements like hyper-parameter con-
two architectures are compatibles because both are designed to figuration or to include attention mechanisms.
work with sequences, processing element by element, so we do
not need to do any transformation to the inputs or outputs. In
Table 1 we present the result of concatenation of a WRN of one,
5. Acknowledgements
two or three blocks and a BLSTM with the same architecture as This work is supported by the Spanish Government and the Eu-
the baseline. We can see that all those configurations perform ropean Union under project TIN2017-85854-C4-1-R, and by
better than any of the two architectures alone. The case of a Gobierno de Aragón / FEDER (research group T36 17R). We
WRN of one block presents the better result with an F1 of 90.82 gratefully acknowledge the support of NVIDIA Corporation
in ”Period” and 92.60 in ”Comma”. with the donation of the Titan Xp GPU used for this research.
299
6. References
[1] F. Batista, H. Moniz, I. Trancoso, and N. Mamede, “Bilingual ex-
periments on automatic recovery of capitalization and punctuation
of automatic speech transcripts,” IEEE Transactions on Audio,
Speech, and Language Processing, vol. 20, no. 2, pp. 474–485,
2012.
[2] O. Tilk and T. Alumäe, “Lstm for punctuation restoration in
speech transcripts,” in Sixteenth annual conference of the inter-
national speech communication association, 2015.
[3] ——, “Bidirectional recurrent neural network with attention
mechanism for punctuation restoration.” in Interspeech, 2016, pp.
3047–3051.
[4] W. Lu and H. T. Ng, “Better punctuation prediction with dynamic
conditional random fields,” in Proceedings of the 2010 conference
on empirical methods in natural language processing. Associa-
tion for Computational Linguistics, 2010, pp. 177–186.
[5] K. Xu, L. Xie, and K. Yao, “Investigating lstm for punctuation
prediction,” in Chinese Spoken Language Processing (ISCSLP),
2016 10th International Symposium on. IEEE, 2016, pp. 1–5.
[6] X. Che, C. Wang, H. Yang, and C. Meinel, “Punctuation predic-
tion for unsegmented transcript based on word vector.” in LREC,
2016.
[7] J. Yi, J. Tao, Z. Wen, and Y. Li, “Distilling knowledge from an en-
semble of models for punctuation prediction,” Proc. Interspeech
2017, pp. 2779–2783, 2017.
[8] W. Gale and S. Parthasarathy, “Experiments in character-level
neural network models for punctuation,” in Proceedings Inter-
speech, 2017, pp. 2794–2798.
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2016, pp. 770–778.
[10] S. Zagoruyko and N. Komodakis, “Wide residual networks,”
arXiv preprint arXiv:1605.07146, 2016.
[11] H. Li, Z. Xu, G. Taylor, and T. Goldstein, “Visualizing the loss
landscape of neural nets,” arXiv preprint arXiv:1712.09913, 2017.
[12] H. K. Vydana and A. K. Vuppala, “Residual neural networks
for speech recognition,” in Signal Processing Conference (EU-
SIPCO), 2017 25th European. IEEE, 2017, pp. 543–547.
[13] L. D. Jahn Heymann and R. Haeb-Umbach, “Wide residual blstm
network with discriminative speaker adaptation for robust speech
recognition,” in Proceedings of the 4th International Workshop on
Speech Processing in Everyday Environments (CHiME16), 2016,
pp. 12–17.
[14] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
deep network training by reducing internal covariate shift,” arXiv
preprint arXiv:1502.03167, 2015.
[15] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
networks,” in Proceedings of the fourteenth international confer-
ence on artificial intelligence and statistics, 2011, pp. 315–323.
[16] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
mization,” arXiv preprint arXiv:1412.6980, 2014.
[17] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito,
Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differ-
entiation in pytorch,” in NIPS-W, 2017.
300
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
301 10.21437/IberSPEECH.2018-63
pus for the task. One of those approaches uses a stack of Long covering 18% and 15%, respectively. This is consistent with the
Short Term Memory (LSTM) units to capture long distance re- information-transfer nature of the dialogs.
lations between tokens [7], while the other uses multiple paral- Although they share most labels, the two task-specific lev-
lel Convolutional Neural Networks (CNNs) with different con- els focus on different information. While Level 2 is related to
text window sizes to capture different functional patterns [8]. the information that is implicitly focused in the segment, Level
We have shown that the CNN-based approach leads to better 3 is related to the kind of information that is explicitly referred
results on every level of the DIHANA corpus. However, while to in the segment. There are 10 labels common to both levels
wider context windows are better for predicting the generic di- and three additional ones on Level 3. The most common Level
alog acts of the top level, the task-specific bottom levels are 2 labels are Departure Time, Fare, and Day, which are present
more accurately predicted when using narrower windows. Fur- in 32%, 14%, and 8% of the segments, respectively. On the
thermore, recently, we have shown that the performance can other hand, the Level 3 label distribution is more balanced, with
be improved by using a Recurrent Convolutional Neural Net- the most common labels, Destination, Day, and Origin, being
work (RCNN)-based segment representation approach that is present in 16%, 16%, and 13% of the segments, respectively.
able to capture long distance relations and discards the need for While a segment has a single Level 1 label, it may have
selecting specific window sizes for convolution [9]. Addition- multiple or no labels in the other levels. In this sense, only 63%
ally, we have shown that a character-based segment representa- of the segments have Level 2 labels, and that percentage is even
tion approach achieves similar or better results than an equiv- lower, 52%, when considering Level 3 labels. This is mainly
alent word-based approach on the Switchboard corpus and the due to the fact that Level 1 labels concerning dialog structuring
top level of the DIHANA corpus and that the information cap- or communication problems cannot be paired with any labels in
tured by both approaches is complementary [15]. the remaining levels.
In [3] we have also shown that context information concern-
ing the dialog history and the classification of the upper levels 3.2. End-to-End Neural Network Architecture
is relevant for the task. Concerning the dialog history, we have
explored the use of information from up to three preceding seg- In our experiments, we incrementally built the architecture
ments in the form of their classifications. On the first two levels, of our network by assessing the performance of different ap-
similarly to what happened in previous studies on the influence proaches for each step. However, due to space constraints,
of context on dialog act recognition [16, 8], we have observed we are not able to show individual figures for all of those ap-
that the first preceding segment is the most important and that proaches. Thus, in Figure 1 we show the architecture of the
the influence decays with distance. On the other hand, since the final network and use it to refer to the alternatives we explored.
bottom level refers to information that is explicitly referred to At the top are our two complementary segment representa-
in the segment, it is not influenced by information from the pre- tion approaches. On the left is the word-based approach, which
ceding segments, at least at the same level. Recently, we have captures information concerning both word sequences and func-
shown that the representation of information from the preceding tional patterns using the adaptation of the RCNN by Lai et
segments used in previous studies does not take the sequential- al. [19] which we introduced in [9]. In our adaptation we re-
ity of those segments into account and that the whole dialog his- placed the simple Recurrent Neural Networks (RNNs) used to
tory can be summarized in order to capture that information as capture the context surrounding each token by Gated Recurrent
well as relations with more distant segments [9]. Additionally, Units (GRUs), in order to capture relations with more distant to-
in this paper, we further explore the relations between levels by kens. To represent each word, in our experiments we used 200-
using an end-to-end approach to predict the labels of the three dimensional Word2Vec [20] embeddings trained on the Span-
levels in parallel and capture those relations implicitly. ish Billion Word Corpus [21]. On the right is the character-
based approach we introduced in [15], which uses three parallel
CNNs with different window sizes to capture relevant patterns
3. Experimental Setup concerning affixes, lemmas, and inter-word relations. We per-
This section presents our experimental setup, starting with a de- formed experiments using each approach individually, as well
scription of the corpus, followed by an overview of the aspects as their combination, which is depicted in Figure 1.
addressed by our experiments and the used network architecture The representation of the segment can then be combined
and a description of the training and evaluation approaches. with context information concerning the dialog history and
speaker changes. We provide the latter in the form of a flag
3.1. Dataset stating whether the speaker changed in relation to the previous
segment, as in [8, 9]. To provide information from the preced-
The DIHANA corpus [2] consists of 900 dialogs between 225 ing segments we use the approach we introduced in [9], which
human speakers and a Wizard of Oz telephonic train informa- summarizes the dialog history by passing the sequence of dia-
tion system. There are 6,280 user turns and 9,133 system turns, log act labels through a GRU. We performed experiments using
with a vocabulary size of 823 words. The turns were manually a single summary combining information concerning the three
transcribed, segmented, and annotated with dialog acts [17]. levels, as well as using per-level summaries which summarize
The total number of annotated segments is 23,547, with 9,715 the sequence of preceding labels of each level individually.
corresponding to user segments and 13,832 to system segments. To predict the dialog act labels for the segment, the com-
The dialog act annotations are hierarchically decomposed bined representation is passed through two dense layers. While
in three levels [18]. The top level, Level 1, represents the the first reduces its dimensionality and identifies the most rele-
generic intention of the segment, while the others refer to task- vant information present in that representation, the second gen-
specific information. There are 11 Level 1 labels, out of which erates the labels. In terms of the dimensionality reduction layer,
two are exclusive to user segments and four to system segments. we experimented using a single layer that captures the most
Overall, the most common label is Question, covering 27% of relevant information that is generic to the three levels, as well
the segments, followed by the Answer and Confirmation labels, as per-level dimensionality reduction layers, which capture the
302
the DIHANA corpus [10, 11]. In terms of evaluation metrics
w0 w1 ... wn-1 wn c0 c1 ... cn-1 cn
we use accuracy. This metric is penalizing for the multi-label
classification scenarios of Level 2 and Level 3, since it does not
account for partial correctness [26]. However, due to space con-
CNN CNN CNN
straints and since we are focusing on the combined prediction of
RCNN the three levels, we do not report results for specialized metrics.
(w = 3) (w = 5) (w = 7)
Word-Based
Segment Representation
4. Results
In this section we present the results of our experiments on each
Character-Based Segment Representation
level, as well as on the combination of the three levels. Both
the results on Level 1 and on the combination of the three levels
Combined Segment Representation
can be directly compared with those reported in [3]. However,
that is not the case for the remaining levels. In [3], since we
GRU Dialog History explored each level independently and the annotation scheme
does not allow segments with a Level 1 label concerning di-
alog structuring or communication problems to have labels in
Speaker Change
the remaining levels, we did not consider those segments when
training and evaluating the Level 2 and Level 3 classifiers. Con-
trarily, since in this study we use a single classifier to predict
Level 1
the labels for all levels, those segments are also considered.
Dense Dense
(Reduction) (Softmax)
L1 Label
We started by exploring the word- and character-based seg-
ment representation approaches, as well as their combination.
Thus, in these experiments, we did not provide context infor-
Level 2
mation to the network and we used a single dimensionality re-
duction layer for the three levels. In Table 1 we can see that, as
Dense Dense
(Reduction) (Sigmoid)
L2 Labels
we have previously shown in [15], the character-based approach
leads to better results than the word-based one on Level 1. Ad-
ditionally, the results achieved by the word-based approach are
above those reported in [15], which confirms that the word-level
Level 3
RCNN-based segment representation approach leads to better
Dense Dense
(Reduction) (Sigmoid)
L3 Labels
results than the CNN-based one we used in [15] and [3]. On the
other hand, the slight performance decrease of the character-
based approach can be explained by the combined prediction of
Figure 1: The architecture of our network. wi refers to the em- the three levels, which does not allow the classifier to special-
bedding representation of the i-th word, while ci refers to the ize in predicting Level 1 labels. On the remaining levels, the
embedding representation of the i-th character. The circles rep- character-level approach still performs better. However, the dif-
resent the concatenation of the different inputs. ference is smaller than on Level 1, which is explained by the
more prominent relation of the labels of these levels with spe-
cific words. Furthermore, the combination of both approaches
leads to the best results on every level.
most relevant information for each level, as depicted in Figure 1.
Since the first level poses a single-label classification problem,
the output layer uses the softmax activation and the categorical Table 1: Accuracy (%) results according to the segment repre-
cross entropy loss function. On the other hand, since the other sentation approach.
levels pose multi-label classification problems, the correspond-
Level 1 Level 2 Level 3 All
ing output layers use the sigmoid activation and the binary cross Approach µ σ µ σ µ σ µ σ
entropy loss function which, given the possibility of multiple la-
Word-Based 92.18 .13 79.36 .18 79.35 .20 75.82 .17
bels, is actually the Hamming loss function [22]. In both cases, Character-Based 95.31 .07 81.25 .47 81.24 .54 78.67 .51
for performance reasons, we use the Adam optimizer [23]. Combined 95.64 .07 82.46 .20 82.44 .21 79.88 .17
In [3], we have shown that the prediction of dialog act la-
bels of a certain level is improved when information concerning
By using per-level dimensionality reduction layers, the
the upper levels is available. Thus, as shown in Figure 1, we
classifier is able to select the information that is most relevant
also performed experiments that considered the output from the
for predicting the labels of each level. Thus, as shown in Ta-
upper levels in the dimensionality reduction layers.
ble 2, this adaptation leads to improved results on the two bot-
tom levels and on the combination of the three levels. However,
3.3. Training and Evaluation
the performance on Level 1 did not improve, which suggests
To implement our networks we used Keras [24] with the Ten- that the combined segment representation captures more infor-
sorFlow [25] backend. We used mini-batching with batches of mation concerning specific words in detriment of functional pat-
size 512 and the training phase stopped after 10 epochs with- terns relevant for the prediction of some Level 1 labels. Provid-
out improvement. The results presented in the next section re- ing information concerning the output generated for the upper
fer to the average (µ) and standard deviation (σ) of the results levels leads to further improvement, in line with that reported in
obtained over 10 runs. On each run, we performed 5-fold cross- [3] in spite of not using gold standard labels.
validation using the folds defined in the first experiments on As stated in Section 2, context information from the pre-
303
Table 2: Accuracy (%) results according to the dimensionality if we post-process the results to enforce that restriction, the im-
reduction approach. provement on the combination of the three levels is of just .03
percentage points when considering all segments and .1 when
Level 1 Level 2 Level 3 All considering user segments only. This shows that the network is
Approach µ σ µ σ µ σ µ σ
able to learn that restriction based on the training examples.
Single Reduction 95.64 .07 82.46 .20 82.44 .21 79.88 .17
Per-Level Reduction 95.64 .05 83.21 .11 83.17 .18 80.23 .16 Overall, the average accuracy of our best approach on the
Output Waterfall 95.65 .05 83.29 .21 83.36 .19 80.49 .24 combination of the three-levels is 95.67%. This result is 3.33
percentage points above the 92.34% we achieved in [3] us-
ing the hierarchical combination of independent classifiers for
ceding segments has been proved important in many studies each level. Furthermore, it is even above the 93.98% achieved
on dialog act recognition. The results in Table 3 confirm this when considering the single-label simplification of the problem,
importance for all levels. However, there are different conclu- which only considers the label combinations present in the cor-
sions to draw depending on the level. Concerning the first level, pus. This shows that the network is able to capture relevant
we can see that considering the dialog history leads to an av- relations between levels while still being able to identify the
erage accuracy improvement of 3.72 percentage points, which most important information for each level using the per-level
is above the 3.42 reported in [15]. Considering that the classi- dimensionality reduction layers.
fier fails to predict the correct Level 1 label for less than one
percent of the segments, this is a relevant improvement which 5. Conclusions
is explained by the representation of the dialog history in the
form of a summary. In [3] we have shown that the dialog his- In this paper we have presented our approach on dialog act
tory is not relevant when only the Level 3 is considered, since it recognition on the DIHANA corpus using an end-to-end clas-
refers to information that is explicitly referred to in the segment. sifier to predict the labels for the three levels defined in the
However, in Table 3 we can see an average accuracy improve- annotation scheme. This way, the relations between levels are
ment of 12.86 percentage points on Level 3 when considering captured implicitly, contrarily to what happened in our previous
the dialog history. This is explained by the fact that the provided approach on the task, which used independent per-level classi-
information concerns all levels and, as shown in [3], informa- fiers. Additionally, we have used approaches on segment and
tion from the preceding segments concerning the upper levels is context information representation which have recently been
relevant when predicting Level 3 labels. On the one hand, what proved more appropriate for the task.
is implicitly targeted at given time is expected to be explicitly First, we have shown that character-based segment repre-
referred to in the future. Thus, there is a relation between the sentation also performs better than word-based representation
Level 2 labels of preceding segments and the Level 3 labels of on the multi-label classification problems and that the combina-
the current one. On the other hand, the dialogs feature multiple tion of both approaches surpasses each individual approach. In
question-answer pairs for which the labels on the lower levels this sense, on the combination of all levels, the combined ap-
are the same. Thus, when the Level 1 label of the preceding proach surpassed the word- and character-level approaches by
segment is Question, the Level 3 labels of that segment are typ- around four and one percentage points, respectively.
ically present in the current segment as well. This relation be- Then, we have shown that it is important to have per-level
tween levels is further confirmed by the improved performance dimensionality reduction layers in order to specialize the seg-
when using a single summary for the whole dialog history in ment representation for each level. Additionally, the perfor-
comparison to when using independent per-level summaries. mance is improved when a cue for the hierarchical relation be-
tween the levels is provided by considering the output for the
Table 3: Accuracy (%) results when using context information upper levels when predicting the labels for each level.
from the preceding segments. Furthermore, we have shown that the relation between lev-
els is also important when providing context information con-
Level 1 Level 2 Level 3 All cerning the dialog history, as a combined summary of the clas-
Approach µ σ µ σ µ σ µ σ sifications of the preceding segments led to better results than
No Information 95.65 .05 83.29 .21 83.36 .19 80.49 .24 three independent per-level summaries.
Single Summary 99.37 .01 96.15 .11 96.22 .15 95.53 .14 Finally, by providing information concerning speaker
Per-Level Summary 99.34 .03 95.80 .11 95.87 .13 95.14 .14
changes and forcing the segments with Level 1 labels concern-
ing dialog structuring or communication problems to have no
As shown in previous studies, using information concern- labels on the remaining levels, we achieved 95.67% accuracy
ing speaker changes slightly improves the performance, up to on the combination of the three levels, which is over three per-
95.64% average accuracy on the combination of the three levels. centage points above our previous approach and even surpasses
More importantly, as discussed in [3], the system segments are the results achieved on the simplified single-label classification
scripted and, thus, are easier to predict than the user segments. problem approached by previous studies.
Furthermore, a dialog system is aware of its own dialog acts As future work, we intend to explore how our approach can
and must only predict those of its conversational partners. As be adapted to perform automatic segmentation of the turns in-
expected, the performance decreases if the classifier is trained stead of relying on a priori segmentation and assess the impact
and evaluated on user segments only. The average decrease on on the overall dialog act recognition performance.
the combination of the three levels is of 4.5 percentage points.
However, on Level 1 it is of just .67 percentage points.
Since we use a single classifier to predict the labels for the
6. Acknowledgements
three-levels, there is no explicit restriction that segments with This work was supported by national funds through Fundação
Level 1 labels concerning dialog structuring or communication para a Ciência e a Tecnologia (FCT) with reference
problems cannot have labels in the remaining levels. However, UID/CEC/50021/2013.
304
7. References [21] C. Cardellino, “Spanish Billion Word Corpus and Embeddings,”
http://crscardellino.me/SBWCE/, 2016.
[1] P. Král and C. Cerisara, “Dialogue Act Recognition Approaches,”
Computing and Informatics, vol. 29, no. 2, pp. 227–250, 2010. [22] J. Dı́ez, O. Luaces, J. J. del Coz, and A. Bahamonde, “Optimizing
Different Loss Functions in Multilabel Classifications,” Progress
[2] J.-M. Benedı́, E. Lleida, A. Varona, M.-J. Castro, I. Galiano,
in Artificial Intelligence, vol. 3, no. 2, pp. 107–118, 2015.
R. Justo, I. L. de Letona, and A. Miguel, “Design and Acquisition
of a Telephone Spontaneous Speech Dialogue Corpus in Spanish: [23] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Opti-
DIHANA,” in LREC, 2006, pp. 1636–1639. mization,” in ICLR, 2015.
[3] E. Ribeiro, R. Ribeiro, and D. M. de Matos, “Hierarchical Multi- [24] F. Chollet et al., “Keras: The Python Deep Learning Library,”
Label Dialog Act Recognition on Spanish Data,” Traitement Au- https://keras.io/, 2015.
tomatique des Langues (submitted to), vol. 59, no. 1, 2019. [25] M. Abadi et al., “TensorFlow: Large-Scale Machine Learning on
[4] N. Kalchbrenner and P. Blunsom, “Recurrent Convolutional Neu- Heterogeneous Systems,” https://www.tensorflow.org/, 2015.
ral Networks for Discourse Compositionality,” in Workshop on
[26] M. S. Sorower, “A Literature Survey on Algorithms for Multi-
Continuous Vector Space Models and their Compositionality,
Label Learning,” Oregon State University, Tech. Rep., 2010.
2013, pp. 119–126.
[5] J. Y. Lee and F. Dernoncourt, “Sequential Short-Text Classifi-
cation with Recurrent and Convolutional Neural Networks,” in
NAACL-HLT, 2016, pp. 515–520.
[6] Y. Ji, G. Haffari, and J. Eisenstein, “A Latent Variable Recur-
rent Neural Network for Discourse Relation Language Models,”
in NAACL-HLT, 2016, pp. 332–342.
[7] H. Khanpour, N. Guntakandla, and R. Nielsen, “Dialogue Act
Classification in Domain-Independent Conversations Using a
Deep Recurrent Neural Network,” in COLING, 2016, pp. 2012–
2021.
[8] Y. Liu, K. Han, Z. Tan, and Y. Lei, “Using Context Information
for Dialog Act Classification in DNN Framework,” in EMNLP,
2017, pp. 2160–2168.
[9] E. Ribeiro, R. Ribeiro, and D. M. de Matos, “Deep Dialog
Act Recognition using Multiple Token, Segment, and Context
Information Representations,” CoRR, vol. abs/1807.08587, 2018.
[Online]. Available: http://arxiv.org/abs/1807.08587
[10] V. Tamarit and C.-D. Martı́nez-Hinarejos, “Dialog Act Labeling in
the DIHANA Corpus using Prosody Information,” in V Jornadas
en Tecnologı́a del Habla, 2008, pp. 183–186.
[11] C.-D. Martı́nez-Hinarejos, J.-M. Benedı́, and R. Granell, “Statis-
tical Framework for a Spanish Spoken Dialogue Corpus,” Speech
Communication, vol. 50, no. 11–12, pp. 992–1008, 2008.
[12] C. D. Martı́nez-Hinarejos, J.-M. Benedı́, and V. Tamarit, “Un-
segmented Dialogue Act Annotation and Decoding with N-Gram
Transducers,” IEEE/ACM Transactions on Audio, Speech, and
Language Processing, vol. 23, no. 1, pp. 198–211, 2015.
[13] B. Gambäck, F. Olsson, and O. Täckström, “Active Learning for
Dialogue Act Classification,” in INTERSPEECH, 2011, pp. 1329–
1332.
[14] D. Jurafsky, E. Shriberg, and D. Biasca, “Switchboard SWBD-
DAMSL Shallow-Discourse-Function Annotation Coders Man-
ual,” University of Colorado, Institute of Cognitive Science, Tech.
Rep. Draft 13, 1997.
[15] E. Ribeiro, R. Ribeiro, and D. M. de Matos, “A Study on Dialog
Act Recognition using Character-Level Tokenization,” in AIMSA,
2018.
[16] ——, “The Influence of Context on Dialogue Act Recogni-
tion,” CoRR, vol. abs/1506.00839, 2015. [Online]. Available:
http://arxiv.org/abs/1506.00839
[17] N. Alcácer, J. M. Benedı́, F. Blat, R. Granell, C. D. Martı́nez, and
F. Torres, “Acquisition and Labelling of a Spontaneous Speech
Dialogue Corpus,” in SPECOM, 2005, pp. 583–586.
[18] C.-D. Martı́nez-Hinarejos, E. Sanchis, F. Garcı́a-Granada, and
P. Aibar, “A Labelling Proposal to Annotate Dialogues,” in LREC,
2002, pp. 1566–1582.
[19] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent Convolutional Neu-
ral Networks for Text Classification,” in AAAI, 2015, pp. 2267–
2273.
[20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
“Distributed Representations of Words and Phrases and their
Compositionality,” in NIPS, 2013, pp. 3111–3119.
305
Index
307
García, Eneritz, 267 Lleida, Eduardo, 1, 6, 87, 220, 296
García-Mateo, Carmen, 35, 72, 277 Llombart, Jorge, 296
García-Perera, L. Paola, 236 Lozano-Diez, Alicia, 179, 224
Garrido, Juan-María, 291 López Otero, Paula, 82, 240
Ghahabi, Omid, 184, 216 López-Espejo, Iván, 142
Gilbert, James M., 163 López-Gonzalo, Eduardo, 262
Gimeno, Pablo, 87, 220
Giménez, Adrià, 257 Machuca, María J., 97
Glackin, Cornelius, 172, 231, 272 Martinez Hinarejos, Carlos David, 92, 127,
Golik, Pavel, 257 174, 267
Gomez Suárez, Javier, 166 Martins de Matos, David, 281, 301
Gomez, Angel M., 45, 142 Martins, Paula, 137
Gomez-Alanis, Alejandro, 45 Martín-Doñas, Juan M., 142
Gonzalez-Dominguez, Javier, 179 Martínez, Carlos David, 102
Gonzalez-Rodriguez, Joaquin, 179 Martínez-Castilla, Pastora, 112
González López, José Andrés, 45, 117, 163 Martínez-Villaronga, Adrià, 257
González-Ferreras, César, 112 Maurice, Benjamin, 194
Gordeeveva, Olga, 172 McLaren, Mitchell, 208
Granell, Emilio, 92, 174, 267 Miguel, Antonio, 1, 6, 87, 220, 296
Green, Phil D., 163 Mingote, Victoria, 1
Guasch, Oriol, 132 Montalvo, Ana R., 254
Guinaudeau, Camille, 194 Montenegro, César, 172
Gully, Amelia, 163 Moreno, Asuncion, 170
Morros, Josep Ramon, 199
Hernaez, Inma, 50, 107, 122, 166 Municio Martín, Jose Antonio, 166
Hernandez, Gabriel, 227 Murphy, Damian, 163
Hernando, Javier, 10, 199
Nandwana, Mahesh Kumar, 208
Hernández-Gómez, Luis A., 262
Navas, Eva, 50, 107, 122, 147, 166
Huang, Zili, 236
Ney, Hermann, 257
India Massana, Miquel Angel, 199
Odriozola, Igor, 50
Inglés-Romero, Juan Francisco, 159
Oliveira, Catarina, 137
Jauk, Igor, 170, 189 Ortega, Alfonso, 1, 6, 87, 220, 296
Jorge, Javier, 257
Palau, Ponç, 199
Joseph, Arun, 137
Parcheta, Zuzanna, 127
Juan, Alfons, 257
Pascual, Santiago, 30, 117, 152
Justo, Raquel, 68, 172
Patino, Jose, 194, 211
Khan, Umair, 10 Pavão Martins, Isabel, 281
Khosravani, Abbas, 231 Peinado, Antonio M., 45, 142
Kimura, Takuya, 97 Peláez-Moreno, Carmen, 15
Kyslitska, Daria, 172 Pereira, Victor, 170
Külebi, Baybars, 25 Perero-Codosero, Juan M., 262
Peñagarikano, Mikel, 249
Labrador, Beltran, 224 Piñeiro-Martín, Andrés, 35
Lindner, Fred, 172 Pompili, Anna, 281
308
Povey, Daniel, 236 Sayrol, Elisa, 199
Schlögl, Stephan, 172
R. Costa-Jussà, Marta, 60 Serrano, Luis, 50, 107, 122, 147
Raman, Sneha, 107, 122 Serrà, Joan, 117, 152
Ramirez, Jose M., 254 Silva, Samuel, 137
Ramirez, Pablo, 224 Silvestre-Cerdà, Joan Albert, 257
Ramos-Muguerza, Eduardo, 204 Socoró, Joan Claudi, 132
Recski, Gábor, 286
Reiner, Miriam, 172 T. Toledano, Doroteo, 64, 224, 245
Ribeiro, Eugénio, 301 Tapias Merino, Daniel, 262
Ribeiro, Ricardo, 301 Tarrés, Laia, 170
Rituerto-González, Esther, 15 Tavarez, David, 122, 147
Roble, Alejandro, 254 Teixeira, António, 137
Rodríguez-Fuentes, Luis J., 249 Tejedor, Javier, 245
Romero, Verónica, 92, 174 Tejedor-García, Cristian, 97, 157
Ríos, Antonio, 97 Tenorio-Laranga, Jofre, 172
Tilves Santiago, Darío, 72
S. Kornes, Maria, 172 Torres, M. Inés, 68, 172
Safari, Pooyan, 10
Sagastiberri, Itziar, 199 Varona, Amparo, 249
Salamea, Christian, 55 Vicente-Chicote, Cristina, 159
Sampaio Neto, Nelson, 77 Villalba, Jesús, 236
San-Segundo, Rubén, 55 Viñals, Ignacio, 6, 87, 220
Sanchez, Jon, 50
Sanchis, Albert, 257 Wanner, Leo, 40
Santana, Riberto, 172 Yin, Ruiqing, 194, 211
Sarasola, Xabier, 122, 147
Saratxaga, Ibon, 122, 147 Öktem, Alp, 20, 25
309