2023 Iwslt-1

IWSLT 2023
The 20th International Conference on

Spoken Language Translation
Proceedings of the Conference
July 13-14, 2023

The IWSLT 2023 organizers gratefully acknowledge the support from our spon-
sors: aiXplain, Apple, AppTek, and Translated.
Diamond
Gold
Silver
ii
c 2023 Association for Computational Linguistics
Order copies of this and other ACL proceedings from:
Association for Computational Linguistics (ACL)

209 N. Eighth Street
Stroudsburg, PA 18360
USA
Tel: +1-570-476-8006
Fax: +1-570-476-0860
[email protected]
ISBN 978-1-959429-84-5
iii
Introduction
The International Conference on Spoken Language Translation (IWSLT) is the premiere annual scien-
tific conference for the study, development and evaluation of spoken language translation technology.
Launched in 2004 and spun out from the C-STAR speech translation consortium before it (1992-2003),
IWSLT is the main venue for scientific exchange on all topics related to speech-to-text translation, speech-
to-speech translation, simultaneous and consecutive translation, speech dubbing, cross-lingual commu-
nication including all multimodal, emotional, paralinguistic, and stylistic aspects and their applications
in the field. The conference organizes evaluations around challenge areas, and presents scientific papers
and system descriptions. IWSLT is organized by the Special Interest Group on Spoken Language Tran-
slation (SIGSLT), which is supported by ACL, ISCA and ELRA.
This year, IWSLT featured nine shared tasks in spoken language translation: (i) simultaneous and (ii)
offline translation, (iii) automatic subtitling and (iv) dubbing, (v) speech-to-speech translation, (vi) mul-
tilingual, (vii) dialect and (viii) low-resource speech translation, and (ix) formality control. Each shared
task was coordinated by one or more chairs. The resulting evaluation campaigns attracted a total of 31
teams, from academia, research centers, and industry. System submissions resulted in system papers
that will be presented at the conference. Following our call for papers, this year 51 submissions were
received. In a blind review process, 8 research papers were selected out of 15 for oral presentation (57%)
in addition to 37 system papers.
The program committee is excited about the quality of the accepted papers and expects lively discussion
and exchange at the conference. The conference chairs and organizers would like to express their grati-
tude to everyone who contributed and supported IWSLT. In particular, we wish to thank our Diamond
sponsors Apple and Translated, our Gold sponsor aiXplain, and our Silver sponsor AppTek. We thank
the shared tasks chairs, organizers, and participants, the program committee members, as well as all the
authors that went the extra mile to submit system and research papers to IWSLT, and make this year’s
conference a big success. We also wish to express our sincere gratitude to ACL for hosting our confe-
rence and for arranging the logistics and infrastructure that allow us to hold IWSLT 2023 as a hybrid
conference.
Welcome to IWSLT 2023, welcome to Toronto!
Marine Carpuat, Program Chair

Marcello Federico and Alex Waibel, Conference Chairs
iv
Organizing Committee
Conference Chairs
Marcello Federico, AWS AI Labs, USA
Alex Waibel, CMU, USA
Program Chair
Marine Carpuat, UMD, USA
Sponsorship Chair
Sebastian Stüker, Zoom, Germany
Evaluation Chairs
Jan Niehues, KIT, Germany
Website and Publication Chair

Elizabeth Salesky, JHU, USA
Publicity Chair
Atul Kr. Ohja, University of Galway, Ireland
v
Program Committee
Program Committee
Sweta Agrawal, University of Maryland, USA
Duygu Ataman, University of Zurich, Switzerland
Laurent Besacier, Naver Labs, France
Roldano Cattoni, FBK, Italy
Alexandra Chronopoulou, LMU Munich, Germany
Josep Maria Crego, Systran, France
Mattia Di Gangi, AppTek, Germany
Qianqian Dong, ByteDance AI Lab, China
Akiko Eriguchi, Microsoft, USA
Carlos Escolano, Universitat Politècnica de Catalunya, Spain
Markus Freitag, Google, USA
Hirofumi Inaguma, Meta AI, USA
Tom Ko, ByteDance AI Lab, China
Surafel Melaku Lakew, Amazon AI, USA
Yves Lepage, Waseda University, Japan
Xutai Ma, Meta AI, USA
Wolfgang Macherey, Google, USA
Prashant Mathur, AWS AI Labs, USA
Evgeny Matusov, AppTek, Germany
Kenton Murray, Johns Hopkins University, USA
Maria Nadejde, AWS AI Labs, USA
Matteo Negri, FBK, Italy
Xing Niu, AWS AI Labs, USA
Raghavendra Reddy Pappagari, Johns Hopkins University, USA
Juan Pino, Meta AI, USA
Elijah Rippeth, UMD, USA
Elizabeth Salesky, Johns Hopkins University, USA
Rico Sennrich, University of Zurich, Switzerland
Matthias Sperber, Apple, USA
Sebastian Stüker, Zoom, Germany
Katsuhito Sudoh, NAIST, Japan
Brian Thompson, AWS AI Labs, USA
Marco Turchi, Zoom, Germany
David Vilar, Google, Germany
Changhan Wang, Meta AI, USA
Krzystof Wolk, Polish-Japanese Academy of Information Technology, Poland
vi
Table of Contents
FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN

Sweta Agrawal, Antonios Anastasopoulos, Luisa Bentivogli, Ondřej Bojar, Claudia Borg, Marine
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda Chen, William Chen, Khalid Choukri, Alexandra
Chronopoulou, Anna Currey, Thierry Declerck, Qianqian Dong, Kevin Duh, Yannick Estève, Marcello
Federico, Souhir Gahbiche, Barry Haddow, Benjamin Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid
Javorský, John Judge, Yasumasa Kano, Tom Ko, Rishu Kumar, Pengwei Li, Xutai Ma, Prashant Mathur,
Evgeny Matusov, Paul McNamee, John P. McCrae, Kenton Murray, Maria Nadejde, Satoshi Nakamura,
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu, Atul Kr. Ojha, John E. Ortega, Proyag Pal, Juan
Pino, Lonneke van der Plas, Peter Polák, Elijah Rippeth, Elizabeth Salesky, Jiatong Shi, Matthias Sper-
ber, Sebastian Stüker, Katsuhito Sudoh, Yun Tang, Brian Thompson, Kevin Tran, Marco Turchi, Alex
Waibel, Mingxuan Wang, Shinji Watanabe and Rodolfo Zevallos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Evaluating Multilingual Speech Translation under Realistic Conditions with Resegmentation and Ter-
minology
Elizabeth Salesky, Kareem Darwish, Mohamed Al-Badrashiny, Mona Diab and Jan Niehues . . 62
The MineTrans Systems for IWSLT 2023 Offline Speech Translation and Speech-to-Speech Translation
Tasks
Yichao Du, Guo Zhengsheng, Jinchuan Tian, Zhirui Zhang, Xing Wang, Jianwei Yu, Zhaopeng
Tu, Tong Xu and Enhong Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Improving End-to-End Speech Translation by Imitation-Based Knowledge Distillation with Synthetic

Transcripts
Rebekka Hubert, Artem Sokolov and Stefan Riezler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
The USTC’s Dialect Speech Translation System for IWSLT 2023

Pan Deng, Shihao Chen, Weitai Zhang, Jie Zhang and Lirong Dai . . . . . . . . . . . . . . . . . . . . . . . . . 102
KIT’s Multilingual Speech Translation System for IWSLT 2023

Danni Liu, Thai Binh Nguyen, Sai Koneru, Enes Yavuz Ugan, Ngoc-Quan Pham, Tuan Nam
Nguyen, Tu Anh Dinh, Carlos Mullov, Alexander Waibel and Jan Niehues . . . . . . . . . . . . . . . . . . . . . . 113
The BIGAI Offline Speech Translation Systems for IWSLT 2023 Evaluation
Zhihang Xie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Enhancing Video Translation Context with Object Labels

Jeremy Gwinnup, Tim Anderson, Brian Ore, Eric Hansen and Kevin Duh . . . . . . . . . . . . . . . . . . 130
Length-Aware NMT and Adaptive Duration for Automatic Dubbing

Zhiqiang Rao, Hengchao Shang, Jinlong Yang, Daimeng Wei, Zongyao Li, Jiaxin GUO, Shaojun
Li, Zhengzhe Yu, Zhanglin Wu, Yuhao Xie, Bin Wei, Jiawei Zheng, Lizhi Lei and Hao Yang . . . . . 138
NAVER LABS Europe’s Multilingual Speech Translation Systems for the IWSLT 2023 Low-Resource
Track
Edward Gow-Smith, Alexandre Berard, Marcely Zanon Boito and Ioan Calapodescu . . . . . . . . 144
Direct Models for Simultaneous Translation and Automatic Subtitling: FBK@IWSLT2023

Sara Papi, Marco Gaido and Matteo Negri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation

Dominik Macháček, Ondřej Bojar and Raj Dabre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
vii
Improving Neural Machine Translation Formality Control with Domain Adaptation and Reranking-
based Transductive Learning
Zhanglin Wu, Zongyao Li, Daimeng Wei, Hengchao Shang, Jiaxin Guo, Xiaoyu Chen, Zhiqiang
Rao, Zhengzhe YU, Jinlong Yang, Shaojun Li, Yuhao Xie, Bin Wei, Jiawei Zheng, Ming Zhu, Lizhi
Lei, Hao Yang and Yanfei Jiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
HW-TSC at IWSLT2023: Break the Quality Ceiling of Offline Track via Pre-Training and Domain
Adaptation
Zongyao Li, Zhanglin Wu, Zhiqiang Rao, Xie YuHao, Guo JiaXin, Daimeng Wei, Hengchao
Shang, Wang Minghan, Xiaoyu Chen, Zhengzhe YU, Li ShaoJun, Lei LiZhi and Hao Yang . . . . . . . 187
Submission of USTC’s System for the IWSLT 2023 - Offline Speech Translation Track
Xinyuan Zhou, Jianwei Cui, Zhongyi Ye, Yichi Wang, Luzhen Xu, Hanyi Zhang, Weitai Zhang
and Lirong Dai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
I2R’s End-to-End Speech Translation System for IWSLT 2023 Offline Shared Task
Muhammad Huzaifah, Kye Min Tan and Richeng Duan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
The NiuTrans End-to-End Speech Translation System for IWSLT23 English-to-Chinese Offline Task
Yuchen Han, Xiaoqian Liu, Hao Chen, Yuhao Zhang, Chen Xu, Tong Xiao and Jingbo Zhu . . 211
ON-TRAC Consortium Systems for the IWSLT 2023 Dialectal and Low-resource Speech Translation
Tasks
Antoine Laurent, Souhir Gahbiche, Ha Nguyen, Haroun Elleuch, Fethi Bougares, Antoine Thiol,
Hugo Riguidel, Salima Mdhaffar, Gaëlle Laperrière, Lucas Maison, Sameer Khurana and Yannick
Estève . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
BUT Systems for IWSLT 2023 Marathi - Hindi Low Resource Speech Translation Task
Santosh Kesiraju, Karel Beneš, Maksim Tikhonov and Jan Černocký . . . . . . . . . . . . . . . . . . . . . . 227
CMU’s IWSLT 2023 Simultaneous Speech Translation System

Brian Yan, Jiatong Shi, Soumi Maiti, William Chen, Xinjian Li, Yifan Peng, Siddhant Arora and
Shinji Watanabe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Improving Low Resource Speech Translation with Data Augmentation and Ensemble Strategies
Akshaya Vishnu Kudlu Shanbhogue, Ran Xue, Soumya Saha, Daniel Zhang and Ashwinkumar
Ganesan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Speech Translation with Style: AppTek’s Submissions to the IWSLT Subtitling and Formality Tracks in
2023
Parnia Bahar, Patrick Wilken, Javier Iranzo-Sánchez, Mattia Di Gangi, Evgeny Matusov and
Zoltán Tüske . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
QUESPA Submission for the IWSLT 2023 Dialect and Low-resource Speech Translation Tasks
John E. Ortega, Rodolfo Zevallos and William Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
GMU Systems for the IWSLT 2023 Dialect and Low-resource Speech Translation Tasks
Jonathan Mbuya and Antonios Anastasopoulos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
The HW-TSC’s Speech-to-Speech Translation System for IWSLT 2023

Minghan Wang, Yinglu Li, Jiaxin GUO, Zongyao Li, Hengchao Shang, Daimeng Wei, Min Zhang,
Shimin Tao and Hao Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
JHU IWSLT 2023 Dialect Speech Translation System Description

Amir Hussein, Cihan Xiao, Neha Verma, Thomas Thebaud, Matthew Wiesner and Sanjeev Khu-
danpur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
viii
Learning Nearest Neighbour Informed Latent Word Embeddings to Improve Zero-Shot Machine Tran-
slation
Nishant Kambhatla, Logan Born and Anoop Sarkar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
JHU IWSLT 2023 Multilingual Speech Translation System Description

Henry Li Xinyuan, Neha Verma, Bismarck Bamfo Odoom, Ujvala Pradeep, Matthew Wiesner and
Sanjeev Khudanpur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation
Task
Kun Song, Yi Lei, Peikun Chen, Yiqing Cao, Kun Wei, Yongmao Zhang, Lei Xie, Ning Jiang and
Guoqing Zhao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Low-Resource Formality Controlled NMT Using Pre-trained LM

Priyesh Vakharia, Shree Vignesh S and Pranjali Basmatkar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
NAIST Simultaneous Speech-to-speech Translation System for IWSLT 2023

Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Yuka Ko, Tomoya Yanagita, Kosuke Doi, Mana
Makinae, Sakriani Sakti, Katsuhito Sudoh and Satoshi Nakamura . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
Language Model Based Target Token Importance Rescaling for Simultaneous Neural Machine Transla-
tion
Aditi Jain, Nishant Kambhatla and Anoop Sarkar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
The Kyoto Speech-to-Speech Translation System for IWSLT 2023

Zhengdong Yang, Shuichiro Shimizu, Wangjin Zhou, Sheng Li and Chenhui Chu . . . . . . . . . . . 357
Tagged End-to-End Simultaneous Speech Translation Training Using Simultaneous Interpretation Data
Yuka Ko, Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Katsuhito Sudoh and Satoshi Nakamura
363
The HW-TSC’s Simultaneous Speech-to-Text Translation System for IWSLT 2023 Evaluation
Jiaxin GUO, Daimeng Wei, Zhanglin Wu, Zongyao Li, Zhiqiang Rao, Minghan Wang, Hengchao
Shang, Xiaoyu Chen, Zhengzhe Yu, Shaojun Li, Yuhao Xie, Lizhi Lei and Hao Yang . . . . . . . . . . . . 376
The HW-TSC’s Simultaneous Speech-to-Speech Translation System for IWSLT 2023 Evaluation
Hengchao Shang, Zhiqiang Rao, Zongyao Li, Zhanglin Wu, Jiaxin GUO, Minghan Wang, Dai-
meng Wei, Shaojun Li, Zhengzhe YU, Xiaoyu Chen, Lizhi Lei and Hao Yang . . . . . . . . . . . . . . . . . . . 383
Towards Efficient Simultaneous Speech Translation: CUNI-KIT System for Simultaneous Track at IW-
SLT 2023
Peter Polak, Danni Liu, Ngoc-Quan Pham, Jan Niehues, Alexander Waibel and Ondřej Bojar 389
Speech Translation with Foundation Models and Optimal Transport: UPC at IWSLT23
Ioannis Tsiamas, Gerard I. Gállego, Jose Fonollosa and Marta R. Costa-jussá . . . . . . . . . . . . . . 397
The Xiaomi AI Lab’s Speech Translation Systems for IWSLT 2023 Offline Task, Simultaneous Task and
Speech-to-Speech Task
Wuwei Huang, Mengge Liu, Xiang Li, Yanzhi Tian, Fengyu Yang, Wen Zhang, Jian Luan, Bin
Wang, Yuhang Guo and Jinsong Su . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
Improving Formality-Sensitive Machine Translation Using Data-Centric Approaches and Prompt En-
gineering
Seugnjun Lee, Hyeonseok Moon, Chanjun Park and Heuiseok Lim . . . . . . . . . . . . . . . . . . . . . . . . 420
ix
UM-DFKI Maltese Speech Translation
Aiden Williams, Kurt Abela, Rishu Kumar, Martin Bär, Hannah Billinghurst, Kurt Micallef,
Ahnaf Mozib Samin, Andrea DeMarco, Lonneke van der Plas and Claudia Borg . . . . . . . . . . . . . . . . . 433
NVIDIA NeMo Offline Speech Translation Systems for IWSLT 2023

Oleksii Hrinchuk, Vladimir Bataev, Evelina Bakhturina and Boris Ginsburg . . . . . . . . . . . . . . . 442
SRI-B’s Systems for IWSLT 2023 Dialectal and Low-resource Track: Marathi-Hindi Speech Translation
Balaji Radhakrishnan, Saurabh Agrawal, Raj Prakash Gohil, Kiran Praveen, Advait Vinay Dho-
peshwarkar and Abhishek Pandey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
BIT’s System for Multilingual Track

Zhipeng Wang, Yuhang Guo and Shuoying Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
Matesub: The Translated Subtitling Tool at the IWSLT2023 Subtitling Task

Simone Perone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling

Itai Gat, Felix Kreuk, Tu Anh Nguyen, Ann Lee, Jade Copet, Gabriel Synnaeve, Emmanuel
Dupoux and Yossi Adi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
DePA: Improving Non-autoregressive Translation with Dependency-Aware Decoder

Jiaao Zhan, Qian Chen, Boxing Chen, Wen Wang, Yu Bai and Yang Gao . . . . . . . . . . . . . . . . . . 478
On the Copying Problem of Unsupervised NMT: A Training Schedule with a Language Discriminator
Loss
Yihong Liu, Alexandra Chronopoulou, Hinrich Schütze and Alexander Fraser . . . . . . . . . . . . . . 491
x
Program
Thursday, July 13, 2023
08:30 - 09:10 Welcome Remarks
08:45 - 09:15 Overview of the IWSLT 2023 Evaluation Campaign
09:30 - 10:30 Invited Talk
10:30 - 11:00 Coffee Break
11:30 - 12:30 Session 1 (Posters): System Papers
12:30 - 14:00 Lunch Break
16:00 - 18:00 Session 3 (Posters): Scientific Papers, including Findings of ACL
xi
Friday, July 14, 2023
09:00 - 10:30 Session 4 (Oral): Scientific Papers
12:30 - 14:00 Lunch Break
16:00 - 17:00 Panel Discussion
17:00 - 17:15 Best Paper Awards
17:15 - 17:30 Closing Remarks
xii
FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN
Milind Agarwal1 Sweta Agrawal2 Antonios Anastasopoulos1 Luisa Bentivogli3
Ondřej Bojar4 Claudia Borg5 Marine Carpuat2 Roldano Cattoni3
Mauro Cettolo3 Mingda Chen6 William Chen7 Khalid Choukri8
Alexandra Chronopoulou9 Anna Currey10 Thierry Declerck11 Qianqian Dong12
Kevin Duh13 Yannick Estève14 Marcello Federico10 Souhir Gahbiche15
Barry Haddow16 Benjamin Hsu10 Phu Mon Htut10 Hirofumi Inaguma6
Dávid Javorský4 John Judge17 Yasumasa Kano18 Tom Ko12
Rishu Kumar4 Pengwei Li6 Xutai Ma6 Prashant Mathur10
Evgeny Matusov19 Paul McNamee13 John P. McCrae20 Kenton Murray13
Maria Nadejde10 Satoshi Nakamura18 Matteo Negri3 Ha Nguyen14
Jan Niehues21 Xing Niu10 Atul Kr. Ojha20 John E. Ortega22
Proyag Pal16 Juan Pino6 Lonneke van der Plas23 Peter Polák4
Elijah Rippeth2 Elizabeth Salesky13 Jiatong Shi7 Matthias Sperber24
Sebastian Stüker25 Katsuhito Sudoh18 Yun Tang6 Brian Thompson10
Kevin Tran6 Marco Turchi25 Alex Waibel7 Mingxuan Wang12
Shinji Watanabe7 Rodolfo Zevallos26
1 GMU 2 UMD 3 FBK 4 Charles U. 5 U. Malta 6 Meta 7 CMU 8 ELDA
9 LMU 10 AWS 11 DFKI 12 ByteDance 13 JHU 14 Avignon U. 15 Airbus
16 U. Edinburgh 18 NAIST 19 AppTek 20 U. Galway 21 KIT 22 Northeastern U.
23 IDIAP 24 Apple 25 Zoom 26 U. Pompeu Fabra
Abstract et al., 2010; Federico et al., 2011, 2012; Cettolo

et al., 2013, 2014, 2015, 2016, 2017; Niehues
This paper reports on the shared tasks orga- et al., 2018, 2019; Ansari et al., 2020; Anasta-
nized by the 20th IWSLT Conference. The sopoulos et al., 2021, 2022b),this year’s confer-
shared tasks address 9 scientific challenges
ence was preceded by an evaluation campaign
in spoken language translation: simultane-
ous and offline translation, automatic subti- featuring shared tasks addressing scientific chal-
tling and dubbing, speech-to-speech transla- lenges in SLT.
tion, multilingual, dialect and low-resource This paper reports on the 2023 IWSLT Eval-
speech translation, and formality control. The uation Campaign, which offered the following 9
shared tasks attracted a total of 38 submis- shared tasks:
sions by 31 teams. The growing interest to-
wards spoken language translation is also wit- • Offline SLT, with focus on speech-to-text
nessed by the constantly increasing number translation of recorded conferences and inter-
of shared task organizers and contributors to views from English to German, Japanese and
the overview paper, almost evenly distributed Chinese.
across industry and academia.
• Simultaneous SLT, focusing on speech-to-
1 Introduction text translation of streamed audio of confer-
The International Conference on Spoken Lan- ences and interviews from English to German,
guage Translation (IWSLT) is the premier an- Japanese and Chinese.
nual scientific conference for all aspects of spoken
language translation (SLT). IWSLT is organized • Automatic Subtitling, with focus on speech-
by the Special Interest Group on Spoken Lan- to-subtitle translation of audio-visual docu-
guage Translation (SIG-SLT), which is supported ments from English to German and Spanish.
by ACL, ISCA and ELRA. Like in all previous
editions (Akiba et al., 2004; Eck and Hori, 2005; • Multilingual SLT, with focus on speech-to-
Paul, 2006; Fordyce, 2007; Paul, 2008, 2009; Paul text translation of recorded scientific talks from
1
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 1–61
July 13-14, 2023 c 2023 Association for Computational Linguistics
Team Organization
A LEXA AI Amazon Alexa AI, USA (Vishnu et al., 2023)
A PP T EK AppTek, Germany (Bahar et al., 2023)
BIGAI Beijing Institute of General Artificial Intelligence, China (Xie, 2023)
BIT Beijing Institute of Technology, China (Wang et al., 2023b)
BUT Brno University of Technology, Czechia (Kesiraju et al., 2023)
CMU Carnegie Mellon University, USA (Yan et al., 2023)
CUNI-KIT Charles University, Czechia, and KIT, Germany (Polák et al., 2023)
FBK Fondazione Bruno Kessler, Italy (Papi et al., 2023)
GMU George Mason University, USA (Mbuya and Anastasopoulos, 2023)
HW-TSC Huawei Translation Services Center, China (Li et al., 2023; Wang et al., 2023a)
(Guo et al., 2023; Shang et al., 2023; Rao et al., 2023)
I2R Institute for Infocomm Research, A*STAR, Singapore (Huzaifah et al., 2023)
JHU Johns Hopkins University, USA (Hussein et al., 2023; Xinyuan et al., 2023)
KIT Karlsruhe Institute of Technology, Germany (Liu et al., 2023)
KU Kyoto University, Japan (Yang et al., 2023)
KU X UPSTAGE Korea University X Upstage, South Korea (Wu et al., 2023; Lee et al., 2023)
M ATESUB Translated Srl, Italy (Perone, 2023)
M INE T RANS U. of Sci. and Techn. of China, Tancient AI Lab, State Key Lab. of Cognitive Intelligence (Du et al., 2023)
NAIST Nara Institute of Science and Technology, Japan (Fukuda et al., 2023)
NAVER NAVER Labs Europe, France (Gow-Smith et al., 2023)
N IU T RANS NiuTrans, China (Han et al., 2023)
NPU-MSXF Northwestern Polytechnical U., Nanjing U., MaShang Co., China (Song et al., 2023)
N EURODUB NeuroDub, Armenia
NEMO NVIDIA NeMo, USA(Hrinchuk et al., 2023)
ON-TRAC ON-TRAC Consortium, France (Laurent et al., 2023)
QUESPA Northeastern U, USA, U. de Pompeu Fabra, Spain, CMU, USA(Ortega et al., 2023)
UPC Universitat Politècnica de Catalunya, Spain (Tsiamas et al., 2023)
SRI-B Samsung R&D Institute Bangalore, India (Radhakrishnan et al., 2023)
UCSC U. of California, Santa Cruz, USA (Vakharia et al., 2023)
UM-DFKI U. of Malta, Malta, and DFKI, Germany (Williams et al., 2023)
USTC U. of Science and Technology of China (Deng et al., 2023; Zhou et al., 2023)
X IAOMI Xiaomi AI Lab, China (Huang et al., 2023)
Table 1: List of Participants
English into Arabic, Chinese, Dutch, French, • Formality Control for SLT, focusing on for-
German, Japanese, Farsi, Portuguese, Russian, mality/register control for spoken language
and Turkish. translation from English to Korean, Viet-
namese, EU Portuguese, and Russian.
• Speech-to-speech translation, focusing on The shared tasks attracted 38 submissions by 31
natural-speech to synthetic-speech translation teams (see Table 1) representing both academic
of recorded utterances from English to Chinese. and industrial organizations. The following sec-
tions report on each shared task in detail, in par-
• Automatic Dubbing, focusing on dubbing of ticular: the goal and automatic metrics adopted for
short video clips from German to English. the task, the data used for training and testing data,
the received submissions and the summary of re-
sults. Detailed results for some of the shared tasks
• Dialect SLT, focusing on speech translation of are reported in a corresponding appendix.
recorded utterances from Tunisian Arabic to
English. 2 Offline SLT
Offline speech translation is the task of translating
• Low-resource SLT, focusing on speech trans- audio speech in one language into text in a differ-
lation of recorded utterances from Irish to En- ent target language, without any specific time or
glish, Marathi to Hindi, Maltese to English, structural constraints (as, for instance, in the si-
Pashto to French, Tamasheq to French, and multaneous, subtitling, and dubbing tasks). Un-
Quechua to Spanish. der this general problem definition, the goal of
2
the offline ST track (one of the speech tasks with submissions to the English-Japanese and English-
the longest tradition at the IWSLT campaign) is to Chinese sub-tasks.
constantly challenge a technology in rapid evolu-
tion by gradually introducing novelty aspects that 2.2 Data and Metrics
raise the difficulty bar. Training and development data. Participants
were offered the possibility to submit systems built
2.1 Challenge under three training data conditions:
In continuity with last year, participants
were given three sub-tasks correspond- 1. Constrained: the allowed training data is
ing to three language directions, namely limited to a medium-sized framework in
English→German/Japanese/Chinese. Partici- order to keep the training time and re-
pation was allowed both with cascade architec- source requirements manageable. The com-
tures combining automatic speech recognition plete list1 of allowed training resources
(ASR) and machine translation (MT) systems (speech, speech-to-text-parallel, text-parallel,
as core components, or by means of end-to-end text-monolingual) does not include any pre-
approaches that directly translate the input speech trained language model.
without intermediate symbolic representations.
Also this year, one of the main objectives was 2. Constrained with large language models
indeed to measure the performance difference (constrained+LLM ): in addition to all the con-
between the two paradigms, a gap that recent strained resources, a restricted selection1 of
research (Bentivogli et al., 2021) and IWSLT find- large language models is allowed to give par-
ings (Ansari et al., 2020; Anastasopoulos et al., ticipants the possibility to leverage large lan-
2021, 2022b) indicate as gradually decreasing. guage models and medium-sized resources.
The other main objective of this round was to 3. Unconstrained: any resource, pre-trained
assess the ability of SLT technology to deal with language models included, can be used with
complex scenarios involving different types of in- the exception of evaluation sets. This setup is
put characterized by phenomena like spontaneous proposed to allow the participation of teams
speech, noisy audio conditions and overlapping equipped with high computational power and
speakers. In light of this, the main novelty of the effective in-house solutions built on addi-
2022 offline SLT task lies in a richer variety of tional resources.
speech data to be processed. To this aim, in addi-
tion to the classic TED talks test set, two novel test The development data allowed under the con-
sets were released: strained condition consist of the dev set from
IWSLT 2010, as well as the test sets used for
• ACL presentations, in which a single
the 2010, 2013-2015 and 2018-2020 IWSLT cam-
speaker is presenting on a stage. Although
paigns. Besides this TED-derived material, ad-
similar to the TED talks scenario, additional
ditional development data were released to cover
challenges posed by this test set include the
the two new scenarios included in this round of
presence of non-native speakers, different ac-
evaluation. For the ACL domain, 5 presentations
cents, variable recording quality, terminol-
from the ACL 2022 conference with translations
ogy, and controlled interactions with a second
and transcriptions were provided. Due to addi-
speaker.
tional constraints, these references were gener-
• Press conferences and interviews, in which ated by human post-editing of automatic transcrip-
two persons interact on different topics. tions and translation. For the press conferences
Inherent challenges, therefore, include the and interviews domain, 12 videos (total duration:
presence of spontaneous speech, non-native 1h:3m) were selected from publicly available in-
speakers, different accents, and controlled interviews from the Multimedia Centre of the Euro-
teraction with a second speaker. pean Parliament (EPTV)2 .
1
All the test sets were used for evaluation in See the IWSLT 2023 offline track web page: https:
//iwslt.org/2023/offline
the English-German sub-task, while only TED 2
https://multimedia.europarl.europa.
Talks and ACL presentations were used to test the eu
3
Test data. Three new test sets were created for Talks / Videos Duration
the three language directions. The new test sets English-German
include heterogeneous material drawn from each TED 42 3h:47m:53s
scenario. For the traditional TED scenario, a new ACL 5 59m:22s
set of 42 talks not included in the current public EPTV 10 1h:1m
release of MuST-C was selected to build the en-de English-Chinese
test set.3 Starting from this material, the talks for TED 37 3h:2m:22s
which Japanese and Chinese translations are avail- ACL 5 59m:22s
able were selected to build the en-zh and en-ja test English-Japanese
sets (respectively, 38 and 37 talks). Similar to the TED 38 3h:19m:34s
2021 and 2022 editions, we consider two different ACL 5 59m:22s
types of target-language references, namely:
Table 2: Statistics of the official test sets for the IWSLT
• The original TED translations. Since these 2023 offline speech translation task.
references come in the form of subtitles, they
are subject to compression and omissions
ranked based on the BLEU calculated on the con-
to adhere to the TED subtitling guidelines.4
catenation of the three test sets by using automatic
This makes them less literal compared to
resegmentation6 of the hypotheses based on the
standard, unconstrained translations;
reference translations. For the BLEU computed
• Unconstrained translations. These references on the concatenation of the three test sets, the new
were created from scratch5 by adhering to the unconstrained ones have been used for the TED
usual translation guidelines. They are hence data. As observed on IWSLT 2022 manual eval-
exact translations (i.e. literal and with proper uation of simultaneous speech-to-text translation
punctuation). (Macháček et al., 2023), COMET is correlating
with human judgments best and BLEU correlation
For the ACL presentation scenario, paper pre- is also satisfactory. Moreover, to meet the requests
sentations from ACL 2022 were transcribed and of last year’s participants, a human evaluation was
translated into the target languages. A detailed de- performed on the best-performing submission of
scription of the data set can be found in Salesky each participant.
et al. (2023). There are 5 presentations in each of
the dev and test sets with a total duration 1h per 2.3 Submissions
split. Talks were selected to include diverse paper This year, 10 teams participated in the offline task,
topics and speaker backgrounds. This test set is submitting a total of 37 runs. Table 3 provides a
shared with the Multilingual task (§5). breakdown of the participation in each sub-task
For the press conferences and interviews sce- showing, for each training data condition, the
nario, the test set comprises 10 EPTV videos of number of participants, the number of submitted
variable duration (6m on average), amounting to a runs and, for each training data condition (con-
total of 1h:1m. The details of the new test sets are strained, constrained+LLM , unconstrained), the
reported in Table 2. number of submitted runs obtained with cascade
Metrics. Systems were evaluated with respect and direct systems.
to their capability to produce translations similar
to the target-language references. The similarity
was measured in terms of BLEU and COMET (Rei • BIGAI (Xie, 2023) participated both with
et al., 2020a) metrics. The submitted runs were cascade and direct models for en-de, en-ja,
and en-zh translations, which were trained
under the constrained+LLM condition.
3
This set of 42 TED talks is also referred to as the
“Common” test set (not to be confused with MuST-C “tst-
COMMON”) because it serves in both Offline and Simul-
The cascade is the concatenation of an
taneous https://iwslt.org/2023/simultaneous ASR model and an MT system. The ASR
tasks. consists of the first 12 Transformer layers
4
http://www.ted.com/participate/
translate/subtitling-tips 6
Performed with mwerSegmenter - https:
5
We would like to thank Meta for providing us with this //www-i6.informatik.rwth-aachen.de/web/
new set of references. Software/mwerSegmenter.tar.gz
4
English-German
Participants Runs Constrained Constrained+LLM Unconstrained
Cascade 1 Cascade 1 Cascade 2
6 16 2 12 2
Direct 1 Direct 11 Direct -
English-Chinese
7 16 5 3 8
Direct 2 Direct 2 Direct 1
English-Japanese
3 5 2 2 1
Direct 1 Direct 1 Direct -
Table 3: Breakdown of the participation in each sub-task (English→German, English→Chinese,

English→Japanese) of the IWSLT offline ST track. For each language direction, we report the number of par-
ticipants, the number of submitted runs and, for each training data condition (constrained, constrained+LLM , un-
constrained), the number of submitted runs obtained with cascade and direct systems.
from wav2vec2-large-960h-lv60-self and speech-text representation is passed to the

an adapter model to compress the feature shared encoder initially pre-trained on text
vectors. Transcripts are obtained through data only. A DeltaLM-based MT model
a CTC greedy decoding step. The MT is incrementally trained on in-domain and
based on mbart-large-50-one-to-many-mmt. out-of-domain data is used as a teacher
The direct model consists of two separate during fine-tuning of the ST system. The
encoders for speech and text, followed by ST model is built on a mix of ASR, ST and
a shared decoder. The speech and text synthetic data. Additional techniques applied
encoders are respectively based on the include on-the-fly audio augmentation to
cascade ASR and MT encoders. An adapter increase robustness to variable audio quality,
model is introduced to connect the two domain tagging to condition the ST output
encoders. The direct model combines the to the different output styles of the test data,
cross entropy loss for MT and the CTC loss and ST model ensembling.
for ASR, together with a hyperparameter to
balance the weights between the two losses.
The training procedure involves dedicated • HW-TSC (Li et al., 2023) participated with
fine-tuning steps, data filtering and audio cascade systems for all language directions
re-segmentation into shorter segments. and in all three training data conditions. The
ASR model used for the constrained train-
ing condition is the Conformer. For the
• I2R (Huzaifah et al., 2023) participated constrained+LLM condition, the encoder of
with a direct approach for en-de trans- wav2vec2 and the decoder of mBART50 are
lation, which was trained under the combined to fine-tune on all data an ASR
constrained+LLM condition. The model model trained on MuST-C. Whisper (Rad-
consists of two separate encoders for speech ford et al., 2022), fine-tuned on MuST-C, is
and text, followed by a shared encoder and instead used for the unconstrained training
a decoder. The speech encoder is initialised condition. All models are built using au-
with WavLM large, while DeltaLM base is dio inputs augmented with SpecAugment and
used to initialise the text encoder, the shared CTC. The MT component is a Transformer-
encoder and the decoder. To leverage both based model trained in a one-to-many mul-
text and speech sources, the shared encoder tilingual fashion. It exploits data filter-
is induced to learn a joint multimodal repre- ing and data augmentation techniques, com-
sentation obtained through forced alignment bined with dropout regularization and do-
of speech and text data. The resulting mixed main adaptation methods, as well as solutions
5
to increase robustness to ASR noise (through en-de system trained under the unconstrained
synthetic noise generation and data augmen- condition. It consists of a 4-staged process
tation). including the ASR, the punctuation module
performing both sentence extraction and
• M INE T RANS (Du et al., 2023) participated punctuation placement, the speaker- and
with en-zh cascade systems trained under gender distinction component, and the
constrained and unconstrained conditions. translation model. Every stage is trained on
The submitted runs are obtained with a the crawled data from the web.
pipeline of ASR, punctuation recognition,
and MT components. The ASR is an RNN-
Transducer. For the unconstrained condi- • N E M O (Hrinchuk et al., 2023) participated
tion, GigaSpeech is added to the training with direct systems for all language di-
data allowed in the constrained setting. In rections in the constrained training data
both conditions, pre-processing and filter- condition. Pre-trained models and synthetic
ing techniques are applied to improve data training data are exploited in different ways
quality, while SpecAugment is used for data to cope with the scarcity of direct ST data. A
augmentation. Before being passed to the Conformer-based ASR model trained on all
MT component, the unpunctuated ASR out- allowed speech-to-text data is used to initial-
put is processed by means of a BERT-based ize the SLT encoder. A Transformer-based
punctuation recognition model. For the MT NMT model trained on all allowed parallel
component, two strategies are implemented. data and fine-tuned on TED talks is used to
The first one relies on different Transformer- generate synthetic translation alternatives for
based models for supervised training. A all available speech-to-text and text-to-text
base Transformer and an M2M 100 model data. A TTS model based on Fast Pitch
are used for the constrained condition. A (Łańcucki, 2021) and trained on the English
translation model trained on additional in- transcripts of all TED-derived data is used
house corpora is used for the unconstrained to generate the synthetic speech version of
condition. The second strategy adopted for English texts in the available text corpora.
the MT component relies on a large language The submitted SLT systems are based on
model (Chat-GPT) for prompt-guided trans- a Conformer-based encoder followed by a
lation. Transformer decoder trained on this mix
of (gold and synthetic) speech-to-text and
• N IU T RANS (Han et al., 2023) participated text-to-text data.
with a direct en-zh system trained under
the constrained condition. It consists of
two separate encoders for speech and text • X IAOMI (Huang et al., 2023) participated
with an adapter in between, followed by a with a direct en-zh system trained under the
decoder. The speech encoder is pre-trained constrained+LLM condition. It consists of
with an ASR encoder, while the textual a speech encoder, a text encoder, and a text
encoder and the decoder with pre-trained decoder, with all parameters initialized using
MT components. Different architectures the pre-trained HuBERT and mBART mod-
with variable size were tested both for ASR els. The speech encoder is composed of a
(enhanced with CTC loss and inter-CTC loss feature extractor based on convolutional neu-
to speed up convergence) and MT (used to ral networks and a Transformer encoder. In
generate pseudo-references so as to increase addition to the cross-entropy loss, ASR, MT,
the size of the SLT data). The final system and a contrastive loss, which tries to learn an
is an ensemble aiming at maximizing the encoder that produces similar representations
diversity between models. for similar instances independently from the
modalities, are added. Self-training is also
used to leverage unlabelled data. In addition
• N EURODUB7 participated with a cascade
to the allowed datasets, a large set of pseudo
7
Unofficial participant, as no system paper is available. references are generated translating the
6
transcripts of the ASR corpora. During train- 2.4 Results
ing, a second fine-tuning is performed on Also this year, the submissions to the IWSLT Of-
MuST-C as in-domain data. The final system fline translation task were evaluated both with au-
is an ensemble of the two best-performing tomatic metrics and through human evaluation.
models. The results for each sub-task are shown in detail
in the Appendix.
2.4.1 Automatic Evaluation

• UPC (Tsiamas et al., 2023) participated
The results for each of the language pairs are
with a direct en-de system trained under the
shown in the tables in Appendix B.1. We present
constrained+LLM condition. It consists of
results for English-German (Table 14), English-
a speech encoder, a textual encoder, and a
Chinese (Table 16) and English-Japanese (Table
text decoder. The speech encoder includes
15). The evaluation was carried out in terms of
a semantic encoder to align speech and
BLEU (the primary metric, in continuity with pre-
text encoder representations. The coupling
vious years), and COMET. We report individual
modules include the CTC and Optimal
scores for the three (or two, as in the case of en-ja
Transport (OT) losses to the outputs of the
and en-zh) different test sets as well as metrics cal-
acoustic and semantic encoders, and the
culated on the concatenation of the different test
addition of a second auxiliary OT loss for
sets. For each sub-task, systems are ranked based
the inputs of the semantic encoder. The
on the BLEU score computed on the concatenated
speech encoder is based on wav2vec 2.0,
test sets.
while the textual encoder uses mBART50.
Knowledge distillation is used to generate End-to-End vs Cascaded This year the cas-
additional data to fine-tune part of the SLT caded systems performed in general better than
model architecture (the feature extractor, the the end-to-end systems. For English-to-German,
acoustic encoder, and the CTC module are for nearly all metrics, the cascaded systems are al-
frozen during fine-tuning). ways ranked best. For English-to-Japanese, the
results show a similar situation to English-to-
German, with the cascade systems outperforming
USTC (Zhou et al., 2023) participated with the end-to-end model. The supremacy of the cas-
cascade and direct en-zh models trained un- cade models is confirmed by all the metrics, with
der the unconstrained condition. For the ASR a clear gap in performance between the worst cas-
of the cascade, two approaches are imple- cade and the best end-to-end models. For English-
mented. The first one exploits a fusion mod- to-Chinese, the picture is not as clear. However,
els trained on the allowed data expanded with the only participant who submitted a primary sys-
speed perturbation, oversampling, concate- tem using the cascaded and one using the end-
nation of adjacent voices and synthetic data to-end paradigm (USTC), the cascaded performed
generation via TTS. The second approach is better in all metrics.
based on Whisper large (Radford et al., 2022)
and SHAS for audio segmentation. The MT Metrics For English-to-German, in general, the
component of the cascade system exploits an results of the BLEU metric correlate quite well
ensemble of Transformer-based models en- with the scores of the COMET metric. Except for
hanced with knowledge distillation, domain relatively small changes, e.g. the order is different
adaptation and robust training strategies. For for the different HW-TSC systems. One excep-
direct SLT, two approaches are implemented. tion is the submissions by UPC and NeMo that are
The first one is an encoder-decoder initial- ranked differently in the two metrics. Therefore, a
ized with the ASR and MT models of the comparison to the human evaluation will be inter-
cascade. The second approach is a Stacked esting. In the English-to-Japanese task, the scores
Acoustic-and-Textual Encoding extension of of the HW-TSC systems are very close to each
SATE (Xu et al., 2021). The final sub- other and some swaps are visible between BLEU
missions also include ensembles obtained by and COMET. However, the changes are only re-
combining cascade and direct systems. lated to the HW-TSC systems and do not mod-
7
ify the overall evaluation of the systems. In the 3 Simultaneous SLT
English-to-Chinese task, there are two situations
Simultaneous speech translation means the system
where the metrics differ significantly. The rank-
starts translating before the speaker finishes the
ing for USTC end-to-end compared to the HW-
sentence. The task is essential to enable people
TSC systems is different with respect to COMET,
to communicate seamlessly across different back-
which rewards the HW-TSC submissions. A sim-
grounds, in low-latency scenarios such as transla-
ilar situation is visible for NiuTrans and Xiaomi,
tion in international conferences or travel.
where BLEU favors the NiuTrans translations,
This year, the task included two tracks: speech-
while COMET assigns higher scores, and ranking,
to-text and speech-to-speech, covering three lan-
to the Xiaomi submissions.
guage directions: English to German, Chinese and
Japanese.
Data conditions For the different data condi-
tions, the gains by using additional large language 3.1 Challenge
models or additional data are not clear. HW- There are two major updates compared with pre-
TSC submitted three primary systems for each vious years:
data condition and they all perform very similarly.
However, for en-zh the unconstrained system by • Removal of the text-to-text track. The task
USTC was clearly the best and for en-de the best focuses on the real-world live-translation set-
system except HW-TSC was also an unconstrained ting, where the speech is the input medium.
one. The additional benefit of the pre-trained mod-
• Addition of a speech-to-speech track. Trans-
els is even less clear. There is no clear picture that
lation into synthetic speech has gained in-
the systems with or without this technology per-
creasing attention within the research com-
form better.
munity, given its potential application to real-
time conversations.
Domains One new aspect this year is the evalu-
ation of the systems on three different test sets and To simplify the shared task, a single latency
domains. First of all, the absolute performance on constraint is introduced for each track: 2 sec-
the different domains is quite different. The sys- onds of Average Lagging for speech-to-text, and
tems perform clearly worse on the EPTV test sets. 2.5 seconds of starting offset for speech-to-speech.
For the relationship between ACL and TED, the The participants can submit no more than one
picture is not as clear. While the BLEU scores system per track / language direction, as long as
on ACL are higher, the COMET scores are lower. the latency of the system is under the constraint.
Only for English-to-Japanese, both metrics are The latency of the system is qualified on the open
higher on the ACL test set. One explanation could MuST-C tst-COMMON test set (Di Gangi et al.,
be that the references for the ACL talks are gen- 2019a).
erated by post-editing an MT output. This could The participants made submissions in a format
indicate that the post-edited references inflate the of docker images, which were later run by orga-
BLEU score, while the COMET score seems to be nizers on the blind-test set in a controllable en-
more robust to this phenomenon. When compar- vironment. An example of implementation was
ing the different systems, the tendency is for all provided with the SimulEval toolkit (Ma et al.,
cases the same. However, some perform slightly 2020a).
better in one condition. For example, the end-
to-end system from USTC performs very well on 3.2 Data
TED compared to other systems but less well on The training data condition of the simultaneous
ACL. task follows “constrained with large language
models” setting in the Offline translation task, as
2.4.2 Human Evaluation described in Section 2.2
The test data has two parts:
At the time of writing, human evaluation is still in
progress. Its results will be reported at the confer- Common TED talks. It’s the the same as in the
ence and they will appear in the updated version Offline task, as described in Section 2.2 .For En-
of this paper in Appendix A. glish to German, Chinese and Japanese
8
Non-Native see Appendix A.1.1. For English to teams entered the English-to-German track; four
German. teams entered the English-to-Chinese track; three
teams entered the English-to-Japanese track. Even
3.3 Evaluation though this year is our first time introducing the si-
Two attributes are evaluated in the simultaneous multaneous speech-to-speech track, four teams out
task: quality and latency. of six, submitted speech-to-speech systems.
For quality, we conducted both automatic and
human evaluation. BLEU score (Papineni et al., • CMU(Yan et al., 2023) participated in both
2002a) is used for automatic quality evaluation. the speech-to-text and speech-to-speech
For speech output, the BLEU score is computed tracks for English-German translation.
on the transcripts from Whisper (Radford et al., Their speech-to-text model combined
2022) ASR model. The ranking of the submis- self-supervised speech representations, a
sion is based on the BLEU score on the Com- Conformer encoder, and an mBART decoder.
mon blind test set. Furthermore, we conducted In addition to the cross-entropy attentional
BLASER (Chen et al., 2022) evaluation on the loss, the translation model was also trained
speech output. We also conducted human evalu- with CTC objectives. They used machine
ation on speech-to-text translation quality, includ- translation pseudo labeling for data aug-
ing general human evaluation for all three lan- mentation. Simultaneous decoding was
guage pairs, and task specific human evaluation on achieved by chunking the speech signals
German and Japanese outputs. and employing incremental beam search.
For latency, we only conducted automatic eval- For their speech-to-speech system, they
uation. We report the following metrics for each incorporated a VITS-based text-to-speech
speech-to-text systems. model, which was trained separately.
• Average Lagging (AL; Ma et al., 2019, • HW-TSC (Guo et al., 2023; Shang et al.,
2020b) 2023) participated in both the speech-to-
text and speech-to-speech tracks for all
• Length Adaptive Average Lagging (LAAL; three language directions. Their model was
Polák et al., 2022; Papi et al., 2022) a cascaded system that combined an U2
ASR, a Transformer-based machine trans-
• Average Token Delay (ATD; Kano et al., lation model, and a VITS-based text-to-
2023) speech model for speech-to-speech transla-
• Average Proportion (AP; Cho and Esipova, tion. The MT model was multilingual and
2016) offered translation in all three directions by
conditioning on language embeddings. For
• Differentiable Average Lagging (DAL; data augmentation, they adopted data di-
Cherry and Foster, 2019) versification and forward translation tech-
niques. Their simultaneous decoding policy
We also measured the computation aware version employed chunk-based incremental decod-
of the latency metrics, as described by Ma et al. ing with stable hypotheses detection. They
(2020b). However, due to the new synchronized also utilized additional TTS models for the
SimulEval agent pipeline design, the actual com- speech-to-speech track.
putation aware latency can be smaller with care-
fully designed parallelism. • NAIST(Fukuda et al., 2023) participated in
For speech-to-speech systems, we report start- the speech-to-text translation direction for
offset and end-offset. The latency metrics will not all three language directions and English-to-
be used for ranking. Japanese speech-to-speech translation. Their
system consisted of a HuBERT encoder and
3.4 Submissions an mBART decoder. They employed three
The simultaneous shared task received submis- techniques to improve translation quality:
sions from six teams, whereas all the teams par- inter-connection to combine pre-trained rep-
ticipated in at least one language direction in resentations, prefix alignment fine-tuning for
speech-to-text translation. Among the teams, five simultaneous decoding, and local agreement
9
to find stable prefix hypotheses. They also attribute this to better robustness of NAIST and
utilized an additional Tacotron2-based TTS CMU towards the noise in Non-Native test set.
model for speech-to-speech translation with
the wait-k decoding policy. English-Chinese The ranking is HW-TSC,
CUNI-KIT, XIAOMI, NAIST, as shown in
• FBK(Papi et al., 2023) participated in the Table 18.
English-to-German speech-to-text translation
track, using an end-to-end Conformer-based English-Japanese The ranking is HW-TSC,
speech-to-text model. Considering computa- CUNI-KIT, NAIST, as shown in Table 19.
tional latency, their focus was on efficient us-
age of offline models. They employed three
simultaneous policies, including local agree- 3.5.2 Speech-to-Speech
ment, encoder-decoder attention, and EDATT Despite the great novelty and difficulty of speech-
v2, to achieve this. to-speech track, there are 5 submissions in total:
2 in German, 2 in Chinese and 1 in Japanese.
• CUNI-KIT(Polák et al., 2023) partici- The full results can be seen in table Table 20.
pated in the English-to-German speech-to- For English-to-German, the ranking is CMU, HW-
text translation track. Their system utilized TSC. For English-to-Chinese, HW-TSC is the
WavLM and mBART as the base framework. only participant. For English-to-Japanese, the
The key highlights of their system were in the ranking is HW-TSC, NAIST.
decoding strategy and simultaneous policies. We also provide the BLASER scores, which
They applied empirical hypotheses filtering directly predict the quality of translations based
during decoding and adopted CTC to detect on speech embeddings. We note that since refer-
the completion of block inference. ence audios are not available in our datasets, we
use text LASER (Heffernan et al., 2022) to embed
• X IAOMI(Huang et al., 2023) participated
reference text to compute the scores. While the
in both the speech-to-text and speech-to-
BLASER scores indicate the same quality rank-
speech tracks for English-Chinese transla-
ing for English to German as BLEU scores, on
tion. Their end-to-end system utilized Hu-
the Japanese output they are similar. It’s pos-
BERT and mBART with a wait-k decoding
sible that BLASER is adequately developed on
strategy and an Information-Transport-based
Japanese outputs
architecture. They further enhanced their sys-
tem by applying data filtering on long sen- 3.6 Human Evaluation
tences and misaligned audio/text, data aug-
mentation with pseudo labeling, and punctu- In the Simultaneous task, speech-to-text track,
ation normalization. They also incorporated English-German and English-Japanese were man-
contrastive learning objectives. ually evaluated, each with a different scoring
method.
3.5 Automatic Evaluation
3.6.1 English-German
We rank the system performance based on BLEU
scores. The detailed results can be found in Ap- For English-to-German, we used the same human
pendix B.2. evaluation method as last year, originally inspired
by Javorský et al. (2022). We evaluated (1) the
3.5.1 Speech-to-Text best system selected by BLEU score, and (2) tran-
English-German On the Common test set, the scription of human interpretation, the same as used
ranking is HW-TSC, CUNI-KIT, FBK, NAIST, in last year evaluation (more details can be found
CMU, as shown in Table 17. Meanwhile, on the in Anastasopoulos et al. (2022a), Section 2.6.1).
Non-Native test set, the ranking differs consider- Figure 1 plots automatic and manual evalua-
ably. While HW-TSC performs best on Common tion in relation with each other. We confirm the
test set, they end up second to last on Non-Native. generally good correlation with BLEU (Pearson
The situation is reversed for NAIST and CMU .952 across the two test set parts), as observed by
who end up at the tail of Common scoring but Macháček et al. (2023), although individual sys-
reach the best scores on the Non-Native set. We tem results are rather interesting this year.
10
Figure 1: Manual and automatic evaluation of Simulatenous speech-to-text English-to-German translation on the
Common (TED talks) and Non-Native test sets. The error bars were obtained by bootstrap resampling, see the
caption of Table 22.
On the Common test set, HWTSC performed The human evaluation results are shown in Ta-
best in terms of BLEU but the manual scor- ble 23. The error score almost correlates with
ing seems to prefer CUNI-KIT and FBK. CMU BLEU against the additional reference, but the dif-
and NAIST are worst in BLEU but on par with ference in the error scores was very small between
HWTSC in terms of manual scores. HW-TSC and CUNI-KIT in spite of the 0.8 BLEU
The situation is very different on the Non- difference.
Native test set: CMU and NAIST score best both
in manual scores and in BLEU while CUNI-KIT 3.7 Final remarks
and esp. FBK get much worse scores, again, both This year, we simplified the conditions by focus-
manual and automatic. ing solely on low-latency systems to reduce the
The Non-Native test set is substantially harder burden of submission and evaluation. We also
with respect to sound conditions, and the striking introduced the novel and challenging speech-to-
difference drop observed for both CUNI-KIT and speech track, and were happy to receive 5 submis-
FBK can be an indication of some form of over- sions.
fitting towards the clean input of Common (TED We note potential modifications for future edi-
talks). tions:
Appendix A.1.1 presents details of the human
• Providing further simplified submission for-
evaluation and results are shown in Table 22.
mat.
3.6.2 English-Japanese • Ranking with better designed metrics to ad-
For English-to-Japanese, we also followed the dress the overfitting towards BLEU scores.
methodology in the last year. We hired a profes-
sional interpreter for human evaluation using JTF • Aligning more with offline tasks on more test
Translation Quality Evaluation Guidelines (JTF, domains and evaluation metrics.
2018) based on Multidimensional Quality Metrics
4 Automatic Subtitling
(MQM; Lommel et al., 2014). We applied the
error weighting by Freitag et al. (2021a). Ap- In recent years, the task of automatically creating
pendix A.1.2 presents details of the human eval- subtitles for audiovisual content in another lan-
uation. guage has gained a lot of attention, as we have
11
seen a surge in the amount of movies, series and domain set AV hh:m ref subtitles
user-generated videos which are being streamed docs h:mm de es
and distributed all over the world. dev 17 04:11 4906 4964
TED
For the first time, this year IWSLT proposed a test 14 01:22 1375 1422
specific track on automatic subtitling, where par- dev 12 01:03 960 909
EPTV
ticipants were asked to generate subtitles of audio- test 10 01:01 891 874
visual documents, belonging to different domains dev 9 03:59 4508 4037
Peloton
with increasing levels of complexity. test 8 02:43 2700 2661
dev 7 06:01 4489 4763
ITV
4.1 Challenge test 7 05:08 4807 4897
The task of automatic subtitling is multi-faceted: Table 4: Statistics of the dev and test sets for the subti-
starting from speech, not only the translation has tling task.
to be generated, but it must be segmented into
subtitles compliant with constraints that ensure
high-quality user experience, like a proper read- 4.2 Data and Metrics
ing speed, synchrony with the voices, the maxi- Data. This track proposed two training condi-
mum number of subtitle lines and characters per tions to participants: constrained, in which only
line, etc. Most audio-visual companies define a pre-defined list of resources is allowed, and un-
their own subtitling guidelines, which can differ constrained, without any data restrictions. The
slightly from each other. Participants were asked constrained setup allowed to use the same train-
to generate subtitles according to some of the tips ing data as in the Offline Speech Translation task
listed by TED, in particular: (see Section 2.2 for the detailed list), with the ob-
vious exclusion of the parallel resources not in-
• the maximum subtitle reading speed is 21
volving the English-{German, Spanish} pairs. In
characters / second;
addition, two monolingual German and Spanish
• lines cannot exceed 42 characters, white text corpora built on OpenSubtitles, enriched with
spaces included; subtitle breaks, document meta-info on genre and
• never use more than two lines per subtitle. automatically predicted line breaks, have been re-
leased.
It was expected that participants used only the au- For each language and domain, a development
dio track from the provided videos (dev and test set and a test set were released. Table 4 provides
sets), the video track being of low quality and pro- some information about these sets.
vided primarily as a means to verify time syn- The evaluation was carried out from three per-
chronicity and other aspects of displaying subtitles spectives, subtitle quality, translation quality and
on screen. subtitle compliance, through the following auto-
The subtitling track requires to automatically matic measures:
subtitle in German and/or Spanish audio-visual
documents where the spoken language is always • Subtitle quality vs. reference subtitles:
English, and which were collected from the fol- – SubER, primary metric, used also for
lowing sources: ranking (Wilken et al., 2022)12 ;
– Sigma (Karakanta et al., 2022b)13 .
• TED talks from the MuST-Cinema8 corpus;
• press interviews from the Multimedia Centre • Translation quality vs. reference translations:
of the European Parliament (EPTV)9 ; – BLEU14 and CHRF15 via sacreBLEU
• physical training videos offered by Peloton10 – BLUERT (Sellam et al., 2020)
11
• TV series from ITV Studios. 12
https://github.com/apptek/SubER
13
8
https://github.com/fyvo/EvalSubtitle
https://ict.fbk.eu/must-cinema 14
sacreBLEU signature: nrefs:1|case:mixed|
9
https://multimedia.europarl.europa.eu |eff:no|tok:13a|smooth:exp|version:2.0.0
10 15
https://www.onepeloton.com sacreBLEU signature: nrefs:1|case:mixed|
11
https://www.itvstudios.com |eff:yes|nc:6|nw:0|space:no|version:2.0.0
12
Automatic subtitles are realigned to the ref- four domains TED, EPTV, ITV, Peloton), fol-
erence subtitles using mwerSegmenter (Ma- lowed by a subtitle line segmentation model
tusov et al., 2005a)16 before running sacre- (intelligent line segmentation by A PP T EK).
BLEU and BLEURT.
• FBK (Papi et al., 2023) submitted primary
• Subtitle compliance:17 runs for the two language pairs, generated
– rate of subtitles with reading speed by a direct neural speech translation model,
higher than 21 char / sec (CPS); trained in the constrained setup, that works
– rate of lines longer than 42 char (CPL); as follows: i) the audio is fed to a Subtitle
– rate of subtitles with more than two lines Generator that produces the (un-timed) sub-
(white spaces included) (LPB). title blocks; ii) the computed encoder repre-
sentations are passed to a Source Timestamp
4.3 Submissions Generator to obtain the caption blocks and
their corresponding timestamps; iii) the sub-
Three teams submitted automatically generated
title timestamps are estimated by the Source-
subtitles for the test sets of this task.
to-Target Timestamp Projector from the gen-
• A PP T EK (Bahar et al., 2023) submitted runs erated subtitles, captions, and source times-
in the constrained setup for both language tamps.
pairs. The primary submissions came from a
cascade architecture composed of the follow- • M ATESUB (Perone, 2023) submitted primary
ing modules: neural encoder-decoder ASR, runs for the two language pairs, automatically
followed by a neural Machine Translation generated by the back-end subtitling pipeline
model trained on the data allowed in the con- of M ATESUB, its web-based tool that sup-
strained track, with the source (English) side ports professionals in the creation of high-
lowercased and normalized to resemble raw quality subtitles (https://matesub.com/). The
ASR output, as well as adapted to the IWSLT M ATESUB subtitling pipeline is based on a
subtitling domains, followed by a subtitle line cascade architecture, composed of ASR, text
segmentation model (intelligent line segmen- segmenter and MT neural models, which al-
tation by A PP T EK). A contrastive run was lows covering any pair from about 60 lan-
generated for the en→de pair only by a direct guages and their variants, including the two
speech translation system with CTC-based language pairs of the task. Since M ATESUB
timestamp prediction, followed by the intel- is a production software, its neural models
ligent line segmentation model of A PP T EK. are trained on more resources than those al-
The system was trained on the constrained allowed for the constrained condition, there-
lowed data plus forward translated synthetic fore the submissions fall into the uncon-
data (translations of allowed ASR transcripts) strained setup.
and synthetic speech data for selected sen- 4.4 Results
tences from the allowed parallel data. For the
en→de pair, A PP T EK also submitted a run in Scores of all runs as computed by automatic met-
the unconstrained setup, where a cascade ar- rics are shown in Tables 24 and 25 in the Ap-
chitecture was employed consisting of: neu- pendix. Averaged over the 4 domains, A PP T EK
ral encoder-decoder CTC ASR, followed by achieved the lowest SubER scores with their pri-
a neural punctuation prediction model and mary submission for en→de in the constrained and
inverse text normalization model, followed unconstrained condition, with the overall best re-
by an MT model adapted to the IWSLT do- sults for the latter. For en→es, M ATESUB obtained
mains (sentences similar in embedding sim- the overall lowest SubER with their unconstrained
ilarity space to the development sets of the system.
We observe that in terms of domain difficulty,
16
https://www-i6.informatik. the TV series (from ITV) pose the most challenges
rwth-aachen.de/web/Software/
mwerSegmenter.tar.gz for automatic subtitling. This has to do with di-
17
https://github.com/hlt-mt/ verse acoustic conditions in which speech is found
FBK-fairseq/blob/master/examples/speech_
to_text/scripts/subtitle_compliance.py in movies and series - background music, noises,
13
shouts, and cross-talk. All of this makes the task Regarding the automatic metrics used in the
of recognizing speech quite challenging, which evaluation, we observed that the metric Sigma pro-
results in error accumulation in the downstream vides scores which are not consistent with the
components. Unconstrained systems by A PP T EK other measures: for example, German subtitles
and M ATESUB perform significantly better on this from M ATESUB seem to be the worst as measured
domain, which shows the importance of training by Sigma, but this is unlikely based on the val-
on additional data that is more representative of ues of the other metrics. Yet the pure MT quality
real-life content. metrics also exhibit some discrepancies in how the
The second-hardest domain are the fitness performance of the same system on the four do-
videos from Peloton. Here, despite a gener- mains is ranked. This ranking sometimes differs
ally clear single-speaker audio with reduced back- depending on whether you choose BLEU, ChrF, or
ground noise, the challenge is the MT: some of the BLEURT as the “primary” metric. The two most
fitness- and sports-specific terminology and slang striking cases are:
pose significant challenges in translation to their • the en→de A PP T EK unconstrained primary
German and Spanish equivalents. submission, for which the BLEU score for
Surprisingly, even the EPTV interviews pose the ITV test data was 14.43 and for Pelo-
significant challenges for subtitling, despite the ton 10.47, but the BLEURT scores were very
fact that the topics discussed in the interviews similar: 0.4069 and 0.4028;
are found in abundance in the allowed speech- • the en→de FBK constrained primary system,
to-text and text-to-text parallel data for the con- for which the BLEU score was 7.73 on the
strained condition (Europarl, Europarl-ST). Here, Peloton part of the test data vs. 8.05 on the
the issues such as spontaneous speech with many ITV part, but the BLEURT scores showed a
pauses, as well as speaker separation may have better quality for Peloton translations: 0.3137
been cause of some of the errors. vs. 0.2255.
The TED talks which have been the main All of these discrepancies highlight the impor-
domain for the IWSLT evaluations in the past tance of human evaluation, which we have not
years are the easiest to be automatically subti- conducted this time. One of the reasons for this
tled. Whereas the current level of subtitle quality is that in most prior research (Matusov et al.,
for TED talks may require minimal human cor- 2019; Karakanta et al., 2022a) the automatic sub-
rections or can even be shown unedited on the titling quality is evaluated in post-editing scenar-
screen, for the other three domains the automatic ios, which are too expensive to be run on signifi-
subtitles will require significant post-editing. This cant amounts of data as they require professional
shows the importance of running evaluations not subtitle translators. On the other hand, as men-
only under very controlled conditions as in the tioned above, for 3 out of 4 domains the quality of
case of TED talks, but on a variety of real-life con- the automatically generated subtitle translations is
tent where multiple research challenges in speech low, so that an evaluation of user experience when
translation are yet to be overcome. watching subtitles would be also challenging, es-
This year’s direct speech translation systems pecially if the users would have to assign evalu-
seem to be too weak to compete with the cascaded ation scores to individual subtitles or sentences.
approaches. In particular, a full end-to-end ap- With all of this in mind, we decided to postpone
proach like the one from FBK that directly gen- any human evaluation to the next edition of the
erates subtitle boundaries is currently inferior in subtitling track at IWSLT.
comparison with the systems that adopt a specific Overall, this first edition of the subtitling track
solution for segmenting the text (intelligent line emphasised the crucial role of the following com-
segmentation by A PP T EK and a neural text seg- ponents related to speech processing: noise re-
menter by M ATESUB). Such specific solutions duction and/or speech separation, speaker diariza-
lead to almost perfect subtitle compliance. But tion, and sentence segmentation. So far they
even in terms of pure speech translation quality as have been underestimated in speech translation re-
measured e.g. with BLEU and BLEURT the cas- search. Current automatic solutions do not reach
caded systems currently provide better translations the level of quality that is necessary in subti-
even under constrained training data conditions. tling. Therefore, we encourage further research
14
into these areas, for which subtitle translation is and translated with the support of ACL and the 60-
a good test case. 60 initiative as described in Salesky et al. (2023).
5 Multilingual SLT
5.2 Data and Metrics
The NLP and speech communities are rapidly ex-
Data. We use the ACL 60-60 evaluation sets cre-
panding with increasing focus on broader lan-
ated by Salesky et al. (2023) to evaluate this chal-
guage coverage and multilinguality. However, de-
lenge task. The data comes from ACL 2022 tech-
spite the community’s efforts on ASR and SLT, re-
nical presentations and is originally spoken in En-
search is rarely focused on applying these efforts
glish, and then transcribed and translated to ten
to the data within the scientific domain. It is clear
target languages from the 60/60 initiative: Ara-
from recent initiatives to caption technical presen-
bic, Mandarin Chinese, Dutch, French, German,
tations at NLP and speech conferences that tran-
Japanese, Farsi, Portuguese, Russian, and Turk-
scription and translation in the technical domain
ish. The resulting dataset contains parallel speech,
is needed, desired, and remains a disproportionate
transcripts, and translation for ten language pairs,
challenge for current ASR and SLT models com-
totaling approximately one hour for the develop-
pared to standard datasets in these spaces. Mo-
ment set and one hour for the evaluation set.
tivated by the ACL 60-60 initiative18 to translate
the ACL Anthology to up to 60 languages for the During the evaluation campaign, the only in-
60th anniversary of ACL, which will be reported domain data provided is the development set. To
on at this year’s ACL conference co-located with simulate the realistic use case where recorded
IWSLT, this year’s Multilingual Task evaluates the technical presentations would be accompanied by
ability of current models to translate technical pre- a research paper, in addition to the talk audio
sentations to a set of ten diverse target languages. we provide the corresponding paper title and ab-
stract, which are likely to contain a subset of
5.1 Challenge relevant keywords and terminology and could be
used by participants to bias or adapt their systems.
Translating technical presentations combines sev- Constrained training data follows the Offline task
eral challenging conditions: domain-specific ter- (see Sec. 2.2) with pretrained models and out-of-
minology, recording conditions varying from domain parallel speech and text provided for all
close-range microphones to laptop microphones 10 language pairs. The unconstrained setting al-
with light background noise or feedback, diverse lowed participants to potentially crawl additional
speaker demographics, and importantly unseg- in-domain data to assist with adaptation, as was
mented speech typically 10-60 minutes in dura- done by one team (JHU). For the official rankings,
tion. This task focuses on one-to-many translation we use the official evaluation set, which was held
from English to ten target languages. Providing blind until after the evaluation campaign.
English ASR was optional though encouraged. In-
To mimic realistic test conditions where the
domain data is scarce, particularly parallel data,
audio for technical presentations would be pro-
though all language pairs are covered by current
vided as a single file, rather than gold-sentence-
publicly available corpora; further challenging for
segmented, for both the development and evalu-
current domain adaptation techniques, monolin-
ation sets we provided the full unsegmented wav
gual data is typically available for the source lan-
files, as well as an automatically generated base-
guage (English) only. We present two conditions:
line segmentation using SHAS (Tsiamas et al.,
constrained (using only the out-of-domain data
2022) to get participants started. Two teams used
allowed and provided for other tasks this year)
the baseline segmentation, while one (JHU) used
and unconstrained (allowing any additional data,
longer segments which improved the ASR qual-
included crawled, which may facilitate e.g., do-
ity of their particular pretrained model. To evalu-
main adaptation). To evaluate submissions, we
ate translation quality of system output using any
use evaluation sets curated from presentations at
input segmentation, we provided gold sentence-
ACL 2022 which were professionally transcribed
segmented transcripts and translations, which sys-
18
https://www.2022.aclweb.org/ tem output could be scored with as described be-
dispecialinitiative low in ‘Metrics.’
15
Metrics. Translation output was evaluated using talk abstracts to prompt Whisper to train-
ing multiple metrics for analysis: translation out- ing in-domain language models on either the
put using chrF (Popović, 2015a), BLEU (Pap- small amount of highly-relevant data in the
ineni et al., 2002b) as computed by S ACRE BLEU talk abstract or larger LMs trained on signifi-
(Post, 2018), and COMET (Rei et al., 2020b) and cantly more data they scraped from the ACL
ASR output using WER. For BLEU we use the Anthology and release with their paper. They
recommended language-specific tokenization in see slight improvements over the provided
S ACRE BLEU for Chinese, Japanese, Korean, and SHAS (Tsiamas et al., 2022) segments us-
the metric-default otherwise. Translation metrics ing longer segments closer what Whisper ob-
were calculated with case and punctuation. WER served in training. They show that prompting
was computed on lowercased text with punctua- Whisper is not competitive with in-domain
tion removed. NFKC normalization was applied language models, and provide an analysis of
on submitted systems and references. All offi- technical term recall and other fine-grained
cial scores were calculated using automatic reseg- details.
mentation of the hypothesis based on the refer-
ence transcripts (ASR) or translations (SLT) by • KIT (Liu et al., 2023) submitted multiple
mwerSegmenter (Matusov et al., 2005b), using constrained multilingual models, both end-
character-level segmentation for resegmentation to-end and cascaded, which combine several
for those languages which do not mark whites- techniques to adapt to the technical domain
pace. The official task ranking is based on average given the absence of in-domain training data,
chrF across all 10 translation language pairs. using pretrained speech and translation mod-
els as initializations (WavLM: Chen et al.
5.3 Submissions 2021, DeltaLM: Ma et al. 2021, mBART-
50: Tang et al. 2020). These include kNN-
We received 11 submissions from 3 teams, as de-
MT to bias generated output to the techni-
scribed below:
cal domain; data diversification to enrich pro-
• BIT (Wang et al., 2023b) submitted a single vided parallel data; adapters for lightweight
constrained one-to-many multilingual model finetuning to the language pairs for trans-
to cover all 10 language pairs, trained using a lation (though they note that this does not
collection of multiple versions of the MuST- necessarily stack with data diversification);
C dataset (Di Gangi et al., 2019b). They use and for their cascaded model, adaptation of
English ASR pre-training with data augmen- the ASR model to the target technical do-
tation from SpecAugment (Park et al., 2019), main using n-gram re-weighting, noting that
and multilingual translation finetuning for all it is typically easier to adapt or add lexical
language pairs together. The final model is an constraints to models with separate LMs, as
ensemble of multiple checkpoints. No adap- opposed to encoder-decoder models. Addi-
tation to the technical domain is performed. tional techniques (ensembling, updated ASR
encoder/decoder settings, knowledge distilla-
• JHU (Xinyuan et al., 2023) submitted two tion, synthesized speech) are also used for
cascaded systems, one constrained and one further small improvements.
unconstrained, combining multiple differ-
ent pretrained speech and translation mod- 5.4 Results
els, and comparing different domain adap- All task results are shown in Appendix B.4. The
tation techniques. Their unconstrained sys- official task ranking was determined by the aver-
tem uses an adapted Whisper (Radford et al., age chrF across all 10 target languages after reseg-
2022) ASR model combined with NLLB mentation to the reference translations.Table 26.
(NLLB Team et al., 2022), M2M-100 (Fan Scores for all submissions by individual language
et al., 2020), or mBART-50 (Tang et al., pairs are shown in Table 28 (chrF), Table 29
2020) MT models depending on the lan- (COMET), and Table 30 (BLEU).
guage pair, while the constrained system Overall, the majority of approaches combined
uses wav2vec2.0 (Baevski et al., 2020a) and strong pretrained speech and translation mod-
mBART-50 or M2M-100. They compare us- els to do very well on the ACL 60-60 evalua-
16
tion data. For this task, cascaded models per- System Metric
JHU-unconstrained KIT-primary chrF
formed consistently better than direct/end-to-end JHU-constrained BIT terminology
approaches; all of the top 6 submissions were cas-
cades, and 4/5 of the lowest-performing systems 70
were direct. Optional English ASR transcripts 60
were submitted for 3 systems (JHUunconstrained ,
50
KITprimary , JHUconstrained ), all of which were
cascades; we see that WER aligns with speech 40
translation performance in these cases. The only 30
unconstrained model, from JHU, utilized larger
20
pretrained models and crawled in-domain lan-
guage modeling data for ASR to great success, and 10
was the top system on all metrics (Table 26). The 0
remaining submissions were all constrained (here ar de fa fr ja nl pt ru tr zh
meaning, used the white-listed training data and Language
smaller pretrained models). The KITprimary sys-
Figure 2: Official task metric performance (chrF) vs
tem was the best performing constrained model. terminology recall for teams’ primary submissions.
While BIT trained models from scratch on TED
to reasonable performance on MuST-C, large pre-
trained models and domain adaptation were key
ing, backtranslation, ...). The data diversifica-
for high performance on the technical in-domain
tion applied by KIT via TTS ‘backtranslation’
test set. chrF and BLEU result in the same sys-
(contrastive5, contrastive7) did not affect chrF or
tem rankings, while COMET favors the end-to-
BLEU, but did provide small (0.5-0.6) improve-
end models slightly more, though not affecting
ments on COMET.
the top 3 systems (JHUunconstrained , KITprimary ,
KITconstrastive1 ). In addition to the overall evaluation set, we look
at the recall of specific terminology annotated for
Domain adaptation techniques had consistent
the ACL evaluation sets. For the three submissions
positive impact on system performance. The KIT
(JHUunconstrained , KITprimary , JHUconstrained )
team submitted constrained systems only and thus
which provided supplementary ASR, we first in-
were limited to the dev bitext and talk abstracts
vestigate terminology recall and propagation be-
for domain adaptation. Despite its small size
tween ASR and downstream ST. Recall that the
(<500 sentences) they were able to generate con-
overall WER of these systems was 16.9, 23.7, and
sistent improvements of up to ∼1chrF and ∼ 1
34.1, respectively. Of the 1107 labeled terminol-
BLEU using kNN-MT (primary/contrastive1 vs
ogy words and phrases from the ACL 60-60 eval-
contrastive2); with this method, extending the dev
uation set annotations, 87.8% / 77.3% / 71.7% in-
data to include the abstracts for the evaluation set
dividual instances were correctly transcribed by
talks (primary vs contrastive1) had neglible ef-
these systems, respectively. Of these, 12.0% /
fect on all 3 metrics. The JHU submissions saw
7.4% / 7.9% were then maintained and correctly
that decoding with interpolated in-domain lan-
translated to each target language respectively on
guage models outperformed knowledge distilla-
average. We plot the official task metric (chrF)
tion or prompting pretrained models with informa-
against terminology recall in Figure 2 for all pri-
tion for each talk in this case; small talk-specific
mary submissions. We see that there were consis-
LMs did provide slight improvements in WER, but
tent differences across languages in how terminol-
significant improvements of 2-3 WER were gained
ogy was maintained, which generally but not fully
by extending the limited highly relevant data from
corresponds to overall performance (ex: Dutch,
talk abstracts and the dev set to the larger domain-
Turkish). While the domain adaptation techniques
general data crawled from the 2021 ACL confer-
used ensured strong transcription performance for
ence and workshop proceedings.
the JHU and KIT submissions, this was not gen-
Without in-domain target-language monolin- erally maintained for translation with a significant
gual data, conventional techniques for adaptation drop, converging with BIT which did not perform
of end-to-end ST models did not apply (finetun- domain adaptation. Additional work is needed to
17
ensure targeted lexical terms are correctly tran- 6.1 Challenge
scribed and translated, both in general as well as The participants were tasked with creating speech-
comparably across different languages. to-speech translation systems that could translate
While the JHU submissions finetuned to each from English to Chinese using various methods,
target language individually, the KIT systems fine- such as a cascade system (ASR + MT + TTS or
tuned multilingually; no contrastive systems were end-to-end speech-to-text translation + TTS), or
submitted with which to ablate this point, but both an end-to-end / direct system. They were also al-
teams’ papers describe consistently worse perfor- lowed to use any techniques to enhance the per-
mance finetuning multilingually rather than bilin- formance of the system, apart from using uncon-
gually, which KIT was able to largely mitigate strained data.
with language adapters in development in isola-
tion but in their final submission on eval language 6.2 Data and Metrics
adapters were consistently slightly worse (con- Data. This task allowed the same training data
trastive4 ‘with’ vs contrastive3 ‘without.’). It re- from the Offline task on English-Chinese speech-
mains to be seen the degree to which one-to-many to-text translation. More details are available in
models can benefit from multilingual training. Sec. 2.2. In addition to the Offline task data,
The Offline task additionally used the ACL 60- the following training data was allowed to help
60 evaluation sets as part of their broader evalu- build English-Chinese speech-to-speech models
ation for 3 language pairs (en→ de, ja, zh), en- and Chinese text-to-speech systems:
abling a wider comparison across 25 total sys-
tems. We show the Multilingual task submissions • GigaS2S, target synthetic speech for the Chi-
compared to the Offline on these languages in Ta- nese target text of GigaST (Ye et al., 2023)
ble 27. On these three language pairs, perfor- that was generated with an in-house single-
mance is generally higher than the remaining lan- speaker TTS system;
guage pairs in the Multilingual task. We again • aishell 3 (Shi et al., 2020), a multi-speaker
consistently see stronger performance on this task Chinese TTS dataset.
from cascaded models, and unconstrained sub-
missions or those with larger pretrained LLMs, It’s noted that several datasets allowed for the
though there are notable outliers such as the HW- Offline task such as Common Voice (Ardila
TSC constrained model. The Offline submissions et al., 2019) actually contain multi-speaker Chi-
did not perform domain adaptation specifically to nese speech and text data that could help for this
the technical ACL domain, but appear to be benefit task.
from better domain-general performance in some
Metrics. All systems were evaluated with both
cases, particularly for submissions targeting only
automatic and human evaluation metrics.
Chinese. We note slight differences in system
rankings between metrics (COMET and BLEU) Automatic metrics. To automatically evaluate
and target languages, particularly for Japanese and translation quality, the speech output was auto-
Chinese targets, possibly highlighting the differ- matically transcribed with a Chinese ASR sys-
ence in metric tokenization for these pairs. tem19 (Yao et al., 2021), and then BLEU20 (Pa-
pineni et al., 2002a), chrF21 (Popović, 2015b),
6 Speech-to-Speech Translation COMET22 (Rei et al., 2022) and SEScore223 (Xu
et al., 2022) were computed between the generated
Speech-to-speech translation (S2ST) involves transcript and the human-produced text reference.
translating audio in one language to audio in an- BLEU and chrF were computed using SacreBLEU
other language. In the offline setting, the transla-
19
tion system can assume that the entire input audio https://github.com/wenet-e2e/wenet/
blob/main/docs/pretrained_models.en.md
is available before beginning the translation pro- 20
sacreBLEU signature: nrefs:1|case:mixed|
cess. This differs from streaming or simultaneous eff:no|tok:zh|smooth:exp|version:2.3.1
21
settings where the system only has access to par- sacreBLEU signature: nrefs:1|case:mixed|
eff:yes|nc:6|nw:0|space:no|version:2.3.1
tial input. The primary objective of this task is to 22
https://huggingface.co/Unbabel/
encourage the advancement of automated methods wmt22-comet-da
for offline speech-to-speech translation. 23
https://github.com/xu1998hz/SEScore2
18
(Post, 2018). Furthermore, the output speech models and brings positive gain to the entire
could be evaluated directly using BLASER (Chen S2ST system.
et al., 2022). More information could be found at
stopes24 (Andrews et al., 2022). • KU (Yang et al., 2023) submitted a cascade
system composed of a speech-to-text transla-
Human evaluation. Output speech translations tion (ST) model and a TTS model. Their ST
were evaluated with respect to translation quality model comprises a ST decoder and an ASR
and speech quality. decoder. The two decoders can exchange in-
formation with each other with the interactive
• Translation quality: Bilingual annotators attention mechanism. For the TTS part, they
were presented with the source audio, source use FastSpeech2 as the acoustic model and
transcript and the generated target audio, then HiFi-GAN as the vocoder.
gave scores on the translation quality be-
tween 1 and 5 (worst-to-best)). There were • NPU-MSXF (Song et al., 2023) submitted a
4 annotators per sample and we retained the cascaded system of separate ASR, MT, and
median score. TTS models. For ASR, they adopt ROVER-
based model fusion and data augmentation
• Output speech quality: In addition to trans- strategies to improve the recognition accu-
lation quality (capturing meaning), the qual- racy and generalization ability. Then they use
ity of the speech output was also human- a three-stage fine-tuning process to adapt a
evaluated. The annotators were requested to pre-trained mBART50 model to translate the
give an overall score by considering three di- output of ASR model. The three-stage fine-
mensions: naturalness (voice and pronunci- tuning is based on Curriculum Learning and
ation), clarity of speech (understandability), it involves three sets of data: (1) the original
and sound quality (noise and other artifacts). MT data, (2) the MT data in ASR transcrip-
Each sample was assessed by 4 annotators tion format and (3) the ASR outputs. For
and scored on a scale of 1-5 (worst-to-best)), TTS, they leverage a two-stage framework,
with a minimum score interval of 0.5. using network bottleneck features as a ro-
bust intermediate representation for speaker
The detailed guidelines for output speech qual-
timbre and linguistic content disentangle-
ity evaluation were similar to last year (Anasta-
ment. Based on the two-stage framework,
sopoulos et al., 2022a).
pre-trained speaker embedding is leveraged
6.3 Submissions as a condition to transfer the speaker timbre
in the source speech to the translated speech.
We received eight submissions from five teams.
The M INE T RANS team submitted four systems • X IAOMI (Huang et al., 2023) submitted a cas-
and each of the other teams submitted one system. cade system composed of a speech-to-text
translation (ST) model and a TTS model. The
• HW-TSC (Wang et al., 2023a) submitted a
ST model is the same as the one they sub-
cascaded system composed of an ensemble
mitted to the Offline SLT track. It is based
of Conformer and Transformer-based ASR
on an encoder-decoder architecture from the
models, a multilingual Transformer-based
pre-trained HuBERT and mBART models.
MT model and a diffusion-based TTS model.
For the TTS model, they use the Tacotron2
Their primary focus in their submission is to
framework. It is first trained with AISHELL-
investigate the modeling ability of the diffu-
3 dataset and then finetuned with GigaS2S
sion model for TTS tasks in high-resource
dataset. Furthermore, they implement sev-
scenarios. The diffusion TTS model takes
eral popular techniques, such as data filtering,
raw text as input and generates waveform
data augmentation, speech segmentation, and
by iteratively denoising on pure Gaussian
model ensemble, to improve the overall per-
noise. Based on the result, they conclude that
formance of the system.
the diffusion model outperforms normal TTS
24
https://github.com/facebookresearch/ • M INE T RANS (Du et al., 2023) submitted
stopes/tree/main/demo/iwslt_blaser_eval three end-to-end S2ST systems (M INE -
19
T RANS E2E, including primary, con- ation along the speech quality perspective, NPU-
trastive1, and contrastive2), and a cascade MSXF obtained the highest score, followed by
S2ST system (M INE T RANS Cascade). Their HW-TSC, X IAOMI, M INE T RANS E2E, M INE -
end-to-end systems adopt the speech-to-unit T RANS Cascade and KU. With a equal weighting
translation (S2UT) framework. The end- of translation quality and speech quality, NPU-
to-end S2UT model comprises a speech MSXF obtained the highest overall score in hu-
encoder, a length adapter and an unit de- man evaluation, followed by X IAOMI and the oth-
coder. The S2UT model is trained to convert ers.
the source speech into units of target speech.
A unit-based HiFi-GAN vocoder is finally S2ST approaches. This year, all systems but
applied to convert the units into waveform. M INE T RANS E2E were cascaded systems, with
Based on their results, they conclude that the three systems adopting an ASR + MT + TTS ap-
widely used multi-task learning technique proach and two systems adopting an end-to-end
is not important for model convergence S2T + TTS approach. This showed that cascade
once large-scale labeled training data is approach was still dominant in the community. Al-
available, which means that the mapping though M INE T RANS E2E performed better than
from source speech to target speech units M INE T RANS Cascade in all evaluation metrics,
can be learned directly and easily. Further- we could not draw conclusions on the comparison
more, they apply other techniques, such as between cascade and end-to-end given the limited
consistency training, data augmentation, data points. Future challenges can encourage more
speech segmentation, and model ensemble direct or end-to-end submissions.
to improve the overall performance of the
system. Their cascade system consists of
ASR, MT and TTS models. Their ASR and 6.5 Conclusion
MT replicates those used for the Offline This is the second time that speech-to-speech
SLT submission. Their TTS model is a translation (S2ST) is presented in one of the
combination of FastSpeech2 and HiFi-GAN. IWSLT tasks. S2ST is an important benchmark for
general AI as other NLP tasks, e.g. dialogue sys-
6.4 Results
tem, question answering and summarization can
Results as scored by automatic metrics are shown also be implemented in speech-to-speech manner.
in Table 31 and human evaluation results are Compared to the setting last year, the size of the
shown in Table 32 in the Appendix. training data set available to the participants is
much larger. The BLEU scores obtained in this
Overall results. According to the automatic
challenge is high in general, compared to MT and
metrics used in the evaluation, X IAOMI obtained
ST of the same language direction. Although not
the highest score in ASR-BLEU, ASR-chrF, ASR-
required by the task, NPU-MSXF is the only
COMET and ASR-SEScore2. NPU-MSXF ob-
team that implemented speaker timbre transfer in
tained the second highest score, followed sub-
their system. We plan to include evaluation met-
sequently by HW-TSC, M INE T RANS E2E, KU
rics addressing this aspect in the next edition.
and M INE T RANS Cascade. The BLEU, chrF,
COMET and SEScore2 rankings were exactly the
same. The scores for the test-expanded data were 7 Dialect SLT
lower than those for the test-primary data, likely
due to a domain mismatch with the training data. The Dialect Speech Translation shared task is a
For human evaluation along the translation quality continuation of last year’s task. We use the same
perspective, X IAOMI obtained the highest score, training data as 2022 and evaluated systems on
followed by NPU-MSXF, then HW-TSC and the 2022 evaluation set to measure progress; in
M INE T RANS E2E, then M INE T RANS Cascade, addition, we added a new 2023 evaluation set as
and finally KU. This ranking was mostly con- blind test. From the organizational perspective, we
sistent with the automatic ranking, showing that merged the call for shared task with the the Low-
automatic metrics were useful in evaluating the Resource tasks (Section 8) in order to encourage
translation quality of systems. For human evalu- cross-submission of systems.
20
7.1 Challenge • test1: Participants are encouraged to use this
Diglossic communities are common around the for internal evaluation since references are
world. For example, Modern Standard Arabic provided. This is part of LDC2022E01 re-
(MSA) is used for formal spoken and written com- leased to participants for training and devel-
munication in most parts of the Arabic-speaking opment, obtained by applying the standard
world, but local dialects such as Egyptian, Moroc- data split and preprocessing26 .
can, and Tunisian are used in informal situations. • test2: official evaluation for 2022, from
Diglossia poses unique challenges to speech trans- LDC2022E02
lation because local “low” dialects tend to be low-
resource with little ASR and MT training data, and • test3: official evaluation for 2023, from
may not even have standardized writing, while re- LDC2023E09
sources from “high” dialects like MSA provides
opportunities for transfer learning and multilin- 7.3 Submissions
gual modeling.
We received submission from four teams:
• GMU (Mbuya and Anastasopoulos, 2023)
Participants were provided with the following
participated in five language-pairs in the
datasets:
Low-Resource tasks as well as this task.
• (a) 160 hours of Tunisian conversational They focused on investigating how different
speech (8kHz), with manual transcripts self-supervised speech models (Wav2vec 2.0,
XLSR-53, and HuBERT) compare when ini-
• (b) 200k lines of manual translations of the tialized to an end-to-end (E2E) speech trans-
above Tunisian transcripts into English, mak- lation architecture.
ing a three-way parallel data (i.e. aligned au-
dio, transcript, translation) that supports end- • JHU (Hussein et al., 2023) submitted both
to-end speech translation models cascaded and E2E systems, using transformer
and branchformer architectures. They inves-
• (c) 1200 hours of Modern Standard Arabic tigated the incorporation of pretrained text
(MSA) broadcast news with transcripts for MT models, specifically mBART50 and dis-
ASR, available from MGB-2 tilled NLLB-200. Further, they explored dif-
• Approximately 42,000k lines of bitext in ferent ways for system combination and han-
MSA-English for MT from OPUS (specifi- dling of orthographic variation and channel
cally: Opensubtitles, UN, QED, TED, Glob- mismatch.
alVoices, News-Commentary).
• ON-TRAC (Laurent et al., 2023) partici-
In 2022, we constructed three conditions: The pated in two language-pairs in the Low-
basic condition trains on (a) and (b), provided by Resource task as well as this task. For this
the Linguistic Data Consortium (LDC); the di- task, they focused on using SAMU-XLS-R
alect adaptation condition trains on (a), (b), (c), as the multilingual, multimodal pretrained
(d); the unconstrained condition can use any addi- speech encoder and mBART as the text de-
tional data and pre-trained models. In 2023, due coder.
to the coordinated organization with other Low-
• USTC (Deng et al., 2023) proposed a
Resource Tasks this year, we renamed basic con-
method for synthesis of pseudo Tunisian-
dition as “constrained condition”, and the other
MSA-English paired data. For the cascaded
two conditions are merged as the “unconstrained
system, they explored ASR with different
condition”.
feature extraction (VGG, GateCNN) and neu-
All train and test sets are time-segmented at
ral architectures (Conformer, Transformer).
the utterance level. Statistics are shown in Table
For E2E, they proposed using SATE and a
5. There are three test sets for evaluation with
hybrid SATE architecture to take advantage
BLEU25 .
26
25
SacreBLEU signature for dialect speech translation task:
https://github.com/kevinduh/
nrefs:1|case:lc|eff:no|tok:13a|smooth:exp|version:2.0.0 iwslt22-dialect
21
Dataset Speech Text (#lines) Use
(#hours) Tunisian MSA English
LDC2022E01 train 160 200k - 200k Constrained condition
LDC2022E01 dev 3 3833 - 3833 Constrained condition
LDC2022E01 test1 3 4204 - 4204 Participant’s internal evaluation
LDC2022E02 test2 3 4288 - 4288 Evaluate progress from 2022
LDC2023E09 test3 3 4248 - 4248 Official evaluation for 2023
MGB2 1100 - 1.1M - Unconstrained condition
OPUS - - 42M 42M Unconstrained condition
Any other data - - - - Unconstrained condition
Table 5: Datasets for Dialect Shared Task.
of the pseudo Tunisian-MSA-English text 8.1 Challenge

data. Additionally, methods for adapting to This year, the task introduced speech translation of
ASR errors and system combination were ex- recorded utterances from Irish to English, Marathi
amined. to Hindi, Maltese to English, Pashto to French,
Tamasheq to French, and Quechua to Spanish.
7.4 Results The different language pairs vary by the amount
The full set of BLEU results on the English trans- of data available, but in general, they have in
lations are available in Tables 33 and 34. We also common the dearth of high-quality available re-
evaluated the WER results for the ASR component sources, at least in comparison to other much
of cascaded systems, in Table 35. higher-resourced settings.
In general, there is an improvement compared to
2022. On test2, the best system in 2022 (achieved
by the CMU team) obtained 20.8 BLEU; several We describe the data available for each language
systems this year improved upon that result, for pair below. Table 6 provides an overview of the
example USTC’s primary system achieved 23.6 provided datasets.
BLEU and JHU’s primary system achieved 21.2
Irish–English Irish (also known as Gaeilge) has
BLEU. On the official evaluation on test3, the best
around 170,000 L1 speakers and 1.85 million peo-
system achieved 21.1 BLEU in the unconstrained
ple (37% of the population) across the island (of
condition and 18.1 BLEU in the constrained con-
Ireland) claim to be at least somewhat proficient
dition.
with the language. In the Republic of Ireland,
From the system descriptions, it appears the in-
it is the national and first official language. It is
gredients for strong systems include: (a) effective
also one of the official languages of the European
use of pretrained speech and text models, (b) sys-
Union (EU) and a recognized minority language
tem combination among both cascaded and E2E
in Northern Ireland with the ISO ga code.
systems, and (c) synthetic data generation to in-
The provided Irish audio data were compiled
crease the size of dialectal data.
from Common Voice (Ardila et al., 2020a),27
We do not plan to continue this shared task next and Living-Audio-Dataset.28 The compiled data
year. Instead, the plan is to make the data available were automatically translated into English and
from the LDC. We encourage researchers to con- corrected by an Irish linguist. The Irish–English
tinue exploring dialectal and diglossic phenomena corpus consists of 11.55 hours of Irish speech data
in the future. (see Table 6), translated into English texts.
8 Low-resource SLT Marathi–Hindi Marathi is an Indo-Aryan lan-

guage which has the ISO code mr, and is domi-
The Low-resource Speech Translation shared task
27
focuses on the problem of developing speech tran- https://commonvoice.mozilla.org/en/
datasets
scription and translation tools for low-resourced 28
https://github.com/Idlak/
languages. Living-Audio-Dataset
22
Language Pairs Train Set Dev Set Test Set Additional Data
Irish–English ga–eng 9.46 1.03 0.44 n/a
Marathi–Hindi mr–hi 15.3 3.7 4.4 monolingual audio with transcriptions
(ASR), monolingual text
Maltese–English mlt–eng 2.5 - 1.35 monolingual audio with transcriptions
(ASR), monolingual text
Pashto–French pus–fra 61 2.5 2 n/a
Tamasheq–French tmh–fra 17 - - untranscribed audio, data in other re-
gional languages
Quechua–Spanish que–spa 1.60 1.03 1.03 60 hours of monolingual audio with
transcriptions (ASR) and MT data (not
transcribed)
Table 6: Training, development and test data details (in hours) for the language pairs of the low-resource shared
task.
nantly spoken in the state of Maharashtra in India. along with about 7.5 hours of audio with only Mal-
It is one of the 22 scheduled languages of India tese transcriptions. Last, the participants were di-
and the official language of Maharashtra and Goa. rected to several monolingual Maltese textual re-
As per the 2011 Census of India, it has around 83 sources. The provided datasets were taken from
million speakers which covers 6.86% of the coun- the MASRI corpus (Hernandez Mena et al., 2020).
try’s total population.29 Marathi is the third most
Pashto–French Pashto is spoken by approxi-
spoken language in India.
mately forty to sixty million people in the world.
The provided Marathi–Hindi corpus consists of
It is particularly spoken by the Pashtun people in
22.33 hours of Marathi speech data (see Table 6)
the south, east and southwest of Afghanistan (it
from the news domain, extracted from News On
is one of the two official languages), as well as
Air30 and translated into Hindi texts.31 The dataset
in the north and northwest Pakistan but also in
was manually segmented and translated by Panlin-
Iran, Tajikistan and India (Uttar Pradesh and Cash-
gua.32 Additionally, the participants were directed
mere) and one of the two official languages of
that they may use monolingual Marathi audio data
Afghanistan.
(with transcription) from Common Voice (Ardila
The corpus was totally provided by ELDA,
et al., 2020a),33 as well as the corpus provided
and is available on the ELRA catalog: TRAD
by He et al. (2020)34 and the Indian Language Cor-
Pashto Broadcast News Speech Corpus (ELRA
pora (Abraham et al., 2020).35
catalogue, 2016b) that consists of audio files and
Maltese–English Maltese is a Semitic lan- TRAD Pashto-French Parallel corpus of tran-
guage, with about half a million native speakers, scribed Broadcast News Speech - Training data
spoken in the official language of Malta and the (ELRA catalogue, 2016a) which are their tran-
EU. It is written in Latin script. scriptions.
The provided data was divided into three parts. This dataset is a collection of about 108 hours of
First, around 2.5 hours of audio with Maltese tran- Broadcast News with transcriptions in Pashto and
scription and an English translation were released, translations into French text. The dataset is built
29
from collected recordings from 5 sources: Ashna
https://censusindia.gov.in/nada/
TV, Azadi Radio, Deewa Radio, Mashaal Radio
index.php/catalog/42561
30
https://newsonair.gov.in and Shamshad TV. Original training data contains
31
https://github.com/panlingua/ 99 hours of speech in Pashto, which corresponds
iwslt2023_mr-hi to 29,447 utterances translated into French. Train-
32
http://panlingua.co.in/
33
https://commonvoice.mozilla.org/en/
ing data corresponds to 61 hours of speech (Ta-
datasets ble 6).
34
https://www.openslr.org/64/
35
https://www.cse.iitb.ac.in/˜pjyothi/ Tamasheq–French Tamasheq is a variety of Tu-
indiccorpora/ areg, a Berber macro-language spoken by nomadic
23
tribes across North Africa in Algeria, Mali, Niger Quechua spoken in Ayacucho, Peru (Quechua
and Burkina Faso. It accounts for approximately Chanka ISO: quy) and Cusco, Peru (Quechua
500,000 native speakers, being mostly spoken in Collao ISO: quz) which are both part of Quechua
Mali and Niger. This task is about translating spo- II and, thus, considered a “southern” languages.
ken Tamasheq into written French. Almost 20 We label the data set with que - the ISO norm for
hours of spoken Tamasheq with French transla- Quechua II mixtures.
tion are freely provided by the organizers. A ma- The constrained setting allowed a Quechua-
jor challenge is that no Tamasheq transcription is Spanish speech translation dataset along with the
provided, as Tamasheq is a traditionally oral lan- additional parallel (text-only) data for machine
guage. translation compiled from previous work (Ortega
The provided corpus is a collection of radio et al., 2020). The audio files for training, valida-
recordings from Studio Kalangou36 translated to tion, and test purposes consisted of excerpts of the
French. It comprises 17 hours of clean speech Siminchik corpus (Cardenas et al., 2018) that were
in Tamasheq, translated into the French language. translated by native Quechua speakers. For the un-
The organizers also provided a 19-hour version of constrained setting, participants were directed to
this corpus, including 2 additional hours of data another larger data set from the Siminchik corpus
that was labeled by annotators as potentially noisy. which consisted of 60 hours of fully transcribed
Both versions of this dataset share the same vali- Quechua audio (monolingual).
dation and test sets. Boito et al. (2022a) provides
a thorough description of this dataset. 8.2.1 Metrics
In addition to the 17 hours of Tamasheq audio We use standard lowercase BLEU as well as
data aligned to French translations, and in light of charF++ to automatically score all submissions.
recent work in self-supervised models for speech Additional analyses for some language pairs are
processing, we also provide participants with un- provided below.
labeled raw audio data in the Tamasheq language, Due to the exceptionally hard setting, which
as well as in other 4 languages spoken from Niger: currently leads to generally less competent transla-
French (116 hours), Fulfulde (114 hours), Hausa tion systems, we did not perform the human eval-
(105 hours), Tamasheq (234 hours) and Zarma uation of the outputs.
(100 hours). All this data comes from the ra-
dio broadcastings of Studio Kalangou and Studio 8.3 Submissions
Tamani.37 Below we discuss all submissions for all language
Note that this language pair is a continuation of pairs, given that there were several overlaps. A
last year’s shared task. An additional separate test brief summary per language is below:
set was provided this year.
• Irish–English received four submissions from
Quechua–Spanish Quechua is an indigenous one team (GMU);
language spoken by more than 8 million peo-
ple in South America. It is mainly spoken in • Marathi–Hindi received submissions from
Peru, Ecuador, and Bolivia where the official high- four teams (A LEXA AI, BUT, GMU, and
resource language is Spanish. It is a highly inflec- SRI-B);
tive language based on its suffixes which aggluti-
nate and are found to be similar to other languages • Maltese–English received five submissions
like Finnish. The average number of morphemes from one team (UM-DFKI);
per word (synthesis) is about two times larger than
in English. English typically has around 1.5 mor- • Pashto–French received submissions from
phemes per word and Quechua has about 3 mor- two teams (GMU, ON-TRAC);
phemes per word.
• Tamasheq–French received submissions
There are two main regional divisions of
from four teams (A LEXA AI, GMU,
Quechua known as Quechua I and Quechua II.
NAVER, and ON-TRAC);
This data set consists of two main types of
36
https://www.studiokalangou.org/ • Quechua-Spanish received three submissions
37
https://www.studiotamani.org/ (GMU, NAVER, and QUESPA).
24
Below we discuss each team’s submission in de- ESPnet (Inaguma et al., 2021) toolkit. The
tail: primary system was built with the end-to-
end and bilingual ASR model while the con-
• A LEXA AI (Vishnu et al., 2023) submitted trastive was built with a cascade which uses
one primary and three contrastive systems, various backbone models including ASR, the
all of these are in the unconstrained condition bilingual ASR, transformer-based seq2seq
(Table 44) for Tamasheq-French, and one pri- MT, LM for re-scoring and XLM.
mary and five contrastive systems on the un-
constrained condition for Marathi–Hindi. For • GMU (Mbuya and Anastasopoulos, 2023)
Marathi–Hindi, their systems relied on an focused on end-to-end speech translation
end-to-end speech translation approach, us- systems. End-to-end (E2E) transformer-
ing the wav2vec 2.0 base model finetuned based encoder-decoder architecture (Vaswani
on 960 hours of English speech (Baevski et al., 2017) was used for primary con-
et al., 2020b) as encoder baseline and it was strained submission. For unconstrained sub-
also finetuned on 94 hours of Marathi au- missions, they explored self-supervised pre-
dio data. The team focused on evaluating trained speech models and used wav2vec 2.0
three strategies including data augmentation, (Baevski et al., 2020a) and HuBERT (Hsu
an ensemble model and post-processing tech- et al., 2021) for the low resource task. They
niques. For Tamasheq–French, they reuse used wav2vec 2.0 - with removing the last
the same end-to-end AST model proposed three layers - for their primary submission.
by the ON-TRAC Consortium in the last HuBERT was used for the contrastive1 sub-
year’s IWSLT edition (Boito et al., 2022b). mission - without removing any layer. For
This model consists of a speech encoder that contrastive2, End-to-end with ASR (E2E-
is initialized by the wav2vec 2.0 (Baevski ASR) architecture uses the same architec-
et al., 2020a) base model pre-trained on 243 ture as the E2E. The difference is that a pre-
hours of Tamasheq audio data released by trained ASR model was used to initialize its
the ON-TRAC Consortium 38 . The decoder encoder.
of this model is a shallow stack of 2 trans-
• ON-TRAC (Laurent et al., 2023) partic-
former layers with 4 attention heads. A
ipated in the Pashto–French (one primary
feed-forward layer is put in between the en-
and three contrastive systems, both for con-
coder and the decoder for matching the di-
strained and unconstrained settings) and
mension of the encoder output and that of
Tamasheq–French (one primary and five con-
the decoder input. In this work, they fo-
trastive systems, all of which are uncon-
cus on leveraging different data augmenta-
strained (c.f. Table 44). For Pashto–French,
tion techniques including audio stretching,
the primary cascaded system is based on a
back translation, paraphrasing, and weighted
convolutional model (Gehring et al., 2017)
loss. Another important endeavor of their
upgraded, while contrastive3 is based on
work is experimenting with different post-
small basic transformers. For Primary and
processing approaches with LLMs, such as
contrastive1 systems, SAMU-XLS-R (Khu-
re-ranking, sentence correction, and token
rana et al., 2022) was used with pre-trained
masking. Besides, they also ensemble AST
encoder with 100 and 53 languages. The two
models trained with different seeds and data
constrained contrastive E2E systems share
augmentation methods, which is proven to
the same encoder-decoder architecture using
improve the performance of their systems.
transformers (Vaswani et al., 2017). The dif-
Their primary system scores 9.30 BLEU on
ference lies in the use or not of a transformer
the 2023 test set.
language model trained from scratch on the
• BUT (Kesiraju et al., 2023) submitted one provided dataset.
primary and one contrastive system using the All of their systems for Tamasheq–French
38
are based on the same end-to-end encoder-
https://huggingface.
co/LIA-AvignonUniversity/ decoder architecture. In this architec-
IWSLT2022-tamasheq-only ture, the encoder is initialized by a pre-
25
trained semantic speech representation learn- performance. Their primary system, which is
ing model named SAMU-XLS-R (Khurana ensembled from 3 different runs on the com-
et al., 2022), while the decoder is initialized bination of both ST and ASR data, scores
with the decoder of the pre-trained mBART 23.59 BLEU on the 2023 test set.
model. Their work heavily relies on different For the Quechua–Spanish track, the overall
versions of the SAMU-XLS-R model, which architecture for their systems consists of first
are pre-trained on different combinations of initializing a PLM which was then fine-tuned
multilingual corpora of 53, 60, and 100 lan- on the speech translation task by inputting
guages. In addition, they leverage training features from a frozen pre-trained speech rep-
data from higher resource corpora, such as resentation. Similar adaptations were done
CoVoST-2 (Wang et al., 2020a) and Europarl- with an MT model to control domain and
ST (Iranzo-Sánchez et al., 2020), for train- length mismatch issues. One of the interest-
ing their end-to-end models. Their primary ing takeaways from their approaches is that
system, which scores 15.88 BLEU on the their contrastive 2 system (1.3 billion pa-
Tamasheq–French 2023 test set, was trained rameters (NLLB Team et al., 2022)) outper-
on the combination of (CoVoST-2, Europarl- formed their contrastive 1 system (3.3 billion
ST and the IWSLT 2022’s test set), with the parameters (NLLB Team et al., 2022)) de-
encoder is initialized by the SAMU-XLS-R spite it having less parameters. NAVER’s
model trained on the data gathered from 100 primary submission was an ensemble ap-
languages. proach that included the use of PLMs for
both the ASR (Baevski et al., 2020a) and
• NAVER (Gow-Smith et al., 2023) submit-
MT systems ((NLLB Team et al., 2022))
ted one primary and two contrastive sys-
and included training on both Tamasheq and
tems to the Tamasheq–French track, as well
Quechua data. Their submissions to QUE–
as one primary and two contrastive sys-
SPA did not include the use of mBART or
tems for the unconstrained condition in the
HuBERT (Hsu et al., 2021) as was done for
Quechua–Spanish track. In their work for
other language pairs that NLE submitted.
the Tamasheq–French track, they concentrate
on parameter-efficient training methods that • QUESPA (Ortega et al., 2023) submitted
can perform both ST and MT in a multilin- to both conditions (constrained and uncon-
gual setting. In order to do so, they initial- strained) a total of six systems including a
ize their models with a pre-trained multilin- primary, contrastive 1, and contrastive 2 for
gual MT model (mBART (Liu et al., 2020) or each condition. They also claim to have tried
NLLB (NLLB Team et al., 2022)), which is several other combinations but did not sub-
then fine-tuned on the ST task by inputting mit those systems. For the constrained condi-
features extracted with a frozen pre-trained tion, their primary system scored second best,
speech representation model (wav2vec 2.0 or slightly less than team GMU with a BLEU
HuBERT (Hsu et al., 2021)). The encoder score of 1.25 and chrF2 of 25.35. They also
of their translation model is slightly modified scored third best for the constrained condi-
where they stack several modality-specific tion with 0.13 BLEU and 10.53 chrF2 us-
layers at the bottom. In addition, adapter ing their contrastive 1 system. It is worth-
layers are also inserted in between layers of while to note that chrF2 was used by the
the pre-trained MT model at both the en- organizers when BLEU scores were below
coder and decoder sides. While these new five. For their constrained systems, a di-
components get fine-tuned during the train- rect speech translation system was submit-
ing process, the pre-trained components of ted similar to the GMU team’s primary ap-
the MT model are frozen. One of the appeal- proach that used Fairseq (Wang et al., 2020b).
ing characteristics of their approach is that it QUESPA extracted mel-filter bank (MFB)
allows the same model to do both speech-to- features similar to the S2T approach in previ-
text and text-to-text translation (or transcrip- ous work Wang et al. (2020b). The main dif-
tion). Furthermore, their method maximizes ference between QUESPA’s submission and
knowledge transfer to improve low-resource GMU’s submissions was that the GMU team
26
increased the number of decoder layers to 8.4 Results
6 which resulted in a slightly better system
Irish–English As discussed earlier, only the
for GMU. The other systems submitted for
GMU team participated in the GA–ENG trans-
the constrained setting were cascade systems
lation track and submitted one primary system to
where ASR and MT were combined in a
constrained, one primary system to unconstrained
pipeline setting. Their contrastive 1 and 2
and the rest of the two systems to contrastive
system submissions for the constrained task
on unconstrained conditions. The end-to-end and
respectively used wav2letter++ (Pratap et al.,
end-to-end with ASR models submitted primary
2019) and a conformer architecture similar
constrained and contrastive2 unconstrained sys-
to previous work (Gulati et al., 2020) along
tems. Both the systems achieved 15.1 BLEU
with an OpenNMT (Klein et al., 2017) trans-
scores. They did not perform well in comparison
lation system trained on the constrained ST
to the wav2vec 2.0 and HuBERT models. The de-
and MT data. Both of those systems per-
tail of the results of this track can be found in Ta-
formed poorly scoring less than 1 BLEU. For
ble 36 and 37.
the unconstrained condition, the three sys-
tems that were presented by QUESPA con- Marathi–Hindi The results of this translation
sisted of pipeline approaches of PLMs that track can be found in Table 38 and 39. Over-
were fine-tuned on the additional 60 hours all we see varying performances among the sys-
of Siminchik audio data along with the con- tems submitted to this track, with some perform-
strained data. Their primary and contrastive ing much better on the test set. Out of the 16
1 unconstrained ASR systems were trained submissions, the SRI-B team’s primary system
using the 102-language FLEURS (Conneau achieved the best result of 31.2 and 54.8 in BLEU
et al., 2023) model and used the MT sys- and in charF++ respectively on the constrained
tem that was based on NLLB (NLLB Team condition while the BUT team’s primary system
et al., 2022) which just so happens to in- achieved the best results of 39.6 in BLEU and
clude Quechua as one of its languages. Their 63.3 in charF++ on the unconstrained condition.
contrastive 2 ASR system was based on In both constrained and unconstrained conditions,
wav2letter++ (Pratap et al., 2019) while their the GMU systems achieved the lowest results of
contrastive 2 MT system was identical to the 3.3 and 5.9 in BLEU and 16.8 and 20.3 in charF++
MT systems used for their Primary and Con- respectively.
trastive 1 submissions.
Maltese–English The results of this translation
track can be found in Table 42. UM-DFKI used
• SRI-B (Radhakrishnan et al., 2023) submit- contrastive approaches in training their ASR sys-
ted four systems. For Marathi–English, they tem. For their contrastive1 system, their fine-
submitted one primary and one contrastive tuning consisted of using Maltese, Arabic, French
system in the constrained setting and one and Italian corpora. Their contrastive2, con-
primary and one contrastive system in the trastive3, and contrastive4 approaches respectively
unconstrained setting. They used end-to- use a subset from Arabic, French and Italian ASR
end speech translation networks comprising a corpus along with Maltese data. The best result
conformer encoder and a transformer decoder of 0.7 BLEU was achieved with their contrastive1
for both constrained and unconstrained. system.
Pashto–French The detailed results can be

• UM-DFKI (Williams et al., 2023) submit- found in Table 41 and Table 40 of the Appendix.
ted five systems. It included one primary and We rank the system performance based on test
four contrastive systems in unconstrained set- BLEU scores. The best score BLEU was achieved
tings. They used a pipeline approach for all by ON-TRAC primary system (SAMU-XLS-R
of their submissions. For ASR, their system model trained on 100 languages). For the con-
builds upon (Williams, 2022) on fine-tuning strained condition, the cascaded approach based
XLS-R based system. mBART-50 was used on convolutional models, gives the best perfor-
for fine-tuning the MT part of the pipeline. mance.
27
Tamasheq-French The results of this transla- and text models to maximize performance in low-
tion track can be found in Table 43 and 44. Com- resource languages. Being able to be trained on
pared to the last year’s edition, this year has wit- both ST and ASR data due to the multilingual na-
nessed a growing interest in this low-resource ture, all of their submissions heavily outperform
translation track in terms of both quantity and the second team ON-TRAC by considerable mar-
quality of submissions. Almost all submissions gins. Their primary system, which is ensembled
achieve relatively better results than the last year’s from 3 different runs, uses NLLB1.3B as the pre-
best system (5.7 BLEU on test2022 (Boito et al., trained MT system, and wav2vec2.0 Niger-Mali 39
2022b)). Furthermore, it is notable that cascaded as the speech presentation extractor. After be-
systems are not favorable in this track while none ing trained on a combination of both ST corpora
of the submitted systems is of this kind. (Tamasheq-French, mTEDx fr-en, mTEDx es-fr,
This year, this language pair remains a chal- mTEDx es-en, mTEDx fr-es (Salesky et al., 2021))
lenging low-resource translation track. There is and AST corpora (TED-LIUM v2 (Rousseau et al.,
only one submission to the constrained condi- 2014), mTEDx fr, mTEDx es), this system estab-
tion from GMU with an end-to-end model scor- lishes an impressive state-of-the-art performance
ing 0.48 BLEU on this year’s test set. For of the Tamasheq-French language pair, scoring
this reason, all the participants are in favor of 23.59 BLEU on the 2023 test set.
exploiting pre-trained models, hence being sub-
Quechua–Spanish The QUE–SPA results for
ject to the unconstrained condition. Among
all systems submitted to this low-resource trans-
these pre-trained models, self-supervised learn-
lation track can be found in Table 45 and 46 of
ing (SSL) from speech models remains a popu-
the appendix. To our knowledge, this first edi-
lar choice for speech encoder initializing. Us-
tion of the QUE–SPA language pair in the low-
ing a wav2vec2.0 model pre-trained on unlabelled
resource track of IWSLT has witnessed the best
Tamasheq data for initializing their speech en-
BLEU scores achieved by any known system in
coder, GMU gains +7.55 BLEU score in compari-
research for Quechua. The two best performing
son with their Transformer-based encoder-decoder
systems: 1.46 BLEU (constrained) and 15.70 (un-
model training from scratch (their primary con-
constrained) show that there is plenty of room to
strained system). At the decoder side, pre-trained
augment approaches presented here. Nonetheless,
models such as mBART or NLLB are commonly
submissions from the three teams: GMU, NAVER,
leveraged for initializing the decoder of the end-to-
and QUESPA have shown that it is possible to use
end ST model. Besides, data augmentation and en-
PLMs to create speech-translation systems with as
sembling are also beneficial as shown by ALEXA
little as 1.6 hours of parallel speech data. This is
AI when they consistently achieve ∼ 9 BLEU in
a notable characteristic of this task and surpasses
all of their settings.
previous work in the field.
Outstanding BLEU scores can be found in the We have found that the NLLB (NLLB Team
work of the ON-TRAC team. An interesting pre- et al., 2022) system’s inclusion of Quechua in re-
trained model named SAMU-XLS-R is shown to cent years has had a greater impact than expected
bring significant improvements. This is a multilin- for ease-of-use. Similarly, the use of Fairseq
gual multimodal semantic speech representation (Wang et al., 2020b) seems to be the preferred
learning framework (Khurana et al., 2022) which toolkit for creating direct S2T systems, cascaded
fine-tunes the pre-trained speech transformer en- or not. The QUE–SPA submissions for the un-
coder XLS-R (Babu et al., 2021) using semantic constrained conditions preferred the use of a cas-
supervision from the pre-trained multilingual se- cading system in a pipeline approach where pre-
mantic text encoder LaBSE (Feng et al., 2022). trained models were fine-tuned first for ASR and
Exploiting this pre-trained model and training then for MT.
end-to-end ST models on the combinations of dif- The constrained setting leaves much room for
ferent ST corpora, they achieve more than 15 improvement. Nonetheless, GMU and QUESPA’s
BLEU in all of their settings. near identical submissions have shown that the in-
NAVER tops this translation track by a multilin- 39
https://huggingface.
gual parameter-efficient training solution that al- co/LIA-AvignonUniversity/
lows them to leverage strong pre-trained speech IWSLT2022-Niger-Mali
28
crease of 3 layers during decoding can be powerful Limitations As noted by some participants,
and should be explored further. It would be worth- the Irish–English and Maltese–English transla-
while for the organizers of the QUE–SPA track to tion track data has limitations. For Irish–English,
obtain more parallel data including translations for the speech translation systems can achieve very
future iterations of this task. high BLEU scores on the test set if the built
The unconstrained setting clearly can benefit systems have used wav2vec 2.0 and/or the Irish
from an ensembling technique and training with ASR model which is trained on the Common
multiple languages – in these submissions, the Voice (Ardila et al., 2020b) dataset. Similarly,
training of a model with an additional language the GMU team has achieved high BLEU scores
like Tamasheq alongside Quechua does not seem especially when they used wav2vec 2.0 and Hu-
to have a negative impact on performance. Al- BERT models. We plan to continue this translation
though, it is hard to ascertain whether the slight track next year by updating the test and training
performance gain of less than 1 BLEU point of the data to thoroughly investigate the data quality as
NLE team’s submission compared to QUESPA’s well as the reason to obtain the high BLEU scores.
submission was due to the ensembling, freezing of For Maltese–English, some participants reported
the models, or the language addition. issues with the data quality, which we hope to re-
As a final takeaway, the NLE team’s submis- solve in future iterations of the shared task.
sions scored quite well under the unconstrained
condition. It should be noted that for other lan- 9 Formality Control for SLT
guage pairs NLE’s high system performance was
Different languages encode formality distinctions
also due to the ensembling of systems that were
in different ways, including the use of honorifics,
executed using different initialization parameters
grammatical registers, verb agreement, pronouns,
on at least three unique runs. As an aside, small
and lexical choices. While machine translation
gains were achieved under the constrained condi-
(MT) systems typically produce a single generic
tion when comparing the GMU submission to the
translation for each input segment, SLT requires
QUESPA system due to the increase in decoding
adapting the translation output to be appropriate to
layers. QUESPA’s inclusion of a language model
the context of communication and target audience.
on top of a state-of-the-art dataset (Fleurs) allowed
This shared task thus challenges machine transla-
them to achieve scores similar to NAVER’s with-
tion systems to generate translations of different
out additional tuning or ensembling. State-of-the-
formality levels.
art performance was achieved by all three teams
that submitted systems. 9.1 Challenge
General Observations As in previous years, the Task Given a source text, X in English, and a
low-resource shared task proved particularly chal- target formality level, l ∈ {F, IF }, the goal in
lenging for the participants, but there are several formality-sensitive machine translation (Niu et al.,
encouraging signs that further reinforce the need 2017) is to generate a translation, Y , in the target
for more research in the area. language that accurately preserves the meaning of
First, more teams than ever participated in the the source text and conforms to the desired formal-
shared task, showing a continued interest in the ity level, l. The two formality levels typically con-
field. Second, we note that for the language sidered are “F” for formal and “IF” for informal,
pair that was repeated from last year (Tamasheq– resulting in two translations: YF and YIF respec-
French), almost all submissions outperformed last tively. For example, the formal and informal trans-
year’s best submission, with an accuracy increase lations for the source text “Yeah Did your mom
of more than 17 BLEU points in the unconstrained know you were throwing the party?” (originally
setting. Last, we highlight the breadth of different informal) in Korean are shown in the table below:
approaches employed by the participants, ranging This shared task builds on last year’s offering,
from the use of finetuned pre-trained models to which evaluated systems’ ability to control for-
pre-training from scratch, to parameter efficient mality on the following translation tasks: trans-
dine-tuning as well as cascaded pipeline systems, lation from English (EN) into Korean (KO) and
all of which seem to have benefits to offer, to a Vietnamese (VI) in the supervised setting, and
certain extent, to different language pairs. from English (EN) into Portugal Portuguese (PT)
29
Source: Yeah Did your mom know you were Constrained (C) Participants were allowed to
throwing the party? use the following resources: Textual MuST-C v1.2
Korean Informal: ᄀ
ᅳ, ᄋ
ᅥ머ᄂᆷᄋ
ᅵ ᆫ [F]ᄂ
ᅳ ᅦ가[/F] (Di Gangi et al., 2019b), CCMatrix (Schwenk
ᅳᄑ
ᄀ ᅡᄐ ᅵᄋᆫᄀ
ᅧ ᅥ [F]ᄋ ᅧ[/F]?
ᅡᄉ et al., 2021), OpenSubtitles (Lison and Tiede-
mann, 2016) and dataset in the constrained set-
Korean Formal: ᄀ
ᅳ, ᄋ
ᅥᄆ
ᅥᄂ ᆫ [F]ᄂ
ᆷᄋ
ᅵᅳ ᅵ[/F] ᄀ
ᆷᄋ
ᅵ ᅳ ting from the Formality Control track at IWSLT22
ᄑᄐ
ᅡ ᅵᄋ ᆫᄀ
ᅧ ᅥ [F]ᄋ
ᅡᄉ ᅭ[/F]?
ᅦᄋ (Anastasopoulos et al., 2022a).
Table 7: Contrastive formal and informal translations Unconstrained (U) Participants could use any
into Korean. Grammatical formality markers are anno- publicly available datasets and resources: the use
tated with [F]text[/F]. of pre-trained language models was also allowed.
Additionally, using additionally automatically an-
notated bitext with formality labels was also al-
and Russian (RU) in the zero-shot setting. Re-
lowed.
sults showed that formality-control is challeng-
ing in zero-shot settings and for languages with 9.3 Formality Classifier
many grammatical and lexical formality distinc-
We release a multilingual classifier (M C) trained
tions. This year’s edition invited participants to
to predict the formality of a text for all the lan-
advance research in effective methods for bridg-
guage pairs: EN-KO, EN-VI, EN-RU, and EN-
ing the gap in formality control for zero-shot cases
PT. We finetune an xlm-roberta-base (Con-
and for languages with rich grammatical and lexi-
neau et al., 2020) model on human-written formal
cal formality distinctions.
and informal translations following the setup from
Briakou et al. (2021). Our classifier achieves an
accuracy of > 98% in detecting the formality of
Participants were provided with test data, as well human-written translations for the four target lan-
as MT quality and formality control metrics. In guages (Table 10). Participants were allowed to
addition, we provided training data, consisting of use the classifier both for model development and
formal and informal translation of texts for the su- for evaluation purposes as discussed below.
pervised language pairs (EN-KO, EN-VI).
9.4 Automatic Metrics
9.2.1 Formality Annotated Dataset We evaluate the submitted system outputs along
We provide targeted datasets comprising source the following two dimensions:
segments paired with two contrastive reference 1. Overall translation quality, evaluated using
translations, one for each formality level (informal SacreBLEU v2.0.0 (Papineni et al., 2002b;
and formal) for two EN-VI, EN-KO in the super- Post, 2018), and COMET (Rei et al., 2020b)
vised setting and EN-RU, EN-PT in the zero-shot on both the shared task-provided test sets
setting (see Example 7)40 . The sizes and proper- based on topical chat (Gopalakrishnan et al.,
ties of the released datasets for all the language 2019) and on the FLORES devtest (NLLB
pairs are listed in Table 8. Formal translations tend Team et al., 2022; Goyal et al., 2022).
to be longer than informal texts for Vietnamese
compared to other language pairs. The number 2. Formality control, evaluated using:
of phrasal formality annotations ranges from 2 to • Matched-Accuracy (mACC), a reference-
3.5 per segment, with Korean exhibiting a higher based corpus-level automatic metric that
diversity between the formal and informal transla- leverages phrase-level formality markers
tions as indicated by the TER score. from the references to classify a system-
generated hypothesis as formal, informal,
9.2.2 Training Conditions
or neutral (Nadejde et al., 2022).
We allowed submissions under the constrained • Classifier-Accuracy (cACC), a reference-
and unconstrained data settings described below: free metric that uses the multilingual for-
40
mality classifier discussed above to label a
https://github.com/amazon-science/
contrastive-controlled-mt/tree/main/ system-generated hypothesis as formal or
IWSLT2023 informal.
30
L ANGUAGE T YPE S IZE L ENGTH # P HRASAL A NNOTATIONS TER(F, IF)
S OURCE F ORMAL I NFORMAL F ORMAL I NFORMAL
Train 400 20.35 28.52 25.48 2.71 1.49 23.70
EN-VI
Test 600 21.82 29.59 26.77 2.79 1.55 23.00
Train 400 20.00 13.41 13.40 3.35 3.35 24.52

EN-KO
Test 600 21.22 13.56 13.55 3.51 3.51 25.32
EN-RU Test 600 21.02 18.03 18.00 2.06 2.05 13.59
EN-PT Test 600 21.36 20.22 20.27 1.93 1.93 10.46
Table 8: Formality Track Shared Task Data Statistics.
PARTICIPANT S ETTINGS C LASSIFIER U SE L ANGUAGES M ODEL TYPE F ORMALITY

UMD-baseline U ✓ All Multilingual Exemplars
C O C OA-baseline C ✗ EN-{VI, KO} Bilingual Side-constraint
A PP T EK U ✗ EN-{PT, RU} Bilingual Side-constraint
HW-TSC U+C ✓ All Bilingual Side-constraint
KU X U P S TAGE U ✓ All Bilingual N/A
UCSC U ✗ EN-{VI, KO} Multilingual Style-Embedding
Table 9: Formality Track Submissions Summary. Most participants train bilingual systems but leverage a diverse
set of formality encoding mechanisms for control.
Target Language Accuracy • C O C OA (baseline) uses a supervised method

Korean 99.9%
where a generic neural MT model is fine-
Vietnamese 99.3%
Russian 99.9% tuned on labeled contrastive translation pairs
Portuguese 98.6% (Nadejde et al., 2022). For the constrained,
supervised setting, the generic neural MT
Table 10: The multilingual classifier can identify the model was trained on parallel data allowed
target formality for human written text across all lan-
for the constrained task and fine-tuned on for-
guages with > 98% accuracy.
mal and informal data released for the shared
task. Following Nadejde et al. (2022), con-
The final corpus-level score for each of the trastive pairs were upsampled with a fixed up-
two metrics described above is the percent- sampling factor of five for all language pairs.
age of system outputs that matches the de-
sired formality level. For example, the cACC • UMD (baseline) uses 16 few-shot tar-
for the target formality, Formal (F), is given get formality-specific exemplars to prompt
by, cACC(F ) = M 1
∑Mi=1 1[M C(Y ) == F ],
XGLM-7.5B (Lin et al., 2021) to generate
where M is the number of system outputs. style-controlled translations. For the su-
pervised setting, these examples are drawn
9.5 Submissions from the official training data, whereas for
the zero-shot setup, the examples from the
We provide methodology descriptions and a sum- Tatoeba corpus (Artetxe and Schwenk, 2019)
mary of the two baseline systems and four sub- are filtered and marked with target formality
missions received for the shared task below and in using the provided formality classifier.
Table 9. Three out of six submissions made use
of the formality classifier released for system de- • A PP T EK (Bahar et al., 2023) submitted out-
velopment. We received two multilingual and four puts using their production quality translation
bilingual systems. We refer the reader to the sys- systems that support formality-controlled
tem description papers for more details. translation generation for EN-PT and EN-
31
RU. These are Transformer-Big models Overall Results For the supervised language
trained on a large public dataset from the pairs in both constrained and unconstrained set-
OPUS collection (Tiedemann, 2012), auto- tings, most submitted systems were successfully
matically marked with formality using a se- able to control formality. The average mAcc
quence of regular expressions. The formality scores ranged from 78-100. Controlling formality
level is encoded with a pseudo-token at the in Korean was found to be more challenging than
beginning of each training source sentence translating with formality control in Vietnamese
with one of 3 values: formal, informal, or no as reflected by the relatively lower mAcc scores
style. which we believe to be due to the variation in for-
mality expression of Korean honorific speech re-
• HW-TSC (Wang et al., 2023a) describes a flected in pretraining data.
system that uses a multi-stage pre-training HW-TSC consistently achieves the best scores
strategy on task-provided data to train strong across the board for all language pairs and both
bilingual models. Using these bilingual mod- settings due to the use of transductive learning.
els, they employ beam re-ranking on the out- Interestingly, the constrained submission by HW-
puts generated using the test source. The gen- TSC achieves better or competitive results com-
erated hypothesis are ranked using the for- pared to their unconstrained system suggesting
mality classifier and phrasal annotations, it- that the use of a pre-trained language model or
eratively fine-tuning the model on this data additional resources is not necessary to gener-
until test performance convergences. Initial ate high-quality formality-controlled translations.
formality control is enabled by a special to- Generally, the systems generate higher quality out-
ken and re-affirmed through classifier output puts in the formal setting relative to the informal
and annotations from training. setting for both supervised language pairs accord-
ing to BLEU and COMET, which might be due
• KU X U P S TAGE (Lee et al., 2023) uses large- to the bias of the dataset used during pre-training
scale bilingual transformer-based MT sys- which is typically news and hence more formal.
tems trained on high-quality datasets and In the zero-shot unconstrained setting, this for-
M BART for the supervised and zero-shot set- mality bias is even more prominent. We observe
tings respectively. They generate a formality- a much wider distribution in the formality scores
controlled translation dataset for supervision for English-Portuguese (mAcc: F 90-100, IF: 58-
in the zero-shot setting using GPT-4 and fil- 100), possibly due to the high ambiguity in the
ter the generated source-translation pairs us- informal language and the confounding dialectal
ing the formality classifier. All bilingual influence of Brazilian Portuguese dominant in the
models are then finetuned independently for pre-training corpora, which is known to use for-
the two target formality directions to gen- mal register even in typically informal contexts
erate formality-controlled outputs, resulting (Costa-jussà et al., 2018). HW-TSC and A PP T EK
in #(Language-pairs) × 2 (Formal/Informal) achieve the best translation quality for English-
models. Portuguese and English-Russian respectively. The
lowest scoring submission in both quality and for-
• UCSC (Vakharia et al., 2023) focused on us- mality control (UCSC) did not include any fine-
ing a single multilingual translation model tuning or adaptation of the base M BART model to
for all the language pairs under the uncon- the two zero-shot language pairs: English-Russian
strained setting. They finetune the pre-trained and English-Portuguese. This suggests that for-
model, mBART-large-50 (Tang et al., mality information is not transferred from the un-
2020), using the provided contrastive transla- related language pairs, EN-KO and EN-VI, and
tions (§ 9.2.1) with an added style embedding that some language-specific supervision is needed
intervention layer. to mark grammatical formality appropriately in
Russian and Portuguese.
9.6 Results
How well do systems match the desired tar-
Tables 47 and 48 in the Appendix show the main get formality? We show the distribution of the
automatic evaluation results for the shared task. scores generated using the formality classifier for
32
Figure 3: Formality Classifier Scores’ Distribution on the submitted system outputs in the Unconstrained setting:
HW-TSC can precisely match the target formality as depicted by the peaky distribution.
all the systems submitted to all language pairs un-

der the unconstrained setting in Figure 3. For su-
pervised language pairs, formal (blue) and infor-
mal (orange) output scores peak at 1.0 and 0.0 re-
spectively. In the zero-shot setting, for both Por-
tuguese (A PP T EK, UCSC) and Russian (UCSC)
translations, the informal outputs have a bimodal
distribution, highlighting that these models gener-
ate many formal translations under informal con-
trol. Figure 4: TER between the Formal (F) and Informal
(IF) Outputs for all submitted systems across all lan-
How contrastive are the generated transla- guage pairs.
tions? We show the Translation Edit Rate (TER)
between the formal and informal outputs for all
submitted systems across all language pairs in Fig- exhibit an ambiguous or richer formality distinc-
ure 4. While the references are designed to be min- tion either due to close dialectal variations (like
imally contrastive, the formal and informal system Portuguese) or due to multiple levels of honorifics
outputs exhibit a much larger edit distance. HW- (like Korean and Japanese) still remain challeng-
TSC has the lowest TER rate for all language pairs ing. Unsupervised transfer of formality knowl-
except English-Korean. edge between related languages remains relatively
unexplored (Sarti et al., 2023). Furthermore, this
Discussion Overall, the shared task results year’s task only considered two levels of formal-
show that finetuning a strong supervised general- ity distinctions with minimal edits. It remains un-
purpose MT system with as low as 400 in- clear whether the models are also capable of mod-
domain contrastive samples seems to be sufficient eling multiple levels of formality potentially with
in generating high-quality contrastive formality- minimal edits in the generated translations. Fi-
controlled translations. However, several avenues nally, no submissions have explored monolingual
for improvement remain open. The languages that editing of translations as a potential solution for
33
formality-controlled MT, despite the edit-focused 3. Target (English) phonemes and durations cor-
nature of the contrastive translations. We recom- responding to a translation which adheres to
mend that future work on formality-controlled ma- the desired timing
chine translation targets these challenges. The test data was produced by volunteers and
consists of videos of native German speakers
10 Automatic Dubbing reading individual sentences from the German
10.1 Challenge CoVoST-2 test set.43 This test set was divided in to
two subsets; Subset 1 where there are no pauses in
This task focuses on automatic dubbing: translat-
the speech and Subset 2 where there is one or more
ing the speech in a video into a new language such
pause in the speech. More details on this data are
that the new speech is natural when overlayed on
presented in (Chronopoulou et al., 2023).
the original video (see Figure 5).
Participants were given German videos, along 10.3 Submissions
with their text transcripts, and were asked to pro-
duced dubbed videos where the German speech Despite high initial interest, we received only
has been translated in to English speech. one submission, which was from the Huawei
Translation Services Center (HW-TSC) (Rao
Automatic dubbing is a very difficult/complex
et al., 2023). However, we had two systems
task (Brannon et al., 2023), and for this shared
(Chronopoulou et al., 2023; Pal et al., 2023) built
task we focus on the characteristic which is per-
for the task for which we had not yet performed
haps most characteristic of dubbing: isochrony.
human evaluation, so we still had enough systems
Isochrony refers to the property that the speech
for a interesting comparison.
translation is time aligned with the original
speaker’s video. When the speaker’s mouth is
• Interleaved (Baseline): Our first baseline
moving, a listener should hear speech; likewise,
and the basis for this shared task is from
when their mouth isn’t moving, a listener should
Chronopoulou et al. (2023). They propose to
not hear speech.
jointly model translations and speech timing,
To make this task accessible for small academic
giving the model the freedom to change the
teams with limited training resources, we make
translation to fit the timing, or and make scar-
some simplifications: First, we assume the input
ifies in translation quality to meet timing con-
speech has already been converted to text using an
straints or relax timing constraints to improve
ASR system and the desired speech/pause times
translation quality. This is achieved by sim-
have been extracted from the input speech. Sec-
ply binning target phoneme durations and in-
ond, to alleviate the challenges of training a TTS
terleaving them with target phonemes during
model, the output is defined to be phonemes and
training and inference. To avoid teaching the
their durations. These phonemes and durations are
model that speech durations should be prior-
played through an open-source FastSpeech2 (Ren
itized over translation quality44 , noise with
et al., 2022) text-to-speech model to produce the
standard deviation 0.1 is added to the target
final speech.41
phrase durations to simulate the source dura-
10.2 Data and Metrics tions used at inference.
Official training and test data sets were provided42 • Factored (Baseline): Pal et al. (2023) build
by the organizers. The training data was derived on the first baseline by using target factors
from CoVoST2 (Wang et al., 2021) and consists (Garcı́a-Martı́nez et al., 2016), where along-
of: side predicting phoneme sequences as the
1. Source (German) text target, we also predict durations for each
2. Desired target speech durations (e.g. 2.1s of phoneme as a target factor. Additionally, they
speech, followed by a pause, followed by 1.3s propose auxiliary counters, which are simi-
of speech) lar to target factors except the model is not
41 43
https://github.com/mtresearcher/ Each volunteer provided their consent to use this data
FastSpeech2 for automatic dubbing task.
42 44
https://github.com/amazon-science/ Median speech overlap is just 0.731 in a large corpus of
iwslt-autodub-task/tree/main/data human dubs (Brannon et al., 2023)
34
Figure 5: To illustrate, here’s an example in which “hallo! wei gehts?” is translated to “hi! how are you?” such
that the output will fit in the desired target speech durations of 0.4s and 1.3s, with a pause in between
trained to predict them. Instead, they pro-

viding additional information to the decoder
consisting of (1) the total number of frames
remaining, (2), the number of pauses remain-
ing, and (3) the number of frames remaining
in the current phrase. As in the first base-
line, noise of standard deviation 0.1 is added
to the target phrase durations during training
to simulate source durations.
• Text2Phone (Baseline): As a sanity check,

we added a third, non-isochronic baseline
trained to take in German text and produce
English phonemes, without any duration in-
formation. We train on the same data as the
first two baselines, but exclude duration in-
formation from training and instead predict Figure 6: System diagram for HW-TSC dubbing sys-
phoneme durations using the duration model tem. Image from Rao et al. (2023).
from the FastSpeech2 model.
• HW-TSC: In contrast to our three baselines, were researchers in automatic dubbing. For each
(Rao et al., 2023) took a more traditional video in the the test set, one judge was shown the
approach to dubbing and followed the prior four system outputs in random order and asked to
works on verbosity control (Lakew et al., rate them from 1-6. The judges were not given
2021, 2019) to first generate a set of transla- a defined rubric or guidelines to follow but were
tion candidates and later re-rank them. Their asked to be consistent.
system consists of four parts: 1) voice ac- As a metric we opted for mean opinion score
tivity detection followed by pause alignment, (MOS) methodology where the scores for a system
2) generating a list of translation candidates, as judged by humans are averaged in one score.45
3) phoneme duration prediction, followed by Feedback from the judges indicate that the base-
4) re-ranking/scaling the candidates based on line and submitted systems often produce poor
the durations (see Figure 6). With the last translations (perhaps due to the small amount of
step in the pipeline, the top scored candidate training data used by each system), and the voice
is ensured to have the best speech overlap quality from the FastSpeech 2 model was far from
with the source speech amongst all candidate perfect. However, they felt that having all systems
translations. share the same voice made it much easier to com-
pare across dubbing systems.
10.4 Evaluation & Metric When we looked at the distribution of scores per
The dubbed English videos were judged by a mix- 45
https://en.wikipedia.org/wiki/Mean_
ture of native and non-native speakers, all of which opinion_score
35
annotator (judge) level, the numbers showed that MOS↑
each annotator had a bias towards dubbing, some System Constrained? Mean CI
liked dubbing more than others which is intuitive Text2Phone Yes 3.16 ±0.19
but has not been studied before in the context of Interleaved Yes 3.33 ±0.18
automatic dubbing. As shown in Table 11, it is Factored Yes 3.43 ±0.19
clear that annotator A2 had a significantly higher HW-TSC No 3.77 ±0.19
preference for dubbing as compared to annotator
A4 in terms of MOS. Table 12: Mean opinion score for baselines 1)
Text2Phone 2) Interleaved (Chronopoulou et al., 2023)
Annotator MOS↑ CI 3) Factored (Pal et al., 2023) and 4) submitted system
±0.16
of HW-TSC (Rao et al., 2023).
A1 3.34
A2 3.74 ±0.19
LSE-D↓
A3 3.53 ±0.13
System Subset1 Subset2
A4 3.07 ±0.15
Original 7.39 7.67
Table 11: MOS (on a scale of 1-6) with confidence in- Text2Phone 11.64 13.31
terval (CI) at 95% per annotator showing the biases to- Interleaved 11.71 12.35
wards general purpose dubbed content. Factored 11.73 12.48
HW-TSC 12.11 12.77
We also looked at MOS for the two different Table 13: Results of Lip-Sync Error Distance (LSE-D)
subsets to understand whether it was difficult for via Syncnet pre-trained model (Chung and Zisserman,
the submitted systems to dub the videos. As it 2016). Lower the better.
turns out, Subset 1 has an significantly higher
MOS of 3.54 (± 0.11) compared to Subset 2 with
a MOS of 3.31 (± 0.11). This shows it is signifi- the amount of Lip-Sync errors in the video. From
cantly more difficult for all systems to dub Subset Table 13, Subset 1 consistently has a lower lip-
2 than Subset 1. sync error than Subset 2 in all cases pointing that
its difficult to generate lip-synced dubs for Sub-
10.5 Results set 2. This result is also in line with the MOS
scores we obtained for two subsets where the an-
Results are shown in Table 12. All three
notators preferred dubs for Subset 1. Secondly,
dubbing systems outperform the non-isochronic
original videos show significantly lower lip-sync
Text2Phone baseline (Chronopoulou et al., 2023),
error distance (12.x v/s 7.x) than dubbed videos
as expected. The factored baseline improves over
showing that automatic dubbing research still has
the interleaved baseline, consistent with the auto-
a long way to go to reach lip-sync quality in origi-
matic metric results reported by Pal et al. (2023).
nal videos.
The HW-TSC system (Rao et al., 2023) outper-
forms all the baselines in terms of mean opinion Acknowledgements
score, making it the clear winner of the IWSLT
2023 dubbing shared task. Unfortunately, since Claudia Borg, Thierry Declerck, Rishu Kumar
HW-TSC system was unconstrained (it trains on and John Judge acknowledge H2020 LT-Bridge
additional bitext compared to the baselines) and Project (GA 952194). Rishu Kumar would also
uses fundamentally different approaches than the like to thank the EMLCT46 programme. Atul
baselines, it is not possible to attribute it’s perfor- Kr. Ojha and John P. McCrae would like to
mance to any single factor. thank Science Foundation Ireland (SFI) under
Grant Number SFI/12/RC/2289 P2 Insight 2, and
Lip-sync is an important feature of dubbing,
Panlingua Language Processing LLP for provid-
it is important that the final generated audio is
ing the Marathi-Hindi speech translation data
in sync with the lip movements of the on-screen
and for their support. John Judge would also
speaker in the original video. As an analy-
like to acknowledge the support of SFI under
sis, we looked at Lip-Sync Error Distance (LSE-
grant SFI/13/RC/2106 P2 ADPAT. Ondřej Bojar
D) (Chung and Zisserman, 2016) following the
would like to acknowledge the grant 19-26934X
evaluation methodology in Hu et al. (2021). LSE-
D is not a perfect metric but it is an indication to 46
https://mundus-web.coli.uni-saarland.de/
36
(NEUREM3) of the Czech Science Foundation. Language Translation (IWSLT 2022), pages 98–157,
Antonios Anastasopoulos and Milind Agarwal are Dublin, Ireland (in-person and online). Association
for Computational Linguistics.
supported by the US National Science Foundation
CCRI-Planning 2234895 award, as well as a Na- Antonios Anastasopoulos, Ondřej Bojar, Jacob Bre-
tional Endowment for the Humanities PR-276810- merman, Roldano Cattoni, Maha Elbayad, Marcello
21 award. Federico, Xutai Ma, Satoshi Nakamura, Matteo Ne-
gri, Jan Niehues, Juan Pino, Elizabeth Salesky,
Sebastian Stüker, Katsuhito Sudoh, Marco Turchi,
Alexander Waibel, Changhan Wang, and Matthew
References Wiesner. 2021. FINDINGS OF THE IWSLT 2021
EVALUATION CAMPAIGN. In Proceedings of the
Basil Abraham, Danish Goel, Divya Siddarth, Ka-
18th International Conference on Spoken Language
lika Bali, Manu Chopra, Monojit Choudhury, Pratik
Translation (IWSLT 2021), pages 1–29, Bangkok,
Joshi, Preethi Jyoti, Sunayana Sitaram, and Vivek
Thailand (online). Association for Computational
Seshadri. 2020. Crowdsourcing speech data for low-
Linguistics.
resource languages from low-income workers. In
Proceedings of the 12th Language Resources and Pierre Andrews, Guillaume Wenzek, Kevin Heffernan,
Evaluation Conference, pages 2819–2826. Onur Çelebi, Anna Sun, Ammar Kamran, Yingzhe
Guo, Alexandre Mourachko, Holger Schwenk, and
Yasuhiro Akiba, Marcello Federico, Noriko Kando, Hi- Angela Fan. 2022. stopes-modular machine trans-
romi Nakaiwa, Michael Paul, and Jun’ichi Tsujii. lation pipelines. In Proceedings of the The 2022
2004. Overview of the IWSLT04 Evaluation Cam- Conference on Empirical Methods in Natural Lan-
paign. In Proceedings of the International Work- guage Processing: System Demonstrations, pages
shop on Spoken Language Translation, pages 1–12, 258–265.
Kyoto, Japan.
Ebrahim Ansari, Amittai Axelrod, Nguyen Bach, On-
Antonios Anastasopoulos, Loı̈c Barrault, Luisa Ben- drej Bojar, Roldano Cattoni, Fahim Dalvi, Nadir
tivogli, Marcely Zanon Boito, Ondřej Bojar, Durrani, Marcello Federico, Christian Federmann,
Roldano Cattoni, Anna Currey, Georgiana Dinu, Jiatao Gu, Fei Huang, Kevin Knight, Xutai Ma, Ajay
Kevin Duh, Maha Elbayad, Clara Emmanuel, Yan- Nagesh, Matteo Negri, Jan Niehues, Juan Pino, Eliz-
nick Estève, Marcello Federico, Christian Fed- abeth Salesky, Xing Shi, Sebastian Stüker, Marco
ermann, Souhir Gahbiche, Hongyu Gong, Ro- Turchi, and Changhan Wang. 2020. Findings of the
man Grundkiewicz, Barry Haddow, Benjamin Hsu, IWSLT 2020 Evaluation Campaign. In Proceedings
Dávid Javorský, Vĕra Kloudová, Surafel Lakew, of the 17th International Conference on Spoken Lan-
Xutai Ma, Prashant Mathur, Paul McNamee, Kenton guage Translation (IWSLT 2020), Seattle, USA.
Murray, Maria Nǎdejde, Satoshi Nakamura, Mat-
teo Negri, Jan Niehues, Xing Niu, John Ortega, Rosana Ardila, Megan Branson, Kelly Davis, Michael
Juan Pino, Elizabeth Salesky, Jiatong Shi, Matthias Henretty, Michael Kohler, Josh Meyer, Reuben
Sperber, Sebastian Stüker, Katsuhito Sudoh, Marco Morais, Lindsay Saunders, Francis M Tyers, and
Turchi, Yogesh Virkar, Alexander Waibel, Chang- Gregor Weber. 2019. Common voice: A massively-
han Wang, and Shinji Watanabe. 2022a. Findings of multilingual speech corpus. arXiv preprint
the IWSLT 2022 evaluation campaign. In Proceed- arXiv:1912.06670.
ings of the 19th International Conference on Spoken
Language Translation (IWSLT 2022), pages 98–157, Rosana Ardila, Megan Branson, Kelly Davis, Michael
Dublin, Ireland (in-person and online). Association Henretty, Michael Kohler, Josh Meyer, Reuben
for Computational Linguistics. Morais, Lindsay Saunders, Francis M Tyers, and
Gregor Weber. 2020a. Common voice: A
Antonios Anastasopoulos, Loı̈c Barrault, Luisa Ben- massively-multilingual speech corpus. In LREC.
tivogli, Marcely Zanon Boito, Ondřej Bojar,
Rosana Ardila, Megan Branson, Kelly Davis, Michael
Roldano Cattoni, Anna Currey, Georgiana Dinu,
Kohler, Josh Meyer, Michael Henretty, Reuben
Kevin Duh, Maha Elbayad, Clara Emmanuel, Yan-
Morais, Lindsay Saunders, Francis Tyers, and Gre-
nick Estève, Marcello Federico, Christian Fed-
gor Weber. 2020b. Common voice: A massively-
ermann, Souhir Gahbiche, Hongyu Gong, Ro-
multilingual speech corpus. In Proceedings of The
man Grundkiewicz, Barry Haddow, Benjamin Hsu,
12th Language Resources and Evaluation Confer-
Dávid Javorský, Vĕra Kloudová, Surafel Lakew,
ence, pages 4218–4222.
Xutai Ma, Prashant Mathur, Paul McNamee, Kenton
Murray, Maria Nǎdejde, Satoshi Nakamura, Mat- Mikel Artetxe and Holger Schwenk. 2019. Mas-
teo Negri, Jan Niehues, Xing Niu, John Ortega, sively multilingual sentence embeddings for zero-
Juan Pino, Elizabeth Salesky, Jiatong Shi, Matthias shot cross-lingual transfer and beyond. Transac-
Sperber, Sebastian Stüker, Katsuhito Sudoh, Marco tions of the Association for Computational Linguis-
Turchi, Yogesh Virkar, Alexander Waibel, Chang- tics, 7:597–610.
han Wang, and Shinji Watanabe. 2022b. Findings of
the IWSLT 2022 Evaluation Campaign. In Proceed- Arun Babu, Changhan Wang, Andros Tjandra, Kushal
ings of the 19th International Conference on Spoken Lakhotia, Qiantong Xu, Naman Goyal, Kritika
37
Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Ronald Cardenas, Rodolfo Zevallos, Reynaldo Baquer-
et al. 2021. XLS-R: Self-supervised cross-lingual izo, and Luis Camacho. 2018. Siminchik: A speech
speech representation learning at scale. arXiv corpus for preservation of southern quechua. ISI-
preprint arXiv:2111.09296. NLP 2, page 21.
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Mauro Cettolo, Marcello Federico, Luisa Ben-
and Michael Auli. 2020a. wav2vec 2.0: A frame- tivogli, Jan Niehues, Sebastian Stüker, K. Su-
work for self-supervised learning of speech repre- doh, K. Yoshino, and Christian Federmann. 2017.
sentations. In Advances in Neural Information Pro- Overview of the IWSLT 2017 Evaluation Campaign.
cessing Systems, volume 33, pages 12449–12460. In Proceedings of the 14th International Workshop
on Spoken Language Translation (IWSLT 2017),
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, pages 2–14, Tokyo, Japan.
and Michael Auli. 2020b. wav2vec 2.0: A frame-
work for self-supervised learning of speech repre- Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa
sentations. Advances in Neural Information Pro- Bentivogli, Roldano Cattoni, and Marcello Federico.
cessing Systems, 33:12449–12460. 2015. The IWSLT 2015 Evaluation Campaign. In
Proceedings of the 12th International Workshop on
Parnia Bahar, Patrick Wilken, Javier Iranzo-Sánchez, Spoken Language Translation (IWSLT 2015), Da
Mattia Di Gangi, Evgeny Matusov, and Zoltán Nang, Vietnam.
Tüske. 2023. Speech Translation with Style:
AppTek’s Submissions to the IWSLT Subtitling and Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa
Formality Tracks in 2023. In Proceedings of the Bentivogli, and Marcello Federico. 2013. Report on
20th International Conference on Spoken Language the 10th IWSLT Evaluation Campaign. In Proceed-
Translation (IWSLT). ings of the Tenth International Workshop on Spoken
Language Translation (IWSLT 2013), Heidelberg,
Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina Germany.
Karakanta, Alberto Martinelli, and Marco Turchi
Matteo Negri. 2021. Cascade versus Direct Speech Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa
Translation: Do the Differences Still Make a Dif- Bentivogli, and Marcello Federico. 2014. Report
ference? In Proceedings of the 59th Annual Meet- on the 11th IWSLT Evaluation Campaign, IWSLT
ing of the Association for Computational Linguis- 2014. In Proceedings of the Eleventh International
tics, Bangkok, Thailand. Association for Computa- Workshop on Spoken Language Translation (IWSLT
tional Linguistics. 2014), Lake Tahoe, USA.
Marcely Zanon Boito, Fethi Bougares, Florentin Bar-
Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa
bier, Souhir Gahbiche, Loı̈c Barrault, Mickael Rou-
Bentivogli, and Marcello Federico. 2016. The
vier, and Yannick Estéve. 2022a. Speech resources
IWSLT 2016 Evaluation Campaign. In Proceedings
in the tamasheq language. Language Resources and
of the 13th International Workshop on Spoken Lan-
Evaluation Conference (LREC).
guage Translation (IWSLT 2016), Seattle, USA.
Marcely Zanon Boito, John Ortega, Hugo Riguidel,
Antoine Laurent, Loı̈c Barrault, Fethi Bougares, Fi- Mingda Chen, Paul-Ambroise Duquenne, Pierre An-
ras Chaabani, Ha Nguyen, Florentin Barbier, Souhir drews, Justine Kao, Alexandre Mourachko, Holger
Gahbiche, and Yannick Estève. 2022b. ON-TRAC Schwenk, and Marta R. Costa-jussà. 2022. Blaser:
Consortium Systems for the IWSLT 2022 Dialect A text-free speech-to-speech translation evaluation
and Low-resource Speech Translation Tasks. In metric.
Proceedings of the 19th International Conference on
Spoken Language Translation (IWSLT). Sanyuan Chen, Chengyi Wang, Zhengyang Chen,
Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki
William Brannon, Yogesh Virkar, and Brian Thomp- Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu,
son. 2023. Dubbing in Practice: A Large Scale Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian,
Study of Human Localization With Insights for Au- Micheal Zeng, and Furu Wei. 2021. Wavlm: Large-
tomatic Dubbing. Transactions of the Association scale self-supervised pre-training for full stack
for Computational Linguistics, 11:419–435. speech processing. IEEE Journal of Selected Top-
ics in Signal Processing, 16:1505–1518.
Eleftheria Briakou, Sweta Agrawal, Joel Tetreault, and
Marine Carpuat. 2021. Evaluating the evaluation Colin Cherry and George Foster. 2019. Thinking slow
metrics for style transfer: A case study in multi- about latency evaluation for simultaneous machine
lingual formality transfer. In Proceedings of the translation. arXiv preprint arXiv:1906.00048.
2021 Conference on Empirical Methods in Natural
Language Processing, pages 1321–1336, Online and Kyunghyun Cho and Masha Esipova. 2016. Can neu-
Punta Cana, Dominican Republic. Association for ral machine translation do simultaneous translation?
Computational Linguistics. arXiv preprint arXiv:1606.02012.
38
Alexandra Chronopoulou, Brian Thompson, Prashant and Speech-to-Speech Translation Tasks. In Pro-
Mathur, Yogesh Virkar, Surafel M. Lakew, and Mar- ceedings of the 20th International Conference on
cello Federico. 2023. Jointly Optimizing Transla- Spoken Language Translation (IWSLT).
tions and Speech Timing to Improve Isochrony in
Automatic Dubbing. ArXiv:2302.12979. Matthias Eck and Chiori Hori. 2005. Overview of the
IWSLT 2005 evaluation campaign. In Proceedings
J. S. Chung and A. Zisserman. 2016. Out of time: au- of the International Workshop on Spoken Language
tomated lip sync in the wild. In Workshop on Multi- Translation, pages 1–22, Pittsburgh, PA.
view Lip-reading, ACCV.
ELRA catalogue. 2016a. Trad pashto broadcast
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, news speech corpus. https://catalogue.
Vishrav Chaudhary, Guillaume Wenzek, Francisco elra.info/en-us/repository/browse/
Guzmán, Edouard Grave, Myle Ott, Luke Zettle- ELRA-S0381/. ISLRN: 918-508-885-913-7,
moyer, and Veselin Stoyanov. 2020. Unsupervised ELRA ID: ELRA-S0381.
cross-lingual representation learning at scale. In ELRA catalogue. 2016b. Trad pashto-french parallel
Proceedings of the 58th Annual Meeting of the Asso- corpus of transcribed broadcast news speech - train-
ciation for Computational Linguistics, pages 8440– ing data. http://catalog.elda.org/en-us/
8451, Online. Association for Computational Lin- repository/browse/ELRA-W0093/. ISLRN:
guistics. 802-643-297-429-4, ELRA ID: ELRA-W0093.
Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi
Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Ma, Ahmed El-Kishky, Siddharth Goyal, Man-
Rivera, and Ankur Bapna. 2023. Fleurs: Few-shot deep Baines, Onur Celebi, Guillaume Wenzek,
learning evaluation of universal representations of Vishrav Chaudhary, Naman Goyal, Tom Birch, Vi-
speech. In 2022 IEEE Spoken Language Technol- taliy Liptchinsky, Sergey Edunov, Edouard Grave,
ogy Workshop (SLT), pages 798–805. IEEE. Michael Auli, and Armand Joulin. 2020. Beyond
english-centric multilingual machine translation.
Marta R. Costa-jussà, Marcos Zampieri, and Santanu
Pal. 2018. A neural approach to language variety Marcello Federico, Luisa Bentivogli, Michael Paul,
translation. In Proceedings of the Fifth Workshop and Sebastian Stüker. 2011. Overview of the IWSLT
on NLP for Similar Languages, Varieties and Di- 2011 Evaluation Campaign. In Proceedings of the
alects (VarDial 2018), pages 275–282, Santa Fe, International Workshop on Spoken Language Trans-
New Mexico, USA. Association for Computational lation, pages 11–27, San Francisco, USA.
Linguistics.
Marcello Federico, Mauro Cettolo, Luisa Ben-
Pan Deng, Shihao Chen, Weitai Zhang, Jie Zhang, tivogli, Michael Paul, and Sebastian Stüker. 2012.
and Lirong Dai. 2023. The USTC’s Dialect Speech Overview of the IWSLT 2012 Evaluation Campaign.
Translation System for IWSLT 2023. In Proceed- In Proceedings of the International Workshop on
ings of the 20th International Conference on Spoken Spoken Language Translation, pages 11–27, Hong
Language Translation (IWSLT). Kong, HK.
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and
Matteo Negri, and Marco Turchi. 2019a. MuST-C: W. Wang. 2022. Language-agnostic BERT Sentence
a Multilingual Speech Translation Corpus. In Pro- Embedding. In Proceedings of the 60th ACL.
ceedings of the 2019 Conference of the North Amer- Cameron Shaw Fordyce. 2007. Overview of the
ican Chapter of the Association for Computational IWSLT 2007 evaluation campaign. In Proceedings
Linguistics: Human Language Technologies, Vol- of the International Workshop on Spoken Language
ume 1 (Long and Short Papers), pages 2012–2017, Translation, pages 1–12, Trento, Italy.
Minneapolis, Minnesota.
Markus Freitag, George Foster, David Grangier, Viresh
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, Ratnakar, Qijun Tan, and Wolfgang Macherey.
Matteo Negri, and Marco Turchi. 2019b. MuST-C: 2021a. Experts, errors, and context: A large-scale
a Multilingual Speech Translation Corpus. In Pro- study of human evaluation for machine translation.
ceedings of the 2019 Conference of the North Amer- Transactions of the Association for Computational
ican Chapter of the Association for Computational Linguistics, 9:1460–1474.
Linguistics: Human Language Technologies, Vol-
ume 1 (Long and Short Papers), pages 2012–2017, Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu
Minneapolis, Minnesota. Association for Computa- Lo, Craig Stewart, George Foster, Alon Lavie, and
tional Linguistics. Ondřej Bojar. 2021b. Results of the WMT21 met-
rics shared task: Evaluating metrics with expert-
Yichao Du, Guo Zhengsheng, Jinchuan Tian, Zhirui based human evaluations on TED and news domain.
Zhang, Xing Wang, Jianwei Yu, Zhaopeng Tu, Tong In Proceedings of the Sixth Conference on Machine
Xu, and Enhong Chen. 2023. The MineTrans Sys- Translation, pages 733–774, Online. Association for
tems for IWSLT 2023 Offline Speech Translation Computational Linguistics.
39
Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Yuka Fei He, Shan-Hui Cathy Chu, Oddur Kjartansson,
Ko, Tomoya Yanagita, Kosuke Doi, Mana Maki- Clara Rivera, Anna Katanova, Alexander Gutkin,
nae, Sakriani Sakti, Katsuhito Sudoh, and Satoshi Isin Demirsahin, Cibu Johny, Martin Jansche,
Nakamura. 2023. NAIST Simultaneous Speech-to- Supheakmungkol Sarin, and Knot Pipatsrisawat.
speech Translation System for IWSLT 2023. In Pro- 2020. Open-source multi-speaker speech cor-
ceedings of the 20th International Conference on pora for building Gujarati, Kannada, Malayalam,
Spoken Language Translation (IWSLT). Marathi, Tamil and Telugu speech synthesis sys-
tems. In Proceedings of the Twelfth Language Re-
Mercedes Garcı́a-Martı́nez, Loı̈c Barrault, and Fethi sources and Evaluation Conference, pages 6494–
Bougares. 2016. Factored neural machine transla- 6503, Marseille, France. European Language Re-
tion architectures. In Proceedings of the 13th Inter- sources Association.
national Conference on Spoken Language Transla-
tion, Seattle, Washington D.C. International Work- Kevin Heffernan, Onur Çelebi, and Holger Schwenk.
shop on Spoken Language Translation. 2022. Bitext mining using distilled sentence rep-
resentations for low-resource languages. In Find-
Jonas Gehring, Michael Auli, David Grangier, Denis ings of the Association for Computational Linguis-
Yarats, and Yann N. Dauphin. 2017. Convolutional tics: EMNLP 2022, pages 2101–2112, Abu Dhabi,
sequence to sequence learning. United Arab Emirates. Association for Computa-
tional Linguistics.
Karthik Gopalakrishnan, Behnam Hedayatnia, Qin-
lang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Carlos Daniel Hernandez Mena, Albert Gatt, Andrea
Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür. DeMarco, Claudia Borg, Lonneke van der Plas,
2019. Topical-Chat: Towards knowledge-grounded Amanda Muscat, and Ian Padovani. 2020. MASRI-
open-domain conversations. In Proc. Interspeech HEADSET: A Maltese corpus for speech recogni-
2019, pages 1891–1895. tion. In Proceedings of the Twelfth Language Re-
sources and Evaluation Conference, pages 6381–
Edward Gow-Smith, Alexandre Berard, 6388, Marseille, France. European Language Re-
Marcely Zanon Boito, and Ioan Calapodescu. sources Association.
2023. NAVER LABS Europe’s Multilingual
Speech Translation Systems for the IWSLT 2023 Oleksii Hrinchuk, Vladimir Bataev, Evelina Bakhtu-
Low-Resource Track. In Proceedings of the 20th rina, and Boris Ginsburg. 2023. NVIDIA NeMo Of-
International Conference on Spoken Language fline Speech Translation Systems for IWSLT 2023.
Translation (IWSLT). In Proceedings of the 20th International Conference
on Spoken Language Translation (IWSLT).
Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-
Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr- Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert
ishnan, Marc’Aurelio Ranzato, Francisco Guzmán, Tsai, Kushal Lakhotia, Ruslan Salakhutdinov,
and Angela Fan. 2022. The Flores-101 evaluation and Abdelrahman Mohamed. 2021. Hubert:
benchmark for low-resource and multilingual ma- Self-supervised speech representation learn-
chine translation. Transactions of the Association ing by masked prediction of hidden units.
for Computational Linguistics, 10:522–538. IEEE/ACM Trans. Audio, Speech and Lang.
Proc., 29:3451–3460.
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Chenxu Hu, Qiao Tian, Tingle Li, Wang Yuping, Yux-
Zhengdong Zhang, Yonghui Wu, and Ruoming uan Wang, and Hang Zhao. 2021. Neural dubber:
Pang. 2020. Conformer: Convolution-augmented Dubbing for videos according to scripts. In Thirty-
transformer for speech recognition. Interspeech, Fifth Conference on Neural Information Processing
pages 5036–5040. Systems.
Jiaxin Guo, Daimeng Wei, Zhanglin Wu, Zongyao Li, Wuwei Huang, Mengge Liu, Xiang Li, Yanzhi Tian,
Zhiqiang Rao, Minghan Wang, Hengchao Shang, Fengyu Yang, Wen Zhang, Jian Luan, Bin Wang,
Xiaoyu Chen, Zhengzhe Yu, Shaojun Li, Yuhao Xie, Yuhang Guo, and Jinsong Su. 2023. The Xiaomi
Lizhi Lei, and Hao Yang. 2023. The HW-TSC’s Si- AI Lab’s Speech Translation Systems for IWSLT
multaneous Speech-to-Text Translation system for 2023 Offline Task, Simultaneous Task and Speech-
IWSLT 2023 evaluation. In Proceedings of the to-Speech Task. In Proceedings of the 20th Interna-
20th International Conference on Spoken Language tional Conference on Spoken Language Translation
Translation (IWSLT). (IWSLT).
Yuchen Han, Xiaoqian Liu, Hao Chen, Yuhao Zhang, Amir Hussein, Cihan Xiao, Neha Verma, Matthew
Chen Xu, Tong Xiao, and Jingbo Zhu. 2023. The Wiesner, Thomas Thebaud, and Sanjeev Khudanpur.
NiuTrans End-to-End Speech Translation System 2023. JHU IWSLT 2023 Dialect Speech Translation
for IWSLT23 English-to-Chinese Offline Task. In System Description. In Proceedings of the 20th In-
Proceedings of the 20th International Conference on ternational Conference on Spoken Language Trans-
Spoken Language Translation (IWSLT). lation (IWSLT).
40
Muhammad Huzaifah, Kye Min Tan, and Richeng Guillaume Klein, Yoon Kim, Yuntian Deng, Jean
Duan. 2023. I2R’s End-to-End Speech Translation Senellart, and Alexander Rush. 2017. OpenNMT:
System for IWSLT 2023 Offline Shared Task. In Open-source toolkit for neural machine translation.
Proceedings of the 20th International Conference on In Proceedings of ACL 2017, System Demonstra-
Spoken Language Translation (IWSLT). tions, pages 67–72, Vancouver, Canada. Association
Hirofumi Inaguma, Brian Yan, Siddharth Dalmia,
Pengcheng Guo, Jiatong Shi, Kevin Duh, and Shinji Surafel M Lakew, Yogesh Virkar, Prashant Mathur,
Watanabe. 2021. ESPnet-ST IWSLT 2021 Offline and Marcello Federico. 2021. Isometric mt: Neural
Speech Translation System. In Proceedings of the machine translation for automatic dubbing. arXiv
18th International Conference on Spoken Language preprint arXiv:2112.08682.
Translation (IWSLT).
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, Surafel Melaku Lakew, Mattia Di Gangi, and Marcello
Javier Jorge, Nahuel Roselló, Adrià Giménez, Al- Federico. 2019. Controlling the output length of
bert Sanchis, Jorge Civera, and Alfons Juan. 2020. neural machine translation. In Proc. IWSLT.
Europarl-st: A multilingual corpus for speech trans-
lation of parliamentary debates. In Proc. of 45th Intl. Antoine Laurent, Souhir Gahbiche, Ha Nguyen,
Conf. on Acoustics, Speech, and Signal Process- Haroun Elleuch, Fethi Bougares, Antoine Thiol,
ing (ICASSP 2020), pages 8229–8233, Barcelona Hugo Riguidel, Salima Mdhaffar, Gaëlle Laperrière,
(Spain). Lucas Maison, Sameer Khurana, and Yannick
Estève. 2023. ON-TRAC consortium systems for
Dávid Javorský, Dominik Macháček, and Ondřej Bo- the IWSLT 2023 dialectal and low-resource speech
jar. 2022. Continuous rating as reliable human translation tasks. In Proceedings of the 20th Inter-
evaluation of simultaneous speech translation. In national Conference on Spoken Language Transla-
Proceedings of the Seventh Conference on Machine tion (IWSLT).
Translation (WMT), pages 154–164, Abu Dhabi,
United Arab Emirates (Hybrid). Association for Seugnjun Lee, Hyeonseok Moon, Chanjun Park,
Computational Linguistics. and Heuiseok Lim. 2023. Improving Formality-
Sensitive Machine Translation using Data-Centric
Japan Translation Federation JTF. 2018. JTF Transla- Approaches and Prompt Engineering. In Proceed-
tion Quality Evaluation Guidelines, 1st Edition (in ings of the 20th International Conference on Spoken
Japanese). Language Translation (IWSLT).
Yasumasa Kano, Katsuhito Sudoh, and Satoshi Naka-
mura. 2023. Average Token Delay: A Latency Met- Zongyao Li, Zhanglin Wu, Zhiqiang Rao, Xie YuHao,
ric for Simultaneous Translation. In Proceedings of Guo JiaXin, Daimeng Wei, Hengchao Shang, Wang
Interspeech 2023. To appear. Minghan, Xiaoyu Chen, Zhengzhe YU, Li Shao-
Jun, Lei LiZhi, and Hao Yang. 2023. HW-TSC at
Alina Karakanta, Luisa Bentivogli, Mauro Cettolo, IWSLT2023: Break the Quality Ceiling of Offline
Matteo Negri, and Marco Turchi. 2022a. Post- Track via Pre-Training and Domain Adaptation. In
editing in automatic subtitling: A subtitlers’ per- Proceedings of the 20th International Conference on
spective. In Proceedings of the 23rd Annual Con- Spoken Language Translation (IWSLT).
ference of the European Association for Machine
Translation, pages 261–270, Ghent, Belgium. Euro- Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu
pean Association for Machine Translation. Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na-
man Goyal, Shruti Bhosale, Jingfei Du, et al. 2021.
Alina Karakanta, François Buet, Mauro Cettolo, and Few-shot learning with multilingual language mod-
François Yvon. 2022b. Evaluating subtitle seg- els. arXiv preprint arXiv:2112.10668.
mentation for end-to-end generation systems. In
Proceedings of the Thirteenth Language Resources
Pierre Lison and Jörg Tiedemann. 2016. OpenSub-
and Evaluation Conference, pages 3069–3078, Mar-
titles2016: Extracting large parallel corpora from
seille, France. European Language Resources Asso-
movie and TV subtitles. In Proceedings of the Tenth
ciation.
International Conference on Language Resources
Santosh Kesiraju, Karel Beneš, Maksim Tikhonov, and and Evaluation (LREC’16), pages 923–929, Por-
Jan Černocký. 2023. BUT Systems for IWSLT 2023 torož, Slovenia. European Language Resources As-
Marathi - Hindi Low Resource Speech Translation sociation (ELRA).
Task. In Proceedings of the 20th International Con-
ference on Spoken Language Translation (IWSLT). Danni Liu, Thai Binh Nguyen, Sai Koneru, Enes Yavuz
Ugan, Ngoc-Quan Pham, Tuan Nam Nguyen,
Sameer Khurana, Antoine Laurent, and James Glass. Tu Anh Dinh, Carlos Mullov, Alexander Waibel,
2022. Samu-xlsr: Semantically-aligned multimodal and Jan Niehues. 2023. KIT’s Multilingual Speech
utterance-level cross-lingual speech representation. Translation System for IWSLT 2023. In Proceed-
IEEE Journal of Selected Topics in Signal Process- ings of the 20th International Conference on Spoken
ing, pages 1–13. Language Translation (IWSLT).
41
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey In Proceedings of the Second International Work-
Edunov, Marjan Ghazvininejad, Mike Lewis, and shop on Spoken Language Translation, Pittsburgh,
Luke Zettlemoyer. 2020. Multilingual denoising Pennsylvania, USA.
pre-training for neural machine translation. Trans-
actions of the Association for Computational Lin- Evgeny Matusov, Patrick Wilken, and Yota Geor-
guistics, 8:726–742. gakopoulou. 2019. Customizing neural machine
translation for subtitling. In Proceedings of the
Arle Lommel, Hans Uszkoreit, and Aljoscha Bur- Fourth Conference on Machine Translation (Volume
chardt. 2014. Multidimensional Quality Met- 1: Research Papers), pages 82–93, Florence, Italy.
rics (MQM): A Framework for Declaring and Association for Computational Linguistics.
DescribingTranslation Quality Metrics. Revista
Tradumàtica: tecnologies de la traducció, 12:455– Jonathan Mbuya and Antonios Anastasopoulos. 2023.
463. GMU Systems for the IWSLT 2023 Dialect and
Low-resource Speech Translation Tasks. In Pro-
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, ceedings of the 20th International Conference on
Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Spoken Language Translation (IWSLT).
Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and
Haifeng Wang. 2019. STACL: Simultaneous trans- Maria Nadejde, Anna Currey, Benjamin Hsu, Xing
lation with implicit anticipation and controllable la- Niu, Marcello Federico, and Georgiana Dinu. 2022.
tency using prefix-to-prefix framework. In Proceed- CoCoA-MT: A dataset and benchmark for con-
ings of the 57th Annual Meeting of the Association trastive controlled MT with application to formality.
for Computational Linguistics, pages 3025–3036, In Findings of the Association for Computational
Florence, Italy. Association for Computational Lin- Linguistics: NAACL 2022, pages 616–632, Seattle,
guistics. United States. Association for Computational Lin-
guistics.
Shuming Ma, Li Dong, Shaohan Huang, Dong-
dong Zhang, Alexandre Muzio, Saksham Singhal, J. Niehues, R. Cattoni, S. Stüker, M. Negri, M. Turchi,
Hany Hassan Awadalla, Xia Song, and Furu Wei. T. Ha, E. Salesky, R. Sanabria, L. Barrault, L. Spe-
2021. DeltaLM: Encoder-decoder pre-training for cia, and M. Federico. 2019. The IWSLT 2019 Eval-
language generation and translation by augmenting uation Campaign. In Proceedings of the 16th Inter-
pretrained multilingual encoders. arXiv. national Workshop on Spoken Language Translation
(IWSLT 2019), Hong Kong, China.
Xutai Ma, Mohammad Javad Dousti, Changhan Wang,
Jiatao Gu, and Juan Pino. 2020a. SIMULEVAL: An
Jan Niehues, Roldano Cattoni, Sebastian Stüker,
evaluation toolkit for simultaneous translation. In
Mauro Cettolo, Marco Turchi, and Marcello Fed-
Proceedings of the 2020 Conference on Empirical
erico. 2018. The IWSLT 2018 Evaluation Cam-
Methods in Natural Language Processing: System
paign. In Proceedings of the 15th International
Demonstrations, pages 144–150, Online. Associa-
Workshop on Spoken Language Translation (IWSLT
tion for Computational Linguistics.
2018), pages 2–6, Bruges, Belgium.
Xutai Ma, Juan Pino, and Philipp Koehn. 2020b.
SimulMT to SimulST: Adapting simultaneous text Xing Niu, Marianna Martindale, and Marine Carpuat.
translation to end-to-end simultaneous speech trans- 2017. A study of style in machine translation: Con-
lation. In Proceedings of the 1st Conference of the trolling the formality of machine translation output.
Asia-Pacific Chapter of the Association for Compu- In Proceedings of the 2017 Conference on Empiri-
tational Linguistics and the 10th International Joint cal Methods in Natural Language Processing, pages
Conference on Natural Language Processing, pages 2814–2819, Copenhagen, Denmark. Association for
582–587, Suzhou, China. Association for Computa- Computational Linguistics.
tional Linguistics.
NLLB Team, Marta R. Costa-jussà, James Cross,
Dominik Macháček, Ondřej Bojar, and Raj Dabre. Onur Çelebi, Maha Elbayad, Kenneth Heafield,
2023. MT Metrics Correlate with Human Ratings Kevin Heffernan, Elahe Kalbassi, Janice Lam,
of Simultaneous Speech Translation. In Proceed- Daniel Licht, Jean Maillard, Anna Sun, Skyler
ings of the 20th International Conference on Spoken Wang, Guillaume Wenzek, Al Youngblood, Bapi
Language Translation (IWSLT). Akula, Loic Barrault, Gabriel Mejia-Gonzalez,
Prangthip Hansanti, John Hoffman, Semarley Jar-
Evgeny Matusov, Gregor Leusch, Oliver Bender, and rett, Kaushik Ram Sadagopan, Dirk Rowe, Shan-
Hermann Ney. 2005a. Evaluating machine transla- non Spruit, Chau Tran, Pierre Andrews, Necip Fazil
tion output with automatic sentence segmentation. Ayan, Shruti Bhosale, Sergey Edunov, Angela
In Proc. of the International Workshop on Spoken Fan, Cynthia Gao, Vedanuj Goswami, Francisco
Language Translation (IWSLT), pages 138–144. Guzmán, Philipp Koehn, Alexandre Mourachko,
Christophe Ropers, Safiyyah Saleem, Holger
Evgeny Matusov, Gregor Leusch, Oliver Bender, and Schwenk, and Jeff Wang. 2022. No language left be-
Hermann Ney. 2005b. Evaluating machine transla- hind: Scaling human-centered machine translation.
tion output with automatic sentence segmentation. arXiv preprint.
42
John E Ortega, Richard Castro Mamani, and Michael Paul, Marcello Federico, and Sebastian Stüker.
Kyunghyun Cho. 2020. Neural machine translation 2010. Overview of the IWSLT 2010 Evaluation
with a polysynthetic low resource language. Ma- Campaign. In Proceedings of the International
chine Translation, 34(4):325–346. Workshop on Spoken Language Translation, pages
3–27, Paris, France.
John E. Ortega, Rodolfo Zevallos, and William Chen.
2023. QUESPA Submission for the IWSLT 2023 Simone Perone. 2023. Matesub: the Translated Sub-
Dialect and Low-resource Speech Translation Tasks. titling Tool at the IWSLT2023 Subtitling task. In
In Proceedings of the 20th International Conference Proceedings of the 20th International Conference on
on Spoken Language Translation (IWSLT). Spoken Language Translation (IWSLT).
Proyag Pal, Brian Thompson, Yogesh Virkar, Prashant Peter Polák, Danni Liu, Ngoc-Quan Pham, Jan
Mathur, Alexandra Chronopoulou, and Marcello Niehues, Alexander Waibel, and Ondřej Bojar. 2023.
Federico. 2023. Improving isochronous machine Towards Efficient Simultaneous Speech Transla-
translation with target factors and auxiliary counters. tion: CUNI-KIT System for Simultaneous Track at
IWSLT 2023. In Proceedings of the 20th Interna-
Sara Papi, Marco Gaido, and Matteo Negri. 2023. Di- tional Conference on Spoken Language Translation
rect Models for Simultaneous Translation and Auto- (IWSLT).
matic Subtitling: FBK@IWSLT2023. In Proceed-
ings of the 20th International Conference on Spoken Peter Polák, Ngoc-Quan Pham, Tuan Nam Nguyen,
Language Translation (IWSLT). Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bo-
jar, and Alexander Waibel. 2022. CUNI-KIT system
Sara Papi, Marco Gaido, Matteo Negri, and Marco for simultaneous speech translation task at IWSLT
Turchi. 2022. Over-generation cannot be rewarded: 2022. In Proceedings of the 19th International Con-
Length-adaptive average lagging for simultaneous ference on Spoken Language Translation (IWSLT
speech translation. In Proceedings of the Third 2022), pages 277–285, Dublin, Ireland (in-person
Workshop on Automatic Simultaneous Translation, and online). Association for Computational Linguis-
pages 12–17, Online. Association for Computational tics.
Linguistics.
Maja Popović. 2015a. chrF: character n-gram F-score
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- for automatic MT evaluation. In Proceedings of the
Jing Zhu. 2002a. Bleu: a method for automatic eval- Tenth Workshop on Statistical Machine Translation,
uation of machine translation. In Proceedings of pages 392–395, Lisbon, Portugal. Association for
the 40th annual meeting on association for compu- Computational Linguistics.
tational linguistics. Association for Computational Maja Popović. 2015b. chrf: character n-gram f-score
Linguistics. for automatic mt evaluation. In Proceedings of the
tenth workshop on statistical machine translation,
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
pages 392–395.
Jing Zhu. 2002b. Bleu: a method for automatic eval-
uation of machine translation. In Proceedings of the Matt Post. 2018. A call for clarity in reporting BLEU
40th Annual Meeting of the Association for Com- scores. In Proceedings of the Third Conference on
putational Linguistics, pages 311–318, Philadelphia, Machine Translation: Research Papers, pages 186–
Pennsylvania, USA. Association for Computational 191, Brussels, Belgium. Association for Computa-
Linguistics. tional Linguistics.
Daniel S. Park, William Chan, Yu Zhang, Chung- Vineel Pratap, Awni Hannun, Qiantong Xu, Jeff Cai,
Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Jacob Kahn, Gabriel Synnaeve, Vitaliy Liptchin-
Quoc V. Le. 2019. SpecAugment: A Simple sky, and Ronan Collobert. 2019. Wav2letter++:
Data Augmentation Method for Automatic Speech A fast open-source speech recognition system. In
Recognition. Interspeech 2019. ICASSP 2019-2019 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
Michael Paul. 2006. Overview of the IWSLT 2006 (ICASSP), pages 6460–6464. IEEE.
Evaluation Campaign. In Proceedings of the In-
ternational Workshop on Spoken Language Trans- Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
lation, pages 1–15, Kyoto, Japan. man, Christine McLeavey, and Ilya Sutskever. 2022.
Robust speech recognition via large-scale weak su-
Michael Paul. 2008. Overview of the IWSLT 2008 pervision.
Evaluation Campaign. In Proceedings of the In-
ternational Workshop on Spoken Language Trans- Balaji Radhakrishnan, Saurabh Agrawal, Raj Prakash
lation, pages 1–17, Waikiki, Hawaii. Gohil, Kiran Praveen, Advait Vinay Dhopesh-
warkar, and Abhishek Pandey. 2023. SRI-B’s sys-
Michael Paul. 2009. Overview of the IWSLT 2009 tems for IWSLT 2023 Dialectal and Low-resource
Evaluation Campaign. In Proceedings of the In- track: Marathi-Hindi Speech Translation. In Pro-
ternational Workshop on Spoken Language Trans- ceedings of the 20th International Conference on
lation, pages 1–18, Tokyo, Japan. Spoken Language Translation (IWSLT).
43
Zhiqiang Rao, Hengchao Shang, Jinlong Yang, Thibault Sellam, Dipanjan Das, and Ankur Parikh.
Daimeng Wei, Zongyao Li, Lizhi Lei, and Hao 2020. BLEURT: Learning robust metrics for text
Yang. 2023. Length-Aware NMT and Adaptive Du- generation. In Proceedings of the 58th Annual
ration for Automatic Dubbing. In Proceedings of the Meeting of the Association for Computational Lin-
20th International Conference on Spoken Language guistics, pages 7881–7892, Online. Association for
Translation (IWSLT). Computational Linguistics.
Ricardo Rei, José GC de Souza, Duarte Alves, Hengchao Shang, Zhiqiang Rao, Zongyao Li, Zhanglin
Chrysoula Zerva, Ana C Farinha, Taisiya Wu, Jiaxin Guo, Minghan Wang, Daimeng Wei,
Glushkova, Alon Lavie, Luisa Coheur, and Shaojun Li, Zhengzhe Yu, Xiaoyu Chen, Lizhi Lei,
André FT Martins. 2022. Comet-22: Unbabel-ist and Hao Yang. 2023. The HW-TSC’s Simultaneous
2022 submission for the metrics shared task. In Speech-to-Speech Translation system for IWSLT
Proceedings of the Seventh Conference on Machine 2023 evaluation. In Proceedings of the 20th Interna-
Translation (WMT), pages 578–585. tional Conference on Spoken Language Translation
(IWSLT).
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Lavie. 2020a. Comet: A neural framework for mt Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming
evaluation. arXiv preprint arXiv:2009.09025. Li. 2020. Aishell-3: A multi-speaker mandarin
tts corpus and the baselines. arXiv preprint
arXiv:2010.11567.
Lavie. 2020b. COMET: A neural framework for MT
evaluation. In Proceedings of the 2020 Conference Kun Song, Yi Lei, Peikun Chen, Yiqing Cao, Kun Wei,
on Empirical Methods in Natural Language Pro- Yongmao Zhang, Lei Xie, Ning Jiang, and Guoqing
cessing (EMNLP), pages 2685–2702, Online. Asso- Zhao. 2023. The NPU-MSXF Speech-to-Speech
ciation for Computational Linguistics. Translation System for IWSLT 2023 Speech-to-
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Speech Translation Task. In Proceedings of the
Zhou Zhao, and Tie-Yan Liu. 2022. Fastspeech 2: 20th International Conference on Spoken Language
Fast and high-quality end-to-end text to speech. Translation (IWSLT).
Anthony Rousseau, Paul Deléglise, and Yannick Es- Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
teve. 2014. Enhancing the ted-lium corpus with man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
selected data for language modeling and more ted gela Fan. 2020. Multilingual translation with exten-
talks. In LREC. sible multilingual pretraining and finetuning. arXiv
preprint arXiv:2008.00401.
Elizabeth Salesky, Kareem Darwish, Mohamed Al-
Badrashiny, Mona Diab, and Jan Niehues. 2023. Jörg Tiedemann. 2012. Parallel data, tools and inter-
Evaluating Multilingual Speech Translation Under faces in OPUS. In Proceedings of the Eighth In-
Realistic Conditions with Resegmentation and Ter- ternational Conference on Language Resources and
minology. In Proceedings of the 20th Interna- Evaluation (LREC’12), pages 2214–2218, Istanbul,
tional Conference on Spoken Language Translation Turkey. European Language Resources Association
(IWSLT 2023). Association for Computational Lin- (ELRA).
guistics.
Ioannis Tsiamas, Gerard I. Gállego, Jose Fonollosa,
Elizabeth Salesky, Matthew Wiesner, Jacob Bremer- and Marta R. Costa-jussà. 2023. Speech Transla-
man, Roldano Cattoni, Matteo Negri, Marco Turchi, tion with Foundation Models and Optimal Trans-
Douglas W. Oard, and Matt Post. 2021. The Mul- port: UPC at IWSLT23. In Proceedings of the
tilingual TEDx Corpus for Speech Recognition and 20th International Conference on Spoken Language
Translation. In Proc. Interspeech 2021, pages 3655– Translation (IWSLT).
3659.
Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol-
Gabriele Sarti, Phu Mon Htut, Xing Niu, Ben- losa, and Marta R. Costa-jussà. 2022. SHAS:
jamin Hsu, Anna Currey, Georgiana Dinu, and Approaching optimal Segmentation for End-to-End
Maria Nadejde. 2023. RAMP: Retrieval and Speech Translation. In Proc. Interspeech 2022,
attribute-marking enhanced prompting for attribute- pages 106–110.
controlled translation.
Priyesh Vakharia, Shree Vignesh S, Pranjali Bas-
Holger Schwenk, Guillaume Wenzek, Sergey Edunov, matkar, and Ian Lane. 2023. Low-Resource For-
Edouard Grave, Armand Joulin, and Angela Fan. mality Controlled NMT Using Pre-trained LM. In
2021. CCMatrix: Mining billions of high-quality Proceedings of the 20th International Conference on
parallel sentences on the web. In Proceedings of the Spoken Language Translation (IWSLT).
59th Annual Meeting of the Association for Compu-
tational Linguistics and the 11th International Joint Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Conference on Natural Language Processing (Vol- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
ume 1: Long Papers), pages 6490–6500, Online. As- Kaiser, and Illia Polosukhin. 2017. Attention is All
sociation for Computational Linguistics. You Need. In Proceedings of NIPS 2017.
44
Akshaya Vishnu, Kudlu Shanbhogue, Ran Xue, Zhihang Xie. 2023. The BIGAI Offline Speech Trans-
Soumya Saha, Daniel Zhang, and Ashwinkumar lation Systems for IWSLT 2023 Evaluation. In Pro-
Ganesan. 2023. Amazon Alexa AI’s Low-Resource ceedings of the 20th International Conference on
Speech Translation System for IWSLT2023. In Pro- Spoken Language Translation (IWSLT).
ceedings of the 20th International Conference on
Spoken Language Translation (IWSLT). Henry Li Xinyuan, Neha Verma, Bismarck Bamfo
Odoom, Ujvala Pradeep, Matthew Wiesner, and
Changhan Wang, Juan Pino, Anne Wu, and Jiatao Gu. Sanjeev Khudanpur. 2023. JHU IWSLT 2023 Mul-
2020a. Covost: A diverse multilingual speech-to- tilingual Speech Translation System Description. In
text translation corpus. In Proceedings of The 12th Proceedings of the 20th International Conference on
Language Resources and Evaluation Conference, Spoken Language Translation (IWSLT).
pages 4197–4203.
Chen Xu, Bojie Hu, Yanyang Li, Yuhao Zhang, shen
Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, huang, Qi Ju, Tong Xiao, and Jingbo Zhu. 2021.
Dmytro Okhonko, and Juan Pino. 2020b. fairseq Stacked acoustic-and-textual encoding: Integrating
s2t: Fast speech-to-text modeling with fairseq. the pre-trained models into speech translation en-
arXiv preprint arXiv:2010.05171. coders.
Changhan Wang, Anne Wu, Jiatao Gu, and Juan Wenda Xu, Xian Qian, Mingxuan Wang, Lei Li, and
Pino. 2021. CoVoST 2 and Massively Multilin- William Yang Wang. 2022. Sescore2: Retrieval aug-
gual Speech Translation. In Proc. Interspeech 2021, mented pretraining for text generation evaluation.
pages 2247–2251. arXiv preprint arXiv:2212.09305.
Minghan Wang, Yinglu Li, Jiaxin Guo, Zongyao Brian Yan, Jiatong Shi, Soumi Maiti, William Chen,
Li, Hengchao Shang, Daimeng Wei, Min Zhang, Xinjian Li, Yifan Peng, Siddhant Arora, and Shinji
Shimin Tao, and Hao Yang. 2023a. The HW-TSC’s Watanabe. 2023. CMU’s IWSLT 2023 Simultane-
Speech-to-Speech Translation System for IWSLT ous Speech Translation System. In Proceedings of
2023. In Proceedings of the 20th International Con- the 20th International Conference on Spoken Lan-
ference on Spoken Language Translation (IWSLT). guage Translation (IWSLT).
Zhipeng Wang, Yuhang Guo, and Shuoying Chen. Zhengdong Yang, Shuichiro Shimizu, Sheng Li
2023b. BIT’s System for Multilingual Track. In Wangjin Zhou, and Chenhui Chu. 2023. The Kyoto
Proceedings of the 20th International Conference on Speech-to-Speech Translation System for IWSLT
Spoken Language Translation (IWSLT). 2023. In Proceedings of the 20th International Con-
ference on Spoken Language Translation (IWSLT).
Patrick Wilken, Panayota Georgakopoulou, and
Evgeny Matusov. 2022. SubER - a metric for au- Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang,
tomatic evaluation of subtitle quality. In Proceed- Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen,
ings of the 19th International Conference on Spoken Lei Xie, and Xin Lei. 2021. Wenet: Produc-
Language Translation (IWSLT 2022), pages 1–10, tion oriented streaming and non-streaming end-to-
Dublin, Ireland (in-person and online). Association end speech recognition toolkit. arXiv preprint
for Computational Linguistics. arXiv:2102.01547.
Aiden Williams. 2022. The applicability of Wav2Vec Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng, Tao
2.0 for low-resource Maltese ASR. B.S. thesis, Uni- Wang, Mingxuan Wang, and Jun Cao. 2023. Gigast:
versity of Malta. A 10,000-hour pseudo speech translation corpus. In
Interspeech 2023.
Aiden Williams, Kurt Abela, Rishu Kumar, Martin Bär,
Xinyuan Zhou, Jianwei Cui, Zhongyi Ye, Yichi Wang,
Hannah Billinghurst, Kurt Micallef, Ahnaf Mozib
Luzhen Xu, Hanyi Zhang, Weitai Zhang, and Lirong
Samin, Andrea DeMarco, Lonneke van der Plas, and
Dai. 2023. Submission of USTC’s system for the
Claudia Borg. 2023. UM-DFKI Maltese Speech
IWSLT 2023 - Offline Speech Translation Track. In
Translation. In Proceedings of the 20th Interna-
tional Conference on Spoken Language Translation
Spoken Language Translation (IWSLT).
(IWSLT).
Adrian Łańcucki. 2021. Fastpitch: Parallel text-to-
Zhanglin Wu, Zongyao Li, Daimeng Wei, Hengchao
speech with pitch prediction. In ICASSP 2021
Shang, Jiaxin Guo, Xiaoyu Chen, Zhiqiang Rao,
- 2021 IEEE International Conference on Acous-
Zhengzhe YU, Jinlong Yang, Shaojun Li, Yuhao
tics, Speech and Signal Processing (ICASSP), pages
Xie, Bin Wei, Jiawei Zheng, Ming Zhu, Lizhi Lei,
6588–6592, Toronto, Canada. IEEE.
Hao Yang, and Yanfei Jiang. 2023. Improving
Neural Machine Translation Formality Control with
Domain Adaptation and Reranking-based Transduc-
tive Learning. In Proceedings of the 20th Interna-
tional Conference on Spoken Language Translation
(IWSLT).
45
Appendix A. Human Evaluation
46
A Human Evaluation
Human evaluation was carried out for the Simultaneous and Offline SLT shared tasks. At the time of
writing, only the former evaluation has been completed which is reported here. The human evaluation of
the Offline Task will be recounted during the conference and possibly in an update version of this report.
A.1 Simultaneous Speech Translation Task

Simultaneous Speech Translation Task ran two different types of manual evaluation: “continuous rating”
for English-to-German and MQM for English-to-Japanese.
A.1.1 Human Evaluation for the English-to-German Simultaneous Task
We used a variant of “continuous rating” as presented by Javorský et al. (2022). The evaluation process
and the guidelines presented to annotators were the same as during the last year evaluation (consult
Section A.1.1 in Anastasopoulos et al. (2022a) for more details).
Time Shift for Better Simultaneity Last year, we reduced the delay by shifting the subtitles ahead in
time to ease the memory overload of the evaluators. Since this year only a low latency regime was used,
we left the subtitles intact for the system outputs. For interpreting, we used the same shift as last year.
Two Test Sets: Common and Non-Native The main part of the test set for the English-to-German
task was the Common test set. The Common test set is a new instance (different from previous years)
consisting of selected TED talks and it serves both in the Offline Speech Translation task as well as in
the Simultaneous Translation task. Following the last year, we also added the Non-Native part that was
created and is in use since IWSLT 2020 Non-Native Translation Task. The Non-Native part is described
in Ansari et al. (2020) Appendix A.6.
We show the size of the corpus, as well as the amount of annotation collected in Table 21.
Processing of Collected Rankings Once the results are collected, they are processed as follows. We
first inspect the timestamps on the ratings, and remove any ratings that have timestamps more than 20
seconds greater than the length of the audio. Because of the natural delay (even with the time-shift) and
because the collection process is subject to network and computational constraints, there can be ratings
that are timestamped greater than the audio length. If the difference is however too high, we judge it to
be an annotation error. We also remove any annotated audio where there is fewer than one rating per 20
seconds, since the annotators were instructed to annotate every 5-10 seconds.
Obtaining Final Scores To calculate a score for each system, we average the ratings across each
annotated audio,47 then average across the multiple annotations for each audio to obtain a system score
for that audio. Finally we average across all audios to obtain a score for each system. This type of
averaging renders all input speeches equally important and it is not affected by the speech length.
We show the results in Table 22. We observe that all systems perform better on the Common part
of the test set than on the Non-Native one. The difference in scores between the best and the worst
system is not so significant: It makes only ∼0.3. When examining the evaluation of Non-Native audios,
we can see that best systems on the Common part are worst on Non-Native. Given that the quality of
the recordings in the non-native part is low on average and the speakers are not native, we hypothesize
that systems with worse performance on Common part are more robust. Such systems then achieve an
increased performance given noisy inputs.
A.1.2 Human Evaluation for the English-to-Japanese Simultaneous Task
For the English-to-Japanese Simultaneous Translation Task, we conducted a human evaluation using a
variant of Multidimensional Quality Metrics (MQM; Lommel et al., 2014). MQM has been used in recent
MT evaluation studies (Freitag et al., 2021a) and WMT Metrics shared task (Freitag et al., 2021b). For
the evaluation of Japanese translations, we used JTF Translation Quality Evaluation Guidelines (JTF,
47
Note that the ratings could be also weighted with respect to the duration of time segments between the ratings but
Macháček et al. (2023) documented on 2022 data that the difference is negligible.
47
2018), distributed by Japan Translation Federation (JTF). The guidelines are based on MQM but include
some modifications in consideration of the property of the Japanese language.
We hired a Japanese-native professional interpreter as the evaluator, while the evaluator was a trans-
lator in the last year (Anastasopoulos et al., 2022a). The evaluator checked translation hypotheses along
with their source speech transcripts and chose the corresponding error category and severity for each
translation hypothesis using a spreadsheet. Here, we asked the evaluator to focus only on Accuracy and
Fluency errors, because other types of errors in Terminology, Style, and Locale convention would not
be so serious in the evaluation of simultaneous translation. Finally, we calculated the cumulative error
score for each system based on the error weighting presented by Freitag et al. (2021a), where Critical
and Major errors are not distinguished.
48
Appendix B. Automatic Evaluation Results and Details
49
B.1 Offline SLT
⋅ Systems are ordered according to the BLEU score computed on the concatenation of the three test sets
(Joint BLEU, third column).
⋅ The “D” column indicates the data condition in which each submitted run was trained, namely: Con-
strained (C), constrained+LLM (C+ ), Unconstrained (U).
⋅ For the BLEU scores computed on the TED test set, “Orig” and “New” respectively indicate the results
computed on the original (subtitle-like) TED translations and the unconstrained (exact, more literal)
translations as references.
⋅ Direct systems are indicated by gray background.
⋅ “*” indicates a late submission.
⋅ “+ ” indicates an unofficial submission.
System D Joint TED ACL EPTV
BLEU COMET BLEU COMET BLEU COMET BLEU COMET
Ref New Orig Both New Orig
HW-TSC C 32.4 0.8213 34.8 30.2 42.1 0.8327 0.8208 38.1 0.8090 16.7 0.3829
HW-TSC U 32.3 0.8209 34.9 30.9 42.4 0.8331 0.8223 36.9 0.8073 16.9 0.3819
HW-TSC C+ 31.9 0.8210 34.4 30.6 41.9 0.8332 0.8230 37.2 0.8063 16.8 0.3823
NeuroDub+ U 30.4 0.8089 31.8 25.8 38.5 0.8205 0.8082 41.1 0.7956 15.4 0.3784
NEMO C 28.5 0.7759 30.5 26.4 37.7 0.7977 0.7871 31.9 0.7171 15.6 0.3680
UPC C+ 27.9 0.7892 29.8 25.5 36.6 0.8098 0.7985 32.1 0.7473 15.6 0.3746
I2R C+ 22.4 0.7070 24.0 20.3 29.5 0.7248 0.7172 23.9 0.6841 13.3 0.3506
BIGAI∗ C+ 20.3 0.6945 22.3 19.3 27.4 0.7128 0.7055 19.6 0.6295 11.5 0.3555
Table 14: Official results of the automatic evaluation for the Offline Speech Translation Task, English to German.
System D Joint TED ACL

BLEU COMET BLEU COMET BLEU COMET
HW-TSC U 21.0 0.8177 18.8 22.6 29.1 0.8111 0.8029 30.7 0.8473
HW-TSC C 20.9 0.8181 18.7 22.7 29.0 0.8123 0.8042 30.1 0.8443
HW-TSC C+ 20.9 0.8177 18.7 22.6 28.9 0.8114 0.8034 30.7 0.8463
NeMo C 18.1 0.7741 16.5 20.4 25.6 0.7734 0.7666 24.9 0.7769
BIGAI∗ C+ 10.7 0.7122 10.7 13.2 16.8 0.7201 0.7228 10.4 0.6769
Table 15: Official results of the automatic evaluation for the Offline Speech Translation Task, English to Japanese.
System D Joint TED ACL

BLEU COMET BLEU COMET BLEU COMET
USTC U 54.7 0.8627 53.9 36.8 62.1 0.8648 0.7992 58.0 0.8535
USTC U 52.8 0.8357 52.9 35.5 60.6 0.8439 0.7798 52.5 0.7999
HW-TSC C 51.1 0.8499 50.6 34.5 57.8 0.8521 0.7876 53.0 0.8404
HW-TSC C+ 51.1 0.8494 50.6 34.5 57.9 0.8514 0.7870 53.0 0.8406
HW-TSC U 51.0 0.8497 50.6 34.5 57.8 0.8519 0.7874 52.8 0.8401
N IU T RANS C 49.4 0.8255 50.0 34.3 57.9 0.8376 0.7740 47.1 0.7733
X IAOMI C+ 47.1 0.8279 47.2 32.4 54.1 0.8375 0.7773 46.5 0.7866
NeMo C 45.6 0.8032 46.5 31.8 53.8 0.8177 0.7575 41.8 0.7404
M INE T RANS U 45.0 0.7920 46.3 32.0 53.2 0.8134 0.7546 39.9 0.6997
BIGAI∗ C+ 31.9 0.7260 33.0 23.3 38.6 0.7428 0.7014 27.4 0.6534
M INE T RANS C 28.7 0.6371 27.7 18.6 32.2 0.6375 0.5976 31.8 0.6354
Table 16: Official results of the automatic evaluation for the Offline Speech Translation Task, English to Chinese.
50
B.2 Simultaneous SLT
Team BLEU LAAL AL AP DAL ATD

Common
HW-TSC 29.63 2.26 (3.93) 2.11 (3.86) 0.83 (1.59) 3.17 (8.99) 2.28 (6.77)
CUNI-KIT 28.51 2.35 (3.63) 2.24 (3.56) 0.79 (1.11) 2.88 (4.50) 2.26 (2.96)
FBK 28.38 2.25 (2.99) 2.09 (2.88) 0.84 (1.03) 2.70 (3.65) 2.15 (2.48)
NAIST 26.05 2.36 (3.30) 2.22 (3.21) 0.82 (1.07) 3.05 (4.45) 2.25 (3.06)
CMU 25.78 1.99 (3.39) 1.92 (3.33) 0.82 (1.31) 3.78 (6.56) 2.46 (4.63)
Non-Native
NAIST 22.96 2.43 (3.52) 1.95 (3.22) 0.845 (1.02) 3.37 (4.71) 3.13 (3.92)
CMU 22.84 2.47 (3.74) 2.36 (3.63) 0.798 (1.16) 4.54 (6.77) 3.77 (5.47)
CUNI-KIT 19.94 3.42 (5.00) 3.24 (4.87) 0.744 (1.04) 4.14 (5.87) 3.82 (4.84)
HW-TSC 17.91 3.57 (6.67) 3.44 (6.61) 0.705 (1.65) 4.39 (12.91) 4.04 (11.13)
FBK 15.19 4.10 (5.34) 3.94 (5.22) 0.89 (1.12) 4.53 (5.85) 3.76 (4.65)
Table 17: Simultaneous Speech-to-Text Translation, English to German. Except for AP, the latency is measured in
seconds. Numbers in brackets are computation aware latency.

HW-TSC 44.95 2.13 (3.80) 2.06 (3.76) 0.78 (1.48) 3.21 (8.66) 0.99 (5.31)
CUNI-KIT 44.16 2.13 (3.30) 2.06 (3.25) 0.77 (1.08) 2.78 (4.38) 0.89 (1.54)
X IAOMI 43.69 2.30 (3.03) 2.23 (2.98) 0.80 (1.08) 2.93 (4.08) 0.90 (1.47)
NAIST 36.80 2.00 (2.80) 1.88 (2.74) 0.76 (1.03) 2.66 (4.22) 0.77 (1.49)
Table 18: Simultaneous Speech-to-Text Translation, English to Chinese. Except for AP, the latency is measured in
seconds. Numbers in brackets are computation aware latency.

HW-TSC 16.63 2.60 (4.38) 2.56 (4.36) 0.71 (1.31) 3.62 (9.07) 0.83 (5.12)
CUNI-KIT 14.92 2.20 (3.55) 2.16 (3.53) 0.68 (1.06) 2.74 (5.17) 0.53 (1.50)
NAIST 14.66 2.52 (3.43) 2.45 (3.39) 0.75 (1.03) 3.24 (5.16) 0.60 (1.57)
Table 19: Simultaneous Speech-to-Text Translation, English to Japanese. Except for AP, the latency is measured
in seconds. Numbers in brackets are computation aware latency.
51
Target Language Team ASR BLEU BLASER Start Offset End Offset ATD
CMU 22.62 0.122 2.37 5.21 4.22
German
HW-TSC 19.74 -0.442 2.04 5.09 3.75
HW-TSC 15.53 -1.70 2.37 3.48 3.56
Japanese
NAIST 10.19 -1.68 2.58 4.32 3.49
Chinese HW-TSC 31.68 -0.696 1.92 3.12 3.23
Table 20: Simultaneous Speech-to-Speech from English Speech. The latency is measured in seconds. The BLEU
scores are computed based on transcript from the default Whisper (Radford et al., 2022) ASR model for each
language direction.
Common Non-native
Number of audios 42 43
Mean audio length (seconds) 400.3 208.8
Mean ratings per audio 65.6 36.5
Table 21: Human evaluation for the English-to-German task on two test sets: the Common one (used also in
automatic scoring) and the Non-native one. We show the size of the test sets, and the number of ratings collected.
On average, our annotators provide a quality judgement ever 6 seconds.
Common Non-native
CUNI-KIT 3.10 3.04→3.16 1.63 1.54→1.72
FBK 3.08 3.02→3.14 1.26 1.20→1.30
HWTSC 2.91 2.85→2.98 2.04 1.92→2.15
NAIST 2.84 2.78→2.91 2.27 2.18→2.34
CMU 2.79 2.72→2.87 2.38 2.30→2.46
Interpreter – 2.79 2.71→2.87
Table 22: Human evaluation results for English-to-German Simultaneous task on the 1–5 (worst-to-best) scale,
with 95% confidence intervals. We calculate a mean score for each annotated audio file, then a mean across
annotators (for each audio), then a mean across all audio files for each system. To compute confidence intervals,
we take the scores for annotated audios, perform 10,000x bootstrap resampling, compute the mean score for each
resample, then compute [2.5, 97.5] percentiles across the resampled means.
BLEU (on two talks) Number of errors

Team Error score
TED ref. Additional ref. Critical Major Minor
HW-TSC 26.59 18.71 383 1 56 98
CUNI-KIT 24.21 17.95 384 0 56 104
NAIST 25.10 16.75 398 0 61 93
Baseline 7.69 6.27 1,074 3 205 34
Table 23: Human evaluation results on two talks (107 lines) in the English-to-Japanese Simultaneous speech-to-
text translation task. Error weights are 5 for Critical and Major errors and 1 for Minor errors.
52
B.3 Automatic Subtitling
team con- system domain Subtitle quality Translation quality Subtitle compliance
dition SubER Sigma Bleu ChrF Bleurt CPS CPL LPB
A PP T EK U prmry ALL 70.64 73.35 15.38 38.36 .4376 87.74 100.00 100.00
ted 59.72 74.33 23.74 49.14 .5683 92.58 100.00 100.00
eptv 73.98 67.09 15.81 45.21 .5229 86.65 100.00 100.00
pltn 77.63 72.79 10.47 33.18 .4069 88.98 100.00 100.00
itv 69.83 74.48 14.43 35.27 .4028 86.01 100.00 100.00
M ATESUB U prmry ALL 75.41 65.22 14.81 39.50 .4591 84.97 99.25 100.00
ted 67.70 62.01 20.37 50.05 .5500 90.55 98.61 100.00
eptv 87.04 57.73 12.08 43.59 .4705 88.59 99.20 100.00
pltn 79.72 68.27 10.06 34.46 .4264 89.17 99.29 100.00
itv 73.11 67.04 14.92 37.13 .4501 80.21 99.47 100.00
A PP T EK C prmry ALL 77.05 72.50 12.74 34.31 .3420 93.35 100.00 100.00
ted 59.61 74.29 26.78 50.93 .5539 97.33 100.00 100.00
eptv 76.25 68.49 14.43 42.37 .4604 95.76 100.00 100.00
pltn 80.72 69.56 9.40 31.20 .3419 93.45 100.00 100.00
itv 80.87 72.62 9.08 27.74 .2612 91.14 100.00 100.00
FBK C prmry ALL 79.70 75.73 11.22 33.32 .3172 69.98 83.50 99.98
ted 63.85 76.79 21.48 50.31 .5511 71.39 79.83 100.00
eptv 79.76 69.04 13.20 42.69 .4722 74.95 82.08 99.91
pltn 83.71 74.02 7.73 30.17 .3137 70.02 84.20 99.96
itv 82.67 77.17 8.05 26.10 .2255 67.75 85.12 100.00
A PP T EK C cntrstv ALL 83.53 70.39 9.73 30.51 .2914 89.60 100.00 100.00
ted 68.47 72.97 19.07 46.17 .4921 90.53 100.00 100.00
eptv 81.69 66.36 11.46 39.25 .4150 94.57 100.00 100.00
pltn 86.37 69.79 7.08 27.89 .2780 91.50 100.00 100.00
itv 87.25 68.29 6.70 23.85 .2204 86.85 100.00 100.00
Table 24: Automatic evaluation results for the Subtitling Task: en→de. C and U stand for constrained and uncon-
strained training condition, respectively; prmry and cntrstv for primary and contrastive systems.
team con- system domain Subtitle quality Translation quality Subtitle compliance
dition SubER Sigma Bleu ChrF Bleurt CPS CPL LPB
M ATESUB U prmry ALL 68.11 68.37 22.34 47.38 .5059 86.07 99.52 100.00
ted 45.94 66.85 40.36 65.72 .7047 92.62 99.48 100.00
eptv 74.47 59.59 21.06 54.11 .5728 90.15 99.44 100.00
pltn 74.87 70.99 15.96 41.86 .4666 88.27 99.60 100.00
itv 71.25 71.06 18.50 41.07 .4592 81.93 99.51 100.00
A PP T EK C prmry ALL 71.68 74.99 18.67 40.21 .3637 95.42 100.00 100.00
ted 45.81 74.50 39.37 62.11 .6562 97.20 100.00 100.00
eptv 66.60 73.31 23.57 51.94 .5379 96.27 100.00 100.00
pltn 76.00 74.63 14.03 36.95 .3664 95.18 100.00 100.00
itv 80.20 75.90 11.37 29.75 .2487 94.67 100.00 100.00
FBK C prmry ALL 73.31 74.44 17.79 39.54 .3419 77.00 91.34 99.99
ted 45.68 74.31 40.21 65.09 .6737 78.95 88.14 100.00
eptv 68.47 69.63 23.92 52.19 .5490 79.81 88.05 100.00
pltn 78.45 75.78 12.84 35.89 .3513 77.79 92.67 99.96
itv 82.00 76.16 9.33 27.14 .2063 74.67 92.94 100.00
Table 25: Automatic evaluation results for the Subtitling Task: en→es. Legenda in Table 24.
53
B.4 Multilingual Speech Translation
Below we show the Multilingual task (§5) results and overall rankings, ordered according to the
average chrF across all 10 target languages after resegmentation to the reference translations.
We also compare to the Offline submissions on the ACL 60-60 evaluation set
on the 3 language pairs used for the Offline task.
Finally, we show the scores for each metric (chrF, COMET, BLEU) per language pair for all systems.
System Constrained? chrF COMET BLEU English WER

1 JHUunconstrained 61.1 82.3 39.3 16.9
2 KITprimary ✓ + LLM 57.5 77.0 34.9 23.7
3 KITcontrastive1 ✓ + LLM 57.5 76.8 34.8 —
10 JHUconstrained ✓ + LLM 48.1 65.3 24.5 34.1
11 BITprimary ✓ 31.0 51.7 11.7 —
Table 26: Overall task ranking with metrics averaged across all ten language pairs on the evaluation set.
We show the official task metric (chrF) as well as the unofficial metrics (COMET, BLEU, and English WER).
All metrics are calculated after resegmentation to reference transcripts and translations. Direct / end-to-end systems
are highlighted in gray.
de ja zh
System Task Constrained? COMET BLEU COMET BLEU COMET BLEU
USTC Off. 85.4 (1) 58.0 (1)
HW-TSC Off. ✓ 80.9 (2) 38.1 (3) 84.4 (3) 30.1 (7) 84.0 (2) 53.0 (2)
JHU Mult. 81.3 (1) 41.2 (1) 84.7 (1) 33.9 (4) 82.0 (3) 46.5 (11)
HW-TSC Off. 80.7 (3) 36.9 (6) 84.7 (1) 30.7 (6) 84.0 (2) 52.8 (3)
HW-TSC Off. ✓ + LLM 80.6 (4) 37.2 (5) 84.6 (2) 30.7 (6) 84.0 (2) 53.0 (2)
NeuroDub Off. 79.6 (5) 41.1 (2)
USTC Off. 80.0 (4) 52.5 (4)
KITpr Mult. ✓ + LLM 74.9 (6) 37.5 (4) 82.0 (4) 35.7 (1) 79.3 (5) 49.4 (6)
KITc1 Mult. ✓ + LLM 74.6 (8) 36.5 (7) 82.0 (4) 35.2 (2) 79.3 (5) 49.7 (5)
KITc2 Mult. ✓ + LLM 74.3 (9) 36.5 (7) 81.6 (6) 34.0 (3) 78.6 (10) 49.4 (6)
KITc3 Mult. ✓ + LLM 74.7 (7) 36.1 (9) 81.4 (7) 33.3 (5) 78.4 (11) 48.6 (7)
KITc4 Mult. ✓ + LLM 74.2 (10) 36.4 (8) 81.7 (5) 33.9 (4) 78.4 (11) 48.2 (8)
KITc5 Mult. ✓ + LLM 74.9 (6) 33.8 (10) 80.3 (8) 27.3 (8) 79.1 (6) 46.7 (10)
UPC Off. ✓ + LLM 74.7 (7) 32.1 (12)
KITc6 Mult. ✓ + LLM 73.9 (11) 32.9 (11) 80.0 (9) 26.6 (9) 78.9 (7) 45.7 (13)
KITc7 Mult. ✓ + LLM 73.9 (11) 32.9 (11) 80.3 (8) 25.6 (10) 78.8 (8) 46.0 (12)
Xiaomi Off. ✓ + LLM 78.7 (9) 46.5 (11)
NiuTrans Off. ✓ 77.3 (12) 47.1 (9)
NeMo Off. ✓ 71.7 (12) 31.9 (13) 77.7 (10) 24.9 (11) 74.0 (13) 41.8 (14)
I2R Off. ✓ + LLM 68.4 (13) 23.9 (14)
JHU Mult. ✓ + LLM 59.0 (15) 23.7 (15) 69.3 (11) 18.9 (12) 67.9 (15) 37.4 (16)
MINE-Trans Off. 70.0 (14) 39.9 (15)
BIGAI* Off. ✓ + LLM 63.0 (14) 19.6 (16) 67.7 (12) 10.4 (13) 65.3 (16) 27.4 (18)
MINE-Trans Off. ✓ 63.5 (17) 31.8 (17)
BIT Mult. ✓ 47.2 (16) 11.1 (17) 56.2 (13) 8.0 (14) 55.7 (18) 19.8 (19)
Table 27: Submissions from all tracks on the ACL 60-60 evaluation sets on the three language pairs shared across
tracks (En → De, Ja, Zh), ordered by average metric ranking. Direct / end-to-end systems are highlighted in gray.
54
Submission ar de fa fr ja nl pt ru tr zh Avg.
JHUunconstrained 62.4 67.6 57.8 73.4 42.0 71.6 75.0 56.8 62.5 42.2 61.1
KITprimary 56.9 64.8 55.4 67.8 42.3 67.6 69.6 51.2 57.3 42.5 57.5
KITcontrastive1 56.9 64.6 55.6 67.8 42.0 67.6 69.6 51.2 56.7 42.7 57.5
KITcontrastive2 56.1 63.6 52.9 67.3 40.8 66.5 69.2 50.6 55.6 41.3 56.4
KITcontrastive4 56.2 63.3 53.0 67.2 40.7 66.5 68.8 50.4 55.1 40.3 56.2
KITcontrastive3 55.5 63.7 52.1 66.9 40.3 66.0 68.9 50.0 55.2 40.6 55.9
KITcontrastive5 55.3 61.3 53.8 65.2 35.9 63.7 67.3 48.6 54.9 39.2 54.5
KITcontrastive7 54.7 60.3 54.0 64.4 34.5 63.4 67.2 47.8 54.2 38.2 53.9
KITcontrastive6 54.6 60.3 52.7 64.3 35.5 62.7 66.4 48.2 53.8 38.4 53.7
JHUconstrained 45.2 53.4 44.5 62.4 26.8 62.1 62.2 46.8 46.3 30.8 48.1
BIT 28.9 36.8 28.8 45.2 14.5 41.7 43.0 28.4 25.9 17.2 31.0
Table 28: chrF with resegmentation for each target language on the evaluation set, sorted by the system average.
Direct / end-to-end systems are highlighted in gray.
Submission ar de fa fr ja nl pt ru tr zh Avg.
KITprimary 78.0 74.9 75.8 74.4 82.0 77.7 78.4 72.5 76.6 79.3 77.0
KITconstrastive1 77.7 74.6 75.7 74.5 82.0 77.6 78.4 72.2 76.4 79.3 76.8
JHUconstrained 67.9 59.0 66.1 63.2 69.3 66.2 67.8 62.0 64.0 67.9 65.3
BIT 52.8 47.2 48.7 52.2 56.2 53.8 54.8 47.7 48.0 55.7 51.7
Table 29: COMET with resegmentation for each target language on the evaluation set, sorted by the system average.
Direct / end-to-end systems are highlighted in gray.
ar de fa fr ja nl pt ru tr zh Avg.
KITprimary 25.9 37.5 29.8 41.3 35.7 40.4 44.3 22.4 21.8 49.4 34.9
JHUconstrained 15.0 23.7 21.9 33.1 18.9 31.3 33.2 17.2 12.8 37.4 24.5
BIT 5.7 11.1 7.4 19.7 8.0 16.3 18.6 6.3 4.1 19.8 11.7
Table 30: BLEU with resegmentation for each target language on the evaluation set, sorted by the system average.
BLEU scores in grey are calculated using language-specific tokenization (ja) or at the character-level (zh); see §5.2
for specific tokenization details. Direct / end-to-end systems are highlighted in gray.
55
B.5 Speech-to-Speech Translation
System Test-primary Test-expanded Overall

Ref BLEU chrF COMET SEScore2 BLEU chrF COMET SEScore2 BLEU chrF COMET SEScore2
Cascade Systems
X IAOMI 47.9 41.0 79.91 -12.27 34.5 29.2 79.07 -20.15 38.4 32.3 79.35 -17.48
NPU-MSXF 47.4 40.7 79.90 -12.21 34.0 28.5 78.68 -20.23 37.7 31.8 79.09 -17.52
HW-TSC 43.2 36.9 76.96 -14.23 32.4 27.7 76.43 -21.61 35.3 30.1 76.61 -19.12
KU 36.7 31.3 69.09 -17.07 25.0 21.7 67.94 -25.68 28.2 24.3 68.33 -22.77
M INE T RANS Cascade 33.9 28.6 67.49 -17.68 24.7 21.5 64.71 -26.34 27.2 23.4 65.65 -23.41
E2E Systems
M INE T RANS E2E (contrastive2) 45.0 38.3 74.83 -13.62 31.1 26.4 73.28 -22.03 34.9 29.6 73.81 -19.18
M INE T RANS E2E (contrastive1) 44.5 38.0 74.14 -13.92 31.0 26.4 72.90 -22.20 34.8 29.5 73.32 -19.40
M INE T RANS E2E (primary) 44.4 38.0 74.40 -13.86 31.1 26.4 73.00 -22.12 34.7 29.5 73.47 -19.32
Table 31: Official results of the automatic evaluation for the English to Chinese Speech-to-Speech Translation
Task.
System Translation Quality Score Speech Quality Score Overall

Cascade Systems
NPU-MSXF 3.70 3.98 3.84
X IAOMI 3.72 3.67 3.70
HW-TSC 3.58 3.75 3.67
M INE T RANS Cascade 3.16 3.26 3.21
KU 2.92 3.01 2.97
E2E Systems
M INE T RANS E2E (contrastive2) 3.58 3.50 3.54
Table 32: Official results of the human evaluation for the English to Chinese Speech-to-Speech Translation Task.
56
B.6 Dialectal SLT
Tunisian Arabic→English (Unconstrained Condition)

test2 test3
Team System BLEU bp pr1 chrF TER BLEU bp pr1 chrF TER
USTC primary 23.6 1.0 52.7 46.7 64.6 21.1 1.0 49.0 43.8 69.0
USTC contrastive1 22.8 1.0 51.7 45.7 65.7 20.2 1.0 47.7 42.9 70.7
JHU contrastive5 21.6 .99 50.7 45.0 66.9 19.1 1.0 46.6 41.9 72.3
JHU primary 21.2 1.0 50.0 44.8 67.7 18.7 1.0 46.0 41.9 73.1
JHU contrastive4 20.7 1.0 49.3 44.2 68.4 18.3 1.0 45.5 41.3 73.7
JHU contrastive3 19.9 .98 49.0 43.0 68.7 18.2 1.0 45.5 40.5 73.1
JHU contrastive1 19.4 .99 48.2 42.4 69.8 17.1 1.0 44.3 39.7 74.9
JHU contrastive2 18.7 .97 48.4 41.8 69.4 17.1 1.0 44.7 39.2 74.1
ON-TRAC post-eval 18.2 1.0 45.9 42.7 73.8 16.3 1.0 41.6 40.3 79.6
GMU contrastive1 15.0 1.0 41.4 38.4 78.2 13.4 1.0 37.2 36.1 83.9
GMU contrastive2 14.1 1.0 40.1 37.5 79.8 12.9 1.0 36.6 35.4 84.7
GMU primary 16.6 1.0 44.5 39.7 74.1 14.6 1.0 40.4 37.6 79.6
ON-TRAC primary 7.0 1.0 27.3 36.4 86.9 6.2 1.0 24.2 34.3 92.0
2022 best:CMU 20.8 .93 53.1 44.3 64.5 - - - - -
Table 33: Automatic evaluation results for the Dialect Speech Translation task, Unconstrained Condition. Systems
are ordered in terms of the official metric BLEU on test3. We also report brevity penalty (bp) and unigram precision
(pr1) of BLEU, chrF, and TER.
Tunisian Arabic→English (Constrained Condition)

test2 test3
Team System BLEU bp pr1 chrF TER BLEU bp pr1 chrF TER
USTC primary 20.5 .99 49.9 43.6 67.6 18.1 1.0 45.7 40.8 73.1
JHU primary 19.1 .94 50.5 42.4 67.2 17.6 .96 46.6 39.9 71.9
GMU primary 5.0 1.0 20.3 21.9 102.2 4.5 1.0 18.4 20.7 105.5
2022 best:CMU 20.4 .94 52.2 43.8 65.4 - - - - -
baseline 11.1 .88 40.0 31.9 77.8 10.4 .90 36.6 29.9 81.4
Table 34: Automatic evaluation results for the Dialect Speech Translation task, Constrained Condition.
Tunisian Arabic ASR Automatic Evaluation Results
ASR System test2 WER↓ test2 CER↓ test3 WER↓ test3 CER↓
Orig Norm Orig Norm Orig Norm Orig Norm
JHU / constrained / primary 70.3 43.7 30.7 22.7 74.0 44.9 33.1 24.8
JHU / unconstrained / primary 69.3 40.6 29.0 20.7 72.9 41.6 31.5 22.9
USTC / constrained / primary 49.5 40.8 24.2 20.9 52.3 43.2 27.1 23.8
USTC / unconstrained / primary 47.4 39.3 23.1 20.0 49.2 40.5 25.2 22.1
2022best:ON-TRAC/unconstrained 65.7 41.5 28.1 21.1 - - - -
Table 35: Word Error Rate (WER) and Character Error Rate (CER) of the ASR component of submitted cascaded
systems on test2 and test3. The original version (Orig) matches the minimal text pre-processing provided by the
organizer’s data preparation scripts, and results in relatively high WER. As diagnosis, we ran additional Arabic-
specific normalization (Norm) for e.g. Alif, Ya, Ta-Marbuta on the hypotheses and transcripts before computing
WER/CER. We are grateful to Ahmed Ali for assistance on this.
57
B.7 Low-Resource SLT
Irish→English (Constrained Condition)

Team System BLEU chrF2
GMU primary 15.1 26.5
Table 36: Automatic evaluation results for the Irish to English task, Constrained Condition.
Irish→English (Unconstrained Condition)

GMU contrastive1 77.4 81.6
Table 37: Automatic evaluation results for the Irish to English task, Unconstrained Condition.
Marathi→Hindi (Constrained Condition)

SRI-B primary 31.2 54.8
SRI-B contrastive 25.7 49.4
Table 38: Automatic evaluation results for the Marathi to Hindi task, Constrained Condition.
Marathi→Hindi (Unconstrained Condition)

Alexa AI primary 28.6 49.4
Alexa AI contrastive1 25.6 46.3
Alexa AI contrastive2 23 41.9
BUT primary 39.6 63.3
BUT contrastive 28.6 54.4
SRI-B primary 32.4 55.5
SRI-B contrastive 29.8 53.2
Table 39: Automatic evaluation results for the Marathi to Hindi task, Unconstrained Condition.
58
Pashto→French (Unconstrained Condition)
BLEU
Team System valid test
ON-TRAC primary 24.82 24.87
ON-TRAC contrastive1 23.38 23.87
Table 40: Automatic evaluation results for the Pashto to French task, Unconstrained Condition.
Pashto→French (Constrained Condition)

BLEU
Team System valid test
ON-TRAC primary 14.52 15.56
Table 41: Automatic evaluation results for the Pashto to French task, Constrained Condition.
Maltese→English (Unconstrained Condition)

Team System BLEU
UM-DFKI primary 0.6
UM-DFKI contrastive1 0.7
Table 42: Automatic evaluation results for the Maltese to English task, Unconstrained Condition.
Tamasheq→French (Constrained Condition)

Team System BLEU chrF2 TER
GMU primary 0.48 19.57 106.23
Table 43: Automatic evaluation results for the Tamasheq to French task, Constrained Condition.
59
Tamasheq→French (Unconstrained Condition)
Team System BLEU chrF2 TER
NAVER primary 23.59 49.84 64.00
NAVER contrastive1 21.31 48.15 66.41
NAVER contrastive2 18.73 46.11 70.32
ON-TRAC primary 15.88 43.88 73.85
ON-TRAC contrastive1 16.35 44.22 74.26
Alexa AI primary 9.30 32.29 81.25
Alexa AI contrastive1 8.87 32.04 81.03
GMU primary 8.03 33.03 87.81
GMU contrastive1 1.30 23.63 96.72
GMU contrastive2 2.10 24.33 94.58
Table 44: Automatic evaluation results for the Tamasheq to French task, Unconstrained Condition.
Quechua→Spanish (Constrained Condition)

QUESPA primary 1.25 25.35
QUESPA contrastive1 0.13 10.53
QUESPA contrastive2 0.11 10.63
Table 45: Automatic evaluation results for the Quechua to Spanish task, Constrained Condition. ChrF2 scores
were only taken into account for those systems that scored less than 5 points BLEU.
Quechua→Spanish (Unconstrained Condition)

Team System BLEU
GMU primary 1.78
GMU contrastive1 1.86
GMU contrastive2 1.63
NAVER primary 15.70
NAVER contrastive1 13.17
NAVER contrastive2 15.55
QUESPA primary 15.36
QUESPA contrastive1 15.27
QUESPA contrastive2 10.75
Table 46: Automatic evaluation results for the Quechua to Spanish task, Unconstrained Condition. ChrF2 scores
were only taken into account for those systems that scored less than 5 points BLEU.
60
B.8 Formality Control for SLT
EN-KO EN-VI
Model
BLEU COMET mACC cACC BLEU COMET mACC cACC
F 11.1 0.5044 28.5 55 43.2 0.6189 99 99
C ONSTRAINED
C O C OA (baseline)
IF 11.1 0.5125 80.4 58 41.5 0.6021 98 99
F 25.6 0.7512 89 100 51.3 0.7522 100 100

HW-TSC
IF 26.1 0.7367 100 100 49.8 0.7209 100 100
F 4.9 0.2110 78 99 26.7 0.3629 96 95

UMD (baseline)
IF 4.9 0.1697 98 99 25.3 0.3452 97 98
U NCONSTRAINED
F 25.4 0.7347 87 100 48.2 0.7214 100 100

HW-TSC
IF 26.2 0.7218 100 100 48.3 0.7102 100 100
F 26.6 0.7269 87 100 47.0 0.6685 99 100

KU X U P S TAGE
IF 27.1 0.7145 98 95 45.6 0.6373 99 100
F 23.3 0.5210 86 98 44.6 0.6771 99 98

UCSC
IF 22.8 0.4724 98 96 43.5 0.6281 99 100
Table 47: Results for the Formality Track (Supervised Setting). Most systems perform well in this setting, though
MT quality on formal (F) tends to be higher than informal (IF)
EN-PT EN-RU
Model
BLEU COMET mACC cACC BLEU COMET mACC cACC
C ONSTRAINED
F 47.4 0.7337 100 100 36.5 0.6472 100 100

HW-TSC
IF 47.9 0.7442 100 100 35.6 0.6442 100 100
F 27.3 0.4477 96 98 21.3 0.3492 96 92

UMD (baseline)
IF 30.9 0.4161 93 91 21.0 0.3475 84 85
U NCONSTRAINED
F 34.6 0.6089 99 99 35.4 0.6165 99 98

A PP T EK
IF 42.4 0.6776 64 65 33.3 0.6026 98 97
F 45.4 0.7737 100 100 33.7 0.5804 100 100

HW-TSC
IF 49.1 0.7845 100 100 32.4 0.5558 100 100
F 31.0 0.5251 100 100 25.8 0.4446 100 100

KU X U P S TAGE
IF 19.9 0.2486 68 90 26.3 0.4181 100 100
F 26.6 0.4048 90 91 18.4 -0.1713 99 79

UCSC
IF 28.4 0.4252 58 42 14.9 -0.2766 52 67
Table 48: Results for the Formality Track (Zero-shot Setting). Appreciable differences in formality control exist
between formal (F) and informal (IF), suggesting that formality bias exists in participant systems.
61
Evaluating Multilingual Speech Translation Under Realistic Conditions
with Resegmentation and Terminology
Elizabeth SaleskyJ Kareem DarwishA Mohamed Al-BadrashinyA

Mona DiabM Jan NiehuesK
J
Johns Hopkins University A
aiXplain M Meta AI K Karlsruhe Institute of Technology
[email protected]
Abstract
We present the ACL 60/60 evaluation sets for

multilingual translation of ACL 2022 technical
presentations into 10 target languages. This
dataset enables further research into multilin-
gual speech translation under realistic record-
ing conditions with unsegmented audio and
domain-specific terminology, applying NLP
tools to text and speech in the technical domain,
and evaluating and improving model robustness
to diverse speaker demographics.
1 Introduction
The NLP and speech communities are rapidly ex-
panding, which has motivated increased interest in Figure 1: Multilingual translation of ACL presentations.
multilingual scientific communication and accessi-
bility. From the automatic captioning at NAACL We present the ACL 60/60 evaluation sets to en-
2019 provided by Microsoft to the current ACL able greater development of tools by the field for
60-60 initiative1 for the 60th anniversary of ACL the field. Specifically, we hope that this data en-
at 2022, it is clear that transcription and translation ables further research into speech translation and
in the technical domain is needed, desired, and still other NLP applications in the technical domain
a disproportionate challenge for current models with resegmentation and terminology, given a di-
compared to standard datasets in these spaces. verse speaker set and realistic recording conditions,
Translating technical presentations presents chal- with the goal of increased accessibility and multi-
lenging conditions, from domain-specific terminol- linguality. Our dataset is publicly available through
ogy and adaptation, to recordings often captured the ACL Anthology.2
with a laptop microphone and light background
noise, diverse speaker demographics as well as 2 Evaluation under realistic conditions
unsegmented speech typically 10-60 minutes in
duration. We have curated evaluation sets from To evaluate transcription and translation under real-
presentations at ACL 2022 which have been pro- istic conditions may require different metrics than
fessionally transcribed and translated with the sup- with e.g. provided segmentation. Here we present
port of ACL and the 60-60 initiative. In this pa- the necessary metrics in order to discuss the dataset
per we describe the methodology to create this creation process.
dataset, considerations and methods to evaluate
speech translation models with it, and open chal- 2.1 Resegmentation
lenges we believe this dataset may support research While most offline speech translation models are
towards. We release all data and intermediate steps trained with provided segmentation, in an applica-
to support further research in this space. tion setting segmentation is unlikely to be provided.
1 2
https://www.2022.aclweb.org/dispecialinitiative https://aclanthology.org/2023.iwslt-1.2
62
Most models are typically unable to maintain out- We caution against using any one translation
put quality given audio of typical talk lengths (10+ metric in isolation, and suggest chrF and COMET
minutes), necessitating the use of automatic seg- as the standard evaluation metrics for this dataset.
mentation methods. In order to evaluate output
with variable segmentation, resegmentation to a 3 Creating the ACL 60/60 evaluation sets
fixed reference is necessary.
3.1 Languages
The standard tool within the field for many years
has been mwerSegmenter (Matusov et al., 2005), All data is originally spoken in English and then
which resegments model output to match a refer- transcribed and translated to ten diverse languages
ence segmentation for downstream evaluation with from the 60/60 initiative for which publicly avail-
various metrics. This is done by dynamically re- able speech translation corpora are available (see
segmenting the output using a given tokenization Table 5: §A.3): Arabic, Mandarin Chinese, Dutch,
to minimize word error rate to the reference.3 We French, German, Japanese, Farsi, Portuguese, Rus-
use mwerSegmenter for all scores in this paper and sian, and Turkish. The resulting dataset contains
suggest that resegmentation be the scoring standard three-way parallel (speech, transcripts, transla-
for the ACL 60/60 dataset. tions) one-to-many data for ten language pairs, and
multi-way parallel text data for 100 language pairs.
2.2 Evaluation metrics
3.2 Data selection
We compare a variety of evaluation metrics to ana-
lyze both transcription and translation quality using Data was selected from the ACL 2022 paper pre-
the evaluation sets, as well as the results of interme- sentations for which precorded audio or video pre-
diate steps in corpus creation such as post-editing. sentations were provided to the ACL Anthology.
For translation, we compare chrF (Popović, Talks were selected such that each of the two evalu-
2015) which is tokenization-agnostic and more ap- ation sets, development and evaluation, would have
propriate for a wider array of target languages than approximately one hour total duration. Oral pre-
BLEU; BLEU (Papineni et al., 2002) as computed sentations were advised to be up to 12 minutes per
by S ACRE BLEU (Post, 2018); and the model- recording, resulting in 5 talks for each set with rel-
based metric COMET (Rei et al., 2020), which atively balanced durations of ∼11.5 minutes each.
often has higher correlation with human judge- From the 324 available recordings, the final 10
ments (Mathur et al., 2020) though is limited by were selected in order to balance speaker demo-
language coverage in pretrained models. For BLEU graphics, accents, and talk content, while lightly
we use the suggested language-specific tokenizers controlling for recording conditions. The major-
in S ACRE BLEU for our non-space delimited tarity of recordings were created using laptop micro-
get languages, Japanese (MeCab4 ) and Chinese phones in quiet conditions, but background noise,
(character-level). microphone feedback, speech rate and/or volume
To analyze both automatic and post-editing tran- in some cases affected understanding of the content.
scription quality, we use word error rate (WER). We selected talks with representative but minimal
We note that we use case-sensitive and punctuation- noise where conditions did not affect understand-
sensitive WER here as these are both maintained in ing of the content. We aimed for a gender balance
system output during dataset creation in order to be representative of conference participation,6 result-
post-edited and translated. For downstream evalua- ing in a 3:7 female:male speaker ratio. This is also
tion of ASR model quality using the final dataset, a global field with a wide variety of native and non-
it may be desired to compute WER without case native English accents, which remains a necessary
and without punctuation; if so, the scores would challenge for speech models to address to mitigate
not be directly comparable to those presented here. performance biases (Sanabria et al., 2023; Feng
We also use translation error rate (TER) (Snover et al., 2021; Koenecke et al., 2020; Tatman and
et al., 2006) to assess the expected level of editing Kasten, 2017). Talks were chosen and assigned to
necessary to match the final reference quality.5 each set to maximize accent diversity, aiming for
3
L1s from all continents with language families fre-
We use word-level tokenization for all languages except
Japanese and Chinese here, where we use character-level. --ter-asian-support in S ACRE BLEU.
4 6
https://taku910.github.io/mecab/ Aggregate conference participation statistics provided by
5
We calculate TER with --ter-normalized and ACL 2022; see §A.2.
63
400
VAD 160 VAD
350 subtitles subtitles
sentences 140 sentences
300
120
Num. Segments
Num. Segments
250 100
200 80
150 60
100 40
50 20
0 0
0 5 10 15 20 25 30 0 10 20 30 40 50 60 70 80 90
Seconds Word count
(a) Speech segment length distribution (b) Text segment length distribution
Figure 2: Distribution of English segment lengths via speech duration (seconds) and text length (word count) for
each of three segmentations: VAD, subtitles, and sentences.
quently represented in the ACL community while based on pauses, speech, and non-speech phenom-
balancing topic diversity and gender. We note na- ena. Figure 2 shows the resulting distribution of
tive language and country where available. Talks segment lengths. Evaluating these initial automatic
were chosen to cover a diverse set of tracks and transcripts against the final released version with
topics and therefore diverse technical vocabulary resegmentation (§2.1), the automatic transcription
representative of the needs of the field. Where pre- yielded a WER of 15.4 and 22.4 for the develop-
sentations were chosen within the same track, they ment and evaluation sets, respectively.
covered different focuses and methodology, e.g.
math word problems versus release note generation 3.4 Human post-editing: Transcription
or few-shot adaptation for structured data. Meta- We contracted with aiXplain Inc. to professionally
data for all talks with exact durations and track and post-edit the ASR output. There was a three tier
speaker annotations are shown in Table 3 in §A.1. review process: an initial annotator post-edited per
Holding out speakers and topics per set opti- segment, followed by a quality assurance (QA) an-
mizes for overall system generalization but reduces notator who went through each full talk to ensure
the match between dev and eval sets; this e.g. re- quality and consistency, and then finally 10-20%
duces the benefit of finetuning on the dev set to of the segments were randomly chosen for a final
maximize test set performance and overfitting the check. In addition to semantic content, annotators
model or chosen hyperparameters to the dev set may theoretically also fix segmentation boundaries
will adversely affect test set performance. How- but in practice this rarely occurs. The annotators
ever, high performance on both sets is more likely provided additional information about the speak-
to indicate generalizable systems and representa- ers, namely gender (male, female) and age (child,
tive performance beyond these data points than if young adult, adult, elderly). The annotators were
the dev and eval data were more closely matched. also shown the video of the presentation to aid them
3.3 Automatic transcription ing recognizing technical terms, which may appear
in the slides. Disfluencies were standardized such
The first pass through the data used automatic seg- that false starts and repetitions were kept where
mentation and transcription to provide initial tran- there were perceivable pauses between them, and
scripts. We used the Azure API speech-to-text two hesitation spelling variations (ah, um) were
service,7 which has the best cost and quality bal- used. The annotator guidelines and LabelStudio
ance of currently available models. In addition to interface are shown in §A.4. After the professional
transcription, the service performs speaker diariza- post-editing pass, a domain expert verified and cor-
tion, with implicit voice activity detection (VAD), rected the technical terms.
segmenting the initially ∼11.5 minute audio files
into segments of approximately 30 seconds or less Post-editing analysis. ASR output is strongly
7
https://azure.microsoft.com/en-us/products/ monotonic with respect to the original speech, and
cognitive-services/speech-to-text accordingly most post-edits are for incorrectly tran-
64
REF: we find a BILSTM ** CRF model using flare
HYP: we find a BIAS TM CRF model using flare
S D
REF: also FASTTEXT CHARACTER EMBEDDINGS
HYP: also FASTTEX KITCHEN BEDDINGS
S S S
Figure 4: Example of tagged terminology from dev.
REF: multilingual BERT PERFORMS better than BETO Terminology lists were not exhaustive; [text-to-speech]
HYP: multilingual BIRD PERFORM better than BETTER did not appear, leading [text] and [speech] to be tagged
S S S separately.
Figure 3: Sample ASR errors from dev using SCLITE.

Corrections are emphasized with CASE. teria, this may result in segments which are not par-
allel across languages (in the case of multilingual
scribed words, case, and punctuation. 93% of speech), which are too short to translate without
words were correctly transcribed by the initial ASR additional context, or which are too long for effec-
pass. Spurious punctuation and casing in the ASR tive system evaluation. For a multilingual dataset
output (ex ‘Thank. You.’) accounted for 43% of the intended to be multi-way parallel and to be used
errors captured by WER. Setting punctuation and for translation, it is critical to have consistent seg-
case aside, in the professional post-editing pass, mentation across all languages and for all segments
60% of sentences had at least one correction made. to contain the necessary context to translate to the
The majority of post-edits were word-level sub- desired target languages.
stitutions for incorrectly transcribed words (62%). The VAD segments facilitated transcription, but
Dropped words were not common, with only 1.6% resulted in a wide distribution of segment lengths,
of words dropped by the ASR model and later in- some just one to two words long, and others con-
serted. Slightly more common (1.8%) were inser- taining multiple sentences, potentially skewing
tions due to words incorrectly transcribed as multi- downstream evaluation metrics and providing a
ple tokens by the ASR system, and later corrected. mismatch to common training conditions. One
Examples are shown in Figure 3. option would be to subdivide the segments using
Further corrections by a domain expert were subtitle guidelines,8 where those segments which
made for 3% of words. While the majority were do not conform to particular length guidelines are
corrections to terminology requiring technical con- realigned into smaller segments which is done us-
text (‘CONEL’ → ‘CONLL’ or ‘position or’ → ‘po- ing forced alignment. However, subtitle segments
sitional’), some fixes were for subtle number and often contain partial sentences, which, particularly
tense changes in the ASR transcription possibly in- when including languages with different word or-
fluenced by recording conditions or pronunciation. ders or degrees of reordering from the source lan-
guage (English), may place verbs across segment
Technical terms. The subset of technical terms boundaries for some languages and not others. Sen-
appearing in the terminology lists created by the tences, then, may be a more appropriate unit for
60-60 initiative were automatically tagged on the multi-way parallel segments. We resegmented the
source side (see Figure 4). These lists were not final post-edited English transcriptions into sen-
exhaustive but provide an initial keyword set to tences manually to avoid noise from currently avail-
bootstrap identification and translation of technical able tools. Examples of all three segmentations
terms and their evaluation, and which future work (VAD, subtitles, and sentences) are shown in Fig-
may find beneficial. ure 12 in § A.8. To ensure the speech and text
Technical terms comprised the majority of ASR were correctly aligned given the final sentence seg-
errors. 86% of the tagged terminology were cor- ments, they were re-force aligned using WHISPER -
rectly transcribed the ASR model, 8% were cor- TIMESTAMPED (Louradour, 2023), an extension
rected by the professional post-editors, and the re- of OpenAI’s Whisper model (Radford et al., 2022)
maining 6% were corrected by a domain expert. which uses DTW (Giorgino, 2009) to time align at
3.5 Sentence segmentation the word level, and were manually rechecked by
the annotators.
While it is common in speech corpora to segment
based on voice activity detection or subtitle-like cri- 8
Subtitle guidelines are shown in §A.7.
65
Metric ar de fa fr ja nl pt ru tr zh
chrF 75.3 72.8 54.9 80.0 56.9 82.7 82.3 59.3 69.0 60.5
dev
BLEU 54.1 48.3 25.3 63.0 50.7 63.6 65.9 30.5 39.1 65.9
COMET 86.2 83.6 76.8 84.5 89.1 88.1 87.9 82.5 85.9 87.4
chrF 77.2 71.7 56.3 83.7 53.6 86.6 84.8 65.3 77.0 62.7
eval
BLEU 55.4 48.5 27.1 68.3 47.3 71.5 68.7 39.4 51.6 67.9
COMET 86.2 83.6 79.5 84.5 89.1 88.1 87.9 82.5 85.9 87.4
Table 1: Evaluating the initial commercial MT from ground-truth transcripts against the final released references.
BLEU scores in grey are calculated using language-specific tokenization (ja) or at the character-level (zh); see §2.2.
We compare the distribution of segment lengths not necessarily indicated by these metrics.
for each of the three approaches (VAD, subtitles,
3.7 Human post-editing: Translation
and sentences) in terms of both duration (seconds)
and number of words (English) in Figure 2. VAD Post-editing has become the industry standard due
results in the most uneven distribution, with seg- its increased productivity, typically reducing pro-
ments ranging from <1 second to >30 seconds. Sub- cessing time and cognitive load compared to direct
titles result in more uniform but distinctly shorter translation, particularly for domain-specific texts
segments, with 58% containing less than 10 words (O’Brien, 2007; Groves and Schmidtke, 2009; Tat-
and 19% shorter than two seconds, likely too short sumi, 2009; Plitt and Masselot, 2010).
for some downstream tasks or metrics. Sentences We contracted with Translated to professionally
result in less extreme segment lengths. Examples post-edit the MT output. There was a two tier re-
of each segmentation are shown in §A.8. The final view process: an initial annotator who was a native
data contains 468 sentences in the development set speaker of the target language post-edited per seg-
and 416 sentences in the evaluation set. ment, followed by a second to review the output
and consistency of the first. Annotator guidelines
3.6 Machine translation and the post-editing interface are shown in §A.5.
The first translation pass used publicly available
Technical terms. Terminology was not handled
bilingual MT models to translate the final sentence
separately during the MT step nor automatically
segments. We used the ModernMT API9 for the
tagged, given that the MT systems may omit or
9 of 10 language pairs supported, and the Azure
incorrectly translate technical terms. We did not
API10 for English-Farsi. We evaluate the commer-
use constrained decoding given the terminology
cial machine translation output against the final
lists translations as their validity could be context-
released translation references (§3.7) using the met-
dependent and some terms had multiple possible
rics discussed in §2.2, shown in Table 1.
translations. Instead, translation post-editors were
Each metric suggests a different story about
instructed to correct the translations of tagged ter-
translation quality and the degree to which it is
minology on the source if they were not maintained
language-specific. While COMET suggests rel-
and then tag the appropriate target translations
atively consistent performance across languages,
for each source tagged source span. Capitalized
chrF and BLEU do not. chrF and BLEU sug-
acronyms and terminology not on the lists and un-
gest significantly worse performance for a subset
known to the translators was left in English.
of target languages, including all but one of the
non-Latin script and non-Indo European languages. Post-editing analysis. While the metrics in the
BLEU yields 1.7× greater variance than chrF. By previous section give a sense for the automatic
all metrics, though, MT quality was consistent be- translation quality, they do not necessarily reflect
tween the development and evaluation sets. We see the effort required to post-edit the translations to
in the next section that the amount of post-editing final reference quality. Using TER to assess the
required to create the final references, however, is degree of post-editing necessary, we see in Fig-
9 ure 5 that this varies by language. Most noticeably,
https://www.modernmt.com/api/
10
https://azure.microsoft.com/en-us/products/ we see that Farsi, Russian, Japanese as target lan-
cognitive-services/translator guages required the highest amount of post-editing.
66
TER [dev] TER [eval] ja
ar fa
zh
zh 50 de
tr
40 de
30 nl
tr 20 fa pt
10
fr
ru
ar
0 2 4 6 8 10 12 14
ru fr
Figure 6: Degree of reordering done in MT post-editing.
exception of Farsi and Japanese (Figure 8). This

pt ja
correlation does not appear to be influenced by lan-
nl guage family and was not related to the proportion
of tagged terminology per talk. For Russian and
Figure 5: Estimated translation post-editing effort re- Turkish, a particular talk skewed overall dev TER,
quired per target language, as measured by TER. possibly due to a greater proportion of polysemous
terms with domain-specific meaning in that area.
For Farsi and Japanese, we see that this is pre- Terminology. Tagged terminology was more of-
dominantly due to reordering. Isolating reorder- ten correctly automatically transcribed than trans-
ing from semantic corrections by looking only at lated. Between 70-75% of the tagged spans were
those tokens11 which did not need to be corrected, translated correctly by the initial MT model de-
we use Levenshtein distance to assess the degree pending on the target language, as measured by an
of reordering from the MT output required. We exact match with the final tagged post-edited span.
observed a strong bias towards source language The remaining 25-30% were manually corrected
word order in the machine translation output, caus- by the post-editors. In addition, 2-5% of words
ing a greater degree of post-editing for languages overall were left in English, predominantly made
with differing word orders. Figure 6 shows that up of additional terminology and names.
reordering requirements are moderately correlated
with overall post-editing effort for most languages 4 Challenges to Address with ACL 60/60
(ρ = 0.41), while TER is only weakly suggested
by COMET (ρ = 0.29) and is negatively correlated 4.1 Segmentation
with chrF and BLEU (−0.63, −0.21 respectively).
Speech translation datasets customarily provide a
For most target languages, there was no signifi-
segmentation for translation and evaluation, seg-
cant difference in post-editing effort between dev
mented either manually (e.g. CoVoST) or automat-
and test, but where there was a difference it was the
ically (e.g. MuST-C). In realistic use cases, such
dev talks that required additional editing, most no-
segmentation is unavailable and long audio cannot
ticeably for Turkish and Russian and to a lesser de-
be processed directly, resulting in mismatched con-
gree Dutch. Dividing the data into individual talks,
ditions at inference time. There can be a noticeable
which each vary in content within the technical do-
performance gap between manual segmentation
main, there was some variation in the quality of the
and automatic methods (Tsiamas et al., 2022).
first-pass MT (Figure 7). We found that which talks
We illustrate the impact of different speech seg-
require similar levels of post-editing is moderately
mentations on downstream transcription and trans-
to strongly correlated across languages, suggesting
lation quality by comparing manual sentence seg-
this was due to topic rather than language, with the
mentation to the initial VAD segments as well as
11
Characters rather than words were used for this analysis for to SHAS (Tsiamas et al., 2022), using the top line
Japanese and Chinese. commercial ASR and MT systems used during the
67
ar 1.0
de ar 0.68 0.11 0.81 0.54 0.79 0.77 0.71 0.69 0.43
60 fa
fr de 0.68 0.27 0.73 0.42 0.77 0.44 0.56 0.68 0.75
ja 0.8
50 nl fa 0.11 0.27 -0.08 0.34 0.10 0.17 0.43 0.50 0.78
pt fr 0.81 0.73 -0.08 0.15 0.76 0.73 0.58 0.57 0.26
40 ru
tr 0.6
zh
ja 0.54 0.42 0.34 0.15 0.22 0.16 0.29 0.27 0.64
TER
30 nl 0.79 0.77 0.10 0.76 0.22 0.56 0.67 0.85 0.42

0.4
20
pt 0.77 0.44 0.17 0.73 0.16 0.56 0.71 0.61 0.26
ru 0.71 0.56 0.43 0.58 0.29 0.67 0.71 0.85 0.52
10 0.2
tr 0.69 0.68 0.50 0.57 0.27 0.85 0.61 0.85 0.61
0 zh 0.43 0.75 0.78 0.26 0.64 0.42 0.26 0.52 0.61
0 1 2 3 4 5 6 7 8 9 0.0
Talk idx ar de fa fr ja nl pt ru tr zh
Figure 7: Range in TER by talk per language. Figure 8: Correlation in TER across languages.
dataset creation pipeline. As seen in Table 2,12 4.2 Demographic fairness

under certain circumstances automatic segmenta-
The field is diverse and rapidly growing with a wide
tion methods can perform as well as manual sen-
variety of speaker demographics and native and
tence segmentation, though this is not always the
non-native English accents. As we train increas-
case and small resulting differences in ASR perfor-
ingly large and multilingual models it is important
mance may cascade into larger performance gaps
to evaluate their fairness to ensure any biases we
in downstream MT, meriting further research.
may find decrease rather than increase over time,
Variation due to segmentation also depends on which we believe this dataset may help with.
model training conditions. Models are typically The variety of speaker demographics in both the
optimized for the segment lengths observed in field and these evaluation sets remain disproportion-
training and/or may use additional internal seg- ately challenging to current ASR models. Looking
mentation. For example, when we compare the at the average WER among talks of each gender,
Whisper LARGE model (Radford et al., 2022) which we see a margin of 10.5. 15% of dev sentences
is trained on longer segments, sentences are sub- and 26% of eval sentences were misclassified as
optimal compared to SHAS and VAD (0.1-0.9 non-English languages when using the multilingual
WER), and when they are further segmented up Whisper BASE model, showing a bias against varied
to 4× by its internal VAD this cascades to dispro- pronunciations and L1s that it is necessary to ad-
portionately worse downstream MT performance dress when pursuing multilingual modelling. WER
(by up to 8 chrF) than with the Azure ASR. is 23% better when the model is prompted to gen-
erate English only, however, there is still a further
ASR MT 16% gap to the English-only BASE model. Mov-
Segmentation dev test dev test ing to the larger multilingual model, the discrep-
Manual sentences 15.2 21.4 69.4 71.5 ancy in performance with and without language
Commercial VAD 15.4 22.4 62.0 59.6 prompting becomes 2.4× larger, though overall
SHAS 16.4 21.5 61.9 60.4 performance improves. At worst, the ∆WER be-
tween speakers is 62.2, and at best, 8.0, highlight-
Table 2: Comparison between manual sentence segmen-
tation and high quality automatic segmentation for ASR ing a significant discrepancy which needs to be
and cascaded ST in WER and avg. chrF, respectively. improved.
Demographic fairness is an important issue
which requires targeted research to address. We
Segmentation is an important open challenge, hope these evaluations sets may facilitate further
and we suggest that this dataset be used to evalu- research in this space, despite their small size.
ate segmentation by making the dataset standard
scoring with resegmentation. 4.3 Domain adaptation and terminology
Terminology. Constrained decoding of techni-
12
chrF for individual languages is shown in Table 6. cal terms or domain-specific translations is an area
68
of active research (Hu et al., 2019; Post and Vi- 5 Related work
lar, 2018; Hokamp and Liu, 2017). The terminol-
ogy lists were not exhaustive, containing just over Previous work has studied data from the ACL An-
250 terms, but provide an initial keyword set to thology for term mining and identification (Schu-
bootstrap identification and translation of technical mann and Martínez Alonso, 2018; Jin et al., 2013)
terms in context and their evaluation, which future and concept relation (Gábor et al., 2016) in the
work may find beneficial. scientific domain.
We highlight the reduction in terminology re- Few speech translation datasets in the technical
call between the strong ASR and MT systems domain exist but those that do such as the QCRI
used in the dataset creation pipeline below in Fig- Educational Corpus (Abdelali et al., 2014; Guzman
ure 9. It is clear that even commercial systems et al., 2013) have primarily targeted educational
struggle with domain-specific terminology particu- lectures and videos. Additional datasets specifi-
larly without adaptation. While there are discrep- cally for speech translation evaluation (Conneau
ancies across language pairs, terminology recall is et al., 2023) are primarily ‘general domain.’
strongly correlated with overall translation perfor- Significant previous work has studied various
mance (ρ = 0.8) as measured by chrF. aspects of translation post-editing, including post-
editing effort (Scarton et al., 2019), evaluating post-
Metric
terminology: ASR terminology: MT chrF editing quality and reference bias (Bentivogli et al.,
100 2018), bias from the initial MT quality and output
90 patterns (Zouhar et al., 2021; Picinini and Ueffing,
80 2017), and the the efficacy of post-editing in highly
70 technical domains (Pinnis et al., 2016) and resulting
60 translation biases (Čulo and Nitzke, 2016).
50 The impact of automatic segmentation quality
40 on various ST metrics has been evaluated in recent
30 IWSLT shared tasks (Ansari et al., 2020; Anasta-
20 sopoulos et al., 2021, 2022) and research (Tsiamas
10 et al., 2022; Sen et al., 2022; Ansari et al., 2021)
0 using other datasets (TED) with longer reference
ar de fa fr ja nl pt ru tr zh segmentations than ours. With longer sequences
Language
there is greater potential for variation, and past cam-
Figure 9: Terminology recall of ASR vs MT, with over- paigns have observed larger differences between
all translation performance shown behind (chrF). segmentations than seen here and even improve-
ments over the provided segmentation. Significant
Lightweight domain adaptation. There are few additional work has been done in the simultaneous
publicly available datasets with technical content, translation space, which we do not address here.
and fewer translated. While it is possible to scrape
in-domain material e.g. from the ACL Anthology, 6 Conclusions
this would be in the source language (English) only
rather than the target languages. While only having We introduced a new dataset to evaluate multilin-
target-domain data in the source language is a real- gual speech translation from English into ten target
istic scenario, it is not the setting typically found languages specifically in the technical NLP domain.
in current research or approaches, and highlights We have discussed in detail the steps to create the
the need for new methods for domain adaptation corpus and the tools and considerations required.
which can make use of this data. We additionally We have also provided a further view into evalua-
provide paper titles and abstracts, which are likely tion methodology mimicking realistic conditions
to contain both particularly important vocabulary where segmentation is not provided. We hope that
and cue the talk topic. We hope this data may prove this dataset may be useful for the field to study the
beneficial for lightweight methods to adapt to the effectiveness of the tools we develop both for trans-
technical domain or specific talk settings or to lexi- lation and additional applications in the technical
cally constrain or prompt particular translations. domain in an increasingly multilingual space.
69
Limitations References
While we have done our best to create high-quality Ahmed Abdelali, Francisco Guzman, Hassan Sajjad,
evaluation data, there are limitations that should be and Stephan Vogel. 2014. The AMARA corpus:
Building parallel language resources for the educa-
kept in mind when using these datasets. It is known tional domain. In Proceedings of the Ninth Inter-
that creating translations by post-editing may bias national Conference on Language Resources and
data towards the output of the MT systems used Evaluation (LREC’14), pages 1856–1862, Reykjavik,
for initial translations; however, many transcription Iceland. European Language Resources Association
(ELRA).
and translation vendors now exclusively use post-
editing rather than translation from scratch and so Milind Agarwal, Sweta Agrawal, Antonios Anasta-
direct translation may not be an option in all cases. sopoulos, Ondřej Bojar, Claudia Borg, Marine
This could influence metrics toward similar MT Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
systems. The presented evaluation sets are moder- Chen, William Chen, Khalid Choukri, Alexandra
Chronopoulou, Anna Currey, Thierry Declerck, Qian-
ately sized compared to datasets in other domains qian Dong, Yannick Estéve, Kevin Duh, Marcello
with plentiful mined data, and may be best used Federico, Souhir Gahbiche, Barry Haddow, Benjamin
in conjunction by reporting on both the develop- Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
ment and evaluation sets for statistical significance. vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
Kumar, Pengwei Li, Xutai Ma, Prashant Mathur,
The evaluation sets also have a necessarily limited Evgeny Matusov, Paul McNamee, John P. McCrae,
set of speakers which may not be fully representa- Kenton Murray, Maria Nadejde, Satoshi Nakamura,
tive. Systems which tune to the development set Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
run the risk of over-fitting to specific speakers or Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
Lonneke van der Plas, Peter Polák, Elijah Rippeth,
content. We do not perform a comparison to hu- Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
man evaluation here, but refer interested readers to bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
the IWSLT’23 evaluation campaign findings paper Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
which runs this comparison for a variety of systems Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
vallos. 2023. Findings of the IWSLT 2023 Evaluation
with the ACL 60/60 data (Agarwal et al., 2023).
Campaign. In Proceedings of the 20th International
Conference on Spoken Language Translation (IWSLT
Ethical Considerations 2023). Association for Computational Linguistics.
This dataset is constructed from a small set of
Antonios Anastasopoulos, Loïc Barrault, Luisa Ben-
speakers where each speaker may be the only rep- tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano
resentative of certain cross-sectional axes, and as Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh,
such, even reporting aggregate metadata may break Maha Elbayad, Clara Emmanuel, Yannick Estève,
anonymity. While we do not distribute speaker an- Marcello Federico, Christian Federmann, Souhir
Gahbiche, Hongyu Gong, Roman Grundkiewicz,
notations with the data some information is inher- Barry Haddow, Benjamin Hsu, Dávid Javorský,
ently recoverable due to the link to the Anthology. Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant
We nonetheless believe this data will be beneficial Mathur, Paul McNamee, Kenton Murray, Maria
to the community in order to study language pro- Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan
Niehues, Xing Niu, John Ortega, Juan Pino, Eliz-
cessing on technical data, and it is necessary to abeth Salesky, Jiatong Shi, Matthias Sperber, Se-
have a diverse evaluation set to provide a more real- bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo-
istic and representative measure for generalization. gesh Virkar, Alexander Waibel, Changhan Wang,
It is difficult and costly to construct datasets with and Shinji Watanabe. 2022. Findings of the IWSLT
human-edited transcripts and translations and this 2022 evaluation campaign. In Proceedings of the
was the largest set possible to collect. Post-editors Translation (IWSLT 2022), pages 98–157, Dublin,
were compensated with professional wages. Ireland (in-person and online). Association for Com-
putational Linguistics.
Acknowledgements
Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremer-
We are very grateful to funding and support from man, Roldano Cattoni, Maha Elbayad, Marcello Fed-
ACL and the 60/60 initiative to create this dataset. erico, Xutai Ma, Satoshi Nakamura, Matteo Negri,
We thank our annotators and the generous support Jan Niehues, Juan Pino, Elizabeth Salesky, Sebas-
tian Stüker, Katsuhito Sudoh, Marco Turchi, Alexan-
of aiXplain and Translated. Elizabeth Salesky is der Waibel, Changhan Wang, and Matthew Wiesner.
supported by the Apple Scholars in AI/ML fellow- 2021. FINDINGS OF THE IWSLT 2021 EVAL-
ship. UATION CAMPAIGN. In Proceedings of the 18th
70
International Conference on Spoken Language Trans- annotation of the ACL Anthology corpus for the auto-
lation (IWSLT 2021), pages 1–29, Bangkok, Thailand matic analysis of scientific literature. In Proceedings
(online). Association for Computational Linguistics. of the Tenth International Conference on Language
Resources and Evaluation (LREC’16), pages 3694–
Ebrahim Ansari, Amittai Axelrod, Nguyen Bach, 3701, Portorož, Slovenia. European Language Re-
Ondřej Bojar, Roldano Cattoni, Fahim Dalvi, Nadir sources Association (ELRA).
Durrani, Marcello Federico, Christian Federmann,
Jiatao Gu, Fei Huang, Kevin Knight, Xutai Ma, Ajay Toni Giorgino. 2009. Computing and visualizing dy-
Nagesh, Matteo Negri, Jan Niehues, Juan Pino, Eliz- namic time warping alignments in r: The dtw pack-
abeth Salesky, Xing Shi, Sebastian Stüker, Marco age. Journal of Statistical Software, 31(7).
Turchi, Alexander Waibel, and Changhan Wang.
2020. FINDINGS OF THE IWSLT 2020 EVAL- Declan Groves and Dag Schmidtke. 2009. Identifica-
UATION CAMPAIGN. In Proceedings of the 17th tion and analysis of post-editing patterns for MT.
International Conference on Spoken Language Trans- In Proceedings of Machine Translation Summit XII:
lation, pages 1–34, Online. Association for Compu- Commercial MT User Program, Ottawa, Canada.
tational Linguistics.
Francisco Guzman, Hassan Sajjad, Stephan Vogel, and
Ebrahim Ansari, Ondřej Bojar, Barry Haddow, and Mo- Ahmed Abdelali. 2013. The AMARA corpus: build-
hammad Mahmoudi. 2021. SLTEV: Comprehensive ing resources for translating the web’s educational
evaluation of spoken language translation. In Pro- content. In Proceedings of the 10th International
ceedings of the 16th Conference of the European Workshop on Spoken Language Translation: Papers,
Chapter of the Association for Computational Lin- Heidelberg, Germany.
guistics: System Demonstrations, pages 71–79, On-
Chris Hokamp and Qun Liu. 2017. Lexically con-
line. Association for Computational Linguistics.
strained decoding for sequence generation using grid
beam search. In Proceedings of the 55th Annual
Luisa Bentivogli, Mauro Cettolo, Marcello Federico,
Meeting of the Association for Computational Lin-
and Christian Federmann. 2018. Machine transla-
guistics (Volume 1: Long Papers), pages 1535–1546,
tion human evaluation: an investigation of evaluation
Vancouver, Canada. Association for Computational
based on post-editing and its relation with direct as-
Linguistics.
sessment. In Proceedings of the 15th International
Conference on Spoken Language Translation, pages J. Edward Hu, Huda Khayrallah, Ryan Culkin, Patrick
62–69, Brussels. International Conference on Spoken Xia, Tongfei Chen, Matt Post, and Benjamin
Language Translation. Van Durme. 2019. Improved lexically constrained
decoding for translation and monolingual rewriting.
Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, In Proceedings of the 2019 Conference of the North
Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara American Chapter of the Association for Computa-
Rivera, and Ankur Bapna. 2023. Fleurs: Few-shot tional Linguistics: Human Language Technologies,
learning evaluation of universal representations of Volume 1 (Long and Short Papers), pages 839–850,
speech. In 2022 IEEE Spoken Language Technology Minneapolis, Minnesota. Association for Computa-
Workshop (SLT), pages 798–805. tional Linguistics.
Oliver Čulo and Jean Nitzke. 2016. Patterns of termino- J. Iranzo-Sánchez, J. A. Silvestre-Cerdà, J. Jorge,
logical variation in post-editing and of cognate use N. Roselló, A. Giménez, A. Sanchis, J. Civera, and
in machine translation in contrast to human transla- A. Juan. 2020. Europarl-st: A multilingual corpus
tion. In Proceedings of the 19th Annual Conference for speech translation of parliamentary debates. In
of the European Association for Machine Translation, ICASSP 2020 - 2020 IEEE International Confer-
pages 106–114. ence on Acoustics, Speech and Signal Processing
(ICASSP), pages 8229–8233.
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli,
Matteo Negri, and Marco Turchi. 2019. MuST-C: a Yiping Jin, Min-Yen Kan, Jun-Ping Ng, and Xiangnan
Multilingual Speech Translation Corpus. In Proceed- He. 2013. Mining scientific terms and their defini-
ings of the 2019 Conference of the North American tions: A study of the ACL Anthology. In Proceed-
Chapter of the Association for Computational Lin- ings of the 2013 Conference on Empirical Methods
guistics: Human Language Technologies, Volume 1 in Natural Language Processing, pages 780–790,
(Long and Short Papers), pages 2012–2017, Min- Seattle, Washington, USA. Association for Computa-
neapolis, Minnesota. Association for Computational tional Linguistics.
Linguistics.
Allison Koenecke, Andrew Joo Hun Nam, Emily Lake,
Siyuan Feng, Olya Kudina, Bence Mark Halpern, and Joe Nudell, Minnie Quartey, Zion Mengesha, Connor
Odette Scharenborg. 2021. Quantifying bias in auto- Toups, John R. Rickford, Dan Jurafsky, and Sharad
matic speech recognition. ArXiv, abs/2103.15122. Goel. 2020. Racial disparities in automated speech
recognition. Proceedings of the National Academy of
Kata Gábor, Haïfa Zargayouna, Davide Buscaldi, Is- Sciences of the United States of America, 117:7684 –
abelle Tellier, and Thierry Charnois. 2016. Semantic 7689.
71
Jérôme Louradour. 2023. whisper-timestamped. https: Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
//github.com/linto-ai/whisper-timestamped. man, Christine McLeavey, and Ilya Sutskever. 2022.
Nitika Mathur, Johnny Wei, Markus Freitag, Qingsong pervision. arXiv preprint arXiv:2212.04356.
Ma, and Ondřej Bojar. 2020. Results of the WMT20
metrics shared task. In Proceedings of the Fifth Con- Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
ference on Machine Translation, pages 688–725, On- Lavie. 2020. COMET: A neural framework for MT
line. Association for Computational Linguistics. evaluation. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Process-
Evgeny Matusov, Gregor Leusch, Oliver Bender, and ing (EMNLP), pages 2685–2702, Online. Association
Hermann Ney. 2005. Evaluating machine translation for Computational Linguistics.
output with automatic sentence segmentation. In Pro-
ceedings of the Second International Workshop on Ramon Sanabria, Nikolay Bogoychev, Nina Markl, An-
Spoken Language Translation, Pittsburgh, Pennsylva- drea Carmantini, Ondrej Klejch, and Peter Bell. 2023.
nia, USA. The edinburgh international accents of english cor-
pus: Towards the democratization of english asr.
Sharon O’Brien. 2007. An empirical investigation of
temporal and technical post-editing effort. The Infor- Scarton Scarton, Mikel L. Forcada, Miquel Esplà-
mation Society, 2:83–136. Gomis, and Lucia Specia. 2019. Estimating post-
editing effort: a study on human judgements, task-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- based and reference-based metrics of MT quality. In
Jing Zhu. 2002. Bleu: a method for automatic evalu- Proceedings of the 16th International Conference on
ation of machine translation. In Proceedings of the Spoken Language Translation, Hong Kong. Associa-
40th Annual Meeting of the Association for Compu- tion for Computational Linguistics.
tational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational Anne-Kathrin Schumann and Héctor Martínez Alonso.
Linguistics. 2018. Automatic annotation of semantic term types
in the complete ACL Anthology reference corpus.
Silvio Picinini and Nicola Ueffing. 2017. A detailed In Proceedings of the Eleventh International Confer-
investigation of bias errors in post-editing of MT out- ence on Language Resources and Evaluation (LREC
put. In Proceedings of Machine Translation Summit 2018), Miyazaki, Japan. European Language Re-
XVI: Commercial MT Users and Translators Track, sources Association (ELRA).
pages 79–90, Nagoya Japan.
Sukanta Sen, Ondřej Bojar, and Barry Haddow. 2022.
Marcis Pinnis, Rihards Kalnins, Raivis Skadins, and Simultaneous translation for unsegmented input: A
Inguna Skadina. 2016. What can we really learn sliding window approach.
from post-editing? In Conferences of the Associa- Matthew Snover, Bonnie Dorr, Rich Schwartz, Linnea
tion for Machine Translation in the Americas: MT Micciulla, and John Makhoul. 2006. A study of trans-
Users’ Track, pages 86–91, Austin, TX, USA. The lation edit rate with targeted human annotation. In
Association for Machine Translation in the Americas. Proceedings of the 7th Conference of the Association
Mirko Plitt and François Masselot. 2010. A productivity for Machine Translation in the Americas: Technical
test of statistical machine translation post-editing in Papers, pages 223–231, Cambridge, Massachusetts,
a typical localisation context. In Prague Bulletin of USA. Association for Machine Translation in the
Mathematical Linguistics. Americas.
Rachael Tatman and Conner Kasten. 2017. Effects of
Maja Popović. 2015. chrF: character n-gram F-score
talker dialect, gender & race on accuracy of bing
for automatic MT evaluation. In Proceedings of the
speech and youtube automatic captions. In Inter-
Tenth Workshop on Statistical Machine Translation,
speech.
pages 392–395, Lisbon, Portugal. Association for
Computational Linguistics. Midori Tatsumi. 2009. Correlation between automatic
evaluation metric scores, post-editing speed, and
Matt Post. 2018. A call for clarity in reporting BLEU some other factors. In Proceedings of Machine Trans-
scores. In Proceedings of the Third Conference on lation Summit XII: Posters, Ottawa, Canada.
Machine Translation: Research Papers, pages 186–
191, Brussels, Belgium. Association for Computa- Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol-
tional Linguistics. losa, and Marta Ruiz Costa-jussà. 2022. Shas:
Approaching optimal segmentation for end-to-end
Matt Post and David Vilar. 2018. Fast lexically con- speech translation. In Interspeech.
strained decoding with dynamic beam allocation for
neural machine translation. In Proceedings of the Changhan Wang, Juan Pino, Anne Wu, and Jiatao Gu.
2018 Conference of the North American Chapter of 2020. CoVoST: A diverse multilingual speech-to-text
the Association for Computational Linguistics: Hu- translation corpus. In Proceedings of the Twelfth Lan-
man Language Technologies, Volume 1 (Long Pa- guage Resources and Evaluation Conference, pages
pers), pages 1314–1324, New Orleans, Louisiana. 4197–4203, Marseille, France. European Language
Association for Computational Linguistics. Resources Association.
72
Vilém Zouhar, Martin Popel, Ondřej Bojar, and Aleš
Tamchyna. 2021. Neural machine translation quality
and post-editing performance. In Proceedings of the
Language Processing, pages 10204–10214, Online
and Punta Cana, Dominican Republic. Association
73
A Appendix
A.1 Additional Metadata for ACL 60/60 Evaluation Sets
Below we list the duration for talks in the evaluation sets, along with additional demographic metadata
about the presenting author (speaker) and content (conference track). Conference tracks are taken from the
ACL 2022 handbook. Gender annotations were checked with speakers’ listed pronouns13 and validated
by speakers where available. For speaker demographics and accent we list L1 and native country where
available, as well as country of affiliation as a rough proxy.
Gender L1 Country Affiliation Time Track

M Kinyarwanda Rwanda USA 0:11:35 Theme: Language Diversity (Best Paper)
M — — USA 0:11:35 Dialogue and Interactive Systems
F Spanish Spain Spain 0:12:17 Resources and Evaluation
F Marathi India USA 0:12:09 Question Answering
M Polish Poland Poland 0:09:37 Machine Learning for NLP
0:57:13 Total development set duration
M Chinese China China 0:12:03 NLP Applications
M — Belgium Netherlands 0:12:02 Resources and Evaluation
F Romanian Romania Germany 0:09:22 Language Grounding, Speech and Multimodality
M Japanese Japan Japan 0:14:02 NLP Applications
M Hebrew Israel Israel 0:11:53 NLP Applications
0:59:22 Total evaluation set duration
Table 3: Additional metadata for talks in the evaluation sets.
A.2 ACL 2022 Conference Participation Statistics

Aggregate statistics for self-identified gender as listed on conference registrations were provided by ACL.
Gender # %
Woman 909 28.7
Man 2164 68.3
Non-binary / Genderqueer / Third gender 14 <1
Genderfluid / Gender non-confirming <10 <1
Prefer not to say 77 2.4
Specify your own <10 <1
TOTAL 3170 100
Table 4: Aggregate statistics on gender of ACL 2022 conference participants.
13
Though we note pronouns do not always indicate gender.
74
A.3 Publicly Available Corpora
Below are the current publicly available multi-way parallel speech translation corpora with English as the
speech source. We note that for MuST-C not all target languages are available in all versions of the corpus
as successive versions added additional language coverage. For full coverage v1.2 or above is required.
Corpus Src Tgt

MuST-C (Di Gangi et al., 2019) en all (10) ar, de, fa, fr, ja, nl, pt, ru, tr, zh
CoVoST (Wang et al., 2020) en all (10) ar, de, fa, fr, ja, nl, pt, ru, tr, zh
Europarl-ST (Iranzo-Sánchez et al., 2020) en some (4) de, fr, pt, tr
Table 5: Current publicly available aligned speech translation corpora covering the ACL 60/60 language pairs.
Target languages are abbreviated using ISO 639-1 codes as follows – Arabic: ar, German: de, Farsi: fa, French: fr,
Japanese: ja, Dutch: nl, Portuguese: pt, Russian: ru, Turkish: tr, Mandarin Chinese: zh.
A.4 Transcription Post-editing Guidelines and Interface

The following guidelines were used for transcription post-editing by aiXplain. The acceptance criterion
was word accuracy >95%.
• Accuracy. Only type the words that are spoken in the audio file. Phrases or words you don’t
understand should NOT be omitted. Instead, they should be annotated using the label “#Unclear”.
• Keep everything verbatim. Include every utterance and sound exactly as you hear. All filler words
should be included (ex. #ah, #hmm). If the user corrects his/her self, all the utterances should be
transcribed and corrected words need to preceded with a # mark (ex. She says #said that).
• Do not paraphrase. Do not correct the speaker’s grammar nor rearrange words. Also, do not cut
words that you think are off-topic or irrelevant. Any words not spoken should not be included. Type
the actual words spoken. If the speaker makes a grammatical mistake, the transcript must reflect the
mistake (ex. If the speaker says: “he were”, it should be transcribed as is without correction).
• Repeat repeated words in the transcript. For example, if the user says: I I said, you must include both
instances of I.
• Do not add additional information such as page numbers, job numbers, titles or your comments in
your submission.
• Foreign words should be transliterated using Latin letters.
• All abbreviations need to be spelled out. For example, doctor should NOT be spelled as Dr. Similarly,
percent should NOT be spelled as %.
• All numbers and special symbols (ex.: %, $, +, @, =, etc.), or combinations of both must be spelled
out as words, and must match what the speaker says exactly.
• All proper names (ex. Google, NATO, Paris) should be transliterated in English.
• Proper punctuation needs to be placed in the text (ex. He, the boy, .). Please pay special attention
and do not miss/omit these punctuation marks: , . ? ! : )(
• Personally identifiable information (like phone number, address, IDs) should be marked in the text as
<PII></PII>. For example: My address is <PII>address</PII>
• Use double dashes “--” to indicate truncated words, attached whether at the beginning or the end of
the word (ex. transfor–).
75
Figure 10: LabelStudio interface for transcription post-editing.
A.5 Translation Post-editing Instructions and Interface

The translation post-editing task was carried out in Matecat14 , an open-source CAT tool that allows
annotators to collaborate and get suggestions from ModernMT in real-time. Matecat also offers an
embedded glossary feature that ensures effective and consistent terminology management (as shown in
the interface image in Figure 11 below, featuring Matecat glossary suggestions).
The following guidelines were used for translation post-editing:
• Any term found in the 60-60 terminologies list, should be translated using the translation in the
terminologies list.
• Any abbreviation if not found in the terminologies list, should be kept it in the English form
• The terms in the terminologies list may contain one or more translation for each term separated by
‘:::’. The translator should pick the proper one based on the context
• If the translator thinks that none of the given translations for a specific term makes sense in the given
context, the translators can use a better translation if they are very confident. If not very confident,
keep the word in the English form
14
https://site.matecat.com/
76
Figure 11: Matecat interface for translation post-editing.
A.6 Segmentation Comparison
Set Segmentation ar de fa fr ja nl pt ru tr zh Avg.

Sentences 66.9 68.7 53.4 73.9 47.8 74.3 74.0 55.0 62.4 50.4 62.7
dev
Commercial VAD 66.6 68.5 52.7 74.1 46.2 73.6 73.7 53.9 60.6 49.8 62.0
SHAS 66.5 68.6 52.8 73.7 46.9 73.8 73.5 54.3 59.9 49.7 62.0
Sentences 64.0 66.1 51.3 69.0 43.9 71.0 71.9 55.8 63.8 46.0 60.3
eval
Commercial VAD 63.5 66.3 51.1 69.0 43.7 70.4 72.0 55.1 62.9 47.1 60.1
SHAS 64.4 66.4 51.5 69.6 42.0 71.4 72.4 55.7 63.1 45.4 60.2
Table 6: Cascaded ST by language for different source speech segmentations, resegmented and scored with chrF.
A.7 Subtitle Guidelines

Subtitle guidelines following industry standards, see for example Netflix15 and TED16 :
• No one segment is allowed to be longer than 30 seconds.

• Each line can not be longer than 42 characters.
• A maximum of 2 lines of text can be shown on screen at once.
• The subtitle reading speed should kept to a maximum of ∼20 characters per second.17
If one of the segments created by the VAD does not adhere to the above guidelines, an English model is
used to force alignment the long audio segment and its transcript to get the timestamp of each token, and
then the segment is split into shorter subsegments. Note that these guidelines are automatically applied;
the above means that if a VAD segment conforms to these guidelines it will not be resegmented, and
subtitle segments may differ from manually created subtitles were semantic coherence may be prioritized
over longer segments within these guidelines, or text may be lightly changed from what is spoken to
optimize subtitle quality (here not allowed).
15
https://partnerhelp.netflixstudios.com/hc/en-us/articles/217350977-English-Timed-Text-Style-Guide
16
https://www.ted.com/participate/translate/subtitling-tips
17
Varies by program audience, commonly between 17 and 21.
77
A.8 Segmentation Examples
Examples of each transcript segmentation approach discussed (VAD, subtitles, and sentences) for sample
data from the development set. Examples were chosen to show segments from the longest and shortest
VAD quartiles, and the resulting subtitles following subtitle guidelines from §A.7.
Figure 12: Examples of each discussed transcript segmentation approach for sample data from the development set.
78
The M INE T RANS Systems for IWSLT 2023 Offline Speech Translation and
Speech-to-Speech Translation Tasks
Yichao Du♭‡ , Zhengsheng Guo♮ , Jinchuan Tian♮ , Zhirui Zhang♮ , Xing Wang♮ , Jianwei Yu♮ ,
Zhaopeng Tu♮ , Tong Xu♭‡ and Enhong Chen♭‡
♭
University of Science and Technology of China ♮ Tencent AI Lab
‡
State Key Laboratory of Cognitive Intelligence
♭
[email protected] ♭ {tongxu, cheneh}@ustc.edu.cn ♮ [email protected]
♮
{zhengshguo, tyriontian, tomasyu, brightxwang, zptu}@tencent.com
Abstract 2021b; Hrinchuk et al., 2022), which combines au-

tomatic speech recognition (ASR), machine trans-
This paper presents the M INE T RANS English-
to-Chinese speech translation systems devel-
lation (MT), and text-to-speech (TTS, for S2ST)
oped for two challenge tracks of IWSLT 2023: components. Recent advances in end-to-end mod-
Offline Speech Translation (S2T) and Speech- els (Liu et al., 2019; Jia et al., 2019; Lee et al.,
to-Speech Translation (S2ST). For the offline 2022; Du et al., 2021, 2022; Zhang et al., 2022b,a)
S2T track, M INE T RANS employs a practi- that directly translate one language speech to an-
cal cascaded system consisting of automatic other without intermediate symbolic representa-
speech recognition (ASR) and machine translations, have shown great potential in overcoming
tion (MT) modules to explore translation per-
the problems inherent in cascaded systems, such
formance limits in both constrained and uncon-
strained settings. To this end, we investigate the as error propagation and slow inference. Despite
effectiveness of multiple ASR architectures and this, there is still a gap between the two approaches,
two MT strategies, i.e., supervised in-domain as end-to-end models have much less supervised
fine-tuning and prompt-driven translation using training data than sub-tasks, i.e., ASR, MT, and
ChatGPT. For the S2ST track, we propose a TTS. Last year’s IWSLT offline S2T track (Anasta-
novel speech-to-unit translation (S2UT) frame- sopoulos et al., 2022) confirmed this, with the best
work to build an end-to-end system, which en-
end-to-end model submission scoring 1.7 BLEU
codes the target speech as discrete units via
our trained HuBERT and leverages the stan-
points lower than the top-ranked cascade system.
dard sequence-to-sequence model to learn the This year’s competition aims to answer the ques-
mapping between source speech and discrete tion of whether cascade solutions remain domi-
units directly. We demonstrate that with a large- nant, particularly in the S2ST track, where there
scale dataset, such as 10,000 hours of training has large-scale data for training.
data, this approach can well handle the map- In the offline S2T track, M INE T RANS employs
ping without any auxiliary recognition tasks a practical cascaded system to explore the limits
(i.e., ASR and MT tasks). To the best of our
of translation performance in both constrained and
knowledge, we are the first and only one to suc-
cessfully train and submit the end-to-end S2ST unconstrained settings, in which the entire system
model on this challenging track. consists of automatic speech recognition (ASR),
and machine translation (MT) modules. We also
1 Introduction investigate the effectiveness of multiple ASR ar-
chitectures and explore two MT strategies: super-
In this paper, we describe the M INE T RANS
vised in-domain fine-tuning (Wang et al., 2022) and
English-to-Chinese speech translation systems
prompt-driven translation using ChatGPT1 (Jiao
which participate in two challenge tracks of the
et al., 2023; He et al., 2023).
IWSLT 2023 (Agarwal et al., 2023) evaluation
In the S2ST track, M INE T RANS utilizes a
campaign: Offline Speech Translation (S2T) and
speech-to-unit translation (S2UT) framework to
Speech-to-Speech Translation (S2ST).
construct an end-to-end system, which is simi-
The annual IWSLT evaluation campaign com-
lar to Lee et al. (2021a) but removes all auxil-
pares the models produced by different institutions
iary recognition tasks (i.e., ASR and MT tasks).
on the task of automatically translating speech from
This framework converts target speech into dis-
one language to another. Traditional S2T/S2ST sys-
crete units via our pre-trained HuBERT and then
tems typically use a cascade approach (Ney, 1999;
Sperber et al., 2017; Zhang et al., 2019; Wang et al., 1
https://chat.openai.com
79
leverages the standard sequence-to-sequence model Librispeech (Panayotov et al., 2015), and Europarl-
to learn the mapping between source speech and ST (Iranzo-Sánchez et al., 2019), resulting in ap-
discrete units directly. We found that with a large- proximately 4500 hours of labeled ASR corpus, as
scale dataset, such as 10,000 hours of training data, shown in Table 1. For MuST-C and Europarl-ST,
the previous multi-task learning technique (Jia; we collect source speech for all translation direc-
Lee et al., 2021a,b; Popuri et al., 2022; Dong tions and de-duplicated them based on audio identi-
et al., 2022) is not necessary for model conver- fiers. In addition, GigaSpeech (Chen et al., 2021) is
gence, and this approach can successfully han- used to construct data-unconstrained ASR model,
dle the mapping between source speech and dis- which includes 10k hours data covering various
crete units. We also explore various initializa- sources (audiobooks, podcasts, and stream media),
tion strategies and several techniques to improve speaking styles (reading and spontaneous), and top-
model performance, including (1) different self- ics (arts, science, sports, etc.). Of these corpus, we
supervised pre-trained speech encoders and pre- use MuST-C as the in-domain for the Offline track
trained text-to-unit models, (2) data filtering and and the rest as the out-of-domain.
augmentation, consistency training, and model en-
MT Corpus. To train data-constrained English-
sembles. To the best of our knowledge, we are
to-Chinese MT models, MuST-C v1&v2 are
the first and only one to successfully train and sub-
considered in-domain corpora, while OpenSubti-
mit the end-to-end S2ST model on this challeng-
tles2018 (Lison et al., 2018) and NewsCommen-
ing track. Our code is open-sourced at: https:
tary3 corpora are considered out-of-domain. Addi-
//github.com/duyichao/MINETrans-IWSLT23.
tionally, we utilize in-house corpora to train data-
The remainder of this paper is organized as fol-
unconstrained MT models, although we cannot pro-
lows: Section 2 describes data preparation, includ-
vide further details about it.
ing data statistics, data preprocessing, and data
filtering. Section 3 describes our solution for the TTS Corpus. To ensure target speech timbre
offline speech translation track. Section 4 describes matching with the S2ST track, we consider the
our solution to the speech-to-speech track. In Sec- single-speaker GigaSS-S, a small subset of GigaSS,
tion 5, we conclude this paper. as in-domain and the multi-speaker AISHELL-
3 (Shi et al., 2020) as out-of-domain. These corpora
2 Data Preparation are used to train the TTS model and its correspond-
ing vocoder.
2.1 Data Statistics
Table 1 lists statistics of the speech corpus we used S2ST Corpus. The full version of GigaSS is
for M INE T RANS training, which can be divided used to train our end-to-end S2UT model, which
into four categories: unlabeled speech, ASR, TTS is an large-scale S2ST corpora derived from Gi-
and S2ST Corpus. gaSpeech (Chen et al., 2021) via MT and TTS.
We also construct S2ST pseudo-data, the details of
Unlabeled Speech. As shown in Table 1, we in- which will be presented in Section 4.1.2.
tegrate source side speech from VoxPopuli (Wang
et al., 2021a) and GigaSS2 to build a large-scale un- 2.2 Data Pre-processing and Filtering
labeled English speech corpus for self-supervised In general, a simple way to improve model perfor-
training of speech encoders Wav2vec2.0 (Baevski mance is to provide them with better data. How-
et al., 2020) and HuBert (Hsu et al., 2021), which ever, through a careful review of the data, we iden-
are used for initializing the S2UT model in the tified issues with the quality of the original data.
S2ST track. Similarly, we also integrate target To address this, we performed the following pre-
speech from GigaSS and AISHELL-3 (Shi et al., processing and filtering:
2020) to train the Chinese HuBert, which is used
for discretizing Chinese speech. • We convert all audio data to mono-channel
16kHz wav format. Since the sentences of spo-
ASR Corpus. To train data-constrained English ken translation are generally short, we discarded
ASR models, we merge MuST-C (Gangi et al., sentences with text longer than 100 and speech
2019), Common Voice v11 (Ardila et al., 2019), frames longer than 3000. Then 80-dimensional
2 3
https://github.com/SpeechTranslation/GigaS2S https://opus.nlpl.eu/News-Commentary.php
80
Corpus Utterances (k) Duration (h) S2T CST. S2ST CST.
Unlabeled VoxPopuli 22,905 28,708 ✓ ✓
MuST-C ASR v1&v2 342 617 ✓ –
Common Voice v11.0 1680 3,098 ✓ –
ASR Librispeech 281 960 ✓ –
Europarl-ST 34 81 ✓ –
GigaSpeech 8,030 10,000 × –
NewsCommentary 32 – ✓ –
MT OpenSubtitles 9,969 – ✓ –
MuST-C v1&v2 543 – ✓ –
In-house – – × –
AISHELL 3 88 85 – ✓
TTS
GigaSS-S 210 244 – ✓
GigaSS 7,635 9,000 – ✓
S2ST CoVoST synthetic 288 288 – ✓
MuST-C synthetic 358 587 – ✓
Table 1: Statistics of the training data. The "CST." indicates that a corpus is in the task constrained corpus list of
corresponding S2T or S2ST. The "-" indicates this corpus is not available in that column.
log-mel filter banks acoustic features are ex- is a standard 1-layer LSTM with a hidden size of
tracted with a stepsize of 10ms and a window 1024. The joint network is linear with a size of
size of 25ms. The acoustic features are normal- 512. The input acoustic features are 80-dim Fbank
ized by global channel mean and variance. plus 3-dim pitch, which are down-sampled by a
• We use a pre-trained ASR model on Librispeech 2-layer CNN with a factor of 6 in the time-axis
to filter the audio with very poor quality, i.e., before being fed into the acoustic encoder. The
word error rate (WER) more than 75. overall parameter budget is 126M. During training,
SpecAugment (Park et al., 2019) is consistently
• Since the annotation format is not uniform across adopted for data augmentation. The training on
multiple datasets, we remove non-printing char- both GigaSpeech and MuST-C datasets lasts for
acters, speaker names, laughter, applause and 50 epochs each, which consumes 32 Nvidia V100
other events. In addition, we also regularize punc- GPUs. The Adam optimizer is adopted, with peak
tuation marks. learning rate of 5e-3, warmup steps of 25k and in-
• For the English-to-Chinese direction of MuST-C, verse square root decay schedule(Vaswani et al.,
we first merge the v1 and v2 versions and then 2017a). Model weights from the last 10 epochs are
remove duplicates based on audio identifiers. averaged before decoding. The default decoding
method described in Graves (2012) is adopted with
3 Offline Speech Translation a beam size of 10. External language models in
any form are not adopted.
3.1 Cascaded M INE T RANS S2T System
3.1.1 Speech Recognition ASR Output Adaptation. In the realm of au-
A standard RNN-Transducer (Graves, 2012) model tomatic speech recognition (ASR) and machine
is used for speech recognition. It consists of an translation (MT), it is common for ASR output to
acoustic encoder, a prediction network and a joint lack punctuation, whereas MT models are sensitive
network. The acoustic encoder contains 18 Con- to punctuation. To address this issue, we propose
former (Gulati et al., 2020) layers with the follow- an ASR output adaptation method by incorporating
ing dimensions: attention size is 512, feed-forward a punctuation model between ASR and MT. Specif-
size is 2048, number of attention heads is 4, and ically, we adopt a BERT-based punctuation model
convolutional kernels is 31. The prediction network that can automatically recover the original punctu-
81
ation. The objective of this approach is to bridge Firstly, we have observed samples of incorrect lit-
the disparity between ASR and MT, leading to im- eral translations. For example, for the parallel sen-
proved overall performance in speech translation tence pair, “I remember my first fire. ||| 记得我
tasks. 第一场火”, we usually translate the English word
“fire” into Chinese word “火灾 (huo zhai)” not “火
Speech Segmentation. Speech translation is a (huo)”. Secondly, we have noticed inconsisten-
multi-faceted task that requires overcoming the cies in the punctuation annotation, as most Chinese
challenges of bridging the gap between automatic translations lack proper full stop marks. To address
speech recognition (ASR) and machine translation these challenges, we have employed the services of
(MT) systems. To address these challenges, we a professional translator to accurately translate the
employ several text augmentation techniques to English sentences. We will release the data, aiming
improve the quality and accuracy of our training to facilitate future research in the field.
data. Specifically, we have utilized speech-based
audio segmentation (SHAS (Tsiamas et al., 2022)) Domain Augmentation. The MuST-C v2.0 train-
to identify and segment meaningful units of speech ing data contains considerable bilingual sentence
that can be accurately translated by the MT system. pairs that are partially aligned. In the specific
pair “Thank you so much Chris. ||| 非常谢谢，
3.1.2 Machine Translation 克里斯。的确非常荣幸”, we are unable to lo-
In our systems, we adopt four different types of cate the corresponding translation for the Chinese
translation strategies: phrase “的确非常荣幸" in the English sentence.
As Koehn and Knowles (2017); Wang et al. (2018)
• T RANSFORMER is a system trained on the pointed out, data noise (partially aligned data) has
constrained data. We train the Transformer- been demonstrated to impact the performance of
base (Vaswani et al., 2017b) model on the con- Neural Machine Translation (NMT). To address
strained general data and finetune the model on this issue, we employ a data rejuvenation strat-
the in-domain MuST-C data. egy (Jiao et al., 2020). Specifically, we first fine-
tune the model using the raw parallel data and then
• M2M-1004 (Fan et al., 2021) is a multilingual
rejuvenate the low-quality bilingual samples to en-
model trained for many-to-many multilingual
hance the training data.
translation. We employ the supervised in-domain
fine-tuning strategy to finetune the M2M-100 3.2 Experiment
1.2B-parameter model on the downstream MuST-
C data. The Cascaded MINETRANS S2T System we pro-
pose comprises an Automatic Speech Recogni-
• C HAT GPT is a large language model product de- tion (ASR) model and a machine translation (MT)
veloped by OpenAI. Previous studies (Jiao et al., model. In our evaluation, we assess the perfor-
2023; Wang et al., 2023) have demonstrated that mance of each component separately. For the ASR
ChatGPT is a good translator on high-resource system evaluation, we employ the Word Error Rate
languages. Therefore we utilize the proper trans- (WER) metric, while the BLEU score is utilized to
lation prompts with ChatGPT to carry out the evaluate the performance of our machine transla-
translation task. tion model.
• I N - HOUSE M ODEL We fine-tune our in-house The evaluation results obtained on the MuST-C
translation model (Huang et al., 2021) using dataset, with and without fine-tuning, are presented
the MuST-C data. Our in-house model is a in Table 2. When the GigaSpeech ASR system
Transformer-big (Vaswani et al., 2017b) model is used without fine-tuning, we observe a WER
with a deep encoder (Dou et al., 2018). of 10.0 on the MuST-C test set. However, when
the system is fine-tuned using the MuST-C dataset,
Data Re-Annotation. We have identified two is- a significant improvement in performance is ob-
sues with the annotation of the English-to-Chinese served, resulting in a noticeable decrease in the
translation direction in the MuST-C v2.0 test set5 . error rate from WER of 10.0 to 5.8. This highlights
4 the effectiveness of fine-tuning on the MuST-C
https://github.com/facebookresearch/fairseq/
tree/main/exa\mples/m2m_100 dataset in enhancing the overall performance of our
5
https://ict.fbk.eu/MuST-C/ system.
82
System Dev Test Target waveform
Gigaspeech 9.3 10.0
+ MuST-C Finetune 4.8 5.8 Unit Hifigan
Vocoder
Table 2: ASR performance measured in terms of word
Target unit
error rates.
Unit
Decoder
We evaluate various translation strategies us-
ing the MuST-C test set. The experimental re-
sults are presented in Table 2. In the constrained Length
scenario, T RANSFORMER achieved a test BLEU Adapter
score of 25.04, whereas M2M-100 attained a
marginally higher score of 25.40. In the uncon-
strained setting, C HAT GPT demonstrated superior Speech
performance with a BLEU score of 28.25, while I N - Encoder
HOUSE M ODEL obtained the highest BLEU score
of 30.91. These results emphasize the significance
Source waveform
of utilizing in-domain data for achieving optimal
performance in spoken language translation. Figure 1: The overall architecture of the end-to-end
S2ST system.
System Dev tst-COMMON
T RANSFORMER 13.93 25.04 4.1.1 Pretrained Models
M2M-100 16.53 25.40 Previous experiences (Dong et al., 2022; Popuri
C HAT GPT — 28.25 et al., 2022) shown that better initialization can
I N - HOUSE M ODEL 21.52 30.91 reduce learning difficulty, we explore pre-training
of both the speech encoder and unit decoder.
Table 3: Offline speech translation performance mea-
sured in terms of the BLEU score. Speech Encoder Pre-training. We use Wav2vec
2.0 (Baevski et al., 2020) and HuBert (Hsu et al.,
2021), which are trained in a self-supervised man-
4 Speech-to-Speech Translation ner, as speech encoders. Due to the data limitation
4.1 End-to-End M INE T RANS S2ST System of the S2ST track, we use the unlabeled speech
described in Table 1 for training speech encoder:
As shown in Figure 1, we construct an end-to-
end S2UT (Lee et al., 2021a) model comprising a • Wav2vec 2.0 uses a multi layer convolution neu-
speech encoder, length adapter, and unit decoder. ral network to encode audio and then uses a
Following (Lee et al., 2021a), we encode target transformer-based context encoder to construct a
speech as discrete units via our trained Chinese contextual representation. The model is trained
HuBert and remove consecutive repetitive units by having a masked span of contrast loss on the
to generate a reduced unit sequence. Unlike (Lee input of the context encoder. In this paper, we
et al., 2021a), our S2UT model directly learns the modify Transformer as Conformer to obtain bet-
mapping between source speech and discrete units ter performance.
without any auxiliary recognition tasks (i.e., ASR • HuBert has the same model architecture as
and MT tasks), which hyper-parameters are diffi- Wav2vec 2.0. However, its training process dif-
cult to tune. Then we leverage a unit-based HiFi- fers primarily in the use of cross-entropy and ad-
GAN Vocoder to achieve unit-to-waveform conditionally in the construction of targets through a
version (Polyak et al., 2021). Next, we detail the separate clustering process.
efforts making in pre-training for model initializa-
tion, data augmentation, consistency training and Unit Decoder Pre-training. We use the standard
model ensemble, which are used to improve the sequence-to-sequence model to model the Text-to-
translation quality of our system. unit (T2U) task on GigaSS, and the decoder of
83
this model will be used for the initialization of the Wav2vec 2.0 L ARGE model. The unit decoder is
unit decoder of S2UT. The T2U model contains initialized from the T2U model.
12 transformer layers for the encoder and coder, • W2V2-T RANS -L ARGE +T2U: The speech en-
respectively. More specifically, we set the size of coder is initialized using Transformer-based
the self-attention layer, the feed-forward network, Wav2vec 2.0 L ARGE model. The unit decoder is
and the head to 1024, 4096, and 8, respectively. initialized from the T2U model.
4.1.2 Model Finetuning • H U B ERT-T RANS -L ARGE +T2U: The speech
We combine the pre-trained speech encoder and encoder is initialized using Transformer-based
unit decoder, and adding a randomly initialized HuBert L ARGE model. The unit decoder is ini-
length adapter between the pre-trained modules. tialized from the T2U model.
The length adapter consists of a one-dimensional
convolutional layer with a stride of 2, which miti- 4.1.5 Data Augmentation
gates the length difference between the source au- We utilize well trained Fastspeech2 (Ren et al.,
dio and the reduced target unit, as well as the mis- 2020) TTS models (see Section 4.2 for details) to
match between representations. generate speech for MuST-C and CoVoST Chinese
texts to construct pseudo-corpora. These pseudo-
Consistency Training. To further improve the corpora are used as training data together with the
consistency of our model, we employ the R-Drop original labeled S2ST corpus.
algorithm (Liang et al., 2021) with a weight α set to
5. The R-Drop algorithm reduces inconsistencies 4.2 Experiments
predicted by the model between training and infer- 4.2.1 Implementation Details
ence through dropout, thereby improving general-
All end-to-end S2UT models are implemented
ization. Specifically, it randomly drops out parts
based on the FAIRSEQ6 (Ott et al., 2019) toolkit.
of the model during training, forcing it to learn
We use pre-trained Chinese HuBERT model and
more robust representations that are less sensitive
k-means model to encode Chinese target speech
to small changes in the input. For a more detailed
into a vocabulary of 250 units. The Chinese Hu-
description of the R-Drop algorithm and its imple-
BERT and k-means models are learned from the
mentation, please refer to the paper by (Liang et al.,
TTS data in Table 1. The architectural details of the
2021).
S2UT models are detailed in section 4.1.4. During
4.1.3 Unit-based Vocoder training, we use the adam optimizer with a learning
We utilize the unit-based HiFi-GAN (Polyak et al., rate set to 5e-5 to update model parameters with 8K
2021) vocoder to convert discrete units into wave- warm-up updates. The label smoothing and dropout
form for the speech-to-unit model. Following ratios are set to 0.15 and 0.2, respectively. In prac-
the (Lee et al., 2021a) setup, we augment the tice, we train S2UT with 8 Nvidia Tesla A100
vocoder with a duration prediction module for the GPUs with 150K update steps. The batch size in
reduced unit output, which consists of two 1D con- each GPU is set to 1200K, and we accumulate the
volutional layers, each with ReLU activation, fol- gradient for every 9 batches. For the first 5K steps
lowed by layer normalization and a linear layer. of S2UT model training, we freeze the update of the
speech encoder. The Unit HiFi-GAN Vocoder is
4.1.4 Ensemble trained using S PEECH -R ESYNTHESISRES7 toolkit
Model ensemble can reduce the inconsistency of for 500k steps. For FastSpeech2 and HiFi-GAN,
the system to some extent, and we consider the we followed the paddlespeech AISHELL recipe8
ensemble of four variants of S2UT models: for training. During inference, we average the
model parameters on the 30 best checkpoints based
• W2V2-C ONF -L ARGE: The speech encoder is on the performance of the GigaSS dev set, and
initialized using Conformer-based Wav2vec 2.0 adopt beam search strategy with beam size of 10.
L ARGE model. The unit decoder is initialized 6
https://github.com/facebookresearch/fairseq
randomly. 7
https://github.com/facebookresearch/
speech-resynthesis
• W2V2-C ONF -L ARGE +T2U: The speech en- 8
https://github.com/PaddlePaddle/PaddleSpeech/
coder is initialized using Conformer-based tree/develop/examples/aishell3/tts3
84
ID Model BLEU chrF on this track. This model uses our trained Hu-
1 W2V2-C ONF -L ARGE 27.7 23.4
BERT to encode the target speech as discrete units
2 W2V2-C ONF -L ARGE +T2U 27.8 23.7 and leverages the standard sequence-to-sequence
3 W2V2-T RANS -L ARGE +T2U 25.2 22.3 model to directly learn the mapping between source
4 H U B ERT-T RANS -L ARGE +T2U 26.2 23.2 speech and discrete units without the need for auxil-
5 H U B ERT-T RANS -L ARGE +T2U* 25.7 22.6 iary recognition tasks such as ASR and MT. We use
6 Ensemble(1, 2, 4) 28.0 23.9 several techniques to improve M INE T RANS’s per-
7 Ensemble(2, 4, 5) 27.2 23.0 formance, including speech encoder pre-training
on large-scale data, data filtering, data augmen-
Table 4: ASR-BLEU and ASR-chrF on GigaSS valida- tation, speech segmentation, consistency training,
tion set. ‘*’ indicates adding the GigaST test set to the and model ensemble.
training data and fine-tuning it for one round.
Acknowledgements
4.2.2 Results This work is supported by the grants from
National Natural Science Foundation of China
To evaluate the speech-to-speech translation sys-
(No.62222213, U20A20229, 62072423), and the
tem, we use a Chinese ASR system9 trained on
USTC Research Funds of the Double First-Class
WenetSpeech (Zhang et al., 2021) to transcribe
Initiative (No.YD2150002009). The authors would
the speech output with the ctc_greedy_serach
like to thank anonymous reviewers for their valu-
mode. Based on this, we report case-sensitive
able comments. Zhirui Zhang and Tong Xu are the
BLEU and chrF scores between the produced tran-
corresponding authors.
script and a textual human reference using sacre-
BLEU. The results on the GigaSS validation set
is shown in Table 4. Comparing W2V2-C ONF - References
L ARGE +T2U and W2V2-T RANS -L ARGE +T2U,
using Conformer-based architecture pre-trained
speech encoder for initialization has better perfor- Milind Agarwal, Sweta Agrawal, Antonios Anasta-
mance. In addition, we find that adding the GigaST sopoulos, Ondřej Bojar, Claudia Borg, Marine
test set to training leads to a weak performance Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
Chen, William Chen, Khalid Choukri, Alexandra
degradation on the validation set, possibly because Chronopoulou, Anna Currey, Thierry Declerck, Qian-
the annotations of the test set are calibrated by hu- qian Dong, Yannick Estève, Kevin Duh, Marcello
mans and their style differs from that of the training Federico, Souhir Gahbiche, Barry Haddow, Benjamin
data. Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
Kumar, Pengwei Li, Xutail Ma, Prashant Mathur,
5 Conclusion Evgeny Matusov, Paul McNamee, John P. McCrae,
Kenton Murray, Maria Nadejde, Satoshi Nakamura,
This paper presents the M INE T RANS system for Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
two challenge tracks of the IWSLT 2023: Offline Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
Speech Translation (S2T) and Speech-to-Speech Lonneke van der Plas, Peter Polák, Elijah Rippeth,
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
Translation (S2ST). For the S2T track, M INE - bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
T RANS employs a cascaded system to investigate Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
the limits of translation performance in both con- Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
strained and unconstrained settings. We explore vallos. 2023. Findings of the IWSLT 2023 Evaluation
two machine translation strategies: supervised in-
domain fine-tuning and prompt-guided translation 2023). Association for Computational Linguistics.
using a large language model. For the S2ST track,
M INE T RANS builds an end-to-end model based on Antonios Anastasopoulos, Loïc Barrault, Luisa Ben-
tivogli, Marcely Zanon Boito, Ondrej Bojar, Roldano
the speech-to-unit (S2U) framework. To the best Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh,
of our knowledge, we are the first and only team to Maha Elbayad, Clara Emmanuel, Y. Estève, Mar-
successfully train and submit the end-to-end S2ST cello Federico, Christian Federmann, Souhir Gah-
biche, Hongyu Gong, Roman Grundkiewicz, Barry
9
https://github.com/wenet-e2e/wenet/blob/main/ Haddow, B. Hsu, Dávid Javorský, Věra Kloudová,
docs/pretrained_models.en.md Surafel Melaku Lakew, Xutai Ma, Prashant Mathur,
85
Paul McNamee, Kenton Murray, Maria Nadejde, Alex Graves. 2012. Sequence transduction with
Satoshi Nakamura, Matteo Negri, Jan Niehues, Xing recurrent neural networks. arXiv preprint
Niu, John E. Ortega, Juan Miguel Pino, Elizabeth arXiv:1211.3711.
Salesky, Jiatong Shi, Matthias Sperber, Sebastian
Stüker, Katsuhito Sudoh, Marco Turchi, Yogesh Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
Virkar, Alexander H. Waibel, Changhan Wang, and Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
Shinji Watanabe. 2022. Findings of the iwslt 2022 Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.
evaluation campaign. In IWSLT. 2020. Conformer: Convolution-augmented Trans-
former for Speech Recognition. In Proc. Interspeech
Rosana Ardila, Megan Branson, Kelly Davis, Michael 2020, pages 5036–5040.
Henretty, Michael Kohler, Josh Meyer, Reuben Zhiwei He, Tian Liang, Wenxiang Jiao, Zhuosheng
Morais, Lindsay Saunders, Francis M. Tyers, and Zhang, Yujiu Yang, Rui Wang, Zhaopeng Tu, Shum-
Gregor Weber. 2019. Common voice: A massively- ing Shi, and Xing Wang. 2023. Exploring human-
multilingual speech corpus. In International Confer- like translation strategy with large language models.
ence on Language Resources and Evaluation. arXiv preprint arXiv:2305.04118.
Alexei Baevski, Henry Zhou, Abdel rahman Mohamed, Oleksii Hrinchuk, Vahid Noroozi, Ashwinkumar
and Michael Auli. 2020. wav2vec 2.0: A framework Ganesan, Sarah Campbell, Sandeep Subramanian,
for self-supervised learning of speech representations. Somshubra Majumdar, and Oleksii Kuchaiev. 2022.
Advances in Neural Information Processing Systems. Nvidia nemo offline speech translation systems for
iwslt 2022. In IWSLT.
Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu
Du, Weiqiang Zhang, Chao Weng, Dan Su, Daniel Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, San- Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-
jeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, rahman Mohamed. 2021. Hubert: Self-supervised
Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, speech representation learning by masked prediction
Yujun Wang, Zhao You, and Zhiyong Yan. 2021. Gi- of hidden units. IEEE/ACM Transactions on Audio,
gaspeech: An evolving, multi-domain asr corpus Speech, and Language Processing, 29:3451–3460.
with 10, 000 hours of transcribed audio. ArXiv,
Guoping Huang, Lemao Liu, Xing Wang, Longyue
abs/2106.06909.
Wang, Huayang Li, Zhaopeng Tu, Chengyan Huang,
and Shuming Shi. 2021. Transmart: A practical in-
Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan
teractive machine translation system. arXiv preprint
Wang, Qibing Bai, and Yu Zhang. 2022. Leverag-
arXiv:2105.13072.
ing pseudo-labeled data to improve direct speech-to-
speech translation. In Interspeech. Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà,
Javier Jorge, Nahuel Roselló, Adrià Giménez, Al-
Zi-Yi Dou, Zhaopeng Tu, Xing Wang, Shuming Shi, and berto Sanchís, Jorge Civera Saiz, and Alfons Juan-
Tong Zhang. 2018. Exploiting deep representations Císcar. 2019. Europarl-st: A multilingual corpus for
for neural machine translation. In Proceedings of the speech translation of parliamentary debates. ICASSP
2018 Conference on Empirical Methods in Natural 2020 - 2020 IEEE International Conference on
Language Processing, pages 4253–4262. Acoustics, Speech and Signal Processing (ICASSP),
pages 8229–8233.
Yichao Du, Weizhi Wang, Zhirui Zhang, Boxing Chen,
Tong Xu, Jun Xie, and Enhong Chen. 2022. Non- Ye Jia, Ron J Weiss, Fadi Biadsy, Wolfgang Macherey,
parametric domain adaptation for end-to-end speech Melvin Johnson, Zhifeng Chen, and Yonghui Wu.
translation. In Conference on Empirical Methods in 2019. Direct speech-to-speech translation with
Natural Language Processing. a sequence-to-sequence model. arXiv preprint
arXiv:1904.06037.
Yichao Du, Zhirui Zhang, Weizhi Wang, Boxing Chen,
Jun Xie, and Tong Xu. 2021. Regularizing end-to- Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing
end speech translation with triangular decomposition Wang, and Zhaopeng Tu. 2023. Is chatgpt a good
agreement. In AAAI Conference on Artificial Intelli- translator? a preliminary study. arXiv preprint
gence. arXiv:2301.08745.
Wenxiang Jiao, Xing Wang, Shilin He, Irwin King,
Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Michael Lyu, and Zhaopeng Tu. 2020. Data reju-
Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep venation: Exploiting inactive training examples for
Baines, Onur Celebi, Guillaume Wenzek, Vishrav neural machine translation. In Proceedings of the
Chaudhary, et al. 2021. Beyond english-centric multi- 2020 Conference on Empirical Methods in Natural
lingual machine translation. The Journal of Machine Language Processing (EMNLP), pages 2255–2266.
Learning Research, 22(1):4839–4886.
Philipp Koehn and Rebecca Knowles. 2017. Six chal-
Mattia Antonino Di Gangi, R. Cattoni, L. Bentivogli, lenges for neural machine translation. In First Work-
Matteo Negri, and M. Turchi. 2019. Must-c: a multi- shop on Neural Machine Translation, pages 28–39.
lingual speech translation corpus. In NAACL. Association for Computational Linguistics.
86
Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Sravya Popuri, Peng-Jen Chen, Changhan Wang,
Xutai Ma, Adam Polyak, Yossi Adi, Qing He, Yun Juan Miguel Pino, Yossi Adi, Jiatao Gu, Wei-Ning
Tang, Juan Miguel Pino, and Wei-Ning Hsu. 2021a. Hsu, and Ann Lee. 2022. Enhanced direct speech-to-
Direct speech-to-speech translation with discrete speech translation using self-supervised pre-training
units. In Annual Meeting of the Association for Com- and data augmentation. In Interspeech.
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao,
Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Zhou Zhao, and Tie-Yan Liu. 2020. Fastspeech
Sravya Popuri, Xutai Ma, Adam Polyak, Yossi Adi, 2: Fast and high-quality end-to-end text to speech.
Qing He, Yun Tang, Juan Pino, and Wei-Ning Hsu. ArXiv, abs/2006.04558.
2022. Direct speech-to-speech translation with dis-
crete units. In Proceedings of the 60th Annual Meet- Yao Shi, Hui Bu, Xin Xu, Shaojing Zhang, and Ming
ing of the Association for Computational Linguistics Li. 2020. Aishell-3: A multi-speaker mandarin tts
(Volume 1: Long Papers), pages 3327–3339, Dublin, corpus and the baselines. In Interspeech.
Ireland. Association for Computational Linguistics.
Matthias Sperber, Graham Neubig, J. Niehues, and
Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne, A. Waibel. 2017. Neural lattice-to-sequence mod-
Holger Schwenk, Peng-Jen Chen, Changhan Wang, els for uncertain inputs. In EMNLP.
Sravya Popuri, Juan Miguel Pino, Jiatao Gu, and
Wei-Ning Hsu. 2021b. Textless speech-to-speech Ioannis Tsiamas, Gerard I Gállego, José AR Fonollosa,
translation on real data. ArXiv, abs/2112.08352. and Marta R Costa-jussà. 2022. Shas: Approaching
optimal segmentation for end-to-end speech transla-
Xiaobo Liang, Lijun Wu, Juntao Li, Yue Wang, tion. arXiv preprint arXiv:2202.04774.
Qi Meng, Tao Qin, Wei Chen, M. Zhang, and Tie-Yan
Liu. 2021. R-drop: Regularized dropout for neural Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
networks. ArXiv, abs/2106.14448. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017a. Attention is all
Pierre Lison, Jörg Tiedemann, and Milen Kouylekov. you need. Advances in neural information processing
2018. Opensubtitles2018: Statistical rescoring of systems, 30.
sentence alignments in large, noisy parallel corpora.
In International Conference on Language Resources Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
and Evaluation. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017b. Attention is all
Yuchen Liu, Hao Xiong, Zhongjun He, Jiajun Zhang, you need. Advances in neural information processing
Hua Wu, Haifeng Wang, and Chengqing Zong. 2019. systems, 30.
End-to-end speech translation with knowledge distil-
lation. In INTERSPEECH. Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu,
Chaitanya Talnikar, Daniel Haziza, Mary Williamson,
H. Ney. 1999. Speech translation: coupling of recogni- Juan Miguel Pino, and Emmanuel Dupoux. 2021a.
tion and translation. 1999 IEEE International Con- Voxpopuli: A large-scale multilingual speech corpus
ference on Acoustics, Speech, and Signal Process- for representation learning, semi-supervised learning
ing. Proceedings. ICASSP99 (Cat. No.99CH36258), and interpretation. In Annual Meeting of the Associa-
1:517–520 vol.1. tion for Computational Linguistics.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang,
S. Gross, Nathan Ng, David Grangier, and Michael Dian Yu, Shuming Shi, and Zhaopeng Tu. 2023.
Auli. 2019. fairseq: A fast, extensible toolkit for Document-level machine translation with large lan-
sequence modeling. In NAACL. guage models. arXiv preprint arXiv:2304.02210.
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Minghan Wang, Yuxia Wang, Chang Su, Jiaxin Guo,
S. Khudanpur. 2015. Librispeech: An asr corpus Yingtao Zhang, Yujiao Liu, M. Zhang, Shimin Tao,
based on public domain audio books. 2015 IEEE Xingshan Zeng, Liangyou Li, Hao Yang, and Ying
International Conference on Acoustics, Speech and Qin. 2021b. The hw-tsc’s offline speech translation
Signal Processing (ICASSP), pages 5206–5210. system for iwslt 2022 evaluation. In IWSLT.
Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Wei Wang, Taro Watanabe, Macduff Hughes, Tetsuji
Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Nakagawa, and Ciprian Chelba. 2018. Denoising
Le. 2019. Specaugment: A simple data augmen- neural machine translation training with trusted data
tation method for automatic speech recognition. In- and online data selection. In Proceedings of the Third
terspeech 2019. Conference on Machine Translation: Research Pa-
pers, pages 133–143.
Adam Polyak, Yossi Adi, Jade Copet, Eugene
Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Ab- Wenxuan Wang, Wenxiang Jiao, Yongchang Hao, Xing
delrahman Mohamed, and Emmanuel Dupoux. 2021. Wang, Shuming Shi, Zhaopeng Tu, and Michael Lyu.
Speech resynthesis from discrete disentangled self- 2022. Understanding and improving sequence-to-
supervised representations. ArXiv, abs/2104.00355. sequence pretraining for neural machine translation.
87
In Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume
1: Long Papers), pages 2591–2600.
Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao,
Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen,
Chenchen Zeng, Di Wu, and Zhendong Peng. 2021.
Wenetspeech: A 10000+ hours multi-domain man-
darin corpus for speech recognition. ICASSP 2022
- 2022 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 6182–
6186.
Peidong Zhang, Boxing Chen, Niyu Ge, and Kai Fan.
2019. Lattice transformer for speech translation. In
ACL.
Weitai Zhang, Zhongyi Ye, Haitao Tang, Xiaoxi Li,
Xinyuan Zhou, Jing Yang, Jianwei Cui, Dan Liu,
Junhua Liu, and Lirong Dai. 2022a. The ustc-nelslip
offline speech translation systems for iwslt 2022. In
IWSLT.
Ziqiang Zhang, Junyi Ao, Shujie Liu, Furu Wei, and
Jinyu Li. 2022b. The yitrans end-to-end speech
translation system for iwslt 2022 offline shared task.
ArXiv, abs/2206.05777.
88
Improving End-to-End Speech Translation by Imitation-Based Knowledge
Distillation with Synthetic Transcripts
Rebekka Hubert∗ Artem Sokolov Stefan Riezler

Computational Linguistics Google Research Computational Linguistics & IWR
Heidelberg University, Germany Berlin, Germany Heidelberg University, Germany
[email protected] [email protected] [email protected]
Abstract scenarios are low-resource settings (e.g., for lan-

guages without written form for which mostly only
End-to-end automatic speech translation (AST) audio-translation data are available), or settings
relies on data that combines audio inputs with where one of the main uses of source transcripts
text translation outputs. Previous work used ex- in AST — pre-training the AST encoder from an
isting large parallel corpora of transcriptions
automatic speech recognition (ASR) system— is
and translations in a knowledge distillation
(KD) setup to distill a neural machine transla- replaced by a large-scale pre-trained ASR system
tion (NMT) into an AST student model. While (which itself is trained on hundreds of thousands
KD allows using larger pretrained models, the hours of speech, but the original training transcripts
reliance of previous KD approaches on manual are not available (Radford et al., 2022; Zhang et al.,
audio transcripts in the data pipeline restricts 2022b)). Relaxing the dependence of pre-training
the applicability of this framework to AST. We AST encoders on manual transcripts has recently
present an imitation learning approach where a been studied by Zhang et al. (2022a). Our focus
teacher NMT system corrects the errors of an
is instead to investigate the influence of manual
AST student without relying on manual tran-
scripts. We show that the NMT teacher can versus synthetic transcripts as input to the student
recover from errors in automatic transcriptions model in an imitation learning (IL) approach (Lin
and is able to correct erroneous translations of et al., 2020; Hormann and Sokolov, 2021), and to
the AST student, leading to improvements of lift this scenario to AST. To our knowledge, this has
about 4 BLEU points over the standard AST not been attempted before. We present a proof-of-
end-to-end baseline on the English-German concept experiment where we train an ASR model
CoVoST-2 and MuST-C datasets, respectively.
on a few hundred hours of speech, but discard the
Code and data are publicly available.1
manual transcripts in IL training, and show that
this ASR model is sufficient to enable large NMT
1 Introduction
models to function as error-correcting oracle in
The success of data-hungry end-to-end automatic an IL setup where the AST student model works
speech translation (AST) depends on large amounts on synthetic transcripts. Focusing on the IL sce-
of data that consist of speech inputs and corre- nario, we show that one of the key ingredients to
sponding translations. One way to overcome the make our framework perform on synthetic ASR
data scarcity issue is a knowledge distillation (KD) transcripts is to give the AST student access to the
setup where a neural machine translation (NMT) oracle’s full probability distribution instead of only
expert (also called oracle) is distilled into an AST the expert’s optimal actions. Furthermore, when
student model (Liu et al., 2019; Gaido et al., 2020). comparing two IL algorithms of different power —
The focus of our work is the question of whether the either correcting the student output in a single step,
requirement of high-quality source language tran- or repairing outputs till the end of the sequence —
scripts, as in previous applications of KD to AST, we find that, at least in the setup of a reference-
can be relaxed in order to enable a wider applicabil- agnostic NMT teacher, the single-step correction
ity of this setup to AST scenarios where no manual of student errors is sufficient.
source transcripts are available. Examples for such One of the general reasons for the success of
∗ our setup may be a reduction of data complexity
All work was done at Heidelberg University.
1
https://github.com/HubReb/imitkd_ast/ and an increase of variations of outputs, similar to
releases/tag/v1.1 applications of KD in NMT (Zhou et al., 2020).
89
To investigate the special case of imitation-based AST performance. Gaido et al. (2020) improved
KD on synthetic speech inputs, we provide a man- upon this by increasing the available training data
ual analysis of the NMT expert’s behavior when by utilizing a MT model to translate the audio tran-
faced with incorrect synthetic transcripts as input, scripts of ASR datasets into another language, yet
or when having to correct a weak student’s transla- they still use manual transcripts for distillation in
tion in the IL setting. We find that the NMT oracle the following finetuning phase.
can correct errors even if the source language input Further attempts focused on improving AST
lacks semantically correct information, by utiliz- models by utilizing MT data for multitask learn-
ing its language modeling capability to correct the ing with speech and text data (Tang et al., 2021b,a;
next-step token. This points to new uses of large Bahar et al., 2019; Weiss et al., 2017; Anastasopou-
pre-trained ASR and NMT models (besides initial- los and Chiang, 2018), such as XSTNet (Ye et al.,
ization of encoder and decoder, respectively) as 2021) and FAT-MLM (Zheng et al., 2021).
tools to improve non-cascading end-to-end AST. A question orthogonal to ours, concerning the
influence of pre-training encoder and/or decoder on
2 Related Work source transcripts, has been investigated by Zhang
et al. (2022a). They achieved competitive results
Imitation learning addresses a deficiency of
without any pretraining via the introduction of pa-
sequence-to-sequence learning approaches, nick-
rameterized distance penalty and neural acoustic
named exposure bias (Bengio et al., 2015; Ranzato
feature modeling in combination with CTC regular-
et al., 2016), that manifests as the inference-time
ization with translations as labels. Their question
inability to recover from own errors, leading to
and solutions are orthogonal to ours and are likely
disfluent or hallucinated translations (Wang and
to be yield independent benefits.
Sennrich, 2020). IL aims to replace the standard
learning paradigm of teacher forcing (Williams and 3 Imitation-based Knowledge Distillation
Zipser, 1989) (which decomposes sequence learn-
ing into independent per-step predictions, each con- We view an auto-regressive NMT or AST system
ditioned on the golden truth context rather than as a policy π that defines a conditional distribution
the context the model would have produced on its over a vocabulary of target tokens v ∈ V that is con-
own) by enriching the training data with examples ditioned on the input x and the so far generated pre-
of successful recovery from errors. We build upon fix y<t : π(v|y<t ; x). This policy is instantiated as
two previous adaptations of IL to NMT (Lin et al., the output of the softmax layer. When training with
2020; Hormann and Sokolov, 2021) and lift them teacher-forcing, the cross-entropy (CE) loss ℓ(·) is
to AST. minimized under the empirical distribution of train-
P
Knowledge distillation (Hinton et al., 2015) ing data D: LCE (π) = E(y,x)∼D [ Tt=1 ℓ(yt , π)].
transfers the knowledge encoded in a large model, To perform well at test time we are interested in the
called teacher, to a far smaller student model by expected loss under Pthe learned model distribution:
using the teacher to create soft labels and train the L(π) = E(y,x)∼π [ Tt=1 ℓ(yt , π)].
student model to minimize the cross-entropy to the As shown by Ross et al. (2011), the discrepancy
teacher. KD has been successfully used for ma- between L and LCE accumulates quadratically with
chine translation (Kim and Rush, 2016), speech the sequence length T , which in practice could
recognition (Wong and Gales, 2016) and speech manifest itself as translation errors. They proposed
translation (Liu et al., 2019). the Dagger algorithm which has linear worst-case
Synthetic speech translation training datasets error accumulation. It, however, relies on the ex-
have been used previously to train AST models: istence of an oracle policy π ∗ that, conditioned on
Pino et al. (2020) used an ASR-NMT model cas- the same input x and the partially generated π’s pre-
cade to translate unlabeled speech data for aug- fix y<t , can produce a single next-step correction
mentation. To obtain more machine translation to y<t . Ross and Bagnell (2014) further proposed
(MT) training data, Jia et al. (2019); Pino et al. the AggreVaTe algorithm which relies on an even
(2019) generated synthetic speech data with a text- more powerful oracle that can produce a full contin-
to-speech model. Liu et al. (2019) applied KD uation in the task-loss optimal fashion: For NMT,
between an NMT expert and an AST student with this means continuing the y<t in a way that maxi-
manual transcriptions as expert input to improve mizes BLEU, as done for example in Hormann and
90
Both algorithms proceed iteratively, where the
newly generated set of triples form a provisional
training data set Di . Originally, Dagger and Ag-
greVaTe train the student’s πi on the aggregated
dataset ∪j≤i Dj and use a probabilistic mixture for
the current roll-out policy, which queries the oracle
with probability βi and the student otherwise. This
setup guarantees that the prediction error scales
at most linearly with time, unlike the quadratic
scaling of the standard teacher forcing (Ross et al.,
2011), which is standardly used in sequence-level
KD. This makes Dagger and AggreVaTe promising
candidates to improve over KD.
In our implementation, we follow Lin et al.
(2020), who save memory via training on individ-
ual Di in each iteration i, instead of training on
the set union. They further speed up training by
keeping the reference translation y with probability
Figure 1: Diagram of AST training with imitation βi , and otherwise generate a translation ŷ of the
learning and synthetic transcripts coming from ASR source sentence x from the student policy (see Al-
models. (1) With probability 1 − β the AST student gorithm 1). For each t in the algorithm, AggreVaTe
creates a hypothesis ŷ that replaces the reference trans- needs to generate an exploration token at and cal-
lation y. (2) The ASR model generates the synthetic culate the BLEU it would lead to, according to the
transcript x̂s for the audio sample xa to feed the NMT
oracle continuation starting off this action.
oracle as input. (3) Calculation of Dagger or AggreVaTe
loss as shown in Algorithm 1.
IL for AST Adapting Dagger and AggreVaTe to
an AST student is relatively straightforward (see
Sokolov (2021). Figure 1): We feed the NMT oracle the source lan-
guage transcript xs of the audio data sample xa that
IL for NMT We pretrain a large NMT model to
is also given to the AST student. We define an algo-
serve as an oracle π ∗ that either simply predicts the
rithm IKD (imitation knowledge distillation) that
next-step optimal output vocabulary token vt∗ given
optimizes the cross-entropy of the student’s policy
a source sentence x and any (potentially, erroneous)
w.r.t. the optimal expert prediction:
partial NMT student hypothesis y<t (Dagger):
" T
#
vt∗ = argmax π ∗ (v | y<t ; x), (1) X
v∈V LIKD (π) = E − log π(vt∗ | y<t ; xa ) , (3)
t=1
or continues y<t till the end (AggreVaTe):
∗ with vt∗ as in (1). Algorithm IKD+ optimizes the
y>t = argmax π ∗ (y<t + at + y≥t | y<t ; x), (2)
y>t cross-entropy w.r.t. the expert’s policy:
where y>t is the continuation, at is an exploratory

action, and the last argmax is implemented as LIKD+ (π) = (4)
beam search. The predicted vt∗ or y>t ∗ are viewed
" #
as one-step or multi-step corrections of the current X
E − π ∗ (v | y<t ; xs ) · log π(v | y<t ; xa ) .
policy, and the student is updated to increase the v∈V
probability of the correction via the cross-entropy
loss on triples (yt , x, vt∗ ) in case of Dagger, or to An important modification to these objectives
decrease a square loss between logit Q of the se- that we propose in this work is to replace the
lected action at and the BLEU of the predicted gold source language transcripts xs fed to the
suffix2 from that action in case of AggreVaTe. NMT oracle by synthetic transcripts generated by
2
We use the difference between the BLEU values of the a pretrained ASR model. We call this algorithm
full sequence and that of the prefix (Bahdanau et al., 2016). SynthIKD, with a respective SynthIKD+ variant.
91
Algorithm 1: Dagger/AggreVaTe for distil- Variant Expert Input Loss
lation in NMT; combined from (Lin et al., Standard - CE
2020) and (Hormann and Sokolov, 2021).
Data: Let D be original bi-text dataset, π ∗ the NMT
KD+ (Liu et al., 2019) gold CE
oracle policy, I the total number of iterations, SynthKD+ synthetic CE
T the max sequence length, Q the final logits,
and B the batch size. IKD (Lin et al., 2020) gold LIKD
Initialize π1 arbitrarily. IKD+ (Lin et al., 2020) gold LIKD+
for i = 1 . . . I do
Initialize Di ← ∅ SynthIKD (ours) synthetic LIKD
for b = 1 . . . B do SynthIKD+ (ours) synthetic LIKD+
Sample an example (x, y) ∼ D.
Sample uniformly u ∼ [0, 1]
if u > βi then
Table 1: Summary of training variants: “Standard” de-
Generate ŷ from πi given x. notes AST trained via cross-entropy (CE) on ground
Replace y with ŷ. truth targets with a label smoothing. KD+ denotes word-
if Dagger then level knowledge distillation between the expert’s and
for t = 1 . . . T do student’s full output probability. IKD and IKD+ denote
Predict vt∗ = argmax π ∗ (v | y<t ; x)
v∈V imitation knowledge distillation where student model is
Append (y<t , x, vt∗ ) to Di corrected by the optimal expert action or the full expert
else // AggreVaTe policy (Lin et al., 2020), respectively. SynthIKD and
Sample uniformly t ∈ {1, .., T }.
Predict at = argmax π(v | y<t ; x) SynthIKD+ are our variants with synthetic transcripts.
v∈V Expert Input indicates whether the NMT expert is given
Predict the original transcripts from the dataset or synthetic
∗
y>t = argmax π ∗ (y>t | y<t + at ; x)
y>t transcripts created by ASR. All IKD methods use the
Append (y ∗
<t , Tx, at , BLEU(y>t )) toDi exponential decay schedule for β that (Lin et al., 2020)
LDagger = EDi −
P
log πi (vt∗ | y<t ; x)
found to work best.
t=1
LAggreVaTe
T=
P ∗
2
EDi σ(Q(at | y<t ; x)) − BLEU(y>t )
t=1
Let πi+1 = πi − αi · ∂L
∂πi
. trained by teacher forcing on ground truth targets
with a label smoothing (Szegedy et al., 2016) factor
of 0.1. KD+ (Liu et al., 2019) denotes word-level
4 Experiments knowledge distillation between the expert’s and
student’s full output probability. IKD and IKD+
We experiment with English-German AST on the denote imitation knowledge distillation, where stu-
CoVoST2 (Wang et al., 2021) (430 hours) and dent model is corrected by the empirical distribu-
the MuST-C (Di Gangi et al., 2019) datasets (408 tion of the optimal expert actions or the full expert
hours)3 . As expert model, we use the Transformer policy (Lin et al., 2020), respectively. SynthIKD
from Facebook’s submission to WMT19 (Ng et al., and SynthIKD+ are our variants with synthetic tran-
2019), which is based on the Big Transformer scripts. We used the same same exponential decay
architecture proposed by (Vaswani et al., 2017). schedule (β = T1 ) used by (Lin et al., 2020) as
Our sequence-to-sequence models for students are early experiments showed that this performed best
RNNs and Base Transformers. All models are in our setup.
based on the fairseq framework (Ott et al.,
2019; Wang et al., 2020), but use different set- All AST models’ encoders are initialized with
tings of meta-parameters and preprocessing than the encoder of the corresponding ASR model,
the default models. More details on models, meta- trained on the respective datasets with cross-
parameters and training settings are given in the entropy and the label-smoothing factor of 0.1. Be-
Appendix A. cause of the relatively small size of these datasets,
Our training setups are summarized in Table 1. our experiments should seen as proof-of-concept,
We compare our trained student models with sev- showing that ASR models trained on a few hun-
eral baseline approaches: “Standard” denotes AST dred hours of audio provide synthetic transcripts
3
of sufficient quality to enable imitation-based KD
We also experimented with a smaller Europarl-ST dataset
and to save space we report results in Appendix B. Overall, for AST. The standalone performance of our ASR
they are similar to these on larger datasets. models is listed in Table 2.
92
Model
CoVoST2 MuST-C MuST-C models, Dagger with the Transformer ar-
dev test dev test chitecture outperforms all baselines4 , and matching
RNN 26.68 33.94 23.42 24.44 full teacher distributions (the ‘+’-versions of losses)
Transformer 20.93 26.60 21.10 20.68
gives consistent gains. Distillation with RNNs, on
Table 2: WER↓ results for ASR models pretrained on
the other hand, fails to improve BLEU scores over
CoVoST2 and MuST-C. These models are used to cre- baselines, most likely due to their overall lower
ate the synthetic transcripts for respective experiments. translation quality. This leads to the student hy-
Standard development and test splits were used for CoV- potheses that are too far from the reference so that
oST2. For MuST-C, we tested on tst-COMMON. the expert’s one-step corrections are not able to
correct them.
The results show that Transformers and RNNs
4.1 Feasibility of Oracle Correction
with synthetic transcripts show statistically insignif-
The idea of using synthetic transcripts in place of icant differences in performance to the ones that
gold transcripts has merit only if the NMT oracle’s are using gold transcripts. This is notable since
translations have higher quality than the transla- the partially synthetic transcripts provided to the
tions the AST model generates. Therefore, we first NMT oracle are often incorrect, yet do not result in
verify if the NMT oracle is capable of complet- a noticeable effect on the final student performance
ing an AST models’ partial hypotheses y<t while if used in the IL framework. A similar observa-
improving quality at the same time. tion can be made when comparing the use of gold
We follow Lin et al. (2020) and let the AST transcripts versus synthetic transcripts: Transform-
models trained with label-smoothed CE on ground ers on both datasets perform comparably and erro-
truth targets translate the audio input with greedy neous transcripts do not seem to harm the trained
decoding up to a randomly chosen time step. Then, AST model.
we feed the NMT expert the gold transcript as input
and the partial translation as prefix, and let the AggreVaTe Finally, we evaluate the performance
oracle finish the translation with greedy decoding. of AggreVaTe both with gold and synthetic tran-
As Table 2 shows, the out-of-the-box ASR per- scripts. During training we targeted and evalu-
formance is relatively low (high WER), so errors ated with the non-decomposable BLEU metric (i.e.
in synthetic transcripts will be propagated through training with sentence-BLEU and evaluating with
the NMT oracle. The question is whether the ex- corpus-BLEU) as well as with the decomposable
pert’s continuation can be of higher quality than TER metric (Table 5). Following Hormann and
the student’s own predictions despite the partially Sokolov (2021) we warm-started AggreVaTe with
incorrect synthetic transcripts. In Table 3, lines 1 differently trained standard or Dagger models, and
and 2 (or, 5 and 6) set the lower (end-to-end) and trained with AggreVaTe objectives for up to 50
upper (cascade) bounds on the performance. We epochs with early stopping on respective develop-
see that the NMT expert is able to complete the ment sets.
student hypotheses successfully (lines 3, 4 and 7, Surprisingly, we found that AggreVaTe does not
8), bringing gains in both gold and synthetic setups, bring additional benefits on top of Dagger despite
and reaching the upper bound (lines 3 vs. 2 and the promise for a better matching between training
7 vs. 6) for gold ones. Although the mistakes in and inference objectives. Also there is no signifi-
the synthetic transcripts do result in lower BLEU cant difference between the results with the TER
scores (lines 4 and 8) they still improve over the rewards objective and sentence-BLEU rewards on
AST student complete translations (lines 1 and 5). both CoVoST2 and MuST-C. We explain these re-
sults by the sufficiency of one-step corrections to
4.2 Main Results correct a “derailed” student, with little benefit of
continuing demonstration till the end of translation.
Table 4 shows the main results of applying Algo-
The fact that Dagger turns out to reap all of the ben-
rithm 1 for training an AST student with imitation-
efits from training with IL is good news in general,
based knowledge distillation on CoVoST2 and
since running beam search during training (to get
MuST-C.
AggreVaTe’s full continuations) is more expensive
Dagger First we present results for the Dagger 4
p-value < 0.005 using the paired approximate random-
algorithm. In Table 4, for both CoVoST2 and ization test (Riezler and Maxwell, 2005)
93
Architecture Hypotheses # Decoding Setup Source Transcripts dev-BLEU↑
1 AST - 11.9
full
2 ASR transcribes, NMT expert translates - 21.8
RNN
3 AST starts, NMT expert completes gold 21.9
partial
4 AST starts, NMT expert completes synthetic 15.6
5 AST - 16.7
full
6 ASR transcribes, NMT expert translates - 25.4
Transformer
7 AST starts, NMT expert completes gold 25.4
partial
8 AST starts, NMT expert completes synthetic 19.9
Table 3: Feasibility experiment: BLEU score on CoVoST2 development set of NMT expert’s completion of AST
model full or partial hypotheses with greedy decoding; gold denotes the usage of the dataset’s source language
transcripts as NMT inputs and synthetic denotes synthetic transcripts created by the respective ASR model.
Achitecture Models
CoVoST2 MuST-C these two new corpora.
dev test dev test As Table 6 shows, Transformer KD+ trained on
Standard 13.6 10.0 14.6 14.1 translated gold transcripts outperforms its coun-
ours baseline ours baseline
KD+ 14.6 11.1 17.9 17.2

RNN IKD+ 13.1 10.1 15.7 14.9
terparts trained on translated synthetic transcripts,
SynthKD+ 14.1 10.6 16.9 15.9 confirming errors in the synthetic transcripts. This
SynthIKD+ 12.8 9.7 16.3 15.1 refutes the first hypothesis.
Standard 18.4 14.2 19.5 19.4 Regarding the second hypothesis, we compare
KD+ 21.3 17.7 17.7 22.2 the KD+ to IKD+ from the synthetic translated part
Transformer IKD+ 21.8 18.4 23.2 23.3
SynthKD+ 21.7 18.0 22.5 22.6
in Table 6. Were “auto-correction” sufficient we
SynthIKD+ 21.8 18.5 23.5 23.5 would see similar performance in both lines. This
rejects the second hypothesis and suggests that IL
Table 4: Main results: RNN and Transformer student adds value on top of general NMT robustness to
models trained on expert inputs and loss variants of Ta- inputs.
ble 1, using Dagger for IL. We used the tst-COMMON
as the test set for MuST-C. (Synth)IKD is not included
4.4 Qualitative Analysis
since its performance is worse than (Synth)KD+ . Trans-
formers trained with IL outperform all baselines, while Here, we perform a human evaluation of success-
pure KD is the best for generally lower-quality RNN- ful IL corrections, aiming at an explanation of the
based models. Synthetic transcripts do not harm perfor- performance of Dagger on synthetic transcripts.
mance for Transformer student models.
We randomly sample 100 examples from the
CoVoST2 training set on which the ASR Trans-
than greedily selecting one action (as does Dagger). former has a non-zero sentence-wise word error
rate, and compare the NMT expert’s probability
4.3 Quality of Synthetic Transcripts distributions over time for the given synthetic tran-
In this section, we investigate explanations for scripts. From the WER histogram in Figure 2 we
the high performance of Dagger on synthetic tran- see that most of the sentences have a single-digit
scripts: The first hypothesis is that synthetic tran- number of errors.
scripts are already “good enough” and per-step IL
corrections add nothing on top. Second, the gains
could be due to the known NMT “auto-correcting”
ability and due to general robustness to the quality
of the source (cf. the success of back-translation in
NMT), and all benefits could be reached with KD
alone. To test both hypotheses, we create new train-
ing datasets where we replace references with trans-
lated gold or synthetic transcripts by the same NMT
expert with beam size 5. Evaluating on the unmodi-
fied references, we trained Transformer-based base- Figure 2: Histogram of sentence-wise WER of ASR
lines and the IL model from Lin et al. (2020) on Transformer on 100 samples from CoVoST2.
94
CoVoST2 MuST-C
IL Algorithm Model Data BLEU↑ TER↓ BLEU↑ TER↓
dev test dev test dev test dev test
Standard gold 18.4 14.2 69.1 77.1 19.5 19.4 70.8 69.4
Dagger IKD+ gold 21.8 18.4 63.7 70.0 23.2 23.3 67.4 65.6
SynthIKD+ synth 21.8 18.5 63.6 69.8 23.5 23.5 67.2 65.6
BLEU↑ TER↓ BLEU↑ TER↓
Warm-start Model Data
dev test dev test dev test dev test
sentence-BLEU reward-to-go
Standard gold 18.7 14.6 68.2 76.0 19.9 19.9 70.2 68.1
Standard synth 18.7 14.6 68.2 75.9 20.0 19.7 70.1 68.7
IKD+ gold 22.1 18.5 63.1 69.6 23.5 23.4 67.4 65.7
AggreVaTe SynthIKD+ synth 22.1 18.5 63.1 69.7 23.5 23.6 67.0 65.6
TER reward-to-go
Standard gold 18.7 14.7 67.8 75.4 20.0 19.9 70.0 68.5
Standard synth 18.7 14.6 67.9 75.6 19.9 19.6 69.8 68.4
IKD+ gold 22.0 18.5 63.1 69.4 23.3 23.4 67.3 65.5
SynthIKD+ synth 22.1 18.5 63.1 69.6 23.5 23.6 67.0 65.3
Table 5: Comparison of Dagger with warm-started AggreVaTe with a maximum of 50 epochs on CoVoST2 and
MuST-C.
CoVoST2 MuST-C Error Type Freq

Training
dev test dev test
omitted tokens 2
training on translated gold transcripts surface form error 17
Standard 18.1 14.9 20.0 20.0 contentual error, correct target in top-1 5
KD+ 21.3 17.6 23.4 23.1 contentual error, correct target in top-8 12
IKD+ 22.6 18.6 23.5 23.7 critical error, expert predicts correctly due to prefix 32
critical error, expert does not predict correctly 32
training on translated synthetic transcripts
Standard 17.8 14.2 19.2 19.2
KD+ 20.2 16.5 22.1 22.5 Table 7: Error types in the synthetic transcripts created
IKD+ 21.0 17.4 23.0 23.1 by the ASR model.
Table 6: BLEU scores of Transformer models trained

on the training set with original references replaced by such as different spellings, punctuation errors and
translations of gold and synthetic transcripts in com- different word choice (17 occurrences). When the
parison to using the original training set (lower part of
synthetic transcripts contain critical errors, e.g. par-
Table 4).
tially hallucinated transcript, the expert is still able
to produce the correct translation if the missing or
As WER cannot be used to differentiate between wrong information can be still inferred from the
small but inconsequential (to the understanding of prefix (32 occurrences).
the sentence) errors and mistakes that change the Next, we verify that the decoder language mod-
meaning of the sentence, we further compare the eling capability is what primarily drives the cor-
generated transcript to the gold transcript and look rection process. We do this by feeding parts of
at the top-8 output probabilities of the expert at reference translations as prefix conditioned on erro-
each time step for each sample to classify each er- neous synthetic transcripts. Consider the transcript
ror in the synthetic transcripts. We further feed the “The king had taken possession of Glamis Castle
sampled sentences to the NMT expert and find that and plywood.” generated by the ASR model. Its
in 36 out of 100 samples (all but the last two lines gold transcript reads “plundered it” instead of “ply-
in Table 7), the expert is able to generate output wood”. In Figure 3 we illustrate output probabili-
probability distributions that favor the correct tar- ties that the expert generates in the last time-steps.
get token despite errors in the transcript. Although Assume as in Figure 3a that the expert has been
the expert can put large probability mass on the given the prefix “Der König hatte Glamis Castle
correct target token, whether it does so depends in Besitz genommen und”. According to the out-
on the error type in the generated transcript. The put probabilities, the next output symbol is the
expert is often able to deal with surface form errors, subword unit “Sperr” and would not be a proper
95
(a) with y<t = “Der König hatte Glamis Castle in Besitz (b) with y<t = “Der König hatte Glamis Castle in Besitz
genommen und ” genommen und ge”
Figure 3: NMT expert top-8 output probabilities when translating the incorrect synthetic transcript “The king had
taken possession of Glamis Castle and plywood it.”
(a) with y<t = “S” (b) with y<t = “Sagte , ”
Figure 4: NMT expert top-8 output probabilities when translating the incorrect synthetic transcript “Slow down!”
correction. At the next timestep, however, the last tions do not share similar meaning with the tran-
symbol in the prefix is the subword unit “ge” and, script. After, in Figure 4b, the expert has received
as Figure 3b shows, the expert, being driven by its the prefix “Sagte,”, it still attempts to complete
decoder language modeling capability, puts highest y<t by generating output symbols that would turn
probabilities on subword units that are most likely y into a valid translation of this wrong transcript
to produce a fluent output (the correct one “pl@@”, (“langsam” (slow), “ruhig” (quiet), “langs@@”))
and less probable “pflan@@” and “kl@@” rather with the rest of options being mostly driven by
then paying attention to the (wrong) information in language modeling rather then reproducing source
the synthetic transcripts. semantics (“ent@@”, “verlan@@”).
Overall, with the SynthIKD+ training, the expert
Similar situations can be observed in samples induces smoothed output distributions and fluency
with entirely wrong synthetic transcripts. In Fig- on the student more than it enforces the student to
ure 4, the expert has received the synthetic tran- predict one-hot labels produced by the expert as is
script “Slow down!” as input, which shares no done by sequence-level KD.
meaning with the gold transcript “Said he’d con-
sider it.” As shown in Figure 4a, the expert as- 5 Conclusion
signs the highest probability to “@@low” if it is
given the prefix “S” (as the expert has a shared We showed that a pretrained NMT model can suc-
vocabulary, it can complete the output this way), cessfully be used as an oracle for an AST student,
which turns the partial translation into an exact without requiring gold source language transcripts
copy of the transcript. Again, the top-8 predic- as in previous approaches to imitation learning for
96
AST. This widens the applicability of imitation Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam
learning approaches to datasets that do not con- Shazeer. 2015. Scheduled sampling for sequence
prediction with recurrent neural networks. In Ad-
tain manual transcripts or to pre-trained ASR mod-
vances in Neural Information Processing Systems,
els for which training transcripts are not available. volume 28. Curran Associates, Inc.
Our qualitative analysis suggests an explanation of
the fact that the NMT oracle is robust against mis- Alexandre Berard, Laurent Besacier, Ali Can Ko-
cabiyikoglu, and Olivier Pietquin. 2018. End-to-end
matches between manual and synthetic transcripts automatic speech translation of audiobooks. In 2018
by its large language model capabilities that allow IEEE International Conference on Acoustics, Speech
it to continue the prefix solely based on its learned and Signal Processing, ICASSP 2018, Calgary, AB,
contextual knowledge. Canada, April 15-20, 2018, pages 6224–6228. IEEE.
6 Limitations Matteo Negri, and Marco Turchi. 2019. MuST-C: a
Multilingual Speech Translation Corpus. In Proceed-
There are several limitations of this study. First, it is ings of the 2019 Conference of the North American
done on one language pair although we believe this Chapter of the Association for Computational Lin-
should not qualitatively change the results. Second, guistics: Human Language Technologies, Volume 1
only one set of standard model sizes was evaluated (Long and Short Papers), pages 2012–2017, Min-
neapolis, Minnesota. Association for Computational
for AST student and NMT expert; we expect it Linguistics.
be in line with reported findings for NMT (Ghor-
bani et al., 2021). Finally, while alluding to the Marco Gaido, Mattia A. Di Gangi, Matteo Negri, and
Marco Turchi. 2020. End-to-end speech-translation
potential of using large pre-trained ASR models in- with knowledge distillation: FBK@IWSLT2020. In
stead of manual transcripts for IL-based AST, our Proceedings of the 17th International Conference on
current work must be seen as a proof-of-concept Spoken Language Translation, pages 80–88, Online.
experiment where we train ASR models on a few Association for Computational Linguistics.
hundred hours of audio, and discard the manual Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur
transcripts in IL training, showing the feasibility of Bapna, Maxim Krikun, Xavier Garcia, Ciprian
our idea. Chelba, and Colin Cherry. 2021. Scaling laws for
neural machine translation. CoRR, abs/2109.07740.
Acknowledgements Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015.
Distilling the knowledge in a neural network. In
The authors acknowledge support by the state of NIPS Deep Learning and Representation Learning
Baden-Württemberg through bwHPC and the Ger- Workshop.
man Research Foundation (DFG) through grant
Luca Hormann and Artem Sokolov. 2021. Fixing ex-
INST 35/1597-1 FUGG. posure bias with imitation learning needs powerful
oracles. CoRR, abs/2109.04114.
References Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà,

Antonios Anastasopoulos and David Chiang. 2018. bert Sanchis, Jorge Civera, and Alfons Juan. 2020.
Tied multitask learning for neural speech translation. Europarl-st: A multilingual corpus for speech trans-
In Proceedings of the 2018 Conference of the North lation of parliamentary debates. In ICASSP 2020 -
American Chapter of the Association for Computa- 2020 IEEE International Conference on Acoustics,
tional Linguistics: Human Language Technologies, Speech and Signal Processing (ICASSP), pages 8229–
Volume 1 (Long Papers), pages 82–91, New Orleans, 8233.
Louisiana. Association for Computational Linguis-
tics. Ye Jia, Melvin Johnson, Wolfgang Macherey, Ron J.
Weiss, Yuan Cao, Chung-Cheng Chiu, Naveen Ari,
Parnia Bahar, Tobias Bieschke, and Hermann Ney. 2019. Stella Laurenzo, and Yonghui Wu. 2019. Leverag-
A comparative study on end-to-end speech to text ing weakly supervised data to improve end-to-end
translation. In 2019 IEEE Automatic Speech Recog- speech-to-text translation. In ICASSP 2019 - 2019
nition and Understanding Workshop (ASRU), pages IEEE International Conference on Acoustics, Speech
792–799. and Signal Processing (ICASSP), pages 7180–7184.
Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Yoon Kim and Alexander M. Rush. 2016. Sequence-
Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron C. level knowledge distillation. In Proceedings of the
Courville, and Yoshua Bengio. 2016. An actor- 2016 Conference on Empirical Methods in Natu-
critic algorithm for sequence prediction. CoRR, ral Language Processing, pages 1317–1327, Austin,
abs/1607.07086. Texas. Association for Computational Linguistics.
97
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad
method for stochastic optimization. In 3rd Inter- Dousti, and Yun Tang. 2020. Self-Training for End-
national Conference on Learning Representations, to-End Speech Translation. In Proc. Interspeech
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, 2020, pages 1476–1480.
Conference Track Proceedings.
Matt Post. 2018. A call for clarity in reporting BLEU
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris scores. In Proceedings of the Third Conference on
Callison-Burch, Marcello Federico, Nicola Bertoldi, Machine Translation: Research Papers, pages 186–
Brooke Cowan, Wade Shen, Christine Moran, 191, Brussels, Belgium. Association for Computa-
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra tional Linguistics.
Constantin, and Evan Herbst. 2007. Moses: Open Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
source toolkit for statistical machine translation. In man, Christine McLeavey, and Ilya Sutskever. 2022.
Proceedings of the 45th Annual Meeting of the As- Robust speech recognition via large-scale weak su-
sociation for Computational Linguistics Companion pervision. CoRR, abs/2212.04356.
Volume Proceedings of the Demo and Poster Sessions,
pages 177–180, Prague, Czech Republic. Association Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,
for Computational Linguistics. and Wojciech Zaremba. 2016. Sequence level train-
ing with recurrent neural networks. In 4th Inter-
Alexander Lin, Jeremy Wohlwend, Howard Chen, and national Conference on Learning Representations,
Tao Lei. 2020. Autoregressive knowledge distillation ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016,
through imitation learning. In Proceedings of the Conference Track Proceedings.
Language Processing (EMNLP), pages 6121–6133, Stefan Riezler and John T. Maxwell. 2005. On some
Online. Association for Computational Linguistics. pitfalls in automatic evaluation and significance test-
ing for MT. In Proceedings of the ACL Workshop
Yuchen Liu, Hao Xiong, Jiajun Zhang, Zhongjun He, on Intrinsic and Extrinsic Evaluation Measures for
Hua Wu, Haifeng Wang, and Chengqing Zong. 2019. Machine Translation and/or Summarization, pages
End-to-End Speech Translation with Knowledge Dis- 57–64, Ann Arbor, Michigan. Association for Com-
tillation. In Proc. Interspeech 2019, pages 1128– putational Linguistics.
1132. Stéphane Ross and Andrew Bagnell. 2014. Reinforce-
ment and imitation learning via interactive no-regret
Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, learning. CoRR, abs/1406.5979.
Michael Auli, and Sergey Edunov. 2019. Facebook
FAIR’s WMT19 news translation task submission. Stephane Ross, Geoffrey Gordon, and Drew Bagnell.
In Proceedings of the Fourth Conference on Machine 2011. A reduction of imitation learning and struc-
Translation (Volume 2: Shared Task Papers, Day tured prediction to no-regret online learning. In Pro-
1), pages 314–319, Florence, Italy. Association for ceedings of the Fourteenth International Conference
Computational Linguistics. on Artificial Intelligence and Statistics, volume 15 of
Proceedings of Machine Learning Research, pages
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, 627–635, Fort Lauderdale, FL, USA. PMLR.
Sam Gross, Nathan Ng, David Grangier, and Michael
Auli. 2019. fairseq: A fast, extensible toolkit for Rico Sennrich, Barry Haddow, and Alexandra Birch.
sequence modeling. In Proceedings of the 2019 Con- 2016. Neural machine translation of rare words with
ference of the North American Chapter of the Associa- subword units. In Proceedings of the 54th Annual
tion for Computational Linguistics (Demonstrations), Meeting of the Association for Computational Lin-
pages 48–53, Minneapolis, Minnesota. Association guistics (Volume 1: Long Papers), pages 1715–1725,
for Computational Linguistics. Berlin, Germany. Association for Computational Lin-
guistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
Jing Zhu. 2002. Bleu: a method for automatic evalu- Jon Shlens, and Zbigniew Wojna. 2016. Rethink-
ation of machine translation. In Proceedings of the ing the inception architecture for computer vision.
40th Annual Meeting of the Association for Compu- In 2016 IEEE Conference on Computer Vision and
tational Linguistics, pages 311–318, Philadelphia, Pattern Recognition (CVPR), pages 2818–2826.
Pennsylvania, USA. Association for Computational
Linguistics. Yun Tang, Juan Pino, Xian Li, Changhan Wang, and
Dmitriy Genzel. 2021a. Improving speech transla-
Juan Pino, Liezl Puzon, Jiatao Gu, Xutai Ma, Arya D. tion by understanding and learning from the auxiliary
McCarthy, and Deepak Gopinath. 2019. Harness- text translation task. In Proceedings of the 59th An-
ing indirect training data for end-to-end automatic nual Meeting of the Association for Computational
speech translation: Tricks of the trade. In Proceed- Linguistics and the 11th International Joint Confer-
ings of the 16th International Conference on Spoken ence on Natural Language Processing (Volume 1:
Language Translation, Hong Kong. Association for Long Papers), pages 4252–4261, Online. Association
Computational Linguistics. for Computational Linguistics.
98
Yun Tang, Juan Pino, Changhan Wang, Xutai Ma, and speech recognition. IEEE Journal of Selected Topics
Dmitriy Genzel. 2021b. A general multi-task learn- in Signal Processing, 16(6):1519–1532.
ing framework to leverage text data for speech to
text tasks. In ICASSP 2021 - 2021 IEEE Interna- Renjie Zheng, Junkun Chen, Mingbo Ma, and Liang
tional Conference on Acoustics, Speech and Signal Huang. 2021. Fused acoustic and text encoding for
Processing (ICASSP), pages 6209–6213. multimodal bilingual pretraining and speech transla-
tion. In International Conference on Machine Learn-
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob ing, pages 12736–12746. PMLR.
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all Chunting Zhou, Jiatao Gu, and Graham Neubig.
you need. Advances in neural information processing 2020. Understanding knowledge distillation in non-
systems, 30. autoregressive machine translation. In International
Conference on Learning Representations (ICLR).
Changhan Wang, Yun Tang, Xutai Ma, Anne Wu,
Dmytro Okhonko, and Juan Pino. 2020. Fairseq
S2T: Fast speech-to-text modeling with fairseq. In
Proceedings of the 1st Conference of the Asia-Pacific
Chapter of the Association for Computational Lin-
guistics and the 10th International Joint Conference
on Natural Language Processing: System Demon-
strations, pages 33–39, Suzhou, China. Association
Changhan Wang, Anne Wu, Jiatao Gu, and Juan Pino.
2021. CoVoST 2 and Massively Multilingual Speech
Translation. In Interspeech, pages 2247–2251.
Chaojun Wang and Rico Sennrich. 2020. On exposure
bias, hallucination and domain shift in neural ma-
chine translation. In ACL.
Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui
Wu, and Zhifeng Chen. 2017. Sequence-to-sequence
models can directly translate foreign speech. In In-
terspeech 2017, 18th Annual Conference of the Inter-
national Speech Communication Association, Stock-
holm, Sweden, August 20-24, 2017, pages 2625–2629.
ISCA.
Ronald J. Williams and David Zipser. 1989. A learning
algorithm for continually running fully recurrent neu-
ral networks. Neural Computation, 1(2):270–280.
Jeremy H.M. Wong and Mark J.F. Gales. 2016. Se-
quence Student-Teacher Training of Deep Neural
Networks. In Proc. Interspeech 2016, pages 2761–
2765.
Rong Ye, Mingxuan Wang, and Lei Li. 2021. End-to-
end speech translation via cross-modal progressive
training. In Proc. of INTERSPEECH.
Biao Zhang, Barry Haddow, and Rico Sennrich. 2022a.
Revisiting end-to-end speech-to-text translation from
scratch. In International Conference on Machine
Learning, pages 26193–26205. PMLR.
Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol
Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yan-
ping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min
Ma, William Chan, Jiahui Yu, Yongqiang Wang, Lian-
gliang Cao, Khe Chai Sim, Bhuvana Ramabhadran,
Tara N. Sainath, Francoise Beaufays, Zhifeng Chen,
Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, and
Yonghui Wu. 2022b. BigSSL: Exploring the frontier
of large-scale semi-supervised learning for automatic
99
BLEU↑ for Transformers, we create each model by aver-
Model aging over the last 10 checkpoints. For inference,
dev test
a beam size of 5 was used and we report case-
original dataset
sensitive detokenized BLEU (Papineni et al., 2002)
Standard 13.8 14.4
computed with sacreBLEU (Post, 2018). We tested
KD+ 17.4 17.8
for statistical significance with the paired approx-
SynthKD + 17.5 18.0
imate randomization test (Riezler and Maxwell,
IKD+ 17.0 17.1
2005).
SynthIKD + 17.0 17.0
For all experiments, we preprocess the datasets
translated gold training set
as follows: We extract log mel-scale filterbanks
Standard 15.3 15.3
with a povey window, 80 bins, a pre-emphasis filter
KD + 18.2 18.4
of 0.97, a frame length of 25 ms and a frame shift
IKD 16.8 17.0
of 10 ms. We discard samples with less than five or
IKD+ 17.1 17.5
more than 3000 frames and subtract the mean of the
synthetic translated training set
waveform from each frame and zero-pad the FFT
Standard 14.7 15.3
input. For the text data, we normalize punctuation,
KD + 17.0 16.8
remove non-printable characters, use the Moses
IKD 16.1 16.0
tokenizer (Koehn et al., 2007) for tokenization and
IKD+ 16.3 16.6
segment the text data into subword units with byte-
Table A.1: Results on Europarl-ST pair encoding (Sennrich et al., 2016). We used a
random seed of 1 for all experiments.
We list the final used and best performing hy-
A Models, Meta-parameters, and perparameters in Table A.2. Parameters that do
Training Settings not differ between the training methods are not re-
peated in the table. We determine the batch size by
We use the speech-to-text module of the fairseq
defining a maximum number of input frames in the
framework (Ott et al., 2019; Wang et al., 2020)
batch.
for all experiments and train both RNNs with con-
volutional layers for time dimension reduction as
in Berard et al. (2018) and small Transformers as
B Europarl-ST
in Wang et al. (2020), which consist of a convo- We performed additional experiments on the
lutional subsampler of two convolutional blocks, Europarl-ST dataset (Iranzo-Sánchez et al., 2020)
followed by 12 encoder layers and 6 decoder layers. that provides 83 hours of speech training data. We
The dimension of the self-attention layer is 256 and train RNNs with a learning rate of 0.002 and a max-
the number of attention heads is set to 4. For the tokens size of 40,000 for a total of 80,000 updates.
NMT oracle, we use the trained Transformer model All other hyper-parameters are the same as listed
from the Facebook’s submission to WMT19 (Ng for MuST-C in Table A.2. We only trained RNNs
et al., 2019) 5 , which is based on the big Trans- on the Europarl-ST dataset due to the small amount
former (Vaswani et al., 2017) which has 6 encoder of available training data. We present the results in
and decoder layers, 16 attention heads and the di- Table A.1.
mension of 1024, with a larger feed-forward layer
Both improvements over standard training and
size of 8192. This NMT oracle had been trained
by training on both the gold-translated and
on all available WMT19 shared task en-de training
synthetic-translated translated training data corre-
data and on back-translated english and german
spond with the results presented in the main body
portions of the News crawl dataset.
of this work. Hence, the results presented here hold
For all models we use Adam (Kingma and Ba, for relatively small datasets, too.
2015) with gradient clipping at norm 10 and stop
training if the development set loss has not im-
C Additional Example of NMT Expert
proved for 10 epochs. For RNN architectures, we
Correction
return the best model on the development set and
5
As the WMT19 submission consists of an ensemble of Here we give another example of the NMT expert
models, we use the model1.pt for our experiments. predicting the correct output token despite receiv-
100
Model Hyperparameter CoVoST2 MuST-C
RNN
standard learning rate 1e-3 1e-3
max-tokens 60000 40000
scheduler fixed fixed
warmup-updates 20000 20000
encoder freezing updates 10000 10000
dropout 0.2 0.2
KD+ learning rate 1e-3 2e-3
max-update 250000 250000
encoder-freezing updates 20000 10000
scheduler inverse square root inverse square root
Transformer
ASR learning rate 2e-3 1e-3
max-update 60000 100000
scheduler inverse square root inverse square root
dropout 0.15 0.1
AST
standard learning rate 2e-3 2e-3
max-update 30000 100000
encoder-freezing updates 1000 -
KD+ max-tokens 50000 20000
Table A.2: list of hyperparameters that are dependent on model and dataset; we list only parameters which differ
from the previous model’s
“World Health Organization” as “World Health Ser-

vice Scheme”, yet the expert produces a probability
distribution that is skewed in favor of the correct
proper name due to its learned context knowledge.
Note that the probability of generating the correct
output token “organisation” (organization) is above
0.8.
Figure C.1: NMT expert top-8 output probabilities with

y<t = “ Er wurde später von der Canadian Cancer Soci-
ety und der Weltgesundheits”.
ing a transcript with incomplete or false informa-

tion.
Figure C.1 shows the expert’s output probabili-
ties in response to receiving factually false informa-
tion in the transcript. The ASR model transcribed
101
The USTC’s Dialect Speech Translation System for IWSLT 2023
Pan Deng1 Shihao Chen1 Weitai Zhang1,2 Jie Zhang1 Lirong Dai1
1
University of Science and Technology of China, Hefei, China
2
iFlytek Research, Hefei, China
{pdeng, shchen16, zwt2021}@mail.ustc.edu.cn; {jzhang6, lrdai}@ustc.edu.cn
Abstract This paper aims to explore the transfer between

This paper presents the USTC system for high-resource MSA and low-resource Tunisian di-
the IWSLT 2023 Dialectal and Low-resource alects, as well as effective training and decoding
shared task, which involves translation from strategies for speech translation (ST) tasks related
Tunisian Arabic to English. We aim to investi- to low-resource dialects. To facilitate dialect trans-
gate the mutual transfer between Tunisian Ara- fer, we introduce two approaches. Firstly, we pre-
bic and Modern Standard Arabic (MSA) to en- train a model using high-resource MSA data, which
hance the performance of speech translation
is then fine-tuned using low-resource Tunisian data.
(ST) by following standard pre-training and
fine-tuning pipelines. We synthesize a substan- This approach involves transferring model parame-
tial amount of pseudo Tunisian-English paired ters and can be used to train various models, e.g.,
data using a multi-step pre-training approach. ASR, MT, end-to-end ST. Secondly, we also de-
Integrating a Tunisian-MSA translation mod- velop two transformation models for explicit di-
ule into the end-to-end ST model enables the alect transfer. On one hand, for the augmentation
transfer from Tunisian to MSA and facilitates of MT data, we build an MT model that translates
linguistic normalization of the dialect. To in- MSA into Tunisian, resulting in a vast amount of
crease the robustness of the ST system, we op-
pseudo Tunisian-English paired data. On the other
timize the model’s ability to adapt to ASR er-
rors and propose a model ensemble method. hand, the Tunisian-MSA MT encoder module is
Results indicate that applying the dialect trans- built and then integrated into the end-to-end ST
fer method can increase the BLEU score of model, which can implicitly normalize dialectal
dialectal ST. It is shown that the optimal sys- expressions. In addition, we also propose robust
tem ensembles both cascaded and end-to-end training and decoding strategies from two perspec-
ST models, achieving BLEU improvements of tives. To improve the robustness of the MT model
2.4 and 2.8 in test1 and test2 sets, respectively,
against ASR errors, we fine-tune the MT model
compared to the best published system.
with the ASR output from the CTC (Graves et al.,
1 Introduction 2006) layer or the ASR decoder. The model ensem-
ble method is exploited to decode multiple models
In this paper, we present the USTC’s submission
synchronously, which is shown to be rather benefi-
to the Dialectal and Low-resource track of IWSLT
cial for the performance.
2023 Evaluation Campaign (Agarwal et al., 2023),
aiming to translate Tunisian Arabic speech to En- The rest of this paper is organized as follows.
glish text. Modern Standard Arabic (MSA) is the Section 2 describes data preparation (e.g., datasets,
official language of Arabic-spoken countries. How- pre-processing). Section 3 presents the methods for
ever, Arabic dialects like Tunisian and Egyptian are training and decoding ASR, MT and ST models.
prevalent in everyday communication, exhibiting Experimental setup and results are given in Section
a similar relation between Chinese and Cantonese. 4. Finally, Section 5 concludes this work.
MSA benefits from an abundant supply of unla-
beled speech and text data, as well as relatively 2 Data Preparation
adequate automatic speech recognition (ASR) and
2.1 Datasets
machine translation (MT) paired data. In contrast,
dialectical forms of Arabic have much less paired In this year’s shared task, there are two types of
data and more irregularities in both pronunciation data conditions: constrained and unconstrained. In
and writing (Ben Abdallah et al., 2020). order to provide a fair comparison with last year’s
102
Task Dataset Condition Utterances Hours 2.2 Audio data pre-processing
Tunisian A 0.2M 160 As the audio data of condition B/C had a sampling
ASR MGB2 B 1.1M 1100
MGB2+Private data C 3.4M 4600 rate of 16kHz, we upsampled the speech signal in
ST Tunisian A 0.2M 160 the Tunisian dataset from 8kHz to 16kHz using the
sox toolkit2 . We extracted 40-dimensional log-mel
Table 1: The summary of the Audio data. filterbank features with a frame length of 25ms and
a frame shift of 10ms, and then normalized these
Dataset Condition Ta-En MSA-En features with a zero mean and unit variance. We
applied SpecAugment (Park et al., 2019) in the
Collected
Tunisian A 0.2M -
OPUS B - 42M
time dimension with mask parameters (mT , T ) =
OPUS+Private data C - 61M (2, 70). Afterwards, we filtered out audio data that
is longer than 3k frames. Further, we introduced
Tunisian A 0.2M -
Filtered
speech perturbations at ratios of 0.9 and 1.1.

OPUS B - 32M
OPUS+Private data C - 47M 2.3 Text Processing & Filtering
Table 2: The summary of the text data. We kept the MSA and Tunisian text data in their
original form without any normalization such as re-
Translation direction Training data MT model moving diacritical marks or converting Alif/Ya/Ta-
Tunisian-English Ta-En Ta2En
Marbuta symbols. We removed punctuations from
English-Tunisian En-Ta En2Ta MSA, Tunisian, and English text while we con-
MSA-English MSA-En MSA2En verted the English text to lowercase. Our data fil-
English-MSA En-MSA En2MSA tering process in condition B/C includes Length
Tunisian-MSA Ta-MSA Ta2MSA Match and Inference Score.
MSA-Tunisian MSA-Ta MSA2Ta
• Length Match: Text samples exceeding 250
Tunisian-MSA-English Ta-MSA-En -
words were dropped first. Next, we calculated
Table 3: Summary of abbreviations used in this paper. the length ratio between the source and target
language text. Text samples with length ra-
tios exceeding 2 or below 0.4 were deemed to
results, we subdivided the unconstrained condition be length mismatching cases and were subse-
into the dialect adaption condition and the fully quently removed. As such, approximately 6M
unconstrained condition. For convenience, we de- text data in condition B were eliminated.
note the constrained condition as condition A, the
dialect adaption condition as condition B, and the • Inference Score: Initially, a basic MT model
fully unconstrained condition as condition C. (scoring model) was trained on raw MSA-En
Table 1 summarizes statistics of the ASR and data in condition B. Subsequently, the scoring
ST datasets. The Tunisian dataset1 in condition A model was used to infer the same MSA-En
is Arabic dialect data. In addition to the MGB2 raw data, resulting in inference scores based
data (Ali et al., 2016) of condition B, we used ad- on logarithmic posterior probabilities. Finally,
ditional private data mainly from MSA for ASR MSA-En data associated with lower inference
training in condition C. Table 2 summarizes the scores were removed, leading to another 4M
statistics of the MT datasets. The MT data for context data being eliminated from condition B.
dition A are Tunisian-English (Ta-En) paired data, Table 2 summarizes the filtered data used for train-
while for condition B/C, the MT data consist of ing. In total, 10M text data in condition B and 4M
MSA-English (MSA-En) paired data(Tiedemann text data in condition C were removed.
and Thottingal, 2020). All MT data undergoes pre-
processing, which includes cleaning and filtering. 3 Methods
Table 3 summarizes the abbreviations for MT mod-
3.1 Automatic Speech Recognition
els and training data associated with the translation
direction that are used in the sequel. We employed several ASR models with differ-
1
ent structures in experiments, including the VGG-
The LDC Catalog ID of the Tunisian dataset for IWSLT
2
is LDC2022E01. http://sox.sourceforge.net
103
Ta-En Data for Training
Paired Data Flow
Condition A Condition A Data for Inference

En En2MSA Model MSA* MSA2Ta* Model
Ta-En Ta-MSA*-En
Model Initialize
Condition B/C Condition B/C Condition B/C

En En2Ta Model Ta* MSA MSA*2Ta Model Ta**
MSA-En Ta*-MSA-En Ta**-MSA-En
MSA-En
Figure 1: The data augmentation method for Tunisian-English Text, where * indicates the pseudo text.
Conformer model (Simonyan and Zisserman, 2014; et al., 2016a) approach. Ultimately, the obtained
Gulati et al., 2020), VGG-Transformer model synthetic data and the original data were merged to
(Vaswani et al., 2017) and GateCNN-Conformer form the BTFT dataset.
model (Dauphin et al., 2017). These ASR mod-
els differ in their feature extractor modules (VGG, Dialect Transfer: In the IWSLT 2022 dialect
GateCNN) and acoustic modules (Conformer, ST track, (Yang et al., 2022) presented an ef-
Transformer). We chose diverse models with the fective Ta2En-bt-tune model that generates syn-
expectation that increasing the variability of ASR thetic Tunisian-English data by converting MSA
models would improve the final ASR performance to pseudo-Tunisian with an MSA2Ta MT model.
when using model ensemble methods. For dialect In Figure 1, we modified this approach by intro-
transfer in condition B/C, we pre-trained an ASR ducing a multi-step pre-training technique that im-
model using MSA data, which was then fine-tuned proves the quality of pseudo-Tunisian and enhances
using the Tunisian data. Note that for condition downstream translation tasks. Our dialect transfer
A, we initially attempted to pre-train a phoneme method is outlined as follows:
recognition model for Tunisian but found it to be (1) Firstly, the En2MSA (English to MSA)
useless after fine-tuning the pre-trained model. model was pre-trained using condition B/C MT
data and then fine-tuned using the MT data from
3.2 Data Augmentation for MT condition A to create the En2Ta model.
We considered various data augmentation tech- (2) The En2MSA and En2Ta models were uti-
niques for MT. To augment the Tunisian-English lized separately with the English texts from con-
(Ta-En) dialect MT data, we used the back transla- dition A and condition B/C as inputs to generate
tion and forward translation (BTFT) method to cre- paired Ta-MSA-En triple text data for condition
ate a synthetic parallel corpus that can be merged A/B/C. The pseudo-text in condition A is the MSA*
with the true bilingual data. To accomplish dialect text, whereas the pseudo-text in condition B/C is
transfer from MSA to Tunisian, we constructed the Tunisian* text (* representing pseudo-text). No-
a pivot MT model that converts MSA to Tunisian tably, during this step, the pseudo-Tunisian* text
and produces abundant synthetic Ta-En data. derived from condition B/C is marked as the first
iteration.
BTFT: Two MT models were first trained from (3) Next, we trained an MSA2Ta (MSA to
Tunisian to English (Ta2En) and from English to Tunisian) model, which serves as a pivot MT model.
Tunisian (En2Ta) using MT data of condition A. We pre-trained the model with the MSA-Ta* data
The Tunisian text and English text were then re- of condition B/C and fine-tuned it using the MSA*-
spectively fed to the corresponding MT models for Ta data of condition A from step 2.
inference, resulting in paired Tunisian to synthetic- (4) Lastly, we input the MSA text of condition
English text and paired synthetic-Tunisian to En- B/C to the MSA2Ta model for inference, generat-
glish text. It is worth noting that the Ta2En model ing the second iteration of the pseudo-Tunisian text
implements the forward translation approach simi- (marked as pseudo-Tunisian**). We re-created the
larly to the sequence-level knowledge distillation paired triple text data of Ta-MSA-En text by merg-
method (Kim and Rush, 2016), while the En2Ta ing the pseudo-Tunisian** text with the primary
model employs the backward translation (Sennrich MSA-English text from condition B/C.
104
Tunisian Speech Speech Encoder CTC Layer Adaptor Ta2En MT English Translation
Tunisian Speech Speech Encoder Ta CTC Layer Adaptor
Text Encoder MSA CTC Layer Adaptor MSA2En MT English Translation
Figure 2: The top figure shows the SATE model (Xu et al., 2021), which implements a forward dialect transfer
system from MSA to Tunisian through pre-training and fine-tuning techniques. The bottom part shows the Hybrid
SATE model with a hierarchical text encoder, which can be used to reversely transfer from Tunisian to MSA.
3.3 End-to-end ST Model by retaining both the repeated tokens and blank
The end-to-end ST approaches can mitigate issues symbols of the CTC output. The resulting output
of error propagation that often appears in low- was then combined with its corresponding English
resource scenarios. We developed an E2E ST sys- text to fine-tune the Ta2En MT model. The modi-
tem utilizing the SATE model (Xu et al., 2021) due fied Ta2En MT model was well-suited to initialize
to its effectiveness and simplicity for implementa- the MT module of the SATE model.
tion, which is shown in Figure 2. In particular, we 3.3.2 Reverse dialect transfer system
suggest two dialect transfer approaches for condi- It is a common issue that the Tunisian Arabic di-
tion B/C, specifically the forward dialect transfer alect is considered as being non-standardized at
system from MSA to Tunisian and the reverse di- the linguistic level (Ben Abdallah et al., 2020). To
alect transfer method from Tunisian to MSA. address this, we proposed a reverse dialect transfer
3.3.1 Forward dialect transfer system system that converts the Tunisian dialect to MSA,
The forward dialect transfer system aims to transfer serving as a regularization of the dialect, which
information from MSA to Tunisian by pre-training is illustrated in Figure 2. We modified the SATE
the ASR and MT models on the MSA dataset, re- model with a hierarchical text encoder (resulting in
spectively. These models are then fine-tuned us- Hybrid SATE) to enable the reverse dialect trans-
ing the Tunisian dataset to transfer from MSA to fer system. The proposed Hybrid SATE model
Tunisian. Note that the forward dialect transfer primarily comprises a speech encoder, a Ta2MSA
system is treated as a transfer of model parameters. text encoder and an MSA2En MT module.
In order to create an E2E ST system, we utilize In order to initialize the model parameter for the
the SATE model with pre-trained Tunisian ASR Ta2MSA text encoder module in the Hybrid SATE
and MT models, followed by fine-tuning the SATE model, we trained a Ta2MSA MT model. Based
model with Tunisian ST dataset. on the generated Ta-MSA* data in condition A
During training, the SATE model utilizes multi- and Ta**-MSA paired data in condition B/C from
task optimization, including the CTC loss of the Section 3.2, we first pre-trained a Ta2MSA MT
source language LTa model with the Ta**-MSA data from condition
CTC , the cross-entropy loss for
the target language LEn B/C. Notably, the Ta2MSA MT model is equipped
CE and the knowledge distil-
lation (KD) losses for both the source and target with a CTC layer on top of its encoder and is trained
languages, i.e., LTa with an additional CTC loss for MSA. Then, we
KD and LKD . The overall loss
En
function reads fine-tuned the model using the Ta-MSA* data from
condition A. Finally, the encoder attached with a
L = λ1 LTa En Ta En
CTC + λ2 LCE + λ3 LKD + λ4 LKD , (1) CTC layer of the Ta2MSA MT model was used to
initialize the Ta2MSA text encoder.
with four respective hyper weight parameters. The The hybrid SATE model is optimized with an
SATE model utilizes an adaptor to map speech fea- additional CTC loss for MSA, denoted as LMSA CTC ,
tures into the text feature space but suffers from resulting in the overall loss function
inconsistent in-between sequence lengths. For this,
we proposed a robust training method. Specifi- L =λ1 LTa En Ta En
CTC + λ2 LCE + λ3 LKD + λ4 LKD
cally, the Tunisian ASR model was first decoded + λ5 LMSA
CTC . (2)
105
3.4 Model Ensemble Method B C
Model
dev test dev test
As training a single model can lead to implicit
model bias, it is expected that a model ensemble VGG-Conformer 14.3 13.2 12.5 12
decoding method can improve system robustness, VGG-Transformer 16.6 15.5 14.2 13.3
especially in low-resource ST scenarios. We imple- GateCNN-Conformer 15.1 14.2 14.3 13.4
mented synchronous decoding with multiple mod-
Table 4: The WER of the MSA MGB2 corpus.
els and averaged the posterior probabilities pre-
dicted by each model at each time step. Consistent
A B C
with single model decoding, the beam search de- Model
dev test1 dev test1 dev test1
coding strategy was used with a beam size of 10. VGG-Conformer 48.5 55.4 45.4 53.2 42 49.7
Subsequently, multiple models decoded the next to- VGG-Transformer 49.2 57 49 56.8 44.7 52.1
kens based on the same historical tokens. It should GateCNN-Conformer 46.6 53.4 47.2 53.7 46.1 53.3
be noted that either E2E ST or MT models can Ensemble 44.5 51.7 43.4 50.9 40.8 48.7
be used for the model ensemble. Consequently,
we can form ensembles of E2E ST and cascaded Table 5: The original WER on Tunisian. Due to the
ST systems by using transcriptions from the ASR non-standard orthography and grammar in Tunisian,
the value of original WER is relatively higher than the
models as inputs for the MT models.
normalized WER in Table 11.
4 Experiments and results

model achieves the best performance. It is clear
4.1 Model Configurarions
that the performance can be further improved by
ASR: For condition A, we employed the base using additional private data in condition C.
model configurations, whereas the large model The pre-trained MSA ASR models are fine-
configurations were used for the experiments on tuned using Tunisian data for dialect transfer in
condition B/C. Byte-Pair Encoding (BPE) (Sen- condition B/C. As shown in Table 5, the VGG-
nrich et al., 2016b) subword segmentation with the Conformer model continues to perform best among
Tunisian text was trained and the dictionary size different single models in condition B/C, while the
was 1000. The detailed model configurations are GateCNN-Conformer model performs best in con-
given in Appendix A. dition A. We further ensemble the three single mod-
MT: We considered two encoder-decoder archi- els mentioned above and get the final ASR model
tectures for MT: the normal transformer model results for each condition3 . This demonstrates that
(Vaswani et al., 2017) and the macaron-like trans- model ensemble can significantly improve the ASR
former model (Lu et al., 2019). The latter uses performance, especially in condition A. Comparing
several FFN-attention-FFN layers instead of the the ASR results in condition B/C with that in con-
attention-FFN layer used in the former. Our MT dition A, we find that pre-training on high-resource
model has three variants based on the number of MSA data can improve the ASR performance in
layers in the encoder and decoder and the type of low-resource Tunisian.
model architecture: MT base, MT large, and MT
macaron. For detailed model and dictionary sizes, 4.2.2 Cascaded Speech Translation
please refer to Table 13 in Appendix A. We will demonstrate the usage of the BTFT data
E2E ST: Since both the SATE and hybrid SATE via an ablation study on condition A. For condition
models are initialized by pre-trained ASR and MT B/C, we compare the quality of different versions
modules, the model parameters can be inferred of Ta-En pseudo data. Besides, we introduce two
straightforwardly from the aforementioned ASR methods for robust training, called constrained
and MT model settings. fine-tune and error adaptation fine-tune.
4.2 Results BTFT and Constrained Fine-Tune Our base-

line MT model of condition A is trained using the
4.2.1 Automatic Speech Recognition original Ta-En MT data. From Table 6, we see
Table 4 shows the ASR performance in terms of 3
For model ensemble of condition B, the VGG Trans-
word error rate (WER) of MSA. Among the three former and GateCNN-Conformer models are from condition
different model structures, the VGG-Conformer A, and the VGG-Conformer model is from condition B.
106
MT Cascaded ST MT Cascaded ST
Data & Method Model
dev test1 dev test1 dev test1 dev test1
Baseline 26.3 23.0 19.4 16.7 MSA2En-large - - - -
BTFT data 28.2 24.0 20.3 17.1 + BTFT data FT 29.3 26.0 22.2 19.0
+ Constrained FT 28.5 24.3 20.6 17.3 + Constrained FT 30.1 26.2 22.5 19.2
Ta*2En-large 16.3 15.6 13.3 11.4
Table 6: The BLEU score of MT and cascaded MT + BTFT data FT 29.9 26.5 22.5 19.3
experiments in condition A. + Constrained FT 30.4 26.6 22.8 19.5
Ta**2En-large 16.7 15.5 13.3 12.0
MT BLEU + BTFT data FT 30.4 26.6 23.1 19.2
Model Pretrain Model
dev test1 + Constrained FT 30.8 27.0 23.2 19.5
En2Ta - 12.4 10.0
En2Ta En2MSA 16.6 12.5 Table 8: The BLEU score of the MT and the cascaded
MSA2Ta* - 8.3 6.8 ST systems in condition C.
MSA*2Ta MSA2Ta* 12.1 9.6
MT Cascaded ST
Model
Table 7: The BLEU score of different pivot MT models dev test1 dev test1
using Ta-MSA*-En triple text data of condition A. Condition A Best 28.5 24.3 20.6 17.3
+ Error Adapation FT 28.3 23.9 20.5 17.1
that combining the training data with BTFT data Condition C Best 30.8 27.0 23.2 19.5
brings a considerable performance gain for both + Error Adapation FT 30.7 26.6 23.3 19.7
MT and cascaded ST. The MT model trained by
Table 9: The BLEU score of the MT and the cascaded
the BTFT data are further fine-tuned by the original
ST systems in condition A/C when using error adaption
true paired Ta-En data. In order to prevent exces- fine-tune method.
sive over-fitting while fine-tuning, we proposed a
constrained fine-tune method, as depicted in Figure YGround Truth YGround Truth
3. Specifically, the student model is constrained
by the teacher model using KL divergence loss to lossCE
lossKD
lossCE
avoid catastrophic forgetting and over-fitting. In Yteacher lossKL Ystudent

YClean
Y'ASR lossKL YASR
case of using the constrained fine-tune method, the
MT training objective function is given by Initialize Initialize
MTteacher MTstudent MTteacher MTstudent
L = LKL + LCE . (3)

Ta2En MT Ta2En MT
Pseudo Ta-En paired data From Table 7, we see

that the model initialized by a pre-trained model XClean input XClean input XASR output
generates higher quality translations, i.e., higher

Figure 3: Left: Constrained Fine-tune, Right: Error
quality pseudo-data. However, the performance
Adaptation Fine-tune.
comparison between the En2Ta model and the
MSA*2Ta model may not be convincing since the
input for the two models is different. English text from condition B/C to generate pseudo-
Comparing the performance of the Ta2En MT Tunisian text. In comparison, the MSA2Ta model
model is more appropriate to directly reveal the consistently uses MSA data from condition B/C for
quality of the two versions of pseudo Ta-En data. both training and decoding.
In Table 8, it is clear that pre-training the MT model
using Ta-En pseudo-data performs better than using Error Adaptation Fine-tune As shown in Fig-
MSA-En data. Moreover, the second version of Ta- ure 3, the error adaptation fine-tune method (Zhang
En pseudo data outperforms the first when used et al., 2022) slightly adjusts the MT model to mit-
for pre-training the Ta2En MT model. We believe igate potential ASR prediction errors. This tech-
that the MSA2Ta model is preferable for the En2Ta nique fine-tunes the Ta2En MT model using a com-
model due to the consistent use of MSA data during bination of the ASR output text and the text from
training and decoding. The En2Ta model employs the target language. It is based on the constrained
English text from condition A for training, but uses fine-tune method by incorporating true text from
107
Model SATE Hybrid-SATE
Speech encoder Conformer Transformer Conformer Ensemble
MT module MT MT-Macaron MT MT
dev 20.2 20.1 19.5 - 21.2
A
test1 17.2 17.3 16.6 - 18.2
dev 22.0 22.0 20.9 22.0 23.4
B
test1 19.0 19.1 18.0 18.9 20.3
dev 23.8 23.7 23.4 23.1 24.9
C
test1 20.7 20.2 20.0 20.2 22.0
Table 10: The BLEU scores of our E2E ST in condition A/B/C, where the speech encoder and MT module represent
the sub-modules, and MT and MT-Macaron represent MT large and MT macaron models, respectively.
the source language as soft-labels to enhance the # data condition A B C

training with the KD loss LKD . The loss function ASR WER↓
for the error adaptation fine-tune method is given JHU-IWSLT2022 44.8 43.8 44.5
by A1 ASR Ensemble 43.0 42.9 40.6
MT BLEU↑
L = 0.5LKD + 0.5LKL + LCE . (4) CMU-IWSLT2022 22.8 23.6 -
M1 MT base 23.8 26.5 26.5
From Table 9, we can observe that the error adap- M2 MT large 23.9 26.3 26.6
tion fine-tune method enhances the performance M3 MT macaron 23.8 26.6 26.9
M4 MT Ensemble 24.3 26.9 27.4
of the cascaded ST system, albeit at a cost of MT
performance decline. This reveals that this method Cascaded ST BLEU↑
CMU-IWSLT2022 17.5 17.9 -
is not effective in condition A but rather useful in
C1 A1 + M1 17.7 19.3 19.6
condition B/C. C2 A1 + M2 17.8 19.5 20.0
C3 A1 + M3 17.6 19.5 19.9
4.2.3 End-to-end Speech Translation C4 A1 + M4 18.4 19.9 20.2
The SATE model can be instantiated in various E2E ST BLEU↑
structures by using different speech encoder and CMU-IWSLT2022 (Mix) 18.7 18.9 -
MT modules. Table 10 demonstrates that the con- E1 Ensemble of SATE 18.2 20.0 21.3
former encoder outperforms the transformer en- E2 Ensemble of SATE + Hybrid SATE - 20.3 22.0
coder, showing an average improvement of 0.7 Cascaded and E2E ST BLEU↑
BLEU in condition A/B/C. For the different MT CMU-IWSLT2022 (Ensemble) 19.2 19.5 -
E3 Ensemble of C4 + E1 19.0 20.5 21.4
modules, the normal MT module is slightly better
E4 Ensemble of C4 + E2 - 20.8 21.9
than the MT module in the macaroon form. Again,
the results indicate model ensemble increases about Table 11: The overall results of our ASR/MT/ST sys-
1.1 BLEU on the test1 set in condition A/B/C. The tems on test1 set. The hypothesis and reference are
results of dialect transfer show an improvement normalized before computing normalized WER in or-
for ST by 2.1 BLEU in condition B compared to der to be consistent with last year’s ASR system. We
condition A, and this is even greater in condition substituted the MT base model of condition C with the
MT base model of condition B. JHU-IWSLT2022 and
C, i.e., 3.8 BLEU. Additionally, the hybrid SATE
CMU-IWLST2022 are taken from (Yang et al., 2022)
model significantly improves the ST performance and (Yan et al., 2022), respectively.
when used as a sub-model for model ensemble.
4.2.4 Model Ensemble may cause a performance drop. The ensemble of
Table 11 presents the overall results of our three single MT models achieves an average im-
ASR/MT/ST systems. The ASR results in terms of provement of 0.4 BLEU in text translation and cas-
the normalized WER are derived from the model caded ST systems of condition A/B/C, compared to
ensemble method in Table 5. It is worth noting that the best single model of each data condition. The
the ASR models are trained on original transcrip- results of the E2E ST systems are derived from
tions but evaluated in a normalized form, which Table 10. We find that the E2E ST system falls
108
slightly behind the cascaded system in condition A References
but significantly surpasses it in condition B/C. Milind Agarwal, Sweta Agrawal, Antonios Anasta-
In the constrained condition, the primary system sopoulos, Ondřej Bojar, Claudia Borg, Marine
of our submission comprises an ensemble of cas- Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
caded and E2E ST models (see row E3 of condition Chen, William Chen, Khalid Choukri, Alexandra
A). Additionally, for the unconstrained condition, qian Dong, Yannick Estève, Kevin Duh, Marcello
we add the hybrid SATE model to the ensemble of Federico, Souhir Gahbiche, Barry Haddow, Benjamin
cascaded and E2E ST models, which leads to a sig- Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
nificant improvement of approximately 0.4 BLEU. vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
Although the ensemble of cascaded and E2E ST Evgeny Matusov, Paul McNamee, John P. McCrae,
system shows a 0.1 BLEU drop in condition C, it Kenton Murray, Maria Nadejde, Satoshi Nakamura,
helps achieve the best performance in condition Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
A/B. Therefore, the primary system of the submis- Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
sion for the unconstrained condition is in row E4 Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
of condition C. Moreover, we submit a contrastive bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
system (i.e., row E4 of condition B) to compare the Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
performance without using private data. Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
5 Conclusion Conference on Spoken Language Translation (IWSLT
2023). Association for Computational Linguistics.
This paper presents the methods and experimen-
tal results of the USTC team for the dialect ST Ahmed Ali, Peter Bell, James Glass, Yacine Messaoui,
(Tunisian Arabic to English) task in IWSLT 2023. Hamdy Mubarak, Steve Renals, and Yifan Zhang.
2016. The mgb-2 challenge: Arabic multi-dialect
The proposed forward and reverse dialect trans-
broadcast media recognition. In 2016 IEEE Spoken
fer methods, which were shown to be effective for Language Technology Workshop (SLT), pages 279–
augmenting text data and building hybrid SATE 284. IEEE.
models. We utilized various model structures for
Najla Ben Abdallah, Saméh Kchaou, and Fethi
implementing ASR, MT and ST tasks, and im- Bougares. 2020. Text and speech-based Tunisian
proved the robustness through model ensembling Arabic sub-dialects identification. In Proceedings
and error adaptation during training. The experi- of the Twelfth Language Resources and Evaluation
ments showed a significant improvement in dialec- Conference, pages 6405–6411, Marseille, France. Eu-
ropean Language Resources Association.
tal ST through the use of dialect transfer method.
In unconstrained condition, our E2E ST system Yann N Dauphin, Angela Fan, Michael Auli, and David
performs better than the cascaded ST system but is Grangier. 2017. Language modeling with gated con-
volutional networks. In International conference on
slightly less effective in constrained condition. Fu- machine learning, pages 933–941. PMLR.
ture studies might include the exploration of E2E
ST models for unified modeling of multiple dialects Alex Graves, Santiago Fernández, Faustino Gomez, and
Jürgen Schmidhuber. 2006. Connectionist temporal
(e.g., Tunisian, Egyptian) with MSA. classification: labelling unsegmented sequence data
with recurrent neural networks. In Proceedings of the
Acknowledgements 23rd international conference on Machine learning,
pages 369–376.
This work is supported by the National Natural
Science Foundation of China (62101523), Hefei Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
Municipal Natural Science Foundation (2022012) Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.
and USTC Research Funds of the Double First- 2020. Conformer: Convolution-augmented Trans-
Class Initiative (YD2100002008). We would like former for Speech Recognition. In Proc. Interspeech
to thank Zhongyi Ye, Xinyuan Zhou and Ziqiang 2020, pages 5036–5040.
Zhang for valuable discussions and also thank Hirofumi Inaguma, Shun Kiyono, Kevin Duh, Shigeki
Kevin Duh, Paul McNamee and Kenton Murray Karita, Nelson Yalta, Tomoki Hayashi, and Shinji
for organizing Tunisian Arabic to English track of Watanabe. 2020. ESPnet-ST: All-in-one speech
translation toolkit. In Proceedings of the 58th An-
the dialectal and low-resource speech translation nual Meeting of the Association for Computational
shared task. Linguistics: System Demonstrations, pages 302–311,
Online. Association for Computational Linguistics.
109
Yoon Kim and Alexander M. Rush. 2016. Sequence- the pre-trained models into speech translation en-
level knowledge distillation. In Proceedings of the coders. In Proceedings of the 59th Annual Meet-
2016 Conference on Empirical Methods in Natu- ing of the Association for Computational Linguistics
ral Language Processing, pages 1317–1327, Austin, and the 11th International Joint Conference on Natu-
Texas. Association for Computational Linguistics. ral Language Processing (Volume 1: Long Papers),
pages 2619–2630, Online. Association for Computa-
Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, tional Linguistics.
Tao Qin, Liwei Wang, and Tie-Yan Liu. 2019. Un-
derstanding and improving transformer from a multi- Brian Yan, Patrick Fernandes, Siddharth Dalmia, Jia-
particle dynamic system point of view. arXiv preprint tong Shi, Yifan Peng, Dan Berrebbi, Xinyi Wang,
arXiv:1906.02762. Graham Neubig, and Shinji Watanabe. 2022. CMU’s
IWSLT 2022 dialect speech translation system. In
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Proceedings of the 19th International Conference on
Sam Gross, Nathan Ng, David Grangier, and Michael Spoken Language Translation (IWSLT 2022), pages
Auli. 2019. fairseq: A fast, extensible toolkit for se- 298–307, Dublin, Ireland (in-person and online). As-
quence modeling. arXiv preprint arXiv:1904.01038. sociation for Computational Linguistics.
Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng
Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. Jinyi Yang, Amir Hussein, Matthew Wiesner, and San-
2019. SpecAugment: A Simple Data Augmentation jeev Khudanpur. 2022. JHU IWSLT 2022 dialect
Method for Automatic Speech Recognition. In Proc. speech translation system description. In Proceed-
Interspeech 2019, pages 2613–2617. ings of the 19th International Conference on Spoken
Language Translation (IWSLT 2022), pages 319–326,
Matt Post. 2018. A call for clarity in reporting BLEU Dublin, Ireland (in-person and online). Association
scores. In Proceedings of the Third Conference on for Computational Linguistics.
191, Brussels, Belgium. Association for Computa- Weitai Zhang, Zhongyi Ye, Haitao Tang, Xiaoxi Li,
tional Linguistics. Xinyuan Zhou, Jing Yang, Jianwei Cui, Pan Deng,
Mohan Shi, Yifan Song, Dan Liu, Junhua Liu, and
Rico Sennrich, Barry Haddow, and Alexandra Birch. Lirong Dai. 2022. The USTC-NELSLIP offline
2016a. Improving neural machine translation models speech translation systems for IWSLT 2022. In Pro-
with monolingual data. In Proceedings of the 54th ceedings of the 19th International Conference on
Annual Meeting of the Association for Computational Spoken Language Translation (IWSLT 2022), pages
Linguistics (Volume 1: Long Papers), pages 86–96, 198–207, Dublin, Ireland (in-person and online). As-
Berlin, Germany. Association for Computational Lin- sociation for Computational Linguistics.
guistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch. A Appendix. Model configurations
2016b. Neural machine translation of rare words
with subword units. In Proceedings of the 54th An- The detailed model configurations for ASR systems
nual Meeting of the Association for Computational are as following:
Linguistics (Volume 1: Long Papers), pages 1715–
1725, Berlin, Germany. Association for Computa-
tional Linguistics. • Condition A: The model configurations are
almost identical to the ESPnet (Inaguma et al.,
Karen Simonyan and Andrew Zisserman. 2014. Very 2020) baseline. There are 12-layer encoder
deep convolutional networks for large-scale image and 6-layer decoder. The attention module of
recognition. arXiv preprint arXiv:1409.1556.
both the encoder and decoder comprises 256
Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS- hidden units and 4 attention heads. The size
MT – building open translation services for the world. of the FFN module is 1024 for the encoder
In Proceedings of the 22nd Annual Conference of
but 2048 for the decoder. We use two VGG
the European Association for Machine Translation,
pages 479–480, Lisboa, Portugal. European Associa- blocks as the feature extractor for both the
tion for Machine Translation. VGG-Conformer and the VGG-Transformer
models. For the GateCNN-Conformer model,
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
the feature extractor has a 6-layer GateCNN.
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing • Condition B/C: The model difference be-
systems, 30. tween the condition A and the condition B/C
Chen Xu, Bojie Hu, Yanyang Li, Yuhao Zhang, Shen
lies in the model size. For condition B/C, the
Huang, Qi Ju, Tong Xiao, and Jingbo Zhu. 2021. attention module has 512 hidden units and 8
Stacked acoustic-and-textual encoding: Integrating attention heads, and the size of FFN is 4096.
110
Condition Training Stage lr Max-tokens Warmup Dropout rate Training steps
Stage1: BTFT Pretrain 5e-4 12000 4000 0.3 120000
A
Stage2: Constrained Fine-tune - 4096 - 0.3 40000
Stage1: MSA-En Pretrain 1e-3 40000×8 4000 0.1 200000
Stage2: Ta**-En Pretrain 5e-4 40000×8 None 0.1 20000
B/C Stage3: BTFT Fine-tune 4e-5 6144 4000 0.3 120000
Stage4: Constrained Fine-tune - 2048 - 0.3 80000
Stage5: Error Adaptation Fine-tune 1e-5 4096 None 0.3 10000
Table 12: Hyper parameters in different stages ("-" means reuse from the former stage and "×" the GPU numbers).
Condition A B/C # A B C
Encoder dim 256 512 test2 ASR WER↓

Encoder FFN dim 1024 2048 IWSLT2022 43.8 42.9 41.5
A1 40.8 40.5 39.3
Encoder attn heads 4 8
test2 ST BLEU↑
Decoder dim 256 512 IWSLT2022 20.4 20.8 18.7
Decoder FFN dim 1024 2048 E3 20.5 - -
Decoder attn heads 4 8 E4 - 22.8 23.6
Tunisian BPE units 1000 1000 test3 ASR WER↓

MSA BPE units - 32000 A1 43.2 42.3 40.5
English BPE units 4000 32000 test3 ST BLEU↑
E3 18.1 - -
Table 13: The model sizes and dictionary sizes for MT E4 - 20.2 21.1
training, where "attn" represents attention module.
Table 14: The overall results of our ASR/ST systems
on test2 set (IWSLT 2022 evaluation set) and test3 set
(IWSLT 2023 evaluation set).
For MT models, the 6-layer encoder and 6-layer
decoder are used for both MT base and MT mac-
aron models, but 12-layer encoder and 6-layer de- MT: The MT model training was also conducted
coder for MT large model. The details of the MT using the fairseq toolkit. We conducted all train-
system are summarized in Table13. ing stages on the NVIDIA A40 GPU, varying the
specific GPU number depending on the stage. Dif-
ferent training methods and hyper-parameters were
B Appendix. Training and Inference used for optimal results depending on the condition,
where we classified them into condition A and B/C.
ASR: We used the fairseq tool (Ott et al., 2019) Specifically, we divided our training method into
for training and inference. During training, we used several stages, see Table 12. In Stage2 and Stage5
a dropout rate of 0.3, set the label-smoothing rate of condition B/C, the number of training steps is
to 0.1 and used a CTC loss weight of 0.3. The max significantly lower than other stages. This was be-
tokens and max sentences per batch were 32000 cause the model had a tendency to overfit quickly
and 120, respectively. We used the inverse square during these stages; hence learning rate warmup
learning rate schedule for training, with a learning method was not used during training. During in-
rate of 1e-3 and warmup steps of 8000 for condition ference, the beam size of decoding is 10. We used
A. For condition B/C, we pre-trained with MSA the official sacrebleu tool (Post, 2018) to calculate
ASR data and used a learning rate of 1e-3 and the normalized case-insensitive BLEU score. We
warmup steps of 30000. We used a learning rate of averaged the model parameters of 5 best models
2e-4 and warmup steps of 8000 while fine-tuning based on the BLEU score on the dev set.
with in-domain Tunisian ASR data. The models
were optimized through the Adam optimizer with E2E ST: The hyper-parameters of the model
β1 = 0.9, β2 = 0.98. During inference, we used training and inference are almost consistent with
an attention-based decoding strategy with a beam those used for ASR. The knowledge distillation
size of 10. We averaged the model parameters of 5 weight (KD) for ASR is set to 0.2 but 0.3 for MT.
best model based on the WER on the dev set. The CTC loss weight for the speech encoder is set
111
to 0.2 while it is 1.2 for the Ta2MSA text encoder
of hybrid SATE. Note that the CTC loss weight for
the Ta2MSA text encoder is much larger because
translating Tunisian to MSA with pseudo Ta-MSA
MT data is challenging.
C Appendix. Official Evaluation Results

The official evaluation results of our submitted sys-
tems on both test2 and test3 sets (both being blind
tests) are summarized in Table 14. Our submis-
sions outperformed last year’s best performance in
all data conditions (constrained and unconstrained)
for both ASR and ST evaluations (e.g, see the re-
sults of test2 set).
112
KIT’s Multilingual Speech Translation System for IWSLT 2023
Danni Liu, Thai Binh Nguyen, Sai Koneru, Enes Yavuz Ugan, Ngoc-Quan Pham,
Tuan-Nam Nguyen, Tu Anh Dinh, Carlos Mullov, Alexander Waibel, Jan Niehues
Karlsruhe Institute of Technology
[email protected]
Abstract translation of these words that rarely occur in the

training data. The styles of the talks, e.g. formal-
Many existing speech translation benchmarks ity, also differ from other domains. As no training
focus on native-English speech in high-quality
data from the same domain is provided, effective
recording conditions, which often do not match
the conditions in real-life use-cases. In this few-shot or zero-shot adaptation is crucial.
paper, we describe our speech translation sys- As the task focuses on one-to-many translation,
tem for the multilingual track of IWSLT 2023, it is also an interesting testbed for whether mul-
which focuses on the translation of scientific tilinguality improves speech translation quality.
conference talks. The test condition features For text-to-text translation, the gain from multi-
accented input speech and terminology-dense
linguality is mostly concentrated in many-to-one
contents. The tasks requires translation into
10 languages of varying amounts of resources.
translation (Aharoni et al., 2019; Fan et al., 2021),
In absence of training data from the target do- i.e., multilinguality on the source side. In con-
main, we use a retrieval-based approach (kNN- trast, for X-to-many translation, it remains unclear
MT) for effective adaptation (+0.8 BLEU for whether incorporating more target languages im-
speech translation). We also use adapters to proves translation quality.
easily integrate incremental training data from In this system description paper, we present
data augmentation, and show that it matches
cascaded and end-to-end systems for the English-
the performance of re-training. We observe
that cascaded systems are more easily adapt- to-many speech translation task. We lever-
able towards specific target domains, due to age pretrained models, including WavLM (Chen
their separate modules. Our cascaded speech et al., 2022), mBART50 (Tang et al., 2020), and
system outperforms its end-to-end counterpart DeltaLM (Ma et al., 2021). The systems do not use
on scientific talk translation, although their per- additional data beyond the allowed corpora, and
formance remains similar on TED talks. therefore fall under the constrained data condition.
For the cascaded system, to handle the unique
1 Introduction
style of scientific talks, we use kNN-MT (Khan-
This paper summarizes Karlsruhe Institute of Tech- delwal et al., 2021) to bias the output generation
nology’s speech translation system for the multilin- towards the target domain. Moreover, as no target
gual track of IWSLT 2023 (Agarwal et al., 2023). monolingual data is provided, we use data diversi-
In this track, the task is to translate scientific talks fication (Nguyen et al., 2020) to enrich the existing
in English into 10 languages: Arabic (ar), Chinese parallel data. We also use adapters (Rebuffi et al.,
(zh), Dutch (nl), French (fr), German (de), Japanese 2017; Bapna and Firat, 2019) as a lightweight ap-
(ja), Persian/Farsi (fa), Portuguese (pt), Russian proach for incremental learning and language adap-
(ru), Turkish (tr). The talks are from presentations tation. For the ASR model, we improve over last
in the 60th Annual Meeting of the Association for year’s performance by using a more recent audio
Computational Linguistics (ACL 2022). encoder (Chen et al., 2022) and adding a dedicated
Translating scientific talks presents several chal- decoder. To adapt the ASR system to the target
lenges. On the source side, most speakers are domain, we use n-gram re-weighting and synthe-
non-native, and the recording conditions often vary. sized data for the target domain. For the end-to-end
This requires acoustic robustness to accents and system, we use our machine translation model for
noise. On the target side, domain-specific termi- knowledge distillation. We also ensemble models
nologies are frequently used, calling for accurate trained with and without synthesized speech data.
113
Our main findings are as follow: Dev/Test set Hours # Utterances Domain
ACL dev 1.0 468 ACL conference talks
• For cascaded ST systems, we can effectively tst-COMMON 4.9 2823 TED talks
adapt the model towards a target domain/style tst2019 4.8 2279 TED talks
using kNN-MT (Khandelwal et al., 2021). A tst2020 4.1 1804 TED talks
datastore as small as a few hundred sentence
Table 1: Overview of development and test data.
pairs was sufficient for achieving consistent
gains (avg. +0.8 BLEU over 10 languages).
Corpus / Data Source Hours # Utterances
• Besides the common use-case of adding Common Voice 1667 1225k
language-specific capacity, adapters (Bapna LibriSpeech 963 281k
and Firat, 2019) is also an effective method MuST-C v2 482 251k
TED-LIUM v3 452 268k
when subsequently adding training data. Em- VoxPopuli 501 177k
pirically, we show it matches the performance TTS 7284 4.7M
of re-training on all new data.
Table 2: ASR data overview.
• For ASR, lexical constraints for domain adap-
ation are more easily integrated in CTC mod-
els. For encoder-decoder model, the control VoxPopuli (Wang et al., 2021). The data overview
could be achieved by TTS-synthesized source is in Table 2.
speech, but it requires more careful tuning.
Synthesized Speech Data To adapt the ASR
2 Data and Preprocessing model to the ACL talks, we add synthesized speech
After describing the evaluation data (§2.1), we out- created by a text-to-speech (TTS) model. Specifi-
line the training data and preprocessing steps for cally, from the MT bitext English side (Table 3), we
our automatic speech recognition (ASR; §2.2), ma- select sentences similar to the ACL domain based
chine translation (MT; §2.3), casing/punctuation on similarity with the provided ACL dev bitext and
restoration (§2.4), and speech translation (ST; §2.5) abstracts. Inspired by data selection strategies for
models. MT (Eck et al., 2005; Koneru et al., 2022), we
use n-gram overlap as similarity metric. 4.7M sen-
2.1 Development and Test Data tences are selected and then synthesized to speech
In the multilingual track, the testing condition is by a VITS (Kim et al., 2021) model trained on
scientific conference talks. Therefore, we primarily MuST-C. The synthesized data amount is shown in
rely on the ACL development (dev) set for valida- the last row of Table 2.
tion. It consists of English transcripts of the talks
2.3 Machine Translation Data
and translations into the 10 target languages. The
systems are then evaluated on a blind test set. The The MT training data include the following
dev and test sets consist of 5 talks each. The paper text-to-text translation corpora: Europarl v7
abstracts for all talks are available in English. The and v10 (Koehn, 2005), NewsCommentary v16,
talks are pre-segmented. In all experiments, we use OpenSubtitles v2018 (Lison and Tiedemann,
the given segmentation. 2016), Tatoeba (Tiedemann, 2012), and ELRC-
We also report performance on tst-COMMON CORDIS_News, JParaCrawl (Morishita et al.,
of MuST-C (Di Gangi et al., 2019), tst2019 and 2022) for Japanese, and TED2020 (Reimers and
tst2020 from previous years’ evaluations (Anasta- Gurevych, 2020) for German1 . We also include
sopoulos et al., 2021, 2022). the text translation part of the following ST cor-
An overview of the development and test data is pora: MuST-C (Di Gangi et al., 2019), CoVoST
in Table 1. v2 (Wang et al., 2020), and Europarl-ST (Iranzo-
Sánchez et al., 2020). The aggregated data amount
2.2 Speech Recognition Data per language is summarized in the “Original” col-
For the ASR training, we use Common umn of Table 3.
Voice (Ardila et al., 2020), LibriSpeech (Panayotov 1
This dataset has deplication with past evaluation sets:
et al., 2015), MuST-C v2 (Di Gangi et al., 2019), tst2019 tst2020 and tst-COMMON. The deplications were
TED-LIUM v3 (Hernandez et al., 2018), and removed prior to training.
114
Original After Diversification training source data. We then train a model to re-
Lang. # sent. (M) # sent. (M) # tokens (M) store the casing and punctuation marks.
ar 26.0 65.2 865.0
zh 11.2 21.5 254.3
2.5 Speech Translation Data
nl 33.1 82.1 1162.7 The speech translation data are shown in Ta-
fr 38.9 91.6 1427.8
de 23.0 54.4 860.0 ble 4. We additionally use our trained MT model
ja* 2.6 27.2 832.7 to create forward translations based on the fol-
fa 5.8 11.3 162.1 lowing transcript-only datasets: Common Voice,
pt 29.0 72.3 1024.3
ru 22.1 51.5 685.3 TEDLIUM, and VoxPopuli. The TTS data de-
tr 36.7 89.7 1021.2 scribed in §2.2 is also used.
Total 228.4 566.8 8295.4
Lang. Corpus / Data Source Hours # Utterances
Table 3: MT data overview. *: For ja, the original data ar CoVoST 429 289k
of 2.6M sentences did not include JParaCrawl, which MuST-C 463 212k
was announced later as allowed data. TTS 283 203k
zh CoVoST 429 289k
MuST-C 596 358k
TTS 204 183k
As preprocessing, we perform truecasing, dedu- nl MuST-C 434 248k
plication, length ratio filtering, and histogram filter- europarl-ST 75 32k
TTS 1138 713k
ing using the statistics by Fan et al. (2021). Then fr MuST-C 485 275k
we perform subword segmentation using Sentence- europarl-ST 76 32k
piece (Kudo and Richardson, 2018) based on the TTS 1768 998k
de CoVoST 429 289k
vocabulary of mBART50 (Tang et al., 2020). MuST-C 440 269k
europarl-ST 77 33k
Data Diversification Different from last years’ TTS 1891 779k
shared tasks (Anastasopoulos et al., 2021, 2022), ja CoVoST 429 289k
MuST-C 541 329k
no monolingual (non-English) data is provided. TTS 73 56k
This means conventional data augmentation tech- fa CoVoST 429 289k
niques like backward translation are not directly MuST-C 347 182k
TTS 89 88k
applicable. On the other hand, forward translation pt MuST-C 377 206k
from existing English monolingual data may intro- europarl-ST 75 32k
duce undesirable errors in the translation targets, TTS 1678 639k
ru MuST-C 482 265k
especially on lower-resource languages. In this TTS 331 331k
light, we use data diversification (Nguyen et al., tr CoVoST 429 289k
2020), a data augmentation method that enriches MuST-C 446 236k
TTS 428 511k
existing parallel data by forward and backward
all Common Voice 1488 948k
translating the training bitext. As the model has TEDLIUM 453 268k
seen the parallel data in training, the synthetic trans- VoxPopuli 502 177k
lations are expected to have relatively high quality.
Moreover, either the source or target side of the Table 4: ST data overview. The last section “all” indi-
synthetic data is from the original bitext. The di- cates forward translated synthetic targets from transcript-
versified data amount after deduplication is shown only corpora, which are available for all 10 languages.
in Table 3. Here we perform one round of forward
and backward translation, as Nguyen et al. (2020)
3 Cascaded System
have empirically shown further rounds do not lead
to substantial gains. For the cascaded system, we introduce our ASR
(§3.1) and MT (§3.2) models.
2.4 Casing/Punctuation Restoration Data
The ASR outputs are lower-cased and unpunctu- 3.1 Automatic Speech Recognition Module
ated, while the MT model expects cased and punc- Baseline Models The first baseline is our ASR
tuated inputs. We randomly sample 1.5 million En- model for last year’s offline track (Pham et al.,
glish sentences from the MT training data (Table 3), 2022). It is a Wav2vec 2.0 (Baevski et al., 2020)
and remove the casing and punctuation marks as with L ARGE configuration pretrained on 960 hours
115
of Librispeech data. This year, after seeing ini- As our final system is an encoder-decoder model
tial favourable results compared to Wav2vec, we (WavLM + mBART50), adapting the LM alone
opt for WavLM (Chen et al., 2022) as audio en- is less straightforward. We create pseudo ASR
coder. We use the L ARGE configuration with 24 training data with ACL data on the transcript side.
layers. We use the mBART50 (Tang et al., 2020) Specifically, we use our TTS model to synthesize
decoder along with the WavLM encoder. As the speech from the ACL dev and test abstracts. As the
ASR model only needs to transcribe English2 , we amount of ACL abstract data is very limited (less
trim the mBART50 vocabulary from 256k down to than 100 sentences in total), we heavily upsampled
62k tokens by removing all non-alphabetic tokens. them, so that they consist of 60% of the training
data. As shown in the lower section of Table 6, this
In-Domain TTS Data We also use the synthe- leads to a minor improvement of WER for ACL
sized TTS data. Compared to the same model dev. However, the gain does not carry over to ST
without TTS data, the word error rate (WER) im- performance when later cascading with our MT
proves from 11.6% to 10.7% on ACL dev, but de- model. Therefore, our final ASR system did not
grades from 8.4% to 9.0% on the TEDLIUM test use the abstracts. The lack of improvement could
set. There are two potential explanations: First, the be related to the low amount of ACL abstract data,
noisy TTS speech may be helpful for handling the which requires heavy upsampling of the TTS data,
non-native utterances prominent in the ACL dev and as a result hinders the ability of transcribing
set. Second, the target side of the TTS data is more real speech.
relevant to the ACL domain, as we selected them The contrast between the two sets of experiments
based on n-gram overlap with ACL data. This in may be related to diminishing gains as WER im-
turn improves ASR performance on the ACL dev proves, i.e., for the Wav2vec + CTC + LM model,
set. gaining over a WER of 13.8% is easier than starting
As shown in Table 5, compared to last year’s sub- from a 10.7% WER. Another interpretation of the
mission, this year’s ASR model achieves consistent difference could be that adding specific constraints
gains across domains on ACL dev, tst-COMMON, to “end-to-end” ASR models is more challenging
and tst2020. than the counterparts with separate LMs.
Model ACL dev tstCom. tst2020 Model ACL dev tst-COMMON
ASR 2022 (Pham et al., 2022) 12.5 5.4 5.6 Wav2vec + CTC + 5-gram 13.8 7.6
WavLM + mBART50 10.7 3.9 4.8 + ACL abstract 5-gram 13.0 7.6
WavLM + mBART50 10.7 3.9
Table 5: ASR results in WER(↓) in comparison to our + ACL abstract TTS (upsampled) 10.5 4.3
submission last year (Pham et al., 2022) which used
Wav2vec trained with CTC and a 5-gram LM. By using Table 6: ASR adaptation results in WER(↓). On prelim-
WavLM audio encoder and the mBART decoder, we inary experiments with Wav2vec + CTC + LM models,
achieve consistent gains across domains (ACL and TED, we improve ASR performance on ACL dev by integrat-
i.e., tst*). ing n-gram statistics from the ACL abstracts. For the
WavLM + mBART 50 model, adding synthesized audio-
transcript data based ACL dev abstracts does not give
Language Model (LM) Adaptation Aside from consistent gain.
using TTS data, we also investigate other meth-
ods to adapt towards the ACL domain using the
Casing/Punctuation Restoration We take a
provided paper abstracts. On preliminary experi-
sequence-to-sequence approach to the casing and
ments with Connectionist Temporal Classification
punctuation restoration problem. Specifically,
(CTC) + n-gram LM models, we integrate ACL
we train a punctuation model initializing from
abstract 5-grams statistics into the language mod-
DeltaLM-base (Ma et al., 2021) to restore the cas-
els. As shown in the upper section of Table 6, this
ing and punctuation information, using the training
improves on ACL dev (WER 13.8% → 13.0%)
data described in §2.4.
while preserving the performance on TED talks
(tst-COMMON WER stays at 7.6%). 3.2 Machine Translation Module
2
BART, the English-only predecessor of mBART, is not Baseline Model We start with the pretrained
among the allowed pretrained models. DeltaLM (Ma et al., 2021) with L ARGE configura-
116
ACL dev (en→X) TED (en→de)
ID de ja zh ar nl fr fa pt ru tr Avg. tst2019 tst2020
From ground-truth transcripts (MT alone)
(1) base 39.8 44.2 47.4 30.4 45.7 48.9 23.6 51.1 19.5 22.9 37.4 29.5 32.9
(2) data divers. all 41.6 44.5 49.8 33.6 50.7 51.1 25.4 52.5 21.5 24.6 39.5 30.0 33.7
(3) (1) + data divers.; adapter 41.4 45.8 48.8 33.3 49.8 51.5 25.2 54.1 21.9 24.1 39.6 29.5 33.2
(4) ensemble (2) + (3) 41.7 46.1 49.6 33.7 50.8 52.1 25.9 54.3 23.1 24.8 40.2 30.4 33.7
(5) (4) + kNN-MT 43.7 47.3 49.8 35.4 52.3 52.8 27.2 55.3 23.9 27.1 41.5 30.4 33.4
From ASR outputs (cascaded ST)
(1) base 34.3 38.2 41.6 25.3 36.6 39.9 19.1 40.7 16.7 18.9 31.1 26.5 28.0
(2) data divers. all 35.4 38.6 44.3 26.8 39.2 41.5 20.5 42.6 18.7 19.5 32.7 27.0 29.3
(3) (1) + data divers.; adapter 35.5 39.0 43.6 26.4 38.9 41.9 20.2 43.0 19.3 19.6 32.7 26.7 28.3
(4) ensemble (2) + (3) 36.1 39.8 44.4 26.9 39.8 42.3 20.7 43.5 19.2 19.7 33.2 26.9 28.7
(5) (4) + kNN-MT 36.8 40.2 44.6 28.2 40.8 42.0 21.8 44.5 19.7 21.1 34.0 26.9 28.5
End-to-end ST
(6) WavLM + mBART50 decoder 31.7 29.2 40.7 25.0 36.7 40.5 19.5 43.0 16.9 18.5 30.2 27.0 29.3
(7) (6) + TTS 33.2 29.2 40.5 25.5 37.9 41.0 20.1 43.9 16.5 18.9 30.7 27.0 29.1
(8) ensemble (6) + (7) 34.0 29.9 41.7 25.5 38.2 42.0 20.2 44.4 18.3 20.2 31.4 27.3 29.6
Table 7: MT and ST results in BLEU(↑).
tion. The pretrained model has 24 and 12 encoder Adapters for Incremental Data Retraining on
and decoder Transformer layers respectively. It the new training data after diversification (Row
uses postnorm layer normalization. It is a fully (2) of Table 7) is time-consuming and costly.
multilingual model where all parameters are shared To adapt the initial model (Row (1) of Table 7)
across languages. The target language tokens are rapidly towards to the augmented data, we use
prepended to the source target sentences. We use adapters (Bapna and Firat, 2019; Philip et al., 2020).
temperature-based sampling (Arivazhagan et al., In this case, the adapters are target-language-
2019) with τ = 5.0 to counteract the data imbal- specific. The adapters are inserted after each en-
ance between languages. When training, we use coder and decoder layer. We initialize from the
a relatively large effective batch size of 128k as trained baseline (Row (1) in Table 7), freeze trained
preliminary experiments with smaller batch sizes parameters and update the adapters only. We use
showed more instabilities in training. This might the efficient implementation from Baziotis et al.
be a side effect of the postnorm layer normaliza- (2022). As shown in Row (3) of Table 7, only train-
tion (Nguyen and Salazar, 2019). The results of the ing the adapters on the new diversified training data
baseline are shown in Row (1) of Table 7, with an performs on par with the re-training setup in Row
average score of 37.4 BLEU3 on ACL dev. (2) (39.6 on MT and 32.7 on ST on average for
ACL dev). These results demonstrate that adapters
are suitable for fast and effective incremental learn-
Data Diversification As motivated in §2.3, we
ing when additional training data emerges later.
use data diversification as an alternative data aug-
mentation method in absence of monolingual target To our surprise, adding adapters to the model
data for backtranslation. As data diversification trained with full data diversification (Row (2) from
needs forward and backward translations on the Table 7) does not bring further gain. A similar
training data, we additionally train a 10-to-English observation was reported by Pires et al. (2023),
model to create the backward translations. Row (2) who opted for training the full network from scratch
of Table 7 shows the results after data diversifica- along with adapters instead. In our case, it therefore
tion on all languages pairs. On average, this data would be interesting to see the impact of training
augmentation approach improves MT quality by on data diversification with adapters from scratch.
2.1 BLEU and (37.4 → 39.5), and ST quality by
1.6 BLEU (31.1 → 32.7). Multilingual vs Bilingual To investigate the im-
pact of interference from multiple target languages,
3
in preliminary experiments, we also compare the
By default using tok.13a from sacreBLEU (Post,
2018), except for zh and ja where we use tok.zh and multilingual and bilingual translation performance
tok.ja-mecab-0.996-IPA. for selected language pairs. As shown in Table 8,
117
compared to bilingual models, the multilingual Source (ASR output): ... in a zero shot evaluation setup,
model lags behind especially on higher-resource meaning that pre trained word embedding models are ap-
plied out of the box without any additional fine tuning
languages. Adding the adapters partly closes this w/o kNN-MT (Table 7 row (4)): ... in einer
gap. Note the score difference to main result table Null-Shot-Bewertungs-Setup (zero-shot evaluation setup),
(Table 7) is because the preliminary experiments was bedeutet, dass vorgebildete (pre-educated) Wort-
Einbettungsmodelle ohne zusätzliche Feinabstimmung di-
did not fully use diversified data for all languages. rekt angewendet werden.
w/ kNN-MT (Table 7 row (5)): ... in einer Null-Shot-
Model ACL dev tst-COMMON Bewertung (zero-shot evaluation), was bedeutet, dass
vortrainierte (pretrained) Wort-Einbettungsmodelle ohne
en-de en-ru en-fa en-de en-ru en-fa zusätzliche Feinabstimmung direkt angewendet werden.
bilingual 41.0 20.0 24.2 34.3 22.7 16.0 Source (ASR output): Hello. My name is Ramachandra,
multilingual 39.8 19.5 23.6 34.1 21.9 15.9 and I will present our paper.
+ adapters 40.9 20.2 23.7 34.7 22.2 16.3 w/o kNN-MT (Table 7 row (4)): 你好 (Hello; addressing
a single person),我叫拉玛钱德拉我要发表 (publish)我
Table 8: Comparison of bilingual vs multilingual trans- 们的论文
lation performance in BLEU (↑) on German (de), Rus- w/ kNN-MT (Table 7 row (5)): 大家好 (Hi all; addressing
a group of audience),我叫拉玛钱德拉, 我要介绍 (intro-
sian (ru), Farsi (fa), which are high-, mid-, low-resource duce)我们的论文。
in the training data (Table 3). Multilingual system falls
behind bilingual system, while adapters partly closes Table 9: Examples of kNN-MT improving transla-
the gap. Note the score difference to main result table tion quality for en→de (upper) and en→zh (lower).
(Table 7) is because the experiments here did not fully kNN-MT creates more accurate terminology transla-
use diversification. tions (“pre trained” for en→de) and create more context-
appropriate translation (“Hello” for en→zh).
Ensemble Although the models in Row (2) and
(3) in Table 7 are trained on the same data and the number of retrieved neighbors k, the tempera-
share the same base architecture, we expect their ture for smoothing the kNN distribution T , and the
representations to be sufficiently different, as (3) interpolation weight w.
additionally uses adapters. We therefore ensemble In our experiments, we use systems (2) and (3)
these two models. The results are in Row (4) of Ta- from Table 7 for creating the datastores. As differ-
ble 7. On MT and ST, for ACL, ensembling shows ent models’ hidden states (which serve as keys in
an improvement of 0.6 and 0.5 BLEU respectively the datastore) also differ substantially, the datastore
over the single models in Row (2) and (3). On is MT-model-dependent. To use kNN-MT when
TED, however, ensembling does not seem to im- ensembling systems (2) and (3), we therefore need
pact the scores compared to the single models. One two datastores for systems (2) and (3) respectively.
explanation is that the adapter model from Row The kNN-MT candidate tokens are interpolated
(3) performs worse than its non-adapter counter- with the output vocabulary distribtuion before the
part (Row (2)) on TED, which limits the overall ensembling operation.
effectiveness of ensembling.
We use hyperparameters k = 8, T = 50,
kNN-MT We also adapt the MT model to the tar- w = 0.3, after an initial search with T ∈
get domain of scientific talks. A challenge is that [10, 50, 100], w ∈ [0.1, 0.3, 0.5]. Our implemen-
we do not have sufficient training data to fully fine- tation mostly follows Zheng et al. (2021), which
tune the MT model towards the desired domain or uses the FAISS toolkit (Johnson et al., 2019) for
style. In this case, we use kNN-MT (Khandelwal efficient kNN operations. Comparing the infer-
et al., 2021) to adapt the model at inference time. ence speed of system (4) and (5), with the same
In kNN-MT, bitexts are passed through a trained batch size of 64 sentences4 , using kNN-MT takes
MT model. For each target token, its decoder hid- roughly 50% more time on a Nvidia Titan RTX
den state is stored in a datastore. At inference time, GPU with 24GB memory.
based on the current decoder hidden state, k candi- Naively using all ACL dev bitext as datastore
date target tokens are retrieved from the datastore would lead the model to copying the oracle targets.
using a nearest neighbor lookup. The retrieved to- To simulate the scenario on the blind test set, when
ken distribution is then interpolated with the MT 4
System (5) requires more GPU memory than system (4).
target distribution, which in turn generates the out- The latter would be able to use a larger batch size of 128
put tokens. Hyperparameters for kNN-MT include sentences.
118
translating the i-th talk, we use the other jj̸=i ∈ systems have several novelties. Lacking suitable
[n] talks’ bitext as datastore, where n is the total training data for the target domain, we used kNN-
number of talks. MT for inference-time adaptation and showed an
As shown in Row (5) of Table 7, kNN-MT improvement of +0.8 BLEU for cascaded speech
brings an additional gain of 1.3 BLEU on MT and translation system. We also used adapters to in-
0.8 BLEU on ST. These results shows a datastore tegrate incremental data from augmentation, and
as small as hundreds of sentence pairs can be effec- achieved performance on-par with re-training on
tively used for inference-time domain adaptation. all data. In our experiments, we observed that cas-
Table 9 shows two examples of kNN-MT im- caded systems are more easily adaptable towards
proving translation quality, apart from generic im- desired target domains due to their separate mod-
provements in fluency and accuracy, in these ex- ules. Our cascaded speech system outperforms
amples kNN-MT also helps generate correct termi- its end-to-end counterpart on scientific talk transla-
nologies and context-appropriate greetings. tion, although their performance remains similar on
TED talks. For future work, we are interested in the
4 End-to-End System feasibility of applying the adaptation approaches
shown effective on MT to end-to-end ST.
For the end-to-end system, similar to our ASR
model, after seeing initial favourable results of Acknowledgement We thank the anonymous re-
WavLM over Wav2vec, we choose WavLM as viewers for detailed and insightful feedback. Part
the audio encoder. Following last year’s submis- of this work was performed on the HoreKa su-
sion (Pham et al., 2022), we use the mBART50 percomputer funded by the Ministry of Science,
decoder. The results are shown in Row (6) of Ta- Research and the Arts Baden-Württemberg and by
ble 7. Contrasting Row (6) and (7) reveals that the Federal Ministry of Education and Research
adding the TTS data does not substantially change of Germany. Part of this work was supported by
ST performance. However, ensembling the two the Federal Ministry of Education and Research
models trained with and without TTS data (Row of Germany under grant agreement 01EF1803B
(8)) improves over the single models (on average (RELATER).
+0.7 for ACL, +0.4 for TED), despite them having
the identical architecture.
Compared to the strongest cascaded system References
(Row (5)), the end-to-end system falls behind 2.6
Milind Agarwal, Sweta Agrawal, Antonios Anasta-
BLEU on ACL dev. On TED, however, it appears sopoulos, Ondřej Bojar, Claudia Borg, Marine
to slightly outperform the cascaded system. One Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
explanation is that the MT model of the cascaded Chen, William Chen, Khalid Choukri, Alexandra
system has not been separately adapted to TED Chronopoulou, Anna Currey, Thierry Declerck, Qian-
qian Dong, Yannick Estéve, Kevin Duh, Marcello
texts (although parts of the full training data do Federico, Souhir Gahbiche, Barry Haddow, Benjamin
cover TED data), which was shown essential in im- Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
proving performance on TED test sets (Zhang et al., vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
2022; Pham et al., 2022). The end-to-end system, Kumar, Pengwei Li, Xutai Ma, Prashant Mathur,
Evgeny Matusov, Paul McNamee, John P. McCrae,
on the other hand, has seen a larger proportion of Kenton Murray, Maria Nadejde, Satoshi Nakamura,
TED data in training (Table 4). Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
Similar to the previous year (Polák et al., 2022), Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
we also adapt our end-to-end offline model for si- Lonneke van der Plas, Peter Polák, Elijah Rippeth,
multaneous track (Polák et al., 2023). bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
5 Conclusion Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
In this paper, we described our systems for the mul- Campaign. In Proceedings of the 20th International
tilingual speech translation track of IWSLT 2023, Conference on Spoken Language Translation (IWSLT
which translates English speech into 10 target lan- 2023). Association for Computational Linguistics.
guages. To tackle the task of translating scien- Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019.
tific conference talks, which feature non-native in- Massively multilingual neural machine translation.
put speech and terminology-dense contents, our In Proceedings of the 2019 Conference of the North
119
American Chapter of the Association for Computa- Ankur Bapna and Orhan Firat. 2019. Simple, scal-
tional Linguistics: Human Language Technologies, able adaptation for neural machine translation. In
Volume 1 (Long and Short Papers), pages 3874–3884, Proceedings of the 2019 Conference on Empirical
Minneapolis, Minnesota. Association for Computa- Methods in Natural Language Processing and the
tional Linguistics. 9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 1538–
Antonios Anastasopoulos, Loïc Barrault, Luisa Ben- 1548, Hong Kong, China. Association for Computa-
tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano tional Linguistics.
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh,
Maha Elbayad, Clara Emmanuel, Yannick Estève, Christos Baziotis, Mikel Artetxe, James Cross, and
Marcello Federico, Christian Federmann, Souhir Shruti Bhosale. 2022. Multilingual machine trans-
Gahbiche, Hongyu Gong, Roman Grundkiewicz, lation with hyper-adapters. In Proceedings of the
Barry Haddow, Benjamin Hsu, Dávid Javorský, 2022 Conference on Empirical Methods in Natu-
Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant ral Language Processing, pages 1170–1185, Abu
Mathur, Paul McNamee, Kenton Murray, Maria Dhabi, United Arab Emirates. Association for Com-
Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan putational Linguistics.
abeth Salesky, Jiatong Shi, Matthias Sperber, Se- Sanyuan Chen, Chengyi Wang, Zhengyang Chen,
bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo- Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki
gesh Virkar, Alexander Waibel, Changhan Wang, Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long
and Shinji Watanabe. 2022. Findings of the IWSLT Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu,
2022 evaluation campaign. In Proceedings of the Michael Zeng, Xiangzhan Yu, and Furu Wei. 2022.
19th International Conference on Spoken Language Wavlm: Large-scale self-supervised pre-training for
Translation (IWSLT 2022), pages 98–157, Dublin, full stack speech processing. IEEE J. Sel. Top. Signal
Ireland (in-person and online). Association for Com- Process., 16(6):1505–1518.
Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremer- Matteo Negri, and Marco Turchi. 2019. MuST-C: a
man, Roldano Cattoni, Maha Elbayad, Marcello Fed- Multilingual Speech Translation Corpus. In Proceed-
erico, Xutai Ma, Satoshi Nakamura, Matteo Negri, ings of the 2019 Conference of the North American
Jan Niehues, Juan Pino, Elizabeth Salesky, Sebas- Chapter of the Association for Computational Lin-
tian Stüker, Katsuhito Sudoh, Marco Turchi, Alexan- guistics: Human Language Technologies, Volume 1
der Waibel, Changhan Wang, and Matthew Wiesner. (Long and Short Papers), pages 2012–2017, Min-
2021. FINDINGS OF THE IWSLT 2021 EVAL- neapolis, Minnesota. Association for Computational
UATION CAMPAIGN. In Proceedings of the 18th Linguistics.
International Conference on Spoken Language Trans-
lation (IWSLT 2021), pages 1–29, Bangkok, Thailand Matthias Eck, Stephan Vogel, and Alex Waibel. 2005.
(online). Association for Computational Linguistics. Low cost portability for statistical machine transla-
tion based on n-gram frequency and TF-IDF. In
Rosana Ardila, Megan Branson, Kelly Davis, Michael Proceedings of the Second International Workshop
Kohler, Josh Meyer, Michael Henretty, Reuben on Spoken Language Translation, Pittsburgh, Penn-
Morais, Lindsay Saunders, Francis M. Tyers, and sylvania, USA.
Gregor Weber. 2020. Common voice: A massively-
multilingual speech corpus. In Proceedings of The Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi
12th Language Resources and Evaluation Confer- Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep
ence, LREC 2020, Marseille, France, May 11-16, Baines, Onur Celebi, Guillaume Wenzek, Vishrav
2020, pages 4218–4222. European Language Re- Chaudhary, Naman Goyal, Tom Birch, Vitaliy
sources Association. Liptchinsky, Sergey Edunov, Michael Auli, and Ar-
mand Joulin. 2021. Beyond english-centric multilin-
Naveen Arivazhagan, Ankur Bapna, Orhan Firat, gual machine translation. The Journal of Machine
Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Learning Research, 22:107:1–107:48.
Mia Xu Chen, Yuan Cao, George F. Foster, Colin
Cherry, Wolfgang Macherey, Zhifeng Chen, and François Hernandez, Vincent Nguyen, Sahar Ghannay,
Yonghui Wu. 2019. Massively multilingual neural Natalia A. Tomashenko, and Yannick Estève. 2018.
machine translation in the wild: Findings and chal- TED-LIUM 3: Twice as much data and corpus repar-
lenges. CoRR, abs/1907.05019. tition for experiments on speaker adaptation. In
Speech and Computer - 20th International Confer-
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, ence, SPECOM 2018, Leipzig, Germany, September
and Michael Auli. 2020. wav2vec 2.0: A framework 18-22, 2018, Proceedings, volume 11096 of Lecture
for self-supervised learning of speech representations. Notes in Computer Science, pages 198–208. Springer.
In Advances in Neural Information Processing Sys-
tems 33: Annual Conference on Neural Information Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà,
Processing Systems 2020, NeurIPS 2020, December Javier Jorge, Nahuel Roselló, Adrià Giménez, Al-
6-12, 2020, virtual. bert Sanchís, Jorge Civera, and Alfons Juan. 2020.
120
Europarl-st: A multilingual corpus for speech transla- Toan Q. Nguyen and Julian Salazar. 2019. Transformers
tion of parliamentary debates. In 2020 IEEE Interna- without tears: Improving the normalization of self-
tional Conference on Acoustics, Speech and Signal attention. In Proceedings of the 16th International
Processing, ICASSP 2020, Barcelona, Spain, May Conference on Spoken Language Translation, IWSLT
4-8, 2020, pages 8229–8233. IEEE. 2019, Hong Kong, November 2-3, 2019. Association
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019.
Billion-scale similarity search with GPUs. IEEE Xuan-Phi Nguyen, Shafiq R. Joty, Kui Wu, and Ai Ti
Transactions on Big Data, 7(3):535–547. Aw. 2020. Data diversification: A simple strategy for
neural machine translation. In Advances in Neural
Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Information Processing Systems 33: Annual Confer-
Zettlemoyer, and Mike Lewis. 2021. Nearest neigh- ence on Neural Information Processing Systems 2020,
bor machine translation. In 9th International Confer- NeurIPS 2020, December 6-12, 2020, virtual.
ence on Learning Representations, ICLR 2021, Vir-
Vassil Panayotov, Guoguo Chen, Daniel Povey, and
tual Event, Austria, May 3-7, 2021. OpenReview.net.
Sanjeev Khudanpur. 2015. Librispeech: An ASR
corpus based on public domain audio books. In
Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. 2015 IEEE International Conference on Acoustics,
Conditional variational autoencoder with adversar- Speech and Signal Processing, ICASSP 2015, South
ial learning for end-to-end text-to-speech. In Pro- Brisbane, Queensland, Australia, April 19-24, 2015,
ceedings of the 38th International Conference on pages 5206–5210. IEEE.
Machine Learning, ICML 2021, 18-24 July 2021, Vir-
tual Event, volume 139 of Proceedings of Machine Ngoc-Quan Pham, Tuan Nam Nguyen, Thai-Binh
Learning Research, pages 5530–5540. PMLR. Nguyen, Danni Liu, Carlos Mullov, Jan Niehues, and
Alexander Waibel. 2022. Effective combination of
Philipp Koehn. 2005. Europarl: A parallel corpus for pretrained models - KIT@IWSLT2022. In Proceed-
statistical machine translation. In Proceedings of ings of the 19th International Conference on Spoken
Machine Translation Summit X: Papers, pages 79–86, Language Translation (IWSLT 2022), pages 190–197,
Phuket, Thailand. Dublin, Ireland (in-person and online). Association
Sai Koneru, Danni Liu, and Jan Niehues. 2022. Cost-
effective training in low-resource neural machine Jerin Philip, Alexandre Berard, Matthias Gallé, and
translation. CoRR, abs/2201.05700. Laurent Besacier. 2020. Monolingual adapters for
zero-shot neural machine translation. In Proceed-
Taku Kudo and John Richardson. 2018. SentencePiece: ings of the 2020 Conference on Empirical Methods
A simple and language independent subword tok- in Natural Language Processing (EMNLP), pages
enizer and detokenizer for neural text processing. In 4465–4470, Online. Association for Computational
Proceedings of the 2018 Conference on Empirical Linguistics.
Demonstrations, pages 66–71, Brussels, Belgium. Telmo Pessoa Pires, Robin M. Schmidt, Yi-Hsiu Liao,
Association for Computational Linguistics. and Stephan Peitz. 2023. Learning language-specific
layers for multilingual machine translation. CoRR,
Pierre Lison and Jörg Tiedemann. 2016. OpenSub- abs/2305.02665.
titles2016: Extracting large parallel corpora from
Peter Polák, Danni Liu, Ngoc-Quan Pham, Jan Niehues,
movie and TV subtitles. In Proceedings of the Tenth
Alexander Waibel, and Ondřej Bojar. 2023. Towards
efficient simultaneous speech translation: CUNI-
and Evaluation (LREC’16), pages 923–929, Portorož,
KIT system for simultaneous track at IWSLT 2023.
Slovenia. European Language Resources Association
In Proceedings of the 20th International Confer-
(ELRA).
ence on Spoken Language Translation (IWSLT 2023),
Toronto, Canada (in-person and online). Association
Shuming Ma, Li Dong, Shaohan Huang, Dong- for Computational Linguistics.
dong Zhang, Alexandre Muzio, Saksham Sing-
hal, Hany Hassan Awadalla, Xia Song, and Furu Peter Polák, Ngoc-Quan Pham, Tuan Nam Nguyen,
Wei. 2021. Deltalm: Encoder-decoder pre-training Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bo-
for language generation and translation by aug- jar, and Alexander Waibel. 2022. CUNI-KIT system
menting pretrained multilingual encoders. CoRR, for simultaneous speech translation task at IWSLT
abs/2106.13736. 2022. In Proceedings of the 19th International Con-
ference on Spoken Language Translation (IWSLT
Makoto Morishita, Katsuki Chousa, Jun Suzuki, and 2022), pages 277–285, Dublin, Ireland (in-person
Masaaki Nagata. 2022. JParaCrawl v3.0: A large- and online). Association for Computational Linguis-
scale English-Japanese parallel corpus. In Pro- tics.
ceedings of the Thirteenth Language Resources and
Evaluation Conference, pages 6704–6710, Marseille, Matt Post. 2018. A call for clarity in reporting BLEU
France. European Language Resources Association. scores. In Proceedings of the Third Conference on
121
191, Brussels, Belgium. Association for Computa-
tional Linguistics.
Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea
Vedaldi. 2017. Learning multiple visual domains
with residual adapters. In Advances in Neural Infor-
mation Processing Systems 30: Annual Conference
on Neural Information Processing Systems 2017, De-
cember 4-9, 2017, Long Beach, CA, USA, pages 506–
516.
Nils Reimers and Iryna Gurevych. 2020. Making
monolingual sentence embeddings multilingual us-
ing knowledge distillation. In Proceedings of the
Language Processing (EMNLP), pages 4512–4525,
Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
gela Fan. 2020. Multilingual translation with exten-
sible multilingual pretraining and finetuning. CoRR,
abs/2008.00401.
Jörg Tiedemann. 2012. Parallel data, tools and inter-
faces in OPUS. In Proceedings of the Eighth In-
ternational Conference on Language Resources and
Evaluation (LREC’12), pages 2214–2218, Istanbul,
Turkey. European Language Resources Association
(ELRA).
Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu,
Juan Pino, and Emmanuel Dupoux. 2021. VoxPop-
uli: A large-scale multilingual speech corpus for rep-
resentation learning, semi-supervised learning and
interpretation. In Proceedings of the 59th Annual
on Natural Language Processing (Volume 1: Long
Papers), pages 993–1003, Online. Association for
Computational Linguistics.
Changhan Wang, Anne Wu, and Juan Miguel Pino. 2020.
Covost 2: A massively multilingual speech-to-text
translation corpus. CoRR, abs/2007.10310.
Xinyuan Zhou, Jing Yang, Jianwei Cui, Pan Deng,
Lirong Dai. 2022. The USTC-NELSLIP offline
speech translation systems for IWSLT 2022. In Pro-
Spoken Language Translation (IWSLT 2022), pages
198–207, Dublin, Ireland (in-person and online). As-
sociation for Computational Linguistics.
Xin Zheng, Zhirui Zhang, Junliang Guo, Shujian Huang,
Boxing Chen, Weihua Luo, and Jiajun Chen. 2021.
Adaptive nearest neighbor machine translation. In
Proceedings of the 59th Annual Meeting of the Asso-
ciation for Computational Linguistics and the 11th
International Joint Conference on Natural Language
Processing (Volume 2: Short Papers), pages 368–374,
122
The BIGAI Offline Speech Translation Systems for IWSLT 2023 Evaluation
Zhihang Xie
Beijing Institute of General Artificial Intelligence
[email protected]
Abstract speech data and alignment scores for translation

This paper describes the BIGAI’s submission
data. 2) Apply a straightforward split-and-merge
to IWSLT 2023 Offline Speech Translation task method to split long audio clips into short seg-
on three language tracks from English to Chi- ments. 3) Employ a three-stage training strategy to
nese, German and Japanese. The end-to-end concatenate the finetuned speech module and the
systems are built upon a Wav2Vec2 model for translation module. 4) Incorporate connectionist
speech recognition and mBART50 models for temporal classification (CTC) loss to leverage the
machine translation. An adapter module is ap- divergence between speech features and source to-
plied to bridge the speech module and the trans-
ken sequences (Graves et al., 2006). Experiments
lation module. The CTC loss between speech
features and source token sequence is incorpo- are carried out to perform speech translation at sen-
rated during training. Experiments show that tence level and corpus level. The performance of
the systems can generate reasonable transla- the three PT36 models is finally evaluated on the
tions on three languages. The proposed models tst2023 datasets with automatic metrics.
achieve BLEU scores of 22.3, 10.7 and 33.0 The rest of this paper is organized as follows.
on tst2023 en→de, en→ja and en→zh TED Section 2 describes how speech data and translation
datasets. It is found that the performance is
data are processed in the experiments. Section 3
decreased by a significant margin on complex
scenarios like presentations and interviews. explains how finetuned models are assembled to
perform speech translation on all three languages.
1 Introduction Section 4 illustrates experiment setups, results and
Speech translation aims to solve the problem of analysis. Section 5 concludes the submission.
translating speech waveform in source language
into written text in target language. Cascade sys- 2 Data Processing
tems decompose the problem into automatic speech
2.1 Speech Corpora
recognition (ASR) to transcribe source speech into
source text and machine translation (MT) to trans- Under the constrained condition, there are five
late source text into target text (Wang et al., 2021b; speech datasets used to train ASR models, namely
Zhang et al., 2022a). It is clear that such architec- LibriSpeech (Panayotov et al., 2015), Mozilla Com-
ture has the advantage of ensembling results from mon Voice v11.0 (Ardila et al., 2019), MuSTC (Cat-
state-of-the-art (SOTA) ASR models and MT mod- toni et al., 2021), TEDLIUM v3 (Hernandez et al.,
els and the disadvantages of accumulating subsys- 2018) and VoxPopuli (Wang et al., 2021a). Statis-
tem errors and discarding paralinguistic features. tics on each dataset are shown as Table 1. Note that
Recent end-to-end speech translation (E2E ST) sys- only the MuSTC datasets are used to train speech
tems have shown the potential to outperform cas- translation systems on the three language tracks,
cade systems (Hrinchuk et al., 2022; Shanbhogue English-to-German (en→de), English-to-Japanese
et al., 2022). However, due to the lack of high- (en→ja) and English-to-Chinese (en→zh).
quality parallel training data, it is difficult to quan- In general, all speech files are unified to single
tify the gap between the two categories. channel 16kHz format. During training, utterances
Inspired by Zhang et al.’s (2022b) work, this shorter than 0.2s or longer than 20s are removed.
submission explores various techniques to address An extra W2V model with 24 Transformer layers is
problems in speech translation. 1) Perform fine- finetuned on the LibriSpeech dataset and calculates
grained data filtering by calculating WERs for WER scores by performing CTC greedy decoding
123
Table 1: Statistics on speech datasets wav2vec2-large-960h-lv60-self 2 model for speech
recognition and the mbart-large-50-one-to-many-
Dataset Utterances Hours
mmt3 model for machine translation.
CommonVoice 948,736 1,503.28
The W2V models (Baevski et al., 2020) are
LibriSpeech 281,241 961.05
trained with contrastive learning to distinguish
MuSTC en→de v3 269,851 440.18
whether two transformations of convolution fea-
MuSTC en→ja v2 328,637 541.04
tures result in similar latent representations. The
MuSTC en→zh v2 358,852 596.20
first transformation is to learn high-level contex-
TEDLIUM 268,263 453.81
tual speech representations through a sequence of
VoxPopuli 182,466 522.60
Transformer layers (Vaswani et al., 2017). The sec-
Total, loaded 2,638,046 5,018.17
ond transformation is to create discrete targets for
Total, filtered 2,528,043 4,713.35
self-training by the quantization module. The best
Table 2: Statistics on translation datasets
partial representations chosen from multiple code-
books with the Gumbel softmax (Jang et al., 2016)
Dataset en→de en→ja en→zh are concatenated and transformed to a quantized
MuSTC 0.269m 0.328m 0.358m representation with a linear layer.
OpenSubtitles 22.512m 2.083m 11.203m The mBART25 models (Liu et al., 2020) are
Commentaries 0.398m 0.002m 0.322m Transformer-based encoder-decoder models that
Total 23.181m 2.414m 11.884m are pretrained on monolingual sentences from
many languages and finetuned with parallel trans-
lation data on 25 languages. The pretraining ob-
at character level on the other speech datasets, so ut- jective is a denoising loss so that the model learns
terances with WER scores over 75% are discarded to reconstruct corrupted sentences to their original
as well. As a result, the speech corpora contains forms. The noise function randomly masks 35% of
nearly 2.53 million valid utterances with the total input sentences in consecutive spans and permutes
duration of 4,713.35 hours. sentence orders for document-level MT if multiple
2.2 Translation Corpora sentences are given. The mBART50 models (Tang
et al., 2020) extend embedding layers with an extra
In addition to the MuSTC datasets, the OpenSubti- set of 25 languages and are finetuned on translation
tles v2018 (Lison et al., 2018) and the News Com- task from English to the other 49 languages.
mentaries v16 (Farhad et al., 2021) datasets are
added up to train MT models. Statistics on these 3.2 Finetuned Models
translation datasets are described as Table 2. Since
The two base models result in one ASR model,
translation pairs do not perfectly match all the
three MT models and three E2E ST models. Writ-
time, the translation quality is measured by the fast-
ten texts in the four languages are tokenized into
align1 toolkit in terms of the percentage of aligned
subword tokens in byte-pair encoding (BPE) us-
words. Word sequences are obtained by splitting
ing the SentencePiece toolkit (Kudo and Richard-
English texts and German texts using whitespaces
son, 2018). The tokenizer is inherited from the
and converting Chinese texts and Japanese texts
mBART50 model with a multilingual configura-
into character sequences. Parallel training exam-
tion by prepending language symbols and the total
ples are filtered out if: 1) the source sentence con-
number of BPE tokens in the vocabulary is 250k.
tains more than 150 words; 2) the alignment score
in either forward translation or backward transla- For speech recognition, the finetuned model
tion is lower than a certain threshold. (ASR12) takes the first 12 Transformer layers from
the base model. An adapter module (Li et al., 2020;
3 Method Shanbhogue et al., 2022) compresses the feature
vectors by a factor of eight, which consists of three
3.1 Pretrained Models one-dimensional convolution layers with a stride
Two state-of-the-art models pretrained with self- of two. A linear layer transforms the compressed
supervised objectives are employed as base models representations into output probabilities.
for downstream tasks with labeled data, namely the 2
facebook/wav2vec2-large-960h-lv60-self
1 3
https://github.com/clab/fast_align facebook/mbart-large-50-one-to-many-mmt
124
For end-to-end speech translation, the mod- Table 3: WER scores on test speech datasets
els have similar architecture as the PT36 mod-
LibriSpeech TEDLIUM MuSTC
els in Zhang et al.’s (2022b) work instead of the
27.23 32.17 34.73
PT48 models to reduce computational complex-
ity. Within a PT36 model, the speech module
Table 4: BLEU scores on tst-COMMON datasets
and the translation module are initialized with the
ASR12 model and the MT24 model respectively. Model en→de en→ja en→zh
The adapter module that connects the two modules MT24 31.04 14.74 22.80
is not trained from random initialization, because + finetune 33.00 17.11 23.44
it has been trained with the ASR12 model on the PT36 26.45 14.28 19.65
first stage. The training loss combines the cross
entropy loss for machine translation and the CTC
loss for speech recognition with a hyperparameter the update frequency of 8. The parameters in the
to balance the weights between the two losses. Wav2Vec2 module and the linear layer are sepa-
rately optimized by the Adam optimizer (Kingma
3.3 Speech Resegmentation and Ba, 2014). The learning rates are initialized
Past years’ systems (Anastasopoulos et al., 2021; with 1e−4 and 4e−4 with the annealing factors set
Antonios et al., 2022) have proved that speech re- to 0.9 and 0.8. The learning rates are updated based
segmentation has a great impact on the translation on the improvement of the training losses between
performance at corpus level. During evaluation, the previous epoch and the current epoch. During
audio clips are splitted into segments with a simple training, speech waveform is perturbed with a ran-
two-stage strategy using the WebRTCVAD4 toolkit. dom speed rate between 0.9 and 1.1 and speech fea-
On the split stage, long audios are processed with tures are augmented with the SpecAugment tech-
three-level settings of aggressiveness modes in- nique (Park et al., 2019).
creasing from 1 to 3 and frame sizes decreasing On the second stage, three MT24 models are
from 30ms to 10ms. In this way, most segments are finetuned on the translation corpora with the batch
no longer than a maximum duration durmax and size of 12 and the update frequency of 4. The
the outliers are further segmented into ⌊ duration
0.75×θ ⌋
en→de MT24 model is trained using 8 A100 GPUs
chunks brutally. On the merge stage, consecutive for 2 epochs and the other two models are trained
segments are merged into final segments no shorter using 4 A100 GPUs for 6 epochs and 3 epochs. The
than a minimum duration durmin . model parameters are optimized with the Adam
optimizer and the initial learning rates are set to
4 Experiments 5e−5 with the annealing factor set to 0.9.
4.1 Settings On the third stage, three PT36 models are fine-
tuned on the corresponding MuSTC datasets, each
All the models are implemented with the Speech- of which is trained using 4 A100 GPUs for 10
Brain toolkit (Ravanelli et al., 2021). The total num- epochs with the batch size of 12 and the update
ber of parameters in a PT36 model is about 794.0M, frequency of 4. The learning rates are initialized
183.2M in the speech module and 610.9M in the to 3e−5 for the W2V module and 5e−5 for the
translation module. The feature extractor processes mBART module with the annealing factors set to
speech waveform with seven 512-channel convo- 0.9. The loss weights are set to 0.1 for the ASR
lution layers, in which kernel sizes and strides are module and 0.9 for the MT module since the per-
[10,3,3,3,3,2,2] and [5,2,2,2,2,2,2]. There are 12 formance of the ASR module is not good enough.
Transformer layers with 16 attention heads, model
dimension of 1024 and inner dimension of 4096 4.2 Speech Recognition
in speech encoder, text encoder and decoder. The Table 3 lists WER scores on test speech datasets,
adapter module has three Conv1D layers with ker- where 34.73% is the average WER score of the
nel sizes and strides being [3,3,3] and [2,2,2]. three MuSTC datasets. Obviously, the performance
On the first stage, the ASR12 model is finetuned of the ASR12 model is much worse than that of
on the speech corpora using 16 NVIDIA A100 other systems (Zhang et al., 2022b; Wang et al.,
GPUs for 21 epochs with the batch size of 3 and 2021b) with WERs around 10%. Due to extremely
4
https://github.com/wiseman/py-webrtcvad large vocabulary size, the model requires a long
125
Table 5: Statistics on short segments in the tst2020 dataset with different durmin and durmax settings.
id durmin durmax level1 level2 level3 brutal split merge

1 5 20 3,473 342 449 185 4,449 2,621
2 10 30 3,568 146 258 69 4,041 1,699
3 15 60 3,624 35 115 0 3,774 1,237
4 20 90 3,635 9 73 0 3,717 970
Table 6: BLEU scores on calculated on past years’ IWSLT en→de test sets with hypotheses automatically reseg-
mented by the mwerSegmenter toolkit (Ansari et al., 2021) based on source transcriptions and target translations.
id durmin durmax 2010 2013 2014 2018 2019 2020 ∆

1 5 20 21.44 27.37 25.87 12.41 18.95 20.14 21.03
2 10 30 23.79 30.33 28.53 16.29 21.22 22.60 +2.76
3 15 60 24.17 31.16 29.23 18.38 22.04 23.46 +3.71
4 20 90 24.31 31.73 30.05 17.98 22.16 23.55 +3.93
time to train. As a result, the model is still far from in Section 3.3. Statistics on short segments in the
converge at the time of this submission. tst2020 dataset are shown as Table 5. It is noticed
that the number of brutal segments is decreased to
4.3 Sentence-level Translation zero when durmin is set to more than 15s.
The tst-COMMON datasets are used to evaluate the Table 6 lists BLEU scores on past years’ test
translation performance at sentence level and the datasets with different durmin and durmax set-
BLEU scores are calculated by the SacreBLEU 5 tings. It is found that the performance is boosted
toolkit, where Japanese texts are tokenized by the as the segment duration gets longer, which means
Mecab6 morphological analyzer and Chinese texts that more contextual information is provided to
are tokenized into characters. The BLEU scores on the model. When durmin and durmax are set to
the three datasets are listed in Table 4. 20s and 90s, the best BLEU scores are achieved
For machine translation, compared with the on most test datasets with an increment of 3.93
base MT24 models, the performance of the fine- (~18.7%) mean BLEU score. Further investigation
tuned MT24 models is improved by 1.96 (~6.3%), on long audio segments finds that avoiding brutal
2.37 (~16.1%) and 0.64 (~2.8%) BLEU scores on segmentation is another factor of such improve-
en→de, en→ja and en→zh translations. It indi- ment. Comparing experiment 2 and experiment 3,
cates that adding out-of-domain corpora like Open- the mean BLEU score is increased by 0.95 (~3.9%)
Subtitles and NewsCommentaries is able to boost points, when the number of brutal segments is de-
the machine translation quality. creased from 69 to 0. Comparing experiment 3
For speech translation, compared with the fine- and experiment 4, the mean BLEU score is merely
tuned MT24 models, the performance of PT36 increased by 0.22 (~0.8%) points.
models is degraded by a large margin with 6.55
(~19.8%), 2.83 (~16.5%) and 3.79 (~16.2%) BLEU 4.5 Submissions
scores on en→de, en→ja and en→zh translations. The three PT36 models are finally evaluated on
Compared with the base MT24 models, the gaps tst2023 datasets (Agarwal et al., 2023) with more
are still relatively large with 4.59 (~14.8%), 0.46 challenging scenarios like presentations and inter-
(~3.1%) and 3.15 (~13.8%) BLEU scores. views. Test audios are resegmented with durmin
and durmax set to 20s and 90s. Official metrics are
4.4 Corpus-level Translation
presented as Table 7 for en→de datasets, Table 8
The translation performance of en→de PT36 model for en→ja datasets and Table 9 for en→zh datasets.
is further evaluated on past years’ test datasets with Comparing the performance between in-domain
challenging scenarios. To keep consistency, all test TED datasets and out-of-domain ACL datasets, the
audios are resegmented using the method described BLEU scores are decreased by 2.7 (~12.1%), 0.3
5
https://github.com/mjpost/sacrebleu (~2.8%) and 5.6 (~16.9%) points on en→de, en→ja
6
https://github.com/taku910/mecab and en→zh translations. Noticeably, the perfor-
126
Table 7: Official metrics on the tst2023 en→de subsets with hypotheses automatically resegmented by the mwerSeg-
menter toolkit (Ansari et al., 2021) based on source transcriptions and target translations.
TED ACL Sub

Comet BLEU chrF Comet BLEU chrF Comet BLEU chrf
ref2 ref1 ref2 ref1 both ref1 ref2
0.7128 0.7055 22.3 19.3 27.4 0.49 0.50 0.6295 19.6 0.46 0.3555 11.5 0.45
Table 8: Official metrics on the tst2023 en→ja subsets.
TED ACL
Comet BLEU Comet BLEU
ref2 ref1 ref2 ref1 both
0.7201 0.7228 10.7 13.2 16.8 0.6769 10.4
mance is almost halved (~48.4%) with only 11.5 Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
BLEU scores on the en→de Sub dataset. The re- Lonneke van der Plas, Peter Polák, Elijah Rippeth,
sults indicate that the proposed PT36 models have
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
inadequate abilities of handling non-native speak- Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
ers, different accents, spontaneous speech and con- Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
trolled interaction with a second speaker. vallos. 2023. Findings of the IWSLT 2023 Evaluation
5 Conclusion Conference on Spoken Language Translation (IWSLT
In conclusion, this paper describes the end-to-end
speech translation systems for IWSLT 2023 of- Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremer-
man, Roldano Cattoni, Maha Elbayad, Marcello Fed-
fline tasks. Built upon pretrained models, the sys- erico, Xutai Ma, Satoshi Nakamura, Matteo Negri,
tems are further trained on large amount of parallel Jan Niehues, Juan Pino, Elizabeth Salesky, Sebas-
data using the three-stage finetuning strategy. The tian Stüker, Katsuhito Sudoh, Marco Turchi, Alexan-
PT36 model consists of an ASR12 module with der Waibel, Changhan Wang, and Matthew Wies-
ner. 2021. Findings of the iwslt 2021 evaluation
an adapter module for ASR and an MT24 module campaign. In Proceedings of the 18th International
for MT. The training loss sums up the CTC loss Conference on Spoken Language Translation (IWSLT
for ASR and the cross entropy loss for MT. Experi- 2021), pages 1–29, Bangkok, Thailand (online). As-
ments demonstrate that the proposed methods have sociation for Computational Linguistics.
the potential to achieve a reasonable performance.
Ebrahim Ansari, Ondřej Bojar, Barry Haddow, and Mo-
However, due to limited resources, some modules hammad Mahmoudi. 2021. Sltev: Comprehensive
has not well trained, which has a negative impact evaluation of spoken language translation. In Pro-
on subsequent tasks. Therefore, the end-to-end ceedings of the 16th Conference of the European
models still underperform SOTA systems. Chapter of the Association for Computational Lin-
guistics: System Demonstrations, pages 71–79.
Anastasopoulos Antonios, Barrault Loc, Luisa Ben-

References tivogli, Marcely Zanon Boito, Bojar Ondřej, Roldano
Milind Agarwal, Sweta Agrawal, Antonios Anasta- Cattoni, Currey Anna, Dinu Georgiana, Duh Kevin,
sopoulos, Ondřej Bojar, Claudia Borg, Marine Elbayad Maha, et al. 2022. Findings of the iwslt
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda 2022 evaluation campaign. In Proceedings of the
Chen, William Chen, Khalid Choukri, Alexandra 19th International Conference on Spoken Language
Chronopoulou, Anna Currey, Thierry Declerck, Qian- Translation (IWSLT 2022), pages 98–157. Associa-
qian Dong, Yannick Estève, Kevin Duh, Marcello tion for Computational Linguistics.
Federico, Souhir Gahbiche, Barry Haddow, Benjamin
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja- Rosana Ardila, Megan Branson, Kelly Davis, Michael
vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu Henretty, Michael Kohler, Josh Meyer, Reuben
Kumar, Pengwei Li, Xutail Ma, Prashant Mathur, Morais, Lindsay Saunders, Francis M Tyers, and
Evgeny Matusov, Paul McNamee, John P. McCrae, Gregor Weber. 2019. Common voice: A massively-
Kenton Murray, Maria Nadejde, Satoshi Nakamura, multilingual speech corpus. arXiv preprint
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu, arXiv:1912.06670.
127
Table 9: Official metrics on the tst2023 en→zh subsets.
TED ACL
ref2 ref1 ref2 ref1 both
0.7428 0.7014 33.0 23.3 38.6 0.6534 27.4
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Xian Li, Changhan Wang, Yun Tang, Chau Tran, Yuqing
and Michael Auli. 2020. wav2vec 2.0: A framework Tang, Juan Pino, Alexei Baevski, Alexis Conneau,
for self-supervised learning of speech representations. and Michael Auli. 2020. Multilingual speech trans-
Advances in neural information processing systems, lation with efficient finetuning of pretrained models.
33:12449–12460. arXiv preprint arXiv:2010.12829.
Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Ben- Pierre Lison, Jörg Tiedemann, and Milen Kouylekov.
tivogli, Matteo Negri, and Marco Turchi. 2021. Must- 2018. Opensubtitles2018: Statistical rescoring of
c: A multilingual corpus for end-to-end speech trans- sentence alignments in large, noisy parallel corpora.
lation. Computer Speech & Language, 66:101155. In Proceedings of the 11th International Confer-
ence on Language Resources and Evaluation (LREC
Akhbardeh Farhad, Arkhangorodsky Arkady, Biesialska 2018). European Language Resources Association
Magdalena, Bojar Ondřej, Chatterjee Rajen, Chaud- (ELRA).
hary Vishrav, Marta R Costa-jussa, España-Bonet
Cristina, Fan Angela, Federmann Christian, et al. Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
2021. Findings of the 2021 conference on machine Edunov, Marjan Ghazvininejad, Mike Lewis, and
translation (wmt21). In Proceedings of the Sixth Luke Zettlemoyer. 2020. Multilingual denoising pre-
Conference on Machine Translation, pages 1–88. As- training for neural machine translation. Transac-
sociation for Computational Linguistics. tions of the Association for Computational Linguis-
tics, 8:726–742.
Alex Graves, Santiago Fernández, Faustino Gomez, and
Jürgen Schmidhuber. 2006. Connectionist temporal Vassil Panayotov, Guoguo Chen, Daniel Povey, and
classification: labelling unsegmented sequence data Sanjeev Khudanpur. 2015. Librispeech: an asr cor-
with recurrent neural networks. In Proceedings of the pus based on public domain audio books. In 2015
23rd international conference on Machine learning, IEEE international conference on acoustics, speech
pages 369–376. and signal processing (ICASSP), pages 5206–5210.
IEEE.
François Hernandez, Vincent Nguyen, Sahar Ghannay,
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng
Natalia Tomashenko, and Yannick Esteve. 2018. Ted-
Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le.
lium 3: Twice as much data and corpus repartition for
2019. Specaugment: A simple data augmentation
experiments on speaker adaptation. In Speech and
method for automatic speech recognition. arXiv
Computer: 20th International Conference, SPECOM
2018, Leipzig, Germany, September 18–22, 2018,
Proceedings 20, pages 198–208. Springer. Mirco Ravanelli, Titouan Parcollet, Peter Plantinga,
Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem
Oleksii Hrinchuk, Vahid Noroozi, Ashwinkumar Subakan, Nauman Dawalatabad, Abdelwahab Heba,
Ganesan, Sarah Campbell, Sandeep Subramanian, Jianyuan Zhong, et al. 2021. Speechbrain: A
Somshubra Majumdar, and Oleksii Kuchaiev. 2022. general-purpose speech toolkit. arXiv preprint
Nvidia nemo offline speech translation systems for arXiv:2106.04624.
iwslt 2022. In Proceedings of the 19th International
Conference on Spoken Language Translation (IWSLT Akshaya Shanbhogue, Ran Xue, Ching Yun Chang, and
2022), pages 225–231. Sarah Campbell. 2022. Amazon alexa ai’s system for
iwslt 2022 offline speech translation shared task. In
Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categori- Proceedings of the 19th International Conference on
cal reparameterization with gumbel-softmax. arXiv Spoken Language Translation (IWSLT 2022), pages
preprint arXiv:1611.01144. 169–176.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
method for stochastic optimization. arXiv preprint man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
arXiv:1412.6980. gela Fan. 2020. Multilingual translation with exten-
sible multilingual pretraining and finetuning. arXiv
Taku Kudo and John Richardson. 2018. Sentencepiece: preprint arXiv:2008.00401.
A simple and language independent subword tok-
enizer and detokenizer for neural text processing. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
arXiv preprint arXiv:1808.06226. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
128
you need. Advances in neural information processing
systems, 30.
Juan Pino, and Emmanuel Dupoux. 2021a. Voxpop-
interpretation. arXiv preprint arXiv:2101.00390.
Minghan Wang, Yuxia Wang, Chang Su, Jiaxin Guo,
Yingtao Zhang, Yujia Liu, Min Zhang, Shimin Tao,
Xingshan Zeng, Liangyou Li, et al. 2021b. The hw-
tsc’s offline speech translation systems for iwslt 2021
evaluation. arXiv preprint arXiv:2108.03845.
Mohan Shi, Yifan Song, et al. 2022a. The ustc-
nelslip offline speech translation systems for iwslt
2022. In Proceedings of the 19th International Con-
2022), pages 198–207.
Ziqiang Zhang, Junyi Ao, Shujie Liu, Furu Wei, and

Jinyu Li. 2022b. The yitrans end-to-end speech trans-
lation system for iwslt 2022 offline shared task. arXiv
129
Enhancing Video Translation Context with Object Labels
Jeremy Gwinnup1,2 , Tim Anderson2 , Brian Ore2 , Eric Hansen2 , Kevin Duh1
1
Johns Hopkins University, 2 Air Force Research Laboratory
{jeremy.gwinnup.1, timothy,anderson.20, brian.ore.1, eric.hansen.5}@us.af.mil,
[email protected]
Abstract toolkit changes, and allows for easier interpretation

of results.
We present a simple yet efficient method to en-
On the How2 dataset (Sanabria et al., 2018), we
hance the quality of machine translation models
trained on multimodal corpora by augmenting
experiment with using clean transcripts and au-
the training text with labels of detected objects tomatic speech recognition transcripts of varying
in the corresponding video segments. We then quality as input to our translation systems. This
test the effects of label augmentation in both tests the effectiveness of our multimodal approach
baseline and two automatic speech recognition in noisy conditions, beneficial in real-world use
(ASR) conditions. In contrast with multimodal cases. Results show gains of +0.4 to +1.0 BLEU
techniques that merge visual and textual fea- on the How2 held-out test set.
tures, our modular method is easy to imple-
ment and the results are more interpretable.
Comparisons are made with Transformer trans-
lation architectures trained with baseline and
augmented labels, showing improvements of
up to +1.0 BLEU on the How2 dataset.
1 Introduction
Video streams are rich sources of content and the
application of machine translation to videos present
open research challenges. Specifically, we are in- src: And then you’re going to stir it so have your
terested in translating the speech content present stirrer available. PERSON CUP BOTTLE
in videos, using the visual modality as auxiliary tgt: E então você vai mexer, então tenha seu
input to improve translation quality. Intuitively, vi- agitador disponível.
sual signals may help disambiguate under-specified Figure 1: Demonstration of augmenting source data
words or correct speech recognition errors. with detected object labels to provide additional context.
There has been much research in speech trans-
lation, which focuses on speech input, and multi-
modal machine translation, which focuses on vi- 2 Object Class Label Augmentation
sual and textual inputs; this work combines aspects
When considering the translation of instructional
of both areas. We assume a cascaded pipeline,
videos, the speaker’s narration may use ambiguous
where the speech in a video input is first passed to
language when describing the steps to the task as
a speech recognition component, then the text tran-
the viewer may be able to infer the intent through
scripts together with the video frames are passed to
objects or actions in the scene. If MT systems
a multimodal machine translation (MMT) system.
are trained on the speaker’s words and translations,
Our contribution is a MMT system that augments
these cues from the scene are not present. We
text-based training data with labels obtained from
proposed to address this omission by analyzing
a computer vision object detector (Fig. 1).
clips of the video and augmenting the text data
In contrast to more complex multimodal fusion
with objects found in that clip.
techniques that combine vision and translation neu-
ral networks into end-to-end models, our modu- Augmentation Process: To augment training
lar approach is simple to implement, requiring no data with object labels, an object recognition model
130
YoloV5 Objects Classes
PERSON
CUP PERSON
BOTTLE CUP
BOTTLE BOTTLE
BOTTLE
BOTTLE
BOTTLE
An ounce of amaretto, an ounce An ounce of amaretto, an ounce of 151

of 151 and then sour mix. and then sour mix. PERSON CUP BOTTLE
Figure 2: Illustration of the object label augmentation processing pipeline.
was applied to each of the videos in the training 50000
45000
set in order to generate lists of objects present. To 40000
that end, we apply the YOLOv51 (Jocher et al., 35000
2021) model (specifically yolov5s) to the 189k 30000
25000
video clips corresponding to the utterances from the 20000
How2 training data. The object detection model 15000
can detect 80 types of objects as outlined in the 10000

5000
COCO (Lin et al., 2015) dataset. 0
The detected labels for the time-slices in the 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
video clip are collated and collapsed in order to Figure 3: Training segments with N object classes de-
keep final sentence length to a manageable size - tected.
we are interested in the presence of an object class
versus how many times that class has occurred in
the scene or the time slices in the video clip. higher class counts forming a long tail. Full class
Once processed, the per-clip labels are appended object counts are shown in Table 1.
to the source side of the training, dev and test sets Observing the most-detected class labels in train-
as “context-markers”. We do not apply these labels ing segments (shown in Figure 4), we see that PER-
to the target side as we wish to generate coherent SON is by far the most common object class with
sentences in the target language. This processing over 164k occurrences, while CUP and BOTTLE
pipeline is illustrated in Figure 2. are the next most common with around 23.8k occur-
In particular, we note in the example in Figure 1 rences each. As How2 is comprised of instructional
that the transcription discusses a stirrer but does not videos in which the authors are demonstrating how
give context to what kind of stirrer: A laboratory to perform a task, PERSON’s high occurrence rate
sample stirrer, a paint stirrer, or in this case a stirrer seems reasonable. The figure shows the top 15
to mix a drink. Using the object labels from the object classes detected, the full list of detection
example, we see that the stirrer in this case refers counts is shown in Table 2.
to a drink - adding valuable context. While the above analyses focus on the train-
The augmented How2 corpus will be available ing portion of the dataset, similar distributions are
for download at a future date. present in both the validation and test sets.
Distribution of Augmentation Labels: When 3 How2 Dataset

examining the counts of per-segment object class
The How2 (Sanabria et al., 2018) dataset is a collec-
annotations in the training set (shown in Figure
tion of instructional videos hosted on YouTube that
3), we note that over 64% of the segments have
are paired with spoken utterances, English subtitles
between one and three object classes present, 13%
and a set of crowdsourced Portuguese translations.
have no detected object classes, and the remain-
Additional metadata such as video descriptions and
ing 23% have four or greater classes present with
summaries are also available. The dataset contains
1
You Only Look Once upwards of 2,000 hours of videos, but only a 300
131
Classes Segments Classes Segments Classes Segments
0 15,544 6 7,508 12 143
1 44,496 7 4,300 13 79
2 41,950 8 2,259 14 42
3 32,077 9 1,166 15 14
4 21,428 10 626 16 7
5 13,011 11 293 17 3
Table 1: Video segments with n object classes present.
Class Count Class Count Class Count

PERSON 164,605 MICROWAVE 4,298 TOILET 1,333
CUP 23,870 REFRIGERATOR 4,014 BROCCOLI 1,327
BOTTLE 23,809 CAKE 3,911 SURFBOARD 1,281
CHAIR 17,806 DONUT 3,729 HORSE 1,222
CELL_PHONE 17,016 DOG 3,496 BED 1,141
REMOTE 16,127 TOOTHBRUSH 2,839 BOAT 1,056
BOWL 13,524 SUITCASE 2,730 BACKPACK 1,034
POTTED_PLANT 13,045 APPLE 2,714 TRUCK 924
TV 11,455 BASEBALL_GLOVE 2,682 TRAFFIC_LIGHT 919
SPORTS_BALL 10,290 SPOON 2,636 ORANGE 841
TIE 9,971 HANDBAG 2,352 COW 794
LAPTOP 9,066 COUCH 2,316 SANDWICH 763
VASE 9,033 BASEBALL_BAT 2,293 FIRE_HYDRANT 722
BOOK 7,612 BIRD 2,292 TEDDY_BEAR 713
WINE_GLASS 7,229 BANANA 2,145 AIRPLANE 576
DINING_TABLE 6,315 PIZZA 2,103 BUS 516
TENNIS_RACKET 5,922 CAT 2,054 SKIS 456
KNIFE 5,355 CARROT 1,986 SNOWBOARD 387
CAR 5,198 BENCH 1,899 TRAIN 338
MOUSE 5,107 MOTORCYCLE 1,872 ELEPHANT 265
SINK 4,688 BICYCLE 1,856 STOP_SIGN 246
FRISBEE 4,675 HOT_DOG 1,652 PARKING_METER 218
OVEN 4,450 SCISSORS 1,529 SHEEP 215
CLOCK 4,382 FORK 1,480 BEAR 198
KEYBOARD 4,353 UMBRELLA 1,408 GIRAFFE 177
SKATEBOARD 4,304 KITE 1,384 ZEBRA 158
Table 2: Detected class counts for training segments.
hour subset contains the full set of annotations. This portion consists of 13,493 videos consist-
This work focuses on that subset. ing of a total run-time of 305.1 hours from which
189,276 utterances are extracted. These videos and
Videos Hours Sentences segments are then segregated into training, vali-
train 13,168 298.2 184,949 dation and test sets as shown in Table 3. These
validation 150 3.2 2,022 segments are then used to train systems in down-
test 175 3.7 2,305 stream tasks such as MT.
Table 3: How2 300h subset statistics
132
180000
three methods to prune over-prevalent or under-
160000
140000 represented object class labels: naïve dropping of
120000
100000
the N most-represented labels, inverse document
80000 frequency (IDF) thresholding and normalized term
60000
40000
frequency-inverse document frequency (TF-IDF)
20000 thresholding. For the first method, object labels are
0
simply removed in the most common order - e.g.
_P IR
_P L
ON
LA E
S
L
NT
RE E
TS V
LE
OK
OP
E
SE
TT BOW
AS
AL
CU
TI
OT
N
T
CE CHA
TT
VA
drop-3 removes the three most common classes:
HO
LA
PT
RS
W BO
_B
GL
M
BO
PE
E_
ED
OR
LL
IN
SP
PERSON, CUP, and BOTTLE.
PO
Figure 4: Top 15 classes present in training video snip-

pets. Total Corpus Lines
IDFT = log2 (1)
# Lines with T present
4 Experiments Inverse document frequency thresholding (as cal-
To gauge the effectiveness of the label augmenta- culated by Equation 1) removes labels that fall be-
tion approach, we train baseline and object-label low a specified threshold compared to a precom-
augmented systems in Marian (Junczys-Dowmunt puted table of IDF scores for each class, effectively
et al., 2018) with a transformer-base (Vaswani et al., removing the most represented labels.
2017) architecture. We also replicate the base- Lastly, normalized TF-IDF thresholding does the
line and image feature augmented shallow recur- same using the product of TF (calculated by the
rent neural network (RNN systems) described in number of times an object label occurs in video
(Sanabria et al., 2018) for comparison. time-slices3 ) and IDF scores normalized from 0
to 1 - this tries to bring a balance between most
4.1 Training Hyperparameters represented labels and more unique labels that may
The Marian (Junczys-Dowmunt et al., 2018) sys- add a distinct contribution to a translation.
tems trained for our experiments use transformer-
4.4 ASR-Degraded experiments
base settings as described in Vaswani et al. (2017):
6-layer encoder, 6-layer decoder, 8 transformer The How2 dataset is provided with reference
heads, 2048 hidden units. These training sessions speech transcription, but in realistic settings one
were performed on 2 NVidia Titan-X Pascal de- may need to derive these automatically. Automatic
vices each with 12Gb GPU RAM, taking 6.5-7.5 speech recognition (ASR) errors may lead to ad-
hours per model. ditional ambiguity in the MT input, but hopefully
can be recovered partially with image context. We
4.2 Data preprocessing build Kaldi (Povey et al., 2011) ASR systems to
In order to prepare the augmented data for use in recognize the speech of the speakers in the How2
training MT systems, we employ SentencePiece videos, then match the ASR output timings to those
(Kudo and Richardson, 2018) unigram-model sub- of the gold-standard utterances. These new utter-
word processing with a disjoint2 vocabulary size ances are used as the source side of the training
of 32k. One important change we introduce is to corpus for both the baseline and object label aug-
preserve each of the COCO class labels as atomic mented condition.
tokens that are not broken apart. These labels are In a second experiment, we add 5 dB of back-
additionally in all caps to both disambiguate from ground noise to the audio in the How2 videos using
natural occurrences of the label words and provide noise samples from the MUSAN corpus (Snyder
a convenient marker for diagnosis. et al., 2015). The same ASR system described
above is then evaluated on the noisy audio to pro-
4.3 Pruning Over-represented Object Labels duce a second set of ASR hypotheses.
As noted in Section 2, PERSON is by far the The English speech recognition system was
most represented object class label. We posit this trained using the Kaldi ASR toolkit. The acoustic
prevalence may have a negative effect on perfor- models utilized 2400 hours of audio from Fisher
mance. To investigate this hypothesis, we examine 3
This is different than our use of object class occurrences
2 in augmentation; the larger video-timeslice object count is
Separate vocabularies for English and Portuguese. needed for the TF-IDF calculation to work properly.
133
(Cieri et al., 2004–2005), TEDLIUM-v3 (Hernan- the longer sequences.
dez et al., 2018), and ATC (Godfrey, 1994); the
4.5.2 Nmtpytorch Baseline Experiments
language models (LM) were estimated on 1 bil-
lion words from Fisher, News-Crawl 2007-2017 For nmtpytorch baseline comparison systems, we
(Kocmi et al., 2022), News-Discuss 2014-2017 note that maximum training sequence has an ef-
(Kocmi et al., 2022), and TED. This system used fect on system performance, most likely due to the
Mel frequency cepstral coefficient (MFCC) fea- shallow RNN architecture. Table 6 shows that us-
tures as input to a factorized time delay neural ing the default 120 max token limit from Sanabria
network (TDNN) with residual network style skip et al. (2018) yields better performance (+0.9-1.1
connections. Initial decoding was performed using BLEU) with both the visual perturbation and our
a finite state transducer (FST) built from a bigram label augmentation approach. These results show
LM, and the resulting lattices were rescored with a our approach yields a similar performance gain.
RNN LM. The vocabulary included 100k words. 4.5.3 ASR Noise Experiments
4.5 Results For the ASR-based experiments shown in Table 7,
we see improvements of +0.7 BLEU with both the
Armed with an array of label pruning strategies,
clean and noisy Kaldi systems. We expect that
we run a series of experiments to determine the
the speech-recognition based systems would not
effectiveness of each method.
perform as well as the gold-standard systems, but
4.5.1 Marian Label Augmented Systems the use of object labels can help mitigate this loss
Marian label augmentation and pruning results are in performance.
shown in Table 4 reporting scores for BLEU (Pap-
4.6 Analyzing Attention Outputs
ineni et al., 2002), chrF2 (Popović, 2015) and TER
(Snover et al., 2006) as calculated by SacreBLEU We use Marian’s ability to output soft attention
(Post, 2018) and COMET (Rei et al., 2020) with weights to compare an augmented system against
the default wmt20-comet-da model. its baseline counterpart, as shown in Figure 5. For
We note that drop-3, tfidf at 0.20, and idf at 4.0 this example, line 221 of the test set, the baseline
each yield a +0.9-1.0 gain in BLEU over baseline. system scores a sentence-BLEU of 30.66 versus the
We also report the number of labels pruned at each augmented system’s 61.32. We note the attention
experimental threshold noting that drop and tfidf contributions of the object labels on the output
remove approximately 42-43% of object class la- tokens. Utilizing this feature as part of an unaltered
bels at maximum performance, while idf removes MT toolkit allows for quick and easy analysis of
a much larger 74.73%. the benefits of object label augmentation.
As we see from the results, each of the three label
5 Related Work
pruning methods yields improvements over both
the text-only and non-pruned augmented systems. Perhaps most closely related to our approach is
Using the compare-mt (Neubig et al., 2019) tool, ViTA (Gupta et al., 2021), which adds object labels
we take a closer look at various characteristics of extracted from images in an image captioning trans-
the translation hypotheses of each of these five lation task. While the motivation of adding object
systems to see if any trends emerge. Table 5 shows labels are similar, there are important differences
averaged sentence BLEU scores for hypotheses with our setup: 1) We work on video narration of
with outputs of varying lengths. The intuition is an author’s task demonstration where objects ap-
that these average scores will help determine if a pear at different points in the clip, which differs
given system or pruning strategy is better at certain significantly from static image captions. 2) Our
output lengths. work focuses on training MT systems from scratch
From these averaged scores, we note that plain as opposed to fine-tuning existing models.
label augmentation tends to improve over base- For a broad survey of multimodal translation,
line with hypothesis lengths between 30 and 60 refer to Sulubacak et al. (2020). Specifically
tokens but performs worse when outside of those for video translation on How2, Sanabria et al.
ranges. Of the three pruning strategies, drop 3 (2018) investigates a MT system that adds a 2048-
tends to bring the most improvement, especially dimensional feature vector averaging features for
with shorter hypotheses and idf 4.0 tends to help every 16 frames to create a global feature vector for
134
System BLEU chrF2 TER COMET Dropped Labels
Marian baseline 57.9 75.0 29.6 0.6819 –
nmtpy baseline 56.2 74.2 30.7 0.6234 –
nmtpy visual 55.9 74.0 31.1 0.6090 –
drop 0 57.6 74.9 29.9 0.6732 0 (0%)
drop 1 58.6 75.4 28.9 0.6785 164,605 (33.55%)
drop 2 58.7 75.5 28.9 0.6840 188,475 (38.41%)
drop 3 58.9 75.7 28.7 0.6907 212,284 (43.26%)
drop 4 58.5 75.3 29.1 0.6766 230,090 (46.89%)
drop 5 58.5 75.2 29.3 0.6687 247,106 (50.36%)
tfidf 0.10 58.3 75.1 29.5 0.6778 162,762 (33.17%)
tfidf 0.20 58.8 75.4 28.8 0.6817 205,938 (41.97%)
tfidf 0.30 58.8 75.5 29.0 0.6812 398,643 (81.24%)
idf 3.0 58.4 75.2 29.2 0.6832 212,284 (43.26%)
idf 4.0 58.9 75.5 29.0 0.6887 366,695 (74.73%)
idf 5.0 58.5 75.4 29.0 0.6857 428,655 (87.36%)
Table 4: Marian system scores for How2 en–pt test set, measured in BLEU, chrF2, TER and COMET. There are
490,697 object class labels present in the entire augmented training corpus.
length base aug drop3 tfidf0.2 idf4.0

<10 52.7 51.8 53.4 52.8 53.1
[10,20) 57.6 57.1 58.7 58.3 57.8
[20,30) 53.7 53.6 54.8 55.1 55.2
[30,40) 53.1 54.1 55.4 54.9 55.8
[40,50) 52.4 52.0 52.9 52.6 53.1
[50,60) 48.3 49.3 52.1 49.8 48.8
>=60 46.6 44.6 45.5 47.3 48.8
Table 5: Averaged sentence BLEU scores for hypotheses

in incremental length bins.
that entire video. This differs from our approach

of creating labels solely for the objects in a clip
directly corresponding to that text segment. Mad-
hyastha et al. (2017) uses a similar approach as
How2 on static imagery.
The Vatex (Wang et al., 2020) video description
dataset includes a Video-guided Machine Transla-
tion (VMT) approach that utilizes an action detec-
tion model feeding a video encoder with temporal
attention and a text source encoder with attention
that both inform the target decoder, producing trans-
lated output from a unified network. The authors
Figure 5: Attention grid for the same output sentence for
perform experiments in an video captioning setting,
Baseline (top, 30.66 sentence-BLEU) and Augmented
(bottom, 61.32 sentence-BLEU) systems. We note the
as opposed How2’s task narration setting.
attention contributions of the augmented object labels. As part of the work in Calixto and Liu (2017),
the authors project static image features into the
135
System Max Tok BLEU to the underlying MT toolkits used to build mod-
els. We additionally show improvements of up to
nmtpy base 120 55.0 +0.7 BLEU with object label augmentation when
nmtpy vis 120 56.1 substituting ASR speech for gold standard inputs.
nmtpy aug 120 55.9
nmtpy base 250 56.2
nmtpy vis 250 55.9 References
nmtpy aug 250 55.7 Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe
Morency. 2019. Multimodal machine learning: A
Table 6: Max token length effect on BLEU for nmtpy- survey and taxonomy. IEEE Trans. Pattern Anal.
torch baseline, visual perturbation and our label aug- Mach. Intell., 41(2):423–443.
mented systems. Iacer Calixto and Qun Liu. 2017. Incorporating global
visual features into attention-based neural machine
System BLEU COMET translation. In Proceedings of the 2017 Conference
Kaldi clean base 52.0 0.556 ing, pages 992–1003, Copenhagen, Denmark. Asso-
Kaldi clean aug 52.7 0.583 ciation for Computational Linguistics.
Kaldi 5 dB noise base 50.8 0.459 Christopher Cieri, David Graff, Owen Kimball, David
Miller, and Kevin Walker. 2004–2005. Fisher En-
Kaldi 5 dB noise aug 51.5 0.459 glish Training Part 1 and 2 Speech and Transcripts.
Linguistic Data Consortium, Philadelphia.
Table 7: Results for clean and noisy Kaldi systems for
both baseline and augmented conditions. John Godfrey. 1994. Air Traffic Control Complete.
Linguistic Data Consortium, Philadelphia.
word embedding space to produce image-based Kshitij Gupta, Devansh Gautam, and Radhika Mamidi.
2021. ViTA: Visual-linguistic translation by aligning
first and last words to influence word choice in object tags. In Proceedings of the 8th Workshop
their bidirectional RNN systems. on Asian Translation (WAT2021), pages 166–173,
While there are a few examples of object detec- Online. Association for Computational Linguistics.
tion as a separate task (including our work), Bal- François Hernandez, Vincent Nguyen, Sahar Ghan-
trusaitis et al. (2019) notes the rapid jump to joint nay, Natalia Tomashenko, and Yannick Estève. 2018.
representations as neural networks became popular TED-LIUM 3: Twice as much data and corpus
tools for a variety of multimodal tasks, explaining repartition for experiments on speaker adaptation.
In Speech and Computer, pages 198–208, Cham.
the prevalence of work following that approach. Springer International Publishing.
6 Future Work Glenn Jocher, Alex Stoken, Ayush Chaurasia, Jirka
Borovec, NanoCode012, TaoXie, Yonghye Kwon,
Having proven our object label augmentation tech- Kalen Michael, Liu Changyu, Jiacong Fang, Abhiram
nique on How2, future work includes applying V, Laughing, tkianai, yxNONG, Piotr Skalski, Adam
label augmentation to other datasets such as the Hogan, Jebastin Nadar, imyhxy, Lorenzo Mammana,
AlexWang1900, Cristi Fati, Diego Montes, Jan Ha-
VATEX (Wang et al., 2020) video description
jek, Laurentiu Diaconu, Mai Thanh Minh, Marc, al-
and VISA (Li et al., 2022) ambiguous subtitles binxavi, fatih, oleg, and wanghaoyang0106. 2021.
datasets. Further research into the effects of ultralytics/yolov5: v6.0 - YOLOv5n ’Nano’ models,
ASR degraded speech and examining task-agnostic Roboflow integration, TensorFlow export, OpenCV
image-language models such as CLIP (Radford DNN support.
et al., 2021) for label augmentation may also be Marcin Junczys-Dowmunt, Roman Grundkiewicz,
useful. Tomasz Dwojak, Hieu Hoang, Kenneth Heafield,
Tom Neckermann, Frank Seide, Ulrich Germann,
7 Conclusion Alham Fikri Aji, Nikolay Bogoychev, André F. T.
Martins, and Alexandra Birch. 2018. Marian: Fast
We present a straight-forward method to improve The views expressed are those of the authors and do not
MT context quality by augmenting training data necessarily reflect the official policy or position of the Depart-
with objects detected in corresponding video clips. ment of the Air Force, the Department of Defense, or the U.S.
government. Distribution Statement A. Approved for public
Using these augmented corpora, we realize gains of release: distribution is unlimited. Originator reference number
up to +1.0 BLEU over baselines without changes RH-22-123269. Case number AFRL-2022-3098.
136
neural machine translation in C++. In Proceedings of Machine Translation: Research Papers, pages 186–
ACL 2018, System Demonstrations, pages 116–121, 191, Brussels, Belgium. Association for Computa-
Melbourne, Australia. Association for Computational tional Linguistics.
Linguistics.
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas
Tom Kocmi, Rachel Bawden, Ondřej Bojar, Anton Burget, Ondrej Glembek, Nagendra Goel, Mirko
Dvorkovich, Christian Federmann, Mark Fishel, Hannemann, Petr Motlicek, Yanmin Qian, Petr
Thamme Gowda, Yvette Graham, Roman Grund- Schwarz, Jan Silovsky, Georg Stemmer, and Karel
kiewicz, Barry Haddow, Rebecca Knowles, Philipp Vesely. 2011. The kaldi speech recognition toolkit.
Koehn, Christof Monz, Makoto Morishita, Masaaki In IEEE 2011 Workshop on Automatic Speech Recog-
Nagata, Toshiaki Nakazawa, Michal Novák, Martin nition and Understanding. IEEE Signal Processing
Popel, and Maja Popović. 2022. Findings of the 2022 Society. IEEE Catalog No.: CFP11SRW-USB.
conference on machine translation (WMT22). In
Proceedings of the Seventh Conference on Machine Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Translation (WMT), pages 1–45, Abu Dhabi, United Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
Arab Emirates (Hybrid). Association for Computa- try, Amanda Askell, Pamela Mishkin, Jack Clark,
tional Linguistics. Gretchen Krueger, and Ilya Sutskever. 2021. Learn-
ing transferable visual models from natural language
Taku Kudo and John Richardson. 2018. SentencePiece: supervision. In Proceedings of the 38th International
A simple and language independent subword tok- Conference on Machine Learning, volume 139 of
enizer and detokenizer for neural text processing. In Proceedings of Machine Learning Research, pages
Proceedings of the 2018 Conference on Empirical 8748–8763. PMLR.
Demonstrations, pages 66–71, Brussels, Belgium.
Lavie. 2020. COMET: A neural framework for MT
Association for Computational Linguistics.
evaluation. In Proceedings of the 2020 Conference
Yihang Li, Shuichiro Shimizu, Weiqi Gu, Chenhui on Empirical Methods in Natural Language Process-
Chu, and Sadao Kurohashi. 2022. VISA: an ambigu- ing (EMNLP), pages 2685–2702, Online. Association
ous subtitles dataset for visual scene-aware machine for Computational Linguistics.
translation. CoRR, abs/2201.08054. Ramon Sanabria, Ozan Caglayan, Shruti Palaskar,
Desmond Elliott, Loïc Barrault, Lucia Specia, and
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir
Florian Metze. 2018. How2: a large-scale dataset for
Bourdev, Ross Girshick, James Hays, Pietro Perona,
multimodal language understanding. In Proceedings
Deva Ramanan, C. Lawrence Zitnick, and Piotr Dol-
of the Workshop on Visually Grounded Interaction
lár. 2015. Microsoft coco: Common objects in con-
and Language (ViGIL). NeurIPS.
text.
Matthew Snover, Bonnie Dorr, Rich Schwartz, Linnea
Pranava Swaroop Madhyastha, Josiah Wang, and Lucia Micciulla, and John Makhoul. 2006. A study of trans-
Specia. 2017. Sheffield MultiMT: Using object pos- lation edit rate with targeted human annotation. In
terior predictions for multimodal machine translation. Proceedings of the 7th Conference of the Association
In Proceedings of the Second Conference on Machine for Machine Translation in the Americas: Technical
Translation, pages 470–476, Copenhagen, Denmark. Papers, pages 223–231, Cambridge, Massachusetts,
Association for Computational Linguistics. USA. Association for Machine Translation in the
Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Americas.
Danish Pruthi, Xinyi Wang, and John Wieting. 2019. David Snyder, Guoguo Chen, and Daniel Povey. 2015.
compare-mt: A tool for holistic comparison of lan- MUSAN: A Music, Speech, and Noise Corpus.
guage generation systems. CoRR, abs/1903.07926. ArXiv:1510.08484v1.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Umut Sulubacak, Ozan Çağlayan, Stig-Arne Grönroos,
Jing Zhu. 2002. Bleu: a method for automatic evalu- Aku Rouhe, Desmond Elliott, Lucia Specia, and Jörg
ation of machine translation. In Proceedings of the Tiedemann. 2020. Multimodal machine translation
40th Annual Meeting of the Association for Compu- through visuals and speech. Machine Translation,
tational Linguistics, pages 311–318, Philadelphia, 34.
Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Maja Popović. 2015. chrF: character n-gram F-score Kaiser, and Illia Polosukhin. 2017. Attention is all
for automatic MT evaluation. In Proceedings of the you need. In Advances in Neural Information Pro-
Tenth Workshop on Statistical Machine Translation, cessing Systems, pages 6000–6010.
pages 392–395, Lisbon, Portugal. Association for
Computational Linguistics. Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-
Fang Wang, and William Yang Wang. 2020. Vatex:
Matt Post. 2018. A call for clarity in reporting BLEU A large-scale, high-quality multilingual dataset for
scores. In Proceedings of the Third Conference on video-and-language research.
137
Length-Aware NMT and Adaptive Duration for Automatic Dubbing
Zhiqiang Rao, Hengchao Shang, Jinlong Yang, Daimeng Wei, Zongyao Li,
Jiaxin Guo, Shaojun Li, Zhengzhe Yu, Zhanglin Wu, Yuhao Xie, Bin Wei,
Jiawei Zheng, Lizhi Lei and Hao Yang
Huawei Translation Service Center, Beijing, China
{raozhiqiang,shanghengchao,yangjinlong7,weidaimeng,lizongyao,
guojiaxin1,lishaojun18,yuzhengzhe,wuzhanglin2,xieyuhao2,weibin29,
zhengjiawei15,leilizhi,yanghao30}@huawei.com
Abstract ious techniques and technologies, including ma-

chine translation (MT) (Lopez, 2008; Vaswani
This paper presents the submission of Huawei et al., 2017), speech synthesis (Wang et al., 2017b;
Translation Services Center for the IWSLT Ren et al., 2022), and speech recognition (Gulati
2023 dubbing task in the unconstrained set-
et al., 2020; Schneider et al., 2019), to improve the
ting. The proposed solution consists of a
Transformer-based machine translation model accuracy and quality of automatic dubbing systems.
and a phoneme duration predictor. The Trans-
former is deep and multiple target-to-source Source Text Source Audio
length-ratio class labels are used to control tar-
get lengths. The variation predictor in Fast-
Speech2 is utilized to predict phoneme dura- Pause
Alignment
tions. To optimize the isochrony in dubbing, re-
ranking and scaling are performed. The source Segmented Source
audio duration is used as a reference to re-rank NMT

the translations of different length-ratio labels,
and the one with minimum time deviation is <Xlonger> <Longer> <Equal> <Shorter> <Xshorter>
preferred. Additionally, the phoneme duration

Target 1 Target 2 Target 3 Target 4 Target 5
outputs are scaled within a defined threshold to
narrow the duration gap with the source audio.
Variation
1 Introduction Predictor
Automatic dubbing (AD) (Federico et al., 2020; Re-ranking &

Scaling
Brannon et al., 2022; Chronopoulou et al., 2023)
technology uses artificial intelligence (AI) to auto- Figure 1: System pipeline.
matically generate dubbed audio for video content.
Dubbing is the process of replacing the audio with Isometric machine translation (Lakew et al.,
a translation of the original audio in a different 2022; Li et al., 2022) is a technique used in au-
language. AI dubbing technology automates this tomatic dubbing where translations should match
process by using machine learning algorithms to a given length to allow for synchronicity between
translate the original audio and synthesize a new source and target speech. For neural MT, generat-
voice that sounds natural and resembles a human ing translations of length close to the source length,
voice. The synthesized voice is then synchronized while preserving quality is a challenging task. Con-
with the lip movements of the characters in the trolling MT output length comes at a cost to trans-
video to produce dubbed audio. This technology lation quality, which is usually mitigated with a
has the potential to significantly reduce the time two-step approach of generating N-best hypotheses
and cost of creating dubbed audio and make it eas- and then re-ranking based on length and quality.
ier to reach a global audience by translating video Another area of research focuses on the syn-
content into multiple languages. chronization of the dubbed audio with the original
Recent advances in the field of automatic dub- source audio. This is essential for ensuring that the
bing have contributed to the development of more dubbed audio matches the timing and intonation of
efficient and cost-effective methods for producing the original speech. Researchers have developed
localized content. Researchers have utilized var- various methods for achieving accurate synchro-
138
nization, including the use of phoneme duration and durations (Brannon et al., 2022). We addition-
predictors and machine learning algorithms to de- ally apply WMT2014 De-En data for training the
tect and align speech segments (Virkar et al., 2021; MT model. The amount of data for both sets is
Effendi et al., 2022; Virkar et al., 2022). shown in Table 1.
One of the latest developments in automatic dub-
bing research is the use of deep neural networks for Data Size
speech synthesis (Chronopoulou et al., 2023; Ren CoVoST2 0.289M
et al., 2022). These networks enable the creation of WMT2014 4.5M
more naturalistic and expressive speech, improving
Table 1: The bilingual data sizes.
the overall quality of the dubbed audio. In con-
clusion, recent research in automatic dubbing has
shown significant progress and promise for the fu- To achieve better training results of the MT
ture of localized content production. By combining model, we used some data pre-processing methods
advanced machine learning techniques with speech to clean the bilingual data, including removing du-
synthesis, speech recognition, and sentiment analy- plicate sentences, using Moses (Koehn et al., 2007)
sis, researchers are developing more accurate, ef- to normalize punctuation, filtering out overly long
ficient, and cost-effective automatic dubbing sys- sentences, using langid (Lui and Baldwin, 2011,
tems. 2012) to filter out sentences that do not match the
The IWSLT 2023 (Agarwal et al., 2023) dubbing desired language, and using fast-align (Dyer et al.,
task focuses on isochrony in dubbing, which refers 2013) to filter out unaligned sentence pairs.
to the property that the speech translation is time
aligned with the original speaker’s video. The task 3 System
assumes that the front Automatic Speech Recog- The system consists of four parts: Pause Alignment,
nition (ASR) output text and subsequent Text-to- Machine Translation, Phoneme Duration Variation
Speech (TTS) models already exist, and the goal is Predictor, and Re-ranking and Scaling. Figure 1
to predict the phonemes and their durations. Our shows the system pipeline. The following describes
proposed solution involves using a Transformer- the four parts in detail.
based (Vaswani et al., 2017) machine translation
model and a phoneme duration predictor. A Deep 3.1 Pause Alignment
Transformer (Wang et al., 2017a, 2019) model is
utilized to handle multiple target-to-source length- During inference, we use a Voice Activity Detector
ratio class labels, which are used to control target (VAD) (Team, 2021) to obtain speech segments
lengths. The phoneme duration predictor is based and their durations from the source audio. The
on the variation predictor used in FastSpeech2 (Ren test data for the task already provides text seg-
et al., 2022). To optimize isochrony in dubbing, the ments separated by pauses. However, we found
solution utilizes re-ranking and scaling techniques. that the number of speech segments obtained by
The translations generated by different length-ratio VAD sometimes does not match the number of text
labels are re-ranked based on their time deviation segments provided, resulting in incorrect matching
from the source audio duration, with the minimum of pause counts. This can cause significant dis-
deviation one preferred. The phoneme duration out- crepancy between the synthetic dubbing and the lip
puts are also scaled within a predefined threshold movements of the character in the video when the
to narrow the duration gap with the source audio. pause duration is long.
These techniques help to ensure that the translated To address this issue, we first perform pause
speech is synchronized with the original speaker’s alignment between the source text and the source
video. audio. We use the proportion of tokens in each
text segment to the total number of tokens, and
2 Data the proportion of duration of each speech segment
to the total duration, to find the best alignment
The data provided in the constrained setting is de- between the text and speech segments. When the
rived from CoVoST2 (Wang et al., 2020) De-En number of text segments is less than the number of
data, consisting of German source text, English tar- speech segments, we merge the audio segments to
get text, speech durations, and English phonemes reduce the number of speech segments. The final
139
speech segments that need to be retained are split network with ReLU activation, followed by layer-
at the following points: normalization and dropout layers, and an additional
linear layer to project the hidden state into the out-
′ |s1..j | |t1..i | put sequence. The final output is the length of each
i = arg min − ;j ≥ i phoneme.
j S T
Where |t1..i | means total number of tokens from 3.4 Re-ranking and Scaling
the first to the i-th text segment. |s1..j | means total To select the best isochrony dubbing, we used
duration from the first to the j-th speech segment. source texts with 5 different tags prepended as in-
T and S represent the total number of tokens in puts for the NMT model. After converting the
the text and the total duration of the speech, respec- output translations into phoneme durations using
′
tively. i is the i-th speech segmentation point after the phoneme duration variation predictor, we re-
merging, corresponding to the i-th text segment. ranked them based on the source audio duration
Conversely, when the number of speech seg- as reference, and selected the output with the least
ments is less than the number of text segments, duration deviation.
we merge the text segments. The final retained text Additionally, we used the ratio of the source
segmentation points are: audio duration to the total predicted phoneme du-
ration as a reference, and scaled the predicted
′ |t1..i | |s1..j | phoneme duration within a certain threshold to
j = arg min − ;i ≥ j further optimize the synchronization between the
i T S
synthesized dubbing and the source video.
3.2 Machine Translation
We trained a Neural Machine Translation (NMT)
′ ′
model using Deep Transformer, which features pre- sj = arg min ( sjk − |sj |); k ∈ [1, 5]
′
layer normalization, 25 encoder layers, and 6 de- sjk
coder layers. Other structural parameters are con-

sistent with the Transformer-Base model. ′ ′ |sj |
sj = sj · Scale( )
Following existing length control methods, we ′
sj
divided the bilingual data into 5 categories based
on the target-to-source character length ratio (LR) 
for each sample (Lakew et al., 2022; Li et al., 
 1.1, r > 1.1
2022). The labels were defined based on LR Scale(r) = r, 0.9 < r < 1.1

 0.9, r < 0.9
thresholds: Xshorter < 0.8 < Shorter <
0.9 < Equal < 1.1 < Longer < 1.2 <
Xlonger. During training, we added a length tag Where |sj | is the total duration of source speech
<Xshorter/Shorter/Equal/Longer/Xlonger> at the ′
segment sj , sj is the total duration of generated
beginning of each source sentence. In the inference ′
dubbing segment sj . And Scale() is a scaling func-
process, text segments are sent to the translation
tion.
model separately and the required tag is prepended
at the beginning of each input segment. 4 Experiments
3.3 Phoneme Duration Variation Predictor We used SentencePiece (Kudo and Richardson,
As with FastSpeech2 (Ren et al., 2022), after us- 2018) to process NMT bilingual text and obtain
ing an open-source grapheme-to-phoneme tool subword vocabularies, resulting in a German vo-
(Park, 2019) to convert the NMT output transla- cabulary of 29k and an English vocabulary of 25k.
tion sequence into a phoneme sequence, the pre- We trained a Transformer NMT model using fairseq
trained variation predictor module in FastSpeech2 (Ott et al., 2019), with an encoder of 25 layers, a
was used to generate initial phoneme durations. decoder of 6 layers, 8 attention heads, embeddings
The variation predictor takes the hidden sequence of 512, and FFN embeddings of 2048. The model
as input and predicts the variance of the mean was optimized using Adam (Kingma and Ba, 2017)
squared error (MSE) loss for each phoneme’s du- with an initial learning rate of 5e-4, and warmup
ration. It consists of a 2-layer 1D-convolutional steps of 4000. Dropout was set to 0.1. The model
140
was trained on 8 GPUs, with a batch size of 2048 value.
tokens and an update frequency of 4. Too long translations will result in lower quality
During the inference phase, an open-source VAD of machine translation, while short translations will
tool was used to process the source speech and ob- result in insufficient duration for generating dub-
tain speech segments and durations for subsequent bing. After re-ranking, the translations can achieve
selection of NMT translated text lengths and adjust- more moderate results in translation quality and
ing the duration of synthetic dubbings. The NMT duration. Moreover, by setting appropriate scaling
translated text was then converted to phoneme thresholds, scaling operation can further improve
sequences using an open-source grapheme-to- the isochrony without affecting BLEU.
phoneme tool, and the initial phoneme durations We also compared the results without pause
were predicted using a pre-trained variation predic- alignment, as shown in the last row of Table 2.
tor module in FastSpeech2. The SO of both test sets decreased significantly,
As the main evaluation method for this task but the BLEU increased slightly. After analysis,
is manual evaluation, and our method allows for the MT translation is more likely to mismatch with
adjustment of phoneme duration prediction, We the shorter segment duration, so the shorter transla-
mainly experiment and compare BLEU (Papineni tion is selected during re-ranking. While our results
et al., 2002) under different strategies of machine show that the shorter the translation, the higher the
translation. To measure the synchronicity between BLEU.
source and dubbed speech, we use speech overlap
(SO) (Chronopoulou et al., 2023) metric. It should 5 Conclusion
be noted that the metrics presented don’t take into This paper describes the submission of Huawei
account speech naturalness, which is extremely im- Translation Services Center for the IWSLT 2023
portant to people viewing dubs. (Brannon et al., dubbing task under the unconstrained setting. Our
2022) showed that human dubbers produces natural solution consists of four parts: pause alignment,
speech even at the cost of isochrony. The experi- machine translation, phoneme duration variation
mental results on the two test sets of the task are predictor, re-ranking and scaling. Pause alignment
shown in Table 2. is used to align source audio and source text to im-
prove synchronization between synthetic dubbing
subset1 subset2 and source video. The machine translation model
Strategy
BLEU SO BLEU SO is trained using the Deep Transformer structure.
Xlonger 24.8 0.71 22.0 0.49 To control the output translation length, multiple
Longer 28.0 0.82 26.1 0.70 target-to-source length-ratio tags are used to adjust
Equal 37.4 0.83 32.4 0.83 the length. Pre-trained variation predictor in Fast-
Shorter 42.7 0.79 37.4 0.85 Speech2 is used to predict phoneme durations. In
Xshorter 45.7 0.73 43.3 0.83 order to optimize the isochrony in dubbing, the re-
Re-ranking 31.2 0.92 33.8 0.93 sults of different lengths of the machine translation
Scaling 31.2 0.97 33.8 0.98 output are re-ranked and scaled. Using the source
- w/o PA 31.6 0.89 34.7 0.87 audio duration as a reference, the translations with
Table 2: Experimental results of NMT. different length ratios are re-ranked, and the output
with the smallest time deviation is preferred. In
addition, the phoneme duration output is scaled
We present the BLEU and SO results using five
within a defined threshold, further narrowing the
different LR tags, re-ranking and scaling strategies.
duration gap from the source audio. We compare
The results of the two test sets have the same trend
the experimental results of different length-ratio
in BLEU, that is, the shorter the generated transla-
strategies, and our method can achieve a balanced
tion, the higher the BLEU value. Since subset2 has
result in BLEU and speech overlap.
pause punctuation, it is more difficult to translate,
so under the same LR tag at all levels, the BLEU
value of subset2 will be lower than that of subset1. References
In terms of SO, both too long or too short trans- Milind Agarwal, Sweta Agrawal, Antonios Anasta-
lations will cause SO to decrease. The results of sopoulos, Ondřej Bojar, Claudia Borg, Marine
medium LR settings can achieve the highest SO Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
141
Chen, William Chen, Khalid Choukri, Alexandra Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Chronopoulou, Anna Currey, Thierry Declerck, Qian- Callison-Burch, Marcello Federico, Nicola Bertoldi,
qian Dong, Yannick Estève, Kevin Duh, Marcello Brooke Cowan, Wade Shen, Christine Moran,
Federico, Souhir Gahbiche, Barry Haddow, Benjamin Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra
Hsu, Phu Mon Htut, Hirofumi Inaguma, John Ja- Constantin, and Evan Herbst. 2007. Moses: Open
vorský, Dávid and Judge, Yasumasa Kano, Tom source toolkit for statistical machine translation. In
Ko, Rishu Kumar, Pengwei Li, Xutail Ma, Prashant Proceedings of the 45th Annual Meeting of the As-
Mathur, Evgeny Matusov, Paul McNamee, John P. sociation for Computational Linguistics Companion
McCrae, Kenton Murray, Maria Nadejde, Satoshi Volume Proceedings of the Demo and Poster Sessions,
Nakamura, Matteo Negri, Ha Nguyen, Jan Niehues, pages 177–180, Prague, Czech Republic. Association
Xing Niu, Atul Ojha Kr., John E. Ortega, Proyag Pal, for Computational Linguistics.
Juan Pino, Lonneke van der Plas, Peter Polák, Elijah
Rippeth, Elizabeth Salesky, Jiatong Shi, Matthias Taku Kudo and John Richardson. 2018. SentencePiece:
Sperber, Sebastian Stüker, Katsuhito Sudoh, Yun A simple and language independent subword tok-
Tang, Brian Thompson, Kevin Tran, Marco Turchi, enizer and detokenizer for neural text processing. In
Alex Waibel, Mingxuan Wang, Shinji Watanabe, and Proceedings of the 2018 Conference on Empirical
Rodolfo Zevallos. 2023. Findings of the IWSLT 2023 Methods in Natural Language Processing: System
Evaluation Campaign. In Proceedings of the 20th Demonstrations, pages 66–71, Brussels, Belgium.
International Conference on Spoken Language Trans- Association for Computational Linguistics.
lation (IWSLT 2023). Association for Computational
Linguistics. Surafel M. Lakew, Yogesh Virkar, Prashant Mathur,
and Marcello Federico. 2022. Isometric mt: Neu-
William Brannon, Yogesh Virkar, and Brian Thompson. ral machine translation for automatic dubbing. In
2022. Dubbing in practice: A large scale study of ICASSP 2022 - 2022 IEEE International Confer-
human localization with insights for automatic dub- ence on Acoustics, Speech and Signal Processing
bing. (ICASSP), pages 6242–6246.
Alexandra Chronopoulou, Brian Thompson, Prashant Zongyao Li, Jiaxin Guo, Daimeng Wei, Hengchao
Mathur, Yogesh Virkar, Surafel M. Lakew, and Mar- Shang, Minghan Wang, Ting Zhu, Zhanglin Wu,
cello Federico. 2023. Jointly optimizing translations Zhengzhe Yu, Xiaoyu Chen, Lizhi Lei, Hao Yang,
and speech timing to improve isochrony in automatic and Ying Qin. 2022. HW-TSC’s participation in
dubbing. the IWSLT 2022 isometric spoken language transla-
tion. In Proceedings of the 19th International Confer-
Chris Dyer, Victor Chahuneau, and Noah A. Smith. ence on Spoken Language Translation (IWSLT 2022),
2013. A simple, fast, and effective reparameteriza- pages 361–368, Dublin, Ireland (in-person and on-
tion of IBM model 2. In Proceedings of the 2013 line). Association for Computational Linguistics.
Conference of the North American Chapter of the
Adam Lopez. 2008. Statistical machine translation.
Association for Computational Linguistics: Human
ACM Comput. Surv., 40(3).
Language Technologies, pages 644–648, Atlanta,
Georgia. Association for Computational Linguistics. Marco Lui and Timothy Baldwin. 2011. Cross-domain
feature selection for language identification. In Pro-
Johanes Effendi, Yogesh Virkar, Roberto Barra-Chicote, ceedings of 5th International Joint Conference on
and Marcello Federico. 2022. Duration modeling of Natural Language Processing, pages 553–561, Chi-
neural tts for automatic dubbing. In ICASSP 2022 ang Mai, Thailand. Asian Federation of Natural Lan-
- 2022 IEEE International Conference on Acoustics, guage Processing.
8041. Marco Lui and Timothy Baldwin. 2012. langid.py: An
off-the-shelf language identification tool. In Proceed-
Marcello Federico, Robert Enyedi, Roberto Barra- ings of the ACL 2012 System Demonstrations, pages
Chicote, Ritwik Giri, Umut Isik, Arvindh Krish- 25–30, Jeju Island, Korea. Association for Computa-
naswamy, and Hassan Sawaf. 2020. From speech-to- tional Linguistics.
speech translation to automatic dubbing. In Proceed-
ings of the 17th International Conference on Spoken Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
Language Translation, pages 257–264, Online. Asso- Sam Gross, Nathan Ng, David Grangier, and Michael
ciation for Computational Linguistics. Auli. 2019. fairseq: A fast, extensible toolkit for
sequence modeling. In Proceedings of the 2019 Con-
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki ference of the North American Chapter of the Associa-
Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, tion for Computational Linguistics (Demonstrations),
Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. pages 48–53, Minneapolis, Minnesota. Association
2020. Conformer: Convolution-augmented transfor Computational Linguistics.
former for speech recognition.
Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Jing Zhu. 2002. Bleu: a method for automatic evalu-
method for stochastic optimization. ation of machine translation. In Proceedings of the
142
Linguistics.
Jongseok Park, Kyubyong Kim. 2019. g2pe. https:
//github.com/Kyubyong/g2p.
Zhou Zhao, and Tie-Yan Liu. 2022. Fastspeech 2:
Fast and high-quality end-to-end text to speech.
Steffen Schneider, Alexei Baevski, Ronan Collobert,

and Michael Auli. 2019. wav2vec: Unsupervised
pre-training for speech recognition.
Silero Team. 2021. Silero vad: pre-trained enterprise-

grade voice activity detector (vad), number detec-
tor and language classifier. https://github.com/
snakers4/silero-vad.

Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
you need. In Advances in Neural Information Pro-
cessing Systems, volume 30. Curran Associates, Inc.
Yogesh Virkar, Marcello Federico, Robert Enyedi,

and Roberto Barra-Chicote. 2021. Improvements
to prosodic alignment for automatic dubbing. In
ICASSP 2021 - 2021 IEEE International Confer-
Yogesh Virkar, Marcello Federico, Robert Enyedi, and
Roberto Barra-Chicote. 2022. Prosodic alignment
for off-screen automatic dubbing.
Changhan Wang, Anne Wu, and Juan Pino. 2020. Cov-
ost 2 and massively multilingual speech-to-text trans-
lation.
Mingxuan Wang, Zhengdong Lu, Jie Zhou, and Qun

Liu. 2017a. Deep neural machine translation with
linear associative unit. In Proceedings of the 55th
Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 136–145,
Vancouver, Canada. Association for Computational
Linguistics.
Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu,
Changliang Li, Derek F. Wong, and Lidia S. Chao.
2019. Learning deep transformer models for machine
translation. In Proceedings of the 57th Annual Meet-
ing of the Association for Computational Linguistics,
pages 1810–1822, Florence, Italy. Association for
Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui
Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang,
Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le,
Yannis Agiomyrgiannakis, Rob Clark, and Rif A.
Saurous. 2017b. Tacotron: Towards end-to-end
speech synthesis.
143
NAVER LABS Europe’s Multilingual Speech Translation Systems
for the IWSLT 2023 Low-Resource Track
Edward Gow-Smith1∗ Alexandre Bérard2† Marcely Zanon Boito2† Ioan Calapodescu2

University of Sheffield
1
NAVER LABS Europe
2
[email protected] [email protected]
Abstract using diverse and representative datasets. This

This paper presents NAVER LABS Europe’s paper describes NAVER LABS Europe’s (NLE)
systems for Tamasheq-French and Quechua- submission to two of the language pairs from the
Spanish speech translation in the IWSLT 2023 IWSLT 2023 (Agarwal et al., 2023) Low-Resource
Low-Resource track. Our work attempts to Track: Tamasheq-French (Taq-Fr) and Quechua-
maximize translation quality in low-resource Spanish (Que-Es).
settings using multilingual parameter-efficient Most successful approaches for tackling scenar-
solutions that leverage strong pre-trained mod-
ios where ST data is scarce perform transfer learn-
els. Our primary submission for Tamasheq
outperforms the previous state of the art by ing across languages and modalities, leveraging
7.5 BLEU points on the IWSLT 2022 test set, multilingual pre-trained models for both speech
and achieves 23.6 BLEU on this year’s test set, and text (Anastasopoulos et al., 2022). However,
outperforming the second best participant by due to the large number of parameters of cur-
7.7 points. For Quechua, we also rank first and rent Transformer-based (Vaswani et al., 2017) ap-
achieve 17.7 BLEU, despite having only two proaches, training such systems is computationally
hours of translation data. Finally, we show that
expensive and not accessible to everyone. NLE’s
our proposed multilingual architecture is also
competitive for high-resource languages, out- submission focuses on a multilingual parameter-
performing the best unconstrained submission efficient training solution that allows us to lever-
to the IWSLT 2021 Multilingual track, despite age strong pre-trained speech and text models
using much less training data and compute. to maximize performance in low-resource lan-
1 Introduction guages.
We present new SOTA results for the Taq-
The vast majority of speech pipelines are developed Fr pair (17 hours of training data) that represent
for high-resource languages, a small percentage of a 57% BLEU increase compared to the results
languages that have ample amounts of annotated achieved by Khurana et al. (IWSLT 2022 post-
data available (Joshi et al., 2020). However, the evaluation).2 This same system achieves 23.6
assessment of systems’ performance based only on BLEU on the IWSLT 2023 test set, an improve-
high-resource settings can be problematic, since it ment of 7.71 BLEU compared to the second best
fails to reflect the real-world performance these ap- result submitted this year. We also present SOTA
proaches will have in diverse and smaller datasets. results in the unconstrained setting for the Que-
Moreover, as around half of the world’s languages Es pair (2 hours of training data), while main-
are considered to be not only low-resource, but taining most of the performance in the Taq-Fr
also from oral tradition (i.e., without a written pair. In addition, to showcase the usefulness of
form), there is an urgent need for speech technol- our parameter-efficient multilingual solution we
ogy that can operate robustly in such low-resource evaluate it on the high-resource setting of the
settings (Bird, 2011). In this context, the IWSLT IWSLT 2021 Multilingual Task (Anastasopoulos
conference1 proposes low-resource speech trans- et al., 2021). We find that our approach outper-
lation (ST) challenges that allow the speech com- forms the best IWSLT 2021 submission (FAIR,
munity to realistically benchmark ST approaches Tang et al., 2021), despite training considerably
∗
Work done during an internship at NAVER LABS Eu- fewer parameters (-64%), and using substantially
rope.
2
†
Equal contribution https://www.clsp.jhu.edu/
1
https://iwslt.org/ jsalt-2022-closing-presentations/
144
Figure 1: An illustration of our multilingual ST architecture as described in Section 2. The bold arrow path
corresponds to the speech-to-text training path. At decoding time, we can choose between producing speech-to-text
or text-to-text translations. Figure best seen in color.
less training data and compute. to map the speech features into the representation
This paper is organized as follows. We first de- space of the pre-trained MT model and the adapters
scribe the architecture and training settings of our can help with domain adaptation (and possibly help
multilingual ST systems in Section 2. We next alleviate the length mismatch). At inference, this
list the resources we use in Section 3. Section 4 model can be used for MT with very little memory
presents our results in both low and high-resource overhead: the convolutional layers and adapters
settings. Lastly, we highlight the zero-shot poten- are disabled, and the bottom encoder layers are
tial of our approach in Section 5 and present our swapped with those of the initial pre-trained model.
concluding remarks in Section 6.
Training settings. We train on 4 V100
2 System Description GPUs (80GB) for up to 200 000 updates, with a
maximum batch size of 4 000 source features (or
In this work we focus on a parameter-efficient train- 80 seconds of audio) and accumulated gradients
ing solution that allows us to input the features over two batches.3 We sample language pairs
from a pre-trained speech representation model with a temperature of 3.4 We validate every 5 000
into a pre-trained multilingual MT model, produc- updates and perform early stopping on valid
ing translations from both speech and text in mul- BLEU for the language pair(s) of interest, with
tilingual settings. This setting also allows us to a patience of 5, averaging model weights across
leverage automatic speech recognition (ASR; i.e. the last 3 checkpoints.5 We find best results using
speech-to-transcript) data. The general architecture a single convolutional layer with stride 2, which
is presented in Figure 1. The architecture is consid- downsamples the sequence of speech features by a
ered parameter-efficient because a small portion of factor of 2. The other hyperparameters are listed in
its parameters are trained (bottom encoder layers Appendix Section A.1.
and small adapters layers). 3
This corresponds to a total of 32 000 features per update,
Architecture. We initialize our models with a or 640 seconds of audio. In practice, with padding, each
update corresponds to approximately 80 utterances or 530 sec-
pre-trained multilingual MT model, which we onds of audio.
adapt to the ST task by inputting features extracted 4 1/3 P 1/3
pk = uk / ui where uk is the utterance count for
with a frozen pre-trained speech representation language pair k.
5
model. The MT model is also frozen, except for While all the configurations presented in this paper use
checkpoint averaging, we later re-trained our contrastive sub-
the bottom 2 or 3 encoder layers and small adapter mission for Taq-Fr and found virtually the same results with-
modules (those introduced by Bapna and Firat out it.
(2019), with bottleneck dimension 64) added af-
ter each encoder and decoder layer. As we show in
our results, the fine-tuned encoder layers are able
145
Transformer Feature Task Source Target hours:minutes # utterances
Model # params
layers dimension
ASR Quechua Quechua 51:39 8,301
Tamasheq (Boito et al., 2022b) 95M 12 768
Niger-Mali (Boito et al., 2022b) 95M 12 768 ST Quechua Spanish 2:42 698
mHuBERT-Tamasheq 95M 12 768 ST Tamasheq French 15:43 5,025
XLSR-53 (Conneau et al., 2021) 317M 24 1024
XLS-R (Babu et al., 2022) 317M 24 1024
Table 2: Speech Translation (ST) and Speech Recogni-
tion (ASR) data provided by the organizers (train+valid).
Table 1: Speech representation models. The top portion
The ASR data is outside of the constrained setting.
presents Tamasheq-dedicated models, while the bottom
lists large general purpose multilingual models.
speech during training.7
3 Resources 3.2 Pre-trained Multilingual MT Models
3.1 Pre-trained Speech Representation To initialize our ST models, we first experi-
Models mented with mBART for many-to-many transla-
We experiment with different versions of two tion (mBART50NN; Tang et al., 2020), but found
speech representation models: HuBERT (Hsu et al., the NLLB-200 models (Costa-jussà et al., 2022)
2021) and wav2vec 2.0 (Baevski et al., 2020). We to give better results. We experiment with the
do not fine-tune these models in any of our con- dense NLLB models of various sizes: the distilled
figurations, but instead use them as feature extrac- 600M-parameter and 1.3B-parameter versions, and
tors (see Figure 1). Because of this, our models the 3.3B-parameter version. We end up using the
are sensitive to the layer we extract features from. larger versions in our submissions (1.3B and 3.3B).
Pasad et al. (2021) argue that, for wav2vec 2.0 mod- Note that NLLB covers 202 languages, including
els that are not fine-tuned on ASR, speech features Tamsheq and Quechua, which is not the case for
from middle layers tend to have a higher abstrac- mBART. At the same model size, despite covering
tion from the speech signal, which is beneficial to more languages, NLLB is also a stronger machine
downstream tasks. The results from Boito et al. translation model overall than mBART. Also, un-
(2022b) seem to confirm this observation holds for like mBART, it is not English-centric.
low-resource ST. To the best of our knowledge, Contrary to Tang et al. (2021), we keep the orig-
there is no similar investigation for HuBERT mod- inal mBART or NLLB vocabularies of size 250k
els.6 and do not train any embeddings. Instead, like
Table 1 presents the speech representation mod- Berard et al. (2021), we find that it is possible to
els we experiment with. The Tamasheq model is filter the vocabulary at test time to only cover the
a monolingual wav2vec 2.0 Base model trained languages of interest, significantly reducing the
on 243 h of Tamasheq speech. The Niger-Mali memory footprint of the model with a minor re-
is a wav2vec 2.0 Base model trained on the duction in performance.8 We can also filter the
same Tamasheq speech data plus 111 h of French, vocabulary and embeddings before ST fine-tuning
109 h of Fulfulde, 100 h of Hausa, and 95 h of and achieve the same performance as with the full
Zarma. This gives 658 h in total. The data for vocabulary without needing to train any embed-
both models is sourced from the Niger-Mali audio dings. See Table 14 in Appendix for a comparison
collection (Boito et al., 2022a). The unreleased of these approaches. In order to study the zero-shot
mHuBERT-Tamasheq model uses this same audio translation capabilities of our models (i.e., trans-
collection for training, while also including Com- lating to languages and language pairs unseen at
mon Voice (Ardila et al., 2020) data in four other training), we do not apply vocabulary filtering to
languages (English, French, Arabic and Kabyle), the configurations presented in the main paper.
resulting in 5 069 h of speech. XLSR-53 (56k hours) 7
Appendix Table 16 lists all models with links for down-
and XLS-R (500k hours) are massively multilingual loading checkpoints, when available.
8
wav2vec 2.0 Large models covering 53 and 128 lan- With NLLB, 44k tokens are enough for a 100% cov-
erage of the training data (mTEDx, TED-LIUM, Quechua,
guages, respectively. Neither of these two multi- Tamasheq), or 35k when restricting to our Taq-Fr setting. This
lingual models have seen Tamasheq or Quechua represents a reduction of more than 200M parameters.
6
We hypothesize that layer selection is less important for
HuBERT architectures due to the multi-iteration approach that
increases signal abstraction at each iteration.
146
Task Source Target hours:minutes # utterances Taq-Fr Que-Es
ASR English English 208:00 91,003
IWSLT IWSLT IWSLT
ASR French French 218:59 117,081 2022 2023 2023
ASR Spanish Spanish 214:15 103,076 primary 20.75 23.59 ✗
Taq-
ST French English 57:39 31,207 contrastive 1 19.06 21.31 ✗
Fr
ST French Spanish 42:14 21,862 contrastive 2 18.58 18.73 17.74
ST Spanish English 79:37 37,168
ST Spanish French 9:34 4,568 primary 18.58 18.73 17.74
Que-
contrastive 1 16.84 ✗ 15.67
Es
contrastive 2 16.21 ✗ 15.25
Table 3: ASR and ST data in English, French and Span-
ish sourced from TED talks (unconstrained setting).
Table 4: Results on the official test sets for the IWSLT
2023 Low-Resource Task. We also show results on the
3.3 Datasets IWSLT 2022 Taq-Fr test set. Note that all Quechua
models are trained on Tamasheq data, but the reverse
We tackle the low-resource setting by building mul- is not true (see Appendix Table 15). Lines 3 and 4
tilingual systems that utilize both ASR and ST correspond to the same model.
data in the languages of interest (Tamasheq and
Quechua), and in high-resource directions whose
experiments in the setting of the IWSLT 2021 Mul-
target language is of interest (French and Span-
tilingual Task to measure how good our approach
ish). Note that we also include X→English data,
is on high-resource languages. The datasets used
as we initially planned to participate in the Irish-
for this setting are presented in Appendix Table 10.
English task. Including more data in high-resource
languages has several advantages. Firstly, it has a 4 Experiments and Results
regularization effect that prevents us from immedi-
ately overfitting the low-resource training data. Sec- All our submissions to the low-resource ST task
ondly, this enables knowledge transfer from com- are in the unconstrained setting, due to the use of
mon target languages and from similarly-sounding pre-trained models, and from training on data in
source languages.9 Thirdly, as we build multilin- other languages. The datasets used in each submis-
gual ST systems by mapping the speech representa- sion are listed in Appendix Table 15. This section
tion vectors into the same space as the multilingual is organized as follows. We present our Taq-Fr re-
MT model, our goal is to produce a model that is sults (4.1) with a detailed ablation study justifying
as multilingual as possible, not specializing in one our architectural choices. We then present our Que-
specific language. Our results show that training Es results (4.2). Lastly, we evaluate and analyze
on multiple languages at once achieves this effect, our approach in a high-resource setting (4.3).
while also producing good zero-shot ST results.
4.1 Tamasheq-French Results
Table 2 presents statistics for the datasets pro-
vided by the IWSLT 2023 organizers. The Que-Es We submit two systems that have Taq-Fr as the
dataset10 is an unreleased dataset prepared for this only low-resource language pair (primary and con-
year’s challenge. It corresponds to a translated trastive 1). Additionally, we take our primary sub-
subset of the Quechua ASR data (“Siminchik”) mission for Que-Es, which has also been trained
from Cardenas et al. (2018). The Taq-Fr dataset on Taq-Fr, and submit this as contrastive 2. The
was introduced by Boito et al. (2022a). Table 3 top portion of Table 4 gives the test BLEU scores,
presents statistics for the datasets in high-resource and the top portion of Appendix Table 11 presents
languages. English ASR data comes from TED- the valid BLEU scores. Table 12 shows statistics
LIUMv2 (Rousseau et al., 2014), and the other (average and standard deviation) over multiple runs
data comes from mTEDx (Salesky et al., 2021). when applicable.
Appendix Table 15 lists the datasets used in each System description. The contrastive 1 model
of our submissions. In Section 4.3, we also run uses as a speech feature extractor the Niger-Mali
9
Manual inspection revealed that audio from both datasets wav2vec 2.0 model (8th layer). It was initialized
presents some degree of target language borrowing (e.g., with NLLB 1.3B, whose bottom 3 encoder layers
Spanish words present in the Quechua speech, French words
present in the Tamasheq speech). were finetuned. We took three runs of this setting
10
We are aware the dataset reference is Que-Spa. We chose with different random seeds and picked the best
to use the ISO 639-1 two letters abbreviation for Spanish for performing one on the validation set (in terms of
consistency with the other datasets used in this work.
147
Taq-Fr BLEU) as our contrastive submission. We mance (even more so for Fr-En). However, we
then ensembled the three runs as our primary sub- find that the gain from using NLLB 3.3B over
mission. Finally, constrastive 2 is the ensemble NLLB 1.3B is too small to justify the increase in
model used as primary submission to the Que-Es model size and decoding latency (3 times slower).
task, which covers both low-resource languages, At the same model size, NLLB 600M performs
and combines XSL-R Large with NLLB 3.3B. considerably better than mBART (+1.7 BLEU on
Taq-Fr, +3.6 BLEU on Fr-En).
Results. Our primary submission significantly
outperforms the previous state of the art of Trained parameters. Fine-tuning too many en-
13.2 BLEU (+7.5 BLEU) on the IWSLT 2022 test coder layers results in overfitting, which hurts
set by Khurana et al. (2022).11 It also ranks first in Taq-Fr and Fr-En performance. On the other
this year’s edition, with +7.7 BLEU over the second hand, fine-tuning just 1 or 2 layers instead of
best primary submission. Our contrastive submis- 3 does not result in a large BLEU drop. Simi-
sions rank second and third (beating the second larly, adapter modules are not always needed. Dis-
best primary submission by +5.4 and +2.8 BLEU). abling decoder adapters does not degrade Taq-
Fr performance (+0.2 BLEU), but results in a
4.1.1 Ablation Study
slight drop in Fr-En performance (-0.9 BLEU),
In Appendix Table 18 we compare our con- which could be attributed to a domain adaptation
trastive 1 model (the non-ensembled version of effect (to the mTEDx domain). Disabling en-
our primary submission) with other architectures coder adapters has more impact on performance for
trained on the same data to validate our choice of Taq-Fr (-0.8 BLEU), with similar effect on perfor-
hyperparameters. mance for Fr-En (-1.0 BLEU). Section 4.3 shows
Speech features. The wav2vec 2.0 models that these adapters are important for domain adap-
trained with Tamasheq (Niger-Mali and Tamasheq) tation.
largely outperform the well-known massively mul- Convolutions. The number of convolutional lay-
tilingual models (XLSR-53 and XLS-R) on Taq-Fr ers does not impact performance much (range of
(e.g. +2.5 BLEU Tamasheq compared to XLS-R L). 1.1 BLEU on Taq-Fr and 3.2 BLEU on Fr-En for
These models are larger and trained on consider- 0 to 3 layers), but it can have a large impact on
ably more data, but do not include any Tamasheq decoding speed: each layer divides the input length
speech. Similar to previous works (Pasad et al., by a factor of 2 resulting in a roughly 3.5× speed-
2021; Boito et al., 2022b), when extracting fea- up from 0 to 3 layers. Interestingly, even though
tures from wav2vec 2.0 we find that the 8th layer it was trained on much shorter sequences, the MT
gives better results than the 11th (penultimate) layer model seems to adapt quite well to any input length,
(+2.5 BLEU for Niger-Mali). even without any convolutions – we achieve a bet-
For HuBERT, on the contrary, features from the ter Taq-Fr result without any convolutions, but a
th
11 layer give the best results (+0.2 BLEU com- worse Fr-En result.12 However, models with fewer
pared to 8th layer). When using the right layer, we convolutional layers seem to converge faster (as
find that wav2vec 2.0 outperforms HuBERT (+2.7 shown in Appendix Figure 2).
BLEU Niger-Mali compared to mHuBERT-Taq).
Finally, Niger-Mali is as good on Taq-Fr as the Stacked layers. While our approach described
Tamasheq wav2vec 2.0, but performs considerably in Section 2 fine-tunes some parameters of the pre-
better on Fr-En (+4.1 BLEU), probably because trained MT model, we can instead plug new Trans-
it was trained with French audio. The best Fr-En former layers at the bottom of the encoder, without
performance is achieved with XLS-R L. We find changing any existing parameter. These “stacked
worse performance on Fr-En with XLS-R XL (-2.0 layers” result in slightly larger models but are con-
BLEU), but this may be due to layer selection. ceptually simpler, as they try to map the speech
features into the same representation space as the
Pre-trained MT model. The larger the model input text embeddings of the MT model. Appendix
used for initialization, the better the perfor- Table 17 compares this architecture with the one
11
Here we are referencing the model pre-trained used in our submission to the Taq-Fr task. We see
using the Niger-Mali dataset that was presented at
12
JSALT 2022: https://www.clsp.jhu.edu/ Without any convolution, the speech feature to target
jsalt-2022-closing-presentations/ token ratio is 12:1.
148
that it performs similarly well (sometimes better) Que-Es ST models are evaluated in an unrealistic
and that it does not add any noticeable decoding setting, where they are tasked to translate Quechua
latency. We can even reach the same Taq-Fr perfor- utterances of which they already know the tran-
mance as our contrastive submission by just adding scription into Quechua. For this reason, we filtered
a single Transformer layer plus one convolution the ASR data to remove all audio files also present
layer and small adapters (28M trained parameters in the validation and test sets for Que-Es, and we
in total). Finally, disabling all adapters only results re-trained models on this filtered data.13 While our
in a small BLEU drop, suggesting that it is indeed official submission results presented in Table 4 use
possible to map the speech features into the text the “contaminated” dataset for comparison with the
input space, with only one Transformer layer. This other submissions, we think any future comparison
is surprising, considering that the input to this layer to our work should be done with the updated results
is 6 times as long as the target sequence on average. in Appendix Table 11. Note that similar care should
be taken with the results of other participants.
4.2 Quechua-Spanish Results
The test and validation scores of our submissions to 4.3 Results and Analysis in a High-Resource
the Que-Es task are reported in the second half of Setting
Table 4 and 11, respectively. Because these models The results of our ablation studies (Section 4.1.1)
are also trained on Taq-Fr data, we additionally seem to indicate that our models are reasonably
report their performance on that task. good on Fr-En translation, even though we do
early stopping and tune our hyper-parameters based
System description. As we do not have a
on Taq-Fr performance. Here, we further inves-
speech feature extractor specialized to Quechua
tigate the performance of our approach on high-
speech, our contrastive 1 submission uses a mas-
resource ST by training models in the setting of the
sively multilingual wav2vec 2.0 model: XLS-R
IWSLT 2021 Multilingual Task (Anastasopoulos
Large (18th layer). Compared to our Tamasheq
et al., 2021). This task evaluates the performance
submission, it is also initialized with a larger MT
of multilingual ST models in 4 training directions,
model (NLLB 3.3B), which we found to perform
for which in-domain training data is provided, and
better in this setting. The training settings are the
3 zero-shot directions, for which no training data is
same as for the Tamasheq models, except that we
provided.
only fine-tune the bottom 2 encoder layers (instead
We use XLS-R Large as the speech feature
of 3) and validate every 2 500 updates, since this
extractor, experiment with both NLLB 1.3B and
larger model tends to converge faster. Another
NLLB 3.3B as the MT model, and perform early
difference is that we train on both Tamasheq and
stopping based on the average validation BLEU
Quechua data (in addition to the mTEDx and TED-
across the 4 official training directions. We train
LIUM data). Like in our Tamasheq submission,
our models on all the mTEDx language pairs that
we train 3 models with different random seeds and
are not zero-shot, along with TED-LIUM (English
ensemble them as our primary submission. Our
ASR) and the Tamasheq and Quechua data (see
constrastive 2 submission uses a single model with
Table 15). Note that the use of pre-trained models
the same training settings, but starts from a smaller
and English ASR means our models fall into the
pre-trained MT model (NLLB 1.3B).
unconstrained setting.
Results. Our primary submission in the Que-Es Table 5 presents our results on this task,
task also ranked first, with 17.7 BLEU on the of- compared with the best unconstrained submis-
ficial test set. The full ranking results were not sion (FAIR; Tang et al., 2021).14 We find that both
communicated in time to this camera-ready. They our models outperform FAIR’s ensemble submis-
will be made available later through the conference sion in the training directions, even though they
findings paper (Agarwal et al., 2023). require substantially less compute and data to train,
and they are not ensembled. In the zero-shot direc-
Data contamination. We found shortly after our
13
submission that all the audio files used in the of- In the updated version, we use NLLB 1.3B by default
instead of NLLB 3.3B, like for Taq-Fr. Appendix Table 11
ficial test and validation sets are also present in presents uncontaminated results.
the ASR training data shared by the organizers 14
SacreBLEU signature (Post, 2018): nrefs:1|
for the unconstrained setting. This means that our case:mixed|eff:no|tok:13a|smooth:exp|version:2.1.0
149
Total Trained Training directions Zero-shot directions
Model
params params Es-En Fr-En Fr-Es Pt-En Pt-Es It-En It-Es
FAIR at IWSLT 2021 700M 40.4 36.4 34.4 29.0 34.4 28.4 34.6
(Tang et al., 2021) 3×700M (ensemble) 42.2 38.7 36.5 31.0 38.2 29.4 37.3
XLS-R + NLLB 1.3B 317M + 1.38B 70M 43.7 39.4 38.0 31.5 35.9 28.9 35.0
XLS-R + NLLB 3.3B 317M + 3.36B 115M 44.0 39.9 38.3 33.1 38.1 29.3 36.9
XLS-R + NLLB 1.3B, ASR + MT cascade 41.8 35.6 34.4 29.7 35.8 29.3 35.2
Table 5: Results on the IWSLT 2021 Multilingual task. We report BLEU scores on the IWSLT 2021 test sets. Our
NLLB 1.3B and 3.3B models took respectively 34 and 46 h to train on 4 V100 GPUs, while FAIR’s models each
took 7 days to train on 8 V100 GPUs. Also note that FAIR’s models were trained on much larger amounts of data,
including data for the “zero-shot” directions (which, in their case is only zero-shot w.r.t the in-domain TED data).
Model New params Taq-Fr of dimension 256 in the bottom layers and training
Joint training 0 21.06 only those; 4) adding adapters of dimension 256 in
Adapters 64 (all) 6.4M 17.60 the bottom layers and training both those and the
Adapters 256 (all) 15.9M 18.18 convolutional layer.
Adapters 256 (bottom) 1.6M 19.24 We keep the same training settings as before, ex-
Conv + Adapters 256 (bottom) 2.5M 19.13
cept that: we train on Taq-Fr data only; we train
Table 6: BLEU scores on the Taq-Fr validation set, only the parameters mentioned above; we validate
when training jointly with IWSLT 2021 and Tamasheq more often (every 1 000 updates); and we disable
data; versus incremental (2-stage) training. The “New checkpoint averaging. Table 6 shows the perfor-
params” columns give the number of Tamasheq-specific mance of these four incremental training methods,
parameters added. compared to training on the entire language set
from scratch. Even though incremental training
tions, our NLLB 1.3B version performs worse than does not perform quite as well, it appears to be a vi-
FAIR’s ensemble, which is not surprising since able option that can achieve decent results. Lastly,
they used training data for the zero-shot language we highlight that our experiments were limited to
directions (from other datasets), whilst we do not.15 these four incremental learning settings (without
We find that using the larger NLLB 3.3B model for hyper-parameter search), and that better results may
initialization considerably improves our zero-shot be obtained with other parameter-efficient adapta-
results. tion methods, or with more regularization.
4.3.1 Incremental Learning 4.3.2 Multimodality and Domain Transfer

A limitation of our approach for low-resource ST Since our systems are initialized with an MT model,
is that we need to know in advance (when training of which just a few encoder layers are modified, it
the multilingual ST model) the set of low-resource is straightforward to use our ST models for text-to-
languages to cover. Here, we show that it is pos- text translation: we just need to store both the MT
sible to add a new low-resource language into an and ST bottom layers and route tokens through the
existing model without re-training it, similar to MT ones (see Figure 1). However, one question
what has been previously done by Berard (2021) that remains is whether the ST adapters can be used
for text-to-text MT. We train a model following for text-to-text decoding.
the IWSLT 2021 setting presented above, but with- As an investigation of this, Appendix Table 19
out any Tamasheq or Quechua data. Then, we measures the MT performance (NLLB 1.3B) on
attempt to adapt it to Taq-Fr using four different the IWSLT 2021 test sets (same domain as the
approaches: 1) adding adapters of dimension 64 in mTEDx training data) with and without the ST
the bottom layers and training all adapters (includ- adapters. Surprisingly, we see that not only can we
ing in the decoder layers and top encoder layers); 2) use these adapters for both text and speech modali-
adding adapters of dimension 256 in the bottom lay- ties, but they actually improve the MT scores (+2.7
ers and fine-tuning all adapters; 3) adding adapters BLEU on average), even though they were only
trained with ST and ASR data. This suggests that
15
NLLB has been pretrained on these language pairs for the fine-tuned bottom layers are able to fully map
MT, but we do not train on ST data for them.
the speech representations into the text represen-
150
Adapter Encoder Decoder Taq-Fr Taq-En Taq-Ko Taq-Fr Taq-En Taq-Ko
Size Adapters Adapters BLEU BLEU BLEU chrF chrF chrF
64 ✓ ✓ 19.1 17.1 12.6 44.2 40.8 18.2
128 ✓ ✓ 19.2 16.7 9.6 44.7 40.3 14.5
64 ✓ ✗ 19.3 16.8 14.6 44.4 42.4 21.5
✗ ✗ ✗ 17.5 16.2 14.4 43.0 40.8 21.5
ST (contrastive 1) + MT (NLLB 1.3B) cascade ✗ 15.0 15.7 ✗ 38.6 22.2
Table 7: BLEU and chrF results for Taq-{Fr, En, Ko} using contrastive 1 and its variants (models trained without
adapters or with larger adapters), on the IWSLT 2022 Taq-Fr test set or silver-standard Korean and English references
obtained with MT. The last row is a cascade of speech translation followed by text translation (Taq→Fr→X).
tation space and that the adapters further improve Note that this is only a silver-standard made of syn-
performance by allowing domain adaptation of the thetic data, and thus the evaluation will inevitably
MT model (which is hard to do at the very bottom be biased.17 Our goal is solely to assess whether
layers). Note that the encoder adapters seem to be our systems have some zero-shot ST abilities. We
the most important ones, which is consistent with evaluate our Taq-Fr contrastive 1 system, and vari-
the findings of Cooper Stickland et al. (2021) that ants of this system with fewer or larger adapters.
adapting the encoder is the most effective strategy We compare with a cascade baseline, in which we
for domain adaptation. Lastly, we highlight that first perform Taq-Fr ST, followed by Fr-En or Fr-
adapting the MT model directly with MT data (mT- Ko MT using the text-to-text path from Figure 1. In
EDx’s transcriptions and translations) gives even this setting, the adapters are disabled during MT.
better results (+4.6 BLEU on average), but this
Results. In Table 7, we measure the zero-shot
cross-modality domain transfer is an interesting
translation capabilities of our approach on this
by-product of our parameter-efficient approach.
silver-standard test set. We evaluate four mod-
5 Zero-Shot Capabilities els: our contrastive 1 submission presented in Sec-
tion 4.1, and variants of this model with increased
Throughout this paper we have argued that one ad- adapter size, adapters only in the encoder, or no
vantage of the multilingual models we propose is adapters. We compare against a cascade baseline
their potential for zero-shot translation, a setting in that is not zero-shot, which consists in translating
which a system produces translation in an unseen the Tamasheq speech into French text and then
language pair by leveraging its existing knowledge translating this text into English or Korean.
of both languages. In Section 4.3 we showed that We observe that, in the case of English, which
our models are competitive with the best submis- was seen during ST adaptation, adapters can be
sion to IWSLT 2021 on the three zero-shot high- helpful (+2 BLEU over the cascade baseline). On
resource language pairs, despite the fact that these the other hand, for Korean, unseen during ST adap-
pairs were not truly zero-shot for that system. In tation, systems with adapters in the decoder (first
this section, we further illustrate the zero-shot ca- two rows) perform worse, as they likely bring some
pabilities of our models by translating Tamasheq degree of language confusion. Results are even
speech in two settings: 1) target language seen dur- worse with larger adapters, with over 40% of out-
ing both MT pre-training and ST adaptation (En- put sentences being in the wrong language. In
glish); 2) target language only seen during MT this setting, the best results are achieved with only
pre-training (Korean). encoder adapters or no adapters at all (-1 BLEU
Evaluation settings. To score BLEU and chrF16 compared to the baseline).
in the chosen target languages, we use a commer- Appendix Table 13 measures the percentage of
cial translation service to translate the French side output sentences in the correct language and the
of the IWSLT 2022 test set to English and Korean. percentage of Hangul versus Latin character in
each system’s outputs. We find that models with
16
SacreBLEU signature: nrefs:1|case:mixed|
17
eff:no|tok:X|smooth:exp|version:2.3.1, (En: For instance, we observe that these generated translations
X=13a, Ko: X=ko-mecab-0.996/ko-0.9.2-KO). contain both the Korean transliteration in Hangul of named
chrF signature: nrefs:1|case:mixed| entities and the original version in the Latin script. This will
eff:yes|nc:6|nw:0|space:no|version:2.3.1 likely penalize our produced translation during scoring.
151
Utterance id Target Content
Ref Chers auditeurs, rappelez-vous que vous écoutez Studio Kalangou en ce moment.
Fr Chers auditeurs, n’oubliez pas que vous êtes avec le Studio Kalangou.
2016-11-23_id_7
En Well, listeners, don’t forget that you are with Studio Kalangou right now.
Ko ᆼᄎ
ᅥ
ᄎ ᅱᄌ ᅡᄋ ᅧ러ᄇ ᆫ, ᄌ
ᅮ ᅵᄀᆷ Studio Kalangouᄋ
ᅳ ᅪᄒ ᆷᄁ
ᅡ ᅦᄋᆻᄂ
ᅵ ᆫᄀ
ᅳ ᆺᄋ
ᅥ ᆯᄋ
ᅳ ᆽᄌ
ᅵ ᅵᄆ ᅡᄉ ᅦ요.
2016-06-27_id_5 Ref Les examens du BEPC sont terminés et les corrections ont commencé hier après-midi dans la ville de Niamey.
Fr Les examens du BEPC sont terminés et sur toute l’étendue du territoire, les travaux de leur suivi ont débuté hier après-midi à Niamey.
En The BEPC exams are over and throughout the country, the monitoring activities started yesterday afternoon in Niamey.
Ko BEPC ᄉ ᅵᄒ
ᆷᄋ
ᅥ ᆫᄁ
ᅳ ᇀᄂ
ᅳ ᆻᄉ
ᅡ ᆸᄂ
ᅳᅵ다. ᄌ ᆫᄀ
ᅥ ᆨᄋ
ᅮ ᅦ서ᄀ ᆷᄉ
ᅥ ᅡᄌ ᆨᄋ
ᅡ ᆸᄋ
ᅥ ᆫᄋ
ᅳ ᅥ제ᄋ ᅩ후 Niameyᄋ ᅦᄉ ᅥᄉ ᅵᄌᆨᄃ
ᅡ ᅬᄋᆻᄉ
ᅥ ᆸᄂ
ᅳ ᅵᄃ ᅡ.
D’autres informations que nous apportons aujourd’hui concernent un projet appelé aniamey.com qui informe que l’État du Nigéria a refoulé
Ref
des Nigériens, au nombre de 53, qui arrivent (), qui habitent dans la ville de Mina sur le territoire du Niger ou Neja.
2016-10-27_id_39 D’autres informations que nous apportons aujourd’hui concernent les informations apportées par un programme dénommé Niamey Point Com qui a
Fr
apporté des informations selon lesquelles le Nigeria a accueilli 53 Nigériens qui habitent la ville de Mena qui se trouve sur le territoire du Niger ou le Niger.
Today, we’re going to talk about the information about a program called Niamey Point Com, which reports that Nigeria has brought back 53 Nigerians
En
who live in the town of Mena in Niger.
Ko ᅮᄅ
ᄋ ᅵᄀ ᅦᄋᆷᄋ
ᅵ ᅴᄋ ᅩᄂᆯᄀ
ᅳ ᅵ사ᄋ ᅦ서ᄂ ᆫ Niamey Point Comᄅ
ᅳ ᅡᄂᆫᄑ
ᅳ ᅳᄅ ᅩ그ᄅ ᆷᄋ
ᅢ ᅳᄅ ᅩᄂ ᅡᄋ ᅵᄌ
ᅵ리ᄋ ᅡᄀ ᅡᄆ ᅵᄂ ᅦᄋ ᅦᄀ ᅥᄌ ᅮ하ᄂᆫ 53ᄆ
ᅳ ᆼᄋ
ᅧ ᅴᄂ ᅵᄀ
ᅳ르ᄋ ᆫᄋ
ᅵ ᆯᄀ
ᅳ ᅱᄒ ᆫᄉ
ᅪ ᅵᄏᆻᄃ
ᅧ ᅡᄂ ᆫᄉ
ᅳ ᅩᄉᆨᄋ
ᅵ ᅵᄋ ᆻᄉ
ᅵ ᆸᄂ
ᅳ ᅡ.
ᅵᄃ
Table 8: Some decoding examples for Taq-Fr, Taq-En and Taq-Ko language pairs, accompanied by the French
reference (Ref). Utterance id corresponds to the suffix of the audio files in the IWSLT 2022 test set.
adapters in the decoder (first two rows) generate References

more Latin characters. Note that the ideal transla- Milind Agarwal, Sweta Agrawal, Antonios Anasta-
tion is not necessarily 100% Hangul, as it might sopoulos, Ondřej Bojar, Claudia Borg, Marine
sometimes be best to keep the foreign named en- Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
tities in the Latin alphabet. Table 8 illustrates this Chen, William Chen, Khalid Choukri, Alexandra
with a few examples of translations from our con- qian Dong, Yannick Estève, Kevin Duh, Marcello
trastive 1 system. Federico, Souhir Gahbiche, Barry Haddow, Benjamin
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
6 Conclusion vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
In this paper we presented our parameter-efficient Evgeny Matusov, Paul McNamee, John P. McCrae,
multilingual systems as submissions to the Kenton Murray, Maria Nadejde, Satoshi Nakamura,
IWSLT 2023 Low-Resource Task in the Tamasheq- Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
French and Quechua-Spanish language pairs. The Lonneke van der Plas, Peter Polák, Elijah Rippeth,
architecture we propose has several advantages: Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
it is computationally and data efficient, it allows bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
the same model to do both speech-to-text and text- Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
to-text translation (or transcription), it maximizes vallos. 2023. Findings of the IWSLT 2023 Evaluation
knowledge transfer to improve low-resource per- Campaign. In Proceedings of the 20th International
formance, and it has good zero-shot translation Conference on Spoken Language Translation (IWSLT
capabilities. Our submissions reach a new state of 2023). Association for Computational Linguistics.
the art performance, winning both speech transla- Antonios Anastasopoulos, Loïc Barrault, Luisa Ben-
tion challenges, especially for Tamasheq-French, tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano
where we outperform the previous state of the art Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh,
Maha Elbayad, Clara Emmanuel, Yannick Estève,
by more than 7 BLEU points. Marcello Federico, Christian Federmann, Souhir
Future work will include a comprehensive eval- Gahbiche, Hongyu Gong, Roman Grundkiewicz,
uation of the ASR capabilities of our architecture, Barry Haddow, Benjamin Hsu, Dávid Javorský,
and the investigation of adapters inside the speech Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant
Mathur, Paul McNamee, Kenton Murray, Maria
representation model. Moreover, when the speech Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan
representation model is frozen, a more in-depth Niehues, Xing Niu, John Ortega, Juan Pino, Eliz-
analysis of the optimal layer is needed. abeth Salesky, Jiatong Shi, Matthias Sperber, Se-
bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo-
Acknowledgements gesh Virkar, Alexander Waibel, Changhan Wang,
and Shinji Watanabe. 2022. Findings of the IWSLT
This work was partially funded by the European 2022 evaluation campaign. In Proceedings of the
Horizon 2022 project UTTER (Unified Transcrip- 19th International Conference on Spoken Language
Translation (IWSLT 2022), pages 98–157, Dublin,
tion and Translation for Extended Reality), under Ireland (in-person and online). Association for Com-
grant agreement No 101070631. putational Linguistics.
Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremer-
152
erico, Xutai Ma, Satoshi Nakamura, Matteo Negri, the Tamasheq language. In Proceedings of the Thir-
Jan Niehues, Juan Pino, Elizabeth Salesky, Sebas- teenth Language Resources and Evaluation Confer-
tian Stüker, Katsuhito Sudoh, Marco Turchi, Alexan- ence, pages 2066–2071, Marseille, France. European
der Waibel, Changhan Wang, and Matthew Wiesner. Language Resources Association.
2021. FINDINGS OF THE IWSLT 2021 EVAL-
UATION CAMPAIGN. In Proceedings of the 18th Marcely Zanon Boito, John Ortega, Hugo Riguidel, An-
International Conference on Spoken Language Trans- toine Laurent, Loïc Barrault, Fethi Bougares, Firas
lation (IWSLT 2021), pages 1–29, Bangkok, Thailand Chaabani, Ha Nguyen, Florentin Barbier, Souhir Gah-
(online). Association for Computational Linguistics. biche, and Yannick Estève. 2022b. ON-TRAC con-
sortium systems for the IWSLT 2022 dialect and
Rosana Ardila, Megan Branson, Kelly Davis, Michael low-resource speech translation tasks. In Proceed-
Kohler, Josh Meyer, Michael Henretty, Reuben ings of the 19th International Conference on Spoken
Morais, Lindsay Saunders, Francis Tyers, and Gre- Language Translation (IWSLT 2022), pages 308–318,
gor Weber. 2020. Common voice: A massively- Dublin, Ireland (in-person and online). Association
multilingual speech corpus. In Proceedings of the for Computational Linguistics.
Twelfth Language Resources and Evaluation Confer-
ence, pages 4218–4222, Marseille, France. European Ronald Cardenas, Rodolfo Zevallos, Reynaldo Baquer-
Language Resources Association. izo, and Luis Camacho. 2018. Siminchik: A speech
corpus for preservation of southern quechua. ISI-
Arun Babu, Changhan Wang, Andros Tjandra, Kushal NLP 2, page 21.
Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh,
Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Alexis Conneau, Alexei Baevski, Ronan Collobert, Ab-
Baevski, Alexis Conneau, and Michael Auli. 2022. delrahman Mohamed, and Michael Auli. 2021. Un-
XLS-R: Self-supervised Cross-lingual Speech Rep- supervised Cross-Lingual Representation Learning
resentation Learning at Scale. In Proc. Interspeech for Speech Recognition. In Proc. Interspeech 2021,
2022, pages 2278–2282. pages 2426–2430.
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Asa Cooper Stickland, Alexandre Berard, and Vassilina
and Michael Auli. 2020. wav2vec 2.0: A framework Nikoulina. 2021. Multilingual domain adaptation
for self-supervised learning of speech representations. for NMT: Decoupling language and domain infor-
Advances in neural information processing systems, mation with adapters. In Proceedings of the Sixth
33:12449–12460. Conference on Machine Translation, pages 578–598,
Ankur Bapna and Orhan Firat. 2019. Simple, scal-
able adaptation for neural machine translation. In Marta R Costa-jussà, James Cross, Onur Çelebi, Maha
Proceedings of the 2019 Conference on Empirical Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe
Methods in Natural Language Processing and the Kalbassi, Janice Lam, Daniel Licht, Jean Maillard,
9th International Joint Conference on Natural Lan- et al. 2022. No language left behind: Scaling
guage Processing (EMNLP-IJCNLP), pages 1538– human-centered machine translation. arXiv preprint
1548, Hong Kong, China. Association for Computa- arXiv:2207.04672.
tional Linguistics.
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
Alexandre Berard. 2021. Continual learning in multilin- Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-
gual NMT via language-specific embeddings. In rahman Mohamed. 2021. Hubert: Self-supervised
Proceedings of the Sixth Conference on Machine speech representation learning by masked prediction
Translation, pages 542–565, Online. Association for of hidden units. IEEE/ACM Transactions on Audio,
Computational Linguistics. Speech, and Language Processing, 29:3451–3460.
Alexandre Berard, Dain Lee, Stephane Clinchant, Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika
Kweonwoo Jung, and Vassilina Nikoulina. 2021. Bali, and Monojit Choudhury. 2020. The state and
Efficient inference for multilingual neural machine fate of linguistic diversity and inclusion in the NLP
translation. In Proceedings of the 2021 Conference world. In Proceedings of the 58th Annual Meeting of
on Empirical Methods in Natural Language Process- the Association for Computational Linguistics.
ing, pages 8563–8583, Online and Punta Cana, Do-
minican Republic. Association for Computational Sameer Khurana, Antoine Laurent, and James Glass.
Linguistics. 2022. Samu-xlsr: Semantically-aligned multimodal
utterance-level cross-lingual speech representation.
Steven Bird. 2011. Bootstrapping the language archive: IEEE Journal of Selected Topics in Signal Processing,
New prospects for natural language processing in 16(6):1493–1504.
preserving linguistic heritage. Linguistic Issues in
Language Technology, 6(4). Ankita Pasad, Ju-Chieh Chou, and Karen Livescu. 2021.
Layer-wise analysis of a self-supervised speech rep-
Marcely Zanon Boito, Fethi Bougares, Florentin Bar- resentation model. In 2021 IEEE Automatic Speech
bier, Souhir Gahbiche, Loïc Barrault, Mickael Rou- Recognition and Understanding Workshop (ASRU),
vier, and Yannick Estève. 2022a. Speech resources in pages 914–921. IEEE.
153
scores. In Proceedings of the Third Conference on
tional Linguistics.
Anthony Rousseau, Paul Deléglise, and Yannick Estève.
2014. Enhancing the TED-LIUM corpus with se-
lected data for language modeling and more TED
talks. In Proceedings of the Ninth International
Conference on Language Resources and Evaluation
(LREC’14), pages 3935–3939, Reykjavik, Iceland.
European Language Resources Association (ELRA).
Elizabeth Salesky, Matthew Wiesner, Jacob Bremerman,
Roldano Cattoni, Matteo Negri, Marco Turchi, Dou-
glas W. Oard, and Matt Post. 2021. Multilingual
tedx corpus for speech recognition and translation.
In Proceedings of Interspeech.
Yun Tang, Hongyu Gong, Xian Li, Changhan Wang,

Juan Pino, Holger Schwenk, and Naman Goyal. 2021.
FST: the FAIR speech translation system for the
IWSLT21 multilingual shared task. In Proceedings
of the 18th International Conference on Spoken Lan-
guage Translation (IWSLT 2021), pages 131–137,
Bangkok, Thailand (online). Association for Compu-
systems, 30.
154
A Appendix
A.1 Hyperparameters
7 No conv layer
1 conv layer
Hyper-parameter Value 6 2 conv layers
3 conv layers
5
Training loss
Batch size 4 000
Data-parallel GPUs 4
Update freq 2 4
Max learning rate 0.0005
3
Initial LR 10−7
Schedule inverse square root 2
Warmup steps 10 000
Adam betas 0.9, 0.999 1
Mixed precision True 0 50000 100000 150000
Label smoothing 0.2 Training steps
Weight decay 0.0 25
Dropout 0.3†
Tamasheq-French valid BLEU

Attention dropout 0.1 20
Gradient clipping none
1D Convolutions 1
15
Conv channels 80⋆
Conv kernel size 5 10
Conv stride No conv layer
√ 2 5 1 conv layer
Embed scaling factor 1 024 2 conv layers
Positional encoding sinusoidalα 3 conv layers
Encoder layers 24 0
Decoder layers 24 0 50000 100000 150000
Embed dim 1 024‡ Training steps
FFN dim 8 192
Figure 2: Training loss and Taq-Fr validation BLEU
Activation ReLU
of variants of our contrastive 1 model, that have 0 to 3
Attention heads 16
convolutional layers (1 by default).
Pre-norm True
Adapter dim 64
Vocab size 250k
Lang-pair temperature 3
Heterogeneous batches True 25
Valid freq 5 000
Tamasheq-French valid BLEU
Checkpoint averaging 3
Patience 5
20
Early stopping metric BLEU
Beam size 5 15
Table 9: Hyper-parameters used to train our models. 10

⋆: a linear layer followed by a ReLU activation is trained mBART
to project the input features (of dimension 768 or 1 024) 5 NLLB 600M
to the input dimension of the CNN (80).
NLLB 1.3B
NLLB 3.3B
†: dropout is also applied to the source and target embed- 0
dings (after the convolutions and positional encoding) 0 50000 100000
and FFN activations. Training steps
‡: 2 048 when the pre-trained MT model is NLLB 3.3B.
α: learned positional embeddings in the decoder when Figure 3: Taq-Fr validation BLEU of variants of our
the pre-trained model is mBART. contrastive 1 model that are initialized with various MT
models (NLLB 1.3B by default).
A.2 Additional Results
155
Task Source Target hours:minutes # utterances
ASR French French 218:59 117,081
ASR Italian Italian 118:39 50,895
ASR Portuguese Portuguese 179:33 91,257
ASR Spanish Spanish 214:15 103,076
ST French English 57:39 31,207
ST French Spanish 42:14 21,862
ST French Portuguese 26:53 14,322 Inference Taq-Fr Fr-En
Train vocab Inference vocab Speed
params BLEU BLEU
ST Portuguese English 63:13 31,868
ST Spanish French 9:34 4,568 Full (256k) 1.38B 19.1 36.6 12.5×
Full (256k)
Filtered (35k) 1.19B 18.9 35.8 13.0×
ST Spanish English 79:37 37,168 Filtered (35k) Filtered (35k) 1.19B 20.0 35.5 13.0×
ST Spanish Italian 11:50 5,616
ST Spanish Portuguese 47:01 22,012
Table 14: Speech Translation performance on the
IWSLT 2022 Taq-Fr and mTEDx Fr-En test sets of
Table 10: Statistics for all the mTEDx lan-
our contrastive Taq-Fr submission (non-ensemble ver-
guages (train+valid) seen by our systems for the IWSLT
sion of our primary submission) with several vocabulary
2021 evaluation setup described in Section 4.3.
filtering strategies: no filtering (first row, corresponds to
Taq-Fr valid Que-Es valid Que-Es test our submission); inference-time filtering (second row);
primary 26.13 ✗ ✗ or training-time filtering (third row). See Table 18 for
Taq-Fr contrastive 1 24.53 ✗ ✗ an explanation of the “speed” column.
contrastive 2 22.88 20.29 17.74
primary 22.88 20.29 17.74
Que-Es contrastive 1 20.81 19.03 15.67
contrastive 2 21.31 16.78 15.25
primary 22.36 16.52 15.70
Que-Es
contrastive 1 20.97 15.15 15.55
(updated)
contrastive 2 20.31 16.30 13.17
Table 11: Validation and test results on the IWSLT 2023

low-resource track. Lines 3 and 4 correspond to the
same model. The “Que-Es (updated)” results corre-
spond to new models trained on filtered Quechua ASR
data, where we removed audio files that are also in the
ST valid and test sets. In this updated version, primary
and contrastive 1 use NLLB 1.3B and contrastive 2
uses NLLB 3.3B. 40 NLLB 3.3B (Fr-En)
NLLB 1.3B (Fr-En)
Taq-Fr test Que-Es valid 35 NLLB 3.3B (Taq-Fr)
NLLB 1.3B (Taq-Fr)
contrastive 1 19.13 ± 0.06 ✗ 30 NLLB 3.3B (Que-Es)
Taq-Fr
contrastive 2 16.89 ± 0.18 18.34 ± 0.59 NLLB 1.3B (Que-Es)
25
Valid BLEU
Que-Es contrastive 1 16.89 ± 0.18 18.34 ± 0.59

Que-Es contrastive 1 16.51 ± 1.12 14.98 ± 0.16 20
(updated) contrastive 2 16.56 ± 0.30 15.66 ± 0.60 15
Table 12: Statistics (BLEU average and standard devia- 10
tion) for the submitted models which have 3 runs with 5
different seeds. The Taq-Fr and Que-Es BLEU scores
are respectively over the IWSLT 2022 test set and the 0
IWSLT 2023 validation set. 0 20000 40000 60000 80000 100000 120000
Training steps
Adapter Encoder Decoder Taq-En Taq-Ko Hangul
Size Adapters Adapters Lang ID Lang ID Percentage Figure 4: Validation BLEU by language direction (Fr-
64 ✓ ✓ 100% 97% 88% En, Taq-Fr and Que-Es) of a multilingual model (XLS-
128 ✓ ✓ 99% 84% 59% R + NLLB 1.3B) which includes both Tamasheq and
64 ✓ ✗ 100% 100% 95%
✗ ✗ ✗ 100% 100% 96%
Quechua (our updated constrastive 1 submission).
✗ ✗ ✗ 100% 100% 93%
Table 13: Percentage of output sentences in the correct

language according to the NLLB language ID (Costa-
jussà et al., 2022). The last column shows the percentage
of output characters that are in the Korean alphabet.
156
IWSLT 2023 TED-LIUM v2 mTEDx ASR mTEDx ST
Submission Taq-Fr Que-Es Que-Que En-En Fr-Fr Es-Es It-It Pt-Pt Fr-En Fr-Es Es-Fr Es-En Fr-Pt Pt-En Es-It Es-Pt
Taq-Fr primary ✓ ✗ ✗ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗
Taq-Fr contrastive 1 ✓ ✗ ✗ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗
Taq-Fr contrastive 2 ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗
Que-Es primary ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗
Que-Es contrastive 1 ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗
Que-Es contrastive 2 ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗
IWSLT 2021 setup ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Table 15: Extensive list of datasets used for training (✓) each system presented in this paper.
Model URL
mHuBERT-Tamasheq Unavailable
Tamasheq https://huggingface.co/LIA-AvignonUniversity/IWSLT2022-tamasheq-only
Niger-Mali https://huggingface.co/LIA-AvignonUniversity/IWSLT2022-Niger-Mali
XLSR-53 https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec
XLS-R large and xlarge https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec/xlsr
Table 16: Downloading sources for the speech representation models checkpoints used in our experiments.
Stacked FT Total Trained Taq-Fr Fr-En

Adapters Speed
layers layers params params BLEU BLEU
1 0 enc+dec (64) 1.40B 28M 19.2 35.0 12.0×
1 0 none 1.39B 22M 17.9 33.8 12.2×
0 1 enc+dec (64) 1.38B 28M 18.2 35.1 12.0×
0 1 none 1.37B 22M 17.5 33.3 12.6×
2 0 enc+dec (64) 1.42B 49M 19.2 35.1 11.9×
2 0 none 1.41B 43M 18.4 35.0 12.5×
0 2 enc+dec (64) 1.38B 49M 19.0 36.2 12.0×
0 3 enc+dec (64) 1.38B 70M 19.1 36.6 12.5×
Table 17: Training stacked layers (i.e. adding and training new bottom encoder layers) versus fine-tuning the
existing bottom layers; with or without adapters. The other hyper-parameters are identical to our constrastive
submission (underlined scores).
157
Conv. FT Total Trained Taq-Fr Fr-En
Speech features MT model Adapters Speed
layers layers params params BLEU BLEU
Tamasheq (layer 11) 1.38B 70M 16.8 32.5 11.6×
Tamasheq (layer 8) 1.38B 70M 19.3 31.6 12.0×
mHuBERT-Taq (layer 11) 1.38B 70M 16.4 37.1 12.1×
mHuBERT-Taq (layer 8) 1.38B 70M 16.2 36.7 12.1×
Niger-Mali (layer 11) NLLB 1.3B 1 3 enc+dec (64) 1.38B 70M 16.6 34.6 11.8×
Niger-Mali (layer 8) 1.38B 70M 19.1 36.6 12.5×
XLSR-53 (layer 18) 1.38B 70M 15.9 38.0 12.4×
XLS-R L (layer 18) 1.38B 70M 16.8 39.4 12.7×
XLS-R XL (layer 46) 1.38B 70M 15.4 37.4 11.7×
mBART (600M) 0.61B 41M 16.3 28.9 22.9×
NLLB (600M) 0.62B 41M 18.0 32.5 24.2×
Niger-Mali (layer 8) 1 3 enc+dec (64)
NLLB (1.3B) 1.38B 70M 19.1 36.6 12.5×
NLLB (3.3B) 3.36B 165M 19.3 37.3 4.5×
3 1.38B 70M 18.5 33.4 25.5×
2 1.38B 70M 19.4 35.4 19.5×
Niger-Mali (layer 8) NLLB 1.3B 3 enc+dec (64)
1 1.38B 70M 19.1 36.6 12.5×
0 1.38B 70M 19.6 34.4 7.1×
24 1.37B 508M 16.7 30.7 11.9×
4 1.38B 91M 19.6 36.8 12.3×
Niger-Mali (layer 8) NLLB 1.3B 1 3 enc+dec (64) 1.38B 70M 19.1 36.6 12.5×
2 1.38B 49M 19.0 36.2 12.0×
1 1.38B 28M 18.2 35.1 12.0×
enc (64) 1.37B 25M 19.1 34.2 12.4×
1
none 1.37B 22M 17.5 33.3 12.6×
enc+dec (256) 1.40B 88M 18.8 35.8 12.2×
Niger-Mali (layer 8) NLLB 1.3B 1 enc+dec (128) 1.38B 76M 19.2 36.3 12.1×
3 enc+dec (64) 1.38B 70M 19.1 36.6 12.5×
enc (64) 1.37B 67M 19.3 35.7 12.7×
none 1.37B 64M 18.3 35.6 13.1×
Table 18: Ablation study on Taq-Fr ST, with various speech feature extractors, pre-trained MT models used for
initialization, and trained parameters. The total parameter counts do not include the parameters of the speech feature
extractors. The BLEU scores reported are on the IWSLT 2022 Taq-Fr and mTEDx Fr-En test sets. The speed metric
is relative to real time (i.e., seconds in the test set divided by seconds spent decoding) and does not include feature
extraction time. It is obtained by decoding the Taq-Fr test set on a single T4 with a batch size of 10 utterances
(averaged over 3 decoding runs). The underlined numbers all correspond to the same model, which is our first
contrastive submission to the task (the non-ensemble version of our primary submission). All of these models are
trained with the same data (see Table 15) and early stopping is done based on Taq-Fr valid BLEU scores. The
numbers inside parentheses in the Adapters column correspond to the bottleneck dimension of the trained adapter
modules. Adapters are not added in the encoder layers that are being fine-tuned. These models took between 15 and
47 h each to train on 4 V100 GPUs, with an average training time of 26 h.
Training directions Zero-shot directions

Task Model Adapters
Es-En Fr-En Fr-Es Pt-En Pt-Es It-En It-Es
ST NLLB 3.3B enc+dec 44.0 39.9 38.3 33.1 38.1 29.3 36.9
enc+dec 43.7 39.4 38.0 31.5 35.9 28.9 35.0
none 36.7 35.0 31.7 23.8 30.5 25.2 31.3
ST NLLB 1.3B
enc 41.4 38.3 36.0 30.8 36.2 26.2 35.1
dec 39.1 38.2 33.1 26.9 31.9 27.9 32.9
MT NLLB 3.3B none 47.4 39.5 39.2 39.8 48.6 34.0 42.4
none 47.9 38.9 39.6 39.8 48.5 33.8 41.9
enc+dec 50.2 40.7 42.2 42.1 51.0 37.6 45.2
MT NLLB 1.3B
enc 49.9 41.3 42.6 41.9 50.6 36.5 44.9
dec 48.8 39.2 41.0 41.1 49.7 35.6 43.9
MT NLLB 1.3B (DA) enc+dec 51.3 43.2 45.2 44.7 53.2 37.8 47.1
Table 19: Top half: Speech translation BLEU scores on the IWSLT 2021 test sets, when deactivating encoder
adapters, decoder adapters, or both in an ST model at inference time. The ST model is the same one as in Table 5,
trained with encoder and decoder adapters. Bottom half: Text-to-text MT BLEU scores when using the ST adapters
in the initial model and disabling the ST bottom layers and convolutions.
158
Direct Models for Simultaneous Translation and Automatic Subtitling:
FBK@IWSLT2023
Sara Papi3,2 , Marco Gaido3 , Matteo Negri3

3
Fondazione Bruno Kessler
2
University of Trento
{spapi,mgaido,negri}@fbk.eu
Abstract while continuing to process the input audio) or

to automatically generate subtitles for audiovisual
This paper describes the FBK’s participation
content (i.e. pieces of translated text which have to
in the Simultaneous Translation and Automatic
Subtitling tracks of the IWSLT 2023 Evaluation conform to specific spatiotemporal constraints and
Campaign. Our submission focused on the use be synchronized with the video).
of direct architectures to perform both tasks: The International Workshop on Spoken Lan-
for the simultaneous one, we leveraged the guage Translation (IWSLT) is playing an impor-
knowledge already acquired by offline-trained tant role in advancing the state-of-the-art in these
models and directly applied a policy to obtain fields by organizing a series of evaluation cam-
the real-time inference; for the subtitling one,
paigns (Ansari et al., 2020; Anastasopoulos et al.,
we adapted the direct ST model to produce
well-formed subtitles and exploited the same 2021, 2022) focused on simultaneous speech trans-
architecture to produce timestamps needed for lation (SimulST) and, this year for the first time,
the subtitle synchronization with audiovisual automatic subtitling. These campaigns provide a
content. Our English-German SimulST sys- unique opportunity for researchers to compare their
tem shows a reduced computational-aware la- systems against others, share their findings, and
tency compared to the one achieved by the top- identify areas for further improvement.
ranked systems in the 2021 and 2022 rounds
of the task, with gains of up to 3.5 BLEU. Our
In this paper, we describe FBK’s participation
automatic subtitling system outperforms the in the IWSLT 2023 Evaluation Campaigns (Agar-
only-existing solution based on a direct system wal et al., 2023) for simultaneous translation and
by 3.7 and 1.7 SubER in English-German and automatic subtitling. Motivated by the promising
English-Spanish respectively. results reported in previous works (Ren et al., 2020;
Papi et al., 2022a), our approach is characterized by
1 Introduction the use of direct ST models to address both tasks.
In recent years, the advances in natural language For the simultaneous speech-to-text transla-
processing and machine learning led to a surge of tion (SimulST) task, we participated in the
interest in developing speech translation (ST) sys- English→German track and leveraged an offline-
tems that can translate speech from one language trained direct model without performing any adap-
into text in another language without human inter- tation to the real-time scenario, as this has recently
vention. Significant progress has been specially been shown not to be necessary to achieve com-
made toward end-to-end ST models (Bérard et al., petitive results (Papi et al., 2022b). For the auto-
2016; Weiss et al., 2017) trained to directly trans- matic subtitling task, we participated in both the
late speech without the intermediate steps of tran- English→German and English→Spanish tracks by
scription (through automatic speech recognition - adapting a direct ST model to produce well-formed
ASR) and translation (through machine translation subtitles and exploiting the same architecture to
- MT). Along with this growing interest in direct produce the timestamps needed for their synchro-
ST, also accompanied by a reduction of the perfor- nization with audiovisual contents, as in (Papi et al.,
mance gap with respect to cascaded architectures 2022a).
(Bentivogli et al., 2021), other trends have emerged Our results demonstrate the effectiveness of our
thanks to deep learning advancements, which made approach. In SimulST, the computational-aware
it possible to deploy direct solutions to perform the latency of our models is lower compared to the
task in real-time (i.e. to produce partial translations winning systems of the last two rounds (2021, and
159
2022) of the IWSLT SimulST Evaluation Cam- 2022e). Since our objective is to avoid any modi-
paign, with gains up to 3.5 BLEU. In automatic fications to the offline-trained model, we pointed
subtitling, our systems improve the results reported our attention to the latter, more conservative cate-
in (Papi et al., 2022a) which, to the best of our gory. Among these policies, we analyzed the three
knowledge, represents the only-existing solution following alternatives:
based on a direct model. Specifically, on average
among the various dev sets available for the task, • Local Agreement (LA) (Liu et al., 2020): this
we achieve 3.7 SubER on en-de and 1.7 SubER on policy prescribes generating a partial hypothe-
en-es. sis from scratch at every newly received audio
segment, and emitting it (or only a part of it)
2 Applied Direct Models if it coincides with one of those generated in
the previous time step;
For this year’s submission, we applied the direct
ST models to the two different scenarios of simul- • Encoder-Decoder Attention (EDATT) (Papi
taneous translation and automatic subtitling. et al., 2022e): it exploits the cross-attention
scores modeling the audio-translation relation
2.1 Simultaneous Translation to decide whether to emit the words of a par-
tial hypothesis or not. If, for the current word,
Recent trends in SimulST consist of using offline-
the sum of the attention scores of the last λ re-
trained models for simultaneous inference (Papi
ceived speech frames exceeds a certain thresh-
et al., 2022b). There are several motivations for
old α (both λ and α are hyperparameters), the
this choice: i) it avoids re-training or building spe-
emission is delayed because the system needs
cific architectures for SimulST, saving time and
more context to translate that word. Other-
computational resources; ii) only one model has to
wise, the word is emitted and we proceed to
be trained and maintained to perform both offline
the next word of the hypothesis;
and simultaneous ST; and iii) there is no need to
train several models, each specialized to support • A LIGN ATT (Papi et al., 2023b): as for
different latency regimes. EDATT, the cross-attention scores are lever-
A key aspect of SimulST, also critical when ap- aged to decide what to emit but, in this case,
proaching the task with offline models at inference instead of summing the attention scores of
time, is the so-called decision policy: the mecha- the last speech frames, each word is uniquely
nism that is in charge of deciding whether to read assigned (or aligned) to the frame having the
more information or to emit a partial hypothesis. maximum attention score. If the aligned frame
One of the first and most popular policies is the corresponds to one of the last f frames (f be-
wait-k (Ma et al., 2019), initially introduced for ing a hyperparameter that controls the latency)
simultaneous MT, and then applied to the speech the emission is stopped. Otherwise, we pro-
scenario (Ma et al., 2020b; Chen et al., 2021; Zeng ceed to the next word.
et al., 2021; Karakanta et al., 2021b). The wait-k,
which prescribes waiting for an initial number of 2.2 Automatic Subtitling
k words before starting to translate, is defined as a So far, the adoption of direct ST architectures to
“fixed” policy (Zheng et al., 2020) because the de- address the automatic subtitling task has only been
cision is taken independently from the source input explored in (Papi et al., 2022a). As a matter of fact,
content. However, as the actual information con- all previous works on the topic (Piperidis et al.,
tained in the input (e.g. in terms of ambiguity, com- 2004; Melero et al., 2006; Matusov et al., 2019;
pleteness, and syntactic/semantic cohesion) is also Koponen et al., 2020; Bojar et al., 2021) rely on
important for the sake of good-quality incremental cascade architectures that usually involve an ASR
translations, several “adaptive” policies have been component to transcribe the input speech, a subtitle
introduced, which instead adapt their decisions to segmenter that segments the transcripts into subti-
the input content. Some adaptive policies require tles, a timestamp estimator that predicts the start
system re-training or the development of ad-hoc and times of each subtitle, and an MT model that
modules (Liu et al., 2021b; Chang and Lee, 2022; translates the subtitle transcripts.
Zhang and Feng, 2022), while some others do not Cascaded architectures, however, cannot ac-
(Liu et al., 2020; Nguyen et al., 2021; Papi et al., cess information contained in the speech, such as
160
prosody, which related works proved to be an im- (2023a)2 based on Fairseq-ST (Wang et al., 2020).
portant source of information for the segmentation In their paper, the authors analyzed the most
into subtitles (Öktem et al., 2019; Federico et al., popular open-source libraries for speech recogni-
2020; Virkar et al., 2021; Tam et al., 2022). The im- tion/translation and found at least one bug affect-
portance of such information has been further veri- ing all the existing Conformer implementations,
fied in (Karakanta et al., 2020a), which proved that therefore claiming the importance of testing code
the direct ST models are better in subtitle segmenta- to avoid the propagation of unreliable findings
tion compared to the cascade ones. Another study masked by good results.
by Karakanta et al. 2021a, also pointed out the
importance of consistency between captions (seg-
Simultaneous We tested a Conformer-based ar-
mented transcripts) and subtitles (segmented trans-
chitecture (Gulati et al., 2020) with two configu-
lations), showing that the predicted caption content
rations: 12 encoder layers and 16 encoder layers.
can also be useful for the translation. Specifically,
The number of Transformer decoder layers is 6, we
the authors obtained significant improvements by
set 512 features for the attention layers and 2,048
using a Triangle Transformer-based architecture
hidden units for the feed-forward layers. We used
(Anastasopoulos and Chiang, 2018) composed of
0.1 dropout for the feed-forward layers, attention
one encoder and two decoders: the first decoder
layers, and convolutional modules. The kernel size
is in charge of emitting the transcripts and the
was set to 31 for the point- and depth-wise con-
second one is in charge of emitting the transla-
volutions. We trained with the Adam optimizer
tion by also attending to the output embeddings of
(Kingma and Ba, 2015) by setting β1 = 0.9 and
the predicted transcript. Therefore, in our submis-
β2 = 0.98, a weight decay of 0.001, the learning
sion, based on the findings of the aforementioned
rate to 0.002 using the inverse square-root sched-
work, we inspected the use of both a classic single
uler with 25,000 warm-up steps. Label smoothed
encoder-single decoder architectures, as in (Papi
cross-entropy loss (0.1 smoothing factor) was used
et al., 2022a), and of the Triangle architecture for
together with the CTC loss (Graves et al., 2006)
automatic subtitling.
with weight 0.5. We experimented also by apply-
ing the CTC compression mechanism (Gaido et al.,
3 Experimental Setup
2021a) to the source input to shrink its dimension
3.1 Data and reduce RAM consumption. Utterance Cep-
stral Mean and Variance Normalization (CMVN)
Simultaneous We developed a pure offline
was applied during training. Also, we leveraged
model trained on the same data used for our
SpecAugment (Park et al., 2019) with frequency
last year’s (constrained) submission (Gaido et al.,
mask (F = 27, and N = 2), time mask (N = 10,
2022b).
T = 300, and p = 0.05), and no time warp. Both
Subtitling We used the same data settings of ST training and ASR pre-training were performed
(Papi et al., 2022a), for which we leverage the with the same settings. The target vocabulary is of
multimodal segmenter by Papi et al. (2022d) to size 16,000, and the source vocabulary is of size
segment into subtitles ST and machine-translated 10,000, and are both based on SentencePiece (Kudo
ASR corpora as per (Gaido et al., 2021b, 2022a).1 and Richardson, 2018). We differentiate between
No OpenSubtitles or text-only data were used to original and machine-translated training data by
train our models. pre-pending a tag (nomt and mt, respectively) to
the target text as in all our last years’ submissions
3.2 Training Settings (Gaido et al., 2020; Papi et al., 2021; Gaido et al.,
2022b). The total batch size was set to 1,280,000
All the models used for our participation were im- and was performed on 4 NVIDIA A40 GPUs with
plemented using the newly released implementa- 40GB of RAM by setting the mini-batch update
tion of the Conformer architecture by Papi et al. frequency to 8 and 40,000 maximum tokens. Max-
1
All the corpora used in (Papi et al., 2022a) are
imum updates were set to 100,000.
allowed ASR and ST training data for the Subtitling
task (https://iwslt.org/2023/subtitling#
2
training-and-data-conditions). Therefore, our Code available at https://github.com/hlt-mt/
submission has to be considered “Constrained”. FBK-fairseq
161
Automatic Subtitling Both the classic encoder- et al., 2022) as additional latency metrics. All the
decoder architecture and the triangle architecture evaluations were run on a single NVIDIA K80 with
are composed of 12 layers of Conformer encoder 12GB of RAM, by applying global CMVN to audio
and 6 layers of Transformer decoder (which is input, whose features were estimated on the MuST-
replicated twice in the triangle model). The di- C v2 training set. Computational aware metrics
mension of the feed-forward layers is 2,048 and (“_CA”) refer to the single NVIDIA K80 setting
d = 512 in the attention. The kernel size of the and consider also the model computational time in
point- and depth-wise convolutions in the convolu- the delay calculation.
tional modules is 31. The dropout was set to 0.1.
CTC loss with compression is added with weight Automatic Subtitling We adopt the follow-
0.5 to the cross entropy loss with label smoothing ing metrics: SubER-cased (henceforth, SubER)
(0.1 of smoothing factor) and optimized with Adam (Wilken et al., 2022) for overall subtitle quality,
(β1 = 0.9, β2 = 0.98). The source vocabulary is of Sigma (Karakanta et al., 2022) for the subtitle seg-
size 8,000 and the target vocabulary of size 16,000 mentation quality, and BLEU5 for translation qual-
(<eob> and <eol> included); both are obtained ity. We also compute the conformity percentage
by SentencePiece models. The ST pre-training was of 42 characters per line (CPL) and 21 characters
done by setting the learning rate to 0.002 with in- per second (CPS) or reading speed, as suggested
verse square-root scheduler and 25,000 warm-up on the track website.6 We neglected the conformity
updates. The SubST fine-tuning was done by set- computation of the subtitles with more than two
ting a constant learning rate of 0.001. A second lines since our model only produces subtitles with
fine-tuning was done with the same setting of (Papi two lines or less, thus being always 100% conform.
et al., 2022a), but we restored the punctuation of Conformity scores are computed by using the script
the ASR datasets which do not contain any (i.e., released for the paper (Papi et al., 2022a).7 Dev/test
the TEDLIUM corpus (Hernandez et al., 2018)) audios are segmented with SHAS (Tsiamas et al.,
by using bert-restore-punctuation,3 be- 2022). No audio cleaning is applied.
fore machine-translating and segmenting the target
texts into subtitles. We trained the standard archi- 4 Results
tecture with 40,000 maximum tokens on 4 NVIDIA 4.1 Simultaneous Translation
A100 GPUs with 40GB of RAM and we set the
update frequency to 2. For the triangle architecture, Since we directly employ an offline model for the
we set maximum tokens to 20,000 to fit the archi- simultaneous inference, we show in Table 1 the
tecture in memory and the update frequency to 4 results of the offline ASR pre-training and ST train-
to hold the same total batch size of 320,000 tokens. ing. Although the model with 12 encoder layers
Maximum updates were set to 100,000 for both the (row 0) obtains lower – hence better – WER com-
pre-training and training phases. pared to the 16 encoder-layers model (row 1), the
highest – hence better – BLEU in ST is achieved
3.3 Evaluation Settings by the bigger architecture. The performance is also
Simultaneous We exploit the SimulEval tool slightly enhanced by adding the CTC compression
(Ma et al., 2020a). To be comparable with the (row 3) during training, which is particularly useful
previous years, all the results except this year’s also for the SimulST scenario since it speeds up
submission are shown for the SimulEval v1.0.2, inference (of about 12/15%). Therefore, we select
which adopts BLEU (Post, 2018)4 to measure trans- this model for the final submission. Compared to
lation quality and Average Lagging or AL (Ma our last year’s submission (row 5), our 16 encoder-
et al., 2019) to measure latency. Instead, for layers model scores +0.4 BLEU even if, at this
this year’s submission, we adopt the latest ver- time, we have not fine-tuned it on the in-domain
sion of SimulEval (1.1.0) with BLEU measured (TED talks) datasets. Our model also performs
with sacrebleu 2.3.0 and we also report Length- 5
case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0
Adaptive Average Lagging or LAAL (Papi et al., 6
https://iwslt.org/2023/subtitling#
2022c) and Average Token Delay or ATD (Kano automatic-evaluation
7
Script available at: https://github.com/
3
https://huggingface.co/felflare/ hlt-mt/FBK-fairseq/blob/master/examples/
bert-restore-punctuation speech_to_text/scripts/subtitle_
4
case:mixed|eff:no|tok:13a|smooth:exp|version:1.5.1 compliance.py
162
better than the NAIST last year’s system (+11.1 32
BLEU) while is worse (-1.0 BLEU) compared to 30

the last year’s SimulST task winner CUNI-KIT
whose model, however, leveraged large pre-trained 28
BLEU
models such as wav2vec 2.0 and mBART50. Com-
26
pared to last year’s cascade model by UPV, we
score -1.7 BLEU. This system, however, also out- 24
performed the CUNI-KIT system by 0.7 BLEU
22
points, indicating that a gap between direct and
cascade architectures still exists. 1 1.5 2 2.5 3 3.5
AL / AL_CA (s)
id Model WER% (↓) BLEU (↑) offline LA EDAtt AlignAtt
1 12 encoder layers 9.7 31.6
2 16 encoder layers 9.9 31.9
3 + CTC compress. - 32.1 Figure 1: Comparison between the LA, EDATT, and
4 CUNI-KIT 2022† - 33.1 A LIGNATT policies described in Section 2.1 on MuST-
5 FBK 2022 - 31.7 C v2 en→de tst-COMMON. Solid curves represent AL,
6 NAIST 2022‡ - 21.0 dashed curves represent AL_CA.
7 UPV 2022 (Cascade)* 9.5 33.8
Table 1: Offline results of our Conformer-based archi-

reported in Table 1, row 4. They applied the LA
tectures on MuST-C v2 tst-COMMON together with the
available results of the last year’s SimulST competitors. policy, the same we analyze in Figure 1, to the
†
(Polák et al., 2022), ‡ (Fukuda et al., 2022), *(Iranzo- aforementioned architecture for simultaneous in-
Sánchez et al., 2022). ference. The comparison is reported in Figure 2.
As we can see, there is a 1.0-2.0 BLEU difference
between our approach and the IWSLT 2022 win-
In Figure 1, we show the simultaneous results of
ner, which is expected since their offline system
the different policies mentioned in Section 2.1 ap-
is superior compared to ours, as already observed
plied to our offline model. The differences in terms
in Table 1. Compared to the IWSLT 2021 winner,
of quality-latency trade-off between the LA and
we observe a performance drop in our system with
both EDATT and A LIGNATT are evident: the last
AL ≤ 1.5s, while the situation is opposite with
ones outperform the former with an improvement
AL > 1.5s. However, when we look at the com-
peak of 1.5 BLEU at lower latency (approximately
putationally aware metrics, the results completely
1s). Moreover, when the computationally aware
change. Our system clearly outperforms the 2021
AL is considered, EDATT and A LIGNATT are the
winner, with a maximum improvement of about 2
only policies able to reach a latency ≤ 2s. Regard-
BLEU points. Moreover, our system is the only
ing the comparison between EDATT and A LIG -
one able to reach a computational aware latency of
NATT , A LIGNATT can span a latency between 1
2s while, instead, the IWSLT 2022 winner curve
and 2.6s ideally (when unlimited computing re-
starts only at around 3s. Therefore, our system is
sources are available), and between 1.8 and 3.7s
significantly faster and, at around 3s, we achieve
computationally aware, while EDATT is limited to
a relative improvement of more than 3.5 BLEU
a latency of 1.4 to 2.5s ideally, and 2.3 to 3.6s com-
compared to the IWSLT 2022 winner.
putationally aware. We hence select A LIGNATT as
To sum up, when the computationally aware met-
it is able to reach a wider range of latency.
ric is considered, our approach outperforms the
Lastly, we compare our policy with the two win-
winners of both the 2021 and 2022 rounds of the
ning systems of the last two years (2021, and 2022).
SimulST task. In addition, in this year’s round, the
The 2021 winner (Liu et al., 2021a) was based on
systems are evaluated with the threshold AL = 2s
an architecture named Cross Attention Augmented
and with the new version of SimulEval.8 With re-
Transducer (CAAT), which was specifically tai-
spect to these settings, our submitted system scores
lored for the SimulST task (Liu et al., 2021b) and
30.7 BLEU with AL = 1.89s (LAAL = 2.07s,
still represents the state of the art in terms of low
ATD = 1.80s).
latency (considering ideal AL only). The 2022 win-
ner (CUNI-KIT (Polák et al., 2022)) was based on 8
https://iwslt.org/2023/simultaneous#
the wav2vec 2.0 + mBART50 offline architecture ranking
163
33 en-de
Model SubER BLEU Sigma CPL CPS
(Papi et al., 2022a) 59.9 23.4 77.9 86.9 68.9
31 Triangle 60.8 22.6 74.6 84.5 67.7
en-es
Model SubER BLEU Sigma CPL CPS
BLEU
29 (Papi et al., 2022a) 46.8 37.4 81.6 93.2 74.6

Triangle 48.4 36.5 76.9 90.3 71.7
27 Table 2: Results of the direct ST models standard and

Triangle architectures described in Section 2.2 on MuST-
25 Cinema test set for en→{de, es}.
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
AL / AL_CA (s)
IWSLT 2021 IWSLT 2022 Ours
tivities, Peloton containing online fitness classes,
Figure 2: Comparison with the 2021 and 2022 win- and ITV Studios or ITV containing videos from
ners of the SimulST Evaluation Campaigns MuST-C v2 a broad range of programming (drama, entertain-
en→{de, es} tst-COMMON. Solid curves represent AL, ment, factual). For both language pairs (en-de and
dashed curves represent AL_CA. en-es), Table 3 shows the results computed with
SubER, which is the primary metric used for the
task.10 As we can see, the models fine-tuned on
4.2 Automatic Subtitling
data with restored punctuation score the best re-
In Table 2, we show a comparison between the sults in both languages. Across the four dev sets,
standard encoder-decoder and the Triangle architec- there is a 3.7 SubER improvement for en-de, and
tures for automatic subtitling. The results are com- 1.7 for en-es. Moreover, coherently among lan-
puted on MuST-Cinema (Karakanta et al., 2020b), guages, the TED talks scenario results in the eas-
the only existing corpus for SubST. Unfortunately, iest one for our model, as it is in-domain (e.g.,
in contrast with the results achieved by (Karakanta MuST-Cinema, based on TED talks, was used to
et al., 2021a), we found that the standard archi- train the model). Conversely, the ITV scenario
tectures perform better on all the considered met- is the most difficult one since it contains TV se-
rics. While the differences in terms of translation ries, which is a completely unseen domain for our
quality are not so big (0.8-9 BLEU drop in both model. Indeed, its data contain a larger amount
languages), there is a huge gap in the quality of of background music/noise, as well as dialogues
the segmentation into subtitles, with the standard with multiple speakers which are not present in
model improving by 3.3 and 4.7 Sigma the scores our training data. In light of the results obtained
obtained by the Triangle respectively on en-de and by the fine-tuned models, we select them for our
en-es. This is also reflected by a worse SubER submission to the automatic subtitling task.
score (the lower, the better) of the Triangle, exhibit-
ing a performance drop of, respectively, 0.9 and
1.6 SubER for en-de and en-es compared to the en-de
Model TED EPTV Peloton ITV Avg
standard architecture. Therefore, we can conclude (Papi et al., 2022a) 72.7 82.3 84.7 88.0 81.9
that the generated captions seem not to help with + fine-tuning 69.4 80.6 79.1 83.7 78.2
subtitle generation. Rather, they negatively affect en-es
subtitle generation to the detriment of segmenta- Model TED EPTV Peloton ITV Avg
(Papi et al., 2022a) 54.8 75.3 82.3 84.1 74.1
tion quality. For this reason, we decided to employ + fine-tuning 52.5 73.7 80.3 82.2 72.4
the standard encoder-decoder architecture for our
participation in the automatic subtitling task. Table 3: SubER (↓) scores for en→{de, es} of the direct
In the following, we present the results of our ST models on the four dev sets of the competition. “fine-
tuning” represents the second fine-tuning on data with
model on the four dev sets released for the task,9
restored punctuation mentioned in Section 3.2.
namely: MuST-Cinema or TED containing TED
talks videos, EuroParlTV or EPTV containing
recordings related to the European Parliament ac-
9 10
https://iwslt.org/2023/subtitling# https://iwslt.org/2023/subtitling#
development-and-evaluation-data automatic-evaluation
164
5 Conclusions Mathur, Paul McNamee, Kenton Murray, Maria
Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan
We presented the FBK’s systems built to partici- Niehues, Xing Niu, John Ortega, Juan Pino, Eliz-
pate in the IWSLT 2023 Evaluation Campaigns for abeth Salesky, Jiatong Shi, Matthias Sperber, Se-
simultaneous speech translation (en-de) and auto- bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo-
gesh Virkar, Alexander Waibel, Changhan Wang, and
matic subtitling (en-{de, es}). Our submissions Shinji Watanabe. 2022. Findings of the IWSLT 2022
are characterized by the use of direct speech trans- evaluation campaign. In Proceedings of the 19th In-
lation models to address both tasks, without any ternational Conference on Spoken Language Trans-
further modification nor adaptation for the simulta- lation (IWSLT 2022), pages 98–157, Dublin, Ireland
(in-person and online).
neous task, and with a fine-tuning on subtitle-like
translations for the automatic subtitling task. Our Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremer-
SimulST system achieves a lower computational- man, Roldano Cattoni, Maha Elbayad, Marcello Fed-
aware latency with up to 3.5 BLEU gain compared erico, Xutai Ma, Satoshi Nakamura, Matteo Negri,
Jan Niehues, Juan Pino, Elizabeth Salesky, Sebas-
to the last two years’ winners. Our automatic subti- tian Stüker, Katsuhito Sudoh, Marco Turchi, Alexan-
tling system achieves 3.7 and 1.7 SubER improve- der Waibel, Changhan Wang, and Matthew Wiesner.
ment on en-de and en-es respectively, compared to 2021. FINDINGS OF THE IWSLT 2021 EVAL-
the only solution published in the literature based UATION CAMPAIGN. In Proceedings of the 18th
on a direct system.
lation (IWSLT 2021), pages 1–29, Bangkok, Thailand
(online).
Acknowledgements
Antonios Anastasopoulos and David Chiang. 2018.
This work has been supported by the project Tied multitask learning for neural speech translation.
“AI@TN” funded by the Autonomous Province In Proceedings of the 2018 Conference of the North
of Trento, Italy. American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 82–91, New Orleans,
References Louisiana.
Milind Agarwal, Sweta Agrawal, Antonios Anasta- Ebrahim Ansari, Amittai Axelrod, Nguyen Bach,
sopoulos, Ondřej Bojar, Claudia Borg, Marine Ondřej Bojar, Roldano Cattoni, Fahim Dalvi, Nadir
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda Durrani, Marcello Federico, Christian Federmann,
Chen, William Chen, Khalid Choukri, Alexandra Jiatao Gu, Fei Huang, Kevin Knight, Xutai Ma, Ajay
Chronopoulou, Anna Currey, Thierry Declerck, Qian- Nagesh, Matteo Negri, Jan Niehues, Juan Pino, Eliz-
qian Dong, Yannick Estève, Kevin Duh, Marcello abeth Salesky, Xing Shi, Sebastian Stüker, Marco
Federico, Souhir Gahbiche, Barry Haddow, Benjamin Turchi, Alexander Waibel, and Changhan Wang.
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja- 2020. FINDINGS OF THE IWSLT 2020 EVAL-
vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu UATION CAMPAIGN. In Proceedings of the 17th
Kumar, Pengwei Li, Xutail Ma, Prashant Mathur, International Conference on Spoken Language Trans-
Evgeny Matusov, Paul McNamee, John P. McCrae, lation, pages 1–34, Online.
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu, Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina
Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino, Karakanta, Alberto Martinelli, Matteo Negri, and
Lonneke van der Plas, Peter Polák, Elijah Rippeth, Marco Turchi. 2021. Cascade versus direct speech
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se- translation: Do the differences still make a differ-
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian ence? In Proceedings of the 59th Annual Meet-
Thompson, Kevin Tran, Marco Turchi, Alex Waibel, ing of the Association for Computational Linguistics
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- and the 11th International Joint Conference on Natu-
vallos. 2023. Findings of the IWSLT 2023 Evaluation ral Language Processing (Volume 1: Long Papers),
Campaign. In Proceedings of the 20th International pages 2873–2887, Online.
2023). Alexandre Bérard, Olivier Pietquin, Christophe Ser-
van, and Laurent Besacier. 2016. Listen and Trans-
Antonios Anastasopoulos, Loïc Barrault, Luisa Ben- late: A Proof of Concept for End-to-End Speech-to-
tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano Text Translation. In NIPS Workshop on end-to-end
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, learning for speech and audio processing, Barcelona,
Maha Elbayad, Clara Emmanuel, Yannick Estève, Spain.
Marcello Federico, Christian Federmann, Souhir
Gahbiche, Hongyu Gong, Roman Grundkiewicz, Ondřej Bojar, Dominik Macháček, Sangeet Sagar,
Barry Haddow, Benjamin Hsu, Dávid Javorský, Otakar Smrž, Jonáš Kratochvíl, Peter Polák, Ebrahim
Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant Ansari, Mohammad Mahmoudi, Rishu Kumar, Dario
165
Franceschini, Chiara Canton, Ivan Simonini, Thai- Alex Graves, Santiago Fernández, Faustino J. Gomez,
Son Nguyen, Felix Schneider, Sebastian Stüker, Alex and Jürgen Schmidhuber. 2006. Connectionist Tem-
Waibel, Barry Haddow, Rico Sennrich, and Philip poral Classification: Labelling Unsegmented Se-
Williams. 2021. ELITR multilingual live subtitling: quence Data with Recurrent Neural Networks. In
Demo and strategy. In Proceedings of the 16th Con- Proceedings of the 23rd international conference
ference of the European Chapter of the Association on Machine learning (ICML), pages 369–376, Pitts-
for Computational Linguistics: System Demonstra- burgh, Pennsylvania.
tions, pages 271–277, Online.
Chih-Chiang Chang and Hung-Yi Lee. 2022. Exploring Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
Continuous Integrate-and-Fire for Adaptive Simulta- Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.
neous Speech Translation. In Proc. Interspeech 2022, 2020. Conformer: Convolution-augmented Trans-
pages 5175–5179. former for Speech Recognition. In Proc. Interspeech
2020, pages 5036–5040.
Junkun Chen, Mingbo Ma, Renjie Zheng, and Liang
Huang. 2021. Direct simultaneous speech-to-text François Hernandez, Vincent Nguyen, Sahar Ghannay,
translation assisted by synchronized streaming ASR. Natalia Tomashenko, and Yannick Estève. 2018. Ted-
In Findings of the Association for Computational lium 3: Twice as much data and corpus repartition for
Linguistics: ACL-IJCNLP 2021, pages 4618–4624, experiments on speaker adaptation. In Speech and
Online. Computer, pages 198–208, Cham. Springer Interna-
tional Publishing.
Marcello Federico, Yogesh Virkar, Robert Enyedi, and
Roberto Barra-Chicote. 2020. Evaluating and Opti- Javier Iranzo-Sánchez, Javier Jorge Cano, Alejandro
mizing Prosodic Alignment for Automatic Dubbing. Pérez-González-de Martos, Adrián Giménez Pas-
In Proc. Interspeech 2020, pages 1481–1485. tor, Gonçal Garcés Díaz-Munío, Pau Baquero-Arnal,
Joan Albert Silvestre-Cerdà, Jorge Civera Saiz, Al-
Ryo Fukuda, Yuka Ko, Yasumasa Kano, Kosuke Doi, bert Sanchis, and Alfons Juan. 2022. MLLP-VRAIN
Hirotaka Tokuyama, Sakriani Sakti, Katsuhito Sudoh, UPV systems for the IWSLT 2022 simultaneous
and Satoshi Nakamura. 2022. NAIST simultaneous speech translation and speech-to-speech translation
speech-to-text translation system for IWSLT 2022. In tasks. In Proceedings of the 19th International Con-
Proceedings of the 19th International Conference on ference on Spoken Language Translation (IWSLT
Spoken Language Translation (IWSLT 2022), pages 2022), pages 255–264, Dublin, Ireland (in-person
286–292, Dublin, Ireland (in-person and online). and online).
Marco Gaido, Mauro Cettolo, Matteo Negri, and Marco Yasumasa Kano, Katsuhito Sudoh, and Satoshi Naka-
Turchi. 2021a. CTC-based compression for direct mura. 2022. Average token delay: A latency met-
speech translation. In Proceedings of the 16th Con- ric for simultaneous translation. arXiv preprint
ference of the European Chapter of the Association arXiv:2211.13173.
for Computational Linguistics: Main Volume, pages
690–696, Online. Alina Karakanta, Franćois Buet, Mauro Cettolo, and
Franćois Yvon. 2022. Evaluating Subtitle Segmenta-
Marco Gaido, Mattia A. Di Gangi, Matteo Negri, and tion for End-to-end Generation Systems. In Proceed-
Marco Turchi. 2020. End-to-end speech-translation ings of the 13th Language Resources and Evaluation
with knowledge distillation: FBK@IWSLT2020. In Conference (LREC), pages 3069–3078, Marseilles,
Proceedings of the 17th International Conference on France.
Spoken Language Translation, pages 80–88, Online.
Alina Karakanta, Marco Gaido, Matteo Negri, and
Marco Gaido, Mattia A. Di Gangi, Matteo Negri, and Marco Turchi. 2021a. Between flexibility and consis-
Marco Turchi. 2021b. On Knowledge Distillation tency: Joint generation of captions and subtitles. In
for Direct Speech Translation . In Proceedings of Proceedings of the 18th International Conference on
CLiC-IT 2020, Online. Spoken Language Translation (IWSLT 2021), pages
215–225, Bangkok, Thailand (online).
Marco Gaido, Matteo Negri, and Marco Turchi. 2022a.
Direct speech-to-text translation models as students Alina Karakanta, Matteo Negri, and Marco Turchi.
of text-to-text models. Italian Journal of Computa- 2020a. Is 42 the answer to everything in subtitling-
tional Linguistics. oriented speech translation? In Proceedings of the
Marco Gaido, Sara Papi, Dennis Fucci, Giuseppe Translation, pages 209–219, Online.
Fiameni, Matteo Negri, and Marco Turchi. 2022b.
Efficient yet competitive speech translation: Alina Karakanta, Matteo Negri, and Marco Turchi.
FBK@IWSLT2022. In Proceedings of the 19th 2020b. MuST-cinema: a speech-to-subtitles cor-
International Conference on Spoken Language pus. In Proc. of the 12th Language Resources and
Translation (IWSLT 2022), pages 177–189, Dublin, Evaluation Conference, pages 3727–3734, Marseille,
Ireland (in-person and online). France.
166
Alina Karakanta, Sara Papi, Matteo Negri, and Marco translation to end-to-end simultaneous speech trans-
Turchi. 2021b. Simultaneous speech translation for lation. In Proceedings of the 1st Conference of the
live subtitling: from delay to display. In Proceedings Asia-Pacific Chapter of the Association for Compu-
of the 1st Workshop on Automatic Spoken Language tational Linguistics and the 10th International Joint
Translation in Real-World Settings (ASLTRW), pages Conference on Natural Language Processing, pages
35–48, Virtual. 582–587, Suzhou, China.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Evgeny Matusov, Patrick Wilken, and Yota Geor-
method for stochastic optimization. In 3rd Inter- gakopoulou. 2019. Customizing neural machine
national Conference on Learning Representations, translation for subtitling. In Proceedings of the
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Fourth Conference on Machine Translation (Volume
Conference Track Proceedings. 1: Research Papers), pages 82–93, Florence, Italy.
Maarit Koponen, Umut Sulubacak, Kaisa Vitikainen, Maite Melero, Antoni Oliver, and Toni Badia. 2006. Au-
and Jörg Tiedemann. 2020. MT for subtitling: User tomatic Multilingual Subtitling in the eTITLE Project.
evaluation of post-editing productivity. In Proceed- In Proceedings of ASLIB Translating and the Com-
ings of the 22nd Annual Conference of the European puter 28.
Association for Machine Translation, pages 115–124,
Lisboa, Portugal. Ha Nguyen, Yannick Estève, and Laurent Besacier.
2021. An empirical study of end-to-end simultaneous
Taku Kudo and John Richardson. 2018. SentencePiece: speech translation decoding strategies. In ICASSP
A simple and language independent subword tok- 2021-2021 IEEE International Conference on Acous-
enizer and detokenizer for neural text processing. In tics, Speech and Signal Processing (ICASSP), pages
Proceedings of the 2018 Conference on Empirical 7528–7532. IEEE.
Alp Öktem, Mireia Farrús, and Antonio Bonafonte.
Demonstrations, pages 66–71, Brussels, Belgium.
2019. Prosodic phrase alignment for machine dub-
Dan Liu, Mengge Du, Xiaoxi Li, Yuchen Hu, and Lirong bing. ArXiv, abs/1908.07226.
Dai. 2021a. The USTC-NELSLIP systems for simul- Sara Papi, Marco Gaido, Alina Karakanta, Mauro Cet-
taneous speech translation task at IWSLT 2021. In tolo, Matteo Negri, and Marco Turchi. 2022a. Direct
Proceedings of the 18th International Conference on speech translation for automatic subtitling. arXiv
Spoken Language Translation (IWSLT 2021), pages preprint arXiv:2209.13192.
30–38, Bangkok, Thailand (online).
Sara Papi, Marco Gaido, Matteo Negri, and Andrea
Dan Liu, Mengge Du, Xiaoxi Li, Ya Li, and Enhong Pilzer. 2023a. Reproducibility is Nothing without
Chen. 2021b. Cross attention augmented transducer Correctness: The Importance of Testing Code in NLP.
networks for simultaneous translation. In Proceed- arXiv preprint arXiv:2303.16166.
ings of the 2021 Conference on Empirical Methods in
Natural Language Processing, pages 39–55, Online Sara Papi, Marco Gaido, Matteo Negri, and Marco
and Punta Cana, Dominican Republic. Turchi. 2021. Dealing with training and test segmen-
tation mismatch: FBK@IWSLT2021. In Proceed-
Danni Liu, Gerasimos Spanakis, and Jan Niehues. 2020. ings of the 18th International Conference on Spoken
Low-Latency Sequence-to-Sequence Speech Recog- Language Translation (IWSLT 2021), pages 84–91,
nition and Translation by Partial Hypothesis Selec- Bangkok, Thailand (online).
tion. In Proc. Interspeech 2020, pages 3620–3624.
Sara Papi, Marco Gaido, Matteo Negri, and Marco
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Turchi. 2022b. Does simultaneous speech transla-
Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, tion need simultaneous models? In Findings of the
Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and Association for Computational Linguistics: EMNLP
Haifeng Wang. 2019. STACL: Simultaneous trans- 2022, pages 141–153, Abu Dhabi, United Arab Emi-
lation with implicit anticipation and controllable la- rates.
tency using prefix-to-prefix framework. In Proceed-
ings of the 57th Annual Meeting of the Association for Sara Papi, Marco Gaido, Matteo Negri, and Marco
Computational Linguistics, pages 3025–3036, Flo- Turchi. 2022c. Over-generation cannot be rewarded:
rence, Italy. Length-adaptive average lagging for simultaneous
speech translation. In Proceedings of the Third Work-
Xutai Ma, Mohammad Javad Dousti, Changhan Wang, shop on Automatic Simultaneous Translation, pages
Jiatao Gu, and Juan Pino. 2020a. SIMULEVAL: An 12–17, Online.
Proceedings of the 2020 Conference on Empirical Sara Papi, Alina Karakanta, Matteo Negri, and Marco
Methods in Natural Language Processing: System Turchi. 2022d. Dodging the data bottleneck: Au-
Demonstrations, pages 144–150, Online. tomatic subtitling with automatically segmented ST
corpora. In Proceedings of the 2nd Conference of the
Xutai Ma, Juan Pino, and Philipp Koehn. 2020b. Asia-Pacific Chapter of the Association for Compu-
SimulMT to SimulST: Adapting simultaneous text tational Linguistics and the 12th International Joint
167
Conference on Natural Language Processing (Vol- Changhan Wang, Yun Tang, Xutai Ma, Anne Wu,
ume 2: Short Papers), pages 480–487, Online only. Dmytro Okhonko, and Juan Pino. 2020. fairseq s2t:
Fast speech-to-text modeling with fairseq. In Pro-
Sara Papi, Matteo Negri, and Marco Turchi. 2022e. At- ceedings of the 2020 Conference of the Asian Chap-
tention as a guide for simultaneous speech translation. ter of the Association for Computational Linguistics
arXiv preprint arXiv:2212.07850. (AACL): System Demonstrations.
Sara Papi, Matteo Negri, and Marco Turchi. 2023b. Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui
Alignatt: Using attention-based audio-translation Wu, and Zhifeng Chen. 2017. Sequence-to-Sequence
alignments as a guide for simultaneous speech trans- Models Can Directly Translate Foreign Speech. In
lation. In Proc. of Interspeech 2023, Dublin, Ireland. Proceedings of Interspeech 2017, pages 2625–2629,
Stockholm, Sweden.
Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng
Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. Patrick Wilken, Panayota Georgakopoulou, and Evgeny
2019. SpecAugment: A Simple Data Augmentation Matusov. 2022. SubER - a metric for automatic eval-
Method for Automatic Speech Recognition. In Proc. uation of subtitle quality. In Proceedings of the 19th
Interspeech 2019, pages 2613–2617. International Conference on Spoken Language Trans-
lation (IWSLT 2022), pages 1–10, Dublin, Ireland
Stelios Piperidis, Iason Demiros, Prokopis Prokopidis, (in-person and online).
Peter Vanroose, Anja Hoethker, Walter Daelemans,
Elsa Sklavounou, Manos Konstantinou, and Yannis Xingshan Zeng, Liangyou Li, and Qun Liu. 2021. Real-
Karavidas. 2004. Multimodal, multilingual resources TranS: End-to-end simultaneous speech translation
in the subtitling process. In Proceedings of the Fourth with convolutional weighted-shrinking transformer.
International Conference on Language Resources In Findings of the Association for Computational
and Evaluation (LREC’04), Lisbon, Portugal. Linguistics: ACL-IJCNLP 2021, pages 2461–2474,
Online.
Peter Polák, Ngoc-Quan Pham, Tuan Nam Nguyen,
Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bo- Shaolei Zhang and Yang Feng. 2022. Information-
jar, and Alexander Waibel. 2022. CUNI-KIT system transport-based policy for simultaneous translation.
for simultaneous speech translation task at IWSLT In Proceedings of the 2022 Conference on Empiri-
2022. In Proceedings of the 19th International Con- cal Methods in Natural Language Processing, pages
ference on Spoken Language Translation (IWSLT 992–1013, Abu Dhabi, United Arab Emirates.
2022), pages 277–285, Dublin, Ireland (in-person
and online). Baigong Zheng, Kaibo Liu, Renjie Zheng, Mingbo Ma,
Hairong Liu, and Liang Huang. 2020. Simultane-
Matt Post. 2018. A call for clarity in reporting BLEU ous translation policies: From fixed to adaptive. In
scores. In Proceedings of the Third Conference on Proceedings of the 58th Annual Meeting of the Asso-
Machine Translation: Research Papers, pages 186– ciation for Computational Linguistics, pages 2847–
191, Brussels, Belgium. 2853, Online.
Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin,

Zhou Zhao, and Tie-Yan Liu. 2020. SimulSpeech:
End-to-end simultaneous speech to text translation.
In Proceedings of the 58th Annual Meeting of the As-
sociation for Computational Linguistics, pages 3787–
3796, Online.
Derek Tam, Surafel M. Lakew, Yogesh Virkar, Prashant

Mathur, and Marcello Federico. 2022. Isochrony-
Aware Neural Machine Translation for Automatic
Dubbing. In Proc. Interspeech 2022, pages 1776–
1780.
Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol-

losa, and Marta R. Costa-jussà. 2022. SHAS: Ap-
proaching optimal Segmentation for End-to-End
Speech Translation. In Proc. Interspeech 2022, pages
106–110.
Yogesh Virkar, Marcello Federico, Robert Enyedi,

and Roberto Barra-Chicote. 2021. Improvements
to prosodic alignment for automatic dubbing. In
ICASSP 2021 - 2021 IEEE International Confer-
168
MT Metrics Correlate with Human Ratings of
Simultaneous Speech Translation
Dominik Macháček1 and Ondřej Bojar1 and Raj Dabre2
Charles University, Faculty of Mathematics and Physics,

Institute of Formal and Applied Linguistics1
National Institute of Information and Communications Technology, Kyoto, Japan2

1
{machacek,bojar}@ufal.mff.cuni.cz, 2 [email protected]
Abstract are following subtitles in real-time, they have lim-
There have been several meta-evaluation stud- ited time for reading and comprehension as they
ies on the correlation between human rat- cannot fully control the reading pace by themselves.
ings and offline machine translation (MT) Therefore, they may be less sensitive to subtle
evaluation metrics such as BLEU, chrF2, grammar and factual flaws than while reading a
B ERT S CORE and COMET. These metrics have text document without any time constraints. The
been used to evaluate simultaneous speech human evaluation of SST should therefore reflect
translation (SST) but their correlations with the simultaneity. The users may also prefer brevity
human ratings of SST, which has been recently
and simplicity over verbatim word-for-word trans-
collected as Continuous Ratings (CR), are un-
clear. In this paper, we leverage the evaluations lation. Even if the reference is brief and simpler
of candidate systems submitted to the English- than the original, there may be lots of variants that
German SST task at IWSLT 2022 and conduct the BLEU score and other MT metrics may not
an extensive correlation analysis of CR and the evaluate as correct.
aforementioned metrics. Our study reveals that
Furthermore, SST and MT differ in their input
the offline metrics are well correlated with CR
and can be reliably used for evaluating machine
modalities. MT sources are assumed to originate
translation in simultaneous mode, with some as texts, while the SST source is a speech given in
limitations on the test set size. We conclude a certain situation, accompanied by para-linguistic
that given the current quality levels of SST, means and specific context knowledge shared by
these metrics can be used as proxies for CR, al- the speaker and listener. Transcribing speech to
leviating the need for large scale human evalua- text for use in offline evaluation of SST may be
tion. Additionally, we observe that correlations limiting.
of the metrics with translation as a reference is
significantly higher than with simultaneous in- In this paper, we aim to determine the suitabil-
terpreting, and thus we recommend the former ity of automatic metrics for evaluating SST. To
for reliable evaluation. this end, we analyze the results of the simultane-
ous speech translation task from English to Ger-
1 Introduction
man at IWSLT 2022 (Anastasopoulos et al., 2022),
The current approach to evaluate simultaneous where we calculate the correlations between MT
speech translation (SST, Cho and Esipova, 2016; metrics and human judgements in simultaneous
Ma et al., 2019) systems that have text as the output mode. There are five competing systems and hu-
modality is to use automatic metrics which are de- man interpreting that are manually rated by bilin-
signed for offline text-to-text machine translation gual judges in a simulated real-time event. Our
(MT), alongside other measures for latency and sta- studies show that BLEU does indeed correlate with
bility. Researchers tend to use offline metrics, such human judgements of simultaneous translations un-
as BLEU (Papineni et al., 2002), chrF2 (Popović, der the same conditions as in offline text-to-text
2017), B ERT S CORE (Zhang et al., 2020), COMET MT: on a sufficiently large number of sentences.
(Rei et al., 2020) and others (Freitag et al., 2022) in Furthermore, chrF2, B ERT S CORE and COMET ex-
SST despite no explicit evidence that they correlate hibit similar but significantly larger correlations.
with human ratings. To the best of our knowledge, we are the first to ex-
However, simultaneous speech-to-text transla- plicitly establish the correlation between automatic
tion has different characteristics compared to of- offline metrics with human SST ratings, indicating
fline text-to-text MT. For example, when the users that they may be safely used in SST evaluation in
169
the currently achieved translation quality levels. segmentation would be provided by an automatic
Additionally, we statistically compare the met- system, e.g. Tsiamas et al. (2022), and may be par-
rics with translation versus interpreting reference, tially incorrect and cause more translation errors
and we recommend the most correlating one: compared to the gold segmentation.
translation reference and COMET metric, with The simultaneous mode in Simultaneous Trans-
B ERT S CORE and chrF2 as fallback options. lation Task means that the source is provided grad-
We publish the code for analysis and visualisa- ually, one audio chunk at a time. After receiv-
tions that we created in this study.1 It is available ing each chunk, the system decides to either wait
for further analysis and future work. for more source context, or produce target tokens.
Once the target tokens are generated, they can not
2 Related Work be rewritten.
We replicate the approach from text-to-text MT re- The participating systems are submitted and stud-
search (e.g. Papineni et al., 2002) that examined ied in three latency regimes: low, medium and high.
the correlation of MT metrics with human judge- It means that the maximum Average Lagging (Ma
ments. The strong correlation is used as the basis et al., 2019) between the source and target on val-
for taking the metrics as reliable. As far as we idation set must be 1, 2 or 4 seconds in a “com-
know, we are the first who apply this approach to putationally unaware” simulation where the time
SST evaluation in simultaneous mode. spent by computation, and not by waiting for con-
In this paper, we analyze four metrics that rep- text, is not counted. One system in low latency did
resent the currently used or recommended (Freitag not pass the latency constraints (see Findings, page
et al., 2022) types of MT metrics. BLEU and chrF2 44, numbered 141), but it is manually evaluated
are based on lexical overlap and are available for regardless.
any language. B ERT S CORE (Zhang et al., 2020) Computationally unaware latency was one of the
is based on embedding similarity of a pre-trained main criteria in IWSLT 2022. It means that the
BERT language model. COMET (Rei et al., 2020) participants did not need to focus on a low latency
is a neural metric trained to estimate the style of implementation, as it is more of a technical and
human evaluation called Direct Assessment (Gra- hardware issue than a research task. However, the
ham et al., 2015). COMET requires sentence-to- subtitle timing in manual evaluation was created
sentence aligned source, translation and reference in a way such that waiting for the first target token
in the form of texts, which may be unavailable was dropped, and then it continued with computa-
in some SST use-cases; then, other metric types tionally aware latency.
may be useful. Another fact is that B ERT S CORE
and COMET are available only for a limited set of 3.1 Continuous Rating (CR)
languages.
Continuous Rating (CR, Javorský et al., 2022;
3 Human Ratings in SST Macháček and Bojar, 2020) is a method for human
assessment of SST quality in a simulated online
As far as we know, the only publicly available event. An evaluator with knowledge of the source
collection of simultaneous (not offline) human and target languages watches a video (or listens to
evaluation of SST originates from IWSLT 2022 an audio) document with subtitles created by the
(Salesky et al., 2022) English-to-German Simul- SST system which is being evaluated. The evalua-
taneous Translation Task, which is described in tor is asked to continuously rate the quality of the
“Findings” (Anastasopoulos et al., 2022, see high- translation by pressing buttons with values 1 (the
lights of it we discuss in Appendix A). The task worst) to 4 (the best). Each evaluator can see every
focused on speech-to-text translation and was re- document only once, to ensure one-pass access to
duced to translation of individual sentences. The the documents, as in a realistic setup.
segmentation of the source audio to sentences was CR is analogous to Direct Assessment (Graham
provided by organizers, and not by the systems et al., 2015), which is a method of human text-to-
themselves. The source sentence segmentation that text MT evaluation in which a bilingual evaluator
was used in human evaluation was gold (oracle). expresses the MT quality by a number on a scale.
It only approximates a realistic setup where the It is natural that individual evaluators have differ-
1
github.com/ufal/MT-metrics-in-SimST ent opinions, and thus it is a common practice to
170
have multiple evaluators evaluate the same outputs 2019), and of presentations by representatives of
and then report the mean and standard deviation of European supreme audit institutions. This subset is
evaluation scores, or the results of statistical sig- described in Findings on page 39 (numbered page
nificance tests that compare the pairs of candidate 136). The duration statistics of audio documents in
systems and show how confident the results are. both test sets are in Findings in Table 17 on page
Javorský et al. (2022) showed that CR relates 48 (numbered 145).
well to comprehension of foreign language doc-
uments by SST users. Using CR alleviates the 4 Correlation of CR and MT Metrics
need to evaluate comprehension by factual ques-
tionnaires that are difficult to prepare, collect and In this section, we study the correlation of CR
evaluate. Furthermore, Javorský et al. (2022) show and MT metrics BLEU, chrF2, B ERT S CORE and
that bilingual evaluators are reliable. COMET. We measure it on the level of documents,
Criteria of CR In IWSLT 2022, the evaluators and not on the test set level, increasing the number
were instructed that the primary criterion in CR of observations for significance tests. There are 60
should be meaning preservation (or adequacy), and evaluated documents (17 in the Common subset
other aspects such as fluency should be secondary. and 43 in Non-Native) and 15 system candidates (5
The instructions do not mention readability due to systems, each in 3 latency regimes), which yields
output segmentation frequency or verbalizing non- 900 data points.
linguistic sounds such as “laughter”, despite the We discovered that CUNI-KIT system outputs
system candidates differ in these aspects. are tokenized, while the others are detokenized.
Therefore, we first detokenized CUNI-KIT outputs.
3.2 Candidate Systems Then, we removed the final end of sequence token
Automatic SST systems There are 5 evaluated (</s>) from the outputs of all systems. Finally,
SST systems: FBK (Gaido et al., 2022), NAIST we calculated BLEU and chrF2 using sacreBLEU
(Fukuda et al., 2022), UPV (Iranzo-Sánchez et al., (Post, 2018), B ERT S CORE and COMET. See Ap-
2022), HW-TSC (Wang et al., 2022), and CUNI- pendix B for metric details and signatures.
KIT (Polák et al., 2022). In total, there are 1584 rating sessions of 900
Human Interpreting In order to compare the candidate document translations. Each candidate
state-of-the-art SST with human reference, the or- document translation is rated either twice with dif-
ganizers hired one expert human interpreter to si- ferent evaluators, once, or not at all. We aggregate
multaneously interpret all the test documents. Then, the individual rating clicks in each rating session
they employed annotators to transcribe the voice by plain average (CR definition in Appendix C) to
into texts. The annotators worked in offline mode. get the CR scores. Then, we average the CR of the
The transcripts were then formed as subtitles in- same documents and candidate translations, and
cluding the original interpreter’s timing and were we correlate it with MT metrics.
used in CR evaluation the same way as SST. How-
ever, human interpreters use their own segmenta- 4.0 system
FBK
tion to translation units so that they often do not 3.5 HW-TSC
translate one source sentence as one target sentence. NAIST
3.0 UPV
There is no gold alignment of the translation sen- CUNI-KIT
tences to interpreting chunks. The alignment has to 2.5 subset
CR
be resolved before applying metrics to interpreting. Non-Native

2.0 Common
3.3 Evaluation Data 1.5
There are two subsets of evaluation data used in 1.0
IWSLT22 En-De Simultaneous Translation task. 1.5 1.0 0.5 0.0 0.5
The “Common” subset consists of TED talks of COMET
the native speakers.See the description in Findings
on page 9 (numbered as 106). The “Non-Native” Figure 1: Averaged document CR vs COMET on both
subset consists of mock business presentations of Common and Non-Native subsets.
European high school students (Macháček et al.,
171
Averaged document ratings metric reference alignment corr.
subsets num. BLEU chrF2 B ERT S. COMET COMET TRANSL S ENT 0.80
both 823 0.65 0.73 0.77 0.80 COMET TRANSL S INGLE S EQ 0.79
Common 228 0.42 0.63 0.68 0.76 COMET TRANSL + INTP S INGLE S EQ 0.79
Non-Native 595 0.70 0.70 0.73 0.75 B ERT S CORE TRANSL S ENT 0.77
B ERT S CORE TRANSL+ INTP S ENT+M WER 0.77
All document ratings COMET INTP S INGLE S EQ 0.77
subsets num. BLEU chrF2 B ERT S. COMET B ERT S CORE TRANSL+ INTP S INGLE S EQ 0.76
both 1584 0.61 0.68 0.71 0.73 B ERT S CORE TRANSL S INGLE S EQ 0.75
Common 441 0.37 0.57 0.60 0.68 chrF2 TRANSL + INTP S ENT+M WER 0.73
Non-Native 1143 0.64 0.64 0.66 0.67 BLEU TRANSL + INTP S INGLE S EQ 0.73
chrF2 TRANSL S ENT 0.73
Table 1: Pearson correlation coefficients for CR vs MT chrF2 TRANSL + INTP S INGLE S EQ 0.72
metrics BLEU, chrF2, B ERT S CORE and COMET for chrF2 TRANSL S INGLE S EQ 0.72
averaged document ratings by all 5 SST systems and 3 BLEU TRANSL S INGLE S EQ 0.71
COMET INTP M WER 0.71
latency regimes (upper), and all ratings (lower). When
B ERT S CORE INTP S INGLE S EQ 0.69
the coefficient is less than 0.6 (in italics), the correlation BLEU TRANSL + INTP S ENT+M WER 0.68
is not considered as strong. Significance values are chrF2 INTP S INGLE S EQ 0.66
p < 0.01 in all cases, meaning strong confidence. BLEU TRANSL S ENT 0.65
chrF2 INTP M WER 0.65
BLEU INTP S INGLE S EQ 0.65
B ERT S CORE INTP M WER 0.60
Correlation Results In Table 1, we report cor- BLEU INTP M WER 0.58
relation coefficients with and without averaging,
together with the number of observations. Figure 1 Table 2: Pearson correlation of metric variants to av-
displays the relation between CR and COMET. eraged CR on both subsets, ordered from the most to
the least correlating ones. Lines indicate “clusters of
Pearson correlation is considered as strong if the significance”, i.e. boundaries between groups where all
coefficient is larger than 0.6 (Evans, 1996). The metric variants significantly differ from all in the other
results show strong correlation (above 0.65) of CR groups, with p < 0.05 for dashed line and p < 0.1 for
with BLEU, chrF2, B ERT S CORE and COMET at dotted line. See the complete pair-wise comparison in
the document level on both test subsets. When Appendix D.
we consider only one subset, the correlation is
lower, but still strong for chrF2, B ERT S CORE and
COMET (0.63, 0.68 and 0.76, resp.). It is be- Macháček et al. (2021) discovered, translation may
cause the Common subset is generally translated be more faithful, word-for-word, but also more
better than Non-Native, so with only one subset, the complex to perceive by target audience. Simulta-
points span a smaller part of the axes and contain a neous interpreting, on the other hand, tends to be
larger proportion of outliers. brief and simpler than offline translation. However,
The strong correlation is not the case of BLEU it may be less fluent and less accurate. Therefore,
on the Common subset where the Pearson coeffi- we consider human translation (TRANSL) and tran-
cient is 0.42. We assume it is because BLEU is script of simultaneous interpreting (INTP) as two
designed for use on a larger test set, but we use it possible references, and also test multi-reference
on short single documents. However, BLEU corre- metrics with both.
lates with chrF2 and COMET (0.81 and 0.62 on the Since interpreting is not sentence-aligned to SST
Common subset). BLEU also correlates with CR candidate translations, we consider two alignment
on the level of test sets, as reported in Findings in methods: single sequence (S INGLE S EQ), and mW-
the caption of Table 18 (page 48, numbered 145). ERSegmenter (Matusov et al., 2005, M WER). S IN -
We conclude that with the current overall lev- GLE S EQ method means that we concatenate all the
els of speech translation quality, BLEU, chrF2, sentences in the document to one single sequence,
B ERT S CORE and COMET can be used for reliable and then apply the metric on it, as if it was one
assessment of human judgement of SST quality at sentence. mWERSegmenter is a tool for aligning
least on the level of test sets. chrF2, B ERT S CORE translation candidates to reference, if their sentence
and COMET are reliable also at the document level. segmentation differs. It finds the alignment with the
minimum WER when comparing tokens in aligned
Translation vs Interpreting Reference There is segments. For translation, we also apply the default
an open question whether SST should rather mimic sentence alignment (S ENT).
offline translation, or simultaneous interpreting. As In Table 2, we report the correlations of metric,
172
reference and alignment variants and their signifi- Furthermore, we used only one example of hu-
cance, with more details in Appendix D. man interpreting. A precise in-depth study of hu-
man interpretations is needed to re-assess the rec-
4.1 Recommendations ommendation of translation or interpreting as refer-
Taking CR as the golden truth of human quality, ence in SST.
we make the following recommendations of the
most correlating metric, reference and sentence Acknowledgements
alignment method for SST evaluation.
We are thankful to Dávid Javorský and Peter Polák
Which metric? COMET, because it correlates for their reviews.
significantly better with CR than B ERT S CORE does. This research was partially supported by the
From the fall back options, chrF2 should be slightly grants 19-26934X (NEUREM3) of the Czech Sci-
preferred over BLEU. ence Foundation, SVV project number 260 698,
and 398120 of the Grant Agency of Charles Uni-
Which reference? The metrics give significantly
versity.
higher correlations with CR with translations than
with interpreting as a reference. Difference be-
tween translation reference and two references References
(TRANSL+INTP) is insignificant. Therefore, we
recommend translation as a reference for SST. tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh,
Which alignment method? With an unaligned Maha Elbayad, Clara Emmanuel, Yannick Estève,
reference, COMET and B ERT S CORE correlate Marcello Federico, Christian Federmann, Souhir
significantly more with S INGLE S EQ than with Gahbiche, Hongyu Gong, Roman Grundkiewicz,
M WER, probably because the neural metrics are Barry Haddow, Benjamin Hsu, Dávid Javorský,
Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant
trained on full, complete sentences, which are of-
Mathur, Paul McNamee, Kenton Murray, Maria
ten split to multiple segments by mWERSegmenter. Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan
chrF2 correlates insignificantly better with M WER Niehues, Xing Niu, John Ortega, Juan Pino, Eliz-
than with S INGLE S EQ. abeth Salesky, Jiatong Shi, Matthias Sperber, Se-
5 Conclusion gesh Virkar, Alexander Waibel, Changhan Wang,
We found correlation of offline MT metrics to hu- 2022 evaluation campaign. In Proceedings of the
man judgements of simultaneous speech transla- 19th International Conference on Spoken Language
Translation (IWSLT 2022), pages 98–157, Dublin,
tion. The most correlating and thus preferred met- Ireland (in-person and online). Association for Com-
ric is COMET, followed by B ERT S CORE and chrF2. putational Linguistics.
We recommend text translation reference over inter-
Kyunghyun Cho and Masha Esipova. 2016. Can neu-
preting, and single sequence alignment for neural,
ral machine translation do simultaneous translation?
and mWERSegmenter for n-gram metrics. CoRR, abs/1606.02012.
6 Limitations James D. Evans. 1996. Straightforward Statistics for

the Behavioral Sciences. Pacific Grove: Brooks/Cole
The data that we analyzed are limited to only one Pub.
English-German language pair, 5 SST systems
Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo,
from IWSLT 2022, and three domains. All the
Craig Stewart, Eleftherios Avramidis, Tom Kocmi,
systems were trained in the standard supervised George Foster, Alon Lavie, and André F. T. Mar-
fashion on parallel texts. They do not aim to mimic tins. 2022. Results of wmt22 metrics shared task:
interpretation with shortening, summarization or Stop using bleu – neural metrics are better and more
redundancy reduction, and they do not use docu- robust. In Proceedings of the Seventh Conference
on Machine Translation, pages 46–68, Abu Dhabi.
ment context. The used MT metrics are good for Association for Computational Linguistics.
evaluating individual sentence translations and that
is an important, but not the only subtask of SST. Ryo Fukuda, Yuka Ko, Yasumasa Kano, Kosuke Doi,
Hirotaka Tokuyama, Sakriani Sakti, Katsuhito Sudoh,
We assume that some future systems created with and Satoshi Nakamura. 2022. NAIST simultaneous
a different approach may show divergence of CR speech-to-text translation system for IWSLT 2022. In
and the offline MT metrics. Proceedings of the 19th International Conference on
173
Spoken Language Translation (IWSLT 2022), pages Dominik Macháček, Matúš Žilinec, and Ondřej Bojar.
286–292, Dublin, Ireland (in-person and online). As- 2021. Lost in Interpreting: Speech Translation from
sociation for Computational Linguistics. Source or Interpreter? In Proc. Interspeech 2021,
pages 2376–2380.
Marco Gaido, Sara Papi, Dennis Fucci, Giuseppe
Fiameni, Matteo Negri, and Marco Turchi. 2022. Evgeny Matusov, Gregor Leusch, Oliver Bender, and
Efficient yet competitive speech translation: Hermann Ney. 2005. Evaluating machine translation
FBK@IWSLT2022. In Proceedings of the 19th output with automatic sentence segmentation. In Pro-
International Conference on Spoken Language ceedings of the Second International Workshop on
Translation (IWSLT 2022), pages 177–189, Dublin, Spoken Language Translation, Pittsburgh, Pennsylva-
Ireland (in-person and online). Association for nia, USA.
Yvette Graham, Timothy Baldwin, and Nitika Mathur. Jing Zhu. 2002. Bleu: a method for automatic evalu-
2015. Accurate evaluation of segment-level machine ation of machine translation. In Proceedings of the
translation metrics. In Proceedings of the 2015 Con- 40th Annual Meeting of the Association for Compu-
ference of the North American Chapter of the Asso- tational Linguistics, pages 311–318, Philadelphia,
ciation for Computational Linguistics: Human Lan- Pennsylvania, USA. Association for Computational
guage Technologies, pages 1183–1191, Denver, Col- Linguistics.
orado. Association for Computational Linguistics.
Peter Polák, Ngoc-Quan Pham, Tuan Nam Nguyen,
Javier Iranzo-Sánchez, Javier Jorge Cano, Alejandro Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bo-
Pérez-González-de Martos, Adrián Giménez Pas- jar, and Alexander Waibel. 2022. CUNI-KIT system
tor, Gonçal Garcés Díaz-Munío, Pau Baquero-Arnal, for simultaneous speech translation task at IWSLT
Joan Albert Silvestre-Cerdà, Jorge Civera Saiz, Al- 2022. In Proceedings of the 19th International Con-
bert Sanchis, and Alfons Juan. 2022. MLLP-VRAIN ference on Spoken Language Translation (IWSLT
UPV systems for the IWSLT 2022 simultaneous 2022), pages 277–285, Dublin, Ireland (in-person
speech translation and speech-to-speech translation and online). Association for Computational Linguis-
tasks. In Proceedings of the 19th International Con- tics.
2022), pages 255–264, Dublin, Ireland (in-person Maja Popović. 2017. chrF++: words helping charac-
and online). Association for Computational Linguis- ter n-grams. In Proceedings of the Second Confer-
tics. ence on Machine Translation, pages 612–618, Copen-
hagen, Denmark. Association for Computational Lin-
Dávid Javorský, Dominik Macháček, and Ondřej Bojar. guistics.
2022. Continuous rating as reliable human evaluation
of simultaneous speech translation. In Proceedings Matt Post. 2018. A call for clarity in reporting BLEU
of the Seventh Conference on Machine Translation, scores. In Proceedings of the Third Conference on
pages 154–164, Abu Dhabi. Association for Compu- Machine Translation: Research Papers, pages 186–
tational Linguistics. 191, Brussels, Belgium. Association for Computa-
tional Linguistics.
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,
Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and Lavie. 2020. Unbabel’s participation in the WMT20
Haifeng Wang. 2019. STACL: Simultaneous trans- metrics shared task. In Proceedings of the Fifth Con-
lation with implicit anticipation and controllable la- ference on Machine Translation, pages 911–920, On-
tency using prefix-to-prefix framework. In Proceed- line. Association for Computational Linguistics.
ings of the 57th Annual Meeting of the Association for
Computational Linguistics, pages 3025–3036, Flo- Elizabeth Salesky, Marcello Federico, and Marta Costa-
rence, Italy. Association for Computational Linguis- jussà, editors. 2022. Proceedings of the 19th Interna-
tics. tional Conference on Spoken Language Translation
(IWSLT 2022). Association for Computational Lin-
Dominik Macháček and Ondřej Bojar. 2020. Presenting guistics, Dublin, Ireland (in-person and online).
simultaneous translation in limited space. In Proceed-
ings of the 20th Conference Information Technolo- Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol-
gies - Applications and Theory (ITAT 2020), Hotel losa, and Marta R. Costa-jussà. 2022. SHAS: Ap-
Tyrapol, Oravská Lesná, Slovakia, September 18-22, proaching optimal Segmentation for End-to-End
2020, volume 2718 of CEUR Workshop Proceedings, Speech Translation. In Proc. Interspeech 2022, pages
pages 34–39. CEUR-WS.org. 106–110.
Dominik Macháček, Jonáš Kratochvíl, Tereza Vojtě- Minghan Wang, Jiaxin Guo, Yinglu Li, Xiaosong Qiao,
chová, and Ondřej Bojar. 2019. A speech test set Yuxia Wang, Zongyao Li, Chang Su, Yimeng Chen,
of practice business presentations with additional Min Zhang, Shimin Tao, Hao Yang, and Ying Qin.
relevant texts. In Statistical Language and Speech 2022. The HW-TSC’s simultaneous speech transla-
Processing, pages 151–161, Cham. Springer Interna- tion system for IWSLT 2022 evaluation. In Proceed-
tional Publishing. ings of the 19th International Conference on Spoken
174
Language Translation (IWSLT 2022), pages 247–254, We found two definitions that can yield differ-
Dublin, Ireland (in-person and online). Association ent results in certain situations: (1) The rating (as
clicked by the evaluator) is valid at the instant time
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. point when the evaluator clicked the rating button.
Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- The final score is the average of all clicks, each
uating text generation with bert. In International click has the equal weight. We denote this interpre-
Conference on Learning Representations. tation as CR.
(2) The rating is assigned to the time interval
A Highlights of IWSLT22 Findings
from the click time to the next click, or between
The Findings of IWSLT22 (Anastasopoulos et al., the last click and the end of the document. The
2022) are available in PDF. The most up-to-date length of the interval is considered in averaging.
version (version 2) is 61 pages long.2 We highlight The final score is the average of ratings weighted
the relevant parts of Findings with page numbers by interval lengths when the rating is valid. We
in Table 3 so that we can refer to them easily. denote this interpretation as CRi. 3
Note that findings are a part of the conference To express them rigorously, let us have a docu-
proceedings (Salesky et al., 2022) as a chapter in a ment of duration T , and n ratings (ri , ti ), where
book. The order of findings pages in PDF does not i ∈ {1, . . . , n} is an index, ri ∈ {1, . . . , 4} is the
match the page numbers at the footers. rated value and 0 ≤ t1 < · · · < tn ≤ T are times
Also note that in Section 2.4 on page 4 (in when the ratings were recorded.
PDF, 101 in Proceedings), there is a description Then, the definitions are as follows:
of MLLP-VRAIN which corresponds to the sys-
tem denoted as UPV in all other tables and figures. 1X
n
CR = ri
n
B Metric Signatures i=1
BLEU and chrF2 SacreBLEU metric signature is

case:mixed|eff:no|tok:13a|smooth:exp|version:2.3.1.
1 X
n−1
For B ERT S CORE, we used F1 with signa- CRi = (ti+1 − ti )ri + (T − tn )rn
T − t1
ture bert-base-multilingual-cased_L9_no-idf_ver- i=1
sion=0.3.12(hug_trans=4.23.1)_fast-tokenizer.
We use COMET model wmt20-comet-da (Rei If the judges press the rating buttons regularly,
et al., 2020). For multi-reference COMET, we with a uniform frequency, then both definitions give
run the model separately with each reference and equal scores. Otherwise, the CR and CRi may
average the scores. differ and may yield even opposite conclusions. For
The standard way of using mWERSegmenter is example, pressing “1” twelve times in one minute,
to segment candidate translation according to refer- then “4” and then waiting for one minute results in
ence. However, COMET requires aligned source different scores: CR = 1.2, CRi = 2.
as one of the inputs, and mWERSegmenter can not To examine the relationship between these defini-
align it because it is in other language. For COMET tions, we counted CR and CRi for each annotation
INTP M WER variant, we therefore aligned inter- of each document in the evaluation campaign. The
preting to translation, which is already aligned to results are in Figure 2 where we observe corre-
source. For the other metrics with INTP M WER, lation between the two definitions. The Pearson
we aligned translation candidate to interpreting, correlation coefficient is 0.98, which indicates a
which is the standard way. very strong correlation.
C Aggregating Continuous Ratings Summary Based on the correlation score we ob-

served, we conclude that both definitions are inter-
We revisited the processing of the individual col- changeable, and any of them can be used in further
lected clicks on the rating buttons into the aggregate analysis.
score of Continuous Rating.
3
Other interpretations are also conceivable, for instance
2
https://aclanthology.org/2022.iwslt-1.10v2. assuming that the rating applies to a certain time before the
pdf click and then till the next judgement.
175
marker PDF page numbered page description
Section 2 3-5 100-102 Simultaneous Speech Translation Task
Figure 1 6 103 Quality-latency trade-off curves
Section 2.6.1 5 102 Description of human evaluation
Figure 5 8 105 Manual scores vs BLEU (plot)
Two Test Sets (paragraph) 39 136 Non-Native subset
Test data (paragraph) 9 106 Common (native) subset of test data
Automatic Evaluation Results 44 141 Latency and BLEU results (table)
A1.1 (appendix) 38-39 135-136 Details on human evaluation
Table 17 48 145 Test subsets duration
Table 18 48 145 Manual scores and BLEU (table)
Table 3: Relevant parts of IWSLT22 Findings (https://aclanthology.org/2022.iwslt-1.10v2.pdf) for En-

De Simultaneous Speech Translation task and human evaluation.
4.0 separately are lower and the differences along the

diagonal are less significant. We explain it by the
3.5 fact that in smaller data set, there is larger impact
of noise.
3.0
2.5
CRi
2.0
1.5
1.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0
CR
Figure 2: Relation between weighted interval averaging
of continuous rating (CRi, y-axis) and average of all rat-
ings (CR, x-axis) for each annotation of each document
(blue data points).
D Pairwise Metrics Comparison
We test the statistical difference of correlations with

Steiger’s method.4 The method takes into account
the number of data points and the fact that all three
compared variables correlate, which is the case of
the MT metrics that are applied on the same texts.
We use two-tailed test.
We applied the test on all pairs of metric vari-
ants. The results for both subsets are in Figure 3.
Figure 4 displays results on the Common subset,
and Figure 5 for the Non-Native subset. These re-
sults are analogous to those in Table 1 in Section 4.
The correlation scores for the two subsets treated
4
https://github.com/psinger/CorrelationStats/
176
Both subsets
COMET transl sent 0.80 0.64 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
COMET transl singleseq 0.64 0.79 0.18 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
COMET transl+intp singleseq 0.37 0.18 0.79 0.04 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BertScore transl sent 0.00 0.01 0.04 0.77 0.75 0.93 0.17 0.08 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BertScore transl+intp sent+mwer 0.00 0.01 0.04 0.75 0.77 0.97 0.20 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
COMET intp singleseq 0.00 0.00 0.00 0.93 0.97 0.77 0.32 0.21 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BertScore transl+intp singleseq 0.00 0.00 0.00 0.17 0.20 0.32 0.76 0.12 0.03 0.02 0.02 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BertScore transl singleseq 0.00 0.00 0.00 0.08 0.10 0.21 0.12 0.75 0.06 0.05 0.03 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
chrF2 transl+intp sent+mwer 0.00 0.00 0.00 0.00 0.00 0.01 0.03 0.06 0.73 0.93 0.27 0.27 0.22 0.02 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BLEU transl+intp singleseq 0.00 0.00 0.00 0.00 0.00 0.01 0.02 0.05 0.93 0.73 0.87 0.42 0.39 0.00 0.08 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
chrF2 transl sent 0.00 0.00 0.00 0.00 0.00 0.01 0.02 0.03 0.27 0.87 0.73 0.41 0.33 0.03 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
chrF2 transl+intp singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.27 0.42 0.41 0.72 0.73 0.30 0.34 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00
chrF2 transl singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.22 0.39 0.33 0.73 0.72 0.32 0.37 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BLEU transl singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.03 0.30 0.32 0.71 0.86 0.20 0.00 0.01 0.00 0.00 0.00 0.00 0.00
COMET intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.08 0.09 0.34 0.37 0.86 0.71 0.24 0.06 0.03 0.00 0.00 0.00 0.00 0.00
BertScore intp singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.02 0.20 0.24 0.69 0.51 0.11 0.07 0.01 0.01 0.00 0.00
BLEU transl+intp sent+mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 0.51 0.68 0.45 0.00 0.12 0.12 0.00 0.00
chrF2 intp singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.03 0.11 0.45 0.66 0.72 0.45 0.40 0.00 0.00
BLEU transl sent 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.00 0.72 0.65 0.85 0.80 0.01 0.00
chrF2 intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.12 0.45 0.85 0.65 0.93 0.00 0.00
BLEU intp singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.12 0.40 0.80 0.93 0.65 0.01 0.00
BertScore intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.60 0.43
BLEU intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.43 0.58
BertScore intp mwer
COMET transl sent
COMET transl singleseq
COMET transl+intp singleseq
BertScore transl sent
BertScore transl+intp sent+mwer
COMET intp singleseq
BertScore transl+intp singleseq
BertScore transl singleseq
BertScore intp singleseq

chrF2 transl+intp sent+mwer
BLEU transl+intp singleseq
chrF2 transl sent
chrF2 transl+intp singleseq
chrF2 transl singleseq
COMET intp mwer

BLEU transl singleseq
chrF2 intp singleseq
BLEU intp mwer

BLEU transl+intp sent+mwer
BLEU transl sent

chrF2 intp mwer
BLEU intp singleseq
Figure 3: Results of significance test (p-values rounded to two decimal digits) for difference of correlations of the
metrics variants to CR. The metrics variants are ordered by Pearson correlation to CR on both subsets from most
correlating (top left) to least (bottom right). The bold numbers on the diagonal are the correlation coefficients to CR.
177
Common subset
COMET intp mwer 0.00 0.03 0.05 0.69 0.77 0.76 0.68 0.05 0.06 0.05 0.06 0.05 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
chrF2 transl sent 0.00 0.00 0.00 0.05 0.04 0.05 0.20 0.53 0.56 0.63 0.87 0.81 0.27 0.15 0.01 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.00
chrF2 intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.43 0.62 0.55 0.74 0.41 0.07 0.24 0.18 0.03 0.06 0.02
BLEU intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.04 0.03 0.10 0.05 0.33 0.66 0.88 0.47 0.74 0.35
BLEU transl sent 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.09 0.16 0.25 0.00 0.35 0.00 0.42
BertScore intp mwer
COMET transl sent

COMET intp mwer
chrF2 intp mwer

chrF2 transl sent

BLEU intp singleseq

BLEU intp mwer
BLEU transl sent
metrics variants to CR. The metrics variants are ordered by Pearson correlation to CR on the Common subset from
most correlating (top left) to least (bottom right). The bold numbers on the diagonal are the correlation coefficients
to CR.
178
Non-Native subset
chrF2 transl sent 0.00 0.00 0.00 0.01 0.01 0.08 0.18 0.40 0.09 0.80 0.84 0.70 0.89 0.67 0.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BLEU transl sent 0.00 0.01 0.01 0.01 0.01 0.08 0.17 0.02 0.56 0.54 0.77 0.89 0.70 0.49 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
COMET intp mwer 0.00 0.00 0.00 0.00 0.00 0.01 0.03 0.22 0.31 0.39 0.36 0.50 0.57 0.73 0.69 0.05 0.05 0.00 0.00 0.00 0.00 0.00 0.00
chrF2 intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 0.06 0.36 0.61 0.00 0.01 0.00 0.00
BLEU intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.12 0.14 0.03 0.50
COMET transl sent
COMET intp mwer
BertScore intp mwer


chrF2 transl sent
BLEU transl sent
BLEU intp singleseq

BLEU intp mwer

chrF2 intp mwer
metrics variants to CR. The metrics variants are ordered by Pearson correlation to CR on the Non-Native subset from
most correlating (top left) to least (bottom right). The bold numbers on the diagonal are the correlation coefficients
to CR.
179
Improving Neural Machine Translation Formality Control with Domain
Adaptation and Reranking-based Transductive Learning
Zhanglin Wu, Zongyao Li, Daimeng Wei, Hengchao Shang, Jiaxin Guo, Xiaoyu Chen,
Zhiqiang Rao, Zhengzhe Yu, Jinlong Yang, Shaojun Li, Yuhao Xie, Bin Wei,
Jiawei Zheng, Ming Zhu, Lizhi Lei, Hao Yang, Yanfei Jiang
{wuzhanglin2,lizongyao,weidaimeng,shanghengchao,guojiaxin1,chenxiaoyu35,
raozhiqiang,yuzhengzhe,yangjinlong7,lishaojun18,xieyuhao2,weibin29,
zhengjiawei15,zhuming47,leilizhi,yanghao30,jiangyanfei}@huawei.com
Abstract and Carpuat, 2020). Fortunately, the IWSLT for-
mality control task now provides a new benchmark1
This paper presents Huawei Translation Ser-
(Nădejde et al., 2022; Agarwal et al., 2023) by
vice Center (HW-TSC)’s submission on the
IWSLT 2023 formality control task, which pro-
contributing high-quality training datasets and test
vides two training scenarios: supervised and datasets for multiple language pairs.
zero-shot, each containing two language pairs, This paper presents HW-TSC’s submission on
and sets constrained and unconstrained condi- the IWSLT 2023 formality control task. How for-
tions. We train the formality control models mality distinctions are expressed grammatically
for these four language pairs under these two and lexically can vary widely by language. Thus,
conditions respectively, and submit the corre- we participate in the formality control task of all
sponding translation results. Our efforts are di-
these four language pairs to investigate a general
vided into two fronts: enhancing general trans-
lation quality and improving formality control formality control method that can be applied to
capability. According to the different require- different language pair. In addition, we also inves-
ments of the formality control task, we use a tigate the difference in formality control between
multi-stage pre-training method to train a bilin- constrained and unconstrained conditions by intro-
gual or multilingual neural machine translation ducing the mBART model (Liu et al., 2020) under
(NMT) model as the basic model, which can im- unconstrained condition.
prove the general translation quality of the base
model to a relatively high level. Then, under 2 Data
the premise of affecting the general translation
quality of the basic model as little as possi- 2.1 Pre-training Data
ble, we adopt domain adaptation and reranking-
We use the CCMatrix2 and OpenSubtitles3 bilin-
based transductive learning methods to improve
the formality control capability of the model. gual data given by the organizers to train a NMT
model from scratch or fine-tune the mBART model
1 Introduction as the general basic model. The bilingual data size
of each language pair is shown in Table 1:
Machine translation (MT) (Lopez, 2008; Vaswani
et al., 2017) models typically return one single Language pair CCMatrix OpenSubtitles
translation for each input sentence. This means EN-KO 19.4M 1.4M
that when the input sentence is ambiguous, the MT EN-VI 50.1M 3.5M
model must choose a translation from among var- EN-PT 173.7M 33.2M
ious valid options, without regard to the intended EN-RU 139.9M 25.9M
use case or target audience. Therefore, there is a
need to control certain attributes (Schioppa et al., Table 1: The bilingual data size of each language pair.
2021) of the text generated in a target language
such as politeness (Sennrich et al., 2016a; Feely In order to achieve a better training effect, we
et al., 2019) or formality (Niu et al., 2017, 2018; also use some data pre-processing methods to clean
Viswanathan et al., 2020). bilingual data, such as: remove duplicate data, use
The lack of gold translation with alternate for- 1
mality for supervised training and evaluation has contrastive-controlled-mt
2
https://opus.nlpl.eu/CCMatrix.php
lead researchers to rely on synthetic supervision 3
https://opus.nlpl.eu/
training and manual evaluation in past work (Niu OpenSubtitles-v2018.php
180
Moses4 to normalize punctuation, filter extremely decoder, 16-head self-attention, 1024-dimensional
long sentences, use langid5 (Lui and Baldwin, 2011, embedding and 4096-dimensional FFN embedding.
2012) to filter sentences that do not meet the lan-
guage requirements, use fast-align6 (Dyer et al., 3.2 Unconstrained Model
2013) to filter unaligned sentence pairs. Recently, multilingual denoising pre-training
method (Liu et al., 2020; Tang et al., 2021) pro-
2.2 Formality-annotated Data duces significant performance gains across a wide
The formality-annotated data is provided by the variety of machine translation tasks. As the ear-
organizers, and the data size of each language pair liest sequence-to-sequence model using multilin-
is shown in Table 2: gual denoising pre-training method, mBART (Liu
et al., 2020) has also achieved good results in var-
Setting Language pair Train Test ious machine translation-related tasks. Under un-
Supervised EN-KO 400 597 constrained conditions, we use the mBART50 1n
Supervised EN-VI 400 598 model7 as the initial model of the unconstrained
Zero-shot EN-PT 0 599 formality control task. The mBART50 1n model
Zero-shot EN-RU 0 600 adopts Transformer structure, which features 12-
Table 2: The formality-annotated data size of each lan- layer encoder, 12-layer decoder, 16-head self-
guage pair. attention, 1024-dimensional embedding and 4096-
dimensional FFN embedding, and an additional
For supervised language pairs, we split the layer-normalization layer (Xu et al., 2019) on top
formality-annotated train data into a train set and of both the encoder and decoder.
a dev set with a ratio of 3:1, and use the formality- 4 Method
annotated train set and a small amount of bilingual
data for formality control training, while for zero- In our implementation, we first use a multi-stage
shot language pairs, we use formality-annotated pre-training method to train a general NMT model
train set from the other two supervised language with relatively high translation quality. Then,
pairs for formality control training. we use domain adaptation method to fine-tune
the NMT model so that the model can have ba-
3 Model sic formality control capability. Finally, we use
3.1 Constrained Model the reranking-based transductive learning (RTL)
method to further improve the formality control
Transformer (Vaswani et al., 2017) is the state-of- capability of the model.
the-art model in recent machine translation evalua-
tions. There are two parts of research to improve 4.1 Multi-stage Pre-training
this kind: the first part uses wide networks (eg: There are four different types of formality control
Transformer-Big (Vaswani et al., 2017)), and the tasks, which are constrained supervised task, con-
other part uses deeper language representations (eg: strained zero-shot task, unconstrained supervised
Deep Transformer (Wang et al., 2019; Wu et al., task, and unconstrained zero-shot task. For these
2022; Wei et al., 2022)). Under the constrained four different tasks, we formulate different pre-
conditions, we combine these two improvements, training strategies and collectively refer to these
adopt the Deep Transformer-Big model structure, strategies as multi-stage pre-training method.
and train a one-to-many multilingual NMT model Under the constrained condition, we adopt the
(Johnson et al., 2017; Zhang et al., 2020) from Deep Transformer-Big model structure and use
scratch using bilingual data of four language pairs bilingual data of all four language pairs to train
provided by the organizers. The main structure a one-to-many multilingual NMT model from
of Deep Transformer-Big is that it features pre- scratch, which is used as the basic model for con-
layer-normalization and 25-layer encoder, 6-layer strained zero-shot task. For constrained supervised
4
https://github.com/moses-smt/ task, we use the bilingual data of this task to further
mosesdecoder 7
5
https://github.com/saffsd/langid.py https://dl.fbaipublicfiles.com/
6
https://github.com/clab/fast_align fairseq/models/mbart50/mbart50.ft.1n.
tar.gz
181
pre-train the multilingual NMT model to obtain a and the formality phrases from formality-annotated
bilingual NMT model as the basic model. training data for reranking. The implementation de-
While under the unconstrained condition, we fur- tails are shown in Algorithm 1. For zero-shot task,
ther pre-train the mBART50 1n model using bilin- due to the lack of formality-annotated training data,
gual data from all these four language pairs as the we just use a reference-free formality classifier for
basic model for unconstrained zero-shot task. For reranking. Among them, the formality classifier
unconstrained supervised task, we use the bilingual under the constrained condition comes from self-
data of this task to further pre-train the pre-trained training (Axelrod et al., 2011), while the formality
model, and use the final pre-trained bilingual model classifier under the unconstrained condition comes
as the basic model. from the organizer8 (Briakou et al., 2021).
4.2 Domain Adaptation for Formality Control Algorithm 1: Reranking by reference-free

With the pre-trained basic model, we use domain formality classifier and formality phrases
adaptation method (Chu et al., 2017) to achieve ba- Input: source sentence x, reference-free
sic formality control. First, we treat formal formal- formality classifier C, formality
ity and informal formality as two special domains, control model M , formal and
and control the formality of the model’s translation informal formality phrases
results using a tagging method (Chu et al., 2017; |W | |W |
WF = {wjF }j=1F , WI = {wjI }j=1I
Nădejde et al., 2022), which attaches a formality- Output: the formality translation yF and yI
indicating tag to the source input. Then, in order to 1 translate x by M , the top 100 formality
affect the general translation quality as little as pos- translations are respectively defined as:
sible, we use a mix fine-tuning method (Chu et al., DF = {yiF }100i=1 , DI = {yi }i=1
I 100
2017; Nădejde et al., 2022). Our specific implemen- 2 yF = y0
F
tation is to upsample the formality-annotated train 3 for yi in DF do
F
set by 5 times, and mix it with the same amount 4 Ff lag = F alse
of randomly sampled general bilingual data to fine- 5 for wjF in WF do
tune the pre-trained basic model. 6 if wjF in yiF then
As mentioned in Section 2.2, for the zero-shot 7 Ff lag = T rue
task, due to the lack of formality-annotated data, 8 break
we have to use the formality-annotated data of the
9 end
two other supervised language pair, which is why
10 end
we set the basic model of zero-shot task to a mul-
11 calculate the formality by C: C(yiF )
tilingual NMT model. After using domain adap-
12 if Ff lag and C(yiF )=="formal" then
tation method, the cross-lingual transfer learning
13 yF = yiF
capability of multilingual model can help zero-shot
14 break
language pair achieve basic formality control.
15 end
4.3 Reranking-based Transductive Learning 16 end
17 pick yI from DI in a similar way to yF
After using domain adaptation method, we can en-
18 return yF , yI
able the model to have the basic formality control
capability. Inspired by the idea of transductive
learning (Shi et al., 2018; Lee et al., 2021), we pro- In the second step, we add the source text of test
pose a RTL method, which can further improve the set and the reranked formality translation results
formality control capability of NMT model. Our to the training data used for domain adaptation,
method is mainly divided into two steps: and then use the adjusted training data to further
In the first step, we adopt beam search based define-tune the formality control model.
coding method (Sennrich et al., 2016b) for the for- We can also repeat the previous two steps until
mality control model, and then select the final trans- the formality control capability of the model on test
lation result that meets the specified formality reset is no longer improved. We refer to this iterative
quirements from the top100 decoding results based 8
on reranking idea (Dou et al., 2019). For supervised contrastive-controlled-mt/releases/tag/
task, we use a reference-free formality classifier classifier-v1.0.0
182
To Formal To Informal Flores
EN-VI
M-Acc C-F BLEU COMET M-Acc C-F BLEU COMET BLEU COMET
AWS-baseline 99.40% 99.16% 43.2 0.6189 98.10% 98.49% 41.5 0.6021 - -
Multilingual pre-training 10.86% 1.67% 25.6 0.2023 89.14% 98.33% 30.0 0.2873 42.3 0.6653
+ Bilingual pre-training 8.80% 3.01% 24.8 0.1782 91.20% 96.99% 28.9 0.2630 42.4 0.6706
+ Domain adaptation 98.17% 97.83% 49.1 0.7248 99.37% 99.83% 48.0 0.6952 41.3 0.6576
+ RTL 99.59% 100.00% 49.5 0.7296 99.38% 100.00% 48.1 0.7034 41.7 0.6614
+ Iterative RTL 100.00% 99.83% 51.3 0.7522 100.00% 100.00% 49.8 0.7209 41.8 0.6730
UMD-baseline 96.00% 99.67% 26.7 0.3629 96.00% 98.16% 25.3 0.3452 - -
mBART50 1n 3.82% 1.51% 26.7 0.3516 96.18% 98.49% 31.0 0.4426 34.7 0.6040
+ Multilingual pre-training 9.44% 1.84% 25.4 0.2089 90.56% 98.16% 29.9 0.2975 42.2 0.6673
+ RTL 99.22% 100.00% 47.7 0.7190 99.16% 100.00% 47.8 0.7053 43.4 0.7033
+ Iterative RTL 100.00% 100.00% 48.2 0.7214 100.00% 100.00% 48.3 0.7102 43.4 0.6983
Table 3: The overall translation quality and formality control accuracy of EN-VI models.

EN-KO
AWS-baseline 28.50% 54.61% 11.1 0.5044 80.40% 57.62% 11.1 0.5125 - -
+ RTL 100.00% 97.65% 25.8 0.7337 100.00% 98.51% 26.5 0.7337 13.0 0.6828
+ Iterative RTL 100.00% 99.83% 25.0 0.7434 100.00% 99.66% 27.0 0.7495 13.2 0.6729
UMD-baseline 78.30% 98.60% 4.9 0.2110 97.60% 99.50% 4.9 0.1697 - -
mBART50 1n 100.00% 98.49% 4.1 0.4468 0.00% 1.51% 3.2 0.3670 9.5 0.5854
+ RTL 100.00% 99.66% 25.5 0.7393 100.00% 100.00% 26.2 0.7340 13.8 0.6845
+ Iterative RTL 100.00% 100.00% 24.2 0.7254 100.00% 100.00% 26.7 0.7311 14.0 0.6882
Table 4: The overall translation quality and formality control accuracy of EN-KO models.
process as iterative RTL method. 2002; Post, 2018) and COMET (eamt22-
cometinho-da)11 (Rei et al., 2022) to evaluate
5 Experiments the overall translation quality of formality con-
5.1 Training Details trol model on the official formality test sets
and FLORES-200 devtest sets12 (Goyal et al.,
We use the Pytorch-based Fairseq framework9 (Ott
2022).
et al., 2019) to pre-train or fine-tune NMT model,
and use Adam optimizer (Kingma and Ba, 2014) • We also use the reference-based corpus-level
with parameters β1=0.9 and β2=0.98. During the automatic metric Matched-Accuracy (M-Acc)
multi-stage pre-training phase, each model uses 8 and the reference-free automatic metric (C-
GPUs for training, warmup steps is 4000, batch size F) that uses a multilingual formality classifier
is 4096, learning rate is 5 × 10−4 , label smoothing provided by the organizer to evaluate the for-
rate (Szegedy et al., 2016) is 0.1, and dropout is mality control accuracy of the model on the
0.1. In the domain adaptation and RTL phases, each official formality test sets, respectively.
model only uses 1 GPU for training without warm-
up, batch size is 1024, learning rate is 3 × 10−5 , 5.3 Evaluation Results
label smoothing rate is 0.1, and dropout is 0.3. Based on the above evaluation metrics, we eval-
uate the formality control models trained at dif-
5.2 Evaluation Metrics
ferent phases for each language pair under con-
We evaluate the translation results of formality constrained and unconstrained conditions, and com-
trol model from the following two dimensions: pare with constrained baseline (AWS-baseline)
• We use SacreBLEU v2.0.0 10 (Papineni et al., (Nădejde et al., 2022) and unconstrained baseline
9 11
https://github.com/facebookresearch/ https://github.com/Unbabel/COMET
12
fairseq https://github.com/facebookresearch/
10 flores/tree/main/flores200
https://github.com/mjpost/sacrebleu
183
EN-RU
+ RTL 99.74% 100.00% 34.5 0.6155 97.14% 100.00% 33.4 0.6019 29.4 0.7261
+ Iterative RTL 100.00% 100.00% 36.5 0.6472 100.00% 100.00% 35.6 0.6442 29.0 0.7153
UMD-baseline 96.20% 92.00% 22.0 0.3492 84.10% 85.17% 21.6 0.3475 - -
mBART50 1n 100.00% 91.67% 25.6 0.2916 0.00% 8.33% 19.3 0.2351 25.0 0.5950
+ RTL 98.76% 100.00% 32.3 0.5575 99.73% 99.83% 31.6 0.5363 30.9 0.7417
+ Iterative RTL 100.00% 100.00% 33.7 0.5804 100.00% 99.83% 32.4 0.5558 31.0 0.7521
Table 5: The overall translation quality and formality control accuracy of EN-RU models.

EN-PT
+ RTL 99.47% 100.00% 43.1 0.6769 92.76% 100.00% 44.1 0.6949 45.3 0.7994
+ Iterative RTL 100.00% 100.00% 47.4 0.7337 100.00% 100.00% 47.9 0.7442 44.9 0.7926
UMD-baseline 96.30% 97.66% 27.3 0.4477 93.20% 90.82% 30.9 0.4161 - -
mBART50 1n 86.81% 91.32% 32.2 0.5011 13.19% 8.68% 31.5 0.4955 33.8 0.6767
+ RTL 100.00% 100.00% 39.9 0.7165 94.97% 99.33% 45.0 0.7341 48.0 0.8457
+ Iterative RTL 100.00% 100.00% 45.4 0.7737 100.00% 99.66% 49.1 0.7845 48.1 0.8457
Table 6: The overall translation quality and formality control accuracy of EN-PT models.
(UMD-baseline) (Lin et al., 2022) provided by the ity of multilingual model. Finally, we still submit
organizers. the Iterative RTL model as primary system.
5.3.1 EN-VI & EN-KO 6 Conclusions
The formality control task for EN-VI and EN-KO
This paper presents HW-TSC’s submission on the
language pairs is supervised, and we adopt the
IWSLT 2023 formality control task, in which we
same training methods on these two language pairs.
participate in both constrained and unconstrained
Table 3 and Table 4 are the evaluation results of
tasks for all four language pairs. For the formal-
the models trained at different phases for these two
ity control task, we use a multi-stage pre-training
language pairs. From the experimental results, the
method to improve the general translation quality
multi-stage pre-training method can improve the
of the basic model. We also adopt domain adap-
translation quality of the model on the FLORES-
tation and RTL methods to improve the model’s
200 devtest sets, while domain adaptation and RTL
formality control capability. Experimental results
methods are effective in improving formality con-
show that these methods we have adopted are ex-
trol capability of the model. Besides, domain adap-
tremely effective, but how to improve general trans-
tation and RTL methods have relatively little im-
lation quality more effectively and achieve formal-
pact on the general translation quality of the model
ity control with less training resources is still wor-
on the FLORES-200 devtest sets. Finally, we sub-
thy of further research.
mit the Iterative RTL model as primary system.
5.3.2 EN-RU & EN-PT
References
The formality control tasks for the EN-RU and EN-
PT language pairs are zero-shot, and we only use Milind Agarwal, Sweta Agrawal, Antonios Anasta-
one-stage pre-training on these two tasks. Table 5 sopoulos, Ondřej Bojar, Claudia Borg, Marine
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
and Table 6 are the evaluation results of the models Chen, William Chen, Khalid Choukri, Alexandra
trained in different phases for these two language Chronopoulou, Anna Currey, Thierry Declerck, Qian-
pairs. The experimental results show that domain qian Dong, Yannick Estève, Kevin Duh, Marcello
adaptation and RTL methods are still effective in Federico, Souhir Gahbiche, Barry Haddow, Benjamin
improving the zero-shot formality control capabil- vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
184
Kumar, Pengwei Li, Xutail Ma, Prashant Mathur, translation system: Enabling zero-shot translation.
Evgeny Matusov, Paul McNamee, John P. McCrae, Transactions of the Association for Computational
Kenton Murray, Maria Nadejde, Satoshi Nakamura, Linguistics, 5:339–351.
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino, Diederik P. Kingma and Jimmy Ba. 2014. Adam:
Lonneke van der Plas, Peter Polák, Elijah Rippeth, A method for stochastic optimization. CoRR,
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se- abs/1412.6980.
Thompson, Kevin Tran, Marco Turchi, Alex Waibel, Ann Lee, Michael Auli, and Marc’Aurelio Ranzato.
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- 2021. Discriminative reranking for neural machine
vallos. 2023. Findings of the IWSLT 2023 Evaluation translation. In Proceedings of the 59th Annual Meet-
Campaign. In Proceedings of the 20th International ing of the Association for Computational Linguistics
Conference on Spoken Language Translation (IWSLT and the 11th International Joint Conference on Natu-
2023). Association for Computational Linguistics. ral Language Processing (Volume 1: Long Papers),
pages 7250–7264.
Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011.
Domain adaptation via pseudo in-domain data se- Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu
lection. In Proceedings of the 2011 conference on Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na-
empirical methods in natural language processing, man Goyal, Shruti Bhosale, Jingfei Du, Ramakanth
pages 355–362. Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav
Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettle-
Eleftheria Briakou, Sweta Agrawal, Joel Tetreault, and moyer, Zornitsa Kozareva, Mona Diab, Veselin Stoy-
Marine Carpuat. 2021. Evaluating the evaluation met- anov, and Xian Li. 2022. Few-shot learning with
rics for style transfer: A case study in multilingual multilingual generative language models. In Proceed-
formality transfer. In Proceedings of the 2021 Con- ings of the 2022 Conference on Empirical Methods
ference on Empirical Methods in Natural Language in Natural Language Processing, pages 9019–9052,
Processing, pages 1321–1336. Abu Dhabi, United Arab Emirates. Association for
Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 2017.
An empirical comparison of domain adaptation meth- Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
ods for neural machine translation. In Proceedings Edunov, Marjan Ghazvininejad, Mike Lewis, and
of the 55th Annual Meeting of the Association for Luke Zettlemoyer. 2020. Multilingual denoising pre-
Computational Linguistics (Volume 2: Short Papers), training for neural machine translation. Transac-
pages 385–391. tions of the Association for Computational Linguis-
tics, 8:726–742.
Zi-Yi Dou, Xinyi Wang, Junjie Hu, and Graham Neubig.
2019. Domain differential adaptation for neural ma- Adam Lopez. 2008. Statistical machine translation.
chine translation. EMNLP-IJCNLP 2019, page 59. ACM Computing Surveys (CSUR), 40(3):1–49.
Chris Dyer, Victor Chahuneau, and Noah A Smith. 2013. Marco Lui and Timothy Baldwin. 2011. Cross-domain
A simple, fast, and effective reparameterization of feature selection for language identification. In Pro-
ibm model 2. In Proceedings of the 2013 Conference ceedings of 5th International Joint Conference on
of the North American Chapter of the Association Natural Language Processing, pages 553–561, Chi-
for Computational Linguistics: Human Language ang Mai, Thailand. Asian Federation of Natural Lan-
Technologies, pages 644–648. guage Processing.
Weston Feely, Eva Hasler, and Adrià de Gispert. Marco Lui and Timothy Baldwin. 2012. langid.py: An
2019. Controlling Japanese honorifics in English- off-the-shelf language identification tool. In Proceed-
to-Japanese neural machine translation. In Proceed- ings of the ACL 2012 System Demonstrations, pages
ings of the 6th Workshop on Asian Translation, pages 25–30, Jeju Island, Korea. Association for Computa-
45–53, Hong Kong, China. Association for Computa- tional Linguistics.
tional Linguistics.
Xing Niu and Marine Carpuat. 2020. Controlling neural
Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- machine translation formality with synthetic super-
Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr- vision. In Proceedings of the AAAI Conference on
ishnan, Marc’Aurelio Ranzato, Francisco Guzmán, Artificial Intelligence, pages 8568–8575.
and Angela Fan. 2022. The flores-101 evaluation
benchmark for low-resource and multilingual ma- Xing Niu, Marianna Martindale, and Marine Carpuat.
chine translation. Transactions of the Association for 2017. A study of style in machine translation: Con-
Computational Linguistics, 10:522–538. trolling the formality of machine translation output.
In Proceedings of the 2017 Conference on Empiri-
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim cal Methods in Natural Language Processing, pages
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, 2814–2819, Copenhagen, Denmark. Association for
Fernanda Viégas, Martin Wattenberg, Greg Corrado, Computational Linguistics.
et al. 2017. Google’s multilingual neural machine
185
Xing Niu, Sudha Rao, and Marine Carpuat. 2018. Multi- semi-supervised deep learning using min-max fea-
task neural models for translating between styles tures. In Proceedings of the European Conference on
within and across languages. In Proceedings of the Computer Vision (ECCV), pages 299–315.
27th International Conference on Computational Lin-
guistics, pages 1008–1021, Santa Fe, New Mexico, Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
USA. Association for Computational Linguistics. Jon Shlens, and Zbigniew Wojna. 2016. Rethinking
the inception architecture for computer vision. In
Maria Nădejde, Anna Currey, Benjamin Hsu, Xing Proceedings of the IEEE conference on computer
Niu, Marcello Federico, and Georgiana Dinu. 2022. vision and pattern recognition, pages 2818–2826.
CoCoA-MT: A dataset and benchmark for Con-
trastive Controlled MT with application to formality. Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
In Findings of the Association for Computational Lin- man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
guistics: NAACL 2022, Seattle, USA. Association for gela Fan. 2021. Multilingual translation from de-
Computational Linguistics. noising pre-training. In Findings of the Association
for Computational Linguistics: ACL-IJCNLP 2021,
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, pages 3450–3466.
Auli. 2019. fairseq: A fast, extensible toolkit for Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
sequence modeling. In Proceedings of NAACL-HLT Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
2019: Demonstrations. Kaiser, and Illia Polosukhin. 2017. Attention is all
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- systems, 30.
Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proceedings of the Aditi Viswanathan, Varden Wang, and Antonina
40th annual meeting of the Association for Computa- Kononova. 2020. Controlling formality and style
tional Linguistics, pages 311–318. of machine translation output using automl. In Infor-
mation Management and Big Data: 6th International
Matt Post. 2018. A call for clarity in reporting BLEU Conference, SIMBig 2019, Lima, Peru, August 21–23,
scores. In Proceedings of the Third Conference on 2019, Proceedings 6, pages 306–313. Springer.
191, Belgium, Brussels. Association for Computa- Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu,
tional Linguistics. Changliang Li, Derek F Wong, and Lidia S Chao.
2019. Learning deep transformer models for ma-
Ricardo Rei, Ana C Farinha, José G.C. de Souza, Pe- chine translation. In Proceedings of the 57th Annual
dro G. Ramos, André F.T. Martins, Luisa Coheur, and Meeting of the Association for Computational Lin-
Alon Lavie. 2022. Searching for COMETINHO: The guistics, pages 1810–1822.
little metric that could. In Proceedings of the 23rd
Annual Conference of the European Association for Daimeng Wei, Zhiqiang Rao, Zhanglin Wu, Shaojun Li,
Machine Translation, pages 61–70, Ghent, Belgium. Yuanchang Luo, Yuhao Xie, Xiaoyu Chen, Hengchao
European Association for Machine Translation. Shang, Zongyao Li, Zhengzhe Yu, et al. 2022. Hw-
tsc’s submissions to the wmt 2022 general machine
Andrea Schioppa, David Vilar, Artem Sokolov, and translation shared task. In Proceedings of the Seventh
Katja Filippova. 2021. Controlling machine transla- Conference on Machine Translation, Online. Associ-
tion for multiple attributes with additive interventions. ation for Computational Linguistics.
cal Methods in Natural Language Processing, pages Zhanglin Wu, Jinlong Yang, Zhiqiang Rao, Zhengzhe
6676–6696, Online and Punta Cana, Dominican Re- Yu, Daimeng Wei, Xiaoyu Chen, Zongyao Li,
public. Association for Computational Linguistics. Hengchao Shang, Shaojun Li, Ming Zhu, et al. 2022.
Hwtsc translation systems for the wmt22 biomedical
Rico Sennrich, Barry Haddow, and Alexandra Birch. translation task. In Proceedings of the Seventh Con-
2016a. Controlling politeness in neural machine ference on Machine Translation, Online. Association
translation via side constraints. In Proceedings of for Computational Linguistics.
the 2016 Conference of the North American Chap-
ter of the Association for Computational Linguistics: Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao,
Human Language Technologies, pages 35–40. and Junyang Lin. 2019. Understanding and improv-
ing layer normalization. Advances in Neural Infor-
Rico Sennrich, Barry Haddow, and Alexandra Birch. mation Processing Systems, 32.
2016b. Improving neural machine translation models
with monolingual data. In Proceedings of the 54th Biao Zhang, Philip Williams, Ivan Titov, and Rico Sen-
Annual Meeting of the Association for Computational nrich. 2020. Improving massively multilingual neu-
Linguistics (Volume 1: Long Papers), pages 86–96. ral machine translation and zero-shot translation. In
2020 Annual Conference of the Association for Com-
Weiwei Shi, Yihong Gong, Chris Ding, Zhiheng MaXi- putational Linguistics, pages 1628–1639. Associa-
aoyu Tao, and Nanning Zheng. 2018. Transductive tion for Computational Linguistics (ACL).
186
HW-TSC at IWSLT2023: Break the Quality Ceiling of Offline Track via
Pre-Training and Domain Adaptation
Zongyao Li, Zhanglin Wu, Zhiqiang Rao, Xie YuHao, Guo JiaXin,
Daimeng Wei, Hengchao Shang, Wang Minghan, Xiaoyu Chen
Zhengzhe YU, Li ShaoJun, Lei LiZhi, Hao Yang
{lizongyao,wuzhanglin2,raozhiqiang,xieyuhao2,guojiaxin1,
weidaimeng,shanghengchao,wangminghan,chenxiaoyu35,
yuzhengzhe,lishaojun18,leilizhi,yanghao30}@huawei.com
Abstract trained cascade system, the accuracy of ASR and
MT will reach a higher level. So from the results,
This paper describes HW-TSC’s submissions the BLEU of the cascaded system will be higher
to the IWSLT 2023 Offline Speech Transla-
than that of the end-to-end system. Currently in
tion task, including speech translation of talks
from English to German, English to Chinese the industry, the mainstream speech translation sys-
and English to Japanese. We participated in all tem is still based on the cascade system. We use
three tracks (Constrained training, Constrained the cascade system for this task, mainly to further
with Large Language Models training, Uncon- improve the performance of speech translation.
strained training), with using cascaded architec-
tures models. We use data enhancement, pre- In this work, we carefully filter and preprocess
training models and other means to improve
the data, and adopt various enhancement tech-
the quality of ASR, and use a variety of tech-
niques including R-Drop, deep model, domain niques, such as pre-training model, data enhance-
data selection, etc. to improve the quality of ment, domain adaptation, etc., to optimize the
NMT. Compared with last year’s best results, performance of ASR. We build machine transla-
we have improved by 2.1 BLEU in the MuST-C tion systems with techniques like back translation
English-German test set. (Edunov et al., 2018), domain adaptation and R-
drop (Wu et al., 2021), which have been proved to
1 Introduction be effective practices.
The goal of the Offline Speech Translation Task
is to examine automatic methods for translating The main contribution of this paper can be sum-
audio speech in one language into text in the tar- marized as follows:
get language. In recent years, end-to-end system
and cascade system are fundamental pipelines for 1) According to the characteristics of three dif-
speech translation tasks. Traditional cascade sys- ferent tracks (constrained, constrained with large
tem is comprised of continuing parts, automatic language models (LLM), and unconstrained), we
speech recognition (ASR) is responsible for gener- use different strategies to optimize the results of
ating transcripts from audios and machine transla- ASR. After careful fine-tuning, the WER of the
tion (MT) model aims at translating ASR outputs ASR system of the three tracks have achieved good
from source language into target language. ASR performance.
model like Conformer (Gulati et al., 2020) and S2T-
Transformer (Synnaeve et al., 2019) are commonly 2) Explored the multilingual machine translation
used. MT models like Transformer (Vaswani et al., model, and tried a variety of model enhancement
2017) can be considered as a standard configura- strategies, and finally achieved good results on the
tion. The End-to-end systems use a model to di- MUST-C test set.
rectly recognize speech into target text in another
language. Section 2 focuses on our data processing strate-
The cascade system will cause some "missing gies while section 3 describes the training tech-
information" due to the two encoding and decoding niques of ASR, including model architecture and
processes of ASR and MT. At the same time, the training strategy, etc. Section 4 describes the train-
disadvantage of the end-to-end system is the lack ing techniques of MT, and section 5 presents our
of sufficient training data. However, with a fully experiment results.
187
Dataset Duration(h) 1) Conformer: The encoder is composed of 2
LibriSpeech 960 layers of VGG and 16 layers of Conformer, and the
MuST-C 590 decoder is composed of 6 layers of Transformer.
CoVoST 1802 The embedding size is 1024, and the hidden size of
TEDLIUM3 453 FFN is 4096, and the attention head is 16.
Europarl 161 2) U2: Two convolution subsampling layers with
VoxPopuli 1270 kernel size 3*3 and stride 2 are used in the front of
the encoder. We use 12 Conformer layers for the
Table 1: Data statistics of our ASR corpora.
encoder and 6 Transformer layers for the decoder.
The embedding size is 1024, and the hidden size of
2 Datasets and Preprocessing FFN is 4096, and the attention head is 16.
During the training of ASR models, we set the
2.1 ASR Data batch size to the maximum of 20,000 frames per-
There are six different datasets used in the training card. Inverse sqrt is used for lr scheduling with
of our ASR models, such as MuST-C V2 (Cat- warm-up steps set to 10,000 and peak lr set as 5e-4.
toni et al., 2021), LibriSpeech (Panayotov et al., Adam is used as the optimizer. All ASR models
2015), TED-LIUM 3 (Hernandez et al., 2018), are trained on 8 A100 GPUs for 100 epochs. Pa-
CoVoST 2(Wang et al., 2020), VoxPopuli (Wang rameters for last 5 epochs are averaged. Audio fea-
et al., 2021), Europarl-ST (Iranzo-Sánchez et al., tures are normalized with utterance-level CMVN
2020), as described in Table 1. We use the ex- for Conformer, and with global CMVN for U2.
actly same data processing strategy to train our All audio inputs are augmented with spectral aug-
ASR models following the configuration of (Wang mentation (Park et al., 2019), and Connectionist
et al., 2022). We extend one data augmentation Temporal Classification (CTC) is added to make
method (Zhang et al., 2022): adjacent voices are models converge better.
concatenated to generate longer training speeches.
3.2 Constrained with Large Language Models
Tsiamas et al. (2022) propose Supervised Hybrid
training
Audio Segmentation (SHAS), a method that can
effectively learn the optimal segmentation from Large Language Models (LLM) is currently the
any manually segmented speech corpus. For test mainstream method in the field of artificial intel-
set, we use SHAS to split long audios into shorter ligence. In ASR, the pre-training model has been
segments. proved to be an effective means to improve the
quality, especially the models such as wav2vec
2.2 MT Data (Schneider et al., 2019) and Hubert (Hsu et al.,
We used all provided data, including text-parallel 2021) have been proposed in recent years. Li et al.
and speech-to-text-parallel, text-monolingual data, (2020) combine the encoder of wav2vec2 (Baevski
and use the exactly same data processing strategy et al., 2020) and the decoder of mBART50 (Tang
to process our MT data following (Wei et al., 2021). et al., 2020) to fine-tune an end2end model. We
Data sizes before and after cleaning are listed in also adopt a similar strategy, but combine the en-
Table 2. coder of wav2vec2 and the decoder of mBART50
to fine-tune an ASR model (w2v2-mBART). Due
3 ASR Model to the modality mismatch between pre-training and
fine-tuning, in order to better train cross-attention,
3.1 Constrained training we freeze the self-attention of the encoder and de-
In this track, we trained the constrained ASR model coder. We first use all the constrained data for
using the Conformer (Gulati et al., 2020) and U2 fine-tuning, and only use the MUST-C data after
(Zhang et al., 2020b) model architectures. The 30 epochs of training.
first model is standard auto-regressive ASR mod-
els built upon the Transformer architecture. The 3.3 Unconstrained training
last one is a unified model that can perform both Whisper (Radford et al., 2022) is an automatic
streaming and non-streaming ASR, supported by speech recognition (ASR) system trained on
the dynamic chunking training strategy. The model 680,000 hours of multilingual and multitask su-
configurations are as follows: pervised data collected from the web. It show that
188
language pairs Raw Data Filter Data LaBSE Filter Data Domain Selection
En2De 19.8M 14.5M 5.8M 0.4M
En2Zh 8.1M 5.5M 2.2M 0.4M
En2Ja 16.4M 14.1M 5.6M 0.4M
Table 2: Bilingual data sizes before and after filtering used in tasks.
the use of such a large and diverse dataset leads to Second, use LaBSE (Feng et al., 2020) to filter the
improved robustness to accents, background noise bilingual data, and use the filtered data for incre-
and technical language. The Whisper architecture mental training. In Table 2, there are the number
is a simple end-to-end approach, implemented as an of filtered data for each languages. Then, for the
encoder-decoder Transformer. Even though it en- three languages, the backward models are trained
ables transcription in multiple languages, we only separately, and the monolingual datas are used for
use its speech recognition feature, transcribing au- backward translation (BT). Finally, we combine
dio files to English text. In this task, we use it as a backward translation and forward translation (FT)
pre-trained model, and use the MUST-C dataset for for iterative joint training (Zhang et al., 2018). Af-
fine-tuning to improve its performance in specific ter the above several stages, a base model with
domains. We trained for 2 epochs with a small better performance is obtained, which can be used
learning rate of 10e-6. for further optimization.
4 Neural Machine Translation 4.3 R-Drop

Dropout-like method (Srivastava et al., 2014; Gao
4.1 Model architecture
et al., 2022) is a powerful and widely used
Transformer is the state-of-the-art model in recent technique for regularizing deep neural networks.
machine translation evaluations. There are two Though it can help improve training effectiveness,
parts of research to improve this kind: the first part the randomness introduced by dropouts may lead
uses wide networks (eg: Transformer-Big), and the to inconsistencies between training and inference.
other part uses deeper language representations (eg: R-Drop (Wu et al., 2021) forces the output distribu-
Deep Transformer (Wang et al., 2017, 2019a)). Un- tions of different sub models generated by dropout
der the constrained conditions, we combine these be consistent with each other. Therefore, we use R-
two improvements, adopt the Deep Transformer- Drop training strategy to augment the base model
Big model structure, and train a one-to-many mul- for each track and reduce inconsistencies between
tilingual NMT model (Johnson et al., 2017; Zhang training and inference.
et al., 2020a) from scratch using bilingual data
of three language pairs (En2De, En2Zh, En2Ja) 4.4 Domain Adaptation
provided by the organizers. The main structure Since the quality of the translation model is easily
of Deep Transformer-Big is that it features pre- affected by the domain, we try to select domain-
layer-normalization and 25-layer encoder, 6-layer related data to incrementally train the model. We
decoder, 16-head self-attention, 1024-dimensional adopted the domain adaptation strategy by (Wang
embedding and 4096-dimensional FFN embedding. et al., 2019b). The strategy uses a small amount
We trained the constrained model using all the of in-domain data to tune the base model, and then
provided data, and trained the unconstrained model leverages the differences between the tuned model
with the WMT data. But after domain adaptation, and the base to score bilingual data. The score is
the performance of the two is similar. Therefore, in calculated based on formula 1.
this task, we only use the constrained MT model.
4.2 Multi-stage Pre-training logP (y|x; θin ) − logP (y|x; θbase )

score = (1)
|y|
In order to get a better model effect, we optimize
the model in several stages. First, we use the data of Where θbase denotes the base model; θin denotes
all three language pairs to train a one-to-many mul- the model after fine-tuning on a small amount of
tilingual model, and add tags (<ja>, <zh>, <de>) at in-domain data, and |y| denotes the length of the
the beginning of the source sentence respectively. sentence. Higher score means higher quality.
189
System En2De En2Ja En2Zh
Constrained 37.28 20.26 28.91
Constrained with LLM 37.96 20.29 28.91
Unconstrained 38.71 20.34 28.93
Table 3: The BLEU of speech translation on tst-COM.
System tst-COM tst2018 tst2019 tst2020 avg

Conformer 5.3 9.3 6.7 8.9 7.6
U2 6.1 9.8 6.6 8.7 7.8
w2v2-mBART 4.9 9.3 6.9 8.9 7.5
Whisper 4.5 11.0 5.4 6.6 6.8
Whisper fine-tuning 4.3 8.5 6.3 7.9 6.8
Table 4: The experimental results of ASR. We present WER performance of tst-COM, tst2018, tst2019 and tst2020.
System En2De En2Ja En2Zh data are provided in this task, including audio,
One2Many 36.22 15.43 29.05 source and target. We use the trained ASR to tran-
+ LaBSE bitext 37.58 15.48 29.48 scribe the audio file to get source′ , and finally get
+ Domain adaptation 41.55 17.08 29.27 the MT training data like (source′ , target). The
+ Iter FTBT 43.03 17.86 29.82 source′ transcribed by ASR may have some errors,
+ Dev fine-tuning 43.66 20.88 30.48 but when used in MT, it will increase the robustness
of the MT encoder.
Table 5: The BLEU of MT using tst-COM golden tran-
When using the data generated above, we refer
scription.
to the tagged BT method (Caswell et al., 2019), and
System En2De En2Ja En2Zh add a special token at the beginning of the source
sentence.
One2Many 31.54 14.08 26.69
+ LaBSE bitext 32.65 13.88 27.14
+ Domain adaptation 35.96 15.4 27.15 5 Experiments and Results
+ Iter FTBT 36.38 15.81 27.98
We use the open-source fairseq (Ott et al., 2019)
+ Dev fine-tuning 37.83 18.6 28.86
for training, word error rate (WER) to evaluate the
+ Robustness 38.71 20.34 28.93
ASR models and report case-sensitive SacreBLEU
Table 6: The BLEU of MT using tst-COM transcription (Post, 2018) scores for machine translation. We
by the Whisper fine-tuning model. evaluated our system on the test sets of MuST-C
tst-COMMON (tst-COM).
Table 3 is our results on three languages for
In this task, we use TED and MUST-C data as three tracks (Constrained, Constrained with LLM,
in-domain data. We score all the training bilingual Unconstrained). After a series of optimizations,
data through Equation 1, and filter out 80% - 90% although the ASR results of the three systems
of the data according to the score distribution. We are somewhat different, the BLEU of all sys-
use the remaining 0.4M in-domain data to continue tems are very close. Since there is no testset for
training on the previous model. iwslt2022, we only compared with last year’s teams
on tst-COM. Compared with last year’s best re-
4.5 Robustness to ASR Noise sults (Zhang et al., 2022), we have improved by 2.1
We use two methods to improve the robustness of BLEU in the MuST-C En2De test set; in En2Zh
the system to ASR output noise. and En2Ja, we have achieved close to last year’s
Synthetic Noise Generation. We refer to the best results.
method proposed in Guo et al. (2022) to synthesize We analyze the main reasons for the similar re-
part of the noise data to enhance the robustness of sults of the three systems: 1. The three systems use
the model. the same MT, and our MT system has the ability
ASR Transcript Data. Because some triplet to correct wrong input after the robustness is en-
190
hanced. 2. Using the same data to finetuning the which proves the effectiveness of our strategy. It-
three ASR systems, the WER are relatively close. erative joint training with FT and BT (Iter FTBT)
is also an effective mean to improve quality. After
5.1 Automatic Speech Recognition dev fine-tuning, the results are already very compet-
We compare the results of different model archi- itive. With improving the robustness of the system
tectures, the overall experimental results about to ASR output, our BLEU in En2De, En2Zh, and
ASR is described in Table 4. We evaluated En2Ja are 38.71, 20.34, and 28.93, respectively.
our system on the test sets of tst-COM, IWSLT
tst2018/tst2019/tst2020 respectively. For long au- 6 Conclusion
dio in the test set, we use SHAS for segmenta-
This paper presents our offline speech translation
tion. We calculate the WER after the reference and
systems in the IWSLT 2023 evaluation. We ex-
hypothesis are lowercased and the punctuation is
plored different strategies in the pipeline of build-
removed.
ing the cascade system. In the data preprocess-
In Table 4, all ASR systems achieve good per-
ing, we adopt efficient cleansing approaches to
formance, and the results are relatively close. Con-
build the training set collected from different data
former and U2 are trained using constrained data.
sources. We tried various ASR training strategies
w2v2-mBART is obtained through fine-tuning us-
and achieved good performance. For the MT sys-
ing pre-trained models, which are constrained.
tem, we have used various methods such as multi-
Whisper is the result of transcribing long audio
lingual machine translation, R-drop, domain adap-
without segmentation using the native whisper
tation, and enhanced robustness. Finally, compared
medium model. Whisper fine-tuning is obtained
with last year’s best results, we have improved by
after fine-tuning on MuST-C dataset, with using the
2.1 BLEU in the MuST-C English-German test set.
Whisper medium model. The WER of Conformer
and U2 is relatively close. In submitting the results
of constrained track, we use Conformer as the fi-
References
nal ASR system. The experimental results show
that pre-trained models exhibit their advantages, Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed,
and Michael Auli. 2020. wav2vec 2.0: A framework
w2v2-mBART can achieve better results than just
for self-supervised learning of speech representations.
training with constrained data. Whisper itself has Advances in neural information processing systems,
a very good performance in the general domain, 33:12449–12460.
and after fine-tuning, it has even better results in
the specific domain. However, it is very difficult Isaac Caswell, Ciprian Chelba, and David Grangier.
2019. Tagged back-translation. arXiv preprint
to perform finetuning on whisper and improve the arXiv:1906.06442.
performance of all domains. WER performance on
tst2019 and tst2020 has deteriorated. Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Ben-
tivogli, Matteo Negri, and Marco Turchi. 2021. Must-
5.2 Neural Machine Translation c: A multilingual corpus for end-to-end speech trans-
lation. Computer Speech & Language, 66:101155.
We evaluate the performance of the MT model in
detail on the MUST-C test set. Table 5 shows the Sergey Edunov, Myle Ott, Michael Auli, and David
performance results of each optimization strategy Grangier. 2018. Understanding back-translation at
using golden as the source; Table 6 uses the tran- scale. arXiv preprint arXiv:1808.09381.
scription generated by Whisper fine-tuning model Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen
as the source. The results show that there is a gap Arivazhagan, and Wei Wang. 2020. Language-
in BLEU between golden and transcription of ASR, agnostic bert sentence embedding. arXiv preprint
which is mainly due to errors (punctuation, capital- arXiv:2007.01852.
ization, vocabulary, etc.) in transcription of ASR. Pengzhi Gao, Zhongjun He, Hua Wu, and Haifeng
On the En2De test set, this gap is particularly wide. Wang. 2022. Bi-simcut: A simple strategy for
One2Many is a multilingual model trained us- boosting neural machine translation. arXiv preprint
ing the R-drop strategy, and has achieved relatively arXiv:2206.02368.
good performance on the test set. LaBSE can bring Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
a little improvement to the model, and domain adap- Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo
tation can bring a huge improvement to the model, Wang, Zhengdong Zhang, Yonghui Wu, et al.
191
2020. Conformer: Convolution-augmented trans- method for automatic speech recognition. arXiv
former for speech recognition. arXiv preprint preprint arXiv:1904.08779.
arXiv:2005.08100.
Matt Post. 2018. A call for clarity in reporting bleu
Bao Guo, Mengge Liu, Wen Zhang, Hexuan Chen, scores. arXiv preprint arXiv:1804.08771.
Chang Mu, Xiang Li, Jianwei Cui, Bin Wang, and
Yuhang Guo. 2022. The xiaomi text-to-text simulta- Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
neous speech translation system for iwslt 2022. In man, Christine McLeavey, and Ilya Sutskever. 2022.
Proceedings of the 19th International Conference on Robust speech recognition via large-scale weak su-
Spoken Language Translation (IWSLT 2022), pages pervision. arXiv preprint arXiv:2212.04356.
216–224.
François Hernandez, Vincent Nguyen, Sahar Ghannay, and Michael Auli. 2019. wav2vec: Unsupervised
Natalia Tomashenko, and Yannick Esteve. 2018. Ted- pre-training for speech recognition. arXiv preprint
lium 3: Twice as much data and corpus repartition for arXiv:1904.05862.
experiments on speaker adaptation. In Speech and
Computer: 20th International Conference, SPECOM Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
2018, Leipzig, Germany, September 18–22, 2018, Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Proceedings 20, pages 198–208. Springer. Dropout: a simple way to prevent neural networks
from overfitting. The journal of machine learning
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, research, 15(1):1929–1958.
Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-
rahman Mohamed. 2021. Hubert: Self-supervised Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Ta-
speech representation learning by masked prediction tiana Likhomanenko, Edouard Grave, Vineel Pratap,
of hidden units. IEEE/ACM Transactions on Audio, Anuroop Sriram, Vitaliy Liptchinsky, and Ronan Col-
Speech, and Language Processing, 29:3451–3460. lobert. 2019. End-to-end asr: from supervised to
semi-supervised learning with modern architectures.
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerda, arXiv preprint arXiv:1911.08460.
Javier Jorge, Nahuel Roselló, Adria Giménez, Al-
bert Sanchis, Jorge Civera, and Alfons Juan. 2020. Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
Europarl-st: A multilingual corpus for speech transla- man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
tion of parliamentary debates. In ICASSP 2020-2020 gela Fan. 2020. Multilingual translation with exten-
IEEE International Conference on Acoustics, Speech sible multilingual pretraining and finetuning. arXiv
and Signal Processing (ICASSP), pages 8229–8233. preprint arXiv:2008.00401.
IEEE.
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Ioannis Tsiamas, Gerard I Gállego, José AR Fonollosa,
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, and Marta R Costa-jussà. 2022. Shas: Approaching
Fernanda Viégas, Martin Wattenberg, Greg Corrado, optimal segmentation for end-to-end speech transla-
et al. 2017. Google’s multilingual neural machine tion. arXiv preprint arXiv:2202.04774.
translation system: Enabling zero-shot translation.
Transactions of the Association for Computational
Linguistics, 5:339–351.
Xian Li, Changhan Wang, Yun Tang, Chau Tran, Yuqing you need. Advances in neural information processing
Tang, Juan Pino, Alexei Baevski, Alexis Conneau, systems, 30.
and Michael Auli. 2020. Multilingual speech trans-
lation with efficient finetuning of pretrained models. Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu,
arXiv preprint arXiv:2010.12829. Chaitanya Talnikar, Daniel Haziza, Mary Williamson,
Juan Pino, and Emmanuel Dupoux. 2021. Voxpop-
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, uli: A large-scale multilingual speech corpus for rep-
Sam Gross, Nathan Ng, David Grangier, and Michael resentation learning, semi-supervised learning and
Auli. 2019. fairseq: A fast, extensible toolkit for se- interpretation. arXiv preprint arXiv:2101.00390.
quence modeling. arXiv preprint arXiv:1904.01038.
Changhan Wang, Anne Wu, and Juan Pino. 2020. Cov-
Vassil Panayotov, Guoguo Chen, Daniel Povey, and ost 2 and massively multilingual speech-to-text trans-
Sanjeev Khudanpur. 2015. Librispeech: an asr cor- lation. arXiv preprint arXiv:2007.10310.
pus based on public domain audio books. In 2015
IEEE international conference on acoustics, speech Minghan Wang, Jiaxin Guo, Yinglu Li, Xiaosong Qiao,
and signal processing (ICASSP), pages 5206–5210. Yuxia Wang, Zongyao Li, Chang Su, Yimeng Chen,
IEEE. Min Zhang, Shimin Tao, et al. 2022. The hw-tsc’s si-
multaneous speech translation system for iwslt 2022
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng evaluation. In Proceedings of the 19th International
Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Conference on Spoken Language Translation (IWSLT
2019. Specaugment: A simple data augmentation 2022), pages 247–254.
192
Mingxuan Wang, Zhengdong Lu, Jie Zhou, and Qun Liu.
2017. Deep neural machine translation with linear
associative unit. arXiv preprint arXiv:1705.00861.
Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu,
Changliang Li, Derek F Wong, and Lidia S Chao.
2019a. Learning deep transformer models for ma-
chine translation. arXiv preprint arXiv:1906.01787.
Wei Wang, Isaac Caswell, and Ciprian Chelba. 2019b.
Dynamically composing domain-data selection with
clean-data selection by" co-curricular learning"
for neural machine translation. arXiv preprint
arXiv:1906.01130.
Daimeng Wei, Zongyao Li, Zhanglin Wu, Zhengzhe Yu,
Xiaoyu Chen, Hengchao Shang, Jiaxin Guo, Minghan
Wang, Lizhi Lei, Min Zhang, Hao Yang, and Ying
Qin. 2021. HW-TSC’s participation in the WMT
2021 news translation shared task. In Proceedings of
the Sixth Conference on Machine Translation, pages
225–231, Online. Association for Computational Lin-
guistics.
Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei
Chen, Min Zhang, Tie-Yan Liu, et al. 2021. R-drop:
Regularized dropout for neural networks. Advances
in Neural Information Processing Systems, 34:10890–
10905.
Biao Zhang, Philip Williams, Ivan Titov, and Rico

Sennrich. 2020a. Improving massively multilingual
neural machine translation and zero-shot translation.
arXiv preprint arXiv:2004.11867.
Binbin Zhang, Di Wu, Zhuoyuan Yao, Xiong Wang,

Fan Yu, Chao Yang, Liyong Guo, Yaguang Hu, Lei
Xie, and Xin Lei. 2020b. Unified streaming and
non-streaming two-pass end-to-end model for speech

Mohan Shi, Yifan Song, et al. 2022. The ustc-nelslip
offline speech translation systems for iwslt 2022. In
198–207.
Zhirui Zhang, Shujie Liu, Mu Li, Ming Zhou, and En-
hong Chen. 2018. Joint training for neural machine
translation models with monolingual data. In Pro-
ceedings of the AAAI Conference on Artificial Intelli-
gence, volume 32.
193
The USTC’s Offline Speech Translation Systems for IWSLT 2023
Xinyuan Zhou2 , Jianwei Cui1 , Zhongyi Ye2 , Yichi Wang1 ,
Luzhen Xu1 , Hanyi Zhang2 , Weitai Zhang1,2 , Lirong Dai1
1
University of Science and Technology of China, Hefei, China
2
iFlytek Research, Hefei, China
{jwcui,wangyichi,lzxu,zwt2021}@mail.ustc.edu.cn
[email protected]
{xyzhou15,zyye7,hyzhang56}@iflytek.com
Abstract Acoustic-and-Textual Encoding (SATE) (Xu et al.,

2021) method combines the acoustic and textual
This paper describes the submissions of the
research group USTC-NELSLIP to the 2023
encoders using an adapter module to approach the
IWSLT Offline Speech Translation competi- performance levels of cascaded solutions. Further-
tion, which involves translating spoken English more, ST can be improved using large-scale and
into written Chinese. We utilize both cascaded cross-modal pretraining methods (Radford et al.,
models and end-to-end models for this task. To 2022; Zhang et al., 2022b) such as Whisper (Rad-
improve the performance of the cascaded mod- ford et al., 2022), which leverages large-scale weak
els, we introduce Whisper to reduce errors in supervision, and SpeechUT (Zhang et al., 2022b),
the intermediate source language text, achiev-
which optimizes the alignment of speech and text
ing a significant improvement in ASR recog-
nition performance. For end-to-end models, modalities by hidden units.
we propose Stacked Acoustic-and-Textual En- In this study, we employ a cascaded approach
coding extension (SATE-ex), which feeds the wherein the ASR system is built using the pre-
output of the acoustic decoder into the textual trained Whisper (Radford et al., 2022) to ensure
decoder for information fusion and to prevent the recognition performance of speech to source
error propagation. Additionally, we improve language text. Furthermore, the MT systems in the
the performance of the end-to-end system in
cascaded setup are created using diverse techniques
translating speech by combining the SATE-ex
model with the encoder-decoder model through
like back translation (Sennrich et al., 2016a), self-
ensembling. training (Kim and Rush, 2016; Liu et al., 2019),
domain adaptation and model ensemble.
1 Introduction In end-to-end condition, we implement two
types of architectures, including encoder-decoder
This paper describes the submission for the IWSLT
(Le et al., 2021) and Stacked Acoustic-and-Textual
2023 Offline Speech Translation task (Agarwal
Encoding extension (SATE-ex). For the encoder-
et al., 2023) by National Engineering Laboratory
decoder, we use the corresponding components
for Speech and Language Information Processing
of ASR models to initialize the encoder, and the
(NELSLIP) at the University of Science and Tech-
corresponding components of MT models to ini-
nology of China.
tialize the decoder. For SATE-ex, we utilize the
Speech translation (ST) solutions include cas-
textual decoder to receive the output features of the
caded and end-to-end approaches. The cascaded
acoustic decoder to assist in generating the target
approach combines Automatic Speech Recogni-
language text, achieving information complemen-
tion (ASR) and Machine Translation (MT) systems.
tarity of different ASR decoding hidden states, and
The ASR system recognizes the source speech as in-
preventing intermediate error propagation. Addi-
termediate text in the source language, and the MT
tionally, we employ adaptation training, along with
system translates the intermediate text into text in
the adaptation module and multi-teacher knowl-
the target language. While the end-to-end approach
edge distillation of Stacked Acoustic-and-Textual
directly translates the source speech into text in tar-
Encoding (SATE) (Xu et al., 2021) to bridge the
get language, without using source language text
gap between pre-training and fine-tuning. Our ap-
as an intermediate representation. Compared with
proach included the utilization of augmentation
cascaded approaches, the end-to-end paradigm can
strategies commonly used in cascaded systems, like
overcome higher architectural complexity and er-
speech synthesis (Casanova et al., 2022) and gen-
ror propagation (Duong et al., 2016). The Stacked
194
Corpus Duration (h) Sample Scale Parallel Monolingual
Librispeech 960 1 EN-ZH 50M 50M
Europarl 161 1
MuST-C (v1) 399 3 Table 3: Training data for text MT.
MuST-C (v2) 449 3
TED-LIUM3 452 3
2.2 Text Translation
CoVoST2 1985 1
VoxPopuli 1270 1 We participate in translating English to Chinese.
Both the bilingual data as well as the monolin-
Table 1: The used speech recognition datasets. gual data are used for training. To ensure optimal
training data quality, we apply several filters in-
Data Duration (h) cluding language identification. We remove sen-
tences longer than 250 tokens and those with a
Raw data 8276 source/target length ratio exceeding 3. Addition-
+ concat 16000 ally, we train a baseline machine translation model
+ oversampling 32000 to filter out sentences with poor translation quality.
+ TTS 56000 To tokenize the text, we utilize LTP4.01 (Wanxi-
ang et al., 2020) for Chinese and Moses for English.
Table 2: Augmented training data for ASR.
The subwords are generated via Byte Pair Encoding
(BPE) (Sennrich et al., 2016b) with 30,000 merge
erating as much semi-supervised data as possible operations for each language direction. Table 3
to enhance the model’s performance. Furthermore, summarizes the detailed statistics on the parallel
we try to achieve further performance optimization and monolingual data used for training our systems.
with ensemble of cascaded and end-to-end models. EN→ZH For EN→ZH task, we utilize nearly
50 million sentence pairs collected from CCMT
2 Data Preprocessing Corpus, News Commentary, ParaCrawl, Wiki Ti-
tles, UN Parallel Corpus, WikiMatrix, Wikititles,
2.1 Speech Recognition
MuST-C, and CoVoST2, to train our MT models.
The speech recognition datasets utilized in our In addition, we randomly extract 50 million mono-
experiments are listed in Table 1, including Lib- lingual Chinese sentences from News crawl and
rispeech, MuST-C (v1, v2), TED Lium3, Europarl, Common Crawl for back-translation purposes to
VoxPopuli, and CoVoST. We first extracted 40- augment our training data.
dimensional log-mel filter bank features computed
with a 25ms window size and a 10ms window shift. 2.3 Speech Translation
And then, a baseline ASR model, which is used to Table 4 outlines the speech translation datasets used
filter training samples with WER > 40%, is trained. in our experiments. MuST-C and CoVoST2 are
Moreover, to generate sufficient speech recognition available for speech translation.
corpora, we applied speed perturbation and over- To augment our data, we implemented two ad-
sampling techniques on the TED/MuST-C corpus ditional methods. Firstly, we utilized a text trans-
(Liu et al., 2021). As a result, we generated nearly lation model to generate the corresponding target
8k hours of speech data. language text from the transcriptions of the speech
To improve our training data, we applied two recognition datasets. The generated text was then
more data augmentation techniques. Firstly, we added to our speech translation dataset along with
combined adjacent voices to produce longer train- its corresponding speech, referred to as KD Cor-
ing utterances. Secondly, we trained a model us- pus in Table 4. This process is similar to sentence
ing Glow-TTS (Casanova et al., 2021) on MuST-C knowledge distillation. Secondly, we applied the
datasets and generated 24,000 hours of audio fea- trained Glow-TTS model to produce audio features
tures by using sentences from EN→DE text trans- from randomly selected sentence pairs in EN→ZH
lation corpora. The resulting training data for ASR text translation corpora. The resulting filter bank
is summarized in Table 2. features and their corresponding target language
1
https://github.com/HIT-SCIR/ltp
195
Corpus Duration (h) Sample Scale based on the Transformer (Vaswani et al., 2017)
implemented in the Fairseq (Ott et al., 2019) toolkit.
MuST-C 593 2
CovoST2 1092 2 Each single model was executed on 16 NVIDIA
EN-ZH V100 GPUs. Our experiments utilized several
KD 16000 2
TTS 27000 1 crucial technologies including Back Translation,
Sentence-level Knowledge Distillation, Domain
Table 4: Speech Translation Corpora. Adaptation, Robust MT Training, and Ensembling.
Back Translation. The utilization of Back-
Translation (Sennrich et al., 2016a) is a proficient
text are utilized to enhance our speech translation
technique for enhancing translation accuracy. This
dataset, referred to as TTS Corpus in Table 4.
method generates synthetic sentence pairs by trans-
3 Cascaded Speech Translation lating target-side monolingual data. It has gained
significant popularity in both academic research
3.1 Automatic Speech Recognition and commercial applications. We train NMT mod-
We implement ASR model in cascaded condi- els with bilingual data, and translate Chinese sen-
tion via Supervised Hybrid Audio Segmentation tences to English.
(SHAS) and Whisper. Knowledge Distillation. Sentence-level Knowl-
Supervised Hybrid Audio Segmentation. Super- edge Distillation (Kim and Rush, 2016), also
vised Hybrid Audio Segmentation (SHAS) (Tsia- known as Self-training, is an effective method for
mas et al., 2022) is used to split long audio into enhancing performance. We expand our training
short segments with quality comparable to manual dataset by leveraging a trained NMT model to trans-
segmentation. Hence, we use SHAS as a Voice Ac- late English sentences into Chinese. This approach
tivity Detection (VAD) in the ASR system, as well has proven to be highly beneficial in improving
as a speech segmentation tool in the Speech Trans- model accuracy.
lation system. This way, the output of the ASR Domain Adapatation. Due to the critical impor-
system can be directly fed into the text translation tance of high-quality, domain-specific translation
component. (Saunders, 2022), we fine-tune the NMT model by
Whisper. We incorporated the pre-trained Whisper using a mix of in-domain data (such as MuST-C,
(Radford et al., 2022) as the ASR model of the cas- TED-LIUM3, etc.) and out-of-domain data. Ad-
caded system to reduce errors in the intermediate ditionally, the labelled English sentences from the
source language text. speech recognition training data is also utilized as
Whisper scales weakly supervised speech-to-text augmented in-domain self-training data by translat-
tasks to 680,000 hours of labeled audio data and ing them.
expands the pre-training scope from English-only We adopt a Denoise-based approach (Wang et al.,
speech recognition to multilingual and multitask. 2018) to assess and select data for domain-specific
In comparison with the previous unsupervised pre- MT and use it to denoise NMT training. The tech-
training approach (Baevski et al., 2020), Whisper nique of denoising addresses data quality issues
not only improves the quality of the audio encoder, and reduces the adverse effects of noise on MT
but also trains a pre-trained decoder with high training, particularly NMT training.
equivalency, enhancing usefulness and robustness. Robust MT Training. To enhance the robustness
Results demonstrate that the pre-trained Whisper of the MT model to ASR errors in cascaded ST,
model can be well transferred to different or even the ASR output adaptive training approach (Zhang
zero-shot datasets without any dataset-specific fine- et al., 2022a) is introduced. The English transcripts
tuning. of all speech translation datasets are inputted into a
We used the large version of the pre-trained whis- trained ASR model to generate text in source side,
per model, which contains 32 layers and a total of which is then paired with the transcription text in
1550M parameters. target side. We improve the robustness of the MT
model through three methods: 1) fine-tuning the
3.2 Neural Machine Translation MT model with synthetic data; 2) incorporating KL
We adopted the same strategy as last year’s (Zhang loss during fine-tuning to prevent over-fitting; and
et al., 2022a) and built machine translation models 3) distilling the model using clean source text and
196
ASR output. 𝐿𝐴𝑆𝑅 𝐿𝐾𝐷−𝑇𝑟𝑎𝑛𝑠 𝐿 𝑇𝑟𝑎𝑛𝑠
Ensemble. For each target language, we trained 4 Softmax

Acoustic
variants based on the large Transformer configura- Decoder
tion, and the final model is an ensemble of these 4 Linear
models. Textual
Decoder
• E15D6-v1: 15 layers for the encoder and 6 Linear
Cross-Attention
layers for the docoder. The embedding size is Softmax
1024. FFN size is 8192 and attention head is Cross-Attention
16. All available corpora including bilingual, 𝐿𝐾𝐷−𝐶𝑇𝐶
BT and FT are used. Acoustic Target

Encoder 𝐿𝐶𝑇𝐶
Text
• E15D6-v2: 15 layers for the encoder, 10%
Textual
training data are randomly dropped. Speech Adaptor
Encoder
Features
• E18D6: 18 layers for the encoder and 10-30%

Figure 1: The architecture of Stacked Acoustic-and-
training data with low machine translation Textual Encoding extension (SATE-ex).
scores are dropped.
• Macaron: A version with macaron architec- which consists of 2 layers of VGG and 16
ture (Lu et al., 2019) based on data of E18D6. layers of Transformer. The decoder of VGG-
36 layers for the encoder and FFN size is T is 6 layers of Transformer with embedding
2048. size of 1024, attention head of 16 and FFN
size of 8192.
3.3 End-to-End Speech Translation
In the end-to-end condition, we ensemble the • VGG-T-init: The VGG-Transformer encoder
encoder-decoder and the Stacked Acoustic-and- is initialized by the ASR VGG-Transformer
Textual Encoding extension (SATE-ex) models de- architecture. The decoder is 6 layers of Trans-
scribed in Section 3.4. former, initialized by NMT E15D6-v2 variant.
Encoder-Decoder. The encoder-decoder-based
end-to-end ST model processes the speech in the 3.4 Stacked Acoustic-and-Textual Encoding
source language by its encoder and generates text Extension
in the target language by its decoder. The encoder To further improve the performance of end-to-end
and decoder are initialized using the corresponding ST, we propose Stacked Acoustic-and-Textual En-
parts of the cascade ASR and MT models. As re- coding extension (SATE-ex) based on SATE (Xu
gards model architecture, we investigate 4 variants et al., 2021).
in end-to-end ST. SATE. The MT encoder captures the long-distance
dependency structure, while ASR encoder focuses
• VGG-C: The encoder of VGG-C is initial- on local dependencies in the input sequence. Thus,
ized by the ASR VGG-Conformer architec- the encoder-decoder model initialized with the
ture, which consists of 2 layers of VGG and ASR encoder and the MT decoder may have in-
12 layers of Conformer. And the ASR VGG- consistent on intermediate representations.
Conformer is trained using the data in Section SATE stacks two encoders, an acoustic encoder
2.1. The decoder of VGG-C is 6 layers of and a textual encoder. The acoustic encoder pro-
Transformer with embedding size of 1024, at- cesses the acoustic input, while the textual encoder
tention head of 16 and FFN size of 8192. generates global attention representations for trans-
lation. Moreover, an adapter is designed after the
• VGG-C-init: The encoder is VGG-Conformer,
acoustic encoder, which maps the acoustic repre-
initialized by ASR VGG-Conformer architec-
sentation to the latent space of the textual encoder
ture. The decoder is 6 layers of Transformer,
while retaining acoustic information. By doing so,
initialized by NMT E15D6-v2 variant.
SATE can maintain consistency in representation
• VGG-T: The encoder of VGG-T is initialized across different pre-trained components. Besides,
by the ASR VGG-Transformer architecture, the multi-teacher knowledge distillation has been
197
developed to preserve pre-training knowledge dur- System tst2018 tst2019 tst2020 tst2022 tst-COM
ing fine-tuning (Hinton et al., 2015). ASR* 95.59 97.55 95.71 96.67 98.04
SATE-ex. Figure 1 shows the SATE-ex architec- Whisper 95.75 98.34 97.17 97.86 97.01
ture, comprising the acoustic encoder, acoustic de-
coder, textual encoder, and textual decoder compo- Table 5: The recognition accuracy of the ASR fusion
model and pre-trained Whisper. ASR* indicates the
nents. Theses components are initialized with their
ASR fusion model.
corresponding components in cascade ASR and
MT models. Notably, the textual decoder in SATE-
ex has a Cross-Attention module (highlighted in (5, 54, 0.1). We also provide the results of MT as
yellow) that processes the acoustic decoder’s out- reference (System #1-5).
put. By doing so, this approach fuses the last layer
decoding hidden states of the ASR decoder into the 4.1 Automatic Speech Recognition
textual decoder, alongside Connectionist Tempo- We evaluate the recognition performance of ASR
ral Classification (CTC) decoding hidden states of fusion model and pre-trained Whisper. The ASR fu-
ASR that are injected through adaptor and textual sion model comprises three model structures, each
encoder. Similar to (Zhang et al., 2020), this idea trained with and without Text-to-Speech (TTS)
facilitates to fuse and complement different decod- data, resulting in a total of six ASR models. These
ing strategies, which can improve inner recognition models are fused to obtain the final ASR* model.
accuracy, reduce the propagation of intermediate The three ASR structures are presented below.
representation errors, and thereby enhance transla-
tion performance. • VGG-Conformer: 2 layers of VGG and 12
The loss function of SATE-ex, similar to SATE layers of Conformer in encoder, 6 layers of
(Xu et al., 2021), computes CTC loss LCT C , ASR Transformer in decoder.
loss LASR , and translation loss LT rans . Addi- • VGG-Transformer: 2 layers of VGG and 16
tionally, the losses LKD−CT C and LKD−T rans of layers of Transformer in encoder, 6 layers of
multi-teacher knowledge distillation are used to Transformer in decoder.
preserve pre-trained knowledge during fine-tuning.
Adaptation Training. To further eliminate the in- • GateCNN-Transformer: 6 layers of GateCNN
termediate representation mismatch in pre-trained and 12 layers of Conformer in encoder, 6 lay-
ASR and MT, before end-to-end training, we adopt ers of Transformer in decoder.
adaptation training to fine-tune the MT part of
SATE-ex (including the textual encoder and tex- The recognition results of the ASR fusion model
tual decoder). Specifically, we first generate greedy and pre-trained Whisper are presented in Table 5.
CTC decoding without removing duplicates and The results indicate that Whisper has a superior
blanks through the acoustic encoder. Then, we pair recognition performance compared to the ASR fu-
these CTC decoding with text in target language to sion model, with an average improvement of 0.51%.
fine-tune the textual encoder and textual decoder. However, the ASR fusion model outperforms Whis-
Please note that the textual decoder here does not per slightly on the tst-COM dataset, which could be
contain the Cross-Attention module (highlighted in due to the ASR fusion model upsampling, making
yellow) in Figure 1. its data distribution closer to tst-COM.
4 Experiments 4.2 Cascaded Systems

We construct two cascaded systems, one consisting
Our experimental results are presented in Table 5
of six-model fusion ASR and six-model fusion MT
and Table 6. All experiments are performed using
(System #6), and the other consisting of Whisper
the Fairseq (Ott et al., 2019) toolkit. We report
and six-model fusion MT (System #7).
case-sensitive SacreBLEU scores (Post, 2018) for
For ASR in System #6, we employ the ASR
speech translation. The performance of the sys-
fusion model described in Section 4.1. For MT in
tems is evaluated on MuST-C-v2 tst-COMMON
System #6, we train the four MT models described
(tst-COM) and Development set (Dev). Addition-
in Section 3.2. E18D6 and Macaron are both saved
ally, we set two values for the parameters of SHAS
with two different checkpoints, resulting in six MT
(min, max, threshold), namely (1, 18, 0.5) and
models that are fused to obtain MT*.
198
Official SHAS SHAS
# System Segment (1, 18, 0.5) (5, 54, 0.1)
Dev tst-COM Dev tst-COM Dev tst-COM
MT
1 E15D6-v1 27.23 30.19 - - - -
2 E15D6-v2 27.14 29.95 - - - -
3 E18D6 27.53 30.48 - - - -
4 Macaron 27.48 30.71 - - - -
5 ensemble (1-4) 27.81 31.03 - - - -
Cascaded
6 ASR*+MT* 26.40 29.83 26.05 29.69 26.45 29.62
7 Whisper+MT* 26.72 29.42 27.00 29.55 26.82 29.03
End-to-End
8 SATE-ex-T (w/ TTS) 24.78 28.17 24.43 27.43 23.30 26.49
9 SATE-ex-T (w/o TTS) 25.27 28.00 25.19 27.81 24.37 27.39
10 SATE-ex-M (w/ TTS) 24.52 28.18 23.61 26.62 22.08 24.67
11 SATE-ex-M (w/o TTS) 24.18 27.26 23.96 27.51 20.91 25.66
12 VGG-C-init 24.62 28.74 24.61 28.50 24.12 28.06
13 VGG-T-init 24.59 28.28 24.51 27.84 23.89 27.59
14 VGG-C 24.75 28.68 24.70 28.35 24.29 27.65
15 VGG-T 24.72 28.42 24.60 27.93 24.09 27.77
16 ensemble (8-11) 25.85 29.00 25.50 28.45 24.22 27.54
17 ensemble (12-15) 25.53 28.86 25.54 28.68 25.36 28.68
18 ensemble (8-15) 26.42 29.29 26.22 29.11 25.92 28.92
Ensemble of cascaded and e2e
19 ensemble (6, 18) 26.85 29.46 26.65 29.19 26.28 29.41
20 ensemble (7, 18) 27.09 29.53 26.82 29.35 26.62 29.45
Table 6: The BLEU scores of machine translation (MT), cascaded, end-to-end, and ensemble systems. * indicates
fusion models. The parameter of SHAS is (min, max, threshold).
System #7 uses the large version of Whisper3 and the E18D6 MT model in Section 3.2 to initial-
as ASR, while the MT* is consistent with System ize the textual encoder and decoder. SATE-ex-M
#6. As shown, on Dev set, using Whisper to reduce uses the Macaron MT model in Section 3.2 to ini-
errors in the source language text has improved tialize the textual encoder and decoder.
the performance of ST. However, on tst-COM, the It can be seen that the results of ensemble SATE-
cascade model with ASR* performs better, pre- ex (System #16) outperform those of ensemble
sumably due to the closer match between the data encoder-decoder (System #17). However, the per-
distribution of ASR* and that of tst-COM. formance of a single SATE-ex model is slightly
worse than that of a single encoder-decoder model,
4.3 End-to-End Systems which we attribute to the lack of fine-tuning for the
In the end-to-end setting, we adopt the encoder- single SATE-ex model. In future work, we will
decoder and SATE-ex architectures. Systems #12- discuss SATE-ex in detail.
15 are built based on the encoder-decoder, with spe-
cific parameters referred to Section 3.3. Systems 4.4 Ensemble Systems
#8-11 adopt the SATE-ex architecture. SATE-ex-T We ensemble the two cascade models (Systems #6
uses the VGG-Conformer ASR model in Section and #7) and the end-to-end model (System #18)
4.2 to initialize the acoustic encoder and decoder, separately. The results are shown in Systems #19
3
https://github.com/openai/whisper
and #20 in Table 6. It can be seen that the ensemble
199
systems achieves excellent performance. References
4.5 System Description Milind Agarwal, Sweta Agrawal, Antonios Anasta-
sopoulos, Ondřej Bojar, Claudia Borg, Marine
Our system is primarily based on the full dataset Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
allowed by IWSLT 2022, supplemented with Whis- Chen, William Chen, Khalid Choukri, Alexandra
per large and SHAS for audio segmentation, which
qian Dong, Yannick Estève, Kevin Duh, Marcello
is trained on MUSTC. We have trained six ASR Federico, Souhir Gahbiche, Barry Haddow, Benjamin
models and six MT models based on the IWSLT Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
2022 training data for model fusion. Additionally, vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
we have trained four end-to-end ST models and Kumar, Pengwei Li, Xutail Ma, Prashant Mathur,
four SATE-ex end-to-end ST models for end-to- Kenton Murray, Maria Nadejde, Satoshi Nakamura,
end model fusion. Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
For the end-to-end system, we use a fusion of Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
the above-mentioned eight end-to-end models. For Lonneke van der Plas, Peter Polák, Elijah Rippeth,
the cascaded systems, we build two cascades: one bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
with ASR based on Whisper and the other with Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
ASR based on six-model fusion. The MT side used Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
six-model fusion for both cascades. The submit- vallos. 2023. Findings of the IWSLT 2023 Evaluation
ted systems are based on these two cascades, each Conference on Spoken Language Translation (IWSLT
combined with the eight-model fusion end-to-end 2023). Association for Computational Linguistics.
system.
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed,
The system structure and SHAS parameter
and Michael Auli. 2020. wav2vec 2.0: A framework
(min, max, threshold) settings of the five submit- for self-supervised learning of speech representations.
ted systems are shown below. Advances in neural information processing systems,
33:12449–12460.
• Primary Cascade: System #7 with SHAS pa-
Edresson Casanova, Christopher Shulby, Eren
rameters set to (5, 54, 0.1).
Gölge, Nicolas Michael Müller, Frederico San-
tos de Oliveira, Arnaldo Candido Jr., Anderson
• Contrastive1: System #20 with SHAS param- da Silva Soares, Sandra Maria Aluisio, and
eters set to (1, 18, 0.5). Moacir Antonelli Ponti. 2021. SC-GlowTTS: An
Efficient Zero-Shot Multi-Speaker Text-To-Speech
• Contrastive2: System #19 with SHAS param- Model. In Proc. Interspeech 2021, pages 3645–3649.
eters set to (1, 18, 0.5).
Edresson Casanova, Christopher Shulby, Alexander Ko-
rolev, Arnaldo Candido Junior, Anderson da Silva
• Contrastive3: System #6 with SHAS parame- Soares, Sandra Aluísio, and Moacir Antonelli Ponti.
ters set to (5, 54, 0.1). 2022. Asr data augmentation using cross-lingual
multi-speaker tts and cross-lingual voice conversion.
• Primary e2e: System #18 with SHAS parame- arXiv preprint arXiv:2204.00618.
ters set to (1, 18, 0.5).
Long Duong, Antonios Anastasopoulos, David Chiang,
Steven Bird, and Trevor Cohn. 2016. An attentional
5 Conclusion model for speech translation without transcription.
This paper summarizes the results on the IWSLT In Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computa-
2023 Offline Speech Translation task. We employ tional Linguistics: Human Language Technologies,
various model architectures and data augmentation pages 949–959, San Diego, California. Association
techniques to build speech translation systems in for Computational Linguistics.
cascaded and end-to-end settings. The experimen- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.
tal results demonstrate the effectiveness of strate- Distilling the knowledge in a neural network. arXiv
gies such as pre-trained Whisper models, adapta- preprint arXiv:1503.02531.
tion training, and the Stacked Acoustic-and-Textual
Yoon Kim and Alexander M Rush. 2016. Sequence-
Encoding extension (SATE-ex). In future work, we level knowledge distillation. In Proceedings of the
will further investigate SATE-ex and explore multi- 2016 Conference on Empirical Methods in Natural
modal representation learning in speech translation. Language Processing, pages 1317–1327.
200
Hang Le, Florentin Barbier, Ha Nguyen, Natalia with subword units. In Proceedings of the 54th An-
Tomashenko, Salima Mdhaffar, Souhir Gabiche Gah- nual Meeting of the Association for Computational
biche, Benjamin Lecouteux, Didier Schwab, and Linguistics (Volume 1: Long Papers), pages 1715–
Yannick Estève. 2021. ON-TRAC’ systems for the 1725, Berlin, Germany. Association for Computa-
IWSLT 2021 low-resource speech translation and tional Linguistics.
multilingual speech translation shared tasks. In Pro-
ceedings of the 18th International Conference on Ioannis Tsiamas, Gerard I Gállego, José AR Fonollosa,
Spoken Language Translation (IWSLT 2021), pages and Marta R Costa-jussà. 2022. Shas: Approaching
169–174, Bangkok, Thailand (online). Association optimal segmentation for end-to-end speech transla-
for Computational Linguistics. tion. arXiv preprint arXiv:2202.04774.
Dan Liu, Mengge Du, Xiaoxi Li, Yuchen Hu, and Lirong Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Dai. 2021. The USTC-NELSLIP systems for simul- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
taneous speech translation task at IWSLT 2021. In Kaiser, and Illia Polosukhin. 2017. Attention is all
Proceedings of the 18th International Conference on you need. Advances in neural information processing
Spoken Language Translation (IWSLT 2021), pages systems, 30.
30–38, Bangkok, Thailand (online). Association for
Computational Linguistics. Wei Wang, Taro Watanabe, Macduff Hughes, Tetsuji
Nakagawa, and Ciprian Chelba. 2018. Denoising
Yuchen Liu, Hao Xiong, Jiajun Zhang, Zhongjun He, neural machine translation training with trusted data
Hua Wu, Haifeng Wang, and Chengqing Zong. 2019. and online data selection. In Proceedings of the Third
End-to-end speech translation with knowledge distil- Conference on Machine Translation: Research Pa-
lation. Proc. Interspeech 2019, pages 1128–1132. pers, pages 133–143, Brussels, Belgium. Association
Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong,
Tao Qin, Liwei Wang, and Tie-Yan Liu. 2019. Un- Che Wanxiang, Feng Yunlong, Qin Libo, and Liu Ting.
derstanding and improving transformer from a multi- 2020. N-ltp: A open-source neural chinese language
particle dynamic system point of view. arXiv preprint technology platform with pretrained models. arXiv
arXiv:1906.02762. preprint.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Chen Xu, Bojie Hu, Yanyang Li, Yuhao Zhang, Shen
Sam Gross, Nathan Ng, David Grangier, and Michael Huang, Qi Ju, Tong Xiao, and Jingbo Zhu. 2021.
Auli. 2019. fairseq: A fast, extensible toolkit for Stacked acoustic-and-textual encoding: Integrating
sequence modeling. In Proceedings of the 2019 Con- the pre-trained models into speech translation en-
ference of the North American Chapter of the Associa- coders. In Proceedings of the 59th Annual Meet-
tion for Computational Linguistics (Demonstrations), ing of the Association for Computational Linguistics
pages 48–53, Minneapolis, Minnesota. Association and the 11th International Joint Conference on Natu-
for Computational Linguistics. ral Language Processing (Volume 1: Long Papers),
pages 2619–2630.
scores. In Proceedings of the Third Conference on Binbin Zhang, Di Wu, Zhuoyuan Yao, Xiong Wang,
Machine Translation: Research Papers, pages 186– Fan Yu, Chao Yang, Liyong Guo, Yaguang Hu, Lei
191, Brussels, Belgium. Association for Computa- Xie, and Xin Lei. 2020. Unified streaming and
tional Linguistics. non-streaming two-pass end-to-end model for speech
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
man, Christine McLeavey, and Ilya Sutskever. 2022. Weitai Zhang, Zhongyi Ye, Haitao Tang, Xiaoxi Li,
Robust speech recognition via large-scale weak su- Xinyuan Zhou, Jing Yang, Jianwei Cui, Pan Deng,
pervision. arXiv preprint arXiv:2212.04356. Mohan Shi, Yifan Song, et al. 2022a. The ustc-
nelslip offline speech translation systems for iwslt
Danielle Saunders. 2022. Domain adaptation and multi- 2022. In Proceedings of the 19th International Con-
domain adaptation for neural machine translation: A ference on Spoken Language Translation (IWSLT
survey. Journal of Artificial Intelligence Research, 2022), pages 198–207.
75:351–424.
Ziqiang Zhang, Long Zhou, Junyi Ao, Shujie Liu,
Rico Sennrich, Barry Haddow, and Alexandra Birch. Lirong Dai, Jinyu Li, and Furu Wei. 2022b.
2016a. Improving neural machine translation models Speechut: Bridging speech and text with hidden-unit
with monolingual data. In Proceedings of the 54th for encoder-decoder based speech-text pre-training.
Annual Meeting of the Association for Computational arXiv preprint arXiv:2210.03730.
Berlin, Germany. Association for Computational Lin-
guistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016b. Neural machine translation of rare words
201
I2R’s End-to-End Speech Translation System
for IWSLT 2023 Offline Shared Task
Muhammad Huzaifah, Kye Min Tan, Richeng Duan

Institute for Infocomm Research, Agency for Science, Technology and Research, Singapore
Abstract come a prevalent basis for speech and language pro-

cessing work (Ma et al., 2021; Chen et al., 2022a).
This paper describes I2R’s submission to the of- Through the utilization of pretrained models and
fline speech translation track for IWSLT 2023.
subsequent finetuning using a small amount of la-
We focus on an end-to-end approach for trans-
lation from English audio to German text, one
beled data, many tasks have exhibited significant
of the three available language directions in improvements in performance (Baevski et al., 2020;
this year’s edition. The I2R system leverages Hsu et al., 2021; Guillaume et al., 2022; Navarro
on pretrained models that have been exposed et al., 2022), some even reaching state-of-the-art
to large-scale audio and text data for our base results.
model. We introduce several stages of addi-
tional pretraining followed by fine-tuning to In this work, we describe our end-to-end system
adapt the system for the downstream speech for the Offline Speech Translation Task at IWSLT
translation task. The strategy is supplemented 2023 (Agarwal et al., 2023) in the English-German
by other techniques such as data augmentation, (En-De) language direction. The current year’s task
domain tagging, knowledge distillation, and not only includes the traditional TED talk evalu-
model ensemble, among others. We evaluate ation set translated from English to German, but
the system on several publicly available test
also introduces two additional test sets consisting
sets for comparison.
of ACL presentations, press conferences and in-
1 Introduction terviews (EMPAC), which are more complex and
challenging. Furthermore, this year’s constrained
Historically, speech translation (ST) has involved data track allows less data than previous years. Our
combining automatic speech recognition (ASR) team enhances the end-to-end ST system within the
and machine translation (MT) systems in a cas- context of the pretrain-finetune paradigm. We in-
cade. The ASR system would transcribe speech troduce several pretraining stages before finetuning
signals into text in the source language, and the for the downstream ST task. Furthermore, we im-
MT system would then translate this text into the plemented dynamic audio augmentation methods to
target language. However, recent developments account for differences in audio recording quality.
in deep learning have made it possible to use an We boost the system’s robustness by ensembling
end-to-end speech translation model (Bérard et al., multiple individual models and use domain tagging
2016; Weiss et al., 2017), which directly trans- to direct the model towards specific output styles.
lates speech in the source language into text in Here, we evaluate our system against various stan-
the target language, without relying on intermedi- dard public test sets for both speech translation and
ate symbolic representations. This approach offers text machine translation.
the advantages of lower latency and avoids error
propagation. While cascaded models initially out-
performed end-to-end models, recent results from 2 Methodology
IWSLT campaigns (Le et al., 2020; Bentivogli et al.,
2021; Anastasopoulos et al., 2022) have shown that
the performance of end-to-end models is now ap- In this section, we introduce the model architecture
proaching that of cascaded solutions. of our system, and describe some of the methods
Large pretrained models (Lewis et al., 2020; we incorporated into the design and training pro-
Conneau et al., 2021; Raffel et al., 2020) have be- cess.
202
Given that ST data is commonly provided as a
triplet of source speech, source text transcription
and target text translations, we leverage both text
and speech sources in our proposed architecture.
Aside from the audio waveforms processed through
the speech encoder, we take as input upsampled to-
kenized source text by repeating subword tokens ac-
cording to a pre-calculated ratio given by an align-
ment system. For data with paired speech and text
inputs, we mix representations from the two input
encoders through random swapping. Otherwise,
unimodal data is processed by their respective en-
coders and the mixing step is skipped, such as the
case during speech-only ST inference. We also
recognise that the flexible nature of the architecture
allows the use of ASR and MT data as unimodal
inputs to further expand the training data and train
a multilingual model. However, due to time and
computational constraints, this was not explored in
this submission and is left as future work.
Figure 1: Our end-to-end ST model architecture 2.2 Representation Mixing

Recent work in unified representation learning of
2.1 Model speech and text (Liu et al., 2020; Zhang et al., 2022;
As shown in Fig 1, our end-to-end ST model uses Chen et al., 2022b; Fang et al., 2022; Sainath et al.,
two separate encoders for speech and text, followed 2023) try to leverage abundant text data to supple-
by a shared encoder and decoder. As the shared ment speech-based models. We similarly encour-
encoder is pretrained on text inputs while the fi- age our model to learn a joint multimodal repre-
nal system has to work with speech inputs, we try sentation by bringing speech and text inputs into a
to bring speech and text into a shared representa- shared representation space.
tion space by devising a training task using mixed To handle the large difference in sequence
speech and text inputs, described in Section 2.2. lengths of audio and text, systems from the litera-
Due to limited computational resources, we ture often upsample text using a trained duration
make use of the allowed pretrained models in model or a resampling scheme. Here, we utilize of-
the constrained track. The speech encoder is ini- fline forced alignment and upsampling to align the
tialized from the WavLM (Chen et al., 2022a) speech and text data. Specifically, a pretrained ASR
large checkpoint which was pretrained on Libri- model is used to first force align text transcripts to
Light, GigaSpeech and VoxPopuli data in a self- audio, returning an upsampling ratio between a
supervised fashion. WavLM was selected as it in- particular subword and its corresponding speech
cludes more data relevant to this year’s test set, segment. Each subword token is then repeated up
and showed better performance in our preliminary to this ratio before being fed to the text encoder
experiments compared to similar models like Hu- such that the final encoded subword is of the same
BERT. DeltaLM base (Ma et al., 2021) was used length as its speech counterpart. The alignment
to initialize the text encoder, shared encoder and and resampling procedure is described in detail in
decoder sections. Prior to the final ST training, the Section 3.1.
DeltaLM model was first finetuned on text-to-text As the shared encoder was pretrained only on
MT (described in Section 3.2). The text encoder in- text, we hypothesize that the model may better
cludes the text and positional embedding layers of adapt to the downstream speech task by using a
DeltaLM and is frozen in the final finetuning stage. mixed speech-text representation compared to train-
The shared encoder encompasses the transformer ing on pure speech inputs. When finetuning the ST
layers of the DeltaLM encoder. model on data with both source speech and text, we
203
feed both the audio and upsampled text tokens into common pre-processing pipeline was applied to
the respective speech and text encoders, then mix the text data, namely removing any tags and con-
the resultant embeddings at the individual subword trol codes, normalizing bullet points, simplifying
token level using a fixed probability. In practice, a punctuation by removing repeats (with the excep-
swapping mask is created before upsampling, with tion of ‘...’) and normalizing whitespace characters.
text embeddings being replaced with speech em- Sentence pairs where source and target differed by
beddings according to a swapping ratio α, where more than three times in length were then removed
0 < α < 1. The tokens and swap mask are up- given that they were likely to be misaligned. Fi-
sampled together and passed into the model so that nally, the remaining sentences were deduplicated.
sequences of identical upsampled tokens can be The out-of-domain data was further filtered us-
replaced with speech embeddings during the repre- ing Language-agnostic BERT Sentence Embedding
sentation mixing step. (LaBSE) (Feng et al., 2022). Specifically, we re-
moved sentence pairs with sentence representations
2.3 Knowledge Distillation lower than 0.5 cosine similarity. We opted not to
To fully utilize the larger amounts of text-only MT use any backtranslation data for training since the
data allowed in the challenge, we train a separate provided monolingual dataset was found to largely
MT model using DeltaLM large. This larger model overlap with OpenSubtitles. The final dataset con-
is then frozen and used as a teacher during fine- tained 850,003 in-domain and 13,083,335 out-of-
tuning of the ST model via negative log-likelihood domain sentence pairs.
minimization between the hypotheses generated by
both the models, similar to the knowledge distilla- Dataset Lines
tion method proposed in Tang et al. (2021). in-domain
MuST-C v1/v2/v3 391K
Our overall loss function therefore consists of
ST TED corpus 170K
cross entropy loss between the ground truth and
TED2020 v1 288K
hypothesis produced by the ST system (Lst ) and
negative log-likelihood loss between the teacher out-of-domain
and student model hypotheses (Lkd ), weighted by CoVoST v2 300K
γ and β respectively: L = γLst + βLkd ELRC-CORDIS News v1 111K
Europarl v10 1.7M
3 Experimental Setup Europarl-ST v1.1 69K
NewsCommentary v16 380K
3.1 Data Preparation OpenSubtitles v2018 apptek 10.1M
Training data was compiled in accordance to con- Tatoeba v1 288K
strained conditions. They can be divided into text Total 13.9M
and audio-based categories which were used to Table 1: Breakdown of text training data. For ST
train the initial MT model and final ST model re- datasets only transcription and translation pairs were
spectively. used.
Text data Parallel En-De lines were gathered

from both MT and ST datasets, seen in Table 1. Audio data Audio data sources include both
These were split into in-domain and out-of-domain ASR and ST corpora, listed in Table 2. ASR
based on whether the text was derived from TED- data consist of Commonvoice (Ardila et al., 2020),
like sources. The in-domain sources include a Librispeech (Panayotov et al., 2015), TED LIUM
combination of MuST-C v1, v2 and v3 (Cattoni (Rousseau et al., 2012), and Vox Populi (Wang
et al., 2021), ST TED (Niehues et al., 2018), and et al., 2021a), whereas the ST data include CoV-
TED 2020 (Reimers and Gurevych, 2020), whereas oST (Wang et al., 2021b), Europarl-ST (Iranzo-
the out-of-domain sources mostly comprised of Sánchez et al., 2020), MuSTC v3 (Cattoni et al.,
OpenSubtitles (Lison and Tiedemann, 2016) and 2021) and ST TED (Niehues et al., 2018). Speech
Europarl (Koehn, 2005), but also include CoVoST was first converted to mono channel and resampled
v2 (Wang et al., 2021b), ELRC-CORDIS News, to 16kHz if required before being saved in FLAC
Europarl-ST (Iranzo-Sánchez et al., 2020), News- format. Only utterances between 800 to 480,000
Commentary (Tiedemann, 2012) and Tatoeba. A samples (i.e. 0.05-30s) were kept and utilized for
204
Dataset Utterances Hours 3.2 Training configuration
ASR data
On-the-fly audio augmentation To make our
Commonvoice v11.0 949K 2320
model more robust against the bigger variances in
Librispeech v12 281K 960
recording quality of the evaluation data introduced
TED LIUM v3 268K 453
this year, we implemented an on-the-fly augmenta-
Vox Populi 177K 543
tion pipeline for input audio via the Audiomenta-
ST data tions library. In addition to initial utterence cepstral
CoVoST v2 289K 364 mean and variance normalization (CMVN), we ap-
Europarl-ST v1.1 68K 89 ply gain, seven-band parametric equalization, gaus-
MuST-C v3 265K 273 sian noise, time stretch, pitch shift and a lowpass
ST TED corpus 169K 252 filter, where each augmentation independently has
Total 2.47M 5254 a 20% chance of being utilized. During inference
Table 2: Breakdown of available audio training data only CMVN is used.
Machine translation We finetuned several con-

training. The provided segmentation was used for figurations of DeltaLM base and large for En-De
all speech training data. machine translation. DeltaLM base has 12 encoder
To increase the amount of available ST data, we and six decoder layers, with an embedding dimen-
generated additional translations from ASR tran- sion of 768 and 12 attention heads per transformer
scription data using our trained MT model. These layer. In contrast, DeltaLM large contains 24 en-
synthetic speech-text pairs were used as part of the coder and 12 decoder layers, an embedding dimen-
ST dataset during the finetuning phase. sion of 1024 and 16 attention heads per layer.
We used a two phase approach to finetuning. In
Forced alignment and upsampling To prepare the first phase, we directly initialized the MT model
text inputs for mixing with speech inputs, we pre- with DeltaLM pretrained weights and trained on
processed the text by upsampling and aligning it all available MT data. We then continued fine-
to its corresponding speech features using a pre- tuning only on in-domain data after checkpoint
trained HuBERT ASR model. First, we normalized averaging the best five checkpoints from the first
the transcripts from ASR and ST datasets by delet- phase in terms of BLEU on the validation set that
ing non-verbal fillers and converting numbers into comprised of IWSLT test sets from 2015, 2018,
their corresponding words. Characters not found 2019 and 2020, plus MuST-C v3 tst-COMMON
among the HuBERT labels were then removed split. We also tried progressive finetuning (Li et al.,
after tokenizing the text. Next, we obtained an 2020) during the second phase for the DeltaLM
alignment between the subword tokens and parallel base configuration where the depth of the encoder
speech using a pretrained HuBERT large model was increased to 16 with four extra randomly ini-
(Hsu et al., 2021) and, following those alignments, tialized layers.
duplicated the input tokens to match the lengths of All models were implemented with the Fairseq
the speech representation produced by the speech library. Models were trained with Adam opti-
encoder. The frequency of the upsampled text to- mization, an inverse square root learning rate (LR)
kens is 50 Hz (equivalent to 16 kHz input audio schedule and a peak LR of 1e-4 for the first phase
downsampled 320 times by the WavLM feature and 1e-5 for the second phase. Label smoothing of
extractor). 0.1 was also used. Training was carried out on four
NVIDIA V100 GPUs. We employ subword tok-
Audio segmentation As segmentation informa- enization for all text inputs using a Sentencepiece
tion was not provided in this year’s evaluation data, model inherited from the original DeltaLM, with a
we used the pretrained Supervised Hybrid Audio vocabulary size of 250,000.
Segmentation (SHAS) model (Tsiamas et al., 2022)
to perform voice activity detection and segmenta- Speech translation finetuning As described in
tion on the input audio from the IWSLT test sets. section 2.1, the end-to-end speech translation
SHAS has been evaluated on MuST-C and mTEDx model consists of separate speech encoder and text
and shows results approaching manual segmenta- embedding input layers, followed by a shared en-
tion. coder and decoder. The speech encoder is initial-
205
ized with a pretrained WavLM large model that layer transformer decoder for ASR using MuST-C
contains a seven layer convolutional feature extrac- v2 data, with and without input augmentations (Ta-
tor followed by 24 transformer layers. We initialize ble 3). Furthermore, we trained a HuBERT large
the text embeddings, shared encoder and decoder model in the same setup to contrast between differ-
layers with the DeltaLM base model previously ent pretrained speech encoders.
finetuned for MT. The input text embeddings are
frozen throughout the ST finetuning. Meanwhile, Model WER
the teacher text model was instead initialized with HuBERT large without augmentation 7.59
the finetuned DeltaLM large configuration. WavLM large without augmentation 5.86
Domain tagging has been shown in previous MT WavLM large with augmentation 5.56
(Britz et al., 2017) and ST (Li et al., 2022) work
to be effective for domain discrimination and to Table 3: ASR results on MuST-C v2 tst-COMMON.
condition the model towards certain output styles.
Given the distinct TED-style outputs of the evalu- As observed, the audio augmentations were
ation data, we introduce ‘<indomain>’ and ‘<out- found to be beneficial, leading to a reduction of
domain>’ tags as prefix tokens during decoding to WER by 0.3. We found WavLM large together
help the model better distinguish the data distribu- with augmentations to perform the best overall and
tion and style of the in-domain data from the other so was adopted for the rest of the experiments.
parts of the dataset.
Similar to the approach employed during MT
training, we initially trained the end-to-end ST 4.2 Machine translation results
model on all available ST data, including those The results of the MT systems for En-De are shown
synthesized from ASR data. Adam optimization in Table 4, separated into the full-domain training
with inverse square root LR schedule and peak LR phase and the in-domain training phase. Perfor-
of 1e-5 was used. A swapping ratio of 0.8 was used mance was evaluated using cased BLEU with de-
during training but 1.0 (i.e. pure speech represen- fault SacreBLEU options (13a tokenization).
tation) was used for inference and testing. In the
It was evident that the continuous finetuning
second phase we continued finetuning two separate
with in-domain data improves performance on sim-
models with different data splits, while swapping
ilar datasets such as past year IWSLT evaluation
ratio was kept at 1.0. To target the usual TED
data or MuST-C. While the DeltaLM large models
evaluation data, we trained one with only MuST-C
achieved the best results, the base variants were not
and ST-TED data, while the other also included
far behind and generally performed within 1 BLEU
CoVoST and Europarl to help deal with the more
score of the former. However, we found no added
diverse speech patterns found in the ACL and EM-
benefit to the progressively finetuned models. It
PAC parts of the evaluation data (given that no
may be the case that the extra representative power
direct development data was provided). We weight
of the expanded encoder layers were not beneficial
the ST loss and knowledge distillation loss with
at the relatively small scale of the in-domain data,
γ = 1 and β = 0.1 respectively. Training was
which was less than 1 million sentence pairs. Some
carried out on four NVIDIA V100 GPUs for both
training runs produced better scores by checkpoint
phases.
averaging the best five checkpoints. Nevertheless,
4 Results and Analysis the improvement was not consistent throughout all
test sets.
We present our experimental results and analyses An ensemble of model variants 6 and 9 further
in this section. improved the BLEU scores on the test sets. We
utilize the ensemble model to generate translations
4.1 Effect of audio augmentations and from ASR transcriptions to supplement the avail-
pretrained speech encoder able ST data. The best checkpoint for DeltaLM
As a preliminary experiment, we tested whether base (model 5) and DeltaLM large (model 9) were
the input audio augmentations have a tangible im- subsequently used to initialize the end-to-end ST
pact on downstream applications. We finetuned a model and teacher text model respectively for the
pretrained WavLM large model together with a six final finetuning.
206
BLEU
Model
tst2020 tst2019 MuST-C v3 MuST-C v2
full-domain
1 base (best) 31.76 28.81 33.11 33.77
2 base (avg 5) 32.86 29.43 34.05 34.67
3 large (best) 31.82 29.01 33.20 34.21
4 large (avg 5) 32.52 29.54 33.65 34.68
in-domain
5 base (best) 33.64 30.67 35.29 35.99
6 base (avg 5) 33.73 30.64 35.26 36.11
7 base-progressive (best) 33.40 30.51 34.25 34.83
8 base-progressive (avg 5) 33.26 30.48 34.37 35.09
9 large (best) 34.44 31.47 35.60 36.26
10 large (avg 5) 34.32 31.42 35.89 36.48
Ensemble (6 + 9) 34.91 31.77 36.14 36.93
Table 4: MT results on various test sets.
BLEU
Model
tst2020 tst2019 MuST-C v3 MuST-C v2 CoVoST v2
in-domain
1 base (best) 25.70 22.68 30.29 30.56 27.92
2 base (avg 5) 24.81 22.25 29.98 30.29 28.11
extended-domain
3 base (best) 22.80 21.17 29.33 29.50 28.63
4 base (avg 3) 23.21 21.20 29.61 29.95 29.30
Ensemble (1 + 2 + 4) 24.99 22.64 29.99 30.35 29.13
Table 5: ST results on various test sets.
4.3 Speech translation results unexpectedly get poor results on IWSLT-tst2019

and IWSLT-tst2020 relative to last year’s best per-
Results from our end-to-end ST systems for En- forming entries, which may point to a weakness
glish speech to German text are provided in Table 5. in the current training procedure, a domain mis-
As mentioned in section 3.2, we trained two mod- match since training was more aligned to MuST-C,
els during the second ST finetuning phase, which or compounded errors due to resegmentation. We
are labelled here as ‘in-domain’, targeting more plan to investigate the reasons more precisely in
TED-like inputs, and ‘extended-domain’ for other future papers. The ensemble model of variants 1,
input domains. As reference segmentation infor- 2 and 4 shows balanced performance across both
mation was not provided for IWSLT-tst2019 and domains, and we submit this as our primary sub-
IWSLT-tst2020 test sets, we used SHAS to segment mission, with variants 1 and 4 as our contrastive
the audio. The translation hypotheses were then systems.
compared to the references provided by using the
SLT.KIT evaluation script listed on the challenge
5 Conclusion
website, that uses the mwerSegmenter resegmen-
tation tool and the BLEU calculation script from In this paper we outline our proposed end-to-end
the Moses toolkit. The provided segmentation and system that incorporates pretrained models trained
SacreBLEU were utilized for the other test sets. on large-scale audio and text data to enhance the
Comparing CoVoST against the rest of the ST performance. The system underwent several
test sets reveals that the in-domain and extended- stages of additional pretraining followed by finetun-
domain models show better results in their respec- ing for the downstream speech translation task. We
tive domain specializations, as was intended. We explored several techniques including audio aug-
207
mentation, domain tagging, knowledge distillation Spoken Language Translation (IWSLT 2022), pages
and model ensemble to improve the system’s per- 98–157. Association for Computational Linguistics.
formance. We utilize both speech and text inputs, Rosana Ardila, Megan Branson, Kelly Davis, Michael
and propose a mixing procedure to unify represen- Kohler, Josh Meyer, Michael Henretty, Reuben
tations from both modalities to not only increase Morais, Lindsay Saunders, Francis Tyers, and Gre-
the amount of available training data but also better gor Weber. 2020. Common voice: A massively-
multilingual speech corpus. In Proceedings of the
adapt the model to downstream speech tasks. We Twelfth Language Resources and Evaluation Confer-
plan to carry out more experiments to further ex- ence, pages 4218–4222, Marseille, France. European
plore the effect of modality mixing and improve Language Resources Association.
the performance of such models for speech-to-text
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed,
tasks. and Michael Auli. 2020. wav2vec 2.0: A framework
Acknowledgements Advances in Neural Information Processing Systems,
33:12449–12460.
This work was supported by the A*STAR Compu-
tational Resource Centre through the use of its high Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina
performance computing facilities. Karakanta, Alberto Martinelli, Matteo Negri, and
Marco Turchi. 2021. Cascade versus direct speech
translation: Do the differences still make a differ-
ence? In Proceedings of the 59th Annual Meet-
References ing of the Association for Computational Linguistics
Milind Agarwal, Sweta Agrawal, Antonios Anasta- and the 11th International Joint Conference on Natu-
sopoulos, Ondřej Bojar, Claudia Borg, Marine ral Language Processing (Volume 1: Long Papers),
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda pages 2873–2887. Association for Computational
Chen, William Chen, Khalid Choukri, Alexandra Linguistics.
Alexandre Bérard, Olivier Pietquin, Laurent Besacier,
and Christophe Servan. 2016. Listen and translate: A
proof of concept for end-to-end speech-to-text trans-
lation. In NIPS Workshop on end-to-end learning
and speech and audio precessing, Barcelona, Spain.
Evgeny Matusov, Paul McNamee, John P. McCrae, Denny Britz, Quoc Le, and Reid Pryzant. 2017. Effec-
Kenton Murray, Maria Nadejde, Satoshi Nakamura, tive domain mixing for neural machine translation.
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu, In Proceedings of the Second Conference on Machine
Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino, Translation, pages 118–126, Copenhagen, Denmark.
Lonneke van der Plas, Peter Polák, Elijah Rippeth, Association for Computational Linguistics.
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian Roldano Cattoni, Mattia Antonino Di Gangi, Luisa
Thompson, Kevin Tran, Marco Turchi, Alex Waibel, Bentivogli, Matteo Negri, and Marco Turchi. 2021.
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- MuST-C: A multilingual corpus for end-to-end
vallos. 2023. Findings of the IWSLT 2023 Evaluation speech translation. Computer Speech & Language,
Campaign. In Proceedings of the 20th International 66:101155.
2023). Association for Computational Linguistics. Sanyuan Chen, Chengyi Wang, Zhengyang Chen,
Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki
Antonios Anastasopoulos, Loïc Barrault, Luisa Ben- Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long
tivogli, Marcely Zanon Boito, Ondrej Bojar, Roldano Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu,
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, Michael Zeng, Xiangzhan Yu, and Furu Wei. 2022a.
Maha Elbayad, Clara Emmanuel, Y. Estève, Mar- WavLM: Large-scale self-supervised pre-training for
cello Federico, Christian Federmann, Souhir Gah- full stack speech processing. IEEE Journal of Se-
biche, Hongyu Gong, Roman Grundkiewicz, Barry lected Topics in Signal Processing, 16(6):1505–1518.
Haddow, B. Hsu, Dávid Javorský, Věra Kloudová,
Surafel Melaku Lakew, Xutai Ma, Prashant Mathur, Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana
Paul McNamee, Kenton Murray, Maria Nadejde, Ramabhadran, Pedro J. Moreno, Ankur Bapna, and
Satoshi Nakamura, Matteo Negri, Jan Niehues, Xing Heiga Zen. 2022b. MAESTRO: Matched speech
Niu, John E. Ortega, Juan Miguel Pino, Elizabeth text representations through modality matching. In
Salesky, Jiatong Shi, Matthias Sperber, Sebastian Interspeech, pages 4093–4097.
Stüker, Katsuhito Sudoh, Marco Turchi, Yogesh
Virkar, Alexander H. Waibel, Changhan Wang, and Alexis Conneau, Alexei Baevski, Ronan Collobert, Ab-
Shinji Watanabe. 2022. Findings of the IWSLT 2022 delrahman Mohamed, and Michael Auli. 2021. Un-
evaluation campaign. In International Workshop on supervised Cross-Lingual Representation Learning
208
for Speech Recognition. In Interspeech, pages 2426– pages 7871–7880, Online. Association for Computa-
2430. tional Linguistics.
Qingkai Fang, Rong Ye, Lei Li, Yang Feng, and Bei Li, Ziyang Wang, Hui Liu, Yufan Jiang, Quan Du,
Mingxuan Wang. 2022. STEMM: Self-learning with Tong Xiao, Huizhen Wang, and Jingbo Zhu. 2020.
speech-text manifold mixup for speech translation. Shallow-to-deep training for neural machine trans-
In Proceedings of the 60th Annual Meeting of the lation. In Proceedings of the 2020 Conference on
Association for Computational Linguistics (Volume Empirical Methods in Natural Language Processing
1: Long Papers), pages 7050–7062, Dublin, Ireland. (EMNLP), pages 995–1005, Online. Association for
Association for Computational Linguistics. Computational Linguistics.
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Ari- Yinglu Li, Minghan Wang, Jiaxin Guo, Xiaosong Qiao,
vazhagan, and Wei Wang. 2022. Language-agnostic Yuxia Wang, Daimeng Wei, Chang Su, Yimeng Chen,
BERT sentence embedding. In Proceedings of the Min Zhang, Shimin Tao, Hao Yang, and Ying Qin.
60th Annual Meeting of the Association for Compu- 2022. The HW-TSC’s offline speech translation sys-
tational Linguistics (Volume 1: Long Papers), pages tem for IWSLT 2022 evaluation. In Proceedings of
878–891, Dublin, Ireland. Association for Computa- the 19th International Conference on Spoken Lan-
tional Linguistics. guage Translation (IWSLT 2022), pages 239–246,
Dublin, Ireland (in-person and online). Association
Séverine Guillaume, Guillaume Wisniewski, Cécile for Computational Linguistics.
Macaire, Guillaume Jacques, Alexis Michaud, Ben-
jamin Galliot, Maximin Coavoux, Solange Rossato, Pierre Lison and Jörg Tiedemann. 2016. OpenSub-
Minh-Châu Nguyên, and Maxime Fily. 2022. Fine- titles2016: Extracting large parallel corpora from
tuning pre-trained models for automatic speech recog- movie and TV subtitles. In Proceedings of the Tenth
nition, experiments on a fieldwork corpus of japhug International Conference on Language Resources
(trans-himalayan family). In Proceedings of the Fifth and Evaluation (LREC’16), pages 923–929, Portorož,
Workshop on the Use of Computational Methods in Slovenia. European Language Resources Association
the Study of Endangered Languages, pages 170–178, (ELRA).
Dublin, Ireland. Association for Computational Lin-
guistics. Yuchen Liu, Junnan Zhu, Jiajun Zhang, and Chengqing
Zong. 2020. Bridging the modality gap for speech-
Wei-Ning Hsu, Yao-Hung Hubert Tsai, Benjamin Bolte, to-text translation. arXiv preprint arXiv:2010.14920.
Ruslan Salakhutdinov, and Abdelrahman Mohamed.
2021. HuBERT: How much can a bad teacher benefit Shuming Ma, Li Dong, Shaohan Huang, Dong-
ASR pre-training? In 2021 IEEE International Con- dong Zhang, Alexandre Muzio, Saksham Singhal,
ference on Acoustics, Speech and Signal Processing Hany Hassan Awadalla, Xia Song, and Furu Wei.
(ICASSP), pages 6533–6537. 2021. DeltaLM: Encoder-decoder pre-training for
language generation and translation by augmenting
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, pretrained multilingual encoders.
bert Sanchis, Jorge Civera, and Alfons Juan. 2020. David Fraile Navarro, Mark Dras, and Shlomo
Europarl-ST: A multilingual corpus for speech trans- Berkovsky. 2022. Few-shot fine-tuning SOTA sum-
lation of parliamentary debates. In 2020 IEEE Inter- marization models for medical dialogues. In Proceed-
national Conference on Acoustics, Speech and Signal ings of the 2022 Conference of the North American
Processing (ICASSP), pages 8229–8233. Chapter of the Association for Computational Lin-
guistics: Human Language Technologies: Student
Philipp Koehn. 2005. Europarl: A parallel corpus for Research Workshop, pages 254–266, Hybrid: Seattle,
statistical machine translation. In Proceedings of Washington + Online. Association for Computational
Machine Translation Summit X: Papers, pages 79–86, Linguistics.
Phuket, Thailand.
Jan Niehues, Rolando Cattoni, Sebastian Stüker, Mauro
Hang Le, Juan Pino, Changhan Wang, Jiatao Gu, Didier Cettolo, Marco Turchi, and Marcello Federico. 2018.
Schwab, and Laurent Besacier. 2020. Dual-decoder The IWSLT 2018 evaluation campaign. In Proceed-
transformer for joint automatic speech recognition ings of the 15th International Conference on Spoken
and multilingual speech translation. In 28th Inter- Language Translation, pages 2–6, Brussels. Interna-
national Conference on Computational Linguistics, tional Conference on Spoken Language Translation.
pages 3520–3533, Barcelona, Spain.
Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan jeev Khudanpur. 2015. Librispeech: An asr corpus
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, based on public domain audio books. In 2015 IEEE
Veselin Stoyanov, and Luke Zettlemoyer. 2020. International Conference on Acoustics, Speech and
BART: Denoising sequence-to-sequence pre-training Signal Processing (ICASSP), pages 5206–5210.
for natural language generation, translation, and com-
prehension. In Proceedings of the 58th Annual Meet- Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
ing of the Association for Computational Linguistics, ine Lee, Sharan Narang, Michael Matena, Yanqi
209
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Speech Translation. In Interspeech, pages 2247–
limits of transfer learning with a unified text-to-text 2251.
transformer. Journal of Machine Learning Research,
21(1). Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui
Wu, and Zhifeng Chen. 2017. Sequence-to-Sequence
Nils Reimers and Iryna Gurevych. 2020. Making Models Can Directly Translate Foreign Speech. In
monolingual sentence embeddings multilingual us- Interspeech, pages 2625–2629.
ing knowledge distillation. In Proceedings of the
2020 Conference on Empirical Methods in Natural Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu,
Language Processing. Association for Computational Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun
Linguistics. Gong, Lirong Dai, Jinyu Li, and Furu Wei.
2022. SpeechLM: Enhanced speech pre-training
Anthony Rousseau, Paul Deléglise, and Yannick Estève. with unpaired textual data. arXiv preprint
2012. TED-LIUM: an automatic speech recogni- arXiv:2209.15329.
tion dedicated corpus. In Proceedings of the Eighth
and Evaluation (LREC’12), pages 125–129, Istanbul,
Turkey. European Language Resources Association
(ELRA).
Tara N Sainath, Rohit Prabhavalkar, Ankur Bapna,
Yu Zhang, Zhouyuan Huo, Zhehuai Chen, Bo Li,
Weiran Wang, and Trevor Strohman. 2023. JOIST:
A joint speech and text streaming model for asr. In
2022 IEEE Spoken Language Technology Workshop
(SLT), pages 52–59. IEEE.
Yun Tang, Juan Pino, Xian Li, Changhan Wang, and
Dmitriy Genzel. 2021. Improving speech translation
by understanding and learning from the auxiliary text
translation task. In Proceedings of the 59th Annual
faces in OPUS. In Proceedings of the Eight Inter-
national Conference on Language Resources and
Evaluation (LREC’12), Istanbul, Turkey. European
Language Resources Association (ELRA).
Ioannis Tsiamas, Gerard I. Gállego, Carlos Escolano,
José Fonollosa, and Marta R. Costa-jussà. 2022. Pre-
trained speech encoders and efficient fine-tuning
methods for speech translation: UPC at IWSLT 2022.
In Proceedings of the 19th International Confer-
ence on Spoken Language Translation (IWSLT 2022),
pages 265–276, Dublin, Ireland (in-person and on-
line). Association for Computational Linguistics.
Juan Pino, and Emmanuel Dupoux. 2021a. VoxPop-
Changhan Wang, Anne Wu, Jiatao Gu, and Juan Pino.
2021b. CoVoST 2 and Massively Multilingual
210
The NiuTrans End-to-End Speech Translation System
for IWSLT23 English-to-Chinese Offline Task
Yuchen Han1∗, Xiaoqian Liu1∗, Hao Chen1 , Yuhao Zhang1 ,
Chen Xu1 , Tong Xiao1,2 , Jingbo Zhu1,2
1
School of Computer Science and Engineering, Northeastern University, Shenyang, China
2
NiuTrans Research, Shenyang, China
{hanyuchen114,yoohao.zhang}@gmail.com,[email protected]
{liuxiaoqian0319,xuchennlp}@outlook.com
{xiaotong,zhujingbo}@mail.neu.edu.cn
Abstract train
decode
This paper describes the NiuTrans end-to-end initialize
speech translation system submitted for the
IWSLT 2023 English-to-Chinese offline task.
Our speech translation models are composed
of pre-trained ASR and MT models under the FBank/ labeled ST
ST
stacked acoustic and textual encoding frame- wav
work. Several pre-trained models with diverse pseudo ST
architectures and input representations (e.g.,
log Mel-filterbank and waveform) were utilized. MT
We proposed an iterative data augmentation ASR
(ensemble)
method to iteratively improve the performance
of the MT models and generate the pseudo ST FBank/wav
data through MT systems. We then trained ST iteratively
models with different structures and data set-
tings to enhance ensemble performance. Exper-
imental results demonstrate that our NiuTrans labeled ASR labeled MT
system achieved a BLEU score of 29.22 on
the MuST-C En-Zh tst-COMMON set, outper- pseudo MT
forming the previous year’s submission by 0.12
BLEU despite using less MT training data. Figure 1: Overview of our system.
1 Introduction
components. Using this framework, we explore
End-to-end speech translation (E2E ST) directly multiple architectures of pre-trained ASR and MT
translate speech in the source language into text in models with varying numbers of parameters and
the target language without generating an interme- input representations such as FBank features or
diate representation, which has gained significant waveform data.
attention in recent years due to several advantages Pseudo data is a crucial component of E2E ST,
over cascade methods, including low latency and often generated by ensemble MT systems (Gaido
the ability to avoid error propagation (Berard et al., et al., 2020). This year, we focused more on the per-
2016; Weiss et al., 2017). In this paper, we describe formance of MT models and developed an Iterative
our NiuTrans E2E ST system that participated in Data Augmentation method to leverage text data
the IWSLT23 English-to-Chinese offline track, the from all corpora, improving the MT models and
overview of our system is shown in Fig 1. enabling the generation of multiple pseudo data.
To improve the performance of our system, we We then used these multiple pseudo data to train
aim to maximize the diversity of our ensemble of diverse E2E ST models for optimal performance.
E2E ST models. Our E2E ST models are based on Our best ST ensemble system includes models with
the stacked acoustic and textual encoding (SATE) different input representations, architectures, and
method (Xu et al., 2021a), which is a framework training corpora, achieving a BLEU score of 29.22
to make the best of pre-trained automatic speech on the MuST-C En-Zh tst-COMMON set.
recognition (ASR) and machine translation (MT) The remainder of the paper is organized as fol-
* Authors contributed equally. lows: Section 2 describes the data processing, data
211
augmentation and speech segmentation. Section 3 Task Corpus Sentence Hour
outlines the construction of the vocabulary and LibriSpeech 0.28 960
structures of our ASR, MT and ST models. The Europarl-ST 0.03 77
experimental settings and final results are presented TED LIUM 0.26 448
in Section 4. Finally, Section 5 concludes the sub- ST TED 0.16 235
mission. ASR VoxPopuil 0.17 478
MuST-C V1 En-De 0.07 138
2 Data MuST-C V2 En-Zh 0.36 572
CoVoST v2 En-Zh 0.28 416
2.1 Data Processing
Total 1.61 3324
Our system was built under the “constrained” train- News Commentary 0.31 -
ing condition. The training data can be divided OpenSubtitle 8.62 -
into three categories: ASR, MT, and ST corpora. MuST-C V2 En-Zh 0.36 -
We used the NiuTrans toolkit (Xiao et al., 2012) to MT
CoVoST V2 En-Zh 0.28 -
segment English and Chinese text in all corpora. Tatoeba 0.05 -
ASR corpora. We followed the previous work Total 9.62 -
(Xu et al., 2021b) and standardized all audio sam- MuST-C En-Zh 0.36 572
ples to a single channel and a sample rate of 16,000 ST CoVoST V2 En-Zh 0.28 416
Hz. For the Common Voice corpus, we selected Total 0.64 988
only the cleaner parts according to the CoVoST
Table 1: Details about the size of all labeled corpora.
v2 En-Zh corpus. In the MuST-C v1 En-De cor- The unit of sentence is million (M).
pus, we removed repetitive items by comparing
the MuST-C v2 En-Zh transcriptions. We used the
Task Corpus Sentence Hour
Librispeech corpus to train the ASR model and
MT ASR corpora+MT 1.38 -
scored the Common Voice, TED LIUM, and ST
ASR corpora+MT 1.61 3323
TED corpus. Data with a WER greater than 0.75 ST
Audio+ASR+MT 1.4e-2 3
were removed, and frames with lengths less than
5 or greater than 3000 were filtered. In addition, Table 2: Details about the size of all pseudo corpora.
utterances with more than 400 characters were re-
moved.
2.2 Data Augmentation
MT corpora. Following the methodology of
(Zhang et al., 2020), we cleaned the parallel texts We only used SpecAugment (Bahar et al., 2019)
of the OpenSubtitle corpus and used fast-align to and not used speed perturb for ASR data augmenta-
score all sentences. We averaged the scores by tion, because speed perturb requires more training
the sentence length and filtered out sentences with resources but has the limited improvement. It is
scores below -6.0. In the News Commentary v16 also worth noting that we did not use back transla-
corpus, we used langid (Lui and Baldwin, 2012) to tion technology in either MT or E2E ST, as there
filter out sentences with incorrect language identifi- was no target-side monolingual data available.
cation results. In the Tatoeba corpus, we converted The MT model or ensemble MT systems repre-
90% of the sentences from traditional Chinese to sent the upper limit for E2E ST. Translating the
simplified Chinese using OpenCC1 . transcript in the ASR corpus into the target lan-
guage using MT models is a simpler and more
ST corpora. For the MuST-C v2 En-Zh and CoV- effective way to augment the ST corpus than gener-
oST v2 En-zh corpus, we only filtered frames by ating source speech features from the source texts
length, similar to the ASR corpora. For the pseudo in the MT corpus using TTS models. Based on
ST data, we removed sentences containing repeated this, we propose an Iterative Data Augmentation
n-gram words (n is 2 to 4) more than four times. (IDA) method, which aims to use text data from all
Additionally, sentences with length ratios outside corpora to improve the performance of MT models
the range of 0.25 to 4 and those with incorrect lan- and generate high-quality ST corpus iteratively, as
guage identification results were filtered out. illustrated in Algorithm 1.
1
https://github.com/BYVoid/OpenCC We also discovered incomplete transcriptions in
212
a few sentences from the TED LIUM, ST-TED, and training corpora for the SPM. The vocabulary size
voxpupil corpus. Therefore, we generated pseudo for English and Chinese is 10k and 44k, respec-
transcriptions using the ASR model and then trans- tively.
lated them using the best MT ensemble systems.
3.2 ASR Models
Algorithm 1: IDA Inspired by Zhang et al. (2022a), we used three

ASR encoders with different architectures and in-
Input: DASR = {(sasr , xasr )},DM T =
put representations to achieve better ensemble per-
{(xmt , ymt )}
′ formance.
Output: DST ∗
aug
= {(sasr , xasr , yasr )}
∗
1 DM T ← DM T ;
∗ • Transformer-HuBERT (TH): This encoder
2 s ← 0;
consists of 7 layers of 512-channel-CNN with
3 for i ← 1 to MAXITER do
∗ ); strides [5,2,2,2,2,2,2] and 12 layers of Trans-
4 M1 , M2 , · · · , Mn ← train(DM
i
T former (Vaswani et al., 2017). The hidden
5 E ← ensemble(M1 , M2 , · · · , Mn );
size, ffn size, and number of heads are 768,
6 si ← score(E i );
3072, and 8, respectively. This architecture
7 if i ̸= 1 and si <= s∗ then
takes waveform data as input.
8 break;
9 else
′ • Conformer-PDS-Medium (CPM): This en-
10 yasr ← decode(E i , xasr ); coder consists of 18 layers of Conformer
i ′
11 DM Taug ← {(xasr , yasr )}; (Gulati et al., 2020) with progressive down-
i ′
12 DST aug
← {(sasr , xasr , yasr )}; sampling (PDS) methods (Xu et al., 2023).
13 DM ∗
T ← DM T ∪ DM Taug ;
i The hidden size, ffn size, and number of heads
14 s∗ ← si ; are 512, 2048, and 8, respectively. This archi-
tecture takes log Mel-filterbank features as
15 return DST
∗
aug
; input.
• Conformer-PDS-Deep (CPD): This encoder

2.3 Speech Segmentation is the same as the Conformer-PDS-Medium,
To avoid the significant performance drop due to but with the number of layers adjusted from
the mismatch between the training and inference 18 to 24.
data, we adopted Supervised Hybrid Audio Seg-
mentation (SHAS) (Tsiamas et al., 2022) to split Due to limited computational resources, we pre-
long audios in the test sets. However, we did not trained the Transformer-HuBERT only on the Lib-
fine-tune our models on the resegmented data, ac- rispeech corpus using the method outlined in Hsu
cording the findings in Gaido et al. (2022). et al. (2021). The Conformer-PDS-Medium/Deep
architectures were trained on all ASR corpora, and
3 Model Architecture we employed an additional decoder with 6 layers
to utilize the Cross Entropy loss. We also adopted
We explored the performances of different ASR,
CTC loss (Graves et al., 2006) and inter-CTC loss
MT, and ST architectures and found that using
(Lee and Watanabe, 2021) to accelerate the conver-
larger models is more conducive to performance
gence.
improvement in all three tasks.
3.1 Vocabulary 3.3 MT Models
We adopted a unified vocabulary for all tasks, While deep models have shown success in trans-
trained by the SentencePiece (Kudo and Richard- lation tasks, we observed that wider architectures
son, 2018) model (SPM) from the MT corpora. To with more parameters generally yield superior per-
incorporate more subwords from the TED domain, formance (Shan et al., 2022). As such, we selected
we up-sampled the MuST-C corpus by 10x 2 in the the DLCL Transformer (Wang et al., 2019) and the
2
Specifically, we created 10 copies of the MuST-C corpus ODE Transformer (Li et al., 2022) for the deep and
and combined them with additional MT data. wide models, respectively.
213
• DLCL: This model consists of 30 layers of Model dev tst-M test-clean test-other
Transformer encoder and 6 layers of Trans- CPM 5.01 4.17 2.81 6.51
former decoder with dynamic linear combina- CPD 4.76 4.25 2.86 6.10
tion of layers and relative position encoding
(Shaw et al., 2018) methods. The hidden size, Table 4: WER scores on the dev, tst-COMMON (tst-M),
and test sets of Librispeech.
ffn size, and number of heads are 512, 2048,
and 8, respectively.
4 Experiments
• ODE: This model consists of 12 layers of
Transformer encoder and 6 layers of Trans- 4.1 Experimental settings
former decoder with an ordinary differential All experiments were implemented using the
equation-inspired method, which has been Fairseq toolkit (Ott et al., 2019). We trained all
proven to be efficient in parameters. The hid- models using pre-norm and utilized dropout with
den size, ffn size, and number of heads are a ratio ranging from 0.1 to 0.3 and label smooth-
1024, 4096, and 16, respectively. ing with 0.1 to prevent overfitting. Training was
stopped early when the indicators on the dev set
• ODE-Deep: This model is the same as ODE
did not improve for 5 consecutive times. During
but with the number of encoder layers ad-
decoding, we averaged the best 5 or 10 models
justed from 12 to 18.
in the dev set in all tasks. For single models, we
Since the transcript in the ASR corpora lacks punc- set the beam size and length penalty to 5 and 1.0,
tuation and is in lower-case, we lowered-cased and respectively, while for ensemble systems we used
removed punctuation from the source text of the different values adapted from our test sets. The MT
MT corpora for consistency before training the MT and ST models were evaluated using SacreBLEU
models. While this operation may have a negative (Post, 2018), while the ASR models were evalu-
impact on MT performance, we have demonstrated ated using WER. All the models were trained on 8
its usefulness for data augmentation and the final NVIDIA 3090 or 8 TITAN RTX GPUs.
ST performance in Section 4.3. 4.2 ASR
3.4 ST Models Table 4 presents the ASR results. We observed
We utilized the SATE method to enhance the usage that the deeper model performed better in con-
of pre-trained ASR and MT models for the ST task. fronting noise test sets (dev set of MuST-C and
Specifically, we decoupled the ST encoder into an test-other), but it also overfitted in some test sets
acoustic encoder and a textual encoder, with an (tst-COMMON and test-clean). We did not calcu-
adapter in between. The pre-trained ASR encoder late the WER of Transformer-HuBERT because it
was used to initialize the acoustic encoder, while was only pre-trained as a feature extractor and was
the pre-trained MT model was used to initialize not fine-tuned for speech recognition tasks.
the textual encoder and decoder. To optimize per- 4.3 MT and IDA
formance with limited memory, we successively
attempted multiple structures, ranging from small Table 5 shows the MT and IDA results on the test
to large, as presented in Table 3. The models with sets of MuST-C and CoVoST. We found that pre-
TH-DLCL structure were trained using the tech- training on all the MT corpora and fine-tuning on
niques outlined in Zhang et al. (2022b). the in-domain corpora can improve performance.
Fine-tuning on both MuST-C and CoVoST together
Structure ASR MT Params. is better than only on MuST-C corpus (ODE1 vs.
TH-DLCL TH DLCL 251M ODE2). It is worth noting that fine-tuning not only
CPM-DLCL CPM DLCL 289M improves the performance of in-domain test sets,
CPM-ODE CPM ODE 444M but also enhances the performance on out-domain
CPD-ODE CPD ODE 472M test sets, such as the test set of WMT21-news (not
included in this paper for simplicity).
Table 3: The ST structures initialized with different We found that both DLCL and ODE models out-
ASR and MT models under the SATE framework. performed our baseline, which was a Transformer-
Base model with fewer parameters. Additionally,
214
Pre-train Fine-tune ID Model Data tst-M tst-C
Model
tst-M tst-C tst-M tst-C 1 Baseline M 23.09 -
Baseline♢† 28.20 50.98 28.96 50.18 2 TH-DLCL2 P2 27.50 41.94
- - - 26.25 46.27 3 CPM-DLCL1 P 1 28.37 44.20
Baseline† 26.99 49.12 28.04 49.49 4 CPM-DLCL2 P 1 28.44 45.58
DLCL1 27.68 50.66 28.62 54.12 5 CPM-DLCL2 P 2 28.57 45.98
ODE1† 28.28 51.67 28.56 51.09 6 CPM-ODE4 P1 28.72 46.76
ODE2 - - 29.03 55.28 7 CPM-ODE4 P2 29.00 47.15
ODE3 28.17 50.98 29.06 54.41 8 CPD-ODE4 P1 28.79 47.18
E 1 : ensemble (above four) 29.61 56.20 9 CPD-ODE4 P2 29.01 47.65
DLCL2 29.12 53.95 29.46 55.24 10 ensemble (7,9) 29.07 48.67
ODE4 29.27 54.31 29.56 55.47 11 ensemble (2,7,9) 29.11 48.88
ODE-Deep1 29.39 54.21 29.36 55.47 12 ensemble (2,7,8,9) 29.16 48.98
ODE-Deep2 29.44 54.28 29.47 55.71 13 +adjusted beam/alpha 29.22 49.27
E 2 : ensemble (above four) 30.02 57.18
Table 6: BLEU scores on the tst-COMMON (tst-M) and
Table 5: BLEU scores on the tst-COMMON (tst-M) and the test set of CoVoST (tst-C). M refers to the MuST-C
the test set of CoVoST (tst-C). All data are in lower case. corpus, C refers to the CoVoST corpus, and P i refers to
Models marked with ♢ indicate that the punctuation of M &C&DST i
. The models with different parameters
the source text in corpora for pre-training, fine-tuning aug
are separated by the dotted line.
and testing was kept. The † means that only the MuST-C
corpus was used in fine-tuning.
ter pre-trained models, and training with higher-
quality pseudo ST corpora were all effective ways
we demonstrated that although models trained on for enhancing the performance of the ST model.
the corpora with punctuation perform better on test These modifications resulted in a significant im-
sets including punctuation (28.96 vs. 28.04), they provement over the baseline model, which has 32M
do not perform as well on test sets without punctu- parameters and was trained solely on the MuST-C
ation (26.25 vs. 28.04), which is more consistent dataset.
with the situation of the ASR transcript. In the ensemble stage, we aimed to maximize the
Since each round of iteration in IDA requires diversity between models. To achieve this, we se-
retraining multiple MT models, we set the MAX- lected models with different input representations,
ITER parameter in IDA to 2 to balance computing architectures, and training corpora. Finally, by ex-
resources and model performance. We observed panding the beam size and adjusting the length
that models trained during the second iteration out- penalty (alpha), we achieved a BLEU score of
performed those trained during the first iteration. 29.22 on tst-COMMON sets, which represents a
During the second iteration, we found that further 0.12 BLEU improvement over our optimal result
increasing the number of parameters resulted in from the previous year, despite using less MT train-
limited improvement (ODE4 vs. ODE-Deep1/2). ing data than last year (Agarwal et al., 2023).
Additionally, iterative training resulted in a con-
siderable improvement in ensemble systems (from 5 Conclusion
29.61 to 30.02). Finally, we employed the ensem-
This paper presented our submission to the
ble systems E 1 and E 2 to generate the pseudo data
IWSLT23 English-to-Chinese offline speech trans-
1
DST and DST2 for ST, respectively.
aug aug lation task. Our system aimed to find the optimal
ensemble system under the "constrained" training
4.4 ST and Ensemble
condition. To achieve this goal, we explored dif-
Table 6 displays the ST results on the test sets ferent input representations, model architectures,
of MuST-C and CoVoST. In contrast to MT, we and proposed an IDA method to utilize all available
did not use in-domain fine-tuning, as we found in texts to improve the MT systems and generate mul-
the pre-experiments that it did not improve perfor- tiple pseudo ST data. Our final system achieved
mance and may even have caused some damage. a BLEU score of 29.22 on the MuST-C En-Zh tst-
Experiments 1-9 demonstrated that increasing COMMON set, and the results on the IWSLT 23
the number of parameters, initializing with bet- test sets are shown in Table 7.
215
TED ACL
System proof of concept for end-to-end speech-to-text trans-
Ref 2 1 2 1 both lation. CoRR, abs/1612.01744.
NiuTrans 0.8376 0.7740 50.0 34.3 57.9 0.7733 47.1
Marco Gaido, Mattia Antonino Di Gangi, Mat-
Table 7: Scores on the IWSLT23 test sets. teo Negri, and Marco Turchi. 2020. End-to-
end speech-translation with knowledge distillation:
Fbk@iwslt2020. In Proceedings of the 17th Interna-
tional Conference on Spoken Language Translation,
Acknowledgement IWSLT 2020, Online, July 9 - 10, 2020, pages 80–88.
The authors would like to thank anonymous review-
ers for their insightful comments. This work was Marco Gaido, Sara Papi, Dennis Fucci, Giuseppe
supported in part by the National Science Founda- Fiameni, Matteo Negri, and Marco Turchi. 2022.
tion of China (No. 62276056), the National Key Efficient yet competitive speech translation:
Fbk@iwslt2022. In Proceedings of the 19th
R&D Program of China, the China HTRD Cen- International Conference on Spoken Language
ter Project (No. 2020AAA0107904), the Natural Translation, IWSLT@ACL 2022, Dublin, Ireland
Science Foundation of Liaoning Province of China (in-person and online), May 26-27, 2022, pages
(2022-KF-16-01), the Yunnan Provincial Major Sci- 177–189. Association for Computational Linguistics.
ence and Technology Special Plan Projects (No. Alex Graves, Santiago Fernández, Faustino J. Gomez,
202103AA080015), the Fundamental Research and Jürgen Schmidhuber. 2006. Connectionist tem-
Funds for the Central Universities (Nos. N2216016, poral classification: labelling unsegmented sequence
data with recurrent neural networks. In Machine
N2216001, and N2216002), and the Program of
Learning, Proceedings of the Twenty-Third Interna-
Introducing Talents of Discipline to Universities, tional Conference (ICML 2006), Pittsburgh, Pennsyl-
Plan 111 (No. B16009). vania, USA, June 25-29, 2006, volume 148 of ACM
International Conference Proceeding Series, pages
369–376. ACM.
References Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
Milind Agarwal, Sweta Agrawal, Antonios Anasta- Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
sopoulos, Ondřej Bojar, Claudia Borg, Marine Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda 2020. Conformer: Convolution-augmented trans-
Chen, William Chen, Khalid Choukri, Alexandra former for speech recognition. In Interspeech 2020,
Chronopoulou, Anna Currey, Thierry Declerck, Qian- 21st Annual Conference of the International Speech
qian Dong, Yannick Estève, Kevin Duh, Marcello Communication Association, Virtual Event, Shang-
Federico, Souhir Gahbiche, Barry Haddow, Benjamin hai, China, 25-29 October 2020, pages 5036–5040.
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja- ISCA.
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
rahman Mohamed. 2021. Hubert: Self-supervised
speech representation learning by masked prediction
of hidden units. IEEE ACM Trans. Audio Speech
Lang. Process., 29:3451–3460.
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se- Taku Kudo and John Richardson. 2018. Sentencepiece:
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian A simple and language independent subword tok-
Thompson, Kevin Tran, Marco Turchi, Alex Waibel, enizer and detokenizer for neural text processing. In
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- Proceedings of the 2018 Conference on Empirical
vallos. 2023. Findings of the IWSLT 2023 Evaluation Methods in Natural Language Processing, EMNLP
Campaign. In Proceedings of the 20th International 2018: System Demonstrations, Brussels, Belgium,
Conference on Spoken Language Translation (IWSLT October 31 - November 4, 2018, pages 66–71. Asso-
2023). Association for Computational Linguistics. ciation for Computational Linguistics.
Parnia Bahar, Albert Zeyer, Ralf Schlüter, and Hermann Jaesong Lee and Shinji Watanabe. 2021. Intermedi-
Ney. 2019. On using specaugment for end-to-end ate loss regularization for ctc-based speech recogni-
speech translation. In Proceedings of the 16th Inter- tion. In IEEE International Conference on Acous-
national Conference on Spoken Language Transla- tics, Speech and Signal Processing, ICASSP 2021,
tion, IWSLT 2019, Hong Kong, November 2-3, 2019. Toronto, ON, Canada, June 6-11, 2021, pages 6224–
Association for Computational Linguistics. 6228. IEEE.
Alexandre Berard, Olivier Pietquin, Christophe Servan, Bei Li, Quan Du, Tao Zhou, Yi Jing, Shuhan Zhou, Xin
and Laurent Besacier. 2016. Listen and translate: A Zeng, Tong Xiao, Jingbo Zhu, Xuebo Liu, and Min
216
Zhang. 2022. ODE transformer: An ordinary differ- Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu,
ential equation-inspired model for sequence genera- Changliang Li, Derek F. Wong, and Lidia S. Chao.
tion. In Proceedings of the 60th Annual Meeting of 2019. Learning deep transformer models for machine
the Association for Computational Linguistics (Vol- translation. In Proceedings of the 57th Conference of
ume 1: Long Papers), ACL 2022, Dublin, Ireland, the Association for Computational Linguistics, ACL
May 22-27, 2022, pages 8335–8351. Association for 2019, Florence, Italy, July 28- August 2, 2019, Vol-
Computational Linguistics. ume 1: Long Papers, pages 1810–1822. Association
Marco Lui and Timothy Baldwin. 2012. langid.py: An
off-the-shelf language identification tool. In The 50th Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui
Annual Meeting of the Association for Computational Wu, and Zhifeng Chen. 2017. Sequence-to-sequence
Linguistics, Proceedings of the System Demonstra- models can directly translate foreign speech. In In-
tions, July 10, 2012, Jeju Island, Korea, pages 25–30. terspeech 2017, 18th Annual Conference of the Inter-
The Association for Computer Linguistics. national Speech Communication Association, Stock-
holm, Sweden, August 20-24, 2017, pages 2625–2629.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, ISCA.
Auli. 2019. fairseq: A fast, extensible toolkit for Tong Xiao, Jingbo Zhu, Hao Zhang, and Qiang Li. 2012.
sequence modeling. In Proceedings of the 2019 Con- Niutrans: An open source toolkit for phrase-based
ference of the North American Chapter of the Asso- and syntax-based machine translation. In The 50th
ciation for Computational Linguistics: Human Lan- Annual Meeting of the Association for Computational
guage Technologies, NAACL-HLT 2019, Minneapo- Linguistics, Proceedings of the System Demonstra-
lis, MN, USA, June 2-7, 2019, Demonstrations, pages tions, July 10, 2012, Jeju Island, Korea, pages 19–24.
48–53. Association for Computational Linguistics. The Association for Computer Linguistics.
Matt Post. 2018. A call for clarity in reporting BLEU Chen Xu, Bojie Hu, Yanyang Li, Yuhao Zhang, Shen
scores. In Proceedings of the Third Conference on Huang, Qi Ju, Tong Xiao, and Jingbo Zhu. 2021a.
Machine Translation: Research Papers, WMT 2018, Stacked acoustic-and-textual encoding: Integrating
Belgium, Brussels, October 31 - November 1, 2018, the pre-trained models into speech translation en-
pages 186–191. Association for Computational Lin- coders. In Proceedings of the 59th Annual Meeting
guistics. of the Association for Computational Linguistics and
Weiqiao Shan, Zhiquan Cao, Yuchen Han, Siming Wu, the 11th International Joint Conference on Natural
Yimin Hu, Jie Wang, Yi Zhang, Hou Baoyu, Hang Language Processing, ACL/IJCNLP 2021, (Volume 1:
Cao, Chenghao Gao, Xiaowen Liu, Tong Xiao, Anxi- Long Papers), Virtual Event, August 1-6, 2021, pages
ang Ma, and Jingbo Zhu. 2022. The niutrans machine 2619–2630. Association for Computational Linguis-
translation systems for WMT22. In Proceedings tics.
of the Seventh Conference on Machine Translation,
WMT 2022, Abu Dhabi, United Arab Emirates (Hy- Chen Xu, Xiaoqian Liu, Xiaowen Liu, Laohu Wang,
brid), December 7-8, 2022, pages 366–374. Associa- Canan Huang, Tong Xiao, and Jingbo Zhu. 2021b.
tion for Computational Linguistics. The niutrans end-to-end speech translation system
for IWSLT 2021 offline task. In Proceedings of the
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. 18th International Conference on Spoken Language
Self-attention with relative position representations. Translation, IWSLT 2021, Bangkok, Thailand (on-
In Proceedings of the 2018 Conference of the North line), August 5-6, 2021, pages 92–99. Association for
American Chapter of the Association for Computa- Computational Linguistics.
tional Linguistics: Human Language Technologies,
NAACL-HLT, New Orleans, Louisiana, USA, June Chen Xu, Yuhao Zhang, Chengbo Jiao, Xiaoqian Liu,
1-6, 2018, Volume 2 (Short Papers), pages 464–468. Chi Hu, Xin Zeng, Tong Xiao, Anxiang Ma, Huizhen
Association for Computational Linguistics. Wang, and Jingbo Zhu. 2023. Bridging the gran-
ularity gap for acoustic modeling. In Findings of
Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol- the Association for Computational Linguistics: ACL
losa, and Marta R. Costa-jussà. 2022. SHAS: 2023. Association for Computational Linguistics.
approaching optimal segmentation for end-to-end
speech translation. In Interspeech 2022, 23rd Annual Yuhao Zhang, Canan Huang, Chen Xu, Xiaoqian Liu,
Conference of the International Speech Communica- Bei Li, Anxiang Ma, Tong Xiao, and Jingbo Zhu.
tion Association, Incheon, Korea, 18-22 September 2022a. The niutrans’s submission to the IWSLT22
2022, pages 106–110. ISCA. english-to-chinese offline speech translation task. In
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Spoken Language Translation, IWSLT@ACL 2022,
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Dublin, Ireland (in-person and online), May 26-27,
Kaiser, and Illia Polosukhin. 2017. Attention is all 2022, pages 232–238. Association for Computational
you need. In Advances in Neural Information Pro- Linguistics.
cessing Systems 30: Annual Conference on Neural
Information Processing Systems 2017, December 4-9, Yuhao Zhang, Ziyang Wang, Runzhe Cao, Binghao Wei,
2017, Long Beach, CA, USA, pages 5998–6008. Weiqiao Shan, Shuhan Zhou, Abudurexiti Reheman,
217
Tao Zhou, Xin Zeng, Laohu Wang, Yongyu Mu, Jing-
nan Zhang, Xiaoqian Liu, Xuanjun Zhou, Yinqiao
Li, Bei Li, Tong Xiao, and Jingbo Zhu. 2020. The
niutrans machine translation systems for WMT20.
In Proceedings of the Fifth Conference on Machine
Translation, WMT@EMNLP 2020, Online, Novem-
ber 19-20, 2020, pages 338–345. Association for
Yuhao Zhang, Chen Xu, Bojie Hu, Chunliang Zhang,
Tong Xiao, and Jingbo Zhu. 2022b. Improving end-
to-end speech translation by leveraging auxiliary
speech and text data. CoRR, abs/2212.01778.
218
ON-TRAC consortium systems for the IWSLT 2023 dialectal and
low-resource speech translation tasks
Antoine Laurent1 , Souhir Gahbiche5 , Ha Nguyen2 , Haroun Elleuch4 ,
Fethi Bougares4 , Antoine Thiol5 , Hugo Riguidel1,3 , Salima Mdhaffar2 ,
Gaëlle Laperrière2 , Lucas Maison2 , Sameer Khurana6 , Yannick Estève2
1
LIUM - Le Mans University, France, 2 LIA - Avignon University, France, 3 Systran - France,
4
ELYADATA - Tunis, Tunisia, 5 Airbus - France,
6
MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
Abstract low-resource task, several language pairs were

proposed this year. In this paper, we focus on
This paper describes the ON-TRAC consor-
Tamasheq-French, Tunisian Arabic-English and
tium speech translation systems developed for
IWSLT 2023 evaluation campaign. Overall, we Pashto-French.
participated in three speech translation tracks This paper reports the ON-TRAC consortium
featured in the low-resource and dialect speech submissions for the aforementioned tasks. The
translation shared tasks, namely; i) spoken ON-TRAC Consortium is composed of researchers
Tamasheq to written French, ii) spoken Pashto from three academic laboratories, LIUM (Le Mans
to written French, and iii) spoken Tunisian to University - France), LIA (Avignon University -
written English. All our primary submissions
France), MIT (Cambridge - USA) together with
are based on the end-to-end speech-to-text neu-
ral architecture using a pre-trained SAMU- three industrial partners: Airbus France, ELYA-
XLSR model as a speech encoder and an mbart DATA and Systran. Our systems for the dialect task
model as a decoder. The SAMU-XLSR model focus on both cascaded and end-to-end approaches
is built from the XLS-R 128 in order to gen- for ST. For the low-resource task, we focus on
erate language agnostic sentence-level embed- the leveraging of models based on self-supervised
dings. This building is driven by the LaBSE learning (SSL), and on the training of ST models
model trained on a multilingual text dataset. with joint automatic speech recognition (ASR), ma-
This architecture allows us to improve the input
chine translation (MT) and ST losses.
speech representations and achieve significant
improvements compared to conventional end- This paper is organized as follows. Section 2
to-end speech translation systems. presents the related work. Section 3 is dedicated
to detail our primary systems encoder-decoder ap-
1 Introduction proach. The experiments with the Tunisian Arabic-
IWSLT is a unique opportunity that allows each English dataset for low-resource and dialect ST
year the assessment of progress made in the area tasks are presented in Section 4. Results for the
of Spoken Language Translation (SLT). This as- Tamasheq-French and Pashto-French tracks are pre-
sessment is made possible throughout the organ- sented in Section 5 and 6 respectively. Section 7
isation of an evaluation campaign including var- concludes the paper and discusses future work.
ious shared tasks that address specific scientific
2 Related work
challenges of the SLT domain. In addition to the
well-established shared tasks, IWSLT organisers Before the introduction of direct or end-to-end ST
introduce new tasks to address the many challenges models (Berard et al., 2016; Weiss et al., 2017), the
settings related to SLT area like data scarcity, mul- ST task was approached as a cascaded problem:
tilingualism, time and computation constraints, etc. the speech is transcribed using an ASR model, and
In this context, the IWSLT 2023 proposes two the transcriptions are used to train a classic MT
interesting shared tasks: low-resource and dialect model. The limitations of this approach include
speech translation (ST). The former aims to as- the need for extensive transcriptions of the speech
sess the exploitability of current translation sys- signal, and the error propagation between ASR and
tems in data scarcity settings. The latter focuses MT modules. In comparison to that, end-to-end ST
on the assessment of the systems’ capabilities in models offer a simpler encoder-decoder architec-
noisy settings: different dialects are mixed in a ture, removing the need for intermediate represen-
single dataset of spontaneous speech. For the tations of the speech signal. Although at first, cas-
219
caded models were superior in performance com- semantic text encoder LaBSE (Feng et al., 2022).
pared to end-to-end models, results from recent The training and modeling details can be found
IWSLT campaigns illustrate how end-to-end mod- in the original paper (Khurana et al., 2022). In
els have been closing this gap (Ansari et al., 2020; this work, we use the same training framework
Bentivogli et al., 2021; Anastasopoulos et al., 2021, but train the model using transcribed speech col-
2022). Moreover, the joint optimization of ASR, lected from approximately 100 spoken languages
MT and ST losses in end-to-end ST models was from several datasets such as CommonVoice-v10
shown to increase overall performance (Le et al., (Ardila et al., 2020a), Multilingual Speech (MLS)
2020; Sperber et al., 2020). (Pratap et al., 2020), Babel, IndicSuperb (Javed
Furthermore, SSL models for speech process- et al., 2022), Shrutilipi (Bhogale et al., 2023), Vox-
ing are now a popular foundation blocks in speech populi (Wang et al., 2021), MGB-2 Arabic (Ali
pipelines (Schneider et al., 2019; Hsu et al., 2021; et al., 2019) and Wenetspeech (Zhang et al., 2022).
Baevski et al., 2019, 2020). These models are
large trainable networks with millions, or even bil- 3.2 Translation model
lions (Babu et al., 2021b), of parameters that are We use the standard encoder-decoder architecture
trained on unlabeled audio data only. The goal for our translation model. We initialize the encoder
of training these models is providing a powerful using the pre-trained SAMU−XLS−R. Following
and reusable abstraction block, which is able to (Li et al., 2020), the decoder is initialized with
process raw audio in a given language or in multi- the decoder of a pre-trained text-to-text transla-
lingual settings (Conneau et al., 2020; Babu et al., tion model, namely MBART2 . The encoder-decoder
2021b), producing a richer audio representation for model is trained using corpora that consist of tu-
the downstream tasks to train with, compared to ples (a1:S , y1:L ), where y1:L is the text translation
surface features such as MFCCs or filterbanks. Re- sequence of the speech sequence a1:S .
cent work found considerable performance gains To maintain the pre-trained SAMU−XLS−R val-
and/or state-of-the-art performance by including ues of the speech encoder, we leave its parameters
these blocks in their target tasks, and more im- unchanged. However, we introduce task-specific
portantly, the final models can be trained with a parameters in the form of adapters (Houlsby et al.,
smaller amount of labeled data, increasing the ac- 2019), consisting of a bottleneck Feed-Forward
cessibility of current approaches for speech process- layer, which are added after the Multi-Headed Self-
ing (Kawakami et al., 2020; Schneider et al., 2019; Attention and fully-connected blocks in each trans-
Hsu et al., 2021; Baevski et al., 2019, 2020).1 Re- former layer. While most parameters of the decoder
cent work found considerable performance gains remain fixed from pre-training, we fine-tune the
and/or state-of-the-art performance by including Layer Normalization and Encoder-Decoder Cross-
these blocks in downstream tasks. Most of them Attention blocks based on (Li et al., 2020).
focused on ASR (Kawakami et al., 2020; Schnei-
der et al., 2019; Hsu et al., 2021; Baevski et al., 4 Tunisian Arabic-English track
2019, 2020), but recent speech benchmarks (Evain In this section, we present our experiments for
et al., 2021b,a; Yang et al., 2021) cover tasks such translating Tunisian Arabic to English in the con-
as ST, spoken language understanding, emotion text of the dialect and low-resource tasks from
recognition from speech and more. IWSLT 2023. Section 4.1 describes the data used
3 Primary systems encoder-decoder in our experiments. Results on the ST task are
presented in Section 4.3.
architecture
3.1 SAMU-XLS-R (SAMU−XLS−R) 4.1 Data
SAMU−XLS−R is a multilingual multimodal se- The training and development data conditions are
mantic speech representation learning framework identical to IWSLT 2022 edition. It consisted of
where the speech transformer encoder XLS−R two types of datasets: (1) 383h of manually tran-
(Babu et al., 2021a) is fine-tuned using seman- scribed conversational speech and (2) 160h, subpart
tic supervision from the pre-trained multilingual of it, augmented with their English translations to
form a three-way parallel corpus (audio, transcript,
1
Recent benchmarks for SSL models can be found in Evain
2
et al. (2021b,a); Yang et al. (2021); Conneau et al. (2022). Text-to-text translation model: MBART
220
translation). This dataset is made available by LDC 5 Tamasheq-French Experiments
under reference LDC2022E01. The goal of this
In this section we present our experiments for the
track is to train speech translation systems under
Tamasheq-French dataset in the context of the low-
two training conditions: constrained, in which only
resource ST track.
the provided dataset resources are allowed, and un-
constrained where participants may use any public 5.1 Data
or private resources.
This dataset, recently introduced in Boito et al.
4.2 End-to-end ST (2022), contains 14 h of speech in the Tamasheq
language for the training split which corresponds
We used the end-to-end translation model presented to 4,444 utterances translated to French. The de-
in section 3.2. The model was trained directly on velopment set contains 581 utterances (a little bit
the Tunisian to English task (no pre-training of less than 2 h of speech), the 2022 test set contains
the encoder-decoder model), using SAMU−XLS−R 804 utterances (approximatively 2 h of speech).
trained on 100 languages. We used adapters The 2023 test set contains 374 utterances (approxi-
(Houlsby et al., 2019) inside the encoder to keep matively 1 h of speech). Additional audio data was
the semantic information while fine-tuning. also made available through the Niger-Mali audio
collection: 224 h in Tamasheq and 417 h in geo-
4.3 Results graphically close languages (French from Niger,
Table 1 presents our ST results for the Tunisian Fulfulde, Hausa, and Zarma).3 For all this data, the
to English Dialectal and Low-resource track. Our speech style is radio broadcasting, and the dataset
primary system obtained a BLEU of 20.7 on our presents no transcription.
validation set. As shown in the tables, the official
5.2 Models
evaluation scores appear to be low compared to
the good result obtained on the validation set. We For the Tamasheq to French task, we performed
suspect that our test submission was not conform several experiments. First of all, we did the same
to the evaluation specifications. We speculate that experiment that was done for Pashto-French and
this difference between validation and test scores is Tunisian-English tasks. We used the end-to-end
due to the fact we did not remove the punctuation translation model presented in section 3.2, directly
nor the disfluencies tokens from the case-sensitive trained on the Tamasheq→French task. Directly
translation we submitted, while the evaluation is means that we used SAMU−XLS−R-xx (xx corre-
made lowercase and no punctuation. We mistak- sponds to the number of languages in the training
enly expected this normalization step to be applied set, equals to 53, 60 and 100) to initialise the en-
by the organizers instead of the participant. We coder and performed the training of the encoder-
were able to ask the organizers to evaluate our nor- decoder model using the Tamasheq→French train-
malized output after the evaluation period. The ing set.
results are reported in Table 1. Test2 refers to the We used the CoVoST-2 (Wang et al., 2020)
IWSLT 2022 evaluation campaign test, and test3 X →EN speech-translation dataset in which we
refers to the one of IWSLT 2023. This normaliza- translated the EN text into French (using Mbart
tion before the training of our translation model is Many-to-Many). Additionally, we exploited the Eu-
expected to further improve our results because we roparl benchmark, which comprises 72 translation
believe that the post-deadline fix more accurately tasks (denoted as X →Y), with the source language
reflects our system’s true performance. set (X ) consisting of nine languages: FR, DE, ES,
IT, PL, PT, RO, NL, and EN. The target language
System Description valid test2 test3 set (Y) is equivalent to the source language set. For
primary SAMU−XLS−R 100 20.7 9.6 8.8
the specific training data distribution of each of the
post-deadline fix SAMU−XLS−R 100 20.7 18.2 16.3 72 translation tasks, refer to (Iranzo-Sánchez et al.,
2019).
Table 1: Results for Tunisian Arabic to English We trained a translation model using CoVost-2
translation systems in terms of %BLEU for low- X→FR,EN and Europarl X→FR, namely models
resource (LR)track. 3
https://demo-lia.univ-avignon.fr/
studios-tamani-kalangou/
221
System Description valid test 2023
primary samu100l[cv2_xx→(en,fr)+europarl_xx→fr] + test22 21.39 16.00
contrastive1 samu100l[cv2_xx→(en,fr)+europarl_xx→fr] 21.41 16.52
contrastive2 samu60l[cv2_xx→(en,fr)+europarl_xx→fr] + test22 20.80 15.84
contrastive3 samu60l[cv2_xx→(en,fr)+europarl_xx→fr] 20.66 15.35
contrastive4 samu100l continue training + test22 21.39 16.30
contrastive5 samu100l continue training 20.78 15.60
baseline best system from IWSLT2022 8.34 5.70
Table 2: Results of the Tamasheq-French ST systems in terms of BLEU score.
samu60l[cv2_xx→(en,fr)+europarl_xx→fr] and 6 Pashto-French Experiments

samu100l[cv2_xx→(en,fr)+europarl_xx→fr]). We
also translated the French translation of the In this section, we present our experiments for the
Tamasheq speech into Spanish, Portuguese and En- first edition of translating Pashto speech to French
glish (still using MBart Many to Many). in the context of the low-resource ST track for
IWSLT 2023.
Using the pre-trained models, we trained a trans-
lation model from Tamasheq to French, Spanish, 6.1 Data
English and Portugese. We added the 2022 test set The Pashto-French dataset used in our experiments
inside the training corpus for the Primary model. was provided by ELDA. This dataset is available in
Moreover, we used the last checkpoint of the the ELRA catalog, TRAD Pashto Broadcast News
SAMU−XLS−R training (100 languages) and pushed Speech Corpus (ELRA catalogue, 2016b) concern
further the training using the LaBSE embeddings audio files and TRAD Pashto-French Parallel cor-
of the translations of the Tamasheq into French, pus of transcribed Broadcast News Speech - Train-
Spanish, English and Portuguese. Then using the ing data (ELRA catalogue, 2016a) are their tran-
specialized Tamasheq SAMU−XLS−R, we trained a scriptions.
Tamasheq to French, Spanish, English, Portuguese This dataset is a collection of about 108 hours of
model. Broadcast News with transcriptions in Pashto and
translations in French text. Dataset is build from
collected recordings from 5 sources: Ashna TV,
5.3 Results Azadi Radio, Deewa Radio, Mashaal Radio and
Shamshad TV. Training data contains 99h of speech
Table 2 presents our ST results for the Tamasheq to in Pashto, which corresponds to 29,447 utterances
French task. Our first contrastive model performed translated into French.
better than the Primary model (16.52 for the con- We participated for Pashto to French task for
trastive model compare to 16.00 for the primary both types of submissions: constrained and uncon-
model). This was unexpected because the 2022 strained conditions. For constrained conditions,
test set was added inside the training corpus for systems are trained only on the dataset provided by
the Primary model and not in the contrastive one. the organizers, while for unconstrained conditions,
The constrative4 and contrastive5 performances (in systems can be trained with any resource, including
which we push the training of the SAMU−XLS−R- pre-trained models.
100 model further) are very close to the primary We investigate two types of ST architectures:
and contrastive1 (16.30 BLEU vs 16.52 BLEU). end-to-end architectures 6.2, and pipeline 6.3 mod-
We did not use the 224 hours of unlabelled els.
data. We could probably get better results by us-
ing pseudo-labelling using our best model and then 6.2 Pipeline models
using the translation for the training of the transla- For the cascaded approach, i.e. the task of using an
tion model. Another direction could be the use of ASR model followed by a MT model, we focused
another decoder like the recently proposed NLLB on Wav2Vec2.0 (Baevski et al., 2020) as a Speech-
model (Costa-jussà et al., 2022). to-Text system. The architecture used is Wav2Vec2-
222
XLSR-53 (Conneau et al., 2020), a large version of is used between the encoder and decoder to con-
Wav2Vec2 pre-trained on the multilingual dataset nect both modules. The difference between both
Common-Voice (Ardila et al., 2020b). Once adding systems lies in the use of a transformer language
a language modeling head on top of the model for model trained from scratch on the provided dataset.
fine-tuning on the Pashto dataset, we observed a Both of these systems were also trained on addi-
score of less than 20% of WER and a good model- tional Pashto data and submitted as contrastive un-
ing of the reference language since the difference of constrained systems 2 and 3. The language model
the scores for translating written Pashto to written was not trained on the additional data.
French when using either the reference or the gener-
ated Pashto text, was always less than 0.5 of BLEU. 6.4 Results
For the MT system, we tested multiple approaches Results for constrained and unconstrained condi-
using auto regressive Sequence-2-Sequence mod- tions are presented in Table 3 and Table 4 respec-
els. tively.
We mainly focused on transformers encoder-
decoder systems from small basic transformers Constrained
(contrastive3 in Table 3) to large pre-trained mul- System Description valid test
tilingual text-to-text transformers such as T5 (Raf-
fel et al., 2020) and mT5 (multilingual T5). For primary Pipeline, fconv-up 14.52 15.56
contrastive1 E2E, without LM 11.06 15.29
primary cascaded system, models are based on a
contrastive2 E2E, with LM 11.11 15.06
convolutional model (fconv) (Gehring et al., 2017)
contrastive3 Pipeline 10.5 9.2
upgraded (fconv-up). We reduced the depth and the
width of both the encoder and decoder to adapt the
size of our fconv model to our dataset. Our fconv- Table 3: Results for Constrained Pashto-to-French ST
systems in terms of %BLEU score.
up model achieves 14.52 of BLEU on valid set and
15.56 on the test set, while fconv would give 13 of
BLEU. Compared to the cascaded baseline system, As for constrained setting, we noted that a
based on small basic transformers (contrastive3), pipeline of two E2E ASR and NMT system gives
fconv-up cascaded system outperforms by 6 BLEU better results compared to using one speech trans-
points. lation E2E system. Although the usage of a LM
improves the E2E ST further, we were not able
Experiments have been carried out in order to
to exceed the pipeline of the two E2E systems
extract the encoder of the fine-tuned W2V and use
(ASR+NMT).
the latent representation of the audio to train an
auto-regressive decoder and thus to skip the Speech-
Unconstrained
to-Text part, but without any success.
System Description valid test
6.3 End-to-end models primary SAMU_XLSR 100l 24.82 24.87

contrastive1 SAMU_XLSR 53l 23.38 23.87
We used the end-to-end translation model presented contrastive2 E2E, without LM 12.26 15.18
in section 3.2. The model was trained directly contrastive3 E2E, with LM 12.16 15.07
on the Pashto to French task (no pre-training of
the encoder-decoder model), using SAMU−XLS−R Table 4: Results for Unconstrained Pashto-to-French ST
trained on 53 and 100 languages. We used adapters systems in terms of %BLEU score.
(Houlsby et al., 2019) inside the encoder to keep
the semantic information while fine-tuning. When we switch to the unconstrained setting,
Two constrained contrastive end-to-end systems we see a significant improvement demonstrated by
were submitted for this task. Both share the same a dramatic increases of the BLEU score with the
encoder-decoder architecture using transformers SAMU−XLS−R system. SAMU−XLS−R obtained a
(Vaswani et al., 2017). The system encoder is the BLEU of 24.87 on the test set when trained start-
encoder from a Whisper small (768) (Radford et al., ing from a pretrained encoder with 100 languages
2022) pre-trained model. The decoder has a dimen- (SAMU−XLS−R-100) and a full BLEU point less
sion of 512 using 8 heads and 6 layers. It is not pre- (23.87) when we start from a 53 languages encoder
trained. A feed forward network projection layer (SAMU−XLS−R-53).
223
7 Conclusion tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano
Cattoni, Anna Currey, Georgiana Dinu, Kevin
This paper presents results obtained on three tasks Duh, Maha Elbayad, Yannick Estéve, Marcello
from the IWSLT 2023 Dialectal and Low-resource Federico, Christian Federmann, Souhir Gahbiche,
ST track, namely Tunisian to English, Tamasheq Hongyu Gong, Roman Grundkiewicz, Barry Had-
dow, Benjamin Hsu, Dávid Javorský, Věra Kloudová,
to French and Pashto to French. Given an un- Surafel M. Lakew, Xutai Ma, Prashant Mathur, Paul
constrained condition, our submission relies heav- McNamee, Kenton Murray, Maria Nădejde, Satoshi
ily on the semantic speech representation learning Nakamura, Matteo Negri, Jan Niehues, Xing Niu,
framework SAMU-XLS-R that greatly improves John Ortega, Juan Pino, Elizabeth Salesky, Jia-
tong Shi, Sebastian Stüker, Katsuhito Sudoh, Marco
results compared to the other submitted end-to- Turchi, Yogesh Virkar, Alex Waibel, Changhan Wang,
end ST models by leveraging multilingual data and Shinji Watanabe. 2022. FINDINGS OF THE
from other languages. These data can thus come IWSLT 2022 EVALUATION CAMPAIGN. In Pro-
from high resource languages and help to allevi- ceedings of the 19th International Conference on
Spoken Language Translation (IWSLT 2022), Dublin,
ate the low-resource setting difficulty. We indeed Ireland. Association for Computational Linguistics.
observe slightly improved results when using a
SAMU-XLS-R model trained on more languages Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremer-
(Tamasheq to French : 15.35 BLEU when using man, Roldano Cattoni, Maha Elbayad, Marcello Fed-
60 languages, 16.52 BLEU when using 100 lan- erico, Xutai Ma, Satoshi Nakamura, Matteo Negri,
Jan Niehues, Juan Pino, Elizabeth Salesky, Sebas-
guages). We believe results could be further im- tian Stüker, Katsuhito Sudoh, Marco Turchi, Alexan-
proved by using the unlabelled data available for der Waibel, Changhan Wang, and Matthew Wiesner.
the Tunisian to English and the Tamasheq to French 2021. FINDINGS OF THE IWSLT 2021 EVAL-
tasks, and by investigating other decoders in our UATION CAMPAIGN. In Proceedings of the 18th
encoder-decoder framework. lation (IWSLT 2021), pages 1–29, Bangkok, Thailand
(online). Association for Computational Linguistics.
Acknowledgements
Ebrahim Ansari, Amittai Axelrod, Nguyen Bach,
This work was partially funded by the follow-
Ondřej Bojar, Roldano Cattoni, Fahim Dalvi, Nadir
ing projects: French Research Agency (ANR) Durrani, Marcello Federico, Christian Federmann,
ON-TRAC project under contract number ANR- Jiatao Gu, et al. 2020. Findings of the iwslt 2020
18-CE23-0021, European Commission SELMA evaluation campaign. In Proceedings of the 17th In-
project under grant number 957017, Euro- ternational Conference on Spoken Language Trans-
lation, pages 1–34.
pean Union’s Horizon 2020 ESPERANTO re-
search and innovation programme under the Rosana Ardila, Megan Branson, Kelly Davis, Michael
Marie Skłodowska-Curie grant agreement No Henretty, Michael Kohler, Josh Meyer, Reuben
101007666 and the DGA RAPID COMMUTE Morais, Lindsay Saunders, Francis M. Tyers, and
Gregor Weber. 2020a. Common Voice: A massively-
project. This work was partially performed us-
multilingual speech corpus. arXiv:1912.06670.
ing HPC resources from GENCI–IDRIS, grants
AD011012527. The Pashto-French data was to- Rosana Ardila, Megan Branson, Kelly Davis, Michael
tally provided by ELRA. We acknowledge ELRA Henretty, Michael Kohler, Josh Meyer, Reuben
catalogue (http://catalog.elra.info) TRAD Pashto- Morais, Lindsay Saunders, Francis M. Tyers, and
Gregor Weber. 2020b. Common voice: A massively-
French Parallel corpus of transcribed Broadcast multilingual speech corpus.
News Speech - Training data, ISLRN: 802-643-
297-429-4, ELRA ID: ELRA-W0093, TRAD Arun Babu, Changhan Wang, Andros Tjandra, Kushal
Pashto Broadcast News Speech Corpus, ISLRN: Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh,
918-508-885-913-7, ELRA ID: ELRA-S0381. Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei
Baevski, Alexis Conneau, and Michael Auli. 2021a.
Xls-r: Self-supervised cross-lingual speech represen-
tation learning at scale. arXiv:2111.09296.
References
Ahmed Ali, Peter Bell, James Glass, Yacine Messaoui, Arun Babu, Changhan Wang, Andros Tjandra, Kushal
Hamdy Mubarak, Steve Renals, and Yifan Zhang. Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh,
2019. The mgb-2 challenge: Arabic multi-dialect Patrick von Platen, Yatharth Saraf, Juan Pino, et al.
broadcast media recognition. 2021b. Xls-r: Self-supervised cross-lingual speech
representation learning at scale. arXiv preprint
Antonios Anastasopoulos, Loïc Barrault, Luisa Ben- arXiv:2111.09296.
224
Alexei Baevski, Michael Auli, and Abdelrahman Mo- Solène Evain, Ha Nguyen, Hang Le, Marcely Zanon
hamed. 2019. Effectiveness of self-supervised pre- Boito, Salima Mdhaffar, Sina Alisamir, Ziyi Tong,
training for speech recognition. arXiv preprint Natalia Tomashenko, Marco Dinarelli, Titouan Par-
arXiv:1911.03912. collet, et al. 2021a. Task agnostic and task specific
self-supervised learning from speech with LeBench-
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, mark. In Thirty-fifth Conference on Neural Informa-
and Michael Auli. 2020. wav2vec 2.0: A framework tion Processing Systems Datasets and Benchmarks
for self-supervised learning of speech representations. Track (Round 2).
Advances in Neural Information Processing Systems,
33:12449–12460. Solène Evain, Ha Nguyen, Hang Le, Marcely Zanon
Boito, Salima Mdhaffar, Sina Alisamir, Ziyi Tong,
Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina
Natalia Tomashenko, Marco Dinarelli, Titouan Par-
Karakanta, Alberto Martinelli, Matteo Negri, and
collet, Alexandre Allauzen, Yannick Estève, Ben-
jamin Lecouteux, François Portet, Solange Rossato,
translation: Do the differences still make a differ-
Fabien Ringeval, Didier Schwab, and Laurent Be-
ence? CoRR, abs/2106.01045.
sacier. 2021b. LeBenchmark: A Reproducible
Alexandre Berard, Olivier Pietquin, Christophe Servan, Framework for Assessing Self-Supervised Represen-
and Laurent Besacier. 2016. Listen and translate: A tation Learning from Speech. In Interspeech, pages
proof of concept for end-to-end speech-to-text trans- 1439–1443.
lation. CoRR, abs/1612.01744.
F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang.
Kaushal Bhogale, Abhigyan Raman, Tahir Javed, 2022. Language-agnostic BERT Sentence Embed-
Sumanth Doddapaneni, Anoop Kunchukuttan, ding. In Proceedings of the 60th ACL.
Pratyush Kumar, and Mitesh M Khapra. 2023. Effec-
tiveness of mining audio and text pairs from public Jonas Gehring, Michael Auli, David Grangier, Denis
data for improving asr systems for low-resource lan- Yarats, and Yann N. Dauphin. 2017. Convolutional
guages. In ICASSP 2023-2023 IEEE International sequence to sequence learning.
Conference on Acoustics, Speech and Signal Process-
ing (ICASSP), pages 1–5. IEEE. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,
Bruna Morrone, Quentin de Laroussilhe, Andrea Ges-
Marcely Zanon Boito, Fethi Bougares, Florentin Bar- mundo, Mona Attariyan, and Sylvain Gelly. 2019.
bier, Souhir Gahbiche, Loïc Barrault, Mickael Rou- Parameter-efficient transfer learning for nlp. In Proc.
vier, and Yannick Estéve. 2022. Speech resources ICML.
in the tamasheq language. Language Resources and
Evaluation Conference (LREC). Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
Alexis Conneau, Alexei Baevski, Ronan Collobert, rahman Mohamed. 2021. Hubert: Self-supervised
Abdelrahman Mohamed, and Michael Auli. 2020. speech representation learning by masked prediction
Unsupervised cross-lingual representation learn- of hidden units. IEEE/ACM Transactions on Audio,
ing for speech recognition. arXiv preprint Speech, and Language Processing, 29:3451–3460.
arXiv:2006.13979.
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà,
Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Javier Jorge, Nahuel Roselló, Adrià Giménez, Al-
Patrick von Platen, Anton Lozhkov, Colin Cherry, bert Sanchis, Jorge Civera, and Alfons Juan. 2019.
Ye Jia, Clara Rivera, Mihir Kale, et al. 2022. Xtreme- Europarl-st: A multilingual corpus for speech trans-
s: Evaluating cross-lingual speech representations. lation of parliamentary debates. arXiv:1911.03167.
Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Tahir Javed, Kaushal Santosh Bhogale, Abhigyan Ra-
Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe man, Anoop Kunchukuttan, Pratyush Kumar, and
Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Mitesh M Khapra. 2022. Indicsuperb: A speech pro-
et al. 2022. No language left behind: Scaling cessing universal performance benchmark for indian
human-centered machine translation. arXiv preprint languages. arXiv preprint arXiv:2208.11761.
arXiv:2207.04672.
Kazuya Kawakami, Luyu Wang, Chris Dyer, Phil Blun-
ELRA catalogue. 2016a. Trad pashto broadcast som, and Aaron van den Oord. 2020. Learning robust
news speech corpus. https://catalogue.elra. and multilingual speech representations. In Find-
info/en-us/repository/browse/ELRA-S0381/. ings of the Association for Computational Linguistics:
ISLRN: 918-508-885-913-7, ELRA ID: ELRA- EMNLP 2020, pages 1182–1192, Online. Association
S0381. for Computational Linguistics.
ELRA catalogue. 2016b. Trad pashto-french paral- Sameer Khurana, Antoine Laurent, and James Glass.
lel corpus of transcribed broadcast news speech - 2022. Samu-xlsr: Semantically-aligned multimodal
training data. http://catalog.elda.org/en-us/ utterance-level cross-lingual speech representation.
repository/browse/ELRA-W0093/. ISLRN: 802- IEEE Journal of Selected Topics in Signal Processing,
643-297-429-4, ELRA ID: ELRA-W0093. pages 1–13.
225
Hang Le, Juan Pino, Changhan Wang, Jiatao Gu, Didier Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang,
Schwab, and Laurent Besacier. 2020. Dual-decoder Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin,
transformer for joint automatic speech recognition Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting
and multilingual speech translation. arXiv preprint Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko tik
arXiv:2011.00747. Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-
Wen Li, Shinji Watanabe, Abdelrahman Mohamed,
Xian Li, Changhan Wang, Yun Tang, Chau Tran, Yuqing and Hung yi Lee. 2021. SUPERB: Speech Process-
Tang, Juan Pino, Alexei Baevski, Alexis Conneau, ing Universal PERformance Benchmark. In Inter-
and Michael Auli. 2020. Multilingual speech trans- speech, pages 1194–1198.
lation with efficient finetuning of pretrained models.
arXiv:2010.12829. Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao,
Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Chenchen Zeng, et al. 2022. Wenetspeech: A 10000+
Synnaeve, and Ronan Collobert. 2020. Mls: A hours multi-domain mandarin corpus for speech
large-scale multilingual dataset for speech research. recognition. In ICASSP 2022-2022 IEEE Interna-
arXiv:2012.03411. tional Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 6182–6186. IEEE.
man, Christine McLeavey, and Ilya Sutskever. 2022.
pervision.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine

Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. 2020. Exploring the limits
of transfer learning with a unified text-to-text trans-
former.

and Michael Auli. 2019. wav2vec: Unsupervised
pre-training for speech recognition. arXiv preprint
arXiv:1904.05862.
Matthias Sperber, Hendra Setiawan, Christian Gollan,

Udhyakumar Nallasamy, and Matthias Paulik. 2020.
Consistent transcription and translation of speech.
Transactions of the Association for Computational
Linguistics, 8:695–709.

systems, 30.
Changhan Wang, Juan Pino, Anne Wu, and Jiatao Gu.

2020. Covost: A diverse multilingual speech-to-text
translation corpus. arXiv:2002.01320.


Wu, and Zhifeng Chen. 2017. Sequence-to-Sequence
Models Can Directly Translate Foreign Speech. In
Proc. Interspeech 2017, pages 2625–2629.
226
BUT Systems for IWSLT 2023 Marathi - Hindi Low Resource
Speech Translation Task
Santosh Kesiraju, Karel Beneš, Maksim Tikhonov and Jan Černocký

Speech@FIT, Brno University of Technology, Czechia
{kesiraju,ibenes,cernocky}@fit.vutbr.cz,
[email protected]
Abstract (such as speech translation) using little amounts

of labelled data (Bansal et al., 2019). To be spe-
This paper describes the systems submitted for cific, we train automatic speech recognition (ASR)
Marathi to Hindi low-resource speech transla-
systems on relatively large amount of transcribed
tion task. Our primary submission is based on
an end-to-end direct speech translation system,
speech data (2790 hours), and transfer the model
whereas the contrastive one is a cascaded sys- for speech translation task by fine-tuning it on rela-
tem. The backbone of both the systems is a tively small amount (16 hours) of IWSLT Marathi-
Hindi-Marathi bilingual ASR system trained Hindi training data.
on 2790 hours of imperfect transcribed speech. This paper describes the systems submitted for
The end-to-end speech translation system was the aforementioned task. While building the sys-
directly initialized from the ASR, and then fine- tems, we mainly focused on end-to-end systems,
tuned for direct speech translation with an aux-
which resulted in our primary submission. We have
iliary CTC loss for translation. The MT model
for the cascaded system is initialized from a also put some efforts in building a cascade pipeline
cross-lingual language model, which was then that was submitted as a contrastive system. Both
fine-tuned using 1.6 M parallel sentences. All the systems come under the unconstrained cate-
our systems were trained from scratch on pub- gory, i.e., we relied on external, publicly available
licly available datasets. In the end, we use a lan- data to train models. These models, which we
guage model to re-score the n-best hypotheses. refer to as back-bone models, mainly comprise au-
Our primary submission achieved 30.5 and 39.6
tomatic speech recognition (ASR), machine trans-
BLEU whereas the contrastive system obtained
21.7 and 28.6 BLEU on official dev and test
lation (MT) and language models (LM).
sets respectively. The paper also presents the The Section 2 describes the various datasets used
analysis on several experiments that were con- for training the back-bone models, and Section 3
ducted and outlines the strategies for improving presents the details of each individual back-bone
speech translation in low-resource scenarios. models (ASR, MT, LM), followed by description
of transfer learning for actual speech translation
1 Introduction systems in Section 4. The Section 5 gives the re-
A typical end-to-end (E 2 E) speech translation sults and analysis, quantifying the effect of various
model is trained with the help of data triplets factors on the target translation task. Finally, we
(x, y, z), i.e., the speech signal (x) in source lan- conclude in Section 6 and discuss directions for
guage, along with its transcription (y), and, text future works.
translation (z) in target language. In usual low-
2 Datasets for training
resource scenarios, the transcriptions in source lan-
guage are unavailable and moreover the speech Here we describe the details and present the statis-
signal and the translation pairs (x, z) are also tics of various datasets used for training the back-
limited, which is the case for the IWSLT 2023 bone models. These datasets come under various
Marathi to Hindi low-resource speech translation categories, i.e., paired speech data for training ASR,
task (Agarwal et al., 2023). In such cases, one can parallel text data for training MT and monolingual
rely on transfer learning, where models trained on data for training LMs. All the data we consid-
relatively large amounts of data (possibly on a re- ered for training covers only Hindi and Marathi
lated task such as automatic speech recognition) languages. Both these share the same Devanagari
are transferred (adapted) to the target task/scenario script (unicode block) but there a few set of charac-
227
ters that are mutually exclusive. radio broadcast news in various Indian lan-
guages. The corresponding transcriptions
2.1 Paired speech data were obtained with the help of OCR and other
The paired speech data for Marathi and Hindi are heuristics (Bhogale et al., 2022). This corpus
collected from various publicly available datasets is the bigger chunk of the data we used for
as listed below: training, but the transcriptions obtained are
not accurate. A manual inspection revealed
• GramVaani (GV)1 comprises telephone qual-
some erroneous alignments at the beginning
ity speech in Hindi (hi). The dataset
and end of the utterances. By setting a thresh-
was used for Interspeech 2022 special ses-
old (≥ 85) on the provided alignment score,
sion (Bhanushali et al., 2022; Patel and
we filtered Hindi (hi) and Marathi (mr) data
Scharenborg, 2022). We considered only the
from this corpus. We believe the domain of
100 hour labelled split of the dataset.
this data is closer to IWSLT 2023 speech trans-
• Indian Language Corpora (ILC) (Abraham lation data.
et al., 2020)2 is crowdsourced speech data
along with transcriptions in Marathi language.
The statistics of each of the above datasets is pre-
The dataset is collected from 36 participants
sented in Table 1. This data was used to train
with various socio-economic backgrounds and
mono and bilingual ASR systems that are described
dialects.
later in Section 3.1. All the speech data was up-
• Mozilla Common Voice v12 (MCV) (Ardila sampled to 16 kHz. Using Kaldi toolkit (Povey
et al., 2020) is a crowdsource collection of et al., 2011) 80 dimensional filter banks and 3-
paired speech data across various languages. dimensional pitch features are extracted for every
We took the validated versions of Hindi (hi) 25 ms of speech frame sliding with 10 ms.
and Marathi (mr) from this corpus.
2.2 Monolingual and parallel text data
• MUCS (Diwan et al., 2021)3 is multilingual
and code-switched corpus for training ASR We prepared monolingual data for both Hindi and
systems in 6 different Indian languages. The Marathi. We pooled data from transcribed speech
dataset was introduced in Interspeech 2021 (Table 1), Samanantar (Ramesh et al., 2022), In-
as part of a special session focusing on ASR dic2Indic, IIIT-H CVIT (Siripragada et al., 2020)
for Indian languages. We considered Hindi corpus, resulting in 9 M sentences (217 M tokens)
and Marathi data from this corpus. Although for Hindi and 4M sentences for Marathi6 .
MUCS contains about 100 hours of tran- The parallel text was taken only from In-
scribed speech for both Marathi and Hindi, dic2Indic split from Samanantar (Ramesh et al.,
the lexical content is not diverse, i.e., the same 2022), whose statistics are given in Table 2. We
utterances were spoken by various speakers. retained punctuation in all the text.
• Multi-speaker speech corpora (MSSC) (He
et al., 2020)4 is a collection of clean speech 2.3 Speech translation data
data with transcriptions intended for building
text-to-speech synthesis systems for various The official speech translation data for Marathi -
Indian languages. We considered only the Hindi involves around 16 hours of training split, i.e.,
Marathi split from this corpus. Marathi speech and its translations in Hindi. There
are no transcriptions for the Marathi speech. Ta-
• Shrutilipi (SL)5 is collected from public ble 3 presents the statistics of the provided speech
archives and contains about 6400 hours of translation data. We used speed perturbation (0.9,
1
https://sites.google.com/view/ 1.0, 1.1) to augment the speech translation data.
gramvaaniasrchallenge/
2
The effect of such augmentation on the final trans-
https://www.cse.iitb.ac.in/~pjyothi/ lation performance is discussed later in Section 5.
indiccorpora/
3
https://navana-tech.github.io/
MUCS2021/data.html 6
Due to a bug in data preparation, only Shrutilipi text data
4
https://www.openslr.org/64/ 400 K (8.2 M tokens) out of 4 M sentences were used to train
5
https://ai4bharat.org/shrutilipi Marathi LM.
228
Duration in hours (number of utterances)
Dataset Language Training Dev Test
GV hi 97.9 (37,152) 4.9 (1885) 2.8 (1032)
ILC mr 109.2 (92,471) - - - -
hi 5.3 (4481) 2.8 (2179) 4.1 (2962)
MCV
mr 12.0 (7321) 3.0 (1678) 3.2 (1827)
hi 95.1 (99,925) 5.6 (3843) - -
MUCS
mr 93.8 (79,432) 5.0 (4675) - -
MSSC mr 3.0 (1569) - - - -
hi 1478.6 (764,237) - - - -
SL
mr 894.8 (466,203) - - - -
hi 1676.8 (898,369) 13.3 (7895) 6.9 (3994)
Total
mr 1112.8 (638,159) 8.0 (6353) 3.2 (1827)
Table 1: Statistics of the data used for training ASR systems. The dev and test splits are only used for internal
evaluation of the ASR systems.
Number of utterance pairs forcing).

Training Dev Test
Lasr = α Lctc (x, y) + (1 − α)Latt (x, y). (1)
1634551 2000 2000
In case of bilingual ASR, the CTC layer, input and
Table 2: Number of parallel utterance (sentence) pairs
between Marathi-Hindi that are used for training XLM output layers of the decoder are specific to each
and MT models. language, i.e., the (sub-)word embeddings are not
shared across languages. Such a design ensures that
only target language tokens are decoded, irrespec-
3 Back-bone models tive of the phonetic similarity with other languages
in the model. The ASR models were trained using
Here, we describe the architecture and training de- ESPnet toolkit (Watanabe et al., 2018). The perfor-
tails of various backbone models. mance of various mono and bilingual ASR systems
is discussed later in Section 5.
3.1 ASR
The ASR model is a transformer based seq2seq 3.2 XLM
model. The speech features are passed through The architecture of pre-training masked-language
2 layers of convolution, followed by 12 lay- model is based on cross-lingual language model
ers of transformer encoder blocks and 6 layer (XLM) (Lample and Conneau, 2019)8 . More
of transformer decoder blocks, with dmodel = specifically, we use translation language modelling
{256, 512}7 , heads = 4, dff = 2048. The objective along with masked language modelling to
dropout was set to 0.1. The model is trained with train the transformer based encoder. Here, we use
a batch size of 128 for 100 epochs using Adam BPE-based sub-word vocabulary that is obtained
optimizer (Kingma and Ba, 2015), and warm up jointly for both languages. The model has 6 trans-
scheduler with a peak learning rate of 0.0005. The former blocks with 512 embedding dimension, 8
training is done with joint CTC and attention objec- attention heads, dropout of 0.1 for both attention
tive (Karita et al., 2019), where the CTC is applied and feed-forward layers. The model is trained for
at the end of encoder layer and the attention acts a maximum of 1000 epochs using Adam optimizer
at the output of autoregressive decoder (teacher- with a learning rate of 0.0001.
7 8
Smaller models use dmodel = 256, where as bigger mod- https://github.com/facebookresearch/
els use dmodel = 512. XLM
229
Duration in hours (# utterances)
Training Dev Test Latt (x, z)
15.9 (7990) 3.7 (2103) 4.4 (2164)
Lctc (x, z)
Table 3: Statistics of Marathi-Hindi IWSLT2023 speech
translation data.
Decoder
3.3 MT
The MT model is a transformer based seq2seq CTC
model initialized from XLM. Both the encoder and
decoder parameters are initialized from XLM en-
coder, except for the cross-attention parameters Encoder
in the decoder that are randomly initialized. The
model is then fine-tuned on the same 1.6 M parallel
sentences with a batch size of 64 and a maximum x
of 1000 epochs. The model achieved 23.0 and 22.6
BLEU scores on the internal valid and test sets Figure 1: End-to-end framework for speech translation.
(Table 2) respectively. x is the input speech (features), z is the target text trans-
lation.
3.4 LM for re-scoring
For Hindi, we used an LSTM of three layers of The effect of various initializations and their influ-
4096 units each, with no dropout. The model was ence on downstream speech translation is discussed
trained on 217 M sub-word tokens obtained by to- later in Section 5.
kenizing the monolingual Hindi corpus into a 10k The E 2 E speech translation was also trained us-
Unigram vocabulary (Kudo, 2018). The model ing ESPnet toolkit. Our changes to the original
achieved validation perplexity of 46. Thereafter, toolkit, along with the training recipes, are avail-
we have fine-tuned it on text data from Shrutilipi able online9 .
(SL) data for 500 steps. A beam search based joint decoding (Karita
For Marathi, we used an LSTM of 2 layers per et al., 2019) that relies on the weighted average
2048 units, again with no dropout. This model also of log-likelihoods from both the CTC and trans-
utilized a 10k Unigram vocabulary and was trained former decoder modules is used, that produces the
on 8.2 M tokens. This model achieved validation most likely hypotheses according to
perplexity of 120.
ẑ = arg max β log pctc (z | x) +
4 Speech translation systems z
(1 − β) log patt (z | x) (3)
Here, we briefly describe both the end-to-end and
cascade systems. We found λ = {0.1, 0.3}, β = {0.1, 0.3} suitable
for joint training and decoding respectively.
4.1 End-to-end
The E 2 E models are initialized from pre-trained 4.2 Cascade systems
ASR models. We use both the encoder and decoder For the cascade speech translation systems, we first
from the ASR, as it provides a better initializa- decode n-best hypotheses from ASR model and
tion since the representations from the encoder are obtain 1-best from Marathi LM rescorer. These are
readily compatible with the decoder (Bansal et al., then passed directly to the MT system, which gives
2019). The model is then trained for direct speech us n-best translation hypotheses in target language
translation, with the auxiliary CTC objective also Hindi. These are then re-scored by Hindi LM to
for translation (Zhang et al., 2022; Yan et al., 2023; give us 1-best translation hypotheses.
Kesiraju et al., 2023). 9
https://github.com/BUTSpeechFIT/
espnet/tree/main/egs2/iwslt23_low_
Lst = λ Lctc (x, z) + (1 − λ)Latt (x, z) (2) resource/st1
230
Model name Training data Model type Sub-word vocab Dev WER Test WER
(hrs) per language mr hi mr hi
H1 198† Mono (hi) 1000 - 30.7 - 35.9
H2 1676 Mono (hi) 8000 - 24.7 - 28.4
M1 218† Mono (mr) 1000 14.3 - 42.4 -
M2 1112 Mono (mr) 8000 19.0 - 36.0 -
B1 416† Bilingual (mr, hi) 1000 11.1 31.5 31.9 35.1
B2 2789 Bilingual (mr, hi) 8000 16.0 24.2 23.7 26.9
Table 4: Word-error-rates (WER) of various mono and bilingual ASR systems, trained on various amounts of data.
†
implies that the training data contains everything from Table 1 except Shrutilipi (SL).
A further fine-tuning of the MT system using H2, M2 and B2 are bigger ones with dmodel = 512.
1-best hypotheses from Marathi to Hindi IWSLT All the ASR models were trained with joint CTC
training set did not improve the results. Due to time and attention loss, where the CTC weight of 0.3
constraints, we did not try various strategies (Ben- was found to be optimal. The same weight was
tivogli et al., 2021) or hyperparameter tuning for used during joint decoding. Since we retained the
the cascade systems. original punctuation in the text, the WER is slightly
affected.
4.3 Re-scoring n-best hypotheses
We have utilized the language models to re-score 5.2 Performance of ST
up to 100-best hypotheses in both languages. Us-
Here we present the results of speech translation
ing BrnoLM10 , we have introduced the language
systems based on end-to-end architecture. As
model scores. Here, we have tuned the two hyper-
shown in Table 5, all the ST models were initial-
parameters: The weight of the LM score (additive
ized either from mono or bilingual ASR systems
to 1.0 weight of the acoustic system) and an inser-
and fine-tuned using the speech translation data
tion bonus, added for each token of the hypothesis,
(with or without data augmentation). While most
in the LM tokenization. For the E 2 E system, we
of these systems can be considered direct end-to-
have achieved optimal results with LM weight 1.2
end; using an external LM for re-scoring the n-best
and insertion bonus 5.5. For the Marathi ASR in
makes an exception. Using a Marathi monolingual
the cascade system, optimal setting was 0.3 and
ASR model would be sub optimal because the in-
3.5. For the translation system in the cascade, we
ternal language model represented in the decoder
did not achieve any improvement by re-scoring the
of the ASR would not be suitable for generating
output with the Hindi LM.
linguistically acceptable text sequences in Hindi.
5 Results and analysis Fig. 2 shows the effect of CTC weight during
joint training and decoding. We can see that 0.3 is
Here, we present the performance of various back- the optimal weight both for training and decoding.
bone models, along with analysis showing the ef- Since, we have a separate vocabulary for both the
fectiveness of various factors such as initializations, languages, the posterior probabilities from CTC
data augmentation, auxiliary objectives and joint during joint decoding will only correspond to the
decoding. tokens from the target language Hindi. This is
important, since both the languages come from
5.1 Performance of ASR systems
same family with high phonetic similarity, and use
From the Table 4 we can see that the bilingual mod- same Devanagari script, the non auto regressive
els perform (B1, B2) better than the monolingual CTC decoder does not accidentally provide higher
parts (H1, M1, H2, M2). Here, H1, M1 and B1 scores for tokens from source language Marathi.
are smaller models with dmodel = 256, whereas The latter scenario can happen when using a joint-
10
https://github.com/BUTSpeechFIT/
sub word vocabulary for both the languages.
BrnoLM Sacrebleu library (Post, 2018) was used to com-
231
ST Model Speed Dev set
initialization perturb BLEU CHR F2
28.5
H1 ✗ 16.3 45.0
H2 ✓ 24.9 51.0
BLEU on dev set
28.0 B1 ✗ 17.4 46.2

B1 ✓ 20.1 48.2
27.5 B2 ✓ 28.7 54.4
B2 + LM rescore ✓ 30.6 55.9
CTC weight during joint training
27.0
λ = 0.1
λ = 0.3
Cascade - 21.7 48.2
0.0 0.1 0.3 0.5 0.7 0.9
(β) CTC weight during joint decoding Table 5: Speech translation results on Marathi - Hindi
dev set. All the ST models are fine-tuned on training
Figure 2: Effect of hyperparameters in joint training and data from Table 3.
decoding for direct speech translation. The model is
initialized from B2 and trained on augmented training
data.
6 Conclusions
pute BLEU11 and CHR F212 scores in the dev sets. In this paper, we presented the systems submitted
to the IWSLT 2023 Marathi Hindi low resource
track. Our main efforts were along the end-to-end
From the Table 5, we can see that independent direct speech translation system, initialized from a
improvements come from using bilingual ASR bilingual ASR. The model was jointly trained with
trained on more data, data augmentation (speed CTC and attention objective directly for translation.
perturbation) and LM re-scoring. In case of cas- The joint decoding provided additional benefits.
cade system, the LM re-scoring did not improve the These strategies combined with speed perturbation
results. We believe this is because the Marathi LM for data augmentation and re-scoring the n-best
was trained on much fewer amounts of data (400K hypotheses using external LM provided further sig-
sentences). We plan to rerun these experiments in nificant improvements. We also submitted a cas-
the near future. cade system which uses the same bilingual ASR
Finally, our primary submission was based on as the backbone, followed by an MT system. Both
B2 + ST fine-tuning with data augmentation + systems performed competitively, while the one
LM re-scoring which obtained 39.6 BLEU and based on end-to-end provided superior results in
63.3 CHR F2 scores on official test set. Our con- terms of BLEU. It is yet to be investigated, if the
trastive system was based on B2 + MT + LM large pre-trained MT systems would close the gap
re-scoring which obtained 28.6 BLEU and 54.4 between cascade and end-to-end systems.
CHR F2 scores.
A manual inspection of the translation outputs

revealed that several mismatches occurred where Acknowledgements
there are ambiguous numerals, i.e., some numbers
were written using digits while the others were The work was supported by Czech National Sci-
spelled out verbatim. There are also cases where ence Foundation (GACR) project NEUREM3
both notations were mixed. We believe, further No. 19-26934X, Czech Ministry of Educa-
text normalization of both reference and hypothe- tion, Youth and Sports project no. LTAIN19087
sis could give us a better picture of the evaluation “Multi-linguality in speech technologies” and Hori-
scores. zon 2020 Marie Skłodowska-Curie grant ES-
PERANTO, No. 101007666. Computing on IT4I
supercomputer was supported by the Czech Min-
11
nrefs:1 | case:mixed | eff:no | istry of Education, Youth and Sports through the
tok:13a | smooth:exp | version:2.3.1
12
nrefs:1 | case:mixed | eff:yes | nc:6 e-INFRA CZ (ID:90254). We thank the reviewers
| nw:0 | space:no | version:2.3.1 for their constructive feedback.
232
References Anish Bhanushali, Grant Bridgman, Deekshitha G,
Prasanta Ghosh, Pratik Kumar, Saurabh Kumar,
Basil Abraham, Danish Goel, Divya Siddarth, Kalika Adithya Raj Kolladath, Nithya Ravi, Aaditeshwar
Bali, Manu Chopra, Monojit Choudhury, Pratik Joshi, Seth, Ashish Seth, Abhayjeet Singh, Vrunda Sukha-
Preethi Jyoti, Sunayana Sitaram, and Vivek Seshadri. dia, Umesh S, Sathvik Udupa, and Lodagala V. S.
2020. Crowdsourcing speech data for low-resource V. Durga Prasad. 2022. Gram Vaani ASR Challenge
languages from low-income workers. In Proceedings on spontaneous telephone speech recordings in re-
of the Twelfth Language Resources and Evaluation gional variations of Hindi. In Proc. Interspeech 2022,
Conference, pages 2819–2826, Marseille, France. Eu- pages 3548–3552.
ropean Language Resources Association.
Kaushal Santosh Bhogale, Abhigyan Raman, Tahir
Milind Agarwal, Sweta Agrawal, Antonios Anasta- Javed, Sumanth Doddapaneni, Anoop Kunchukuttan,
sopoulos, Ondřej Bojar, Claudia Borg, Marine Pratyush Kumar, and Mitesh M. Khapra. 2022. Effec-
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda tiveness of mining audio and text pairs from public
Chen, William Chen, Khalid Choukri, Alexandra data for improving asr systems for low-resource lan-
Chronopoulou, Anna Currey, Thierry Declerck, Qian- guages.
Federico, Souhir Gahbiche, Barry Haddow, Benjamin Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah,
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja- Ankita Singh, Srinivasa Raghavan, Shreya Khare,
vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiran-
Kumar, Pengwei Li, Xutail Ma, Prashant Mathur, jeevi Yarra, Ashish Mittal, Prasanta Kumar Ghosh,
Evgeny Matusov, Paul McNamee, John P. McCrae, Preethi Jyothi, Kalika Bali, Vivek Seshadri, Sunayana
Kenton Murray, Maria Nadejde, Satoshi Nakamura, Sitaram, Samarth Bharadwaj, Jai Nanavati, Raoul
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu, Nanavati, and Karthik Sankaranarayanan. 2021.
Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino, MUCS 2021: Multilingual and Code-Switching ASR
Lonneke van der Plas, Peter Polák, Elijah Rippeth, Challenges for Low Resource Indian Languages. In
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se- Proc. Interspeech 2021, pages 2446–2450.
Thompson, Kevin Tran, Marco Turchi, Alex Waibel, Fei He, Shan-Hui Cathy Chu, Oddur Kjartansson, Clara
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- Rivera, Anna Katanova, Alexander Gutkin, Isin
vallos. 2023. Findings of the IWSLT 2023 Evaluation Demirsahin, Cibu Johny, Martin Jansche, Supheak-
Campaign. In Proceedings of the 20th International mungkol Sarin, and Knot Pipatsrisawat. 2020. Open-
Conference on Spoken Language Translation (IWSLT source multi-speaker speech corpora for building Gu-
2023). Association for Computational Linguistics. jarati, Kannada, Malayalam, Marathi, Tamil and Tel-
ugu speech synthesis systems. In Proceedings of
Rosana Ardila, Megan Branson, Kelly Davis, Michael the 12th Language Resources and Evaluation Confer-
Kohler, Josh Meyer, Michael Henretty, Reuben ence, pages 6494–6503, Marseille, France. European
Morais, Lindsay Saunders, Francis Tyers, and Gre- Language Resources Association.
gor Weber. 2020. Common voice: A massively-
multilingual speech corpus. In Proceedings of the Shigeki Karita, Nelson Enrique Yalta Soplin, Shinji
12th Language Resources and Evaluation Confer- Watanabe, Marc Delcroix, Atsunori Ogawa, and To-
ence, pages 4218–4222, Marseille, France. European mohiro Nakatani. 2019. Improving Transformer-
Language Resources Association. Based End-to-End Speech Recognition with Connec-
tionist Temporal Classification and Language Model
Integration. In Proc. of Interspeech, pages 1408–
Sameer Bansal, Herman Kamper, Karen Livescu, Adam
1412.
Lopez, and Sharon Goldwater. 2019. Pre-training
on high-resource speech recognition improves low- Santosh Kesiraju, Marek Sarvaš, Tomáš Pavliček, Cécile
resource speech-to-text translation. In Proceedings Macaire, and Alejandro Ciuba. 2023. Strategies for
of the 2019 Conference of the North American Chap- improving low resource speech to text translation
ter of the Association for Computational Linguistics: relying on pre-trained asr models.
Human Language Technologies, Volume 1 (Long and
Short Papers), pages 58–68, Minneapolis, Minnesota. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
Association for Computational Linguistics. method for stochastic optimization. In 3rd Inter-
national Conference on Learning Representations,
Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
Karakanta, Alberto Martinelli, Matteo Negri, and Conference Track Proceedings.
translation: Do the differences still make a differ- Taku Kudo. 2018. Subword regularization: Improv-
ence? In Proceedings of the 59th Annual Meet- ing neural network translation models with multiple
ing of the Association for Computational Linguistics subword candidates. In Proceedings of the 56th An-
and the 11th International Joint Conference on Natu- nual Meeting of the Association for Computational
ral Language Processing (Volume 1: Long Papers), Linguistics (Volume 1: Long Papers), pages 66–75,
pages 2873–2887, Online. Association for Computa- Melbourne, Australia. Association for Computational
tional Linguistics. Linguistics.
233
Guillaume Lample and Alexis Conneau. 2019. Cross-
lingual language model pretraining. CoRR,
abs/1901.07291.
Tanvina Patel and Odette Scharenborg. 2022. Using
cross-model learnings for the Gram Vaani ASR Chal-
lenge 2022. In Proc. Interspeech 2022, pages 4880–
4884.
tional Linguistics.
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukáš
Burget, Ondřej Glembek, K. Nagendra Goel, Mirko
Hannemann, Petr Motlíček, Yanmin Qian, Petr
Schwarz, Jan Silovský, Georg Stemmer, and Karel
Veselý. 2011. The kaldi speech recognition toolkit.
In Proceedings of ASRU 2011, pages 1–4. IEEE Sig-
nal Processing Society.
Gowtham Ramesh, Sumanth Doddapaneni, Aravinth
Bheemaraj, Mayank Jobanputra, Raghavan AK,
Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Ma-
halakshmi J, Divyanshu Kakwani, Navneet Kumar,
Aswin Pradeep, Srihari Nagaraj, Kumar Deepak,
Vivek Raghavan, Anoop Kunchukuttan, Pratyush Ku-
mar, and Mitesh Shantadevi Khapra. 2022. Samanan-
tar: The largest publicly available parallel corpora
collection for 11 indic languages. Transactions of the
Association for Computational Linguistics, 10:145–
162.
Shashank Siripragada, Jerin Philip, Vinay P. Nambood-
iri, and C V Jawahar. 2020. A multilingual parallel
corpora collection effort for Indian languages. In
Proceedings of the 12th Language Resources and
Evaluation Conference, pages 3743–3751, Marseille,
France. European Language Resources Association.
Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki
Hayashi, Jiro Nishitoba, Yuya Unno, Nelson En-
rique Yalta Soplin, Jahn Heymann, Matthew Wiesner,
Nanxin Chen, Adithya Renduchintala, and Tsubasa
Ochiai. 2018. ESPnet: End-to-end speech process-
ing toolkit. In Proceedings of Interspeech, pages
2207–2211.
Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham
Neubig, Florian Metze, Alan W Black, and Shinji
Watanabe. 2023. CTC alignments improve autore-
gressive translation. In Proceedings of the 17th Con-
ference of the European Chapter of the Association
for Computational Linguistics, pages 1623–1639,
Dubrovnik, Croatia. Association for Computational
Linguistics.
Biao Zhang, Barry Haddow, and Rico Sennrich. 2022.
Revisiting End-to-End Speech-to-Text Translation
From Scratch. In International Conference on Ma-
chine Learning, volume 162 of Proc. of Machine
Learning Research, pages 26193–26205. PMLR.
234
CMU’s IWSLT 2023 Simultaneous Speech Translation System
Brian Yan*1 Jiatong Shi*1 Soumi Maiti1 William Chen1
Xinjian Li1 Yifan Peng2 Siddhant Arora1 Shinji Watanabe1,3
1
Language Technologies Institute, Carnegie Mellon University, USA
2
Electrical and Computer Engineering, Carnegie Mellon University, USA
3
Human Language Technology Center of Excellence, Johns Hopkins University, USA
{byan, jiatongs}@cs.cmu.edu
Abstract 2 Task Description

This paper describes CMU’s submission to The IWSLT 2023 simultaneous speech translation
the IWSLT 2023 simultaneous speech transla- track1 is a shared task for streaming speech-to-
tion shared task for translating English speech text and speech-to-speech translation of TED talks.
to both German text and speech in a stream- This track mandates that systems do not perform
ing fashion. We first build offline speech-to- re-translation, meaning that the streaming outputs
text (ST) models using the joint CTC/attention
cannot be edited after the system receives more
framework. These models also use WavLM
front-end features and mBART decoder initial- input audio. Systems are required to meet a par-
ization. We adapt our offline ST models for ticular latency regime: SST systems must have <2
simultaneous speech-to-text translation (SST) seconds average lagging (AL) and SS2ST systems
by 1) incrementally encoding chunks of input must have <2.5 seconds start offset (SO) (Ma et al.,
speech, re-computing encoder states for each 2020).
new chunk and 2) incrementally decoding out- Of the allowed training data, we selected a sub-
put text, pruning beam search hypotheses to 1- set of in-domain data to train our ASR and ST
best after processing each chunk. We then build
text-to-speech (TTS) models using the VITS
models: for ASR we use TEDLIUM v1 and v2
framework and achieve simultaneous speech- (Zhou et al., 2020) and for ST we used MuST-
to-speech translation (SS2ST) by cascading our C v2 (Di Gangi et al., 2019). We also use a set
SST and TTS models. of cross-domain data to train our MT and TTS
models due to the lack of in-domain data: for
1 Introduction MT we use Europarl, NewsCommentary, Open-
Subtitles, TED2020, Tatoeba, and ELRC-CORDIS
In this paper, we present CMU’s English to Ger-
News (Tiedemann et al., 2020). For TTS we use
man simultaneous speech translation systems. Our
CommonVoice (Ardila et al., 2020). The following
IWSLT 2023 (Agarwal et al., 2023) shared task
section describes how each of the ASR, ST, MT,
submission consists of both simultaneous speech-
and TTS components fit together in our ultimate
to-text (SST) and simultaneous speech-to-speech
systems.
(SS2ST) systems. Our general strategy is to first
build large-scale offline speech translation (ST) 3 System Description
models which leverage unpaired speech data, ASR
data, and ST data. We then adapt these offline 3.1 Offline Speech Translation (ST)
models for simultaneous inference. Finally, we As shown in Figure 1, our offline ST models
use a text-to-speech model to achieve SS2ST in a are based on the joint CTC/attention framework
cascaded manner. (Watanabe et al., 2017; Yan et al., 2023a). Com-
In particular, our system consists of: pared to a purely attention-based approach, joint
CTC/attention has been shown to reduce the soft-
1. Offline ST using joint CTC/attention with self- alignment burden, provide a positive ensembling
supervised speech/text representations (§3.1) effect, and improve the robustness of end-detection
2. Offline-to-online adaptation via chunk-based en- during inference (Yan et al., 2023a).
coding and incremental beam search (§3.2) To leverage unpaired speech data, we use first
3. Simultaneous S2ST by feeding incremental text use WavLM representations (Chen et al., 2022) as
outputs to a text-to-speech model (§3.3) 1
https://iwslt.org/2023/simultaneous
235
Figure 1: Offline ST model architecture based on the Figure 2: Incremental encoding strategy which pro-
joint CTC/attention framework with a WavLM front- cesses chunks of input speech by re-computing repre-
end and mBART decoder. sentations corresponding to earlier chunks.
front-end features to train ASR models. In these Algorithm 1 Beam search step with rewinding of
models, a pre-encoder module (Chang et al., 2021) unreliable hypotheses on non-final chunks and in-
applies feature dimension down-sampling and a cremental pruning upon end-detection.
learned weighted combination of WavLM layers be- 1: procedure B EAM S TEP(hyps, prevHyps, isFinal)
2: newHyps = {}; endDetected = False
fore feeding to a Conformer encoder (Gulati et al., 3: for y1:l−1 ∈ prtHs do
2020). The pre-encoder and encoder modules from 4: attnCnds = top-k(PAttn (yl |X, y1:l−1 ), k = p)
5: for c ∈ attnCnds do
ASR are then used to initialize our ST models. 6: y1:l = y1:l−1 ⊕ c
To leverage unpaired text data, we use the 7: αCTC = CTCScore(y1:l , X1:T )
mBART decoder (Tang et al., 2020) as an initial- 8: αAttn = AttnScore(y1:l , X1:T )
9: β = LengthPen(y1:l )
ization for our ST models. Following (Li et al., 10: PBeam (y1:l |X) = αCTC + αAttn + β
2020), we freeze all feed-forward layers during 11: newHyps[y1:l ] = PBeam (·)
fine-tuning and use a post-encoder down-sampling 12: if (!isFinal) and (c is <eos> or repeat) then
13: endDetected = True
layer to reduce the computational load. 14: newHyps = prevHyps ▷ rewind
We fine-tune our ST models using the follow- 15: else if l is maxL then
ing interpolated loss function: L = λ1 LASR_CE + 16: endDetected = True
17: end if
λ2 LASR_CTC + λ3 LST_CE + λ4 LST_CTC . Here, the 18: end for
cross-entropy (CE) losses are used to train atten- 19: end for
20: if endDetected then ▷ incremental pruning
tional decoders. Note that in Figure 1, we omit 21: newHyps = top-k(PBeam (·), k = 1)
the ASR attentional decoder and CTC components 22: else ▷ standard pruning
as these function as training regularizations and 23: newHyps = top-k(PBeam (·), k = b)
24: end if
do not factor into the inference proceedure. We 25: return newHyps, endDetected
perform fine-tuning on in-domain data consisting 26: end procedure
primarily of MuST-C (Di Gangi et al., 2019).
To leverage additional in-domain data, we apply
MT pseudolabeling to TEDLIUM ASR data (Zhou speech. As shown in Figure 2, our scheme uses a
et al., 2020). We also use the same MT model fixed duration (e.g. 2 seconds) to compute front-
to apple sequence-level knowledge distillation to end and encoder representations on chunks of in-
the MuST-C data. The MT model is a pre-trained put speech. With each new chunk, we re-compute
DeltaLM-large (Ma et al., 2021) fine-tuned on the front-end and encoder representations using the
corpora listed in Section 2. The pseudo-labels and incrementally longer input speech.
distilled sequences were then translated from En- To produce incremental translation outputs, we
glish to German using a beam size of 10. apply several modifications to the offline joint
CTC/attention beam search. As shown in Algo-
3.2 Simultaneous Speech Translation (SST) rithm 1, we run beam search for each chunk of
We adapt our offline ST model for streaming infer- input. Unless we know that the current chunk is the
ence by using a chunk-based processing of input final chunk, we perform end-detection using the
236
M ODEL Q UALITY L ATENCY
O FFLINE S PEECH T RANSLATION (ST) BLEU ↑ -
Multi-Decoder CTC/Attn (Yan et al., 2023b) 30.1 - -
WavLM-mBART CTC/Attn (Ours) 32.5 - -
S IMUL S PEECH T RANSLATION (SST) BLEU ↑ AL ↓ LAAL ↓
Time-Sync Blockwise CTC/Attn (Yan et al., 2023b) 26.6 1.93 1.98
WavLM-mBART CTC/Attn (Ours) 30.4 1.92 1.99
S IMUL S PEECH - TO -S PEECH T RANSLATION (SS2T) ASR-BLEU ↑ SO ↓ EO ↓
WavLM-mBART CTC/Attn + VITS (Ours) 26.7 2.33 5.67
Table 1: Results of our English to German ST/SST/SS2ST models on MuST-C-v2 tst-COMMON.
heuristics introduced by (Tsunoo et al., 2021). If use speech enhancement metric DNSMOS (Reddy
any of the hypotheses in our beam propose a next et al., 2021) which provides an estimation of the
candidate which is the special end-of-sequence to- speech quality. We evaluate the speech quality for
ken or a token which already appeared in the hy- the top five speakers with the largest number of
pothesis, then this strategy determines that the out- utterances. To establish the high-quality subset,
puts have likely covered all of the available input. we set a threshold of 4.0 for selecting sentences
At this point, the current hypotheses should be con- that meet the desired quality level. Based on this
sidered unreliable and thus the algorithm rewinds criterion, we choose the second speaker, who has
hypotheses to the previous step. approximately 12 hours of high-quality data.
After the end has been detected within the cur- Finally, we combine our trained German TTS
rent chunk, we prune the beam to the 1-best hypoth- model with SST module during inference. We feed
esis and select this as our incremental output – this incremental translation text outputs to TTS and
pruning step is necessary to avoid re-translation. synthesize translated speech.
When the next input chunk is received, beam search
continues from this 1-best hypothesis. 4 Experimental Setup
Our models were developed using the ESPnet-ST-
3.3 Simultaneous Speech-to-Speech
v2 toolkit (Yan et al., 2023b). Our ST/SST model
Translation (S2ST)
uses WavLM-large as a front-end (Chen et al.,
Simultaneous S2ST model is created by feeding in- 2022). A linear pre-encoder down-samples from
cremental text outputs to a German text-to-speech 1024 to 80 feature dim. Our encoder is a 12 layer
model. We use end-to-end TTS model VITS (Kim Conformer with 1024 attention dim, 8 attention
et al., 2021) and train a single speaker German TTS heads, and 2048 linear dim (Gulati et al., 2020).
model using CommonVoice dataset(Ardila et al., A convolutional post-encoder then down-samples
2020). VITS consists of text-encoder, flow based along the length dimension by a factor of 2. Our de-
stochastic duration predictor from text, variational coder follows the mBART architecture and we ini-
auto-encoder for learning latent feature from au- tialize using the mBART-large-50-many-to-many
dio and generator-discriminator based decoder for model (Tang et al., 2020). Our ST CTC branch uses
generating speech from latent feature. We use char- the same 250k vocabulary as the mBART decoder
acter as input to the TTS model. to enable joint decoding. Our TTS model consists
We select a suitable speaker from CommonVoice of 6 transformer encoder layers for text-encoder, 4
German dataset and train single speaker TTS. As normalizing flow layers for duration predictor, 16
CommonVoice may contain many noisy utterances residual dilated convolutional blocks as posterior
which can hurt performance of TTS, we use data- encoder and multi-period HiFiGan (Kong et al.,
selection for high-quality subset. The data selec- 2020) style decoder. We train VITS model for
tion process involves identifying the speaker who 400 epochs with AdamW (Loshchilov and Hutter,
has the highest number of utterances with high 2019) optimizer.
speech quality. To determine the speech quality, we During inference, we use a chunk size of 2 sec-
237
onds for SST and 2.5 seconds for SS2ST. For both Acknowledgements
SST and SS2ST we use beam size 5, CTC weight
Brian Yan and Shinji Watanabe are supported by
0.2, and no length penalty/bonus. To account for
the Human Language Technology Center of Ex-
incremental outputs which end in a prefix of a word
cellence. This work used the Extreme Science
rather than a whole word, we delay outputs for scor-
and Engineering Discovery Environment (XSEDE)
ing by 1 token. There are two exceptions to this
(Towns et al., 2014), which is supported by Na-
token delay: if the last token is a valid German
tional Science Foundation grant number ACI-
word or a punctuation, then we do not delay.
1548562; specifically, the Bridges system (Nys-
We evaluate translation quality using BLEU
trom et al., 2015), as part of project cis210027p,
score (Papineni et al., 2002) for ST/SST and ASR-
which is supported by NSF award number ACI-
BLEU score for SS2ST. ST/SST references are
1445606, at the Pittsburgh Supercomputing Center.
case-sensitive and punctuated while SS2ST refer-
This work also used GPUs donated by the NVIDIA
ences are case-insensitive and un-punctuated. The
Corporation.
ASR model used for ASR-BLEU is Whisper-small
(Radford et al., 2022). We evaluate translation la-
tency for SST using average lagging (AL) (Ma References
et al., 2020) and length-adaptive average lagging
(LAAL) (Papi et al., 2022). We evaluate translation sopoulos, Ondřej Bojar, Claudia Borg, Marine
latency for SS2ST using start (SO) and end-offset Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
(EO) (Ma et al., 2020). Chen, William Chen, Khalid Choukri, Alexandra
5 Results Federico, Souhir Gahbiche, Barry Haddow, Benjamin
Table 1 shows the quality and latency of our SST vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
and SS2ST models as measured on En-De tst- Kumar, Pengwei Li, Xutail Ma, Prashant Mathur,
COMMON. We also show the ST performance of Evgeny Matusov, Paul McNamee, John P. McCrae,
our model for reference. As a baseline, we compare Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
to the IWSLT-scale ST and SST systems developed Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
in Yan et al. (2023b) – our systems show improved Lonneke van der Plas, Peter Polák, Elijah Rippeth,
quality, primarily due to the use of WavLM and Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
mBART self-supervised representations. Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
From ST to SST, we observe a 6% quality degra- Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
dation. Note that the average duration of tst- vallos. 2023. Findings of the IWSLT 2023 Evaluation
COMMON utterances is around 5 seconds, mean- Campaign. In Proceedings of the 20th International
ing the corresponding latency gain is 60%. From 2023). Association for Computational Linguistics.
SST to SS2ST, we observe a 12% quality degrada-
tion. Note that both the TTS model and the Whis- Rosana Ardila, Megan Branson, Kelly Davis, Michael
Kohler, Josh Meyer, Michael Henretty, Reuben
per ASR model powering the ASR-BLEU metric Morais, Lindsay Saunders, Francis Tyers, and Gre-
contribute to this gap. gor Weber. 2020. Common voice: A massively-
multilingual speech corpus. In Proceedings of the
6 Conclusion Twelfth Language Resources and Evaluation Confer-
ence, pages 4218–4222, Marseille, France. European
Language Resources Association.
We describe our English to German simultane-
ous speech-to-text and speech-to-speech transla- Xuankai Chang, Takashi Maekaku, Pengcheng Guo,
tion systems for the IWSLT 2023 shared task. We Jing Shi, Yen-Ju Lu, Aswin Shanmugam Subra-
start by building large-scale offline speech-to-text manian, Tianzi Wang, Shu-wen Yang, Yu Tsao,
Hung-yi Lee, et al. 2021. An exploration of self-
systems which leverage self-supervised speech and supervised pretrained representations for end-to-end
text representations. We then adapt these offline speech recognition. In 2021 IEEE Automatic Speech
models for online inference, enabling simultaneous Recognition and Understanding Workshop (ASRU),
speech-to-text translation. Finally, we feed stream- pages 228–235. IEEE.
ing text outputs to a down-stream TTS model, en- Sanyuan Chen, Chengyi Wang, Zhengyang Chen,
abling simultaneous speech-to-speech translation. Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki
238
Kanda, Takuya Yoshioka, Xiong Xiao, et al. 2022. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Wavlm: Large-scale self-supervised pre-training for Jing Zhu. 2002. Bleu: a method for automatic evalu-
full stack speech processing. IEEE Journal of Se- ation of machine translation. In Proceedings of the
lected Topics in Signal Processing, 16(6):1505–1518. 40th Annual Meeting of the Association for Compu-
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, Pennsylvania, USA. Association for Computational
Matteo Negri, and Marco Turchi. 2019. MuST-C: a Linguistics.
Multilingual Speech Translation Corpus. In Proceed-
ings of the 2019 Conference of the North American Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
Chapter of the Association for Computational Lin- man, Christine McLeavey, and Ilya Sutskever. 2022.
guistics: Human Language Technologies, Volume 1 Robust speech recognition via large-scale weak su-
(Long and Short Papers), pages 2012–2017, Min- pervision. arXiv preprint arXiv:2212.04356.
Linguistics. Chandan KA Reddy, Vishak Gopal, and Ross Cutler.
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki 2021. Dnsmos: A non-intrusive perceptual objec-
Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, tive speech quality metric to evaluate noise suppres-
Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. sors. In IEEE International Conference on Acoustics,
2020. Conformer: Convolution-augmented Trans- Speech and Signal Processing (ICASSP), pages 6493–
former for speech recognition. In Proceedings of 6497. IEEE.
Interspeech, pages 5036–5040.
Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
Conditional variational autoencoder with adversarial gela Fan. 2020. Multilingual translation with extensi-
learning for end-to-end text-to-speech. In Interna- ble multilingual pretraining and finetuning.
tional Conference on Machine Learning. PMLR.
Jörg Tiedemann, Santhosh Thottingal, et al. 2020. Opus-
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. mt–building open translation services for the world.
Hifi-gan: Generative adversarial networks for effi- In Proceedings of the 22nd Annual Conference of
cient and high fidelity speech synthesis. volume 33, the European Association for Machine Translation.
pages 17022–17033. European Association for Machine Translation.
Xian Li, Changhan Wang, Yun Tang, Chau Tran, Yuqing
J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither,
Tang, Juan Pino, Alexei Baevski, Alexis Conneau,
A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka,
G. D. Peterson, R. Roskies, J. R. Scott, and
lation with efficient finetuning of pretrained models.
N. Wilkins-Diehr. 2014. Xsede: Accelerating scien-
tific discovery. Computing in Science & Engineering,
Ilya Loshchilov and Frank Hutter. 2019. Decoupled 16(5):62–74.
weight decay regularization. In International Confer-
ence on Learning Representations. Emiru Tsunoo, Yosuke Kashiwagi, and Shinji Watanabe.
2021. Streaming transformer asr with blockwise
Shuming Ma, Li Dong, Shaohan Huang, Dong- synchronous beam search. In 2021 IEEE Spoken
dong Zhang, Alexandre Muzio, Saksham Singhal, Language Technology Workshop (SLT), pages 22–29.
Hany Hassan Awadalla, Xia Song, and Furu Wei. IEEE.
2021. DeltaLM: Encoder-decoder pre-training for
language generation and translation by augmenting Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R
pretrained multilingual encoders. Hershey, and Tomoki Hayashi. 2017. Hybrid
ctc/attention architecture for end-to-end speech recog-
Xutai Ma, Mohammad Javad Dousti, Changhan Wang, nition. IEEE Journal of Selected Topics in Signal
Jiatao Gu, and Juan Pino. 2020. Simuleval: An eval- Processing, 11(8):1240–1253.
uation toolkit for simultaneous translation. In Pro-
ceedings of the EMNLP. Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham
Nicholas A Nystrom, Michael J Levine, Ralph Z Neubig, Florian Metze, Alan W Black, and Shinji
Roskies, and J Ray Scott. 2015. Bridges: a uniquely Watanabe. 2023a. CTC alignments improve autore-
flexible hpc resource for new communities and data gressive translation. In Proceedings of the 17th Con-
analytics. In Proceedings of the 2015 XSEDE Confer- ference of the European Chapter of the Association
ence: Scientific Advancements Enabled by Enhanced for Computational Linguistics, pages 1615–1631,
Cyberinfrastructure, pages 1–8. Dubrovnik, Croatia. Association for Computational
Linguistics.
Sara Papi, Marco Gaido, Matteo Negri, and Marco
Turchi. 2022. Over-generation cannot be rewarded: Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma,
Length-adaptive average lagging for simultaneous Yifan Peng, Siddharth Dalmia, Peter Polák, Patrick
speech translation. In Proceedings of the Third Work- Fernandes, Dan Berrebbi, Tomoki Hayashi, et al.
shop on Automatic Simultaneous Translation, pages 2023b. Espnet-st-v2: Multipurpose spoken language
12–17. translation toolkit. arXiv preprint arXiv:2304.04596.
239
Wei Zhou, Wilfried Michel, Kazuki Irie, Markus Kitza,
Ralf Schlüter, and Hermann Ney. 2020. The rwth asr
system for ted-lium release 2: Improving hybrid hmm
with specaugment. In ICASSP 2020-2020 IEEE Inter-
national Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 7839–7843. IEEE.
240
Improving Low Resource Speech Translation with Data Augmentation and
Ensemble Strategies
Akshaya Vishnu Kudlu Shanbhogue Ran Xue∗ Soumya Saha∗

Daniel Yue Zhang∗ Ashwinkumar Ganesan∗
Amazon Alexa AI
{ashanbho,ranxue,soumyasa,dyz,gashwink}@amazon.com
Abstract and Quechua (Que) to Spanish (Es). This paper

outlines a low resource speech translation system
This paper describes the speech translation (from the Amazon Alexa AI team) for 2 language
system submitted as part of the IWSLT 2023
shared task on low resource speech transla-
pairs, namely, Tamasheq-French (Tmh→Fra) and
tion. The low resource task aids in building Marathi-Hindi (Mr→Hi).
models for language pairs where the training Depending on the type of output that is gener-
corpus is limited. In this paper, we focus on ated the end-to-end speech translation task has two
two language pairs, namely, Tamasheq-French formats: (a) Speech-to-text (S2T), and (b) Speech-
(Tmh→Fra) and Marathi-Hindi (Mr→Hi) and to-Speech (S2S). There are two types of ST sys-
implement a speech translation system that is tems. The first is a cascaded system where speech
unconstrained. We evaluate three strategies
recognition and language translation are decou-
in our system: (a) Data augmentation where
we perform different operations on audio as pled.2 The second is an end-to-end (E2E) model
well as text samples, (b) an ensemble model that combines both audio processing and language
that integrates a set of models trained using a translation. We design and evaluate an E2E model
combination of augmentation strategies, and (c) in this paper.
post-processing techniques where we explore In the past, various approaches have been pro-
the use of large language models (LLMs) to posed to build E2E low resource speech translation
improve the quality of sentences that are gen- models. Bansal et al. (2018) designs an initial sys-
erated. Experiments show how data augmenta-
tion can relatively improve the BLEU score by
tem that is an encoder-decoder architecture that
5.2% over the baseline system for Tmh→Fra integrates a convolutional neural network (CNN)
while an ensemble model further improves per- and recurrent neural network (RNN). Stoian et al.
formance by 17% for Tmh→Fra and 23% for (2020) try to improve ST models for low resource
Mr→Hi task. languages by pretraininig the model on automated
speech recognition (ASR) task. Cheng et al. (2021)
1 Introduction propose a new learning framework called AlloST
Speech translation (ST) systems have multiple ap- that trains a transformer architecture with language-
plications. They can be utilized in a wide range independent phonemes. Mi et al. (2022) improves
of scenarios such as closed captioning in different translation performance by expanding the training
languages while watching videos or even as a real- corpus through generation of synthetic translation
time assistant that translates speeches to live audi- examples, where the target sequences are replaced
ences. One persistent challenge for speech trans- with diverse paraphrases. In IWSLT 2022 (Anasta-
lation systems continues to be performing transla- sopoulos et al., 2022), Boito et al. (2022b) utilized
tions for low resource language pairs.1 The IWSLT a wav2vec encoder and trained an E2E ST model
2023 (Agarwal et al., 2023) shared task for low where source audios are directly translated to the
resource speech translation targets 8 language pairs target language.3
that include Tunisian Arabic (Aeg) to English (En), In this paper, we extend the previous work with
Irish (Ga) to English (En), Marathi (Mr) to Hindi the following contributions:
(Hi), Maltese (Mlt) to English (En), Pashto (Pus) • We train and assess a speech translation model
to French (Fr), Tamasheq (Tmh) to French (Fr), 2
For the S2S version, speech generation is separate too.
∗ 3
These authors contributed equally to this work. They contributed towards the low resource speech trans-
1
https://iwslt.org/2023/low-resource lation task for Tmh→Fra.
241
for Tmh→Fra with audio stretching (Yang During initialization, the last 6 layers of the pre-
et al., 2021). trained wav2vec 2.0 model are discarded. We use a
• The baseline model for Tmh→Fra is trained shallow decoder which consists 2 transformer lay-
with a back-translation corpus generated users with 4 attention heads. Between encoder and
ing the NLLB-200 machine translation model decoder, we use one feed-forward layer to match
(Team et al., 2022). the dimension of encoder output and decoder input.
• For Tmh→Fra, we build a separate training During training, the model directly performs
corpus of paraphrases and show that model speech to text translation task without generating
performance improves when trained on this intermediate source language text. The training
dataset (Bhavsar et al., 2022). loss is the cross entropy loss between ground truth
• We show how a weighted cross entropy and hypothesis with label smoothing of 0.1. Each
loss further improves the performance of the experiment is trained for 200 epochs and check-
Tmh→Fra translation model. The model points are selected based on best validation BLEU.
trained with this loss, additional data gener- For Marathi-Hindi speech-to-text (ST) model,
ated using paraphrases and audio stretching is we chose a Wav2Vec 2.0 base model finetuned
shown to perform 5.2% better than the base- on 960 h of English speech (Baevski et al., 2020)
line. as the encoder baseline. We also used the same
• An ensemble of models trained on the above encoder model finetuned on 94 hours of Marathi
strategies shows the best performance, with audio data (Chadha et al., 2022) in our experiments.
BLEU score that is 17.2% higher than the For these models, the last 6 layers of the pretrained
average BLEU score of the individual models models were discarded, while the decoder archi-
within the ensemble. tecture and other hyperparameters were kept same
• In case of Mr→ Hi, our best independent as the Tmh→Fra models 4 . For audio encoder,
ensemble model shows a 23% improvement we also experimented with Wav2vec 2.0 XLS-R
over the average BLEU score of the individual 0.3B model (Babu et al., 2021) and another XLS-R
models within the ensemble. 0.3B model specifically finetuned on Marathi audio
Apart from these contributions, we also explore (Bhattacharjee, 2022). Because the XLS-R base
post-processing techniques with large language model was trained on audio from a range of Indian
models (LLMs), focusing on re-ranking generated languages including Marathi and Hindi, we chose
translations (Kannan et al., 2018), correcting the to incorporate XLS-R in our experimentation. For
grammar of translations and masking tokens so the XLS-R based models, we utilized the first 12
that the LLM can complete the translate sentence. out of 24 encoder layers to initialize the encoder
These methods though, did not yield any noticeable followed by a linear projection layer to transform
improvement. the output features of 1024 dimensions to the de-
The paper is organized as follows: Section 2 sired decoder dimensionality of 256. We trained
describes our speech translation system, 3.1 has all Marathi-Hindi ST models for 300 epochs and
details about the datasets for various language pairs, we chose the best checkpoint based on validation
3.2 contains analysis of our experimental results BLEU score.
and we finally conclude in 4.
2 Speech Translation System
2.1 Baseline Model 2.2.1 Audio Stretching
Our base model for Tmh→Fra ST task is an end- We apply audio stretching directly on wav form
to-end speech translation system which employs data using torchaudio library (Yang et al., 2021).5
an encoder-decoder architecture (Vaswani et al., For each audio sample, we alter the speed of the
2017). We initialize the audio feature extractor and audio with a rate uniformly sampled from [0.8, 1.2]
the 6-layer transformer encoder from a pretrained with a probability of 0.8 while maintaining the
wav2vec 2.0 base model (Baevski et al., 2020). audio sample rate.
We reuse the wav2vec 2.0 model pretrained on
243 hours of Tamasheq audio data released by ON- 4
Detailed hyperparameters used can be found in A.1.
TRAC Consortium Systems (Boito et al., 2022b). 5
https://github.com/pytorch/audio
242
2.2.2 Back-Translation Where, yt denotes the decoded token at time t, x
We use the NLLB-200 machine translation model denotes the input and θi denotes the ith model in
to generate variations of target text in French (Team the ensemble.
et al., 2022). The original French data is first trans- We apply the following ensemble decoding
lated into English, and then translated back into strategies:
French. For French to English translation, only • Independent ensemble: we ensemble check-
1 best prediction is used. For English to French points having the highest BLEU scores on the
translation, we take the top 5 results with a beam validation set, on N training runs. The N dif-
size of 5. ferent models have the same architecture, but
We also try to generate synthetic transcription of initialized with different seed values.
the Tamasheq audio by translating French text into • Data-augmented ensemble: we ensemble
Tamasheq. However, we notice that the translation checkpoints having the highest BLEU scores
quality is unstable and decide to not use it for the on the validation set, on N training runs. The
experiment. N different models have the same architec-
ture, but trained on different data augmenta-
2.2.3 Paraphrasing
tion strategies.
We use a French paraphrase model (Bhavsar, 2022),
which is a fine tuned version of mBART model (Liu We additionally attempt a checkpoint ensemble,
et al., 2020), to generate variations of target text in where N different checkpoints having the highest
French. We take the top 5 paraphrases using beam validation BLEU within the same training run are
search with a beam size of 5. ensembled. Since we notice marginal improve-
ments with checkpoint ensemble, we decide to not
2.2.4 Weighted Loss explore checkpoint ensemble in depth for our ex-
As the quality of synthetically generated sentences periments.
varies, we apply a sentence level weight to the 2.4 Post Processing with LLMs
corresponding sample’s cross entropy loss during
training. We further explore a set of post processing strate-
gies by leveraging large language models (LLM)
N
X to 1) rerank the top-k generated samples; 2) correct
l= wi ∗ CE(yi , ŷi ) (1) grammar of the output; and 3) guess the missing
i tokens of the sentence. The strategy is based on
where N is the size of the corpus, yi , ŷi , wi are the observation that translation outputs from the
ground truth, prediction, and loss weight for sam- validation set often carry incomplete sentences and
ple i respectively . For back-translation data, the broken grammar. We found that LLMs are good
weights are directly taken from the prediction score fit to address this problem as they have brought
of NLLB-200. For paraphrasing data, we calculate promising improvements in sentence re-ranking,
the perplexity of each generated paraphrase and and rewriting tasks (Liu et al., 2023). We summa-
then take the exponential of the perplexity as the rize our proposed strategies as follows:
weight. For original training data (clean and full), 2.4.1 Re-ranking
weight are set to 1.
The reranking approach takes the top 5 results from
2.3 Ensemble Model the best-performing candidate, and rerank these
outputs with language models. We first explore
Ensemble decoding (Liu et al., 2018; Zhang and
performing shallow fusion (Kannan et al., 2018)
Ao, 2022) is a method of combining probability
with language model (GPT2-Fr).6 Additionally, we
values generated by multiple models while decod-
leverage a LLM (French finetuned-Alpaca 7B 7 )
ing the next token. We provide equal-weight to N
to guess the most probable sentence that is from a
different ensemble models as shown in 2.
radio broadcast news with the prompt:
quelle phrase est plus susceptible
N
1 X d’apparaître dans un journal télévisé
logP (yt |x, y1...t−1 ) = logPθi (yt |x, y1...t−1 )
N i 6
https://github.com/aquadzn/gpt2-french
(2) 7
https://github.com/bofenghuang/vigogne
243
2.4.2 Sentence Correction 3.1.2 Marathi-Hindi Corpus
The sentence correction approach rewrites the For Marathi-Hindi we use the data from Panlin-
whole output prediction by correcting the gram- gua (2023) containing approximately 25 hours of
matical and spelling errors. We use two LLMs speech. The audio recordings are sourced from the
for this tasks - aforementioned Alpaca model and news domain. The statistics of the dataset is shown
Bloom 7B with the following prompt: 8 in Table 3.
Corrigez la faute de frappe et la Data Split Hours # Utterances
grammaire de la phrase sans changer train 16 7,990
valid 3.7 2,103
la structure test 4.5 2,164
2.4.3 Token Masking Table 3: Data statistics for mr→hi corpus. Hours shows the
number of hours of audio samples available while # Utterances
The token masking approach first masks the trans- is the associated number of utterances.
lation output with <blank> tokens for out-of-
vocabulary (OOV) tokens. For example the pre-
dicted output "...Les questions sont [pi];." is re- 3.2 Experimental Results
placed with " <blank> Les questions sont <blank>." In this section, we compare the effects of data aug-
where [pi] is a common token we observed in the mentation, ensembling and post-processing strate-
prediction output that does not carry meaning. We gies on the tmh→fra task on test 2022 dataset. We
then apply the following prompt to let the LLMs to additionally compare results on the mr→hi task on
complete the sentence: the validation dataset.
complétez la phrase en remplaçant 3.2.1 Impact of Data Augmentation

les jetons <blank> Table 1 shows the effect of various data augmenta-
tion strategies used. We find that using full-audio
3 Experiments
dataset performs better than using just the clean-
3.1 Datasets audio data. Also, adding audio stretching alone
3.1.1 Tamasheq-French Corpus does not improve model performance.
The dataset used for our training, validation, and Adding synthetically generated back-translation
testing is obtained from Boito et al. (2022a), which data shows mixed results. We hypothesize that this
is shared as a part of IWSLT 2023 shared task. It is due to cascading errors while performing back-
consists of a parallel corpus of radio recordings translation. However, adding paraphrases data per-
in Tamasheq language predominantly from male forms slightly better than baseline. We find that
speakers. The dataset includes approximately 18 using a weighted loss while using synthetically
hours of speech divided in training, validation and generated translation data is beneficial.
test sets along with its French translation. We re- 3.2.2 Performance of Ensemble Model
fer to this data as "clean". Additionally, there is
approximately 2 hours of possible noisy training Table 6 shows the summary of the effect of dif-
data from the same source, which we include in our ferent ensembling strategies. For complete results,
experiments along with the clean data. We refer to refer to table 12. We find that the performance of
this combined 20 hour dataset as "full" data. The the ensemble model increases with the increase in
statistics of the dataset are in Table 2. number of models present in the ensemble. We
also find that the data-augmented ensemble works
Data Split Hours # Utterances better than independent ensemble. Additionally,
train clean 13.6 4,444
train full 15.5 4,886
data-augmented ensembling using paraphrase data
valid 1.7 581 performs better than data-augmented ensembling
test2022 2 804 using back-translation data.
test2023 1 374
Table 2: Data statistics for tmh→fra corpus. Hours shows

3.2.3 Impact of Post-processing Methods
the number of hours of audio samples available while # Utter- Table 4 summarizes the experimental results for
ances is the associated number of utterances.
the post-editing strategies. We make the following
8
https://huggingface.co/bigscience/bloom-7b1 observations. First, sentence correction strategy
244
# Data Data Augmentation Vocab size Loss Test2022 BLEU
cb clean baseline 1k baseline 8.85
fb full baseline 1k baseline 9.25
ft full back-translation 3k baseline 8.84
ftw full back-translation 3k weighted 9.45
fta full back-translation + audio stretching 3k baseline 9.01
ftaw full back-translation + audio stretching 3k weighted 9.71
fp full paraphrase 3k baseline 9.70
fpw full paraphrase 3k weighted 9.73
fpa full paraphrase + audio stretching 3k baseline 9.47
fpaw full paraphrase + audio stretching 3k weighted 9.53
Table 1: Impact of Data Augmentation on tmh→fra models. The table shows the BLEU scores for different strategies
in comparison to the baseline trained on clean and full dataset. Back-Translation + audio stretching and Paraphrase dataset
augmentation improve the BLEU score. Back-Translation alone can improve model performance when combined with a weighted
loss.
Approach Model Test2022 BLEU

Baseline Ensembled Wav2Vec2 11.26
Reranking Shallow-Fusion-based (GPT2-French) 11.24
Instruct-based (Stanford Alpaca 7B) 10.78
Token Masking Stanford Alpaca 7B 11.20
Bloom 6.7B 10.84
Sentence Correction Stanford Alpaca 7B 8.70
Bloom 6.7B 8.54
Translation + Reranking Stanford Alpaca 7B 3.45
Bloom 6.7B 3.58
Table 4: Impact of Post Processing on tmh→fra corpus. The post-processing steps outlined are applied to an Ensembled
Wav2Vec2 model. The post-processing with a LLM does not provide any additional benefit.
Instruct: quelle phrase est plus susceptible d’apparaître dans un journal télévisé
Reranking Input: top k hypothesis
Output: best hypothesis picked by LLM
Instruct: complétez la phrase en remplaçant les jetons <blank>?
Token Masking Input: Donc, on dirait que l’organisation de l’UENA, elle est <blank>
Output: Donc, on dirait que l’organisation de l’UENA, elle est un organisme de bienfaits
Instruct: Corrigez la faute de frappe et la grammaire de la phrase sans changer la structure
Sentence Correction Input: Les a été libérés et ceux qui sont rentrés.
Output: Ils ont été libéré et ceux rentrant.
Table 5: Prompt Designs. Example LLM Prompts for Post Processing tmh→fra corpus.
Ensemble Models
(Refer Table 1) Ensemble Type Test2022 BLEU observation to the fact that the pretrained LLMs
cb-ensemble Independent 10.32 lacks context-specific data of the Tamasheq corpus.
fb-ensemble Independent 10.79 For example, when asked to correct the output sen-
Data Augmented
ft+ftw+fta+ftaw Back-translation 10.95 tence, LLMs tend to re-frame the phrases related
Data Augmented to more generic topics like sports or events.
fp+fpw+fpa+fpaw Paraphrase 11.26
Second, we find reranking and token masking
strategies both lead to slight degradation compared
Number of models Avg Test BLEU to the baseline. This is due to the fact that both
4 10.83
3 10.60 approaches make less aggressive changes to the
2 10.23 original output. In general, we find LLMs do not
1 (No Ensemble) 9.24 perform well when the predicted text deviates too
Table 6: Impact of Ensembling tmh→fra ST models. En- much from the ground truth.
sembling models trained with different seeds increases the
BLEU score. Increasing the number of models in ensemble Finally, we perform the same set of the strate-
also increases performance. gies but using translated English output from the
original French translation. We present the best
performing candidates (Translation+Reranking in
leads to significant performance degradation com- Table 4). We find that this strategy caused the worst
pared to the ensemble baseline. We attribute this performance degradation due to error propagation
245
# Model Vocab size Validation BLEU
mwb wav2vec2-base-960h 1k 11.41
mwbm1k wav2vec2-base-marathi 1k 13.19
mwbm3k wav2vec2-base-marathi 3k 11.85
mwx wav2vec2-xls-r-300m 1k 15.94
mwxm wav2vec2-xls-r-300m-marathi 1k 10.76
Table 7: Model performance on mr→hi task. Average BLEU scores are shown for the models which we trained with multiple
seeds. Move to XLS-R model as encoder improved BLEU by 40% over baseline. Complete results in Table 13
Ensemble Models Ensemble Models

(Refer Table 7) Validation BLEU (Refer Table 1) Ensemble Type Test2023 BLEU
mwbm1k-ensemble 16.17 cb-ensemble Independent 9.28
mwbm3k-ensemble 13.80 fb-ensemble Independent 9.50
mwx-ensemble 19.63 Data Augmented
ft+ftw+fta+ftaw Back-translation 8.87
Table 8: Impact of Ensembling mr→hi models. Consistent Data Augmented
with experiments from tmh→fra, an independent ensemble fp+fpw+fpa+fpaw Paraphrase 9.30
model built from different seeds improves BLEU score.
Table 9: Test 2023 results for tmh→fra ST models.
caused by fra→eng→fra translation. Models Test2023 BLEU

mwbm1k-ensemble 25.60
3.2.4 Marathi-Hindi mwbm3k-ensemble 23.00
mwx-ensemble 28.60
We present the BLEU scores of various models we
have trained on the validation dataset. From Ta- Table 10: Test 2023 results for mr→hi ST models.
ble 7 we can see that our wav2vec2-base-marathi
model outperforms the baseline wav2vec2-base- Marathi-Hindi (Mr→Hi). We show expanding the
960h model by 16% in terms of BLEU score. We training dataset with paraphrases of translated sen-
also notice increasing vocabulary size of the tok- tences as well as an ensemble model (of trained
enizer leads to worse performance. It could be at- ST models with different seeds and data augmen-
tributed to the fact that the size of the data is not ad- tation methods), improves performance over the
equate for the model to properly train with the pro- baseline model for (Tmh→Fra). Similarly, an en-
vided hyperparameters. The wav2vec2-xls-r-300m semble model for Marathi-Hindi (Mr→Hi) has a
model outperforms baseline wav2vec2-base-960h higher BLEU score in comparison to the baseline
model by 40%. We notice that the Marathi fine- architecture. We also explore the use of large lan-
tuned version of the same model performs worse guage models and find that post-processing using
than our baseline. them did not show any noticeable improvement.
We perform independent ensemble decoding on
the models with the same architecture and hyper-
parameters but trained with different seeds. The References
results are shown in Table 8. Refer Table 14 for Milind Agarwal, Sweta Agrawal, Antonios Anasta-
full results. We notice that ensemble decoding sopoulos, Ondřej Bojar, Claudia Borg, Marine
improves the BLEU score of the best model by Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
Chen, William Chen, Khalid Choukri, Alexandra
23% compared to the average BLEU score of the Chronopoulou, Anna Currey, Thierry Declerck, Qian-
individual models used in the ensemble. qian Dong, Yannick Estève, Kevin Duh, Marcello
3.3 Test 2023 results Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
Results for the different models on Test 2023 Kumar, Pengwei Li, Xutail Ma, Prashant Mathur,
dataset for Tmh→Fra are present in Table 9 and Evgeny Matusov, Paul McNamee, John P. McCrae,
Mr→Hi results are present in Table 10. Kenton Murray, Maria Nadejde, Satoshi Nakamura,
4 Conclusion Lonneke van der Plas, Peter Polák, Elijah Rippeth,
In this paper, we explore multiple types of strate- bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
gies to improve speech translation for two lan- Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
guage pairs: Tamasheq-French (Tmh→Fra) and Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
246
vallos. 2023. Findings of the IWSLT 2023 Evaluation Marcely Zanon Boito, John Ortega, Hugo Riguidel, An-
Campaign. In Proceedings of the 20th International toine Laurent, Loïc Barrault, Fethi Bougares, Firas
Conference on Spoken Language Translation (IWSLT Chaabani, Ha Nguyen, Florentin Barbier, Souhir Gah-
2023). Association for Computational Linguistics. biche, et al. 2022b. On-trac consortium systems for
the iwslt 2022 dialect and low-resource speech trans-
Antonios Anastasopoulos, Loic Barrault, Luisa Ben- lation tasks. IWSLT.
tivogli, Marcely Zanon Boito, Ondrej Bojar, Roldano
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, Harveen Singh Chadha, Anirudh Gupta, Priyanshi Shah,
Maha Elbayad, Clara Emmanuel, Yannick Esteve, Neeraj Chhimwal, Ankur Dhuriya, Rishabh Gaur,
Marcello Federico, Christian Federmann, Souhir and Vivek Raghavan. 2022. Vakyansh: Asr toolkit
Gahbiche, Hongyu Gong, Roman Grundkiewicz, for low resource indic languages.
Barry Haddow, Benjamin Hsu, David Javorsky,
Vera Kloudova, Surafel Melaku Lakew, Xutai Ma, Yao-Fei Cheng, Hung-Shin Lee, and Hsin-Min Wang.
Prashant Mathur, Paul McNamee, Kenton Murray, 2021. Allost: Low-resource speech transla-
Maria Nădejde, Satoshi Nakamura, Matteo Negri, tion without source transcription. arXiv preprint
Jan Niehues, Xing Niu, John Ortega, Juan Pino, Eliz- arXiv:2105.00171.
abeth Salesky, Jiatong Shi, Matthias Sperber, Sebas-
tian Stuker, Katsuhito Sudoh, Marco Turchi, Yogesh Anjuli Kannan, Yonghui Wu, Patrick Nguyen, Tara N
Virkar, Alex Waibel, Changhan Wang, and Shinji Sainath, Zhijeng Chen, and Rohit Prabhavalkar. 2018.
Watanabe. 2022. Findings of the iwslt 2022 evalua- An analysis of incorporating an external language
tion campaign. In IWSLT 2022. model into a sequence-to-sequence model. In 2018
IEEE International Conference on Acoustics, Speech
Arun Babu, Changhan Wang, Andros Tjandra, Kushal and Signal Processing (ICASSP), pages 1–5828.
Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, IEEE.
Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei
Baevski, Alexis Conneau, and Michael Auli. 2021. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang,
Xls-r: Self-supervised cross-lingual speech represen- Hiroaki Hayashi, and Graham Neubig. 2023. Pre-
tation learning at scale. train, prompt, and predict: A systematic survey of
prompting methods in natural language processing.
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, ACM Computing Surveys, 55(9):1–35.
and Michael Auli. 2020. wav2vec 2.0: A frame-
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
work for self-supervised learning of speech represen-
Edunov, Marjan Ghazvininejad, Mike Lewis, and
tations.
Luke Zettlemoyer. 2020. Multilingual denoising pre-
training for neural machine translation. Transac-
Sameer Bansal, Herman Kamper, Karen Livescu,
tions of the Association for Computational Linguis-
Adam Lopez, and Sharon Goldwater. 2018. Low-
tics, 8:726–742.
resource speech-to-text translation. arXiv preprint
arXiv:1803.09164. Yuchen Liu, Long Zhou, Yining Wang, Yang Zhao,
Jiajun Zhang, and Chengqing Zong. 2018. A com-
Joydeep Bhattacharjee. 2022. Xls-r marathi pre- parable study on model averaging, ensembling and
trained model. https://huggingface.co/ reranking in nmt. In Natural Language Processing
infinitejoy/wav2vec2-large-xls-r- and Chinese Computing.
300m-marathi-cv8. Accessed: 2023-04-15.
Ilya Loshchilov and Frank Hutter. 2019. Decoupled
Nidhir Bhavsar. 2022. French paraphrase weight decay regularization.
model. https://huggingface.co/
enimai/mbart-large-50-paraphrase- Chenggang Mi, Lei Xie, and Yanning Zhang. 2022. Im-
finetuned-for-fr. Accessed: 2023-04-12. proving data augmentation for low resource speech-
to-text translation with diverse paraphrasing. Neural
Nidhir Bhavsar, Rishikesh Devanathan, Aakash Bhatna- Networks, 148:194–205.
gar, Muskaan Singh, Petr Motlicek, and Tirthankar
Ghosal. 2022. Team innovators at SemEval-2022 Language Processing LLP Panlingua. 2023. Dataset for
for task 8: Multi-task training with hyperpartisan marathi-hindi speech translation shared task@iwslt-
and semantic relation for multi-lingual news article 2023. Contributor/©holder: Panlingua Languague
similarity. In Proceedings of the 16th International Processing LLP, India and Insight Centre for Data
Workshop on Semantic Evaluation (SemEval-2022), Analytics, Data Science Institue, University of Gal-
pages 1163–1170, Seattle, United States. Association way, Ireland.
Mihaela C Stoian, Sameer Bansal, and Sharon Goldwa-
Marcely Zanon Boito, Fethi Bougares, Florentin Bar- ter. 2020. Analyzing asr pretraining for low-resource
bier, Souhir Gahbiche, Loïc Barrault, Mickael Rou- speech-to-text translation. In ICASSP 2020-2020
vier, and Yannick Estéve. 2022a. Speech resources IEEE International Conference on Acoustics, Speech
in the tamasheq language. Language Resources and and Signal Processing (ICASSP), pages 7909–7913.
Evaluation Conference (LREC). IEEE.
247
NLLB Team, Marta R. Costa-jussà, James Cross, Onur A.2 Full Results
Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef-
fernan, Elahe Kalbassi, Janice Lam, Daniel Licht,
Jean Maillard, Anna Sun, Skyler Wang, Guillaume
Wenzek, Al Youngblood, Bapi Akula, Loic Bar-
rault, Gabriel Mejia Gonzalez, Prangthip Hansanti,
John Hoffman, Semarley Jarrett, Kaushik Ram
Sadagopan, Dirk Rowe, Shannon Spruit, Chau
Tran, Pierre Andrews, Necip Fazil Ayan, Shruti
Bhosale, Sergey Edunov, Angela Fan, Cynthia
Gao, Vedanuj Goswami, Francisco Guzmán, Philipp
Koehn, Alexandre Mourachko, Christophe Ropers,
Safiyyah Saleem, Holger Schwenk, and Jeff Wang.
2022. No language left behind: Scaling human-
centered machine translation.

Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
you need.
Yao-Yuan Yang, Moto Hira, Zhaoheng Ni, Anjali Chour-

dia, Artyom Astafurov, Caroline Chen, Ching-Feng
Yeh, Christian Puhrsch, David Pollack, Dmitriy Gen-
zel, Donny Greenberg, Edward Z. Yang, Jason Lian,
Jay Mahadeokar, Jeff Hwang, Ji Chen, Peter Golds-
borough, Prabhat Roy, Sean Narenthiran, Shinji
Watanabe, Soumith Chintala, Vincent Quenneville-
Bélair, and Yangyang Shi. 2021. Torchaudio: Build-
ing blocks for audio and speech processing. arXiv
Ziqiang Zhang and Junyi Ao. 2022. The YiTrans speech

translation system for IWSLT 2022 offline shared
task. In Proceedings of the 19th International Con-
and online). Association for Computational Linguis-
tics.
A Appendix
A.1 Hyperparameters and Computing
Resource
• encoder
– n layers: 6
– hidden dim: 1024 for mr-hi xls-r model, 768 for
tmh-fra model and other mr-hi model
– n head: 12
– activation: gelu
• decoder
– n layers: 2
– hidden dim: 256
– n head: 4
– activation: gelu
• training
– optimizer: AdamW (Loshchilov and Hutter,
2019)
– lr: 1e − 3
– encoder lr: 1e − 5
– label smoothing: 0.1
– batch size: 4
• computing resource: AWS g5.12xlarge instance (4x
NVIDIA A10G Tensor Core GPUs)
248
# Data Data Augmentation Vocab size Loss Seed Test2022 BLEU
cb1 clean baseline 1k baseline v1 8.98
fb1 full baseline 1k baseline v1 9.53
Table 11: Results of different seed experiments on tmh→fra models.
Data Data Augmentation Models in Ensemble (Refer to Table 1) Test BLEU

clean baseline cb1+cb2+cb3+cb4 10.32
clean baseline cb1+cb2+cb3 10.22
clean baseline cb1+cb2 9.79
full baseline fb1+fb2+fb3+fb4 10.79
full baseline fb1+fb2+fb3 10.52
full baseline fb1+fb2 10.00
full back-translation ft+ftw+fta+ftaw 10.95
full back-translation ft+ftw+fta 10.49
full back-translation ft+ftw+ftaw 10.75
full back-translation ft+fta+ftaw 10.93
full back-translation ftw+fta+ftaw 11.26
full back-translation ft+ftw 10.08
full back-translation ft+fta 9.82
full back-translation ft+ftaw 10.49
full back-translation ftw+fta 10.4
full back-translation ftw+ftaw 10.72
full back-translation fta+ftaw 10.78
full paraphrase fp+fpw+fpa+fpaw 11.26
full paraphrase fp+fpw+fpa 10.78
full paraphrase fp+fpw+fpaw 10.91
full paraphrase fp+fpa+fpaw 10.77
full paraphrase fpw+fpa+fpaw 11.95
full paraphrase fp+fpw 10.40
full paraphrase fp+fpa 10.62
full paraphrase fp+fpaw 10.76
full paraphrase fpw+fpa 10.60
full paraphrase fpw+fpaw 10.61
full paraphrase fpa+fpaw 10.44
Table 12: Impact of Ensembling tmh→fra models (complete).
249
# Model Vocab size Seed Validation BLEU
mwbm1k1 wav2vec2-base-marathi 1k v1 13.19
mwx1 wav2vec2-xls-r-300m 1k v1 16.31
Table 13: Results of different seed experiments on mr→hi models.
Model Ensemble Models (Refer Table 13) Validation BLEU

wav2vec2-base-marathi mwbm1k1+mwbm1k2+mwbm1k3+mwbm1k4 16.17
wav2vec2-base-marathi mwbm1k1+mwbm1k2+mwbm1k3 16.15
wav2vec2-base-marathi mwbm1k1+mwbm1k2 15.23
wav2vec2-base-marathi mwbm3k1+mwbm3k2+mwbm3k3+mwbm3k4 13.80
wav2vec2-xls-r-300m mwx1+mwx2+mwx3+mwx4 19.63
wav2vec2-xls-r-300m mwx1+mwx2+mwx3 19.27
wav2vec2-xls-r-300m mwx1+mwx2 17.89
Table 14: Impact of Ensembling mr→hi models (complete).
250
Speech Translation with Style: AppTek’s Submissions to the IWSLT
Subtitling and Formality Tracks in 2023
Parnia Bahar∗, Patrick Wilken∗, Javier Iranzo-Sánchez,

Mattia di Gangi, Evgeny Matusov, Zoltán Tüske
Applications Technology (AppTek), Aachen, Germany
{pbahar,pwilken,jiranzo,mdigangi,ematusov,ztuske}@apptek.com
Abstract
AppTek participated in the subtitling and for- formality-controlled machine translation. Finally,
mality tracks of the IWSLT 2023 evaluation. Section 4.1 shows the results of our formality track
This paper describes the details of our subti- submission.
tling pipeline - speech segmentation, speech
recognition, punctuation prediction and inverse 2 Data Preparation
text normalization, text machine translation and
direct speech-to-text translation, intelligent line 2.1 Text Data
segmentation - and how we make use of the We use all of the allowed “speech-to-text paral-
provided subtitling-specific data in training and
lel” and “text-parallel” data, including Europarl,
fine-tuning. The evaluation results show that
our final submissions are competitive, in par- Europarl-ST, News Commentary, CORDIS News,
ticular outperforming the submissions by other Tatoeba, TED2020, IWSLT TED, MuST-C v3,
participants by 5% absolute as measured by CoVoST v2, and OpenSubtitles1 . We apply com-
the S UB ER subtitle quality metric. For the for- mon parallel data filtering steps based on lan-
mality track, we participated with our En-Ru guage identification, sentence length ratios between
and En-Pt production models, which support source and target sentences and additional heuris-
formality control via prefix tokens. Except for
tics. After filtering, we obtain 13.5M sentence pairs
informal Portuguese, we achieved near perfect
formality level accuracy while at the same time
with 152M running words (counted on the English
offering high general translation quality. side) for En-De and 16.5M sentence pairs with
183M words for En-Es.
1 Introduction Next, we clone this data and process the En
This paper presents AppTek’s submissions to the side of the clone with our text normalization tool
NEWTN . It implements elaborate regular expres-
subtitling and formality tracks of the IWSLT 2023
evaluation campaign. In the subtitling track, we sions to convert numbers, dates, monetary amounts,
participate in constrained and unconstrained condi- and other entities with digits into their spoken form.
tions and in both language pairs English-to-German It is also used to remove punctuation and word case
(En-De) and English-to-Spanish (En-Es). In the information. After training on such source data, our
formality track, we participate in the zero-shot un- MT systems are able to directly translate from raw
constrained condition for English-to-Portuguese ASR output that lacks punctuation and casing into
(En-Pt) and English-to-Russian (En-Ru). properly formatted written target language text.
This paper is organized as follows: Section 2 For the parallel corpora which have document
briefly describes our data preparation. Section 3 labels, we also create a version in which we con-
presents AppTek’s pipeline for subtitle translation. catenate two subsequent sentences from the same
Its different components, namely audio segmen- document using a separator symbol. Our past ex-
tation, speech translation (ST), automatic speech perience shows that adding such data is beneficial
recognition (ASR), machine translation (MT) mod- even if we do not add the context of the previous
els, and our subtitle segmentation algorithm are sentence at inference time.
described in Sections 3.1-3.5. Section 3.6 contains Finally, for each language pair, we extract about
experiments and an analysis of our subtitling sys- 4M words of bilingual phrases (based on unsuper-
tems. Section 4 presents AppTek’s approach to vised word alignment) as additional training “sen-
∗ 1
equal contribution The filtered version provided by the track organizers.
251
tence” pairs to make sure that the MT system can followed by a probabilistic divide-and-conquer
cope well with incomplete sentences or too fine- (pDAC) algorithm that iteratively splits audio at the
grained automatic sentence segmentation. positions with the lowest probability of the speech
class. For the unconstrained condition, we use the
2.2 Speech Data English segmentation model published by the au-
We use all the allowed datasets marked as “speech” thors of SHAS, which is an XLS-R 300M model
and “speech-to-text parallel”, including Europarl- (Babu et al., 2022) fine-tuned for the frame clas-
ST, How2, MuST-C, TED-LIUM, LibriSpeech, sification task on the MuST-C train set. For the
Mozilla Common Voice, VoxPopuli, CoVoST, and constrained condition, we train our own frame clas-
IWSLT TED. After removing very short (< 0.1s) sifier with Wav2Vec2 (Baevski et al., 2020), pre-
and long (> 120s) segments, we obtain about trained on LibriSpeech, followed by fine-tuning for
3590 hours of speech with transcripts. From each the frame classification task using MuST-C.
dataset, we only take the train sets, where appli- A hyper-parameter search was conducted to find
cable. The English text is processed to be lower- the number of layers (constrained model), as well
cased, punctuation-free using NEWTN, and split as the inference parameters (max. segment length
into 10k byte-pair-encoding (BPE) tokens (Sen- and pDAC threshold) that optimize the performance
nrich et al., 2016). of the downstream speech translation pipeline. We
found that the pDAC threshold, which is the min-
2.3 Direct Speech Translation Data imum probability required to keep a frame, has
All data marked as “speech-to-text parallel”, i.e. significant effects on the translation quality, and
Europarl-ST, MuST-C, CoVoST, and IWSLT TED – that the optimal value can vary depending on the
except MuST-Cinema – is utilized for direct speech task and acoustic conditions.
translation. It results in a total of approximately
1220 hours of speech with transcripts and corre- 3.2 Direct Speech Translation
sponding translations after only keeping segments 3.2.1 Attention Encoder-Decoder
between 0.1 and 120 seconds. As for our data pro- We train an attention-based model (Bahdanau et al.,
cessing, on the English text, we carried out the 2015) composed of a Conformer encoder (Gulati
same scheme as for speech data, while following et al., 2020) and a Transformer decoder (Vaswani
almost the same German data processing scheme et al., 2017). The encoder consists of 12 layers
as described in Section 2.1. plus tokenization using with a size of 512, a feed-forward size of 2048, and
the Moses toolkit (Koehn et al., 2007). Then 10k 8 heads, whereas the decoder has 6 layers with the
and 20k BPEs are used on the English and Ger- same hidden size and number of heads. For fast yet
man texts, respectively. The dev set for the direct stable convergence, we apply a layer-wise network
model is chosen to be the concatenation of IWSLT construction scheme (Zeyer et al., 2018, 2019).
dev2010, MuST-C, Europal-ST, and CoVoST dev Specifically, we start with 2 layers of halved hid-
sets, resulting in a large dev set of 33 hours. den dimensions in both encoder and decoder (18M
2.3.1 Synthetic Data parameters) and linearly scale the model depth and
To leverage more training data for our direct model, width to full size (125M parameters) in the first 5
we translate the English transcripts of the allowed sub-epochs where each sub-epoch is one-twentieth
“speech” data (Jia et al., 2019) using our constrained of the whole training data. Also, L2-norm regular-
machine translation model described in Section ization and dropout are scaled up from 0 to 0.0001
3.4 with output length control “short” (Wilken and and 0.1 respectively. Label smoothing is enabled
Matusov, 2022). Combining the real ST data with only afterwards. We apply Adam (Kingma and Ba,
the synthetic data, we obtain about 4100 hours of 2015) with an initial learning rate of 0.0005 and
translated-speech parallel utterances. dynamic learning scheduling based on dev set loss.
Audio log mel 80-dimensional features are ex-
3 Subtitle Translation tracted every 10ms. The first layer of Conformer
is composed of 2 convolution layers with strides
3.1 Audio Segmentation of 3 and 2 over time giving a reduction factor of
We use the SHAS method (Tsiamas et al., 2022) 6. We use SpecAugment (Park et al., 2019; Bahar
for audio segmentation. SHAS scores every audio et al., 2019b) and speed perturbation in a random
frame with a binary classifier (speech/no-speech), interval of [0.9, 1.1] as data augmentation. In order
252
to train a single direct speech translation model that Unconstrained We train an attention-based
also supports time alignment between source label encoder-decoder model to run ASR decoding and
sequence and time frames, we add the source CTC also a CTC model which is used to generate word
loss (Graves et al., 2006; Kim et al., 2017; Bahar timings by force-aligning the audio with the de-
et al., 2019a) on top of the encoder in training. coded hypotheses. Here, the CTC model uses an
We also add a second shallow 1-layer Trans- explicit word boundary <space> symbol between
former decoder (with 14M parameters) in order to words. It serves as silence modeling. Both
generate better source transcripts for time align- models are trained on the same training set of 15K
ment. Given this network with a shared speech hours of speech mixing publicly available data with
encoder and two independent decoders, multi-task a commercial license and in-house data.
learning is employed to train all model parameters The 185M-parameter attention-based model uses
jointly. The final objective function is computed as a 31-layer Conformer encoder of hidden size 384;
a sum of the 3 losses (source CTC, source enc-dec, 8 heads with 64 dimensions per head; Macaron-
and target enc-dec). style (Lu et al., 2019) feed-forward layers with
size 2048; convolutional layers with 1024 chan-
3.2.2 Forced Alignment nels and kernel size 31. The decoder is a single-
CTC relies on Viterbi alignment to obtain the best headed attention-based model (Tüske et al., 2020),
path going through the source token at position n and consists of 4 stacked projected long short-
at time frame t. It is therefore possible to obtain term memory (pLSTM) recurrent layers with layer
word timings from CTC which can be used for size 2048 (Hochreiter and Schmidhuber, 1997; Sak
subtitle generation. To do so, we first generate the et al., 2014). The first two LSTMs operate on
source transcripts using the source decoder of the the embedding of the label sequence only. The
network and then use them to run forced-alignment other two decoder LSTM layers also process the
on the CTC output. The model’s alignments are on acoustic information extracted by the encoder us-
BPE-level, we therefore combine the timings of all ing a single-head, additive, location-aware cross-
subwords belonging to a word to obtain the final attention. The decoder predicts 1K BPE units. De-
word-level timestamps. coding is done using an external neural LM con-
We experimented with this approach and were sisting of 4 stacked LSTM layers of size 3072 with
able to generate accurate timestamps appropriate the same output vocabulary as the ASR models.
for creating subtitles in the source language. How- The 273M-parameter language model is trained on
ever, as we decide against using the source template 2.4B running words segmented to BPE units. The
approach for the constrained systems (see Section language model data are selected from a wide range
3.5), only the timings of the first and last word in of various domains, e.g. books, movies, news, re-
a segment are used for the target subtitles of the views, Wikipedia, talks, etc. ASR transcription is
constrained submission. We plan to explore how obtained after decoding with beam search limited to
to make better use of the CTC timings from this 16 hypotheses without any vocabulary constraints.
model in future experiments. In particular, we plan The CTC model uses the same encoder structure as
to add silence modeling to obtain information about the attention-based model.
pauses within speech segments, which can then be
reflected in the subtitle timings. 3.4 Machine Translation
3.4.1 Unconstrained Condition
3.3 Automatic Speech Recognition For the unconstrained subtitling pipeline we use
Constrained We train a Conformer-Transformer AppTek’s production MT systems which have been
model for the constrained task mainly following trained on large amounts of parallel data, mostly
Section 3.2.1 using 3590 hours of speech. Layer- from the OPUS collection (Tiedemann, 2012).
wise network construction, SpecAugment, and Both En-De and En-Es systems are Transformer
CTC loss are applied. Since the model is not Big systems that support additional API parame-
trained for multiple tasks (no additional decoder ters which can in particular control the genre (e.g.
is added), it has better performance in terms of patents, news articles, dialogs) and length (auto-
W ER compared to the source decoder part of the matic, short, long, etc.). The control is imple-
ST model. The final checkpoint achieves a W ER mented via pseudo-tokens in the beginning of the
of 9.6% on the concatenated dev set of 33h. source or target sentence (Matusov et al., 2020).
253
For the IWSLT experiments, we set the genre to System MuST-C TED EPTV ITV Peloton
English-to-German
“dialogs” because it reflects best the spoken sponta-
unconstrained 33.7 27.1 19.0 30.6 23.9
neous style in the dev 2023 data. When not men- + fine-tuning 35.0 27.7 20.3 31.0 24.4
tioned otherwise, we set the length to “short”. This constrained 32.3 34.2 18.4 27.2 20.3
yields more condensed translations, similar to how + fine-tuning 32.9 – 19.0 28.1 21.5
human subtitlers would translate to comply with a English-to-Spanish
given reading speed limit. baseline 37.2 46.1 34.1 24.5 23.6
+ fine-tuning 38.2 46.4 34.8 25.5 24.7
3.4.2 Constrained Condition
Table 1: B LEU scores in % for text-only MT fine-tuning
For the constrained condition we use the parallel experiments on the MuST-C tst-COMMON set and on
training data prepared as described in Section 2.1. the AppTek’s aligned subsets of the 2023 subtitling track
As the dev data for learning rate control, we use dev data.
the Europarl-ST and MuST-C dev sets.
Our MT model is a variant of the Transformer that provides a translation with a target-to-source
Big model (Vaswani et al., 2017) with additional character ratio of less than 1.1. This is motivated
encoder layers and using relative positional encod- by the fact that translations need to be fitted into
ing (Shaw et al., 2018). We use a batch size of the source subtitle template (Section 3.5.1). We
800 words, but the effective batch size is increased note that the reading speed compliance of our sub-
by accumulating gradients over 8 batches. We add mission could have been increased even further
the same length control feature as for the uncon- by exploiting timing information to select the MT
strained system by classifying the training data into length variants.
5 bins of target-to-source length ratios and adding
3.4.4 Fine-tuning Experiments
the class label as a target-side prefix token.
We apply SentencePiece (Kudo and Richardson, For our fine-tuning experiments, we first select “in-
2018) segmentation with a vocabulary size of 10K domain” training data in terms of similarity to the
for En and 20K for De/Es and use a translation fac- seed data – the dev 2023 set – from the real parallel
tor to predict the casing of the target words (Wilken data, as well as the synthetic data described in Sec-
and Matusov, 2019). Our MT models have been tion 2.3.1. The selection is done by clustering dis-
trained for 100 sub-epochs with 1M lines in each; tributed sentence representations in the embedding
thus, all of the prepared data has been observed space, and then keeping sentence pairs from the
in training 1-3 times. For each sub-epoch, we se- clusters which correspond to the seed data clusters.
lect sentence pairs proportionally to the following This is done considering both source and target
distribution and then randomly mix them: seed data sentences, but independently, so that no
sentence-level alignment of seed data is necessary.
20% Europarl and Europarl-ST data For details on this data selection method, please
20% TED data (MuST-C, IWSLT, TED2020) refer to our 2020 submission to the offline speech
20% OpenSubtitles (other) translation track (Bahar et al., 2020). With this
10% News (Commentary+CORDIS), Tatoeba, CoVoST method, we create two versions of the in-domain
data: one using all 4 parts of the dev 2023 set as
15% Concatenated neighboring sentence pairs2
seed data (in-domain A: En-De: 1.9M lines, 27M
5% OpenSubtitles (documentaries)
En words; En-Es: 1.7M lines, 25M words), and
5% OpenSubtitles (sports) one, for En-De only, using just ITV and Peloton
5% Bilingual phrases dev 2023 parts as seed data (in-domain B: 1.5M
3.4.3 Length ROVER lines, 20M words).
We then use the dev 2023 set as a dev set in
For all final submissions, we optimize the length
fine-tuning of the MT model for learning rate con-
control of MT by using a length ROVER (Wilken
trol. Since the dev 2023 data is not aligned at
and Matusov, 2022). For each segment we create 3
sentence-level, but is available as (in part) indepen-
translations: without forcing the target-side length
dently created subtitle files, we had to sentence-
token, forcing length bin 2 ("short"), and forcing
align it. To do so, we first extracted full sen-
length bin 1 ("extra short"). From those transla-
tences from the English subtitles based on sentence-
tions we select the first – given the order above –
final punctuation marks, translated these sentences
2
See Section 2.1. with the (constrained) baseline MT, and then re-
254
segmented the target side into sentences that match translation. This creates a nice viewing experience,
the source sentences using Levenshtein alignment since subtitles appear on the screen only during
as implemented by the S UB ER tool (Wilken et al., the actual speech. However, the source template
2022). The source-target segments obtained this constraints might be sub-optimal in terms of target
way are kept in the final dev set only if the B ERT F- language reading speed.
score (Zhang et al., 2019) for a given pair is > 0.5 We use the source template approach for the un-
for TED, EPTV, and Peloton sets and > 0.55 for constrained submission. To create subtitles in the
the ITV set. With this method, the obtained dev original language of the videos (English), we start
set contains 7645 sentence-like units with 27.7K with a timed word list provided by the ASR sys-
words for TED, 2.3K for EPTV, 20.7K for Peloton, tem. We train a 3-layer bidirectional LSTM model
and 13.9K for ITV. (hidden size 256, embedding dim 128) to jointly
We perform fine-tuning for up to 20 sub-epochs add basic punctuation marks ( .,!? ) and casing
ranging in size from 100K to 400K sentence pairs information to the word list. As training data, we
using a small learning rate between 10−06 and use 14M English sentences from the Gigaword and
10−05 , and select the best configuration for each of OpenSubtitles corpora. The model operates on full
the four dev 2023 domains. words and has two softmax output layers, one with
The fine-tuning results are shown in Table 1. the four punctuation tokens and "no punctuation"
Despite the fact that no real in-domain data, not as target classes (to be added after the word), the
even the dev 2023 set, is used as training data in other one with lower-cased, capitalized, all-upper,
fine-tuning we are able to improve MT quality in and mixed-cased classes as targets.
terms of B LEU scores (Papineni et al., 2002; Post, In addition, we train an inverse text normaliza-
2018), as well as B ERT and other scores skipped tion model to convert spoken forms of numbers,
due to space constraints. The improvements are dates, currencies, etc. into the proper written form.
more pronounced for the constrained system, but This model is a Transformer Big trained on data
the absolute scores are generally better with the where the source data is processed using our text
unconstrained setup3 . However, since the TED talk normalization tool NEWTN, see Section 2.1. Ap-
and Europarl domains are covered well in the data plying it to the transcriptions helps MT to produce
allowed for the constrained condition, the differ- proper digits also on the target side. This has a
ence between our unconstrained and constrained slight positive effect on automatic scores (0.8%
system for the TED and EPTV domains is small. It S UB ER for Peloton, only up to 0.4% for the other
is worth noting that for ITV and Peloton domains domains), but mainly helps subjectively perceived
we could only improve MT quality by fine-tuning quality and also reduces the number of characters.
on the in-domain B set that did not include any The resulting timed, punctuated, and cased word
TED-related data, and also not using any TED or list is split into sentences using punctuation ( .!? )
EPTV dev data for learning rate control. and pauses between words longer than 3 seconds.
Those are fed into a subtitle segmentation algo-
3.5 Subtitle Creation rithm similar to the one described in (Matusov et al.,
3.5.1 Source Template Approach 2019). Its core component is an LSTM segmenta-
To create subtitle files from translation hypothe- tion model that is trained on English OpenSubtitles
ses, the text has to be segmented into blocks with XML data, which includes subtitle block boundary
start/end time information. One challenge is to information4 , to estimate the probability of a subti-
transfer timings extracted from the source speech tle break after each word of a given input sentence.
to the target subtitles. An approach to generate tim- Within a beam search framework, this model is
ings that is also used in human subtitling workflows combined with hard subtitling constraints such as
(Georgakopoulou, 2019), is to first create subtitles the character limit per line to create valid subtitles.
in the source language – a so-called subtitle tem- Here, we adjust it for the creation of subtitles from
plate – and to keep the same subtitle blocks during timed words by including minimum and maximum
3
subtitle duration as constraints, and not forcing any
The B LEU score of the constrained system on the En-De
TED part is higher because, as we found out shortly before
predefined number of subtitles.
submission, some of the dev 2023 TED talks were part of the After segmentation, we use the start time of the
allowed TED2020 training corpus. Hence, further fine-tuning
4
did not help for this system on this set. The unconstrained https://opus.nlpl.eu/download.php?f=
system had not been trained on this corpus. OpenSubtitles/v2018/xml/en.zip
255
first word and the end time of the last word in system TED EPTV Peloton ITV
each subtitle block as the subtitle start and end SHAS 0.31 21.1 14.9 12.1 15.6
time. The subtitle template defined this way is SHAS 0.50 22.4 14.9 11.6 13.9
SHAS 0.71 20.8 14.6 10.8 10.7
then translated using the fine-tuned MT system
ASR Segm. 19.8 14.8 11.3 13.5
described in Section 3.4.4, employing the length
ROVER (Section 3.4.3) to avoid long translations Table 2: Impact of different segmentation schemes on
that do not fit the template. Sentences as defined the translation quality (B LEU in %).
above are used as translation units, note that they
may span several subtitle blocks. To insert the 3.6 Results
translations back into the template, we again apply
We first decide which audio segmentation to use
the subtitle segmentation algorithm, this time with
based on dev set results using our final ASR and
the exact settings as in (Matusov et al., 2019).
MT unconstrained systems. We set different pDAC
3.5.2 Template-Free Approach thresholds for the unconstrained SHAS (0.31, 0.50,
By definition, the source template approach is not and 0.71) and compare them with an in-house seg-
desirable for direct speech translation without inter- menter optimized for ASR. The results in Table 2
mediate source text representation. Also, the con- show that a low threshold of 0.31 leads to better
strained condition does not include English Open- translations overall. There is however variation de-
Subtitles data with subtitle breaks. We hence fall pending on the domain: it is 1.3 B LEU points worse
back to a simpler subtitle creation approach for than SHAS 0.50 on TED, but as good or up to 1.7
our constrained direct and cascade systems. We B LEU points better in all other domains. Results
use the segments provided by the audio segmenter for ITV are highly sensitive to the threshold. We
as translation units. For the cascade system, we attribute this to the fact that in TV series speech
translate the transcription of each segment with the is often mixed with music and other sounds and a
fine-tuned constrained MT, also using the length lower threshold is required not to miss speech seg-
ROVER (Section 3.4.3). End-of-line and end-of- ments. Given these results, we use SHAS 0.31 as
block tokens are inserted into the translated text our segmenter for unconstrained experiments. For
of each segment using the subtitle segmentation the constrained experiments, we use SHAS 0.31
algorithm configured similarly to the case of tem- everywhere except on TED with SHAS 0.50.
plate creation in the previous section but without Table 3 compares the performance of the final
duration-based constraints. Timestamps for the ad- constrained cascade (separate ASR + MT) and di-
ditional subtitle block boundaries are then created rect En-De subtitling systems as well as the un-
by linearly interpolating the audio segment tim- constrained cascade system. All metrics are com-
ings according to character count ratios. Assuming puted using the S UB ER tool5 (Wilken et al., 2022)
the translation of an audio segment with start time directly on subtitle files. To calculate the B LEU
Tstart and end time Tend is split into N blocks with and C HR F (Popović, 2015) metrics, it performs
c1 , ..., cN characters, respectively, the start time of an alignment of hypothesis to reference sentences
P n−1
c n′ similar to (Matusov et al., 2005). On all metrics,
block n is set to Tstart + (Tend − Tstart ) · Pn
′=1
.
N
n′ =1 cn′ the constrained cascade system outperforms our
This method leads to reasonable timings in most direct model. We observe imperfections in the di-
cases but can create temporary time shifts between rect model’s output such as repetitions. This can
speech and subtitles inside long audio segments. be partially attributed to the fact that it has been
trained jointly for 3 tasks leading to sub-optimal
3.5.3 Subtitle Post-Processing
optimization for the final translation process. The
To all subtitles, we apply a final post-processing lack of length control of our direct ST model is
that splits rare cases of subtitles with more than 2 another reason for the gap between the two con-
lines (same segmentation method as for template- strained systems. For the cascade systems, we find
free approach) and shifts subtitle end times to later length control via the length ROVER to be crucial,
in time if needed to comply with the maximum giving consistent improvements of 4 to 5% points
reading speed of 21 characters per second. The in S UB ER compared to no length control at all.
latter is only possible if there is a large enough As seen in Table 3, the unconstrained system out-
gap after a given subtitle and will therefore not
guarantee low enough reading speed in all cases. 5
https://github.com/apptek/SubER
256
system constr. S UB ER (↓) B LEU C HR F pairs (Matusov et al., 2020). This year, we decided
TED to test these systems in the unconstrained condition
cascade yes 63.0 26.0 53.9 of the IWSLT formality track for En-Pt and En-
direct yes 75.9 17.1 47.6 to-Ru. Each of these two systems is trained in a
cascade no 64.3 22.1 51.0 Transformer Big setup (Vaswani et al., 2017). The
EPTV formality level is encoded with a pseudo-token in
cascade yes 78.7 13.5 45.2 the beginning of each training source sentence with
direct yes 85.1 10.9 42.6 one of 3 values: formal, informal, no style. The
cascade no 75.8 14.8 44.1
system is trained on large public data from the
Peloton
OPUS collection (Tiedemann, 2012) that has been
cascade yes 87.6 9.9 32.0
partitioned into the 3 style classes as follows.
direct yes 86.1 6.8 26.9
First, we write a sequence of regular expressions
cascade no 71.9 11.6 34.3
for the target language (in this case, European Pt
ITV
cascade yes 83.6 8.5 26.1
and Ru) which try to match sentences containing
direct yes 90.9 5.7 21.0 formal or informal features. Thus, for Russian, we
cascade no 71.4 14.8 35.2 try to match either the formal or informal second-
person pronoun that corresponds to English “you”,
Table 3: En-De subtitle translation results in % (con- including their possessive forms. For Portuguese,
strained and unconstrained setting) on the dev2023 sets. we additionally match the forms of most common
verbs which agree with the corresponding pronoun.
Domain S UB ER (↓) B LEU (↑) C HR F (↑) The regex list for Russian is given in Table 56 .
TED 48.8 37.8 61.8 Each list of regular expressions uses standard
EPTV 70.2 20.4 50.6 regex syntax and makes either case-sensitive or
Peloton 79.0 12.2 36.2 insensitive matches. For each sentence pair from
ITV 82.1 9.2 26.8 the parallel data, the regex list is processed from
top to bottom. As soon as a match in the target
Table 4: Subtitle translation results in % on the dev2023
sets for En-Es via the constrained cascade system. sentence is found, the FORMAL or INFORMAL label
is assigned to the sentence pair. The sentence pair
performs both constrained systems except on the is labeled with NO _ STYLE if there is no match.
TED set. This is due to a data overlap, some TED If document information is available and at least
talks present in the dev set have also been part of 5% of the document sentence pairs are labeled as
the constrained training data. To analyze the im- formal/informal according to the regex rules (with
pact of the source template approach we re-create no sentences labeled with the opposite class), then
the subtitles of the unconstrained system using the all of the sentence pairs in the document are as-
template-free approach. We find that this deterio- signed the corresponding label. Such data is useful
rates the S UB ER scores for TED, Peloton and ITV to model stylistic traits which are not limited to
by 0.7, 3.6 and 3.8% points, respectively, while the choice of second-person pronouns. Note that
actually giving better results for EPTV by 0.7%. In document annotations are available for some of
general, the results in Table 3 show a higher auto- the IWSLT data, including TED talks, OpenSubti-
matic subtitling quality for the TED domain, which tles (each subtitle file corresponds to a document),
represents the case of well recorded and prepared individual sessions of European Parliament, etc.
speech, but also show the need to focus research We further smooth the three style classes to en-
on harder conditions such as interviews and TV sure that e.g., sentences containing second-person
series. Table 4 contains the scores we are able pronouns can be translated well even when no style
to achieve for En-Es under constrained conditions. is specified at inference time. To this end, 5 to 8%
Also here, acceptable subtitle quality can only be of sentence pairs which had been assigned to one of
reached for TED and EPTV content, but not for the the 3 style classes as described above are randomly
more challenging Peloton and ITV content. re-assigned to one of the other two classes.
For En-Ru, the training data that had been parti-
4 Formality Control tioned into style classes in this way included about
AppTek’s production systems support formality or, 6
We released the En-Pt and En-Ru lists of regular expres-
as we call it, style control for selected language sions as part of our evaluation submission.
257
INFORMAL IGNORECASE \b(ты|теб[яе]|тобой|тво[йеёяю]|твоей|твоего|твоему|твоим|тво[ёе]м)\b
FORMAL IGNORECASE \b(вы|вами?|ваш[ае]?|вашей|вашего|вашему?|вашу|вас|вашим)\b
Table 5: The regular expressions used to partition En-Ru training data into formal, informal, and (in case of no
match) “no style” classes.
language pair / B LEU C OMET M-Acc to the imperfect regular expressions we defined for
requested style [%] [%] informal Portuguese pronouns and corresponding
En-Pt formal 34.6 0.6089 99 verb forms, since some of them are ambiguous.
informal 42.4 0.6776 64
However, we find it difficult to explain that e.g. the
En-Ru formal 35.4 0.6165 99
informal 33.3 0.6026 98
B LEU score of AppTek’s “informal” MT output
with respect to the informal reference is almost 8%
Table 6: Automatic evaluation results for AppTek’s absolute higher than for our “formal” output with
submission to the formality track of IWSLT 2023. respect to the formal reference. This may indicate
that the human reference translation also has not
40M sentence pairs. At the time this model was always followed the requested style, the informal
trained in early 2022, the larger CCMatrix cor- one in particular.
pus (Schwenk et al., 2021) was not included. For
En-Pt, we did use a filtered version of CCMatrix 5 Conclusion
in training, so that the total number of parallel sen-
tence pairs was 140M. The filtering of CCMatrix We described AppTek’s submissions to the subti-
and other large crawled data included removing sen- tling and formality tracks of the IWSLT 2023.
tence pairs with low cross-lingual sentence embed- For the subtitling track, we obtained good re-
ding similarity as given by the LABSE scores (Feng sults, outperforming the other two evaluation partic-
et al., 2022). All of our parallel training data is also ipants either with our constrained or unconstrained
filtered based on sentence-level language identifi- cascaded approach on all 4 domains. Part of this
cation scores and other heuristics. success is due to our subtitle creation process, in
When training the Transformer Big model, we which we employ AppTek’s intelligent line seg-
balanced the contribution of formal, informal, and mentation models. However, the results varied by
“no style” data by adding them in equal proportions domain, with the domain of movie subtitles posing
(number of lines) to each sub-epoch. the most challenges for ASR, and the domain of
fitness-related videos (Peloton) being hardest for
4.1 Results MT. Yet our biggest overall challenge, especially
We did not perform any experiments, but just for the direct (end-to-end) submission was speech
set the API parameter style=formal or segmentation and creating sentence-like units, on
style=informal and translated the evaluation real ITV movies in particular, in which there is mu-
data with the AppTek’s production systems, trained sic, background noise, and multiple speakers. In
as described above. The results in terms of auto- the future, we plan to improve this component of
matic error metrics, as reported by the track orga- our speech translation technology. We also plan to
nizers, are summarized in Table 6. include length control in our direct models which
Among the 5 participants of the unconstrained showed to be an important factor for those applica-
condition, we obtain the best results for En-Ru in tions with time constraints.
terms of B LEU and C OMET (Rei et al., 2020), while Our formality track participation was a one-shot
producing the correct formality level for more than attempt at a zero-shot task that showed the compet-
98% of the sentences. The second-best competitor itiveness of the formality control that we have im-
system obtains formality accuracy of 100%, but plemented in AppTek’s production systems. How-
scores 1.7% absolute lower in B LEU for the formal ever, our approach currently requires the creation
and 0.9% B LEU absolute for the informal class. of manual regular expression rules for partition-
For En-Pt, our system scores second in terms of ing the parallel training data into formality classes,
automatic MT quality metrics and correctly pro- and the participation in the IWSLT evaluation re-
duced the formal style for 99% of the sentences in vealed some weaknesses of this approach for one
the evaluation data. However, when the informal of the involved target languages. In the future, we
style was requested, our system could generate it in plan to further improve our approach, reducing or
only 64% of the cases. We attribute this low score eliminating the need for writing rules.
258
References speech recognition. 21th Annual Conference of the
International Speech Communication Association
Arun Babu, Changhan Wang, Andros Tjandra, Kushal (INTERSPEECH), pages 5036–5040.
Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh,
Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
Baevski, Alexis Conneau, and Michael Auli. 2022. short-term memory. Neural computation, 9(8):1735–
XLS-R: Self-supervised Cross-lingual Speech Rep- 1780.
resentation Learning at Scale. In Proc. Interspeech
2022, pages 2278–2282. Ye Jia, Melvin Johnson, Wolfgang Macherey, Ron J.
Weiss, Yuan Cao, Chung-Cheng Chiu, Naveen Ari,
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Stella Laurenzo, and Yonghui Wu. 2019. Leverag-
and Michael Auli. 2020. wav2vec 2.0: A framework ing weakly supervised data to improve end-to-end
for self-supervised learning of speech representations. speech-to-text translation. In IEEE International
In Advances in Neural Information Processing Sys- Conference on Acoustics, Speech and Signal Pro-
tems 33: Annual Conference on Neural Information cessing, ICASSP 2019, Brighton, United Kingdom,
Processing Systems 2020, NeurIPS 2020, December May 12-17, 2019, pages 7180–7184. IEEE.
6-12, 2020, virtual.
Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017.
Parnia Bahar, Tobias Bieschke, and Hermann Ney. Joint ctc-attention based end-to-end speech recogni-
2019a. A comparative study on end-to-end speech tion using multi-task learning. In Proc. Int. Conf.
to text translation. In IEEE Automatic Speech Recog- on Acoustics, Speech, and Signal Processing, pages
nition and Understanding Workshop (ASRU), pages 4835–4839, New Orleans, LA, USA.
792–799, Sentosa, Singapore.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
Parnia Bahar, Patrick Wilken, Tamer Alkhouli, Andreas
method for stochastic optimization. In 3rd Inter-
Guta, Pavel Golik, Evgeny Matusov, and Christian
Herold. 2020. Start-before-end and end-to-end: Neu-
ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
ral speech translation by apptek and rwth aachen
Conference Track Proceedings.
university. In Proceedings of the 17th International
Conference on Spoken Language Translation, pages Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
44–54. Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran,
Parnia Bahar, Albert Zeyer, Ralf Schlüter, and Hermann
Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra
Ney. 2019b. On using specaugment for end-to-end
Constantin, and Evan Herbst. 2007. Moses: Open
speech translation. In International Workshop on
source toolkit for statistical machine translation. In
Spoken Language Translation (IWSLT).
ACL 2007, Proceedings of the 45th Annual Meet-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- ing of the Association for Computational Linguistics,
gio. 2015. Neural machine translation by jointly June 23-30, 2007, Prague, Czech Republic.
learning to align and translate. In Proceedings of the
International Conference on Learning Representa- Taku Kudo and John Richardson. 2018. Sentencepiece:
tions (ICLR). A simple and language independent subword tok-
enizer and detokenizer for neural text processing. In
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Ari- Proceedings of the 2018 Conference on Empirical
vazhagan, and Wei Wang. 2022. Language-agnostic Methods in Natural Language Processing: System
BERT sentence embedding. In Proceedings of the Demonstrations, pages 66–71.
tational Linguistics (Volume 1: Long Papers), pages Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong,
878–891, Dublin, Ireland. Association for Computa- Tao Qin, Liwei Wang, and Tie-yan Liu. 2019. Un-
tional Linguistics. derstanding and improving transformer from a multi-
particle dynamic system point of view. In ICLR 2020
Panayota Georgakopoulou. 2019. Template files:: The Workshop on Integration of Deep Neural Models and
holy grail of subtitling. Journal of Audiovisual Trans- Differential Equations.
lation, 2(2):137–160.
Evgeny Matusov, Gregor Leusch, Oliver Bender, and
Alex Graves, Santiago Fernández, Faustino J. Gomez, Hermann Ney. 2005. Evaluating machine transla-
and Jürgen Schmidhuber. 2006. Connectionist tem- tion output with automatic sentence segmentation. In
poral classification: Labelling unsegmented sequence International Workshop on Spoken Language Trans-
data with recurrent neural networks. In International lation, pages 148–154, Pittsburgh, PA, USA.
Conference on Machine Learning (ICML), volume
148, pages 369–376, Pittsburgh, PA, USA. Evgeny Matusov, Patrick Wilken, and Yota Geor-
gakopoulou. 2019. Customizing neural machine
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki translation for subtitling. In Proceedings of the
Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Fourth Conference on Machine Translation (Volume
Wang, Zhengdong Zhang, Yonghui Wu, et al. 2020. 1: Research Papers), pages 82–93, Florence, Italy.
Conformer: Convolution-augmented transformer for Association for Computational Linguistics.
259
Evgeny Matusov, Patrick Wilken, and Christian Herold. In Proceedings of the 2018 Conference of the North
2020. Flexible customization of a single neural American Chapter of the Association for Computa-
machine translation system with multi-dimensional tional Linguistics: Human Language Technologies,
metadata inputs. In Proceedings of the 14th Confer- Volume 2 (Short Papers), pages 464–468.
ence of the Association for Machine Translation in
the Americas (Volume 2: User Track), pages 204– Jörg Tiedemann. 2012. Parallel data, tools and inter-
216, Virtual. Association for Machine Translation in faces in OPUS. In Proceedings of the Eighth In-
the Americas. ternational Conference on Language Resources and
Evaluation (LREC’12), pages 2214–2218, Istanbul,
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Turkey. European Language Resources Association
Jing Zhu. 2002. Bleu: a method for automatic evalu- (ELRA).
ation of machine translation. In Proceedings of the
40th Annual Meeting of the Association for Compu- Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol-
tational Linguistics, pages 311–318, Philadelphia, losa, and Marta R. Costa-jussà. 2022. SHAS: Ap-
Pennsylvania, USA. Association for Computational proaching optimal Segmentation for End-to-End
Linguistics. Speech Translation. In Proc. Interspeech 2022, pages
106–110.
Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Zoltán Tüske, George Saon, Kartik Audhkhasi, and
2019. SpecAugment: A simple data augmentation Brian Kingsbury. 2020. Single headed attention
method for automatic speech recognition. based sequence-to-sequence model for state-of-the-
art results on Switchboard. In Interspeech, pages
Maja Popović. 2015. chrF: character n-gram F-score 551–555, Shanghai, China.
for automatic MT evaluation. In Proceedings of the
Tenth Workshop on Statistical Machine Translation, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
pages 392–395, Lisbon, Portugal. Association for Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Computational Linguistics. Kaiser, and Illia Polosukhin. 2017. Attention is all
Matt Post. 2018. A call for clarity in reporting BLEU cessing Systems, pages 5998–6008.
Patrick Wilken, Panayota Georgakopoulou, and Evgeny
Matusov. 2022. SubER - a metric for automatic eval-
uation of subtitle quality. In Proceedings of the 19th
tional Linguistics.
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon lation (IWSLT 2022), pages 1–10, Dublin, Ireland (in-
Lavie. 2020. COMET: A neural framework for MT person and online). Association for Computational
evaluation. In Proceedings of the 2020 Conference Linguistics.
on Empirical Methods in Natural Language Process- Patrick Wilken and Evgeny Matusov. 2019. Novel appli-
ing (EMNLP), pages 2685–2702, Online. Association cations of factored neural machine translation. arXiv
for Computational Linguistics. preprint arXiv:1910.03912.
Haşim Sak, Andrew W. Senior, and Françoise Beaufays. Patrick Wilken and Evgeny Matusov. 2022. AppTek’s
2014. Long short-term memory based recurrent neu- submission to the IWSLT 2022 isometric spoken lan-
ral network architectures for large vocabulary speech guage translation task. In Proceedings of the 19th
recognition. arXiv preprint arXiv:1402.1128. International Conference on Spoken Language Trans-
lation (IWSLT 2022), pages 369–378, Dublin, Ire-
Holger Schwenk, Guillaume Wenzek, Sergey Edunov,
land (in-person and online). Association for Compu-
Edouard Grave, Armand Joulin, and Angela Fan.
2021. CCMatrix: Mining billions of high-quality
parallel sentences on the web. In Proceedings of the Albert Zeyer, Parnia Bahar, Kazuki Irie, Ralf Schlüter,
59th Annual Meeting of the Association for Compu- and Hermann Ney. 2019. A comparison of trans-
tational Linguistics and the 11th International Joint former and lstm encoder decoder models for asr. In
Conference on Natural Language Processing (Vol- IEEE Automatic Speech Recognition and Understand-
ume 1: Long Papers), pages 6490–6500, Online. As- ing Workshop, pages 8–15, Sentosa, Singapore.
Albert Zeyer, Kazuki Irie, Ralf Schlüter, and Hermann
Rico Sennrich, Barry Haddow, and Alexandra Birch. Ney. 2018. Improved training of end-to-end attention
2016. Neural machine translation of rare words with models for speech recognition. In 19th Annual Conf.
subword units. In Proceedings of the 54th Annual Interspeech, Hyderabad, India, 2-6 Sep., pages 7–11.
guistics, ACL 2016, August 7-12, 2016, Berlin, Ger- Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
many, Volume 1: Long Papers. Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
uating text generation with bert. arXiv preprint
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. arXiv:1904.09675.
Self-attention with relative position representations.
260
QUESPA Submission for the IWSLT 2023
Dialect and Low-resource Speech Translation Tasks
John E. Ortega1 , Rodolfo Zevallos2 , and William Chen3

1
Northeastern University, USA, 2 Universitat de Pompeu Fabra, Spain
3
Carnegie Mellon University, USA
contact email: [email protected]
Abstract is difficult to achieve more than 5 BLEU (Papineni

et al., 2002) score points for the combined task of
This article describes the QUESPA team speech speech translation (ST), even in a unconstrained
translation (ST) submissions for the Quechua
setting.
to Spanish (QUE–SPA) track featured in the
Evaluation Campaign of IWSLT 2023: low- This year, the IWSLT 2023 (Agarwal et al.,
resource and dialect speech translation. Two 2023) evaluation campaign for low-resource and
main submission types were supported in the dialect speech translation has included Tamasheq–
campaign: constrained and unconstrained. We French along with several other language pairs.
submitted six total systems of which our best One of the newly introduced language pairs is
(primary) constrained system consisted of an
Quechua–Spanish deemed QUE–SPA by the orga-
ST model based on the Fairseq S2T framework
where the audio representations were created nizers. Quechua is an indigenous language spoken
using log mel-scale filter banks as features and in the Andes mountainous region in South America.
the translations were performed using a trans- It is spoken by millions of native speakers mostly
former. The best (primary) unconstrained sys- from Peru, Ecuador and Bolivia. In those regions,
tem used a pipeline approach which combined the high-resource language is Spanish. Quechua
automatic speech recognition (ASR) with ma- displays many unique morphological properties of
chine translation (MT). The ASR transcriptions
which high inflection and poly-synthetic are the
for the best unconstrained system were com-
puted using a pre-trained XLS-R–based model
two most commonly known. It is worthwhile to
along with a fine-tuned language model. Tran- note that previous work (Ortega and Pillaipakkam-
scriptions were translated using a MT system natt, 2018; Ortega et al., 2020) has been somewhat
based on a fine-tuned, pre-trained language successful in identifying the inflectional properties
model (PLM). The four other submissions are of Quechua such as agglutination where another
presented in this article (2 constrained and 2 high-resource language, namely Finnish, can aid
unconstrained) for comparison because they for translation purposes achieving nearly 20 BLEU
consist of various architectures. Our results
on religious-based (text-only) tasks.
show that direct ST (ASR and MT combined
together) can be more effective than a PLM in a Since this is the first year that QUE–SPA has
low-resource (constrained) setting for Quechua been included in the IWSLT 2023 campaign, we
to Spanish. On the other hand, we show that feel that it is important to set a proper baseline. The
fine-tuning of any type on both the ASR and aim of our submission was to increase the viabil-
MT system is worthwhile, resulting in nearly ity of the use of a Quechua–Spanish ST system
16 BLEU for the unconstrained task.
and we thus attempted several approaches that in-
cluded the use of pipelines (cascade) approaches
1 Introduction
along with joint ASR + MT. We report on the six
Low-resource machine translation (LRMT) can be system submissions as a final takeaway for this ar-
considered a difficult task due to the low amount ticle; however, we also compare other approaches
of parallel data on hand. (Haddow et al., 2022) that performed worse (1 BLEU or less). Our team
By adding the task of automatic speech recogni- is called QUESPA and consists of a consortium
tion (ASR), complexity can be even more difficult. that spans across three universities: Northeastern
Findings from the previous year’s IWSLT 2022 University (USA), Universitat de Pompeu Fabra
(Antonios et al., 2022) have shown that for low- (Spain), and Carnegie Mellon University (USA).
resource language pairs like Tamasheq–French, it Our objective is to help to solve the LRMT prob-
261
lem for Quechua with the intention of at some point the highest performing QUE–SPA for its time.
releasing an ST system to the Quechua community None of the approaches before Chen and Fazio
where we have strategic partners located in areas (2021) work included the use of pre-trained lan-
of Peru where Quechua is mostly spoken. The guage models (PLMs) for low-resource languages.
authors of this article have participated in several However, the introduction of zero-shot models
other events and written literature that includes occurred at the low-resource machine translation
native Quechua annotations for natural language workshop in 2020 (Ojha et al., 2020) and not
processing (NLP) systems including MT and more. long after in 2021 at the Americas NLP work-
This article reports the QUESPA consortium shop (Mager et al., 2021). The Americas NLP
submissions for the IWSLT 2023 dialect and low- 2021 workshop included the use of QUE–SPA, al-
resource tasks. We focus only on the low-resource beit for MT only achieving scores of 5.39 BLEU
task despite the mention of two dialects Quechua I through the use of a multi-lingual model trained on
and II. Our focus is on creating the optimum mod- 10 other indigenous languages. Their work did not
els we can for the constrained task and leveraging include zero-shot task approaches as introduced by
pre-trained models for the unconstrained task fur- Ebrahimi et al. (2022) where fine-tuning was per-
ther described in Section 3. formed on a pre-trained XLM-R (Conneau et al.,
The rest of this article is organized as follows. 2020) model that achieved impressive results (40–
Section 2 presents the related work. The experi- 55 BLEU). More recent work (Weller-Di Marco
ments for QUE–SPA low-resource track are pre- and Fraser; Costa-jussà et al., 2022) did not surpass
sented in Section 3. Section 4 provides results from those results for MT of QUE–SPA.
the six submitted systems and concludes this work. To our knowledge only one competition/shared
task has attempted to process QUE–SPA for speech
2 Related work translation purposes – Americas NLP 20221 . How-
ever, the findings for the task have not beeen pub-
In this section, we first cover work directly related lished as of the writing of this article. Their compe-
to the ASR and MT tasks of QUE–SPA done in tition used corpora similar to IWSLT 2023 but lacks
the past. Then, we introduce related work on ST MT data as a separate (constrained) resource. They
models in general to provide an idea of what work also do not introduce the concept of constrained or
is current in the field. unconstrained tasks as was done at IWSLT 2023.
Quechua to Spanish MT approaches have be-
Apart from those tasks that directly use the QUE–
come more abundant in the past few years. When
SPA language pair, several mainstream techniques
it comes to ASR–>MT, or ST approaches, there are
are currently being used as alternatives to super-
few attempts officially recorded. In this section, we
vised (from scratch) training. For example, one
list previous work in chronological order to better
of the most common approaches for both ST and
explain the MT approaches attempted. First, Rios
MT approaches tend to use a transformer in some
(2015) provided an advanced linguistic Quechua
capacity along with a PLM. One such model that
toolkit that used finite state transducers (FSTs) to
uses a multi-lingual low-resource corpus called Flo-
translate from Spanish to Quechua. Her work laid
res (Guzmán et al., 2019) is Facebook’s NLLB (no
the foundation for future work and helped to pro-
language left behind) approach (NLLB Team et al.,
mote the digitization of the Quechua language. Af-
2022). Their approach uses self-supervised learn-
ter that, Ortega and Pillaipakkamnatt (2018) and
ing (SSL) from previous innovation (Pino et al.,
Cardenas et al. (2018) introduced several new find-
2020) for multi-lingual approaches that combines
ings that included the ASR corpus used in the
ASR with MT in a ST task alone and is made
IWSLT 2023 task for both unconstrained and con-
available through Fairseq (Wang et al., 2020). In
strained purposes. Not long after, Ortega et al.
our work, our primary systems use Fairseq and
(2020) introduced the first known attempt of a neu-
Facebook’s PLMs with sentence embeddings based
ral MT system that included several annotators
on previous work (Artetxe and Schwenk, 2019)
along with the state-of-the-art techniques in sub-
and the M2M (multi-to-multi) model (Fan et al.,
segmentation such as byte-pair encoding (BPE)
2021) consisting of 1.2 Billion parameters. This
(Sennrich et al., 2015). Their work was then ex-
tended by others (Chen and Fazio, 2021) more re- 1
https://github.com/AmericasNLP/
cently to achieve 23 BLEU on religious-based text, americasnlp2022
262
enables zero-shot cross-lingual transfer for many 4. a primary unconstrained system consisting of
low-resource languages, including Quechua. a multi-lingual PLM ASR model, a Quechua
We provide reference to previous work that in- recurrent neural-network language model, and
cludes either a direct or end-to-end ST models (Be- a fine-tuned neural MT system based on a
rard et al., 2016; Weiss et al., 2017). More tradi- PLM;
tional approaches typically use a cascade approach
5. a contrastive 1 unconstrained system consist-
which first transcribes using an ASR model and
ing of a multi-lingual PLM ASR model and a
then translates using a MT model. While recent
fine-tuned neural MT system based on a PLM;
work (Bentivogli et al., 2021; Anastasopoulos et al.,
2021; Antonios et al., 2022) has shown that the 6. a contrastive 2 unconstrained system consist-
direct ST approaches are worthy, traditional ap- ing of a wav2letter ASR system and a fine-
proaches work well for low-resource situations too. tuned neural MT system based on a PLM.
In our system submissions, all of our systems with We present the experimental settings and results
exception of the primary constrained used the cas- for all systems starting off with constrained sys-
cade approach. tems in Section 3.1 and continuing with the uncon-
strained systems in Section 3.2. We then describe
3 Quechua-Spanish the other less successful approaches in Section 3.3.
Finally, we offer results and discussion in Section
In this section we present our experiments for the
4.
QUE–SPA dataset provided in the low-resource ST
track at IWSLT 2023. This is the first time that 3.1 Constrained Setting
this dataset has been officially introduced in its cur- The IWSLT 2023 constrained setting for QUE–SPA
rent state which contains 1 hour and 40 minutes consists of two main datasets. First, the speech
of constrained speech audio along with its corre- translation dataset consists of 1 hour and 40 min-
sponding translations and nearly 60 hours of ASR utes divided into 573 training files, 125 validation
data (with transcriptions) from the Siminichik (Car- files, and 125 test files where each file is a .wav
denas et al., 2018) corpus. AmericasNLP 2022’s file with a corresponding transcription and human-
task used a smaller part of the dataset but the data validated translation from Simanchik (Cardenas
was not presented or compiled with the same of- et al., 2018). Secondly, there is a MT data set com-
fering and, as of this writing, have not published bined by previous work (Ortega et al., 2020) which
their results. This dataset aggregates the QUE–SPA consists of 100 daily magazine article sentences
MT corpus from previous neural MT work (Ortega and 51140 sentences which are of religious context
et al., 2020). The audio and corresponding tran- in nature.
scriptions along with their translations are mostly
made of of radio broadcasting, similar to the work 3.1.1 Primary System
from Boito et al. (2022) which contains 17 hours The Primary System consists of a direct ST ap-
of speech in the Tamasheq language. proach. Since the constrained setting does not al-
We present the six submissions for both the con- low for external data, we used only the data pro-
strained and unconstrained as follows: vided. We use the Fairseq (Ott et al., 2019) toolkit
to perform direct ST using the 573 training files, a
1. a primary constrained system that uses a direct total of 1.6 hours of audio. The system extracts log
ST approach; mel-filter bank (MFB) features and is based on the
S2T approach by (Wang et al., 2020). We gener-
2. a contrastive 1 constrained system consisting ate a 1k unigram vocabulary for the Spanish text
of a wav2letter (Pratap et al., 2019) ASR sys- using SentencePiece (Kudo and Richardson, 2018),
tem and a neural MT system created from with no pre-tokenization. Our model consists of
scratch; a convolutional feature extractor and transformer
encoder-decoder (Vaswani et al., 2017) with 6 en-
3. a contrastive 2 constrained system consist- coder layers and 3 decoder layers. Error is mea-
ing of a conformer-based (Gulati et al., 2020) sured using cross entropy and optimization is done
ASR system and a neural MT system created using Adam. Our model was run for 500 epochs
from scratch; with a learning rate of .0002.
263
3.1.2 Contrastive 1 System 3.2 Unconstrained Setting
The Contrastive 1 System is a cascade system For the unconstrained setting in IWSLT 2023, an
where first ASR is performed to produce transcrip- additional 60 hours of speech data with their corre-
tions that are translated using a separate MT system. sponding transcriptions was made available by the
For the ASR system, we used the wav2letter++ organizers. This allowed for greater mono-lingual
(Pratap et al., 2019) model. The wav2letter++ fine-tuning of the ASR data. Additionally, for both
model consists of a RNN with 30M parameters the ASR and MT components of all three of our
(2 spatial convolution layers, 5 bidirectional LSTM submitted unconstrained systems, PLMs were used
layers, and 2 linear layers) and a CNN with 100M along with fine-tuning. The three submissions were
parameters (18 temporal convolution layers and 1 cascade systems.
linear layer). We use the convolutional gated lin- 3.2.1 Primary System
ear unit (GLU) (Dauphin et al., 2017) architecture
The Primary System for the unconstrained setting
proposed in the recipe wav2letter (WSJ) (Collobert
consists of two systems, the ASR and the MT
et al., 2016). Our experiments using wav2letter++
system. Both systems are fine-tuned. First, the
took 134 epochs to train, using Stochastic Gra-
ASR system is multi-lingual model pre-trained on
dient Descent (SGD) with Nesterov momentum
the 102-language FLEURS (Conneau et al., 2023)
and a minibatch of 8 utterances. The initial learn-
dataset. The model consists of a conformer (Gulati
ing rate was set to 0.006 for faster convergence,
et al., 2020) encoder and transformer decoder and is
and it was annealed with a constant factor of 3.6
trained using hybrid CTC/attention loss (Watanabe
after each epoch, with momentum set to 0. The
et al., 2017) and hierarchical language identifica-
model was optimized using the Auto Segmentation
tion conditioning (Chen et al., 2023). The model
Criterion (ASG) (Collobert et al., 2016). During
inputs are encoded representations extracted from
development, the ASR system WER was 72.15
a pre-trained XLS-R 128 model (Babu et al., 2021)
on the validation set. The MT system was cre-
with its weights frozen, augmented with SpecAug
ated from scratch using the OpenNMT framework
(Park et al., 2019) and speech perturbation (Ko
(Klein et al., 2020) with the MT data provided for
et al., 2015). In order to jointly decode, we also
the constrained task along with the ASR training
trained an RNN language model. The RNN con-
data. More specifically, the MT system’s encoder
sists of 2 layers with a hidden size of 650, trained
and decoder are based on a transformer (Vaswani
using SGD with a flat learning rate of 0.1. The
et al., 2017) (encode/decode) architecture of 6 lay-
word-error rate on the validation set was 15. For
ers. Hidden layer and vectors sizes were 512.
the MT system, we use the Fairseq (Ott et al., 2019)
Dropout was set to 0.1. Optimization was done
tool kit for translation. The Flores 101 model was
using the Adam optimizer. Tokenization was done
used (Guzmán et al., 2019) as the PLM and is based
using SentencePiece (Kudo and Richardson, 2018).
on a transformer (Vaswani et al., 2017) architecture
Both source and target vocabularies were 50k. Ini-
used at WMT 20212 by Facebook. Fine-tuning was
tial BLEU score on the validation set was 21.13.
performed using the training ASR+MT data from
the constrained task as was used for training in the
3.1.3 Contrastive 2 System
Constrained Contrastive 1 task in Section 3.1.2.
Similar to the Contrastive 1 System, the Contrastive
2 system is a cascade approach. The ASR sys- 3.2.2 Contrastive 1 System
tem, however, is distinct. It is derived using MFB The Constrastive 1 system is nearly identical to
features similar to previous work Berrebbi et al. the Primary System for the unconstrained setting.
(2022). It uses a conformer instead of the trans- The MT system is identical to that of the Primary
former encoder like Gulati et al. (2020). Training System submission for the unconstrained setting.
was performed using a hybrid CTC/attention loss For the ASR system, a FLEURS approach is used
(Watanabe et al., 2017). The model was optimized identical to the unconstrained Primary System in
using Adam (Kingma and Ba, 2015) and a Noam Section 3.2.1. The only difference is that this Un-
learning rate scheduler (Vaswani et al., 2017) with constrained Contrastive 1 system does not use a
4000 warmup steps. The MT system is identical language model.
to the OpenNMT MT system mentioned for the 2
https://www.statmt.org/wmt21/large-scale-
Contrastive 1 submisison covered in Section 3.1.2. multilingual-translation-task.html
264
3.2.3 Contrastive 2 System score was of 6.27 BLEU. The Flores 200 model
The Contrastive 2 System is also a cascade is made available as the NLLB task on Fairseq,
(ASR+MT) system. The MT system is identical however, we experienced several conflicts with the
to that of the Primary System submission for the machine infrastructure causing complexity with the
unconstrained setting. The ASR system architec- Stopes tokenization that prevented us from moving
ture is identical to the Constrained Contrastive 1 forward.
System in Section 3.1.2, but with other hyperparam- For direct ST approaches, we also were unsuc-
eters. In this experiment took 243 epochs to train, cessful using w2v feature encoding without ma-
using Stochastic Gradient Descent (SGD) with Nes- jor modification. Overall, the cascade approaches
terov momentum and a minibatch of 16 utterances. seemed to work better for this task and, thus, we
The initial learning rate was set to 0.002 for faster made a decision to use those instead. The results
convergence, and it was annealed with a constant for the constrained task, nonetheless, show that
factor of 1.2 after each epoch, with momentum the direct s2t approach worked well using MFB
set to 0. In this system, we add the additional 60 features.
hours of monolingual transcribed speech data from
4 Results and Discussion
the unconstrained setting mentioned in the IWSLT
2023 low-resource task in addition to the 1.6 hours
provided for the constrained setting. Team QUESPA BLEU and CHRF Scores
Constrained
3.3 Other Approaches
System Description BLEU CHRF
As noted in Section 2, there have been other suc- primary mfb+s2t 1.25 25.35
contrastive 1 w2vl+onmt 0.13 10.53
cessful approaches worth visiting. While we could contrastive 2 conformer+onmt 0.11 10.63
not exhaustively attempt to use all of those ap-
Unconstrained
proaches, we did focus on several that are worth
noting. System Description BLEU CHRF
primary fleurs+lm+floresmt 15.36 47.89
For ASR approaches, we focused on experiment- contrastive 1 fleurs+floresmt 15.27 47.74
ing with different model architectures. This in- contrastive 2 w2vl+floresmt 10.75 42.89
cluded using different encoders (transformer, con-
former) and decoders (auto-regressive Transformer, Table 1: Team QUESPA results for the Quechua to
CTC-only). Regardless, all of the ASR systems Spanish low-resource task at IWSLT 2023.
achieved at best 100 WER in the constrained set-
ting, limiting the effectiveness of any cascaded Results are presented in Table 1. For the con-
approach. In the unconstrained setting, we also strained task, we were unable to create a system
looked at different ways to incorporate pre-training. that would be viable for deployment. Notwithstand-
For example, we tried directly fine-tuning a pre- ing, we believe that the primary submission which
trained XLS-R model (Babu et al., 2021; Baevski used MFB features along with the default Fairseq
et al., 2020) instead of using extracted layer-wise S2T recipe could be used to further research in the
features from a frozen model. These approaches field. Other systems, based on w2vletter (Pratap
were somewhat more successful by achieving up to et al., 2019) and a conformer (Gulati et al., 2020)
20.4 WER on the validation set; however, the top resulted in a near zero BLEU score and are proba-
three systems reported performed better with ASR. bly only valid as proof of the non-functional status
For MT approaches, several attempts were made of the two systems when performing ASR on the
to experiment with other systems. For example, the QUE–SPA language pair. It is clear that with 1.6
OpenNMT (Klein et al., 2020) toolkit now offers hours of data for training, few constrained systems
PLMs that include the Flores 101 (Guzmán et al., will perform better than 5 BLEU, as seen in previ-
2019) dataset. However, since Quechua was not ous IWSLT tasks.
included in the language list, the performance was For the unconstrained setting, our findings have
extremely low on the validation set (0.06 BLEU). shown that for both the ASR and MT models, the
The Hugging Face version of the Flores 200 dataset use of a PLM with fine-tuning is necessary. We
was also tested and resulted in 23.5 on its own data. were unable to create a system from scratch that
However, when testing on the validation set, the would perform as well as those presented in previ-
265
Figure 1: The best-performing unconstrained speech translation pipeline.
ous tasks. The combination of a language model References

and the FLEURS PLM for ASR along with the
FLORES 101 PLM for MT constitutes our best sopoulos, Ondřej Bojar, Claudia Borg, Marine
performing system overall as shown in Figure 1. Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
The language model slightly helped for the Primary Chen, William Chen, Khalid Choukri, Alexandra
system by a gain of nearly 0.10 points in BLEU. Chronopoulou, Anna Currey, Thierry Declerck, Qian-
The other unconstrained system based on w2vletter Federico, Souhir Gahbiche, Barry Haddow, Benjamin
(Pratap et al., 2019) performed much better than Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
the constrained version making it worthwhile to vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
explore for future iterations since it doesn’t require Kumar, Pengwei Li, Xutail Ma, Prashant Mathur,
other languages. Kenton Murray, Maria Nadejde, Satoshi Nakamura,
5 Conclusion Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
Concluding, we have experimented with several Lonneke van der Plas, Peter Polák, Elijah Rippeth,
options for both the constrained and unconstrained bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
settings. This constitutes the first time that ex- Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
periments have been put together along with the Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
other team submissions for the Quechua to Spanish vallos. 2023. Findings of the IWSLT 2023 Evaluation
task. We believe that the performance achieved Conference on Spoken Language Translation (IWSLT
here can serve as baselines for more sophisticated 2023). Association for Computational Linguistics.
approaches. Additionally, it came to our attention
that data splits provided by the organizers can be Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremer-
adjusted to better fit the data. There are multiple man, Roldano Cattoni, Maha Elbayad, Marcello Fed-
erico, Xutai Ma, Satoshi Nakamura, Matteo Negri,
speakers in several of the audio files, we did not Jan Niehues, Juan Pino, Elizabeth Salesky, Sebas-
take advantage of this and hope to address it in the tian Stüker, Katsuhito Sudoh, Marco Turchi, Alexan-
future. Also for the future, we believe that more der Waibel, Changhan Wang, and Matthew Wiesner.
work could be done using direct ST systems with 2021. FINDINGS OF THE IWSLT 2021 EVAL-
UATION CAMPAIGN. In Proceedings of the 18th
fine-tuning. We did not follow that path in this International Conference on Spoken Language Trans-
work but feel it would be advantageous. lation (IWSLT 2021), pages 1–29, Bangkok, Thailand
(online). Association for Computational Linguistics.
266
Anastasopoulos Antonios, Barrault Loc, Luisa Ben- Ronan Collobert, Christian Puhrsch, and Gabriel Syn-
tivogli, Marcely Zanon Boito, Bojar Ondřej, Roldano naeve. 2016. Wav2letter: an end-to-end convnet-
Cattoni, Currey Anna, Dinu Georgiana, Duh Kevin, based speech recognition system. arXiv preprint
Elbayad Maha, et al. 2022. Findings of the iwslt arXiv:1609.03193.
2022 evaluation campaign. In Proceedings of the
19th International Conference on Spoken Language Alexis Conneau, Alexei Baevski, Ronan Collobert,
Translation (IWSLT 2022), pages 98–157. Associa- Abdelrahman Mohamed, and Michael Auli. 2020.
tion for Computational Linguistics. Unsupervised cross-lingual representation learn-
ing for speech recognition. arXiv preprint
Mikel Artetxe and Holger Schwenk. 2019. Mas- arXiv:2006.13979.
sively multilingual sentence embeddings for zero-
shot cross-lingual transfer and beyond. Transactions Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang,
of the Association for Computational Linguistics, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara
7:597–610. Rivera, and Ankur Bapna. 2023. Fleurs: Few-shot
Arun Babu, Changhan Wang, Andros Tjandra, Kushal learning evaluation of universal representations of
Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, speech. In 2022 IEEE Spoken Language Technology
Patrick von Platen, Yatharth Saraf, Juan Pino, et al. Workshop (SLT), pages 798–805.
2021. Xls-r: Self-supervised cross-lingual speech
representation learning at scale. arXiv preprint Marta R Costa-jussà, James Cross, Onur Çelebi, Maha
arXiv:2111.09296. Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe
Kalbassi, Janice Lam, Daniel Licht, Jean Maillard,
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, et al. 2022. No language left behind: Scaling
and Michael Auli. 2020. wav2vec 2.0: A framework human-centered machine translation. arXiv preprint
for self-supervised learning of speech representations. arXiv:2207.04672.
Advances in Neural Information Processing Systems,
33:12449–12460. Yann N Dauphin, Angela Fan, Michael Auli, and David
Grangier. 2017. Language modeling with gated con-
Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina volutional networks. In International conference on
Karakanta, Alberto Martinelli, Matteo Negri, and machine learning, pages 933–941. PMLR.
translation: Do the differences still make a differ- Abteen Ebrahimi, Manuel Mager, Arturo Oncevay,
ence? CoRR, abs/2106.01045. Vishrav Chaudhary, Luis Chiruzzo, Angela Fan, John
Alexandre Berard, Olivier Pietquin, Christophe Servan, Ortega, Ricardo Ramos, Annette Rios Gonzales,
and Laurent Besacier. 2016. Listen and translate: A Ivan Meza-Ruiz, et al. 2022. Americasnli: Evalu-
proof of concept for end-to-end speech-to-text trans- ating zero-shot natural language understanding of
lation. CoRR, abs/1612.01744. pretrained multilingual models in truly low-resource
languages. In Proceedings of the 60th Annual Meet-
Dan Berrebbi, Jiatong Shi, Brian Yan, Osbel López- ing of the Association for Computational Linguistics
Francisco, Jonathan Amith, and Shinji Watanabe. (Volume 1: Long Papers), pages 6279–6299.
2022. Combining Spectral and Self-Supervised Fea-
tures for Low Resource Speech Recognition and Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi
Translation. In Proc. Interspeech 2022, pages 3533– Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep
3537. Baines, Onur Celebi, Guillaume Wenzek, Vishrav
Chaudhary, et al. 2021. Beyond english-centric multi-
Marcely Zanon Boito, Fethi Bougares, Florentin Bar- lingual machine translation. The Journal of Machine
bier, Souhir Gahbiche, Loïc Barrault, Mickael Rou- Learning Research, 22(1):4839–4886.
vier, and Yannick Estéve. 2022. Speech resources
in the tamasheq language. Language Resources and Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
Evaluation Conference (LREC). Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.
Ronald Cardenas, Rodolfo Zevallos, Reynaldo Baquer-
2020. Conformer: Convolution-augmented Trans-
izo, and Luis Camacho. 2018. Siminchik: A speech
former for Speech Recognition. In Proc. Interspeech
corpus for preservation of southern quechua. ISI-
2020, pages 5036–5040.
NLP 2, page 21.
William Chen and Brett Fazio. 2021. Morphologically- Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan
guided segmentation for translation of agglutinative Pino, Guillaume Lample, Philipp Koehn, Vishrav
low-resource languages. Proceedings of Machine Chaudhary, and Marc’Aurelio Ranzato. 2019. The
Translation Summit XVIII. flores evaluation datasets for low-resource machine
translation: Nepali–english and sinhala–english. In
William Chen, Brian Yan, Jiatong Shi, Yifan Peng, Proceedings of the 2019 Conference on Empirical
Soumi Maiti, and Shinji Watanabe. 2023. Improv- Methods in Natural Language Processing and the 9th
ing massively multilingual asr with auxiliary CTC International Joint Conference on Natural Language
objectives. arXiv preprint arXiv:2302.12829. Processing (EMNLP-IJCNLP), pages 6098–6111.
267
Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
Barone, Jindřich Helcl, and Alexandra Birch. 2022. Sam Gross, Nathan Ng, David Grangier, and Michael
Survey of low-resource machine translation. Compu- Auli. 2019. fairseq: A fast, extensible toolkit for
tational Linguistics, 48(3):673–732. sequence modeling. In NAACL (Demonstrations),
pages 48–53, Minneapolis, Minnesota. Association
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A for Computational Linguistics.
method for stochastic optimization. In ICLR 2015,
Conference Track Proceedings. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu-
Guillaume Klein, François Hernandez, Vincent Nguyen, ation of machine translation. In Proceedings of the
and Jean Senellart. 2020. The opennmt neural ma- 40th annual meeting of the Association for Computa-
chine translation toolkit: 2020 edition. In Proceed- tional Linguistics, pages 311–318.
ings of the 14th Conference of the Association for
Machine Translation in the Americas (Volume 1: Re- Daniel S Park, William Chan, Yu Zhang, Chung-Cheng
search Track), pages 102–109. Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le.
Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San- method for automatic speech recognition. arXiv
jeev Khudanpur. 2015. Audio augmentation for preprint arXiv:1904.08779.
speech recognition. In Proc. Interspeech 2015, pages
3586–3589. Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad
Dousti, and Yun Tang. 2020. Self-training for end-to-
Taku Kudo and John Richardson. 2018. Sentencepiece: end speech translation.
A simple and language independent subword tok-
enizer and detokenizer for neural text processing. Vineel Pratap, Awni Y. Hannun, Qiantong Xu, Jeff Cai,
arXiv preprint arXiv:1808.06226. Jacob Kahn, Gabriel Synnaeve, Vitaliy Liptchinsky,
and Ronan Collobert. 2019. Wav2letter++: A fast
Manuel Mager, Arturo Oncevay, Abteen Ebrahimi, open-source speech recognition system. IEEE Inter-
John Ortega , Annette Riosψ, Angela Fan, Xi- national Conference on Acoustics, Speech and Signal
mena Gutierrez-Vasquesψ, Luis Chiruzzo, Gustavo A Processing (ICASSP), pages 6460–6464.
Giménez-Lugo, Ricardo Ramosη, et al. 2021. Find-
ings of the americasnlp 2021 shared task on open Annette Rios. 2015. A basic language technology
machine translation for indigenous languages of the toolkit for Quechua. Ph.D. thesis, University of
americas. NAACL-HLT 2021, page 202. Zurich.
NLLB Team, Marta R. Costa-jussà, James Cross, Onur Rico Sennrich, Barry Haddow, and Alexandra Birch.
Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef- 2015. Neural machine translation of rare words with
fernan, Elahe Kalbassi, Janice Lam, Daniel Licht, subword units. arXiv preprint arXiv:1508.07909.
Jean Maillard, Anna Sun, Skyler Wang, Guillaume
Wenzek, Al Youngblood, Bapi Akula, Loic Bar- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
rault, Gabriel Mejia-Gonzalez, Prangthip Hansanti, Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
John Hoffman, Semarley Jarrett, Kaushik Ram Kaiser, and Illia Polosukhin. 2017. Attention is all
Sadagopan, Dirk Rowe, Shannon Spruit, Chau you need. Advances in neural information processing
Tran, Pierre Andrews, Necip Fazil Ayan, Shruti systems, 30.
Bhosale, Sergey Edunov, Angela Fan, Cynthia
Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Changhan Wang, Yun Tang, Xutai Ma, Anne Wu,
Koehn, Alexandre Mourachko, Christophe Ropers, Dmytro Okhonko, and Juan Pino. 2020. fairseq s2t:
Safiyyah Saleem, Holger Schwenk, and Jeff Wang. Fast speech-to-text modeling with fairseq. arXiv
2022. No language left behind: Scaling human- preprint arXiv:2010.05171.
centered machine translation.
Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R.
Atul Kr Ojha, Valentin Malykh, Alina Karakanta, and Hershey, and Tomoki Hayashi. 2017. Hybrid
Chao-Hong Liu. 2020. Findings of the loresmt 2020 CTC/attention architecture for end-to-end speech
shared task on zero-shot for low-resource languages. recognition. IEEE Journal of Selected Topics in Sig-
In Proceedings of the 3rd Workshop on Technologies nal Processing, 11(8):1240–1253.
for MT of Low Resource Languages, pages 33–37.
John E Ortega, Richard Castro Mamani, and Kyunghyun Wu, and Zhifeng Chen. 2017. Sequence-to-Sequence
Cho. 2020. Neural machine translation with a Models Can Directly Translate Foreign Speech. In
polysynthetic low resource language. Machine Trans- Proc. Interspeech 2017, pages 2625–2629.
lation, 34(4):325–346.
Marion Weller-Di Marco and Alexander Fraser. Find-
John E Ortega and Krishnan Pillaipakkamnatt. 2018. ings of the wmt 2022 shared tasks in unsupervised
Using morphemes from agglutinative languages like mt and very low resource supervised mt.
quechua and finnish to aid in low-resource translation.
Technologies for MT of Low Resource Languages
(LoResMT 2018), page 1.
268
GMU Systems for the IWSLT 2023 Dialect and Low-resource Speech
Translation Tasks
Jonathan Kabala Mbuya Antonios Anastasopoulos

George Mason University George Mason University
Abstract bilities for researchers to find innovative ways to

This paper describes the GMU Systems for the develop speech translation systems for languages
IWSLT 2023 Dialect and Low-resource Speech with limited data. Unlike previous years, this year
Translation Tasks. We submitted systems for noticed an addition of more low-resource languages
five low-resource tasks and the dialectal task. language pairs (up to 6) in addition to a dialect lan-
In this work, we explored self-supervised pre- guage pair.
trained speech models and finetuned them on This paper describes the GMU submissions to
speech translation downstream tasks. We use the low-resource and dialectal tasks. Our systems
the Wav2vec 2.0, XLSR-53, and Hubert as self-
use self-supervised pre-trained speech models to
supervised models. Unlike Hubert, Wav2vec
2.0 and XLSR-53 achieve the best results when improve speech translation models’ performance
we remove the top three layers. Our results in general, particularly for low-resource languages.
show that Wav2vec 2.0 and Hubert perform Self-supervised pre-training is possible because
similarly with their relative best configuration. unlabeled data (i.e., audio or text) can be obtained
In addition, we found that Wav2vec 2.0 pre- easier compared to labeled data. Previous research
trained on audio data of the same language has addressed using self-supervised speech models
as the source language of a speech translation
for speech translation (Wu et al., 2020; Nguyen
model achieves better results. For the low-
resource setting, the best results are achieved et al., 2020; Popuri et al., 2022). However, these
using either the Wav2vec 2.0 or Hubert models, prior work did not consider exploring the impact
while XLSR-53 achieves the best results for of different layers of these self-supervised models
the dialectal transfer task. We find that XLSR- to maximize the performance of S2T models.
53 does not perform well for low-resource In this paper, we consider three self-supervised
tasks. Using Wav2vec 2.0, we report close to 2 speech models: Wav2vec 2.0 (Baevski et al., 2020),
BLEU point improvements on the test set for XLSR (Conneau et al., 2020) and Hubert (Hsu et al.,
the Tamasheq-French compared to the baseline
2021). Following the discussion by Pasad et al.
system at the IWSLT 2022.
(2022), we experimented to study the impact of
1 Introduction removing the top n layers of these models for the
speech translation task. By removing the last three
Recently, speech-to-text translation (S2T) has re-
layers of the Wav2vec 2.0 model, we achieve more
ceived a lot of focus in the community where neu-
than 2 BLEU improvement (8.03) on the blind test
ral, end-to-end approaches outperform traditional
set for the Tamasheq-French pair compared to the
statistical approaches (Weiss et al., 2017). Recent
best system submitted to the IWSLT 2022 low-
neural approaches to S2T have shown superior per-
resource shared task (Anastasopoulos et al., 2022;
formance on this task (Fang et al., 2022; Tang et al.,
Zanon Boito et al., 2022). Similarly, using a pre-
2022). Despite the success of neural approaches
trained XLSR-53, we achieved a BLEU score of
to S2T, data scarcity is one of the significant chal-
16.3 on the Tunisian Arabic-to-English language
lenges, given that neural networks require hundreds
pair without using the transcripts.
to thousands of hours of labeled data to train a
good speech translation model (Sperber and Paulik, 2 Task Descriptions
2020). This makes developing such S2T models
challenging, especially for low-resource languages. We are concerned with developing speech transla-
The IWSLT 2023 Low-resource and dialectal tion models in low-resource and dialectal tracks.
shared tasks (Agarwal et al., 2023) give the possi- Each track poses distinct challenges. The low-
269
Language Pairs Language Code Train Set Hours Shared Task
Irish to English (Agarwal et al., 2023) ga-eng 11 Low-resource
Marathi to Hindi (Agarwal et al., 2023) mr-hi 15.3 Low-resource
Pashto to French (ELRA) pus-fra 61 Low-resource
Tamasheq to French (Boito et al., 2022) tmh-fra 17 Low-resource
Quechua to Spanish que-spa 1.6 Low-resource
Tunisian Arabic to English aeb-eng 160 Dialectal
Table 1: Language pair details used in our experiments.
resource setting has limited training data, while audio-only data. Though Tunisian Arabic is not
the dialectal one lacks standard orthography and part of the XLSR-53, the XLSR-53 contains Ara-
formal grammar. Both shared tasks allowed the bic data that may be related to Tunisian Arabic.
submission of models trained under constrained
and unconstrained conditions. In the constrained 3 Proposed Methods
condition, models are only trained on data provided Our methods consist of three different architectures.
by the organizers. In contrast, models in the uncon- The first is an end-to-end based transformer-based
strained condition can be trained on any available architecture (E2E) trained on only provided data.
resources, including pre-trained models. The second architecture, which we name E2E-ASR,
2.1 Data is the same as the first, except that we initialize the
encoder with an ASR encoder. The third archi-
Six low-resource languages were made available, tecture uses self-supervised speech models as an
and one dialectal. However, due to data quality is- encoder and a transformer-based decoder. We used
sues (see Section 5) we do not report results on the three different self-supervised models, Wav2vec
Maltese to English task. Table 1 shows the data de- 2.0, XLSR-53, and Hubert, and refer to these ar-
tails for each language pair. The organizers shared chitectures as W2V-E2E, XLSR-E2E, and Hubert-
additional data for specific languages, including E2E respectively.
data for automatic speech recognition (ASR) and We used the Fairseq ST (Wang et al., 2020)
machine translation (MT). However, our approach framework for all our experiments and modified
used the data described in table 1. The exception is this framework to accommodate our new custom
for Tamasheq-French, where we used the provided model architectures.
234 hours of unlabeled Tamasheq audio to pre-train
a self-supervised speech model. 3.1 End-to-end and End-to-end with ASR
For the unconstrained condition, we used data For End-to-end (E2E) architecture, we used a
from MUST-C1 (Di Gangi et al., 2019) to train transformer-based encoder-decoder architecture
an ASR model for which we used its encoder to (Vaswani et al., 2017) (st_tranformer_s)
initialize the speech translation training. We used as implemented in the Fairseq S2T framework
publicly available pre-trained self-supersized mod- (Wang et al., 2020). The E2E architecture con-
els (Wav2vec 2.0 (Baevski et al., 2020), XLSR- sists of a 6-block transformer encoder and a 6-
53 (Conneau et al., 2020), and Hubert (Hsu et al., block transformer decoder and is optimized using
2021)). The Wav2vec 2.0 and Hubert check- the cross-entropy loss with label smoothing. We
points we used were trained on the Librispeech used this model architecture to train the model
960hr English-only data (Panayotov et al., 2015), for the primary constrained category (primary-
while XLSR-53 was trained on 53 different lan- constrained).
guages (Conneau et al., 2020). No source lan- The End-to-end with ASR (E2E-ASR) architec-
guage of all language pairs appears in any self- ture, similar to (Stoian et al., 2019) and (Bansal
supervised models except Tamasheq-French, where et al., 2019), uses the same architecture as the
we pre-trained the Wav2vec 2.0 model we used for E2E. The difference is that we use a pre-trained
Tamasheq-French was pre-trained on Tamasheq ASR model to initialize its encoder. We used a
1
English to French only transformer-based architecture identical to the one
270
for E2E to train the ASR on the English data of fine-tuning these models on a downstream task as
the English-French Must-C dataset (Di Gangi et al., done in Pasad et al. (2022), we explored the idea
2019). We chose this architecture for the ASR of removing these layers and then fine-tuning the
model to facilitate the transfer of the ASR encoder modified model on a downstream task. Through a
weights to initialize the E2E-ASR encoder. The series of experiments, we found that removing the
decoder of the E2E-ASR was randomly initialized last three layers for the Wav2vec 2.0 and XLSR-53
and did not use the ASR decoder because it was models yields the highest BLEU score.
trained on a different language with a different vo- We found the Wav2vec 2.0 helpful for the low-
cabulary. We used this model architecture to train resource languages, while the XLSR-53 was more
the model for the second contrastive unconstrained beneficial for the dialectal language. Therefore,
category (contrastive2-unconstrained). we used the Wav2vec 2.0 for the primary uncon-
strained category (primary unconstrained) for the
3.2 Self-Supervised Approaches low-resource task. The XLSR-53 was used as the
The self-supervised approach uses self-supervised primary unconstrained category (primary uncon-
speech models as acoustic encoders with a strained) for the dialectal transfer task.
transformer-based decoder. The use of these self- The Wav2vec 2.0 we used for all the low-
supervised models is motivated by the scarcity of resource languages (except Tamasheq-French) was
data in the low-resource setting. However, we trained on the English raw audio of the Librispeech
found these models useful even for the dialectal 960hr data (Panayotov et al., 2015). However, due
task. The self-supervised architecture is illustrated to the availability of Tamasheq raw audio, we also
in figure 1. trained a Wav2vec 2.0 model on Tamasheq raw au-
We used three different self-supervised models, dio that used this model on the Tamasheq to French
Wav2vec 2.0, XLSR-53, and Hubert, which cor- language pair. The XLSR-53 model we used was
respond to the respective architectures W2V-E2E, trained on 53 raw audio data from 53 different lan-
XLSR-E2E, and Hubert-E2E. These models con- guages.
sist of a feature encoder and a context network.
The feature encoder has seven temporal convolu- 3.2.2 Using Hubert
tion blocks, and the context network consists of Unlike Wav2vec 2.0 and XSLR-53, we did not re-
several transformer blocks. The Wav2vec 2.0 and move any layers for the Hubert model. We rather
Hubert models used in our experiments have 12 fine-tuned the out-of-the-box pre-trained Hubert
transformer blocks, whereas the XLSR-53 has 24.2 model on the English raw audio data of Librispeech
We use these self-supervised models as encoders 960hr. As discussed by (Pasad et al., 2022), Hubert
following the traditional encoder-decoder model does not follow the autoencoder pattern, given that
architecture. The decoder consists of a transformer the higher layers appear to encode more phonetic
network with six layers preceded by a linear layer. and word information. The choice of not removing
top layers for the Hubert model was also corrobo-
3.2.1 Using Wav2vec 2.0 and XLSR-53 rated through our empirical experiments, where we
Instead of using all the layers of the context net- achieved the highest BLEU score for the Hubert
work for the Wav2vec 2.0 and XLSR-53 models, model when we did not remove any top layers.
we explored the impact of removing the top n most We used the Hubert model for the first con-
layers. The exploration of removing the top layers trastive constrained category (contrastive1 uncon-
was inspired by Pasad et al. (2022), who analyzed strained) for the low-resource and dialectal tasks.
self-supervised speech models and measures the
acoustic, phonetic, and word-level properties en- 3.3 Data
coded in individual layers of the context network. The input to architectures E2E and E2E-ASR con-
For Wav2vec 2.0 and XLSR, the analyses show sist of 80-channel log-mel filterbank features com-
that the initial and the final layers are more simi- puted on a 25 ms window with a 10 ms shift. We
lar to the inputs than the intermediate layers. In- used raw audio as input for all the architectures
stead of re-initializing the top n layers and then using self-supervised models. For the translation
2 text, we use the byte pair encoding (BPE) (Sen-
We refer the reader to the following papers (Baevski et al.,
2020), (Conneau et al., 2020) and (Hsu et al., 2021) for more nrich et al., 2016) algorithm with the sentencepiece
details on these models. toolkit from the Fairseq ST framework (Wang et al.,
271
Figure 1: Self-supervised model architecture. This is an end-to-end architecture that uses self-supervised speech
models as the encoder. The encoder is one of the Wav2vec 2.0, XLSR, or Hubert models. We removed the top 3
layers of the Wav2vec 2.0 and XLSR models.
Language Pairs Vocab. Size

Irish-English 1000
Marathi-Hindi 1000
Pashto-French 3000
Tamasheq-French 1000
Quechua Spanish 400
Tunisian Arabic-English 8000
Table 2: BPE vocabulary for each language.
Figure 2: BLEU score on the test set for Tamasheq-

2020) to create vocabularies for all the target lan- French (tmh-fra) and Quechua-Spanish 3 (que-spa) after
guages. We chose the vocabulary size based on removing top n number of layers of the Wav2vec 2.0.
the amount of text data we had for each language. These results are run using the W2V-E2E architecture.
For both Tamasheq-French and Quechua-Spanish, the
Table 2 shows the BPE vocabulary size we used
best BLEU is achieved after removing the top 3 layers.
for each language pair. Though we used the train-
ing data size as a heuristic for choosing these BPE
vocabulary sizes, we empirically tested a few con- n layers for the Wav2vec 2.0 model used in the
figurations. We kept the sizes that gave the best W2V-E2E architecture. As illustrated in figure 2,
BLEU score. the highest BLEU was achieved by removing the
top three layers of the Wav2vec 2.0 model. We,
4 Results and Analyses therefore, used the same heuristic for the XLSR-53
Table 3 shows results for all the systems we sub- model, given that it has the same architecture as
mitted. Our primary system reports the best results the Wav2vec 2.0 model.
for the unconstrained setting where we used the 3
The results for Quechua to Spanish are different from
W2V-E2E and XLSR-E2E architectures for the low- those in Table 3 because they were run after the evaluation
period.
resource and dialectal tasks, respectively.
We explored the impact of removing the top
272
Language System Task Architecture dev/valid test1 test2 test3
primary constr. E2E - - 15.1 -
primary unconstr. W2V-E2E - - 66.5 -
ga-eng LR
contrastive1 unconstr. Hubert-E2E - - 77.4 -
contrastive2 unconstr. E2E-ASR - - 15.1 -
primary constr. E2E 0.77 - 3.3 -
primary unconstr. W2V-E2E 4.76 - 7.7 -
mr-hi LR
contrastive1 unconstr. Hubert-E2E 5.78 - 8.6 -
contrastive2 unconstr. E2E-ASR 4.07 - 5.9 -
pus-fra LR
primary constr. E2E 1.24 1.0 0.48 -
primary unconstr. W2V-E2E 12.07 7.63 8.03 -
tmh-fra LR
contrastive1 unconstr. Hubert-E2E 4.79 2.77 1.3 -
contrastive2 unconstr. E2E-ASR 5.24 3.77 2.1 -
que-spa LR
primary constr. E2E 11.49 8.94 5.0 4.5
primary unconstr. XLSR-E2E 19.35 16.31 16.6 14.6
aeb-eng DT
contrastive1 unconstr. Hubert-E2E 17.69 14.52 15.0 13.4
contrastive2 unconstr. W2V-E2E 16.7 14.4 14.1 12.9
Table 3: BLEU score for all the submitted systems. LR and DT indicate low-resource and dialectal transfer,
respectively. dev/valid refers to the validation or development sets we used during training. test1 refers to the test set
we used during training (some language pairs did not have this set). test2 refers to the blind test set. Some language
pairs (i.e., aeb-eng) had an additional blind test set called test3. The "-" character indicates that we do not have
BLEU results for that category. We did not report the dev/valid results for the Irish to English (ga-eng) task due to
the data quality issue discussed in section 5.
4.1 Low-Resource Task The W2V-E2E architecture achieves a relatively

For the low-resource shared task, the highest BLEU high BLEU score compared to Hubert-E2E for
is obtained on average by the architecture that uses Tamasheq-French. This behavior is explained by
the Wav2vec 2.0 model (W2V-E2E). However, the the fact that the Wav2vec 2.0 models used for
Hubert (Hubert-E2E) architecture yields competi- Tamasheq-French were pre-trained on 234 hours of
tive BLEU compared to the W2V-E2E architecture. Tamasheq audio, while the Hubert was pre-trained
In fact, for Marathi-Hindi and Quechua-Spanish on 960 hours of English data from the Librispeech
language pairs, the highest BLEU is achieved by dataset. Therefore, pre-training a self-supervised
using the Hubert model. Based on our experiments, model on audio data from the same source lan-
we think both the Hubert and the Wav2vec 2.0 guage helps improve the model’s performance on a
models may have similar performance though each downstream task.
model may require different configurations. In the Interestingly, pre-training on audio data from
future, we hope to have a detailed analysis of the a different language than the source language for
conditions under which one model performs better the speech translation task still yields improvement
than the other. Table 3 shows the BLEU results for compared to starting with random weights. While
the low-resource task. Bansal et al. (2019) reported this behavior for ASR
273
pre-training, we still see the same pattern for self- the metadata of about 1001 out of 1698 samples
supervised pre-training. mentioned zero or less than zero duration for audio
Particularly for Tamasheq-French, which had a samples (start_time >= end_time) while
baseline BLEU score of 5.7 for the best IWSLT the aligned utterances had several words in most
2022 system (Anastasopoulos et al., 2022), we nev- cases. Therefore, we were not able to align most
ertheless improved upon the baseline by more than audio data with their utterances.
2 BLEU on the blind test set. The Irish to English data had an issue with the
development set. Initially, the samples in the devel-
4.2 Dialectal Task opment were also present in the training set. How-
Unlike the low-resource task, the highest BLEU ever, the organizer later fixed this issue by updating
for the dialectal task was achieved by using the the development set data. However, no matter how
XLSR-53 model (XLSR-E2E). Therefore, we used we trained our models, we never achieved more
this architecture for our primary unconstrained set- than 1 BLEU score on the updated development
ting. Table 3 shows the results for Tunisian Arabic- set. After troubleshooting our model on the train-
English. ing data, we were confident that we should have
For this task, Wav2vec 2.0 and Hubert had com- gotten a BLEU score that was well above 1. We
parable BLEU scores. However, surprisingly, they proceeded with submitting our system for this task.
did not perform as well as XLSR-53. This find- However, we are very suspicious of the high BLEU
ing was counterintuitive given that the XLSR-53 score reported on the blind test, as shown in Ta-
model did not perform as well as the Wav2vec 2.0 ble 3, as it suggests that there’s an overlap between
or Hubert on all the low-resource languages. The training and test sets.
XLSR-53 model was also reported to have poor
performance by Zanon Boito et al. (2022) on a low- 6 Conclusion
resource language. Based on our experiments, we In this paper, we presented the GMU Systems for
think that the poor performance of the XLSR-53 the IWSLT 2023 Dialect and Low-resource Speech
model for the low-resource task was related to its Translation Tasks. Our approach mainly focused on
size. We speculate that the XLSR-53 model size using self-supervised pre-trained speech models to
may fail to adapt while fine-tuning it on little data. improve the performance of speech translation on
However, fine-tuning it on a lot of data, like the downstream tasks. The self-supervised pre-trained
case of Tunisian-Arabic-English, may yield overall speech models used in this paper are the Wav2vec
improvement. 2.0, XLSR-53, and Hubert. We showed that the
It is also possible that the best performance of the Wav2vec 2.0 and the Hubert model have compa-
XLSR-53 model on the Tunisian Arabic-English rable results in low resource and dialectal transfer
data is because it was trained on more languages. It tasks. However, the Wav2vec 2.0 performs well
will be interesting to investigate the impact of the when we remove the top three layers, while the
model size and multilinguality for self-supervised Hubert model has no such requirements.
pre-trained speech models to improve the perfor- Our experiments showed that the XLSR-53
mance of speech translation downstream tasks. In model performs poorly in the low-resource setting
addition, we think there may be room to study compared to the Wav2vec 2.0 and Hubert models.
further the speech representation of the XLSR- However, in the dialectal task, the XLSR-53 model
53 model across layers so that they can be better outperforms the Wav2vec 2.0 and Hubert models.
adapted in low-resource settings. In the future, we plan to conduct an in-depth anal-
5 Data Quality Issues ysis to understand the advantages and limitations
of these self-supervised pre-trained speech mod-
The low-resource shared tasks of the IWSLT 2023 els while fine-tuning them on downstream speech
consists of six tasks, each task corresponding to translation tasks.
one language pair. As we worked on these shared
tasks, we noticed issues with the data of two tasks: Acknolwedgements
Maltese to English and Irish to English. We are thankful to the organizers of the IWSLT
The Maltese to English data had a number of 2023 low resource and dialectal shared tasks. This
issues that made it hard to work with. For instance, work was generously supported by NSF grant
274
IIS-2125466 and by a Meta Sponsored Research on high-resource speech recognition improves low-
Award. We are also thankful to the Office of resource speech-to-text translation. In Proceedings
of the 2019 Conference of the North American Chap-
Research Computing at George Mason Univer-
ter of the Association for Computational Linguistics:
sity (https://orc.gmu.edu), funded in part Human Language Technologies, Volume 1 (Long and
by grants from the National Science Foundation Short Papers), pages 58–68, Minneapolis, Minnesota.
(Awards Number 1625039 and 2018631), for the Association for Computational Linguistics.
computing resources we used to train our models. Marcely Zanon Boito, Fethi Bougares, Florentin Bar-
bier, Souhir Gahbiche, Loïc Barrault, Mickael Rou-
vier, and Y. Estève. 2022. Speech resources in the
References tamasheq language. In International Conference on
Language Resources and Evaluation.
sopoulos, Ondřej Bojar, Claudia Borg, Marine Alexis Conneau, Alexei Baevski, Ronan Collobert, Ab-
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda del rahman Mohamed, and Michael Auli. 2020. Un-
Chen, William Chen, Khalid Choukri, Alexandra supervised cross-lingual representation learning for
Chronopoulou, Anna Currey, Thierry Declerck, Qian- speech recognition. In Interspeech.
Federico, Souhir Gahbiche, Barry Haddow, Benjamin Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli,
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja- Matteo Negri, and Marco Turchi. 2019. MuST-C: a
vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu Multilingual Speech Translation Corpus. In Proceed-
Kumar, Pengwei Li, Xutail Ma, Prashant Mathur, ings of the 2019 Conference of the North American
Evgeny Matusov, Paul McNamee, John P. McCrae, Chapter of the Association for Computational Lin-
Kenton Murray, Maria Nadejde, Satoshi Nakamura, guistics: Human Language Technologies, Volume 1
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu, (Long and Short Papers), pages 2012–2017, Min-
Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino, neapolis, Minnesota. Association for Computational
Lonneke van der Plas, Peter Polák, Elijah Rippeth, Linguistics.
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian ELRA. Elra catalogue (http://catalog.elra.info), trad
Thompson, Kevin Tran, Marco Turchi, Alex Waibel, pashto-french parallel corpus of transcribed broad-
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- cast news speech - training data, islrn: 802-643-297-
vallos. 2023. Findings of the IWSLT 2023 Evaluation 429-4, elra id: Elra-w0093, trad pashto broadcast
Campaign. In Proceedings of the 20th International news speech corpus, islrn: 918-508-885-913-7, elra
Conference on Spoken Language Translation (IWSLT id: Elra-s0381.
Qingkai Fang, Rong Ye, Lei Li, Yang Feng, and
Antonios Anastasopoulos, Loïc Barrault, Luisa Ben- Mingxuan Wang. 2022. STEMM: Self-learning with
tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano speech-text manifold mixup for speech translation.
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, In Proceedings of the 60th Annual Meeting of the
Maha Elbayad, Clara Emmanuel, Yannick Estève, Association for Computational Linguistics (Volume
Marcello Federico, Christian Federmann, Souhir 1: Long Papers), pages 7050–7062, Dublin, Ireland.
Gahbiche, Hongyu Gong, Roman Grundkiewicz, Association for Computational Linguistics.
Barry Haddow, Benjamin Hsu, Dávid Javorský, Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-
Mathur, Paul McNamee, Kenton Murray, Maria rahman Mohamed. 2021. Hubert: Self-supervised
Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan speech representation learning by masked prediction
Niehues, Xing Niu, John Ortega, Juan Pino, Eliz- of hidden units. IEEE/ACM Transactions on Audio,
abeth Salesky, Jiatong Shi, Matthias Sperber, Se- Speech, and Language Processing, 29:3451–3460.
gesh Virkar, Alexander Waibel, Changhan Wang, Ha Nguyen, Fethi Bougares, Natalia Tomashenko, Yan-
and Shinji Watanabe. 2022. Findings of the IWSLT nick Estève, and laurent besacier. 2020. Investi-
2022 evaluation campaign. In Proceedings of the gating self-supervised pre-training for end-to-end
19th International Conference on Spoken Language speech translation. In ICML 2020 Workshop on Self-
Translation (IWSLT 2022), pages 98–157, Dublin, supervision in Audio and Speech.
Ireland (in-person and online). Association for Com-
putational Linguistics. Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-
jeev Khudanpur. 2015. Librispeech: An asr corpus
Alexei Baevski, Henry Zhou, Abdel rahman Mohamed, based on public domain audio books. 2015 IEEE
and Michael Auli. 2020. wav2vec 2.0: A framework International Conference on Acoustics, Speech and
for self-supervised learning of speech representations. Signal Processing (ICASSP), pages 5206–5210.
ArXiv, abs/2006.11477.
Ankita Pasad, Bowen Shi, and Karen Livescu. 2022.
Sameer Bansal, Herman Kamper, Karen Livescu, Adam Comparative layer-wise analysis of self-supervised
Lopez, and Sharon Goldwater. 2019. Pre-training speech models. ArXiv, abs/2211.03929.
275
Sravya Popuri, Peng-Jen Chen, Changhan Wang,
Juan Miguel Pino, Yossi Adi, Jiatao Gu, Wei-Ning
Hsu, and Ann Lee. 2022. Enhanced direct speech-to-
speech translation using self-supervised pre-training
and data augmentation. In Interspeech.
2016. Neural machine translation of rare words with
subword units. In Proceedings of the 54th Annual
guistics (Volume 1: Long Papers), pages 1715–1725,
Berlin, Germany. Association for Computational Lin-
guistics.
Matthias Sperber and Matthias Paulik. 2020. Speech
translation and the end-to-end promise: Taking stock
of where we are. In Annual Meeting of the Associa-
Mihaela Stoian, Sameer Bansal, and Sharon Goldwater.
2019. Analyzing asr pretraining for low-resource
speech-to-text translation. ICASSP 2020 - 2020 IEEE
International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pages 7909–7913.
Yun Tang, Hongyu Gong, Ning Dong, Changhan Wang,
Wei-Ning Hsu, Jiatao Gu, Alexei Baevski, Xian Li,
Abdelrahman Mohamed, Michael Auli, and Juan
Pino. 2022. Unified speech-text pre-training for
speech translation and recognition. In Proceedings
of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
pages 1488–1499, Dublin, Ireland. Association for
Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob
you need. ArXiv, abs/1706.03762.
Changhan Wang, Yun Tang, Xutai Ma, Anne Wu,
Dmytro Okhonko, and Juan Miguel Pino. 2020.
Fairseq s2t: Fast speech-to-text modeling with
fairseq. In AACL.
Wu, and Z. Chen. 2017. Sequence-to-sequence mod-
els can directly translate foreign speech. In Inter-
speech.
Anne Wu, Changhan Wang, Juan Miguel Pino, and
Jiatao Gu. 2020. Self-supervised representations
improve end-to-end speech translation. ArXiv,
abs/2006.12124.
Marcely Zanon Boito, John Ortega, Hugo Riguidel, An-
toine Laurent, Loïc Barrault, Fethi Bougares, Firas
Chaabani, Ha Nguyen, Florentin Barbier, Souhir Gah-
biche, and Yannick Estève. 2022. ON-TRAC con-
sortium systems for the IWSLT 2022 dialect and
low-resource speech translation tasks. In Proceed-
276
The HW-TSC’s Speech-to-Speech Translation System for IWSLT 2023
Minghan Wang, Yinglu Li, Jiaxin Guo, Zongyao Li, Hengchao Shang, Daimeng Wei,
Chang Su, Min Zhang, Shimin Tao, Hao Yang
1
Huawei Translation Services Center, Beijing, China
{wangminghan,liyinglu,guojiaxin1,lizongyao,shanghengchao,
weidaimeng,suchang8,zhangmin186,taoshimin,yanghao30}@huawei.com
Abstract year’s competition, we shifted our research focus

to the TTS component. Therefore, we directly used
This paper describes our work on the
the ASR and MT systems in our offline ST track
IWSLT2023 Speech-to-Speech task. Our pro-
posed cascaded system consists of an ensemble (Wang et al., 2022a,b). Additionally, we no longer
of Conformer and S2T-Transformer-based ASR considered the issue of context consistency during
models, a Transformer-based MT model, and inference.
a Diffusion-based TTS model. Our primary Given the unprecedented success of the Diffu-
focus in this competition was to investigate sion Model (Ho et al., 2020; Rombach et al., 2022)
the modeling ability of the Diffusion model in image generation over the past few years, we
for TTS tasks in high-resource scenarios and sought to explore its potential in speech synthe-
the role of TTS in the overall S2S task. To this
sis. Thus, we proposed an end-to-end Diffusion
end, we proposed DTS, an end-to-end diffusion-
based TTS model that takes raw text as input TTS (DTS) model. Unlike previous TTS mod-
and generates waveform by iteratively denois- els, such as FastSpeech2 (Ren et al., 2021), which
ing on pure Gaussian noise. Compared to previ- use phonemes as input and use a duration predic-
ous TTS models, the speech generated by DTS tor to determine the duration and generate mel-
is more natural and performs better in code- spectrograms, DTS uses raw text as input, predicts
switching scenarios. As the training process is the total audio length, and generates the waveform
end-to-end, it is relatively straightforward. Our
by iteratively denoising the output.
experiments demonstrate that DTS outperforms
other TTS models on the GigaS2S benchmark, The structure of this paper is as follows: We first
and also brings positive gain for the entire S2S introduce the dataset used in this task, followed by
system. a brief introduction to the ASR and MT models
used. Then, we provide a detailed description of
1 Introduction our proposed DTS model. Finally, we showcase
the performance of each model on the GigaS2S
Compared to previous iterations of the IWSLT-S2S
dataset.
task (Anastasopoulos et al., 2022; Guo et al., 2022),
this year’s task (Agarwal et al., 2023) is distinct, 2 Method
particularly in terms of data. The official training
dataset provided is GigaS2S (Chen et al., 2021; 2.1 Dataset
Ye et al., 2022), which is substantially larger than To train the ASR model, we combined five datasets
previous S2S datasets, with a data size of 10,000 and added corresponding domain tags to enable the
hours. Although the target text and speech are model to generate speech in the desired style (Wang
generated by MT and TTS systems, their quality is et al., 2022b). For the MT model, we aggregated all
relatively high, making them suitable for initiating available en-de, en-zh, and en-ja translation data
research on end-to-end S2S or TTS models in high- allowed for constrained offline tasks and added
resource scenarios. language tags to train a multilingual model. Finally,
Our strategy is similar to that of last year (Guo for the TTS model, we utilized the Chinese text and
et al., 2022), where we used a cascaded S2S system, speech pairs from GigaS2S (Ye et al., 2022).
but our research focus has shifted. In last year’s
work, we primarily studied the role of ASR and MT 2.2 ASR
in the S2S system and attempted to optimize the We trained our ASR models using a combination
context consistency of translation results. In this of five datasets: MuST-C V2, LibriSpeech, TED-
277
Dataset Number of Utterance Duration(hrs) 2.4 TTS
LibriSpeech 281,241 960.85 2.4.1 Modeling
MuST-C 340,421 590.67 The Denoising Diffusion Model (DDM) (Ho et al.,
IWSLT 170,229 254.41 2020) models a continuous process of iteratively
CoVoST 1362,422 1802.52
denoising Gaussian noise to restore the original
TEDLIUM3 268,214 453.42
sample. The model consists of two processes: the
Table 1: Data statistics of our ASR corpora forward process of adding noise and the reverse
process of denoising. These continuous processes
are assumed to have Markovian properties and can
be decomposed into T conditional distributions
LIUM 3, CoVoST, and IWSLT. Table 1 provides through a Markov chain, with x0 representing the
statistics for these datasets. Our model uses an original data (raw waveform in the TTS task) and
80-dimensional filterbank feature, with input sam- xT representing pure noise.
ples restricted to a frame size between 50 to 3000 In the forward process of DDM, q(x1:T |x0 , c) is
and a token limit of 150 to ensure that the Trans- decomposed into a Markov process of T steps and
former model’s encoder and decoder can process conditioned on the input text c:
sequences of limited size.
T
Y
To identify outliers, we calculated the speech q(x1:T |x0 , c) = q(xt |xt−1 ) (1)
speed of each sample based on the transcript length t=1
and frame size. We excluded samples with speeds p
q(xt |xt−1 , c) = N (xt ; 1 − βt xt−1 , βt I). (2)
outside the range of µ(τ ) ± 4 × σ(τ ), where τ =
# frames
# tokens . The sampling of xt given xt−1 can be expressed
We utilized an ensemble of two models to as:
improve ASR performance: Conformer (Gulati p p
xt = 1 − βt xt−1 + βt ϵ, (3)
et al., 2020) and S2TTransformer (Synnaeve et al.,
2019). The encoder of Conformer incorporates
where ϵ ∈ N (0, I), and βt ∈ [0, 1] is a noise
a macaron structure at each layer based on the
scheduler related to t. Therefore, each step of
S2TTransformer’s encoder to enhance speech en-
the forward process is adding a certain amount of
coding capability. Our ensemble method involves
Gaussian noise to the previously corrupted speech
averaging the probabilities output by both decoders
xt−1 . Finally, x0 ultimately evolves into white
at each decoding step during beam-search. To con-
noise that follows a Gaussian distribution. An im-
trol the model’s generation style, we added prefix
portant characteristic of the forward process is that
tags corresponding to the COVOST dataset for in-
xt ∼ q(xt |x0 , c) for any t has a closed form:
ference, making the model’s inference style closer
to GigaS2S transcripts. √
q(xt |x0 , c) = N (xt ; ᾱt x0 , (1 − ᾱt )I} (4)
Q
αt = 1 − βt , and ᾱt = ts=1 αs , so we can effi-
2.3 MT ciently obtain xt for any t from x0 during training.
In the reverse process, the denoising process is
similar to the forward process, and is also described
For MT, we utilized the multilingual Transformer
as a T -step Markov process:
model that we developed for the offline track, train-
ing it on en-zh, en-de, and en-ja datasets. To ensure T
Y
high-quality pairs, we first cleaned and removed du- pθ (x0:T , c) = p(xT ) pθ (xt−1 |xt ) (5)
plicates from the data, then filtered it using LaBSE t=1
(Feng et al., 2022) to select domain-specific data. pθ (xt−1 |xt , c) = N (xt−1 ; µθ (xt , t, c), σt2 I), (6)
During training, we employed R-Drop (Liang et al.,
2021) for additional regularization. Our Trans- where the µθ can be learned by neural networks
former (Vaswani et al., 2017) model consisted of and σt2 = 1 − αt .
a 25-layer encoder and a 6-layer decoder with a The training objective of the Diffusion Model is
dimension of 1024 and an FFN dimension of 4096. to maximize the log-likelihood of p(x0 |c), which
278
is intractable, so optimization on the variational
bound is used instead. (Ho et al., 2020) further sim-
plify it to an unweighted version of L2 regression
loss with respect to ϵ̂ and added noise ϵ. In our Length Predictor
work, we predict the x0 with the model instead of

2 x 1DConvTranspose
the noise:

L(θ) = Et,x0 ,ϵ ||x̂θ (xt , t, c) − x0 || (7)
k x TransformerDecoderLayer
Transformer
Here, t is uniformly sampled from the interval Encoder
LayerwiseTimeEncoding
[0, T ]. Timestep-Embedding
During inference, the model iteratively samples
Positional Embedding
xt−1 from xt :
2 x 1DConvolution
1 1 − αt
xt−1 = √ xt − √ ϵθ (xt , t, c) + σt z
αt 1 − ᾱt
(8)
1 √
ϵθ = √ xt − ᾱt x̂θ (xt , t, c) (9)
1 − ᾱt Figure 1: The architecture of DTS model, which takes
√ C = [c1 , ..., cM ] as the encoder input to predict the
where σt = 1 − αt and z ∼ N (0, I). In our ex- frame length N . For the decoder, it takes xt and t as
periments, to allow for flexible determination of input, conditions on C to predict x0 for the sampling of
the maximum step T , we choose to use a contin- xt−1 according to Eq 8 and 9.
uous t ranging from 0 to 1. During training, t is
uniformly sampled, and we use the cosine noise
scheduler (Nichol and Dhariwal, 2021). • In the input part of the Decoder, we use two
In addition to modeling the denoising process, 1D convolutions with a proper setting of ker-
DTS also needs to predict the length of the tar- nel size, stride, and padding, so the sequence
get audio in advance, as DTS is essentially a non- length before and after convolution remains
autoregressive (NAR) model. However, unlike pre- unchanged.
vious TTS models that predict the duration of each
phoneme, we directly model the total number of
frames in the target audio, which is more conve- • As the Diffusion model depends on the time
nient. Specifically, we use the text representation step t, we additionally introduce a Timestep
after average pooling, denoted as hc , as the input Embedding, and use the same implementation
to the classifier ϕ to predict the length distribution. as (Ho et al., 2020).
Then, we calculate the cross-entropy loss with the
frame number Nx0 of x0 .
• To make the time step encoding more compre-
Llength = CE(ϕ(hc ; θ), Nx0 ) (10) hensive, we add Layerwise time encoding at
each layer and added to the encoded hidden
2.4.2 Model Architecture states from the last layer.
The DTS model is essentially a parameterized de-
noising function x̂(xt , t, c) which takes xt , t as in-
put, conditions on c, and predicts the x0 for the • In the output part of the decoder, we add 2
sampling of xt−1 . The model makes some modifi- 1D deconvolutions to restore the hidden state
cations on top of the Transformer model to make back to the waveform. We use deconvolu-
it more suitable for speech synthesis. As shown in tion because we found that using only linear
Figure 1, the main modifications are as follows: projection leads to a lack of dependency be-
tween the generated waveform and the pre-
• On top of the Encoder, we add a two-layer vious waveform, resulting in noticeable jitter,
FFN network to predict the length of the target which can be significantly eliminated by using
audio. deconvolution.
279
Model WER-all-punct WER-all WER-code-switch WER-zh
FastSpeech 2 13.18 10.75 15.70 8.37
DTS-Mel 13.32 10.28 15.66 7.69
DTS-Wave 12.68 9.82 15.33 7.17
Table 2: This table shows the performance of our TTS models on the GigaS2S dev set, using ground truth transcripts
as input. We compare our models against FastSpeech 2 (Ren et al., 2021), which serves as the baseline. Additionally,
we present a DTS model trained to predict mel-spectrograms (DTS-Mel) for comparison with DTS for waveform
(DTS-Wave). The table reports the word error rate (WER) for the entire set with punctuation (WER-all-punct), WER
for all samples without punctuation (WER-all), WER for code-switch samples without punctuation (WER-code-
switch), and WER for Chinese-only samples without punctuation (WER-zh). The results indicate that DTS-Wave
outperforms the other models, achieving the lowest WER values in all categories.
Model WER WER-no-punct Model BLEU ChrF

S2TTransformer 22.67 18.15
FastSpeech2 21.8 22.7
Conformer 22.42 17.80
DTS-Mel 22.3 23.1
Ensemble 21.57 16.92
DTS-Wave 22.7 23.4
Table 3: The performance of our two independent ASR
models and the ensemble of them with or without punc- Table 5: The overall cascade performance evaluated by
tuation. BLEU and ChrF.
Model Input BLEU ChrF the raw waveform, DTS can also learn to generate
ASR output 29.0 25.4 mel-spectrogram, simply by changing wave frames
Ground Truth 30.7 27.3 to spectrogram frames. This is also evaluated in
our experiment.
Table 4: The performance of our MT models with
ground truth input and asr outputs as the input. 3.2 Experimental Results
In the experiments, we tested the performance of
3 Experiment each module in our S2S system separately. In addi-
tion to testing with the cascaded results as input, we
3.1 Experimental Setup also conducted independent tests with ground truth
input. For the three modules, we mainly used the
For the ASR and MT parts of our S2S system, we
dev set of GigaS2S for evaluation. In terms of eval-
directly used the same setting as in the Offline track.
uation metrics, for ASR and MT, we used WER,
For the TTS part, we trained the model on the Gi-
BLEU and ChrF, respectively. For TTS, we used a
gaS2S dataset for 360k steps, with a maximum
Whisper-medium (Radford et al., 2022) model to
learning rate of 1e-4, warmup of 20000 steps, and
transcribe the TTS-generated audio back into the
a batch size of 32 samples per GPU. The maximum
text for automatic evaluation and calculated WER.
and minimum audio lengths were restricted to 25
seconds and 0.5 seconds, respectively. The model ASR Results We evaluated the results of two
has 12 layers in the encoder and 16 layers in the ASR models trained on the same corpus separately,
decoder, with a hidden dimension of 512 and an as well as the ensemble version. As shown in Table
FFN dimension of 2048. DTS can directly generate 3, the ensemble results were slightly better.
waveforms, but since audio waveforms are usually
long, we pre-segment them into equally sized non- MT Results In the evaluation of MT, we consid-
overlapping frames. In this way, the model learns ered two scenarios: using ground truth transcripts
to generate the waveform frame by frame, and we as input and using the output of the previous ASR
only need to flatten these frames to get the final module as input. The experimental results showed
output. In our experiments, we used a frame length that the robustness of MT was relatively good, even
of 1200 and a sampling rate of 24000. When infer- if there were errors in the ASR output, the differ-
ence, we set the sampling step to 100. In addition to ence in BLEU score was not significant as shown
280
in Table 4. Niehues, Xing Niu, Atul Ojha Kr., John E. Ortega,
Proyag Pal, Juan Pino, Lonneke van der Plas, Elijah
TTS Results In the TTS experiments, because Rippeth, Elizabeth Salesky, Matthias Sperber, Se-
the development set of GigaS2S contains code- bastian Stüker, Katsuhito Sudoh, Brian Thompson,
switching samples, we evaluated not only the Marco Turchi, Alex Waibel, Mingxuan Wang, and
Rodolfo Zevallos. 2023. Findings of the IWSLT 2023
WER of the entire set but also separately evalu- Evaluation Campaign. In Proceedings of the 20th
ated the cases without the code-switching. As for International Conference on Spoken Language Trans-
the models, we chose FastSpeech 2 as the base- lation (IWSLT 2023). Association for Computational
line. In addition, we trained an additional DTS Linguistics.
based on mel-spectrogram for comparison with Antonios Anastasopoulos, Loïc Barrault, Luisa Ben-
the waveform-based DTS. Both FS2 and DTS-mel tivogli, Marcely Zanon Boito, Ondrej Bojar, Roldano
used the Griffin-lim vocoder. As shown in Table Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh,
2, DTS-Wave outperformed the other two models,
especially on Chinese monolingual data. Gahbiche, Hongyu Gong, Roman Grundkiewicz,
Barry Haddow, Benjamin Hsu, Dávid Javorský,
Full Pipeline Results In addition to testing each Vera Kloudová, Surafel Melaku Lakew, Xutai Ma,
module separately, we also tested the final metrics Prashant Mathur, Paul McNamee, Kenton Murray,
of the entire pipeline. We compared the difference Maria Nadejde, Satoshi Nakamura, Matteo Negri, Jan
between the speech generated by the three TTS Niehues, Xing Niu, John Ortega, Juan Miguel Pino,
models with the MT results as input by computing bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo-
the BLEU and ChrF with the ground truth transla- gesh Virkar, Alexander Waibel, Changhan Wang, and
tion. Table 5 shows that there is a difference that Shinji Watanabe. 2022. Findings of the IWSLT 2022
existed, but it is not significant. Therefore, we can evaluation campaign. In Proceedings of the 19th In-
ternational Conference on Spoken Language Transla-
conclude that the quality of the speech generated tion, IWSLT@ACL 2022, Dublin, Ireland (in-person
by TTS does affect the final performance of S2S and online), May 26-27, 2022, pages 98–157. Asso-
system in terms of automatic evaluation, but the ciation for Computational Linguistics.
impact is still limited.
Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel
4 Conclusion Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, San-
jeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao,
In this paper, we present the system we developed Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang,
for the IWSLT2023 speech-to-speech competition. Zhao You, and Zhiyong Yan. 2021. Gigaspeech: An
The system includes relatively simple and effective evolving, multi-domain ASR corpus with 10, 000
ASR and MT modules, as well as a TTS module hours of transcribed audio. In Interspeech 2021,
22nd Annual Conference of the International Speech
proposed by us based on the Diffusion Model. In
Communication Association, Brno, Czechia, 30 Au-
the experiments, we demonstrate that the denoising gust - 3 September 2021, pages 3670–3674. ISCA.
diffusion process can effectively learn end-to-end
TTS task, simplifying both training and inference. Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Ari-
vazhagan, and Wei Wang. 2022. Language-agnostic
However, its generation speed is relatively slow. BERT sentence embedding. In Proceedings of the
In our future work, we will continue to optimize 60th Annual Meeting of the Association for Compu-
its quality and generation efficiency, and further tational Linguistics (Volume 1: Long Papers), ACL
explore the application of diffusion in end-to-end 2022, Dublin, Ireland, May 22-27, 2022, pages 878–
891. Association for Computational Linguistics.
S2S tasks.
Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
References Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.
2020. Conformer: Convolution-augmented trans-
Milind Agarwal, Sweta Agrawal, Antonios Anasta- former for speech recognition. In Interspeech 2020,
sopoulos, Claudia Borg, Marine Carpuat, Roldano 21st Annual Conference of the International Speech
Cattoni, Mauro Cettolo, William Chen, Khalid Communication Association, Virtual Event, Shang-
Choukri, Alexandra Chronopoulou, Thierry Declerck, hai, China, 25-29 October 2020, pages 5036–5040.
Qianqian Dong, Yannick Estève, Kevin Duh, Mar- ISCA.
cello Federico, Souhir Gahbiche, Benjamin Hsu,
John Judge, Tom Ko, Rishu Kumar, Xutail Ma, Jiaxin Guo, Yinglu Li, Minghan Wang, Xiaosong Qiao,
Prashant Mathur, Evgeny Matusov, Paul McNamee, Yuxia Wang, Hengchao Shang, Chang Su, Yimeng
John P. McCrae, Kenton Murray, Matteo Negri, Jan Chen, Min Zhang, Shimin Tao, Hao Yang, and Ying
281
Qin. 2022. The hw-tsc’s speech to speech translation Minghan Wang, Jiaxin Guo, Yinglu Li, Xiaosong Qiao,
system for IWSLT 2022 evaluation. In Proceedings Yuxia Wang, Zongyao Li, Chang Su, Yimeng Chen,
of the 19th International Conference on Spoken Lan- Min Zhang, Shimin Tao, Hao Yang, and Ying Qin.
guage Translation, IWSLT@ACL 2022, Dublin, Ire- 2022a. The hw-tsc’s simultaneous speech translation
land (in-person and online), May 26-27, 2022, pages system for IWSLT 2022 evaluation. In Proceedings
293–297. Association for Computational Linguistics. of the 19th International Conference on Spoken Lan-
guage Translation, IWSLT@ACL 2022, Dublin, Ire-
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. De- land (in-person and online), May 26-27, 2022, pages
noising diffusion probabilistic models. In Advances 247–254. Association for Computational Linguistics.
in Neural Information Processing Systems 33: An-
nual Conference on Neural Information Processing Minghan Wang, Jiaxin Guo, Xiaosong Qiao, Yuxia
Systems 2020, NeurIPS 2020, December 6-12, 2020, Wang, Daimeng Wei, Chang Su, Yimeng Chen, Min
virtual. Zhang, Shimin Tao, Hao Yang, and Ying Qin. 2022b.
The hw-tsc’s offline speech translation system for
Xiaobo Liang, Lijun Wu, Juntao Li, Yue Wang, IWSLT 2022 evaluation. In Proceedings of the
Qi Meng, Tao Qin, Wei Chen, Min Zhang, and Tie- 19th International Conference on Spoken Language
Yan Liu. 2021. R-drop: Regularized dropout for Translation, IWSLT@ACL 2022, Dublin, Ireland (in-
neural networks. In Advances in Neural Information person and online), May 26-27, 2022, pages 239–246.
Processing Systems 34: Annual Conference on Neu- Association for Computational Linguistics.
ral Information Processing Systems 2021, NeurIPS
2021, December 6-14, 2021, virtual, pages 10890– Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng, Tao
10905. Wang, Mingxuan Wang, and Jun Cao. 2022. Gigast:
A 10, 000-hour pseudo speech translation corpus.
Alexander Quinn Nichol and Prafulla Dhariwal. 2021. CoRR, abs/2204.03939.
Improved denoising diffusion probabilistic models.
In Proceedings of the 38th International Conference
on Machine Learning, ICML 2021, 18-24 July 2021,
Virtual Event, volume 139 of Proceedings of Machine

man, Christine McLeavey, and Ilya Sutskever. 2022.
pervision. CoRR, abs/2212.04356.

Zhou Zhao, and Tie-Yan Liu. 2021. Fastspeech 2:
Fast and high-quality end-to-end text to speech. In
9th International Conference on Learning Represen-
tations, ICLR 2021, Virtual Event, Austria, May 3-7,
2021. OpenReview.net.
Robin Rombach, Andreas Blattmann, Dominik Lorenz,

Patrick Esser, and Björn Ommer. 2022. High-
resolution image synthesis with latent diffusion mod-
els. In IEEE/CVF Conference on Computer Vision
and Pattern Recognition, CVPR 2022, New Orleans,
LA, USA, June 18-24, 2022, pages 10674–10685.
IEEE.
Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Edouard

Grave, Tatiana Likhomanenko, Vineel Pratap,
Anuroop Sriram, Vitaliy Liptchinsky, and Ronan Col-
lobert. 2019. End-to-end ASR: from supervised to
semi-supervised learning with modern architectures.
CoRR, abs/1911.08460.

cessing Systems 30: Annual Conference on Neural
Information Processing Systems 2017, December 4-9,
2017, Long Beach, CA, USA, pages 5998–6008.
282
JHU IWSLT 2023 Dialect Speech Translation System Description
Amir Hussein† Cihan Xiao† Neha Verma† Thomas Thebaud†
Matthew Wiesner‡ Sanjeev Khudanpur†‡
†
Center for Language and Speech Processing, and
‡
Human Language Technology Center of Excellence,
Johns Hopkins University
{ahussei6, cxiao7, nverma7, tthebau1, wiesner, khudanpur}@jhu.edu
Abstract contact languages: Berber and Romance languages

like French, Spanish and Italian.
This paper presents JHU’s submissions to the
Recent successes in machine translation (MT)
IWSLT 2023 dialectal and low-resource track
of Tunisian Arabic to English speech transla-
of text for low-resource languages or non-standard
tion. The Tunisian dialect lacks formal orthog- dialects have entailed the use of large pretrained
raphy and abundant training data, making it models such as mBART (Liu et al., 2020a) and
challenging to develop effective speech trans- NLLB (NLLB Team et al., 2022). These models
lation (ST) systems. To address these chal- have demonstrated state-of-the-art performance via
lenges, we explore the integration of large pre- transfer learning from higher-resource languages,
trained machine translation (MT) models, such particularly through related languages. However,
as mBART and NLLB-200 in both end-to-end
there is a lack of understanding regarding how to ef-
(E2E) and cascaded speech translation (ST) sys-
tems. We also improve the performance of au- fectively integrate these models with speech recog-
tomatic speech recognition (ASR) through the nition systems to develop speech translation sys-
use of pseudo-labeling data augmentation and tems. To fill this gap we investigate dialect transfer
channel matching on telephone data. Finally, by integrating large pretrained models with speech
we combine our E2E and cascaded ST systems recognition models in end-to-end (E2E) and cas-
with Minimum Bayes-Risk decoding. Our com- caded speech translation (ST) systems. The key
bined system achieves a BLEU score of 21.6
components of our system are:
and 19.1 on test2 and test3, respectively.
• Dialectal transfer from large pre-trained mod-
1 Introduction els to improve translation in both E2E and
The performance of machine translation systems is Cascaded ST settings (§3.1,§3.2).
closely tied to the amount of available training data.
• Improved ASR of dialectal speech by reduc-
Regional dialects, which are less prevalent and pri-
ing orthographic variation in training tran-
marily spoken languages, pose a challenge for these
scripts, and by channel matching (§3.1.1).
systems due to the scarcity of digital data, the ab-
sence of standard orthography, and prevalence of • System combination with Minimum Bayes-
non-standard grammar. The IWSLT 2023 dialect Risk decoding based on the COMET similar-
and low-resource track focuses these challenges. ity metric (§3.3).
In this paper we present the JHU Tunisian Arabic
to English speech translation systems submitted to Our system outperforms the best previous ap-
the IWSLT 2023 dialectal and low-resource track proaches (Yang et al., 2022; Yan et al., 2022) for
(Agarwal et al., 2023). Arabic and its dialects form both ASR (WER) and ST (BLEU). We also found
a dialect continuum anchored by Modern Standard that integrating pre-trained MT models into end-to-
Arabic (MSA) (Badawi et al., 2013). While MSA end ST systems did not improve performance.
is the language of formal and written communi-
cation, most native Arabic speakers colloquially
2 Dialect Speech Translation Task
use local dialects, which often lack a standardized The dialect speech translation task permitted sub-
written form. In many North African Arabic di- missions using models trained under two data con-
alects, including Tunisian, there is a significant ditions, (A) constrained and (B) unconstrained. For
code-switching with and borrowing from several
283
Condition ASR MT
166 hours of manually transcribed Tunisian 212K lines of manual English translation
(A) Basic
telephone speech of the Tunisian transcripts
1200 hours of Modern Standard Arabic
Any other English, Arabic dialects,
broadcast speech (MGB-2) (Ali et al., 2016).
(B) Unconstrained 250 hours of Levantine Arabic telephone
or multilingual models
beyond English and Arabic
conversations (LDC2006S291 , LDC2006T072 )
Table 1: Data used for constrained and unconstrained conditions.
brevity, we will refer to these conditions as (A) and brid CTC/attention (Watanabe et al., 2017) ap-
(B) respectively. proach. Each Branchformer encoder block consists
of two branches that work in parallel. One branch
2.1 Data description uses self-attention to capture long-range dependen-
The data we used for the conditions (A) and (B) cies while the other branch uses a multi-layer per-
are listed in Table 1, and sizes of the training, ceptron with convolutional gating (Sakuma et al.,
development-testing and test partitions are listed in 2021) to capture local dependencies. To mitigate
Table 2. The development and test sets for Tunisian orthographic variations (or inconsistencies) in the
data are provided by the organizers of IWLST 2023. ASR transcripts, we augment the training data
The data is 3-way parallel: Tunisian Arabic tran- during the fine-tuning stage by reusing the audio
scripts and English translations are available for training samples paired with their ASR transcripts,
each Tunisian Arabic audio utterance. We use the which tend to be orthographically more consistent.
development set for model comparison and hyper- We refer to this approach as pseudo-labeling.
parameter tuning, and the test1 set for evaluating
Condition (A). We train the ASR model de-
our ST systems. Finally, the task organizers pro-
scribed previously using the constrained Tunisian
vided blind evaluation (test2, test3) sets for final
Arabic audio and transcripts.
comparison of submissions.
Condition (B). The ASR Branchformer in this
ASR (hours) MT (lines) condition is pretrained on our MGB-2 standard Ara-
train (condition A) 160 ∼202k
bic data (Ali et al., 2016) and then fine-tuned on the
train (condition B) 1200+160+250 -
dev 3.0 3833
provided Tunisian Arabic data. The MGB-2 MSA
test1 3.3 4204 data differ from the Tunisian data in channel, and
test2 3.6 4288 dialect. Since the Tunisian data are telephone con-
test3 3.5 4284 versations sampled at 8kHz, we downsample the
MGB-2 speech from 16kHz to 8kHz, which we pre-
Table 2: Details for train, dev and test1 sets for con-
viously found was more effective than upsampling
strained condition (A) and unconstrained condition (B). the telephone conversations to 16kHz (Yang et al.,
2022). We also added additional telephone speech
3 Methods from the Levantine Arabic dialect (Maamouri et al.,
2006). Note that Levantine Arabic is very different
In this section we describe our cascaded (§3.1), from Tunisian, and the hope here is to benefit from
and end-to-end (E2E) (§3.2) speech translation sys- matched genre and channel conditions, not dialect.
tems as well as our strategy for combining both We did not explicitly attempt to reduce the di-
approaches (§3.3). alect mismatch. However, we mitigated some of
the spurious orthographic variations in transcripts
3.1 Cascaded ASR-MT
of dialectal speech by using pseudo-labels for train-
3.1.1 Automatic Speech Recognition ing instead of of the manual transcripts, as noted
To train ASR models for E2E and cascaded sys- above, in the final fine-tuning step.
tems, we use the ESPnet (Watanabe et al., 2018)
3.1.2 Machine Translation
toolkit. Our ASR architecture uses a Branchformer
encoder (Peng et al., 2022), a Transformer de- Condition (A). We train an MT model on
coder (Vaswani et al., 2017) and follows the hy- Tunisian Arabic transcripts paired with their En-
glish translations. The MT architecture is similar to
1
https://catalog.ldc.upenn.edu/LDC2006S29 §3.1.1 model architecture, and uses a Branchformer
2
https://catalog.ldc.upenn.edu/LDC2006T07
encoder and Transformer decoder.
284
Condition (B). We experiment with two main tion on the E2E-ST system and pre-trained MT
pre-trained models: mBART and NLLB-200. In initialization.
the first setting, we use the mBART25 model,
Condition (B). For the unconstrained condition,
which was shown to be slightly better for MSA ver-
we propose a novel E2E-ST system that incorpo-
sus the newer mBART50 model (Liu et al., 2020a;
rates the combination of a pretrained ASR mod-
Tang et al., 2020). mBART25 also contains French,
ule and a pretrained MT module. Specifically, we
Turkish, Italian, and Spanish, all of which con-
combine the Branchformer ASR module described
tribute loanwords to Tunisian (Zribi et al., 2014).
in Section 3.1, with mBART (Liu et al., 2020b),
Although these loanwords are transcribed in the
which was fine-tuned on Tunisian data. We modify
Arabic script in our data, there is prior evidence
the ESPnet ST recipe to incorporate the mBART
that multilingual language models can benefit from
model trained by the fairseq (Ott et al., 2019) frame-
cross-lingual transfer even between different scripts
work. The architecture of the model is shown in
of the same language (Pires et al., 2019).
Figure 1. In contrast to the modified Hierarchical
For NLLB-200, we use the distilled 1.3 billion
Multi-Decoder architecture for Condition (A), to
parameter version of the model, due to space con-
fully exploit the effect of MT pretraining, we re-
straints. This model is a dense Transformer dis-
moved the speech attention from the MT decoder
tilled from the original NLLB-200 model, which is
that attends to the hierarchical encoder’s hidden
a 54 billion parameter Mixture-of-Experts model
representations.
that can translate into and out-of 200 different
Specifically, the ASR encoder module in the pro-
languages. We note that this model supports
posed architecture takes in a sequence of audio
Tunisian Arabic, the aforementioned contact lan-
features x1 , x2 , · · · , xT and generates a sequence
guages, MSA, as well as other closely related
of hidden representations with length N , optimized
Maghrebi dialects (Moroccan, Egyptian, Maltese).
with respect to the ASR CTC objective. The ASR
The breadth of language coverage seen during the
decoder takes in the ASR encoder’s hidden rep-
training of NLLB-200 makes this model an attrac-
resentations and autoregressively produces a se-
tive choice for a dialect speech translation task.
quence of logits with length L trained by the label-
We fine-tune these models on the provided ∼
smoothing loss. The hierarchical speech encoder
200K lines of Tunisian Arabic-English data. The
module is trained directly by the ST CTC loss for
source side is normalized as described in Section
generating auxiliary frame-level labels in the tar-
4. We preprocess all data with the provided pre-
get language to aid the ST decoding process. The
trained sentencepiece vocabularies released with
primary innovation of the proposed system lies
the models with no pre-tokenization. Results on
in the fully-connected layer that maps the ASR
MT systems are included in Table 8.
decoder’s output hidden representations to some
3.2 End-to-End Speech Translation representations that resemble mBART’s encoder’s
For the constrained condition we adopt the hierar- embedding layer’s outputs, making the full sys-
chical multi-decoder architecture proposed by (Yan tem differentiable. The ST encoder subsequently
et al., 2022). encodes the input representations and feeds them
into its decoder. The ST decoder, slightly differ-
Condition (A). The system consists of a multi- ent from the vanilla mBART decoder, optionally
task learning approach, which combines ASR and runs hybrid/joint CTC decoding at inference time,
MT sub-nets into one differentiable E2E system with the ST-CTC auxiliary labels and the autore-
where the hidden representation of the speech de- gressively generated ST outputs with target length
coder is fed as input to the MT encoder. Addition- M , i.e. y1ST , y2ST , · · · , yM
ST .
ally, the authors proposed using a hierarchical MT

encoder with an auxiliary connectionist temporal 3.3 System Combination
classification (CTC) loss on top of the speech en- We perform a system combination across 5 of our
coder. The MT decoder performs cross-attention systems: best constrained end-to-end system, best
over both the speech encoder and MT encoder rep- unconstrained end-to-end system, best cascaded
resentations. The ASR module is initialized with system, and 2 additional cascaded systems (Fer-
a Branchformer trained on the Tunisian data. In nandes et al., 2022). The two additional systems
this part, we explore the effect of text normaliza- use the ASR produced by our end-to-end systems,
285
Figure 1: E2E model architecture with mBART MT module. The fully-connected (FC) layer applies a linear
transformation to the ASR decoder’s final hidden representation, which is then used to replace mBART’s encoder’s
embedding layer’s output.
and the same NLLB-200 MT component as in our for every system, as a simplified version of the
best cascaded system. In Table 6, the 5 combined Generalized MBR (Duh et al., 2011).
systems are referred to as A3, B1, B3, B4, and B5,
COMET-MBR For our third combination, we
in order.
utilized the comet-mbr framework, which employs
3.3.1 Minimum Bayes Risk the COMET score between the source and hypothe-
We applied Minimum Bayes Risk decoding (Ku- sis as the similarity metric (L), using same equation
mar and Byrne, 2004) to combine the hypotheses (1), without the use of posterior probabilities (Fer-
produced by five systems. For a given speech ut- nandes et al., 2022). We used wmt20-comet-da
terance xi , and for a given system sjθj (j ∈ S and for MBR scoring (Rei et al., 2020). Despite
θj the set of parameters used by the j th trained Tunisian Arabic not being a COMET-supported
system), we can define the translation hypothesis language, we observed an improvement compared
as yij = fθjj (xi ) and pji be the probability that the to our single best system, suggesting that this ap-
proach may extend to dialects of languages covered
hypothesis yij would be outputted. We use this by COMET.
probability as a self-confidence score. Let L be
similarity metric used to compare two hypothesis, 4 Experiments
outputting a scalar that rises if the two hypothesis
are more similar. Then, for a given speech utter- In this section, we describe our experiments on the
ance xi , and for a given set of systems S, we define ASR, MT, and ST tasks. In order to reduce the
the best output as the one minimizing the distance orthographic variation in the Tunisian speech tran-
with others while having the highest confidence: scription we performed additional text normaliza-
tion similar to (Yang et al., 2022) which showed sig-
X jX
yimbr = max pi L(yij , yik ) (1) nificant improvements on ASR, MT and ST tasks.
yij j∈F k∈F The normalization is performed on both Tunisian
and MSA transcripts and includes removing dia-
3.3.2 Variations of MBR critics and single character words, and Alif/Ya/Ta-
Baseline MBR For our first combination, we Marbuta normalization (see (Yang et al., 2022) for
compute the outputs according to the MBR using more details).
the BLEU score of sacrebleu (Post, 2018a) as the
4.1 ASR
L similarity metric and the posterior probabilities
pji used are the log-likelihood ratios given by the First we augment the raw audio segments by ap-
end-to-end systems and the MT systems. plying speed perturbation with three speed factors
of 0.9, 1.0 and 1.1 (Ko et al., 2015). Then we
Unscored MBR For our second combination, we transform the augmented audio to a sequence of
use the same technique but with a constant pji = 1 83-dimensional feature frames for the E2E model;
286
80-dimensional log-mel filterbank coefficients with dev test1 test2 test3
3 pitch features (Ghahremani et al., 2014). We nor- ASR-ID Model WER (↓)
A1 Conformer (Yang et al., 2022) 40.8 44.8 43.8 -
malize the features by the mean and the standard A2 Branchformer 40.1 44.5 - -
deviation calculated on the entire training set. In ad- B1 MGB2-tune (Yang et al., 2022) 38.8 43.8 42.8 -
dition, we augment the features with specaugment B2
B3
MGB2-tune Branchformer
+ Pseudo
38.3
37.5
43.1
42.6
-
-
-
-
approach (Park et al., 2019), with mask parameters B4 + Tel 36.5 41.7 40.6 41.6
B5 E2E-MD-ASR 40.6 45.1 43.7 44.9
(mT, mF, T, F ) = (5, 2, 27, 0.05) and bi-cubic B6 E2E-mBART-ASR 37.7 43.2 41.5 42.6
time-warping. The E2E Branchformer-based ASR
model was trained using Adam optimizer for 50 Table 4: WER (%) of ASR models on dev, test1, test2
epochs with dropout-rate 0.001, warmup-steps of and test3 sets. A* and B* IDs are the ASR models devel-
25000 for condition (A) and 40000 for condition oped under condition (A) and condition (B) respectively.
(B). The BPE vocabulary size is 500 for condition B5 refers to the ASR submodule of the MD-ASR sys-
(A) and 2000 for condition (B). Table 3 summa- tem under the constrained condition and B6 refers to
the ASR sub-module of the E2E-mBART system both
rizes the best set of parameters that were found for
described in Section 3.2.
the Branchformer architecture. We note here that
the Branchformer has 28.28 M parameters, which BW (REF / HYP) Arabic English Translation
is approximately one-fourth the number of parame- 69: Ayh / Ay éK @ / ø @ yes
ters in the Conformer (Yang et al., 2022), which is 61: Ay / Ayh ø @ / éK @ yes
116.15 M. 18: Akhw / khw ñê» / ñê» @ it’s
17: khw / Akhw ñê» @ / ñê» it’s
Att heads CNN Enc layers Dec layers dk FF
4 31 12 6 256 2048 8: gdwA / gdwh @ð Y« / èðY« tomorrow
7: gdwh / gdwA èðY« / @ð Y« tomorrow
Table 3: Values of condition (A) and (B) hyperparame-
ters CNN: refers to CNN module kernel, Att: attention, Table 5: Top 6 substitutions with inconsistencies for
Enc: encoder, Dec: decoder, and FF: fully connected ASR system transliterated using Buckwalter (BW). The
layer number of times each error occurs is followed by the
word in the reference and the corresponding hypothesis.
MGB2-tune: the pretrained model on MGB-2
is fine-tuned on Tunisian data from condition (A) confirm this hypothesis we take a closer look at
by updating all model parameters with 1/10 of the most frequent top four substitutions shown in
the learning rate that was used during the training Table 5. The words are transliterated using Buck-
similar to (Hussein et al., 2021). In addition, we walter transliteration (BW)3 to make it readable for
examine the effect of adding ASR outputs to the non-Arabic speakers. It can be seen that the ASR
ground truth source during finetuning (pseudo la- substitutions are present in both hypothesis and as
beling ) and adding additional telephone data (Tel). correct reference which indicates that the assump-
The ASR results are summarized in Table 4 and tion of reference inconsistency holds true. Finally,
compared to the state-of-the-art conformer results channel matching using more telephone data pro-
from (Yang et al., 2022). The MD refers to the vides an additional 2.5% relative improvement.
hierarchical multi-decoder ST architecture adopted
from (Yan et al., 2022), and MD-ASR refers to the
ASR sub-module of the ST. It can be observed that 4.2 MT
the Branchformer provides slightly better results We train the MT models as described in Section
compared to the previous best conformer with simi- 3.1.2. For condition (A) the MT system parameters
lar size on both conditions (A) and (B). In addition, are shown in Table 7. In this condition, our MT
it can be also seen that pseudo labeling provides system is finetuned on the training Tunisian data
2% relative improvement. We found that there is a where the source data is mixed with ASR outputs,
high inconsistency between different transcribers in order to be more robust to noisy source data. We
since there is no standard orthography in Tunisian use 5000 Byte-pair encoding (BPE) units shared
dialect. By incorporating the ASR predictions in between Tunisian Arabic and English. We train
this way, we aim to provide the model with more 3
https://en.wikipedia.org/wiki/Buckwalter_
examples of the Tunisian dialect and help it better transliteration
generalize to variations in the spoken language. To
287
Pretrained dev test1 test2 test3
ST-ID Type ASR MT BLEU (↑) BLEU (↑) BLEU (↑) BLEU (↑)
A1 Cascade A2 A3 18.9 15.6 - -
A2 E2E-MD (Yan et al., 2022) A2 - 20.6 17.1 - -
A3 E2E-MD+norm A2 - 20.7 17.5 19.1 17.6
B1 E2E-mBART B4 B2 20.7 17.5 17.5 17.1
B2 Cascade-mBART B4 B2 20.9 17.9 - -
B3 Cascade-Base-NLLB200 B4 B3 22.2 19.2 21.2 18.7
B4 Cascade-B5-ASR-NLLB200 B5 B3 21.1 18.3 19.9 18.2
B5 Cascade-B6-ASR-NLLB200 B6 B3 22.2 18.8 20.7 18.3
B6 MBR with scores - - 21.7 18.8 18.7 17.1
B7 MBR no scores - - 22.7 19.6 20.6 18.8
B8 comet-mbr - - 22.7 19.6 21.6 19.1
Table 6: Results of cascaded, E2E, and combined systems measured by BLEU score on the dev, test1, test2 and
test3. E2E-MD is the hierarchical multi-decoder described in (§3.2). Norm indicates the use of text normalization
(§4) which is used with all systems except A2. The pretrained indicates the use of pretrained ASR and MT systems
from Tables(8,4). A* and B* IDs are the models developed under condition (A) and condition (B) respectively
layers embed-dim FF-embed att-heads

Encoder 6 256 1024 4 former module yields similar performance to the
Decoder 6 256 2048 4 best Fairseq model. Finally finetuning NLLB-200
MT achieves the best results in the unconstrained
category with 30.5 and 26.4 BLEU scores.
Table 7: Values of constrained MT system parameters
Enc: encoder, Dec: decoder, and FF: feed-forward 4.3 ST
dev test1
Table 6 presents the results of our submitted cas-
MT-ID Model Type Model Size BLEU (↑) BLEU (↑)
A1 Transformer (Yang et al., 2022) 24.5 21.5
caded and E2E ST systems. The pretrained column
A2 Transformer Espnet 13.63 M 23.5 19.9 refers to the pretrained ASR and MT systems from
A3 Branchformer Espnet 16.81 M 25.0 21.4
B1 Transformer (Yang et al., 2022) 29.0 25.0
Tables (4, 8). B1 denotes the end-to-end ST with
B2 mBART 610M 29.2 24.6 B4 ASR and B2 mBART under the unconstrained
B3 NLLB-200 1.3B 30.5 26.4
condition, as described in Section 3.2. The E2E-
MD is a hierarchical multi-decoder architecture
Table 8: BLEU scores of various MT models using the described in Section 3.2, where the MT compo-
gold reference transcripts. A* and B* IDs are the MT nent is trained from scratch. The cascaded ST
models developed under condition (A) and condition
systems, Cascade-Base-NLLB200, Cascade-B5-
(B) respectively.
ASR-NLLB200 and Cascade-B6-ASR-NLLB200,
utilize the best MT model (NLLB200 B3) and
with the Adam optimizer; the maximum learning
ASR submodules including branchformer (B4),
rate is 3e-03, attained after 20000 warm-up steps,
branchformer finetuned in E2E-MD setup (B5) and
and then decayed according to an inverse square
branchformer finetuned in with mBART setup (B6)
root scheduler; we use dropout probability of 0.3;
respectively from Table 4.
the model is trained for 200 epochs. For condition
It can be seen that the E2E-multidecoder archi-
(B), for both NLLB-200 and mBART25, we fine-
tecture outperforms the cascaded system in the con-
tune our model for up to 80000 updates and use
strained condition, with a significant improvement
loss to select our best model checkpoint. We use
of up to +1.7 in absolute BLEU. Text normalization
sacrebleu to compute the case-insensitive BLEU
provides additional boost of +0.4 in absolute BLEU.
scores for all evaluation sets (Papineni et al., 2002;
On the other hand for the unconstrained system, we
Post, 2018b) as shown in Table 8. The comparative
observe that the cascaded system B2 outperforms
analysis of our Espnet MT transformer with the
the E2E B1 by up to 0.4 in absolute BLEU that
best MT models reported in previous works based
utilizes identical submodules. The reason for this
on Fairseq transformer (Yang et al., 2022) reveals
performance difference may be attributed to the
a noticeable performance lag of up to -1.6 in ab-
inability of the input linear layer that was added
solute BLEU. However, incorporating the Branch-
288
to the MT encoder in the E2E setup (B1) to adjust Kumar, Pengwei Li, Xutail Ma, Prashant Mathur,
the length of the ASR output to match the length Evgeny Matusov, Paul McNamee, John P. McCrae,
of the mBART encoder’s tokenization. This length
discrepancy may lead to a loss of crucial infor- Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
mation during the integration of the two modules, Lonneke van der Plas, Peter Polák, Elijah Rippeth,
ultimately resulting in a degradation of overall per- Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
formance. Further analysis is required to confirm bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
this hypothesis and to identify potential solutions Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
to address this issue. The highest performance vallos. 2023. Findings of the IWSLT 2023 Evaluation
of single ST system is obtained using Cascade- Campaign. In Proceedings of the 20th International
NLLB200-1.3B with BLEU of 21.2 and 18.7 on Conference on Spoken Language Translation (IWSLT
test2 and test3 respectively. Finally, we combine
A3, B1, B3, B4 and B5 with comet-mbr which Ahmed M. Ali, Peter Bell, James R. Glass, Yacine Mes-
achieves the highest BLEU scores of 21.6 and 19.1 saoui, Hamdy Mubarak, Steve Renals, and Yifan
Zhang. 2016. The mgb-2 challenge: Arabic multi-
on test2 and test3 respectively. dialect broadcast media recognition. 2016 IEEE Spo-
ken Language Technology Workshop (SLT), pages
5 Conclusion 279–284.
In this paper, we have presented our submission El Said Badawi, Michael Carter, and Adrian Gully. 2013.
for the IWSLT 2023 dialect speech translation Modern written Arabic: A comprehensive grammar.
task. We compared end-to-end to cascaded sys- Routledge.
tems under constrained and unconstrained condi- Kevin Duh, Katsuhito Sudoh, Xianchao Wu, Hajime
tions. We found that an E2E-ST system outper- Tsukada, and Masaaki Nagata. 2011. Generalized
formed the cascaded system under the constrained minimum bayes risk system combination. In Pro-
ceedings of 5th International Joint Conference on
condition, while the cascaded models significantly Natural Language Processing, pages 1356–1360.
outperformed the E2E-ST systems under the un-
constrained condition. We provided a new E2E- Patrick Fernandes, António Farinhas, Ricardo Rei, José
De Souza, Perez Ogayo, Graham Neubig, and Andre
ST baseline combining large pretrained MT with
Martins. 2022. Quality-aware decoding for neural
ASR under the unconstrained condition. Finally, machine translation. In Proceedings of the 2022
we demonstrated that pseudo-labeling and channel Conference of the North American Chapter of the
matching provided significant improvements for Association for Computational Linguistics: Human
the ASR and hence improved cascaded ST systems. Language Technologies, pages 1396–1412, Seattle,
United States. Association for Computational Lin-
In future work we plan to explore more effective guistics.
ways of integrating the large pretrained MT models
into E2E ST systems. Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Ko-
rbinian Riedhammer, Jan Trmal, and Sanjeev Khu-
danpur. 2014. A pitch extraction algorithm tuned
Acknowledgements for automatic speech recognition. In IEEE interna-
This work was partially supported by NSF CCRI tional conference on acoustics, speech and signal
processing (ICASSP), pages 2494–2498. IEEE.
Grant No 2120435. In addition, we thank the Qatar
Computing Research Institute, Doha, for providing Amir Hussein, Shammur Chowdhury, and Ahmed Ali.
some of the computational resources that made this 2021. Kari: Kanari/qcri’s end-to-end systems for the
interspeech 2021 indian languages code-switching
work possible. challenge. arXiv preprint arXiv:2106.05885.
Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San-
References jeev Khudanpur. 2015. Audio augmentation for
speech recognition. In Sixteenth Annual Conference
Milind Agarwal, Sweta Agrawal, Antonios Anasta- of the International Speech Communication Associa-
sopoulos, Ondřej Bojar, Claudia Borg, Marine tion.
Chen, William Chen, Khalid Choukri, Alexandra Shankar Kumar and Bill Byrne. 2004. Minimum bayes-
Chronopoulou, Anna Currey, Thierry Declerck, Qian- risk decoding for statistical machine translation. In
qian Dong, Yannick Estève, Kevin Duh, Marcello Proceedings of the Human Language Technology
Federico, Souhir Gahbiche, Barry Haddow, Benjamin Conference of the North American Chapter of the
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja- Association for Computational Linguistics: HLT-
vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu NAACL 2004, pages 169–176.
289
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Matt Post. 2018a. A call for clarity in reporting BLEU
Edunov, Marjan Ghazvininejad, Mike Lewis, and scores. In Proceedings of the Third Conference on
Luke Zettlemoyer. 2020a. Multilingual denoising Machine Translation: Research Papers, pages 186–
pre-training for neural machine translation. Transac- 191, Belgium, Brussels. Association for Computa-
tions of the Association for Computational Linguis- tional Linguistics.
tics, 8:726–742.
Matt Post. 2018b. A call for clarity in reporting BLEU
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey scores. In Proceedings of the Third Conference on
Edunov, Marjan Ghazvininejad, Mike Lewis, and Machine Translation: Research Papers, pages 186–
Luke Zettlemoyer. 2020b. Multilingual denoising 191, Brussels, Belgium. Association for Computa-
pre-training for neural machine translation. tional Linguistics.
Mohamed Maamouri et al. 2006. Levantine arabic qt Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
training data set 5, speech ldc2006s29. Web Down- Lavie. 2020. Comet: A neural framework for mt
load. evaluation. arXiv preprint arXiv:2009.09025.
Jin Sakuma, Tatsuya Komatsu, and Robin Scheibler.
NLLB Team, Marta R. Costa-jussà, James Cross, Onur
2021. Mlp-based architecture with variable length
Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef-
input for automatic speech recognition.
fernan, Elahe Kalbassi, Janice Lam, Daniel Licht,
Jean Maillard, Anna Sun, Skyler Wang, Guillaume Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
Wenzek, Al Youngblood, Bapi Akula, Loic Bar- man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
rault, Gabriel Mejia-Gonzalez, Prangthip Hansanti, gela Fan. 2020. Multilingual translation with exten-
John Hoffman, Semarley Jarrett, Kaushik Ram sible multilingual pretraining and finetuning. arXiv
Sadagopan, Dirk Rowe, Shannon Spruit, Chau preprint arXiv:2008.00401.
Bhosale, Sergey Edunov, Angela Fan, Cynthia A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
Gao, Vedanuj Goswami, Francisco Guzmán, Philipp L. Jones, A. Gomez, u. Kaiser, and I. Polosukhin.
Koehn, Alexandre Mourachko, Christophe Ropers, 2017. Attention is all you need. In Advances in
Safiyyah Saleem, Holger Schwenk, and Jeff Wang. neural information processing systems, pages 5998–
2022. No language left behind: Scaling human- 6008.
centered machine translation. Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Yalta,
Sam Gross, Nathan Ng, David Grangier, and Michael Jahn Heymann, Matthew Wiesner, Nanxin Chen,
Auli. 2019. fairseq: A fast, extensible toolkit for Adithya Renduchintala, and Tsubasa Ochiai. 2018.
sequence modeling. In Proceedings of NAACL-HLT Espnet: End-to-end speech processing toolkit. ArXiv,
2019: Demonstrations. abs/1804.00015.
Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Hershey, and Tomoki Hayashi. 2017. Hybrid
Jing Zhu. 2002. Bleu: a method for automatic evalu- ctc/attention architecture for end-to-end speech recog-
ation of machine translation. In Proceedings of the nition. IEEE Journal of Selected Topics in Signal
40th Annual Meeting of the Association for Compu- Processing, 11:1240–1253.
Pennsylvania, USA. Association for Computational Brian Yan, Patrick Fernandes, Siddharth Dalmia, Jia-
Linguistics. tong Shi, Yifan Peng, Dan Berrebbi, Xinyi Wang,
Graham Neubig, and Shinji Watanabe. 2022. CMU’s
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng IWSLT 2022 dialect speech translation system. In
Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Proceedings of the 19th International Conference on
2019. Specaugment: A simple data augmentation Spoken Language Translation (IWSLT 2022), pages
method for automatic speech recognition. Proc. In- 298–307, Dublin, Ireland (in-person and online). As-
terspeech, pages 2613–2617. sociation for Computational Linguistics.
Yifan Peng, Siddharth Dalmia, Ian Lane, and Shinji Jinyi Yang, Amir Hussein, Matthew Wiesner, and San-
Watanabe. 2022. Branchformer: Parallel mlp- jeev Khudanpur. 2022. Jhu iwslt 2022 dialect speech
attention architectures to capture local and global translation system description. In Proceedings of the
context for speech recognition and understanding. 19th International Conference on Spoken Language
In International Conference on Machine Learning, Translation (IWSLT 2022), pages 319–326.
pages 17627–17643. PMLR.
Inès Zribi, Rahma Boujelbane, Abir Masmoudi, Mariem
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. Ellouze, Lamia Belguith, and Nizar Habash. 2014.
How multilingual is multilingual BERT? In Proceed- A conventional orthography for Tunisian Arabic. In
ings of the 57th Annual Meeting of the Association for Proceedings of the Ninth International Conference
Computational Linguistics, pages 4996–5001, Flo- on Language Resources and Evaluation (LREC’14),
rence, Italy. Association for Computational Linguis- pages 2355–2361, Reykjavik, Iceland. European Lan-
tics. guage Resources Association (ELRA).
290
Learning Nearest Neighbour Informed Latent Word Embeddings
to Improve Zero-Shot Machine Translation
Nishant Kambhatla Logan Born Anoop Sarkar

School of Computing Science, Simon Fraser University
8888 University Drive, Burnaby BC, Canada
{nkambhat, loborn, anoop}@sfu.ca
Abstract amazing jugando fußball amazing

he is great at playing football k-NN word
great avg
awesome treten soccer embedding
Multilingual neural translation models exploit awesome
cross-lingual transfer to perform zero-shot
translation between unseen language pairs. Past Figure 1: NN-informed embeddings average representa-
efforts to improve cross-lingual transfer have tions from nearby subwords in the embedding space.
focused on aligning contextual sentence-level
representations. This paper introduces three
novel contributions to allow exploiting near- level, which exploits nearest neighbours (NNs) to
est neighbours at the token level during train-
aggregate information from subwords across multi-
ing, including: (i) an efficient, gradient-friendly
way to share representations between neighbor-
ple languages. When analysing embedding spaces,
ing tokens; (ii) an attentional semantic layer many authors speak in terms of “neighborhoods”
which extracts latent features from shared em- or “subspaces” which group together tokens from
beddings; and (iii) an agreement loss to har- a particular semantic field or other natural class.
monize predictions across different sentence These neighborhoods form implicitly as a model
representations. Experiments on two multilin- learns similarities between embedded words or sub-
gual datasets demonstrate consistent gains in words. We propose to make this neighborhood
zero shot translation over strong baselines.
structure explicit by forcing a model to consider
a token’s neighbors when learning its embedding.
1 Introduction
Specifically, we dynamically perturb a translation
Many-to-many multilingual neural translation mod- model’s token embeddings at training time by av-
els (Firat et al., 2016; Johnson et al., 2017; Khan- eraging them with their nearest neighbors; thus a
delwal et al., 2020; Fan et al., 2022) share a single token like soccer may end up mixed together with
representation space across multiple language pairs, related tokens such as football, fußball, or futbol
which enables them to perform zero-shot transla- from potentially distinct languages (Figure 1). This
tions between unseen pairs (Ha et al., 2017; Chen encourages the model to organize its subword em-
et al., 2022; Wu et al., 2022). Prior work on zero- beddings in such a way that nearby tokens convey
shot translation has focused on aligning contextual, similar information to one another. We hypothe-
sentence-level representations from multiple lan- size that this process will produce a more structured
guages (Ji et al., 2020; Pan et al., 2021a), to make embedding space which will in turn enable more
these more ‘universal’ or language-agnostic (Gu fluent outputs. This process only uses the model’s
et al., 2018; Gu and Feng, 2022). Non-contextual, embedding layer, and does not require any offline
token-level representations offer another space in dictionaries or additional data.
which this kind of alignment could be pursued, but Our experiments and ablations show that this
this space has not been thoroughly explored in prior simple technique significantly increases the effec-
work. Even lexicon-based methods (Conneau et al., tiveness of translation models on the IWSLT17
2020; Reid and Artetxe, 2022), which exploit token- and TED59 massively multilingual datasets. Con-
level anchors from multilingual dictionaries (Duan cretely, our contributions include: (i) an efficient,
et al., 2020), still use these to align representations gradient-friendly, soft representation-mixing tech-
at the sentence level. nique which exploits token-level neighbors without
In this work, we explore a novel technique for changing the Transformer architecture; (ii) an atten-
sharing information across languages at the token tional semantic layer which extracts features from
291
value key query :
hausen Transformer
semantic 1
ingen +
semantic 2 attn
ево final knn-
weighted- : informed word
shire averaged
embed semantic n embedding
hire tok
lookup
ingham Attentional Semantic
Representation
cester
top-kNN toks
de - hausen, ingen (Schaffhausen, Uhldingen) + add residual connection
uk - ево (Мукачево) weighted avg of top-k lookup embeds
en - shire, ingham, cester (Yorkshire, Birmingham, Alcester)
Figure 2: A NN-informed embedding for an arbitrary subword shire is produced by averaging across nearby
subwords from various languages, and combining with a semantic representation extracted from this average.
mixed representations to give neighbour-informed with a weighting term λ:

latent word embeddings, and which is a drop-in
replacement for a conventional embedding layer; 1ÿ
k
and (iii) an agreement loss which harmonizes pre- EMB µ pwq “λ pEMBpni qq ` p1 ´ λqEMBpwq
k i“1
dictions with and without neighbor-informed em-
(2)
beddings.
EMB µ p¨q is computed directly from Wemb , which
ensures that our technique remains gradient-
2 Translation with Nearest Neighbour friendly2 and does not need a separate warm-up
Augmented Embeddings step. Previous NN-based proposals for translation
We describe our model for nearest-neighbour in- (Khandelwal et al., 2020) and language modeling
formed token level embeddings (Figure 2) of sub- (Khandelwal et al., 2019) have only explored NNs
words from multiple source languages. of contextualized representations, strictly for gen-
eration, and using neighbors from an offline frozen
Nearest Neighbor Retrieval Let Lemb be a word datastore of pretrained candidates. Their method
embedding layer that performs a lookup EMBp¨q us- proved effective for MT domain adaptation, rather
ing weights Wemb P R|V|ˆD , where V is a joint than zero-shot translation which is the focus of
subword vocabulary over all languages and D is this work. The ability to propagate gradients to a
a fixed embedding dimension. Given the embed- subword’s neighbors during training is novel and
ding q “ EMBpwq P R1ˆD of a subword w, we unique compared to previous NN-based techniques.
wish to find q’s nearest neighbour n (or neighbors
n1 , ..., nk ) using maximum inner product search Attentional Semantic Representation To
(MIPS) over the weight matrix Wemb : extract contextually-salient information from
EMB µ pwq, which combines information from
n “ arg min ||q ´ x||22 many subwords in potentially disparate languages,
xĂWemb we use a shared semantic embedding inspired by
(1)
“ arg minp||x||22 ´ 2q T xq Gu et al. 2018; Wang et al. 2018 that shows a
xĂWemb similar effect as topical modelling.
We introduce Wsem P RN ˆD , where each of
Approximate solutions to (1) can be efficiently com-
the N rows is taken to be a language-agnostic se-
puted on-the-fly using anisotropic vector quantiza-
mantic representation. Wsem is shared across all
tion (Guo et al., 2020).1
languages. We use attention (Luong et al., 2015) to
Given the approximate nearest neighbors compute a latent embedding EMBlatent pwq using
(ANNs) n1 , ..., nk of subword w, we compute a
weighted average over these tokens’ embeddings 2
In Section 3.3 we introduce a caching heuristic which is
not gradient-friendly; however, this is simply an implemen-
1
Exact and approximate solutions yield similar results, but tation detail to speed up training, and the gradient-friendly
approximation gives significant gains in training speed. presentation in this section achieves equivalent performance.
292
De - It De - Nl De - Ro It - Nl It - Ro Nl - Ro zero sup.
Ð Ñ Ð Ñ Ð Ñ Ð Ñ Ð Ñ Ð Ñ
Base M2M 15.64 15.28 18.46 18.14 14.42 14.98 18.16 18.79 17.91 20.14 15.81 16.41 17.01 30.62
SRA (2019) 16.44 16.45 18.44 19.15 15.07 15.83 19.30 19.10 18.52 21.52 16.83 17.66 17.85 30.41
SF (2019) 16.34 15.77 18.37 18.16 14.74 15.25 18.6 19.18 18.54 21.64 16.09 16.94 17.46 30.50
LV (2021) 16.82 15.81 18.74 18.64 15.12 16.32 18.92 19.29 18.70 22.13 16.21 18.22 17.91 30.51
CL (2021b) 17.31 16.21 19.70 19.57 15.32 16.25 18.90 20.09 19.07 22.44 17.14 17.99 18.33 30.29
DP (2021) 16.62 15.64 19.64 18.78 15.07 15.96 19.01 20.15 18.67 21.56 16.46 18.18 17.97 30.49
Ours 17.41 16.89 19.71 19.21 15.60 16.22 19.30 20.10 19.60 21.88 17.25 18.40 18.47 30.62
Table 1: BLEU on IWSLT17 test set (mean of 3 runs). Zero and sup. are average zero-shot and supervised results.
the averaged embedding EMBµ pwq as query: IWSLT17 (Cettolo et al., 2012) is an English-
centric dataset3 totalling 1.8M parallel sentences.
EMB latent pwq
T
“ SoftmaxpEMBµ pwq.Wsem qWsem (3) It has 8 supervised directions to and from Ger-
man, Italian, Dutch and Romanian, each with about
A residual connection from EMBµ pwq gives the 220,000 parallel sentences, and 12 zero-shot direc-
final NN-informed word embedding: tions. We use the official validation and test sets.
EMB knn pwq “ EMBlatent pwq ` EMBµ pwq (4) Ted59 (Qi et al., 2018) is a massively multilin-
gual English-centric dataset4 with 116 translation
EMBknn pwq is a drop-in replacement for a conven-
directions totalling 10.8M parallel sentences. The
tional word embedding EMBpwq. imbalanced data—from 0.25M to just 2000 parallel
samples for some language pairs—makes it ideal
Modelling Prediction Consistency Given a to study the effects of our method. Following (Aha-
source sentence represented using conventional roni et al., 2019; Raganato et al., 2021) we evaluate
word embeddings and using NN-informed em- on 16 supervised pairs and 4 zero-shot (Arabic Ø
beddings, following Kambhatla et al. (2022b) we French, Ukranian Ø Russian).
model the loss with respect to target sentence yi as:
3.2 Baselines and Related Work
i i
L “ α 1 LN LL p pΘ p yi |xi q q
looooooooooooomooooooooooooon We compare against methods for encoder manifold
source x-entropy
alignment. These include strong baselines such
` α2 LiN LL p pΘ p yi | kNNpxi qq q (5)
looooooooooooooooomooooooooooooooooon as sentence representation alignment (SRA; Ari-
k-NN embeds. source x-entropy vazhagan et al. 2019), softmax forcing (SF; Pham
` β Lidist p pΘ p yi |xi q, pΘ p yi |kNNpxi qq q
loooooooooooooooooooooooomoooooooooooooooooooooooon et al. 2019), the contrastive multilingual model (CL;
agreement loss Pan et al. 2021b), multilingual Transformer with
disentagled positional embedding (DP; Liu et al.
where kNNpxi q denotes the set of k-nearest neigh- 2021), and latent variable based denoising (LV;
bors to token xi . This loss combines three Wang et al. 2021), along with the vanilla many-
terms: the first two are conventional negative to-many zero-shot model (M2M). On TED59, we
log-likelihoods, while the third is an agreement compare against CL and 3 explicit multilingual
loss measuring pairwise symmetric KL diver- alignment techniques proposed by Raganato et al.
gence between the output distributions for xi and (2021): word-alignment, language tag alignment,
kNNpxi q. This agreement-loss term performs co- and the union of the two. We also implement and
regularization by allowing explicit interactions compare against Raganato et al.’s (2021) sparse
between source sentences with and without NN- 1.5entmax cross-attention variant.
informed embeddings.
3.3 Model and Implementation Details
3 Experiments
All models use the configuration in Vaswani et al.
3.1 Datasets 2017 using the fairseq toolkit (Ott et al., 2019).
See reproducibility details in Appendix A.
We conduct experiments on 2 multilingual datasets,
each with BPE (Sennrich et al., 2016) vocabulary 3
https://wit3.fbk.eu/2017-01
size of 32k subwords: 4
github.com/neulab/word-embeddings-for-nmt
293
Θ EnÑX XÑEn Zero-Shot Acc0
Aharoni et al. – 106 langs 473M 20.11 29.97 9.17 -
Aharoni et al. – 59 langs 93M 19.54 28.03 - -
Transformer M2M reimp. 93M 18.98 27.22 7.12 74.10
Constrastive (2021b) 93M 19.09 27.29 8.16 73.90
Ours 77M 19.01 27.11 10.03 95.81
Raganato et al. (2021)
ZS + 1.5entmax (ibid.) 93M 18.90 27.21 10.02 87.81
ë Word Align (ibid.) 93M 18.99 27.58 8.38 73.12
ë LangID Align (ibid.) 93M 18.98 27.48 6.35 65.01
ë Word + LangID Align 93M 19.06 27.37 11.94 97.25
Ours + 1.5entmax 77M 18.94 27.42 12.11 98.90
Table 2: Average BLEU scores on the TED59 dataset. Our model produces zero-shot translations in the correct
output language with high accuracy (Acc0 ).
We use ScANN (Guo et al., 2020) for efficient against the stronger contrastive model. Further, our
ANN search 5 with k “ 3. To increase train- model consistently outperforms strong, explicitly
ing speeds, we cache each subword’s ANNs for alignment-based methods.
400 iterations before recomputing them. We only
(peridocally) cache subword IDs: the embedding Target-language Accuracy. To supplement the
EMB µ p¨q is always computed directly from Wemb .
evaluation, we provide the accuracy score for tar-
We set λ “ 0.5, α1 , α2 “ 1, and β “ 5. The atten- get language identification6 in zero-shot scenarios,
tional latent semantic representation layer has 512 called Acc0 . While the classical many-to-many
dim (same as the embedding layer) and a size N of NMT models (Johnson et al., 2017; Aharoni et al.,
1000 for IWSLT17 (smaller dataset) and 5000 for 2019) enable zero-shot translations, several studies
TED59 (larger dataset). We did not tune this hyper- have shown that these models fail to reliably gener-
parameter and chose the values based on the size of alize to unseen language pairs, ending up with an
the datasets. For evaluation, we report sacreBLEU off-target translation issue (Zhang et al., 2020). The
(Post, 2018). model ignores the language label and the wrong
target language is produced as a result. We ob-
3.4 Results serve significant improvements in target language
accuracy, up to nearly 99% (absolute).
Main Results. Tables 1 and 2 show our main
results. On IWSLT17, our latent k-NN embed- 4 Analysis
ding model outperforms several strong baselines,
including sentence-representation alignment and Ablation Study. Table 3 reports ablations on the
contrastive learning, by an average of 0.62 and IWSLT17 test set. We find that kNN embeddings
0.11 BLEU respectively across the 12 zero-shot alone yield improvements over the baseline many-
pairs. Compared to the baseline many-to-many to-many model. By contrast, absent the other parts
model, our method yields a 1.5 BLEU gain on av- of our model, the attentional semantic layer dete-
erage. Our method is able to improve zero-shot riorates model performance. Only in combination
performance without deteriorating supervised per- with the agreement loss do we observe a benefit
formance. from this component.
On the TED59 dataset, we follow Raganato Embedding Analysis. Figure 3 visualizes sub-
et al. (2021) in comparing against two multilin- word representations from models trained on
gual model variants: the standard Transformer, and IWSLT17. Each subword is colored according to
the Transformer with sparse entmax instead of stan- the language in which it is most frequent. The over-
dard softmax cross-attention. Our approach gains all layout of the two spaces is similar, although the
„3 BLEU points against the baseline, and 2 BLEU
6
We utilize FastText (Joulin et al., 2017) as a language
5
We use asymmetric hashing with 2-dimensional blocks identification tool to compare the translation language with
and a quantization threshold of 0.2, and re-order the top 100 the reference target language and keep count of the number of
ANN candidates. matches.
294
ID Component dev.2010 test.2010 We quantify this trend by labeling each subword
1 many-to-many (zero-shot) 15.95 18.46 according to the language in which it is most fre-
2 1 + attn. semantic repr. 15.43 17.83 quently attested. In the baseline model, we find
3 1 + kNN embeds 17.11 19.69 that on average only 2.7 of a subword’s 6 nearest
4 2 + kNN embeds 16.60 19.08 neighbors come from the same language as that sub-
5 3 + agreement loss 17.99 20.91 word. This average rises to 3.6 in the ANN model,
6 4 + agreement loss 18.31 21.01 demonstrating that ANN training significantly in-
creases the number of same-language neighbors on
Table 3: Effect of different components of our model on
average.
the IWSLT17 datasets. We report sacreBLEU scores on ?
the two official validation sets with beam size 1. In the ANN model, a few rare subwords ( , ž, ć)
are disproportionately common among the nearest
neighbors of many other subwords. We speculate
baseline model (left) exhibits a clear ring-shaped that these tokens may act as pivots for informa-
gap dividing the embeddings into two groups. With tion to flow between their many neighbours. Their
ANN embeddings (right), this gap is eliminated and high centrality means that these tokens provide
the layout of the embeddings appears more homo- avenues for information to flow between a large
geneous. Quantitatively, the average distance from number of subwords, even those which never occur
a subword to its neighbors exhibits a smaller vari- in sentences together. Because these tokens are
ance in the ANN model than in the baseline, which rare, there is also very little penalty for the model
further supports the reading that ANN training cre- to “corrupt” their representations with information
ates a more homogeneous representation space in from neighboring subwords.
which subwords are more uniformly distributed.
5 Other Related Work
A vast body of work addresses zero-shot transla-
tion. Most methods focus on producing language-
agnostic encoder outputs (Pham et al., 2019). Wei
et al. (2021) introduce multilingual contrastive
learning, while Yang et al. (2021) adopt auxiliary
target language prediction. To enable the input to-
kens to be positioned without constraints, Liu et al.
Figure 3: t-SNE visualization of subword embeddings (2021) eliminate the residual connections within
from IWSLT17 models trained without (left) and with
a middle layer of the encoder. Yang et al. (2022);
(right) ANN embeddings. Points are colored according
to the language where the corresponding subword is
Gu and Feng (2022) employ optimal transport to
most frequent. ANN embeddings decrease the separa- improve contextual cross-alignments, in contrast
tion between some monolingual subspaces, and remove to our method which performs soft, non-contextual
others entirely. alignment between subwords in the continuously-
updating embedding space. Other methods ex-
Table 4 shows nearest neighbors for a random tend the training data using monolingual data (Al-
sample of subwords (additional examples in Table 5 Shedivat and Parikh, 2019) to pretrain the decoder
in Appendix B). With ANN training, a subword’s (Gu et al., 2019), and random-online backtransla-
nearest neighbors are generally its synonyms (e.g. tion (Zhang et al., 2020). Lin et al. (2021); Reid and
_wonderful, _large _tremendous, and _big Artetxe (2022) use dictionary based alignments to
as neighbors to _great) or derived forms (e.g. produce pseudo-cross-lingual sentences. Other ap-
_încep, _începem, _început, _începe be- proaches that enhance token level representations
side _înceap). In the baseline, it is more likely include multiple subword segmentations (Wu et al.,
to find neighbors with no apparent relation, such as 2020; Kambhatla et al., 2022a), enciphered source
_erzählen ‘tell’ and _stemmen ‘hoist’ or ‘accom- text (Kambhatla et al., 2022b) and stroke sequence
plish’ beside _America. This suggests that ANN modelling (Wang et al., 2022). While all these tech-
embeddings help a model to better organize its sub- niques rely on multilingual training paradigm for
word embedding space into coherent, semantically- machine translation, they either rely on external
related subspaces. data and use explicit augmentations. We do not
295
Subword Nearest Neighbors (Baseline) Nearest Neighbors (Ours)
?
_great _gesproken _schaffen ppy ită _prosper _senior _wonderful _large _tremendous _big _great
_înceapă _popolare _condotto _mişcă _bekijken _crească _creeze _gepubliceerd _încep _începem _început _începe muovono
_America tate _erzählen _stemmen dine _facultate _chestiune _USA _Asia _Africa _American _America ć
?
_play _lavori eranno _tenuto _bekijken - möglichkeiten play _playing _Play _played _play
_football _pesci bon _surf _betrachten _Hintergrund möglichkeiten _weather _baseball ball _montagna _biodiversità _football
ing ificazione izăm amento tung erende ende ling ting ung ž ingen ing
?
_fish _petrec schen _Sachen _feed _chestii möglichkeiten fisch _pesce _pesca _Fisch _fish
Table 4: Approximate nearest neighbors for a sample of subwords, computed with (right) and without (left) ANN
training.
use any external data or explicit alignments and NSERC RGPIN-2018-06437 and RGPAS-2018-
our model can be trained end-to-end like a regular 522574 and a Department of National Defence
multilingual model. (DND) and NSERC grant DGDND-2018-00025 to
the third author, and by an NSERC award CGSD3-
6 Conclusion 547773-2020 to the second author.
We described a novel approach to harness near-
est neighbors at the token level and learn nearest- References
neighbour informed word embeddings for every
Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019.
word in a source language for many-to-many multi- Massively multilingual neural machine translation.
lingual translation. Our experiments show that this In Proceedings of the 2019 Conference of the North
simple yet effective approach results in consistently American Chapter of the Association for Computa-
better zero-shot translations across multiple multi- tional Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers), pages 3874–3884,
lingual datasets. Additionally, our model produces Minneapolis, Minnesota. Association for Computa-
translations in the right target language with high tional Linguistics.
accuracy. Our analysis shows that our model learns
Maruan Al-Shedivat and Ankur Parikh. 2019. Con-
to organize subwords into semantically-related sistency by agreement in zero-shot neural machine
neighborhoods, and reduces the separation between translation. In Proceedings of the 2019 Conference
monolingual subspaces in the embedding space. of the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
Limitations nologies, Volume 1 (Long and Short Papers), pages
1184–1197, Minneapolis, Minnesota. Association for
While our method is effective in zero-shot set- Computational Linguistics.
tings, we find that it has limited implications in Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Roee
supervised settings. This is because improving Aharoni, Melvin Johnson, and Wolfgang Macherey.
zero-shot translation presents a tug-of-war between 2019. The missing ingredient in zero-shot neural ma-
chine translation. arXiv preprint arXiv:1903.07091.
language-agnostic and language-specific represen-
tations, each of which has a distinct effect on the Mauro Cettolo, Christian Girardi, and Marcello Fed-
model. Another major downside is reduced training erico. 2012. Wit3: Web inventory of transcribed and
translated talks. In Proceedings of the 16th Annual
speed relative to the baseline many-to-many model. conference of the European Association for Machine
We note that this is an artifact of the agreement Translation, pages 261–268.
loss (KLDiv.) which entails two forward-passes for
Guanhua Chen, Shuming Ma, Yun Chen, Dongdong
each update. Finally, in the present work, we com- Zhang, Jia Pan, Wenping Wang, and Furu Wei. 2022.
pute k-NNs for every source word in a sentence. Towards making the most of cross-lingual transfer
Although this has yielded strong results, we would for zero-shot neural machine translation. In Proceed-
like to explore a more explainable setting where ings of the 60th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Pa-
k-NNs can be applied to specific source words. We pers), pages 142–157, Dublin, Ireland. Association
leave such explorations to future work. for Computational Linguistics.
Acknowledgements Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettle-

moyer, and Veselin Stoyanov. 2020. Emerging cross-
We would like to thank the anonymous review- lingual structure in pretrained language models. In
ers for their helpful comments. The research was ciation for Computational Linguistics, pages 6022–
partially supported by the Natural Sciences and 6034, Online. Association for Computational Lin-
Engineering Research Council of Canada grants guistics.
296
Xiangyu Duan, Baijun Ji, Hao Jia, Min Tan, Min Zhang, pre-training based transfer for zero-shot neural ma-
Boxing Chen, Weihua Luo, and Yue Zhang. 2020. chine translation. In Proceedings of the AAAI con-
Bilingual dictionary based neural machine translation ference on artificial intelligence, volume 34, pages
without using parallel sentences. In Proceedings 115–122.
of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 1570–1579, Online. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim
Association for Computational Linguistics. Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,
Fernanda Viégas, Martin Wattenberg, Greg Corrado,
Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Macduff Hughes, and Jeffrey Dean. 2017. Google’s
Ma, Ahmed El-Kishky, Siddharth Goyal, Man- multilingual neural machine translation system: En-
deep Baines, Onur Celebi, Guillaume Wenzek, abling zero-shot translation. Transactions of the As-
Vishrav Chaudhary, Naman Goyal, Tom Birch, Vi- sociation for Computational Linguistics, 5:339–351.
taliy Liptchinsky, Sergey Edunov, Edouard Grave, Armand Joulin, Edouard Grave, Piotr Bojanowski, and
Michael Auli, and Armand Joulin. 2022. Beyond Tomas Mikolov. 2017. Bag of tricks for efficient
english-centric multilingual machine translation. J. text classification. In Proceedings of the 15th Con-
Mach. Learn. Res., 22(1). ference of the European Chapter of the Association
for Computational Linguistics: Volume 2, Short Pa-
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. pers, pages 427–431, Valencia, Spain. Association
Multi-way, multilingual neural machine translation for Computational Linguistics.
with a shared attention mechanism. In Proceedings
of the 2016 Conference of the North American Chap- Nishant Kambhatla, Logan Born, and Anoop Sarkar.
ter of the Association for Computational Linguistics: 2022a. Auxiliary subword segmentations as related
Human Language Technologies, pages 866–875, San languages for low resource multilingual translation.
Diego, California. Association for Computational In Proceedings of the 23rd Annual Conference of
Linguistics. the European Association for Machine Translation,
pages 131–140, Ghent, Belgium. European Associa-
Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor O.K. tion for Machine Translation.
Li. 2018. Universal neural machine translation for
extremely low resource languages. In Proceedings Nishant Kambhatla, Logan Born, and Anoop Sarkar.
of the 2018 Conference of the North American Chap- 2022b. CipherDAug: Ciphertext based data augmen-
ter of the Association for Computational Linguistics: tation for neural machine translation. In Proceed-
Human Language Technologies, Volume 1 (Long Pa- ings of the 60th Annual Meeting of the Association
pers), pages 344–354, New Orleans, Louisiana. As- for Computational Linguistics (Volume 1: Long Pa-
sociation for Computational Linguistics. pers), pages 201–218, Dublin, Ireland. Association
Jiatao Gu, Yong Wang, Kyunghyun Cho, and Vic- Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke
tor O.K. Li. 2019. Improved zero-shot neural ma- Zettlemoyer, and Mike Lewis. 2020. Nearest neigh-
chine translation via ignoring spurious correlations. bor machine translation. In International Conference
In Proceedings of the 57th Annual Meeting of the As- on Learning Representations.
sociation for Computational Linguistics, pages 1258–
1268, Florence, Italy. Association for Computational Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke
Linguistics. Zettlemoyer, and Mike Lewis. 2019. Generalization
through memorization: Nearest neighbor language
Shuhao Gu and Yang Feng. 2022. Improving zero- models. In International Conference on Learning
shot multilingual translation with universal represen- Representations.
tations and cross-mappings. In Proceedings of the
EMNLP 2022 Long Findings. Yusen Lin, Jiayong Lin, Shuaicheng Zhang, and Haoy-
ing Dai. 2021. Bilingual dictionary-based language
Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, model pretraining for neural machine translation.
David Simcha, Felix Chern, and Sanjiv Kumar. 2020.
Danni Liu, Jan Niehues, James Cross, Francisco
Accelerating large-scale inference with anisotropic
Guzmán, and Xian Li. 2021. Improving zero-shot
vector quantization. In International Conference on
translation by disentangling positional information.
Machine Learning, pages 3887–3896. PMLR.
In Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics and the
Thanh-Le Ha, Jan Niehues, and Alexander Waibel. 2017. 11th International Joint Conference on Natural Lan-
Effective strategies in zero-shot neural machine trans- guage Processing (Volume 1: Long Papers), pages
lation. In Proceedings of the 14th International Con- 1259–1273, Online. Association for Computational
ference on Spoken Language Translation, pages 105– Linguistics.
112, Tokyo, Japan. International Workshop on Spo-
ken Language Translation. Thang Luong, Hieu Pham, and Christopher D. Manning.
2015. Effective approaches to attention-based neural
Baijun Ji, Zhirui Zhang, Xiangyu Duan, Min Zhang, machine translation. In Proceedings of the 2015 Con-
Boxing Chen, and Weihua Luo. 2020. Cross-lingual ference on Empirical Methods in Natural Language
297
Processing, pages 1412–1421, Lisbon, Portugal. As- Machel Reid and Mikel Artetxe. 2022. PARADISE:
sociation for Computational Linguistics. Exploiting parallel data for multilingual sequence-
to-sequence pretraining. In Proceedings of the 2022
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Conference of the North American Chapter of the
Sam Gross, Nathan Ng, David Grangier, and Michael Association for Computational Linguistics: Human
Auli. 2019. fairseq: A fast, extensible toolkit for Language Technologies, pages 800–810, Seattle,
sequence modeling. In Proceedings of the 2019 Con- United States. Association for Computational Lin-
ference of the North American Chapter of the Associa- guistics.
tion for Computational Linguistics (Demonstrations),
pages 48–53, Minneapolis, Minnesota. Association Rico Sennrich, Barry Haddow, and Alexandra Birch.
for Computational Linguistics. 2016. Neural machine translation of rare words with
subword units. In Proceedings of the 54th Annual
Xiao Pan, Mingxuan Wang, Liwei Wu, and Lei Li. Meeting of the Association for Computational Lin-
2021a. Contrastive learning for many-to-many multi- guistics (Volume 1: Long Papers), pages 1715–1725,
lingual neural machine translation. In Proceedings Berlin, Germany. Association for Computational Lin-
of ACL 2021. guistics.
Xiao Pan, Mingxuan Wang, Liwei Wu, and Lei Li. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
2021b. Contrastive learning for many-to-many mul- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
tilingual neural machine translation. In Proceedings Kaiser, and Illia Polosukhin. 2017. Attention is all
of the 59th Annual Meeting of the Association for you need. Advances in neural information processing
Computational Linguistics and the 11th International systems, 30.
Joint Conference on Natural Language Processing Weizhi Wang, Zhirui Zhang, Yichao Du, Boxing Chen,
(Volume 1: Long Papers), pages 244–258, Online. Jun Xie, and Weihua Luo. 2021. Rethinking zero-
Association for Computational Linguistics. shot neural machine translation: From a perspective
of latent variables. In Findings of the Association
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- for Computational Linguistics: EMNLP 2021, pages
Jing Zhu. 2002. Bleu: a method for automatic evalu- 4321–4327, Punta Cana, Dominican Republic. Asso-
ation of machine translation. In Proceedings of the ciation for Computational Linguistics.
tational Linguistics, pages 311–318, Philadelphia, Xinyi Wang, Hieu Pham, Philip Arthur, and Graham
Pennsylvania, USA. Association for Computational Neubig. 2018. Multilingual neural machine transla-
Linguistics. tion with soft decoupled encoding. In International
Conference on Learning Representations.
Ngoc-Quan Pham, Jan Niehues, Thanh-Le Ha, and Alex
Waibel. 2019. Improving zero-shot translation with Zhijun Wang, Xuebo Liu, and Min Zhang. 2022. Break-
language-independent constraints. In Proceedings ing the representation bottleneck of Chinese charac-
of the Fourth Conference on Machine Translation ters: Neural machine translation with stroke sequence
(Volume 1: Research Papers), pages 13–23. modeling. In Proceedings of the 2022 Conference on
Empirical Methods in Natural Language Processing,
Matt Post. 2018. A call for clarity in reporting BLEU pages 6473–6484, Abu Dhabi, United Arab Emirates.
scores. In Proceedings of the Third Conference on Association for Computational Linguistics.
191, Belgium, Brussels. Association for Computa- Xiangpeng Wei, Rongxiang Weng, Yue Hu, Luxi Xing,
tional Linguistics. Heng Yu, and Weihua Luo. 2021. On learning univer-
sal representations across languages. In International
Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Pad- Conference on Learning Representations.
manabhan, and Graham Neubig. 2018. When and
Lijun Wu, Shufang Xie, Yingce Xia, Yang Fan, Jian-
why are pre-trained word embeddings useful for neu-
Huang Lai, Tao Qin, and Tieyan Liu. 2020. Sequence
ral machine translation? In Proceedings of the 2018
generation with mixed representations. In Proceed-
ings of the 37th International Conference on Machine
Learning, volume 119 of Proceedings of Machine
Language Technologies, Volume 2 (Short Papers),
pages 529–535, New Orleans, Louisiana. Associa-
tion for Computational Linguistics. Shijie Wu, Benjamin Van Durme, and Mark Dredze.
2022. Zero-shot cross-lingual transfer is under-
Alessandro Raganato, Raúl Vázquez, Mathias Creutz, specified optimization. In Proceedings of the 7th
and Jörg Tiedemann. 2021. An empirical investi- Workshop on Representation Learning for NLP,
gation of word alignment supervision for zero-shot pages 236–248, Dublin, Ireland. Association for
multilingual neural machine translation. In Proceed- Computational Linguistics.
ings of the 2021 Conference on Empirical Methods
in Natural Language Processing, pages 8449–8456, Yilin Yang, Akiko Eriguchi, Alexandre Muzio, Prasad
Online and Punta Cana, Dominican Republic. Asso- Tadepalli, Stefan Lee, and Hany Hassan. 2021. Im-
ciation for Computational Linguistics. proving multilingual translation by representation
298
and gradient regularization. In Proceedings of the
Language Processing, pages 7266–7279, Online and
Punta Cana, Dominican Republic. Association for
Zhe Yang, Qingkai Fang, and Yang Feng. 2022. Low-
resource neural machine translation with cross-modal
alignment. pages arXiv–2210.
Biao Zhang, Philip Williams, Ivan Titov, and Rico Sen-
nrich. 2020. Improving massively multilingual neu-
ral machine translation and zero-shot translation. In
ciation for Computational Linguistics, pages 1628–
1639, Online. Association for Computational Linguis-
tics.
299
A Reproducibility Details the IWSLT17 model and 2.5M parameters to the
TED59 model. However, note that the total train-
A.1 Data able parameters are still much lower than that of
IWSLT17 (Cettolo et al., 2012) is an English- the baselines – this because our models have shared
centric dataset7 totalling 1.8M parallel sentences. embedding layers.
It has 8 supervised directions to and from Ger- We use the Adam optimizer with inverse square
man, Italian, Dutch and Romanian, each with about root learning scheduling and 6k warm steps, lr “
220,000 parallel sentences, and 12 zero-shot direc- 0.0007 and dropout of 0.3 (IWSLT17), or 10k
tions. We use the official validation and test sets. warmup steps, lr “ 0.005 and dropout of 0.2
(TED59). The batch size is 4096 tokens for each
Ted59 (Qi et al., 2018) is a massively multilin- of four A100 GPUs.
gual English-centric dataset8 with 116 translation We use ScANN (Guo et al., 2020) for efficient
directions totalling 10.8M parallel sentences. The ANN search10 with k “ 3. To increase train-
imbalanced data—from 0.25M to just 2000 parallel ing speeds, we cache each subword’s ANNs for
samples for some language pairs—makes it ideal 400 iterations before recomputing them. We only
to study the effects of our method. Following (Aha- (peridocally) cache subword IDs: the embedding
roni et al., 2019; Raganato et al., 2021) we evaluate EMB µ p¨q is always computed directly from Wemb .
on 16 supervised pairs (Azerbaijani, Belarusian, The value of λ is set to 0.5 (Equation 1). We follow
Galician, Slovak, Arabic, German, Hebrew, and Kambhatla et al. (2022b) to set the values of α1 , α2
Italian to and from English) and 4 zero-shot (Ara- to 1, and β to 5 (Equation 5).
bic Ø French, Ukranian Ø Russian). Note that of
these languages, Azerbaijani, Belarusian, Galician, Evaluation. For evaluation, all translations are
and Slovak are low resource with only 5.9k, 4.5k, generated with beam size 5. We report case-
10k and 61.5k paralle samples to/from English. sensitive BLEU scores (Papineni et al., 2002) us-
All settings and baselines use sentencepiece9 ing sacreBLEU11 (Post, 2018). We report detok-
for subword tokenization using byte-pair encodings enized BLEU for IWSLT17 and tokenized BLEU
(BPEs; Sennrich et al. 2016) with 32000 merge for TED59 for fair comparison with prior work
operations. (Aharoni et al., 2019; Raganato et al., 2021).
A.2 Model and Hyperparameters B Nearest Neighbor Examples

All models follow the basic configuration of See Table 5.
Vaswani et al. (2017), using the fairseq toolkit
(Ott et al., 2019) in PyTorch. This includes 6 lay-
ers of encoder and eecoder each with 512 dim and
2048 feed-forward dimension. The 512 dim word
embedding layer has a vocabulary size of 32000.
All word-embeddings in the model (encoder, de-
coder input/output) are shared, although the latent
embedding layer alone is specific to encoder only.
This implies that any updates to the actual embed-
ding layer because of k-NN tokens also impacts
the decoder.
The attentional latent semantic representation
layer has 512 dim (same as the embedding layer)
and a size N of 1000 for IWSLT17 (smaller
dataset) and 5000 for TED59 (larger dataset). We
did not tune this hyperparameter and chose the val-
ues based on the size of the datasets. This implies
that this layer adds 0.5M trainable parameters to 10
https://github.com/google-research/
google-research/tree/master/scann . We use asymmet-
7
https://wit3.fbk.eu/2017-01 ric hashing with 2-dimensional blocks and a quantization
8
github.com/neulab/word-embeddings-for-nmt threshold of 0.2, and re-order the top 100 ANN candidates.
9 11
https://github.com/google/sentencepiece case:mixed|eff:no|tok:13a|smooth:exp|version:2.3.1
300
Subword Nearest Neighbors (Baseline) Nearest Neighbors (Ours)
_Fisch _findet œ _chestii _Netz fisch möglichkeiten erei fisch _pesca _fish _Fisch ž
schaft hood erung ungen gaat _gehabt schaft würdig lichkeit ship nisse äglich schaft
the tje ped own asta by _solamente tech ther th by the
?
_the isce izăm _erzählen ”& _gehabt oara _your _their _our _the ć ž
_Music mat _cartoon hood _connessione zia _şcoala _musica _music ž _Music dine ć
_picior _sfârşit _plaatje _mesaj _teren _avion _gehabt _corpul _brat, _pagină _picior ž
?
ern eien iere eren erung _tenuto _gehabt uren ungen ert eren stern ern
_înceapă _popolare _condotto _mişcă _bekijken _crească _creeze _gepubliceerd _încep _începem _început _începe muovono
_democrat, ia analisi _înt, elege _popolare izăm _şcoala _deshalb _terorism muovono _democratic dine _biodiversità ć
_pure rische _giovane _appena _tare \u0e22 _avesse _semplicemente _unique _tragic _complete _sole _pure
_genomic _finanzia Â _popolare _răspândi _genomen möglichkeiten _electronic _genome _robotic ž _genetic _genomic
301
_Abbiamo _perciò _gehabt _spunem _condotto izăm _avesse abbiamo mmo iamo _Abbiamo _abbiamo ć
izări amento isieren ierung izzazione _răspândi izare izare ităţi aţie izări muovono nelli
_negative _altele azioni iere _bune _enormous oase _illegal _alternative _evil _positive _negativ _negative
_take _solamente _gemacht _spinge _accompagna _preso _tenuto _takes _taken _taking _took ć _take
_muziek _percorso _besef _onderwijs _erzählen _vreugde oara _music muovono _Musik _musica _muziek ć
_Karte _Bibliothek _lavori strategie _chestii _cifre kaart _Weise _Sprache _carta _montagna kjes _Karte
_funct, iona _mişcă _munci matig _realiza _funct, ie _funct, iona _funcţionează _funct, ionează _funziona _funcţiona _funct, iona ć
_naţional _popolare iere _bază _condotto _esenţial _politic juist _rural äglich _National _naţional _national
_America tate _erzählen _stemmen dine _facultate _chestiune _USA _Asia _Africa _American _America ć
Table 5: Approximate nearest-neighbors for a sample of subwords, computed with (right) and without (left) ANN training.
JHU IWSLT 2023 Multilingual Speech Translation System Description
Henry Li Xinyuan1∗ Neha Verma1∗ Bismarck Bamfo Odoom1 Ujvala Pradeep1
Matthew Wiesner2 Sanjeev Khudanpur1,2
1
Center for Language and Speech Processing, and
2
Human Language Technology Center of Excellence,
Johns Hopkins University
{xli257, nverma7, bodoom1, upradee1, wiesner, khudanpur}@jhu.edu
Abstract 2 The Speech Translation of Talks Task

We describe the Johns Hopkins ACL 60-60 In 2022, the ACL began the 60-60 initiative, a di-
Speech Translation systems submitted to the versity and inclusion initiative to translate the ACL
IWSLT 2023 Multilingual track, where we Anthology into 60 languages for its 60th anniver-
were tasked to translate ACL presentations sary. The initiative provided evaluation data for the
from English into 10 languages. We developed
IWSLT 2023 multilingual track on speech transla-
cascaded speech translation systems for both
the constrained and unconstrained subtracks.
tion of talks from English into 10 major languages.
Our systems make use of pre-trained models as It was further split into constrained and uncon-
well as domain-specific corpora for this highly strained subtracks. The constrained subtrack al-
technical evaluation-only task. We find that the lowed the use of only certain datasets and pre-
specific technical domain which ACL presenta- trained models, whereas the unconstrained subtrack
tions fall into presents a unique challenge for had no such restrictions. We submitted systems to
both ASR and MT, and we present an error anal- both subtracks and describe them in Section 4.
ysis and an ACL-specific corpus we produced
to enable further work in this area. 2.1 Evaluation Data
1 Introduction The ACL 60-60 development data provided to par-
ticipants is composed of the audio of 5 talks, their
In this work, we describe the 2023 JHU 60-60 Mul- transcripts, and multi-parallel translations into 10
tilingual speech translation track submissions and languages. Each talk is about 12 minutes in length
their development (Agarwal et al., 2023; Salesky – a total of about an hour of English speech for the
et al., 2023). This multilingual task involved the entire set. Additionally, participants are provided
translation of ACL conference oral presentations, with the text abstract of each talk taken from the
given in English, into 10 different target languages. corresponding paper.
High quality translation systems that can assist in The nature of these data presents a few major
translating highly technical and scientific informa- challenges for speech translation. The ACL is a
tion helps in the dissemination of knowledge to global community of researchers from many dif-
more people, which in turn can help make our field ferent countries who speak in a variety of accents,
more inclusive and accessible. which can pose a challenge to even modern day
We briefly describe the task in Section 2. In Sec- speech recognition systems. Additionally, the con-
tion 3 we describe the collection and preparation tent of these talks is highly technical and contains
of in-domain ACL data to improve ASR and MT terms and acronyms that are specific to the field.
performance by addressing the domain-specificity Sentence-level translations of the talks are provided
of the task. We then describe our systems in Sec- along with unsegmented audio of the full ∼12
tion 4, including their motivation and design in minute talk. An audio segmentation produced with
context of this shared task. Technical details of our the SHAS baseline segmentation method (Tsiamas
experiments are in 5. We present our results and a et al., 2022) is also provided.
discussion of our contributions in Section 6.
* Authors contributed equally 3 In-domain Data
Utilizing additional in-domain data has been shown
to be helpful in improving the performance and
302
robustness of translation systems. In light of this, These constraints were applied in order to mimic
we scraped talks and papers from the proceedings the text-normalization of the dev data so that these
and workshops of ACL 2021. scraped ACL data could be incorporated into our
model’s source language side.
3.1 Data Collection
About 65% of the papers accepted in ACL 2021 4 Systems
have video presentations recorded and uploaded In this section, we separately describe our uncon-
on the ACL website. We scraped 1847 papers and strained and constrained submissions. Since we
1193 talks from the proceedings and workshops. built cascaded models, we describe the automatic
The format of the papers and talks are pdf and speech recognition (ASR) and machine translation
mp4 respectively. We extract the text from the (MT) components of each system.
papers using pypdf.1 The talks are split into 30-
second chunks, converted into FLAC format, and 4.1 Unconstrained Subtrack
resampled to 16KHz. This amounts to about 155 4.1.1 Automatic Speech Recognition
hours of speech and about 200K lines of text. We
An important characteristic of ACL presentations
plan to release the data under a CC BY 4.0 license2
is the wide array of accents represented, which re-
(same as the license for the ACL talks).
flects the diverse background of NLP researchers.
3.2 Data Filtering Accent-robust speech recognition continues to
present a challenge to the community (Tadimeti
To make the corpora (including ACL papers before
et al., 2022; Riviere et al., 2021; Radford et al.,
2022) useful, we first denoised the data and made
2022).
it similar to ASR text outputs. A comprehensive
One model that demonstrated a degree of robust-
list of the filters we applied to the data includes:
ness to accented speech, is Whisper (Radford et al.,
• Removing any information past the Refer- 2022), an ASR model trained on 680,000 hours of
ences section. web-crawled data. Its performance on the accented
splits of the VoxPopuli (Wang et al., 2021), while
• Removing links ("https..").
significantly worse than non-accented English, was
• Reforming broken words since the text was in comparable (without an external language model)
a two column format. to methods designed for accent robustness (with a
• Removing any information before the Ab- strong language model) (Riviere et al., 2021). This
stract section. robustness to accented speech, as well as its overall
strong performance on English ASR makes it well-
• Removing any non alpha-numeric or punctua-
suited for the accent-diverse ACL presentations.
tion characters.
The domain specificity and technical terms of
• Removing any lines that start with or that have ACL presentations may still prove difficult for a
too many numbers (to account for tables with strong ASR model like Whisper. We therefore
data). condition the decoder towards key technical vocab-
• Removing any lines with less that 10 charac- ulary and named entities by prompting Whisper
ters (number obtained from averaging mini- with the corresponding abstracts when decoding
mum character length of each sentence in dev each presentation.
data). Additionally, we test the effect of using the
pre-segmented audio files (with oracle segmenta-
• Removing any lines larger than 297 characters
tion provided by the IWSLT 60-60 challenge or-
(number obtained through a similar process as
ganizers) versus using longer speech segments for
above).
Whisper decoding. We find that decoding the full
• Reformatting the data such that it has one sen- talk at once results in a lower WER than decod-
tence per line. ing segment-by-segment. For Whisper-large, the
1
https://github.com/py-pdf/pypdf best performing model, this difference is 0.6 WER.
2
https://github.com/IWSLT-23/60_60_data/tree/ Longer form inputs more closely match the train-
main/acl_data ing segments of Whisper, which were in 30 second
segments (Radford et al., 2022).
303
4.1.2 Audio Segmentation et al., 2019). We use the 1.2B parameter version of
Since we found that decoding using unsegmented M2M100 in our experiments.
audio outperformed decoding using the predefined 4.1.4 Domain-Specific Data
segments, we segment our ASR text output in order
Using the 2021 ACL data described in Section 3,
to perform sentence-level machine translation. We
we attempted to perform sequence knowledge dis-
choose to perform sentence-level machine trans-
tillation (SeqKD) (Kim and Rush, 2016). Because
lation rather than incorporating more document
we only had additional source-side monolingual
context because our final systems make use of
data, SeqKD could give us pseudo-target labels in
many large pre-trained multilingual models that
order to retrain our best model on these outputs.
are trained at a sentence level rather than a docu-
Although NLLB-200-3.3B is our best model for
ment level.
many of our language pairs, we fine-tune NLLB-
Because we require sentence-level segments
200-1.3B instead due to computational constraints.
from our ASR outputs, we use the state-of-the-
While benchmarking these models, however, there
art ersatz neural sentence segmenter. ersatz has
is only a marginal improvement in using the larger
been shown to be more robust to technical terms in-
model over the smaller (average +0.6 chrF). For en-
cluding acronyms and irregular punctuation, which
ja, however, we continue to use mBART50-1toN.
is particularly helpful in the ACL domain (Wicks
Despite the large amount of in-domain source
and Post, 2021).
language data we made available, we did not see
4.1.3 Machine Translation much benefit from it ourselves, specifically for data
We test several pre-trained MT systems on our data. augmentation via SeqKD. We speculate that the
Specifically, we test NLLB-200 (NLLB Team et al., data may be too noisy in spite of filtering, and that
2022), mBART50 (Tang et al., 2020), and M2M100 its best use may be as source context during infer-
(Fan et al., 2021). All 10 of our target languages ence, rather than for training data augmentation.
are supported by these models.
4.2 Constrained Subtrack
The original NLLB-200 model is a 54 billion pa-
rameter Mixture-of-Experts model that translates 4.2.1 Automatic Speech Recognition
to and from 200 languages. It is trained on a We leveraged the pre-trained wav2vec 2.0 model
large amount of mined parallel, back-translated, (Baevski et al., 2020) for the constrained ST task.
and monolingual data. We use the 3.3B parame- Wav2vec 2.0 was trained in a self-supervised fash-
ter version of NLLB-200, which is a dense Trans- ion and requires fine-tuning on an annotated cor-
former model that is trained via online distillation pus in order to be used for the ASR task, with the
of the original model, but still supports all of the domain-similarity between the choice of the fine-
original 200 languages. tuning corpus and the evaluation data being crucial
mBART50 is the second iteration of the multi- for ASR performance. The most commonly used
lingual BART model, which is a dense transformer wav2vec 2.0 model is fine-tuned with a CTC objec-
architecture trained on multilingual text using a tive on Librispeech, a corpus made of audiobooks
denoising task. The authors of mBART50 also re- that is considered to have a considerable domain
lease a checkpoint of mBART50 that is fine-tuned mismatch compared to the ACL 60-60 data. Since
on the one-to-many translation task, which we will the development split of the ACL 60-60 data alone
refer to as mBART50-1toN. In this case, English is insufficient for wav2vec 2.0 fine-tuning, we in-
is the source, and all 50 covered languages are the stead performed a two-stage fine tuning with TED-
targets. LIUM 3 (Hernandez et al., 2018) being used in the
Finally, M2M100 is another transformer-based first stage and the ACL 60-60 development data
model that is trained directly on the MT task. It used in the second.
translates to and from 100 languages, and is a previ- Our approach to tackling the content domain mis-
ous iteration of the initiative that produced NLLB- match between the training data and ACL presen-
200. However, we still test both models because tations is to perform ASR decoding with the help
sometimes adding additional language pairs to a of an content-domain matching language model.
model can lead to the reduced performance of some What it means in practice is that we rescore the per-
language pairs (Aharoni et al., 2019; Arivazhagan frame output trellis with a content-domain match-
ing language model, which in turn was created by
304
interpolating a general language model (trained 5.1 ASR Experiments
from all the available English corpora in the con- 5.1.1 Prompting Whisper
strained challenge) and a domain-specific language
In the unconstrained setting, we evaluate Whisper
model (trained with transcripts from the ACL 60-
on both the segmented and unsegmented audio files.
60 development data). In order to bias our model
We simulate LM biasing by using the “prompt”
towards named entities mentioned in each specific
interface provided by Whisper.
presentation, we train a separate language model
for each presentation by re-interpolating the above- 5.1.2 Decoding with an Interpolated
mentioned language model with one trained with Language Model
the corresponding paper abstract. In the constrained setting, we build a domain-
4.2.2 Machine Translation adapted language model as follows: first we com-
bine transcripts from a number of ASR corpora that
In the constrained setting, we use mBART50-1toN
are available in the constrained challenge, namely
and M2M100 as our base models. We addition-
Librispeech, VoxPopuli, Common Voice (Ardila
ally test fine-tuning these models on MuST-C data,
et al., 2020), and TED-LIUM 3, to train a flexi-
which we hypothesized to be closely related to the
ble 6-gram general bpe-level language model for
ACL talk data, domain-wise (Di Gangi et al., 2019).
English. We proceed to interpolate the general
This data is comprised of professionally translated
English language model with one trained on the
English TED talks, which matches the presentation
development split transcripts from the ACL 60-60
domain as well as some of the technical nature of
challenge, allowing the model to gain exposure
the ACL talks, although to a lesser degree.
to technical terms within the NLP field. Finally,
We fine-tune both mBART and M2M100 using
during decoding, we further interpolate the previ-
the MuST-C transcripts and translations available
ously obtained language model with a low-order
in all 10 language pairs. We use data from both v1.2
language model trained from the paper abstract cor-
(v1.0 is contained in v1.2) and v2.0 depending on
responding to the current presentation, biasing our
language pair availability. A summary of this data
model towards technical terms and named entities
is provided in Table 1. For mBART, we additionally
that are likely to appear in the presentation.
test multilingual fine-tuning where we fine-tune on
We used KenLM (Heafield, 2011) to train and
all the language pairs simultaneously, rather than
integrate our language models. The interpolation
fine-tuning on a single language pair bitext (Tang
weights for each step were estimated using a leave-
et al., 2020).
one-out strategy on the development split, minimis-
ing the perplexity on the held-out transcript and
lang. pair MuST-C release # lines
averaging the interpolation weights.
en-ar v1.2 212085
5.1.3 Decoding with a Language Model
en-de v1.0 229703
Trained on Additional ACL Anthology
en-fa v1.2 181772
data
en-fr v1.0 275085
en-ja v2.0 328639 We use the text scraped from the proceedings and
en-nl v1.0 248328 workshops of ACL 2021 to train a 6-gram domain-
en-pt v1.0 206155 matching language model for decoding. Without
en-ru v1.0 265477 interpolation or additional data, this gives a WER
en-tr v1.2 236338 of 18.9 and a technical term recall of 0.47 using
en-zh v1.2 184801 Wav2Vec2-TED-LIUM 3 as the acoustic model.
We observe that using data from a similar domain
improves performance even though the data are
Table 1: Dataset statistics and source of MuST-C bitext
across the 10 task language pairs. relatively noisy.
5.1.4 Evaluation
5 Experimental Setup We compare ASR performance, as measured by
Word Error Rate (WER), across the different sys-
In this section, we provide technical details of our tems that we built. Specifically, we compute WER
experiments and our evaluation practices. on depunctuated lowercase transcripts. Since we
305
Acoustic Model Language Model WER Tech. Term Recall
Whisper-medium.en - 8.1 0.861
Whisper-medium.en abstract prompting 8.7 0.865
Whisper-large - 6.8 0.854
Whisper-large abstract prompting 6.9 0.852
Whisper-large abstract and conclusion prompting 6.7 0.863
Whisper-large abstract, conclusion and intro prompting 6.6 0.851
Whisper-large abstract, conclusion, intro & author name prompting 6.4 0.854
Wav2Vec2-960h librispeech librispeech-4gram 25.1 0.306
Wav2Vec2-960h librispeech interpolated LM 24.3 0.370
Wav2Vec2-960h librispeech inter. LM + dev transcripts 24.1 0.382
Wav2Vec2-960h librispeech inter. LM + dev + abstract 23.7 0.392
Wav2Vec2-960h librispeech inter. LM + dev + abstract + ACL anthology 20.7 0.462
HUBERT-960h librispeech librispeech-4gram 22.0 0.390
HUBERT-960h librispeech interpolated LM 21.7 0.386
HUBERT-960h librispeech inter. LM + dev transcripts 20.4 0.421
HUBERT-960h librispeech inter. LM + dev + abstract 20.4 0.498
HUBERT-960h librispeech inter. LM + dev + abstract + ACL anthology 18.5 0.473
Wav2Vec2-TED-LIUM 3 librispeech-4gram 20.9 0.383
Wav2Vec2-TED-LIUM 3 interpolated LM 19.5 0.422
Wav2Vec2-TED-LIUM 3 inter. LM + dev transcripts 18.9 0.436
Wav2Vec2-TED-LIUM 3 inter. LM + dev + abstract 14.2 0.626
Wav2Vec2-TED-LIUM 3 inter. LM + dev + abstract + ACL anthology 16.7 0.505
Wav2Vec2-TED-LIUM 3 ACL anthology only 18.9 0.470
Table 2: ASR results. WER is measured against depunctuated, all lower-case reference text.
either perform ASR on unsegmented talks (uncon- enizers provided by sacrebleu (ja-mecab and zh,
strainted), or on the SHAS-segmented audio (con- respectively).
strained), we use mwerSegmenter to align our out- For evaluating translations of ASR outputs, ei-
puts to the gold transcripts (Matusov et al., 2005). ther segmented using ersatz or pre-segmented us-
Because we are interested in the effect of using ing the provided SHAS-segmented wav files, we
domain-specific text to improve ASR on techni- use the mwerSegmenter to resegment the transla-
cal terms, we compute the recall of NLP-specific tions based on the references. For all languages ex-
technical words in our output. We obtain these cept Japanese and Chinese, we use detokenized text
technical terms by asking domain experts to flag as input to resegmentation. However, for Japanese
all technical terms in the development set reference and Chinese, we first use whitespace tokenization
transcript. as input to mwerSegmenter, and then detokenize
for scoring, which is retokenized according to the
5.2 MT Experiments sacrebleu package.
5.2.1 MuST-C fine-tuning
For bilingual fine-tuning on mBART50 and 6 Results
M2M100, we train for 40K updates, and use loss 6.1 ASR Results
to select the best checkpoint. For multilingual fine-
For the Whisper-based systems, we focus on the ef-
tuning on mBART50-1toN, we train for 100K up-
fects of prompting; for the constrained systems, we
dates, and use temperature sampling of the mixed
contrast different families of pre-trained ASR mod-
datset using T = 1.5. We use loss to select the
els fine-tuned on different ASR corpora; finally, we
best checkpoint. For all experiments, we use an
assess the efficacy of incorporating an in-domain
effective batch size of 2048 tokens.
language model during decoding. The full list of
5.2.2 Evaluation results is shown in Table 2.
For all experiments, we report BLEU and chrF Contrary to what we expected, prompting Whis-
scores as reported by sacrebleu (Post, 2018). For per with the corresponding paper abstracts not only
Japanese and Chinese, we use the appropriate tok- had little impact on the ASR WER, but also failed
306
mBART50-1toN M2M100 NLLB-200
language pair BLEU chrF BLEU chrF BLEU chrF
en-ar 22.6 52.9 16.2 46.3 37.6 65.4
en-de 37.4 66.0 39.7 66.8 42.9 69.6
en-fa 17.2 49.6 20.4 49.5 27.4 57.3
en-fr 46.4 70.4 54.5 74.6 55.9 76.2
en-ja 37.5 45.9 35.2 43.8 25.7 36.3
en-nl 41.0 69.0 50.9 75.3 51.5 76.1
en-pt 44.3 69.7 57.6 77.4 61.6 79.0
en-ru 22.2 52.0 24.3 54.3 27.4 57.2
en-tr 15.5 50.7 22.3 56.5 28.6 62.8
en-zh 43.8 38.8 45.7 40.7 42.2 38.5
Table 3: Unconstrained MT results on the development set using oracle transcripts as input. Both chrF and BLEU
scores are computed using the mWER Segmenter and sacrebleu. BLEU scores for ja and zh are computed using
the ja-mecab and zh tokenizers in sacrebleu, respectively. We bold our best chrF scores as it is the main metric of
the task.
mBART50-1toN +MuST-C (indiv) +MuST-C (multi) M2M-100 +MuST-C (indiv)

lang pair BLEU chrF BLEU chrF BLEU chrF BLEU chrF BLEU chrF
en-ar 22.6 52.9 24.7 55.9 19.6 51.0 16.2 46.3 24.0 55.7
en-de 37.4 66.0 35.6 63.7 36.8 64.5 39.7 66.8 34.7 62.8
en-fa 17.2 49.6 28.9 56.0 26.3 52.4 20.4 49.5 17.9 54.4
en-fr 46.4 70.4 48.0 70.9 46.7 70.1 54.5 74.6 49.0 71.1
en-ja 37.5 45.9 24.0 35.7 24.9 37.0 35.2 43.8 21.0 32.3
en-nl 41.0 69.0 43.3 70.1 38.5 67.1 50.9 75.3 42.1 69.0
en-pt 44.3 69.7 48.2 71.4 42.8 68.5 57.6 77.4 50.0 72.3
en-ru 22.2 52.0 21.0 50.4 19.5 47.9 24.3 54.3 22.1 50.7
en-tr 15.5 50.7 18.9 53.3 15.6 50.8 22.3 56.5 21.4 56.0
en-zh 43.8 38.8 45.3 40.6 31.5 39.2 45.7 40.7 42.8 37.5
Table 4: Constrained MT results on the development set using oracle transcripts as input. Both chrF and BLEU
scores are computed using the mWER Segmenter and sacrebleu. BLEU scores for ja and zh are computed using
the ja-mecab and zh tokenizers in sacrebleu, respectively. We bold our best chrF scores as it is the main metric of
the task.
to improve the recall of technical terms of the ASR domain language model (from Librispeech-4gram
system. Further increasing the length and relevance to Interpolated LM) resulted in WER improve-
of the prompts provided to whisper, such as adding ments while not necessarily helping technical term
the conclusion and part of the introduction section recall; by contrast, while LMs that better fit the
of each paper corresponding to the ACL presenta- domain may not necessarily help WER, they bring
tion in question, had marginal impact on both of the substantial gains in technical term recall.
above-mentioned metrics. A more detailed look at The language model that best fits our domain,
the mechanism and behaviour of Whisper prompt- namely the model that interpolates the LMs trained
ing could help to understand this observation. from every ASR corpus in addition to the develop-
On the constrained side, the incorporation of the ment transcripts, from the current paper abstract,
interpolated LM during ASR decoding had a sig- and from the crawled ACL anthology, provided
nificant impact on the performance of our ASR substantial improvement on both WER and tech-
systems, regardless of the upstream acoustic model. nical term recall for the weaker acoustic models
As expected, increasing the quality of the out-of- (Wav2Vec2 fine-tuned on Librispeech) but not on
307
Constrained Unconstrained
language MT system BLEU chrF MT system BLEU chrF
en-ar mBART50-1toN+MuST-C 15.3 45.6 NLLB-200-3.3B 33.7 62.5
en-de M2M100 24.3 55.2 NLLB-200-3.3B 39.6 67.8
en-fa mBART50-1toN+MuST-C 14.8 42.0 NLLB-200-3.3B 24.5 54.3
en-fr M2M100 33.3 61.9 NLLB-200-3.3B 49.3 72.5
en-ja mBART50-1toN 21.9 29.9 mBART50-1toN 34.8 43.1
en-nl M2M100 30.6 62.5 NLLB-200-3.3B 45.7 72.4
en-pt M2M100 34.9 63.4 NLLB-200-3.3B 54.7 75.6
en-ru M2M100 15.0 45.1 NLLB-200-3.3B 24.8 54.4
en-tr M2M100 11.9 43.5 NLLB-200-3.3B 24.7 58.8
en-zh M2M100 32.2 26.6 M2M100 37.7 33.5
Table 5: Final speech translation results for both our constrained and unconstrained systems on the development set.
Both chrF and BLEU scores are computed using the mWER Segmenter and sacrebleu. BLEU scores for ja and zh
are computed using the ja-mecab and zh tokenizers in sacrebleu, respectively. We used output from our strongest
ASR system, Whisper-large with abstract prompting, as the input to our translation system.
the stronger acoustic models. scripts is -5.7 chrF. In the constrained case, this
value is -12.8 chrF. The small reduction in the un-
6.2 MT results constrained system indicates that our cascaded ap-
We detail the results of testing pre-trained MT mod- proach of two strong components is a viable option
els as described in Section 4 on the oracle tran- for ST in this setting. However, our constrained
scripts in Table 3. This table reflects experiments system could likely benefit from techniques that
we performed for the unconstrained setting. We help reduce the error propagation from ASR, like
find that for almost all language pairs, NLLB-200- mixing ASR outputs with gold source sentences
3.3B has the best performance, except for en-ja during MT training, or joint training of ASR and
and en-zh, which perform best with mBART and MT components.
M2M100, respectively.
We summarize our fine-tuning results in Table 7 Conclusion
4. This table reflects experiments we performed We present a constrained and unconstrained system
for the constrained setting. We find that in gen- for the IWSLT 2023 Multilingual speech transla-
eral, the additional data can provide a boost over tion task. We address some of the major challenges
mBART50-1toN, but not for M2M100. Addition- of this dataset with our design choices: ASR ro-
ally, we find that despite positive results in Tang bust to speaker accents, adaptation to match the
et al. (2020), multilingual fine-tuning does not out- domain specificity, and ASR prompting to incorpo-
perform bilingual fine-tuning in this setting. For a rate context in this academic talk-level translation
majority of pairs, M2M100 without fine-tuning is task. We additionally release a supplemental ACL
the best system, but for en-ar and en-fa, mBART50- audio and text corpus to encourage further work in
1toN with fine-tuning is the best system, and simi- high quality speech translation of ACL content.
lar to the unconstrained system, mBART50-1toN
without fine-tuning is the best system for en-ja.
References
6.3 ST Results
Final results for both our constrained and uncon- sopoulos, Ondřej Bojar, Claudia Borg, Marine
strained systems are summarized in Table 5. We Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
translate the transcripts from our best ASR systems Chen, William Chen, Khalid Choukri, Alexandra
using the best language-pair specific MT systems. Chronopoulou, Anna Currey, Thierry Declerck, Qian-
In the unconstrained case, the average reduction in Federico, Souhir Gahbiche, Barry Haddow, Benjamin
chrF from using ASR outputs versus oracle tran- Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
308
vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu Kenneth Heafield. 2011. KenLM: Faster and smaller
Kumar, Pengwei Li, Xutai Ma, Prashant Mathur, language model queries. In Proceedings of the Sixth
Evgeny Matusov, Paul McNamee, John P. McCrae, Workshop on Statistical Machine Translation, pages
Kenton Murray, Maria Nadejde, Satoshi Nakamura, 187–197, Edinburgh, Scotland. Association for Com-
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu, putational Linguistics.
Lonneke van der Plas, Peter Polák, Elijah Rippeth, François Hernandez, Vincent Nguyen, Sahar Ghannay,
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se- Natalia A. Tomashenko, and Yannick Estève. 2018.
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian TED-LIUM 3: twice as much data and corpus repar-
Thompson, Kevin Tran, Marco Turchi, Alex Waibel, tition for experiments on speaker adaptation. CoRR,
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- abs/1805.04699.
Campaign. In Proceedings of the 20th International Yoon Kim and Alexander M. Rush. 2016. Sequence-
Conference on Spoken Language Translation (IWSLT level knowledge distillation. In Proceedings of the
2023). Association for Computational Linguistics. 2016 Conference on Empirical Methods in Natu-
ral Language Processing, pages 1317–1327, Austin,
Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Texas. Association for Computational Linguistics.
Massively multilingual neural machine translation.
In Proceedings of the 2019 Conference of the North Evgeny Matusov, Gregor Leusch, Oliver Bender, and
American Chapter of the Association for Computa- Hermann Ney. 2005. Evaluating machine translation
tional Linguistics: Human Language Technologies, output with automatic sentence segmentation. In Pro-
Volume 1 (Long and Short Papers), pages 3874–3884, ceedings of the Second International Workshop on
Minneapolis, Minnesota. Association for Computa- Spoken Language Translation, Pittsburgh, Pennsylva-
tional Linguistics. nia, USA.
Rosana Ardila, Megan Branson, Kelly Davis, Michael NLLB Team, Marta R. Costa-jussà, James Cross, Onur
Kohler, Josh Meyer, Michael Henretty, Reuben Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef-
Morais, Lindsay Saunders, Francis Tyers, and Gre- fernan, Elahe Kalbassi, Janice Lam, Daniel Licht,
gor Weber. 2020. Common voice: A massively- Jean Maillard, Anna Sun, Skyler Wang, Guillaume
multilingual speech corpus. In Proceedings of the Wenzek, Al Youngblood, Bapi Akula, Loic Bar-
Twelfth Language Resources and Evaluation Confer- rault, Gabriel Mejia-Gonzalez, Prangthip Hansanti,
ence, pages 4218–4222, Marseille, France. European John Hoffman, Semarley Jarrett, Kaushik Ram
Language Resources Association. Sadagopan, Dirk Rowe, Shannon Spruit, Chau
Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Bhosale, Sergey Edunov, Angela Fan, Cynthia
Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Gao, Vedanuj Goswami, Francisco Guzmán, Philipp
Mia Xu Chen, Yuan Cao, George Foster, Colin Koehn, Alexandre Mourachko, Christophe Ropers,
Cherry, et al. 2019. Massively multilingual neural Safiyyah Saleem, Holger Schwenk, and Jeff Wang.
machine translation in the wild: Findings and chal- 2022. No language left behind: Scaling human-
lenges. arXiv preprint arXiv:1907.05019. centered machine translation.
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Matt Post. 2018. A call for clarity in reporting BLEU
and Michael Auli. 2020. wav2vec 2.0: A framework scores. In Proceedings of the Third Conference on
for self-supervised learning of speech representations. Machine Translation: Research Papers, pages 186–
In Advances in Neural Information Processing Sys- 191, Brussels, Belgium. Association for Computa-
tems, volume 33, pages 12449–12460. Curran Asso- tional Linguistics.
ciates, Inc.
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, man, Christine McLeavey, and Ilya Sutskever. 2022.
Matteo Negri, and Marco Turchi. 2019. MuST-C: a Robust speech recognition via large-scale weak su-
Multilingual Speech Translation Corpus. In Proceed- pervision. arXiv preprint arXiv:2212.04356.
ings of the 2019 Conference of the North American
Chapter of the Association for Computational Lin- Morgane Riviere, Jade Copet, and Gabriel Synnaeve.
guistics: Human Language Technologies, Volume 1 2021. Asr4real: An extended benchmark for speech
(Long and Short Papers), pages 2012–2017, Min- models. arXiv preprint arXiv:2110.08583.
Elizabeth Salesky, Kareem Darwish, Mohamed Al-
Linguistics.
Badrashiny, Mona Diab, and Jan Niehues. 2023.
Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Evaluating Multilingual Speech Translation Under
Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Realistic Conditions with Resegmentation and Ter-
Baines, Onur Celebi, Guillaume Wenzek, Vishrav minology. In Proceedings of the 20th International
Chaudhary, et al. 2021. Beyond english-centric multi- Conference on Spoken Language Translation (IWSLT
lingual machine translation. The Journal of Machine 2023). Association for Computational Linguistics.
Learning Research, 22(1):4839–4886.
309
Divya Tadimeti, Kallirroi Georgila, and David Traum.
2022. Evaluation of off-the-shelf speech recognizers
on different accents in a dialogue domain. In Pro-
ceedings of the Thirteenth Language Resources and
Evaluation Conference, pages 6001–6008, Marseille,
France. European Language Resources Association.
losa, and Marta R. Costa-jussà. 2022. SHAS: Ap-
proaching optimal Segmentation for End-to-End
Speech Translation. In Proc. Interspeech 2022, pages
106–110.
Rachel Wicks and Matt Post. 2021. A unified approach
to sentence segmentation of punctuated text in many
languages. In Proceedings of the 59th Annual Meet-
ing of the Association for Computational Linguistics
and the 11th International Joint Conference on Natu-
ral Language Processing (Volume 1: Long Papers),
pages 3995–4007, Online. Association for Computa-
tional Linguistics.
310
The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023
Speech-to-Speech Translation Task
Kun Song1 , Yi Lei1 , Peikun Chen1 , Yiqing Cao2 , Kun Wei1 , Yongmao Zhang1 ,
Lei Xie1∗ , Ning Jiang3 , Guoqing Zhao3
1
Audio, Speech and Language Processing Group (ASLP@NPU),
School of Computer Science, Northwestern Polytechnical University, China
2
Department of Computer Science and Technology, Nanjing University, China
3
MaShang Consumer Finance Co., Ltd, China
Abstract speech-to-text translation task. Challengingly, the

This paper describes the NPU-MSXF system test set contains multi-source speech data, covering
for the IWSLT 2023 speech-to-speech trans- a variety of acoustic conditions and speaking styles,
lation (S2ST) task which aims to translate designed to examine the robustness of the S2ST
from English speech of multi-source to Chi- system. Moreover, speaker identities conveyed in
nese speech. The system is built in a cascaded the diverse multi-source speech test data are unseen
manner consisting of automatic speech recog-
during training, which is called zero-shot S2ST and
nition (ASR), machine translation (MT), and
text-to-speech (TTS). We make tremendous ef- better meets the demands of real-world applica-
forts to handle the challenging multi-source tions.
input. Specifically, to improve the robustness Current mainstream S2ST models usually in-
to multi-source speech input, we adopt various clude cascaded and end-to-end systems. Cascaded
data augmentation strategies and a ROVER- S2ST systems, widely used in the speech-to-speech
based score fusion on multiple ASR model
translation task (Nakamura et al., 2006), usually
outputs. To better handle the noisy ASR tran-
scripts, we introduce a three-stage fine-tuning contain three modules, i.e. automatic speech recog-
strategy to improve translation accuracy. Fi- nition (ASR), machine translation (MT), and text-
nally, we build a TTS model with high nat- to-speech (TTS). Meanwhile, end-to-end (E2E)
uralness and sound quality, which leverages a S2ST systems (Jia et al., 2019; Lee et al., 2022)
two-stage framework, using network bottleneck have recently come to the stage by integrating the
features as a robust intermediate representation above modules into a unified model for directly syn-
for speaker timbre and linguistic content disen-
thesizing target language speech translated from
tanglement. Based on the two-stage framework,
pre-trained speaker embedding is leveraged as the source language. E2E S2ST systems can ef-
a condition to transfer the speaker timbre in the fectively simplify the overall pipeline and allevi-
source English speech to the translated Chinese ate possible error propagation. Cascaded S2ST
speech. Experimental results show that our systems may also alleviate the error propagation
system has high translation accuracy, speech problem by leveraging the ASR outputs for MT
naturalness, sound quality, and speaker simi- model fine-tuning. Meanwhile, thanks to the indi-
larity. Moreover, it shows good robustness to vidual training process of sub-modules, cascaded
multi-source data.
systems can make better use of large-scale text and
1 Introduction speech data, which can significantly promote the
In this paper, we describe NPU-MSXF team’s cas- performance of each module.
caded speech-to-speech translation (S2ST) system In this paper, we build a cascaded S2ST sys-
submitted to the speech-to-speech (S2S) track1 of tem aiming at English-to-Chinese speech trans-
the IWSLT 2023 evaluation campaign. The S2S lation with preserving the speaker timbre of the
track aims to build an offline system that realizes source English speech. The proposed system
speech-to-speech translation from English to Chi- consists of Conformer-based (Gulati et al., 2020)
nese. Particularly, the track allows the use of large- ASR models, a pretrain-finetune schema-based MT
scale data, including the data provided in this track model (Radford et al., 2018), and a VITS-based
as well as all training data from the offline track2 on TTS model (Kim et al., 2021). For ASR, model fu-
∗
Lei Xie is the corresponding author.
sion and data augmentation strategies are adopted
1
https://iwslt.org/2023/s2s to improve the recognition accuracy and gener-
2
https://iwslt.org/2023/offline alization ability of ASR with multi-source input.
311
For MT, we use a three-stage fine-tuning process the generalization ability of the proposed
to adapt the translation model to better facilitate model. Speed perturbation is the process of
the output of ASR. Meanwhile, back translation changing the speed of an audio signal while
and multi-fold verification strategies are adopted. preserving other information (including pitch)
Our TTS module is composed of a text-to-BN in the audio. We perturb the audio speech with
stage and a BN-to-speech stage, where speaker- a speed factor of 0.9, 1.0, and 1.1 to all the
independent neural bottleneck (BN) features are training data. Here speed factor refers to the
utilized as an intermediate representation bridging ratio compared to the original speed of speech.
the two stages. Specifically, the BN-to-speech mod-
ule, conditioned on speaker embedding extracted • Pitch Shifting: Pitch shifting can effectively
from the source speech, is to synthesize target lan- vary the speaker identities to increase data
guage speech with preserving the speaker timbre. diversity. Specifically, we use SoX3 audio
Combined with a pre-trained speaker encoder to manipulation tool to perturb the pitch in the
provide speaker embeddings, the TTS model can range [-40, 40].
be generalized to unseen speakers, who are not in-
• Noise Augmentation: There are many cases
volved in the training process. Experimental results
with heavy background noise in the test set, in-
demonstrate the proposed S2ST system achieves
cluding interfering speakers and music. How-
good speech intelligibility, naturalness, sound qual-
ever, the data set provided by the organizer is
ity, and speaker similarity.
much cleaner than the test set, which makes
2 Automatic Speech Recognition it necessary to augment the training data by
Our ASR module employs multiple models for adding noises to improve the recognition per-
score fusion in the inference. Moreover, data aug- formance. Since there is no noise set available,
mentation is adopted during training to handle we create a noise set from the data provided.
noisy multi-source speech. A statistical VAD (Sohn et al., 1999) is used
2.1 Model Structure to cut the non-vocal and vocal segments from
Our system employs both Conformer (Gulati et al., the data and the non-vocal segments with en-
2020) and E-Branchformer models (Kim et al., ergy beyond a threshold comprise our noise
2023) in our ASR module to address the diver- set. We add the noise segments to the speech
sity of the test set. Conformer sequentially com- utterances with a signal-to-noise ratio ranging
bines convolution, self-attention, and feed-forward from 0 to 15 dB.
layers. The self-attention module serves to cap- • Audio Codec: Considering the test data come
ture global contextual information from the input from multiple sources, we further adopt au-
speech, while the convolution layer focuses on ex- dio codec augmentation to the training data.
tracting local correlations. This model has demon- Specifically, we use the FFmpeg4 tool to con-
strated remarkable performance in ASR tasks with vert the original audio to Opus format at [48,
the ability to capture local and global informa- 96, 256] Kbps.
tion from input speech signals. E-Branchformer
• Spectrum Augmentation: To prevent the
uses dedicated branches of convolution and self-
ASR model from over-fitting, we apply the
attention based on the Conformer and applies ef-
SpecAugment method (Park et al., 2019) to
ficient merging methods, in addition to stacking
the input features during every mini-batch
point-wise modules. E-Branchformer achieves
training. SpecAugment includes time warp-
state-of-the-art results in ASR.
ing, frequency channel masking, and time step
2.2 Data Augmentation masking, and we utilize all of these techniques
Considering the diversity of the testing data, we during training.
leverage a variety of data augmentation strategies
to expand the training data of our ASR system, 2.3 Model Fusion
including the following aspects. Since a single ASR model may overfit to a spe-
• Speed Perturbation: We notice that the test- cific optimization direction during training, it can-
ing set contains spontaneous speech such as not guarantee good recognition accuracy for the
conversations with various speaking speeds. 3
https://sox.sourceforge.net/
So speed perturbation is adopted to improve 4
https://ffmpeg.org/
312
speech of various data distributions. To let the on Curriculum Learning (Bengio et al., 2009), we
ASR model generalize better to the multi-source adopt a three-stage fine-tuning strategy to mitigate
input, we adopt a model fusion strategy. Specifi- such a mismatch.
cally, we train the Conformer and E-branchformer
• Fine-tuning using the MT data: First, we
models introduced in Section 2.1 using the com-
use all the MT data to fine-tune the pre-trained
bination of the original and the augmented data.
model to improve the accuracy of the model
Each testing utterance is then transcribed by these
in the En2Zh translation task.
different models, resulting in multiple outputs. Fi-
nally, ROVER (Fiscus, 1997) is adopted to align • Fine-tuning using the MT data in ASR tran-
and vote with equal weights on the multiple outputs, scription format: Second, we convert the
resulting in the final ASR output. English text in the MT data into the ASR
2.4 ASR Output Post-processing transcription format. Then, we fine-tune the
Given that the spontaneous speech in the test set MT model using the converted data, which is
contains frequent filler words such as "Uh" and closer to the actual text than the ASR recog-
"you know", it is necessary to address their impact nition output. This approach can enhance the
on subsequent MT accuracy and TTS systems that stability of the fine-tuning process, minimize
rely on the ASR output. To mitigate this issue, the impact of ASR recognition issues on the
we use a simple rule-based post-processing step translation model, and improve the model’s
to detect and eliminate these expressions from the ability to learn punctuation, thereby enhanc-
ASR output. By doing so, we improve the accuracy ing its robustness.
of the downstream modules. • Fine-tuning using the ASR outputs: Third,
3 Machine Translation we leverage GigaSpeech (Chen et al., 2021)
For the MT module, we first use a pre-trained lan- to address the mismatch problem between the
guage model as a basis for initialization and then ASR outputs and the MT data. Specifically,
employ various methods to further enhance transla- we use the ASR module to transcribe the Gi-
tion accuracy. gaSpeech training set and replace the corre-
3.1 Pre-trained Language Model sponding transcriptions in GigaST (Ye et al.,
As pre-trained language models are considered 2022) with the ASR transcriptions for transla-
part of the training data in the offline track and tion model fine-tuning. This enables the MT
can be used in the S2ST track, we use the pre- model to adapt to ASR errors.
trained mBART50 model for initializing our MT
module. mBART50 (Liu et al., 2020) is a multi- 3.3 Back Translation
lingual BART (Lewis et al., 2020) model with 12 Following (Akhbardeh et al., 2021), we adopt the
layers of encoder and decoder, which we believe back translation method to enhance the data and
will provide a solid basis for improving translation improve the robustness and generalization of the
accuracy. model. First, we train a Zh2En MT model to trans-
3.2 Three-stage Fine-tuning based on late Chinese to English, using the same method
Curriculum Learning employed for the En2Zh MT module. Next, we
We perform fine-tuning on the pre-trained model to generate the corresponding English translations for
match the English-to-Chinese (En2Zh) translation the Chinese text of the translation data. Finally, we
task. There are substantial differences between combine the back translation parallel corpus pairs
the ASR outputs and the texts of MT data. First, with the real parallel pairs and train the MT model.
ASR prediction results inevitably contain errors. 3.4 Cross-validation
Second, ASR outputs are normalized text without We use 5-fold cross-validation (Ojala and Garriga,
punctuation. Therefore, directly fine-tuning the 2010) to improve the robustness of translation and
pre-trained model with the MT data will cause a reduce over-fitting. Firstly, we randomly divide the
mismatch problem with the ASR output during data into five equal parts and train five models on
inference. On the other hand, fine-tuning the model different datasets by using one of them as the vali-
with the ASR outputs will cause difficulty in model dation set each time and combining the remaining
coverage because of the difference between the four as the training set. After that, we integrate the
ASR outputs and the MT data. Therefore, based predicted probability distributions from these five
313
Speech
24kHz Speech
BN VISinger 2
Decoder
Audio Super-resolution
Conformer Decoder
16kHz Speech Speaker Embedding Posterior Encoder
Variance Adaptor Linear Spectrogram

BN-to-speech ECAPA-
TDNN
Source Speech
BN Conformer Encoder Flow
Source Speech
Text-to-BN BN Encoder
Text
MT Output Text BN
(a) pipeline (b) Text-to-BN (c) BN-to-speech
Figure 1: Architecture of our text-to-speech module.
models to obtain the final predicted probability dis- BN features contain the duration and prosody
tribution for the next word during token generation information, which eliminates the need for text
for predicting the translation results. transcripts and prosody modeling. Instead, the
BN-to-speech stage focuses on time-invariant
4 Text-to-speech
information modeling, such as speaker timbre.
4.1 Overview As the goal of this work is to conduct zero-shot
Figure 1 (a) shows the pipeline of the text-to-speech English-to-Chinese speech translation, we concen-
module in the proposed S2ST system. The TTS trate on the method to transfer the unseen speaker
module is built on a BN-based two-stage architec- timbre of the source English speech to the synthe-
ture, which consists of a text-to-BN and a BN-to- sized Chinese speech through voice cloning (Chen
speech procedure. The text-to-BN stage tends to et al., 2019). To capture new speaker timbre dur-
generate BN features from the Chinese text trans- ing inference, the TTS module requires to model
lated by the MT module. The BN-to-speech stage abundant various speakers during training, which
produces 16KHz Chinese speech from the BN fea- relies on large-scale high-quality TTS data. Un-
ture, conditioning on the speaker embedding of fortunately, we are limited in the high-quality TTS
source speech. Given the translated Chinese speech data we can use in this task and must rely on ad-
which preserves the speaker timbre in the source ditional data such as ASR to model the speaker
English speech, an audio super-resolution model is timbre. However, this data is not suitable for TTS
further leveraged to convert the synthesized speech model training because the labels are inconsistent
from 16KHz to 24KHz for higher speech fidelity. with TTS, and the prosody of the speakers is not as
Building on the two-stage framework good as high-quality TTS data.
AdaVITS (Song et al., 2022a), we employ Furthermore, we incorporate ASR data into the
bottleneck (BN) features as the intermediate BN-to-speech training procedure by re-sampling
representations in the two-stage TTS module. BN all the training speech to 16kHz, which can not
features, extracted from a multi-condition trained reach high-quality audio. Therefore, we utilize
noise-robust ASR system, mainly represent the audio super-resolution techniques to upsample the
speaker-independent linguistic content. So BN can synthesized 16KHz audio and convert it into higher
effectively disentangle the speaker timbre and the sampling rate audio.
linguistic content information. In the text-to-BN 4.2 Text-to-BN
stage, high-quality TTS data is adopted in the
Our text-to-BN stage network in TTS is based on
training phase to model the speaker-independent
DelightfulTTS (Liu et al., 2021), which employs a
BN features with prosody information. In the
Conformer-based encoder, decoder, and a variance
BN-to-speech stage, both high-quality TTS data
adapter for modeling duration and prosody. The
and low-quality ASR data should be involved
model extends phoneme-level linguistic features to
during training to sufficiently model the speech of
frame-level to guarantee the clarity and naturalness
various speaker identities. Extracted from speech,
of speech in our system.
314
4.3 BN-to-speech 5.1.1 ASR Data
We build the BN-to-speech model based on For the English ASR module in our proposed sys-
VITS (Kim et al., 2021), which is a mainstream tem, we use GigaSpeech, LibriSpeech, TED-LIUM
end-to-end TTS model. VITS generates speech v2&v3 as training data. For the ASR system used to
waveforms directly from the input textual informa- extract BN features in TTS, we use text-to-speech
tion, rather than a conventional pipeline of using data in AISHELL-3 and Chinese speech in GigaS2S,
the combination of an acoustic model and a neural along with the corresponding Chinese text in Gi-
vocoder. gaST, as the training set. Since the test set’s MT
The network of the BN-to-speech stage consists output text is a mix of Chinese and English, includ-
of a BN encoder, posterior encoder, decoder, flow, ing names of people and places, the TTS module
and speaker encoder. The monotonic alignment needs to support both languages. Therefore, we
search (MAS) from the original VITS is removed also add the aforementioned English data to the
since BN features contain the duration information. training set.
For achieving zero-shot voice cloning, an ECAPA- 5.1.2 MT Data
TDNN (Desplanques et al., 2020) speaker encoder We use the text-parallel data including News Com-
is pre-trained to provide the speaker embedding mentary and OpenSubtitles2018 as MT training set.
as the condition of the synthesized speech. To Moreover, we also add the Chinese texts in GigaST
avoid periodic signal prediction errors in the orig- and the English texts in GigaSpeech corresponding
inal HiFiGAN-based (Kong et al., 2020) decoder to the Chinese texts in GigaST to the training set.
in VITS, which induces sound quality degradation, 5.1.3 TTS Data
we follow VISinger2 (Zhang et al., 2022) to adopt a We use AISHELL-3 as training data in Text-to-BN
decoder with the sine excitation signals. Since The and audio super-resolution. For the pre-trained
VISinger2 decoder requires pitch information as speaker encoder, we adopt LibriSpeech, which con-
input, we utilize a pitch predictor with a multi-layer tains 1166 speakers, as the training data.For the BN-
Conv1D that predicts the speaker-dependent pitch to-speech model, in addition to using AISHELL-3
from BN and speaker embedding. With the desired which has 218 speakers, we also use LibriSpeech
speaker embedding and corresponding BN features, to meet the data amount and speaker number re-
the BN-to-speech module produces Chinese speech quirements of zero-shot TTS.
in the target timbre. 5.2 Data Pre-processing
4.4 Audio Super-resolution 5.2.1 ASR Data
Following (Liu et al., 2021), we use an upsam- To prepare the ASR data, we pre-process all tran-
pling network based vocoder to achieve audio scripts to remove audio-related tags. Next, we map
super-resolution (16kHz→24kHz). During train- the text to the corresponding byte-pair encoding
ing, the 16KHz mel-spectrogram is used as the (BPE) unit and count the number of BPE units in
condition to predict the 24KHz audio in the au- the ASR dictionary, which totals 5,000 units. For
dio super-resolution model. Specifically, we adopt audio processing, we use a frame shift of 10ms and
the AISHELL-3 (Shi et al., 2021) dataset, com- a frame length of 25ms and normalize all audio to
posing the paired 16KHz and 24KHz speech data 16KHz.
for model training. During inference, the high- 5.2.2 MT Data
quality 24kHz speech is produced for the mel- For the MT data, we use the same tokenizer as
spectrogram of the 16KHz speech generated by the mBART50 to perform sub-word segmentation for
BN-to-speech model. Here DSPGAN (Song et al., English and Chinese texts and to organize them
2022b) is adopted as our audio super-resolution into a format for neural network training. By doing
model, which is a universal vocoder that ensures so, we can maximize the benefits of initializing
robustness and good sound quality without periodic our translation model with mBART50 pre-trained
signal errors. model parameters. The mBART tokenizer men-
5 Data Preparation tioned above is a Unigram tokenizer. A Unigram
model is a type of language model that consid-
5.1 Datasets
Following the constraint of data usage, the training ers each token to be independent of the tokens be-
dataset for the S2ST system is illustrated in Table 1. fore it. What’s more, the tokenizer has a total of
5
https://github.com/SpeechTranslation/ 250,054 word segmentations, supports word seg-
GigaS2S mentation processing for English, Chinese, and
315
Table 1: Datasets used in our proposed system.
Datasets Utterances Hours

English Labeled Speech Data
GigaSpeech (Chen et al., 2021) 8,315K 10,000
LibriSpeech (Panayotov et al., 2015) 281K 961
TED-LIUM v2 (Rousseau et al., 2012)&v3 (Hernandez et al., 2018) 361K 661
CommonVoice (Ardila et al., 2020) 1,225K 1,668
Text-parallel Data
News Commentary (Chen et al., 2021) 322K -
OpenSubtitles2018 (Lison et al., 2018) 10M -
ST Data
GigaST (Ye et al., 2022) 7,651K 9,781
S2S Data
GigaS2S5 7,626K -
Chinese TTS Data
AISHELL-3 (Shi et al., 2021) 88K 85
other languages, and uses special tokens like <s>, responding En-Zh texts. It is worth noting that the
</s>, and <unk>. development data for evaluations has been removed
5.2.3 TTS Data from the training dataset.
For AISHELL-3, we downsample it to 16KHz and 6 Experiments
24KHz respectively as the TTS modeling target
and the audio super-resolution modeling target. All 6.1 Experimental Setup
other data is down-sampled to 16KHz. All data All the models in our system are trained on 8 A100
in TTS adopts 12.5ms frame shift and 50ms frame GPUs and optimized with Adam (Kingma and Ba,
length. 2015).
Speech Enhancement. Given the presence of ASR Module. All ASR models are implemented
substantial background noise in the test set, the dis- in ESPnet6 . Both Conformer and E-Branchformer
criminative power of speaker embeddings is signif- models employ an encoder with 17 layers and a
icantly reduced, thereby impeding the performance feature dimension of 512, with 8 heads in the self-
of the TTS module. Furthermore, the ASR data in- attention mechanism and an intermediate hidden
corporated during the training of the BN-to-speech dimension of 2048 for the FFN. In addition, we
model is also subject to background noise. There- employ a 6-layer Transformer decoder with the
fore, we employ a single-channel wiener filtering same feature hidden dimension as the encoder. The
method (Lim and Oppenheim, 1979) to remove E-Branchformer model uses a cgMLP with an in-
such noise from these data. Please note that we termediate hidden dimension of 3072. The total
do not perform speech enhancement on the test set number of parameters for the Conformer and E-
in the ASR module, because there is a mismatch Branchformer model in Section 2.1 is 147.8M and
between the denoised audio and which is used in 148.9M respectively. We train the models with
ASR training, and denoising will reduce the speech batch size 32 sentences per GPU for 40 epochs,
recognition accuracy. and set the learning rate to 0.0015, the warm-up
step to 25K.
5.2.4 Evaluation Data
For data augmentation, we conduct speed per-
For all evaluations, we use the English-Chinese turbation, pitch shifting, and audio codec on the
(En-Zh) development data divided by the organizer original recordings. Spectrum augmentation and
from GigaSpeech, GigaST and GigaS2S, including
5,715 parallel En-Zh audio segments, and their cor- 6
https://github.com/espnet/espnet
316
noise augmentation are used for on-the-fly model AISHELL-3.
training. Proposed system & Ablation Study. We fur-
MT Module. All MT models are implemented ther conduct ablation studies to evaluate each com-
in HuggingFace7 . Using MT data, we fine-tune the ponent in the proposed system. Specifically, the
mBART-50 large model, which has 611M param- ablation studies are designed to verify the effec-
eters, with a batch size of 32 sentences per GPU tiveness of model fusion and data augmentation
for 20 epochs. The learning rate is set to 3e-5 and in ASR, three-stage fine-tuning, back translation,
warmed up for the first 10% of updates and linearly cross-verification in MT, two-stage training with
decayed for the following updates. For fine-tuning BN, pre-trained speaker embedding, and audio
using the MT data in ASR transcription format and super-resolution in TTS.
the ASR outputs, we also fine-tune the model with
6.3 Results & Analysis
batch size 32 sentences per GPU for 5 epochs and
set the learning rate to 3e-5, which is warmed up We conduct experiments on the effectiveness of
for the first 5% of updates and linearly decayed for each sub-module and the performance of our pro-
the following updates. posed cascaded S2ST system.
TTS Module. We complete our system based 6.3.1 ASR Module
on VITS official code8 . The text-to-BN follows We calculate the word error rate (WER) of each
the configuration of DelightfulTTS and has about ASR module to evaluate the English speech recog-
64M parameters. To extract the duration required nition accuracy. As shown in Table 2, the WER
for text-to-BN, we train a Kaldi9 model using of the proposed system has a significant drop com-
AISHELL-3. The ASR system used for extract- pared with the baseline, which indicates that the
ing BN is the Chinese-English ASR model men- proposed system greatly improves the recognition
tioned in Section 5.1.1. For BN-to-speech, we use accuracy. Moreover, the results of the ablation
a 6-layer FFT as the BN encoder and follow the study demonstrate the effectiveness of both model
other configuration in VIsinger2 with about 45M fusion and data augmentation in improving speech
parameters in total. The pitch predictor has 4 lay- recognition accuracy.
ers of Conv1D with 256 channels. Pitch is ex-
Table 2: The WER results of each ASR module.
tracted by Visinger2 decoder and DSPGAN from
Harvest (Morise, 2017) with Stonemask. To pre-
Model WER (%)
dict pitch in DSPGAN, we use the method de-
scribed in Section 4.3. Up-sampling factors in Baseline 13.53
DSPGAN is set as [5, 5, 4, 3] and other config- Proposed system 10.25
uration of DSPGAN-mm is preserved for audio w/o model fusion 11.95
super-resolution. The DSPGAN model has about w/o data augmentation 12.40
9M parameters in total. We train all the above mod-
els with a batch size of 64 sentences per GPU for
6.3.2 MT Module
1M steps and set the learning rate to 2e-4. For the
We evaluate our MT module in terms of the BLEU
pre-trained speaker encoder, we follow the model
score, which measures the n-gram overlap between
configuration and training setup of ECAPA-TDNN
the predicted output and the reference sentence.
(C=1024) with 14.7M parameters.
Table 3: The BLEU results of each MT module.
6.2 Evaluation Models
Baseline. To evaluate the effectiveness of the pro- Model BLEU
posed cascaded S2ST system, we adopt the orig-
Baseline 28.1
inal cascaded S2ST system as a baseline, includ-
Proposed system 33.4
ing an E-Branchformer ASR model, a mBART50
w/o three-stage fine-tuning 28.7
MT model fine-tuned using the MT data, and an
end-to-end TTS model based on VITS trained with w/o back translation 30.8
w/o cross-validation 31.0
7
https://github.com/huggingface/
transformers
8
https://github.com/jaywalnut310/vits As shown in Table 4, the proposed system with
9
https://github.com/kaldi-asr/kaldi three-stage fine-tuning achieves a significantly bet-
317
Table 4: Experimental results of TTS in terms of MOS and WER. BN means using two-stage training with BN and
pre-trained spkr. embed. means using pre-trained speaker embedding.
Model Clarity in CER (%) Naturalness (MOS) Sound Quality (MOS) Speaker Similarity (MOS)
Baseline 7.14 3.38±0.05 3.81±0.04 2.12±0.06
Proposed system 6.12 3.70±0.06 3.86±0.06 3.72±0.06
w/o BN 7.12 3.40±0.04 3.81±0.05 3.10±0.07
w/o Pre-trained spkr. embd. - - 4.05±0.05 2.22±0.06
w/o Audio super-resolution - - 3.64±0.04 -
Recording 4.53 4.01±0.04 3.89±0.03 4.35±0.05
ter BLEU score than the baseline, demonstrating an intermediate representation in our experimental
the effectiveness of curriculum learning in our sce- scenario.
nario. Furthermore, by incorporating back trans-
6.3.4 System Evaluation
lation and cross-validation, the translation perfor-
mance can be further improved. Finally, we calculate the ASR-BLEU score for the
baseline and the proposed system to evaluate the
6.3.3 TTS Module speech-to-speech translation performance. Specif-
ically, we use the ASR system to transcribe the
We calculate the character error rate (CER) to eval- Chinese speech generated by TTS, and then com-
uate the clarity of speech for each TTS module. pute the BLEU scores of the ASR-decoded text
The ASR system used for calculating CER is the with respect to the reference English translations.
Chinese-English ASR model mentioned in Sec- The ASR system for transcribing Chinese speech
tion 5.1.1. Additionally, we conduct mean opinion is the same as that in Section 6.2.3.
score (MOS) tests with ten listeners rating each
sample on a scale of 1 (worst) to 5 (best) to evaluate Table 5: The ASR-BLEU results of each system.
naturalness, sound quality, and speaker similarity.
In the ablation study without pre-trained speaker Model ASR-BLEU
embedding, speaker ID is to control the speaker Baseline 27.5
timbre of the synthesized speech. To eliminate the Proposed system 32.2
influence of ASR and MT results on TTS evalua-
tion, we use the Chinese text in the evaluation data As shown in Table 5, our proposed system
and its corresponding English source speech as the achieves a higher ASR-BLEU score than the base-
reference of speaker timbre as the test set for TTS line, which indicates that our proposed system has
evaluation. good speech-to-speech translation accuracy.
As shown in Table 3, our proposed system has
achieved significant improvement in naturalness, 7 Conclusion
sound quality, speaker similarity, and clarity of This paper describes the NPU-MSXF speech-to-
speech compared with the baseline. Interestingly, speech translation system, which we develop for
the system without pre-trained speaker embedding the IWSLT 2023 speech-to-speech translation task.
has better sound quality than both the proposed sys- Our system is built as a cascaded system that in-
tem and recording. We conjecture the reason is that cludes ASR, MT, and TTS modules. To ensure
the pre-trained speaker embedding greatly influ- good performance with multi-source data, we im-
ences the sound quality in the zero-shot TTS setup. proved each module using various techniques such
Therefore, the quality of the synthesized 24KHz as model fusion and data augmentation in the
audio is superior to the 16KHz recording, which ASR, three-stage fine-tuning, back translation, and
can be demonstrated by the 3.64 MOS score of cross-validation in the MT, and two-stage training,
the system without audio super-resolution. Mean- pre-trained speaker embedding, and audio super-
while, the speaker similarity MOS score is very low resolution in the TTS. Through extensive experi-
due to the lack of generalization ability to unseen ments, we demonstrate that our system achieves
speakers. Without using the BN-based two-stage high translation accuracy, naturalness, sound qual-
model, the system decreases performance on all ity, and speaker similarity with multi-source input.
indicators, which shows the effectiveness of BN as
318
References 21st Annual Conference of the International Speech
Communication Association, Virtual Event, Shang-
Farhad Akhbardeh, Arkady Arkhangorodsky, Mag- hai, China, 25-29 October 2020, pages 3830–3834.
dalena Biesialska, Ondrej Bojar, Rajen Chatter- ISCA.
jee, Vishrav Chaudhary, Marta R. Costa-jussà,
Cristina España-Bonet, Angela Fan, Christian Fe- Jonathan G Fiscus. 1997. A post-processing system
dermann, Markus Freitag, Yvette Graham, Ro- to yield reduced word error rates: Recognizer out-
man Grundkiewicz, Barry Haddow, Leonie Harter, put voting error reduction (ROVER). In 1997 IEEE
Kenneth Heafield, Christopher Homan, Matthias Workshop on Automatic Speech Recognition and Un-
Huck, Kwabena Amponsah-Kaakyire, Jungo Kasai, derstanding Proceedings, pages 347–354. IEEE.
Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp
Koehn, Nicholas Lourie, Christof Monz, Makoto Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
Nakazawa, Matteo Negri, Santanu Pal, Allahsera Au- Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.
guste Tapo, Marco Turchi, Valentin Vydrin, and Mar- 2020. Conformer: Convolution-augmented trans-
cos Zampieri. 2021. Findings of the 2021 confer- former for speech recognition. In Interspeech 2020,
ence on machine translation (WMT21). In Proceed- 21st Annual Conference of the International Speech
ings of the Sixth Conference on Machine Translation, Communication Association, Virtual Event, Shang-
WMT@EMNLP 2021, Online Event, November 10- hai, China, 25-29 October 2020, pages 5036–5040.
11, 2021, pages 1–88. Association for Computational ISCA.
Linguistics.
François Hernandez, Vincent Nguyen, Sahar Ghannay,
Rosana Ardila, Megan Branson, Kelly Davis, Michael Natalia A. Tomashenko, and Yannick Estève. 2018.
Kohler, Josh Meyer, Michael Henretty, Reuben TED-LIUM 3: Twice as much data and corpus repar-
Morais, Lindsay Saunders, Francis M. Tyers, and tition for experiments on speaker adaptation. In
Gregor Weber. 2020. Common voice: A massively- Speech and Computer - 20th International Confer-
multilingual speech corpus. In Proceedings of The ence, SPECOM 2018, Leipzig, Germany, September
12th Language Resources and Evaluation Confer- 18-22, 2018, Proceedings, volume 11096 of Lecture
ence, LREC 2020, Marseille, France, May 11-16, Notes in Computer Science, pages 198–208. Springer.
2020, pages 4218–4222. European Language Re-
Ye Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey,
sources Association.
Melvin Johnson, Zhifeng Chen, and Yonghui Wu.
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, 2019. Direct speech-to-speech translation with
and Jason Weston. 2009. Curriculum learning. In a sequence-to-sequence model. In Interspeech
Proceedings of the 26th Annual International Con- 2019, 20th Annual Conference of the International
ference on Machine Learning, ICML 2009, Montreal, Speech Communication Association, pages 1123–
Quebec, Canada, June 14-18, 2009, volume 382 of 1127. ISCA.
ACM International Conference Proceeding Series, Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021.
pages 41–48. ACM. Conditional variational autoencoder with adversar-
ial learning for end-to-end text-to-speech. In Pro-
Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel
Machine Learning, ICML 2021, 18-24 July 2021, Vir-
Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, San-
tual Event, volume 139 of Proceedings of Machine
jeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao,
Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang,
Zhao You, and Zhiyong Yan. 2021. Gigaspeech: An Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan,
evolving, multi-domain ASR corpus with 10, 000 Prashant Sridhar, Kyu J Han, and Shinji Watanabe.
hours of transcribed audio. In Interspeech 2021, 2023. E-branchformer: Branchformer with enhanced
22nd Annual Conference of the International Speech merging for speech recognition. In 2022 IEEE Spo-
Communication Association, Brno, Czechia, 30 Au- ken Language Technology Workshop (SLT), pages
gust - 3 September 2021, pages 3670–3674. ISCA. 84–91. IEEE.
Yutian Chen, Yannis M. Assael, Brendan Shillingford, Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
David Budden, Scott E. Reed, Heiga Zen, Quan method for stochastic optimization. In 3rd Inter-
Wang, Luis C. Cobo, Andrew Trask, Ben Laurie, national Conference on Learning Representations,
Çaglar Gülçehre, Aäron van den Oord, Oriol Vinyals, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
and Nando de Freitas. 2019. Sample efficient adap- Conference Track Proceedings.
tive text-to-speech. In 7th International Conference
on Learning Representations, ICLR 2019, New Or- Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020.
leans, LA, USA, May 6-9, 2019. OpenReview.net. HiFi-GAN: Generative adversarial networks for effi-
cient and high fidelity speech synthesis. In Advances
Brecht Desplanques, Jenthe Thienpondt, and Kris De- in Neural Information Processing Systems 33: An-
muynck. 2020. ECAPA-TDNN: emphasized chan- nual Conference on Neural Information Processing
nel attention, propagation and aggregation in TDNN Systems 2020, NeurIPS 2020, December 6-12, 2020,
based speaker verification. In Interspeech 2020, virtual.
319
Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Speech and Signal Processing, ICASSP 2015, South
Sravya Popuri, Xutai Ma, Adam Polyak, Yossi Adi, Brisbane, Queensland, Australia, April 19-24, 2015,
Qing He, Yun Tang, Juan Pino, and Wei-Ning Hsu. pages 5206–5210. IEEE.
2022. Direct speech-to-speech translation with dis-
crete units. In Proceedings of the 60th Annual Meet- Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng
ing of the Association for Computational Linguistics Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le.
(Volume 1: Long Papers), ACL 2022, pages 3327– 2019. Specaugment: A simple data augmentation
3339. Association for Computational Linguistics. method for automatic speech recognition. In Inter-
speech 2019, 20th Annual Conference of the Inter-
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan national Speech Communication Association, Graz,
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Austria, 15-19 September 2019, pages 2613–2617.
Veselin Stoyanov, and Luke Zettlemoyer. 2020. ISCA.
BART: denoising sequence-to-sequence pre-training
for natural language generation, translation, and com- Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya
prehension. In Proceedings of the 58th Annual Meet- Sutskever, et al. 2018. Improving language under-
ing of the Association for Computational Linguistics, standing by generative pre-training.
ACL 2020, Online, July 5-10, 2020, pages 7871–7880.
Association for Computational Linguistics. Anthony Rousseau, Paul Deléglise, and Yannick Estève.
2012. TED-LIUM: an automatic speech recognition
Jae Soo Lim and Alan V Oppenheim. 1979. Enhance- dedicated corpus. In Proceedings of the Eighth In-
ment and bandwidth compression of noisy speech. ternational Conference on Language Resources and
Proceedings of the IEEE, 67(12):1586–1604. Evaluation, LREC 2012, Istanbul, Turkey, May 23-25,
2012, pages 125–129. European Language Resources
Pierre Lison, Jörg Tiedemann, and Milen Kouylekov. Association (ELRA).
2018. Opensubtitles2018: Statistical rescoring of
sentence alignments in large, noisy parallel corpora. Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming
In Proceedings of the Eleventh International Confer- Li. 2021. AISHELL-3: A multi-speaker mandarin
ence on Language Resources and Evaluation, LREC TTS corpus. In Interspeech 2021, 22nd Annual Con-
2018, Miyazaki, Japan, May 7-12, 2018. European ference of the International Speech Communication
Language Resources Association (ELRA). Association, Brno, Czechia, 30 August - 3 September
2021, pages 2756–2760. ISCA.
Yanqing Liu, Zhihang Xu, Gang Wang, Kuan Chen,
Bohan Li, Xu Tan, Jinzhu Li, Lei He, and Sheng Jongseo Sohn, Nam Soo Kim, and Wonyong Sung. 1999.
Zhao. 2021. DelightfulTTS: The microsoft speech A statistical model-based voice activity detection.
synthesis system for blizzard challenge 2021. CoRR, IEEE Signal Process. Lett., 6(1):1–3.
abs/2110.12612. Kun Song, Heyang Xue, Xinsheng Wang, Jian Cong,
Yongmao Zhang, Lei Xie, Bing Yang, Xiong Zhang,
and Dan Su. 2022a. AdaVITS: Tiny VITS for low
Edunov, Marjan Ghazvininejad, Mike Lewis, and
computing resource speaker adaptation. In 13th In-
Luke Zettlemoyer. 2020. Multilingual denoising pre-
ternational Symposium on Chinese Spoken Language
training for neural machine translation. Trans. Assoc.
Processing, ISCSLP 2022, Singapore, December 11-
Comput. Linguistics, 8:726–742.
14, 2022, pages 319–323. IEEE.
Masanori Morise. 2017. Harvest: A high-performance
Kun Song, Yongmao Zhang, Yi Lei, Jian Cong,
fundamental frequency estimator from speech signals.
Hanzhao Li, Lei Xie, Gang He, and Jinfeng Bai.
In Interspeech 2017, 18th Annual Conference of the
2022b. DSPGAN: a gan-based universal vocoder
International Speech Communication Association,
for high-fidelity TTS by time-frequency domain su-
pages 2321–2325. ISCA.
pervision from DSP. CoRR, abs/2211.01087.
Satoshi Nakamura, Konstantin Markov, Hiromi Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng, Tao
Nakaiwa, Gen-ichiro Kikui, Hisashi Kawai, Wang, Mingxuan Wang, and Jun Cao. 2022. GigaST:
Takatoshi Jitsuhiro, Jinsong Zhang, Hirofumi A 10, 000-hour pseudo speech translation corpus.
Yamamoto, Eiichiro Sumita, and Seiichi Yamamoto. CoRR, abs/2204.03939.
2006. The ATR multilingual speech-to-speech
translation system. IEEE Trans. Speech Audio Yongmao Zhang, Heyang Xue, Hanzhao Li, Lei Xie,
Process., 14(2):365–376. Tingwei Guo, Ruixiong Zhang, and Caixia Gong.
2022. Visinger 2: High-fidelity end-to-end singing
Markus Ojala and Gemma C. Garriga. 2010. Permu- voice synthesis enhanced by digital signal processing
tation tests for studying classifier performance. J. synthesizer. CoRR, abs/2211.02903.
Mach. Learn. Res., 11:1833–1863.
Vassil Panayotov, Guoguo Chen, Daniel Povey, and

Sanjeev Khudanpur. 2015. Librispeech: An ASR
corpus based on public domain audio books. In
2015 IEEE International Conference on Acoustics,
320
Low-Resource Formality Controlled NMT Using Pre-trained LM
Priyesh Vakharia and Shree Vignesh S and Pranjali Basmatkar

Department of Computer Science
University of California, Santa Cruz
{pvakhari, ss64293, pbasmatk}@ucsc.edu
Abstract Consequently, this limitation can result in inaccu-

racies in selecting the appropriate level of formal-
This paper describes the UCSC’s submission
to the shared task on formality control for spo- ity, potentially leading to translations that may be
ken language translation at IWSLT 2023. For deemed inappropriate in specific contexts. Recog-
this task, we explored the use of “additive style nizing the significance of formality control, we aim
intervention” using a pre-trained multilingual to build a formality-controlled machine translation
translation model, namely mBART. Compared system to foster smooth and reliable conversations
to prior approaches where a single style-vector and enhance communication across languages and
was added to all tokens in the encoder output, cultures, facilitating more nuanced and effective
we explored an alternative approach in which
we learn a unique style-vector for each input
linguistic exchanges.
token. We believe this approach, which we call Formality-controlled Neural Machine Transla-
“style embedding intervention,” is better suited tion is the IWSLT 2023 task (Nădejde et al., 2022)
for formality control as it can potentially learn under the Formality track. The goal of the task
which specific input tokens to modify during is to achieve formality controlled machine transla-
decoding. While the proposed approach ob- tion for the English-Vietnamese (En-Vi), English-
tained similar performance to “additive style Korean (En-Ko) in a supervised setting and English-
intervention” for the supervised English-to-
Portuguese (En-Pt) and English-Russian (En-Ru)
Vietnamese task, it performed significantly bet-
ter for English-to-Korean, in which it achieved in a zero-shot setting as detailed in (Agarwal et al.,
an average matched accuracy of 90.6 compared 2023). We provide an example of formal and infor-
to 85.2 for the baseline. When we constrained mal translations of an English sentence into Viet-
the model further to only perform style inter- namese in Figure 1. The formal and informal to-
vention on the <bos> (beginning of sentence) kens are in bold.
token, the average matched accuracy improved
further to 92.0, indicating that the model could 2 Related Works
learn to control the formality of the translation
output based solely on the embedding of the Machine translation (MT) research has primarily
<bos> token. focused on preserving the meaning between lan-
guages. However, it is widely recognized that
1 Introduction
maintaining the intended level of formality in
In the past decade, neural machine translation communication is a crucial aspect of the prob-
has made remarkable strides, achieving transla- lem (Hovy, 1987) (Hovy, 1987). This field of
tion quality that is increasingly comparable to research was named formality-sensitive machine
human-level performance across various languages. translation (FSMT) (Niu et al., 2017), where the
However, despite these advancements, the field target formality level is considered in addition to
of controllable machine translation remains rela- the source segment in determining the translated
tively under-explored. One crucial aspect of transla- text. Further, several studies have attempted to
tion variation is formality, which manifests through regulate formality in MT through side constraints
grammatical registers, adapting the language to suit to control politeness, or formality (Sennrich et al.,
specific target audiences. Unfortunately, current 2016); (Feely et al., 2019); (Schioppa et al., 2021a).
neural machine translation (NMT) systems lack Other studies have tried to address this with custom
the capability to comprehend and adhere to gram- models trained on data with consistent formality
matical registers, specifically concerning formality. (Viswanathan et al., 2020). Most prior research
321
Figure 1: Contrastive Data Sample
has been tailored to individual languages and has for every token i in the input space. In short, we re-
labeled large amounts of data using word lists or purpose an Embedding layer as a style intervening
morphological analyzers. layer between the encoder and the decoder. This
design resulted from our original question: will
3 Approach allowing more flexibility in the encoder enable it
to identify which tokens require stylization, thus
3.1 Overview
making it more interpretable. The hypothesis that
The task of formality-controlled generation can be originated from this question was: by giving each
viewed as a seq2seq machine translation task. More token its own intervention vector Vi , the model will
formally, given an input sequence x, we design a learn each intervention vector Vi differently based
model that does the following: on whether the token at that time step has a contrast-
ing translation that is dependent on the formality
ŷ = arg max p(y|x, ls , lt , f ; θ) (1) setting. In short, we let the model learn different
y∈Y
Vi ’s for each token. If true, this will provide some
Where, interpretability on which tokens the model recog-
x is the input sequence, nizes as having a formality marker and translates
ls is the source language, them differently in formal and informal settings.
lt is the target language, This approach is visualized in Figure 2. Since our
f is the formality, approach uses an embedding layer for style inter-
ŷ is the formality controlled translation vention, we call our approach ’style embedding
intervention.’
We propose a single model that produces an out- We learn the style embedding layer only in the
put, given input x, and formality setting f. Despite formal setting and use a zero vector in the informal
being part of the unconstrained task, our proposed setting. In other words, the style embedding inter-
approach does not mine or develop any formal- vention is performed only in the formal setting, and
ity annotated data for training and just uses a pre- encoder outputs are not perturbed in the informal
trained checkpoint of mBART. setting. We do not have separate Embedding lay-
ers to learn each formality style, simply because,
3.2 Design
it would be difficult to switch between layers dur-
We looked at previous works incorporating con- ing batched training. Looking at (Schioppa et al.,
trasting styles Rippeth et al., 2022, and Schioppa 2021b), the combination of a style vector and a
et al., 2021b as motivation for our approach. For zero vector for contrasting styles was sufficient to
controlling styles, the aforementioned works use an learn the style.
additive intervention approach. This approach en-
tails adding a single style intervention vector V to 4 Experimental Apparatus
the pre-trained encoder output Z. The same vector
V is added to all the tokens of the encoder outputs, 4.1 Dataset
thereby changing the encoder outputs uniformly. The IWSLT formality shared task provided a for-
We modify the above approach to allow for more mality annotated dataset (Nadejde et al., 2022).
flexibility while learning. Instead of a single inter- This dataset comprises source segments paired with
vention vector V, we propose a unique vector Vi two contrastive reference translations, one for each
322
length of 128. We trained for 15 epochs with an
early stopping callback set at 3.
We have implemented all the models in PyTorch
(Paszke et al., 2019) leveraging Huggingface (Wolf
et al., 2019) transformers and evaluate libraries.
4.3 Evaluation
To assess the performance of the models, we use
four metrics to evaluate the two main underlying
Figure 2: Approach tasks - translation quality and formality control.
For evaluating the translation quality, we use the
following two metrics:
formality level (informal and formal) for two lan- • Bilingual Understudy Evaluation (BLEU)
guage pairs: EN-KO, VI in the supervised setting score: BLEU score (Papineni et al., 2002)
and two language pairs: EN-PT, RU in the zero- calculates the similarity between a machine
shot setting. The data statistics can be seen in Table translation output and a reference translation
1. We use a random split of 0.2 to construct the using n-gram precision. We use SacreBLEU
validation dataset during model development. 2.0 (Post, 2018) implementation for reporting
our scores.
4.2 Training Setup
• Cross-lingual Optimized Metric for Eval-
For all our modeling experiments, we use mbart- uation of Translation (COMET) score:
large-50-one-to-many-mmt, a fine-tuned check- COMET score (Rei et al., 2020) calculates
point of mBART-large-50 (Liu et al., 2020). This the similarity between a machine translation
model, introduced by (Tang et al., 2020), is a fine- output and a reference translation using to-
tuned mBART model which can translate English ken or sentence embeddings. We use COMET
to 49 languages, including the languages we are wmt22-comet-da (Rei et al., 2022) model for
interested in: KO, VI, PT, and RU. reporting our scores.
For our baseline, we perform zero-shot inference
on the mBART model for the four language pairs. For evaluating the formality control, we use the
The results are shown in tables 3 - 6. following two metrics:
Based on the findings of (Nakkiran et al., 2019) • Matched-Accuracy (M-Acc): A reference-
and (Galke and Scherp, 2022) we fixed our loss based corpus-level automatic metric that lever-
function to be ‘cross entropy with logits‘ and op- ages phrase-level formality markers from
timizer to AdamW (Loshchilov and Hutter, 2017). the references to classify a system-generated
We use the default learning rate of 10-3 , standard translation as either formal or informal. This
weight decay of 10-2 and set β1 , β2 and ϵ to 0.9, metric was provided by the IWSLT Formality
0.998 and 10-8 respectively. shared task organizers.
To effectively train the transformer-based
• Reference-free Matched-Accuracy (RF-M-
mBART model, we used a learning rate scheduler
Acc): A reference-free variant of M-Acc that
- a linear schedule with a warm-up, as introduced
uses a multilingual formality classifier, based
by (Vaswani et al., 2017). This creates a schedule
on xlm-roberta-base, fine-tuned on human-
with a learning rate that decreases linearly from
written formal and informal text, to label a
the initial learning rate to 0 after a warm-up period.
system-generated hypothesis as formal or in-
The warm-up period is set to 10% of the total train-
formal. This metric was provided by the
ing steps, during which the learning rate increases
IWSLT Formality shared task organizers.
linearly from 0 to the initial learning rate set in the
optimizer. All the other hyper-parameters are left In addition to this, we evaluate our generic trans-
at their defaults. lation quality on FLORES-200 (Goyal et al., 2022)
We trained our models using one NVIDIA A100 for all language pairs under supervised and zero-
GPU with 80GB memory. To fit our model in this shot settings. We use the devtest set of FLORES-
GPU we used a batch size of 16 and a max sequence 200 and compute the BLEU and COMET scores.
323
Language pair Training Data points Testing Data points
EN-KO 400 600
EN-VI 400 600
EN-PT 0 600
EN-RU 0 600
Table 1: Data description
Formal Informal
BLEU Matched Acc BLEU Matched Acc
Rippeth et al., 2022 38.3 98.4 38.3 82.7
Style embedding intervention 38 99.2 37.4 98
Table 2: Grounding our model for EN-ES data
the informal setting. The similarity scores are vi-

sualized in Figure 3. For a closer look, Table 8
displays the similarity scores.
5 Grounding results and observations

Along with the validation splits, we ground our
approach by comparing our results with the 2022
formality track submission Rippeth et al., 2022.
We compare our results on one language pair i.e.
English-Spanish. The comparison is shown in Ta-
ble 2.
As seen in Table 2, the BLEU scores between
our approach - “style embedding intervention” -
and the approach in Rippeth et al., 2022 - "additive
style intervention" - are similar but our approach
makes significant gains in Matched Accuracy, espe-
cially in the informal setting indicating improved
formality control.
5.1 Style embedding layer analysis Figure 3: Similarity scores for hypothesis analysis.
In this section, we analyze the style embedding
layer and compare the analysis with the original
hypothesis - giving each token its own interven- As seen from the token representation similarity
tion vector Vi , the model will learn each vector scores, the model does not seem to learn new in-
differently based on whether the token at that time formation in tokens that have a contrasting setting-
step has a contrasting translation that is dependent dependent translation - the tokens’ similarity scores
on the formality setting. Due to the unique nature are very near 1. Instead, it uses the </s>’s repre-
of our training setup - learning zero vector in the sentation to store the style ’signal’, by creating a
informal setting - for our hypothesis testing, we style vector that makes the </s>’s representation
compare the encoder vectors with and without the ∼11% different between formality settings.
style embedding intervention. For this purpose, we Another interesting observation is the extremely
use the dot product similarity. At each time step, slight dissimilarity produced at the beginning of
we compute the dot product similarity between the the sentence or ’en_xx’ token. Did the model learn
encoder output before style intervention and the the same style information in ∼1% of information
output after style intervention. This is equivalent space in the ’en_xx’ token compared to the ∼11%
to comparing the encoder outputs in the formal and of information space in the ’</s>’ token? To an-
324
Models EN-VI EN-KO
BLEU COMET %M-Acc %C-F BLEU COMET %M-Acc %C-F
Baseline 1 26.7 0.3629 96 0.95 4.9 0.2110 78 0.99
Baseline 2 26.1 0.829 3 0.006 3.9 0.8445 66.7 0.979
Model 1 44.8 0.8467 99 0.989 22.2 0.8246 74.1 0.9815
Model 2 44.2 0.8702 98.6 0.9782 22.5 0.831 82.9 0.9765
Model 3 44.6 0.874 99 0.9849 23.3 0.836 85.7 0.9832
Model 4 44.3 0.8462 99.2 0.9849 23.2 0.8287 75.3 0.9815
Baseline 1: UMD-baseline
Baseline 2: Zero-Shot mBart
Model 1: single vector intervention with train-dev split of 0.1
Model 2: style embedding intervention
Model 3: bos style intervention - Primary Submission
Table 3: Results on the official test split in the formal supervised setting for language pairs EN-VI and EN-KO.
Models EN-PT EN-RU

Baseline 1 27.3 0.4477 96.3 0.9766 22.0 0.3492 96.20 0.92
Baseline 2 33 0.8445 54.9 0.8447 24.9 0.7604 99.4 0.9116
Model 1 27.2 0.7686 84.6 0.918 23.8 0.737 97.6 0.865
Model 2 26.6 0.7895 81.5 0.8748 18.5 0.6837 99.2 0.76
Model 3 26.6 0.7889 89.9 0.9082 18.4 0.6664 98.8 0.79
Model 4 28.2 0.7726 80.5 0.9348 24.3 0.7373 97.9 0.858
Table 4: Results on the official test split in the formal unsupervised setting for language pairs EN-PT and EN-RU.
swer this question, we added another modification mal setting, we obtain a BLEU score of 44.6 for
to our approach - we masked out the intervention EN-VI and 23.3 for EN-KO on the official test split.
vectors for all tokens except the ’en_xx’ token. In the informal setting, we obtain a BLEU score of
For naming purposes, we call this approach ’bos 43.5 for EN-VI and 22.8 for EN-KO. Tables 3 and
style intervention’ respectively. 5 have detailed results of all our models. Our pri-
mary model - ’bos style intervention’ - outperforms
6 Official Results the UMD baseline significantly for both languages
Along with the approach from Rippeth et al., 2022 with around 20 BLEU increase and more than dou-
taken as a baseline and an adapted version of it, ble the COMET score. This answers our hypothesis
we submit the results of our approach and of the that the model can learn the formality style in the
’bos style intervention’ approach. We analyse the small ∼1% information space at the beginning of
performance of our models under the supervised the sentence in ’en_xx’ token. Moreover, we ob-
setting and the zero-shot setting. We also generate tain higher scores on the metrics M-Acc% & C-F%
results on the FLORES-200 test split. that compute the degree of formality/informality
induced.
6.1 Supervised Setting Qualitative analysis of the translations, espe-
We trained our models multi-lingually on EN-VI cially for KO, revealed that code-switching was
and EN-KO for the supervised setting. In the for- a major issue. For example, some translations have
325
Models EN-VI EN-KO
Baseline 1 25.3 0.3452 96 0.9816 4.9 0.1697 97.6 0.995
Baseline 2 31.9 0.8352 97 0.9933 3.2 0.8311 33.3 0.020
Model 1 43.3 0.8238 98.7 0.9949 22.1 0.8115 96.3 0.889
Model 2 43.6 0.8514 98.9 0.9949 23.0 0.8256 98.3 0.9514
Model 3 43.5 0.8504 98.9 1 22.8 0.8257 98.3 0.9581
Model 4 42.5 0.8232 98.3 0.9765 22.6 0.8162 96.4 0.9028
Table 5: Results on the official test split in the informal supervised setting for language pairs EN-VI and EN-KO.
Models EN-PT EN-RU

Baseline 1 30.9 0.4161 93.2 0.9082 21.6 0.3475 84.1 0.8417
Baseline 2 33.2 0.8229 45.1 0.1552 18.8 0.7489 0.6 0.0883
Model 1 28.2 0.7606 55.6 0.378 18.8 0.7109 47.7 0.556
Model 2 28.7 0.7821 58.8 0.5092 18.6 0.6544 45.1 0.6
Model 3 28.4 0.7853 58 0.419 14.9 0.6365 51.6 0.6683
Model 4 28.8 0.7673 57 0.3305 20 0.7102 46.9 0.55
Table 6: Results on the official test split in the informal unsupervised setting for language pairs EN-PT and EN-RU.
Models EN-VI EN-KO EN-PT EN-RU

BLEU COMET BLEU COMET BLEU COMET BLEU COMET
Model 1 29.8 0.8169 5.5 0.773 30.6 0.8082 21.4 0.794
Model 2 27.8 0.8205 4.6 0.758 30.8 0.8258 19.3 0.7686
Model 3 27.9 0.8225 4.5 0.7586 30.4 0.8264 19.1 0.7543
Model 4 30.3 0.8186 5.6 0.7752 30.9 0.814 21.5 0.7935
Table 7: Results on Flores-200 test split for language pairs EN-VI & EN-KO in supervised setting and for language
pairs EN-PT & EN-RU in unsupervised setting.
entire phrases or latter parts of sentences in English 6.2 Zero-shot Setting

as shown in Figure 4.
We evaluate the above multi-lingually trained
model on RU and PT in a zero-shot setting. In the
formal setting, we obtain a BLEU score of 26.6 for
326
Token Similarity Score to-Korean translation task, resulting in an average
en_xx 0.99037 matched accuracy improvement from 85.2 to 90.6.
Have 0.99928 Further, on analysis of our "style embedding inter-
you 0.99914 vention" model, we find that most of the style infor-
ever 0.99935 mation is learnt in the <bos> token. Constraining
seen 0.99916 style addition to the <bos> token - "bos style inter-
Big 0.99916 vention" - further improved our averaged matched
hero 0.99919 accuracy from 90.6 to 92.
6 0.99920 We also observed that in a zero-shot setting, the
? 0.99910 formality control doesn’t seem to transfer well, and
</s> 0.89028 the model leans towards biases learnt during pre-
training rather than the transferred style interven-
Table 8: Similarity scores for hypothesis analysis.
tions. This is more pronounced for En-Ru trans-
lations where the model is more biased towards
the formal style, with a matched accuracy of 98.8,
than the informal style, with a matched accuracy
of 51.6.
Figure 4: Similarity scores for hypothesis analysis.
Future works focused on alleviating the style
biases of pre-trained models might be necessary
to ensure style transfer works equally well in a
zero-shot setting.
EN-PT and 18.4 for EN-RU on the official test split. We hope our work on translation models with
In the informal setting, we obtain a BLEU score of interpretable formality control can serve as a base
28.4 for EN-PT and 14.9 for EN-RU. Tables 4 and 6 for other future works on interpretable models, es-
have detailed results of all our models. We observe pecially in low-resource settings.
that our model does not transfer the style knowl- Code used for our implementation can be ac-
edge very well. In both cases, the model is often cessed at https://github.com/Priyesh1202/
biased toward formal translations. Moreover, our IWSTL-2023-Formality.
models have a slightly degraded performance in the
translation quality than UMD baseline model. This 8 Acknowledgements
cements our earlier observation that style knowl- We thank Prof. Lane, Prof. Rao and Brendan King
edge transfer is incomplete. Qualitative analysis from UC Santa Cruz for their constant guidance
of the translations revealed that the zero-shot lan- and support.
guage translations also suffer from code-switching.
6.3 Testing on FLORES-200 dataset References

In addition to evaluating formality, we assess the Milind Agarwal, Sweta Agrawal, Antonios Anasta-
translation quality of our models by evaluating on sopoulos, Claudia Borg, Marine Carpuat, Roldano
Cattoni, Mauro Cettolo, William Chen, Khalid
the FLORES-200 test split. The results can be seen
Choukri, Alexandra Chronopoulou, Thierry Declerck,
in Table 7. Qianqian Dong, Yannick Estève, Kevin Duh, Mar-
7 Conclusion John Judge, Tom Ko, Rishu Kumar, Xutail Ma,
Prashant Mathur, Evgeny Matusov, Paul McNamee,
In this paper, we presented and explored "style John P. McCrae, Kenton Murray, Matteo Negri, Jan
embedding intervention," a new approach for low- Niehues, Xing Niu, Atul Ojha Kr., John E. Ortega,
Proyag Pal, Juan Pino, Lonneke van der Plas, Elijah
resource formality control in spoken language Rippeth, Elizabeth Salesky, Matthias Sperber, Se-
translation. By assigning unique style vectors to bastian Stüker, Katsuhito Sudoh, Brian Thompson,
each input token, the proposed approach shows Marco Turchi, Alex Waibel, Mingxuan Wang, and
promising results in understanding and controlling Rodolfo Zevallos. 2023. Findings of the IWSLT 2023
Evaluation Campaign. In Proceedings of the 20th
the nuances of formal and informal style transla- International Conference on Spoken Language Trans-
tion. It outperforms previous "additive style in- lation (IWSLT 2023). Association for Computational
tervention" methods, specifically for the English- Linguistics.
327
Weston Feely, Eva Hasler, and Adrià de Gispert. 40th Annual Meeting of the Association for Compu-
2019. Controlling japanese honorifics in english-to- tational Linguistics, pages 311–318, Philadelphia,
japanese neural machine translation. In Proceedings Pennsylvania, USA. Association for Computational
of the 6th Workshop on Asian Translation, pages 45– Linguistics.
53.
Adam Paszke, Sam Gross, Francisco Massa, Adam
Lukas Galke and Ansgar Scherp. 2022. Bag-of-words Lerer, James Bradbury, Gregory Chanan, Trevor
vs. graph vs. sequence in text classification: Ques- Killeen, Zeming Lin, Natalia Gimelshein, Luca
tioning the necessity of text-graphs and the surpris- Antiga, Alban Desmaison, Andreas Köpf, Edward Z.
ing strength of a wide MLP. In Proceedings of the Yang, Zach DeVito, Martin Raison, Alykhan Tejani,
60th Annual Meeting of the Association for Compu- Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Jun-
tational Linguistics (Volume 1: Long Papers), pages jie Bai, and Soumith Chintala. 2019. Pytorch: An
4038–4051, Dublin, Ireland. Association for Compu- imperative style, high-performance deep learning li-
tational Linguistics. brary. CoRR, abs/1912.01703.
Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- Matt Post. 2018. A call for clarity in reporting BLEU
Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr- scores. In Proceedings of the Third Conference on
ishnan, Marc’Aurelio Ranzato, Francisco Guzmán, Machine Translation: Research Papers, pages 186–
and Angela Fan. 2022. The Flores-101 evaluation 191, Brussels, Belgium. Association for Computa-
benchmark for low-resource and multilingual ma- tional Linguistics.
chine translation. Transactions of the Association for Ricardo Rei, José G. C. de Souza, Duarte Alves,
Computational Linguistics, 10:522–538. Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova,
Alon Lavie, Luisa Coheur, and André F. T. Martins.
Eduard Hovy. 1987. Generating natural language un-
2022. COMET-22: Unbabel-IST 2022 submission
der pragmatic constraints. Journal of Pragmatics,
for the metrics shared task. In Proceedings of the
11(6):689–719.
Seventh Conference on Machine Translation (WMT),
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey pages 578–585, Abu Dhabi, United Arab Emirates
Edunov, Marjan Ghazvininejad, Mike Lewis, and (Hybrid). Association for Computational Linguistics.
Luke Zettlemoyer. 2020. Multilingual denoising pre- Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
training for neural machine translation. Lavie. 2020. COMET: A neural framework for MT
evaluation. In Proceedings of the 2020 Conference
Ilya Loshchilov and Frank Hutter. 2017. Fixing
weight decay regularization in adam. CoRR,
ing (EMNLP), pages 2685–2702, Online. Association
abs/1711.05101.
Maria Nădejde, Anna Currey, Benjamin Hsu, Xing Elijah Rippeth, Sweta Agrawal, and Marine Carpuat.
Niu, Marcello Federico, and Georgiana Dinu. 2022. 2022. Controlling translation formality using pre-
Cocoa-mt: A dataset and benchmark for contrastive trained multilingual language models. In Proceed-
controlled mt with application to formality. arXiv ings of the 19th International Conference on Spoken
preprint arXiv:2205.04022. Language Translation (IWSLT 2022), pages 327–340,
Maria Nadejde, Anna Currey, Benjamin Hsu, Xing
Niu, Marcello Federico, and Georgiana Dinu. 2022.
CoCoA-MT: A dataset and benchmark for contrastive Andrea Schioppa, David Vilar, Artem Sokolov, and
controlled MT with application to formality. In Find- Katja Filippova. 2021a. Controlling machine transla-
ings of the Association for Computational Linguistics: tion for multiple attributes with additive interventions.
NAACL 2022, pages 616–632, Seattle, United States. In Proceedings of the 2021 Conference on Empiri-
Association for Computational Linguistics. cal Methods in Natural Language Processing, pages
6676–6696.
Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan
Yang, Boaz Barak, and Ilya Sutskever. 2019. Deep Andrea Schioppa, David Vilar, Artem Sokolov, and
double descent: Where bigger models and more data Katja Filippova. 2021b. Controlling machine transla-
hurt. CoRR, abs/1912.02292. tion for multiple attributes with additive interventions.
Xing Niu, Marianna Martindale, and Marine Carpuat. cal Methods in Natural Language Processing, pages
2017. A study of style in machine translation: Con- 6676–6696, Online and Punta Cana, Dominican Re-
trolling the formality of machine translation output. public. Association for Computational Linguistics.
cal Methods in Natural Language Processing, pages Rico Sennrich, Barry Haddow, and Alexandra Birch.
2814–2819. 2016. Controlling politeness in neural machine trans-
lation via side constraints. In Proceedings of the
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- 2016 Conference of the North American Chapter of
Jing Zhu. 2002. Bleu: a method for automatic evalu- the Association for Computational Linguistics: Hu-
ation of machine translation. In Proceedings of the man Language Technologies, pages 35–40.
328
sible multilingual pretraining and finetuning. CoRR,
abs/2008.00401.
you need.
Aditi Viswanathan, Varden Wang, and Antonina
Kononova. 2020. Controlling formality and style
of machine translation output using automl. In Infor-
mation Management and Big Data: 6th International
Conference, SIMBig 2019, Lima, Peru, August 21–23,
2019, Proceedings 6, pages 306–313. Springer.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,
and Jamie Brew. 2019. Huggingface’s transformers:
State-of-the-art natural language processing. CoRR,
abs/1910.03771.
329
NAIST Simultaneous Speech Translation System for IWSLT 2023
Ryo Fukuda† Yuta Nishikawa† Yasumasa Kano† Yuka Ko†
Tomoya Yanagita† Kosuke Doi† Mana Makinae†
Sakriani Sakti‡† Katsuhito Sudoh† Satoshi Nakamura†
†
Nara Institute of Science and Technology, Japan
‡
Japan Advanced Institute of Science and Technology, Japan
[email protected]
Abstract augmentation based on Bilingual Prefix Align-

ment (Kano et al., 2022).
This paper describes NAIST’s submission In this year, we use an end-to-end multilin-
to the IWSLT 2023 Simultaneous Speech gual offline ST model based on large-scale pre-
Translation task: English-to-{German,
trained speech and text models for the speech-
Japanese, Chinese} speech-to-text translation
and English-to-Japanese speech-to-speech to-text track, following Polák et al. (2022). We
translation. Our speech-to-text system uses used Hidden-Unit BERT (HuBERT) (Hsu et al.,
an end-to-end multilingual speech translation 2021) as the speech encoder fine-tuned using En-
model based on large-scale pre-trained speech glish automatic speech recognition (ASR) data
and text models. We add Inter-connections and mBART50 (Tang et al., 2020) as the text
into the model to incorporate the outputs decoder fine-tuned using a multilingual machine
from intermediate layers of the pre-trained translation data. We prepare the multilingual ST
speech model and augment prefix-to-prefix
model in the following steps:
text data using Bilingual Prefix Alignment to
enhance the simultaneity of the offline speech 1. Initialize the model with the parameters of
translation model. Our speech-to-speech HuBERT and mBART50 models and add
system employs an incremental text-to-speech
Inter-connections between the intermediate
module that consists of a Japanese pronuncia-
tion estimation model, an acoustic model, and layer of the speech encoder and the text de-
a neural vocoder. coder.
2. Train the model using multilingual ST cor-
1 Introduction pora.
This paper presents NAIST’s simultaneous speech 3. Fine-tune the model using bilingual prefix
translation (SimulST) systems for the IWSLT pairs in English-to-{German, Japanese, Chi-
2023 English-to-{German, Japanese, Chinese} nese} extracted using Bilingual Prefix Align-
speech-to-text track and the English-to-Japanese ment.
speech-to-speech track (Agarwal et al., 2023).
Many previous studies on end-to-end SimulST We use a SimulST policy called local agreement
have focused on training methodologies and ar- (Liu et al., 2020) that finds the longest common
chitectures specialized for the simultaneous sce- prefixes among successive decoding steps. For the
nario. However, such a specialized system setup English-to-Japanese speech-to-speech track, we
for SimulST is not trivial and increases the dif- developed a cascade of the SimulST above and
ficulty of the system development and the com- an incremental text-to-speech module using a pro-
putational complexity. One recent approach to nunciation estimation model, an acoustic model,
SimulST systems is to use an offline speech trans- and a neural vocoder.
lation (ST) model for prefix-to-prefix translation
2 System Architecture
required in SimulST. In last year’s IWSLT Eval-
uation Campaign (Anastasopoulos et al., 2022), Figure 1 illustrates an overview of our system ar-
Polák et al. (2022) demonstrated superior results chitecture. The following subsections explain our
using such multilingual offline ST models. In our methodologies: Inter-connection in 2.1 and Bilin-
last year’s systems (Fukuda et al., 2022), we used gual Prefix Alignment in 2.2, the local agreement
an offline model fine-tuned for SimulST with data in 2.3, and the incremental text-to-speech in 2.4.
330
▁私はペンを買っワタシハペン
(2.3) Local agreement Wait-k Look-ahead

Read Write Read Write Read Write
Multilingual ST Pronunciation Acoustic model

(HuBERT + mBART50) Estimation + Vocoder
SimulS2T model with

(2.1) Inter-connection and
(2.2) Prefix Alignment (2.4) Incremental Text-to-Speech
Figure 1: Block diagram of SimulS2S.
2.1 Inter-connection puts with growing input prefixes should be reli-

Intermediate layers of a speech SSL (Self- able. Polák et al. (2022) generalized this idea us-
Supervised Learning) model contain useful in- ing agreement among the prefixes at n consecutive
formation for downstream tasks (Pasad et al., steps (LA-n) and demonstrated that n = 2 works
2021). However, the simple addition of connec- well on SimulST. According to their finding, we
tions from the intermediate layers of the speech use LA-2 as a SimulST policy and adjust the in-
encoder to the text decoder does not always put chunk length (in milliseconds) to control the
work well. We use a weighted integration of quality-latency trade-offs.
the encoder’s intermediate layers, called Inter-
connection (Nishikawa and Nakamura, 2023), 2.4 Incremental Text-to-Speech
where the output tensors from the intermediate
layers are aggregated with the weights. The Our English-to-Japanese speech-to-speech simul-
weights are additional learnable parameters opti- taneous translation system uses the aforemen-
mized through the training. We also apply layer tioned SimulST system with incremental Japanese
normalization after the weighted aggregation to text-to-speech (TTS). The incremental TTS con-
stabilize the training. sists of three modules: a pronunciation estimation,
an acoustic model, and a neural vocoder. The pro-
2.2 Prefix Alignment nunciation estimation predicts the pronunciations
In simultaneous translation, the model translates of SimulST outputs, the acoustic model predicts
a prefix of the entire input to the corresponding acoustic features from the pronunciations, and the
output prefix. The prefix translation using a full- neural vocoder synthesizes speech from the acous-
sentence model often suffers from so-called over- tic features.
translation (or hallucination) due to the lack of We use the wait-k approach (Ma et al., 2019) for
training examples in the prefix-to-prefix scenarios. the incremental pronunciation estimation, taking a
To mitigate this problem, we leverage the training subword sequence as the input and predicting pro-
corpus using Bilingual Prefix Alignment (Kano nunciation symbols in Japanese katakana phono-
et al., 2022) for data augmentation for prefix-to- grams and a couple of special characters represent-
prefix pairs to fine-tune the SimulST model. ing accents as the output. To control the output
length, we extend the wait-k policy by allowing
2.3 Local Agreement the decoder to output an arbitrary length of sym-
Liu et al. (2020) proposed Local agreement to find bols; The decoder stops its write steps when the
a stable prefix translation hypothesis in the prefix- largest weight of its cross attention goes over the
to-prefix translation based on chunk-wise inputs last two tokens in the input prefix. This also works
with the fixed length. It verifies the stability of as a lookahead mechanism for pronunciation es-
the hypothesis at step t using the hypothesis at timation. We use Tacotron2 (Shen et al., 2018)
step t + 1 by taking the agreeing prefix (i.e., the for the acoustic modeling and Parallel WaveGAN
longest common prefix) of them. This is based (Yamamoto et al., 2020) as the neural vocoder in
on an idea that the agreeing prefix translation out- the prefix-to-prefix manner (Ma et al., 2020a).
331
Table 1: Training data measured in hours. Table 2: Comparison of the removed ratios result-
ing from data filtering with maximum ratios of 4,800,
Dataset En-De En-Ja En-Zh 4,000, and 3,200.
MuST-C v1 408h
Removed Ratio (%)
MuST-C v2 436h 526h 545h
Filter (max ratio) En-De En-Ja En-Zh
Europarl-ST 83h
CoVoST-2 413h 413h 413h No filtering 0.0 0.0 0.0
TED-LIUM 415h 4,800 37.8 59.4 59.7
Total 1,755h 939h 958h 4,000 53.9 72.5 74.1
3,200 78.0 87.9 89.4
3 System Setup
with a kernel size of (10, 3, 3, 3, 3, 2, 2), a
3.1 Data
stride of (5, 2, 2, 2, 2, 2, 2), and 512 channels.
We used MuST-C v2.0 (Di Gangi et al., 2019) and The number of the Transformer encoder layers is
CoVoST-2 (Wang et al., 2020) for all language 24. The text decoder was initialized with the de-
pairs: English-to-German (En-De), English-to- coder of mBART50 (Tang et al., 2020). The de-
Japanese (En-Ja), and English-to-Chinese (En- coder consists of twelve Transformer layers, and
Zh). We also used MuST-C v1.0, Europarl-ST an embedding layer and linear projection weights
(Iranzo-Sánchez et al., 2020), and TED-LIUM are shared, with a size of 250,000. The size of
(Rousseau et al., 2012) for English-to-German. each Transformer and feed-forward layer is 1,024
We included the development and test portions of and 4,096, respectively, the number of attention
CoVoST-2 and Europarl-ST in our training data. heads is 16, the activation function is ReLU, and
The overall statistics for these corpora are shown the layer normalization is applied before the at-
in Table 1. For evaluation, we used the tst- tention operations. The encoder and decoder are
COMMON portion of MuST-C v2.0. All the text also connected via Inter-connection (2.1) and a
data in the corpora were tokenized using a multi- length adapter (Tsiamas et al., 2022). The length
lingual SentencePiece tokenizer with a vocabulary adapter is a 3-layer convolutional network with
of 250,000 subwords, distributed with mBART50 1,024 channels, the stride of 2, and the activation
pre-trained model. function of a Gated Linear Unit (GLU).
Speech input is given as waveforms with a 16-
3.2 Data Filtering
kHz sampling rate, normalized to zero mean and
We conducted a data filtering on the prefix trans- unit variance. During training, each source au-
lation pairs obtained through the Bilingual Pre- dio was augmented (Kharitonov et al., 2020) be-
fix Alignment, following our IWSLT 2022 sys- fore normalization, with a probability of 0.8. We
tem (Fukuda et al., 2022). We compared three trained multilingual models on all the data listed in
cut-off ratios of the number of samples in the in- Table 1 with a maximum source length of 400,000
put speech to the number of tokens in the output: frames and a target length of 1,024 tokens. We
4,800, 4,000, and 3,200. Table 2 shows the per- applied gradient accumulation and data-parallel
centage of data that was removed following the computations to achieve a batch size of approx-
application of filters. We also applied the same imately 32 million tokens. We used Adam with
filtering to the development data. β1 = 0.99, β2 = 0.98, and a base learning rate of
2.5 × 10−4 . The learning rate was controlled by a
3.3 Simultaneous Speech-to-Text System tri-stage scheduler with phases of 0.15, 0.15, and
We deveoped an end-to-end speech-to-text model 0.70 for warm-up, hold, and decay, respectively,
initialized with two pre-trained models for its while the initial and final learning rate had a scale
speech encoder and text decoder. The speech en- of 0.01 compared to base. We used sentence av-
coder was initialized with HuBERT-Large, which eraging and gradient clipping of 20. We applied a
consists of a feature extractor trained on 60 K dropout probability of 0.1 and used time masking
hours of unlabeled speech data Libri-Light (Kahn for 10-length spans with a probability of 0.2, and
et al., 2020) and Transformer encoder layers. The channel masking for 20-length spans with a proba-
feature extractor has seven convolutional layers bility of 0.1 in the encoder feature extractor’s out-
332
put. The loss was the cross-entropy loss with a LSTM in Tacotron2 and attention mechanism to
label smoothing with 20% probability mass. the forward attention with the transit agent (Zhang
The offline SimulST model was fine-tuned, and et al., 2018) for incremental processing. Guided
then checkpoint averaging was performed. In the Attention Loss (Tachibana et al., 2018) was used
checkpoint averaging, the model checkpoints were as an additional Loss function. The input size of
saved every 1,000 training steps, and the averaged Tactoron2 is 89, and the optimizer was Adam with
parameter values among the five-best models in the learning rate of 1e-3 and the hyperparameters
the loss on the development data were taken for of β1 = 0.9 and β2 = 0.999 and ϵ = 1e − 6.
the final model. Subsequently, one epoch of fine- The batch size was 32 in the number of sentences.
tuning was performed on the training data-only Experimental conditions for Parallel WaveGan are
prefix alignment pairs in MuST-C v2. We reduced the same as in the original paper, except for the pa-
the learning rate to 2.5 × 10−5 during the fine- rameters related to acoustic features and speech.
tuning using translation pairs obtained using Bilin- The pronunciation estimation used the wait-3
gual Prefix Alignment. policy. The incremental TTS has a couple of look-
As a SimulST policy, the local agreement with ahead parameters, indicating the length to control
n = 2 (LA-2) was used. The chunk size was var- the quality-latency trade-off. We tune these pa-
ied from 200 ms to 1000 ms to adjust the quality- rameters to keep the quality of synthesized speech
latency trade-off. A beam search of beam size five within the latency threshold requirement (2.5 sec-
was used to generate hypotheses for input chunks. onds).
3.4 Simulaneous Speech-to-Speech System 3.5 Evaluation

Here, we describe the detailed setup of the in- We evaluated our systems using SimulEval (Ma
cremental TTS. Pronunciation symbols were ob- et al., 2020b) toolkit2 . For the SimulST systems,
tained from the text using Open Jtalk1 . We used translation quality was evaluated by BLEU using
the Balanced Corpus of Contemporary Written sacreBLEU3 . Translation latency was evaluated
Japanese (BCCWJ; Maekawa, 2008) for training using the following metrics:
the pronunciation estimation model. The training, • Average Lagging (Ma et al., 2019)
development, and test data were approximately 1.4
M, 10 K, and 10 K sentences, respectively. We • Length Adaptive Average Lagging (Papi
also used the training portion of MuST-C as ad- et al., 2022)
ditional training data. We used an LSTM-based
attentional encoder-decoder model for the pronun- • Average Token Delay (Kano et al., 2023)
ciation estimation model. Its encoder and decoder • Average Proportion (Cho and Esipova, 2016)
were implemented with two-layer uni-directional
LSTM, and the cross-attention was based on the • Differentiable Average Lagging (Cherry and
dot product. The optimizer was Adam with the Foster, 2019)
learning rate of 1e-3 and hyperparameters of β1 =
0.9 and β2 = 0.999. The batch size was 256 in the For the SimulS2S system, translation qual-
number of sentences. ity was evaluated by BLEU after transcribing
JSUT corpus (Sonobe et al., 2017) was used for the output speech with Whisper (Radford et al.,
training Tacotron2 and Parallel WaveGAN. The 2022) (WHISPER_ASR_BLEU). Translation la-
numbers of sentences in the training, develop- tency was evaluated with ATD and the time offset
ment, and test data were 7,196, 250, and 250, of the start and end of the translation.
respectively. Speech is downsampled from 48 AL is a latency metric commonly used for text-
kHz to 16 kHz, and 80 dimensional Mel spec- to-text and speech-to-text simultaneous transla-
trum was used as the acoustic features. The size tion. However, AL focuses on when the transla-
of the Fourier transform, frameshift length, win- tion starts but does not consider enough when the
dow length, and window function are 2,048, 10 translation for each input chunk finishes. Since
ms, 50 ms, and Hann window, respectively. We the speech segments are generated sequentially in
replaced bi-directional LSTM with uni-directional 2
SimulEval
1 3
https://open-jtalk.sourceforge.net https://github.com/mjpost/sacrebleu
333
30 Offline Offline
Offline+PA (3200) Offline+PA (3200)
Offline+PA (4000) 15 Offline+PA (4000)
28 Offline+PA (4800) Offline+PA (4800)
Offline+PA (None) Offline+PA (None)
14
26
BLEU
BLEU
24 13
22
12
20
250 500 750 1000 1250 1500 1750 2000 500 1000 1500 2000 2500
AL AL
Figure 2: BLEU and AL results of the offline model Figure 3: BLEU and AL results of the offline model
and the models fine-tuned with prefix alignment on and the models fine-tuned with prefix alignment on
En-De. The parentheses indicate the max ratio of En-Ja.
prefix pair filtering. Circled dots indicate our sumit-
ted SimulS2t system.
22
a speech-to-speech translation scenario, the trans- 21

lation output will be delayed if its preceding trans-
BLEU
lation outputs are delayed and occupy the speech 20

output channel. Thus, AL is not appropriate to
19 Offline
evaluate the latency of speech-to-speech simulta- Offline+PA (3200)
Offline+PA (4000)
neous translation, so we use ATD which includes 18 Offline+PA (4800)
Offline+PA (None)
the delays caused by the outputs in the latency cal-
0 500 1000 1500 2000
culation. ATD calculates the delay by having the AL
average time difference between the source token
and its corresponding target token. In the setting Figure 4: BLEU and AL results of the offline model
of SimulEval, assuming one word requires 300 ms and the models fine-tuned with prefix alignment on
En-Zh.
to speak, the input and output speech are seg-
mented into the size of 300 ms regarding the seg-
ments as the tokens when calculating ATD. performed the baseline offline model for all lan-
guage pairs. In En-Ja, the best results were ob-
4 Experimental Results
tained when prefix pair filtering was applied with
4.1 Submitted Speech-to-Text System the maximum ratio of 4,000, similar to Fukuda
For each language direction, we selected one sub- et al. (2022). It suggests the importance of the
mission with the settings satisfying the task re- filtering to reduce unbalanced data pairs consist-
quirement, AL ≤ 2 sec. Table 3 shows the scores ing of long source speech and short target text in
of the submitted Speech-to-Text systems. The re- language pairs with the large word order differnce.
sults of all chunk settings for the models used in On the other hand, the prefix pair filtering did not
the submitted systems are shown in Appendix A. work well for the other language directions.
The following sections discuss the effectiveness of
4.3 Inter-connection
each of the techniques we used.
We analyzed the effectiveness of Inter-connection
4.2 Prefix Alignment through an ablation study on the connection meth-
Figures 2 to 4 show quality-latency trade-offs on ods and the checkpoint averaging. The results are
En-De, En-Ja, and En-Zh tst-COMMON, respec- shown in Table 4.
tively. For En-De and En-Ja, the quality and la- The results show that checkpoint averaging
tency were roughly proportional in the range of improved BLEU for the En-Ja and En-Zh and
AL ≤ 2000, while the quality improvement satu- that Inter-connection worked for En-De and En-
rated at around AL = 1, 500 for En-Zh. The fine- Ja. This could be attributed to differences in the
tuned model with Bilingual Prefix Alignment out- speech features required for speech translation.
334
Language pair chunk size BLEU LAAL AL AP DAL ATD
En-De 950 ms 29.975 2172.927 1964.329 0.846 2856.738 1893.749
En-Ja 840 ms 15.316 2290.716 1973.586 0.892 2889.950 547.752
En-Zh 700 ms 22.105 1906.995 1471.287 0.821 2436.948 667.780
Table 3: Results of the submitted speech-to-text systems on the MuST-C v2 tst-COMMON.
Model En-De En-Ja En-Zh Ave.

Simple Connection 30.49 15.28 24.50 23.42
Simple Connection + Ckpt Ave. 30.47 15.71 25.01 23.73
Inter-connection 30.49 15.53 24.23 23.42
Inter-connection + Ckpt Ave. 30.89 15.89 24.75 23.84
Table 4: BLEU scores for models without and with checkpoint averaging for simple and Inter-connection were
evaluated with MuST-C v2 tst-COMMON.
In the multilingual model, the weights required ASR_BLEU StartOffset EndOffset ATD
for each language pair are different because the 9.873 2495.01 4134.752 3278.809
weights of the weighted sum in Inter-connection
Table 5: Results of the submitted SimulS2S system on
are shared. In the case of En-Zh, there was larger the MuST-C v2 tst-COMMON.
difference in the weights than in En-De and En-Ja,
and sharing weights leads to decrease the perfor-
mance. segmentation strategy and latency reduction with
a fixed strategy.
4.4 Computation-aware Latency
We also evaluated models with computation- 4.5 Submitted SimulS2S System
aware Average Lagging (AL_CA). AL_CA is a
variant of AL that adds the actual elapsed time Table 5 shows the scores of the SimulS2S sys-
elapsedi to the delay di of i-th target token yi : tem. Compared to the BLEU results with the
SimulS2T systems with similar chunk size set-
j
∑ tings, the SimulS2S system resulted in much
di = (Tk + elapsedi ) (1) worse ASR_BLEU in nearly five points due to
k=1
the quality of the synthesized speech and possi-
where Tk is the duration of the k-th input speech ble ASR errors. Figure 6 shows the quality-latency
segment and j is the position of the input segment trade-offs of SimulS2S, with ASR_BLEU stagnat-
already read when generating yi . The elapsed time ing around 10.5 points. In addition, the output
elapsedi is measured as the time from the start of of the submitted SimulS2S system had a charac-
the translation to the output of target token yi . ter error rate of 28.3% relative to the output of the
The evaluation was conducted using an SimulS2T system with the same chunk size. These
NVIDIA GeForce RTX 2080 Ti. Figure 5 shows results indicate that there is a significant room for
the result. Unlike the non-computation-aware improvement both in the TTS and ASR.
latency metrics, the fixed-size segmentation
worked better than the local agreement in the 5 Conclusions
quality-latency trade-off. The local agreement
often discards the latter part of the prefix trans- In this paper, we described our SimulST systems
lation due to the disagreement with the next for the IWSLT 2023 Simultaneous Speech Trans-
prefix translation, while such a trackback does lation task. Experimental results demonstrated
not happen in the fixed segmentation scenario. the effectivenesses of Inter-connection and Bilin-
Therefore, the local agreement needs to predict gual Prefix Alignment. The speech-to-speech sys-
more tokens every time and increases the decod- tem is still challenging but showed promising per-
ing time. This result suggests another trade-off formance by a simple cascade of speech-to-text
between quality improvement with a sophisticated SimulST and incremental TTS.
335
1000 9001000 5000 1000
30 Local Agreement 800 900 5000 Local Agreement 700 Local Agreement 800 900
Fixed-size segmentation 4000 Fixed-size segmentation 600 4000
700 500 3000 22.0 Fixed-size segmentation 3000 4000 5000
29 3000 14 400 600 2000
600
2000 2000 21.5
28 500 500
12 300
BLEU
BLEU
21.0 400
BLEU
27 400 1000
10
800
26 20.5
8 600
25
300
20.0 300
24 1000 6 400 1000
19.5
500 1000 1500 2000 2500 1500 1000 500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
AL AL AL
(a) BLEU and AL in En-De. (b) BLEU and AL in En-Ja. (c) BLEU and AL in En-Zh.
1000 1000
900 1000
30 Local Agreement 800 900 5000 Local Agreement 700 5000 Local Agreement 800 900
Fixed-size segmentation 4000 Fixed-size segmentation 4000
600
700
14 3000 500 22.0 Fixed-size segmentation 3000 4000 5000
29 3000 400 2000 600
600
2000 2000 21.5
28 500 500
12 300
BLEU
BLEU 21.0 400
BLEU
27 400 1000
10
800
26 20.5
8 600
25
300 20.0 300
24 1000 6 400 1000
19.5
1500 2000 2500 3000 3500 1000 1500 2000 2500 3000 3500 1500 2000 2500 3000 3500
AL_CA AL_CA AL_CA
(d) BLEU and AL_CA in En-De. (e) BLEU and AL_CA in En-Ja. (f) BLEU and AL_CA in En-Zh.
Figure 5: Comparison of the local agreement with n = 2 and fixed-size segmentation policies.
11.0 References
10.5 800 840
Milind Agarwal, Sweta Agrawal, Antonios Anas-
WHISPER_ASR_BLEU
10.0 650
700 tasopoulos, Ondřej Bojar, Claudia Borg, Marine
600 Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
9.5
Chen, William Chen, Khalid Choukri, Alexan-
9.0 dra Chronopoulou, Anna Currey, Thierry Declerck,
Qianqian Dong, Yannick Estève, Kevin Duh, Mar-
8.5 cello Federico, Souhir Gahbiche, Barry Haddow,
8.0 400 Benjamin Hsu, Phu Mon Htut, Hirofumi Inaguma,
Dávid Javorský, John Judge, Yasumasa Kano, Tom
7.5
2800 2900 3000 3100 3200 3300 3400 3500 Ko, Rishu Kumar, Pengwei Li, Xutail Ma, Prashant
ATD Mathur, Evgeny Matusov, Paul McNamee, John P.
McCrae, Kenton Murray, Maria Nadejde, Satoshi
Figure 6: WHISPER_ASR_BLEU and ATD results of Nakamura, Matteo Negri, Ha Nguyen, Jan Niehues,
the SimulS2S systems on En-Ja. The numbers Xing Niu, Atul Ojha Kr., John E. Ortega, Proyag Pal,
above the marks indicates chunk size. Circled dots in- Juan Pino, Lonneke van der Plas, Peter Polák, Elijah
dicate our sumitted system. Rippeth, Elizabeth Salesky, Jiatong Shi, Matthias
Sperber, Sebastian Stüker, Katsuhito Sudoh, Yun
Tang, Brian Thompson, Kevin Tran, Marco Turchi,
Acknowledgements Alex Waibel, Mingxuan Wang, Shinji Watanabe,
and Rodolfo Zevallos. 2023. Findings of the IWSLT
Part of this work was supported by JSPS KAK- 2023 Evaluation Campaign. In Proceedings of the
ENHI Grant Number JP21H05054. 20th International Conference on Spoken Language
Translation (IWSLT 2023). Association for Compu-

tivogli, Marcely Zanon Boito, Ondřej Bojar,
Roldano Cattoni, Anna Currey, Georgiana Dinu,
Kevin Duh, Maha Elbayad, Clara Emmanuel, Yan-
nick Estève, Marcello Federico, Christian Fed-
ermann, Souhir Gahbiche, Hongyu Gong, Ro-
man Grundkiewicz, Barry Haddow, Benjamin Hsu,
Dávid Javorský, Vĕra Kloudová, Surafel Lakew,
Xutai Ma, Prashant Mathur, Paul McNamee, Kenton
Murray, Maria Nǎdejde, Satoshi Nakamura, Mat-
teo Negri, Jan Niehues, Xing Niu, John Ortega,
Juan Pino, Elizabeth Salesky, Jiatong Shi, Matthias
Sperber, Sebastian Stüker, Katsuhito Sudoh, Marco
336
Turchi, Yogesh Virkar, Alexander Waibel, Chang- 19th International Conference on Spoken Language
han Wang, and Shinji Watanabe. 2022. Findings of Translation (IWSLT 2022), pages 22–31, Dublin,
the IWSLT 2022 evaluation campaign. In Proceed- Ireland (in-person and online). Association for Com-
ings of the 19th International Conference on Spoken putational Linguistics.
Dublin, Ireland (in-person and online). Association Yasumasa Kano, Katsuhito Sudoh, and Satoshi Naka-
for Computational Linguistics. mura. 2023. Average token delay: A latency metric
for simultaneous translation. In Proc, Interspeech
Colin Cherry and George Foster. 2019. Thinking slow 2023. To appear.
about latency evaluation for simultaneous machine
translation. arXiv preprint arXiv:1906.00048. Eugene Kharitonov, Morgane Rivière, Gabriel Syn-
naeve, Lior Wolf, Pierre-Emmanuel Mazaré,
Kyunghyun Cho and Masha Esipova. 2016. Can neu- Matthijs Douze, and Emmanuel Dupoux. 2020.
ral machine translation do simultaneous translation? Data augmenting contrastive learning of speech
arXiv preprint arXiv:1606.02012. representations in the time domain. arXiv preprint
arXiv:2007.00991.
Matteo Negri, and Marco Turchi. 2019. MuST-C: Danni Liu, Gerasimos Spanakis, and Jan Niehues.
a Multilingual Speech Translation Corpus. In Pro- 2020. Low-Latency Sequence-to-Sequence Speech
ceedings of the 2019 Conference of the North Amer- Recognition and Translation by Partial Hypothesis
ican Chapter of the Association for Computational Selection. In Proc. Interspeech 2020, pages 3620–
Linguistics: Human Language Technologies, Vol- 3624.
ume 1 (Long and Short Papers), pages 2012–2017,
Minneapolis, Minnesota. Association for Computa-
Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,
tional Linguistics.
Ryo Fukuda, Yuka Ko, Yasumasa Kano, Kosuke Doi, Haifeng Wang. 2019. STACL: Simultaneous trans-
Hirotaka Tokuyama, Sakriani Sakti, Katsuhito Su- lation with implicit anticipation and controllable la-
doh, and Satoshi Nakamura. 2022. NAIST simulta- tency using prefix-to-prefix framework. In Proceed-
neous speech-to-text translation system for IWSLT ings of the 57th Annual Meeting of the Association
2022. In Proceedings of the 19th International Con- for Computational Linguistics, pages 3025–3036,
ference on Spoken Language Translation (IWSLT Florence, Italy. Association for Computational Lin-
2022), pages 286–292, Dublin, Ireland (in-person guistics.
and online). Association for Computational Linguis- Mingbo Ma, Baigong Zheng, Kaibo Liu, Renjie Zheng,
tics. Hairong Liu, Kainan Peng, Kenneth Church, and
Liang Huang. 2020a. Incremental text-to-speech
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hu-
synthesis with prefix-to-prefix framework. In Find-
bert Tsai, Kushal Lakhotia, Ruslan Salakhutdi-
ings of the Association for Computational Linguis-
nov, and Abdelrahman Mohamed. 2021. Hubert:
tics: EMNLP 2020, pages 3886–3896, Online. As-
Self-supervised speech representation learning by
masked prediction of hidden units.
Xutai Ma, Mohammad Javad Dousti, Changhan Wang,
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, Jiatao Gu, and Juan Pino. 2020b. SIMULEVAL: An
Javier Jorge, Nahuel Roselló, Adrià Giménez, Al- evaluation toolkit for simultaneous translation. In
bert Sanchis, Jorge Civera, and Alfons Juan. 2020. Proceedings of the 2020 Conference on Empirical
Europarl-st: A multilingual corpus for speech trans- Methods in Natural Language Processing: System
lation of parliamentary debates. In ICASSP 2020 Demonstrations, pages 144–150, Online. Associa-
- 2020 IEEE International Conference on Acous- tion for Computational Linguistics.
8229–8233. Kikuo Maekawa. 2008. Balanced Corpus of Contem-
porary Written Japanese. In Proceedings of the 6th
J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, Workshop on Asian Language Resources.
P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Col-
lobert, C. Fuegen, T. Likhomanenko, G. Syn- Yuta Nishikawa and Satoshi Nakamura. 2023. Inter-
naeve, A. Joulin, A. Mohamed, and E. Dupoux. connection: Effective connection between pre-
2020. Libri-light: A benchmark for asr with trained encoder and decoder for speech translation.
limited or no supervision. In ICASSP 2020 - In Proc, Interspeech 2023. To appear.
2020 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), Sara Papi, Marco Gaido, Matteo Negri, and Marco
pages 7669–7673. https://github.com/ Turchi. 2022. Over-generation cannot be rewarded:
facebookresearch/libri-light. Length-adaptive average lagging for simultaneous
speech translation. In Proceedings of the Third
Yasumasa Kano, Katsuhito Sudoh, and Satoshi Naka- Workshop on Automatic Simultaneous Translation,
mura. 2022. Simultaneous neural machine transla- pages 12–17, Online. Association for Computational
tion with prefix alignment. In Proceedings of the Linguistics.
337
Ankita Pasad, Ju-Chieh Chou, and Karen Livescu. Language Resources and Evaluation Conference,
2021. Layer-wise analysis of a self-supervised pages 4197–4203, Marseille, France. European Lan-
speech representation model. 2021 IEEE Automatic guage Resources Association.
Speech Recognition and Understanding Workshop
(ASRU), pages 914–921. Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim.
2020. Parallel wavegan: A fast waveform genera-
Peter Polák, Ngoc-Quan Pham, Tuan Nam Nguyen, tion model based on generative adversarial networks
Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bo- with multi-resolution spectrogram. In ICASSP 2020
jar, and Alexander Waibel. 2022. CUNI-KIT system - 2020 IEEE International Conference on Acous-
for simultaneous speech translation task at IWSLT tics, Speech and Signal Processing (ICASSP), pages
2022. In Proceedings of the 19th International Con- 6199–6203.
2022), pages 277–285, Dublin, Ireland (in-person Jing-Xuan Zhang, Zhen-Hua Ling, and Li-Rong
and online). Association for Computational Linguis- Dai. 2018. Forward attention in sequence- to-
tics. sequence acoustic modeling for speech synthesis.
In 2018 IEEE International Conference on Acous-
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- tics, Speech and Signal Processing (ICASSP), pages
man, Christine McLeavey, and Ilya Sutskever. 2022. 4789–4793.
pervision. arXiv preprint arXiv:2212.04356.
Anthony Rousseau, Paul Deléglise, and Y. Estève.
2012. Ted-lium: an automatic speech recognition
dedicated corpus. In International Conference on
Language Resources and Evaluation.
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike
Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan,
Rif A. Saurous, Yannis Agiomvrgiannakis, and
Yonghui Wu. 2018. Natural tts synthesis by con-
ditioning wavenet on mel spectrogram predictions.
In 2018 IEEE International Conference on Acous-
4779–4783.
Ryosuke Sonobe, Shinnosuke Takamichi, and Hiroshi
Saruwatari. 2017. Jsut corpus: free large-scale
japanese speech corpus for end-to-end speech syn-
thesis. arXiv preprint arXiv:1711.00354.
Hideyuki Tachibana, Katsuya Uenoyama, and Shun-
suke Aihara. 2018. Efficiently trainable text-to-
speech system based on deep convolutional net-
works with guided attention. In 2018 IEEE Interna-
tional Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 4784–4788.
sible multilingual pretraining and finetuning.
Ioannis Tsiamas, Gerard I. Gállego, Carlos Escolano,
José Fonollosa, and Marta R. Costa-jussà. 2022.
Pretrained speech encoders and efficient fine-tuning
methods for speech translation: UPC at IWSLT
2022. In Proceedings of the 19th International Con-
tics.
Changhan Wang, Juan Pino, Anne Wu, and Jiatao Gu.
2020. CoVoST: A diverse multilingual speech-to-
text translation corpus. In Proceedings of the 12th
338
A Appendix
Tables 6, 7, and 8 show the results for all chunk
size settings for the En-De, En-Ja, and En-Zh
models used in the submitted system, respectively.
339
chunk size BLEU LAAL AL AP DAL ATD
300 24.217 947.509 495.162 0.732 1465.822 814.368
400 26.657 1189.696 829.689 0.753 1738.568 1180.684
500 27.986 1416.459 1071.682 0.774 1992.596 1375.404
600 28.739 1618.746 1318.715 0.791 2232.175 1367.612
700 29.298 1797.061 1515.356 0.811 2432.087 1608.334
800 29.809 1956.321 1714.173 0.826 2617.073 1720.705
820 29.78 2011.518 1772.404 0.827 2672.554 1765.76
840 29.792 2022.322 1790.452 0.832 2680.218 1741.386
860 29.746 2054.923 1825.194 0.834 2726.204 1740.656
900 29.805 2115.625 1895.961 0.841 2783.033 1711.2
950 29.975 2172.927 1964.329 0.846 2856.738 1893.749
1000 30.234 2255.583 2057.579 0.852 2938.408 1884.775
Table 6: Results of the Offline+PA (None) model on the MuST-C v2 tst-COMMON En-De.

300 11.714 1096.676 288.185 0.807 1643.59 181.268
400 13.284 1377.647 697.522 0.827 1949.44 260.12
500 14.04 1642.289 1171.154 0.845 2246.513 343.565
600 14.458 1858.317 1433.278 0.866 2463.025 386.054
700 14.828 2064.974 1695.339 0.877 2672.509 471.012
800 15.235 2224.392 1803.111 0.895 2831.076 519.566
820 15.232 2256.386 1862.014 0.892 2865.29 537.516
840 15.316 2290.716 1973.586 0.892 2889.95 547.752
860 15.214 2341.734 2023.29 0.896 2946.322 557.76
900 15.281 2389.836 2121.337 0.898 3010.863 563.603
1000 15.439 2528.8 2247.036 0.907 3126.384 630.97
Table 7: Results of the Offline+PA (4000) model on the MuST-C v2 tst-COMMON En-Ja.

300 19.794 1011.202 109.706 0.755 1411.409 206.106
400 20.874 1283.497 540.576 0.774 1718.894 370.356
500 21.291 1522.251 881.957 0.796 1984.268 474.854
600 21.628 1714.688 1173.412 0.811 2216.213 499.254
700 22.105 1906.995 1471.287 0.821 2436.948 667.78
750 21.844 1994.88 1587.405 0.83 2526.013 672.637
800 22.041 2071.358 1689.633 0.831 2621.874 738.394
840 22.101 2126.632 1826.245 0.829 2689.418 761.502
860 22.125 2167.874 1829.369 0.836 2728.565 760.173
900 22.057 2211.844 1927.426 0.838 2779.555 749.444
1000 22.196 2383.854 2137.905 0.851 2946.303 882.875
Table 8: Results of the Offline+PA (None) model on the MuST-C v2 tst-COMMON En-Zh.
340
Language Model Based Target Token Importance Rescaling for
Simultaneous Neural Machine Translation
Aditi Jain∗ Nishant Kambhatla∗ Anoop Sarkar

IIT Delhi School of Computing Science School of Computing Science
[email protected] Simon Fraser University Simon Fraser University
It was cozy in winter but extremely hot in summer .

Abstract 0.2 Reads
LM
MMA 0
The decoder in simultaneous neural machine 1
0.1 2
translation receives limited information from 3
4
the source while having to balance the oppos- 0.0
warm
It
's
quite
in
winter
,
but
summer
is
extremely
hot
.
<EOS>
ing requirements of latency versus translation
quality. In this paper, we use an auxiliary target-
side language model to augment the training of Source
the decoder model. Under this notion of target
adaptive training, generating rare or difficult SiMT policy
tokens is rewarded which improves the transla-
tion quality while reducing latency. The predic- Target
tions made by a language model in the decoder prior token 1 1 1
work importance
are combined with the traditional cross entropy
loss which frees up the focus on the source side ours rescaling
1 0.7 1.2
context. Our experimental results over multiple
target context
language pairs show that compared to previous
language model
state of the art methods in simultaneous trans-
lation, we can use an augmented target side conditional max. likelihood
context to improve BLEU scores significantly.

We show improvements over the state of the Figure 1: Prior work on simultaneous MT weighs ev-
art in the low latency range with lower average ery target token equally. Top: Normalized negative
lagging values (faster output). 1 log-likelihood (nll) scores of each generated in-context
target token as scored by the baseline SiMT along with
1 Introduction the number of reads preceding a target token, and a
target language model (LM). As the translations are im-
Simultaneous Machine Translation (SiMT; Gris- perfect, the LM shows disagreement by following an
som II et al. (2014); Cho and Esipova (2016)) is a opposite nll trend compared to the translation model.
special case of neural machine translation (NMT; Bottom: Our method rescales the importance of each
Vaswani et al. (2017)) that aims to produce real target token using the target context during training.
time-translations in the target language from a
streaming input in the source language. The cor-
nerstone of this task, as well as a key challenge, the current source and target contexts (Zheng et al.,
is the trade-off between the translation quality and 2020). Although adaptive policies achieve a better
the latency in producing the translations. This bal- latency/BLEU trade-off, they often fail to account
ance is ensured by a fixed (Ma et al., 2019; El- for the varying importance of different tokens when
bayad et al., 2020) or adaptive (Arivazhagan et al., deciding a READ/WRITE action.
2019; Ma et al., 2020; Zhang and Feng, 2022b) In Figure 1 (top), there is a negative correlation
read/write policy that determines whether to wait between the normalized negative log likelihoods
for the next source token (a READ action) or to of output tokens as measured by MMA (a SiMT
generate a translation (a WRITE action). Adaptive model with an adaptive policy; Ma et al. 2020)
policies dynamically predict the action based on versus a left-to-right language model (LM). This
1
reflects a translation which the SiMT model is con-
github.com/sfu-natlang/target_rescale_siMT
∗Equal contribution. Listing order is random. Work fident about, but which the LM regards as poor
performed while AJ was visiting SFU Natlang Lab. English (possibly due to the semantic mismatch
341
evident in “warm in winter”). Since a simultane- a language model that provides an additional sig-
ous policy can only access partial source context, nal indicating the importance of each target token
its outputs are likely to reflect imperfect guesses or sentence. This target-context aware estimation
such as these, particularly when translating in real- leverages the relative probabilities of the translation
time between language pairs with different word or- model and language model to guide the generation
derings (Subject-Object-Verb) and very long com- process by explicitly re-weighting the training loss
pounds. As a result, training objectives which treat of each target token in the translation. Experiments
all translated tokens with equal importance are sub- show the strength of our simple method, outper-
optimal. forming several strong baselines in terms of both
latency and BLEU scores. We perform exhaustive
In the context of translation, content words are
analysis to show that our model performs particu-
generally considered more informative than func-
larly well on translating low frequency words and
tion words (Chen et al., 2020). This is because
longer sentences.
content words carry the main semantic and lexical
meaning of a sentence, while function words pro- 2 Background
vide grammatical context and help to convey the
syntactic structure of a sentence. Similarly, high- Target adaptive training (Lin et al., 2017) in
frequency words that are easier for the translation NMT addresses the token imbalance problem (Gu
model to generate may sometimes carry less in- et al., 2020). While a translation model is conven-
formation than the desirable low-frequency (rare) tionally trained with conditional maximum likeli-
words that the model struggles to generate (Chen hood estimation or cross-entropy:
et al., 2017). To this end, Zhang et al. (2022b) pro-
posed to leverage conditional mutual information N
X
(MI) to estimate the weight-coefficients between LCE (x, y) = − log p (yj | y<j , x) (1)
the source and target to reweigh the importance j=1
of each target token. However, such an approach
hasn’t been explored to address simultaneous or adaptive training rescales this objective by assign-
streaming MT to the best of our knowledge, and ing static or dynamic weights to further guide the
the lack of a complete source context makes the translation model:
adaptation of this method to SiMT non-trivial. To N
X
improve simultaneous MT, Alinejad et al. (2018) Ladapt (x, y) = − wj log p (yj | y<j , x) (2)
proposed a prediction mechanism on the source j=1
side to get future information and aid the lack of
Frequency based approaches (Gu et al., 2020; Xu
information on target-side for translation. Instead
et al., 2021) to assign these weights are promis-
of directly predicting a source token, Zhang and
ing but maintaining a frequency count can be
Feng (2022a) predict its aligned future-position for
an expensive overhead and would not be directly
a given target token to guide its policy. On the
transferable to a simultaneous setting. More re-
other hand, Zhang and Feng (2022b) and Zhang
cently, Zhang et al. (2022b) proposed to leverage
et al. (2022a) explored policies that assign varying
pointwise mutual information (MI) to estimate the
importance to source/target tokens based on their
weight-coefficients between the source x and target
level of information, with more informative tokens
y as :
having a greater influence on the model.

In this paper, we propose a technique to alleviate p(x, y)
MI(x, y) = log (3)
this problem in SiMT using an information theo- p(x).p(y)
retic approach and an adaptive training paradigm.
which can reflect the importance of target tokens
Inspired by the recent work in using pointwise mu-
for translation models.
tual information for guiding the decoder in full-
sentence (non-simultaneous) translation (Lee et al., Monotonic Infinite-Lookback Attention Ari-
2022), we differentiate the importance of various vazhagan et al. (2019) models a Bernoulli variable
target tokens by their dependence on the source to make the READ or WRITE decision at every
sentence. As shown in Figure 1 (bottom), to guide time step, while processing the input sequence in-
our simultaneous translation model, we incorporate crementally. Ma et al. (2020) present monotonic
342
multihead attention to extend this policy to the mul- as:
tihead attention of transformers. For each encoder
p (yj , x≤i | y<j )

state in MMA, every head in the cross-attention of TIQ (yj ) = log
p (yj | y<j ) · p (x≤i | y<j )
each decoder layer produces a probability pi,j that
p (yj | x≤i , y<j ) · p (x≤i | y<j )
= log
dictates if it should write target token yj while hav- p (yj | y<j ) · p (x≤i | y<j )
(6)
ing read till the source token xi , or wait for more p (yj | x≤i , y<j )
= log
inputs. This is computed using the softmax energy: p (yj | y<j )

pSiMT (yj )
= log
pLM (yj )
T !
mj W K si−1 W Q where pSiMT (.) is the simultaneous translation
energyi,j = √
dk
i,j
(4) model probability and pLM (.) is the auxiliary target-
side language model of the same size as the transla-
pi,j = Sigmoid energyi,j
tion decoder. By decomposing the conditional joint
where m signifies the encoder states, W the input distribution, this can be formalized as the log quo-
projection matrix for query Q and key K, and dk is tient of the streaming translation model probability
the dimension of the attention head. The probabil- and target language model probability. This cap-
ity pi,j is then used to parameterize the Bernoulli tures the information of a target token conditioned
random variable: on the target context and uses it to rescale the loss,
thereby making the model pay more attention to
bi,j ∼ Bernoulli (pi,j ) (5) more “informative" words.
To incorporate weights into the adaptive training
If bi,j = 1 then the model performs a WRITE objective (equation 2), two separate weights are
action on yj based on previous source tokens, oth- used:
erwise it performs a READ. Token-level weight is used to determine weights
Our method is based on MMA and we use it as of loss from each target token yj and streaming
our main simultaneous policy. To mitigate the nega- source context, conditioned on the obtained partial
tive impact of outlier heads2 on the read/write path, translation at the current timestep. We use a token
we have made slight modifications to MMA to en- TIQ measure and normalise it to reduce variance:
sure more stable performance. Instead of allowing
the heads in each decoder layer to independently TIQtok = TIQ(yj ) − µtok /σ tok (7)
determine the READ/WRITE action, we now share
where µtok , and σ tok are the mean and standard
the READ/WRITE action between the decoder lay-
deviation of TIQ(yj ) respectively, for every sen-
ers. This adjustment helps to avoid outlier heads
tence.
that could potentially disrupt the system perfor-
mance and stability (Indurthi et al., 2022). Sentence-level weight on the other hand, is
token-level TIQ is aggregated and averaged over
3 Approach the target sentence length |y|:
3.1 Target-context Aware Information  
Quotient 1 X|y|
TIQsen = TIQ(yj ) − µsen  /σ sen (8)
|y| j=1
Inspired by Lee et al. (2022), we leverage the point-
wise mutual information (MI) between each target
where µsen , and σ sen are the mean and standard
token and its source context under the condition of
deviation of TIQ(yj ) respectively, over a batch.
previous target context. For a target token yj and
The final rescaling factor to assign weights in
the streaming source context x ≤ i, factoring in
equation 2 is calculated as:
the partially constructed target prefix y < j gives
the target information quotient (TIQ) is calculated wj = (λtok TIQtok + 1) · (λsen TIQsen + 1) (9)
2
In MMA, every head in the transformer multihead atten- The rescaling allows the model to learn the source
tion independently decides its read/write action and has access
to all previous encoder states. The write action only takes side information for a particular target token yj ,
place when the slowest head has arrived to a write decision. while also factoring in the target context so far.
343
34
26
32
26
24
30
BLEU
BLEU
BLEU
22 24
Offline Offline Offline
Wait-k 28 Wait-k Wait-k
Efficient Wait-k Efficient Wait-k Efficient Wait-k
20 MMA+TC(ours) MMA+TC(ours) MMA+TC(ours)
MMA MMA 22 MMA
Wait-Info 26 Wait-Info Wait-Info
2 4 6 8 10 3 4 5 6 7 8 3 4 5 6 7 8 9
Average Lagging (AL) Average Lagging (AL) Average Lagging (AL)
(a) Vi→En (b) De→En (c) En→De
Figure 2: Results on IWSLT15 Vi → En (a), and IWSLT14 En⇔De (b,c)
Given that information (from source) is con- 4 Experiments

strained in the nature of this task, this additional
4.1 Data
signal of the target context acts as reinforcement
for translation. The likelihood score from the LM IWSLT’15 English ↔ Vietnamese (133K pairs)
should serve to strengthen the predictive capabil- with TED tst2012 (1553 pairs) as validation set and
ity of the decoder. Frequent words would have a TED tst2013 (1268 pairs) as test set. The vocabu-
higher LM score and therefore a smaller weight lary sizes of English and Vietnamese are 17K and
wj . On the other hand, rare words would be scored 7.7K respectively.
lower by the LM, and thus have a higher rescaling
IWSLT’14 English ↔ German (160K pairs)
weight wj , allowing the model to focus on them
with validation set and test set of 7283 and 6750
more.
pairs respectively. The vocabulary size of German
is 13.5K and 9.98K for English.
3.2 Final Training Objective with Adaptive
Weights and Latency Constraints 4.2 Baselines and Model Settings
In MMA models, following Ma et al. (2020), we The following are the main baselines we compare
use the weighted average of differentiable lagging our method against:
metric C (Arivazhagan et al., 2019) over all the
Offline Transformer (Vaswani et al., 2017)
attentions heads as the weighted average latency
model for full-sentence translation.
Lavg constraint3 .
The MMA model uses both these loss terms in its Wait-k policy (Ma et al., 2019) which is a fixed-
final loss, with the hyperparameters λavg and λvar policy that reads k source tokens initially, and then
respectively. Combining the latency average loss alternates between reading and writing.
and the target-context aware information quotient,
Efficient Wait-k (Elbayad et al., 2020) uses
the final training objective for our model is:
multiple k’s to train a Wait-k model and relieves
the constraint of test k being equal to train k.
LMMA+TC = Ladapt (TIQ) + λavg Lavg (10)
Monotonic Multihead Attention (MMA; Ma
where Ladapt is adaptive cross-entropy loss from et al. (2020)) extends infinite lookback attention
equation 2 with TIQ (equation 9) as its rescaling- (Arivazhagan et al., 2019) to all the Transformer
weight, and λavg is a hyperparameter to control the heads.
latency constraint.
Wait-Info (Zhang et al., 2022a) quantifies
3
Early experiments with other policies such as GMA and
source and target token info to decide R/W action.
Wait-info showed the approach to be ineffective. The explicit
latency loss in MMA is crucial for the working of target adap- We also juxtapose our method against several
tive training for simultaneous MT. other baselines on the En → Vi direction:
344
GMA Adaptive Waitk MMA 4.3 Evaluation
Wait-k MoE WaitK Wait-Info
Efficient Wait-k MMA+TC(ours) ITST We evaluate using BLEU (Papineni et al., 2002) for
Offline translation quality and Average Lagging (AL) (Ma
et al., 2019) for latency. AL denotes the lagging be-
29
hind the ideal policy (Wait-0). Other metrics used
are Average Proportion (AP) (Cho and Esipova,
28 2016) and Differentiable Average Lagging (DAL)
(Arivazhagan et al., 2019). Given a read/write pol-
BLEU
27 icy gi , AL is :
τ
26 1X i−1
AL = gi − (11)
τ |y|/|x|
i=1
25
3 4 5 where τ = argmaxi (gi = |x|), |x| and |y| are
Average Lagging (AL) source sentence and target sentence lengths respec-
tively.
Figure 3: Performance of several methods on the
En→Vi dataset in the low latency (AL<5) window. 5 Results
Figure 2 shows the comparison of BLEU vs. La-
Gaussian Multihead Attention (GMA; Zhang tency (in terms of Average Lagging) of our method
and Feng (2022a)) that predicts the aligned against previous methods on the IWSLT’15 Vi →
source position for a target token and rescales at- En and IWSLT’14 En ↔ De directions. For Vi →
tention with a gaussian distribution centred at this En, we observe a significant improvement in the
position. BLEU scores at the same latencies, compared to
the baselines. We also reach the offline translation
ITST (Zhang and Feng, 2022b) finds the op- quality in low AL on this dataset. In the En →
timal information transport between source and De, De → En directions too, there is a boost in
target. the translation quality, more noticeably for lower
latencies. The plots show that our method boosts
Adaptive Wait-k(Zheng et al., 2020) dynami- translation quality in the earlier latencies and the
cally chooses an optimal k in the wait-k policy at effect of reweighing is more pronounced in these
every step. regions, where the source context is more limited.
In higher latency regions, when the source infor-
MoE Wait-k (Zhang and Feng, 2021b) uses mation window increases, the other baselines start
attention heads as experts trained with different k to reach our BLEU score in the English-German
with the wait-k policy. directions.
In Figure 3, we compare against several state-
MMA+TC (ours) is the proposed MMA model of-the-art methods on the En → Vi. Our method
with target context aware adaptive training objec- gets better translation quality compared all others,
tive. We use an auxiliary target-side LM decoder in the low-latency zone, matching the offline score
of the same configuration as the MT decoder. Note at 3.86 AL. We show the BLEU vs. AL plot in a
that the LM is only used during training and dis- low latency range to compare performance in the
carded at test time. We do not use extra data. more challenging area of this task, the low latency
points.
The implementation of our method is based on
fairseq (Ott et al., 2019). Following MMA, we 6 Analysis
use transformer (Vaswani et al., 2017) with 6 en-
coder and decoder layers and 4 monotonic attention 6.1 Token-level vs. Sentence-Level Weight
heads for the IWSLT datasets En↔Vi, De↔En. All Ablation Study The two hyperparameters in our
baselines are trained with same configurations and method are Sentence-Level Weight and Token-
are trained with 16k tokens. Our auxiliary language Level Weight, which determine the sentence and
model follows the decoder settings in the model. token-level effect of rescaling with LM. In Fig. 5
345
Token Order Avg. MMA MMA+ POS Ref MMA (%) +TC (%) MSE (↓)
Ref (%)
(Descending) Freq. (%) TC (%)
ADJ 1497 82.1 83.5 0.18 | 0.16
[0, 10%) 1385 85.56 87.63 87.21 ADV 1323 83.5 87.6 0.20 | 0.12
[10, 30%) 56 6.89 6.48 6.34 INTJ 74 98.6 94.6 0.01 | 0.04
[30, 50%) 20 2.19 1.75 1.95 NOUN 4187 90.5 93.4 0.09 | 0.06
[50, 70%) 11 1.30 0.70 0.86 PROPN 1315 99.4 99.4 -|-
[70, 100%] 6 0.95 0.26 0.31 VERB 3226 94.0 95.7 0.06 | 0.04
Table 1: Avg. frequency on the training set and the Table 2: Our method generates more content words
proportion of tokens of different frequencies in the test than the baseline MMA. Columns 2 and 3 show the
set and the translations generated by the baseline and percentage of the reference content words recovered in
our model. MMA and MMA+TC (in blue) respectively. The last
column shows normalized mean squared error (MSE)
0.6 MMA
of the recovered content words wrt reference. Lower
MSE values are better.
F-measure
MMA+TC (Ours)
0.5
0.4
2 3 4 [5,10) [10,100)
Word Frequency Content word occurrences. Zhang et al. (2022a)
show that focusing on the right content words in
Figure 4: F-measure between model outputs and refer-
the target is crucial to getting the necessary target
ence tokens for the low-frequency words, bucketed by
frequency of the reference token.
information in a subcutaneous translation setting.
Following Moradi et al. (2019) we inspect the con-
tent words generated by our model using spacy to
we report the BLEU scores with different hyperpa- get POS tags over the translations. As evident from
rameter settings on Vi-En. (AL across the table are Table 2, our model recovers more content words in
similar as experiments are done with the same λ). the translations wrt the reference.
We set the values of these hyperparameters to 0.2
in all our experiments. 6.3 Effect on Translation Length
Following the rationale of Lakew et al. (2019) in
25.36 26.03 25.83
Token-level Scale
NMT and Zhang and Feng (2022d) in simultaneous

0.3 0.2 0.1
25.83 26.38 26.26 translation, we inspect the translation quality of our

25.74 25.68 25.83 model on varying target sentence lengths in Figure
0.1 0.2 0.3 8 and observe that our method shows a big im-
Sentence-Level Scale
provement in BLEU on the longer sentences. Our
Figure 5: MMA+TC with different combinations for method prevents the model from over-producing
tok-level scale (λtok ) and sent-level scale (λsen ) values. words (as seen in the top figure in Figure 8). We
hypothesize that this is because the model does
not generate as many words (and overuse them)
6.2 Effect on Low-frequency Words from the most frequent word bin (see Table 1, top
10% bin) as MMA. Our target sentence lengths are
With reweighing loss using the Language Model consistently less than MMA’s and are closer to the
likelihood, we aim to reduce the effect of frequency ground truth sentence lengths (as shown in the bin
imbalance in the corpus on training. We compare 0, Fig. 8 (top)).
our translations against MMA on rare and frequent
words. In addition to an overall BLEU improve- 6.4 Effect on Translation Paths
ment, we also see an improvement in the F-measure
of rare words. As shown in Figure 4, our method Attention Heatmaps and READ/WRITE Se-
does better on extremely rare words (freq ≤ 10). quences. Figure 6 compares attention heatmaps
Table 1 shows that while the baseline overfits to from MMA and MMA+TC (our method) on the
the most frequent words, our method captures rare Vi → En direction. As evident, our method per-
words, from the bottom two frequency bins (50- forms READ actions in smaller intervals between
70% and 70-100%), better. The results show that predicting consecutive WRITE actions.
our method makes the model train better on rare Consider the READ/WRITE actions generated
words and remedy the effect of token imbalance. by MMA and MMA+TC for the given source sen-
346
We And The
also
when
have neighbors
we
samples heard
here
do
. that about
It , this
's who idea
pretty knows
warm .
?
. EOS
EOS
EOS
Các
gia
EOS
ình
hàng
nghe
ng
này
.
xóm
ta
Và
khi
làm
ó
,
ai
mà
bi t
c
?
EOS
chúng
vi c
tôi
v t
m u
ây
.
Nó
m
.
EOS
Chúng
còn
còn
khá
k
v
t
(a) MMA+TC (ours)
We And And
also
when the
have
we neighbors
samples
do
here heard
that
.
about
It ,
this
's who
pretty knows idea
warm ? .
.
EOS EOS
EOS
ta
Và
khi
làm
ó
,
ai
mà
bi t
c
?
EOS
chúng
vi c
Các
gia
ý
ng
EOS
ình
hàng
nghe
này
.
xóm
m u
ây
.
Chúng
tôi
còn
v t
Nó
còn
khá
EOS
k
v
t
(b) MMA (baseline)
Figure 6: Attention heatmap comparison on the Vi → En direction. The Read-Write policy is drawn with red and
green arrows respectively. The pink column at the start denotes the source tokens read to produce the target token
on the left (darker implies more source words read, and white denotes 0 reads between consecutive target tokens)
many but not all necessary READs would result in

0.94
high latency while few but not sufficient READ ac-
0.92
tions would exclude needed information and could
0.90
cause poor translation quality. When the ground
Sufficiency
0.88
truth aligned source position of the j th target word
0.86
is denoted by aj , and the number of source words
read when writing target j th word is denoted by rj :
0.84 MMA
0.82 MMA+TC
Wait-Info |y|
0.80 Suf 1 X
20 30 40 50 60 70 80 A = 1aj ≤rj (12)
Sentence Length |y|
j=1
Figure 7: Sufficiency as a function of target length. All
We compare our method against MMA and Wait-
models produce translation with an AL of 4.
Info on AL=4 with the sufficiency metric. Using
equation(12) across sentences of varying lengths,
tence are: we evaluate the read-write paths of each model,
against reference alignments from Eflomal (Östling
Src: Chúng tôi còn vt mu đây . Nó còn khá m . and Tiedemann, 2016)4 . In Figure 7, we can see a
MMA: RRR W RRR WWW RRR WW RRR W RR WWWWW clearly increasing and higher score on sufficiency
Ours: RRR W RRR WW RR W R WW RRR W R W R WWWW as compared to the baselines - Wait-Info and MMA.
This signifies that our target-context augmented
In this example, MMA reads more than required training helps the model read sufficient source to-
for a write in certain places. It shows that at a kens required for producing a translation, while
similar lag, our model gets a higher probability of maintaining the same latency as others, showing
a WRITE action, compared to MMA, after having that the model learns and correctly gauges the in-
read the same number of source words. formation it requires to translate a target token, and
4
Sufficiency of the READ actions. Zhang and We use the Eflomal library to get alignment priors
from IWSLT’15 Vi-En train set, and use them to gen-
Feng (2022c) introduce a metric of sufficiency erate alignments for the test set. https://github.com/
ASuf in Read/Write paths with the notion that too robertostling/eflomal
347
250 MMA
MMA+TC (Ours)
introduce a character level wait-k policy. But fixed
200
Sentence Counts
policy methods aren’t feasible for complex inputs
150
and cannot adapt to them. Full-sentence MT has
100
also been leveraged to augment the policy with
50
future information (Zhang et al., 2020; Alinejad
0
) ) 5 4 3 2 1 0 1 2 3 4 5 ,20)
,-10 0,-5 - - - - -
et al., 2021). But using such oracle or gold (Zheng
[-20 [-1 [6
Len(Output)-Len(Reference) et al., 2019; Arthur et al., 2021) READ/WRITE
actions does not optimize policy with translation
26 0.71 quality. Alinejad et al. (2018) proposes providing
0.70
future-information on the source side using predic-
BLEU
24 0.69
IoU
tion. Grissom II et al. (2014) predict unseen verbs
0.68
22
0.67
and uses reinforcement learning to learn when to
20 0.66 trust these predictions and when to wait for more
[10,20) [20,30) [30,40) [40,50) [50,60) >=60
Sentence Lengths input. In contrast, we leverage target side context
Figure 8: Top: Length difference compared to ref. Bot- to strengthen the simultaneous translations.
tom: Sentence BLEU bucketed by target length (shown Zhang and Feng (2022c) train two models on
in bars), and the ratio of aligned READ actions for each either language directions and make their policies
bucket (IoU scores, Eqn. 13) shown with lines. converge. Wilken et al. (2020) propose external
ground-truth alignments to train the policy. Papi
makes READ actions accordingly. et al. (2023) use cross attention scores to guide pol-
icy. Infinite-lookback (Arivazhagan et al., 2019)
Ratio of Aligned READ actions. We compare and chunkwise (Chiu* and Raffel*, 2018) atten-
MMA and our Read-Write policy against the ref- tion propose to use a soft monotonic attention over
erence source-target alignments by computing the previous encoder states. We use a variant of the
overlap between the hard alignments and the trans- policy proposed by Ma et al. (2020) that adapts
lation path for all output translations : monotonic attention to the multihead architecture
X intersection(ai , ri ) of the Transformer. GMA (Zhang and Feng, 2022a)
IoUa,r = (13) predicts the aligned source position of the current
union(ai , ri ) target token and rescales attention based on it. But
i=1
these methods treat all words equally during train-
where ai is the reference alignment matrix for the
ing whereas our method improves upon MMA via
ith sentence, made by setting all aligned source po-
adaptive training.
sitions to 1 and ri is the upper triangular matrix set
Some recent work explores capturing and quan-
to 1 using reads from the policy.5 The IoU scores
tifying information from the source tokens and use
for our policy and for MMA are shown in Figure
it to model READ/WRITE actions (Zhang et al.,
8 (bottom) with varying sentence lengths. Our pol-
2022a; Zhang and Feng, 2022b). But these works
icy shows a stronger adherence to the source-target
do not use the target context in their information.
monotonic alignment path.
Unlike their quantization method, we present a sim-
7 Related Work ple scoring by using an auxiliary target-side LM.
Simultaneous Translation. Fixed Policy meth- Adaptive Training for MT. Target adaptive ob-
ods (Ma et al., 2019; Elbayad et al., 2020) follow jectives have been explored by (Lin et al., 2017)
the fixed rule of waiting for the first k source tokens which uses probability of a class to scale, but actu-
before generating a target token, and alternate there- ally only scale down high frequency classes; (Jiang
after. Adaptive Wait-k (Zheng et al., 2020) dynam- et al., 2019) which directly uses normalized fre-
ically chooses the best k at every step. Han et al. quency count but have high variance. (Gu et al.,
(2020) applied meta learning in wait-k. Zhang and 2020) use a chi-square and an exponential distri-
Feng (2021b) use each attention head as an expert bution function with frequency. However these
of wait-k policy whereas Zhang and Feng (2021a) use only static word frequency. BMI (Xu et al.,
5
2021) attempt to capture mutual information be-
We choose this metric to show the extent to which the
policy follows the source-target alignments. In an ideal setting, tween each source and target token. CBMI (Zhang
IoU = 1. et al., 2022b) incorporate target context as well, in
348
mutual information. However, these adaptive meth- that helped shape this paper and Dr. Angel Chang
ods are not directly transferable to the streaming for lending us the GPU resources. The research
nature of our task. was partially supported by the Natural Sciences and
Engineering Research Council of Canada grants
8 Conclusion NSERC RGPIN-2018-06437 and RGPAS-2018-
We have presented a simple technique for rescaling 522574 and a Department of National Defence
target-token importance in simultaneous transla- (DND) and NSERC grant DGDND-2018-00025
tion using an information theoretic approach and to the third author.
an adaptive training paradigm. We differentiate the
importance of various target tokens by their depen-
References
dence on the source sentence. To guide our simul-
taneous translation model, we incorporate a target- Ashkan Alinejad, Hassan S. Shavarani, and Anoop
Sarkar. 2021. Translation-based supervision for pol-
side language model that provides an additional sig- icy generation in simultaneous neural machine trans-
nal indicating the importance of each target token lation. In Proceedings of the 2021 Conference on
or sentence under the condition of the previous tar- Empirical Methods in Natural Language Processing,
get context. Our model shows strong performance pages 1734–1744, Online and Punta Cana, Domini-
can Republic. Association for Computational Lin-
on several datasets and outperforms several state-of-
guistics.
the-art techniques in the low latency range (AL<5).
Further analysis shows that our technique is bet- Ashkan Alinejad, Maryam Siahbani, and Anoop Sarkar.
ter able to translate long sentences and those with 2018. Prediction improves simultaneous neural ma-
chine translation. In Proceedings of the 2018 Con-
rare words. We also showed that the translation ference on Empirical Methods in Natural Language
path (read/write action sequence) has a stronger Processing, pages 3022–3027, Brussels, Belgium.
correlation to the source-target alignment. Association for Computational Linguistics.
Limitations and Future Work Naveen Arivazhagan, Colin Cherry, Wolfgang

Macherey, Chung-Cheng Chiu, Semih Yavuz, Ruom-
Since our auxiliary target-side LM decoder is ing Pang, Wei Li, and Colin Raffel. 2019. Monotonic
spawned with the same configuration as the MT infinite lookback attention for simultaneous machine
translation. In Proceedings of the 57th Annual
decoder, this significantly adds to the model size at Meeting of the Association for Computational
training time. This makes it difficult to scale/slower Linguistics, pages 1313–1323, Florence, Italy.
to train with translation models of large size. While Association for Computational Linguistics.
this problem can be easily mitigated by using a Philip Arthur, Trevor Cohn, and Gholamreza Haffari.
GPU of larger memory, we would like to explore 2021. Learning coupled policies for simultaneous
more efficient ways of incorporating the target con- machine translation using imitation learning. In Pro-
text which we leave for future work. Secondly, ceedings of the 16th Conference of the European
even though our method gives a significant boost
guistics: Main Volume, pages 2709–2719, Online.
to translation quality in the early latencies, it relies Association for Computational Linguistics.
on the MMA (Ma et al., 2020) policy that has some
limitations in terms of latency because of a subopti- Kehai Chen, Rui Wang, Masao Utiyama, and Eiichiro
Sumita. 2020. Content word aware neural machine
mal decision making using multiple heads (Indurthi translation. In Proceedings of the 58th Annual Meet-
et al., 2022). While our policy shows improvement, ing of the Association for Computational Linguistics,
it could be further optimized, for instance, in fol- pages 358–364, Online. Association for Computa-
lowing reference alignments more closely which tional Linguistics.
would have a positive effect on latency. Finally, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro
using additional monolingual data is also a viable Sumita, and Tiejun Zhao. 2017. Context-aware
direction for future work to strengthen the language smoothing for neural machine translation. In Pro-
model used in the approach. ceedings of the Eighth International Joint Confer-
ence on Natural Language Processing (Volume 1:
Long Papers), pages 11–20, Taipei, Taiwan. Asian
Acknowledgements Federation of Natural Language Processing.
We would like to thank the anonymous reviewers Chung-Cheng Chiu* and Colin Raffel*. 2018. Mono-
for their helpful comments. We would also like to tonic chunkwise attention. In International Confer-
thank Logan Born for the numerous discussions ence on Learning Representations.
349
Kyunghyun Cho and Masha Esipova. 2016. Can neural Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,
machine translation do simultaneous translation? Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,
Maha Elbayad, Laurent Besacier, and Jakob Verbeek. Haifeng Wang. 2019. STACL: Simultaneous trans-
2020. Efficient Wait-k Models for Simultaneous Ma- lation with implicit anticipation and controllable la-
chine Translation. In Proc. Interspeech 2020, pages tency using prefix-to-prefix framework. In Proceed-
1461–1465. ings of the 57th Annual Meeting of the Association for
Computational Linguistics, pages 3025–3036, Flo-
Alvin Grissom II, He He, Jordan Boyd-Graber, John rence, Italy. Association for Computational Linguis-
Morgan, and Hal Daumé III. 2014. Don’t until the tics.
final verb wait: Reinforcement learning for simul-
taneous machine translation. In Proceedings of the Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon,
2014 Conference on Empirical Methods in Natural and Jiatao Gu. 2020. Monotonic multihead attention.
Language Processing (EMNLP), pages 1342–1352, In International Conference on Learning Representa-
Doha, Qatar. Association for Computational Linguis- tions.
tics.
Pooya Moradi, Nishant Kambhatla, and Anoop Sarkar.
2019. Interrogating the explanatory power of atten-
Shuhao Gu, Jinchao Zhang, Fandong Meng, Yang Feng,
tion in neural machine translation. In Proceedings of
Wanying Xie, Jie Zhou, and Dong Yu. 2020. Token-
the 3rd Workshop on Neural Generation and Trans-
level adaptive training for neural machine translation.
lation, pages 221–230, Hong Kong. Association for
In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 1035–1046, Online. Association for Computa- Robert Östling and Jörg Tiedemann. 2016. Effi-
tional Linguistics. cient word alignment with Markov Chain Monte
Carlo. Prague Bulletin of Mathematical Linguistics,
Hou Jeung Han, Mohd Abbas Zaidi, Sathish Reddy In- 106:125–146.
durthi, Nikhil Kumar Lakumarapu, Beomseok Lee,
and Sangha Kim. 2020. End-to-end simultaneous Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
translation system for IWSLT2020 using modality Sam Gross, Nathan Ng, David Grangier, and Michael
agnostic meta-learning. In Proceedings of the 17th Auli. 2019. fairseq: A fast, extensible toolkit for
International Conference on Spoken Language Trans- sequence modeling. In Proceedings of NAACL-HLT
lation, pages 62–68, Online. Association for Compu- 2019: Demonstrations.
Sara Papi, Marco Turchi, and Matteo Negri. 2023. Alig-
Sathish Reddy Indurthi, Mohd Abbas Zaidi, Beomseok natt: Using attention-based audio-translation align-
Lee, Nikhil Kumar Lakumarapu, and Sangha Kim. ments as a guide for simultaneous speech translation.
2022. Infusing future information into monotonic
attention through language models. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: A method for automatic evalu-
Shaojie Jiang, Pengjie Ren, Christof Monz, and Maarten ation of machine translation. In Proceedings of the
de Rijke. 2019. Improving neural response diversity 40th Annual Meeting on Association for Computa-
with frequency-aware cross-entropy loss. New York, tional Linguistics, ACL ’02, page 311–318, USA.
NY, USA. Association for Computing Machinery. Association for Computational Linguistics.

Surafel Melaku Lakew, Mattia Di Gangi, and Marcello Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Federico. 2019. Controlling the output length of neu- Kaiser, and Illia Polosukhin. 2017. Attention is all
ral machine translation. In Proceedings of the 16th you need. In Advances in Neural Information Pro-
International Conference on Spoken Language Trans- cessing Systems, volume 30. Curran Associates, Inc.
lation, Hong Kong. Association for Computational
Linguistics. Patrick Wilken, Tamer Alkhouli, Evgeny Matusov, and
Pavel Golik. 2020. Neural simultaneous speech trans-
Youngwon Lee, Changmin Lee, Hojin Lee, and Seung- lation using alignment-based chunking. In Proceed-
won Hwang. 2022. Normalizing mutual information ings of the 17th International Conference on Spoken
for robust adaptive training for translation. In Pro- Language Translation, pages 237–246, Online. Asso-
ceedings of the 2022 Conference on Empirical Meth- ciation for Computational Linguistics.
ods in Natural Language Processing, pages 8008–
8015, Abu Dhabi, United Arab Emirates. Association Yangyifan Xu, Yijin Liu, Fandong Meng, Jiajun Zhang,
for Computational Linguistics. Jinan Xu, and Jie Zhou. 2021. Bilingual mutual in-
formation based adaptive training for neural machine
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, translation. In Proceedings of the 59th Annual Meet-
and Piotr Dollár. 2017. Focal loss for dense object ing of the Association for Computational Linguistics
detection. In Proceedings of the IEEE international and the 11th International Joint Conference on Natu-
conference on computer vision, pages 2980–2988. ral Language Processing (Volume 2: Short Papers),
350
pages 511–516, Online. Association for Computa- Baigong Zheng, Kaibo Liu, Renjie Zheng, Mingbo Ma,
tional Linguistics. Hairong Liu, and Liang Huang. 2020. Simultane-
ous translation policies: From fixed to adaptive. In
Shaolei Zhang and Yang Feng. 2021a. ICT’s system for Proceedings of the 58th Annual Meeting of the Asso-
AutoSimTrans 2021: Robust char-level simultaneous ciation for Computational Linguistics, pages 2847–
translation. In Proceedings of the Second Workshop 2853, Online. Association for Computational Lin-
on Automatic Simultaneous Translation, pages 1–11, guistics.
Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang
Shaolei Zhang and Yang Feng. 2021b. Universal simul- Huang. 2019. Simpler and faster learning of adaptive
taneous machine translation with mixture-of-experts policies for simultaneous translation. In Proceedings
wait-k policy. In Proceedings of the 2021 Confer- of the 2019 Conference on Empirical Methods in Nat-
ence on Empirical Methods in Natural Language Pro- ural Language Processing and the 9th International
cessing, pages 7306–7317, Online and Punta Cana, Joint Conference on Natural Language Processing
Dominican Republic. Association for Computational (EMNLP-IJCNLP), pages 1349–1354, Hong Kong,
Linguistics. China. Association for Computational Linguistics.
Shaolei Zhang and Yang Feng. 2022a. Gaussian multi-
head attention for simultaneous machine translation.
In Findings of the Association for Computational
Linguistics: ACL 2022, pages 3019–3030, Dublin,
Shaolei Zhang and Yang Feng. 2022b. Information-
transport-based policy for simultaneous translation.
In Proceedings of the 2022 Conference on Empirical
Methods in Natural Language Processing, pages 992–
1013, Abu Dhabi, United Arab Emirates. Association
Shaolei Zhang and Yang Feng. 2022c. Modeling dual
read/write paths for simultaneous machine transla-
tion. In Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 2461–2477, Dublin,
Shaolei Zhang and Yang Feng. 2022d. Reducing posi-
tion bias in simultaneous machine translation with
length-aware framework. In Proceedings of the 60th
Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 6775–
6788, Dublin, Ireland. Association for Computational
Linguistics.
Shaolei Zhang, Yang Feng, and Liangyou Li. 2020.
Future-guided incremental transformer for simulta-
neous translation. In AAAI Conference on Artificial
Intelligence.
Shaolei Zhang, Shoutao Guo, and Yang Feng. 2022a.
Wait-info policy: Balancing source and target at in-
formation level for simultaneous machine translation.
Linguistics: EMNLP 2022, pages 2249–2263, Abu
Dhabi, United Arab Emirates. Association for Com-
Songming Zhang, Yijin Liu, Fandong Meng, Yufeng
Chen, Jinan Xu, Jian Liu, and Jie Zhou. 2022b. Con-
ditional bilingual mutual information based adaptive
training for neural machine translation. In Proceed-
ings of the 60th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Pa-
pers), pages 2377–2389, Dublin, Ireland. Association
351
A Hyperparameters
IWSLT’15 En ↔ Vi
Hyperparameter
IWSLT’14 De ↔ En
encoder layers 6
encoder attention heads 4
encoder embed dim 512
encoder ffn embed dim 1024
decoder layers 6
decoder attention heads 4
decoder embed dim 512
decoder ffn embed dim 1024
dropout 0.3
optimizer adam
adam-β (0.9,0.98)
clip-norm 0
lr 5e-4
lr scheduler inverse sqrt
warmup-updates 4000
warmup-init-lr 1e-7
weight decay 0.0001
label-smoothing 0.1
max tokens 16000
Table 3: Hyperparameters used in our experiments
All models were trained on 2 x Titan RTX with

24 GB memory each. An entire training run fin-
ishes within 2.5 hours with fp32 completing about
40 epochs.
B Detailed Results
352
IWSLT15 En-Vi Transformer-Small
AP AL DAL BLEU
Full-sentence MT
1.00 22.08 22.08 28.91
λ AP AL DAL BLEU
0.4 0.58 2.68 3.46 27.73
0.3 0.59 2.98 3.81 27.90
MMA 0.2 0.63 3.57 4.44 28.47
0.1 0.67 4.63 5.65 28.42
0.04 0.70 5.44 6.57 28.33
0.02 0.76 7.09 8.29 28.28
k AP AL DAL BLEU
1 0.63 3.03 3.54 25.21
3 0.71 4.80 5.42 27.65
Wait-K
5 0.78 6.46 7.06 28.34
7 0.83 8.21 8.79 28.60
9 0.88 9.92 10.51 28.69
k AP AL DAL BLEU
1 0.63 3.06 3.61 26.23
3 0.71 4.66 5.20 28.21
Efficient Wait-K
5 0.78 6.38 6.94 28.56
7 1.96 8.13 8.69 28.62
9 0.87 9.80 10.34 28.52
K AP AL DAL BLEU
1 0.67 3.76 4.33 28.37
2 0.69 4.10 4.71 28.45
3 0.71 4.60 5.28 28.54
Wait-Info 4 0.74 5.28 5.97 28.59
5 0.77 6.01 6.71 28.70
6 0.80 6.80 7.51 28.78
7 0.82 7.61 8.33 28.80
8 0.84 8.39 9.11 28.82
λ AP AL DAL BLEU
0.55 0.66 3.1 5.12 28.6
0.5 0.67 3.60 5.78 28.81
MMA+TC 0.3 0.68 3.86 6.12 28.9
0.2 0.71 4.58 7.22 28.74
0.1 0.74 5.34 8.18 28.65
0.01 0.89 9.89 14.37 28.67
Table 4: Experiments on IWSLT15 English → Vietnamese
353
IWSLT15 Vi - En Transformer-Small
Full-sentence MT AP AL DAL BLEU
(Offline) 1.00 27.56 27.56 26.11
λ AP AL DAL BLEU
0.4 0.63 3.60 6.96 25.36
0.3 0.64 3.95 7.59 24.75
MMA 0.2 0.67 4.54 9.09 25.33
0.1 0.75 7.14 11.60 25.84
0.05 0.77 7.61 15.70 25.31
0.01 0.88 13.63 23.95 26.11
k AP AL DAL BLEU
1 0.42 -2.89 1.62 7.57
3 0.53 -0.18 3.24 14.66
5 0.61 1.49 5.08 17.44
Wait-K
7 0.67 3.28 7.05 19.02
9 0.76 6.75 8.96 22.39
11 0.80 7.91 10.71 23.28
13 0.84 10.37 12.36 24.80
K AP AL DAL BLEU
4 0.62 2.58 5.06 22.45
5 0.67 4.08 6.27 23.75
6 0.72 5.61 7.72 25.19
Wait-Info
7 0.76 7.01 9.19 25.45
8 0.79 8.26 10.66 25.86
9 0.82 9.37 11.98 25.93
10 0.84 10.56 13.30 26.13
λ AP AL DAL BLEU
0.4 0.63 3.51 5.902 26.38
0.3 0.65 4.01 6.558 26.04
0.2 0.67 4.62 7.527 26.32
MMA+TC
0.1 0.71 5.67 9.212 26.63
0.05 0.76 7.23 10.579 26.52
0.04 0.77 7.55 11.76 26.85
0.01 0.89 13.31 18.627 26.67
Table 5: Experiments on IWSLT15 Vietnamese → English
354
IWSLT15 De-En Transformer-Small
(Offline) 1.00 22.97 22.97 33.64
λ AP AL DAL BLEU
0.4 0.67 3.91 6.36 30.8
MMA 0.3 0.69 4.27 6.84 31.12
0.2 0.72 4.97 7.82 31.34
0.1 0.77 6.08 9.47 31.95
K AP AL DAL BLEU
1 0.57 1.32 2.53 26.26
2 0.59 1.97 3.17 27.39
3 0.64 3.08 4.35 29.01
Wait-Info 4 0.69 4.27 5.61 30.36
5 0.739 5.30 6.84 30.92
6 0.77 6.26 8.03 31.45
7 0.80 7.17 9.09 31.82
8 0.82 8.06 9.94 32.05
k AL BLEU
3 1.8 26
Wait-K 5 4 28.6
7 6 29.7
9 8 31.5
k AL BLEU
3 2 26.4
Efficient Wait-K 5 4 27
7 6 30
9 8 31.7
λ AP AL DAL BLEU
0.5 0.66 3.68 5.92 30.97
0.4 0.68 4.06 6.51 31.33
MMA+TC
0.3 0.70 4.49 7.12 31.69
0.2 0.73 5.06 7.93 32.2
0.1 0.77 6.10 9.54 32.22
Table 6: Experiments on IWSLT14 German → English
355
IWSLT15 En-De Transformer-Small
(Offline) 1.00 22.21 22.21 27.46
λ AP AL DAL BLEU
0.5 0.69 4.32 6.42 26.03
0.4 0.71 4.70 6.95 26.20
MMA 0.3 0.72 4.97 7.28 26.30
0.2 0.74 5.44 7.96 26.19
0.1 0.79 6.86 9.72 26.77
0.05 0.84 8.25 11.42 26.91
K AP AL DAL BLEU
1 0.61 2.62 3.09 21.75
2 0.63 3.15 3.89 22.42
3 0.68 4.24 5.30 24.48
Wait-Info 4 0.73 5.36 6.77 25.60
5 0.77 6.38 8.09 26.18
6 0.80 7.23 9.18 26.35
7 0.83 8.23 10.35 26.61
8 0.86 9.25 11.46 26.74
k AL BLEU
3 3.41 22.00
Wait-K 5 5.00 25.21
7 6.83 26.32
9 8.72 26.61
k AL BLEU
3 3.51 23.01
Efficient Wait-K 5 5.27 24.80
7 7.03 25.93
9 8.81 26.11
λ AP AL DAL BLEU
0.6 0.68 4.04 6.07 26.03
0.5 0.69 4.19 6.25 26.19
0.4 0.69 4.38 6.52 26.43
MMA+TC
0.3 0.71 4.87 7.14 26.56
0.2 0.74 5.51 8.09 26.71
0.1 0.79 6.74 9.80 26.76
0.06 0.82 7.75 10.94 27.01
Table 7: Experiments on IWSLT14 English → German
356
Kyoto Speech-to-Speech Translation System for IWSLT 2023
Zhengdong Yang1 Shuichiro Shimizu1 Zhou Wangjin1 Sheng Li2 Chenhui Chu1
Kyoto University1 National Institute of Information and Communications Technology2
{zd-yang, sshimizu, chu}@nlp.ist.i.kyoto-u.ac.jp
[email protected]
[email protected]
Abstract put sequence x = [x1 , ..., xTx ] is the corre-

sponding transcription, where Tx indicates the
This paper describes the Kyoto speech-to-
speech translation system for IWSLT 2023. length of the transcription.
Our system is a combination of speech-to-text
• For ST, the input sequence s = [s1 , ..., sTs ] is
translation and text-to-speech synthesis. For
the speech-to-text translation model, we used the same with ASR and the output sequence
the dual-decoder Transformer model. For the y = [y1 , ..., yTy ] is the corresponding transla-
text-to-speech synthesis model, we took a cas- tion in target language, where Ty indicates the
cade approach of an acoustic model and a length of the translation.
vocoder.
The model performs the multi-task learning of
1 Introduction ASR and ST and the output distributions can be
This paper describes the Kyoto speech-to-speech written as
translation system for IWSLT 2023 (Agarwal et al.,
2023). Our system is a combination of speech-to- Dasr-st = p(x, y|s)
text translation and text-to-speech synthesis. For max(Tx ,Ty )
Y
speech-to-text translation model, we used dual- = p(xt , yt |x<t , y<t , s) (1)
decoder Transformer model following Le et al. t=0
(2020). For text-to-speech synthesis model, we The training objective is a weighted sum of cross-
took cascade approach of an acoustic model and a entropy losses for both tasks:
vocoder. We used FastSpeech 2 (Ren et al., 2021)
as the acoustic model and HiFi-GAN (Kong et al., Lasr-st = αLasr + (1 − α)Lst (2)
2020) as the vocoder.
Different decoders can exchange information
2 System Description with each other with the interactive attention mech-
anism, which refers to replacing attention sub-
The speech-to-speech translation system is a com-
layers in the standard Transformer decoder with
bination of speech-to-text translation and text-to-
interactive attention sub-layers (Liu et al., 2020). In
speech synthesis.
our models, the replaced sub-layers are the encoder-
2.1 Speech-to-Text Translation decoder attention sub-layers.
As illustrated in the lower part of Figure 1, an
We adopt the end-to-end speech-to-text translation
interactive attention sub-layer consists of one main
architecture. The speech-to-text translation model
attention sub-layer and a cross-attention sub-layers.
is based on dual-decoder Transfomer (Le et al.,
The main attention sub-layer is the same as the
2020).
replaced attention sub-layer. The cross-attention
As shown in Figure 1, the model is a
sub-layers receive query Q from the same decoder
Transformer-based model, comprising two de-
A and receive key K and value V from another
coders - one for speech-to-text translation (ST) and
decoder B. We adopt the parallel variation of dual-
the other for automatic speech recognition (ASR).
decoder Transformers where K and V are hidden
The task of ASR and ST can be defined as follows:
states from the same layer in decoder B.
• For ASR, the input sequence s = [s1 , ..., sTs ] The final output is obtained by merging the out-
is a sequence of speech features. The output of the primary attention sub-layer Hmain with
357
the output of the cross attention sub-layer Hcross . Dataset
Sentence Embedding Total Length
We adopt linear interpolation as the merging func- Model Used for Filtering (Hours)
tion. Therefore the output representations of the MuST-C None 600.2
GigaST None 9873.2
interactive attention sub-layers are GigaST LASER 919.1
GigaST Sentence Transformers 601.1
Hdual = Hmain + λHcross (3)
Table 1: The size of the datasets and the filtered versions
used for training the ST system.
where λ is a learnable parameter.
3 Experiments
Transcripts Translations
3.1 Speech-to-Text Translation

Attention
ASR Decoder ST Decoder
3.1.1 Datasets
Encoder
To train our ST system, we utilized two distinct
Speech Features datasets: MuST-C (Di Gangi et al., 2019) v2 with
Chinese translations, and GigaST (Ye et al., 2022)
I don’t understand 我不明白
which is the original dataset that was used to con-
Feed Feed
struct the GigaS2S dataset provided by the organiz-
Forward Forward ers.
Merge Merge
×N
Both datasets offer unique advantages. While
Feed
Forward
Attention Attention Attention Attention GigaST is in the same domain as the development
N× Maske Maske and test data, MuST-C is not. In addition, GigaST
Attention Attention Attention
is considerably larger than MuST-C. However, it is
I don’t … 我不 ...
worth noting that the translations in GigaST were
d
generated by a machine translation system and may

Figure 1: General architecture of dual-decoder Trans- not be of the same quality as those in MuST-C,
former (upper) and interactive attention mechanism which were translated by human. As a result, deter-
(lower). Interactive attention sub-layers are marked mining which dataset is more likely to yield better
with dotted boxes. They merge the outputs of the main results requires further experimentation.
attention sub-layers (red boxes) and cross-attention sub-
To shorten the training time and improve per-
layers (yellow boxes).
formance, we filtered the extremely large GigaST
dataset to select utterances with better translation
quality. As the translations in GigaST are machine-
2.2 Text-to-Speech Synthesis generated and there are no reference translations
available, we evaluated the translation quality using
We adopted the approach to cascade an acoustic the cosine similarity of sentence embeddings from
model and a vocoder. We used FastSpeech 2 (Ren the source and target sentences. We tested two
et al., 2021) as the acoustic model and HiFi-GAN different models for generating the embeddings:
(Kong et al., 2020) as the vocoder. FastSpeech 2 LASER1 and “paraphrase-xlm-r-multilingual-v1”
adopts Transformer-based architecture for the en- from Sentence Transformers2 (simply referred to
coder and the Mel-spectrogram decoder, and the as “Sentence Transformers” subsequently). The re-
variance adapter between them predicts the dura- sulting similarity distributions are shown in Figure
tion, pitch, and energy of the audio. HiFi-GAN em- 2. We selected the top 10% of the data based on
ploys generative adversarial networks to generate similarity scores (data that is on the right-hand side
waveforms from Mel-spectrograms. It is composed of the red line). Table 1 shows the sizes of MuST-C
of one generator and two discriminators, a multi- and GigaST before and after filtering.
period discriminator, and a multi-scale discrimi-
1
nator. We used the PaddleSpeech toolkit (Zhang https://github.com/facebookresearch/LASER
2
https://github.com/UKPLab/
et al., 2022a) and the pretrained models provided sentence-transformers/tree/master/examples/
by Zhang et al. (2022a) to generate waveforms. training/paraphrases
358
3.1.2 Training and Decoding
English sentences were normalized and tokenized
using the Moses tokenizer (Koehn et al., 2007),
and punctuations were stripped. Chinese sentences
LASER
500000 were tokenized using jieba.3 English and Chinese
tokens were further split into subwords using the
BPE method (Sennrich et al., 2016) with a joint
vocabulary of 16, 000 subwords.
400000
We used Kaldi (Ravanelli et al., 2019) to extract
83-dimensional features normalized by the mean
and standard deviation computed on the training
300000
set. We removed utterances with more than 6, 000
frames or more than 400 characters and used speed
perturbation (Inaguma et al., 2020) with factors of
200000 0.9, 1.0, and 1.1 for data augmentation.
Our implementation was based on the ESPnet-
ST toolkit (Inaguma et al., 2020). We used the
100000 same architecture for all the ST models with a 12-
layer encoder and 8-layer decoders. The coefficient
α in the loss function (Equation 2) was set to 0.3 in
0 all the experiments. We used the Adam optimizer
0.0 0.2 0.4 0.6 0.8 1.0
Similarity (Kingma and Ba, 2015) and Noam learning rate
schedule (Vaswani et al., 2017) with 25, 000 warm-
up steps and a maximum learning rate of 2.5e − 3.
We used a batch size of 48 per GPU and trained
Sentence Transformers models on a single machine with 4 Tesla V100
GPUs. The models were trained for 25 epochs. We
kept checkpoints after each epoch and averaged the
200000 five best models on the development set based on
prediction accuracy. For decoding, the beam size
was set to 5 for ST and 1 for ASR.
150000 3.1.3 Results
We conducted experiments to investigate the im-
pact of using different datasets for training the sys-
100000 tem. The results are presented in Table 2. Ad-
ditionally, we evaluated the performance of the
system when using different sentence embedding
50000 models for data filtering. Our findings reveal that
LASER produces better results compared to Sen-
tence Transformers. Notably, after filtering the data
using LASER, the total number of hours of audio
0
0.0 0.2 0.4 0.6 0.8 1.0 is higher compared to that obtained using Sentence
Similarity
Transformers. Given this observation, it might be
more appropriate to perform filtering based on the
Figure 2: Histograms of cosine similarity between length of the audio rather than the number of utter-
source and target sentence embedding based on LASER ances.
and Sentence Transformers. The red line marks the 90th Our experiments also revealed that training the
percentile.
model with GigaST alone yielded better results
compared to using only the MuST-C dataset. Fur-
3
https://github.com/fxsjy/jieba
359
Training Data BLEU In the future, we will try to perform multi-level pre-
MuST-C 9.71 training based on transforming SpeechUT (Zhang
GigaST (LASER) 13.96 et al., 2022b) with phonemes as unit. We will also
GigaST (Sentence Transformers) 11.57
MuST-C → GigaST (LASER) 13.52
try to use Encodec-based speech synthesis method
GigaST (LASER) → MuST-C 13.30 similar to VALL-EX (Zhang et al., 2023) to in-
crease the accurate representation of emotions and
Table 2: Experimental results on training with different vocal patterns.
datasets. “→” indicates training with the dataset on the
left and use the best checkpoint to initiate the training References
with the dataset on the right.
Milind Agarwal, Sweta Agrawal, Antonios Anas-
tasopoulos, Ondřej Bojar, Claudia Borg, Ma-
thermore, we evaluated an approach in which we rine Carpuat, Roldano Cattoni, Mauro Cettolo,
trained the model with one dataset and use the best Mingda Chen, William Chen, Khalid Choukri,
checkpoint to initiate the training with the other Alexandra Chronopoulou, Anna Currey, Thierry
dataset. However, we observed that this approach Declerck, Qianqian Dong, Yannick Estève,
did not yield any improvement compared to train- Kevin Duh, Marcello Federico, Souhir Gahbiche,
ing the model with GigaST alone. Barry Haddow, Benjamin Hsu, Phu Mon Htut,
Based on these findings, we adopted the transla- Hirofumi Inaguma, Dávid Javorský, John Judge,
tion generated by the ST system trained solely on Yasumasa Kano, Tom Ko, Rishu Kumar, Peng-
GigaST filtered based on LASER for our submis- wei Li, Xutail Ma, Prashant Mathur, Evgeny
sion. Matusov, Paul McNamee, John P. McCrae, Ken-
ton Murray, Maria Nadejde, Satoshi Nakamura,
3.2 Text-to-Speech Synthesis
Matteo Negri, Ha Nguyen, Jan Niehues, Xing
We used pretrained models provided by Zhang Niu, Atul Ojha Kr., John E. Ortega, Proyag Pal,
et al. (2022a) trained on the AISHELL-3 dataset Juan Pino, Lonneke van der Plas, Peter Polák,
(Shi et al., 2021). The PaddleSpeech toolkit pro- Elijah Rippeth, Elizabeth Salesky, Jiatong Shi,
vides several models trained with the AISHELL-3 Matthias Sperber, Sebastian Stüker, Katsuhito
dataset, including FastSpeech 2 and HiFi-GAN. Sudoh, Yun Tang, Brian Thompson, Kevin Tran,
We used the best-performing model combination in Marco Turchi, Alex Waibel, Mingxuan Wang,
terms of MOS reported in (Zhang et al., 2022a). Shinji Watanabe, and Rodolfo Zevallos. 2023.
For other configurations, such as grapheme-to- Findings of the IWSLT 2023 Evaluation Cam-
phoneme conversion, we followed Zhang et al. paign. In Proceedings of the 20th International
(2022a). Conference on Spoken Language Translation
The generated audio files have one channel, a (IWSLT 2023). Association for Computational
sample width of 16 bit, and a frame rate of 24, 000. Linguistics.
Because the predictions of speech-to-text transla-
tion sometimes contained English words that were Mattia A. Di Gangi, Roldano Cattoni, Luisa Ben-
preprocessed to empty strings by the grapheme-to- tivogli, Matteo Negri, and Marco Turchi. 2019.
phoneme conversion, some (less than 1 % of the MuST-C: a Multilingual Speech Translation Cor-
test set) audio files could not be generated. pus. In Proceedings of the 2019 Conference
of the North American Chapter of the Associ-
4 Conclusion ation for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and
In this paper, we described our system, which is a Short Papers), pages 2012–2017, Minneapolis,
combination of speech-to-text translation and text- Minnesota. Association for Computational Lin-
to-speech synthesis. For speech-to-text translation, guistics.
we trained the Dual-decoder Transformer model
with the GigaST dataset filtered based on the simi- Hirofumi Inaguma, Shun Kiyono, Kevin Duh,
larity of multilingual sentence embeddings. For the Shigeki Karita, Nelson Yalta, Tomoki Hayashi,
text-to-speech synthesis model, we took a cascade and Shinji Watanabe. 2020. Espnet-st: All-in-
approach of an acoustic model and a vocoder and one speech translation toolkit. In Proceedings of
used a combination of FastSpeech 2 and HiFi-GAN. the 58th Annual Meeting of the Association for
360
Computational Linguistics: System Demonstra- ence on Acoustics, Speech and Signal Process-
tions, ACL 2020, pages 302–311. Association ing, ICASSP 2019, Brighton, United Kingdom,
for Computational Linguistics. May 12-17, 2019, pages 6465–6469. IEEE.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao,
A method for stochastic optimization. In 3rd Zhou Zhao, and Tie-Yan Liu. 2021. FastSpeech
International Conference on Learning Represen- 2: Fast and High-Quality End-to-End Text to
tations, ICLR 2015, Conference Track Proceed- Speech. In International Conference on Learn-
ings. ing Representations.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Rico Sennrich, Barry Haddow, and Alexandra
Chris Callison-Burch, Marcello Federico, Nicola Birch. 2016. Neural machine translation of rare
Bertoldi, Brooke Cowan, Wade Shen, Christine words with subword units. In Proceedings of
Moran, Richard Zens, Chris Dyer, Ondrej Bojar, the 54th Annual Meeting of the Association for
Alexandra Constantin, and Evan Herbst. 2007. Computational Linguistics, ACL 2016, Volume
Moses: Open source toolkit for statistical ma- 1: Long Papers. The Association for Computer
chine translation. In ACL 2007, Proceedings of Linguistics.
the 45th Annual Meeting of the Association for
Computational Linguistics. The Association for Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming
Computational Linguistics. Li. 2021. AISHELL-3: A Multi-Speaker Man-
darin TTS Corpus. In Proc. Interspeech 2021,
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae.
pages 2756–2760.
2020. HiFi-GAN: Generative Adversarial Net-
works for Efficient and High Fidelity Speech
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Synthesis. In Proceedings of the 34th Interna-
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
tional Conference on Neural Information Pro-
Lukasz Kaiser, and Illia Polosukhin. 2017. At-
cessing Systems, NIPS’20, Red Hook, NY, USA.
tention is all you need. In Advances in Neu-
Curran Associates Inc.
ral Information Processing Systems 30: Annual
Hang Le, Juan Pino, Changhan Wang, Jiatao Gu, Conference on Neural Information Processing
Didier Schwab, and Laurent Besacier. 2020. Systems 2017, pages 5998–6008.
Dual-decoder transformer for joint automatic
speech recognition and multilingual speech Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng,
translation. In Proceedings of the 28th Interna- Tao Wang, Mingxuan Wang, and Jun Cao. 2022.
tional Conference on Computational Linguistics, GigaST: A 10,000-hour Pseudo Speech Transla-
pages 3520–3533, Barcelona, Spain (Online). tion Corpus.
International Committee on Computational Lin-
Hui Zhang, Tian Yuan, Junkun Chen, Xintong Li,
guistics.
Renjie Zheng, Yuxin Huang, Xiaojie Chen, Enlei
Yuchen Liu, Jiajun Zhang, Hao Xiong, Long Zhou, Gong, Zeyu Chen, Xiaoguang Hu, Dianhai Yu,
Zhongjun He, Hua Wu, Haifeng Wang, and Yanjun Ma, and Liang Huang. 2022a. Paddle-
Chengqing Zong. 2020. Synchronous speech Speech: An easy-to-use all-in-one speech toolkit.
recognition and speech-to-text translation with In Proceedings of the 2022 Conference of the
interactive decoding. In The Thirty-Fourth AAAI North American Chapter of the Association for
Conference on Artificial Intelligence, AAAI 2020, Computational Linguistics: Human Language
The Thirty-Second Innovative Applications of Technologies: System Demonstrations, pages
Artificial Intelligence Conference, IAAI 2020, 114–123, Hybrid: Seattle, Washington + Online.
The Tenth AAAI Symposium on Educational Ad- Association for Computational Linguistics.
vances in Artificial Intelligence, EAAI 2020,
pages 8417–8424. AAAI Press. Ziqiang Zhang, Long Zhou, Junyi Ao, Shujie Liu,
Lirong Dai, Jinyu Li, and Furu Wei. 2022b.
Mirco Ravanelli, Titouan Parcollet, and Yoshua Speechut: Bridging speech and text with hidden-
Bengio. 2019. The pytorch-kaldi speech recog- unit for encoder-decoder based speech-text pre-
nition toolkit. In IEEE International Confer- training. arXiv preprint arXiv:2210.03730.
361
Ziqiang Zhang, Long Zhou, Chengyi Wang,
Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen,
Yanqing Liu, Huaming Wang, Jinyu Li, et al.
2023. Speak foreign languages with your own
voice: Cross-lingual neural codec language mod-
eling. arXiv preprint arXiv:2303.03926.
362
Tagged End-to-End Simultaneous Speech Translation Training
using Simultaneous Interpretation Data
Yuka Ko Ryo Fukuda Yuta Nishikawa Yasumasa Kano

Katsuhito Sudoh Satoshi Nakamura
Nara Institute of Science and Technology
[email protected]
Abstract still very small compared to bilingual data based

on offline translations. Using such scarce SI data to
Simultaneous speech translation (SimulST)
fine-tune an offline translation model causes over-
translates partial speech inputs incrementally.
Although the monotonic correspondence be- fitting on the small SI data. Training a model using
tween input and output is preferable for smaller mixed data of offline and SI data is another option
latency, it is not the case for distant language to mitigate the problem of data scarcity, but the
pairs such as English and Japanese. A prospec- simple data mixture causes confusion between the
tive approach to this problem is to mimic si- output styles of offline translation and SI.
multaneous interpretation (SI) using SI data to In this paper, we propose a method to train a
train a SimulST model. However, the size of SimulST model using mixed data of SI and offline
such SI data is limited, so the SI data should
translation with style tags to tell the model to gener-
be used together with ordinary bilingual data
whose translations are given in offline. In this ate SI- or offline-style output selectively. It has the
paper, we propose an effective way to train a advantage of sharing two different styles in a single
SimulST model using mixed data of SI and model and generating SI-style outputs by putting
offline. The proposed method trains a single the SI-style tag in the decoding, which are lever-
model using the mixed data with style tags that aged by offline translation data. Experiment results
tell the model to generate SI- or offline-style using MuST-C and small SI data showed improve-
outputs. Experiment results show improve-
ments of BLEURT by the proposed method over
ments of BLEURT in different latency ranges,
and our analyses revealed the proposed model the baselines in different latency ranges. Further
generates SI-style outputs more than the base- analyses revealed that the proposed model gener-
line. ates more appropriate SI-style outputs than base-
lines.
1 Introduction
2 Related Work
Simultaneous speech translation (SimulST) is a
technique to translate speech incrementally without There have been many studies on simultaneous
waiting for the end of a sentence. Since SimulST translation for text and speech in decades (Fügen
should work in small latency against the input et al., 2007; Oda et al., 2014; Dalvi et al., 2018).
speech, monotonic translation following the word Most recent approaches are based on deep neural
order of the source language is preferable. How- networks and have evolved with the technologies
ever, making translation monotonic is not trivial of neural machine translation (NMT) (Gu et al.,
especially for distant language pairs with different 2017) and neural speech recognition (ASR) (Rao
word orders, such as English and Japanese. Most et al., 2017). An important advantage of the neural
recent SimulST studies still use parallel corpora SimulST methods (Ma et al., 2020b; Ren et al.,
only with offline translations and potentially have 2020) is their end-to-end modeling of the whole
the limitation to work in a monotonic way. process, which improves the efficiency compared to
A prospective approach to this problem is to a cascade approach. Such an end-to-end SimulST
use SI data to train a SimulST model for mimick- model is trained using speech translation corpora
ing simultaneous interpretation (SI). There are sev- such as MuST-C (Di Gangi et al., 2019), but these
eral SI data resources developed so far for English- corpora are usually based on offline translation due
Japanese (Toyama et al., 2004; Shimizu et al., 2013; to the lack of large-scale SI data.
Doi et al., 2021). Despite these efforts, SI data are For the English-Japanese language pair, there
363
Offline Target
しかしこの経済(6)危機や私の(8)国での(7)出来事について(1)私は(4)男性に(5)非があると(3)言うつもりは(2)ありません
Source
And (1)I’m (2)not here to (3)say that (4)men are to (5)blame for the (6)crisis and what (7)happened in my (8)country.
SI Target
(4)男性の、(5)せいだけでは(2)ありません、私どもの(8)国の、金融(6)崩壊の、(5)責任は、
Figure 1: Example of English-to-Japanese offline translation and SI.

6
have been some attempts for the development of a given English source sentence. The solid lines in
SI corpora (Toyama et al., 2004; Shimizu et al., the figure represent word correspondences. In this
2013; Doi et al., 2021). However, the amount of figure, we can find:
such SI corpora is still very limited compared to
offline translations. We tackle this problem by us- • Most English content words are translated into
ing a larger-scale offline translation corpus. This Japanese in the offline translation, while some
condition can be seen as domain adaptation from are missing in the SI transcript.
resource-rich offline translation to resource-poor
simultaneous translation. In a typical domain adap- • The SI tries to translate the former half of the
tation scenario, an out-of-domain model is fine- input earlier than the latter half with some un-
tuned using in-domain data (Luong and Manning, naturalness, while the offline translation keeps
2015; Sennrich et al., 2016), but it tends to over- naturalness in Japanese with long-distance re-
fit to the small in-domain data (Chu et al., 2017). ordering from the input English.
As another adaptation approach, tag-based NMT
These points suggest important differences between
works to control the politeness of translations (Sen-
offline translation and SI; SI focuses on the simul-
nrich et al., 2016) and to enable zero-shot mul-
taneity of the interpretation to deliver the contents
tilingual NMT (Johnson et al., 2017). This tag-
as early as possible and to maintain the interpreter’s
based approach has been extended to multi-domain
working memory. The word order difference be-
fine-tuning (Kobus et al., 2017) and mixed fine-
tween English and Japanese poses a serious diffi-
tuning (Chu et al., 2017). These studies fine-tune
culty in SI, as mentioned in the literature (Mizuno,
NMT models using mixed data of in-domain and
2017). Thus, it is important to use SI data to train
out-of-domain corpora. Tagged Back-Translation
a SimulST model to improve its simultaneity.
(Caswell et al., 2019) is an application of the tag-
based approach to well-known back-translation- 4 Proposed Method
based data augmentation. It distinguishes source
language sentences from parallel corpora and those Although training a SimulST model using SI data
obtained from back-translation to handle possible is necessary, we suffer from data scarcity in prac-
back-translation noise in the training of an NMT tice. We propose a method to use a relatively large
model. Our work is motivated by these tag-based offline translation corpus to mitigate for the SI data
methods and tackles the scarcity of SI data. scarcity for training a SimulMT model. Following
the tag-based NMT studies, we put a style tag at
3 Differences between Offline Translation the beginning of the target string in training and
and Simultaneous Interpretation predict a specified tag forcibly at the first step in
inference. In this work, we use two tags: <si> for
There is a large style difference between SI and SI and <off> for offline translation.
offline translation. Figure 1 shows an example of Suppose we have an SI transcript: 私は、買っ
offline translation and SI transcript in Japanese for た。ペンを、 for an English input: I bought a
364
Offline SI 5.2 Simultaneous Speech Translation
#segm. #En words #segm. #En words
train 328,639 5,714,360 65,008 1,120,245 We used our SimulST implementation based on
dev 1,369 23,059 165 2,804
test 2,841 46,144 511 8,104 fairseq (Ott et al., 2019). It followed the sys-
tem architecture of the best-scored system in the
Table 1: Data sizes of offline data and SI data in the IWSLT 2022 evaluation campaign (Polák et al.,
number of aligned segments. 2022), which used an offline ST model in the online
simultaneous decoding based on Local Agreement
pen. as a training example. We put the SI-style tag (LA) (Liu et al., 2020a)4 .
at the beginning of the SI transcript as follows:
5.2.1 Offline ST Model
<si>私は、買った。ペンを、 We built the initial offline ST model by connect-
This string is tokenized into subwords1 : ing two pre-trained models. Firstly, we used Hu-
BERT Large as the encoder, which consists of a
_< si > 私は、買った。ペ feature extractor trained on 60k hours of unlabeled
ンを、 speech data Libri-Light (Kahn et al., 2020) and
Here, we assume we have a pre-trained sequence- a transformer encoder layer. The feature extrac-
to-sequence model such as mBART (Liu et al., tor is a 7-layer convolutional layer with a kernel
2020b; Tang et al., 2021) as a basis of the SimulST size of (10,3,3,3,3,2,2), a stride of (5,2,2,2,2,2,2),
model, as described later in the next section. The and 512 channels, while the transformer encoder
aforementioned style tags may not be included in layer consists of 24 layers. Next, we used the de-
the subword vocabulary of the pre-trained model coder portion of mBART50, an encoder-decoder
and are tokenized further like “_< si >”, but it model pre-trained with 50 language pairs, as the
works in practice. decoder. The decoder consists of 12 layers of trans-
former decoders, and the embedding layer and
5 Experimental Setup linear projection weights are shared, with a size
of 250,000. The dimension of each layer of the
5.1 Dataset
transformer encoder and decoder is 1024, the di-
We used MuST-C (Di Gangi et al., 2019) v2 mension of the feed forward network is 4096, the
English-Japanese data as our offline speech trans- number of multi-heads is 16, the activation func-
lation corpus. We also prepared development and tion is the ReLU function, and the normalization
test sets from our in-house Japanese SI recordings method is pre-layer normalization (Baevski and
on TED Talks that are not included in the train- Auli, 2019). These two models are connected by an
ing sets above. As for the SI data for training, we Inter-connection (Nishikawa and Nakamura, 2023)
used NAIT-SIC-Aligned (Zhao et al., 2023). This that weights each transformer layer of the encoder
SI data is constructed by applying heuristic sen- and integrates the output tensors of each layer in a
tence alignment to extract parallel sentence pairs weighted sum, and a length adapter (Tsiamas et al.,
using the latest version of NAIST-SIC2 (Doi et al., 2022). The length adapter is a 3-layer convolu-
2021). From NAIST-SIC-Aligned, we selected IN- tional network with 1024 channels, the stride of 2,
TRA, AUTO-DEV and AUTO-TEST as train, dev and the activation function of GELU.
and test data, respectively. For all the SI sets, we The inputs are waveforms with a 16-kHz sam-
aligned the English text segments with the corre- pling rate that are normalized to zero mean and
sponding audio tracks in MuST-C using an English unit variance. During training, each source audio
forced-aligner Gentle3 . Here, we excluded seg- is augmented (Kharitonov et al., 2020) with a prob-
ments not aligned with the source speech from the ability of 0.8. We train the model on MuST-C
aligned dataset. Table 1 shows the size of the of- (Di Gangi et al., 2019), CoVoST-2 (Wang et al.,
fline and SI data. 2020), Europarl-ST (Iranzo-Sánchez et al., 2020),
1
“_” is the meta-character representing white spaces in and TED-LIUM (Rousseau et al., 2012). We
an original string by SentencePiece (Kudo and Richardson,
2018), and “ ” represents a white space in a tokenized string. use gradient accumulation and data parallelism to
2
https://dsc-nlp.naist.jp/data/ achieve a batch size of approximately 32 million
NAIST-SIC/2022
3 4
https://github.com/lowerquality/ We also tried wait-k (Ma et al., 2019), but LA worked
gentle better than wait-k in our pilot test.
365
tokens. We use Adam with β1 = 0.99, β2 = 0.98, (BLEURT) SI Offline
and a base learning rate of 2.5 × 10−4 . The learn- Offline FT 0.386 0.518
ing rate is controlled by a tri-stage scheduler with SI FT 0.359 0.347
phases of 0.15, 0.15, and 0.70 for warm-up, hold, Mixed FT 0.393 0.483
and decay, respectively, while the initial and final Mixed FT + Style 0.445 0.522
learning rate has a scale of 0.01 compared to base. Mixed FT + Style + Up 0.443 0.516
We use sentence averaging and gradient clipping
of 20. We apply a dropout of 0.1 before every non- Table 2: BLEURT in full-sentence offline ST on SI and
offline test sets.
frozen layer and use time masking for 10-length
spans with a probability of 0.2, and channel mask- (BLEU) SI Offline
ing for 20-length spans with a probability of 0.1 in Offline FT 7.8 16.0
the encoder feature extractor’s output. The loss is SI FT 10.9 6.3
the cross-entropy loss with label smoothing of 0.2. Mixed FT 9.4 13.3
We call this trained model base model. Mixed FT + Style 10.3 15.4
Mixed FT + Style + Up 12.2 14.2
The base model was fine-tuned using the of-
fline training and development sets (Table 1). Dur- Table 3: BLEU in full-sentence offline ST on SI and
ing fine-tuning, we set the learning rate of 2.5 × offline test sets.
10−5 , saved models in every 1,000 updates, and
adopted checkpoint averaging over five-best check-
points according to the loss on the development SI FT Fine-tuned using the prefix pairs from the
set. We call this fine-tuned model base+O model. SI data (baseline in SI).
About those base and base+O models, we use
the NAIST IWSLT 2023 Simultaneous speech-to- Mixed FT Fine-tuned using prefix pairs from both
speech model for the Simultaneous Speech Transla- of the offline and SI data (baseline in mixed).
tion task (Fukuda et al., 2023). We further fine-tune
Mixed FT + Style Fine-tuned using prefix pairs
the base+O model using the SI data in the same
from both of the offline and SI data with the
manner to derive base+O+S model. Here, follow-
style tags (proposed method).
ing (Tsiamas et al., 2022), to avoid overfitting the
small SI data, the parameters of the following com- Mixed FT + Style + Up The SI portions were up-
ponents were kept fixed: the feature extractor and sampled in Mixed FT + Style to balance the
feedforward layers of the encoder and the embed- data size between the offline and SI data (pro-
ding, self-attention, and feedforward layers of the posed method).
decoder.
Here, the prefix pairs from the offline data were ob-
5.2.2 Fine-tuning using Prefix Alignment tained using base+O model, and those from the SI
For further fine-tuning toward SimulST, we ex- data were obtained using the base+O+S model.
tracted prefix-to-prefix translation pairs from the The hyperparameter settings for the fine-tuning
available training sets using Prefix Alignment were the same as that for the base+O model.
(PA) (Kano et al., 2022). PA uses an offline transla-
tion model to find prefix-to-prefix translation pairs 5.3 Evaluation Metrics
that can be obtained as intermediate translation We evaluated the SimulST systems using SimulE-
results using a given offline translation model. Fi- val5 (Ma et al., 2020a). The unit length of speech
nally, we fine-tuned the base+O model using the segments was set to {200, 400, 600, 800, 1,000}
prefix pairs. milliseconds6 . For the SimulST systems, transla-
tion quality was evaluated in BLEURT (Sellam
5.2.3 Compared Methods et al., 2020) and BLEU (Papineni et al., 2002)7 .
We compared the following conditions on the final 5
fine-tuning data: SimulEval
6
We also evaluated SI FT on the SI test set with 120 and
160 ms speech segments to investigate its performance in low
Offline FT Fine-tuned using the prefix pairs from latency ranges.
the offline data (baseline in offline). 7
BLEU was calculated using SacreBLEU (Post, 2018).
366
SI test SI test
0.44 11
0.42 10
0.40 9
BLEURT
BLEU
0.38 8
Offline FT Offline FT
0.36 SI FT 7 SI FT
Mixed FT 6 Mixed FT
0.34 Mixed FT + Style
Mixed FT + Style
0.32 Mixed FT + Style + Up 5 Mixed FT + Style + Up
200 400 600 800 1000 200 400 600 800 1000
ATD ATD
(a) BLEURT (b) BLEU
Figure 2: SimulST latency (ATD) – quality results on SI test set.

Offline test (tst-COMMON)
0.525
14
0.500
0.475
12
BLEURT
BLEU
0.450
0.425 Offline FT 10 Offline FT
0.400 SI FT SI FT
Mixed FT Mixed FT
0.375 8
Mixed FT + Style Mixed FT + Style
0.350 Mixed FT + Style + Up Mixed FT + Style + Up
200 400 600 800 1000 200 400 600 800 1000
ATD ATD
(a) BLEURT (b) BLEU
Figure 3: SimulST latency (ATD) – quality results on offline test set.
The latency in SimulST was evaluated in Aver- The result shows that the upsampling worked for
age Token Delay (ATD) (Kano et al., 2023) im- BLEU improvement for the SI test set in the offline
plemented in SimulEval. Even though Average translation condition.
Lagging (AL) (Ma et al., 2019) is the most popular
latency metric, it sometimes resulted in negative 6.2 Simultaneous Translation Results
values, as suggested by Kano et al. (2023). Thus,
Figure 2 shows SimulST results in BLEURT and
we present the results using ATD and include the
BLEU for the SI test set. In Figure 2a, the pro-
AL results in Appendix A.
posed method with the style tags showed clearly
6 Results better BLEURT results than the baselines. The up-
sampling did not bring clear differences, the same
6.1 Offline Translation Results as findings on the offline translation results shown
Tables 2 and 3 show the offline translation re- in Table 2. In contrast, Figure 2b shows SI FT
sults in BLEURT and BLEU for the SI and offline worked the best in almost all latency ranges, while
test sets. These results show that our proposed the proposed method outperformed the other two
Mixed FT + Style and Mixed FT + Style + Up sur- baselines (Offline and Mixed).
passed baselines in BLEURT for SI test. On the Figure 3 shows SimulST results for the offline
offline test set (MuST-C tst-COMMON), the per- test set. They reflect the difference in reference
formance of the proposed models was almost the translations between the SI and offline test sets.
same as Offline FT. This suggests that our proposed The Offline FT baseline worked well in BLEURT
method leads to outputs semantically close to SI and outperformed the proposed method in BLEU.
references than the baseline. Contrary, the SI FT The other baselines resulted in worse BLEURT and
baseline surpassed the Mixed FT + Style in BLEU. BLEU scores than the proposed method.
367
SI test SI test SI test
0.75 0.75
0.74
0.74
BERTScore Precision
BERTScore Recall
BERTScore F1
0.73 0.74 0.73
0.72 0.72
Offline FT 0.73 Offline FT Offline FT
0.71
0.71 SI FT SI FT SI FT
Mixed FT 0.72 Mixed FT 0.70 Mixed FT
Mixed FT + Style Mixed FT + Style 0.69 Mixed FT + Style
0.70 Mixed FT + Style + Up Mixed FT + Style + Up Mixed FT + Style + Up
200 400 600 800 1000 0.71 200 400 600 800 1000 200 400 600 800 1000
ATD ATD ATD
(a) BERTScore F1 (b) BERTScore Recall (c) BERTScore Precision
Figure 4: SimulST latency (ATD) – quality (BERTScore) results on SI test set.
These results suggest the proposed method con- SI test

1.6 Offline FT
veys the information given in source language SI FT
speech better than the baselines. 1.4
Mixed FT
Mixed FT + Style
Length Ratio
Mixed FT + Style + Up
7 Discussions 1.2
The results shown in Figures 2, 3 demonstrated the 1.0

advantage of the proposed method in BLEURT, but
not in BLEU. In this section, we discuss the results 0.8
200 400 600 800 1000
in detail to reveal which model works the best from ATD
the viewpoint of SimulST.
Figure 5: Length ratio results on SI test set.
7.1 BERTScore Details SI test
175
SI FT
Figure 4 shows the detailed results in F1, recall, 150 Mixed FT + Style
and precision by BERTScore (Zhang et al., 2020) 125
for the SI test set. The proposed method worked
Number
100
the best in BERTScore recall, and the recall curves 75
look similar to BLEURT curves shown in Figure 2a. 50
On the other hand, the SI FT baseline worked the
25
best in BERTScore precision, and the precision
0 75
curves look very similar to the BLEU curves shown 50 25 0
Length Difference
25 50 75 100
in Figure 2b. We conducted further analyses below
to investigate the mixed results in different quality Figure 6: The length differences between hypotheses
metrics. and references in SI FT and Mixed FT + Style (speech
segment size is 600ms) on SI test set.
7.2 Length Differences
First, we focus on the length differences between Table 4 shows the translation examples by SI FT
translation outputs and references. Figure 5 shows and Mixed FT + Style. Here, SI FT generates very
the length ratios of translation results and their ref- short outputs compared with Mixed FT + Style;
erences. The proposed method resulted in longer BLEU is not always good due to the brevity penalty,
outputs than the baselines, and the SI FT baseline but SI FT would have an advantage in BERTScore
preferred shorter output than the others and ref- precision.
erences. From the viewpoint of the precision of
the translation results, outputs longer than their 7.3 Non-speech Sound Events and Repetitions
references are unfavorable. Figure 6 shows the his- Next, we investigated the over-translation sug-
togram of length differences between SI FT and gested in the analyses above.
Mixed FT + Style. They showed different distribu- We observed serious repetitions by the proposed
tions; this suggests that SI FT suffered from under- method, such as (拍手) (拍手) ..., which means
translation, and the proposed method suffered from (Applause). This kind of non-speech sound events
over-translation. (applause and laughter) are found many times in
368
Source TEMPT was one of the foremost graffiti artists in the 80s.
There’s no hospital that can say “No.”
Anybody who’s paralyzed now has access to actually draw or communicate using only their eyes.
SI FT テンプトは、グラフィティアーティストの (TEMPT was, graffiti artists’)
(Baseline) 病院は、(a hospital)
麻痺した人達は、 (paralyzed people)
Mixed FT + Style テンプトは、グラフィティアーティストの一人です。(TEMPT is one of graffiti artists.)
(Proposed) 病院では「いいえ」は言えません。(In a hospital, we cannot say “No.”)
麻痺した人なら誰でも、絵を描いたり、会話をすることができます。
(Anybody who is paralyzed can draw a picture and have a talk.)
SI reference 八十年代の素晴らしいグラフィックアーティストでした。
((He) was a great graphic artist in the 80s.)
病院も、ノーとは言えない。(There’s no hospital that can say “No.”)
麻痺してる人達は、これを全員使うことが出来るようになっています。
(Everybody who is paralyzed can use this.)
Offline reference 80年代を代表するグラフィティ・アーティストでした
病院もダメと言えません
全身麻痺の人誰もが目だけで絵を描いたりコミュニケーションできます
Table 4: Example sentences in SI FT and Mixed FT + Style (speech segment size: 600ms) on SI test set.
TED Talks, but they are not translated by inter- the proposed method, but it made little impact on
preters and excluded from the SI data. According semantic-oriented automatic evaluation results.
to this assumption, we tried to eliminate typical
repetitions as follows and to conduct the evaluation 8 Conclusion
after that.
In this paper, we proposed an effective method
• Removing tokens if they are surrounded by to train a SimulST model using mixed data of SI-
"()" and "<>". (if the tokens include parts of and offline-style translations with style tags to tell
"(拍手)" like "拍手)" or "(", they were also the model to generate outputs in either style, mo-
excluded.) tivated by the tag-based approach to domain adap-
tation. Experiment results on English-to-Japanese
• Stopping the generating output when at least SimulST demonstrated the advantage of the pro-
one kind of 3-gram appeared at least 3 times posed method in BLEURT and BERTScore re-
in the steps until reaching the end of the sen- call despite the inferior performance in BLEU and
tence. BERTScore precision due to over-translations and
repetitions. Future work includes an extension to
We applied this repetition removal on the re- other language pairs and further verification via
sults by Mixed FT + Style and SI + Style; they human evaluation.
are labeled as Mixed FT + Style + Rmrep and
SI FT + Rmrep, respectively. Figure 7 shows 9 Limitation
BLEU and length ratio results before and after
The scores reported in the SI test were lower than
the repetition removal. BLEU increased consis-
those in the offline test. Reporting results on other
tently on the proposed method while almost no
SI data would support seeing the effectiveness of
changes were observed on the SI FT baseline ex-
our method. To our knowledge, this is the first work
cept for one sample at ATD=200. This suggests the
to use SI data as speech translation data. There
existence of many repetitions in the translation re-
are no other language pairs SI data than English-
sults by the proposed method. We also investigated
Japanese pairs those source speech and target text
BLEURT and BERTScore, as shown in Figure 8.
aligned.
The repetition removal made almost no changes in
BLEURT, probably due to the semantic-oriented Acknowledgement
evaluation strategy of BLEURT. BERTScore Pre-
cision and F1 of the proposed method increased Part of this work was supported by JSPS KAK-
in the middle latency ranges, while they decreased ENHI Grant Number JP21H05054 and JST
almost consistently for the SI FT baseline. These SPRING Grant Number JPMJSP2140.
findings suggest an over-translation problem with
369
SI test
11
SI test
10 0.44
BLEU
9 0.42
SI FT
BLEURT
8 SI FT + Rmrep 0.40
Mixed FT + Style
7 Mixed FT + Style + Rmrep 0.38 SI FT
200 400 600 800 1000 SI FT + Rmrep
ATD Mixed FT + Style
0.36
Mixed FT + Style + Rmrep
(a) BLEU 200 400 600 800 1000
SI test ATD
(a) BLEURT
1.4 SI test
0.745
Length Ratio
1.2
0.740
BERTScore F1
1.0 SI FT 0.735
SI FT + Rmrep
Mixed FT + Style 0.730
0.8 Mixed FT + Style + Rmrep SI FT
200 400 600 800 1000 SI FT + Rmrep
ATD
Mixed FT + Style + Rmrep
(b) Length ratio 200 400 600 800 1000
ATD
Figure 7: Results with repetition removal (Rmrep) in (b) BERTScore-F1
BLEU and length ratio against ATD on SI test set. SI test
0.75
References
BERTScore Precision
0.74
Alexei Baevski and Michael Auli. 2019. Adaptive Input
Representations for Neural Language Modeling. In 0.73
7th International Conference on Learning Represen- SI FT
tations, ICLR 2019, New Orleans, LA, USA, May 6-9, 0.72 SI FT + Rmrep
2019. OpenReview.net. Mixed FT + Style
0.71 Mixed FT + Style + Rmrep
200 400 600 800 1000
Isaac Caswell, Ciprian Chelba, and David Grangier. ATD
2019. Tagged back-translation. In Proceedings of the
(c) BERTScore-Precision
Fourth Conference on Machine Translation (Volume
SI test
1: Research Papers), pages 53–63, Florence, Italy. 0.755
0.750
BERTScore Recall
Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 2017. 0.745

An empirical comparison of domain adaptation meth-
ods for neural machine translation. In Proceedings 0.740
of the 55th Annual Meeting of the Association for 0.735 SI FT
Computational Linguistics (Volume 2: Short Papers), SI FT + Rmrep
pages 385–391, Vancouver, Canada. Association for 0.730 Mixed FT + Style
Computational Linguistics. Mixed FT + Style + Rmrep
200 400 600 800 1000
ATD
Fahim Dalvi, Nadir Durrani, Hassan Sajjad, and Stephan
Vogel. 2018. Incremental decoding and training (d) BERTScore-Recall
methods for simultaneous translation in neural ma-
chine translation. In Proceedings of the 2018 Con- Figure 8: Results with repetition removal (Rmrep)
ference of the North American Chapter of the Asso- in BLEURT and BERTScore F1, precision and recall
ciation for Computational Linguistics: Human Lan- against ATD on SI test set.
guage Technologies, Volume 2 (Short Papers), pages
493–499, New Orleans, Louisiana. Association for
370
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, Yasumasa Kano, Katsuhito Sudoh, and Satoshi Naka-
Matteo Negri, and Marco Turchi. 2019. MuST-C: a mura. 2022. Simultaneous neural machine transla-
Multilingual Speech Translation Corpus. In Proceed- tion with prefix alignment. In Proceedings of the
ings of the 2019 Conference of the North American 19th International Conference on Spoken Language
Chapter of the Association for Computational Lin- Translation (IWSLT 2022), pages 22–31, Dublin, Ire-
guistics: Human Language Technologies, Volume 1 land (in-person and online). Association for Compu-
(Long and Short Papers), pages 2012–2017, Min- tational Linguistics.
Linguistics. Yasumasa Kano, Katsuhito Sudoh, and Satoshi Naka-
mura. 2023. Average Token Delay: A Latency Met-
Kosuke Doi, Katsuhito Sudoh, and Satoshi Nakamura. ric for Simultaneous Translation. In Proceedings of
2021. Large-scale English-Japanese simultaneous in- Interspeech 2023. To appear.
terpretation corpus: Construction and analyses with
sentence-aligned data. In Proceedings of the 18th Eugene Kharitonov, Morgane Rivière, Gabriel Syn-
International Conference on Spoken Language Trans- naeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs
lation (IWSLT 2021), pages 226–235, Bangkok, Thai- Douze, and Emmanuel Dupoux. 2020. Data Aug-
land (online). Association for Computational Linguis- menting Contrastive Learning of Speech Repre-
tics. sentations in the Time Domain. arXiv preprint
arXiv:2007.00991.
Christian Fügen, Alex Waibel, and Muntsin Kolss. 2007.
Simultaneous translation of lectures and speeches. Catherine Kobus, Josep Crego, and Jean Senellart. 2017.
Machine translation, 21:209–252. Domain control for neural machine translation. In
Proceedings of the International Conference Recent
Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Yuka
Advances in Natural Language Processing, RANLP
Ko, Tomoya Yanagita, Kosuke Doi, Mana Makinae,
2017, pages 372–378, Varna, Bulgaria. INCOMA
Katsuhito Sudoh, Sakriani Sakti, and Satoshi Naka-
Ltd.
mura. 2023. NAIST Simultaneous Speech Transla-
tion System for IWSLT 2023. In Proceedings of the Taku Kudo and John Richardson. 2018. SentencePiece:
20th International Conference on Spoken Language A simple and language independent subword tok-
Translation (IWSLT2023). To appear. enizer and detokenizer for neural text processing. In
Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic- Proceedings of the 2018 Conference on Empirical
tor O.K. Li. 2017. Learning to translate in real-time Methods in Natural Language Processing: System
with neural machine translation. In Proceedings of Demonstrations, pages 66–71, Brussels, Belgium.
the 15th Conference of the European Chapter of the Association for Computational Linguistics.
Association for Computational Linguistics: Volume
Danni Liu, Gerasimos Spanakis, and Jan Niehues.
1, Long Papers, pages 1053–1062, Valencia, Spain.
2020a. Low-Latency Sequence-to-Sequence Speech
Recognition and Translation by Partial Hypothesis
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, Selection. In Proc. Interspeech 2020, pages 3620–
Javier Jorge, Nahuel Roselló, Adrià Giménez, Al- 3624.
bert Sanchis, Jorge Civera, and Alfons Juan. 2020.
Europarl-ST: A Multilingual Corpus for Speech Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Translation of Parliamentary Debates. In ICASSP Edunov, Marjan Ghazvininejad, Mike Lewis, and
2020 - 2020 IEEE International Conference on Luke Zettlemoyer. 2020b. Multilingual denoising
Acoustics, Speech and Signal Processing (ICASSP), pre-training for neural machine translation. Transac-
pages 8229–8233. tions of the Association for Computational Linguis-
tics, 8:726–742.
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Minh-Thang Luong and Christopher Manning. 2015.
Fernanda Viégas, Martin Wattenberg, Greg Corrado, Stanford neural machine translation systems for spo-
Macduff Hughes, and Jeffrey Dean. 2017. Google’s ken language domains. In Proceedings of the 12th
multilingual neural machine translation system: En- International Workshop on Spoken Language Trans-
abling zero-shot translation. Transactions of the As- lation: Evaluation Campaign, pages 76–79, Da Nang,
sociation for Computational Linguistics, 5:339–351. Vietnam.
J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,
P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Col- Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,
lobert, C. Fuegen, T. Likhomanenko, G. Syn- Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and
naeve, A. Joulin, A. Mohamed, and E. Dupoux. Haifeng Wang. 2019. STACL: Simultaneous trans-
2020. Libri-Light: A Benchmark for ASR with lation with implicit anticipation and controllable la-
Limited or No Supervision. In ICASSP 2020 tency using prefix-to-prefix framework. In Proceed-
- 2020 IEEE International Conference on Acous- ings of the 57th Annual Meeting of the Association for
tics, Speech and Signal Processing (ICASSP), Computational Linguistics, pages 3025–3036, Flo-
pages 7669–7673. https://github.com/ rence, Italy. Association for Computational Linguis-
facebookresearch/libri-light. tics.
371
Xutai Ma, Mohammad Javad Dousti, Changhan Wang, Kanishka Rao, Haşim Sak, and Rohit Prabhavalkar.
Jiatao Gu, and Juan Pino. 2020a. SIMULEVAL: An 2017. Exploring architectures, data and units for
evaluation toolkit for simultaneous translation. In streaming end-to-end speech recognition with rnn-
Proceedings of the 2020 Conference on Empirical transducer. In 2017 IEEE Automatic Speech Recog-
Methods in Natural Language Processing: System nition and Understanding Workshop (ASRU), pages
Demonstrations, pages 144–150, Online. Association 193–199. IEEE.
Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin,
Xutai Ma, Juan Pino, and Philipp Koehn. 2020b. Zhou Zhao, and Tie-Yan Liu. 2020. SimulSpeech:
SimulMT to SimulST: Adapting simultaneous text End-to-end simultaneous speech to text translation.
translation to end-to-end simultaneous speech trans- In Proceedings of the 58th Annual Meeting of the As-
lation. In Proceedings of the 1st Conference of the sociation for Computational Linguistics, pages 3787–
Asia-Pacific Chapter of the Association for Compu- 3796, Online. Association for Computational Lin-
tational Linguistics and the 10th International Joint guistics.
Conference on Natural Language Processing, pages
582–587, Suzhou, China. Association for Computa- Anthony Rousseau, Paul Deléglise, and Y. Estève. 2012.
tional Linguistics. TED-LIUM: an Automatic Speech Recognition ded-
icated corpus. In International Conference on Lan-
Akira Mizuno. 2017. Simultaneous interpreting and guage Resources and Evaluation.
cognitive constraints. Bull. Coll. Lit, 58:1–28.
Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020.
Yuta Nishikawa and Satoshi Nakamura. 2023. Inter- BLEURT: Learning robust metrics for text genera-
connection: Effective Connection between Pre- tion. In Proceedings of the 58th Annual Meeting of
trained Encoder and Decoder for Speech Translation. the Association for Computational Linguistics, pages
In Proceedings of Interspeech 2023. To appear. 7881–7892, Online. Association for Computational
Linguistics.
Yusuke Oda, Graham Neubig, Sakriani Sakti, Tomoki
Toda, and Satoshi Nakamura. 2014. Optimizing seg- Rico Sennrich, Barry Haddow, and Alexandra Birch.
mentation strategies for simultaneous speech transla- 2016. Controlling politeness in neural machine trans-
tion. In Proceedings of the 52nd Annual Meeting of lation via side constraints. In Proceedings of the 2016
the Association for Computational Linguistics (Vol- Conference of the North American Chapter of the
ume 2: Short Papers), pages 551–556. Association for Computational Linguistics: Human
Language Technologies, pages 35–40, San Diego,
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, California. Association for Computational Linguis-
Sam Gross, Nathan Ng, David Grangier, and Michael tics.
Auli. 2019. fairseq: A fast, extensible toolkit for
Hiroaki Shimizu, Graham Neubig, Sakriani Sakti,
sequence modeling. In Proceedings of the 2019 Con-
Tomoki Toda, and Satoshi Nakamura. 2013. Con-
ference of the North American Chapter of the Associa-
structing a speech translation system using simulta-
tion for Computational Linguistics (Demonstrations),
neous interpretation data. In Proceedings of IWSLT.
pages 48–53, Minneapolis, Minnesota. Association
for Computational Linguistics. Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- gela Fan. 2021. Multilingual translation from de-
Jing Zhu. 2002. Bleu: a method for automatic evalu- noising pre-training. In Findings of the Association
ation of machine translation. In Proceedings of the for Computational Linguistics: ACL-IJCNLP 2021,
40th Annual Meeting of the Association for Compu- pages 3450–3466.
Pennsylvania, USA. Association for Computational Hitomi Toyama, Shigeki Matsubara, Koichiro Ryu,
Linguistics. Nobuo Kawaguchi, and Yasuyoshi Inagaki. 2004.
CIAIR Simultaneous Interpretation Corpus. In Pro-
Peter Polák, Ngoc-Quan Pham, Tuan Nam Nguyen, ceedings of Oriental COCOSDA.
Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bo-
jar, and Alexander Waibel. 2022. CUNI-KIT system Ioannis Tsiamas, Gerard I. Gállego, Carlos Escolano,
for simultaneous speech translation task at IWSLT José Fonollosa, and Marta R. Costa-jussà. 2022. Pre-
2022. In Proceedings of the 19th International Con- trained speech encoders and efficient fine-tuning
ference on Spoken Language Translation (IWSLT methods for speech translation: UPC at IWSLT 2022.
2022), pages 277–285, Dublin, Ireland (in-person In Proceedings of the 19th International Confer-
and online). Association for Computational Linguis- ence on Spoken Language Translation (IWSLT 2022),
tics. pages 265–276, Dublin, Ireland (in-person and on-
line). Association for Computational Linguistics.
scores. In Proceedings of the Third Conference on Changhan Wang, Juan Pino, Anne Wu, and Jiatao Gu.
Machine Translation: Research Papers, pages 186– 2020. CoVoST: A diverse multilingual speech-to-text
191, Brussels, Belgium. Association for Computa- translation corpus. In Proceedings of the Twelfth Lan-
tional Linguistics. guage Resources and Evaluation Conference, pages
372
4197–4203, Marseille, France. European Language
Resources Association.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
Weinberger, and Yoav Artzi. 2020. BERTScore:
Evaluating Text Generation with BERT. In 8th Inter-
ICLR 2020, Addis Ababa, Ethiopia, April 26-30,
2020. OpenReview.net.
Jinming Zhao, Yuka Ko, Ryo Fukuda, Katsuhito Su-
doh, Satoshi Nakamura, et al. 2023. NAIST-SIC-
Aligned: Automatically-Aligned English-Japanese
Simultaneous Interpretation Corpus. arXiv preprint
arXiv:2304.11766.
373
A Evaluation Results in AL.
Figure 9 shows the main results in BLEURT and
BLEU in SI test in AL. Figure 10 shows the main
results in BLEURT and BLEU in offline test in
AL. Those results trends are almost the same as the
trends in main results in Figure 2, 3.
374
SI test SI test
0.44 11
0.42 10
0.40 9
BLEURT
0.38 BLEU 8
Offline FT Offline FT
0.36 SI FT 7 SI FT
Mixed FT 6 Mixed FT
Mixed FT + Style
0.32 Mixed FT + Style + Up 5 Mixed FT + Style + Up
2000 1000 0 1000 2000 2000 1000 0 1000 2000
AL AL
(a) BLEURT (b) BLEU
Figure 9: SimulST latency (AL) – quality results on SI test set.

0.525
14
0.500
0.475
12
BLEURT
BLEU
0.450
0.425 Offline FT 10 Offline FT
0.400 SI FT SI FT
Mixed FT Mixed FT
0.375 8
Mixed FT + Style Mixed FT + Style
0.350 Mixed FT + Style + Up Mixed FT + Style + Up
1000 500 0 500 1000 1500 2000 2500 1000 500 0 500 1000 1500 2000 2500
AL AL
(a) BLEURT (b) BLEU
Figure 10: SimulST latency (AL) – quality results on offline test set.
375
The HW-TSC’s Simultaneous Speech-to-Text Translation system for
IWSLT 2023 evaluation
Jiaxin GUO, Daimeng Wei, Zhanglin Wu, Zongyao Li, Zhiqiang Rao, Minghan Wang,
Hengchao Shang, Xiaoyu Chen, Zhengzhe Yu, Shaojun Li, Yuhao Xie, Lizhi Lei, Hao Yang
{guojiaxin1, weidaimeng, wuzhanglin2, lizongyao, raozhiqiang,

wangminghan, shanghengchao, chenxiaoyu35, yuzhengzhe,
lishaojun18, xieyuhao2, leilizhi, yanghao30}@huawei.com
Abstract tion datasets, which are necessary for end-to-end

models, are still scarce resources.
In this paper, we present our submission to the The current efforts in simultaneous speech-to-
IWSLT 2023 (Agarwal et al., 2023) Simulta- text translation (SimulS2T) concentrate on devel-
neous Speech-to-Text Translation competition. oping dedicated models that are tailored to this
Our participation involves three language direc-
tions: English-German, English-Chinese, and
specific task. However, this approach has certain
English-Japanese. Our proposed solution is drawbacks, such as the requirement of an additional
a cascaded incremental decoding system that model, which typically involves a more challenging
comprises an ASR model and an MT model. training and inference process, as well as height-
The ASR model is based on the U2++ architec- ened computational demands and the possibility of
ture and can handle both streaming and offline decreased performance when utilized in an offline
speech scenarios with ease. Meanwhile, the environment.
MT model adopts the Deep-Transformer archi-
Our approach for this study involves utilizing a
tecture. To improve performance, we explore
methods to generate a confident partial target sturdy offline ASR model and a robust offline MT
text output that guides the next MT incremen- model as the foundation for our system. By modify-
tal decoding process. In our experiments, we ing the onlinization approach of (Polák et al., 2022)
demonstrate that our simultaneous strategies and introducing an enhanced technique that can be
achieve low latency while maintaining a loss of seamlessly integrated into the cascade system, we
no more than 2 BLEU points when compared are able to demonstrate that our simultaneous sys-
to offline systems.
tem can perform at the similar level as the offline
models under strict latency restrictions without any
1 Introduction adjustments to the original models. Furthermore,
This paper describes the HW-TSC‘s submission our system even surpasses previous higher latency
to the Simultaneous Speech-to-Text Translation IWSLT systems.
(SimulS2T) task at IWSLT 2023 (Agarwal et al., Our contribution is as follows:
2023). • We have revised the approach of onlinization
From a systems architecture perspective, current adopted by (Polák et al., 2022) and put for-
research on simultaneous speech-to-text translation ward an enhanced technique that can be easily
(SimulS2T) can be categorized into two forms: cas- integrated into the cascade system.
cade and end-to-end. Cascade systems typically
consist of a streaming Automatic Speech Recogni- • Our findings show that the pre-training plus
tion (ASR) module and a streaming text-to-text ma- fine-tuning paradigm yields significant im-
chine translation (MT) module, with the possibil- provements in both ASR and MT.
ity of incorporating additional correction modules.
While integrating these modules can be complex, • Our research highlights that enhancing the
training each module with sufficient data resources offline MT model has a direct positive impact
can prove to be worthwhile. Alternatively, an end- on the online cascade system as well.
to-end approach is also an option for SimulS2T,
2 Related Work
where translations can be directly generated from
a unified model with speech inputs. However, it Simultaneous speech-to-text translation can be
is important to note that bilingual speech transla- achieved through either a cascaded system or an
376
ASR MT LCP
asr_ouput1,1 mt_ouput1,1
chunk1 asr_ouput1,2 mt_ouput1,2 ouput1
prefix
chunk1 chunk2 asr_ouput2,2 mt_ouput2,2 ouput2
prefix
chunk1 chunk2 chunk3 asr_ouput3,2 mt_ouput3,2 ouput3
prefix
Figure 1: An overview of hw-tsc’s s2t framework.
end-to-end model, both of which can be (hybrid) in standard Transformer or Conformer architectures
nature. While cascaded systems currently offer the and can perform both streaming and non-streaming
highest quality in offline speech translation, end- ASR. One of the major advantages of U2 over other
to-end speech translation provides a better trade- offline autoregressive ASR models is its ability to
off between quality and latency (Guo et al., 2022; support streaming through dynamic chunk training
Wang et al., 2022a,b). and decoding with a CTC decoder on top of the
End-to-end speech translation systems incorpo- encoder. Additionally, U2 includes a standard au-
rate various techniques to enable simultaneous toregressive attention decoder and can be jointly
translation. For example, (Ma et al., 2019) im- trained with the CTC decoder to improve training
plements a wait-k model and utilizes meta-learning stability. The dynamic chunk training method in-
to address data scarcity, while (Zhang et al., 2022b) volves applying a causal mask with varying chunk
employs a wait-info model that incorporates infor- sizes at the self-attention layer within the encoder.
mation entropy from both the original text and the This allows the hidden representation to condition
translation into the model. Additionally, (Liu et al., on some look-ahead contexts within the chunk,
2020) utilizes a unidirectional encoder with mono- similar to the self-attention of an autoregressive
tonic cross-attention to constrain dependence on decoder.
future context.
In addition, some research has focused on de- U2 offers four different decoding strategies:
tecting stable hypotheses. For instance, (Liu et al., "ctc_greedy_search", "ctc_beam_search", "atten-
2020) proposed the Hold-n strategy, which identi- tion_decoding", and "attention_rescoring". The
fies the best hypothesis in the beam and removes CTC decoder, with argmax decoding, guarantees
the last n tokens from it. Similarly, (Liu et al., 2020) that the tokens decoded in previous chunks are un-
introduced the LA-n strategy, which identifies the altered, leading to a smooth streaming experience.
matching prefixes of two consecutive chunks. Ad- The attention decoder generates output token by
ditionally, like the LA-n strategy, (Nguyen et al., token and also has the ability to re-score CTC gen-
2021) developed the SP-n strategy, which identifies erated texts using prefix beam search in the event
the longest common prefix among all items in the of multiple candidate proposals.
beam of a chunk. Our work directly addresses this
After building on our findings from last year,
issue.
we have discovered that U2 offers stability and
3 Methods robustness in predicting audio without real utter-
ances. This improvement is due to the model’s
Figure 1 illustrates our framework. training strategy, specifically the use of dynamic
chunk training. In our current work, we have fur-
3.1 ASR ther improved the performance of the model by
In our cascade system, we have incorporated the breaking the chunk-based attention approach and
U2 (Wu et al., 2021) as the ASR module. This employing the "attention_rescoring" decoding strat-
framework has the flexibility to be implemented on egy.
377
3.2 MT hypothesize that there are domain-like distinctions
Our cascade system includes the Transformer between ASR-generated results and actual text. To
(Vaswani et al., 2017) as the MT module, which has further improve the performance, we use the gen-
become a prevalent method for machine translation eration from a well-trained ASR model to replace
(Guo et al., 2021) in recent years. The Transformer source-side text in the training corpus data. This
has achieved impressive results, even with a primi- fine-tuning approach enables us to achieve further
tive architecture that requires minimal modification. improvements in the MT model.
To improve the offline MT model performance,
3.3 Onlinization
we utilize multiple training strategies (Wei et al.,
2021). Incremental Decoding Translation tasks may re-
quire reordering or additional information that is
Multilingual Translation (Johnson et al., 2017) not apparent until the end of the source utterance,
has proposed a simple solution for translating mul- depending on the language pair. In offline settings,
tiple languages using a single neural machine trans- processing the entire utterance at once produces
lation model with no need to alter the model archi- the highest-quality results. However, this approach
tecture. The proposed technique involves inserting also leads to significant latency in online mode.
an artificial token at the start of the input sentence One possible solution to reduce latency is to divide
to specify the target language. Furthermore, all the source utterance into smaller parts and translate
languages use the same vocabulary, eliminating the each one separately.
need to add additional parameters. In this study, En- To perform incremental inference, we divide the
De/ZH/JA data was combined and jointly trained, input utterance into chunks of a fixed size and de-
demonstrating that a multilingual model can signif- code each chunk as it arrives. Once a chunk has
icantly enhance translation performance. been selected, its predictions are then committed
to and no longer modified to avoid visual distrac-
Data diversification Data diversification
tions from constantly changing hypotheses. The
(Nguyen et al., 2020) is an effective strategy to
decoding of the next chunk is dependent on the pre-
improve the performance of NMT. This technique
dictions that have been committed to. In practice,
involves utilizing predictions from multiple
decoding for new chunks can proceed from a previ-
forward and backward models and then combining
ously buffered decoder state or begin after forced
the results with raw data to train the final NMT
decoding with the tokens that have been committed
model. Unlike other methods such as knowledge
to. In either case, the source-target attention can
distillation and dual learning, data diversification
span all available chunks, as opposed to only the
does not require additional monolingual data and
current chunk.
can be used with any type of NMT model. Addi-
tionally, this strategy is more efficient and exhibits Stable Hypothesis Detection Our approach is
a strong correlation with model integration. based on prior research in (Polák et al., 2022), and
we have implemented stable hypothesis detection
Forward translation Forward translation (Wu
to minimize the potential for errors resulting from
et al., 2019) refers to using monolingual data in the
incomplete input. Their methods, such as LA-n
source language to generate synthetic data through
(Liu et al., 2020) and SP-n (Nguyen et al., 2021),
beam search decoding. This synthetic data is then
are designed for use in end-to-end systems that
added to the training data in order to increase its
search for a shared prefix among the hypotheses
size. While forward translation alone may not yield
generated from different chunk inputs. In contrast,
optimal results, when combined with a back trans-
our approach operates within a cascaded system
lation strategy, it can enhance performance more
that processes the same chunk input.
effectively than back translation alone. In this work,
we use only the forward model to create synthetic We can denote the MT and ASR generating func-
data and add the data to the original parallel cor- tions as G and F respectively. Let Fi,nC represent
pora. the i output generated by the ASR function for a

c-chunk input with a beam size of n. Then the
Domain Fine-tuning Previous studies have final common prefix for the c-chunk input can be
shown that fine-tuning a model with in-domain expressed as pref ixc , which is determined as fol-
data can significantly enhance its performance. We lows:
378
Model Language Pair Lantency BLEU AL AP DAL
Low 26.82 0.96 0.77 2.07
IWSLT22 Best System EN-DE Medium 31.47 1.93 0.86 2.96
High 32.87 3.66 0.96 4.45
Our System EN-DE - 33.54 1.88 0.83 2.84
Low 16.92 2.46 0.9 3.22

IWSLT22 Best System EN-JA Medium 16.94 3.77 0.97 4.29
High 16.91 4.13 0.98 4.53
Our System EN-JA - 17.89 1.98 0.83 2.89
Low 25.87 1.99 0.87 3.35

IWSLT22 Best System EN-ZH Medium 26.21 2.97 0.94 4.16
High 26.46 3.97 0.98 4.62
Our System EN-ZH - 27.23 1.98 0.83 2.89
Table 1: Final systems results
Training During the training of the ASR model,

we set the batch size to a maximum of 40,000
pref ixc = LCP (G(F1,n
c c
), ..., G(Fn,n )) (1) frames per card. We use inverse square root for
lr scheduling, with warm-up steps set to 10,000
where LCP (·) is longest common prefix of the and peak lr set at 5e−4. Adam is utilized as the op-
arguments. timizer. The model is trained on 4 V100 GPUs for
50 epochs, and the parameters for the last 4 epochs
4 Experiments Setup
are averaged. To improve accuracy, all audio inputs
4.1 ASR are augmented with spectral augmentation and nor-
Model We extract 80-dimensional Mel-Filter malized with utterance cepstral mean and variance
bank features from audio files to create the ASR normalization.
training corpus. For tokenization of ASR texts, 4.2 MT
we utilize Sentencepiece with a learned vocab-
Model For our experiments using the MT model,
ulary of up to 20,000 sub-tokens. The ASR
we utilize the Transformer deep model architecture.
model is configured as follows: nencoder layers =
The configuration of the MT model is as follows:
12, ndecoder layers = 8, nheads = 8, dhidden = 512,
nencoder layers = 25, ndecoder layers = 6, nheads =
dF F N = 2048. We implement all models using
16, dhidden = 1024, dF F N = 4096, pre_ln = T rue.
wenet (Zhang et al., 2022a).
Dataset To train the MT model, we collected all
Dataset To train the ASR module, we utilized
available parallel corpora from the official websites
four datasets: LibriSpeech V12, MuST-C V2
and selected data that was similar to the MuST-C
(Gangi et al., 2019), TEDLIUM V3, and CoVoST
domain. We first trained a multilingual MT base-
V2. LibriSpeech consists of audio book record-
line model on all data from three language direc-
ings with case-insensitive text lacking punctuation.
tions. Then, we incrementally trained the baseline
MuST-C, a multilingual dataset recorded from TED
model based on data from each language direction.
talks, was used solely for the English data in the
ASR task. TEDLIUM is a large-scale speech recog- Training We utilize the open-source Fairseq (Ott
nition dataset containing TED talk audio recordings et al., 2019) for training, with the following main
along with text transcriptions. CoVoST is also a parameters: each model is trained using 8 GPUs,
multilingual speech translation dataset based on with a batch size of 2048, a parameter update fre-
Common Voice, with open-domain content. Un- quency of 32, and a learning rate of 5e − 4. Addi-
like LibriSpeech, both MuST-C and CoVoST have tionally, a label smoothing value of 0.1 was used,
case-sensitive text and punctuation. with 4000 warmup steps and a dropout of 0.1. The
379
Adam optimizer is also employed, with β1 = 0.9 5.2 Ablation Study on MT training strategies
and β2 = 0.98. During the inference phase, a beam
size of 8 is used. The length penalties are set to Training strategies BLEU
1.0. Baseline 33.54
- Domain Fine-tuning 27.87
5 Results - Forward Translation 25.49
From Table 1, we can see that the our systems work - Multiligual Translation 23.76
well on various language pairs. And our systems
Table 4: Ablation Study on MT training strategies for
even beat the best IWSLT 22 systems under higher
EN-DE direction
latency.
In the field of machine translation, Domain Fine-
Language Pair Model BLEU
tuning, Forward Translation, and Multiligual Trans-
En-DE Offline 35.23
lation are frequently employed methods to enhance
- Simul 33.54
translation quality. It is evident from Table 4 that
En-JA Offline 19.45
these training strategies can effectively improve the
- Simul 17.89
overall quality of the system.
En-ZH Offline 27.93
- Simul 27.23 6 Conclusion
Table 2: Comparison to offline system In this paper, we report on our work in the IWSLT
2023 simultaneous speech-to-text translation evalu-
Previous research has shown that the quality of ation. We propose an onlinization strategy that can
simultaneous translation can now match or even be applied to cascaded systems and demonstrate
surpass that of offline systems. However, in our its effectiveness in three language directions. Our
current study, we first established a new baseline approach is simple and efficient, with ASR and MT
for the offline system. Furthermore, we found that modules that can be optimized independently. Our
there is still a difference of 1-2 BLEU between cascade simultaneous system achieves results that
simultaneous translation and offline translation, see are comparable to offline systems. In the future, we
Table 2. plan to further explore the direction of end-to-end
systems.
5.1 Ablation Study on different ASR decoding
strategies
References
Language Pair Decoding strategies BLEU Milind Agarwal, Sweta Agrawal, Antonios Anasta-
En-DE ctc_beam_search 32.88 sopoulos, Ondřej Bojar, Claudia Borg, Marine
En-JA ctc_beam_search 16.56 Chen, William Chen, Khalid Choukri, Alexandra
En-ZH ctc_beam_search 26.47 Chronopoulou, Anna Currey, Thierry Declerck, Qian-
En-DE attention_rescoring 33.54 qian Dong, Yannick Estève, Kevin Duh, Marcello
En-JA attention_rescoring 17.89 Federico, Souhir Gahbiche, Barry Haddow, Benjamin
En-ZH attention_rescoring 27.23 vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
Table 3: Ablation Study on different ASR decoding Evgeny Matusov, Paul McNamee, John P. McCrae,
strategies Kenton Murray, Maria Nadejde, Satoshi Nakamura,
Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan
The decoding strategy of "attention_rescoring" Pino, Lonneke van der Plas, Peter Polák, Elijah
involves using a decoder to re-rank the results based Rippeth, Elizabeth Salesky, Jiatong Shi, Matthias
on the decoding output of "ctc_beam_search". As a Sperber, Sebastian Stüker, Katsuhito Sudoh, Yun
result, "attention_rescoring" can obtain better ASR Tang, Brian Thompson, Kevin Tran, Marco Turchi,
Alex Waibel, Mingxuan Wang, Shinji Watanabe, and
results. Table 3 demonstrates that a better ASR Rodolfo Zevallos. 2023. Findings of the IWSLT
decoding strategy can lead to overall better quality 2023 Evaluation Campaign. In Proceedings of the
results for the system. 20th International Conference on Spoken Language
380
Translation (IWSLT 2023). Association for Compu- Brno, Czechia, 30 August - 3 September 2021, pages
tational Linguistics. 1762–1766. ISCA.
Mattia Antonino Di Gangi, Roldano Cattoni, Luisa Xuan-Phi Nguyen, Shafiq R. Joty, Kui Wu, and
Bentivogli, Matteo Negri, and Marco Turchi. 2019. Ai Ti Aw. 2020. Data diversification: A sim-
Must-c: a multilingual speech translation cor- ple strategy for neural machine translation. In
pus. In Proceedings of the 2019 Conference of Advances in Neural Information Processing Systems
the North American Chapter of the Association 33: Annual Conference on Neural Information
for Computational Linguistics: Human Language Processing Systems 2020, NeurIPS 2020, December
Technologies, NAACL-HLT 2019, Minneapolis, 6-12, 2020, virtual.
MN, USA, June 2-7, 2019, Volume 1 (Long and
Short Papers), pages 2012–2017. Association for Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
Computational Linguistics. Sam Gross, Nathan Ng, David Grangier, and Michael
Jiaxin Guo, Yinglu Li, Minghan Wang, Xiaosong Qiao, sequence modeling. CoRR, abs/1904.01038.
Yuxia Wang, Hengchao Shang, Chang Su, Yimeng
Chen, Min Zhang, Shimin Tao, Hao Yang, and Peter Polák, Ngoc-Quan Pham, Tuan-Nam Nguyen,
Ying Qin. 2022. The hw-tsc’s speech to speech Danni Liu, Carlos Mullov, Jan Niehues, Ondrej Bojar,
translation system for IWSLT 2022 evaluation. In and Alexander Waibel. 2022. CUNI-KIT system for
Proceedings of the 19th International Conference on simultaneous speech translation task at IWSLT 2022.
Spoken Language Translation, IWSLT@ACL 2022, In Proceedings of the 19th International Conference
Dublin, Ireland (in-person and online), May 26-27, on Spoken Language Translation, IWSLT@ACL
2022, pages 293–297. Association for Computational 2022, Dublin, Ireland (in-person and online), May
Linguistics. 26-27, 2022, pages 277–285. Association for Com-
Jiaxin Guo, Minghan Wang, Daimeng Wei, Hengchao
Shang, Yuxia Wang, Zongyao Li, Zhengzhe Yu, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Zhanglin Wu, Yimeng Chen, Chang Su, Min Zhang, Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Lizhi Lei, Shimin Tao, and Hao Yang. 2021. Self- Kaiser, and Illia Polosukhin. 2017. Attention is
distillation mixup training for non-autoregressive all you need. In Advances in Neural Information
neural machine translation. CoRR, abs/2112.11640. Processing Systems 30: Annual Conference on
Neural Information Processing Systems 2017,
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim December 4-9, 2017, Long Beach, CA, USA, pages
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho- 5998–6008.
rat, Fernanda B. Viégas, Martin Wattenberg, Greg
Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Minghan Wang, Jiaxin Guo, Yinglu Li, Xiaosong
Google’s multilingual neural machine translation sys- Qiao, Yuxia Wang, Zongyao Li, Chang Su, Yimeng
tem: Enabling zero-shot translation. Trans. Assoc. Chen, Min Zhang, Shimin Tao, Hao Yang, and
Comput. Linguistics, 5:339–351. Ying Qin. 2022a. The hw-tsc’s simultaneous speech
translation system for IWSLT 2022 evaluation. In
Danni Liu, Gerasimos Spanakis, and Jan Niehues. 2020. Proceedings of the 19th International Conference on
Low-latency sequence-to-sequence speech recogni- Spoken Language Translation, IWSLT@ACL 2022,
tion and translation by partial hypothesis selection. Dublin, Ireland (in-person and online), May 26-27,
In Interspeech 2020, 21st Annual Conference of the 2022, pages 247–254. Association for Computational
International Speech Communication Association, Linguistics.
Virtual Event, Shanghai, China, 25-29 October 2020,
pages 3620–3624. ISCA. Minghan Wang, Jiaxin Guo, Xiaosong Qiao, Yuxia
Wang, Daimeng Wei, Chang Su, Yimeng Chen, Min
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Zhang, Shimin Tao, Hao Yang, and Ying Qin. 2022b.
Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, The hw-tsc’s offline speech translation system for
Zhongjun He, Hairong Liu, Xing Li, Hua Wu, IWSLT 2022 evaluation. In Proceedings of the
and Haifeng Wang. 2019. STACL: simultane- 19th International Conference on Spoken Language
ous translation with implicit anticipation and con- Translation, IWSLT@ACL 2022, Dublin, Ireland
trollable latency using prefix-to-prefix framework. (in-person and online), May 26-27, 2022, pages 239–
In Proceedings of the 57th Conference of the 246. Association for Computational Linguistics.
Association for Computational Linguistics, ACL
2019, Florence, Italy, July 28- August 2, 2019, Daimeng Wei, Zongyao Li, Zhanglin Wu, Zhengzhe
Volume 1: Long Papers, pages 3025–3036. Asso- Yu, Xiaoyu Chen, Hengchao Shang, Jiaxin Guo,
ciation for Computational Linguistics. Minghan Wang, Lizhi Lei, Min Zhang, Hao Yang,
and Ying Qin. 2021. Hw-tsc’s participation in
Thai-Son Nguyen, Sebastian Stüker, and Alex Waibel. the WMT 2021 news translation shared task. In
2021. Super-human performance in online low- Proceedings of the Sixth Conference on Machine
latency recognition of conversational speech. In Translation, WMT@EMNLP 2021, Online Event,
Interspeech 2021, 22nd Annual Conference of the November 10-11, 2021, pages 225–231. Association
International Speech Communication Association, for Computational Linguistics.
381
Di Wu, Binbin Zhang, Chao Yang, Zhendong
Peng, Wenjing Xia, Xiaoyu Chen, and Xin Lei.
2021. U2++: unified two-pass bidirectional end-
to-end model for speech recognition. CoRR,
abs/2106.05642.
Lijun Wu, Yiren Wang, Yingce Xia, Tao Qin, Jianhuang
Lai, and Tie-Yan Liu. 2019. Exploiting monolin-
gual data at scale for neural machine translation. In
Methods in Natural Language Processing and
the 9th International Joint Conference on Natural
Language Processing, EMNLP-IJCNLP 2019, Hong
Kong, China, November 3-7, 2019, pages 4205–
Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song,

Zhuoyuan Yao, Hang Lv, Lei Xie, Chao Yang, Fup-
ing Pan, and Jianwei Niu. 2022a. Wenet 2.0: More
productive end-to-end speech recognition toolkit. In
Interspeech 2022, 23rd Annual Conference of the
Incheon, Korea, 18-22 September 2022, pages 1661–
1665. ISCA.
Shaolei Zhang, Shoutao Guo, and Yang Feng. 2022b.
Wait-info policy: Balancing source and target at in-
formation level for simultaneous machine translation.
Linguistics: EMNLP 2022, Abu Dhabi, United Arab
Emirates, December 7-11, 2022, pages 2249–2263.
382
The HW-TSC’s Simultaneous Speech-to-Speech Translation system for
IWSLT 2023 evaluation
Hengchao Shang, Zhiqiang Rao, Zongyao Li, Jiaxin GUO, Zhanglin Wu, Minghan Wang,
Daimeng Wei, Shaojun Li, Zhengzhe Yu, Xiaoyu Chen, Lizhi Lei, Hao Yang
{shanghengchao, raozhiqiang, lizongyao, guojiaxin1, wuzhanglin2, wangminghan,

weidaimeng, lishaojun18, yuzhengzhe, chenxiaoyu35, leilizhi, yanghao30}@huawei.com
Abstract TTS model as the foundation for our system. More-

over, we introduce a refined onlinization technique
In this paper, we present our submission to the based on the approach developed by (Polák et al.,
IWSLT 2023 (Agarwal et al., 2023) Simultane- 2022), which seamlessly integrates into the cascade
ous Speech-to-Speech Translation competition.
system.
Our participation involves three language direc-
tions: English-German, English-Chinese, and Offline TTS models often produce a blank sound
English-Japanese. Our solution is a cascaded at the end of a sentence. As a result, when generat-
incremental decoding system, consisting of an ing audio results in the simultaneous interpreting
ASR model, an MT model, and a TTS model. mode, it can lead to blank tones between clips, caus-
By adopting the strategies used in the Speech- ing the final audio to lack smoothness. To address
to-Text track, we have managed to generate a
this issue, we have developed several strategies
more confident target text for each audio seg-
ment input, which can guide the next MT in-
aimed at mitigating this problem in our work.
cremental decoding process. Additionally, we
have integrated the TTS model to seamlessly 2 Related Methods
reproduce audio files from the translation hy-
pothesis. To enhance the effectiveness of our
2.1 ASR
experiment, we have utilized a range of meth-
ods to reduce error conditions in the TTS input In our cascade system, we have incorporated the
text and improve the smoothness of the TTS
output audio.
U2 (Wu et al., 2021) as the ASR module. This
framework has the flexibility to be implemented on
1 Introduction standard Transformer or Conformer architectures
and can perform both streaming and non-streaming
This paper describes the HW-TSC‘s submission ASR. One of the major advantages of U2 over other
to the Simultaneous Speech-to-Speech Translation offline autoregressive ASR models is its ability to
(SimulS2S) task at IWSLT 2023 (Agarwal et al., support streaming through dynamic chunk training
2023). and decoding with a CTC decoder on top of the
Simultaneous speech-to-speech translation encoder. Additionally, U2 includes a standard au-
(SimulS2S) is currently being researched using toregressive attention decoder and can be jointly
Cascade systems. These systems typically involve trained with the CTC decoder to improve training
a streaming Automatic Speech Recognition stability. The dynamic chunk training method in-
(ASR) module, a streaming Text-to-Text ma- volves applying a causal mask with varying chunk
chine translation (MT) module, and an offline sizes at the self-attention layer within the encoder.
Text-to-Speech(TTS) module, with the option This allows the hidden representation to condition
of incorporating additional correction modules. on some look-ahead contexts within the chunk,
Although integrating these modules can be similar to the self-attention of an autoregressive
complex, training each module with sufficient data decoder.
resources can prove to be worthwhile. U2 offers multiple decoding strategies. In this
Our study adopts a comprehensive approach that work, we use "attention_rescoring" decoding strat-
utilizes several key components to build a strong egy, which is to use the attention decoder re-score
system. We incorporate a formidable offline ASR CTC generated texts using prefix beam search in
model, a robust offline MT model, and a pre-trained the event of multiple candidate proposals.
383
ASR MT LCP TTS Deblanking
chunk1 asr_ouput1,2 mt_ouput1,2 txt_ouput1 wav_ouput1 ouput1
prefix
chunk1 chunk2 asr_ouput2,2 mt_ouput2,2 txt_ouput2 wav_ouput2 ouput2
prefix
chunk1 chunk2 chunk3 asr_ouput3,2 mt_ouput3,2 txt_ouput3 wav_ouput3 ouput3
prefix
Figure 1: An overview of hw-tsc’s s2s framework.
2.2 MT the model generates the raw audio waveform. This

process is highly efficient and requires no addi-
Our cascade system includes the Transformer
tional input from the user. By leveraging the VITS
(Vaswani et al., 2017) as the MT module, which
model, we are able to streamline the TTS module
has become a prevalent method for machine trans-
and deliver high-quality speech output in a fraction
lation (Wei et al., 2021; Guo et al., 2021) in recent
of the time traditionally required by other systems.
years. The Transformer has achieved impressive
This results in a more seamless and intuitive user
results, even with a primitive architecture that re-
experience, enabling our system to be used by a
quires minimal modification.
wider range of individuals and applications.
In this work, we use multiple training strate-
gies to improve the offline MT model performance.
3 Framework
First, we train a multilingual model for three di-
rections En-De/ZH/JA. Multilingual Translation Figure 1 illustrates our framework.
(Johnson et al., 2017) has proposed a simple solu-
tion to enhance translation performance for trans- 3.1 Onlinization
lating multiple languages using a single neural ma- The primary method for onlinizing an offline model
chine translation model with no need to alter the and transforming it into a simul model is Incremen-
model architecture. Second, we use Forward trans- tal Decoding. Depending on the language pair,
lation (Wu et al., 2019) to generate synthetic data translation tasks may require reordering or addi-
through beam search decoding. The we add the tional information that is not apparent until the end
data to the original parallel corpora and re-train the of the source utterance. In offline settings, process-
MT model. Finally, we use the generation from a ing the entire utterance at once usually produces
well-trained ASR model to replace source-side text the highest-quality results, but this approach can
in the training corpus data and fine-tune the MT result in significant latency in online mode. One
model to reduce the domain gap. possible solution to reduce latency is to divide the
source utterance into smaller parts and translate
2.3 TTS
each part separately. This approach helps to reduce
In a cascaded speech-to-speech translation system, the time required for processing while still main-
the TTS module plays a critical role in rendering taining translation quality. By using incremental
high-quality speech output from translated text. To decoding in conjunction with smaller processing
this end, we utilize the state-of-the-art VITS (Kim units, we can significantly improve the speed and
et al., 2021) model, which is pretrained on mas- efficiency of the translation process, making it ideal
sive amounts of data and incorporates advanced for online settings where speed is of the essence.
techniques such as variational inference augmented To perform incremental inference, we divide the
with normalizing flows and adversarial training. input utterance into chunks of a fixed size and de-
This model has been shown to produce speech out- code each chunk as it arrives. Once a chunk has
put that is more natural and fluent compared to been selected, its predictions are then committed
traditional TTS models. to and no longer modified to avoid visual distrac-
The inference process involves providing the tions from constantly changing hypotheses. The
VITS model with the generated text, after which decoding of the next chunk is dependent on the pre-
384
dictions that have been committed to. In practice, issues. The first scenario involved the TTS model
decoding for new chunks can proceed from a previ- producing unusual waveforms for previously un-
ously buffered decoder state or begin after forced seen tokens. The second scenario involved TTS
decoding with the tokens that have been committed generating blank sounds to indicate pauses within
to. In either case, the source-target attention can the audio fragments. To address these issues, we
span all available chunks, as opposed to only the implemented two strategies which we have collec-
current chunk. tively named Deblanking.
3.2 Stable Hypothesis Detection Unknown Filtering In the Chinese and Japanese
language directions, we initially remove tokens that
Our approach is based on prior research in (Polák
are not included in the vocabulary, such as infre-
et al., 2022), and we have implemented stable hy-
quent punctuation marks and words. For Chinese
pothesis detection to minimize the potential for
in particular, we must convert Arabic numerals into
errors resulting from incomplete input. In previ-
textual numerals.
ous research, some methods focused on detecting
stable hypotheses using strategies such as the Hold- Context-Aware Pause Detection When analyz-
n strategy proposed by (Liu et al., 2020), which ing the waveform generated by TTS, we evaluate
identifies the best hypothesis in the beam and re- whether or not the original text indicates a pause. If
moves the last n tokens from it. Similarly, (Liu the text does not indicate a pause, we eliminate the
et al., 2020) introduced the LA-n strategy, which final prolonged silence that produces the waveform.
identifies the matching prefixes of two consecutive Additionally, to ensure speech coherence, we’ve
chunks. In addition, (Nguyen et al., 2021) devel- reserved at least 160 frames of blank audio.
oped the SP-n strategy, which identifies the longest
common prefix among all items in the beam of a 4 Experiments
chunk.
4.1 Dataset
However, these methods were designed for end-
to-end systems that search for a shared prefix To train the ASR module, we utilized four datasets:
among the hypotheses generated from different LibriSpeech V12, MuST-C V2 (Gangi et al., 2019),
chunk inputs. Our approach, on the other hand, TEDLIUM V3, and CoVoST V2. LibriSpeech con-
operates within a cascaded system that processes sists of audio book recordings with case-insensitive
the same chunk input. As such, we have adapted text lacking punctuation. MuST-C, a multilingual
these strategies to better fit our context, resulting dataset recorded from TED talks, was used solely
in a more effective approach for stable hypothesis for the English data in the ASR task. TEDLIUM is
detection. By using our approach, we are able to a large-scale speech recognition dataset containing
achieve higher accuracy and stability in our system, TED talk audio recordings along with text tran-
thereby improving its overall performance. scriptions. CoVoST is also a multilingual speech
We can denote the MT and ASR generating func- translation dataset based on Common Voice, with
tions as G and F respectively. Let Fi,n C represent open-domain content. Unlike LibriSpeech, both
the i output generated by the ASR function for a MuST-C and CoVoST have case-sensitive text and
c-chunk input with a beam size of n. Then the punctuation.
final common prefix for the c-chunk input can be To train the MT model, we collected all available
expressed as pref ixc , which is determined as fol- parallel corpora from the official websites and se-
lows: lected data that was similar to the MuST-C domain.
We first trained a multilingual MT baseline model
on all data from three language directions. Then,
pref ixc = LCP (G(F1,n
c c
), ..., G(Fn,n )) (1) we incrementally trained the baseline model based
on data from each language direction.
where LCP (·) is longest common prefix of the
arguments. 4.2 Model
ASR We extract 80-dimensional Mel-Filter bank
3.3 Deblanking features from audio files to create the ASR training
Our team conducted a manual evaluation of the corpus. For tokenization of ASR texts, we utilize
audio output generated by TTS and identified two Sentencepiece with a learned vocabulary of up to
385
Model Language Pair BLEU/Whisper_ASR_BLEU StartOffset EndOffset ATD
EN-DE 33.54
Our S2T System EN-JA 17.89
EN-ZH 27.23
Our System EN-DE 10.45 1.04 2.73 1.97
Our System EN-JA 14.53 1.59 2.96 2.76
Our System EN-ZH 20.19 1.77 2.98 2.93
Table 1: Final systems results
20,000 sub-tokens. The ASR model is configured 4.3 Results

as follows: nencoder layers = 12, ndecoder layers = A detailed analysis of the results presented in Ta-
8, nheads = 8, dhidden = 512, dF F N = 2048. We ble 1 indicates that the TTS transcription results
implement all models using wenet (Zhang et al., in Japanese have the smallest gap compared to the
2022). results obtained from the S2T system, with a dif-
During the training of the ASR model, we set ference of approximately 3 BLEU. However, in
the batch size to a maximum of 40,000 frames per the German direction, the TTS system generates
card. We use inverse square root for lr scheduling, the worst results among all the evaluated systems.
with warm-up steps set to 10,000 and peak lr set Further research is needed to understand the un-
at 5e − 4. Adam is utilized as the optimizer. The derlying reasons for this discrepancy and identify
model is trained on 4 V100 GPUs for 50 epochs, potential strategies to improve TTS performance in
and the parameters for the last 4 epochs are aver- this language pair.
aged. To improve accuracy, all audio inputs are
augmented with spectral augmentation and normal- 4.4 Ablation Study on Deblanking strategies
ized with utterance cepstral mean and variance nor-
malization. Language Pair Training strategies BLEU
EN-DE Baseline 10.45
- Context-aware wait 10.32
MT For our experiments using the MT model,
- Unknown Filtering 10.27
we utilize the Transformer deep model architecture.
EN-JA Baseline 14.53
The configuration of the MT model is as follows:
- Context-aware wait 13.37
nencoder layers = 25, ndecoder layers = 6, nheads =
- Unknown Filtering 13.08
16, dhidden = 1024, dF F N = 4096, pre_ln = T rue.
EN-ZH Baseline 20.19
We utilize the open-source Fairseq (Ott et al., - Context-aware wait 18.64
2019) for training, with the following main param- - Unknown Filtering 16.73
eters: each model is trained using 8 GPUs, with a
batch size of 2048, a parameter update frequency Table 2: Ablation Study on Deblanking strategies
of 32, and a learning rate of 5e − 4. Additionally, a
label smoothing value of 0.1 was used, with 4000 The results presented in Table 2 provide strong
warmup steps and a dropout of 0.1. The Adam evidence that our proposed strategies are effective
optimizer is also employed, with β1 = 0.9 and β2 in reducing the gap between offline and streaming
= 0.98. During the inference phase, a beam size of TTS.
8 is used. The length penalties are set to 1.0.
5 Conclusion
TTS For EN-DE direction, we utilize the open- This paper details our involvement in the IWSLT
source Espnet (Watanabe et al., 2018) for infer- 2023 simultaneous speech-to-speech translation
ence. For EN-JA/ZH, we use the pretrained models evaluation. Our team presents an onlinization strat-
in huggingface. The pretrained models are VITS egy that can be utilized by cascaded systems, which
(Kim et al., 2021) architecture, which adopts varia- we have proven to be effective in three different
tional inference augmented with normalizing flows language directions. Additionally, we introduce
and an adversarial training process. two strategies that address the disparity between
386
offline and streaming TTS. Our approach is both of Machine Learning Research, pages 5530–5540.
simple and efficient. Moving forward, we aim to PMLR.
delve further into end-to-end systems. Danni Liu, Gerasimos Spanakis, and Jan Niehues. 2020.
Low-latency sequence-to-sequence speech recogni-
tion and translation by partial hypothesis selection.
References In Interspeech 2020, 21st Annual Conference of the
Milind Agarwal, Sweta Agrawal, Antonios Anasta- Virtual Event, Shanghai, China, 25-29 October 2020,
sopoulos, Ondřej Bojar, Claudia Borg, Marine pages 3620–3624. ISCA.
Chen, William Chen, Khalid Choukri, Alexandra Thai-Son Nguyen, Sebastian Stüker, and Alex Waibel.
Chronopoulou, Anna Currey, Thierry Declerck, Qian- 2021. Super-human performance in online low-
qian Dong, Yannick Estève, Kevin Duh, Marcello latency recognition of conversational speech. In
Federico, Souhir Gahbiche, Barry Haddow, Benjamin Interspeech 2021, 22nd Annual Conference of the
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja- International Speech Communication Association,
vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu Brno, Czechia, 30 August - 3 September 2021, pages
Kumar, Pengwei Li, Xutail Ma, Prashant Mathur, 1762–1766. ISCA.
Kenton Murray, Maria Nadejde, Satoshi Nakamura, Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu, Sam Gross, Nathan Ng, David Grangier, and Michael
Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Auli. 2019. fairseq: A fast, extensible toolkit for
Pino, Lonneke van der Plas, Peter Polák, Elijah sequence modeling. CoRR, abs/1904.01038.
Rippeth, Elizabeth Salesky, Jiatong Shi, Matthias
Sperber, Sebastian Stüker, Katsuhito Sudoh, Yun Peter Polák, Ngoc-Quan Pham, Tuan-Nam Nguyen,
Tang, Brian Thompson, Kevin Tran, Marco Turchi, Danni Liu, Carlos Mullov, Jan Niehues, Ondrej Bojar,
Alex Waibel, Mingxuan Wang, Shinji Watanabe, and and Alexander Waibel. 2022. CUNI-KIT system for
Rodolfo Zevallos. 2023. Findings of the IWSLT simultaneous speech translation task at IWSLT 2022.
2023 Evaluation Campaign. In Proceedings of the In Proceedings of the 19th International Conference
20th International Conference on Spoken Language on Spoken Language Translation, IWSLT@ACL
Translation (IWSLT 2023). Association for Compu- 2022, Dublin, Ireland (in-person and online), May
tational Linguistics. 26-27, 2022, pages 277–285. Association for Com-
Mattia Antonino Di Gangi, Roldano Cattoni, Luisa
Bentivogli, Matteo Negri, and Marco Turchi. 2019. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Must-c: a multilingual speech translation cor- Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
pus. In Proceedings of the 2019 Conference of Kaiser, and Illia Polosukhin. 2017. Attention is
the North American Chapter of the Association all you need. In Advances in Neural Information
for Computational Linguistics: Human Language Processing Systems 30: Annual Conference on
Technologies, NAACL-HLT 2019, Minneapolis, Neural Information Processing Systems 2017,
MN, USA, June 2-7, 2019, Volume 1 (Long and December 4-9, 2017, Long Beach, CA, USA, pages
Short Papers), pages 2012–2017. Association for 5998–6008.
Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki
Jiaxin Guo, Minghan Wang, Daimeng Wei, Hengchao Hayashi, Jiro Nishitoba, Yuya Unno, Nelson En-
Shang, Yuxia Wang, Zongyao Li, Zhengzhe Yu, rique Yalta Soplin, Jahn Heymann, Matthew Wiesner,
Zhanglin Wu, Yimeng Chen, Chang Su, Min Zhang, Nanxin Chen, Adithya Renduchintala, and Tsubasa
Lizhi Lei, Shimin Tao, and Hao Yang. 2021. Self- Ochiai. 2018. Espnet: End-to-end speech processing
distillation mixup training for non-autoregressive toolkit. CoRR, abs/1804.00015.
neural machine translation. CoRR, abs/2112.11640.
Daimeng Wei, Zongyao Li, Zhanglin Wu, Zhengzhe
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Yu, Xiaoyu Chen, Hengchao Shang, Jiaxin Guo,
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho- Minghan Wang, Lizhi Lei, Min Zhang, Hao Yang,
rat, Fernanda B. Viégas, Martin Wattenberg, Greg and Ying Qin. 2021. Hw-tsc’s participation in
Corrado, Macduff Hughes, and Jeffrey Dean. 2017. the WMT 2021 news translation shared task. In
Google’s multilingual neural machine translation sys- Proceedings of the Sixth Conference on Machine
tem: Enabling zero-shot translation. Trans. Assoc. Translation, WMT@EMNLP 2021, Online Event,
Comput. Linguistics, 5:339–351. November 10-11, 2021, pages 225–231. Association
Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021.
Conditional variational autoencoder with adversar- Di Wu, Binbin Zhang, Chao Yang, Zhendong
ial learning for end-to-end text-to-speech. In Peng, Wenjing Xia, Xiaoyu Chen, and Xin Lei.
Proceedings of the 38th International Conference 2021. U2++: unified two-pass bidirectional end-
on Machine Learning, ICML 2021, 18-24 July to-end model for speech recognition. CoRR,
2021, Virtual Event, volume 139 of Proceedings abs/2106.05642.
387
Lijun Wu, Yiren Wang, Yingce Xia, Tao Qin, Jianhuang
Lai, and Tie-Yan Liu. 2019. Exploiting monolin-
gual data at scale for neural machine translation. In
Methods in Natural Language Processing and
the 9th International Joint Conference on Natural
Language Processing, EMNLP-IJCNLP 2019, Hong
Kong, China, November 3-7, 2019, pages 4205–
Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song,
Zhuoyuan Yao, Hang Lv, Lei Xie, Chao Yang, Fup-
ing Pan, and Jianwei Niu. 2022. Wenet 2.0: More
productive end-to-end speech recognition toolkit. In
Interspeech 2022, 23rd Annual Conference of the
Incheon, Korea, 18-22 September 2022, pages 1661–
1665. ISCA.
388
Towards Efficient Simultaneous Speech Translation:
CUNI-KIT System for Simultaneous Track at IWSLT 2023
Peter Polák1 and Danni Liu2 and Ngoc-Quan Ngoc2
Jan Niehues2 and Alexander Waibel2,3 and Ondřej Bojar1

[email protected]
1
Charles University 2 Karlsruhe Institute of Technology
3
Carnegie Mellon University
Abstract the online CTC policy can be used to onlinize the

offline models achieving a 45 % improvement in
In this paper, we describe our submission to the real time factor (RTF) as well as to improve the
Simultaneous Track at IWSLT 2023. This year,
quality of the streaming blockwise models (Tsunoo
we continue with the successful setup from the
last year, however, we adopt the latest meth- et al., 2021). Aside from improving the online
ods that further improve the translation quality. policy, we also adopt the novel improved stream-
Additionally, we propose a novel online policy ing beam search (Polák et al., 2023) that further
for attentional encoder-decoder models. The improves the translation quality.
policy prevents the model to generate transla- Our contributions are as follows:
tion beyond the current speech input by using
an auxiliary CTC output layer. We show that • We adopt the latest online decoding algorithm
the proposed simultaneous policy can be ap- that improves the translation quality of robust
plied to both streaming blockwise models and
offline models in the simultaneous regime,
offline encoder-decoder models. We observe
significant improvements in quality (up to 1.1
• We propose a novel online policy that signifi-
BLEU) and the computational footprint (up to
45 % relative RTF).
cantly
– lowers the computational complexity of
1 Introduction the online decoding with robust offline
Simultaneous speech translation (SST) is the task models while maintaining the same or
of translating speech into text in a different lan- only slightly worse translation quality,
guage before the utterance is finished. The goal – improves the translation quality of the
of SST is to produce a high-quality translation in streaming blockwise models while main-
real-time while maintaining low latency. However, taining the same latency,
these two objectives are conflicting. If we decrease
• We demonstrate that our systems can run on
the latency, the translation quality also drops. Last
hardware accessible to a wide audience.
year’s IWSLT evaluation campaign (Anastasopou-
los et al., 2022) showed that current methods for
2 Methods
simultaneous speech translation can approach the
translation quality of human interpreters (Polák In our submission, we use two different model
et al., 2022). The disadvantage is a higher com- architectures — a traditional offline ST architec-
putation footprint that might make a widespread ture and a blockwise simultaneous ST architecture
application prohibitive. (Tsunoo et al., 2021). In this section, we describe
This paper describes the CUNI-KIT submission the methods applied to achieve simultaneous ST
to the Simultaneous translation track at IWSLT using these architectures.
2023 (Agarwal et al., 2023). Following our last
year’s submission (Polák et al., 2022), we continue 2.1 Incremental Blockwise Beam Search with
in our effort to onlinize the robust offline speech Controllable Quality-Latency Tradeoff
translation models. However, the main goal of this To use the traditional offline ST model in a simulta-
submission is to improve the computational foot- neous regime, Liu et al. (2020) proposed chunking,
print. To this end, we propose a novel online policy i.e., splitting the audio source utterance into small
based on CTC. As we experimentally document, constant-length chunks that are then incrementally
389
fed into the model. As translation quality tends to with the <eos> token, the algorithm allows gen-
diminish toward the end of the unfinished source, erating a hypothesis with a score lower than the
an online policy is employed to control the latency- unreliable score if it was seen during the decoding
quality tradeoff in the generated output. Popular of previous blocks.
online policies include wait-k (Ma et al., 2019), Finally, the algorithm removes two instead of
shared prefix (Nguyen et al., 2020), hold-n and one token in the current beam (see Line 20). Re-
local agreement (Liu et al., 2020). In Polák et al. moving the last two tokens mitigates the issue of
(2022), we showed that the tradeoff could be con- low-quality translation toward the end of the con-
trolled by varying the chunk length. text.1
To generate the translation, a standard beam
search is typically applied (Sutskever et al., 2014). 2.2 Rethinking Online Policies for
While this decoding algorithm enables the model Attention-based ST Models
to generate a complete translation for the current
While the improved incremental blockwise beam
input, it also suffers from overgeneration (i.e., hallu-
search improves the performance, it still requires a
cinating tokens beyond sounds present in the input
strong online policy such as hold-n or local agree-
segment) and low-quality translations towards the
ment (Liu et al., 2020). A common property of
end of the source context (Dong et al., 2020; Polák
these online policies is that they require multiple
et al., 2022).
re-generations of the output translation. For ex-
To tackle this issue, we adopt an improved in-
ample, the local agreement policy must generate
cremental blockwise beam search (Polák et al.,
each token at least twice to show it to the user, as
2023). We outline the algorithm in Algorithm 1
each token must be independently generated by
and highlight the main differences from the origi-
two consecutive contexts to be considered stable.
nal approach used in Polák et al. (2022) with red.
Depending on the model architecture, the genera-
tion might be the most expensive operation. Ad-
Algorithm 1: Incremental blockwise
ditionally, the sequence-to-sequence models tend
streaming beam search algorithm for incre-
to suffer from exposure bias (i.e., the model is not
mental ST
Input :A list of blocks, an ST model
exposed to its own errors during the training) (Ran-
Output :A set of hypotheses and scores zato et al., 2015; Wiseman and Rush, 2016). The
1 Seen ← ∅;
2 for each block do exposure bias then causes a lower translation qual-
3 Encode block using the ST model;
4 Stopped ← ∅; ity, and sometimes leads to hallucinations (i.e., gen-
5
6
minScore ← −∞;
while #active beams > 0 and not max. length do
eration of coherent output not present in the source)
7 Extend beams and compute scores; (Lee et al., 2018; Müller et al., 2019; Dong et al.,
8 for each active beam b do
9 if b ends with <eos> or (score ≤ minScore 2020). Finally, attentional encoder-decoder models
and b ∈/ Seen) then are suspected to suffer from label bias (Hannun,
10 minScore ← max(minScore, score);
11 Stopped ← Stopped ∪ b; 2020).
12 Remove b from the beam search;
13 end A good candidate to address these problems is
end
14
15 end
CTC (Graves et al., 2006). For each input frame,
16 Seen ← Seen ∪ Stopped; CTC predicts either a blank token (i.e., no output)
17 Sort Stopped by length-normalized score;
18 Set the best hypothesis from Stopped as active beam; or one output token independently from its previous
19
20
Apply the incremental policy;
Remove the last two tokens from the active beam;
predictions, which better matches the streaming
21 end translation and reduces the risk of hallucinations.
Because the CTC’s predictions for each frame are
In Algorithm 1, the overgeneration problem conditionally independent, CTC does not suffer
is addressed by stopping unreliable beams (see from the label bias problem (Hannun, 2020). Al-
Line 9). The unreliable beam is defined as a beam though, the direct use of CTC in either machine
ending with <eos> token or having a score lower or speech translation is possible, yet, its quality
or equal to any other unreliable beam detected so lags behind autoregressive attentional modeling
far. This means, that we stop any beam that has a (Libovický and Helcl, 2018; Chuang et al., 2021).
score lower than any beam ending with <eos> to- 1
Initial experiments showed that removing more than two
ken. Since there might be a hypothesis that would tokens leads to higher latency without any quality improve-
always score lower than some hypothesis ending ment.
390
Another way, how to utilize the CTC is joint de- The disadvantage of this definition is that
coding (Watanabe et al., 2017; Deng et al., 2022). pctc (. . . |X) must be computed for every vocab-
In the joint decoding setup, the model has two ulary entry separately and one evaluation costs
decoders: the non-autoregressive CTC (usually a O(T ), i.e., O(|V| · T ) in total. Contemporary ST
single linear layer after the encoder) and the atten- systems use vocabularies in orders of thousands
tional autoregressive decoder. The joint decoding items making this definition prohibitively expen-
is typically guided by the attentional decoder, while sive. Since the CTC is used together with the
the CTC output is used for re-scoring. Since the label-synchronous decoder, we can approximate
CTC predicts hard alignment, the rescoring is not the denominator with a single vocabulary entry catt
straightforward. To this end, Watanabe et al. (2017) predicted by the attentional decoder patt :
proposed to use the CTC prefix probability (Graves,
2008) defined as a cumulative probability of all la-
pctc (g ⊕ <eos>|X)
bel sequences that have the current hypothesis h as Oddsend (g) ≈ , (5)
their prefix: pctc (g ⊕ catt |X)
X where catt = argmaxc∈V/{<eos>} patt (g ⊕ c|X).
pctc (h, ...) = pctc (h ⊕ ν|X), (1)
Now the evaluation of Oddsend (g) is O(T ). If we
ν∈V +
consider that the baseline model already uses CTC
where V is output vocabulary (including the rescoring, then evaluating Oddsend (g) amounts to
<eos> symbol), ⊕ is string concatenation, and a constant number of extra operations to evaluate
X is the input speech. To calculate this probability pctc (g ⊕ <eos>|X).
effectively, Watanabe et al. (2017) introduce vari- Finally, to control the latency of the online decod-
(b) (n)
ables γt (h) and γt (h) that represent forward ing, we compare the logarithm of Oddsend (g) with
probabilities of h at time t, where the superscript a tunable constant Cend . If log Oddsend (g) > Cend ,
denotes whether the CTC paths end with a blank we stop the beam search and discard the last token
or non-blank CTC symbol. If the hypothesis h is a from g. We found values of Cend between -2 and 2
complete hypothesis (i.e., ends with the <eos> to- to work well across all models and language pairs.
ken), then the CTC probability of h = g ⊕ <eos>
is: 3 Experiments and Results
(b) (n) 3.1 Models
pctc (h|X) = γT (g) + γT (g), (2)
Our offline multilingual ST models are based on
where T is the final time stamp. attentional encoder-decoder architecture. Specifi-
If h = g ⊕ c is not final, i.e., c ̸= <eos>, then cally, the encoder is based on WavLM (Chen et al.,
the probability is: 2022), and the decoder is based on multilingual
BART (Lewis et al., 2019) or mBART for short.
T The model is implemented in the NMTGMinor li-
X
pctc (h|X) = Φt (g) · p(zt = c|X), (3) brary.2 For details on the offline model see KIT
t=1 submission to IWSLT 2023 Multilingual track (Liu
et al., 2023).
where The small simultaneous speech translation mod-
( els for English-to-German and English-to-Chinese
(b) 0 last(g) = c
Φt (g) = γt−1 (g) + (n)
language pairs follow the blockwise streaming
γt−1 (g) otherwise. Transformer architecture (Tsunoo et al., 2021) im-
plemented in ESPnet-ST-v2 (Yan et al., 2023).
2.3 CTC Online Policy
Specifically, the encoder is a blockwise Conformer
Based on the the definition of pctc (h|X) in Equa- (Gulati et al., 2020) with a block size of 40 and
tions (2) and (3), we can define the odds of g being look-ahead of 16, with 18 layers, and a hidden
at the end of context T : dimension of 256. The decoder is a 6-layer Trans-
former decoder (Vaswani et al., 2017). To improve
the training speed, we initialize the encoder with
pctc (g ⊕ <eos>|X)
Oddsend (g) = P . (4) 2
c∈V/{<eos>} pctc (g ⊕ c|X) https://github.com/quanpn90/NMTGMinor
391
weights pretrained on the ASR task. Further, we Lang Decoding AL↓ ALCA ↓ RTF↓ BLEU↑
employ ST CTC (Deng et al., 2022; Yan et al., BWBS 1922 3121 0.46 30.6
En-De
IBWBS 1977 3277 0.52 31.7
2022) after the encoder with weight 0.3 during the
BWBS 1992 3076 0.50 15.5
training. During the decoding, we use 0.3 for En- En-Ja
IBWBS 1935 3264 0.64 15.6
glish to German, and 0.4 for English to Chinese.
BWBS 1948 2855 0.41 26.5
We preprocess the audio with 80-dimensional fil- En-Zh
IBWBS 1945 3031 0.48 26.5
ter banks. As output vocabulary, we use unigram
models (Kudo, 2018) of size 4000 for English to Table 1: Incremental SST with the original BWBS and
German, and 8000 for English to Chinese. IBWBS. Better scores in bold.
3.2 Evaluation
In all our experiments with the offline models, we pute the decoder states after each source increment.
use beam search of size 8 except for the CTC pol- Since the IBWBS sometimes waits for more source
icy experiments where we use greedy search. For chunks to output more tokens, the unnecessary de-
experiments with the blockwise models, we use coder state recomputations might increase the com-
the beam search of 6. For experiments with the putational complexity.
improved blockwise beam search, we follow Polák 3.4 CTC Online Policy
et al. (2023) and remove the repetition detection in
the underlying offline models, while we keep the In Figure 1, we compare the improved blockwise
repetition detection on for all experiments with the beam search (IBWBS) with the proposed CTC pol-
blockwise models. icy using the blockwise streaming models. The
For evaluation, we use Simuleval (Ma et al., tradeoff curves for English-to-German (see Fig-
2020) toolkit and tst-COMMON test set of MuST- ure 1a) and English-to-Chinese (see Figure 1b)
C (Cattoni et al., 2021). To estimate transla- show that the proposed CTC policy improves the
tion quality, we report detokenized case-sensitive quality (up to 1.1 BLEU for En→De, and 0.8
BLEU (Post, 2018), and for latency, we report av- BLEU for En→Zh), while it is able to achieve the
erage lagging (Ma et al., 2019). To realistically same latencies.
assess the inference speed, we run all our experi-
3.5 CTC Online Policy for Large Offline
ments on a computer with Intel i7-10700 CPU and
Models
NVIDIA GeForce GTX 1080 with 8 GB graphic
memory. We were also interested in whether the CTC policy
can be applied to large offline models. Unfortu-
3.3 Incremental Blockwise Beam Search with nately, due to limited resources, we were not able
Controllable Quality-Latency Tradeoff to train a large offline model with the CTC output.
In Table 1, we compare the performance of the Hence, we decided to utilize the CTC outputs of the
onlinized version of the baseline blockwise beam online blockwise models and used them to guide
search (BWBS) with the improved blockwise beam the large offline model. Since the models have very
search (IBWBS; Polák et al., 2023). As we can see different vocabularies,3 we decided to execute the
in the table, the improved beam search achieves CTC policy after a whole word is generated by the
higher or equal BLEU scores than the baseline offline model (rather than after every sub-word to-
beam search across all language pairs. We can ken). For the very same reason, we do not use CTC
observe the highest improvement in English-to- for rescoring.
German (1.1 BLEU), while we see an advantage We report the results in Table 2. Unlike in the
of 0.1 BLEU for English-to-Japanese. and no im- blockwise models (see Section 3.4), the CTC policy
provement in English-to-Chinese. does not improve the quality in En→De, and has a
In Table 1, we also report the real-time factor slightly worse quality (by 0.7 BLEU) in En→Zh.
(RTF), and the computation-aware average lagging This is most probably due to the delayed CTC-
(ALCA ). Interestingly, we observe a higher com- attention synchronization that is not present for the
putational footprint of the IBWBS compared to blockwise models (as both decoders there share the
the baseline beam search by 13, 28, and 17 % 3
The blockwise models have a vocabulary size of 4000
on En→{De, Ja, Zh}, resp., when measured with for En→De and 8000 for En→Zh, and the offline model has
RTF. This might be due to the fact that we recom- 250k.
392
25.5
25 23.5
BLEU↑
BLEU↑
CTC
24.5 IBWBS
CTC 23
24 IBWBS
1,750 2,000 2,250 2,500 1,750 2,000 2,250
AL↓ (ms) AL↓ (ms)
(a) English to German (b) English to Chinese
Figure 1: Comparison of the improved blockwise beam search (IBWBS) and the proposed CTC policy using
blockwise streaming models.
same vocabulary and the models compute the CTC the BLEU scores for the 2022 model unreliable.
policy after each token rather than word). However,
Lang Model AL↓ ALCA ↓ BLEU↑
we still observe a significant reduction in computa-
2022 1991 3138 31.8
tional latency, namely by 45 and 34 % relative RTF En-De
2023 1955 3072 31.4
for En→De and En→Zh, respectively.
2022 1906 3000 15.5
En-Ja
Lang Decoding AL↓ ALCA ↓ RTF↓ BLEU↑ 2023 1982 3489 15.3
BWBS 1922 3121 0.46 30.6 2022 1984 3289 26.8
En-Zh
En-De IBWBS 1977 3277 0.52 31.7 2023 1987 3508 26.6
CTC 1946 2518 0.21 30.6
BWBS 1948 2855 0.41 26.5 Table 3: Submitted onlinized large offline models.
En-Zh IBWBS 1945 3031 0.48 26.5
CTC 1981 2515 0.28 25.8
We also submit the system based on the large
Table 2: Comparison of onlinization of the large offline model onlinized using the CTC policy. The sys-
model using chunking with the local agreement policy tems are summarized in Table 4. Unfortunately, we
(LA-2) and with the proposed CTC policy. were not aware of the training and test data overlap
during the evaluation period, so we decided to use
our 2022 model also this year.
4 Submission
In this section, we summarize our submission to Lang Model AL↓ ALCA ↓ BLEU↑
the Simultaneous track at IWSLT 2023. In total, En-De 2022 1959 2721 31.4
En-Zh 2022 1990 2466 26.3
we submit 10 systems for all three language pairs.
4.1 Onlinized Offline Models Table 4: Submitted large offline models onlinized using
the proposed CTC policy.
Following our last year’s submission, we onlinize
two large offline models (our models for IWSLT
2022 Offline ST track and IWSLT 2023 Multilin- 4.2 Blockwise Online Models
gual track). This year, however, we utilize the
improved blockwise beam search to yield higher Finally, we submit small blockwise models. Their
BLEU scores. We submit systems for all language advantage is that they are able to run on a CPU
pairs based on the last year’s model, and our new faster than real time (more than 5× faster). We
model. We summarize the submitted models and report their performance in Table 5.
their performance in Table 3. As we can observe Lang AL↓ ALCA ↓ RTF↓ BLEU↑
in Table 3, the 2023 model appears to perform En-De 1986 2425 0.19 25.4
worse. However, we learned during the writing of En-Zh 1999 2386 0.19 23.8
this paper that there was some overlap between the
training and test data for the 2022 model4 , making Table 5: Submitted small blockwise models using the
4
proposed CTC online policy.
(Zhang and Ao, 2022) found an overlap between ST-TED
training corpus and tst-COMMON set of MuST-C dataset.
393
5 Conclusion and Future Work Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan
In this paper, we present the CUNI-KIT submis- abeth Salesky, Jiatong Shi, Matthias Sperber, Se-
sion to the Simultaneous track at IWSLT 2023. We bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo-
experimented with the latest decoding methods and gesh Virkar, Alexander Waibel, Changhan Wang,
proposed a novel CTC online policy. We experi- 2022 evaluation campaign. In Proceedings of the
mentally showed that the proposed CTC online pol- 19th International Conference on Spoken Language
icy significantly improves the translation quality of Translation (IWSLT 2022), pages 98–157, Dublin,
the blockwise streaming models. Additionally, the Ireland (in-person and online). Association for Com-
proposed CTC policy significantly lowers the com-
putational footprint of the onlinized large offline Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Ben-
models. Unaware of a data overlap issue in 2022, tivogli, Matteo Negri, and Marco Turchi. 2021. Must-
we eventually chose to use our last years’ models c: A multilingual corpus for end-to-end speech trans-
lation. Computer Speech & Language, 66:101155.
in the official evaluation also this year.
Sanyuan Chen, Chengyi Wang, Zhengyang Chen,
Acknowledgments Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki
Kanda, Takuya Yoshioka, Xiong Xiao, et al. 2022.
This work has received support from the Wavlm: Large-scale self-supervised pre-training for
project “Grant Schemes at CU” (reg. no. full stack speech processing. IEEE Journal of Se-
CZ.02.2.69/0.0/0.0/19_073/0016935), the grant 19- lected Topics in Signal Processing, 16(6):1505–1518.
26934X (NEUREM3) of the Czech Science Foun-
dation, and by Charles University, project GA UK Shun-Po Chuang, Yung-Sung Chuang, Chih-Chiang
Chang, and Hung-yi Lee. 2021. Investigating the re-
No 244523. ordering capability in CTC-based non-autoregressive
end-to-end speech translation. In Findings of the
Association for Computational Linguistics: ACL-
References IJCNLP 2021, pages 1068–1077, Online. Association
sopoulos, Ondřej Bojar, Claudia Borg, Marine Keqi Deng, Shinji Watanabe, Jiatong Shi, and Siddhant
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda Arora. 2022. Blockwise Streaming Transformer for
Chen, William Chen, Khalid Choukri, Alexandra Spoken Language Understanding and Simultaneous
Chronopoulou, Anna Currey, Thierry Declerck, Qian- Speech Translation. In Proc. Interspeech 2022, pages
qian Dong, Yannick Estève, Kevin Duh, Marcello 1746–1750.
Linhao Dong, Cheng Yi, Jianzong Wang, Shiyu Zhou,
Shuang Xu, Xueli Jia, and Bo Xu. 2020. A com-
parison of label-synchronous and frame-synchronous
end-to-end models for speech recognition. arXiv
Lonneke van der Plas, Peter Polák, Elijah Rippeth, Alex Graves. 2008. Supervised sequence labelling with
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se- recurrent neural networks. Ph.D. thesis, Technical
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian University Munich.
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- Alex Graves, Santiago Fernández, Faustino Gomez, and
vallos. 2023. Findings of the IWSLT 2023 Evaluation Jürgen Schmidhuber. 2006. Connectionist temporal
Campaign. In Proceedings of the 20th International classification: labelling unsegmented sequence data
Conference on Spoken Language Translation (IWSLT with recurrent neural networks. In Proceedings of the
2023). Association for Computational Linguistics. 23rd international conference on Machine learning,
pages 369–376.
tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
Maha Elbayad, Clara Emmanuel, Yannick Estève, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.
Marcello Federico, Christian Federmann, Souhir 2020. Conformer: Convolution-augmented Trans-
Gahbiche, Hongyu Gong, Roman Grundkiewicz, former for Speech Recognition. In Proc. Interspeech
Barry Haddow, Benjamin Hsu, Dávid Javorský, 2020, pages 5036–5040.
Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant
Mathur, Paul McNamee, Kenton Murray, Maria Awni Hannun. 2020. The label bias problem.
394
Taku Kudo. 2018. Subword regularization: Improv- Peter Polák, Ngoc-Quan Pham, Tuan Nam Nguyen,
ing neural network translation models with multiple Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bo-
subword candidates. In Proceedings of the 56th An- jar, and Alexander Waibel. 2022. CUNI-KIT system
nual Meeting of the Association for Computational for simultaneous speech translation task at IWSLT
Linguistics (Volume 1: Long Papers), pages 66–75, 2022. In Proceedings of the 19th International Con-
Melbourne, Australia. Association for Computational ference on Spoken Language Translation (IWSLT
Linguistics. 2022), pages 277–285, Dublin, Ireland (in-person
Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fan- tics.
njiang, and David Sussillo. 2018. Hallucinations in
neural machine translation. Peter Polák, Brian Yan, Shinji Watanabe, Alexander
Waibel, and Ondrej Bojar. 2023. Incremental Block-
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
wise Beam Search for Simultaneous Speech Transla-
Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
tion with Controllable Quality-Latency Tradeoff. In
Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: De-
Proc. Interspeech 2023.
noising sequence-to-sequence pre-training for natural
language generation, translation, and comprehension. Matt Post. 2018. A call for clarity in reporting BLEU
arXiv preprint arXiv:1910.13461. scores. In Proceedings of the Third Conference on
Jindřich Libovický and Jindřich Helcl. 2018. End-to- Machine Translation: Research Papers, pages 186–
end non-autoregressive neural machine translation 191, Brussels, Belgium. Association for Computa-
with connectionist temporal classification. In Pro- tional Linguistics.
ceedings of the 2018 Conference on Empirical Meth- Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,
ods in Natural Language Processing, pages 3016– and Wojciech Zaremba. 2015. Sequence level train-
3021, Brussels, Belgium. Association for Computa- ing with recurrent neural networks. arXiv preprint
tional Linguistics. arXiv:1511.06732.
Danni Liu, Ngoc-Quan Pham, Tuan Nam Nguyen,
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Thai-Binh Nguyen, Danni Liu, Carlos Mullov, Jan
Sequence to sequence learning with neural networks.
Niehues, and Alexander Waibel. 2023. KIT sub-
Advances in neural information processing systems,
mission to multilingual track at IWSLT 2023. In
27.
Spoken Language Translation (IWSLT 2023). Associ- Emiru Tsunoo, Yosuke Kashiwagi, and Shinji Watanabe.
ation for Computational Linguistics. 2021. Streaming transformer asr with blockwise
Danni Liu, Gerasimos Spanakis, and Jan Niehues. 2020. synchronous beam search. In 2021 IEEE Spoken
Low-Latency Sequence-to-Sequence Speech Recog- Language Technology Workshop (SLT), pages 22–29.
nition and Translation by Partial Hypothesis Selec- IEEE.
tion. In Proc. Interspeech 2020, pages 3620–3624.
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Kaiser, and Illia Polosukhin. 2017. Attention is all
Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and you need. Advances in neural information processing
Haifeng Wang. 2019. STACL: Simultaneous trans- systems, 30.
lation with implicit anticipation and controllable la-
tency using prefix-to-prefix framework. In Proceed- Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R
ings of the 57th Annual Meeting of the Association for Hershey, and Tomoki Hayashi. 2017. Hybrid
Computational Linguistics, pages 3025–3036, Flo- ctc/attention architecture for end-to-end speech recog-
rence, Italy. Association for Computational Linguis- nition. IEEE Journal of Selected Topics in Signal
tics. Processing, 11(8):1240–1253.
Xutai Ma, Mohammad Javad Dousti, Changhan Wang, Sam Wiseman and Alexander M Rush. 2016. Sequence-
Jiatao Gu, and Juan Pino. 2020. SIMULEVAL: An to-sequence learning as beam-search optimization.
evaluation toolkit for simultaneous translation. In In Proceedings of the 2016 Conference on Empiri-
Proceedings of the 2020 Conference on Empirical cal Methods in Natural Language Processing, pages
Methods in Natural Language Processing: System 1296–1306.
Demonstrations, pages 144–150, Online. Association
Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham
Neubig, Florian Metze, Alan W Black, and Shinji
Mathias Müller, Annette Rios, and Rico Sennrich. 2019. Watanabe. 2022. Ctc alignments improve autoregres-
Domain robustness in neural machine translation. sive translation. arXiv preprint arXiv:2210.05200.
Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma,
Thai-Son Nguyen, Sebastian Stüker, and Alex Waibel. Yifan Peng, Siddharth Dalmia, Peter Polák, Patrick
2020. Super-human performance in online low- Fernandes, Dan Berrebbi, Tomoki Hayashi, et al.
latency recognition of conversational speech. arXiv 2023. Espnet-st-v2: Multipurpose spoken language
preprint arXiv:2010.03449. translation toolkit. arXiv preprint arXiv:2304.04596.
395
task. In Proceedings of the 19th International Con-
tics.
396
Speech Translation with Foundation Models and Optimal Transport:
UPC at IWSLT23
Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonollosa Marta R. Costa-jussà

Universitat Politècnica de Catalunya, Barcelona Meta AI, Paris
{ioannis.tsiamas,gerard.ion.gallego,jose.fonollosa}@upc.edu [email protected]
Abstract generate representations similar to the MT system’s

encoder (Siamese pretraining) using Connectionist
This paper describes the submission of the UPC
Machine Translation group to the IWSLT 2023
Temporal Classification (CTC) supervision (Graves
Offline Speech Translation task. Our Speech et al., 2006) and Optimal Transport (Peyré and
Translation systems utilize foundation models Cuturi, 2019). The resulting speech encoder and
for speech (wav2vec 2.0) and text (mBART50). text decoder can be fine-tuned with ST data.
We incorporate a Siamese pretraining step of Another way of incorporating ASR and MT is to
the speech and text encoders with CTC and
leverage large pretrained speech and text models as
Optimal Transport, to adapt the speech rep-
resentations to the space of the text model, a foundation for end-to-end ST systems (Li et al.,
thus maximizing transfer learning from MT. 2021; Gállego et al., 2021; Han et al., 2021; Zhang
After this pretraining, we fine-tune our sys- and Ao, 2022; Pham et al., 2022; Tsiamas et al.,
tem end-to-end on ST, with Cross Entropy 2022b). However, these systems encounter repre-
and Knowledge Distillation. Apart from the sentation discrepancy issues, which can hinder the
available ST corpora, we create synthetic data full exploitation of pretrained foundation models.
with SegAugment to better adapt our models Gállego et al. (2021); Zhao et al. (2022) aimed to
to the custom segmentations of the IWSLT test
sets. Our best single model obtains 31.2 BLEU
address this by adding coupling modules after the
points on MuST-C tst-COMMON, 29.8 points pretrained encoder, while other focus on solving
on IWLST.tst2020 and 33.4 points on the newly the length discrepancies (Zhang et al., 2020; Xu
released IWSLT.ACLdev2023. et al., 2021a; Gaido et al., 2021). Han et al. (2021)
tackled the issue by projecting speech and text fea-
1 Introduction tures to a common semantic space using attention
In the past decade, the field of Speech Translation mechanisms and semantic memories.
(ST) has seen significant advancements, mainly In our work, we tackle the issue of misaligned
due to end-to-end models that directly translate speech and text encoder representations by adopt-
speech, offering a more efficient method compared ing the approach proposed by Le et al. (2023).
to traditional cascade systems (Sperber and Paulik, Our system uses a speech foundation model fine-
2020). Despite data availability challenges, recent tuned on English ASR, wav2vec 2.0 (Baevski et al.,
progress has diminished the performance disparity 2020), and an MT foundation model fine-tuned
between these approaches (Bentivogli et al., 2021; on multilingual MT (En-Xx), mBART50 (Tang
Potapczyk and Przybysz, 2020; Inaguma et al., et al., 2020), as described in Section 2.1. Build-
2021; Ansari et al., 2020). Critical to the advance- ing on prior research (Xu et al., 2021a; Han et al.,
ments in end-to-end models is the exploitation of 2021), we employ two encoders: an acoustic en-
ASR and MT data through pretraining strategies coder from wav2vec 2.0 and a semantic encoder
(Berard et al., 2018; Pino et al., 2019; Di Gangi from mBART50. Coupling modules link these en-
et al., 2019; Gangi et al., 2019; Wang et al., 2020a; coders to address length discrepancy. We extend
Zhang et al., 2020; Bansal et al., 2019). Le et al. (2023) by applying CTC and OT losses to
Recently, Le et al. (2023) proposed a method to the outputs of the acoustic and semantic encoders,
effectively utilize both ASR and MT pretraining respectively, add a second auxiliary OT loss for
to enhance ST. This approach involves pretraining the inputs of the semantic encoder, and keep the
an encoder-decoder MT system with available text text encoder frozen to keep the MT space intact.
data, followed by pretraining a speech encoder to This method aligns the speech encoder’s represen-
397
Figure 1: Extended Siamese pretraining
2.1 System architecture

As depicted in Figures 1 and 2, the encoder of our
system is composed of several interconnected mod-
ules, while the decoder is adopted directly from
the MT foundation model. The speech encoder is
designed to generate representations closely resem-
bling those of the MT foundation model, ensuring
Figure 2: Speech Translation fine-tuning better compatibility between them. The following
paragraphs provide a detailed overview of its key
components and their functions.
tations with the MT foundation model, effectively
improving the final ST system’s performance by Acoustic Modeling The speech waveform x ∈
mitigating representation mismatch. Rn is first processed by a feature extractor, which
In summary, we participate in the IWSLT 2023 consists of several strided convolutional layers,
Offline Speech Translation task, focusing on trans- downsampling the input to a length of n′ . Fol-
lating spoken English to written German, by em- lowing, a Transformer encoder with dimensionality
ploying an end-to-end system. We leverage ASR d is responsible for the acoustic modeling. Both
and MT foundation models with the Siamese pre- these modules are initialized from an ASR founda-
training approach, to effectively bring their en- tion model.
coder’s representations closer. We furthermore
CTC Compression The obtained acoustic rep-
decouple acoustic and semantic modeling in our ′
resentation h ∈ Rn ×d is passed through a linear
speech encoder, adjust for the length miss-match
layer (initialized from the ASR model) and a soft-
between speech and text with several coupling
max to generate the ASR vocabulary predictions
modules, and apply knowledge distillation (Hin- ′
p(ctc) ∈ Rn ×|V| , where V is the size of the vocab-
ton et al., 2015) from MT (Liu et al., 2019; Gaido
ulary. We apply CTC compression (Gaido et al.,
et al., 2020), using mBART50.
2021) to the acoustic representation, averaging the
2 Methodology representations corresponding to repeating predic-
tions on p(ctc) and removing those associated with
Our system, an encoder-decoder transformer, lever- the blank token. This process results in a new com-
′′
ages ASR and MT foundation models (§2.1). We pressed representation h(compr) ∈ Rn ×d , where
initially train the speech encoder with an Extended n′′ denotes the compressed length of the sequence.
Siamese pretraining (§2.2), and then fine-tune it This compression helps to reduce the length dis-
with the MT decoder for end-to-end ST (§2.3). crepancy between speech and text representations,
398
which, in turn, facilitates the alignment process OT loss to better adapt the input to the semantic en-
during Siamese pretraining (§2.2). coder. Next, we also employ CTC-based compres-
sion and coupling modules to better align the length
Coupling Modules Next, we apply an adapter
of speech features with corresponding sub-word
(Houlsby et al., 2019), consisting of a linear projec-
text representations. Finally, we opt to freeze the
tion to 8d, a non-linear activation, a linear projec-
text encoder to not modify the MT decoder’s repre-
tion back to d. This module serves to (1) process
sentation space. The extended Siamese pretraining
the collapsed representations resulting from the
scheme is illustrated in Figure 1. For brevity, we
compression and (2) provide sufficient parameters
refer to it simply as "Siamese" throughout the rest
between the CTC and first OT loss to decouple
of the paper.
their influence (§2.2). After the adapter we apply
The Siamese pretraining is supervised by a com-
a strided 1D Convolution that subsamples the se-
bination of loss functions, each serving a distinct
quence by a factor of 2, which can help transform
purpose. The CTC loss ensures the performance
it closer to a sub-word level representation, rather
of the acoustic modeling by applying to the predic-
than a character-level one, and subsequently aid in
tions of the CTC module. Meanwhile, the two OT
the Optimal Transport training with the sub-word
losses target the input and output of the semantic
level representation from the text encoder (§2.2).
encoder, and aim to align them with the text en-
Semantic Modeling At this point, we modify the coder representations. We calculate the OT loss
representation to better match the input expected as the Wasserstein distance (Frogner et al., 2015)
by the MT encoder. This is achieved by prepend- between the text and speech representations, using
ing and appending special tokens that correspond an upper bound approximation, which is efficiently
to the BOS and EOS tokens used in MT. We also evaluated by the Sinkhorn algorithm (Knopp and
re-introduce positional information to the represen- Sinkhorn, 1967). Since the Wasserstein distance is
tation with learned positional embeddings. Both position invariant, we follow (Le et al., 2023), and
the special tokens tbos , teos ∈ Rd and the positional apply positional encodings, to make it applicable
embeddings E pos ∈ R(M +2)×d (with M represent- to sequences. The combined loss function for the
ing the maximum sequence length) are learnable pa- Siamese pretraining stage is given by:
rameters initialized from the MT foundation model.
The motivation is to bring the representation closer
Lsiamese = α LCT C + β LOT1 + γ LOT2 (1)
to the text embedding from the MT model, facil-
itating OT loss convergence (§2.2). Finally, the Where α, β, and γ are hyperparameters that con-
representation is processed by several more trans- trol the relative importance of each loss component
former encoder layers, which are initialized from in the combined pretraining loss.
the MT model and are responsible for semantic
modeling. 2.3 Speech Translation fine-tuning
Upon obtaining the encoder from §2.2, we utilize
2.2 Siamese pretraining it to initialize our ST system’s encoder, while us-
Our approach builds upon the Siamese pretraining the MT foundation model to initialize the de-
ing proposed by Le et al. (2023), which exploits coder (Fig. 2). In addition to the Cross Entropy
both ASR and MT pretraining to improve ST per- loss, we optionally provide guidance for the ST
formance. This approach involves pretraining the training through Knowledge Distillation (KD) (Tan
encoder of an ST system jointly with Connection- et al., 2019), using the MT foundation model as a
ist Temporal Classification (CTC) and Optimal teacher. Specifically, we only use the top-k predic-
Transport (OT), bringing its representations close tions rather than the entire distribution, and soften
to those of an MT encoder. This pretraining strat- them using a temperature T (Gaido et al., 2020).
egy has demonstrated superior results compared to Since CTC supervision is not employed at this
traditional ASR pretraining with encoder-decoder stage, we freeze the Feature Extractor, Acoustic
and Cross-Entropy (Le et al., 2023). In this work, Encoder, and CTC module from our encoder. Dur-
we build upon the method of Le et al. (2023) in ing training, we optimize the parameters of the ST
several ways. First, we decouple the CTC and OT system’s encoder and decoder with respect to the
losses to correspond to the acoustic and semantic combined loss function, which is the sum of the
representations. Second, we add an extra auxiliary Cross Entropy loss and the optional KD loss:
399
training data it provides, as shown in Tsiamas et al.
(2022a).
LST = λ LCE + (1 − λ) LKL (2)
Original Siamese ST
Where LCE is the Cross Entropy loss, LKL is ST datasets
the Kullback–Leibler divergence between the MT MuST-C v3 427 417 421
and ST output distributions, and 0 ≤ λ ≤ 1 is a hy- ,→ SegAugment 1, 364† − 1, 007†
Europarl-ST 77 64 75
perparameter that controls the relative importance CoVoST 2 362 − 344
of each loss component in the combined ST loss.
ASR datasets
CommonVoice v11 1, 503 1, 361 1, 082†
3 Data
Total − 1, 842 2, 929
3.1 Datasets
Table 1: Filtered training data (in hours) for Siamese
To train our ST models we used data from three and ST training stages. Synthetic data is denoted with †.
speech translation datasets, MuST-C v3 (Cattoni
et al., 2021), Europarl-ST (Iranzo-Sánchez et al.,
2020) and CoVoST-2 (Wang et al., 2020b). MuST- 3.3 Data Filtering
C is based on TED talks, Europarl-ST on the Eu- Siamese pretraining We remove speaker names,
ropean Parliament proceedings, and CoVoST is as well as events like "Laughter" and "Applause",
derived from the Common Voice dataset (Ardila we convert numbers to their spelled-out forms,1
et al., 2020). Their statistics are available in the first convert all text to lowercase, and finally remove all
part of Table 1. We use as development data the characters that are not included in the vocabulary
IWSLT test sets of 2019 and 2020 (Niehues et al., of the ASR foundation model. Furthermore, we
2019; Ansari et al., 2020), which are based on TED apply a step of ASR-based filtering, to filter out
talks, and the ACL development set of 2023, which noisy examples stemming from wrong audio-text
contains 5 presentations from ACL 2022. All devel- alignments, where we remove examples with high
opment data are unsegmented, meaning that they word-error-rate (WER). We adjust the threshold for
are long and continuous speeches. We apply SHAS each dataset dynamically, ensuring that the result-
segmentation (§5) before translating them. For the ing data has a WER of 0.11. Thus, the thresholds
Siamese pretraining, we used the English ASR data are 0.5 for MuST-C, 0.28 for Europarl-ST, and 0.4
from MuST-C v3 and Europarl-ST, as well as Com- for CommonVoice, which indicates that Europarl-
monVoice v11 (Ardila et al., 2020) (Table 1). ST has a significant number of misalignments, a
conclusion supported by manual inspection. Re-
moving them allowed for faster convergence during
We employ data augmentation, to create more ST Siamese pretraining.
data for training our models (Table 1). We use
the MT foundation model, to translate the tran- ST fine-tuning We apply text normalization to
script of English CommonVoice v11 (Ardila et al., the original ST data, remove speaker names and
2020). Since CommonVoice data contains various event-related tags from the MuST-C dataset, dis-
accents, we expect the synthetic data will be help- card examples with extreme source-to-target text
ful for translating the ACL talks domain, which length ratios (Gaido et al., 2022), and finally
has predominantly non-native English accents. We remove audio-transcription misaligned examples
additionally utilize SegAugment (Tsiamas et al., with ASR-based filtering, using a fixed WER
2022a), which creates alternative versions of the threshold of 0.5. For the synthetic Common-
training data by segmenting them differently with Voice data, we remove the ones already present
SHAS (Tsiamas et al., 2022c). We apply SegAug- in CoVoST. We also filter the synthetic examples
ment to MuST-C v3, with three different length of SegAugment, as the SHAS segmentation fre-
parameterizations: medium (m) (3 to 10 seconds), quently resembles the original segmentation, thus
long (l) (10 to 20 seconds), and extra-long (xl) (20 resulting in highly similar examples. We retain
to 30 seconds). We expect that SegAugment will only the ones that are sufficiently dissimilar from
be beneficial for translating the SHAS-segmented 1
https://github.com/savoirfairelinux/
test sets, due to the similar segmentations of the num2words
400
the original ones, based on text similarity measures, encoder representations with the MT model’s repre-
using TF-IDF features from the translations. More sentation space. The system, instead of using three
concretely, for each talk id, we compute the simi- layers of 1D convolutions, now incorporates also
larity matrix of its original translations and the new CTC-based compression, a large adapter, and fi-
candidates from SegAugment, find the most similar nally a single layer of 1D convolutions. Following
original example for each new candidate, and add the Siamese pre-training on MuST-C’s ASR data,
it to the filtered data only if its similarity score is we jointly fine-tune the model and the MT decoder
below 0.8. We apply this approach also between on the MuST-C ST data. Similar to the baseline,
the different SegAugment versions (m, l, xl). the MT model is also fine-tuned on the parallel text
of MuST-C beforehand.
4 Experiments
More Data We extend the previously described
Here we describe the experiments we carried out in process by incorporating additional data. Initially,
this work. The implementation details are available we fine-tune mBART50 using all the MT data (Ta-
in §A.1. ble 6). Subsequently, we perform the Siamese pre-
training and ST fine-tuning employing all the avail-
IWSLT ’22 System For the IWSLT 2022 of- able speech data (Table 1). By incorporating a
fline task, our submission employed a HuBERT larger dataset, we aim to enhance the system’s gen-
encoder (Hsu et al., 2021a) and an mBART50 (En- eralization capabilities and overall performance.
Xx) decoder, which were efficiently fine-tuned to
ST with the LNA strategy (Li et al., 2021) and par- Data Augmentation We employ two data aug-
allel adapters (He et al., 2022), using datasets such mentation techniques to increase the performance
as MuST-C v2, Europarl-ST and CoVoST. The ar- of our system during ST fine-tuning (§3.2), while
chitecture included three 1D convolutional layers no modifications are made to the Siamese pre-
between the encoder and decoder, resulting in a training. First, we investigate the use of SegAug-
subsampling of the encoder representation by a fac- ment (Tsiamas et al., 2022a), which we apply to
tor of 8. The final ensemble also comprised models MuST-C v3. Secondly, we generate synthetic data
utilizing Knowledge Distillation and a wav2vec 2.0 from Common Voice (Ardila et al., 2020), by lever-
encoder (Tsiamas et al., 2022b). aging the fine-tuned mBART50 (§A.2).
KD We use knowledge distillation with the fine-
Baseline Our baseline has four main differences
tuned mBART50 as the teacher (§A.2). The loss
compared our last year’s best system. We did an ini-
for training the ST model is the average of the
tial exploratory analysis of various encoders (§A.3),
standard cross entropy and the Kullback-Leibler
including different versions of wav2vec 2.0, and
(KL) divergence between the MT and ST output
HuBERT. Upon observing no significant differ-
probability distributions. We utilize all available
ences, we opted to utilize wav2vec 2.0 fine-tuned
ST data in this experiment, including both real and
with pseudo-labels (Xu et al., 2021b), a more preva-
synthetic data.
lent choice within the research community. Despite
the strong performance demonstrated by efficient 5 Audio Segmentation
fine-tuning with LNA and parallel adapters, we
chose to switch to standard ST fine-tuning in order To segment the audio of the IWSLT test sets, we
to optimize performance. Moreover, we employ a use SHAS (Tsiamas et al., 2022c). The tst2023
semantic encoder initialized from the MT model. test set, unlike previous years, contains another
Lastly, we also pre-train the foundation models, two domains apart from TED talks, which are ACL
wav2vec 2.0 with CTC on the ASR data of MuST- presentations and Press conferences. We tune the
C, and mBART50 on the parallel text of MuST-C. parameters of SHAS separately for each domain,
It is important to note that only MuST-C data was but since no development set is available for the
utilized for the baseline. press conferences, we decided to treat it as the ACL
domain. For fine-tuning the segmentation parame-
Siamese Pre-training Instead of pre-training the ters, we used the ST model that was trained with
speech encoder with CTC only, we follow the synthetic data from CommonVoice and SegAug-
Siamese pre-training method (§2.2), with the en- ment and initialized from Siamese pre-training (Ta-
coder architecture described in §2.1, to align the ble 2, 2d). We evaluate the performance of the
401
Figure 4: BLEU scores on IWSLT.ACLdev2023 for
different combinations of min and max segment length
parameters of SHAS.
Figure 3: BLEU scores on IWSLT.tst2020 for different

combinations of min and max segment length
best single model in both MuST-C and IWSLT
parameters of SHAS. test sets, although it only uses data from MuST-
C. The reasons behind these improvements are the
proper fine-tuning of learning rate and regulariza-
ST model on many different combinations of the tion parameters, as well as the choice of the speech
min and max segment length parameters, between encoder (§A.3). For the next exepriment (1b), by
0.2-30 seconds on IWSLT.tst2019 and 0.2-18 on using the Siamese pretraining (§2.2), instead of
ACLdev2023. In Figure 3, we observe that the min- just using CTC for the pretraining, we obtain sub-
imum segment length of 10 seconds is consistently stantial improvements in MuST-C v2, tst2020, and
reaching the best BLEU of 29.7 points. We decided acl2023, indicating the efficacy of our pretraining
to choose the combination of 10-26 seconds, since method when applied on top of foundation models.
the max of 26, seemed to be slightly better com- Adding more data in all parts of the training (2a),
pared to other neighboring values. As depicted in including the MT fine-tuning, Siamese pre-training
Figure 4, smaller segments are better for the ACL and ST fine-tuning, did not bring any meaningful
domain, with the best BLEU score obtained for improvements to MuST-C and IWSLT.tst2019/20,
min of 0.2 and max of 12. We hypothesize that the but it dramatically improved the results on the
differences in the optimal segmentation between acl2023 development set. We hypothesize that the
the IWSLT and ACL sets is because the ACL data CommonVoice and CoVoST data play an important
are essentially out-of-domain for our ST models. role due to the large representation of foreign ac-
In turn, the ST models are not confident in their cents, similar to those in acl2023. Following, with
predictions to handle long segments, and thus it is the inclusion of SegAugment in the ST fine-tuning
better to translate short segments instead. (2b) we observe an increase in all test sets, with
larger ones in the IWSLT test sets, since SegAug-
6 Results
ment data have the same segmentation. Then, also
In Table 2 we provide the BLEU scores on MuST-C using synthetic data from CommonVoice (2c) has
tst-COMMON and the IWLST test sets of tst2019 minor improvements in MuST-C and a slight de-
and tst2020 (TED domain), and acl2023 (ACL do- crease in IWSLT. Despite that, we included syn-
main). We are using the original segmentation for thetic data in subsequent experiments, since they
MuST-C and apply SHAS with the optimal param- were running in parallel. Applying Knowledge Dis-
eters (§5) of 10-26 secs for the TED domain, and tillation with the fine-tuned mBART50 as a teacher
0.2-12 secs for the ACL one. We also provide the (2d), brings moderate gains of 0.1-0.4 BLEU in the
results from our submission to IWSLT ’22. IWSLT sets, and finally an increase in the learning
In the first part of Table 2, we observe that this rate (2e) from 5e-5 to 7.5e-5 provide a model that
year’s baseline (1a) improves results from last year scored the best in tst2020 and acl2023.
402
Dataset MuST-C IWSLT
split v2 v3 tst2019 tst2020 acl2023
UPC ’22 (Tsiamas et al., 2022b)
a Best Single 29.4 - 24.9 26.8 -
0
b Best Ensemble 30.8 - 25.4 27.8 -
Only MuST-C
a Baseline 29.8 29.9 25.7 27.3 25.1
1
b 1a + Siamese Pretraining 30.8 30.1 25.9 28.5 26.4
Extended Data Conditions
a 1b + More Data 30.8 30.7 26.0 28.0 31.6
b 2a + SegAugment 31.3 30.9 26.6 29.4 32.4
2 c 2b + synthCV 31.4 31.0 26.5 29.4 32.3
d 2c + Knowledge Distillation 30.9 30.7 26.8 29.5 32.7
e 2c + higher LR 31.2 30.8 26.4 29.8 33.4
Ensembles
a Ensemble (2d, 2e) 31.4 31.1 26.9 29.7 32.8
3 b Ensemble (2c, 2d, 2e) 31.4 31.1 27.0 29.9 32.7
c Ensemble (2b, 2c, 2d, 2e) 31.5 31.2 27.0 29.8 33.1
Table 2: BLEU scores for En-De MuST-C and IWSLT sets. In bold are the best scores by single models, and in
underlined bold are the best scores overall.
Ensembling multiple models provided small in- are evaluated on the three test sets (TED, ACL,
creases in all sets. We believe that there is very little Sub) with three metrics; BLEU (Papineni et al.,
variation in our best models (2b-2e), since they are 2002), chrF (Popović, 2017), and COMET (Rei
initialized from the same Siamese pre-training (2b), et al., 2020). The TED test set also has two avail-
thus resulting in ineffective ensembles. In general, able references.
and in terms of single models, we improve our re-
Metric BLEU chrF COMET
sults from last year by 1.6 BLEU in tst2019 and 2.1 Reference 1 2 both 1 2 1 2
BLEU in tst2020, while the difference is larger in System 3c 25.5 29.8 36.6 0.56 0.58 0.7985 0.8098
terms of single models.
Table 3: Official Results for the TED test set 2023.
7 Conclusions
We described the submission of the UPC Machine
Metric BLEU chrF COMET
Translation group for the IWSLT 2023 Offline ST
System 3c 32.1 0.6 0.7473
task. Our system leverages ASR and MT foun-
dation models and a Siamese pretraining step to
Table 4: Official Results for the ACL test set 2023.
maximize the transfer learning from MT. We show
that Siamese pretraining can bring significant im-
provements to our ST models, while fine-tuning Metric BLEU chrF COMET
with KD can also be helpful. We furthermore show System 3c 15.6 0.47 0.3746
that synthetic data are crucial at improving perfor-
mance in the IWSLT test sets. In future work, we Table 5: Official Results for the Sub test set 2023.
plan to investigate the zero-shot capabilities of opti-
mal transport in the context of foundation models.
Acknowledgements
8 Submission Results
The work done by Ioannis Tsiamas and Gerard
In Tables 3, 4 and 5, we present the official submis- I. Gállego was supported by the ADAVOICE
sion results for IWSLT 2023 with our best system, project, PID2019-107579RB-I00 / AEI /
which is the Ensemble 3c of Table 2. Systems 10.13039/501100011033
403
References for Speech Recognition. In Proc. Interspeech 2021,
pages 2426–2430.
Ebrahim Ansari, Amittai Axelrod, Nguyen Bach, On-
drej Bojar, Roldano Cattoni, Fahim Dalvi, Nadir Mattia A. Di Gangi, Matteo Negri, Viet Nhat Nguyen,
Durrani, Marcello Federico, Christian Federmann, Amirhossein Tebbifakhr, and Marco Turchi. 2019.
Jiatao Gu, Fei Huang, Kevin Knight, Xutai Ma, Ajay Data Augmentation for End-to-End Speech Trans-
Nagesh, Matteo Negri, Jan Niehues, Juan Pino, Eliz- lation: FBK@IWSLT ’19. In Proceedings of the
abeth Salesky, Xing Shi, Sebastian Stüker, Marco 16th International Workshop on Spoken Language
Turchi, Alexander H. Waibel, and Changhan Wang. Translation, Hong Kong. Publisher: Zenodo.
2020. FINDINGS OF THE IWSLT 2020 EVAL-
UATION CAMPAIGN. In Proceedings of the 17th Charlie Frogner, Chiyuan Zhang, Hossein Mobahi,
International Conference on Spoken Language Trans- Mauricio Araya-Polo, and Tomaso Poggio. 2015.
lation, IWSLT 2020, Online, July 9 - 10, 2020, pages Learning with a wasserstein loss. In Proceedings
1–34. Association for Computational Linguistics. of the 28th International Conference on Neural In-
formation Processing Systems - Volume 2, NIPS’15,
R. Ardila, M. Branson, K. Davis, M. Henretty, page 2053–2061, Cambridge, MA, USA. MIT Press.
M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M.
Tyers, and G. Weber. 2020. Common voice: A Marco Gaido, Mauro Cettolo, Matteo Negri, and Marco
massively-multilingual speech corpus. In Proceed- Turchi. 2021. CTC-based compression for direct
ings of the 12th Conference on Language Resources speech translation. In Proceedings of the 16th Con-
and Evaluation (LREC 2020), pages 4211–4215. ference of the European Chapter of the Association
for Computational Linguistics: Main Volume, pages
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, 690–696, Online. Association for Computational Lin-
and Michael Auli. 2020. wav2vec 2.0: A framework guistics.
In Advances in Neural Information Processing Sys- Marco Gaido, Mattia A. Di Gangi, Matteo Negri, and
tems, volume 33, pages 12449–12460. Curran Asso- Marco Turchi. 2020. End-to-end speech-translation
ciates, Inc. with knowledge distillation: FBK@IWSLT2020. In
Sameer Bansal, Herman Kamper, Karen Livescu, Adam Spoken Language Translation, pages 80–88, Online.
Lopez, and Sharon Goldwater. 2019. Pre-training Association for Computational Linguistics.
on high-resource speech recognition improves low-
resource speech-to-text translation. In Proceedings Marco Gaido, Sara Papi, Dennis Fucci, Giuseppe
of the 2019 Conference of the North American Chap- Fiameni, Matteo Negri, and Marco Turchi. 2022.
ter of the Association for Computational Linguistics: Efficient yet competitive speech translation:
Human Language Technologies, pages 58–68, Min- FBK@IWSLT2022. In Proceedings of the 19th
neapolis, Minnesota. Association for Computational International Conference on Spoken Language
Linguistics. Translation (IWSLT 2022), pages 177–189, Dublin,
Ireland (in-person and online). Association for
Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina Computational Linguistics.
Karakanta, Alberto Martinelli, Matteo Negri, and
Marco Turchi. 2021. Cascade versus direct speech Gerard I. Gállego, Ioannis Tsiamas, Carlos Escolano,
translation: Do the differences still make a differ- José A. R. Fonollosa, and Marta R. Costa-jussà. 2021.
ence? In Proceedings of the 59th Annual Meet- End-to-end speech translation with pre-trained mod-
ing of the Association for Computational Linguistics els and adapters: UPC at IWSLT 2021. In Proceed-
and the 11th International Joint Conference on Natu- ings of the 18th International Conference on Spoken
ral Language Processing (Volume 1: Long Papers), Language Translation (IWSLT 2021), pages 110–119,
pages 2873–2887, Online. Association for Computa- Bangkok, Thailand (online). Association for Compu-
tional Linguistics. tational Linguistics.
Alexandre Berard, Laurent Besacier, Ali Can Ko- Mattia A. Di Gangi, Matteo Negri, and Marco Turchi.
cabiyikoglu, and Olivier Pietquin. 2018. End-to-End 2019. Adapting Transformer to End-to-End Spoken
Automatic Speech Translation of Audiobooks. In Language Translation. In Proc. Interspeech 2019,
2018 IEEE International Conference on Acoustics, pages 1133–1137.
6228, Calgary, AB. IEEE. Alex Graves, Santiago Fernández, Faustino Gomez, and
Jürgen Schmidhuber. 2006. Connectionist temporal
Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Ben- classification: Labelling unsegmented sequence data
tivogli, Matteo Negri, and Marco Turchi. 2021. Must- with recurrent neural networks. In Proceedings of
c: A multilingual corpus for end-to-end speech trans- the 23rd International Conference on Machine Learn-
lation. Computer Speech & Language, 66:101155. ing, ICML ’06, page 369–376, New York, NY, USA.
Association for Computing Machinery.
Alexis Conneau, Alexei Baevski, Ronan Collobert, Ab-
delrahman Mohamed, and Michael Auli. 2021. Un- Chi Han, Mingxuan Wang, Heng Ji, and Lei Li. 2021.
supervised Cross-Lingual Representation Learning Learning shared semantic space for speech-to-text
404
translation. In Findings of the Association for Com- Eugene Kharitonov, Morgane Rivière, Gabriel Syn-
putational Linguistics: ACL-IJCNLP 2021, pages naeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs
2214–2225, Online. Association for Computational Douze, and Emmanuel Dupoux. 2021. Data augment-
Linguistics. ing contrastive learning of speech representations in
the time domain. In 2021 IEEE Spoken Language
Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg- Technology Workshop (SLT), pages 215–222.
Kirkpatrick, and Graham Neubig. 2022. Towards a
unified view of parameter-efficient transfer learning. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A
In International Conference on Learning Representa- method for stochastic optimization.
tions.
Paul Knopp and Richard Sinkhorn. 1967. Concerning
Dan Hendrycks and Kevin Gimpel. 2020. Gaussian nonnegative matrices and doubly stochastic matrices.
error linear units (gelus). Pacific Journal of Mathematics, 21(2):343 – 348.
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean.
2015. Distilling the knowledge in a neural network. Philipp Koehn. 2004. Statistical significance tests for
ArXiv, abs/1503.02531. machine translation evaluation. In Proceedings of the
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Language Processing, pages 388–395, Barcelona,
Bruna Morrone, Quentin De Laroussilhe, Andrea Spain. Association for Computational Linguistics.
Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019.
Parameter-efficient transfer learning for NLP. In Philipp Koehn. 2005. Europarl: A parallel corpus for
Proceedings of the 36th International Conference statistical machine translation. In Proceedings of
on Machine Learning, volume 97 of Proceedings Machine Translation Summit X: Papers, pages 79–86,
of Machine Learning Research, pages 2790–2799. Phuket, Thailand.
PMLR.
Phuong-Hang Le, Hongyu Gong, Changhan Wang, Juan
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Pino, Benjamin Lecouteux, and Didier Schwab. 2023.
Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel- Pre-training for speech translation: Ctc meets optimal
rahman Mohamed. 2021a. Hubert: Self-supervised transport.
speech representation learning by masked prediction
of hidden units. IEEE/ACM Transactions on Audio, Xian Li, Changhan Wang, Yun Tang, Chau Tran, Yuqing
Speech, and Language Processing, 29:3451–3460. Tang, Juan Pino, Alexei Baevski, Alexis Conneau,
Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Ta- lation from efficient finetuning of pretrained models.
tiana Likhomanenko, Qiantong Xu, Vineel Pratap, In Proceedings of the 59th Annual Meeting of the
Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Syn- Association for Computational Linguistics and the
naeve, and Michael Auli. 2021b. Robust wav2vec 11th International Joint Conference on Natural Lan-
2.0: Analyzing Domain Shift in Self-Supervised Pre- guage Processing (Volume 1: Long Papers), pages
Training. In Proc. Interspeech 2021, pages 721–725. 827–838.
Hirofumi Inaguma, Brian Yan, Siddharth Dalmia,
Pengcheng Guo, Jiatong Shi, Kevin Duh, and Shinji Yuchen Liu, Hao Xiong, Jiajun Zhang, Zhongjun He,
Watanabe. 2021. ESPnet-ST IWSLT 2021 offline Hua Wu, Haifeng Wang, and Chengqing Zong. 2019.
speech translation system. In Proceedings of the 18th End-to-End Speech Translation with Knowledge Dis-
International Conference on Spoken Language Trans- tillation. In Proc. Interspeech 2019, pages 1128–
lation (IWSLT 2021), pages 100–109, Bangkok, Thai- 1132.
land (online). Association for Computational Linguis-
tics. J. Niehues, R. Cattoni, S. Stüker, M. Negri, M. Turchi,
Elizabeth Salesky, Ramon Sanabria, Loïc Barrault,
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, Lucia Specia, and Marcello Federico. 2019. The
Javier Jorge, Nahuel Roselló, Adrià Giménez, Al- iwslt 2019 evaluation campaign. In Proceedings
bert Sanchis, Jorge Civera, and Alfons Juan. 2020. of the 16th International Workshop on Spoken Lan-
Europarl-st: A multilingual corpus for speech trans- guage Translation.
lation of parliamentary debates.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, Sam Gross, Nathan Ng, David Grangier, and Michael
P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Col- Auli. 2019. fairseq: A fast, extensible toolkit for
lobert, C. Fuegen, T. Likhomanenko, G. Syn- sequence modeling. In Proceedings of NAACL-HLT
naeve, A. Joulin, A. Mohamed, and E. Dupoux. 2019: Demonstrations.
2020. Libri-light: A benchmark for asr with
limited or no supervision. In ICASSP 2020 - Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-
2020 IEEE International Conference on Acous- jeev Khudanpur. 2015. Librispeech: An asr corpus
tics, Speech and Signal Processing (ICASSP), based on public domain audio books. In 2015 IEEE
pages 7669–7673. https://github.com/ International Conference on Acoustics, Speech and
facebookresearch/libri-light. Signal Processing (ICASSP), pages 5206–5210.
405
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
Jing Zhu. 2002. Bleu: a method for automatic evalu- man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
ation of machine translation. In Proceedings of the gela Fan. 2020. Multilingual translation with exten-
40th Annual Meeting of the Association for Compu- sible multilingual pretraining and finetuning. arXiv
tational Linguistics, pages 311–318, Philadelphia, preprint arXiv:2008.00401.
Linguistics. Ioannis Tsiamas, José A. R. Fonollosa, and Marta R.
Costa-jussà. 2022a. SegAugment: Maximiz-
Gabriel Peyré and Marco Cuturi. 2019. Computational ing the Utility of Speech Translation Data with
optimal transport: With applications to data science. Segmentation-based Augmentations.
Ngoc-Quan Pham, Tuan Nam Nguyen, Thai-Binh Ioannis Tsiamas, Gerard I. Gállego, Carlos Escolano,
Nguyen, Danni Liu, Carlos Mullov, Jan Niehues, and José Fonollosa, and Marta R. Costa-jussà. 2022b.
Alexander Waibel. 2022. Effective combination of Pretrained speech encoders and efficient fine-tuning
pretrained models - KIT@IWSLT2022. In Proceed- methods for speech translation: UPC at IWSLT 2022.
ings of the 19th International Conference on Spoken In Proceedings of the 19th International Confer-
Language Translation (IWSLT 2022), pages 190–197, ence on Spoken Language Translation (IWSLT 2022),
Dublin, Ireland (in-person and online). Association pages 265–276, Dublin, Ireland (in-person and on-
for Computational Linguistics. line). Association for Computational Linguistics.
Juan Pino, Liezl Puzon, Jiatao Gu, Xutai Ma, Arya D. Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol-
McCarthy, and Deepak Gopinath. 2019. Harness- losa, and Marta R. Costa-jussà. 2022c. Shas:
ing Indirect Training Data for End-to-End Automatic Approaching optimal segmentation for end-to-end
Speech Translation: Tricks of the Trade. In Proceed- speech translation.
ings of the 16th International Workshop on Spoken
Language Translation, Hong Kong. Publisher: Zen- Changhan Wang, Yun Tang, Xutai Ma, Anne Wu,
odo. Dmytro Okhonko, and Juan Pino. 2020a. Fairseq
S2T: Fast speech-to-text modeling with fairseq. In
Maja Popović. 2017. chrF++: words helping charac- Proceedings of the 1st Conference of the Asia-Pacific
ter n-grams. In Proceedings of the Second Confer- Chapter of the Association for Computational Lin-
ence on Machine Translation, pages 612–618, Copen- guistics and the 10th International Joint Conference
hagen, Denmark. Association for Computational Lin- on Natural Language Processing: System Demon-
guistics. strations, pages 33–39, Suzhou, China. Association
Matt Post. 2018. A call for clarity in reporting BLEU for Computational Linguistics.
Machine Translation: Research Papers, pages 186– Changhan Wang, Anne Wu, and Juan Pino. 2020b. Cov-
191, Belgium, Brussels. Association for Computa- ost 2: A massively multilingual speech-to-text trans-
tional Linguistics. lation corpus. arXiv preprint arXiv:2007.10310.
Tomasz Potapczyk and Pawel Przybysz. 2020. SR- Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng,
POL’s System for the IWSLT 2020 End-to-End Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan
Speech Translation Task. In Proceedings of the 17th Lan, Liwei Wang, and Tie-Yan Liu. 2020. On layer
International Conference on Spoken Language Trans- normalization in the transformer architecture. In Pro-
lation, pages 89–94, Online. Association for Compu- ceedings of the 37th International Conference on
tational Linguistics. Machine Learning, ICML’20. JMLR.org.
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Chen Xu, Bojie Hu, Yanyang Li, Yuhao Zhang, Shen
Lavie. 2020. COMET: A neural framework for MT Huang, Qi Ju, Tong Xiao, and Jingbo Zhu. 2021a.
evaluation. In Proceedings of the 2020 Conference Stacked acoustic-and-textual encoding: Integrating
on Empirical Methods in Natural Language Process- the pre-trained models into speech translation en-
ing (EMNLP), pages 2685–2702, Online. Association coders. In Proceedings of the 59th Annual Meet-
for Computational Linguistics. ing of the Association for Computational Linguistics
and the 11th International Joint Conference on Natu-
Matthias Sperber and Matthias Paulik. 2020. Speech ral Language Processing (Volume 1: Long Papers),
translation and the end-to-end promise: Taking stock pages 2619–2630, Online. Association for Computa-
of where we are. In Proceedings of the 58th Annual tional Linguistics.
guistics, pages 7409–7421, Online. Association for Qiantong Xu, Alexei Baevski, Tatiana Likhomanenko,
Computational Linguistics. Paden Tomasello, Alexis Conneau, Ronan Collobert,
Gabriel Synnaeve, and Michael Auli. 2021b. Self-
Xu Tan, Yi Ren, Di He, Tao Qin, and Tie-Yan Liu. training and pre-training are complementary for
2019. Multilingual neural machine translation with speech recognition. In ICASSP 2021 - 2021 IEEE
knowledge distillation. In International Conference International Conference on Acoustics, Speech and
on Learning Representations. Signal Processing (ICASSP), pages 3030–3034.
406
Biao Zhang, Ivan Titov, Barry Haddow, and Rico Sen- have 12 layers each. All layers have an embedding
nrich. 2020. Adaptive feature selection for end-to- dimensionality of 1024, a feed-forward dimension-
end speech translation. In Findings of the Association
ality of 4098, GELU activations (Hendrycks and
for Computational Linguistics: EMNLP 2020, pages
2533–2544, Online. Association for Computational Gimpel, 2020), 16 attention heads, and pre-layer
Linguistics. normalization (Xiong et al., 2020). The vocabulary
for the CTC has a size of 32 characters, while the
translation system for IWSLT 2022 offline shared one for the ST model has a size of 250,000.
task. In Proceedings of the 19th International Con- The model takes waveforms with a 16kHz sam-
ference on Spoken Language Translation (IWSLT pling rate as input, which are normalized to zero
and online). Association for Computational Linguis- mean and unit variance. The models are trained
tics. using the data presented in Table 1, with maximum
source length of 400,000 and target length of 1024
Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu,
tokens. Gradient accumulation and data parallelism
Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun Gong,
Lirong Dai, Jinyu Li, et al. 2022. Speechlm: En- are employed to achieve an effective batch size of
hanced speech pre-training with unpaired textual data. approximately 32 million tokens.
For the Siamese pre-training we use Adam
Jinming Zhao, Hao Yang, Gholamreza Haffari, and (Kingma and Ba, 2014) with a base learning rate
Ehsan Shareghi. 2022. M-Adapter: Modality Adap- of 2 · 10−4 , a warm-up of 1,000 steps and an in-
tation for End-to-End Speech-to-Text Translation. In verse square root scheduler. We follow a reduced
Proc. Interspeech 2022, pages 111–115.
regularization approach, as compared to the origi-
A Appendix nal configuration of wav2vec 2.0 and mBART50,
which we found to work the best in our preliminary
A.1 Implementation Details experiments. Thus, we use 0.1 activation dropout
This section presents the implementation details of in the acoustic encoder, as well as time masking
our proposed model architecture. with probability of 0.2 and channel masking with
As an ASR model, we are using wav2vec 2.02 probability of 0.1. For the context encoder, we use
which is composed of a 7-layer convolutional fea- 0.1 dropout and 0.1 attention dropout. All other
ture extractor and 24-layer Transformer encoder. dropouts are inactive. All the weights in the loss
It is pretrained with 60k hours of non-transcribed function were set to 1.0 (Eq. 1). We train until the
speech from Libri-Light (Kahn et al., 2020), and LOT2 term of the loss does not improve for 5,000
fine-tuned for ASR with 960 hours of labeled data steps, and then average the 10 best checkpoints
from Librispeech (Panayotov et al., 2015). The according to the same loss term.
wav2vec 2.0 version we use was also fine-tuned For ST fine-tuning, we use Adam with a base
with pseudo-labels (Xu et al., 2021b). learning rate of 5 · 10−5 , fixed for the 20% of the
As an MT model, we are using mBART50 (Tang training before decaying to 5 · 10−7 for the rest.
et al., 2020), which is already fine-tuned on En- In the semantic encoder, we apply a dropout of
Xx multilingual machine translation3 . We further 0.1 and an attention dropout of 0.1, while for the
pretrain it for two reasons. Firstly, we are only in- decoder we use a dropout of 0.3 and an attention
terested in the En-De direction, and thus we would dropout of 0.1. Neither dropout nor masking is
like a more specialized model on that direction. applied in the frozen acoustic encoder. The loss is
Secondly, due to the 2nd step of encoder matching, the cross-entropy with label smoothing of 0.2.
we would like the text encoder to have a very good For the experiments incorporating Knowledge
representation of our data. For MT fine-tuning, we Distillation (KD) during ST fine-tuning, the loss
use the original parameters of mBART50 (Tang is calculated as a weighted sum of the standard
et al., 2020), and the datasets listed in Table 6. cross-entropy (no label smoothing) and the KL di-
The acoustic encoder has 24 Transformer lay- vergence between the teacher and student distribu-
ers, while the semantic encoder and the decoder tions, controlled by a hyperparameter λ, set to 0.5.
2
https://dl.fbaipublicfiles.com/ The teacher distribution for each step is obtained
fairseq/wav2vec/wav2vec2_vox_960h_new.pt offline using the fine-tuned mBART50, where we
3
https://dl.fbaipublicfiles.com/
fairseq/models/mbart50/mbart50.ft.1n. keep the top-8 indices, and both the teacher and
tar.gz student distributions are additionally modified with
407
temperature T = 1.3 (Gaido et al., 2020). v2 dev set, such as BLEU (Papineni et al., 2002),
After ST fine-tuning, we pick the 10 best check- chrF2 (Popović, 2017), and COMET (Rei et al.,
points according to the BLEU (Papineni et al., 2020). To ensure the robustness of our findings,
2002) computed with sacreBLEU (Post, 2018) on we estimated statistical significance using the boot-
the development set of MuST-C and average them. strap resampling method (Koehn, 2004).
For generation, we use a beam search of 5. All In our initial experiment, we examined the im-
models are implemented in FAIRSEQ (Ott et al., pact of various fine-tuning strategies used in our
2019), and experiments were run on a cluster of 8 last years’ participations, specifically LNA (Li et al.,
NVIDIA GeForce RTX 3090. Our code is available 2021) and LNA-Adapters (Tsiamas et al., 2022b),
at a public repository4 . in comparison to full fine-tuning. The goal was
to verify whether these approaches inadvertently
A.2 MT fine-tuning
hurt the system’s performance. As demonstrated in
For the MT fine-tuning, we use the parallel text Table 8, these strategies indeed had a detrimental
of the ST datasets, as well as Europarl v10 En-De effect, leading to reductions of 1.9 BLEU points
(Koehn, 2005) (Table 6). We perform text nor- when applied to both the encoder and the decoder.
malization and remove pairs with extremely short Consequently, we opted to adopt a conventional full
text segments (fewer than 4 characters) or extreme fine-tuning strategy for subsequent experiments.
source-to-target length ratio (less than 0.5 or larger
Following this, we conducted a comparative anal-
than 2).
ysis of various speech encoders, including different
Original Filtered variations of wav2vec 2.0 (Baevski et al., 2020;
Xu et al., 2021b; Hsu et al., 2021b; Conneau et al.,
ST datasets
2021), HuBERT (Hsu et al., 2021a), and SpeechLM
MuST-C v3 270 235
Europarl-ST 33 26 (Zhang et al., 2022) (Table 9). Our baseline was
CoVoST 2 231 203 the wav2vec 2.0 fine-tuned with pseudo-labels (Xu
et al., 2021b), and intriguingly, most encoders ex-
MT datasets
Europarl v10 1, 829 1, 566 hibited a comparable level of performance. A
marginal decrease was observed with the wav2vec
Total 2, 363 2, 030
2.0 pretrained on a large pool of datasets (LV-60 +
Table 6: Filtered training data (thousands of sentences) CV + SWBD + FSH) (Hsu et al., 2021b), and the
for MT fine-tuning stage. multilingual version of wav2vec 2.0, XLSR (Con-
neau et al., 2021). The SpeechLM results were
noticeably below expectations, leading us to sus-
MuST-C
Europarl-ST CoVoST2
pect a bug in our implementation.
v2 v3
Upon noting that the hyperparameters were op-
Off-the-shelf
timized for a specific speech encoder, we hy-
mBART50 31.4 30.9 35.0 33.6
pothesized that a reduction in the learning rate
Fine-tuned
might boost HuBERT’s performance. However,
MuST-C v2 35.3 34.4 34.6 35.3
All (§3.1) 34.9 34.2 40.3 39.9 as demonstrated in Table 11, the performance was
adversely affected, prompting us to retain the origi-
Table 7: BLEU scores on MT test sets. nal wav2vec 2.0 as the primary speech encoder due
to the lack of substantial improvements offered by
other alternatives.
A.3 Preliminary experiments
Our focus then shifted towards examining the
Before starting the primary experiments for the influence of varying regularization and data aug-
IWSLT evaluation campaign, we conducted an ar- mentation strategies on system performance (Table
ray of preliminary tests, building on top of previous 10). We explored a range, from our traditionally
years’ submissions (Gállego et al., 2021; Tsiamas used setup (base), to the one employed in the orig-
et al., 2022b). These explorations were intended to inal foundation model fine-tuning, and a reduced
examine the impact of system configuration varia- version. Implementing the original regularization
tions on the performance metrics on the MuST-C within the speech encoder, as opposed to the base
4
https://github.com/mt-upc/iwslt-2023 variant, significantly boosted performance, leading
408
Encoder Decoder BLEU chrF2 COMET
- - 29.0 54.7 0.8001
LNA - 28.0 ∗ 54.1 ∗ 0.7949 ∗
- LNA 27.9 ∗ 54.0 ∗ 0.7882 ∗
LNA LNA 27.1 ∗ 53.2 ∗ 0.7800 ∗
LNA-Adapt - 28.2 ∗ 54.3 ∗ 0.7960 ∗
- LNA-Adapt 27.6 ∗ 53.6 ∗ 0.7889 ∗
LNA-Adapt LNA-Adapt 27.1 ∗ 53.5 ∗ 0.7847 ∗
Table 8: Performance comparison of fine-tuning

strategies w.r.t. to full fine-tuning, evaluated on the
MuST-C v2 dev set (en-de). LNA and LNA-Adapters
represent the strategies proposed by (Li et al., 2021)
and (Tsiamas et al., 2022b) respectively. ∗ indicates
significance w.r.t. baseline (full fine-tuning).
us to select this configuration. We also explored the

effectiveness of WavAugment (Kharitonov et al.,
2021), ultimately finding that, despite its training
speed slowdown, it did not enhance the results.
Consequently, we opted to stop using it.
Lastly, we evaluated the potential benefits of
employing the new MuST-C v3 training data on
system performance (Table 12). Unexpectedly, no
significant improvements were observed upon tran-
sitioning from MuST-C v2 to v3. Despite this, we
decided to utilize v3, since it’s specifically prepared
for the IWSLT evaluation campaign.
These preliminary investigations have not only
provided a more profound understanding of the role
of each system’s component and setting, but also
have yielded us with a better starting point for the
subsequent experiments of our work.
409
Learning Rate BLEU chrF2 COMET
5 · 10−4 30.3 56.1 0.8099
2 · 10−4 30.3 56.0 0.8069
1 · 10−4 30.2 55.9 0.8085
5 · 10−5 29.5 ∗ 55.3 ∗ 0.8047
Table 11: Learning rate search for HuBERT encoder,

with MuST-C v2 dev set (en-de). ∗ indicates
significance w.r.t. baseline (1st row).
Training Data BLEU chrF2 COMET

MuST-C v2 30.7 56.4 0.8127
MuST-C v3 30.5 56.6 0.8118
Table 12: Performance of the systems trained with

different versions of MuST-C, evaluated with MuST-C
v2 dev set (en-de). No significant improvements found.
System ASR FT BLEU chrF2 COMET

Wav2Vec 2.0 Large (LV-60) + Self Training ✓ 30.2 56.1 0.8087
Wav2Vec 2.0 Large (LV-60) ✓ 30.1 55.9 0.8098
Wav2Vec 2.0 Large (LV-60) ✗ 30.3 55.9 −
Wav2Vec 2.0 Large (LV-60 + CV + SWBD + FSH) ✓ 29.7 ∗ 55.7 ∗ 0.8083
Wav2Vec 2.0 Large (LV-60 + CV + SWBD + FSH) ✗ 30.0 55.9 −
Wav2Vec 2.0 Large conformer - rope (LV-60) † ✓ 29.8 55.4 ∗ −
XLSR-53 ✗ 28.9 ∗ 55.0 ∗ −
HuBERT Large ✓ 30.3 56.1 0.8099
HuBERT Large ✗ 30.3 56.2 0.8110
SpeechLM-P Large ‡ ✗ 23.6 ∗ 50.2 ∗ −
Table 9: Speech encoders exploration with MuST-C v2 dev set (en-de). ∗ indicates significance w.r.t. baseline (1st
row). † uses LNA-Adapters (Tsiamas et al., 2022b). ‡ indicates a possible bug in our implementation.
Encoder Reg. Decoder Reg. WavAugm. BLEU chrF2 COMET

base base ✓ 30.2 56.1 0.8087
base original ✓ 30.5 56.4 ∗ 0.8149 ∗
base original ✗ 30.7 56.4 ∗ 0.8127 ∗
base reduced ✓ 30.1 55.9 0.8078
original base ✓ 29.8 55.8 0.8100
reduced base ✓ 30.1 55.9 0.8108
original original ✓ 30.4 56.2 0.8138 ∗
reduced reduced ✓ 30.1 56.0 0.8122 ∗
Table 10: Variations of the regularization and data augmentation strategies, with MuST-C v2 dev set (en-de). ∗
indicates significance w.r.t. baseline (1st row).
410
The Xiaomi AI Lab’s Speech Translation Systems for IWSLT 2023
Offline Task, Simultaneous Task and Speech-to-Speech Task
Wuwei Huang1∗† Mengge Liu2∗‡ Xiang Li1 Yanzhi Tian2‡ Fengyu Yang1
Wen Zhang1 Yuhang Guo2 Jinsong Su3 Jian Luan1 Bin Wang1
1
Xiaomi AI Lab, Beijing, China
2
Beijing Institute of Technology, Beijing, China
3
Xiamen University, Xiamen, Fujian, China.
{huangwuwei,lixiang21,yangfengyu1,zhangwen17,luanjian,wangbin11}@xiaomi.com
{liumengge,tianyanzhi,guoyuhang}@bit.edu.cn [email protected]
Abstract text in the same language, and then the MT model

translates the ASR output into text in the target
This system description paper introduces the language. In contrast, the end-to-end ST system
systems submitted by Xiaomi AI Lab to the
directly translates speech utterances in the source
three tracks of the IWSLT 2023 Evaluation
Campaign, namely the offline speech transla- language into text in the target language.
tion (Offline-ST) track, the offline speech-to- The scarcity of training data makes end-to-end
speech translation (Offline-S2ST) track, and systems still slightly inferior in translation qual-
the simultaneous speech translation (Simul-ST) ity to cascaded systems, which suffer from er-
track. All our submissions for these three tracks ror propagation and information loss (Sperber and
only involve the English-Chinese language di- Paulik, 2020). Cascaded systems continue to domi-
rection. Our English-Chinese speech transla-
nate the systems submitted at IWSLT in previous
tion systems are constructed using large-scale
pre-trained models as the foundation. Specifi- years (Anastasopoulos et al., 2022, 2021; Ansari
cally, we fine-tune these models’ correspond- et al., 2020). However, with the rapid develop-
ing components for various downstream speech ment of pre-training technology, a large number of
translation tasks. Moreover, we implement sev- large-scale pre-training models suitable for various
eral popular techniques, such as data filtering, modalities, such as speech (Baevski et al., 2020;
data augmentation, speech segmentation, and Hsu et al., 2021; Tang et al., 2022) and text (Liu
model ensemble, to improve the system’s over-
et al., 2020), have emerged. Therefore, end-to-end
all performance. Extensive experiments show
that our systems achieve a significant improve- ST systems have gradually attracted attention from
ment over the strong baseline systems in terms both the academic and industrial communities in
of the automatic evaluation metric. recent years. In our submission, we have opted for
an end-to-end approach to establish the ST system.
1 Introduction We briefly introduce the submitted systems:
We submit an end-to-end offline speech transla- Offline Speech Translation System. Our submit-
tion system, a cascaded offline speech-to-speech ted end-to-end offline speech-to-text translation
translation system, and an end-to-end simultane- system is based on two pre-trained models: Hu-
ous interpretation system to the Offline-ST track, BERT (Hsu et al., 2021) and mBART (Liu et al.,
Offline-S2ST track, and Simul-ST track, respec- 2020). It has been proven that these two mod-
tively. This paper provides a detailed description els have strong capabilities on ST and MT tasks,
of the three systems we submit. respectively. Our offline ST model consists of a
There are two commonly used solutions for speech encoder, a text encoder, and a text decoder,
speech translation models: the end-to-end approach with all parameters initialized using the pre-trained
and the cascaded approach. The cascaded system HuBERT and mBART models.
uses a pipeline where an automatic speech recogni- Offline Speech-to-Speech Translation System.
tion (ASR) system is followed by a machine transla- Speech-to-speech translation has great application
tion (MT) system. The ASR system first transcribes value in various scenarios, such as international
the speech utterances in the source language into online lectures and multinational meetings. Lee
∗
et al. (2022) trained a sequence-to-sequence speech-
Equal contribution.
† to-unit translation (S2UT) model to directly predict
Crossponding Author.
‡
The work was done during the author’s internship at the discrete representations of the target speech.
Xiaomi. Drawing on the method of Lee et al. (2022), we
411
implement a cascaded speech-to-speech translation Corpora Duration #Spl.
system. Specifically, an end-to-end speech-to-text MuST-C v2.0 596h 359K
translation model is trained, followed by a text-to- ST CoVoST 1119h 870K
speech (TTS) synthesis model. GigaST 10000h 7.6M
MT OpenSubtitles - 11.2M
To implement a cascaded speech-to-speech trans-
LibriSpeech 960h 273K
lation system, we first train an end-to-end speech- Common Voice 2320h 1.62M
to-text translation model, followed by a text-to- TED LIUM (v3) 452h 268K
speech (TTS) synthesis model that we train. ASR Vox Populi 543h 181K
Simultaneous Speech Translation System. Apart ST-TED* 273h 171K
from the above two offline systems, we also sub- Europal-ST* ~80h 30K
MuST-C* ~100h 78K
mit an end-to-end system for the English-Chinese
AISHELL-3 85h 88K
language direction in the Simul-ST track. Simul- TTS
GigaS2S 10000h 7.6M
taneous speech translation involves the challenge Unlabeled
Vox Populi 24100h -
of striking a balance between translation quality Audio
and latency, as the system starts to translate the
input audio even before the entire speech input Table 1: The statistical results of all available train-
ing corpora in the En⇒Zh translation direction for the
is received. The Information-Transport-based Si-
offline speech translation track, the offline speech-to-
multaneous Translation (ITST) (Zhang and Feng, speech translation track, and the simultaneous speech
2022) architecture is adopted to build our end-to- translation track. The tilde symbol (~) indicates a rough
end Simul-ST system, and we initialize its cor- estimation. #Spl. indicates the number of samples.
responding components using the HuBERT and
mBART pre-trained models. When the AL value is
less than 2000, our submitted end-to-end simultaneous sources, such as LibriSpeech4 , CommonVoice5 ,
ous ST system achieves a significant improvement TED LIUM6 , and Vox Populi7 . In addition to this,
of +3.2 BLEU scores over last year’s best end-to- we also utilize the audio-transcription pairs from
end simultaneous ST system. We also explore a English-German (En⇒De) ST data, including ST-
streaming simultaneous interpretation approach by TED, Europarl-ST, and MuST-C (indicated with a
training an offline model and applying a wait-k star in Table 1). Furthermore, AISHELL-38 and
decoding strategy, which even yields better perfor- GigaS2S9 datasets are used to train the TTS model.
mance. We filter out those samples in the MuST-C En⇒De
The rest of this paper is organized as follows: training set whose source sentences are included in
Section 2 describes the data preparation, including the MuST-C En⇒Zh training set. Table 1 presents
data filtering, data augmentation, speech segmen- the statistical results of the training samples for
tation, etc. Section 3 elaborates on the models and different tasks.
strategies used in our systems. We present our ex-
2.2 Offline-ST and Simul-ST Corpus
periment settings, results, and analyses in Section 4.
Finally, Section 5 provides the conclusion. For both the En⇒Zh offline speech translation and
En⇒Zh simultaneous speech translation tracks, we
2 Data Preparation use the same training corpus, the same data filtering
and data augmentation methods.
2.1 Statistics
2.2.1 Data Filtering
Our English-Chinese (abbreviated as En⇒Zh) ST All text data involved in MT, ST, and TTS tasks
systems are developed under constrained condi- are tokenized using SentencePiece10 . For the MT
tions using two allowed ST corpora: MuST-C v2.01 data, we adopt heuristic rules to filter out noisy data
and CoVoST2 . The only text translation dataset
4
available is OpenSubtitles20183 . To construct the 5
http://www.openslr.org/12/
https://commonvoice.mozilla.org/en/datasets
English ASR corpus, we gather data from vari- 6
https://lium.univ-lemans.fr/en/ted-lium3/
7
https://github.com/facebookresearch/voxpopuli
1 8
https://ict.fbk.eu/must-c/ https://www.aishelltech.com/aishell_3
2 9
https://github.com/facebookresearch/covost https://github.com/SpeechTranslation/GigaS2S
3 10
https://opus.nlpl.eu/OpenSubtitles2018.php https://github.com/google/sentencepiece
412
in the training set similar to the rules used in (Guo Models BLEU
et al., 2022), following these steps: mBART50 (one-to-many) 25.81
+ domain fine-tuning on 9M corpus 28.41
• A series of hand-crafted rules are adopted to + domain fine-tuning on MuST-C 29.50
filter out noisy sentences from the training set.
In particular, we discard sentences that con- Table 2: The BLEU scores of MT models obtained by
tain less than 50% linguistic words. For Chi- fine-tuning one-to-many mBART50 model using vari-
nese sentences, Chinese characters are consid- ous bilingual datasets on the tst-COMMON test set.
ered linguistic words; for English sentences,
words containing only alphabet characters are MuST-C datasets to improve the domain adaptabil-
considered linguistic words; ity of the model. The results are shown in Table 2.
• We utilize fast_align11 open source tool to In the Librispeech and TED-LIUM datasets, En-
exclude sentence pairs with a score lower than glish sentences do not have punctuation or case
−8. We also apply the language identifica- information. We fine-tune the mBART50 model
tion (LangID) tool12 to filter out sentence pairs to add punctuation and restore case information to
that are neither in Chinese nor English; English sentences. Furthermore, samples already
• Duplicate sentence pairs are discarded, and included in the CoVoST corpus are removed from
any pairs with a length ratio greater than 3.0 the CommonVoice dataset. The transcriptions of
or sentences with a length exceeding 200 are the ASR data are then translated using the best fine-
also filtered out. tuned mBART50 model and filtered using the same
To filter out noise data in the ST training set, we rules as the ST data in Section 2.2.1, resulting in
apply the following steps: a total of 1.6 million synthesized speech-to-text
translation pairs.
• Pairs that have an audio duration exceeding 60 Finally, for constrained data, we combine the
seconds or a text length exceeding 200 tokens hand-annotated ST corpus with the synthesized ST
are excluded; corpus to produce the final training corpus for the
• We calculate the ratio of the number of speech Offline-ST and Simul-ST models, yielding a total
frames to tokens in each sample, and remove of 2.9 million speech-to-text translation pairs. In
samples whose ratio exceeds three times the the case of unconstrained training on the offline
average ratio. track, we augment our training corpus with the
GigaST corpus, resulting in 9 million speech-to-
2.2.2 Data Augmentation text translation pairs.
To effectively train an end-to-end speech transla-
tion model, it is impractical to rely solely on hand- 2.3 Cascaded S2ST Corpus
annotated training data, due to the scarcity of hand- In the En⇒Zh speech-to-speech translation track,
annotated data. To mitigate this issue, we utilize we leverage all available constrained data from the
a well-trained MT model to translate the transcrip- offline speech translation track as well as the Gi-
tions from ASR data and synthesize a large amount gaST corpus13 to train our offline speech transla-
of pseudo-data, which has been widely used in the tion model. This model is then followed by a TTS
previous years’ competitions (Ding and Tao, 2021; model that is trained on the AISHELL-3 and Gi-
Zhang and Ao, 2022; Zhang et al., 2022b; Li et al., gaS2S datasets.
2022; Zhu et al., 2022).
We initially gather all available English-Chinese 2.4 Speech Segmentation
bilingual parallel sentence pairs from ST and MT Since the speech in the evaluation set is not pre-
tasks, as listed in Table 1. We then filter the data segmented, we apply SHAS (Tsiamas et al., 2022)
using the method mentioned in Section 2.2.1, gen- to segment the full speech into shorter segments.
erating 9M sentence pairs. These 9M sentence However, we observe two issues. Firstly, some
pairs are used to fine-tune the pre-trained one-to- segments have incomplete final words, which could
many mBART50 model for 30 epochs. We further negatively impact the performance of the ST model.
fine-tune mBART50 for another 30 epochs using To alleviate this problem, we add a few extra frames
11 13
https://github.com/clab/fast_align https://st-benchmark.github.io/resources/
12
https://github.com/saffsd/langid.py GigaST.html
413
Text Encoder Text Decoder 欢迎来到小米
Speech Encoder
Initialized from
Transformer Encoder
HuBERT
Initialized from
CNN Feature Extractor mBART
Welcome to Xiaomi
Figure 1: The architecture of our end-to-end offline speech translation model consists of three components: speech
encoder, text encoder, and text decoder. The speech encoder is composed of a CNN feature extractor and a 24-layer
Transformer encoder with a CNN positional encoder. Both the text encoder and the text decoder are 12-layer
standard Transformer structures. Note that the speech encoder is initialized with the pre-trained HuBERT model,
and both the text encoder and text decoder are initialized with the pre-trained mBART model.
at the end of each segment to ensure that the final HuBERT and mBART models. Figure 1 illustrates
word is fully pronounced. Secondly, the speaking the architecture of our model, which consists of
rate varies among different speakers or types of a speech encoder, a text encoder, and a text de-
speeches, resulting in different amounts of words coder. More specifically, the speech encoder is
being spoken within a given time period. Excessive composed of a feature extractor based on con-
words in a speech segment may result in missing volutional neural networks (CNN), named CNN
translations. We choose different hyperparameters feature extractor and a 24-layer Transformer en-
for different speakers or different types of speeches. coder. The CNN feature extractor is used to ex-
tract speech features from waveform, with 7 layers
3 Methods each containing 512 channels and kernel widths of
We build our Offline-ST system in an end-to-end [10, 3, 3, 3, 3, 2, 2] and strides of [5, 2, 2, 2, 2, 2, 2].
manner (End-to-End Offline-ST) based on the Hu- The Transformer encoder is derived from the stan-
BERT and mBART pre-trained models. Our si- dard Transformer (Vaswani et al., 2017) encoder,
multaneous speech translation system (End-to-End except for using CNN as the position encoder. The
Simul-ST) utilizes the same model architecture as text encoder is a 12-layer standard Transformer en-
the Offline-ST system and adopts wait-k and ITST coder, and the text decoder is a 12-layer standard
strategies. The cascaded S2ST system involves Transformer decoder. The training objective of our
an end-to-end speech-to-text translation model fol- speech translation model can be formulated as:
lowed by a TTS model. |y|
X
L (x, y; θe , θd ) = - log p yt |y<t , x; θe , θd (1)
3.1 End-to-End Offline-ST System t=1
The speech translation corpus typically consists of where θe and θd represent the parameters of the
triples (x, z, y) that contain speech, transcription, encoder and the decoder, respectively.
and translation data, where x = (x1 , · · · , x|x| ) rep-
resents a sequence of acoustic features, while z 3.2 Cascaded S2ST System
= (z1 , · · · , z|z| ) and y = (y1 , · · · , y|y| ) denote the In the cascaded S2ST system, we reuse the offline
corresponding transcription in the source language speech translation model discussed in Section 3.1
and translation in the target language, respectively. as the ST model. For the TTS model, we first train a
Our end-to-end Offline-ST system is based on an base TTS model and vocoder using the AISHELL-
encoder-decoder architecture from the pre-trained 3 dataset with the Tacotron2 (Shen et al., 2018)
414
open source framework. The final TTS model is wait-k streaming decoding strategy, and finally
obtained by fine-tuning the base model on the Gi- evaluated using the SimulEval (Ma et al., 2020a)
gaS2S dataset. toolkit. To ensure accurate translations, we en-
force a constraint that the model should not pro-
3.3 End-to-End Simul-ST System duce the final translation until it has fully processed
In order to take full advantage of the powerful capa- the speech in the source language.
bilities of large pre-trained models, we develop an
3.4 Self-Training
end-to-end Simul-ST system based on the HuBERT
and mBART models. Furthermore, we employ two Self-training is a simple semi-supervised learning
strategies, namely wait-k and ITST. method that involves using unlabeled data to aug-
ment labeled data (Pino et al., 2020; Sun et al.,
3.3.1 Wait-k 2021; Wang et al., 2021; Popuri et al., 2022). To
Ma et al. (2020b) adapts methods originally pro- leverage the large-scale unlabeled audio introduced
posed for simultaneous machine translation to de- in Section 2.1, we employ self-training in our ap-
velop an end-to-end Simul-ST system. To achieve proach. In particular, we first train the end-to-end
this, they employ the wait-k (Ma et al., 2019) strat- speech translation model on both manually anno-
egy and a fixed pre-decision module. Under this tated data and augmentation data, as described in
approach, the system first reads k speech segments, Section 2. Next, we use the model to generate
each of which contains a fixed number (q, a hyper- Chinese translation text, which we merge with the
parameter in the pre-decision module) of speech original training data and unlabeled audio. We then
frames. When k speech segments have been read, continue training the end-to-end speech translation
the decoder generates one token in the target lan- model on this merged dataset.
guage. Similarly, we also apply the wait-k strategy
in the decoding process of our end-to-end offline- 3.5 Contrastive Learning
ST system, as it strikes a good balance between The objective of contrastive learning (Chen et al.,
translation quality and latency without requiring 2020; Gao et al., 2021; Ye et al., 2022; Zhang et al.,
any streaming strategy during training (Papi et al., 2023) is to learn an encoder that produces similar
2022; Polák et al., 2022). During inference, once a representations for similar instances, while pro-
speech segment is accepted, the decoder takes the ducing dissimilar representations for dissimilar in-
following action: stances, as measured by their cosine similarity. In
our approach, we assume that the same utterance,
continue to read |x| − |y| < k regardless of whether it is in speech or text modal-
Action = (2)
output yt |x| − |y| ≥ k
ity, will have similar hidden representations. There-
where yt denotes the t-th token of the target lan- fore, we aim to minimize the cosine distance be-
guage, while |x| and |y| refer to the number of tween the hidden representations of the two modal-
source speech segments and target tokens, respec- ities for the same utterance, while increasing the
tively. cosine distance between the hidden representations
of different utterances. Specifically, we minimize
3.3.2 ITST the cosine distance between the speech encoder
The Information-Transport-based Simultaneous output and the corresponding word embedding for
Translation (ITST) architecture has achieved state- the same utterance, while maximizing the distance
of-the-art performance in end-to-end simultaneous between the representations of different utterances.
speech translation. To implement this strategy, we The training objective is as follows:
initialize the corresponding parameters by using XN
exp(sim(u, v)/T )
the pre-trained HuBERT and mBART models, and LCT R = - log p PX (3)
exp(sim(u, v(xj ))/T )
randomly initialize additional parameters for com- t=1
puting the information transport matrix. We then where u is the average state of the speech encoder
optimize the quality and latency objectives using output along the sequence length, v is the average
the ITST criterion, varying the δ value to control word embedding, and T is the temperature hyper-
the latency in streaming inference. parameter. More specifically, LCT R quantifies the
Our end-to-end speech translation system is built negative logarithm of the probability that the simi-
based on the ITST architecture, equipped with a larity between u and v is greater than the similarity
415
between u and other candidate word embeddings Models BLEU
v(xj ). The probabilities are normalized using a 0 wav2vec2.0 (small) 23.84
softmax function over all candidate embeddings. 1 HuBERT + mBART50 (one-to-many) 27.74
In addition to contrastive learning, we also con- 2 + fine-tuning on MuST-C 27.90
duct multitask learning using labeled ASR and MT 3 + Self-Training 27.69
4 + Contrastive Learning 28.11
training data, which results in the final optimization
5 + fine-tuning on MuST-C 27.94
objective: 6 data2vec + mBART50 (one-to-many) 27.66
7 + fine-tuning on MuST-C 27.59
L = LST + LASR + LM T + LCT R (4) 8 Ensemble (2, 5) 27.79
9 Ensemble (2, 7) 27.61
10 Ensemble (2, 5, 7) 27.94
where LST , LASR , LM T , and LCT R denote the
losses for speech-to-text translation, ASR, MT, and
Table 3: The BLEU scores of ST models on the tst-
contrastive learning, respectively. COMMON test set.
4 Experiments
4.3 Main Results
4.1 Experiment Settings Offline En⇒Zh Speech Translation
The fairseq toolkit14
is used to train our speech- We evaluate our offline-ST models on the tst-
to-text models. During training, the models take COMMON test set by reporting the BLEU score
the original waveform sampled at 16kHz as the in- in accordance with the official evaluation criteria.
put. The Adam optimizer (Kingma and Ba, 2015) To establish a baseline for comparison, we use
with a fixed learning rate of 5e-5 is used to train the widely-used standard wav2vec2.0 model for
the models. Each model is trained for 200k steps, speech translation tasks. Table 3 shows the com-
and we save the model every 2.5k steps using an parison results among all models. Our end-to-end
early stopping mechanism. In detail, if the BLEU models exhibit a significant improvement of ap-
score on the development set does not improve for proximately 4 BLEU points over the wav2vec2.0
10 consecutive checkpoints, the training will be ter- baseline, which demonstrates the effectiveness of
minated. During the fine-tuning stage, we set the our methods. Additionally, we also conduct ex-
maximum number of updates to 50k and the learn- periments using data2vec (Baevski et al., 2022)
ing rate to 2e-5. Our TTS model is implemented pre-trained model and obtain comparable results
using the Tacotron2 toolkit15 . on the tst-COMMON test set.
By analyzing our experimental results, we ob-
4.2 Evaluation serve that domain fine-tuning does not significantly
improve the performance of the model. Neverthe-
As the official automatic evaluation criterion, the less, we believe domain fine-tuning will be benefi-
BLEU score (Papineni et al., 2002) is used to eval- cial for final human evaluation on the TED18 test
uate the translation quality of all our systems. For set. Our final submission is an ensemble of the
the Simul-ST system, we employ the average lag models listed in rows 2, 5, and 7 of Table 3.
(AL) (Ma et al., 2019, 2020b) metric to measure
It is worth mentioning that we encounter some
the translation latency, which is a standard metric
challenges when training our model. When the
for simultaneous speech translation. The SimulE-
HuBERT model is used to initialize our model,
val open-source toolkit16 is utilized to calculate
instabilities are observed during training, with sud-
both the BLEU and AL metrics for the Simul-ST
den gradient explosions leading to training collapse.
system. All BLEU scores are calculated with the
After careful analysis, we determine that the prob-
SacreBLEU17 (Post, 2018) toolkit at the character
lem is that the gradients of the CNN layers are
level.
relatively large during the entire training process.
14 We address this issue by scaling down the gradients
https://github.com/pytorch/fairseq
15
https://github.com/NVIDIA/tacotron2 of the CNN layers.
16
https://github.com/facebookresearch/SimulEval
17 18
https://github.com/mjpost/sacrebleu https://www.ted.com/
416
Models BLEU Strategies Models BLEU AL
1 Offline-ST 30.10 1 Wait-k HuBERT+mBART 25.99 1980
2 Wait-k + ST & CL 26.59 1966
2 Offline-ST + GigaST 31.56
3 ITST HuBERT+mBART 26.25 1906
3 Ensemble (1, 2) 31.81
Table 6: The evaluation results of Simul-ST models
Table 4: BLEU scores of our ST models on the develop- on tst-COMMON. ST and CL denote self-training and
ment set of the S2ST track in IWSLT 2023. Offline-ST contrastive learning for the Offline-ST model.
is trained on all manually annotated data and the aug-
mented data described in Section 2.2.2. In addition to
the data used by the offline-ST model, the Offline-ST + 6000ms, the model performs a WRITE action to
GigaST model incorporates additional GigaST data. predict the next target token.
We evaluate the wait-k strategy using models 1
Models ASR-BLEU and 4 in Table 3, and train the ITST model with
1 Offline-ST 28.88 the same configuration as model 1 in Table 3. The
2 Offline-ST + GigaST 30.10 results of the Simul-ST models are presented in
Table 6. Although ITST shows better performance
3 Ensemble (1, 2) 30.18 than wait-k in the same setting, the wait-k strategy
combined with self-training and contrastive learn-
Table 5: ASR-BLEU scores of our ST models on the
ing can achieve better results. Therefore, we finally
development set of the S2ST track in IWSLT 2023. The
models are identical to those presented in Table 4.
submit the system corresponding to the second row
in Table 6.
Offline En⇒Zh Speech-to-Speech Translation 5 Conclusion

We evaluate the performance of our end-to-end In this paper, we present our submissions for the
speech-to-text translation system and cascaded IWSLT 2023 shared tasks. We participate in three
speech-to-speech system on the development set of tracks, namely the offline speech translation track,
the S2ST track in IWSLT 2023, comprising 5, 000 the offline speech-to-speech translation track, and
utterances. The results of the speech-to-text transla- the simultaneous speech translation track. All of
tion models and speech-to-speech translation mod- our submissions use large-scale pre-trained mod-
els are demonstrated in Table 4 and 5, respectively. els, and we further improve these models using
For the speech-to-text translation model, we adopt various effective techniques, such as data augmen-
the ensemble of models corresponding to rows 1 tation, contrastive learning, and model ensembles.
and 2 in Table 4. To build the speech-to-speech Extensive experiments validate the effectiveness of
translation system, we then leverage our trained our proposed method and demonstrate that our sub-
Chinese TTS model to synthesize Chinese speech mitted systems are comparable to state-of-the-art
and generate the corresponding Chinese transcript baseline systems in terms of performance.
with the Conformer model 19 trained on the Wenet-
Speech dataset (Zhang et al., 2022a). Finally, the Acknowledgements
generated Chinese transcript and reference are used Mengge Liu, Yanzhi Tian and Yuhang Guo have
to calculate the ASR-BLEU score. been supported in part by the National Key R&D
Simultaneous En⇒Zh Speech Translation Program of China (No. 2020AAA0106600). Jin-
song Su has been supported by Youth Innovation
We use the SimulEval toolkit to evaluate the quality Fund of Xiamen (No. 3502Z20206059).
and latency of our simultaneous speech translation
model on the tst-COMMON set. In order to achieve
a better balance between quality and latency, when
the prediction probability is lower than 20%, the
READ action is performed; when the delay exceeds
19
https://wenet-1256283475.cos.ap-shanghai.
myqcloud.com/models/wenetspeech/wenetspeech_
u2pp_conformer_exp.tar.gz
417
References Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
Antonios Anastasopoulos, Loïc Barrault, Luisa Ben- rahman Mohamed. 2021. HuBERT: Self-supervised
tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano speech representation learning by masked prediction
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, of hidden units. In Proc. of TALSP.
Marcello Federico, Christian Federmann, Souhir Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
Gahbiche, Hongyu Gong, Roman Grundkiewicz, method for stochastic optimization. In Proc. of ICLR.
Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu,
Mathur, Paul McNamee, Kenton Murray, Maria Sravya Popuri, Xutai Ma, Adam Polyak, Yossi Adi,
Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan Qing He, Yun Tang, Juan Pino, and Wei-Ning Hsu.
Niehues, Xing Niu, John Ortega, Juan Pino, Eliz- 2022. Direct speech-to-speech translation with dis-
abeth Salesky, Jiatong Shi, Matthias Sperber, Se- crete units. In Proc. of ACL.
gesh Virkar, Alexander Waibel, Changhan Wang, and Yinglu Li, Minghan Wang, Jiaxin Guo, Xiaosong Qiao,
Shinji Watanabe. 2022. Findings of the IWSLT 2022 Yuxia Wang, Daimeng Wei, Chang Su, Yimeng Chen,
evaluation campaign. In Proc. of IWSLT. Min Zhang, Shimin Tao, Hao Yang, and Ying Qin.
2022. The HW-TSC’s offline speech translation sys-
Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremer- tem for IWSLT 2022 evaluation. In Proc. of IWSLT.
erico, Xutai Ma, Satoshi Nakamura, Matteo Negri, Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Jan Niehues, Juan Pino, Elizabeth Salesky, Sebas- Edunov, Marjan Ghazvininejad, Mike Lewis, and
tian Stüker, Katsuhito Sudoh, Marco Turchi, Alexan- Luke Zettlemoyer. 2020. Multilingual denoising pre-
der Waibel, Changhan Wang, and Matthew Wiesner. training for neural machine translation. In Proc. of
2021. FINDINGS OF THE IWSLT 2021 EVALUA- TACL.
TION CAMPAIGN. In Proc. of IWSLT.
Ebrahim Ansari, Amittai Axelrod, Nguyen Bach, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,
Ondřej Bojar, Roldano Cattoni, Fahim Dalvi, Nadir Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and
Durrani, Marcello Federico, Christian Federmann, Haifeng Wang. 2019. STACL: Simultaneous trans-
Jiatao Gu, Fei Huang, Kevin Knight, Xutai Ma, Ajay lation with implicit anticipation and controllable la-
Nagesh, Matteo Negri, Jan Niehues, Juan Pino, Eliz- tency using prefix-to-prefix framework. In Proc. of
abeth Salesky, Xing Shi, Sebastian Stüker, Marco ACL.
Turchi, Alexander Waibel, and Changhan Wang.
2020. FINDINGS OF THE IWSLT 2020 EVAL- Xutai Ma, Mohammad Javad Dousti, Changhan Wang,
UATION CAMPAIGN. In Proc. of IWSLT. Jiatao Gu, and Juan Pino. 2020a. SIMULEVAL: An
Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Proc. of EMNLP.
Babu, Jiatao Gu, and Michael Auli. 2022. data2vec:
A general framework for self-supervised learning in Xutai Ma, Juan Pino, and Philipp Koehn. 2020b.
speech, vision and language. In Proc. of ICML. SimulMT to SimulST: Adapting simultaneous text
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, translation to end-to-end simultaneous speech trans-
and Michael Auli. 2020. wav2vec 2.0: A framework lation. In Proc. of AACL/IJCN.
for self-supervised learning of speech representations. Sara Papi, Marco Gaido, Matteo Negri, and Marco
In Proc. of NIPS. Turchi. 2022. Does simultaneous speech translation
Ting Chen, Simon Kornblith, Mohammad Norouzi, and need simultaneous models? In Findings of the Asso-
Geoffrey Hinton. 2020. A simple framework for con- ciation for Computational Linguistics: EMNLP 2022,
trastive learning of visual representations. In Proc. Abu Dhabi, United Arab Emirates, December 7-11,
of ICML. 2022, pages 141–153. Association for Computational
Linguistics.
Liang Ding and Dacheng Tao. 2021. The USYD-JD
speech translation system for IWSLT2021. In Proc. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
of IWSLT. Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proc. of ACL.
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021.
SimCSE: Simple contrastive learning of sentence Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad
embeddings. In Proc. of EMNLP. Dousti, and Yun Tang. 2020. Self-training for end-to-
end speech translation. In Proc. of Interspeech.
Bao Guo, Mengge Liu, Wen Zhang, Hexuan Chen,
Chang Mu, Xiang Li, Jianwei Cui, Bin Wang, and Peter Polák, Ngoc-Quan Pham, Tuan-Nam Nguyen,
Yuhang Guo. 2022. The Xiaomi text-to-text simulta- Danni Liu, Carlos Mullov, Jan Niehues, Ondrej Bojar,
neous speech translation system for IWSLT 2022. In and Alexander Waibel. 2022. CUNI-KIT system for
Proc. of IWSLT. simultaneous speech translation task at IWSLT 2022.
418
In Proceedings of the 19th International Confer- Hao Zhang, Nianwen Si, Yaqi Chen, Wenlin Zhang,
ence on Spoken Language Translation, IWSLT@ACL Xukui Yang, Dan Qu, and Wei-Qiang Zhang. 2023.
2022, Dublin, Ireland (in-person and online), May Improving speech translation by cross-modal multi-
26-27, 2022, pages 277–285. Association for Com- grained contrastive learning. IEEE/ACM Transac-
putational Linguistics. tions on Audio, Speech, and Language Processing.
Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Shaolei Zhang and Yang Feng. 2022. Information-
Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, and Ann transport-based policy for simultaneous translation.
Lee. 2022. Enhanced direct speech-to-speech transla- In Proc. of EMNLP.
tion using self-supervised pre-training and data aug-
mentation. In Proc. of Interspeech. Weitai Zhang, Zhongyi Ye, Haitao Tang, Xiaoxi Li,
Matt Post. 2018. A call for clarity in reporting BLEU Mohan Shi, Yifan Song, Dan Liu, Junhua Liu, and
scores. In Proc. of WMT. Lirong Dai. 2022b. The USTC-NELSLIP offline
speech translation systems for IWSLT 2022. In Proc.
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike of IWSLT.
Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan,
Rif A. Saurous, Yannis Agiomyrgiannakis, and
task. In Proc. of IWSLT.
Yonghui Wu. 2018. Natural TTS synthesis by condi-
tioning wavenet on mel spectrogram predictions. In Qinpei Zhu, Renshou Wu, Guangfeng Liu, Xinyu Zhu,
Proc. of ICASSP. Xingyu Chen, Yang Zhou, Qingliang Miao, Rui
Wang, and Kai Yu. 2022. The AISP-SJTU simul-
Matthias Sperber and Matthias Paulik. 2020. Speech taneous translation system for IWSLT 2022. In Proc.
translation and the end-to-end promise: Taking stock of IWSLT.
of where we are. In Proc. of ACL.
Haipeng Sun, Rui Wang, Kehai Chen, Masao Utiyama,

Eiichiro Sumita, and Tiejun Zhao. 2021. Self-
training for unsupervised neural machine translation
in unbalanced training data scenarios. In Proc. of
NAACL.
Yun Tang, Hongyu Gong, Ning Dong, Changhan Wang,

Wei-Ning Hsu, Jiatao Gu, Alexei Baevski, Xian Li,
Abdelrahman Mohamed, Michael Auli, and Juan
Pino. 2022. Unified speech-text pre-training for
speech translation and recognition. In Proc. of ACL.

losa, and Marta R. Costa-jussà. 2022. SHAS:
approaching optimal segmentation for end-to-end
speech translation. In Proc. of Interspeech.

Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
you need. In Proc. of NIPS.
Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski,

Michael Auli, and Alexis Conneau. 2021. Large-
scale self- and semi-supervised learning for speech
translation. In Proc. of Interspeech.
Rong Ye, Mingxuan Wang, and Lei Li. 2022. Cross-

modal contrastive learning for speech translation. In
Proc. of NAACL.
Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao,

Chenchen Zeng, Di Wu, and Zhendong Peng. 2022a.
Wenetspeech: A 10000+ hours multi-domain man-
darin corpus for speech recognition. In Proc. of
ICASSP.
419
Improving Formality-Sensitive Machine Translation using
Data-Centric Approaches and Prompt Engineering
Seungjun Lee1 , Hyeonseok Moon1 , Chanjun Park1,2 , Heuiseok Lim1∗
1
Korea University, South Korea
2
Upstage, South Korea
{dzzy6505, glee889, bcj1210, limhseok}@korea.ac.kr
[email protected]
Abstract Source: It did, many people liked his show

so yeah, do you like Chris Pratt?
In this paper, we present the KU x Up-
Korean Formal: 그랬어요, 많은 사람들이 그의
stage team’s submission for the Special Task
on Formality Control on Spoken Language 쇼를 좋아했죠. 그래서 당신 크리스 프랫 좋아해요?
Translation, which involves translating En- Korean Informal: 그랬어, 많은 사람들이 그의
glish into four languages with diverse gram- 쇼를 좋아했지. 그래서 너 크리스 프랫 좋아해?
matical formality markers. Our methodology
comprises two primary components: 1) a Table 1: Contrastive translations in formal and informal
language-specific data-driven approach, and styles into Korean are presented. Grammatical formal-
2) the generation of synthetic data through ity markers, which are bolded, can be aligned through
the employment of large-scale language mod- colors.
els and empirically-grounded prompt engineer-
ing. By adapting methodologies and models
to accommodate the unique linguistic prop- absence of gold translations with alternate formal-
erties of each language, we observe a no- ity levels and the extensive variation in grammatical
table enhancement in performance relative to formality markers across languages.
the baseline, substantiating the heightened effi- In the 2023 shared task, an English source seg-
cacy of data-driven approaches. Moreover, our
ment is paired with two references that are mini-
devised prompt engineering strategy yields su-
perior synthetic translation instances. mally contrastive in grammatical formality, repre-
senting both formal and informal levels as shown
1 Introduction in Table 1. Training and test samples are provided
in the domains of “telephony data” and “topical
Neural machine translation (NMT) models have chat” (Gopalakrishnan et al., 2019) for two super-
achieved remarkable progress in recent years, as vised language pairs, English-Korean (EN-KO) and
evidenced by their high BLEU scores (Britz et al., English-Vietnamese (EN-VI), and two zero-shot
2017; Stahlberg, 2020). Nonetheless, these models language pairs, English-Portuguese (EN-PT) and
generally rely on generic parallel corpora and as- English-Russian (EN-RU). Grammatical formality
sume a single target translation for a given source markers differ across these languages. Personal pro-
sentence, often overlooking the significance of nouns and verb agreement signal formality in many
style and pragmatic aspects in translation, such as Indo-European languages (e.g., PT, RU), while in
formality or politeness (Li et al., 2022). To address Korean, formality control is notably challenging
this issue, formality-sensitive machine translation due to the widespread use of morphological mark-
(FSMT) has emerged as a research area, aiming ers to convey polite, respectful, and humble speech,
to control grammatical formality in translated text making it an intriguing test case for FSMT.
across languages (Niu et al., 2017). In this paper, we present our approach to FSMT,
The Special Task on Formality Control on Spo- focusing on the supervised setting for the English-
ken Language Translation introduces a new bench- Korean (EN-KO) and English-Vietnamese (EN-
mark with high-quality training datasets for di- VI) language pairs and evaluating our methods
verse languages, encompassing both supervised on the zero-shot English-Portuguese (EN-PT) and
and zero-shot language pairs. Despite these new English-Russian (EN-RU) pairs. Our method con-
datasets (Nădejde et al., 2022), controlling formal- sists of two main strategies: 1) a language-specific
ity in MT remains a challenging problem due to the data-driven approach, and 2) synthetic data gener-
420
ation using large-scale language models and em- Park et al., 2020, 2021). Finally, we fine-tuned the
pirical prompt engineering. We apply techniques pre-trained model (PLM) on the supervised train
and models tailored to the linguistic features of set each language pair.
each language. For Korean, we utilize a morpheme-
EN-KO We discuss our approach to improve the
centric subword tokenization method, while for
English-Korean (EN-KO) translation performance
Vietnamese, we employ a pre-trained EnViT5
by pre-training a Transformer using a high-quality
model with high-quality Vietnamese parallel cor-
dataset and leveraging morpheme-aware subword
pora. Additionally, we generate synthetic trans-
tokenization to better capture the linguistic charac-
lation datasets for Portuguese and Russian using
teristics of the Korean language such as agglutina-
prompt engineering and refine these datasets using
tive nature and structure.
formality classifiers for fine-tuning our models. Fur-
We adopted a data-centric approach by pre-
thermore, we founded significant performance im-
training a Transformer for EN-KO translation. To
provements in EN-KO and EN-VI and conducted
do so, we used a high-quality dataset from the AI
an ablation study to utilize high-quality synthetic
Hub (Park et al., 2022)1 data platform, which is
examples.
operated by the Korean government. This compre-
2 Proposed Method hensive dataset includes various parallel corpora
encompassing diverse domains such as technical
2.1 Task Definition and scientific fields, daily life and colloquial ex-
In this submission, we focus on the supervised and pressions, news articles, government and local gov-
zero-shot settings on unconstrained formality con- ernment websites, publications, administrative reg-
trol machine translation task. Formally, provided ulations, Korean culture, and formal and informal
with a source segment X = {x1 , x2 , . . . , xm } and language. By using a dataset specifically tailored
a formality level l ∈ {formal, informal}, the ob- for English-Korean translation, we aimed to cap-
jective is to identify a model defined by parame- ture finer nuances in both languages and enhance
ters Θ that produces the most probable translation the translation quality by incorporating domain-
Y = {y1 , y2 , . . . , yn } in accordance with the for- specific knowledge and addressing the linguistic
mality level: variations in different contexts.
Furthermore, we addressed the linguistic char-
Y = arg maxP (X, l; Θ) acteristics of the Korean language by applying a
Yl morpheme-aware subword tokenization method,
In simpler terms, the goal is to find the optimal which combines a segmentation strategy based on
model parameters Θ that produce the most likely linguistic features with subwords. This approach
translation Y , given the source segment X and the has been shown to be effective in various Korean
desired formality level l (either formal or informal). NLP tasks (Park et al., 2020). We utilized MeCab-
This is achieved by maximizing the probability ko 2 , a widely-used morphological analyzer for the
P (X, l; Θ) of obtaining the translation Y at the Korean language, for morpheme analysis. After ob-
specified formality level. taining the morphemes, we applied the Unigram
subword tokenization method, which allowed our
2.2 Language Specialized Data-Centric model to capture linguistic patterns specific to the
Approach Korean language, ultimately improving the overall
In this work, we employ a language special- translation performance.
ized data-centric approach by integrating trans- EN-VI For the EN-VI language pair, we em-
fer learning techniques from Zoph et al. (2016) ployed the EnViT5 (Ngo et al., 2022), a Text-to-
and language-specific subword methods, such as Text Transfer Transformer (T5) model proposed
Unigram (Kudo, 2018) or byte-pair encoding by Raffel et al. (2020). We aimed to improve the
(BPE) (Sennrich et al., 2015b). This combina- fine-tuning translation performance of EN-VI in a
tion effectively captures the unique morphologi- low-resource setting by applying this data-centric
cal and syntactic structures of the target language, approach to the multi-domain pre-trained EnViT5
resulting in substantial improvements in transla- 1
https://aihub.or.kr/
tion performance, especially for low-resource lan- 2
https://bitbucket.org/eunjeon/
guages (Zoph et al., 2016; Bojanowski et al., 2017; mecab-ko-dic
421
model, which has been specifically designed for ples are sourced from the target language’s training
Vietnamese language tasks. Notably, EnViT5 mod- set and include both informal and formal levels.
els outperformed existing multilingual models such ChatGPT is then tasked with translating the input
as mBART and M2M-100 while maintaining a sig- text into either an informal or formal target lan-
nificantly smaller parameter size, making them scal- guage, depending on the specified prompt. For the
able and promising for both academic and industry input text, we use English source sentences from
applications (Ngo et al., 2022). the IWSLT 22 Formality Track’s other language
EnViT5 was pre-trained with the CC100 pairs. After filtering the translated examples us-
Dataset (Wenzek et al., 2020) which comprises ing a formality classifier, we fine-tuned the respec-
monolingual data for over 100 languages. Subse- tive PLMs for EN-KO and EN-VI by incorporating
quently, EnViT5 was fine-tuned on the MTet (Ngo synthetic examples into the training sets for each
et al., 2022) and PhoMT (Doan et al., 2021) language pair. To verify the effectiveness of data
datasets. MTet is a multi-domain EN-VI machine augmentation through prompt engineering, we con-
translation dataset encompassing a diverse range duct experiments comparing the results with and
of domains, including educational videos, soft- without the augmented data.
ware user interfaces, COVID-related news arti-
cles, religious texts, subtitles, Wikipedia, and TED Language Size
Talks (Reimers and Gurevych, 2020). Ultimately, Train Test
when combined with PhoMT and IWSLT’15 (Cet- EN-KO 400 600
tolo et al., 2015), the final MTet dataset expands EN-VI 400 600
EN-PT 0 600
the training set size to 6 million examples, covering
EN-RU 0 600
previously neglected areas such as law and biomed-
ical data, which contains monolingual data for over Table 2: Data statistics in train and test sets of Formality
100 languages. Dataset
2.3 Synthetic Data Generation via Prompt

Engineering Zero-shot Setting In the EN-PT and EN-RU
zero-shot settings, we generate synthetic exam-
Leveraging synthetic examples in machine trans-
ples for fine-tuning using the IWSLT’22 train set.
lation is crucial for improving translation quality,
We translate the source into both formal and in-
especially in low-resource settings (Edunov et al.,
formal target language levels, employing suitable
2018; Sennrich et al., 2015a). ChatGPT with GPT-4
prompts and filtering with a formality classifier to
engine (OpenAI, 2023), in particular, exhibits trans-
ensure conditioned formality. The template, shown
lation performance comparable to state-of-the-art
in Appendix A, is adapted from the OpenAI Play-
WMT system and demonstrate good quality of gen-
ground’s default sentence-level translation task3 .
eration conditioned translation generation in both
The model is instructed to translate English in-
few-shot and zero-shot settings (Hendy et al., 2023).
put into either informal or formal target language,
To generate synthetic data, we employ ChatGPT to
guided by n random shots from the training set.
condition on formality and translate the IWSLT’22
Generated examples are then filtered using a for-
Formality Track (Salesky et al., 2022) for all lan-
mality classifier before fine-tuning the pre-trained
guage pairs with English as the source language.
multilingual translation model.
Furthermore, we use a formality classifier (Rippeth
This zero-shot approach enables effective con-
et al., 2022) to filter synthetic examples, ensuring
ditioned task performance with limited exposure
that both formal and informal examples are accu-
to specific language pairs and formality levels. By
rately translated for each language.
generating synthetic translation data for fine-tuning,
Supervised Setting We follow the prompt tem- we capitalize on the model’s generalization ability
plate depicted in Appendix A, which is based on across languages and formality levels, enhancing
the approach proposed by Hendy et al. (2023). To translation performance in zero-shot settings. This
provide context for the model, we utilize n ran- highlights the potential of synthetic data in extend-
domly selected shots from the English training set ing pre-trained language models’ capabilities, even
of other language pairs in the IWSLT 23 Formality 3
https://platform.openai.com/examples/
Track (Agarwal et al., 2023). The few-shot exam- default-translate
422
with novel language pair and formality combina- Language Size Source
tions. AI Hub (Formal/Informal
EN-KO 6M
+ Tech/Sci + Social/Sci + News)
MTet (Ngo et al., 2022)
3 Experiment Settings EN-VI 6.2M
+ PhoMT (Doan et al., 2021)
EN source from IWSLT’22
3.1 Dataset Details EN-{PT, RU} 1.6K
(Anastasopoulos et al., 2022)
The IWSLT shared task provides Formality Dataset Table 3: Additional external datasets used for the for-
which contains English source segments, each ac- mality track in various language pairs.
companied by two contrasting reference transla-
tions representing informal and formal formality
levels. This is available for two language pairs, 3.2 Training Details
EN-{KO, VI}, in the supervised setting and two In the training details for the EN-KO language
additional language pairs, EN-{PT, RU}, in the pair, we applied a morpheme-aware tokenization
zero-shot setting. The statistics for the train and method to the translation dataset. To achieve this,
test sets of the dataset are shown in Table 2 we followed the training methods proposed by Park
For training and testing purposes, we randomly et al. (2020) and Gowda and May (2020), using
sampled 50 pairs of examples across each domain MeCab-ko and Unigram to construct a vocabu-
from the train set of Formality Dataset, and set lary of 48K tokens. We then pre-trained the Trans-
them aside as validation sets (TASK DEV) for each former model (Vaswani et al., 2017). We used the
supervised language. The remaining samples were fairseq library with 12 encoder and 12 decoder
utilized for training (TASK TRAIN). layers, each having 16 attention heads. Both en-
Additionally, we utilized external datasets in coder and decoder had an embedding dimension
conjunction with the data provided in the shared of 1024 and a feed-forward network (FFN) dimen-
task. For EN-KO, we employed a parallel corpus sion of 4096. During pre-training, we trained for
comprising Formal/Informal, Social Science, Tech- 20 epochs with a learning rate of 5e-4 and 4000
nology Science, and News domains from AI Hub warmup updates. For fine-tuning, we trained for
for the pretraining of the PLM. For EN-VI, we 200 epochs using a learning rate of 4e-5 and 100
utilized EnViT5, which was fine-tuned using the warmup updates. We fine-tuned using the TASK
MTet (Ngo et al., 2022) and PhoMT (Doan et al., TRAIN for all language pairs.
2021) datasets. For EN-{VI, PT, RU} pairs, we fine-tuned us-
In our research, we leverage ChatGPT for the ing the huggingface library. For EN-VI, we
augmentation of the EN-KO and EN-VI and the used the VietAI/envit5-translation as
generation of synthetic examples for fine-tuning the PLM. Fine-tuning was performed for 200
on EN-PT and EN-RU. This was done by using epochs with a learning rate of 4e-5, 200 warmup
the source data from all available English-other steps, and a batch size of 64. For EN-{PT,RU}
language pairs (EN-XX) in the IWSLT’22 Formal- pairs, we used facebook/mbart-large-50
ity Track (Anastasopoulos et al., 2022). To secure and trained for 200 epochs with a learning rate of
the quality and uniqueness of our training set, we 3e-5, 100 warmup steps, and a batch size of 16. All
implemented a preprocessing step that excludes du- models were trained using four RTX A6000 GPUs.
plicate sentences. Furthermore, to determine the op- Detailed hyperparameters and training information
timal hyperparameters, we conducted a case study can be found in the Appendix B.
utilizing TASK DEV (details can be found in Sec-
tion 4.3). The hyperparameters that led to the high- 3.3 Evaluation Details
est Matched-Accuracy (M-Acc) were selected for In our experimental setting, we used the official test
use. For all language pairs, we utilized a temper- set from Formality Dataset (IWSLT’23) to evaluate
ature of 0.9; specifically, we implemented 4-shot our translation model’s performance. The evalua-
learning for EN-KO and 2-shot learning for EN- tion was conducted across two dimensions: overall
VI. For EN-PT and EN-RU, we proceeded with translation quality and formality control. To as-
a zero-shot setting. More detailed information re- sess the overall translation quality, we employed
garding the datasets and the preprocessing steps BLEU (Papineni et al., 2002) and COMET (Rei
are presented in Table 3. et al., 2020) (eamt22-cometinho-da) as au-
423
EN-KO EN-VI
M ETHOD BLEU COMET %M-ACC %C-F BLEU COMET %M-ACC %C-F
Official Baseline 4.91 0.211 78.3 98.6 26.71 0.363 96.0 99.7
Formal
ChatGPT 5.65 0.524 83.3 100.0 27.07 0.510 100.0 98.0

Ours 26.60 0.727 87.0 100.0 47.00 0.669 99.4 100.0
Ours + Augmentation 17.09 0.667 79.4 99.5 41.57 0.653 99.4 99.7
Informal
ChatGPT 5.60 0.564 100.0 100.0 25.83 0.482 100.0 100.0

Ours 27.10 0.715 98.0 95.0 45.60 0.637 98.8 100.0
Ours + Augmentation 20.35 0.621 98.5 98.8 40.46 0.484 98.7 100.0
Table 4: Results on the test set of Formality Dataset for formal and informal supervised settings, obtained via our
language specialized data-centric approach.
EN-PT EN-RU
M ETHOD BLEU COMET %M-ACC %C-F BLEU COMET %M-ACC %C-F
Formal
ChatGPT 31.25 0.655 92.0 96.0 31.25 0.655 92.0 96.0

Ours 31.00 0.525 100.0 100.0 25.80 0.445 100.0 100.0
Informal
ChatGPT 27.38 0.512 48.4 46.0 31.25 0.655 92.0 100.0

Ours 19.90 0.249 68.0 90.0 26.30 0.418 100.0 100.0
Table 5: Results on the test set of Formality Dataset for formal and informal zero-shot settings, achieved through
our approach of synthetic data generation via prompt engineering.
tomatic evaluation metrics. We use 13A tokenizer ting, while for EN-PT and EN-RU pairs, we em-
to report SACRE BLEU (Post, 2018) scores for all ployed a zero-shot setting. In the supervised set-
languages. ting, we extracted arbitrary n-shot samples using
For formality control, we utilized Matched- the TASK TRAIN. We designed prompts by leverag-
Accuracy (M-Acc), a reference-based corpus- ing langchain’s prompt guide and prompt examples
level metric that leverages phrase-level formality from Hendy et al. (2023). Detailed examples and
markers from the references to classify system- explanations of the prompts can be found in Ap-
generated hypotheses as formal or informal. The pendix A.
corpus-level score is the percentage of system out-
puts that match the desired formality level. 4 Result & Findings
Additionally, we used a reference-free variant
4.1 Results for Supervised Setting
of M-Acc (C-F) 4 , which relies on a multilingual
formality classifier to label system-generated hy- Table 4 presents our experimental results in the su-
potheses as formal or informal, with the corpus- pervised setting. As demonstrated by our results,
level score representing the percentage of system our model, trained with the high-quality human-
outputs matching the desired formality level. annotated Formality Dataset, exhibited outstand-
ing performance. In particular, with respect to the
3.4 Prompt Design C-F metric, our model shows almost perfect for-
We conducted experiments using ChatGPT with mality control performance (100% accuracy) for
GPT-4 engine with langchain5 . For EN-KO and most of the tasks, except for the EN-KO informal
EN-VI language pairs, we used a supervised set- task. Additionally, our model shows superior per-
formance for the conventional NMT metrics (i.e.
4
https://github.com/amazon-science/ BLEU, COMET), outperforming ChatGPT with a
contrastive-controlled-mt/tree/main/
IWSLT2023 21.50 BLEU score for the EN-KO informal task.
5
https://python.langchain.com/ The EN-VI pair also exhibits high NMT metric
424
15.00 35.00
Formal Informal Formal Informal
13.00 33.00
11.00 31.00
BLEU
BLEU
9.00 29.00
7.00 27.00
5.00 25.00
1-shot 2-shot 4-shot 8-shot 16-shot 32-shot shot-1 shot-2 shot-4 shot-8 shot-16 shot-32
EN-KO (temperature=0.5) EN-VI (temperature=0.9)
100 100
95 95
M-Acc
M-Acc
90 90
85 85
1-shot 2-shot 4-shot 8-shot 16-shot 32-shot shot-1 shot-2 shot-4 shot-8 shot-16 shot-32
EN-KO (temperature=0.5) EN-VI (temperature=0.9)
Figure 1: BLEU and M-Acc scores for ChatGPT based on superviesed setting, evaluated on TASK DEV.
scores, M-Acc, and C-F scores compared to the performs the official baseline on all tasks except the
baseline. These results suggest that our language- EN-PT informal task. Notably, our model demon-
specific data-centric approach is effective. strates consistently higher performance in terms of
Through our experiments, we observed a sig- C-F metric compared to ChatGPT, achieving 100%
nificant degradation in the quality for supervised M-ACC and C-F in the majority of tasks.
settings EN-{KO, VI}. This phenomenon can be Exceptionally for EN-PT informal task, the per-
attributed to the limitations of synthetic data pro- formance of our model is markedly subpar, and
duced by ChatGPT. While the data generated ChatGPT even fails to exceed the official base-
through ChatGPT exhibits considerable quality, line. We find this result is highly noteworthy, as
it was not up to par with the sentences derived it suggest that ChatGPT may generate semantically
from our data-centric approach. We found that the accurate and plausible data, while the formality
integration of ChatGPT-augmented data inadver- can hardly be controlled, especially for the EN-PT
tently introduced noise into the system, leading to language pair. In our experiments, we utilized the
a decrease in overall performance. Despite the ex- same prompt for both EN-PT and EN-RU language
ceptional capabilities of ChatGPT, it appears that pairs, differing only in language specification. The
in this context, the quality of data augmented by disparity in results between these two language pair
conventional NMT methods is still superior. This suggests that specialized techniques for controlling
observation further emphasizes the critical role of formality are required for each language pair. This
data quality over quantity in supervised learning en- issue can be partially attributed to a data bias in
vironments, and highlights the potential benefits of ChatGPT, indicating a potential training data bias
more sophisticated prompting techniques that con- concerning formality.
sider formality control, such as stylistic or sentence
endings, for improving overall performance. 4.3 Case Study
Impact of In-context Shots In this section, we
4.2 Results for Zero-shot Setting
examine the changes in performance based on the
The experimental results for the zero-shot setting number of few-shot samples used for in-context
are shown in Table 5. As can be seen from the learning, particularly when employing prompt en-
experimental results, our model significantly out- gineering for translation. Previous research sug-
425
35.00 30.00
28.00
30.00
BLEU
BLEU
26.00
25.00
24.00
20.00 22.00
0.2 0.5 0.7 0.9 0.2 0.5 0.7 0.9
EN-PT Temperature EN-RU Temperature

100
100
80
95
M-Acc
M-Acc
60
90
40 85
20 80
0.2 0.5 0.7 0.9 0.2 0.5 0.7 0.9
EN-PT Temperature EN-RU Temperature

Figure 2: BLEU and M-Acc scores for ChatGPT based on zero-shot setting, evaluated on test set of Formality
Dataset.
gests that increasing the number of shots beyond to an improvement in the general translation perfor-
10 does not significantly impact translation performance metric, BLEU. However, the scores of M-
mance when using large language models (Zhang Acc and C-F, we found that the best performance
et al., 2023). However, we argue that applying the was achieved with a smaller number of shots. This
same perspective to formality control tasks proves suggests that the nature of formality as a feature
challenging. This complexity arises as formality in- makes the “formality control” task distinct from
troduces a unique element required for these tasks. conventional NMT, and it may be challenging to di-
Additionally, previous research did not consider rectly apply perspectives from conventional NMT
unintended consequences arising from this factor. to this task. We propose two hypotheses based on
In pursuit of this, we conducted experiments these results: (i) there exists a trade-off between
where the number of shots was incrementally in- translation performance and formality control as
creased from 1 to 32, in powers of 2, using TASK the number of shots increases, and (ii) increasing
DEV . The aim was to verify the differences in per- the number of shots while applying random sample
formance resulting from these changes. This pro- selection may have caused confusion in perform-
cess involved translating data via ChatGPT with ing formality control. We leave the analysis and
an increasing number of shots and then evaluating validation of these hypotheses for future work.
the resulting translation data for its appropriateness.
The experimental results are depicted in Figure 1. Impact of Temperature Temperature is an im-
For this particular experiment, we selected one tem- portant parameter to make ChatGPT generates var-
perature (from the options of 0.2, 0.5, 0.7, 0.9) that ied responses to human queries (Peng et al., 2023).
demonstrated the highest performance and eval- Basically, higher temperatures leads to the higher
uated the changes in performance based on the linguistic variety, while the lower one generates
number of shots. grammatically correct and deterministic text (Ip-
As observed in our experimental results, increas- polito et al., 2019). Previous work suggested that
ing the number of shots for in-context learning led for machine translation, a diverse generation may
426
impede its translation quality with a high degree of centric approaches in NMT, aiming to improve
certainty(i.e. high temperature) (Peng et al., 2023). translation quality and overcome the limitations
In this sense, we experiment with different tem- of low-resource languages.
perature setting and find the optimal temperature
for the formality control data augmentation. In our
experiments, we select the most appropriate one 6 Conclusion
among seven shot-candidates (1, 2, 4, 8, 16, 32) for
each language pair. In this paper, we presented the KU x UpStage
Experimental results reveal that varying temper- team’s submission for four languages, employ-
ature can lead to significant performance fluctu- ing two main strategies: 1) a language-specific
ations. It is particularly noteworthy that the per- data-driven approach, and 2) synthetic data gen-
formance disparity due to temperature changes is eration using large-scale language models and em-
exceptionally high for the informal tasks. For for- pirical prompt engineering. While our data-driven
mal tasks, the impact of temperature is relatively approach excelled, particularly in EN-KO and EN-
minor, with the variation in BLEU score is at most VI, the quality of synthetic data generation was
0.95 (EN-RU). However, for informal tasks, the called into question. In light of this feedback, we
performance shift can reach up to 4.82 points (EN- propose to enhance the quality of synthetic data
RU) as temperature changes. Additionally, we find by integrating Quality Estimation (QE) techniques
that in informal task, the performance variation de- as an additional filter in the generation process.
pending on the temperature shows distinct trend This step aims to further refine our synthetic ex-
for each language pair. This is evident from the amples, potentially improving the overall system
fact that a moderate temperature(0.7) yielded the performance. We also plan to explore the use of
highest BLEU performance in the EN-PT informal translation models with larger parameters and con-
task, while a similarly moderate temperature(0.5) duct a thorough analysis through more shot exam-
resulted in the lowest performance. Our findings ples and linguistically-grounded data augmentation
suggest that handling ChatGPT in informal task techniques. Finally, we aim to extend our under-
necessitates more elaborate control compared to standing of factors influencing FSMT performance,
dealing with formal data. such as the impact of formal register versus gram-
matical formality in training data and a detailed
5 Background examination of zero-shot transfer.
In this work, we focus on data-centric approaches

to improve Neural Machine Translation (NMT) Acknowledgments
performance. Several studies have investigated dif-
ferent strategies to address the challenges of low- This work was generously supported by mul-
resource languages and enhance translation quality. tiple grants. The Core Research Institute Ba-
Kudo (2018) proposed subword regularization to sic Science Research Program, funded by the
improve NMT models using multiple subword can- Ministry of Education through the National Re-
didates, effectively increasing data diversity and search Foundation of Korea (NRF) (Grant NRF-
robustness. Gu et al. (2018) introduced a univer- 2021R1A6A1A03045425), provided valuable sup-
sal NMT model for extremely low-resource lan- port. Additionally, this research received backing
guages, leveraging multilingual knowledge from from the Information Technology Research Cen-
high-resource languages to assist in translation. ter (ITRC) support program (IITP-2023-2018-0-
Zoph et al. (2016) explored transfer learning for 01405), which is supervised by the Institute for
low-resource NMT, utilizing pre-trained models Information & Communications Technology Plan-
on related high-resource languages to improve the ning & Evaluation (IITP) and funded by the Min-
performance on the target low-resource language. istry of Science and ICT (MSIT), Korea. Finally,
Additionally, Sennrich et al. (2015a) proposed a the Korea government’s MSIT also funded a grant
method of improving NMT models by generating through the IITP (No. 2020-0-00368) dedicated to
synthetic parallel data through back-translation, "A Neural-Symbolic Model for Knowledge Acqui-
which has proven successful in various transla- sition and Inference Techniques." We extend our
tion tasks. These studies highlight the diverse data- gratitude for this comprehensive support.
427
Limitations Piotr Bojanowski, Edouard Grave, Armand Joulin, and
Tomas Mikolov. 2017. Enriching word vectors with
Due to the random sampling of shots, the results of subword information. Transactions of the associa-
the experiment may vary between repeated trials. tion for computational linguistics, 5:135–146.
However, we did not conduct repeated experiments Denny Britz, Anna Goldie, Minh-Thang Luong, and
under identical conditions, and thus we acknowl- Quoc Le. 2017. Massive exploration of neural
edge the potential inconsistency of our experimen- machine translation architectures. arXiv preprint
tal results. arXiv:1703.03906.
Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa
Ethics Statement Bentivogli, Roldano Cattoni, and Marcello Federico.
2015. The IWSLT 2015 evaluation campaign. In
This research study did not involve any human Proceedings of the 12th International Workshop
or animal subjects, and no personal data or sensi- on Spoken Language Translation: Evaluation Cam-
tive information was used in this research. There- paign, pages 2–14, Da Nang, Vietnam.
fore, no ethical issues were encountered in this Long Doan, Linh The Nguyen, Nguyen Luong Tran,
study. The authors confirm that the research was Thai Hoang, and Dat Quoc Nguyen. 2021. PhoMT:
conducted in accordance with the relevant ethical A high-quality and large-scale benchmark dataset
guidelines and principles. for Vietnamese-English machine translation. In Pro-
ceedings of the 2021 Conference on Empirical Meth-
ods in Natural Language Processing, pages 4495–
4503, Online and Punta Cana, Dominican Republic.
References Association for Computational Linguistics.
sopoulos, Claudia Borg, Marine Carpuat, Roldano Sergey Edunov, Myle Ott, Michael Auli, and David
Cattoni, Mauro Cettolo, William Chen, Khalid Grangier. 2018. Understanding back-translation at
Choukri, Alexandra Chronopoulou, Thierry De- scale. arXiv preprint arXiv:1808.09381.
clerck, Qianqian Dong, Yannick Estève, Kevin Duh, Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang
Marcello Federico, Souhir Gahbiche, Benjamin Hsu, Chen, Anna Gottardi, Sanjeev Kwatra, Anushree
John Judge, Tom Ko, Rishu Kumar, Xutail Ma, Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür.
Prashant Mathur, Evgeny Matusov, Paul McNamee, 2019. Topical-chat: Towards knowledge-grounded
John P. McCrae, Kenton Murray, Matteo Negri, Jan open-domain conversations.
Niehues, Xing Niu, Atul Ojha Kr., John E. Ortega,
Proyag Pal, Juan Pino, Lonneke van der Plas, Elijah Thamme Gowda and Jonathan May. 2020. Finding the
Rippeth, Elizabeth Salesky, Matthias Sperber, Se- optimal vocabulary size for neural machine transla-
bastian Stüker, Katsuhito Sudoh, Brian Thompson, tion. arXiv preprint arXiv:2004.02334.
Marco Turchi, Alex Waibel, Mingxuan Wang, and
Rodolfo Zevallos. 2023. Findings of the IWSLT Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor OK
2023 Evaluation Campaign. In Proceedings of the Li. 2018. Universal neural machine translation for
20th International Conference on Spoken Language extremely low resource languages. arXiv preprint
Translation (IWSLT 2023). Association for Compu- arXiv:1802.05368.
Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas
Antonios Anastasopoulos, Loïc Barrault, Luisa Raunak, Mohamed Gabr, Hitokazu Matsushita,
Bentivogli, Marcely Zanon Boito, Ondřej Bojar, Young Jin Kim, Mohamed Afify, and Hany Has-
Roldano Cattoni, Anna Currey, Georgiana Dinu, san Awadalla. 2023. How good are gpt models
Kevin Duh, Maha Elbayad, Clara Emmanuel, at machine translation? a comprehensive evaluation.
Yannick Estève, Marcello Federico, Christian Fed- arXiv preprint arXiv:2302.09210.
ermann, Souhir Gahbiche, Hongyu Gong, Roman
Grundkiewicz, Barry Haddow, Benjamin Hsu, Daphne Ippolito, Reno Kriz, Maria Kustikova, João Se-
Dávid Javorský, Vĕra Kloudová, Surafel Lakew, doc, and Chris Callison-Burch. 2019. Comparison
Xutai Ma, Prashant Mathur, Paul McNamee, Kenton of diverse decoding methods from conditional lan-
Murray, Maria Nǎdejde, Satoshi Nakamura, Matteo guage models. arXiv preprint arXiv:1906.06362.
Negri, Jan Niehues, Xing Niu, John Ortega, Juan Taku Kudo. 2018. Subword regularization: Improving
Pino, Elizabeth Salesky, Jiatong Shi, Matthias neural network translation models with multiple sub-
Sperber, Sebastian Stüker, Katsuhito Sudoh, Marco word candidates. arXiv preprint arXiv:1804.10959.
Turchi, Yogesh Virkar, Alexander Waibel, Changhan
Wang, and Shinji Watanabe. 2022. Findings of the Can Li, Wenbo Wang, Bitty Balducci, Lingshu Hu,
IWSLT 2022 evaluation campaign. In Proceedings Matthew Gordon, Detelina Marinova, and Yi Shang.
of the 19th International Conference on Spoken 2022. Deep formality: Sentence formality predic-
Language Translation (IWSLT 2022), pages 98–157, tion with deep learning. In 2022 IEEE 23rd Inter-
Dublin, Ireland (in-person and online). Association national Conference on Information Reuse and Inte-
for Computational Linguistics. gration for Data Science (IRI), pages 1–5. IEEE.
428
Maria Nădejde, Anna Currey, Benjamin Hsu, Xing of transfer learning with a unified text-to-text trans-
Niu, Marcello Federico, and Georgiana Dinu. 2022. former. The Journal of Machine Learning Research,
Cocoa-mt: A dataset and benchmark for contrastive 21(1):5485–5551.
controlled mt with application to formality. arXiv
preprint arXiv:2205.04022. Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Lavie. 2020. Comet: A neural framework for mt
Chinh Ngo, Trieu H Trinh, Long Phan, Hieu Tran, evaluation. arXiv preprint arXiv:2009.09025.
Tai Dang, Hieu Nguyen, Minh Nguyen, and Minh-
Thang Luong. 2022. Mtet: Multi-domain transla- Nils Reimers and Iryna Gurevych. 2020. Mak-
tion for english and vietnamese. arXiv preprint ing monolingual sentence embeddings multilin-
arXiv:2210.05610. gual using knowledge distillation. arXiv preprint
arXiv:2004.09813.
Xing Niu, Marianna Martindale, and Marine Carpuat.
2017. A study of style in machine translation: Con- Elijah Rippeth, Sweta Agrawal, and Marine Carpuat.
trolling the formality of machine translation output. 2022. Controlling translation formality using
In Proceedings of the 2017 Conference on Empiri- pre-trained multilingual language models. arXiv
cal Methods in Natural Language Processing, pages preprint arXiv:2205.06644.
2814–2819, Copenhagen, Denmark. Association for
Computational Linguistics. Elizabeth Salesky, Marcello Federico, and Marta Costa-
jussà, editors. 2022. Proceedings of the 19th Inter-
OpenAI. 2023. Gpt-4 technical report. national Conference on Spoken Language Transla-
tion (IWSLT 2022). Association for Computational
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Linguistics, Dublin, Ireland (in-person and online).
Jing Zhu. 2002. Bleu: a method for automatic eval-
uation of machine translation. In Proceedings of Rico Sennrich, Barry Haddow, and Alexandra Birch.
the 40th Annual Meeting of the Association for Com- 2015a. Improving neural machine translation
putational Linguistics, pages 311–318, Philadelphia, models with monolingual data. arXiv preprint
Pennsylvania, USA. Association for Computational arXiv:1511.06709.
Linguistics.
Chanjun Park, Sugyeong Eo, Hyeonseok Moon, and 2015b. Neural machine translation of rare
Heui-Seok Lim. 2021. Should we find another words with subword units. arXiv preprint
model?: Improving neural machine translation per- arXiv:1508.07909.
formance with one-piece tokenization method with-
out model modification. In Proceedings of the 2021 Felix Stahlberg. 2020. Neural machine translation: A
Conference of the North American Chapter of the review. Journal of Artificial Intelligence Research,
Association for Computational Linguistics: Human 69:343–418.
Language Technologies: Industry Papers, pages 97–
104. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Chanjun Park, Midan Shim, Sugyeong Eo, Seolhwa Kaiser, and Illia Polosukhin. 2017. Attention is all
Lee, Jaehyung Seo, Hyeonseok Moon, and Heuiseok you need. Advances in neural information process-
Lim. 2022. Empirical analysis of parallel corpora ing systems, 30.
and in-depth analysis using liwc. Applied Sciences,
12(11):5545. Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con-
neau, Vishrav Chaudhary, Francisco Guzmán, Ar-
Kyubyong Park, Joohong Lee, Seongbo Jang, and Da- mand Joulin, and Edouard Grave. 2020. CCNet:
woon Jung. 2020. An empirical study of tokeniza- Extracting high quality monolingual datasets from
tion strategies for various korean nlp tasks. arXiv web crawl data. In Proceedings of the Twelfth Lan-
preprint arXiv:2010.02534. guage Resources and Evaluation Conference, pages
4003–4012, Marseille, France. European Language
Keqin Peng, Liang Ding, Qihuang Zhong, Li Shen, Resources Association.
Xuebo Liu, Min Zhang, Yuanxin Ouyang, and
Dacheng Tao. 2023. Towards making the most Biao Zhang, Barry Haddow, and Alexandra Birch.
of chatgpt for machine translation. arXiv preprint 2023. Prompting large language model for ma-
arXiv:2303.13780. chine translation: A case study. arXiv preprint
arXiv:2301.07069.
scores. In Proceedings of the Third Conference on Barret Zoph, Deniz Yuret, Jonathan May, and
Machine Translation: Research Papers, pages 186– Kevin Knight. 2016. Transfer learning for low-
191, Brussels, Belgium. Association for Computa- resource neural machine translation. arXiv preprint
tional Linguistics. arXiv:1604.02201.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine

Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. 2020. Exploring the limits
429
A Prompt Template
A.1 Superviesd Setting
You are a helpful assistant that translates English to:

1. Informal [target language] or 2. Formal [target language]
####
[shot 1 source]
[shot 2 source]
[shot n source]
1. Informal [target language]: [shot 1 reference]
2. Formal [target language]: [shot 1 reference]
1. Informal [target language]: [shot 2 reference]
2. Formal [target language]: [shot 2 reference]
1. Informal [target language]: [shot n reference]
2. Formal [target language]: [shot n reference]
####
Translate this into only [1. Informal | 2. Formal] [target language]: [input]
Figure 3: Prompt template for supervised setting based on Hendy et al. (2023). We utilize n randomly selected
shots from the English training set of other language pairs in the IWSLT 23 Formality Track as input for our
model, with few-shot examples derived from the target language’s training set.
A.2 Zero-shot Setting
You are a helpful assistant that translates English to:

1. Informal [target language] or 2. Formal [target language]
[shot n source]
Translate this into only [1. Informal | 2. Formal] [target language]: [input]
Figure 4: Prompt template for zero-shot setting, following the recommended instruction and format for the default
sentence-level translation task in OpenAI playground6 . This consistency enables us to maximize the benefits of the
instruction finetuning protocol. We use n random shots from the training set.
430
B Experimental Setup
B.1 EN-KO
In the experimental setup for the EN-KO language pair, we employed a Transformer architecture with
shared decoder input-output embeddings. The model’s parameters included 1024-dimensional embeddings
for both encoder and decoder, 16 attention heads for each, and 12 layers for both encoder and decoder.
We used the Adam optimizer with beta values (0.9, 0.98) and a learning rate of 5e-4 scheduled by an
inverse square root scheduler with a 4000-step warm-up. To prevent overfitting, we applied a dropout rate
of 0.3 and weight decay of 0.0001. Our translation task utilized a label-smoothed cross-entropy criterion
with a label smoothing factor of 0.1. The training process was performed with a maximum token limit
of 4096 per batch and an update frequency of 4. Model performance was evaluated using BLEU scores
with a beam size of 1 and detokenization using the Moses tokenizer. The training process was executed
for a maximum of 20 epochs with a log interval of 200 and without epoch checkpoints, while sharing all
embeddings.
Parameters for pre-training:
fairseq-train \
--fp16 \
--fp16-init-scale 4096 \
--arch transformer --share-decoder-input-output-embed \
--encoder-embed-dim 1024 --decoder-embed-dim 1024 \
--encoder-attention-heads 16 --decoder-attention-heads 16 \
--encoder-ffn-embed-dim 4096 --decoder-ffn-embed-dim 4096 \
--encoder-normalize-before --decoder-normalize-before \
--encoder-layers 12 --decoder-layers 12 \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--dropout 0.3 --weight-decay 0.0001 \
--task translation \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 4096 \
--update-freq 4 \
--eval-bleu \
--eval-bleu-args '{"beam": 1, "max_len_a": 1.2, "max_len_b": 10}' \
--eval-bleu-detok moses \
--eval-bleu-remove-bpe \
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
--log-interval 200 \
--max-epoch 20 \
--skip-invalid-size-inputs-valid-test \
--no-epoch-checkpoints \
--share-all-embeddings
Parameters for fine-tuning:

fairseq-train \
--batch-size 32 \
--lr 4e-5 --warmup-updates 200 \
--max-epoch 200 \
--restore-file $MODELDIR/checkpoint_best.pt \
--reset-optimizer --reset-meters --reset-dataloader --reset-lr-scheduler
B.2 EN-VI
We fine-tuned our model using the Hugging Face library and the code available at their repository7 . The
fine-tuning was performed with a learning rate of 4e-5, Adam optimizer with beta1 and beta2 values set to
0.9 and 0.98, respectively, and a weight decay of 0.0001. We also used mixed precision training (fp16) to
accelerate the process. The learning rate scheduler was set to inverse square root with a warm-up of 200
steps. The training was conducted for 200 epochs with a maximum gradient norm of 0.0, label smoothing
factor of 0.1, and a batch size of 64 for both training and evaluation. The model was saved and evaluated
at the end of each epoch, and the logging was performed after each training step.
7
https://github.com/huggingface/transformers/tree/main/examples/pytorch/
translation
431
python train_mt_trainer.py \
--fp16 \
--model_name_or_path VietAI/envit5-translation \
--do_train \
--do_eval \
--do_predict \
--source_lang en \
--target_lang vi \
--source_prefix "translate English to Vietnamese: " \
--learning_rate 4e-5 \
--adam_beta1 0.9 \
--adam_beta2 0.98 \
--max_grad_norm 0.0 \
--num_train_epochs 200 \
--lr_scheduler_type inverse_sqrt \
--warmup_steps 200 \
--weight_decay 0.0001 \
--label_smoothing_factor 0.1 \
--save_strategy epoch \
--logging_steps 1 \
--evaluation_strategy epoch \
--per_device_train_batch_size=64 \
--per_device_eval_batch_size=64
B.3 EN-{PT, RU}

We utilized the same training code as for the EN-VI task and employed the
facebook/mbart-large-50 model.
export langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,
it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,
tr_TR,vi_VN,zh_CN
python train_mt_trainer.py \
--fp16 \
--model_name_or_path facebook/mbart-large-50 \
--do_train \
--do_eval \
--do_predict \
--source_lang en_XX \
--target_lang pt_XX \
--learning_rate 3e-5 \
--adam_beta1 0.9 \
--adam_beta2 0.98 \
--max_grad_norm 0.0 \
--num_train_epochs 200 \
--lr_scheduler_type inverse_sqrt \
--warmup_steps 100 \
--weight_decay 0.0001 \
--label_smoothing_factor 0.1 \
--save_strategy epoch \
--logging_steps 1 \
--evaluation_strategy epoch \
--per_device_train_batch_size=16 \
--per_device_eval_batch_size=16
432
UM-DFKI Maltese Speech Translation
Aiden Williams* Kurt Abela* Rishu Kumar⋄ Martin Bär*
Hannah Billinghurst* Kurt Micallef* Ahnaf Mozib Samin*
Andrea De Marco* Lonneke van der Plas† Claudia Borg*
*University of Malta, ⋄ DFKI, † IDIAP

[email protected], [email protected],
[email protected],[email protected],
[email protected], [email protected]
Abstract We begin by discussing the state of the art in

speech translation and describe the two main ap-
For the 2023 IWSLT (Agarwal et al., 2023)
Maltese Speech Translation Task, UM-DFKI proaches, cascade and end-to-end. Afterwards, we
jointly presents a cascade solution which briefly summarise the challenges posed by low-
achieves 0.6 BLEU. While this is the first time resource languages and possible mitigation strate-
that a Maltese speech translation task has been gies. We then describe our system, a pipeline ap-
released by IWSLT, this paper explores pre- proach containing an internal Automatic Speech
vious solutions for other speech translation Recognition (ASR) component and the outward
tasks, focusing primarily on low-resource sce- facing Machine Translation (MT) component. The
narios. Moreover, we present our method of
ASR component can use one of five fine-tuned
fine-tuning XLS-R models for Maltese ASR us-
ing a collection of multi-lingual speech corpora XLS-R (Babu et al., 2021) models, whereas the
as well as the fine-tuning of the mBART model MT stage always uses an mBART-50 model.
for Maltese to English machine translation.
2 Literature Review
1 Introduction
The following literature review aims to provide an
Speech Translation (ST), or speech-to-text trans- overview of previous IWSLT ST submissions, with
lation, involves converting speech in a source lan- a particular focus on low-resource scenarios. The
guage into written text in a target language. With review is divided into two sections; where the first
the rise of deep learning, steep progress has been explores the general approaches and challenges as-
made in this field and many other areas that fall sociated with low-resource ST, and the second sec-
under the Natural Language Processing (NLP) um- tion discusses previous approaches to low-resource
brella (Khurana et al., 2023; Qiu et al., 2020). How- ST as applied to IWSLT.
ever, development for low-resource languages has
continued to present difficulties and obstacles due 2.1 Previous IWSLT Approaches for
to a variety of factors, including the lack of suffi- Low-Resource Languages
cient training data, language experts and other re- The IWSLT (Anastasopoulos et al., 2022) set the
sources (Magueresse et al., 2020; Hedderich et al., task in 2022 to attempt to solve “the problem of de-
2021). veloping speech transcription and translation tools
The International Workshop on Spoken Lan- for under-resourced languages”. This problem
guage Translation (IWSLT) shared task is an an- involved translating Tamasheq into English and
nual competition that aims to foster research in the Tunisian Arabic into French. Three different teams
field of speech translation. With its low-resource attempted to solve the problem of the Tamasheq-
track, it also contributes to advanced research for English ST; Taltech publish an encoder-decoder
speech translation in low-resource scenarios. In ST model that used a pre-trained XLS-R that they
this paper, we present our submission to the low refine-tuned on unlabelled Tamasheq as the encoder
source track: a pipeline system for English-Maltese and mBART-50 as the decoder, GMU used the
speech-to-text translation. Fairseq s2t extension with its transformer archi-
433
tecture in which they fine-tuned the pre-trained (Ding and Tao, 2021). Bahar et al. (2021) trained
XLS-R 300M encoder on French and Arabic and their ASR and MT components jointly by passing
then trained the whole model on the provided data the ASR output to the MT component as a proba-
from the task; finally, ON-TRAC had a primary bility vector instead of a one-hot vector to attenuate
submission which used a pre-trained Wav2Vec 2.0 error propagation and avoid information loss of the
base model trained on Tamasheq and a contrastive otherwise purely textual output.
model which was comprised of a partial Wav2Vec
2.0 model, a linear layer used for down projecting 2.2 Wav2Vec 2.0 XLS-R For Maltese ASR
the output of the Wav2Vec and a transformer de- One of the latest developments for the Wav2Vec
coder. All three submissions decided to focus on system is the introduction of multilingual pre-
using large pre-trained models when approaching training. Due to the robust architectural design
the task, which is the approach taken for our models of Wav2Vec 2.0, models are able to learn cross-
as well. The results from the submissions showed lingual speech representations (XLSR) while pre-
that using powerful speech feature extractors such training on massive amounts of data. This is put
as Wav2Vec 2.0 and massive multilingual decoders in practice with the XLSR models, which are pre-
such as mBART-50 does not stop low-resource ST trained on up to 53 different languages from the
from being a major challenge. Of the three submis- Mozilla Commonvoice (v. Nov. 2019), BABEL
sions, training self-supervised models on the target (Gales et al., 2014) and Multilingual LibriSpeech
data and producing artificial supervision seemed (Pratap et al., 2020) speech corpora, with the largest
to be the most effective approach to solving the model, pre-trained on a total of 56 thousand hours
problem (Zanon Boito et al., 2022). of speech data (Conneau et al., 2021). To test out
Previous, well-performing systems submitted to the XLSR approach, several Wav2Vec BASE mod-
the IWLST offline and low-resource speech trans- els are pre-trained either monolingually or multi-
lation tracks made use of various methods to im- lingually. Monolingual models follow the process
prove the performance of their cascade system. For previously taken, i.e. they are pre-trained using the
the ASR component, many submissions used a same language on which they are fine-tuned. This
combination of transformer and conformer mod- process is changed slightly for multilingual models,
els (Zhang et al., 2022; Li et al., 2022; Nguyen which are pre-trained on ten languages; then at the
et al., 2021) or fine-tuned existing models (Zhang fine-tuning stage, a model is fine-tuned for each
and Ao, 2022; Zanon Boito et al., 2022; Denisov language. The experiment also included the pre-
et al., 2021). They managed to increase ASR perfor- training of the Wav2Vec LARGE XLSR-53 model,
mance by voice activity detection for segmentation which was pre-trained on the entire dataset of unan-
(Zhang et al., 2022; Ding and Tao, 2021), train- notated data, and just like the multilingual models,
ing the ASR on synthetic data with added punc- a separate model is then created for each language
tuation, noise-filtering and domain-specific fine- it was evaluated on during fine-tuning. The perfor-
tuning (Zhang and Ao, 2022; Li et al., 2022) or mance of different approaches, evaluated on four
adding an intermediate model that cleans the ASR languages; Assamese, Tagalog, Swahili, and Geor-
output in terms of casing and punctuation (Nguyen gian, is shown in Table 1. In these languages, the
et al., 2021). The MT components were mostly multilingual models, XLSR even more so, outper-
transformer-based (Zhang et al., 2022; Nguyen form the monolingual model.
et al., 2021; Bahar et al., 2021) or fine-tuned on pre- The work on the XLSR approach continues in
existing models (Zhang and Ao, 2022). Additional (Babu et al., 2021) with the release of the XLS-R
methods used to improve MT performance were model, which saw an increase in both the size of
multi-task learning (Denisov et al., 2021), back- the unannotated data and the languages included.
translation (Ding and Tao, 2021; Zhang et al., 2022; BABEL, Multilingual LibriSpeech, and Common-
Zhang and Ao, 2022), domain adaption (Nguyen Voice (v. Dec. 2020) are joined by the VoxPopoli
et al., 2021; Zhang et al., 2022), knowledge distil- (Wang et al., 2021) and VoxLingua107 (Valk and
lation (Zhang et al., 2022), making the MT compo- Alumäe, 2021) corpora for a total of 436 thousand
nent robust by training it on noisy ASR output data unannotated hours.
(Nguyen et al., 2021; Zhang et al., 2022; Zhang and
Ao, 2022), re-ranking and de-noising techniques
434
Table 1: XLSR Wav2Vec 2.0 performance on low- the dataset in (Williams, 2022) as a base has made
resource settings when evaluated using WER. Assamese comparisons with previous experiments possible.
(AS), Tagalog (TL), Swahili (SW), and Georgian (KA)
As described in Table 2, the Maltese speech cor-
are the languages presented.
pus is made up of several segments from two main
Maltese speech corpora, MASRI (Hernandez Mena
Language AS TL SW KA
et al., 2020), CommonVoice (CV) (Ardila et al.,
Annotated Data (h) 55 76 30 46
2020) and an annotated set from publicly available
XLSR-10 44.9 37.3 35.5 - parliamentary sittings. Previous research in ASR
XLSR-53 44.1 33.2 36.5 31.1 for Maltese has used English speech data with vary-
XLS-R (0.3B) 42.9 33.2 24.3 28.0 ing degrees of success (Mena et al., 2021). How-
XLS-R (1B) 40.4 30.6 21.2 25.1 ever, when applied in fine-tuning an XLS-R model,
XLS-R (2B) 39.0 29.3 21.0 24.3 the effect was detrimental. To further observe the
effect non-Maltese data would have on the trans-
2.3 mBART For Maltese to English lation task, we used three other subsets from the
Translation CommonVoice speech corpus. Selecting 50 hours
of validated each from the Italian, French and Ara-
According to (Liu et al., 2020), using mBART-25 as bic sets.
the pre-trained model has been shown to improve Individually these speech corpora each amount
translations over a randomly initialized baseline to 50 hours, from which four models are trained.
in low/medium resource language. mBART-25 is One with just the Maltese data and the other three
a transformer model trained on the BART (Lewis trained on the extra language combined with the
et al., 2019) objective. It is trained on 25 differ- Maltese set. A fifth model is also trained with all
ent languages. mBART-25 was later extended to the data included. Further combinations were not
include 25 more languages and was called mBART- tried due to time concerns.
50 (Tang et al., 2020). However, neither model
included Maltese - in fact, translation experiments Table 2: Each corpus is listed along with its total length,
on Maltese are very limited. In our experiments, sample count and average sample length.
in Section 3.2, we checked whether these perfor-
mance gains expand to the Maltese language, and Length Average
Dataset (h,m)
Samples Length (s)
this claim appears to hold.
HEADSET 6, 40 4979 4.81
3 Methodology MEP 1, 20 656 7.11
For this task, we decided to use a cascade system Tube 13, 20 8954 5.34
where the ASR and MT components were trained MERLIN 19, 4 9720 6.14
separately but evaluated jointly. In this section, a Parlament 2, 30 1672 5.35
detailed description of both components is given. CV Validated 4, 57 3790 12.68
First, the training data is described, followed by CV Other 5, 4 3833 4.71
the pre-processing steps applied to said data. Next, CV French 50 - -
the models are introduced, and lastly training, the CV Italian 50 - -
training procedure is outlined. CV Arabic 50 - -
Validation 2, 32 1912 4.89
3.1 Automatic Speech Recognition Test MASRI 1 668 5.39
The ASR component in this submission contin- Test CV 0, 54 670 4.74
ues the previous work done in (Williams, 2022),
and so the same annotated dataset consisting of 50 The XLS-R model comes in three pre-trained
hours of Maltese speech is used for this task. We variants; the small model with 300 million parame-
opted not to use data released for this task for two ters, the medium model with a billion parameters
reasons. First was the additional annotation work and the large model with two billion parameters.
that was required, mainly segmentation, for which Size on disk scales with size with the small model
we experienced issues attempting to do in a timely being roughly 1GB in size and the large model
manner. Secondly, this submission includes models being roughly 8GB. All three of them have been
fine-tuned with non-Maltese data. Making use of pre-trained on roughly 500 thousand hours of un-
435
Table 3: ASR Models and the data used for fine-tuning. dataset4 , the COVID-19 EC-EUROPA dataset5 ,
the COVID-19 EU press corner V2 dataset6 , the
Model Corpora used COVID-19 EUROPARL v2 dataset7 , the Digital
MT Only All Maltese corpora Corpus of the European Parliament (Hajlaoui et al.,
2014), the DGT-Acquis (Steinberger et al., 2014),
MT+All All corpora presented ELRC8 , the Tatoeba corpus9 , OPUS (Tiedemann,
2012), EUIPO - Trade mark Guidelines10 , Malta
All Maltese corpora + Arabic sub- Government Gazette11 , MaCoCu (Bañón et al.,
MT+AR
set 2022), as well as data extracted from the Laws
of Malta12 .
All Maltese corpora + French The different datasets were compiled into a sin-
MT+FR
subset gle one. The total number of parallel sentences
All Maltese corpora + Italian sub- amounts to 3,671,287. The development and test
MT+IT
set set was kept the exact same as the OPUS dataset
(Tiedemann, 2012), which amount to 2000 sen-
tences each, and the rest of the data was placed
labelled, multilingual speech. Previous research
in the training set, which amounts to 3,667,287
(Williams, 2022), has shown that both the small
parallel sentences.
and large models fare well when fine-tuned for
Before training the system, the data has to be
the downstream Maltese ASR task. With this in
further pre-processed. Firstly, a BPE tokenizer is
mind, the small 300M XLS-R variant model was
trained on the training set only. The MosesDe-
chosen for this task. The main reason was due to
coder13 package is used to pre-process the dataset,
its smaller size, a larger batch size could be used
by normalising punctuation and training a true case
which expedited the fine-tuning process, while the
on the training set and applying it to the whole
performance loss was expected to be minimal.
dataset. In the case of Maltese data, a tokenizer
This submission follows the same training pro-
specifically designed for Maltese was used because
cedure as outlined in (Williams, 2022). Where the
the regular English tokenizer does not tokenize ev-
procedure was conducted utilising the Huggingface
erything correctly. For this, the tokenizer from
Trainer object with the following hyper-parameters.
MLRS14 was used, which utilises regular expres-
Each model is trained for 30 epochs, using the
sions to tokenize linguistic expressions that are
AdamW criterion with a starting learning rate of
specific to Maltese, such as certain prefixes and
3e − 4. To stabilise the training process, the first
articles. The dataset is then encoded using the pre-
500 training steps were used as warm-up steps.
viously trained BPE encoder.
Gradient accumulation was also used to effectively
The machine translation model is built and
quadruple the batch size. The batch size was depen-
trained using Fairseq (Ott et al., 2019). Fairseq
dent on the training set used, where due to some
is a library that allows for easy implementation
differences in sample lengths, different batch sizes
of a machine translation system through CLI com-
had to be used. We fine-tune 5 XLS-R 300m mod-
mands, meaning minimal code is needed to create
els as presented in Table 3.
a fully working machine translation system.
3.2 Machine Translation For this system, a pre-trained mBART-50 model
The dataset used to train the machine translation (Tang et al., 2020) was used and fine-tuned on our
systems comes from publicly available sources.
4
The original data sources include datasets from https://bit.ly/3pBCg7u
5
https://bit.ly/3AcjIzR
Arab-Acquis (Habash et al., 2017), the Euro- 6
https://bit.ly/3wmCyTD
pean Vaccination Portal1 ,the Publications Office 7
https://bit.ly/3wl3brZ
of the EU on the medical domain2 , the European 8
https://www.lr-coordination.eu/node/
Medicines Agency3 , the COVID-19 ANTIBIOTIC 2
9
https://bit.ly/3cejoIU
10
https://bit.ly/3AB01Tr
11
https://bit.ly/3QDXm1a
1 12
https://bit.ly/3dLbGX9 https://legislation.mt/
2 13
https://bit.ly/3R2G5OH https://www.statmt.org/moses/
3 14
https://bit.ly/3QWIjPM https://mlrs.research.um.edu.mt/
436
data. An mBART-25 (Liu et al., 2020) model, as strings are then passed to the mBART model to be
well as a randomly initialised baseline Transformer inferred and the BPE model to encode the inputs.
model, were also experimented with, however af- The beam size is set to five. The resulting tokens
ter training a system using a subset of the dataset, are then detokenized and saved.
it was apparent that the mBART-50 model outper-
forms them both. Due to limited resource con- 4 Evaluation and Results
straints, only one MT model was trained on the full Table 4 contains the official results for our submis-
dataset. sion for the Maltese → English spoken language
The maximum number of steps was set out to translation track. While we observed better scores
be 1,000,000, yet the validation was performed ev- during training and validation, our models strug-
ery 10,000 steps with a patience value of 10. This gled with the official test set. In this section, we
means that if the BLEU score on the validation set note our few observations and qualitative analysis
does not improve after ten validation steps, then of results to highlight the errors.
the model stops training. After multiple experi- The test set proved to be difficult for both the
ments using a smaller subset of the dataset, it was ASR and MT systems to get right due to the type of
seen that increasing max-tokens tended to result language used as well as the speed of the speech in
in higher overall performance. However, due to general. Table 5 shows the reference transcription
resource constraints, the maximum number of to- of the beginning of the file, accompanied by the MT
kens per batch was set to 1024. The learning rate is Only and MT+All ASR transcription, and lastly,
set to 1e−3 , but the initial learning rate is smaller the machine translation of the mt-50 model. The
at 1e−7 and increases using an inverse square root monolingually fine-tuned MT Only model was our
learning rate scheduler to linearly increase the rate primary submission from the five submitted ASR
after 10,000 steps. For inference, a beam size of models, with BLEU scores of 0.6.
five is used to generate predictions. The mt-50 output is relatively similar to the refer-
The total number of updates using mBART-50 ence sentence, except for a few minor errors, includ-
was 990,000, with an early stop since the validation ing the misspelling of the name “Mark”. However,
didn’t improve in the last 10 validation epochs. this should still be a good sentence to input into the
This amounts to exactly three full epochs on the machine translation system. In stark contrast to the
whole training set. MT+All system outputs.
3.3 Completed Pipeline The main issue here is that this system does not
output Maltese characters and completely omits
To create a speech-to-text translation system, a them, which presents an issue for the downstream
Huggingface pipeline is set up to accept an audio translation task since the meaning of the word is
file that is passed to the ASR system. The test set lost in these cases.
provided for this task is a single file of over one Machine translation also had similar issues. The
hour. Due to its size, the file needs to be segmented training set contained data coming from legal texts,
for inference and evaluation due to its size. The so the data is very formal, making it very difficult
XLS-R model automatically returns a timestamp to evaluate since the input text is very informal and
for each output word. These timestamps are used unlike the legal text data seen.
to create segments that align with the segments file Unfortunately, most of this is unrelated to what
provided with the test set.
This means that the ASR component returns a
Table 4: Official Results for our models for Maltese →
list of text strings. Each segment is an item in the
English SLT task
list of strings. Each string is passed to the MT sys-
tem. Before passing through the MT component,
Submission Name BLEU Score
the resultant strings are pre-processed. The afore-
mentioned MosesDecoder package is used to trans- MT Only 0.6
form the strings using the same rules that have been MT+All 0.7
applied to the MT training data. This means that MT+AR 0.4
the strings have their punctuation normalised, then MT+FR 0.3
true cased and finally tokenized. The processed MT+IT 0.4
437
Table 5: Reference transcription sample from the Continuing the trend observed in (Williams,
IWSLT 2023 test set along with the MT Only and 2022), the use of additional languages when fine-
MT+All automatic transcription and the machine trans-
tuning an XLS-R model proved to be detrimental
lation of the MT Only output.
towards the final output. As observed in Section
merh̄ba’ gh̄al- podcast ieh̄or din 4, some models trained with additional data lost
id- darba ma bniedem kemxejn the ability to transcribe Maltese-specific alphabetic
polemikuż mhux gh̄ax jien gh̄andi characters. So far, the character-to-sound pair was
Reference wisq xi ngh̄id però Mark Camil- always made with the source language in mind. For
leri huwa il- mexxejj kemxejn example, the French ‘Ç’ is transformed into the ‘C’
kontroversjali tal- kunsill naz- character, which itself is only present in the Maltese
zjonali tal- ktieb alphabet when English words are loaned and used
directly. It’s important to note that code-switching
merba’ l- pot kast ieh̄or din to English is very common in Maltese speech. Fu-
id- darba ma bniedem kemx- ture work should explore these character-to-sound
ejn polemikuż mhux gh̄ax jien pairs.
MT Only gh̄andi wisq xi ngh̄id però mar
Camilleri huwa il- mexxejj kemx- 5 Conclusion and Future Work
ejn kontroversjali tal- kunsill naz-
This paper showcased the results of a speech-to-
zjonali tal- ktieb
text translation system in the direction of Maltese
meba l Pold cast ieor din id- to English. A cascade system is chosen, where
darba ma bniedem kemmxejn ASR and MT models are pipelined together.
polemiku mhux gax jien Gandi The automatic speech recognition system chosen
MT+All wisq xi ngid per mar kamileri is based on XLS-R and is fine-tuned on data from
huwai - mexxejk emxejh kontro- different languages. The best-performing model
versjali tal- kunsill nazzjonali tal- was the XLS-R 300M model fine-tuned on 50 hours
ktieb of Maltese speech. The machine translation system
chosen is based on mBART-50, and it was fine-
four of the other potential this tuned on parallel Maltese - English data. Aside
time does not work very slightly from fine-tuning, no modifications were made to
at all , but not at all , the same the pre-trained models.
Translation
time , it is the slightly cross- sec- For future work, we have various potential av-
MT Only
toral leader of the national when enues for improvement. For machine translation,
the book is also of humane since mBART-50 was not pre-trained on Maltese
data, extending the vocabulary to include Maltese-
was actually said. Looking into the translations specific tokens would improve the representation
deeper, one can see the reasoning behind certain and potentially the downstream performance as
translations. For example, the dataset does not con- well. Moreover, our approach solely relied on
tain a lot of conversational data, so general greet- parallel data and did not investigate techniques
ings like “merh̄ba” may not be present. This case is which leverage monolingual data, such as back-
represented by the translation of the token “merba”, translation. Monolingual corpora, such as Korpus
which was translated to “four”. Here the token Malti v4 (Micallef et al., 2022), not only provide
“merba” (welcome) was mistaken for “erba” (four). significantly more data but also have more diversity
Other mistakes include those that are phonetically in terms of domains. Apart from this, it might be
plausible but grammatically incorrect output, such beneficial to perform more quality checks on the
as the transcription for “podcast” which was tran- parallel dataset since some portions of the publicly
scribed as “pot kast”. Certain expressions like “din available datasets are automatically crawled and, in
id-darba” were correctly translated to “this time”, some cases, contain noise.
however rarer words such as “polemikuż” and “kon- Regarding ASR improvement, other systems,
troversjali”, both of which have the same meaning such as Whisper and, most recently Meta’s Mas-
as “controversial”, seemed to not appear in the sively Multilingual Speech (MMS) project should
translation. be tried and evaluated. The research made in multi-
438
lingual fine-tuning needs to be more focused. One Rosana Ardila, Megan Branson, Kelly Davis, Michael
idea we can explore is the transliteration of foreign Kohler, Josh Meyer, Michael Henretty, Reuben
Morais, Lindsay Saunders, Francis Tyers, and Gre-
alphabetic characters into Maltese characters, e.g.
gor Weber. 2020. Common voice: A massively-
’h’ in English would be transliterated as ’h̄’. It is multilingual speech corpus. In Proceedings of the
also the case that no language model is used to 12th Language Resources and Evaluation Confer-
correct the ASR output mistakes; this is currently ence, pages 4218–4222, Marseille, France. European
our next milestone. Language Resources Association.
Arun Babu, Changhan Wang, Andros Tjandra, Kushal
Acknowledgements Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh,
Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei
We acknowledge LT-Bridge Project (GA 952194). Baevski, Alexis Conneau, and Michael Auli. 2021.
Rishu Kumar was supported financially by the XLS-R: self-supervised cross-lingual speech repre-
EMLCT15 programme during this entire work. sentation learning at scale. CoRR, abs/2111.09296.
Parnia Bahar, Patrick Wilken, Mattia A. Di Gangi, and
Evgeny Matusov. 2021. Without Further Ado: Direct
References and Simultaneous Speech Translation by AppTek
Milind Agarwal, Sweta Agrawal, Antonios Anasta- in 2021. In Proceedings of the 18th International
sopoulos, Ondřej Bojar, Claudia Borg, Marine Conference on Spoken Language Translation (IWSLT
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda 2021), pages 52–63, Bangkok, Thailand (online). As-
Chen, William Chen, Khalid Choukri, Alexandra sociation for Computational Linguistics.
Marta Bañón, Miquel Esplà-Gomis, Mikel L. For-
cada, Cristian García-Romero, Taja Kuzman, Nikola
Ljubešić, Rik van Noord, Leopoldo Pla Sempere,
Gema Ramírez-Sánchez, Peter Rupnik, Vít Su-
chomel, Antonio Toral, Tobias van der Werff, and
Jaume Zaragoza. 2022. MaCoCu: Massive collec-
tion and curation of monolingual and bilingual data:
focus on under-resourced languages. In Proceedings
of the 23rd Annual Conference of the European As-
sociation for Machine Translation, pages 303–304,
Ghent, Belgium. European Association for Machine
Translation.
Thompson, Kevin Tran, Marco Turchi, Alex Waibel, Alexis Conneau, Alexei Baevski, Ronan Collobert, Ab-
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- delrahman Mohamed, and Michael Auli. 2021. Un-
vallos. 2023. Findings of the IWSLT 2023 Evaluation supervised Cross-Lingual Representation Learning
Campaign. In Proceedings of the 20th International for Speech Recognition. In Proc. Interspeech 2021,
Conference on Spoken Language Translation (IWSLT pages 2426–2430.
Pavel Denisov, Manuel Mager, and Ngoc Thang Vu.
Antonios Anastasopoulos, Loïc Barrault, Luisa Ben- 2021. IMS’ Systems for the IWSLT 2021 Low-
tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano Resource Speech Translation Task. In Proceedings
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, of the 18th International Conference on Spoken Lan-
Maha Elbayad, Clara Emmanuel, Yannick Estève, guage Translation (IWSLT 2021), pages 175–181,
Marcello Federico, Christian Federmann, Souhir Bangkok, Thailand (online). Association for Compu-
Gahbiche, Hongyu Gong, Roman Grundkiewicz, tational Linguistics.
Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant Liang Ding and Dacheng Tao. 2021. The USYD-JD
Mathur, Paul McNamee, Kenton Murray, Maria Speech Translation System for IWSLT2021. In Pro-
Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan ceedings of the 18th International Conference on
Niehues, Xing Niu, John Ortega, Juan Pino, Eliz- Spoken Language Translation (IWSLT 2021), pages
abeth Salesky, Jiatong Shi, Matthias Sperber, Se- 182–191, Bangkok, Thailand (online). Association
bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo- for Computational Linguistics.
gesh Virkar, Alexander Waibel, Changhan Wang,
and Shinji Watanabe. 2022. Findings of the IWSLT Mark JF Gales, Kate M Knill, Anton Ragni, and
2022 Evaluation Campaign. In Proceedings of the Shakti P Rath. 2014. Speech recognition and key-
19th International Conference on Spoken Language word spotting for low-resource languages: Babel
Translation (IWSLT 2022), pages 98–157, Dublin, project research at cued. In Fourth International
Ireland (in-person and online). Association for Com- workshop on spoken language technologies for under-
putational Linguistics. resourced languages (SLTU-2014), pages 16–23.
International Speech Communication Association
15
https://mundus-web.coli.uni-saarland.de/ (ISCA).
439
Nizar Habash, Nasser Zalmout, Dima Taji, Hieu Hoang, 2021. Data augmentation for speech recognition
and Maverick Alzate. 2017. A parallel corpus for in maltese: A low-resource perspective. CoRR,
evaluating machine translation between arabic and abs/2111.07793.
european languages. In Proceedings of the 15th Con-
ference of the European Chapter of the Association Kurt Micallef, Albert Gatt, Marc Tanti, Lonneke van der
for Computational Linguistics: Volume 2, Short Pa- Plas, and Claudia Borg. 2022. Pre-training data qual-
pers, pages 235–241. ity and quantity for a low-resource language: New
corpus and BERT models for Maltese. In Proceed-
Najeh Hajlaoui, David Kolovratnik, Jaakko Väyrynen, ings of the Third Workshop on Deep Learning for
Ralf Steinberger, and Daniel Varga. 2014. Dcep- Low-Resource Natural Language Processing, pages
digital corpus of the european parliament. In Pro- 90–101, Hybrid. Association for Computational Lin-
ceedings of the Ninth International Conference on guistics.
Language Resources and Evaluation (LREC’14).
Tuan Nam Nguyen, Thai Son Nguyen, Christian Huber,
Michael A. Hedderich, Lukas Lange, Heike Adel, Jan- Ngoc-Quan Pham, Thanh-Le Ha, Felix Schneider,
nik Strötgen, and Dietrich Klakow. 2021. A Survey and Sebastian Stüker. 2021. KIT’s IWSLT 2021 Of-
on Recent Approaches for Natural Language Process- fline Speech Translation System. In Proceedings of
ing in Low-Resource Scenarios. In Proceedings of the 18th International Conference on Spoken Lan-
the 2021 Conference of the North American Chap- guage Translation (IWSLT 2021), pages 125–130,
ter of the Association for Computational Linguistics: Bangkok, Thailand (online). Association for Compu-
Human Language Technologies, pages 2545–2568, tational Linguistics.
Carlos Daniel Hernandez Mena, Albert Gatt, Andrea Sam Gross, Nathan Ng, David Grangier, and Michael
DeMarco, Claudia Borg, Lonneke van der Plas, Auli. 2019. fairseq: A fast, extensible toolkit for
Amanda Muscat, and Ian Padovani. 2020. MASRI- sequence modeling. In Proceedings of NAACL-HLT
HEADSET: A Maltese corpus for speech recognition. 2019: Demonstrations.
In Proceedings of the 12th Language Resources and
Evaluation Conference, pages 6381–6388, Marseille, Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel
France. European Language Resources Association. Synnaeve, and Ronan Collobert. 2020. MLS: A
Large-Scale Multilingual Dataset for Speech Re-
Diksha Khurana, Aditya Koli, Kiran Khatter, and search. In Proc. Interspeech 2020, pages 2757–2761.
Sukhdev Singh. 2023. Natural language process-
ing: State of the art, current trends and challenges. Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao,
Multimedia tools and applications, 82(3):3713–3744. Ning Dai, and Xuanjing Huang. 2020. Pre-trained
models for natural language processing: A survey.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Science China Technological Sciences, 63(10):1872–
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, 1897.
Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: De-
noising sequence-to-sequence pre-training for natural Ralf Steinberger, Mohamed Ebrahim, Alexandros
language generation, translation, and comprehension. Poulis, Manuel Carrasco-Benitez, Patrick Schlüter,
arXiv preprint arXiv:1910.13461. Marek Przybyszewski, and Signe Gilbro. 2014. An
overview of the european union’s highly multilingual
Yinglu Li, Minghan Wang, Jiaxin Guo, Xiaosong Qiao, parallel corpora. Language resources and evaluation,
Yuxia Wang, Daimeng Wei, Chang Su, Yimeng Chen, 48(4):679–707.
Min Zhang, Shimin Tao, Hao Yang, and Ying Qin.
2022. The HW-TSC’s Offline Speech Translation Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
System for IWSLT 2022 Evaluation. In Proceedings man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
of the 19th International Conference on Spoken Lan- gela Fan. 2020. Multilingual translation with exten-
guage Translation (IWSLT 2022), pages 239–246, sible multilingual pretraining and finetuning. arXiv
Dublin, Ireland (in-person and online). Association preprint arXiv:2008.00401.
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey faces in opus. In Proceedings of the Eight Inter-
Edunov, Marjan Ghazvininejad, Mike Lewis, and national Conference on Language Resources and
Luke Zettlemoyer. 2020. Multilingual denoising pre- Evaluation (LREC’12), Istanbul, Turkey. European
training for neural machine translation. Language Resources Association (ELRA).
Alexandre Magueresse, Vincent Carles, and Evan Heet- Jörgen Valk and Tanel Alumäe. 2021. Voxlingua107:
derks. 2020. Low-resource languages: A review A dataset for spoken language recognition. In 2021
of past work and future challenges. arXiv preprint IEEE Spoken Language Technology Workshop (SLT),
arXiv:2006.07264. pages 652–658.
Carlos Daniel Hernandez Mena, Andrea DeMarco, Clau- Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu,
dia Borg, Lonneke van der Plas, and Albert Gatt. Chaitanya Talnikar, Daniel Haziza, Mary Williamson,
440
Aiden Williams. 2022. The applicability of Wav2Vec
2.0 for low-resource Maltese ASR. B.S. thesis, Uni-
versity of Malta.
Marcely Zanon Boito, John Ortega, Hugo Riguidel, An-
toine Laurent, Loïc Barrault, Fethi Bougares, Firas
Chaabani, Ha Nguyen, Florentin Barbier, Souhir Gah-
biche, and Yannick Estève. 2022. ON-TRAC Con-
sortium Systems for the IWSLT 2022 Dialect and
Low-resource Speech Translation Tasks. In Proceed-
Lirong Dai. 2022. The USTC-NELSLIP Offline
Speech Translation Systems for IWSLT 2022. In
198–207, Dublin, Ireland (in-person and online). As-
Ziqiang Zhang and Junyi Ao. 2022. The YiTrans Speech
Translation System for IWSLT 2022 Offline Shared
Task. In Proceedings of the 19th International Con-
tics.
441
NVIDIA NeMo Offline Speech Translation Systems for IWSLT 2023
Oleksii Hrinchuk*1 , Vladimir Bataev1,2 , Evelina Bakhturina1 , Boris Ginsburg1

1
NVIDIA, Santa Clara, CA 2
University of London, London, UK
Abstract • Text-to-speech (TTS) model with Fast-

Pitch (Łańcucki, 2021) architecture trained
This paper provides an overview of NVIDIA on the English transcripts of TED talks.
NeMo’s speech translation systems for the
IWSLT 2023 Offline Speech Translation Task. • Supervised Hybrid Audio Segmentation
This year, we focused on end-to-end system (SHAS) model (Tsiamas et al., 2022) trained
which capitalizes on pre-trained models and on TED talks.
synthetic data to mitigate the problem of di-
rect speech translation data scarcity. When Our constrained end-to-end ST model consists
trained on IWSLT 2022 constrained data, our of a FastConformer encoder and a Transformer
best En→De end-to-end model achieves the decoder. We initialize the encoder with the corre-
average score of 31 BLEU on 7 test sets from
sponding component from ASR and train our ST
IWSLT 2010-2020 which improves over our
last year cascade (28.4) and end-to-end (25.7) model on a mix of speech-to-text and text-to-text
submissions. When trained on IWSLT 2023 data. We replace all ground truth translations (wher-
constrained data, the average score drops to ever available) with synthetic ones generated with
29.5 BLEU. the NMT model and voice the English portion of
parallel text corpora with TTS.
1 Introduction Our systems will be open-sourced as part of
NVIDIA NeMo1 framework (Kuchaiev et al.,
We participate in the IWSLT 2023 Offline 2019).
Speech Translation Task (Agarwal et al., 2023)
for English→German, English→Chinese, and 2 Data
English→Japanese. This year, we focus on an end-
In this section, we describe the datasets used for
to-end model, which directly translates English
training (Table 1). For evaluation, we used the de-
audio into text in other languages.
velopment sets of Must-C v2 (Cattoni et al., 2021),
In contrast to automatic speech recognition
as well as the test sets from past IWSLT competi-
(ASR) and text-to-text neural machine translation
tions. We noticed that development data had a large
(NMT), the data for direct speech translation (ST)
overlap with training data, mostly because of the
is scarce and expensive. Thus, to train a high-
usage of the same TED talks in different datasets.
quality end-to-end ST model, we heavily rely on a
Thus, we discarded all samples with overlapping
number of auxiliary models for which the amount
transcripts and talk ids.
of available data is enough. Specifically, we train
the following models: TED talks In the list of allowed data, there are
several datasets comprised of TED talks, namely
• ASR model with FastConformer- Must-C v1-v3, ST-TED (Jan et al., 2018), and TED-
RNNT (Rekesh et al., 2023) architecture LIUM v3 (Hernandez et al., 2018) which have sig-
trained on all allowed data. nificant data overlap. After combining them to-
gether and doing deduplication, we ended up with
• NMT model with Transformer encoder- the dataset of 370K unique samples (611 hours of
decoder architecture trained on all allowed English audio) we used for in-domain fine-tuning
bitext and in-domain fine-tuned on TED talks. of various models. Further in the text, we refer to
* Correspondence 1
to: [email protected] https://github.com/NVIDIA/NeMo
442
Table 1: Statistics of different datasets used for training Table 2: Statistics of TED talks dataset.
our models in a constrained regime.
Segments Time
Segments Time Model
Model (thousands) (hours)
(millions) (hours)
En audio → En text 370 611
ASR 2.7 4800
En audio → De text 280 459
NMT En→De 11 − En audio → Zh text 350 580
NMT En→Zh 7.5 − En audio → Ja text 321 528
NMT En→Ja 21 −
TTS 0.37 611
for En/De, jieba tokenization for Zh, and ja-mecab
tokenization for Ja.
this dataset and its subsets with available transla-
TTS For training our TTS model, we used TED
tions to De/Zh/Ja as TED talks. See Table 2 for
talks with English transcripts. The combination of
the detailed statistics of this dataset.
Must-C v1-v3 and ST-TED contained 3696 speak-
ASR For training our ASR model, we used Lib- ers, however, some of them were not unique. Capi-
riSpeech (Panayotov et al., 2015), Mozilla Com- talizing on the huge overlap with TED-LIUM v3
mon Voice v11.0 (Ardila et al., 2019), TED-LIUM and the speaker names from there, we managed to
v3 (Hernandez et al., 2018), VoxPopuli v2 (Wang attribute several talks to a single speaker reducing
et al., 2021), all available speech-to-English data the number of unique speakers to 3361. We also
from Must-C v1-v3 (Cattoni et al., 2021) En- removed capitalization from English transcripts in
De/Zh/Ja datasets, ST-TED (Jan et al., 2018), and TED talks.
Europarl-ST (Iranzo-Sánchez et al., 2020).
We converted all audio data to mono-channel ST For training our end-to-end ST models, we
16kHz wav format. Of all the datasets allowed un- used the combination of 1) ASR data with the
der the constrained submission, LibriSpeech and ground truth transcripts replaced by synthetic trans-
TED-LIUM v3 were the only datasets that provided lations; 2) NMT data with TTS-generated English
transcripts with neither punctuation nor capitaliza- audios on source side (Table 1).
tion (P&C). For LibriSpeech, we managed to re-
3 System
store P&C from the dataset metadata available at
their website2 . For TED-LIUM v3, we applied In this section, we describe the essential compo-
P&C restoration model trained on the English por- nents of our end-to-end submission.
tion of allowed bitext. Finally, we discarded all
samples shorter than 0.2s and longer than 22s and ASR We trained 17-layer large conformer-
all samples with transcripts present in the evalua- transducer (Gulati et al., 2020) with FastCon-
tion dataset. As a result, our training dataset conformer (Rekesh et al., 2023) encoder and RNN-
tained 2.7M audio segments with a total duration T loss and decoder (Graves, 2012). The pre-
of 4.8k hours. diction network consisted of a single layer of
LSTM (Hochreiter and Schmidhuber, 1997), and
MT For training our NMT models, we used the joint network is an MLP. All the hidden sizes
all available bitext allowed for IWSLT 2023 con- in the decoder were set to 640. Unigram Senten-
strained submission. After training, we additionally cePiece (Kudo and Richardson, 2018) with 1024
fine-tuned our models on bitexts from TED talks tokens was used for tokenization.
for each language. The ASR models were trained for 45 epochs,
We applied langid and bicleaner filtering starting with a checkpoint pre-trained on Lib-
following Subramanian et al. (2021) and discarded riSpeech. We used AdamW (Loshchilov and Hut-
all sentences longer than 128 tokens and sentences ter, 2017) optimizer and Noam Annealing (Vaswani
with the length ratio between source and target et al., 2017) with 10K warmup steps and a maxi-
exceeding 3. We also applied Moses tokenization mum learning rate of 1.15. Weight decay of 0.001
2
https://www.openslr.org/12 on all parameters was used for regularization. The
443
effective batch size was set to 1200, and we could Table 3: Word error rate (WER) of the English ASR
fit larger batch sizes via batch splitting for the RNN- model evaluated on TED talks from Must-C v2 and past
test sets from IWSLT. All predictions and ground truths
T loss. Time-Adaptive SpecAugment (Park et al.,
transcripts were normalized for WER computation.
2020) with 2 freq masks (F = 27) and 10 time
masks (T = 5%) was used as the augmentation
scheme. We also used dropout of 0.1 for both the tst-COM IWSLT.tst
Model
attention scores and intermediate activations. De Zh/Ja 2018 2019 2020
norm 5.9 5.8 9.8 5.6 8.0
NMT We trained our NMT models (Transformer, punct 5.7 5.4 9.4 4.9 7.0
12 × 6 layers, dmodel = 1024, dinner = 4096, punct+capit 5.7 5.5 9.5 5.7 8.5
nheads = 16) with Adam optimizer (Kingma
and Ba, 2014) and inverse square root anneal-
ing (Vaswani et al., 2017) with 7.5K warmup steps
2048, nheads = 8). We used the vocabulary
and a maximum learning rate of 10−3 . The mod-
of 16384 YouTokenToMe3 byte-pair-encodings,
els were trained for a maximum of 75K steps with
trained jointly for En→De and separately for
a dropout of 0.1 on intermediate activations and
En→Zh/Ja. All models were trained for 30k steps
label smoothing with α = 0.1. Our En→De mod-
with ASR-initialized encoder and randomly initial-
els used joint BPE vocabulary of 16384 tokens
ized decoder.
and En→Zh/Ja used separate vocabularies with the
To speed up training and improve GPU utiliza-
same number of tokens per language.
tion, we bucketed our ASR and NMT datasets on
After training, we did checkpoint averaging and
sequence length so each batch contained a simi-
fine-tuned all our base NMT models on TED talks
lar number of tokens. On each iteration, we pick
for 3 epochs with an initial learning rate of 2×10−5 ,
one batch from ASR and one batch which resulted
inverse square root annealing, and a warmup of
in approximately 3:2 ratio between segments from
10% steps. Finally, we ensembled 2 models trained
ASR and NMT for En→De. TTS mel spectrograms
with different initializations for each language di-
were generated on-the-fly for a randomly selected
rection.
speaker for each sample.
TTS Our TTS model was multi-speaker Fast- After pretraining on the ASR task, we fused
Pitch (Łańcucki, 2021) text-to-mel-spectrogram BatchNorm in FastConformer layers as proposed
generator. Training vocoder was not necessary in (Bataev et al., 2023) to avoid a mismatch be-
for our setup as the parameters of spectrograms tween statistics for natural and generated mel spec-
matched ones for ST models following the ap- trograms. The batch normalization layer was re-
proach described in (Bataev et al., 2023). TTS- placed with a trainable projection initialized from
generated spectrograms were fed directly into the original parameters. We observed meaningful
the FastConformer encoder when training the ST improvements when using such an approach com-
model. Our TTS model was trained for 200 epochs pared to retaining the original batch normalization.
on TED talks with restored speakers from TED-
LIUM v3 (Hernandez et al., 2018). 4 Experiments
Segmentation We used Supervised Hybrid Au- 4.1 Results

dio Segmentation (SHAS) approach following Tsia- ASR Table 3 shows word error rate (WER) of our
mas et al. (2022). As using speech representation ASR models on different evaluation datasets. We
pre-trained wav2vec 2.0 (Baevski et al., 2020) goes trained 3 models which differed by the format of
beyond the scope of constrained submission, transcripts: normalized (norm), with punctuation
we replaced it with Conformer ASR encoder, pre- only (punct), with punctuation and capitalization
trained on LibriSpeech. (punct+capit).
All models exhibited similar results, with
ST Our end-to-end model consisted of FastCon-
punct being slightly better on all evaluation
former encoder followed by Transformer trained on
datasets. However, in our further experiments of
pairs of English audio and transcripts in other lan-
training end-to-end ST with an ASR-initialized en-
guages (17-layer FastConformer encoder, 6 × 6
Transformer, both with dmodel = 512, dinner = 3
https://github.com/VKCOM/YouTokenToMe
444
Table 4: En→De BLEU scores calculated on IWSLT test sets from different years by using automatic re-
segmentation of the hypothesis based on the reference translation by mwerSegmenter implemented in
SLTev (Ansari et al., 2021). Avg ∆ computes the improvement over the cascade baseline averaged over 7 test sets.
Model description 2010 2013 2014 2015 2018 2019 2020 Avg
Text-to-text NMT models
Transformer 12 × 6 constrained 32.9 36.7 32.7 34.2 30.5 29.4 33.0 32.8
+ checkpoint averaging 33.1 37.4 32.8 35.1 30.3 29.8 33.5 33.1
+ TED talks fine-tuning 34.5 39.1 34.1 35.3 30.8 30.3 33.8 34.0
+ x2 ensembling 35.2 40.2 34.9 36.0 32.5 31.6 35.4 35.1
NeMo IWSLT’22 NMT model 35.7 41.2 36.2 38.1 34.7 31.7 35.0 36.1
End-to-end ST models
Conformer (17) + Transformer (6 × 6) 29.8 33.8 30.2 27.1 26.2 26.8 29.1 29.0
+ better WebRTC VAD parameters 31.2 35.4 31.8 28.6 27.3 27.6 29.7 30.2
+ SHAS segmentation 32.1 36.1 32.6 29.0 28.4 27.9 30.9 31.0
NeMo IWSLT 2023 constrained 31.0 34.9 30.7 28.6 27.4 27.7 30.3 29.5
NeMo IWSLT 2022 (end-to-end) 24.5 30.0 25.2 25.3 24.9 24.1 26.2 25.7
NeMo IWSLT 2022 (cascade) 26.6 32.2 26.8 28.3 28.1 27.3 29.7 28.4
KIT IWSLT 2022 − − − 27.9 − 27.6 30.0 −
USTC-NELSLIP IWSLT 2022 − − − − 29.9 28.2 30.6 −
YiTrans IWSLT 2022 − − − − − 31.6 34.1 −
coder, we did not notice a significant difference in ST En→Zh/Ja To train English-Chinese and
the corresponding BLEU scores. English-Japanese ST systems, we followed a sim-
ilar recipe to the English-German system. Specif-
ST En→De Table 4 shows the performance of ically, we re-trained NMT components and used
our baseline En→De system and its ablations on them to generate synthetic translations of audio
7 different IWSLT test sets over the years. All ab- segments. With other auxiliary models intact, we
lation experiments used the last year’s constrained replaced bitexts used for TTS augmentations and
setup that included more NMT data from WMT to trained En→Zh (Table 5) and En→Ja (Table 6) ST
be comparable with the last year submissions. The end-to-end models in a constrained setup.
systems we submit were retrained on the allowed The only difference in our submission was that
data to comply with constrained restrictions. the English-Chinese model used punct+capit
ASR, while the English-Japanese model used
We improve the average BLEU score by 5.3 over
norm ASR. This choice was based on a slightly
our last year end-to-end submission. We believe
higher (less than 0.5) BLEU score on Must-C v2
that such gain is attributed to several factors, most
dev dataset.
importantly, switching to synthetic transcripts, in-
cluding TTS-generated data, and a better segmen- 4.2 Discarded alternatives
tation model. On some of the evaluation datasets,
When designing our submission, we explored a
we approached the BLEU scores of top contestants
number of alternatives that did not lead to a clear
from last year.
improvement in preliminary experiments and, thus,
Retraining our model in accordance with this were not included in the final submission.
year constrained setup resulted in the aver-
age degradation of 1.5 BLEU. Most of this perfor- ASR We tried to replace BatchNorm with Layer-
mance drop was attributed to worse NMT models Norm in the FastConformer backbone to mitigate
trained on limited amount of data which did not the statistics mismatch between natural and TTS-
include large bitexts from WMT. generated mel-spectrograms. The resulting model
445
Table 5: En→Zh BLEU scores calculated on Must-C Table 6: En→Ja BLEU scores calculated on Must-C
dev and tst-COMMON with official segmentation. dev and tst-COMMON with official segmentation.
Model description dev tst-COM Model description dev tst-COM

Text-to-text NMT models Text-to-text NMT models
Transformer 12 × 6 22.9 26.4 Transformer 12 × 6 12.8 15.5
+ ckpt avg 23.0 26.4 + ckpt avg 13.3 16.2
+ TED talks fine-tuning 24.7 28.0 + TED talks fine-tuning 14.7 18.5
+ x2 ensembling 25.5 28.9 + x2 ensembling 15.0 19.2
End-to-end ST models End-to-end ST models
NeMo IWSLT 2023 23.9 27.5 NeMo IWSLT 2023 14.5 18.3
USTC-NELSLIP IWSLT’22 − 28.7 USTC-NELSLIP IWSLT’22 − 18.2
YiTrans IWSLT’22 − 29.3 YiTrans IWSLT’22 − 19.1
required more epochs to converge and resulted in We experimented with using RNN-T instead
slightly higher WER. of the Transformer decoder. Despite its remark-
able performance in ASR, RNN-T converged much
NMT We experimented with larger models of up slower and underperformed our Transformer de-
to 12 × 8 layers, larger vocabularies of up to 32k coder by more than 2 BLEU in our ST model.
tokens, and label smoothing of up to 0.2 but did not
notice any improvements to BLEU scores. We also 5 Conclusion
saw diminishing returns when using more than 2
We present NVIDIA NeMo group’s offline speech
models in the ensemble. Thus, we decided to stick
translation systems for En→De, En→Zh, and
to the ensemble of two 12 × 6 models with 16k
En→Ja IWSLT 2023 Tasks.
vocab to speed up synthetic data generation.
Our primary end-to-end models that translate
TTS While debugging the code, we noticed that English speech directly into German, Chinese, and
TTS model generating mel-spectrograms used the Japanese texts, consist of FastConformer encoder
same single speaker and had dropout enabled. Sur- and Transformer decoder. To alleviate the prob-
prisingly, it did not lead to performance degrada- lem of direct ST data scarcity, we capitalized on a
tion. We hypothesize that this was caused by using number of auxiliary ASR, TTS, and NMT models,
well converged pre-trained ASR encoder, which and their ability to generate hiqh-quality audio and
was not altered significantly by the low-quality sig- translations. The resulting models achieve com-
nal. We also experimented with improving gener- petitive performance without using any amount of
ated spectrograms with GAN enhancer following direct ST data.
Bataev et al. (2023), which led to similar results at Although we participated in constrained
the cost of significant computation overhead. scenario, our pipeline can be easily scaled to ar-
bitrarily large amounts of ASR and NMT data.
Segmentation We experimented with voice ac-
tivity detection implemented in WebRTC4 toolkit, Acknowledgments
however, the BLEU scores on IWSLT test sets were
The authors would like to thank Somshubra Ma-
lower even after extensive hyperparameter search.
jumdar for many useful discussions over the course
ST Given the effectiveness of ensembling in last of this project and Nithin Koluguri for help with
year’s competition, we evaluated the performance training ASR models.
of an ensemble of up to 3 models with different
ASR encoder initializations. Unlike NMT, we did
References
not observe any improvement in using the best
model from the ensemble. Milind Agarwal, Sweta Agrawal, Antonios Anasta-
sopoulos, Ondřej Bojar, Claudia Borg, Marine
4
https://github.com/wiseman/py-webrtcvad Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
446
Chen, William Chen, Khalid Choukri, Alexandra François Hernandez, Vincent Nguyen, Sahar Ghannay,
Chronopoulou, Anna Currey, Thierry Declerck, Qian- Natalia Tomashenko, and Yannick Esteve. 2018. Ted-
qian Dong, Yannick Estève, Kevin Duh, Marcello lium 3: twice as much data and corpus repartition for
Federico, Souhir Gahbiche, Barry Haddow, Benjamin experiments on speaker adaptation. In International
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja- conference on speech and computer, pages 198–208.
vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu Springer.
Evgeny Matusov, Paul McNamee, John P. McCrae, Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
Kenton Murray, Maria Nadejde, Satoshi Nakamura, short-term memory. Neural computation, 9(8):1735–
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu, 1780.
Lonneke van der Plas, Peter Polák, Elijah Rippeth, Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerda,
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se- Javier Jorge, Nahuel Roselló, Adria Giménez, Al-
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian bert Sanchis, Jorge Civera, and Alfons Juan. 2020.
Thompson, Kevin Tran, Marco Turchi, Alex Waibel, Europarl-st: A multilingual corpus for speech transla-
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- tion of parliamentary debates. In ICASSP 2020-2020
vallos. 2023. Findings of the IWSLT 2023 Evaluation IEEE International Conference on Acoustics, Speech
Campaign. In Proceedings of the 20th International and Signal Processing (ICASSP), pages 8229–8233.
Conference on Spoken Language Translation (IWSLT IEEE.
Niehues Jan, Roldano Cattoni, Stüker Sebastian, Mauro
Ebrahim Ansari, Ondřej Bojar, Barry Haddow, and Mo- Cettolo, Marco Turchi, and Marcello Federico. 2018.
hammad Mahmoudi. 2021. SLTEV: Comprehensive The iwslt 2018 evaluation campaign. In Proceedings
evaluation of spoken language translation. In Pro- of IWSLT, pages 2–6.
ceedings of the 16th Conference of the European
Diederik Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint
guistics: System Demonstrations, pages 71–79, On-
arXiv:1412.6980.
line. Association for Computational Linguistics.
Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii
Rosana Ardila, Megan Branson, Kelly Davis, Michael Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kri-
Henretty, Michael Kohler, Josh Meyer, Reuben man, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook,
Morais, Lindsay Saunders, Francis M Tyers, and et al. 2019. Nemo: a toolkit for building ai ap-
Gregor Weber. 2019. Common voice: A massively- plications using neural modules. arXiv preprint
multilingual speech corpus. arXiv preprint arXiv:1909.09577.
arXiv:1912.06670.
Taku Kudo and John Richardson. 2018. Sentencepiece:
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, A simple and language independent subword tok-
and Michael Auli. 2020. wav2vec 2.0: A framework enizer and detokenizer for neural text processing.
for self-supervised learning of speech representations. arXiv preprint arXiv:1808.06226.
Advances in neural information processing systems,
33:12449–12460. Adrian Łańcucki. 2021. FastPitch: Parallel text-to-
speech with pitch prediction. In ICASSP.
Vladimir Bataev, Roman Korostik, Evgeny Shabalin,
Vitaly Lavrukhin, and Boris Ginsburg. 2023. Text- Ilya Loshchilov and Frank Hutter. 2017. Decou-
only domain adaptation for end-to-end asr using in- pled weight decay regularization. arXiv preprint
tegrated text-to-mel-spectrogram generator. ArXiv, arXiv:1711.05101.
abs/2302.14036.
jeev Khudanpur. 2015. Librispeech: an ASR corpus
Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Ben-
based on public domain audio books. In Proceedings
tivogli, Matteo Negri, and Marco Turchi. 2021. Must-
of ICASSP, pages 5206–5210. IEEE.
c: A multilingual corpus for end-to-end speech trans-
lation. Computer Speech & Language, 66:101155. Daniel S Park, Yu Zhang, Chung-Cheng Chiu,
Youzheng Chen, Bo Li, William Chan, Quoc V Le,
Alex Graves. 2012. Sequence transduction with and Yonghui Wu. 2020. Specaugment on large scale
recurrent neural networks. arXiv preprint datasets. In ICASSP 2020-2020 IEEE International
arXiv:1211.3711. Conference on Acoustics, Speech and Signal Process-
ing (ICASSP), pages 6879–6883. IEEE.
Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Dima Rekesh, Samuel Kriman, Somshubra Majumdar,
Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Vahid Noroozi, He Juang, Oleksii Hrinchuk, Ankur
2020. Conformer: Convolution-augmented Trans- Kumar, and Boris Ginsburg. 2023. Fast conformer
former for speech recognition. In Proceedings of with linearly scalable attention for efficient speech
Interspeech, pages 5036–5040. recognition. arXiv preprint arXiv:2305.05084.
447
Sandeep Subramanian, Oleksii Hrinchuk, Virginia
Adams, and Oleksii Kuchaiev. 2021. Nvidia nemo
neural machine translation systems for english-
german and english-russian news and biomedical
tasks at wmt21. arXiv preprint arXiv:2111.08634.
Ioannis Tsiamas, Gerard I Gállego, José AR Fonollosa,
and Marta R Costa-jussà. 2022. Shas: Approaching
optimal segmentation for end-to-end speech transla-
tion. arXiv preprint arXiv:2202.04774.
you need. In Proceedings of NeurIPS, pages 5998–
6008.
Juan Pino, and Emmanuel Dupoux. 2021. Voxpop-
interpretation. arXiv preprint arXiv:2101.00390.
448
SRI-B’s systems for IWSLT 2023 Dialectal and Low-resource track:
Marathi-Hindi Speech Translation
Balaji Radhakrishnan, Saurabh Agrawal, Raj Prakash Gohil, Kiran Praveen,

Advait Vinay Dhopeshwarkar, Abhishek Pandey
Samsung R&D Institute,Bangalore
{balaji.r, saurabh.a, raj.gohil, k.praveen.t, a.dhopeshwar, abhi3.pandey}@samsung.com
Abstract track (Agarwal et al., 2023) as a part of their 2023

shared tasks evaluation campaign. While this track
This paper describes the speech translation
systems SRI-B developed for the IWSLT includes various low resource languages, we focus
2023 Evaluation Campaign Dialectal and Low- our efforts on the Marathi-Hindi language pair. The
resource track: Marathi-Hindi Speech Trans- goal of this task is to translate Marathi speech to
lation. We propose systems for both the it’s corresponding Hindi text. Marathi and Hindi
constrained (systems are trained only on the are both Indo-Aryan languages used in India. Even
datasets provided by the organizers) and the un- though there were 83 million people across India
constrained conditions (systems can be trained
speaking Marathi as per the 2011 census of India,
with any resource). For both the conditions, we
build end-to-end speech translation networks it lacks sufficient speech data to support modern
comprising of a conformer encoder and a trans- speech translation systems.
former decoder. Under both the conditions, we This paper discusses our work and submissions
leverage Marathi Automatic Speech Recogni- on the Marathi-Hindi low-resource speech trans-
tion (ASR) data to pre-train the encoder and lation task. Our experiments in this paper focus
subsequently train the entire model on the only on end-to-end architectures. We begin our
speech translation data. Our results demon- experiments with a simple end-to-end Transformer
strate that pre-training the encoder with ASR
and build on this approach with the following key
data is a key step in significantly improving the
speech translation performance. We also show contributions that significantly better our final per-
that conformer encoders are inherently superior formance:
to its transformer counterparts for speech trans-
lation tasks. Our primary submissions achieved • Encoder pre-training with Marathi ASR data.
a BLEU% score of 31.2 on the constrained con- • Replacing the Transformer encoder blocks
dition and 32.4 on the unconstrained condition. with Conformer encoder blocks.
We secured the top position in the constrained
condition and second position in the uncon- • Utilizing the dev split during speech transla-
strained condition. tion training for the final submissions.
1 Introduction 2 Related work

Speech translation (ST) is the task of automatically Traditionally, speech translation was performed us-
translating a speech signal in a given language into ing cascaded systems (Ney, 1999) (Casacuberta
text in another language. While rapid strides have et al., 2008) (Post et al., 2013) (Kumar et al., 2014)
been made in speech translation in recent times, of ASR and Machine Translation (MT) models. In
this progress has been restricted to a small number this approach, speech was first transcribed using an
of high resource languages. This progress excludes ASR model and then the transcriptions were trans-
sizable sections of people who speak languages lated to text in the target language with the help
that have very little speech data available. So, for of a MT model. This approach however possessed
these speech systems to be beneficial and impactful several key drawbacks like error propagation, in-
in the real world, they have to be developed and creased latency, and architectural complexity due
shown to work on low-resource languages as well. to multiple models.
In order to mitigate these issues and encourage The first attempt towards building an end-to-end
research on low-resource languages, IWSLT pro- speech translation system was by (Bérard et al.,
pose a dialectal and low-resource speech translation 2016), where they built a system that eliminated
449
Type #Utterances Hours dian Language Corpora (Abraham et al., 2020) con-
train 7990 15.53 sists of crowd-sourcing recordings of low-income
dev 2103 3.39 workers. From all three datasets, only the Marathi
test 2164 4.26 language subsets were utilized for training pur-
poses.
Table 1: Details of speech translation (ST) data.
For the unconstrained condition, in addition to
the aforementioned datasets, IIIT-H Voices (Prahal-
the need for source language transcriptions. Sim- lad et al., 2012) (Prahallad et al., 2013) and IITM
ilarly, (Weiss et al., 2017) proposed an attention Indic TTS (Baby et al., 2016) were also utilized,
based encoder-decoder architecture for end-to-end both of which were designed for building TTS sys-
speech translation that exhibited improved perfor- tems.
mance over cascaded systems. (Bentivogli et al.,
2021) perform a detailed comparison between the 4 System Description
paradigms of cascaded and end-to-end speech trans-
All the models we trained for this challenge are
lation.
end-to-end speech translation (ST) systems. For
Developing speech translation systems for low- the purposes of this challenge, we tried two archi-
resource scenarios are especially challenging given tectures: Listen, attend and spell (LAS) (Chan et al.,
the scarcity of training data. Speech translation 2016) style Transformer (Vaswani et al., 2017) and
systems submitted in IWSLT 2019 (Niehues et al., the same model with its encoder replaced with Con-
2019) tended to prefer cascaded approaches for former (Gulati et al., 2020) layers. Both the models
low-resource tracks. The cascaded approach which were implemented using the Fairseq S2T toolkit
was favoured in (Le et al., 2021), used a hybrid (Ott et al., 2019).
ASR system with wav2vec features followed by
The Conformer model consists of a 16-layer
a MT model for two low-resource language pairs.
Conformer encoder paired with a 6-layer Trans-
Recently, as system trained with joint optimiza-
former decoder. The Transformer model comprises
tion of ASR, MT and ST (Anastasopoulos et al.,
of a 12-layer Transformer encoder and a 6-layer
2022) exhibited good performance. Also, usage of
Transformer decoder. In all the cases where pre-
self-supervised learning based pre-trained models
training is involved, the encoder blocks are pre-
such as XLR-S (Babu et al., 2021) and mBART
trained (Bahar et al., 2019) using Marathi ASR data
(Tang et al., 2020) have been shown to be effective,
mentioned in Table 2. Then, the model is trained
especially for low-resource scenarios.
on the Marathi-Hindi ST data with the encoder ini-
tialized from the previous ASR pre-training stage.
3 Data description
Relative positional encoding was used in the case
The challenge data consists of Marathi speech to of the Conformer model.
Hindi text translation data from the news domain For speech inputs, 80-channel log mel-filter
for the model training and development which we bank features (25ms window size and 10ms shift)
shall henceforth refer to as ST (speech translation) were extracted with utterance-level CMVN (Cep-
data. The details of this dataset has been mentioned stral Mean and Variance Normalization) applied.
in Table 1. This dataset was directly shared with all SpecAugment (Park et al., 2019) is applied on top
the participants involved. Development (dev) and of this feature set. We experimented with character
test (test) sets were also provided for assessing the vocabulary and a 1000 BPE (Byte Pair Encoding)
model performance. Hindi text labels for the test vocabulary and found that the former performs bet-
set were kept blind for all the participants. ter for our task.
The organizers shared additional Marathi audio Adam (Kingma and Ba, 2014) with a learning
data along with its transcripts which can be used for rate of 2 × 10−3 was the optimizer of choice for
the constrained condition, the details of which have all the experiments. Inverse square-root scheduling
been mentioned in Table 2. Common Voice (Ardila available in the toolkit was used with a warm-up of
et al., 2019) is a publicly available multi-language 1000 steps. Label-smoothed-cross-entropy with 0.1
dataset prepared using crowd-sourcing. OpenSLR as label smoothing was used as the criterion across
(He et al., 2020) is a multi-speaker speech corpora all the experiments. We set dropout (Srivastava
intended for text-to-speech (TTS) applications. In- et al., 2014) to 0.15 during ASR pre-training and
450
Dataset Condition Hours
Indian Language Corpora Constrained 109
Common Voice Constrained 3.7
OpenSLR Constrained 3
IIIT-H Voices Unconstrained 40
IITM Indic TTS Unconstrained 20
Table 2: Details of Marathi ASR datasets used for pre-training.
0.1 during ST training. We pre-train on the ASR the Transformer ones as can be gleaned from Table
data for 6000 steps and then train on the ST data 3, we chose to use only Conformer models for the
for 2250 steps. After ST training, we average the unconstrained condition. We train the following
last 10 checkpoints to create the final model. We models for the unconstrained condition:
used a beam size of 10 for decoding.
• The Conformer model encoder pre-trained
4.1 Constrained condition with constrained and unconstrained ASR data
For the constrained condition, we are only permit- mentioned in Table 2 and then trained with
ted to use the data provided by the organizers. For only the train split from the ST data.This
the constrained models, wherever pre-training is served as our unconstrained contrastive model
involved, we only utilize the 3 constrained datasets for the final submission.
from Table 2. For this condition, we train the fol-
lowing models: • The Conformer model encoder pre-trained
with constrained and unconstrained ASR data
• The Transformer model trained with only the mentioned in Table 2 and then trained with
train split from the ST data. both the train and the dev splits from the ST
data. This served as our unconstrained pri-
• The Conformer model trained with only the
mary model for the final submission.
train split from the ST data.
• The Transformer model encoder pre-trained 5 Results

with constrained ASR data mentioned in Table
The results for all the models we trained can be
2 and then trained with only the train split
seen in Table 3. The first striking result is that,
from the ST data.
irrespective of the scenario, the Conformer en-
• The Conformer model encoder pre-trained coder strongly outperforms the Transformer en-
with constrained ASR data mentioned in Ta- coder. Replacing the Transformer encoder blocks
ble 2 and then trained with only the train split with it’s Conformer counterpart results in the dev
from the ST data. This served as our con- split BLEU score increasing by 3.2 points. Con-
strained contrastive model for the final sub- formers are already state of the art when it comes
mission. to speech recognition, so it would make inher-
ent sense that this advantage would carry over to
• The Conformer model encoder pre-trained speech translation as well.
with constrained ASR data mentioned in Ta- Encoder pre-training with Marathi ASR data also
ble 2 and then trained with both the train and results in a significant improvement in speech trans-
the dev splits from the ST data. This served lation performance.This is a commonly used strat-
as our constrained primary model for the final egy while training speech translation models and
submission. allows us to increase the BLEU score on the dev
split by 8.5 and 11.8 points on the Transformer and
4.2 Unconstrained condition Conformer models respectively. Two additional
For the unconstrained condition, wherever pre- Marathi ASR datasets were added for pre-training
training is involved, we utlize all of the datasets the encoder in the unconstrained condition. This
mentioned in Table 2, both constrained and uncon- resulted in the BLEU score increasing by 4.1 points
strained. Since the Conformer models outperform on both the dev and test splits.
451
Training Data Dev Test
Condition Model Pretraining
ASR ST BLEU(%) BLEU(%) CHRF2(%)
Constrained Transformer ✗ – train 1.02 – –
Constrained Conformer ✗ – train 4.26 – –
Constrained Transformer ✓ constrained train 9.55 – –
Constrained Conformer ✓ constrained train 16.09 25.7 49.4
Constrained Conformer ✓ constrained train+dev – 31.2 54.8
Unconstrained Conformer ✓ all train 20.22 29.8 53.2
Unconstrained Conformer ✓ all train+dev – 32.4 55.5
Table 3: Results for all our trained models on dev & test splits. Here all indicates that both constrained and
unconstrained datasets were used for ASR pretraining.
Finally, since the dev and test splits come from Chronopoulou, Anna Currey, Thierry Declerck, Qian-
a similar distribution, including the dev split in qian Dong, Yannick Estève, Kevin Duh, Marcello
speech translation training boosted our BLEU
scores on the test split by 5.5 and 2.6 points in vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
the cases of constrained and unconstrained condi- Kumar, Pengwei Li, Xutail Ma, Prashant Mathur,
tions respectively. Utilizing the dev split for speech Evgeny Matusov, Paul McNamee, John P. McCrae,
translation training also narrowed down the gap in Kenton Murray, Maria Nadejde, Satoshi Nakamura,
performance between the unconstrained and con- Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
strained models on the test split. Lonneke van der Plas, Peter Polák, Elijah Rippeth,
6 Conclusion bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
In this paper we present our approaches to the
IWSLT 2023 Evaluation Campaign Dialectal and Campaign. In Proceedings of the 20th International
Low-resource track: Marathi-Hindi Speech Trans- Conference on Spoken Language Translation (IWSLT
lation which secured the first and second places 2023). Association for Computational Linguistics.
in the constrained and unconstrained conditions Antonios Anastasopoulos, Loïc Barrault, Luisa Ben-
respectively. We start off with a simple end-to- tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano
end approach with Transformers and then apply a Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh,
gamut of ideas like replacing the encoder blocks Maha Elbayad, Clara Emmanuel, Yannick Estève,
with Conformers, encoder pre-training, etc., to dras- Gahbiche, Hongyu Gong, Roman Grundkiewicz,
tically improve our dev BLEU score from 1.02 to Barry Haddow, Benjamin Hsu, Dávid Javorský,
20.22. Through our results, we also quantitatively Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant
demonstrate how much of an impact each of our Mathur, Paul McNamee, Kenton Murray, Maria
ideas bring forth and sincerely hope that some of Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan
these ideas might be useful for researchers and abeth Salesky, Jiatong Shi, Matthias Sperber, Se-
practitioners alike working on low-resource speech bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo-
translation problems. gesh Virkar, Alexander Waibel, Changhan Wang,
2022 evaluation campaign. In Proceedings of the
References Translation (IWSLT 2022), pages 98–157, Dublin,
Ireland (in-person and online). Association for Com-
Basil Abraham, Danish Goel, Divya Siddarth, Kalika putational Linguistics.
Bali, Manu Chopra, Monojit Choudhury, Pratik Joshi,
Preethi Jyothi, Sunayana Sitaram, and Vivek Se- Rosana Ardila, Megan Branson, Kelly Davis, Michael
shadri. 2020. Crowdsourcing speech data for low- Henretty, Michael Kohler, Josh Meyer, Reuben
resource languages from low-income workers. In Morais, Lindsay Saunders, Francis M Tyers, and
Proceedings of the 12th Conference on Language Re- Gregor Weber. 2019. Common voice: A massively-
sources and Evaluation (LREC), pages 2819–2826. multilingual speech corpus. arXiv preprint
arXiv:1912.06670.
sopoulos, Ondřej Bojar, Claudia Borg, Marine Arun Babu, Changhan Wang, Andros Tjandra, Kushal
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh,
Chen, William Chen, Khalid Choukri, Alexandra Patrick von Platen, Yatharth Saraf, Juan Pino, et al.
452
2021. Xls-r: Self-supervised cross-lingual speech Hang Le, Florentin Barbier, Ha Nguyen, Natalia
representation learning at scale. arXiv preprint Tomashenko, Salima Mdhaffar, Souhir Gahbiche,
arXiv:2111.09296. Bougares Fethi, Benjamin Lecouteux, Didier
Schwab, and Yannick Estève. 2021. On-trac’systems
Arun Baby, Anju Leela Thomas, N. L. Nishanthi, and for the iwslt 2021 low-resource speech translation
TTS Consortium. 2016. Resources for Indian lan- and multilingual speech translation shared tasks. In
guages. In CBBLR – Community-Based Building International Conference on Spoken Language Trans-
of Language Resources, pages 37–43, Brno, Czech lation (IWSLT).
Republic. Tribun EU.
H. Ney. 1999. Speech translation: coupling of recog-
Parnia Bahar, Tobias Bieschke, and Hermann Ney. 2019. nition and translation. In 1999 IEEE International
A comparative study on end-to-end speech to text Conference on Acoustics, Speech, and Signal Process-
translation. In 2019 IEEE Automatic Speech Recog- ing. Proceedings. ICASSP99 (Cat. No.99CH36258),
nition and Understanding Workshop (ASRU), pages volume 1, pages 517–520 vol.1.
792–799. IEEE.
Jan Niehues, Rolando Cattoni, Sebastian Stüker, Mat-
Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina teo Negri, Marco Turchi, Thanh-Le Ha, Elizabeth
Karakanta, Alberto Martinelli, Matteo Negri, and Salesky, Ramon Sanabria, Loic Barrault, Lucia Spe-
Marco Turchi. 2021. Cascade versus direct speech cia, and Marcello Federico. 2019. The IWSLT 2019
translation: Do the differences still make a differ- evaluation campaign. In Proceedings of the 16th In-
ence? arXiv preprint arXiv:2106.01045. ternational Conference on Spoken Language Trans-
lation, Hong Kong. Association for Computational
Alexandre Bérard, Olivier Pietquin, Christophe Servan, Linguistics.
and Laurent Besacier. 2016. Listen and translate: A
proof of concept for end-to-end speech-to-text trans- Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
lation. arXiv preprint arXiv:1612.01744. Sam Gross, Nathan Ng, David Grangier, and Michael
Francisco Casacuberta, Marcello Federico, Hermann sequence modeling. In Proceedings of NAACL-HLT
Ney, and Enrique Vidal. 2008. Recent efforts in 2019: Demonstrations.
spoken language translation. IEEE Signal Processing
Magazine, 25(3):80–88. Daniel S Park, William Chan, Yu Zhang, Chung-Cheng
William Chan, Navdeep Jaitley, Quoc Le, and Oriol 2019. Specaugment: A simple data augmentation
Vinyals. 2016. Listen, attend and spell: A neural method for automatic speech recognition. arXiv
network for large vocabulary conversational speech preprint arXiv:1904.08779.
recognition. IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). Matt Post, Gaurav Kumar, Adam Lopez, Damianos
Karakos, Chris Callison-Burch, and Sanjeev Khu-
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki danpur. 2013. Improved speech-to-text translation
Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo with the fisher and callhome Spanish-English speech
Wang, Zhengdong Zhang, Yonghui Wu, et al. translation corpus. In Proceedings of the 10th Inter-
2020. Conformer: Convolution-augmented trans- national Workshop on Spoken Language Translation:
former for speech recognition. arXiv preprint Papers, Heidelberg, Germany.
arXiv:2005.08100.
Kishore Prahallad, E Naresh Kumar, Venkatesh Keri,
Fei He, Shan-Hui Cathy Chu, Oddur Kjartansson, Clara S Rajendran, and Alan W Black. 2012. The iiit-h
Rivera, Anna Katanova, Alexander Gutkin, Isin indic speech databases. In Thirteenth annual con-
Demirsahin, Cibu Johny, Martin Jansche, Supheak- ference of the international speech communication
mungkol Sarin, and Knot Pipatsrisawat. 2020. Open- association.
source multi-speaker speech corpora for building Gu-
jarati, Kannada, Malayalam, Marathi, Tamil and Tel- Kishore Prahallad, Anandaswarup Vadapalli, Naresh
ugu speech synthesis systems. In Proceedings of the Elluru, Gautam Mantena, Bhargav Pulugundla, Peri
Twelfth Language Resources and Evaluation Confer- Bhaskararao, Hema A Murthy, Simon King, Vasilis
ence, pages 6494–6503, Marseille, France. European Karaiskos, and Alan W Black. 2013. The blizzard
Language Resources Association. challenge 2013–indian language task. In Blizzard
challenge workshop, volume 2013.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
arXiv:1412.6980. Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: a simple way to prevent neural networks
Gaurav Kumar, Matt Post, Daniel Povey, and Sanjeev from overfitting. The journal of machine learning
Khudanpur. 2014. Some insights from translating research, 15(1):1929–1958.
conversational telephone speech. In 2014 IEEE Inter-
national Conference on Acoustics, Speech and Signal Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
Processing (ICASSP), pages 3231–3235. man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
453
systems, 30.
Ron J Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui
Wu, and Zhifeng Chen. 2017. Sequence-to-sequence
models can directly translate foreign speech. arXiv
454
BIT’s System for Multilingual Track
Zhipeng Wang Yuhang Guo∗
Beijing Institute of Technology Beijing Institute of Technology
Shuoying Chen
Beijing Institute of Technology
[email protected]
Abstract translation task. So researchers have focused on

end-to-end speech translation. At present, bilingual
This paper describes the system we submitted end-to-end speech translation has achieved very
to the IWSLT 2023 multilingual speech transla-
good results, but using a single model to complete
tion track, with the input is speech from one lan-
guage, and the output is text from 10 target lan- multiple language translations has always been a
guages. Our system consists of CNN and Trans- goal pursued by researchers, that is multilingual
former, convolutional neural networks down- speech translation. Compared to bilingual speech
sample speech features and extract local infor- translation, the advantages of multilingual speech
mation, while transformer extract global fea- translation include: (1) completing multilingual
tures and output the final results. In our system, translation with fewer parameters; (2) low resource
we use speech recognition tasks to pre-train
languages can learn knowledge from high resource
encoder parameters, and then use speech trans-
lation corpus to train the multilingual speech
languages. In this paper, we conducted one-to-
translation model. We have also adopted other many multilingual speech translation, and submit-
methods to optimize the model, such as data ted our system to the IWSLT 2023(Agarwal et al.,
augmentation, model ensemble, etc. Our sys- 2023) multilingual speech translation track. Here
tem can obtain satisfactory results on test sets is an introduction to our submitted system:
of 10 languages in the MUST-C corpus.
We first use convolutional neural networks to
downsample the input features, then input them
1 Introduction
into the Transformer model for further processing,
Speech translation refers to the technology of trans- and finally output the translation results at the out-
lating source language speech into target language put layer. The encoder for speech translation needs
text (or speech). This task has a very broad applica- to complete both acoustic feature extraction and se-
tion space in real life, such as in international con- mantic feature extraction tasks. In order to reduce
ferences, lectures, and overseas tourism; Adding the encoding pressure of the model, we use speech
speech translation to short videos or real-time sub- recognition task to pre-train the parameters of the
titles in some foreign language videos can provide encoder. Before inputting the data into the model,
users with a better experience. Early speech trans- we applied the SpecAugment(Park et al., 2019)
lation is the combination of speech recognition and method for data augmentation, which increased
machine translation. Firstly, the speech recognition data diversity and resulted in better results for the
model recognizes source language speech as source model. After training the multilingual speech trans-
language transcribed text, and then the machine lation model, we calculated the average value of the
translation model translates the recognized source model parameters obtained for the last 10 epochs
language text into the target language text, which to generate the model we used during testing, the
is also called cascade method. The advantage of model with the obtained average parameters can
cascade model is that it can use a large amount of have better results.
data in speech recognition and machine translation The target language includes Arabic, Chinese,
to train the model, and it is relatively simple to Dutch, French, German, Japanese, Farsi, Por-
implement. However, the disadvantages of cascade tuguese, Russian, and Turkish. The training data
model are also obvious: errors in speech recogni- for these languages can be found in the com-
tion results will be transferred to the next machine monly used corpus for speech translation – MUST-
∗
*Corresponding author C(Di Gangi et al., 2019). We downloaded the
455
Table 1: Training Set Information
Tgt talks sentences time words src words tgt

ar 2412 212k 463h 4520k 4000k
de 2043 229k 400h 4196k 3869k
fa 1911 181k 347h 3548k 4559k
fr 2460 275k 484h 5067k 5163k
ja 3258 328k 540h 5712k 69k
nl 2219 248k 434h 4548k 4251k
pt 2001 206k 376h 3887k 3621k
ru 2448 265k 481h 5007k 4192k
tr 2307 236k 445h 4600k 3388k
zh 3583 358k 596h 6251k 97k
data for the relevant languages from MUST-C tures, training the sentencepeice(Kudo and Richard-
v1.0, MUST-C v1.2, and MUST-C v2.0, merged son, 2018) model, generating a vocabulary, and
them, and preprocessed them to obtain our train- finally generating a training set. The processed
ing dataset. We used the Fairseq(Ott et al., 2019) MFCC feature dimension is 80, and SpecAugment
toolkit to conduct our experiment, and after the is applied for data augmentation. The relevant
training was completed, we scored the translation configurations used in the experiment regarding
quality using the sacrebleu metric. Our model SpecAugment are shown in Table 2:
achieved our expected results on 10 target lan-
guages.
Table 2: Parameter settings for SpecAugment
2 Data Preparation
As shown in the Table 1, we collected training Parameters Values
data for relevant languages from the MUST-C cor- freq_mask_F 27
pus and provided their information. It can be seen freq_mask_N 2
from this that there are significant differences be- time_mask_N 2
tween different languages. There are differences time_mask_T 100
in the number of source language words and target time_mask_p 1.0
language words among different languages. For time_wrap_W 0
example, the number of source language words
in the Arabic language corpus is greater than the
number of target language words, while the num-
ber of source language words in the Farsi language The SpecAugment method uses three different
corpus is less than the number of target language data augmentation methods: Time warping, Fre-
words. This indicates that the difficulty of length quency masking, and Time masking. Time warping
conversion required by the model when dealing selects an area from the time dimension for warp-
with different languages varies to some extent. ing. Frequency masking selects an area from the
Due to our task of one-to-many multilingual frequency dimension for masking, in our experi-
speech translation, the input received by the model mental configuration, the length of the masked part
is all English speech data, which enables us to per- is 27, which is the parameter freq_mask_F, and
form the same preprocessing operation on all data. the parameter freq_mask_N refers to the number
The original speech is in wav format, and most of of masked areas. Time masking selects an area
it is long audio. We need to segment and extract from the time dimension for masking, the param-
features before inputting it into the model. So we eter time_mask_T we set is 100, and the number
segment the speech data based on the start time and of masked areas is 2. SpecAugment increases the
duration of each segment given in MUST-C. The diversity of training data, making the trained model
preprocessing stage includes extracting MFCC fea- more robust.
456
3 Method their original dimensions, the calculation in the
feed forward module is as follows:
3.1 Speech Recognition
We use speech recognition tasks to pre train en- F F N (x) = max(0, xW1 + b1 )W2 + b2 (2)
coder parameters. After experimental verification,
Positional Encoding. The transformer uses po-
using speech recognition for pre training parame-
sition encoding to indicate the relative position be-
ters is much better than not using pre-training. Due
tween tokens, and the calculation method is as fol-
to the need to initialize the parameters of the speech
lows:
translation model using the encoder of the speech
recognition model, we use the same structure to P E(pos,2i) = sin(pos/100002i/dmodel ) (3)
train the speech recognition model. Although ex-
tracting MFCC features from the original audio can P E(pos,2i+1) = cos(pos/100002i/dmodel ) (4)
reduce the sequence length, the processed MFCC
After extracting shallow features from speech
features still have a long time dimension and re-
using convolutional neural networks, transformer
quire further downsampling. In speech translation
combines the extracted information. Convolutional
related works, a common practice is to use CNN
neural networks are good at extracting local fea-
or Shrink modules(Liu et al., 2020) to compress
tures, while transformer have a stronger ability to
feature sequences. We use convolutional neural
model global features. This structure enables the
networks to downsample the extracted MFCC fea-
model to perform well in several speech processing
ture sequence, the input MFCC features are first
tasks.
extracted through a two-layer convolutional neural
network to extract shallow features and downsam-
pling, and then input into the Transformer model to
complete the speech recognition task. The model
structure is shown in the Figure 1. The reason
why Transformer has strong modeling information
ability is due to its self attention mechanism, the
multi-head attention calculation in transformer is
shown in the Figure 2. Perform different linear
calculations on the input to obtain Q, K, and V.
compute the matrix of outputs as:
QK T
Attention(Q, K, V ) = sof tmax( √ )V (1)
dk
Each module in Transformer has its specific role,

and the following is an analysis of its main mod-
ules:
Multi-head attention module. Self attention
refers to calculating the attention of the current
token to other tokens in the sequence, and using
the calculated attention score as a weight to weight
and sum the feature sequence, thus modeling global
information. The final output of the multi-head self
attention module is obtained by concatenating the Figure 1: Model structure
results obtained from all the attention heads and
then performing a linear mapping.
Feed forward module. In the feed forward 3.2 Multilingual Speech Translation
module, the extracted global features are linearly The multilingual speech translation model also
combined, which includes two linear mappings: adopts the structure shown in the Figure 1, replac-
mapping feature sequences to high dimensions and ing the speech recognition vocabulary with the mul-
mapping features from high dimensions back to tilingual speech translation vocabulary, and training
457
4 Experiments
4.1 Implemention
The downsampling module contains two layers of
convolutional neural networks, with convolutional
kernel sizes of 5 and step sizes of 2. After the
feature sequence passes through the downsampling
layer, the sequence length becomes one quarter of
the original, The dimension of the output feature is
1024.
The encoder of the model contains 12 trans-
Figure 2: Multi-head Attention former blocks, with each layer having an output
feature dimension of 512. In order to fully model
speech features, 8 attention heads were used to
the model using the speech translation training set. model information in speech from different per-
Unlike speech recognition task, the vocabulary in spectives. The feedforward neural network module
multilingual speech translation task contains lan- contains two linear maps to reorganize the features.
guage labels, and the sub words in the dictionary First, the feature dimension is mapped to 2048, and
come from all target language texts. Before train- then it is mapped back to 512.
ing the speech translation model, use the encoder The decoder of the model consists of 6 Trans-
of the trained speech recognition model to initialize former blocks, which also use 8 attention heads.
the encoder parameters of the speech translation In addition, we use the dropout of the attention
model, and optimize all model parameters during matrix to prevent overfitting. The dropout rate of at-
training. tention is set to 0.1. In speech recognition tasks, we
set the vocabulary size to 8000; In the speech trans-
We only use the encoder of the speech recog-
lation task, we set the vocabulary size to 10000.
nition model to initialize the multilingual speech
Because the speech recognition task only involves
translation model because the tasks completed by
English text, while the speech translation task in-
the two encoders are similar, that is, the shallow
volves translated text from 10 target languages, a
layer of the encoder needs to extract acoustic fea-
larger vocabulary needs to be used. At the time
ture information. However, there are task differ-
of model output, the probability of speech recog-
ences between speech recognition and speech trans-
nition task computing on 8000 sub words and the
lation decoders. The speech translation decoder
probability of speech translation task computing on
needs to complete language conversion, and the
10000 sub words.
speech recognition decoder does not involve this
Adam optimizer and cross entropy loss function
task, so the speech recognition decoder is not used
are used in model training. We use max tokens
to initialize parameters. The speech recognition
to dynamically control the number of samples in-
model will no longer be used in subsequent opera-
cluded in a batch. In our experiment, the max
tions.
tokens used for both speech recognition and speech
The decoder adopts an auto-regressive approach translation tasks were 20000. The number of steps
to output the translation sequence, and in this ex- for optimizing speech recognition tasks is 100k,
periment, language labels are used to indicate the and the number of steps for optimizing speech
current translation direction. For example, <lang: translation tasks is 350k, based on the difficulty
de> indicates that the target language of the current of these two tasks. Among them, perform warmup
translation task is German. in the first 10k steps. The learning rate is 1e-3, and
We conducted parameter fusion on the trained the label smoothing is 0.1. We trained our model
model. After the model was trained to converge, using two NVIDIA TITAN RTX.
the last 10 checkpoint points were fused and the
test set was scored by the fused model. The specific 4.2 Main Results
approach is to find variables with the same name We trained a speech recognition model with good
from all the models read, calculate the average performance, and the WER of the model on each
value, and save it in the new model. language is shown in the Table 3.
458
Table 3: The WER of the ASR model on data in each language.
ar de fa fr ja nl pt ru tr zh
WER 16.01 10.64 11.65 10.74 8.79 10.43 10.76 10.71 11.10 8.80
Table 4: BLEU scores on the MUST-C test set.
ar de fa fr ja nl pt ru tr zh
BLEU 12.35 23.30 12.15 32.59 12.93 27.46 28.57 14.66 11.33 22.07
After training the model on the MUST-C training We used convolutional neural networks combined
set, we used its tst-COMMON test set to verify the with Transformer models to complete the task of
model’s effectiveness. The experimental results are English speech to 10 target language texts. Our
shown in the Table 4. system is characterized by its simplicity and effi-
From the Table 4, it can be seen that our sys- ciency, effectively modeling local and global fea-
tem can complete translations in these 10 target tures in speech, and completing modal and lan-
languages, and the BLEU score exceeds 20 in all guage transformations within the model. Our sys-
5 languages of them. Although using the same tem has achieved satisfactory results on the test set
model for translation tasks, the difficulty of transla- of 10 languages in MUST-C corpus.
tion varies among different languages. As shown
in the table, the BLEU scores of ar, fa, ja, ru, and
tr are lower compared to other languages, but they
References
use a similar amount of data. On the one hand, Milind Agarwal, Sweta Agrawal, Antonios Anasta-
there are significant differences in grammar rules sopoulos, Ondřej Bojar, Claudia Borg, Marine
between these target languages and the source lan- Chen, William Chen, Khalid Choukri, Alexandra
guage, making it more difficult for the model to Chronopoulou, Anna Currey, Thierry Declerck, Qian-
complete language conversion; On the other hand, qian Dong, Yannick Estève, Kevin Duh, Marcello
the differences between target languages make it Federico, Souhir Gahbiche, Barry Haddow, Benjamin
difficult to share information between them consis- vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
tently. Kumar, Pengwei Li, Xutail Ma, Prashant Mathur,
In the current work of multilingual speech trans- Evgeny Matusov, Paul McNamee, John P. McCrae,
lation, many methods have modified the model Kenton Murray, Maria Nadejde, Satoshi Nakamura,
architecture and optimization methods, and our Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
system uses a simple convolutional neural network Lonneke van der Plas, Peter Polák, Elijah Rippeth,
combined with the Transformer structure to achieve Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
a relatively good effect. Compared to those com- bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
plex systems that modify models, our system has Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
the following advantages: On the one hand, our sys- vallos. 2023. Findings of the IWSLT 2023 Evaluation
tem’s training method is relatively simple and re- Campaign. In Proceedings of the 20th International
quires fewer model parameters. On the other hand, Conference on Spoken Language Translation (IWSLT
this simple structure can also effectively complete
multilingual speech translation tasks. Our system Mattia A Di Gangi, Roldano Cattoni, Luisa Bentivogli,
can be applied to devices with strict memory re- Matteo Negri, and Marco Turchi. 2019. Must-c: a
multilingual speech translation corpus. In Proceed-
quirements, and can achieve relatively satisfactory ings of the 2019 Conference of the North American
results with a small number of parameters. Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 1
5 Conclusion (Long and Short Papers), pages 2012–2017. Associa-
This paper introduces our system submitted on the Taku Kudo and John Richardson. 2018. Sentencepiece:
IWSLT 2023 multilingual speech translation track. A simple and language independent subword tok-
459
enizer and detokenizer for neural text processing.
Yuchen Liu, Junnan Zhu, Jiajun Zhang, and Chengqing
Zong. 2020. Bridging the modality gap for speech-
to-text translation. arXiv preprint arXiv:2010.14920.

Auli. 2019. fairseq: A fast, extensible toolkit for se-
quence modeling. arXiv preprint arXiv:1904.01038.

method for automatic speech recognition. arXiv
460
Matesub: the Translated Subtitling Tool at the IWSLT2023 Subtitling task
Simone G. Perone
Translated srl
via Indonesia 23
00144 Rome - Italy
[email protected]
Abstract 1.1.1 Automatic speech recognition

The ASR is in charge of the transcription of the
This paper briefly describes Matesub, the subti- speech content of an audio signal. In Matesub,
tling tool Translated used to participate in the this processing stage is provided either by an in-
Subtitling shared task at IWSLT 2023. Matesub
house ASR model or by a 3rd party commercial
is a professional web-based tool that combines
state-of-the-art AI with a WYSIWYG editor. ASR service, according to the availability of the
The automatic generation of subtitles in Mate- internal solution and its relative quality. In both
sub is based on a cascade architecture, com- cases, the word hypotheses are expected to be given
posed of ASR, text segmenter and MT neural in conversation time mark (CTM) format. This text
models, which allows covering any pair from file consists of records each having 5 fields, e.g.:
about 60 languages and their variants.
23.66 0.29 human 0.00998 False
23.96 0.40 beings. 0.01000 True
1 Matesub 24.48 0.13 We 0.33000 False
Matesub1 is a web-based tool released by Trans- whose meaning is given in Table 1.
lated2 that combines state-of-the-art AI with a field meaning
WYSIWYG (What You See Is What You Get) ed- 1 start time (sec)
itor for supporting professionals in the creation 2 duration (sec)
3 token (i.e. word)
of subtitles for audio visual documents. Matesub 4 confidence
generates subtitling suggestions through a process- 5 end of sentence (boolean)
ing pipeline which was used to participate in the
Subtitling shared task at IWSLT 2023. This paper Table 1: Fields in the CTM format.
first describes the pipeline, and then presents and
discusses the scores of the submission. Note that the transcription is punctuated and
cased; moreover, the flag indicating the end of sen-
1.1 The subtitling pipeline tence is typically set on for acoustic reasons, like
the presence of the pause between the tokens begin.
In Matesub, subtitles are automatically generated and We, but - less frequently - also for “linguistic”
by a pipeline (Figure 1) which concatenates two evidence (learned by the ASR from training data).
main modules, based on neural models: an au-
tomatic speech recognition (ASR) system and a 1.1.2 Captioning and subtitling
module providing the Captions & Subtitles Service. The Captions and Subtitles Service is in charge
They are described in the following. of building, starting from a given CTM file, the
SubRip Subtitle (SRT) files of the transcription
contained in the CTM file and its translation; the
two SRTs are finally merged in a single JSON file.
As shown in Figure 2, this module consists of two
Figure 1: Architecture of the subtitling pipeline.
main components, a text segmenter and a neural
machine translation (NMT) system, in addition to
a number of secondary sub-components.
1
https://matesub.com/ The two main components are built using the
2
https://translated.com/ same sequence-to-sequence neural modeling tech-
461
Figure 2: Captions and subtitles service.
nique. The segmenter, implemented as proposed 1. Segmentation of the transcription on the basis
in (Karakanta et al., 2020; Papi et al., 2022), inserts of acoustic cues (audio blocks)
in an unsegmented input text - either in the source 2. Segmentation of audio blocks into caption
or in the target language - markers of segment blocks (and lines) by means of the source lan-
boundaries. It is trained on pairs of unsegmented- guage segmenter
segmented text, where segment boundaries are
marked by means of two special symbols: <eob> 3. Automatic translation of each caption block
to mark the end of block (caption or subtitle), and into the target language(s) (subtitle blocks)
<eol> to mark the end of line. Figure 3 shows an 4. Segmentation of subtitle blocks into lines by
example of a sentence after inserting the markers means of the target language segmenter
from the corresponding fragment of the SRT file.
5. Timing projection from the CTM to the cap-
164
tion/subtitle blocks
00:08:57,020–>00:08:58,476 6. Packaging of SRT and JSON files.
I wanted to challenge the idea
Note that the translation of each block in step 3
165 is done without looking at the context, i.e. at the
00:08:58,500–>00:09:02,060 surrounding blocks. On the one hand, this worsens
that design is but a tool the quality of the translation a little, but, on the
to create function and beauty.
other, it facilitates the satisfaction of the reading
I wanted to challenge the idea <eob> that design is but a speed requirement through the n-best mechanism,
tool <eol> to create function and beauty. <eob>
sketched in the next section.
Figure 3: Subtitle file (top) and the full sentence an-
1.1.3 Machine translation
notated with the subtitle breaks (bottom). Figure taken
from (Karakanta et al., 2020). Neural machine translation is provided by Mod-
ernMT3 (Bertoldi et al., 2021) through a REST
The neural machine translation engine performs API connection. ModernMT implements the Trans-
the translation of the text from the source language former (Vaswani et al., 2017) architecture; generic
(English, in the IWSLT 2023 context) into the corre- big models (about 200M parameters each), trained
sponding text in the target language (here German on both public and proprietary data, cover hundred
and Spanish). Other processing modules are in of languages4 in any direction, through a seam-
charge of (i) generating captions/subtitles in SRT less integration of the pivot based approach, where
format (starting from transcripts, word timestamps, the pivot language is English. Matesub requests
translations and segmentations), and (ii) merging ModernMT to provide the 16 best translations of
the SRTs of captions and subtitles into a single 3
https://www.modernmt.com/
JSON file. The main processing steps are: 4
https://www.modernmt.com/api/#languages
462
each block (step 3 mentioned in the previous sec- that the quality of TED and of Spanish EPTV
tion); between them, the hypothesis with the high- subtitles is high, while subtitles of ITV, PELO-
est probability and whose length permits to satisfy TON and German EPTV documents would
the reading speed constraint (given the duration of need major post-editing
the block) is selected. If no such hypothesis exists, • Since SubER is based on TER and Sigma
the shortest is chosen. on BLEU, their values match the scores of
those metrics rather than BLEURT, ChrF
1.2 The editor
and the subtitle compliance as measured by
Matesub provides a WYSIWYG editor, which al- CPS/CPL/LPB, possibly affecting the final
lows the user to review and correct the subtitles ranking of Matesub
automatically generated and synced in the chosen
• The compliance of subtitles is language inde-
target language by the back-end subtitling pipeline.
pendent
Figure 4 shows a screenshot of the Matesub editor.
The editor permits the user to easily fix both • Despite the fact that Matesub does not imple-
translation and segmentation errors, thanks to the ment any hard rule, relying only on machine
rich catalogue of functions and user-friendliness. learning methods, CPL and CPL are (almost)
Once the editing is over, subtitles can be embedded perfect
in the video or exported in production-ready SRT • The reading speed (CPS) is under the max
files or any other supported subtitles format. threshold of 21 characters per second in about
85% of subtitles; more in detail, the average
2 Submission and Results is about 18.5 and only in 5% of cases it ex-
ceeds 30 characters per second, values that we
Translated participated in the Subtitling shared
consider satisfactory.
task at IWSLT 2023 with the back-end subtitling
pipeline of Matesub. No adaptation of the general Acknowledgements
purpose pipeline was carried out, therefore the qual-
ity of subtitles generated for the audio-visual docu- Matesub received funding from the European Insti-
ments proposed in the shared task is that typically tute of Innovation and Technology (EIT), a body of
expected by the in-production system before the the European Union, through the MateDub (2020)
post-editing stage. Since neural models of Mate- and MateDub++ (2021) projects. Within them, the
sub (ASR, text segmenter and MT) were trained Matesub subtitling chain was developed in collab-
on more resources than those allowed for the con- oration with the FBK’s MT research unit. We es-
strained condition, we labelled our submission as pecially thank Mauro Cettolo for his invaluable
unconstrained; it was also our unique submission, contribution to the success of this product and the
and as such it is the primary run. support he gave us in the participation to the Subti-
Table 2 shows scores of our test set subtitles as tling track at IWSLT 2023.
computed by the organizers (Agarwal et al., 2023).
They are in line with those we obtained on the dev
References
sets.
Without knowing the results of the other sub- Milind Agarwal, Sweta Agrawal, Antonios Anasta-
sopoulos, Claudia Borg, Marine Carpuat, Roldano
missions, it is hard to judge the results obtained. Cattoni, Mauro Cettolo, William Chen, Khalid
However, some considerations can be made: Choukri, Alexandra Chronopoulou, Thierry Declerck,
Qianqian Dong, Yannick Estève, Kevin Duh, Mar-
• As expected, from the pure speech translation
perspective, the TED domain is the easiest John Judge, Tom Ko, Rishu Kumar, Xutail Ma,
one by far Prashant Mathur, Evgeny Matusov, Paul McNamee,
• Surprisingly, at least when German is the tar- John P. McCrae, Kenton Murray, Matteo Negri, Jan
Niehues, Xing Niu, Atul Ojha Kr., John E. Ortega,
get language, the EPTV domain is as much Proyag Pal, Juan Pino, Lonneke van der Plas, Elijah
challenging as ITV and PELOTON, which we Rippeth, Elizabeth Salesky, Matthias Sperber, Se-
expected to be the most difficult ones bastian Stüker, Katsuhito Sudoh, Brian Thompson,
Marco Turchi, Alex Waibel, Mingxuan Wang, and
• Assuming that BLEURT and ChrF are more Rodolfo Zevallos. 2023. Findings of the IWSLT 2023
reliable than BLEU and TER (according Evaluation Campaign. In Proc. of IWSLT, Toronto,
to (Kocmi et al., 2021), for example), it seems Canada.
463
Subtitle quality Translation quality Subtitle compliance
en- domain SubER↓ Sigma↑ BLEU↑ ChrF↑ TER↓ BLEURT↑ CPS↑ CPL↑ LPB↑
EPTV 87.04 57.73 12.08 43.59 85.53 .4705 88.59 99.20 100.00
TED 67.70 62.01 20.37 50.05 65.55 .5500 90.55 98.61 100.00
-de ITV 73.11 67.04 14.92 37.13 71.27 .4501 80.21 99.47 100.00
PELOTON 79.72 68.27 10.06 34.46 78.25 .4264 89.17 99.29 100.00
ALL 75.41 65.22 14.81 39.50 73.60 .4591 84.97 99.25 100.00
EPTV 74.47 59.59 21.06 54.11 72.08 .5728 90.15 99.44 100.00
TED 45.94 66.85 40.36 65.72 43.81 .7047 92.62 99.48 100.00
-es ITV 71.25 71.06 18.50 41.07 69.57 .4592 81.93 99.51 100.00
PELOTON 74.87 70.99 15.96 41.86 73.88 .4666 88.27 99.60 100.00
ALL 68.11 68.37 22.34 47.38 66.66 .5059 86.07 99.52 100.00
Table 2: Results of the Matesub submission.
Figure 4: Screenshot of Matesub Editor
Nicola Bertoldi, Davide Caroselli, M. Amin Farajian, matic subtitling with automatically segmented st cor-
Marcello Federico, Matteo Negri, Marco Trombetti, pora. In Proc. of AACL-IJCNLP, pages 480–487.
and Marco Turchi. 2021. Translation system and
method. US Patent 11036940. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Alina Karakanta, Matteo Negri, and Marco Turchi. 2020. Kaiser, and Illia Polosukhin. 2017. Attention is all
MuST-cinema: a speech-to-subtitles corpus. In Proc. you need. In Proc. of NIPS, pages 5998—-6008.
of LREC, pages 3727–3734, Marseille, France.
Tom Kocmi, Christian Federmann, Roman Grund-

kiewicz, Marcin Junczys-Dowmunt, Hitokazu Mat-
sushita, and Arul Menezes. 2021. To ship or not to
ship: An extensive evaluation of automatic metrics
for machine translation. In Proc. of WMT, pages
478–494.
Sara Papi, Alina Karakanta, Matteo Negri, and Marco

Turchi. 2022. Dodging the data bottleneck: Auto-
464
Augmentation Invariant Discrete Representation for
Generative Spoken Language Modeling
Itai Gat♢ , Felix Kreuk♢ , Tu Anh Nguyen♢ , Ann Lee♢ , Jade Copet♢ ,
Gabriel Synnaeve♢ , Emmanuel Dupoux♠,♢ , Yossi Adi♡,♢
♢
FAIR Team, Meta AI Research
♠
ENS, INRIA, INSERM, UPEC, PSL Research University
♡
The Hebrew University of Jerusalem
Abstract et al. (2021) demonstrated that such self-supervised

learning (SSL) representations can be used for Gen-
Generative Spoken Language Modeling re- erative Spoken Language Modeling.
search focuses on optimizing speech Language
Models (LMs) using raw audio recordings with-
Generative Spoken Language Modeling (GSLM)
out accessing any textual supervision. Such is the task of learning the acoustic and linguistic
speech LMs usually operate over discrete units characteristics of a language from raw audio. In
obtained from quantizing internal representa- other words, a discrete representation of the au-
tions of self-supervised models. Although such dio signal is being learned. A common practice
units show impressive modeling results, their is to extract continuous representation using an
robustness capabilities have not been exten- SSL model, then apply vector quantization, usu-
sively investigated. This work focuses on im-
ally using the k-means algorithm (Lakhotia et al.,
proving the invariance of discrete input rep-
resentations to non-spoken augmentations for 2021; Kharitonov et al., 2021a; Borsos et al., 2022).
generative spoken language modeling. First, Then a speech-language model is trained on top
we formally define how to measure the robust- of the obtained representation. Finally, a neural
ness of such representations to various signal vocoder converts the output units to raw audio. As
variations that do not alter the spoken infor- the discrete speech representation often operates
mation (e.g., time-stretch). Next, we empiri- over units extracted over relatively short windows
cally demonstrate how current state-of-the-art
(e.g., 20ms), sequences can be long and contain
representation models lack robustness to such
variations. To overcome this, we propose an
repetitions, e.g., 10 11 11 11 21 32 32 32 21.
effective and efficient method to learn invari- Preliminary studies have found that removing se-
ant discrete speech representation for genera- quential repetitions of units improves performance,
tive spoken language modeling. The proposed hence applying it universally (Lakhotia et al., 2021).
approach is based on applying a set of signal For example, a pseudo-text 10 11 11 11 21 32 32
transformations to the speech signal and op- 32 21 becomes 10 11 21 32 21. This framework
timizing the model using an iterative pseudo- was shown to be effective in modeling multiple
labeling scheme. Our method significantly im-
levels of the speech utterance, namely prosody, and
proves over the evaluated baselines when con-
sidering encoding and modeling metrics. We content (Lakhotia et al., 2021; Kharitonov et al.,
additionally evaluate our method on the speech- 2021a; Borsos et al., 2022), speech codec (Polyak
to-speech translation task, considering Spanish- et al., 2021), speech emotion conversion (Kreuk
English and French-English translations, and et al., 2021), spoken dialogue (Nguyen et al., 2022),
show the proposed approach outperforms the and speech-to-speech translation (Lee et al., 2021;
evaluated baselines. Popuri et al., 2022; Lee et al., 2022).
An essential prerequisite for such an audio rep-
1 Introduction
resentation to be used in real-world conditions is
Self-supervised speech models were shown to learn robustness to various signal corruptions. Although
effective representations for various downstream the aforementioned audio representation models
tasks (Hsu et al., 2021; Chen et al., 2022; Baevski have shown effectiveness in many tasks, they were
et al., 2020). These models were mainly evaluated mainly evaluated on academic benchmarks.
on discriminative tasks, such as automatic speech In this work, we evaluate current state-of-the-
recognition, speaker verification, intent classifica- art self-supervised speech representation models
tion, etc. (Yang et al., 2021). Recently, Lakhotia on what are arguably the most basic signal vari-
465
Speech continuation
Pre-trained
Unit Language Language
Model Model
Speech-to-unit
Quantizer
Resynthesis Pre-trained
Speech Encoder Unit-to-speech
ASR
Figure 1: Generative Spoken Language Modeling is composed of three components: (i) Speech-to-unit, (ii) Unit
language model, and (iii) Unit-to-speech. Pre-trained ASR and language models are used for evaluation.
ations, namely time-stretch, pitch-shift, additive- mon approach is first to encode the speech into
noise, and reverberation. Our premise is that while a continuous representation and then quantize the
these variations modify the signal, its’ underly- representation to achieve a sequence of discrete
ing content remains the same, especially under units (Lakhotia et al., 2021; Polyak et al., 2021;
the units repetition removal process. Therefore, Popuri et al., 2022; Lee et al., 2021; Kharitonov
a robust representation should be affected by such et al., 2021a; Kreuk et al., 2021; Kharitonov et al.,
variations to a minimal extent. 2022; Nguyen et al., 2022; Borsos et al., 2022;
As a first step, we propose a set of metrics for Tjandra et al., 2019, 2020).
evaluating the model’s robustness. Then, we point Formally, denote the domain of audio samples
to the lack of robustness of these models with re- by X ⊂ R. The representation for a raw signal is
spect to the aforementioned variations. Next, we therefore a sequence of samples x = (x1 , . . . , xT ),
design a simple and effective method for learning where xt ∈ X for all 1 ≤ t ≤ T .
augmentation-invariant discrete representation on Consider an encoder network, f , that gets as in-
top of any speech SSL model. We demonstrate how put the speech utterance and outputs a sequence of
such a method greatly improves robustness. Then, spectral representations sampled at a low frequency
we empirically show that performance improves as follows f (x) = (v1 , . . . , vT ′ ). Note that we do
on several tasks for various SSL models. Specifi- not assume anything about the structure of the en-
cally, we evaluate the newly proposed speech encoder network f . Lakhotia et al. (2021), evaluated
coders when considering zero-shot evaluation tasks several speech encoders, namely, Mel-spectrogram,
considering encoding and modeling, i.e., ABX, Contrastive Predictive Coding (Oord et al., 2018,
sWUGGY, and sBLIMP (Nguyen et al., 2020), to- CPC), wav2vec2 (Baevski et al., 2020), and Hu-
gether with a high-level downstream task in the BERT (Hsu et al., 2021).
form of speech-to-speech translation. Since the representations learned by such mod-
els are usually continuous, a k-means algorithm is
2 Background applied over the models’ outputs to generate dis-
The general Generative Spoken Language Model- crete units, denoted as z = (z1 , . . . , zT ′ ). Each
ing (GSLM) pipeline is comprised of three main element zi in z is a positive integer, zi ∈ {1, .., K}
modules: (i) Speech-to-unit, (ii) Unit language for 1 ≤ i ≤ T ′ , where K is the number of discrete
model, and (iii) Unit-to-speech, where each of units. We denote the quantization model with E.
these modules is trained separately. Speech resyn-
Unit Language Model is trained on the extracted
thesis can be achieved while ignoring the language
discrete units, z. Such a language model learns
model and directly feeding the quantized units into
a probability distribution of the learned unit se-
the unit-to-speech module (Polyak et al., 2021)
quences, which enables direct modeling of speech
(See Figure 1 for a visual description). In the fol-
data without textual supervision.
lowing paragraphs, we give detailed background
The language model can be used to gener-
for each of the three components mentioned above,
ate speech conditionally or unconditionally, repli-
including the standard evaluation methods.
cating what toddlers achieve before learning to
Speech-to-unit module encodes the raw speech read. Moreover, such a modeling framework al-
signal into a discrete representation. The com- lows for capturing and modeling prosodic fea-
466
tures (Kharitonov et al., 2021a), as well as speaker tions. It is essential to note that augmentations can
identity (Borsos et al., 2022), or even natural dia- alter the spatial dimension of the signal. For ex-
logues (Nguyen et al., 2022). This is in contrast to ample, stretching a signal results in more frames,
using textual features, as they do not encode such yielding a longer representation sequence. Similar
information. phenomenon will happen when convolving with
different room impulse response to simulate re-
Unit-to-speech module converts the speech dis- verberation. Hence, the metric should be able to
crete units to a raw waveform. Lakhotia et al. measure the distance between two sequences of dif-
(2021) used a Tacotron2.0 (Shen et al., 2018) ferent lengths. Ideally, it will consider the number
based model followed by WaveGlow (Prenger et al., of deletions, insertions, and substitutions that occur
2019) vocoder. Later, Polyak et al. (2021) proposed due to augmenting the input data. For this purpose,
a unit-based vocoder based on the HiFi-GAN ar- we find the Levenshtein distance a good fit (Leven-
chitecture to convert units to speech directly. Such shtein, 1966). The Levenshtein distance measures
a paradigm seems to provide high-quality gener- the minimum changes one should make to modify
ations with better efficiency as it uses only one one sequence to another. It has two essential prop-
model rather than two. Kreuk et al. (2021) and Lee erties: the first is that the score is non-negative, and
et al. (2021) additionally improved the unit based when the sequences are equal, the metric equals
vocoder to include emotional tokens for speech zero. The second property is that the maximum
emotion conversion tasks, and duration modeling value it can get equals the longer sequence length
for direct speech-to-speech translation. between the two sequences. We provide a detailed
Zero-shot Evaluation. Evaluating such a com- explanation of the Levenshtein distance in the Ap-
plex pipeline comprised of several components is pendix material.
a challenging task. Lakhotia et al. (2021) pro- We aggregate the distance values over the eval-
posed a set of zero-shot evaluation tasks aiming uation set while considering the sequence length.
for each of the modules. Overall the proposed This is desirable since we want to normalize scores
tasks can be divided into four main groups: (i) for sequences in different lengths, and the Leven-
acoustic encoding using ABX, bitrat, (ii) language shtein distance’s maximum value is the original
encoding using sWUGGY, sBLIMP (Nguyen et al., sequence’s length. Another essential property of a
2020; Lakhotia et al., 2021), (iii) resynthesis using spatial metric is repetitions. Consider time stretch
Phoneme/Word Error Rate; (iv) speech generation as an example, it changes the number of the in-
using VERT (Lakhotia et al., 2021), Meaningful- put frames, but one would expect the deduplicated
ness Mean Opinion Score. quantized signal to be the same as before the aug-
mentation. Hypothetically, one can maximize the
3 Robustness of Speech-to-Unit Models score by stretching the signal infinitely. To elimi-
nate such dependencies, we compute the score on
The first step toward developing an effective spoken a deduplicated quantized representation. Formally,
language model is to develop a robust representa- our final metric is:
tion. The focus of a robust representation should
Definition 3.1 (Unit Edit Distance). Given a con-
be on the spoken information rather than unrelated ′
tinuous encoder f : RT → RT , a quantizer
signals, such as prosodic features in the form on ′ ′
E : RT → {1, .., K}T , and an input augmen-
duration and F0, background noise, or reverbera- ′ c′
tions. In the following section, we propose a metric tation g : RT → RT . The deduplicated unit edit
for quantifying the degree to which augmentations distance UEDD (E, f, g) on the evaluation set D is:
change the resulting encoding. X 1
LEV ((E ◦ f )(x), (E ◦ f ◦ g)(x)) , (1)
Tx′
3.1 Unit Edit Distance x∈D
A spoken language model is built on top of a dis- where Tx′ is the number of frames of a sample x.
crete representation of a continuous encoder. We Ideally, a perfect spoken language quantizer ob-
examine the robustness of the discrete space to tains a zero distance after deduplication. Next,
augmentations that do not change the spoken con- we study state-of-the-art spoken language repre-
tent. Therefore, we are interested in a sequential sentations using our proposed metric in different
distance metric between two discrete representa- settings.
467
60 60 60 60
50 50 50 50
40 40 40 40
UED
UED
UED
UED
30 30 30 30
20 20 20 20
10 10 10 10
0 50 100 200 500 0 50 100 200 500 0 50 100 200 500 0 50 100 200 500
Number of units (K) Number of units (K) Number of units (K) Number of units (K)
(a) Time stretch (b) Pitch shift (c) Reverberation (d) Noise
Figure 2: UED scores for various augmentations and number of clusters. We note that the UED is relatively high
(the distance is normalized). We also note that the UED monotonically increases with the number of units used. We
multiply the scores by a hundred.
3.2 Evaluation Noise injection. We mix a given speech signal

In the following, we study current state-of-the- with non-stationary additive noise, using a ran-
art representations for generative spoken language domly sampled Signal-to-Noise Ratio (SNR) in
modeling using the proposed metric. The current the range of [5, 15]. Background noises are sam-
popular quantization technique is a k-means model pled from the Deep Noise Suppression (DNS) chal-
trained on top of a pre-trained encoder (Lakho- lenge (Reddy et al., 2020) which includes a diverse
tia et al., 2021). In our evaluation setup, we use set of noise types from AudioSet (Gemmeke et al.,
a different number of clusters and encoder archi- 2017), Freesound, 1 and Demand (Thiemann et al.,
tectures. Our ablation study include quantizers 2013).
with 50, 100, 200, and 500 clusters. We further 3.2.2 Results
investigate our metric on top of HuBERT (Hsu
In Figure 2, we use our metric to study the ro-
et al., 2021), wav2vec2 (Baevski et al., 2020),
bustness of k-means trained on top of HuBERT
and WavLM (Chen et al., 2022). For readability,
with various augmentations and values of K. This
throughout the paper, we report results for the Hu-
evaluation points to the lack of robustness of the
BERT model while leaving the rest of the results
current state-of-the-art representation of simple,
in the Appendix material.
non-spoken augmentations. For example, for time
3.2.1 Augmentations stretch augmentation, the UED score is between
This work focus on four simple signal modifica- 39 and 51. Considering that UED is computed
tions which mimic real-world signal variations: on deduplicated signals, those numbers are high.
Moreover, this number increases as a function of
Time stretch. We use the Phase Vocoder K. The high numbers and the monotonicity of the
method (Karrer et al., 2006) to stretch or shrink UED as a function of K are consistent for all values
the time domain signal with a rate of τ without of K, augmentations, and models we experimented
changing the pitch. For example, τ = 1.2 speeds with (HuBERT, wav2vec2, and WavLM). Next, we
up the signal by 20%. In this work, for each sample, propose a method that improves the robustness of
we sample uniformly a value in the range [0.8, 1.2]. such representations.
Pitch shift. We change the original pitch of the
4 Invariant Discrete Representation
speech signal by a given number of semitones us-
ing the resampling method over the time-stretched Our findings in Section 3 suggest that current state-
signal (Karrer et al., 2006). In this paper, we shift of-the-art representations may be too sensitive to
the pitch by up to four semitones. augmentations that do not alter spoken information.
Preliminary invariance research focused primarily
Reverberation. We follow a similar setting
on noise augmentation. This is convenient since the
of Chazan et al. (2021), in which we consider
signal length is not affected by such augmentations.
an Acoustic Transfer Function (ATF) to be sim-
In practice, real-world augmentations may modify
ulated using the pyroomacoustics (Scheibler et al.,
the signal length. In order to work with various
2018) audio room simulations package. We ran-
types of augmentations, we must align the original
domly sample room dimensions, microphone loca-
and augmented sequences. The following section
tion, and source location, then convolve the ATF
with the speech signal. 1
https://freesound.org/
468
Clean signal
Continuous encoder K-means 31 2 .. 15 19
Augmented signal CTC
Continuous encoder Quantizer 47 47 .. 12 5
Figure 3: Illustration of our method: We forward a clean signal through an encoder followed by a pre-trained
quantizer (k-means). Next, we forward an augmented signal through the same encoder, followed by a new quantizer
(green). The CTC loss between the deduplicated output of the clean signal and the output of the augmented signal is
used to learn the parameters of the new quantizer. In the iterative approach, post the convergence of the learned
quantizer E0 , we freeze it and learn a new quantizer E1 that distills information from E0 .
presents a pseudo-labeling, alignment-based ap- quantizer E1 , our loss function is as follows:

proach to learning an augmentation-invariant quan-
tizer. LD (E0 , E1 , G) ≜ Ex∼D,g∼U (G) [ℓ(E0 , E1 , x, g)] .
Note that the alignment between the predicted

4.1 Pseudo-labeling
and target sequences is many-to-one. Thus, one
The GSLM encoding framework comprises a raw or more output units can be aligned to a single tar-
audio signal forwarded through an encoder, then get unit. Hence, to work with augmentations that
a quantizer. The quantizer is learned on top of stretch the signal, we are required to deduplicate
a trained encoder, e.g., k-means trained on each the target sequence. Intuitively, this process dis-
embedding vector extracted from HuBERT. tills quantization knowledge from the pre-trained
As discussed above, we do not want to limit quantizer into the new quantizer while injecting E1
the invariance process to a family of augmenta- knowledge about the contextual similarity between
tions that do not change the signal’s length. To the original and augmented signals.
align and use augmentations that may modify the A significant advantage of our method is that it is
signal’s length, we use the Connectionist Tempo- highly efficient. Our method requires training only
ral Classification (CTC) loss (Graves et al., 2006). a relatively small amount of parameters. In con-
The CTC operation computes the probability of trast to previous methods that train HuBERT from
an alignment based on the predicted and target se- scratch, which takes up to seven days on 32 GPUs,
quences. Finally, the CTC loss considers the nega- our method converges in a few hours on a single
tive log-likelihood produced by the CTC operation. GPU. In fact, our experiments show that learning
We forward a clean signal through an encoder the parameters of the encoder performs worse than
f followed by a pre-trained quantizer E0 . Par- freezing them. While the UED is boosted, but the
allelly, we forward an augmented signal through ABX are negatively affected. The freezing of the
the same encoder f and train a non-linear multi- upstream model thus serves as a regularizer.
layer-perceptron E1 . Using the CTC loss, which
accounts for the alignment between the outputs, we 4.2 Iterative Pseudo-labeling
learn the parameters of E1 . Formally, the proba- In the previous section, we presented a pseudo-
bility given by the CTC loss ℓ(E0 , E1 , x, g) for a labeling approach that relies on a converged quan-
single data point x follows tizer E0 , e.g., k-means on top of HuBERT. This
raises the question of whether it is possible to en-
−p ((E0 ◦ f )(x)|(E1 ◦ f ◦ g)(x)) , (2) hance the invariance of the learned quantizer E1 by
iteratively replacing the pre-trained quantizer with
which can be decomposed to a sum over the set of the converged quantizer and learning another MLP
all alignments Ax on top of it. It turns out that such a process can
r
further improve the model’s invariance.
X Y
− pt (at |(E1 ◦ f ◦ g)(x)). (3) The iterative process begins with a pre-trained
A∈Ax t=1 quantizer E0 , then, as in Section 4.1 we learn an
invariant quantizer E1 . Upon E1 convergence, we
Finally, for a training set D, a set of augmenta- replace E0 with E1 and use it as the pre-trained
tions G, a pre-trained quantizer E0 , and a learned quantizer. Then, we learn a new MLP E2 on top of
469
Augmentation
# units Method
Time Pitch shift Reverberation Noise
k-means 39.61±0.37 44.33±0.92 28.25±0.61 29.74 ±0.31
50 Ours 27.91±0.42 30.74±0.71 20.16±0.60 25.33±0.36
Ours (Iterative) 26.89±0.33 30.22±0.79 19.89±0.54 24.67±0.29
k-means 41.97±0.42 48.68±0.96 30.42±0.69 31.38±0.33
100 Ours 31.05±0.39 34.77±0.92 22.21±0.63 28.05±0.31
Ours (Iterative) 29.72±0.41 32.84±0.91 21.31±0.71 25.06±0.31
k-means 45.59±0.39 53.14±1.01 32.89±0.72 33.34 ±0.38
200 Ours 34.40±0.46 38.51±1.09 24.10±0.66 30.19±0.37
Ours (Iterative) 32.99±0.42 36.45±1.03 22.94±0.67 26.76 ±0.31
k-means 50.60±0.42 58.92±0.98 39.71±0.81 36.47±0.44
500 Ours 38.04±0.44 43.48±1.03 28.43±0.73 29.99±0.45
Ours (Iterative) 36.50±0.49 40.82±1.02 25.78±0.74 27.51±0.49
Table 1: Unit edit distance study: Using our metric, we assess the robustness of various quantization methods on
top of a HuBERT representation. This study uses four different augmentations: time stretching, pitch shifting,
reverberation, and noise injection. The non-iterative (Section 4.1) and iterative (Section 4.2) methods significantly
and consistently improve the robustness of k-means. Pseudo-labeling accounts for most of the improvement. By
applying our method iteratively, we can improve it further. For readability, we multiply the scores by a hundred.
the converged E1 . We repeat this process K times. and WavLM are in Appendix C. To match the cur-
This process needs more careful training. We note rent k-means training set, we use the Librispeech-
that it is essential to replace the quantizers only 100h to learn our quantizer (Panayotov et al., 2015).
post-convergence. We analyze our metric using the ‘clean’ and ‘other’
development sets from Librispeech. A detailed
5 Experiments setup is provided in Appendix B.
In the following, we assess the efficacy of our 5.1 Analysis
method using state-of-the-art self-supervised rep-
In Section 3, we presented an evaluation metric
resentations and popular discriminative and gener-
that assesses the robustness of a quantized speech
ative evaluation tasks. It is important to note that
representation to augmentations. The metric is
a single metric cannot tell the whole story. For
insensitive to changes in the length of the signal.
example, similarly to perplexity, all representations
Using it, we investigated the current state-of-the-
can be assigned to the same cluster, which achieves
art representations. In the following, we study our
a perfect unit edit distance but a poor representa-
invariant quantization method.
tion. We first examine our proposed method using
the unit edit distance along with other discrimina- Table 1 presents the unit edit distance metric us-
tive and generative performance metrics. Then, we ing our robustness method with and without the
show that our method improves downstream tasks. iterative approach. Compared with the k-means
method, which is currently in use, our non-iterative
In Section 5.1 we use our proposed metric from
method consistently outperforms it by a large mar-
Section 3 to analyze the robustness of our method.
gin (relative improvement of at least 30%). We
In Section 5.2 we study the discriminative capabili-
also note that different augmentations affect the
ties of our method using the ABX test (Schatz et al.,
representation differently. Our iterative method
2013). Then, we evaluate our methods using gener-
provides a slight but consistent improvement over
ative zero-shot evaluation tasks such as sWUGGY
the non-iterative method. It is noticeable that the
and sBLIMP (Nguyen et al., 2020; Lakhotia et al.,
UED is increasing (i.e., worse performing) with the
2021). Finally, we demonstrate the effect of using
number of units used.
our invariant quantizer’s units in speech-to-speech
translation. 5.2 Zero-shot Evaluation
Experimental Setup. We study our method us- We evaluate the proposed method using the stan-
ing the base versions of HuBERT, wav2vec2, and dard GSLM setup, i.e., ABX, sWUGGY, sBLIMP.
WavLM. For readability, we report results for Hu- The ABX task examines the discriminative pho-
BERT in the main paper. The results for wav2vec2 netic abilities of the representation. Versteegh et al.
470
ABX (clean) ↓ ABX (other)↓
# units Method sWUGGY ↑ sBLIMP ↑
Within Across Within Across
k-means 7.52 8.90 9.84 13.5 66.12 54.91
50 Ours 6.76 7.72 9.03 11.78 67.59 55.76
Ours (Iterative) 6.63 7.55 9.53 12.14 67.42 57.04
k-means 6.37 7.72 8.4 12.29 67.70 56.16
100 Ours 5.50 6.21 7.24 10.11 67.79 57.01
Ours (Iterative) 5.39 6.22 7.46 10.20 68.20 56.99
k-means 5.99 7.14 8.23 11.51 66.51 54.64
200 Ours 5.29 6.01 7.22 9.78 70.51 56.19
Ours (Iterative) 5.19 6.00 7.18 9.70 70.68 56.26
k-means 5.98 6.98 7.89 11.43 66.92 55.97
500 Ours 5.16 6.03 7.06 9.76 70.13 55.19
Ours (Iterative) 4.96 5.73 6.93 9.63 69.33 56.93
Table 2: Zero-shot discriminative and generative evaluation tasks: We evaluate the ABX score on the ‘clean’ and
‘other’ development sets from Librispeech. Our method improves the scores scores in all setups.
(2015) show that the ABX result is a good proxy model. As presented in Table 2, our method en-
to signal content (i.e., Phoneme Error Rate). The ables improvement in all the investigated setups
input to the ABX is a pair of words with a phoneme for both the spot-the-word and acceptability judg-
modification and a reference word containing the ment tests. This is especially noticeable for a larger
same phoneme as one of the pair’s words. Next, number of units. For instance, when considering
the ABX measures the distance of the test phoneme 200 or 500 units, the absolute improvement of the
representation to both the correct and incorrect rep- sWUGGY score is 4.17 and 3.21, respectively.
resentations. Finally, the distance between the test
and the correct representation is expected to be 5.3 Speech-to-speech Translation
lower than the distance to the incorrect represen- Lastly, we evaluate the proposed method consid-
tation. The ABX task is conducted in two setups: ering the speech-to-speech translation task. To
‘within’ and ‘across.’ ‘Within’ is evaluated on in- better assess the effectiveness of the proposed
put data from the same speaker, while ‘across’ is augmentation-invariant discrete representation we
evaluated on input data from different speakers. follow the same setup as in Lee et al. (2022) while
Table 2 shows the ABX results for both Lib- changing the discrete speech representation only.
rispeech ‘clean’ and ‘other’. In our experiments, Lee et al. (2022) propose a textless speech-to-
we found that the ABX score consistently and sig- speech translation method by forwarding a source
nificantly improved on all the setups we tested. In speech signal and predicting its target’s discrete
this case, the iterative approach improves more representation. The authors use a k-means model
than the non-iterative one, but the improvement trained on top of a multilingual HuBERT (mHu-
is inconsistent. For a small number of units and BERT) for speech representation. Additionally,
the ‘other’ split, the ABX score is lower than the the authors show that solving an auxiliary task en-
iterative model’s score. Note that the ‘other’ split hances performance. We investigate the impact of
is challenging as it is characterized by recordings using our augmentation-invariant quantizer as an
that contain background noise and various accents. alternative to the k-means used by Lee et al. (2022).
The spot-the-word task (sWUGGY) requires de- Differently, we use HuBERT (instead of mHu-
tecting the real word from a pair of short utterances BERT). Besides that, we follow the same setup
such as ‘brick’ vs. ‘blick.’ The detection is done in terms of model, computation resources, and data.
by comparing the probabilities given by a language To evaluate the quality of the translation the sen-
model for each word. This allows comparing rep- tence BLEU score (SacreBLEU) (Post, 2018) was
resentations by training language models on top of used.
them. Differently, the acceptability judgment test Table 3 presents the results for the Spanish-
(sBLIMP) requires detecting the syntactically cor- English and French-English setups on the Europarl-
rect sentence from a pair of sentences, one of which ST development and test sets (Iranzo-Sánchez et al.,
is syntactically correct and the other is wrong. The 2020). It also shows the original result from Lee
detection is based on the perplexity of the language et al. (2022). The proposed method improves over
471
# units Method S-E F-E generative self-supervised work is Autoregresstive
500 Invariant 17.3 16.4 Predictive Coding (Chung et al., 2019), which pre-
Dev
1000 k-means 15.4 16.0 dicts the spectrum of a future frame. Later, Liu
et al. (2020) introduced Mockingjay, which learns
1000 Invariant 18.2 17.5
its representation by predicting non-causal context.
500 Invariant 14.4 15.75 TERA (Liu et al., 2021) alters time, frequency, and
Test
1000 k-means 13.1 15.4 magnitude. Then it is required to reconstruct acous-
1000 Invariant 15.9 17.1 tic frames from altered versions.
Robustness. A desired property of a spoken lan-
Table 3: Speech-to-Speech Translation results: We re- guage representation is robustness to augmenta-
port BLEU scores for the proposed method (Invariant) tions that do not change the spoken information.
and compare it against the k-means used in Lee et al.
The spoken information should not differ signifi-
(2022). We report both development and test sets results
for Spanish(S)-English(E) and French(F)-English(E).
cantly when male and female speakers say the same
content. There is an interesting trade-off between
Lee et al. (2022) under all the evaluated setups. training a robust representation and the quality of
Note, these results are especially interesting as the the input data. It is possible, for example, to use the
proposed method was trained on significantly less same speaker for all data points in the training set.
data (ours was trained on 1k hours while Lee et al. The model would not be able to learn any speaker
(2022) was trained on 100k hours). bias, but this constraint prevents scaling.
Recently, the robustness of self-supervised
6 Related work speech representations has gained attention from
This work investigates the robustness of self- the community. WavLM (Chen et al., 2022)
supervised representations for language modeling. proposes adopting the well-known HuBERT
This is related to the advancements in speech self- model (Hsu et al., 2021) and training it with an addi-
supervised learning, their robustness, and modern tional denoising process. The authors apply a nois-
generative spoken language modeling. In the fol- ing process to the training data and then predict the
lowing, we review all three areas. clean units from it. ContentVec (Qian et al., 2022)
is focused on the disentanglement of a speaker from
Self-supervised Learning. The field of deep self-supervised speech representation. The authors
learning research has significantly benefited from propose to use three disentanglement components.
self-supervised learning. Commonly, it involves First, the student network is disentangled through
encoding the input data and performing a task that two transformations. Then the representations are
enforces the representation to learn contextual em- forwarded through a speaker condition component.
beddings. Speech self-supervised learning can be Finally, voice-converted input data points are used
divided into two lines of research. to generate teacher labels.
The first is discriminative, Oord et al. (2018)
introduced Contrastive Predictive Coding (CPC),
7 Conclusions
which trains a convolutional encoder and a pre- In this work, we first propose a metric for evaluat-
dictor for future embeddings of the encoder using the robustness of self-supervised speech repre-
ing a contrastive loss. On top of it, Kharitonov sentations applied for spoken language modeling
et al. (2021b) propose to use time domain aug- tasks. Equipped with the aforementioned metric,
mentations to improve the CPC model further. we point out the lack of robustness in current state-
Wav2vec2 (Schneider et al., 2019) suggest using of-the-art speech encoders with respect to simple
a contrastive loss that requires distinguishing be- signal variations that do not alter the spoken infor-
tween true and false future audio samples. Later, mation. We then propose a simple and effective
wav2vec2 (Baevski et al., 2020) learn quantized method to augmentation-invariant discrete repre-
units using Gumbel softmax and predict masked sentation that boosts the robustness of the current
spans of the latent speech representation. Hu- approaches and demonstrate it on three state-of-the-
BERT (Hsu et al., 2021) employ a frame-based art self-supervised speech representation models.
masked prediction task. First, it quantizes input We empirically show the efficacy of the proposed
frames and then predicts masked frames. approach when considering encoding methods to-
The second line of work is generative. An early gether with a textless speech-to-speech translation.
472
Broader Impact Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman,
Aren Jansen, Wade Lawrence, R Channing Moore,
As for broader impacts, this work is the first (to Manoj Plakal, and Marvin Ritter. 2017. Audio set:
the best of our knowledge) which analyzes self- An ontology and human-labeled dataset for audio
supervised speech representation models, consid- events. In ICASSP.
ering basic signal variations. We hope that with Alex Graves, Santiago Fernández, Faustino Gomez, and
the provided analysis and evaluation, researchers Jürgen Schmidhuber. 2006. Connectionist temporal
working on spoken language modeling and self- classification: labelling unsegmented sequence data
supervised speech representation learning will con- with recurrent neural networks. In ICLR.
sider reporting the proposed metric setup along Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
with evaluation of down stream tasks. Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-
rahman Mohamed. 2021. Hubert: Self-supervised
Limitations speech representation learning by masked prediction
of hidden units. IEEE/ACM Transactions on Audio,
The proposed method has several limitations that Speech, and Language Processing.
should be taken into consideration when employing
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà,
it. First, the method relies on an existing model, Javier Jorge, Nahuel Roselló, Adrià Giménez, Al-
e.g., k-means, which creates a dependency between bert Sanchis, Jorge Civera, and Alfons Juan. 2020.
the performance of the initial and the robust mod- Europarl-st: A multilingual corpus for speech trans-
els. Second, the flow is not trained end-to-end, lation of parliamentary debates. In ICASSP.
which can also limit its performance as end-to-end Thorsten Karrer, Eric Lee, and Jan O Borchers. 2006.
training allows improvement of the robustness of Phavorit: A phase vocoder for real-time interactive
the whole representation. Lastly, to fully assess time-stretching. In ICMC.
the effectiveness of the method, multiple metrics
Eugene Kharitonov, Jade Copet, Kushal Lakhotia,
need to be examined. This can be a limitation as Tu Anh Nguyen, Paden Tomasello, Ann Lee, Ali
interpreting the results from multiple metrics may Elkahky, Wei-Ning Hsu, Abdelrahman Mohamed,
not be straightforward. However, it gives a more Emmanuel Dupoux, et al. 2022. textless-lib: a li-
complete picture of the model’s performance. brary for textless spoken language processing. arXiv
Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi,

References Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Mor-
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, gane Rivière, Abdelrahman Mohamed, Emmanuel
and Michael Auli. 2020. wav2vec 2.0: A framework Dupoux, et al. 2021a. Text-free prosody-aware gen-
for self-supervised learning of speech representations. erative spoken language modeling. arXiv preprint
NeurIPS. arXiv:2109.03264.
Zalán Borsos, Raphaël Marinier, Damien Vincent, Eu- Eugene Kharitonov, Morgane Rivière, Gabriel Syn-
gene Kharitonov, Olivier Pietquin, Matt Sharifi, naeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs
Olivier Teboul, David Grangier, Marco Tagliasacchi, Douze, and Emmanuel Dupoux. 2021b. Data aug-
and Neil Zeghidour. 2022. Audiolm: a language mod- menting contrastive learning of speech representa-
eling approach to audio generation. arXiv preprint tions in the time domain. In SLT.
arXiv:2209.03143.
Shlomo E Chazan, Lior Wolf, Eliya Nachmani, and method for stochastic optimization. arXiv preprint
Yossi Adi. 2021. Single channel voice separation for arXiv:1412.6980.
unknown number of speakers under reverberant and
noisy settings. In ICASSP. Felix Kreuk, Adam Polyak, Jade Copet, Eugene
Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Ning Hsu, Abdelrahman Mohamed, Emmanuel
Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Dupoux, and Yossi Adi. 2021. Textless speech emo-
Kanda, Takuya Yoshioka, Xiong Xiao, et al. 2022. tion conversion using decomposed and discrete rep-
Wavlm: Large-scale self-supervised pre-training for resentations. arXiv preprint arXiv:2111.07402.
full stack speech processing. IEEE Journal of Se-
lected Topics in Signal Processing. Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu,
Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh
Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James Nguyen, Jade Copet, Alexei Baevski, Abdelrahman
Glass. 2019. An unsupervised autoregressive model Mohamed, and Emmanuel Dupoux. 2021. On Gener-
for speech representation learning. In INTER- ative Spoken Language Modeling from Raw Audio.
SPEECH. TACL.
473
Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Ryan Prenger, Rafael Valle, and Bryan Catanzaro. 2019.
Xutai Ma, Adam Polyak, Yossi Adi, Qing He, Yun Waveglow: A flow-based generative network for
Tang, Juan Pino, et al. 2021. Direct speech-to- speech synthesis. In ICASSP.
speech translation with discrete units. arXiv preprint
arXiv:2107.05604. Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni,
Cheng-I Lai, David Cox, Mark Hasegawa-Johnson,
Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne, and Shiyu Chang. 2022. Contentvec: An improved
Holger Schwenk, Peng-Jen Chen, Changhan Wang, self-supervised speech representation by disentan-
Sravya Popuri, Yossi Adi, Juan Pino, Jiatao Gu, gling speakers. In ICML.
and Wei-Ning Hsu. 2022. Textless speech-to-speech Chandan KA Reddy, Vishak Gopal, Ross Cutler,
translation on real data. In NAACL. Ebrahim Beyrami, Roger Cheng, Harishchandra
Dubey, Sergiy Matusevych, Robert Aichner, Ashkan
Vladimir Levenshtein. 1966. Binary codes capable of Aazami, Sebastian Braun, et al. 2020. The in-
correcting deletions, insertions, and reversals. In terspeech 2020 deep noise suppression challenge:
Soviet physics doklady. Datasets, subjective testing framework, and challenge
results. arXiv preprint arXiv:2005.13981.
Andy T Liu, Shang-Wen Li, and Hung-yi Lee. 2021.
Tera: Self-supervised learning of transformer encoder Thomas Schatz, Vijayaditya Peddinti, Francis Bach,
representation for speech. IEEE/ACM Transactions Aren Jansen, Hynek Hermansky, and Emmanuel
on Audio, Speech, and Language Processing. Dupoux. 2013. Evaluating speech features with
the minimal-pair abx task: Analysis of the classical
Andy T Liu, Shu-wen Yang, Po-Han Chi, Po-chun mfc/plp pipeline. In INTERSPEECH.
Hsu, and Hung-yi Lee. 2020. Mockingjay: Unsu-
pervised speech representation learning with deep Robin Scheibler, Eric Bezzam, and Ivan Dokmanić.
bidirectional transformer encoders. In ICASSP. 2018. Pyroomacoustics: A python package for audio
room simulation and array processing algorithms. In
Tu Anh Nguyen, Maureen de Seyssel, Patricia ICASSP.
Rozé, Morgane Rivière, Evgeny Kharitonov, Alexei Steffen Schneider, Alexei Baevski, Ronan Collobert,
Baevski, Ewan Dunbar, and Emmanuel Dupoux. and Michael Auli. 2019. wav2vec: Unsupervised pre-
2020. The zero resource speech benchmark 2021: training for speech recognition. In INTERSPEECH.
Metrics and baselines for unsupervised spoken lan-
guage modeling. In NeurIPS – Self-Supervised Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike
Learning for Speech and Audio Processing Work- Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
shop. Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan,
et al. 2018. Natural tts synthesis by conditioning
Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi wavenet on mel spectrogram predictions. In ICASSP.
Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello,
Robin Algayres, Benoit Sagot, Abdelrahman Mo- Joachim Thiemann, Nobutaka Ito, and Emmanuel Vin-
hamed, et al. 2022. Generative spoken dialogue lan- cent. 2013. Demand: a collection of multi-channel
guage modeling. arXiv preprint arXiv:2203.16502. recordings of acoustic noise in diverse environments.
In Proc. Meetings Acoust.
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura.
Representation learning with contrastive predictive 2020. Transformer vq-vae for unsupervised unit dis-
coding. arXiv preprint arXiv:1807.03748. covery and speech synthesis: Zerospeech 2020 chal-
lenge. In Interspeech.
jeev Khudanpur. 2015. Librispeech: an asr corpus Andros Tjandra, Berrak Sisman, Mingyang Zhang,
based on public domain audio books. In ICASSP. Sakriani Sakti, Haizhou Li, and Satoshi Nakamura.
2019. Vqvae unsupervised unit discovery and multi-
Adam Polyak, Yossi Adi, Jade Copet, Eugene scale code2spec inverter for zerospeech challenge
Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Ab- 2019. In Interspeech.
delrahman Mohamed, and Emmanuel Dupoux.
2021. Speech resynthesis from discrete disentan- Maarten Versteegh, Roland Thiolliere, Thomas Schatz,
gled self-supervised representations. arXiv preprint Xuan Nga Cao, Xavier Anguera, Aren Jansen, and
arXiv:2104.00355. Emmanuel Dupoux. 2015. The zero resource speech
challenge 2015. In Sixteenth annual conference of
Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan the international speech communication association.
Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, and Ann
Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang,
Lee. 2022. Enhanced direct speech-to-speech transla-
Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin,
tion using self-supervised pre-training and data aug-
Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-
mentation. arXiv preprint arXiv:2204.02967.
Ting Lin, et al. 2021. Superb: Speech processing
universal performance benchmark. arXiv preprint
arXiv:2105.01051.
scores. In EMNLP.
474
A Levenshtein Distance in Section C, we present additional results. We
report results on two additional state-of-the-art self-
Throughout the paper, we use a version of the Lev-
supervised speech representations. We show that
enshtein distance. In this section, we detail the
our method is indeed effective for those representa-
Levenshtein distance between two sequences. Let
tions as well as shown in the main paper.
x ∈ {1, .., K}Tx and y ∈ {1, .., K}Ty be two dis-
crete vectors, not necessary in the same size. Let C Additional Results
us also denote the operator tail(x) to return a copy
of the vector x without its first element. Then, In the following, we provide additional results on
the Levenshtein distance is defined recursively by the state-of-arts representations “wav2vec2” and
Lev(x, y) = “WavLM” (Baevski et al., 2020; Chen et al., 2022).
 Tables 4 and 5 present the UED scores for both

 |x|, if |y| = 0 the wav2vec2 and WavLM models. Using our

 method, we observe robustness improvements for

 if |x| = 0
|y|,  both of the models. However, it is notable that the
 Lev(tail(x), y)

WavLM model is more robust than the wav2vec2



 1 + min Lev(x, tail(y)) , otherwise model. It is reasonable since the WavLM trained

 

Lev(tail(x), tail(y)) to be a more robust model using noisy training
samples.
where |x|, |y| are the lengths of the vectors x and y Tables 6 and 7 present the discriminative and
respectively. Note, in our implementation, we use generative metrics for both wav2vec2 and WavLM.
deduplicated sequences. We observe a consistent improvement using our
robust quantizer as in the robustness metrics. How-
B Extended Experimental Setup
ever, for the WavLM, the improvements are some-
Models. We study our method using the base ver- times marginal (except for k = 50 where k-means
sions of HuBERT, wav2vec2, and WavLM. Similar outperforms our method). The WavLM model is
to prior work, for HuBERT and WavLM, we use trained with a HuBERT architecture, with more
the ninth and sixth layers for wav2vec2. For read- data and noisy samples. Interestingly, while pre-
ability, we report results for HuBERT in the main senting better performance on various downstream
paper. The results for wav2vec2 and WavLM are tasks than HuBERT, their ABX, sWUGGY, and
presented in Appendix C. In our quantizer learning sBLIMP scores are lower.
process, we use a learning rate of 0.0001, a batch
size of 32, and Adam optimizer (Kingma and Ba,
2014). Our quantizer is composed of three fully
connected layers with LeakyReLU activation be-
tween them. The dimensions of those layers are
determined by the division floor of the difference
between the upstream dimension to the number
of units. We train our quantizer using a single
NVIDIA V100 GPU.
Datasets. To match the current k-means popular
training set, we use the Librispeech-100h to learn
our quantizer (Panayotov et al., 2015). We analyze
our metric using the ‘clean’ and ‘other’ develop-
ment sets from Librispeech. The augmentations
in all setups include time stretch, pitch shift, rever-
beration, and noise injection (exact parameters are
detailed in Section 3.2.1). For the sWUGGY and
sBLIMP evaluations, we use the ‘big’ transformer
language model from Lakhotia et al. (2021).
This appendix begins with a detailed explana-
tion on the Levenshtein distance (Section A). Then,
475
Augmentation
# units Method
k-means 50.81±0.41 58.66±1.16 43.71±0.77 32.17±0.61
50 Ours 38.74±0.45 42.33±0.97 33.69±0.73 25.36±0.49
Ours (Iterative) 36.68±0.39 40.29±1.04 33.28±0.74 23.99±0.51
k-means 55.30±0.61 65.23±0.91 48.41±0.72 33.97±0.46
100 Ours 42.32±0.46 47.07±0.88 36.83±0.71 27.15±0.75
Ours (Iterative) 40.43±0.57 45.73±0.90 36.34±0.77 26.22±0.59
k-means 59.85±0.39 70.80±1.31 53.13±0.67 36.64±0.62
200 Ours 46.84±0.42 51.60±1.21 40.54±0.66 32.61±0.67
Ours (Iterative) 44.90±0.35 49.59±1.25 40.58±0.62 29.49 ±0.57
k-means 66.12±0.48 77.01±0.98 59.69±1.01 37.22±0.65
500 Ours 51.65±0.49 55.40±1.03 45.85±0.93 33.17±0.62
Ours (Iterative) 50.50±0.53 57.12±1.02 44.67±0.98 31.92±0.69
Table 4: Wav2vec2 unit edit distance
Augmentation
# units Method
k-means 47.66±0.49 52.93±1.02 33.45±0.62 28.46±0.61
50 Ours 39.12±0.43 44.25±1.06 31.58±0.62 25.32±0.67
Ours (Iterative) 36.79±0.46 40.16±1.05 25.73±0.64 25.01±0.66
k-means 52.61±0.51 58.44±0.72 36.27±0.45 29.44±0.64
100 Ours 43.55±0.53 49.03±0.75 30.54±0.44 25.93±0.67
Ours (Iterative) 42.11±0.50 46.08±0.74 28.88±0.47 25.47±0.59
k-means 58.50±0.42 64.75±1.02 41.05±0.54 30.93±0.62
200 Ours 49.57±0.41 53.48±1.09 34.29±0.53 26.66±0.65
Ours (Iterative) 47.82±0.46 52.47±1.01 32.88±0.55 26.09 ±0.62
k-means 64.25±0.67 70.55±0.75 45.63±0.83 33.17±0.71
500 Ours 55.41±0.64 59.79±0.87 42.85±0.78 28.46±0.79
Ours (Iterative) 52.92±0.69 57.840±0.81 40.46±0.81 27.09±0.72
Table 5: WavLM unit edit distance
476
k-means 12.03 15.31 13.61 19.07 49.76 53.92
50 Ours 11.18 13.82 13.34 18.39 - -
Ours (Iterative) 10.35 12.75 12.64 17.29 49.65 55.29
k-means 11.27 13.99 13.06 17.11 51.63 53.87
100 Ours 9.86 11.81 11.44 16.63 -
Ours (Iterative) 9.24 11.30 11.37 16.14 51.90 54.95
k-means 11.13 14.42 12.37 18.02 51.29 54.99
200 Ours 10.19 12.41 11.85 17.52 - -
Ours (Iterative) 9.00 11.11 11.49 16.53 51.99 55.67
k-means 12.06 15.61 13.77 19.94 52.21 54.32
500 Ours 10.76 13.83 13.52 19.60 - -
Ours (Iterative) 10.16 12.42 12.56 18.24 52.93 55.17
Table 6: Wav2vec2 discriminative and generative evaluation metrics.

k-means 7.60 9.06 9.22 12.99 63.91 55.29
50 Ours 7.41 8.68 9.51 11.78 - -
Ours (Iterative) 7.19 8.25 9.41 11.87 64.87 55.81
k-means 6.91 8.06 8.95 11.86 63.61 54.59
100 Ours 6.02 7.13 8.36 10.95 - -
Ours (Iterative) 6.39 7.02 8.17 11.21 63.99 54.97
k-means 6.74 8.12 8.76 12.09 65.97 55.59
200 Ours 6.40 7.45 8.61 11.49 - -
Ours (Iterative) 6.51 7.73 8.93 11.94 66.90 55.89
k-means 7.14 8.10 9.09 11.70 64.56 55.91
500 Ours 7.03 7.91 8.99 11.21 - -
Ours (Iterative) 7.08 7.87 9.03 11.54 65.81 56.09
Table 7: WavLM discriminative and generative evaluation metrics.
477
DePA: Improving Non-autoregressive Machine Translation with
Dependency-Aware Decoder
Jiaao Zhan1 , Qian Chen, Boxing Chen, Wen Wang, Yu Bai1 , Yang Gao1∗
1
School of Computer Science and Technology,
Beijing Institute of Technology, Beijing, China
[email protected]
{lukechan1231,chenboxing,wwang.969803}@gmail.com
{yubai,gyang}@bit.edu.cn
Abstract leads to degradation in accuracy compared to AT

models, as NAT models cannot properly learn tar-
Non-autoregressive machine translation (NAT)
models have lower translation quality than au-
get dependencies. Dependency in prior works and
toregressive translation (AT) models because our work takes its standard definition in NLP, i.e.,
NAT decoders do not depend on previous tar- syntactic relations between words in a sentence.
get tokens in the decoder input. We propose The mainstream NAT models fall into two cate-
a novel and general Dependency-Aware De- gories: iterative NAT models and fully NAT models.
coder (DePA) to enhance target dependency
Iterative NAT models (Gu et al., 2019; Ghazvinine-
modeling in the decoder of fully NAT models
from two perspectives: decoder self-attention jad et al., 2019; Lee et al., 2018) improve transla-
and decoder input. First, we propose an autore- tion accuracy by iteratively refining translations at
gressive forward-backward pre-training phase the expense of slower decoding speed. In contrast,
before NAT training, which enables the NAT fully NAT models (Gu et al., 2018; Bao et al., 2022)
decoder to gradually learn bidirectional target have great latency advantage over AT models by
dependencies for the final NAT training. Sec- making parallel predictions with a single decod-
ond, we transform the decoder input from the ing round, but they suffer from lower translation
source language representation space to the tar-
accuracy. In this paper, we aim at improving the
get language representation space through a
novel attentive transformation process, which translation accuracy of fully NAT models while
enables the decoder to better capture target de- preserving their latency advantage.
pendencies. DePA can be applied to any fully Previous research (Gu and Kong, 2021) argues
NAT models. Extensive experiments show that that reducing dependencies is crucial for training a
DePA consistently improves highly competi-
fully NAT model effectively, as it allows the model
tive and state-of-the-art fully NAT models on
widely used WMT and IWSLT benchmarks
to more easily capture target dependencies. How-
by up to 1.88 BLEU gain, while maintaining ever, dependency reduction limits the performance
the inference latency comparable to other fully upper bound of fully NAT models, since models
NAT models.1 may struggle to generate complex sentences. Pre-
vious studies show that multi-modality (Ran et al.,
1 Introduction 2020) is the main problem that NAT models suffer
Autoregressive translation (AT) systems achieve from (Huang et al., 2021; Bao et al., 2022), i.e., the
state-of-the-art (SOTA) performance for neural ma- target tokens may be generated based on different
chine translation (NMT) and Transformer (Vaswani possible translations, often causing over-translation
et al., 2017) encoder-decoder is the prevalent ar- (token repetitions), under-translation (source words
chitecture. In AT systems, each generation step not translated), and wrong lexical choice for poly-
depends on previously generated tokens, resulting semous words. Table 1 Row3 shows all three multi-
in high inference latency when output is long. Non- modality error types from the highly competitive
autoregressive translation (NAT) models (Gu et al., fully NAT model GLAT (Qian et al., 2021) with
2018) significantly accelerate inference by gener- modeling only forward dependency (F-NAT) in our
ating all target tokens independently and simulta- experiments. We observe that lack of complete de-
neously. However, this independence assumption pendency modeling could cause multi-modality
∗ errors. For example, for the source text (in Ger-
Corresponding Author
1
We released our code at: https://github.com/ man) “Woher komme ich?” in the last column of
zhanjiaao/NAT_DePA. Table 1, “Woher” means both “where” and “how”.
478
Under-Translation Over-Translation Wrong Lexical Choice
Source Wir haben es weltweit in 300 Gemeinden gemacht. Einige leute wollten ihn einfach König nennen Woher komme ich ? Wer bin ich ?
Target Reference We ’ve done it in 300 communities around the world. Some people just wanted to call him King . Where am I from ? Who am I ?
F-NAT We did it the world in 300 communities. Some people just wanted to call him him king. How do I come from ? Who am I ?
FB-NAT We ’ve done it in 300 communities around the world. Some people just wanted to call him king. Where do I come from? Who am I ?
Table 1: Case studies of our proposed FBD approach on the highly competitive fully NAT model GLAT (Qian et al., 2021) for
alleviating three types of multi-modality errors on the IWSLT16 DE-EN validation set. Repetitive tokens are in red. Source words
that are not semantically translated are in bold and underlined. Wrong lexical choices (for polysemous words) and redundant
words are in blue. F-NAT denotes only modeling forward dependencies while FB-NAT denotes modeling both forward and
backward dependencies, the same as the models in Table 5. Case studies of our proposed IT approach are in Appendix.
The NAT model modeling only forward depen- tion space, resulting in differences from the true
dency (F-NAT) incorrectly translates “woher” into target-side distribution. Our proposed IT ensures
“how” and outputs “How do I come from?”; whereas that the decoder input is in the exact target repre-
the model modeling both forward and backward sentation space hence enables the model to better
dependency (FB-NAT) translates it correctly into capture target dependencies.
“Where do I come from?”. Therefore, instead of Our contributions can be summarized as follows:
dependency reduction, we propose a novel and gen- (1) We propose a novel and general Dependency-
eral Dependency-Aware Decoder (DePA), which Aware Decoder (DePA) for fully NAT models. For
enhances the learning capacity of fully NAT mod- DePA, we propose a novel approach FBD for learn-
els and enables them to learn complete and complex ing both forward and backward dependencies in
forward and backward target dependencies in order NAT decoder, through which the target dependen-
to alleviate the multi-modality issue. cies can be better modeled. To the best of our
Firstly, we enhance the NAT decoder to learn knowledge, our work is the first to successfully
complete target dependencies by exploring decoder model both forward and backward target-side
self-attention. We believe that previous works (Guo dependencies explicitly for fully NAT models.
et al., 2020a) incorporating only forward depen- We also propose a novel decoder input transfor-
dency modeled by AT models into NAT models are mation approach (IT). IT could ease target-side
inadequate to address multi-modality. Therefore, dependency modeling and enhance the effective-
we propose an effective forward-backward depen- ness of FBD. DePA is model-agnostic and can
dency modeling approach, denoted by FBD, as be applied to any fully NAT models. (2) Exten-
an auto-aggressive forward-backward pre-training sive experiments on WMT and IWSLT benchmarks
phase before NAT training, using curriculum learn- demonstrate that our DePA consistently improves
ing. The FBD approach implements triangular the representative vanilla NAT model (Gu et al.,
attention masks and takes different decoder inputs 2018), the highly competitive fully NAT model
and targets in a unified framework to train the GLAT (Qian et al., 2021) and the current SOTA
model to attend to previous or future tokens and of fully NAT models, CTC w/ DSLP & Mixed
learn both forward or backward dependencies. Training (denoted by CTC-DSLP-MT) (Huang
et al., 2021) (DSLP denotes Deep Supervision and
Secondly, we enhance target dependency model-
Layer-wise Prediction), by up to +0.85 BLEU on
ing within the NAT decoder from the perspective
the SOTA CTC-DSLP-MT, +1.88 BLEU on GLAT,
of the decoder input. Most prior NAT models (Gu
and +2.2 BLEU on vanilla NAT, while reserving
et al., 2018; Wang et al., 2019; Wei et al., 2019)
inference latency as other fully NAT models, about
use a copy of the source text embedding as the
15× speed-up over AT models. Experiments show
decoder input, which is independent from the tar-
that DePA achieves greater BLEU gains with less
get representation space and hence makes target
speed-up loss than DSLP when applied to various
dependency modeling difficult. We transform the
fully NAT models.
initial decoder input from the source language rep-
resentation space to the target language representa- 2 Related Work
tion space through a novel attentive transformation
process, denoted by IT. Previous works on trans- Forward and Backward Dependencies Prior
forming the decoder input cannot guarantee that works explore bidirectional decoding to improve
the decoder input is in the exact target representa- modeling of both forward and backward depen-
479
dencies in phrase-based statistical MT (Finch and ate multiple possible translations. In contrast, our
Sumita, 2009) and RNN-based MT (Zhang et al., DePA utilizes forward-backward pre-training and
2018). For NAT, Guo et al. (2020a) and Wei et al. a novel attentive transformation of decoder input
(2019) use forward auto-regressive models to guide to enhance target dependency modeling. Under
NAT training. Liu et al. (2020) introduces an in- same settings and with KD, DA-Transformer per-
termediate semi-autoregressive translation task to forms only comparably to CTC-DSLP-MT; how-
smooth the shift from AT training to NAT train- ever, performance of DA-Transformer benefits no-
ing. However, backward dependencies are rarely tably from Transformer-big for KD while CTC-
investigated in NAT. DSLP-MT uses Transformer-base for KD. DDRS
w/ NMLA (Shao and Feng, 2022) benefits greatly
Decoder Input of Fully NAT Models The de- from using diverse KD references while CTC-
coder input of AT models consists of previously DSLP-MT uses only a single KD reference. Hence,
generated tokens. However, selecting appropriate CTC-DSLP-MT is still the current SOTA for
decoder input for fully NAT models could be chal- fully NAT models on WMT benchmarks.
lenging. Most prior NAT models (Gu et al., 2018;
Wang et al., 2019; Wei et al., 2019) use uniform Non-autoregressive Models Besides fully NAT
copy (Gu et al., 2018) or soft copy (Wei et al., models, iterative NAT models are proposed such
2019) of the source text embedding as the decoder as iterative refinement of target sentences (Lee
input, which is independent of the target repre- et al., 2018), masking and repredicting words with
sentation space hence hinders target dependency low probabilities (Ghazvininejad et al., 2019), edit-
modeling. Methods such as GLAT (Qian et al., based methods to iteratively modify decoder out-
2021) and (Guo et al., 2020a,b) attempt to make put (Stern et al., 2019; Gu et al., 2019), and parallel
the NAT decoder input similar to the target rep- refinement of every token (Kasai et al., 2020). It-
resentation space by substituting certain positions erative NAT models improve translation accuracy
in the decoder input with the corresponding target at the cost of slower speed. Non-autoregressive
embedding. However, this creates a mismatch be- models are practically important due to high effi-
tween training and inference. Guo et al. (2019) uses ciency. Other than MT, they are applied to various
phrase-table lookup and linear mapping to make tasks such as image captioning (Gao et al., 2019),
the decoder input closer to the target embedding, automatic speech recognition (Chen et al., 2019),
but this method still causes difference between the and text-to-speech synthesis (Oord et al., 2018).
decoder input and the real target-side distribution.
3 Methodology
Fully NAT Models To address multi-modality
for fully NAT models, various approaches are pro- 3.1 Problem Formulation
posed. Gu et al. (2018) uses knowledge distillation NMT can be formulated as a sequence-to-sequence
(KD) (Kim and Rush, 2016) to reduce dataset com- generation problem. Given a sequence X =
plexity. Libovickỳ and Helcl (2018) and Saharia {x1 , ..., xN } in the source language, a sequence
et al. (2020) use connectionist temporal classifica- Y = {y1 , ..., yT } in the target language is gener-
tion (CTC) (Graves et al., 2006) for latent align- ated following the conditional probability P (Y |X).
ment. Sun et al. (2019) utilizes CRFs to model NAT models are proposed to speed up generation
target positional contexts. Kaiser et al. (2018), by decoding all the target tokens in parallel, using
Ma et al. (2019) and Shu et al. (2020) incorpo- conditional independent factorization as:
rate latent variables to guide generation, similar
to VAEs (Kingma and Welling, 2013). Guo et al. T
Y
PN A (Y |X) = PL (T |x1:N ) · P (yt |x1:N ; θ) (1)
(2020c) initializes NAT decoders with pretrained t=1
language models. Huang et al. (2021) proposes
CTC with Deep Supervision and Layer-wise Pre- where the target sequence length T is modeled by
diction and Mixed Training (CTC-DSLP-MT), set- the conditional distribution PL , and dependence
ting new SOTA for fully NAT models on WMT on previous target tokens is removed. Compared
benchmarks. DA-Transformer (Huang et al., 2022) to AT models, NAT models speed up inference
represents hidden states in a directed acyclic graph significantly at the expense of translation quality,
to capture dependencies between tokens and gener- because the conditional independence assumption
480
ing phase uses features of each word to predict the
word itself. We make the following hypotheses:
(1) Considering the nature of languages, learning
forward dependency in Phase 1 is easier for the
model for language generation. (2) Modeling back-
ward dependency relies on learned forward depen-
dency knowledge, hence it should be in the second
phase. In fact, we observe the interesting find-
ing that the best curriculum remains forward-
Figure 1: The proposed forward-backward dependency mod- backward-forward-NAT (FBF-NAT) for both
eling (FBD) with triangular attention masks in a unified frame-
work. The red dashed lines indicate the attention masks. We left-branching and right-branching languages,
use different colors to highlight the difference of inputs and proving our hypotheses. We speculate that NAT
targets in each phase. training may benefit from another forward depen-
dency modeling in Phase 3 because the order of
left-to-right is more consistent with characteristics
in Eq.1 enables parallel processing but lacks ex-
of natural languages, hence adding the second for-
plicit modeling of dependency between target to-
ward dependency modeling after FB (i.e., FBF)
kens. To enhance target dependency modeling, we
smooths the transition to the final NAT training.
propose two innovations as incorporating both for-
Detailed discussions are in Section 4.3.
ward and backward dependency modeling into the
training process (Section 3.2) and transforming the 3.3 Decoder Input Transformation (IT) for
decoder input into the target representation space Target Dependency Modeling
(Section 3.3).
Given the initial decoder input z as a copy of source
3.2 Target Dependency Modeling with text embedding, we propose to directly select rele-
Curriculum Learning (FBD) vant representations from target embedding to form
a new decoder input z ′ (Figure 2). z is used as
Prior work (Guo et al., 2020a) utilizes forward de- the query and the selection is implemented as a
pendency in AT models to initialize model parame- learnable attention module. The learnable parame-
ters for NAT. However, as discussed in Section 1, ters bridge the gap between training and inference
for fully NAT models, only modeling forward de- while the selection guarantees consistency between
pendency is inadequate for addressing the multi- the decoder input matrix and the target represen-
modality problem (Finch and Sumita, 2009; Zhang tation space (i.e., the output embedding matrix of
et al., 2018) (the Row for F-NAT in Table 1). Our the decoder). This way, the decoder input is in the
innovations include incorporating both forward and exact target-side embedding space and more con-
backward dependency modeling into NAT models, ducive to modeling target dependencies for NAT
via triangular attention masks in a unified frame- models than previous approaches using source text
work through curriculum learning (Figure 1), and embedding or transformed decoder input.
investigating efficacy of different curricula. In Fig-
ure 1, the NAT decoder phase denotes standard Decoder Input Transformation To transform z
NAT training of any NAT decoder Dec. The For- into the target representation space, we apply atten-
ward Dependency and Backward Dependency tion mechanism between z and the output embed-
phases serve pre-training for NAT training, learning ding matrix Emb ∈ Rd×v , where d and v denote
left-to-right and right-to-left dependencies to ini- sizes of hidden states and the target vocabulary.
tialize NAT models with better dependencies. For- Since NAT models usually have embedding matrix
ward Dependency and Backward Dependency train- Emb including both source and target vocabular-
ing phases apply the same upper triangle attention ies, first, we conduct a filtering process to remove
mask on Dec. We use KD data from AT models for source vocabulary (mostly not used by the decoder)
each phase but the inputs and the targets are differ- from the decoder output embedding matrix (the
ent. The Forward Dependency training phase uses linear layer before decoder softmax). We build a
y1 to predict y2 and so on. The Backward Depen- dictionary that contains only target-side tokens in
dency training phase reverses the target sequence the training set. We then use this dictionary to filter
and uses y2 to predict y1 and so on. The NAT Train- Emb and obtain the new output embedding matrix
481
Figure 2: The proposed Decoder Input Transformation (IT) from z to z ′ , where z ∈ RT ×d denotes the initial decoder input
copied from the source text embedding xemb , T and d denote the length of the target text y and the size of hidden states,
respectively. Emb ∈ Rd×v denotes the output embedding matrix of the decoder (the target representation space), where v
denotes the size of the target vocabulary.
′
of the decoder Emb′ ∈ Rd×v , where v ′ denotes Since we can manually set v ∗ as a relatively small
size of the filtered vocabulary. This filtering pro- number (e.g., 1000, 2000), the computational cost
cess guarantees that Emb′ is strictly from the target of the attention mechanism can be greatly reduced.
representation space. The attention process starts We hypothesize that target-side embedding com-
with a linear transformation: pression may also alleviate over-fitting on small
datasets and confirm this hypothesis in Section 4.3.
z l = Wq · z (2)
Next, the dot-product attention is performed on z l 4 Experiments

(as query) and Emb′ (as key and value):
4.1 Experimental Setup
l ′
Sim = softmax(z · Emb ) (3) Datasets We compare our methods with prior
Sim represents similarity between each zil and works on widely used MT benchmarks for evaluat-
each embedding in the target vocabulary. Finally, ing NAT models: WMT14 EN↔DE (4.5M pairs),
we compute a weighted sum z ′ of target embedding WMT16 EN↔RO (610K pairs). Also, we use
based on their similarity values: IWSLT16 DE-EN (196K pairs), IWSLT14 DE-EN
(153K pairs), and SP EN-JA2 (50K pairs) for fur-
z ′ = Sim · Emb′T (4) ther analysis. For WMT16 EN↔RO and IWSLT16
DE-EN, we adopt the processed data from (Lee
Since z ′ is a linear combination of Emb′ which is et al., 2018). For WMT14 EN↔DE, we apply
strictly in the target representation space, z ′ is also the same preprocessing and learn subwords as Gu
strictly in the target representation space, hence and Kong (2021). For IWSLT14 DE-EN, we fol-
using z ′ as the decoder input provides a more solid low preprocessing in (Guo et al., 2019). For SP
basis for target dependency modeling. EN-JA, we use sentencepiece3 to tokenize the text
Target-side Embedding Compression To re- into subword units following Chousa et al. (2019).
duce the computational cost of IT, we propose a Following prior works, we share the source and tar-
target-side embedding compression approach to get vocabulary and embeddings in each language
compress the large target embedding matrix. We pair in Emb, except EN-JA. Also following prior
process Emb′ through a linear layer to obtain a works (Gu et al., 2018; Qian et al., 2021), all NAT
new target embedding Emb∗ ∈ Rd×v :
∗
models in our experiments are trained on data gen-
erated from pre-trained AT Transformer-base
Emb∗ = (Wc · Emb′T )T (5) with sequence-level knowledge distillation (KD)
∗ ′ for all datasets except EN-JA.
where Wc ∈ Rv ×v is trainable and the size of
compressed vocabulary v ∗ is set manually. The re- 2
https://github.com/odashi/small_parallel_enja
sult Emb∗ is still in the target representation space. 3
https://github.com/google/sentencepiece
482
Baselines and Training We implement the base- single NVIDIA V100 GPU, then compute the aver-
line models based on their released codebases. age time per sentence. We report Speed-up based
We implement the representative vanilla NAT (Gu on the inference latency of Transformer-base AT
et al., 2018; Qian et al., 2021; Huang et al., (teacher) and fully NAT models.
2021)4 , the highly competitive fully NAT model
GLAT (Qian et al., 2021)5 , and current fully 4.2 Main Results
NAT SOTA CTC w/ DSLP & Mixed Training Table 2 shows the main results on the WMT bench-
(CTC-DSLP-MT) (Huang et al., 2021)6 and ap- marks. For EN↔RO, we report the mean of BLEU
ply our methods to them. Following Qian et al. from 3 runs with different random seeds for Row
(2021), we use base-Transformer (dmodel =512, 12-13, all with quite small standard deviations
nhead =8, nlayer =6) for WMT datasets and small- (≤ 0.16) 7 . We apply our proposed DePA, which
Transformer (dmodel =256, nhead =4, nlayer =5) for includes IT and FBD, to vanilla NAT, GLAT, and
IWSLT and SP EN-JA datasets. We use the same the current fully NAT SOTA CTC-DSLP-MT, on
training setup for training the three models, Vanilla WMT, IWSLT, and EN-JA benchmarks. We use
NAT , GLAT, and CTC-DSLP-MT as in their orig- the same hyperparameters and random seeds to
inal papers cited above. We train models with fairly compare two models. It is crucial to point
batches of 64K tokens for WMT datasets, and 8K out that accuracies of vanilla NAT, GLAT, and
tokens for IWSLT and SP EN-JA datasets, using CTC-DSLP-MT models have plateaued out af-
NVIDIA V100 GPUs. For GLAT, we use Adam ter 300K training steps on WMT datasets hence
optimizer (Kingma and Ba, 2015) with β = (0.9, original papers of these three models set max
0.999) and set dropout rate to 0.1. For Vanilla training steps to 300K. We verify this observation
NAT and CTC-DSLP-MT, we use Adam optimizer in our own experiments as we also see no gains
(Kingma and Ba, 2015) with β = (0.9, 0.98). For on these models after 300K training steps on the
WMT datasets, the learning rate warms up to 5e-4 WMT datasets. Hence, although our DePA trains
in 4K steps and gradually decays according to in- 300K × 4 = 1200K steps on WMT datasets due
verse square root schedule (Vaswani et al., 2017). to FBF pre-training as in Section 4.3, all compar-
As for IWSLT and SP EN-JA datasets, we adopt isons between baselines w/ DePA and w/o DePA
linear annealing (from 3e-4 to 1e-5 ) as in Lee et al. are fair comparisons. Table 2 shows that DePA
(2018). We choose the model with the best perfor- consistently improves the translation accuracy for
mance on the validation set as the final model and both vanilla NAT and GLAT on each benchmark,
evaluate the final model on the test sets. For experi- achieving mean=+1.37 and max=+1.88 BLEU
ments using our method FBD (Section 3.2), we use gain on GLAT and mean=+2.34 and max=+2.46
the FBF-NAT configuration (as in Section 4.3) BLEU gain on vanilla NAT. DePA also improves
and train the same number of steps at each phase the SOTA CTC-DSLP-MT by mean=+0.42 and
(including NAT training phase), with 300K steps max=+0.49 BLEU gain on the WMT test sets (Ta-
for each phase for WMT datasets and 100K steps ble 2), +0.85 BLEU gain on the IWSLT16 DE-EN
for each phase for IWSLT datasets and SP EN-JA. validation set and +1.43 BLEU gain on the EN-JA
IT by default is without Target-side Embedding test set (Table 3). All gains from DePA on vanilla
Compression (Section 3.3). NAT, GLAT, and CTC-DSLP-MT are statistically
Evaluation To evaluate the translation accuracy, significant (p < 0.05) based on a paired bootstrap
we use SacreBLEU (Post, 2018) for all experi- resampling test conducted using 1K resampling
ments and ChrF (Popovic, 2015) (also using the trials and the SacreBLEU tool.
SacreBLEU tool) additionally for ablation study Table 2 also shows that on each benchmark,
on IWSLT benchmark. To evaluate the inference the average improvement from DePA on three
latency, following Gu and Kong (2021), we mea- models (vanilla NAT, GLAT, and CTC-DSLP-
sure the wall-clock time for translating the entire MT) is within [0.90,1.56] (Row15), always larger
WMT14 EN-DE test set with batch_size=1 on a than the average improvement from w/DSLP on
4 7
https://github.com/facebookresearch/fairseq/ WMT14 EN↔DE is much larger than WMT16 EN↔RO.
tree/main/examples/nonautoregressive_ Since standard deviations of BLEU from multiple runs with
translation different random seeds on WMT14 EN↔DE are very small,
5
https://github.com/FLC777/GLAT ≤ 0.08 (Huang et al., 2022), following prior works, we report
6
https://github.com/chenyangh/DSLP single-run BLEU on WMT14 EN↔DE to save energy.
483
WMT’14 WMT’16
Row# Models Speed-up ↑ EN-DE DE-EN EN-RO RO-EN
1 Transformer-base (teacher) 1.0× 27.48 31.39 33.70 34.05
2 + KD 2.5× 27.34 30.95 33.52 34.01
3 Vanilla NAT 15.6× 20.36 24.81 28.47 29.43
4 w/ DSLP∗ 14.8× 22.72 25.83 30.48 31.46
5 w/ DePA (Ours) 15.4× 23.15 26.59 30.78 31.89
6 GLAT 15.3× 25.21 29.84 31.19 32.04
7 w/ DSLP∗ 14.9× 25.69 29.90 32.36 33.06
8 w/ DePA (Ours) 15.1× 26.43 30.42 33.07 33.82
10 CTC∗ 15.5× 25.72 29.89 32.89 33.79
11 w/ DSLP∗ 14.8× 26.85 31.16 33.85 34.24
12 w/ DSLP & Mixed Training 14.8× 27.02 31.61 33.99 34.42
13 w/ DSLP & Mixed Training & w/ DePA (Ours) 14.7× 27.51 31.96 34.48 34.77
14 Average improvement from DSLP - 1.32 0.78 1.38 1.17
15 Average improvement from DePA (Ours) - 1.50 0.90 1.56 1.53
Table 2: BLEU and Speed-up from our DePA and existing methods on WMT benchmark test sets. Speed-up is measured on
WMT14 EN-DE test set. BLEUs without rescoring are reported, with the best BLEU scores in bold for each group. ∗ denotes
the results are copied from previous work (Huang et al., 2021), other results are obtained by our implementation. Average
improvements of DSLP are re-calculated using our results, which are slightly different from Table 1 in (Huang et al., 2021).
them, [0.78,1.38] (Row14). DePA brings consis- (only +0.07 BLEU gain) on IWSLT16 DE-EN,
tent improvement over SOTA CTC-DSLP-MT on whereas GLAT w/FBD brings +1.19 BLEU/+1.2
all benchmarks (Table 2 Row13-over-Row12, Ta- ChrF gains over GLAT (400K steps).
ble 3), hence we expect DePA to also improve DA- Table 4 shows our IT outperforms Linear Map-
Transformer (Huang et al., 2022) and DDRS w/ ping (Guo et al., 2019) by +2.31 BLEU gain on
NMLA (Shao and Feng, 2022) and will verify this IWSLT14 DE-EN test set. IT has the same num-
w/ and w/o KD in future work. Applying DePA to ber of extra parameters as Linear Mapping. Hence,
fully NAT models retains the inference speed-up the large gain proves that improvements from IT
advantages of fully NAT models. Applying DePA are not just from additional layers. The number
to vanilla NAT, GLAT, and SOTA CTC-DSLP- of extra parameters of IT, as from Wq in Eq.2,
MT obtain 15.4×, 15.1×, and 14.7× speed-up is quite small: 512*512=262144 for Transformer-
over the autoregressive Transformer-base (teacher) base on WMT datasets and 256*256=65536 for
(Row1). Overall Table 2 shows that DePA achieves Transformer-small on IWSLT datasets. The large
greater BLEU gains with less speed-up loss than BLEU gain +3.18 from applying IT to vanilla NAT
DSLP on all baselines. These results demonstrate proves vanilla transformer decoder cannot achieve
superiority of DePA over DSLP on improving other similar transformation effectiveness as IT. Table 3
fully NAT models. shows that for language pairs with different lev-
els of source-target vocabulary sharing, such as
4.3 Analysis WMT EN-DE and DE-EN, IWSLT DE-EN, EN-
Ablation Study We analyze the respective effi- RO, and EN-JA, our IT method can achieve con-
cacy of IT and FBD in DePA on the IWSLT16 DE- sistent improvements over GLAT and CTC-DSLP-
EN validation and the WMT and SP EN-JA test sets. MT. Applying IT consistently improves GLAT and
Table 3 shows that FBD and IT improve GLAT by CTC-DSLP-MT although these gains are smaller
+1.26 BLEU/+1.5 ChrF and +0.34 BLEU/+1.0 than gain on vanilla NAT. This is because decoder
ChrF on IWSLT16 DE-EN validation set, respec- input of vanilla NAT only replicates source em-
tively. Considering that GLAT w/FBD has more bedding, whereas GLAT and CTC-DSLP-MT al-
training steps than GLAT, we also train GLAT ready transform decoder input by replacing se-
(400K steps) which has the same training steps lected positions in decoder input with target em-
as GLAT w/FBD for fair comparison. Similar to bedding, hence reducing improvements of IT. Still,
findings on WMT datasets, we observe plateaus gains from w/IT+FBD over w/FBD confirms our
of accuracy on IWSLT and EN-JA datasets from hypothesis that IT can enhance effectiveness of
more training steps than the original 100K. Just FBD. On GLAT, IT+FBD yields +1.4 BLEU/+2.7
training more steps hardly improves the baseline ChrF gains on IWSLT16 DE-EN and +1.43 BLEU
484
IWSLT16 WMT’14 WMT’16
Models DE-EN EN-DE DE-EN EN-RO RO-EN
BLEU ChrF BLEU BLEU
CTC-DSLP-MT 31.04 56.7 27.02 31.61 34.17 34.60
CTC-DSLP-MT w/ IT 31.29 57.1 27.21 31.78 34.32 34.71
CTC-DSLP-MT w/ FBD 31.73 57.5 27.44 31.90 34.60 34.92
CTC-DSLP-MT w/ IT+FBD 31.89 57.8 27.51 31.96 34.68 34.98
IWSLT16 EN-JA
Models DE-EN
BLEU ChrF BLEU
GLAT 29.61 51.8 27.67
GLAT (400K step) 29.68 52.1 –
GLAT w/ IT 29.95 52.8 27.95
GLAT w/ FBD 30.87 53.3 28.87
GLAT w/ IT+FBD 31.01 54.5 29.10
Table 3: Effect of IT and FBD and IT+FBD (i.e., DePA) on the IWSLT16 DE-EN validation set, the WMT and SP EN-JA
test sets. We report mean of BLEU/ChrF from 3 runs with different random seeds. BLEU gains from DePA on SOTA
CTC-DSLP-MT on each set, [0.85, 0.49, 0.51], are larger than std (≤ 0.17).
Models BLEU DE-EN and +2.08 BLEU on EN-JA. It seems that

Vanilla NAT (Guo et al., 2019) 22.95 forward dependency modeling achieves good ini-
Vanilla NAT w/ Linear Mapping (Guo et al., 2019) 24.13 tialization for subsequent training phases, while
Vanilla NAT (our implementation) 23.26
Vanilla NAT w/ IT 26.44 backward dependency modeling cannot. We ob-
serve the best curriculum as FBF-NAT, i.e., first
Table 4: Compare IT and Linear Mapping (Guo et al., 2019) learn forward dependency, next learn backward
on vanilla NT on the IWSLT14 DE-EN test set.
dependency, then another round of forward de-
pendency training before NAT training. Table 5
on EN-JA and on SOTA CTC-DSLP-MT, +0.85 shows the same trend of curricula on SP EN-JA
BLEU/+1.1 ChrF gain on IWSLT16 DE-EN. as on IWSLT16 DE-EN, with FBF-NAT perform-
ing best, demonstrating that this trend of forward-
To further analyze IT, we compare cosine sim-
backward dependency modeling curricula is con-
ilarity between the target embedding against the
sistent for both right-branching (English) and left-
original decoder input and the transformed de-
branching (Japanese) target languages. All these
coder input, respectively. For each sample in the
observations confirm our hypotheses in Section 3.2.
IWSLT16 DE-EN validation set, we average all its
Our FBF-NAT consistently outperforms baseline
token embeddings as the decoder input represen-
GLAT (denoted by NAT in Table 5) by +1.58 on
tation and the same for the target representation
IWSLT16 DE-EN and +1.56 on SP EN-JA and
and then compute cosine similarity. We average
outperforms prior works modeling forward depen-
similarities of all samples as the final similarity.
dency only (Guo et al., 2020a) on GLAT (denoted
We find that IT significantly improves similarity
by F-NAT in Table 5) by +1.15 on DE-EN and
between the decoder input and the target rep-
+1.08 on EN-JA.
resentation, 0.04951 → 0.14521 for GLAT and
0.04837 → 0.14314 for vanilla NAT. DePA on Raw Data We evaluate DePA on raw
data by training models on the original training set
Impact of Different Dependency Curricula in without KD (Section 4.1). DePA improves GLAT
FBD Table 5 presents results from applying dif- on the IWSLT16 DE-EN validation set by +1.57
ferent forward-backward dependency modeling BLEU (26.57 → 28.14), proving that DePA effec-
curricula (Figure 1) on GLAT on the IWSLT16 tively enhances the dependency modeling ability
DE-EN validation and the SP EN-JA test sets. of fully NAT models hence reduces dependence
Compared with modeling backward dependency of NAT training on AT models.
in Phase 1 (B-NAT and BF-NAT), modeling for-
ward dependency in Phase 1 (F-NAT , FB-NAT, and Effectiveness of Target-side Embedding Com-
FBF-NAT) performs notably better. FB-NAT out- pression We propose a linear compression mod-
performs BF-NAT by +3.04 BLEU on IWSLT16 ule to reduce the selection candidates of the target
485
IWSLT16 DE-EN validation set SP EN-JA test set
Models BLEU Models BLEU Models BLEU Models BLEU
NAT 29.61 BF-NAT 27.83 NAT 27.67 BF-NAT 26.79
F-NAT 30.04 FB-NAT 30.87 F-NAT 28.15 FB-NAT 28.87
B-NAT 27.05 FBF-NAT 31.19 B-NAT 25.83 FBF-NAT 29.23
Table 5: BLEU from different dependency modeling curricula on GLAT. Best results for each set are in bold. NAT denotes
GLAT baseline. F and B denote forward dependency and backward dependency phase respectively (Figure 1). For example,
F-NAT denotes forward dependency training then NAT training.
Compressed
w/o IT 1000 1200 1400 1600 1800 2000
Dimension
BLEU 29.61 29.45 29.56 29.77 29.85 30.39 29.14
Table 6: BLEU from GLAT w/ IT on the IWSLT16

DE-EN validation set with Target-side Embedding Com-
pression described in Section 3.3.
are corrected by incorporating both forward and

(a) NAT (b) F-NAT
backward dependency modeling through FBD.
For a more intuitive analysis of FBD, we present
a visualization of the decoder self-attention distri-
bution of different NAT models in Figure 3. All
models are based on GLAT and model names con-
form to those in Table 5. In the baseline GLAT (Fig-
ure 3a), the self-attention distribution of each po-
sition is scattered in adjacent positions, indicating
(c) B-NAT (d) FB-NAT that the NAT model lacks dependency and has high
confusion during decoding, causing multi-modality
Figure 3: Visualization of the decoder self-attention distri- errors. In F-NAT and B-NAT models, significant
bution in NAT models on IWSLT16 DE-EN validation set.
Definitions of model names are the same as in Table 5. forward and backward dependencies can be ob-
served in Figure 3b and 3c, indicating that these
two models can better use information in previ-
embedding for IT (Section 3.3). We use the di- ous or future positions. Encouragingly, forward
chotomy to determine the compression dimension and backward dependencies are fused in the FB-
interval [1000, 2000] and evaluate GLAT w/ IT us- NAT model (Figure 3d), which can focus on future
ing different dimensions with step size 200 in this information while modeling forward dependency,
interval for IT on the IWSLT16 DE-EN validation capable of alleviating problems shown in Table 1.
set. As shown in Table 6, applying IT on GLAT
improves BLEU up to +0.78 (29.61→ 30.39) with 5 Conclusion
compressed dimension 1800. We also experiment
We propose a novel and general Dependency-
with target-side embedding compression on a larger
Aware Decoder (DePA) to enhance target de-
model on WMT16 EN-RO but find no gains. We
pendency modeling for fully NAT models, with
assume that for relatively small models and data,
forward-backward dependency modeling and de-
this approach helps filter out some redundant target
coder input transformation. Extensive experiments
information, hence refines the target representation
show that DePA improves the translation accuracy
space and improves the translation accuracy.
of highly competitive and SOTA fully NAT mod-
4.4 Case Study and Visualization els while preserving their inference latency. In fu-
Table 1 presents case studies of GLAT w/ FBD ture work, we will evaluate DePA on iterative NAT
(“FB-NAT”) and with only forward modeling (“F- models such as Imputer, CMLM, and Levenshtein
NAT”) on IWLST16 DE-EN validation set. Some Transformer and incorporate ranking approaches
typical multi-modality errors in F-NAT predictions into DePA.
486
6 Limitations Junlong Gao, Xi Meng, Shiqi Wang, Xia Li, Shan-
she Wang, Siwei Ma, and Wen Gao. 2019. Masked
Apart from all the advantages that our work non-autoregressive image captioning. arXiv preprint
achieves, some limitations still exist. Firstly, in arXiv:1906.00717.
this work, we investigate the efficacy of apply- Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and
ing our proposed DePA approach on the represen- Luke Zettlemoyer. 2019. Mask-predict: Parallel
tative vanilla NAT, the highly competitive fully decoding of conditional masked language models.
NAT model GLAT and current SOTA CTC-DSLP- arXiv preprint arXiv:1904.09324.
MT for fully NAT models, but we have yet to ap-
Alex Graves, Santiago Fernández, Faustino J. Gomez,
ply DePA to iterative NAT models, such as Im- and Jürgen Schmidhuber. 2006. Connectionist tem-
puter (Saharia et al., 2020), CMLM (Ghazvinine- poral classification: labelling unsegmented sequence
jad et al., 2019), and Levenshtein Transformer (Gu data with recurrent neural networks. In Machine
et al., 2019). Hence, the effectiveness of DePA Learning, Proceedings of the Twenty-Third Interna-
tional Conference (ICML 2006), Pittsburgh, Pennsyl-
on iterative NAT models still needs to be veri- vania, USA, June 25-29, 2006, volume 148 of ACM
fied. Secondly, we have not yet incorporated re- International Conference Proceeding Series, pages
ranking approaches such as Noisy Parallel Decod- 369–376. ACM.
ing (NPD) (Gu et al., 2018) into DePA. Thirdly, our
proposed method FBD requires multiple additional Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK
Li, and Richard Socher. 2018. Non-autoregressive
training phases before NAT training, resulting in neural machine translation. In International Confer-
longer training time and using more GPU resources. ence on Learning Representations.
Reducing the computational cost of FBD training
is one future work that will be beneficial for energy Jiatao Gu and Xiang Kong. 2021. Fully non-
autoregressive neural machine translation: Tricks of
saving. Last but not least, NAT models have limita- the trade. In Findings of the Association for Com-
tions on handling long text. They suffer from worse putational Linguistics: ACL-IJCNLP 2021, pages
translation quality when translating relatively long 120–133, Online. Association for Computational Lin-
text. We plan to investigate all these topics in future guistics.
work.
Jiatao Gu, Changhan Wang, and Jake Zhao.
2019. Levenshtein transformer. arXiv preprint
arXiv:1905.11006.
References
Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and
Yu Bao, Hao Zhou, Shujian Huang, Dongqi Wang, Li- Tie-Yan Liu. 2019. Non-autoregressive neural ma-
hua Qian, Xinyu Dai, Jiajun Chen, and Lei Li. 2022. chine translation with enhanced decoder input. In
GLAT: glancing at latent variables for parallel text Proceedings of the AAAI Conference on Artificial
generation. In Proceedings of the 60th Annual Meet- Intelligence, volume 33, pages 3723–3730.
ing of the Association for Computational Linguistics
(Volume 1: Long Papers), ACL 2022, Dublin, Ireland, Junliang Guo, Xu Tan, Linli Xu, Tao Qin, Enhong Chen,
May 22-27, 2022, pages 8398–8409. Association for and Tie-Yan Liu. 2020a. Fine-tuning by curriculum
Computational Linguistics. learning for non-autoregressive neural machine trans-
lation. In Proceedings of the AAAI Conference on
Nanxin Chen, Shinji Watanabe, Jesús Villalba, and Na- Artificial Intelligence, volume 34, pages 7839–7846.
jim Dehak. 2019. Listen and fill in the missing letters:
Non-autoregressive transformer for speech recogni- Junliang Guo, Linli Xu, and Enhong Chen. 2020b.
tion. arXiv preprint arXiv:1911.04908. Jointly masked sequence-to-sequence model for non-
autoregressive neural machine translation. In Pro-
Katsuki Chousa, Katsuhito Sudoh, and Satoshi Naka- ceedings of the 58th Annual Meeting of the Associa-
mura. 2019. Simultaneous neural machine trans- tion for Computational Linguistics, pages 376–385.
lation using connectionist temporal classification.
arXiv preprint arXiv:1911.11933. Junliang Guo, Zhirui Zhang, Linli Xu, Hao-Ran Wei,
Boxing Chen, and Enhong Chen. 2020c. Incor-
Andrew M. Finch and Eiichiro Sumita. 2009. Bidirec- porating bert into parallel sequence decoding with
tional phrase-based statistical machine translation. In adapters. In NeurIPS.
Methods in Natural Language Processing, EMNLP Chenyang Huang, Hao Zhou, Osmar R Zaïane, Lili
2009, 6-7 August 2009, Singapore, A meeting of SIG- Mou, and Lei Li. 2021. Non-autoregressive transla-
DAT, a Special Interest Group of the ACL, pages tion with layer-wise prediction and deep supervision.
1124–1132. ACL. arXiv preprint arXiv:2110.07515.
487
Fei Huang, Hao Zhou, Yang Liu, Hang Li, and Min- Maja Popovic. 2015. chrf: character n-gram f-score
lie Huang. 2022. Directed acyclic transformer for automatic MT evaluation. In Proceedings of the
for non-autoregressive machine translation. CoRR, Tenth Workshop on Statistical Machine Translation,
abs/2205.07459. WMT@EMNLP 2015, 17-18 September 2015, Lis-
bon, Portugal, pages 392–395. The Association for
Lukasz Kaiser, Samy Bengio, Aurko Roy, Ashish Computer Linguistics.
Vaswani, Niki Parmar, Jakob Uszkoreit, and Noam
Shazeer. 2018. Fast decoding in sequence mod- Matt Post. 2018. A call for clarity in reporting BLEU
els using discrete latent variables. In International scores. In Proceedings of the Third Conference on
Conference on Machine Learning, pages 2390–2399. Machine Translation: Research Papers, WMT 2018,
PMLR. Belgium, Brussels, October 31 - November 1, 2018,
pages 186–191. Association for Computational Lin-
Jungo Kasai, James Cross, Marjan Ghazvininejad, and guistics.
Jiatao Gu. 2020. Non-autoregressive machine trans-
lation with disentangled context transformer. In In- Lihua Qian, Hao Zhou, Yu Bao, Mingxuan Wang, Lin
ternational Conference on Machine Learning, pages Qiu, Weinan Zhang, Yong Yu, and Lei Li. 2021.
5144–5155. PMLR. Glancing transformer for non-autoregressive neural
machine translation. In Proceedings of the 59th An-
Yoon Kim and Alexander M Rush. 2016. Sequence- nual Meeting of the Association for Computational
level knowledge distillation. In Proceedings of the Linguistics and the 11th International Joint Confer-
2016 Conference on Empirical Methods in Natural ence on Natural Language Processing (Volume 1:
Language Processing, pages 1317–1327. Long Papers), pages 1993–2003, Online. Association
method for stochastic optimization. In ICLR (Poster). Qiu Ran, Yankai Lin, Peng Li, and Jie Zhou. 2020.
Learning to recover from multi-modality errors for
Diederik P Kingma and Max Welling. 2013. Auto-
non-autoregressive neural machine translation. In
encoding variational bayes. arXiv preprint
arXiv:1312.6114.
Jason Lee, Elman Mansimov, and Kyunghyun Cho. 3069.
2018. Deterministic non-autoregressive neural se-
quence modeling by iterative refinement. In Proceed- Chitwan Saharia, William Chan, Saurabh Saxena, and
ings of the 2018 Conference on Empirical Methods Mohammad Norouzi. 2020. Non-autoregressive ma-
in Natural Language Processing, pages 1173–1182. chine translation with latent alignments. In Proceed-
ings of the 2020 Conference on Empirical Methods
Jindřich Libovickỳ and Jindřich Helcl. 2018. End-to- in Natural Language Processing (EMNLP), pages
end non-autoregressive neural machine translation 1098–1108.
with connectionist temporal classification. In Pro-
ceedings of the 2018 Conference on Empirical Meth- Chenze Shao and Yang Feng. 2022. Non-monotonic la-
ods in Natural Language Processing, pages 3016– tent alignments for ctc-based non-autoregressive ma-
3021. chine translation. arXiv preprint arXiv:2210.03953.
Jinglin Liu, Yi Ren, Xu Tan, Chen Zhang, Tao Qin, Raphael Shu, Jason Lee, Hideki Nakayama, and
Zhou Zhao, and Tie-Yan Liu. 2020. Task-level cur- Kyunghyun Cho. 2020. Latent-variable non-
riculum learning for non-autoregressive neural ma- autoregressive neural machine translation with deter-
chine translation. In Proceedings of the Twenty-Ninth ministic inference using a delta posterior. In Proceed-
International Joint Conference on Artificial Intelli- ings of the AAAI Conference on Artificial Intelligence,
gence, IJCAI 2020, pages 3861–3867. ijcai.org. volume 34, pages 8846–8853.
Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neu- Mitchell Stern, William Chan, Jamie Kiros, and Jakob
big, and Eduard Hovy. 2019. Flowseq: Non- Uszkoreit. 2019. Insertion transformer: Flexible se-
autoregressive conditional sequence generation with quence generation via insertion operations. In In-
generative flow. In Proceedings of the 2019 Confer- ternational Conference on Machine Learning, pages
ence on Empirical Methods in Natural Language Pro- 5976–5985. PMLR.
cessing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP), Zhiqing Sun, Zhuohan Li, Haoqing Wang, Di He, Zi Lin,
pages 4282–4292. and Zhi-Hong Deng. 2019. Fast structured decoding
for sequence models. In NeurIPS.
Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Si-
monyan, Oriol Vinyals, Koray Kavukcuoglu, George Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Driessche, Edward Lockhart, Luis Cobo, Florian Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Stimberg, et al. 2018. Parallel wavenet: Fast high- Kaiser, and Illia Polosukhin. 2017. Attention is all
fidelity speech synthesis. In International conference you need. In Advances in neural information pro-
on machine learning, pages 3918–3926. PMLR. cessing systems, pages 5998–6008.
488
Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang
Zhai, and Tie-Yan Liu. 2019. Non-autoregressive
machine translation with auxiliary regularization. In
Proceedings of the AAAI Conference on Artificial
Intelligence, volume 33, pages 5377–5384.
Bingzhen Wei, Mingxuan Wang, Hao Zhou, Junyang
Lin, and Xu Sun. 2019. Imitation learning for non-
autoregressive neural machine translation. In Pro-
ceedings of the 57th Annual Meeting of the Asso-
1312.
Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Ron-
grong Ji, and Hongji Wang. 2018. Asynchronous
bidirectional decoding for neural machine translation.
In Proceedings of the Thirty-Second AAAI Confer-
ence on Artificial Intelligence, (AAAI-18), the 30th in-
novative Applications of Artificial Intelligence (IAAI-
18), and the 8th AAAI Symposium on Educational
Advances in Artificial Intelligence (EAAI-18), New
Orleans, Louisiana, USA, February 2-7, 2018, pages
5698–5705. AAAI Press.
Case study for the proposed IT NAT models

generally suffer from the multi-modality problem,
which shows as over-translation (repetition), under-
translation (missing information), and wrong lex-
ical choice (incorrect translations caused by poly-
semy) (Ran et al., 2020). As shown in Table 7,
Vanilla NAT and GLAT tend to generate repet-
itive tokens which are highlighted in red (over-
translation). Additionally, Vanilla NAT omits the
translation of “schließlich” which is in bold and
underlined (under-translation). By applying our
method IT, the decoder input is closer to the tar-
get representation space and the model has a better
perception for the target-side information, so that
the repetition and under-translation problems can
be effectively alleviated. As for incorrect transla-
tions caused by polysemy, as shown in Case #1
in Table 7, “Drucks” means both “printing” and
“pressure” in German. GLAT mistakenly translates
“Drucks” into “printing”, but our method can help
the model correctly translate it into “pressure”. Par-
ticularly, in Case #2, “Bauplan” means “blueprint”
in German. Although both baseline models Vanilla
NAT and GLAT generate the correct words, they
also generate the redundant word “plan” which is
also a subword of “Bauplan”. These examples
demonstrate that the baseline models may con-
fuse the source representation space with the tar-
get representation space during generation, but our
method IT effectively remedies this problem.
489
Case #1 Case #2
obwohl sie erwischt wurden , wurden sie schließlich
Source das ist ein Bauplan für Länder wie China und den Iran .
freigelassen aufgrund immensen internationalen Drucks .
even though they were caught , they were eventually
Target Reference this is a blueprint for countries like China and Iran .
released after heavy international pressure .
although they were caught , they were released released
Vanilla NAT this is a blueprint plan for countries like China and and Iran .
because because of huge drug .
although they were caught , they were finally released
Vanilla NAT w/ IT this is a blueprint for countries like China and Iran .
because huge international pressure .
although they were caught , they finally were released
GLAT this is a blueprint plan for countries like China and Iran .
because of of international printing .
although they were caught , they were finally
GLAT w/ IT this is a blueprint for countries like China and Iran .
released after huge international pressure .
Table 7: Case studies of our method IT on the IWSLT16 DE-EN validation set by comparing the translations from
the two baseline models Vanilla NAT and GLAT and from them after applying IT (models in bold). Repetitive tokens
are in red. Source words that are not semantically translated are marked in bold and underlined (under-translation).
Wrong lexical choice (incorrect translations caused by polysemy) and redundant words are in blue.
490
On the Copying Problem of Unsupervised NMT:
A Training Schedule with a Language Discriminator Loss
Yihong Liu*⋄ , Alexandra Chronopoulou*⋄ , Hinrich Schütze*⋄ , and Alexander Fraser*⋄
*
Center for Information and Language Processing, LMU Munich
⋄
Munich Center for Machine Learning (MCML)
{yihong, achron, fraser}@cis.lmu.de
Abstract intermediate language but simply copies tokens

from the source language. To address this problem,
Although unsupervised neural machine transla-
tion (UNMT) has achieved success in many lan- this work proposes a simple but effective method
guage pairs, the copying problem, i.e., directly that can be integrated into the standard UNMT
copying some parts of the input sentence as the training. We leverage a language discriminator to
translation, is common among distant language detect the language of the intermediate translation
pairs, especially when low-resource languages generated in BT and backpropagate the gradients
are involved. We find this issue is closely re- to the main model. In this way, we can provide
lated to an unexpected copying behavior during
implicit supervision to the model. We find that
online back-translation (BT). In this work, we
propose a simple but effective training sched-
by adding such a training objective, the copying
ule that incorporates a language discriminator problem can be largely alleviated, especially for
loss. The loss imposes constraints on the inter- low-resource languages. Noticeably, we do not in-
mediate translation so that the translation is in troduce any language-specific architectures into the
the desired language. By conducting extensive main model. To the best of our knowledge, this is
experiments on different language pairs, includ- the first work that introduces a language discrimi-
ing similar and distant, high and low-resource nator loss to force the intermediate translations in
languages, we find that our method alleviates
BT to be in the correct language. The contributions
the copying problem, thus improving the trans-
lation performance on low-resource languages. of our work are as follows:
(1) We explore the reasons behind the copying prob-
1 Introduction lem in UNMT and propose a training schedule with
a language discriminator loss.
UNMT (Lample et al., 2018; Artetxe et al., 2018)
(2) We evaluate our method on many languages,
is a new and effective approach for tackling the
including high- and low-resource, and similar and
scarcity of parallel data. Typically, a cross-lingual
distant language pairs.
pretrained language model (PLM) (Peters et al.,
(3) We carry out an analysis, showing the proposed
2018; Devlin et al., 2019) is trained on two lan-
method can reduce the copying ratio, especially on
guages and then used to initialize the model for
small-size datasets and distant language pairs.
the UNMT task (Conneau and Lample, 2019; Song
(4) We make our code publicly available. 1
et al., 2019; Yang et al., 2020; Liu et al., 2020).
However, when it comes to low-resource languages,
2 Problem Statement & Approach
especially when translating between distant lan-
guage pairs, UNMT often yields very poor re- 2.1 Copying Problem
sults (Neubig and Hu, 2018; Guzmán et al., 2019;
Marchisio et al., 2020). One of the major problems The copying problem is also known as an off-target
that lead to low translation quality is the copying translation issue in multilingual NMT especially
problem or off-target problem (Kim et al., 2020; zero-shot scenario (Gu et al., 2019; Yang et al.,
Zhang et al., 2020). That is: the trained model does 2021; Chen et al., 2023). One important task in
not translate but copies some words or even the zero-shot NMT is to let the model translate into the
whole sentence from the input as the translation. correct language given so many target languages.
We find the copying problem is closely related Our motivation in UNMT is similar, while each
to an unexpected behavior in BT (Sennrich et al., 1
https://github.com/yihongL1U/xlm_
2016): the model does not translate into the correct lang_dis
491
Figure 2: The losses (left ordinate) and copying ratios
(right ordinate) of Multi30K English-French pair over
epochs. The normal_dae_loss (resp. normal_bt_loss)
Figure 1: A view of the UNMT architecture. The and normal_copying_ratio are DAE loss (resp. BT
weights of the final fully connected layer (block F) are loss) and copying ratio from the vanilla UNMT. The
tied with the weight of the embedding layer ( block E). ld_dae_loss (resp. ld_bt_loss) and ld_copying_ratio are
DAE loss (resp. BT loss) and the copying ratio from the
UNMT incorporated with the language discriminator.
UNMT model often specifically deals with two
languages, therefore only two translation directions
are considered. Although adding language tags generation of the subsequent tokens. In contrast
(Wu et al., 2021) is effective in addressing the to this setting, using separate word look-up tables
copying problem in multilingual NMT, it is not or separate decoders for involved languages can
a standard process in UNMT. This is because address the problem (Lample et al., 2018; Liu
a language embedding is often added to each et al., 2022). However, such a setting can be
token embedding (Conneau and Lample, 2019; harmful for learning cross-lingual knowledge and
Song et al., 2019; Liu et al., 2022). Language largely increase the number of parameters. In this
embeddings have similar functions to language view, it is desired to keep the structure simple (no
tags: providing information about the language of language-specific architecture) while preventing
each token. Unfortunately, language embeddings the model from decoding in a copying way.
turn out to be not very effective in addressing the
copying problem, especially for low-resource or Objective perspective. Typically, a UNMT
distant language pairs (Kim et al., 2020). Thus, in model is trained by denoising autoencoding (DAE)
this work, we explore why the copying problem (Vincent et al., 2008) and online back-translation
occurs and how we can alleviate it in UNMT. We (BT) (Sennrich et al., 2016) objectives. In DAE
analyze the problem from two perspectives: objective, even though the model is trained to
denoise on two languages simultaneously, there
Architecture perspective. In UNMT, the weight is no guarantee that the model can transfer the
of the final fully connected layer (for obtaining the cross-lingual information that might improve
logits of each word in the vocabulary) is often tied translation between the two languages. In fact,
to the weight of a cross-lingual embedding layer, Song et al. (2019) empirically find that a pretrained
as shown in Figure 1. That is, the representations encoder-decoder model with DAE objective can
of tokens from two languages are shared in the even perform worse than the model without it
same space. Although this setting is arguably a because DAE encourages the model to perform the
better starting point for most modern NMT models, copying. In comparison with DAE, BT is arguably
it unfortunately also allows the models to generate more important, as it tries to directly optimize the
a token in an unexpected language at any time step. translation. However, we find that BT can also
Furthermore, because of an autoregressive decoder, “fail” during training. That is, the model can take
errors can easily accumulate, as the tokens initially the shortcut, i.e., copy the input sentence as the
generated by the model highly influence the intermediate translation and then copy it again for
492
the reconstruction. By taking such a shortcut, the where Otgt are the first-time-step outputs generated
loss of BT can quickly decrease while the copying in the src-to-tgt step, i.e., Dec(Enc(x, src), tgt).
ratio (Liu et al., 2021), a metric to measure the The language discriminator does not have to be
percentage of generated tokens that are copied used for the next step in BT, i.e., tgt-to-src trans-
from the input, keeps increasing and reaches a lation, because there are already ground-truth src-
high-value plateau, as shown in Figure 2. This language sentences as supervision. All we need
indicates that: because of no constraints on the to do is to make sure the intermediate translation
intermediate translation, the model can always is in the correct language. We use a weight λLD
choose the easiest shortcut for BT, which finally to control the contribution of the LD loss to the
corrupts the model’s translation capability. final loss that is used to update the parameters of
the main model. It is easy to note that the larger
2.2 A Language Discriminator Loss the weight, the model will be more focusing on the
To avoid such an unexpected copying behavior in task of distinguishing representations from differ-
BT, our intuition suggests that forcing the interme- ent languages.
diate generation to be in the correct language would This training schedule is similar to the adversar-
be helpful. Instead of forcing all tokens, we could ial loss (Goodfellow et al., 2014) used by Lample
simply force the first token to be in the correct et al. (2018), where they trained a discriminator
language, because the first generated token will in- to make the outputs of the encoder language-
fluence the generation of all the subsequent tokens. agnostic, aiming to improve the cross-linguality
Next, the problem is how to force the first gener- of a shared encoder. Our aim, however, is different:
ated token to be in the desired target language. An we want to enable the decoder to generate distin-
equivalent question would be: how can we force the guishable outputs which correctly correspond to
output vector of the decoder at the first time step to the language that the model is expected to gener-
be closer to the embedding of a token in the target ate in the BT process. Algorithm 1 presents the
language? The answer might be trivial. We could training schedule in detail.
use a trained language discriminator (LD), which
is a classifier, to classify the first-time-step output
vectors of the decoder and then backpropagate the Algorithm 1: Training Schedule
Input: pretrained encoder Enc and decoder Dec,
gradients to the main model (encoder and decoder). language discriminator LD, source and target
In this way, the model knows which intermediate monolingual data Dsrc , Dtgt , maximum
language it should generate for the first-time-step finetuning steps T and coefficient λLD ;
Output: Finetuned encoder Enc and decoder Dec);
token, therefore preventing the copying behavior. 1 t ← 0;
For training LD, we could use the first-time-step 2 while not converged or t < T do
outputs of the decoder in DAE steps. The LD is 3 // for src language do DAE and BT:
4 Bsrc ← sample batch from Dsrc ;
trained to predict the language of the first-time-step 5 // DAE step (below)
outputs by minimizing the cross entropy loss: 6 B̃src , Osrc ← generate reconstructions and
first-time-step outputs from
LLD = Ex∼Dl [p(l|LD(Ol )] (1) Dec(Enc(noise(Bsrc ), src), src);
7 detach Osrc from the compute graph ;
where LD is the language discriminator, Ol 8 θ Enc , θ Dec ← arg min LDAE (Bsrc , B̃src );
9 θ LD ← arg min LLD (Osrc , src);
are the first-time-step outputs generated by 10 // BT step (below)
Dec(Enc(x, l), l) and l denotes the language (ei- 11 freeze θ LD ;
ther src or tgt). Notably, LLD only backpropagates 12 B̃tgt , Otgt ← generate tgt-language translations
to the language discriminator in the DAE step. In and first-time-step outputs from
Dec(Enc(Bsrc , src), tgt) ;
this way, the discriminator is able to distinguish 13 B̃src ← generate src-language back-translations
representations from different languages. from Dec(Enc(B̃tgt , tgt), src) ;
In the BT process, the language discriminator is 14 θ Enc , θ Dec ← arg min LBT (Bsrc , B̃src ) +
fixed and LLD loss is only used to update the main λLD LLD (Otgt , tgt);
model so it learns to differentiate representations 15 // for tgt language do the same as above
16 t ← t + 1;
from different languages. Taking src-tgt-src BT for 17 end
example, the loss is as follows: 18 return Enc and Dec;
LLD = Ex∼Dsrc [p(tgt|LD(Otgt )] (2)

493
(a) λLD = 0 (b) λLD = 0.01 (c) λLD = 0.1
(d) λLD = 1 (e) λLD = 10 (f) λLD = 100
Figure 3: The visualizations of the first-time-step output vectors of the decoder in UNMT trained with different
weights for the proposed language discriminator loss. The dimension of the outputs is originally 1024. Principal
component analysis (PCA) is leveraged to project those outputs into a 2-dimensional subspace for convenience
of visualization. src2src (resp. tgt2tgt) denotes the output in the English-to-English (resp. German-to-German)
autoencoding task. src2tgt (resp. tgt2src) denotes the output in the English-to-German (resp. German-to-English)
translation task. The sentences used for the visualizations are the same or the corresponding parallel translations.
3 Experiments those experiments, we randomly initialize a shared

decoder because Multi30k is so small that a ran-
3.1 Setups domly initialized decoder can work already very
Multi30K (Elliott et al., 2016, 2017)2 . The of- well based on our preliminary experiments. For
ficially provided train, validation and test sets in WMT experiments, we pretrain our own cross-
English (En), German (De) and French (Fr) are lingual language models using the code base of
used. Similar to Lample et al. (2018), we only use XLM4 and use the pretrained models to initialize
the caption of each image, and we split the train both the encoder and decoder for UNMT task.5
and validation sets into monolingual corpora by
only using one-half of the data for a language. 3.2 Analysis on Multi30K
To figure out how the LD loss could influence the
WMT (Barrault et al., 2019). We select 50M sen- performance, we use six different weights for it: 0,
tences for high-resource languages: English (En), 0.01, 0.1, 1, 10 and 100. When the weight equals 0,
French (Fr), German (De), Russian (Ru) and Chi- the UNMT training will not consider the LD loss at
nese (Zh) (14M available) and all available mono- all and this setting would then be exactly the same
lingual sentences for low-resource language: Gu- as the vanilla (i.e., DAE + BT) UNMT. The results
jarati (Gu) (3M), Kazakh (Kk) (4M). We report the are shown in Table 2. In addition to BLEU scores
results on newtest2014 for En-Fr pair, newtest2016 (Papineni et al., 2002), we also compute copying
for En-De pair, newtest2018 for En-Ru pair and ratios (Liu et al., 2021) for each listed direction.
newtest2019 for the remaining language pairs. The general trend shows that: when 0 ≤ λLD ≤
Pretrained Models We use cross-lingual pre- 1, the BLEU scores increase and the copying ratios
trained language model (xlm-mlm-ende-1024 and decrease when increasing the weight, suggesting
xlm-mlm-enfr-1024) from HuggingFace3 (Wolf the copying problem is alleviated by introducing
et al., 2020) to initialize a shared encoder (pa- the LD loss. However, when λLD > 1, the BLEU
rameters are fixed) in Multi30K experiments. In 4
XLM
2 5
https://github.com/multi30k/dataset Details of hyperparameters and relevant information of
3
https://github.com/huggingface all the models are shown in Section A.2 in the Appendix.
494
Model Source input Model output Reference output
a man in an orange hat
λLD = 0
staring at something.
ein mann in an orange hat ein mann mit einem
λLD = 0.01 a man in an orange hat
starring at something. orangefarbenen hut,
starring at something.
ein mann in an orange hat der etwas anstarrt.
λLD = 0, 1
gerade etwas bei etwas.
ein mann in einem orangefarbenen
λLD = 1
hut spielt bei etwas.
ein mann in einem orangefarbenen
λLD = 10
eine frau in einem orangefarbenen
λLD = 100
a boston dog is running on leafy grass
λLD = 0
in front of a white fence.
a boston terrier is running ein boston terrier läuft auf einem gepflasterten ein boston terrier läuft
λLD = 0.01
on lush green grass grünen grass in front of a white fence. über saftig-grünes gras
in front of a white fence. ein boston terrier läuft auf einem grünen rasen vor einem weißen zaun.
λLD = 0.1
vor einem weißen zaun.
ein boston terrier läuft auf einem grünen rasen
λLD = 1
ein boston terrier läuft auf einem grünen gras
λLD = 10
eine boston terrier läuft auf grünen gras
λLD = 100
Table 1: Examples of translations from the model trained on Multi30K dataset (En-De pair) with different weights
λLD for language discriminator loss. We do not use beam search to generate these translations.
Models En ) De De ) En En ) Fr Fr ) En
likely for the model to copy the words from the
0 0.22 (87%) 0.19 (84%) 0.14 (89%) 0.10 (83%)
0.01 15.78 (42%) 22.04 (24%) 24.73 (24%) 22.15 (25%) source input as the output translation. However,
0.1 25.91 (14%) 28.46 (15%) 39.72 (6%) 37.50 (7%) when the weight is too large, e.g., λLD = 100,
1 27.96 (12%) 30.05 (12%) 42.74 (5%) 39.02 (6%)
10 24.35 (14%) 25.60 (13%) 41.26 (5%) 37.61 (6%) there are obvious mistakes made by the translation
100 20.66 (12%) 26.74 (10%) 30.65 (5%) 32.10 (7%) model. For example, “man” in English is wrongly
translated to “frau” (means woman) in German,
Table 2: BLEU scores and copying ratios (inside paren-
theses) of models trained with different weights λLD “a” is wrongly translated into “eine” since boston
on Multi30K dataset. When the weight λLD = 0, the terrier is a masculine instead of a feminine noun.
model degenerates to the vanilla UNMT model. Moderate weights, e.g., λLD = 1, achieves the best
performance while obtaining fewer errors.
To figure out how the LD loss influences the
scores decrease while copying ratios remain at the representations, i.e., the first-time-step output vec-
same level with the increase of the weight. This tors generated by the decoder, we visualize these
indicates that the model is over-emphasizing dis- vectors in 2D by using principal component anal-
tinguishing the outputs when the weights are large. ysis (PCA), as shown in Figure 3. The visual-
Therefore, moderate weights, e.g., 1, might be op- ization verifies the relationship between the out-
timal if we want to alleviate the copying problem put and the occurrence of the copying problem.
while achieving good translation performance. src2tgt and tgt2tgt first-time-step outputs should be
When λLD = 0, poor BLEU scores are obtained close to each other in the subspace as they are both
because of the copying problem. We see that all used to directly generate target-language sentences.
copying ratios in Table 2 are very high: more than However, in Fig. 3 (a), when λLD = 0, src2tgt
80% for all directions. Example translations from and src2src are located together while tgt2src and
the translation model for En-De pair in Table 1 tgt2tgt are together. In contrast, when LD loss is
show that when λLD = 0, the MT system simply imposed, e.g., λLD = 1 (Fig. 3 (d)), the outputs
copies the input sentences. It is very clear that are distributed as we expect: src2tgt and tgt2tgt are
with the increase of the weight, it becomes less located together and tgt2src and src2src together.
495
Models En ) De De ) En En ) Fr Fr ) En En ) Ru Ru ) En En ) Zh Zh ) En
XLM baseline 20.51 25.99 22.87 25.88 14.10 16.92 6.36 4.28
XLM (+ LD) 20.40 25.85 21.22 26.92 13.49 16.12 6.80 4.69
Table 3: BLEU scores of the XLM baseline and the same model enhanced with the LD loss on high-resource
language pairs. The scores of baseline are obtained by reproducing the published code (Conneau and Lample, 2019).
Models En-De En-Fr En-Ru En-Zh En-Kk En-Gu Models En )Kk Kk )En En )Gu Gu )En
baseline 18% 23% 11% 29% 57% 68% XLM baseline (512) 0.80 2.00 0.60 0.60
(+ LD) 19% 25% 11% 24% 42% 52% XLM baseline (1024) 1.80 1.59 2.12 0.54
XLM (+ LD) 2.03 1.70 3.55 0.64
∆ +1% +2% -0% -5% -15% -14%
Table 5: BLEU scores of the XLM baseline and the
Table 4: The copying ratio for each language pair of
same model enhanced with the LD loss on low-resource
XLM baselines and LD model. The average of the ratios
language pairs. The scores of baseline (512) are copied
of two directions for a language pair is reported. The
from (Kim et al., 2020). Same as the setting for high-
translations used to compute the ratios are the same as
resource languages, we reproduced XLM with 1024-
translations for BLEU used in Table 3 and Table 5.
dim embeddings to obtain the scores for baseline (1024).
3.3 Main Results on WMT

As the proposed LD is helpful to alleviate the
copying problem in Multi30K experiments when Low-resource language pairs. En-Kk and En-Gu
the weight λLD is moderate, we further conduct represent two very distant pairs that include
experiments on WMT datasets, which are much low-resource languages. We report the BLEU
larger than Multi30K. We use λLD = 1 as default. scores in Table 5 and average copying ratios in
Table 4. From the results, we first see that the
High-resource language pairs. We report performance of all considered UNMT systems is
the results on Table 3 and average copying ratios rather poor. This is because they are all distant
for each language pair in Table 4. Firstly, we pairs and unsupervised training cannot learn
observe that there is a slight decrease in BLEU enough cross-lingual information. We find the
scores for En-De and En-Ru pair. Different from copying problem overwhelming, with 57% and
Table 2 where we see that the vanilla models 68% copying ratios on En-Kk and En-Gu pair
suffer from the copying problem, the vanilla respectively. By using the proposed LD loss, we
models in Table 3 perform fairly well on En-De see a consistent increase in BLEU scores and
and En-Ru. The copying ratios of each pair are an evident decrease in average copying ratios
also below 20%. We therefore speculate that (15% decrease on En-Kk and 14% on En-Gu pair
the size and complexity of the training data respectively). This shows the incorporation of
can influence the effectiveness of the language LD loss can significantly alleviate the copying
discriminator, as it can easily distinguish the problem. On the other hand, we attribute the weak
decoder outputs in Multi30K because the size is translation quality to the already poor performance
small and each sentence has a similar and simple of the vanilla UNMT models, which cannot be
structure. The copying problem does not severely largely improved simply by alleviating the copying
impact the BLEU scores of these language pairs problem. Decreasing copying ratios does not
when training on WMT data, presumably because necessarily lead to a correct translation. Because
of the much larger dataset sizes. When the two of the unsupervised nature of the task, it can still
languages are more distant, however, the copying be extremely hard for the model to learn enough
problem can occur even if considerable training cross-lingual information that is useful to perform
data is there: XLM baseline has a copying ratio good translation. Table 6 shows some examples,
of 29% on En-Zh pair. XLM (+LD) can improve we notice that XLM (+ LD) generates sentences
results by 0.44 and 0.41 in En ) Zh and Zh ) En in the correct language, but the semantics of
directions, and decrease the copying ratio by 5%, the output sentences is not that related to the
which indicates that the LD loss can improve the original ones, indicating that lower copying ratios
translation where the copying problem is obvious. do not necessarily induce better translation quality.
496
Model Source input Model output Reference output
XLM baseline Негiзi , менiң қарсылығым жоқ .
Негiзi , менiң қарсылығым жоқ . Actually , I have no objection .
XLM (+LD) "Негiзi , I have no idea .
XLM baseline The сома алты еуроға тең .
Бұл сома алты еуроға тең . This amount equals to six euro .
XLM (+LD) The price of six еуроға тең .
XLM baseline Олардың көпшiлiгi ауыл шаруашылығы Their көпшiлiгi family life has changed .
Most of them are in agricultural area .
саласында болып отыр . Their family members have been
XLM (+LD)
in the area for the past two years .
Table 6: Examples of translations from Kazakh to English by XLM baseline (1024) and XLM (+LD) in Table 5.
The examples show XLM (+LD) suffers fewer the copying problem but it can generate incorrect tokens that do not
match the semantics of the input sentence.
an implicit supervision). In the case of distant lan-

Based on the high- and low-resource transla- guage pairs including low-resource languages, e.g.,
tion experiments, our insights are as follows: the En-Gu and En-Kk in our WMT experiments, both
UNMT models can (easily) learn a lot of cross- issues (1) and (2) prohibit the model from learning
lingual information on similar and high-resource to translate accurately. Although the copying prob-
languages and thus the copying problem is less lem is alleviated, as shown in Table 6, this does not
obvious. Under such a case, additionally using guarantee a correct or even good translation quality.
LD loss can divert the focus of the training. How- We therefore expect future research could explore
ever, on distant pairs involving low-resource lan- using a more powerful baseline model, e.g., includ-
guages, models would struggle to learn enough ing static cross-lingual embeddings to improve the
cross-lingual information and therefore the copying cross-linguality (Chronopoulou et al., 2021), which
problem is obvious. In such a case, although involv- might further improve the performance for distant
ing LD loss cannot provide additional cross-lingual language pairs including low-resource languages.
knowledge, it can alleviate the copying problem
thus improving the performance to a certain extent. 5 Conclusion
In this paper, we find that the copying problem in
4 Discussion UNMT is closely related to the lack of constraints
From the Multi30K and WMT experiments, we ver- on the intermediate translation in the BT process.
ify the ability of the LD loss to alleviate the copy- To address this issue, we propose an LD loss to
ing problem by showing consistently lower copy- give additional supervision to the first-time-step
ing ratios. However, the performance in terms of output vectors generated by the decoder in the BT
BLEU scores on these two datasets shows slightly process. We find that the method can alleviate the
different trends: we improve translation quality copying problem by correcting the wrong behavior
on Multi30K a lot by reducing the copying ratios; in BT. In addition, through extensive experiments
whereas we do not see a prominent improvement on different language pairs (including low-resource
on WMT even if copying ratios are largely reduced. languages and distant pairs), we discover that the
This discrepancy can be explained as follows. Two method can consistently improve the performance
main issues are preventing the model from achiev- of distant language pairs.
ing good performance: (1) lacking cross-lingual
6 Limitations and Risks
alignment information that is useful for learning
translation (2) no clear guidance on which language Our training schedule introduces a language dis-
to translate into. The experiments on the small criminator loss to impose constraints on the inter-
dataset Multi30K indicate that issue (1) is not the mediate translation in the back-translation period.
major obstacle when two similar languages are con- The experimental results suggest that our method
sidered, e.g., En and Fr. In such a case, it is the can alleviate the copying problem when the in-
issue (2) that prevents the model from performing volved languages are distant language pairs or lack
the actual translation. This is why large improve- training data. However, for language pairs that are
ments are achieved by simply adding the LD loss not distant, and especially high-resource languages,
when training a model on Multi30k (note that the our model does not show improvement over the
language discriminator does not provide any ad- baseline. Due to time and resource limitations, we
ditional cross-lingual information but only acts as do not further explore whether the optimal weight
497
for the language discriminator loss can have a con- of pretrained language models for unsupervised neu-
nection with the size of the dataset and the involved ral machine translation. In Proceedings of the 2021
language pairs. For example, for WMT En-De or
En-Fr pairs, the languages are not distant language Language Technologies, pages 173–180, Online.
pairs and therefore we might obtain better results if
the weights are slightly smaller. We believe that fu- Alexis Conneau and Guillaume Lample. 2019. Cross-
lingual language model pretraining. In Advances in
ture research could explore this direction: to adapt Neural Information Processing Systems, volume 32.
the weight to different language pairs and the size
of the training data. In addition, we do not conduct Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
hyperparameter search for other hyperparameters, Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
instead directly using suggested values. standing. In Proceedings of the 2019 Conference of
In this work, we propose a novel training sched- the North American Chapter of the Association for
ule that tries to address the copying problem, which Computational Linguistics: Human Language Tech-
is common among distant language pairs in UNMT. nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota.
We experiment with high-resource languages En-
glish, German, French, Russian and Chinese, and Desmond Elliott, Stella Frank, Loïc Barrault, Fethi
low-resource languages including Gujarati and Bougares, and Lucia Specia. 2017. Findings of the
Kazakh. The training data we use is monolingual second shared task on multimodal machine transla-
tion and multilingual image description. In Proceed-
text extracted from online newspapers and released ings of the Second Conference on Machine Transla-
for the WMT series of shared tasks. As far as we tion, pages 215–233, Copenhagen, Denmark.
know, all the monolingual corpora do not contain
Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu-
any metadata and therefore it would be unlikely
cia Specia. 2016. Multi30K: Multilingual English-
that anyone can use the concerned data to attribute German image descriptions. In Proceedings of the
to specific individuals. 5th Workshop on Vision and Language, pages 70–74,
Berlin, Germany.
Acknowledgements
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,
We would like to thank the anonymous reviewers. Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
This work was funded by the European Research Courville, and Yoshua Bengio. 2014. Generative ad-
versarial nets. In Advances in Neural Information
Council (grant #740516) and by the German Re-
Processing Systems.
search Foundation (DFG, grant FR 2829/4-1).
Jiatao Gu, Yong Wang, Kyunghyun Cho, and Vic-
tor O.K. Li. 2019. Improved zero-shot neural ma-
References chine translation via ignoring spurious correlations.
In Proceedings of the 57th Annual Meeting of the As-
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and sociation for Computational Linguistics, pages 1258–
Kyunghyun Cho. 2018. Unsupervised neural ma- 1268, Florence, Italy.
chine translation. In 6th International Conference
on Learning Representations, ICLR 2018, Vancouver, Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan
BC, Canada, April 30 - May 3, 2018, Conference Pino, Guillaume Lample, Philipp Koehn, Vishrav
Track Proceedings. Chaudhary, and Marc’Aurelio Ranzato. 2019. The
Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, FLORES evaluation datasets for low-resource ma-
Christian Federmann, Mark Fishel, Yvette Gra- chine translation: Nepali–English and Sinhala–
ham, Barry Haddow, Matthias Huck, Philipp Koehn, English. In Proceedings of the 2019 Conference on
Shervin Malmasi, Christof Monz, Mathias Müller, Empirical Methods in Natural Language Processing
Santanu Pal, Matt Post, and Marcos Zampieri. 2019. and the 9th International Joint Conference on Natu-
Findings of the 2019 conference on machine trans- ral Language Processing (EMNLP-IJCNLP), pages
lation (WMT19). In Proceedings of the Fourth Con- 6098–6111, Hong Kong, China.
ference on Machine Translation (Volume 2: Shared
Yunsu Kim, Miguel Graça, and Hermann Ney. 2020.
Task Papers, Day 1), pages 1–61, Florence, Italy.
When and why is unsupervised neural machine trans-
Liang Chen, Shuming Ma, Dongdong Zhang, Furu Wei, lation useless? In Proceedings of the 22nd Annual
and Baobao Chang. 2023. On the off-target problem Conference of the European Association for Machine
of zero-shot multilingual neural machine translation. Translation, pages 35–44, Lisboa, Portugal.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
Alexandra Chronopoulou, Dario Stojanovski, and method for stochastic optimization. In 3rd Inter-
Alexander Fraser. 2021. Improving the lexical ability national Conference on Learning Representations,
498
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Maja Popović. 2015. chrF: character n-gram F-score
Conference Track Proceedings. for automatic MT evaluation. In Proceedings of the
Tenth Workshop on Statistical Machine Translation,
Philipp Koehn. 2004. Statistical significance tests for pages 392–395, Lisbon, Portugal.
machine translation evaluation. In Proceedings of the
2004 Conference on Empirical Methods in Natural Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Language Processing, pages 388–395, Barcelona, Lavie. 2020. COMET: A neural framework for MT
Spain. evaluation. In Proceedings of the 2020 Conference
Guillaume Lample, Alexis Conneau, Ludovic Denoyer, ing (EMNLP), pages 2685–2702, Online.
and Marc’Aurelio Ranzato. 2018. Unsupervised ma-
chine translation using monolingual corpora only. In Rico Sennrich, Barry Haddow, and Alexandra Birch.
6th International Conference on Learning Represen- 2016. Improving neural machine translation models
tations, ICLR 2018, Vancouver, BC, Canada, April with monolingual data. In Proceedings of the 54th
30 - May 3, 2018, Conference Track Proceedings. Annual Meeting of the Association for Computational
Xuebo Liu, Longyue Wang, Derek F. Wong, Liang Ding, Berlin, Germany.
Lidia S. Chao, Shuming Shi, and Zhaopeng Tu. 2021.
On the copying behaviors of pre-training for neural Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
machine translation. In Findings of the Association Yan Liu. 2019. MASS: masked sequence to sequence
for Computational Linguistics: ACL-IJCNLP 2021, pre-training for language generation. In Proceedings
pages 4265–4275, Online. of the 36th International Conference on Machine
Learning, ICML 2019, 9-15 June 2019, Long Beach,
Yihong Liu, Haris Jabbar, and Hinrich Schuetze. 2022. California, USA, volume 97, pages 5926–5936.
Flow-adapter architecture for unsupervised machine
translation. In Proceedings of the 60th Annual Meet- Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and
ing of the Association for Computational Linguistics Pierre-Antoine Manzagol. 2008. Extracting and com-
(Volume 1: Long Papers), pages 1253–1266, Dublin, posing robust features with denoising autoencoders.
Ireland. In Proceedings of the 25th international conference
on Machine learning, pages 1096–1103.
Edunov, Marjan Ghazvininejad, Mike Lewis, and Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Luke Zettlemoyer. 2020. Multilingual denoising pre- Chaumond, Clement Delangue, Anthony Moi, Pier-
training for neural machine translation. Transac- ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
tions of the Association for Computational Linguis- icz, Joe Davison, Sam Shleifer, Patrick von Platen,
tics, 8:726–742. Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Kelly Marchisio, Kevin Duh, and Philipp Koehn. 2020. Quentin Lhoest, and Alexander Rush. 2020. Trans-
When does unsupervised machine translation work? formers: State-of-the-art natural language processing.
In Proceedings of the Fifth Conference on Machine In Proceedings of the 2020 Conference on Empirical
Translation, pages 571–583, Online. Methods in Natural Language Processing: System
Demonstrations, pages 38–45, Online.
Graham Neubig and Junjie Hu. 2018. Rapid adaptation
of neural machine translation to new languages. In Liwei Wu, Shanbo Cheng, Mingxuan Wang, and Lei
Proceedings of the 2018 Conference on Empirical Li. 2021. Language tags matter for zero-shot neural
Methods in Natural Language Processing, pages 875– machine translation. In Findings of the Association
880, Brussels, Belgium. for Computational Linguistics: ACL-IJCNLP 2021,
pages 3001–3007, Online.
Jing Zhu. 2002. Bleu: a method for automatic evalu- Yilin Yang, Akiko Eriguchi, Alexandre Muzio, Prasad
ation of machine translation. In Proceedings of the Tadepalli, Stefan Lee, and Hany Hassan. 2021. Im-
40th Annual Meeting of the Association for Compu- proving multilingual translation by representation
tational Linguistics, pages 311–318, Philadelphia, and gradient regularization. In Proceedings of the
Pennsylvania, USA. 2021 Conference on Empirical Methods in Natural
Language Processing, pages 7266–7279, Online and
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Punta Cana, Dominican Republic.
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word repre- Zhen Yang, Bojie Hu, Ambyera Han, Shen Huang, and
sentations. In Proceedings of the 2018 Conference of Qi Ju. 2020. CSP:code-switching pre-training for
the North American Chapter of the Association for neural machine translation. In Proceedings of the
Computational Linguistics: Human Language Tech- 2020 Conference on Empirical Methods in Natural
nologies, Volume 1 (Long Papers), pages 2227–2237, Language Processing (EMNLP), pages 2624–2636,
New Orleans, Louisiana. Online.
499
Biao Zhang, Philip Williams, Ivan Titov, and Rico Sen- a shared encoder and randomly initialize a shared
nrich. 2020. Improving massively multilingual neu- decoder. A single embedding layer (containing the
ral machine translation and zero-shot translation. In
words/subwords of both the source and target lan-
ciation for Computational Linguistics, pages 1628– guages) from the pretrained encoder is used. The
1639, Online. weight of the final fully connected layer is tied
with the embedding layer. The parameters of the
A Appendix encoder are fixed except for this embedding layer
A.1 Scores of Other Metrics which is also used by the decoder. The embedding
size is 1024 and the hidden size of the decoder is
In addition to BLEU scores, we also compute other
512. The decoder has 8 heads and 3 layers. We
scores in other metrics, such as CHR F (Popović,
follow the denoising autoencoding hyperparame-
2015) in Table 9 and Table 7, COMET (Rei et al.,
ter settings used by Lample et al. (2018) and the
2020) in Table 10 and Table 8, and confidence in-
training schedule of Liu et al. (2022), i.e., firstly
terval of BLEU scores (Koehn, 2004) in Table 11,
fine-tuning the models with only DAE loss and LD
Table 12 and Table 13. The translations used for
loss for the language discriminator for the first 2
computing the scores are the same as the transla-
epochs, then fine-tuning the models with all losses
tions used to compute the BLEU scores in Table 3
(including the BT) for the rest of the epochs. We
and Table 5.
set the batch size to 32 and use Adam optimizer
To quantify the copying problem, we use the
(Kingma and Ba, 2015) with an initial learning rate
copying ratio proposed by Liu et al. (2021), which
of 0.0001. We stop the training when the model
is defined as follows:
PI does not improve the BLEU scores on the valida-
count(copying tokens) tion set for 5 epochs. We do not use beam search
Ratio = i=1 PI (3)
i=1 count(tokens)
to generate translations for Multi30K.
In Section 3.3, we pretrain all our own cross-
where I denotes the number of the total sentences lingual language models of each language pair
in the test set, copying tokens are those tokens based on XLM code base7 (Conneau and Lam-
in the translation which are directly copied from ple, 2019). Then the encoder and decoder are both
the source language and the denominator is the to- initialized with the same cross-lingual pretrained
tal number of tokens in the generated translations. model. The recommended hyperparameters for the
This metric will directly reflect the degree of the model architecture are used, i.e., 1024 for the em-
copying behavior of the translation model. The bedding size, 4096 for the hidden size, 8 heads
higher the copying ratio, the model tends to per- and 6 layers for the transformer blocks. We follow
form more copying instead translation. We report the recommended pretraining as well as UNMT
the average of the copying ratios of the two trans- fine-tuning hyperparameters from XLM. We only
lation directions for each language pair in Table 4. change the hyperparameter tokens_per_batch to
We could see that the copying problem of the XLM 250 to adapt to small- or moderate memory GPUs.
baseline models is very obvious in low-resource We generate the translations by using beam search
language pairs, i.e., En-Kk and En-Gu. When the of size 5. These translations are used to compute
language discriminator loss is introduced, the copy- the scores in all the WMT-related experiments.
ing ratios decrease by more than 10%. We also For the language discriminator, we simply use
notice that XLM (+LD) has a less obvious copy- a feed-forward neural network (FFNN). The lan-
ing problem than the baseline in En-Zh pair, a dis- guage discriminator has two hidden layers and each
tant language pair. For other language pairs, the layer has the same dimension as the embedding,
copying problem is not that severe and therefore i.e., 1024, for both Multi30K and WMT-related
introducing the language discriminator loss does experiments. The output dimension is two which
not much change the ratios. corresponds to the number of language domains
A.2 Model Details we want to classify into, as we have two languages
involved in the training for each model.
In Section 3.2, we use the pretrained XLM mod-
els from HuggingFace6 (Wolf et al., 2020) (xlm-
mlm-enfr-1024, xlm-mlm-ende-1024) to initialize 7
6
https://github.com/huggingface XLM
500
Models En)Kk Kk)En En)Gu Gu)En
XLM baseline 8.85 7.61 7.95 4.76
XLM (+ LD) 11.78 10.09 11.71 7.12
Table 7: CHR F scores (Popović, 2015) of the XLM

UNMT baseline as well as the XLM model with the
language discriminator on low-resource language pairs
(the translations used are the same as used in Table 5
for BLEU scores).

XLM baseline -1.41 -1.10 -1.40 -1.90
XLM (+ LD) -1.14 -1.04 -0.91 -1.68
Table 8: COMET scores (Rei et al., 2020) of the XLM

UNMT baseline as well as the XLM model with the
language discriminator on low-resource language pairs
(the translations used are the same as used in Table 5
for BLEU scores). We use wmt20-comet-da model to
evaluate the translations.
501
Models En)De De)En En)Fr Fr)En En)Ru Ru)En En)Zh Zh)En
XLM baseline 45.09 48.20 44.99 49.93 34.75 38.56 16.11 19.08
XLM (+ LD) 44.42 48.20 42.94 50.50 34.39 36.56 16.74 20.45
Table 9: CHR F scores (Popović, 2015) of the XLM UNMT baseline as well as the XLM model with the language
discriminator on high-resource language pairs (the translations used are the same as used in Table 3 for BLEU
scores).
Models En)De De)En En)Fr Fr)En En)Ru Ru)En En)Zh Zh)En

XLM baseline -0.19 -0.22 -0.04 0.19 -0.34 -0.22 -0.43 -0.78
XLM (+ LD) -0.22 -0.23 -0.04 0.21 -0.37 -0.33 -0.36 -0.81
Table 10: COMET scores (Rei et al., 2020) of the XLM UNMT baseline as well as the XLM model with the
language discriminator on high-resource language pairs (the translations used are the same as used in Table 3 for
BLEU scores). We use wmt20-comet-da model to evaluate the translations.
Models En)De De)En En)Fr Fr)En

XLM baseline 20.53±0.59 25.96±0.66 22.85±0.72 25.89±0.57
XLM (+ LD) 20.42±0.61 25.84±0.63 21.18±0.76 26.92±0.59
Table 11: 95% confidence interval for the BLEU scores of the XLM UNMT baseline as well as the XLM model
with the language discriminator on En-De and En-Fr pair (the translations used are the same as used in Table 3 for
BLEU scores). Differences between bold results are statistically significant under p = 0.05. For the statistical test,
we use paired bootstrap resampling (Koehn, 2004).
Models En)Ru Ru)En En)Zh Zh)En

XLM baseline 14.08±0.48 16.93±0.51 6.34±0.34 4.28±0.28
XLM (+ LD) 13.48±0.45 16.11±0.51 6.80±0.37 4.69±0.31
with the language discriminator on En-Ru and En-Zh pair (the translations used are the same as used in Table 3 for

XLM baseline 1.80±0.37 1.58±0.48 2.13±0.31 0.54±0.17
XLM (+ LD) 2.04±0.45 1.69±0.49 3.56±0.41 0.64±0.20
with the language discriminator on En-Kk and En-Gu pair (the translations used are the same as used in Table 3 for
502
Author Index
Abela, Kurt, 433 Currey, Anna, 1
Adi, Yossi, 465
Agrawal, Saurabh, 449 Dabre, Raj, 169
Agrawal, Sweta, 1 Dai, Lirong, 102, 194
Al-Badrashiny, Mohamed, 62 Darwish, Kareem, 62
Anastasopoulos, Antonios, 1, 269 Declerck, Thierry, 1
Anderson, Tim, 130 DeMarco, Andrea, 433
Anh Dinh, Tu, 113 Deng, Pan, 102
Anh Nguyen, Tu, 465 Di Gangi, Mattia, 251
Arora, Siddhant, 235 Diab, Mona, 62
Doi, Kosuke, 330
Bahar, Parnia, 251 Dong, Qianqian, 1
Bai, Yu, 478 Du, Yichao, 79
Bakhturina, Evelina, 442 Duan, Richeng, 202
Bamfo Odoom, Bismarck, 302 Duh, Kevin, 1, 130
Basmatkar, Pranjali, 321 Dupoux, Emmanuel, 465
Bataev, Vladimir, 442
Beneš, Karel, 227 E. Ortega, John, 1, 261
Bentivogli, Luisa, 1 Elleuch, Haroun, 219
Berard, Alexandre, 144 Estève, Yannick, 1, 219
Billinghurst, Hannah, 433
Binh Nguyen, Thai, 113 Federico, Marcello, 1
Bojar, Ondřej, 1, 169, 389 Fonollosa, Jose, 397
Borg, Claudia, 1, 433 Fraser, Alexander, 491
Born, Logan, 291 Fukuda, Ryo, 330, 363
Bougares, Fethi, 219
Bär, Martin, 433 Gahbiche, Souhir, 1, 219
Gaido, Marco, 159
Calapodescu, Ioan, 144 Ganesan, Ashwinkumar, 241
Cao, Yiqing, 311 Gao, Yang, 478
Carpuat, Marine, 1 Gat, Itai, 465
Cattoni, Roldano, 1 Ginsburg, Boris, 442
Cettolo, Mauro, 1 Gow-Smith, Edward, 144
Chen, Boxing, 478 GUO, Jiaxin, 138, 277, 376, 383
Chen, Enhong, 79 Guo, Jiaxin, 180
Chen, Hao, 211 Guo, Yuhang, 411, 455
Chen, Mingda, 1 Gwinnup, Jeremy, 130
Chen, Peikun, 311
Chen, Qian, 478 Haddow, Barry, 1
Chen, Shihao, 102 Han, Yuchen, 211
Chen, Shuoying, 455 Hansen, Eric, 130
Chen, William, 1, 235, 261 Hrinchuk, Oleksii, 442
Chen, Xiaoyu, 180, 187, 376, 383 Hsu, Benjamin, 1
Choukri, Khalid, 1 Huang, Wuwei, 411
Chronopoulou, Alexandra, 1, 491 Hubert, Rebekka, 89
Chu, Chenhui, 357 Hussein, Amir, 283
Copet, Jade, 465 Huzaifah, Muhammad, 202
Cui, Jianwei, 194
503
I. Gállego, Gerard, 397 Mbuya, Jonathan, 269
Inaguma, Hirofumi, 1 McNamee, Paul, 1
Iranzo-Sánchez, Javier, 251 Mdhaffar, Salima, 219
Micallef, Kurt, 433
Jain, Aditi, 341 Min Tan, Kye, 202
Javorský, Dávid, 1 Minghan, Wang, 187
Jiang, Ning, 311 Mon Htut, Phu, 1
Jiang, Yanfei, 180 Moon, Hyeonseok, 420
JiaXin, Guo, 187 Mozib Samin, Ahnaf, 433
Judge, John, 1 Mullov, Carlos, 113
Murray, Kenton, 1
Kambhatla, Nishant, 291, 341
Kano, Yasumasa, 1, 330, 363 Nadejde, Maria, 1
Kesiraju, Santosh, 227 Nakamura, Satoshi, 1, 330, 363
Khudanpur, Sanjeev, 283, 302 Nam Nguyen, Tuan, 113
Khurana, Sameer, 219 Negri, Matteo, 1, 159
Ko, Tom, 1 Nguyen, Ha, 1, 219
Ko, Yuka, 330, 363 Niehues, Jan, 1, 62, 113, 389
Koneru, Sai, 113 Nishikawa, Yuta, 330, 363
Kr. Ojha, Atul, 1 Niu, Xing, 1
Kreuk, Felix, 465
Kumar, Rishu, 1, 433 Ore, Brian, 130
Laperrière, Gaëlle, 219 P. McCrae, John, 1

Laurent, Antoine, 219 Pal, Proyag, 1
Lee, Ann, 465 Pandey, Abhishek, 449
Lee, Seugnjun, 420 Papi, Sara, 159
Lei, Lizhi, 138, 180, 376, 383 Park, Chanjun, 420
Lei, Yi, 311 Peng, Yifan, 235
Li, Pengwei, 1 Perone, Simone, 461
Li, Shaojun, 138, 180, 376, 383 Pham, Ngoc-Quan, 113, 389
Li, Sheng, 357 Pino, Juan, 1
Li, Xiang, 411 Polak, Peter, 389
Li, Xinjian, 235 Polák, Peter, 1
Li, Yinglu, 277 Pradeep, Ujvala, 302
Li, Zongyao, 138, 180, 187, 277, 376, 383 Prakash Gohil, Raj, 449
Lim, Heuiseok, 420 Praveen, Kiran, 449
Liu, Danni, 113, 389
Liu, Mengge, 411 R. Costa-jussá, Marta, 397
Liu, Xiaoqian, 211 Radhakrishnan, Balaji, 449
Liu, Yihong, 491 Rao, Zhiqiang, 138, 180, 187, 376, 383
LiZhi, Lei, 187 Riezler, Stefan, 89
Luan, Jian, 411 Riguidel, Hugo, 219
Rippeth, Elijah, 1
Ma, Xutai, 1
Macháček, Dominik, 169 Saha, Soumya, 241
Maison, Lucas, 219 Sakti, Sakriani, 330
Maiti, Soumi, 235 Salesky, Elizabeth, 1, 62
Makinae, Mana, 330 Sarkar, Anoop, 291, 341
Mathur, Prashant, 1 Schütze, Hinrich, 491
Matusov, Evgeny, 1, 251 Shang, Hengchao, 138, 180, 187, 277, 376, 383
504
ShaoJun, Li, 187
Shi, Jiatong, 1, 235 Xiao, Cihan, 283
Shimizu, Shuichiro, 357 Xiao, Tong, 211
Sokolov, Artem, 89 Xie, Lei, 311
Song, Kun, 311 Xie, Yuhao, 138, 180, 376
Sperber, Matthias, 1 Xie, Zhihang, 123
Stüker, Sebastian, 1 Xinyuan, Henry Li, 302
Su, Jinsong, 411 Xu, Chen, 211
Sudoh, Katsuhito, 1, 330, 363 Xu, Luzhen, 194
Synnaeve, Gabriel, 465 Xu, Tong, 79
Xue, Ran, 241
Tang, Yun, 1
Tao, Shimin, 277 Yan, Brian, 235
Thebaud, Thomas, 283 Yanagita, Tomoya, 330
Thiol, Antoine, 219 Yang, Fengyu, 411
Thompson, Brian, 1 Yang, Hao, 138, 180, 187, 277, 376, 383
Tian, Jinchuan, 79 Yang, Jinlong, 138, 180
Tian, Yanzhi, 411 Yang, Zhengdong, 357
Tikhonov, Maksim, 227 Yavuz Ugan, Enes, 113
Tran, Kevin, 1 Ye, Zhongyi, 194
Tsiamas, Ioannis, 397 Yu, Jianwei, 79
Tu, Zhaopeng, 79 YU, Zhengzhe, 180, 187, 383
Turchi, Marco, 1 Yu, Zhengzhe, 138, 376
Tüske, Zoltán, 251 YuHao, Xie, 187
Vakharia, Priyesh, 321 Zanon Boito, Marcely, 144

van der Plas, Lonneke, 1, 433 Zevallos, Rodolfo, 1, 261
Verma, Neha, 283, 302 Zhan, Jiaao, 478
Vignesh S, Shree, 321 Zhang, Daniel, 241
Vinay Dhopeshwarkar, Advait, 449 Zhang, Hanyi, 194
Vishnu Kudlu Shanbhogue, Akshaya, 241 Zhang, Jie, 102
Zhang, Min, 277
Waibel, Alex, 1 Zhang, Weitai, 102, 194
Waibel, Alexander, 113, 389 Zhang, Wen, 411
Wang, Bin, 411 Zhang, Yongmao, 311
Wang, Minghan, 277, 376, 383 Zhang, Yuhao, 211
Wang, Mingxuan, 1 Zhang, Zhirui, 79
Wang, Wen, 478 Zhao, Guoqing, 311
Wang, Xing, 79 Zheng, Jiawei, 138, 180
Wang, Yichi, 194 Zhengsheng, Guo, 79
Wang, Zhipeng, 455 Zhou, Wangjin, 357
Watanabe, Shinji, 1, 235 Zhou, Xinyuan, 194
Wei, Bin, 138, 180 Zhu, Jingbo, 211
Wei, Daimeng, 138, 180, 187, 277, 376, 383 Zhu, Ming, 180
Wei, Kun, 311
Wiesner, Matthew, 283, 302 Černocký, Jan, 227
Wilken, Patrick, 251
Williams, Aiden, 433
Wu, Zhanglin, 138, 180, 187, 376, 383
505

2023 Iwslt-1

Uploaded by

Copyright:

Available Formats

2023 Iwslt-1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2023 Iwslt-1

Uploaded by

Copyright:

Available Formats

IWSLT 2023

The 20th International Conference on

Proceedings of the Conference

July 13-14, 2023

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)

Welcome to IWSLT 2023, welcome to Toronto!

Marine Carpuat, Program Chair

Website and Publication Chair

FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN

Improving End-to-End Speech Translation by Imitation-Based Knowledge Distillation with Synthetic

The USTC’s Dialect Speech Translation System for IWSLT 2023

KIT’s Multilingual Speech Translation System for IWSLT 2023

Enhancing Video Translation Context with Object Labels

Length-Aware NMT and Adaptive Duration for Automatic Dubbing

Direct Models for Simultaneous Translation and Automatic Subtitling: FBK@IWSLT2023

MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation

CMU’s IWSLT 2023 Simultaneous Speech Translation System

The HW-TSC’s Speech-to-Speech Translation System for IWSLT 2023

JHU IWSLT 2023 Dialect Speech Translation System Description

JHU IWSLT 2023 Multilingual Speech Translation System Description

Low-Resource Formality Controlled NMT Using Pre-trained LM

NAIST Simultaneous Speech-to-speech Translation System for IWSLT 2023

The Kyoto Speech-to-Speech Translation System for IWSLT 2023

NVIDIA NeMo Offline Speech Translation Systems for IWSLT 2023

BIT’s System for Multilingual Track

Matesub: The Translated Subtitling Tool at the IWSLT2023 Subtitling Task

Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling

DePA: Improving Non-autoregressive Translation with Dependency-Aware Decoder

Thursday, July 13, 2023

08:30 - 09:10 Welcome Remarks

08:45 - 09:15 Overview of the IWSLT 2023 Evaluation Campaign

09:30 - 10:30 Invited Talk

10:30 - 11:00 Coffee Break

11:30 - 12:30 Session 1 (Posters): System Papers

12:30 - 14:00 Lunch Break

14:00 - 15:30 Session 2 (Posters): System Papers

15:30 - 16:00 Coffee Break

16:00 - 18:00 Session 3 (Posters): Scientific Papers, including Findings of ACL

09:00 - 10:30 Session 4 (Oral): Scientific Papers

10:30 - 11:00 Coffee Break

11:00 - 12:30 Session 5 (Posters): System Papers

12:30 - 14:00 Lunch Break

14:00 - 15:30 Session 6 (Posters): System Papers

15:30 - 16:00 Coffee Break

16:00 - 17:00 Panel Discussion

17:00 - 17:15 Best Paper Awards

17:15 - 17:30 Closing Remarks

Abstract et al., 2010; Federico et al., 2011, 2012; Cettolo

Table 1: List of Participants

Table 3: Breakdown of the participation in each sub-task (English→German, English→Chinese,

from wav2vec2-large-960h-lv60-self and speech-text representation is passed to the

2.4.1 Automatic Evaluation

Table 5: Datasets for Dialect Shared Task.

of the pseudo Tunisian-MSA-English text 8.1 Challenge

8 Low-resource SLT Marathi–Hindi Marathi is an Indo-Aryan lan-

Pashto–French The detailed results can be

Train 400 20.00 13.41 13.40 3.35 3.35 24.52

EN-RU Test 600 21.02 18.03 18.00 2.06 2.05 13.59