Efficient Neural Speech Synthesis For Low-Resource Languages Through Multilingual Modeling
Efficient Neural Speech Synthesis For Low-Resource Languages Through Multilingual Modeling
Efficient Neural Speech Synthesis For Low-Resource Languages Through Multilingual Modeling
multilingual modeling
Marcel de Korte, Jaebok Kim, Esther Klabbers
ReadSpeaker
Huis ter Heide, the Netherlands
{marcel.korte,jaebok.kim,esther.judd}@readspeaker.com
Abstract it is not a suitable solution for many languages for which large
quantities of high-quality multi-speaker data are not available.
Recent advances in neural TTS have led to models that can Multilingual multi-speaker synthesis aims to address this issue
produce high-quality synthetic speech. However, these mod- by training a multilingual model on the data of multiple lan-
arXiv:2008.09659v1 [eess.AS] 20 Aug 2020
els typically require large amounts of training data, which can guages. Among the first to propose a neural approach to multi-
make it costly to produce a new voice with the desired qual- lingual modeling was [9]. Instead of modeling languages sepa-
ity. Although multi-speaker modeling can reduce the data re- rately, they modeled language variation through cluster adap-
quirements necessary for a new voice, this approach is usually tive training, where a mean tower as well as language basis
not viable for many low-resource languages for which abundant towers were trained. They found that multilingual modeling
multi-speaker data is not available. In this paper, we therefore did not harm naturalness for high-resource languages, while
investigated to what extent multilingual multi-speaker model- low-resource languages benefited from multilingual modeling.
ing can be an alternative to monolingual multi-speaker model- Another study by [10] scaled up the number of unseen low-
ing, and explored how data from foreign languages may best be resource languages to twelve, and similarly found that multilin-
combined with low-resource language data. We found that mul- gual models tend to outperform single speaker models.
tilingual modeling can increase the naturalness of low-resource
language speech, showed that multilingual models can produce More recently, multilingual modeling was also adopted in
speech with a naturalness comparable to monolingual multi- S2S architectures [11, 12, 13, 14, 15, 16], however mostly for
speaker models, and saw that the target language naturalness the purposes of code-mixing and cross-lingual synthesis. Lan-
was affected by the strategy used to add foreign language data. guage information was typically represented either with a lan-
guage embedding [12, 15] or with a separate encoder for each
Index Terms: neural TTS, sequence-to-sequence models, mul- language [11], while [13] applied both approaches to code-
tilingual synthesis, multi-speaker models, data reduction mixing and accent conversion. With regards to multilingual
modeling, [12] showed that multilingual models can attain a
1. Introduction naturalness and speaker similarity that is comparable to that of a
single speaker model for high-resource target languages, while
Over the past few years, developments in sequence-to-sequence research from [16] obtained promising results with a crosslin-
(S2S) neural text-to-speech (TTS) research have led to synthetic gual transfer learning approach.
speech that sounds almost indistinguishable from human speech
(e.g. [1, 2, 3]). However, large amounts of high-quality record- While research into S2S multilingual modeling is clearly
ings are typically required from a professional voice talent to vibrant, there appears to exist little systematic research into how
train models of such quality, which can make them prohibitively S2S multilingual models could be used to increase speech natu-
expensive to produce. To counter this issue, investigations into ralness for low-resource languages. To fill this void, this paper
how S2S models can facilitate multi-speaker data has become a investigated to what extent results that are found in S2S mono-
popular topic of research [4, 5, 6]. A study by [7], for example, lingual multi-speaker modeling are transferable to multilingual
showed that multi-speaker models can perform as well or even multi-speaker modeling, and if it is possible to attain higher nat-
better than single-speaker models when large amounts of target uralness on low-resource languages with multilingual models
speaker data are not available, and that single-speaker models than with single speaker models. Because multilingual model-
only perform better when substantial amounts of data are used. ing can benefit from the inclusion of large amounts of non-target
Their research also showed that the amount of data necessary language data, we also experimented with several data addition
for an additional speaker can be as little as 1250 or 2500 sen- strategies and evaluated to what extent these strategies are ef-
tences without significantly reducing naturalness. With regards fective to improve naturalness for low-resource languages. As
to parametric synthesis, [8] investigated the effect of several this research is primarily addressing the viability of different
multi-speaker modeling strategies for class imbalanced data. approaches with regards to low-resource languages, our focus
Their research found that for limited amounts of speech, multi- is not so much on maximizing naturalness but rather on gain-
speaker modeling and oversampling could improve speech nat- ing a better understanding of how different strategies work and
uralness compared to single speaker models, while undersam- would potentially scale up using larger amounts of data.
pling was found to generally have a harmful effect. They also The rest of this paper is organized as follows. In Section 2,
showed that ensemble methods can further improve naturalness, we describe the architecture used to conduct our experiments.
but this strategy comes with a considerable computational cost In Section 3, we describe the experimental design and give de-
that is usually not feasible for S2S modeling. tails about training and evaluation. In Section 4, we provide the
Although the above research shows that multi-speaker mod- experimental results. Finally, in Section 5, we discuss conclu-
eling can be an effective strategy to reduce data requirements, sions and directions for future research.
2. System architecture was suggested by [19] that the model might become less ro-
bust if the variation in the class weights becomes too large. To
2.1. S2S Acoustic model
counter this effect, we applied a square root operation to the
The architecture that is used in this paper for acoustic model- weights and found that this led to better naturalness compared
ing is based on VoiceLoop [17]. This architecture is appealing to both the unbalanced and the balanced weights. The weights
for several reasons: the architecture is relatively small which were then normalized to correct for the square root operation,
makes it more suitable to train with smaller amounts of data, the where j is the index that iterates over the number of classes:
model takes relatively little time to train, and it is capable of dis- c
entangling speaker information well for seen speakers [18]. To nαi = αi × PN (2)
make the architecture suitable for multilingual modeling and in- j cj × αj
crease its naturalness and robustness, we made several changes
to the architecture. First, we incorporated a separate encoder 3. Experimental setup
for each language to disentangle language information, simi- In this paper, we aimed to answer the following research ques-
lar to [11]. We empirically found that representing language tions:
information this way was more effective than using a language
embedding. This language encoder is used to convert phonemes 1. To what extent does adding data from non-target lan-
from a language-dependent phone set into 256-dimensional em- guage speakers increase the naturalness for various
beddings. Second, we added a 3-layer convolutional prenet Npr amounts of data from a low-resource language?
in the style of [1] to better model phonetic context. Third, we 2. How does replacing monolingual multi-speaker mod-
added a two-layer LSTM recurrency Nr with 512 nodes to the els with multilingual multi-speaker models affect speech
decoder to better retain long-term information. The model was naturalness?
trained to produce 80-dimensional mel-spectrogram features in 3. In what way can additional non-target language data best
a way similar to [1]. The resulting architecture is visualized in be added to improve the naturalness of low-resource tar-
Figure 1. get language speech?
Two listening experiments were designed to answer these re-
Attention network
Update network (N_u)
search questions.
(N_a)