Developing Concatenative Based Text To Speech

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Internet of Things and Cloud Computing

2020; 8(2): 24-30


http://www.sciencepublishinggroup.com/j/iotcc
doi: 10.11648/j.iotcc.20200802.12
ISSN: 2376-7715 (Print); ISSN: 2376-7731 (Online)

Developing Concatenative Based Text to Speech


Synthesizer for Tigrigna Language
Mezgebe Araya Keletay1, 2, *, Hussien Seid Worku2
1
Department of Computer Science, School of Computing and Informatics, Mizan-Tepi University, Tepi, Ethiopia
2
Department of Computer Science, College of Engineering and Technology, Arba-Minch Institute of Technology, Arba-Minch, Ethiopia

Email address:
*
Corresponding author

To cite this article:


Mezgebe Araya Keletay, Hussien Seid Worku. Developing Concatenative Based Text to Speech Synthesizer for Tigrigna Language. Internet
of Things and Cloud Computing. Vol. 8, No. 2, 2020, pp. 24-30. doi: 10.11648/j.iotcc.20200802.12

Received: February 24, 2020; Accepted: March 10, 2020; Published: October 27, 2020

Abstract: A Text-To-Speech (TTS) synthesizer is a computer-based system able to read any text and convert it into speech
that resembles as closely as possible a native speaker of the language. This thesis describes the first Text-to-Speech (TTS)
system for the Tigrigna language, using speech synthesis architecture in MATLAB. The TTS system is working based on
concatenative synthesis and applying LPC technique. The performance of the system is measured and the quality of
synthesized speech is assessed in terms of intelligibility and naturalness. The result of the synthesizer is evaluated in two ways,
in word level and sentences level. The test results indicate in the word level is evaluated by NeoSpeech tool online and most of
the words are recognizable. The overall performance of the system in the word level which is evaluated by NeoSpeech tool is
found to be 78%. When it comes to the intelligibility and naturalness of the synthesized speech in the sentence level, it is
measured in MOS scale and the overall intelligibility and naturalness of the system is found to be 3.28 and 3.27 respectively.
The values of performance, intelligibility and naturalness are encouraging and show that diphone speech units are good
candidates to develop fully functional speech synthesizer. But there are areas that can be improved. Inclusion of text analyzer
to pronounce zonal dialects of the language and prosody generator are some of the things that need further investigation.
Keywords: Concatenative Approach, Speech Synthesis, Tigrigna Syllables, Text-to-Speech

are taken from natural speech put together to form a word or


1. Introduction sentence.
Language is a fundamental part of everyday life human Text-To-Speech (TTS) synthesis system has a wide range
being. Whether we are using speech, sign language, emotion of applications in everyday life. And a text to speech
or a coding system that conveys meaning through touch, we synthesizer is used for vocalization processed content [4]. In
use language to express our thoughts, intentions, reactions, last decade, a great deal of TTS-Synthesis system has done
and experiences [1]. Text-to-speech (TTS) synthesizer much work in various languages as well as different
transforms linguistic information stored as data or text into synthesis techniques such as Unit-selection, Formant, Hidden
speech. It is most widely used in the audio reading devices Markov Model and Articulatory synthesis was done by
for the visually impaired people now days. TTS is one of the researchers [4]. In order to make the computer systems more
major applications of NLP. The NLP module of general TTS interactive and helpful to the users, especially physically and
synthesizer consists of the Pre-processor, text analyzer, visually impaired and illiterate masses, [5] the TTS synthesis
contextual analyzer [2], syntactic prosodic parser, letter to systems are in great demand for the Ethiopian languages.
sound module and prosody generator. Synthesized speech Research in the area of speech synthesis has been worked
can be created by concatenating part of recorded speech by the growing importance of many new applications. These
which is stored in a database. Speech is often based on include information retrieval services over telephone such as
concatenation of natural speech that is the units, [3] which banking services, public announcements at places like train
stations and reading out manuscripts for gathering [7].
Internet of Things and Cloud Computing 2020; 8(2): 24-30 25

Speech synthesis has also found applications in tools for morphological properties. Tigrigna language has its own
reading emails, faxes and web pages over telephone and characterizing phonetic, phonological and morphological
voice output in automatic translation systems. Special properties. It has a set of speech sounds that is not found in
equipment for the physically challenged, [8] such as word other languages. For example the following sounds are not
processors with reading-out capability and book-reading aids found in English and Amharic: [?](ዐ), [ḥ] (ሐ), [k'] (ኸ), [ʔ] (አ)
for visually challenged and speaking aids for the vocally and [x'] (ቐ) [10]. Tigrigna also has its own inventory of
challenged also use speech synthesis. speech sounds. Fidel’s (alphabets) have the same
The growing popularity of speech-enabled computer pronunciation but different symbols, these different Fidel’s
interfaces demands high quality speech output, particularly can be used interchangeably without meaning change. The
for telephone applications. The perceived quality of standard Fidel’s are “ጸ” and “ፀ”, “ሰ” and “ሠ” and “ሀ”, and “ኀ”. For
general purpose text-to speech (TTS) systems is not good example, the word “Hair” can be written as, “ጸጉሪ”, “ፀጉሪ”,
enough, [16] which forces application developers to use pre- the word “weed” can be written as, “ፀሃየ”, “ፀኃየ”, “ጸሃየ”, and
recorded prompts, drastically reducing the text generation “ጸኃየ”, the word “hunter” can be written as, “ሃደነ”, “ኃደነ”, and
flexibility. Recent improvements in limited-domain synthesis the word “troop” can be written as, “ሰራዊት”, “ሠራዊት” etc, all
have been in the context of concatenative synthesis, with a mean the same, although they are written differently and
focus on methods for combining whole phrases and words produce different orthographic form.
with sub word units for infrequent or new words. Little or no
attention has been paid to natural prosody generation, with 1.3. Consonant Phonemes
the assumption that it is accounted for in the phrase-size units. There are thirty-five consonant phonemes in Tigrigna. The
However, [9] as complexity of the domain increases, there is consonants are generally classified as Stops, fricatives, nasals,
more room for prosodic variability that must be accounted liquids, and semi-vowels. Unlike many of the modern
for to achieve natural speech. Ethiopian Semitic languages, Tigrigna has preserved the two
1.1. The Tigrigna Language pharyngeal consonants which is apparently part of the ancient
Ge'ez language and which, along with [x'], which is “ቐ”, a
Tigrigna, often written as Tigrinya (ትግርኛ) is a language velar or uvular ejective stop make it easy to distinguish
spoken in the east African countries such as Eritrea and spoken Tigrinya from related languages such as Amharic.
Ethiopia. It is one of the two official languages of the country The fricative sounds [x], which is “ኸ”, [xʷ], which is “ዀ”,
Eritrea. It is also a working language of the Tigray region of [x'] which is “ቐ”, and [xʷ'] which is “ቘ” occur as allophones
Ethiopia. According to the 2015 Census conducted by the [6].
Agency of Ethiopia (CSA), the Tigray Region has a
population of 6.3 million and from the total population Table 2. Tigrigna Syllabic Structure [6].
around 4.3 million are native Tigrigna speakers, and Kxa x ኸ ኹ ኺ ኻ ኼ ኽ ኾ
according to Ethnologies there are 2.4 million Tigrigna Kxwa xw ዀ ዂ ዃ ዄ ዅ
speakers in Eritrea [10]. Qa k’ ቀ ቁ ቂ ቃ ቄ ቅ ቆ
Qha b’ ቐ ቑ ቒ ቓ ቔ ቕ ቖ
The script of Tigrigna is phonetic in nature. It has 35
Qhwa bw’ ቘ ቚ ቛ ቜ ቝ
consonants and 7 vowels [6]. The orthographic representation Qwa kw’ ቈ ቊ ቋ ቌ ቍ
of the language is organized into orders. Each of the 35
consonants has seven orders (derivatives). Out of the 35 1.4. Vowel Phonemes
consonants four of them are diphthongs. Six of them are CV
combinations while the 7th is the consonant itself. The way Vowels are always voiced sounds and they are produced
Tigrigna orthographic characters are written is very similar to with the vocal cords in vibration [1]. Most languages have
the way they are spoken. It means Tigrigna is a phonetic five vowels/a, e, i, o, u/, but in case of Tigrigna, there are
language. The mapping of the written form and the spoken seven vowels. These are አ, ኡ, ኢ, ኣ, ኤ, እ, and ኦ. All are
form is one to one except the epenthetic vowel. Characters voiced and oral sounds. These vowels can be found in each
representing the same consonant followed by different letters, that is, each letter in Tigrigna is not a single sound
vowels are similar in shape [6]. rather they are a combination of two sounds, one from vowel
and one from consonant. Depending on the position of the lip
Table 1. Tigrigna Syllables structure of character “ሀ” and “ለ”. the Tigrigna vowels (አ፣ኡ፤ኢ፤ኣ፤ኤ፤እ፤ and ኦ) [1] are broadly
1st 2nd 3rd 4th 5th 6th 7th
categorized into rounded (ኡ and ኦ) and unrounded (አ፤ኢ፤ኣ፤ኤ
ሀ ሁ ሂ ሃ ሄ ህ ሆ and እ).

He Hu Hi ha Hie h ho
ለ ሉ ሊ ላ ሌ ል ሎ 1.5. Gemination

Le Lu Li La Lie L Lo
Gemination /ጥብቀት/ (consonant lengthening) is not
1.2. Characteristics of Tigrigna Language normally indicated in the Ge‟ez script. Longer duration of
identical segments [17], adjacent consonants or vowels that
As same with other Semitic languages, Tigrigna has its are the same can form in Tigrigna sequence of vowels is not
own characterizing phonetic, phonological, and permissible.
26 Mezgebe Araya Keletay and Hussien Seid Worku: Developing Concatenative Based Text to Speech
Synthesizer for Tigrigna Language

Consonant gemination may bring meaning differences in production system (especially vocal tract system, various
words. If we compare “ዘዋሪ” /zawara/ “he got roaming” and articulators like, Lip, tongue, jaw etc…) and articulatory
“ዘዋሪ” /zawwara/ “he drove”, and the word “ሓሊፉ” /halifu/ processes directly. However, [12] it is also the most difficult
„he passed‟ and “ሓሊፉ” /hallifu/ “he excelled”. There is a method to implement due to lack of knowledge of the
difference of meaning in each pair. In each pair, we observe a complex human articulation organs.
geminated or ungeminated medial consonant that brings a
meaning difference in each of them. 2.3.2. Formant Synthesis
Formant synthesis is based on the rules which describe the
resonant frequencies of the vocal tract. The formant method
2. Literature Review uses the source-filter model of speech production, where
Speech synthesis is the processes of converting a written speech is modeled by parameters of the filter model. Rule-
text into speech and this technology have the ability to based formant synthesis can produce quality speech which
convert arbitrary text into audible speech, with the goal of sounds unnatural, [5] since it is difficult to estimate the vocal
being able to provide textual information to people via voice tract model and source parameters.
messages [11]. The speech synthesizer depends on the TTS 2.3.3. Unit Selection Synthesis
synthesizer architecture inculcated to produce intelligible and Unit selection based Concatenative speech synthesis, joint
natural sounds from the synthesizer. cost also known as Concatenative cost, which measures how
well two units can be joined together [13].
2.3.4. Concatenative Synthesis
Systems can synthesize high quality and more natural
sound speech but in order to synthesize speech with various
Figure 1. Text to Speech System [17]. voice characteristics such as speaker individualities, speaking
styles, emotions, etc., a large amount of speech corpus and
2.1. The Natural Language Processing (NLP) Component memory is required as stored basic speech units (like
Natural Language Processing or text-to-phoneme (T2P) is syllables, diphones etc.) are concatenated to form word
targeted to produce phonetic transcription of the text, sequence using pronunciation dictionary [13].
together with the desired prosodic features [9]. It concern Concatenative synthesis is concatenating the pre-recorded
how computational methods can aid the understanding of segments to generate the natural speech. Concatenative
human language and focused on developing systems that speech is produce intelligible & natural synthetic speech,
allow computers to communicate with people using every usually close to a real voice of person [13]. However,
day in their life. The components are text analysis, automatic concatenative synthesizers are limited to only one speaker
phonetization and prosody generation [1]. and one voice. The difference between natural variation in
There are a number of factors which is affected natural speech signals and the nature of the automated techniques are
language processing and the final output of digital signal segmenting the waveforms form the audible output [14].
processing. Some of the factors which affected in this
research works like, environmental affects during record time, 3. Methodology
quality of microphone, sampling frequency, echo and noise.
Research methodology is the process of used to collect
2.2. The Digital Signal Processing (DSP) Component information and data for the purpose of making decisions
regarding of the research title. Research methodology may
The digital signal processing unit transforms the symbolic include publication researches, interviews, surveys and other
information that receives from NLP into audible and research techniques are used.
intelligible speech. Automatically, [1] the operations
involved in the DSP component are the computer analogue of 3.1. Research Strategy
dynamically controlling the articulatory muscles and the
vibratory frequency of the vocal folds so that the output The research thought with respect to this thesis work was
signal matches the input requirements. an applied one, but not new. Somewhat, numerous researches
are existing regarding the role of TTS in different local and
2.3. Speech Synthesis Techniques international languages to synthesis the natural languages
automatically for the purpose of minimizing the challenges in
Synthesized speech can be produced by employing several day to day activities specially visual impaired peoples, not
different techniques to find natural human like sounds. The only for Impaired peoples in specific, but also for non-
main techniques of speech synthesis synthesizer are blinded peoples are also usable.
discussed below:
3.2. Research Approach
2.3.1. Articulatory Synthesis
Articulatory synthesis tries to model the human speech There are different approaches to develop a text to speech
Internet of Things and Cloud Computing 2020; 8(2): 24-30 27

synthesizer, such of the approaches are discussed in chapter as “Abugida” and the collected strings are recorded and
two, but this research was used a concatenative based analyzed using PRAAT. Natural sounds are collected from
approach to synthesis the Tigrigna TTS model. In different articles, journals, and newspapers of Tigrigna
concatenative approach which records the Tigrigna diphones language and analyzed to phones, words, phrases, and
(half phone) which is known as “Fidels”. The prerecorded sentences to check the performance evaluation of the TTS
sounds of Tigrigna were concatenated to get a words, phrases, synthesizer.
and sentences of Tigrigna using a concatenative approach.
The systems in concatenative approach can synthesize high 3.5. Research Method
quality and more natural sound speech was listened by the The research methodology provides an orientation that
native speakers of Tigrigna language. influences the research results, procedures, evaluating
3.3. Data Collection Method and Tools validations of the research work. Tigrigna corpus was
prepared to implement a TTS synthesizer using the tool of
The direct observation and review of articles are applied in PRAAT by recorded the Tigrigna diphones in wav file. Then
this research paper to identify the whole strings which is after the recorded wav file phones are changed to txt files
represented the language (the “Fidels”) and tools used to using the tool of MATLAB. Subsequently, the txt file is read
develop and test the TTS synthesizer respectively. Tools automatically in the MATLAB and linear productive coding
which are used in this research paper was PRAAT, which is (LPC) was applied to estimate the error signals in order to get
used to record and analyze the strings (“fidels”) of Tigrigna the natural sound. Then, the TTS synthesizer was checked its
language, MATLAB was used to implement the Tigrigna performance in two techniques, the first one is by using the
TTS synthesizer, and Neospeech was used to test the tool of NeoSpeech in order to test the sample words of their
performance of the TTS synthesizer. naturalness and intelligibility of the synthesizer. Secondly,
the mean opinion score (MOS) was used to test the sample
3.4. Data Analysis sentences by invited 20 native speakers of the language.
Data analysis is a content analysis which is used to analyze Finally, the overall result using diphones to synthesize
the data which was gathered from interviews and direct Tigrigna language with 78% accuracy and the overall
observations. Therefore, in this research work the gathered intelligibility and naturalness of the system from twenty
information’s are analyzed using a tool of praat. The gathered listeners for the ten Tigrigna sentences is found to be 3.27
data or the strings (“Fidels”) of Tigrigna language are and 3.28 respectively.
collected from spiritual notes of Geez scripts which is known

Figure 2. GUI Text-to-Speech Synthesizer for Tigrigna.

3.6. Sample Selection which belongs to the sampling size, are selected on the basis
of implemented the TTS synthesizer, evaluated the
The method of sampling was used to develop the sample performance of the TTS synthesizer and testing the TTS
of the research under discussion. According to this method,
28 Mezgebe Araya Keletay and Hussien Seid Worku: Developing Concatenative Based Text to Speech
Synthesizer for Tigrigna Language

Synthesizer. In this research work 35×35 diphones are 4.1. Linear Productive Coding
recorded to develop the TTS model for Tigrigna language.
Additionally, to test the TTS model 100 Tigrigna words and Linear productive coding is a tool used mostly in audio
10 different sentences were used and to check the signal processing and speech processing for representing the
performance of the synthesizer twenty (20) native speakers spectral envelope of a digital signal of speech in compressed
are participated, out of them 12 persons are men and the form using the information of a linear predictive model [15].
remaining 8 persons are women. There are various advantages for the use of LPC and they are.
a) LPC proves better approximation coefficient spectrum
b) LPC gives shorter and efficient calculation time for
4. Design an Automatic Model Text to signal parameters and
Speech Synthesizer for Tigrigna c) LPC has been able to get important characteristics of
the input signals.
The demonstration of text to speech synthesizer model is
how it could be designed, implemented and integrated the Sn ∑ ak S n k …………. (1)
input texts matching with its database. Algorithms enable to Where P is the number of past samples of s[n] which we
modify the pitch and duration of the speech to achieve wish to examine.
synthesized speech by concatenating diphone segments.

Figure 3. The original signal of the word "arba".

The algorithm which is used to read files from the database STEP3: Open the.txt file in matlab.
in concatenative approach is as follows: STEP4: Read the file opened.
1) for check the text file from one to N STEP5: For every character read, play the corresponding
2) Load text file from database wave (.wav) file.
3) concatnate one to N
4) read the text 4.2. Proposed Architecture of Text to Speech Synthesizer
5) End for for Tigrigna
ALGORITHM 1 (steps to read a file): Basically there are three main modules that are used to
STEP1: Create a database of various wave files build TTS synthesizer for Tigrigna: the Natural Language
STEP2: Create a text file (.txt) processing module, the Digital Signal Processing Modules
Internet of Things and Cloud Computing 2020; 8(2): 24-30 29

and the Database modules.

Figure 4. TTS Architecture for Tigrigna.

4.3. Experimental Results and Discussions was directly introduced in the computer by an operator.
Text Analysis which is capable of converting raw text to
The first experiment is on the performance of the system pronounceable words, Phonetic Analysis which converts text
that is assessed on word level. The test consists of 100 in orthographic form to phonemes, certain properties of the
Tigrigna words selected through the help of a native speakers speech signal are processed, Diphone database Creation
of the language. The selected words are evaluated their which provides diphone speech units to be concatenated and
naturalness and intelligibility using a software tool called uttered and Diphone Concatenation where the speech is
NeoSpeech. Therefore, the researcher gives the selected generated.
words for the tool and listen their naturalness and Based on the evaluation, the system register on the average
intelligibilities of the sound which is played by the tool 78% performance; 3.28 MOS score in intelligibility and 3.27
online. MOS score for naturalness. The result looks encouraging and
The overall performance of the system is measured in further improvement of intelligibility and naturalness depend
terms of total number of correctly pronounced words over the on proper works in different context. In this research we
total number of words played. Finally by calculating the prepared diphone inventory in consultation with the domain
number of words which are correctly pronounced the overall experts. But as proved in different literatures having well
performance of the system is found to be 78%. studied diphone units produce better quality sound.
The second experiment evaluated intelligibility and
naturalness of the synthesizer. In this research Mean Opinion 5.2. Recommendation
Score (MOS) technique is used to evaluate the synthesized
text because it is the most widely used and simplest method Based on the findings of the study, we recommend the
to evaluate speech quality [6]. following to improve the quality of the system and to
The overall intelligibility of the system from twenty enhance the quality of the synthesized speech.
listeners for the ten Tigrigna sentences is found to be 3.27. In this study we did not consider prosody, word stresses,
Which means the synthesizer is ‘good’ as per the scale of the intonations and zonal dialects of the language, which are
MOS test. The overall naturalness of the synthesizer found to challenging in designing the speech synthesis.
be 3.28 which also approach to ‘good’ MOS scale. Speech emotion development for different type emotions
like normal, happy, anger, and sad, fear and grief are some of
the emotion type which make the speech output as well as
5. Conclusion and Recommendation waveform generation varied. Therefore, there is much work
5.1. Conclusion that could be carried out in this area alone. However, future
work in other emotions may not produce the same results
Text-To-Speech (TTS) synthesizer is a computer-based found in this thesis. This would be due to a number of
system that should be able to read any text aloud, when it reasons: more complex emotions are less understood and as a
30 Mezgebe Araya Keletay and Hussien Seid Worku: Developing Concatenative Based Text to Speech
Synthesizer for Tigrigna Language

consequence of speech correlates for complex emotions are [8] J. M. Varghese, "Design of Gujarati Text-to-Speech System,"
much harder to identify. vol. 02, no. 05, May 2015.
[9] B. Sudhakar, "Development of Concatenative Syllable based
Acknowledgements Text to Speech Synthesis System for Tamil," vol. 91, April
2014.
The corresponding author would like to thank the [10] Y. Fisseha, "development of stemming algorithm for tigrigna
Department of Computer Science in Arba-Minch University text," june 2011.
to support and advice worth considering starting from the
beginning to the completion of the paper, and the native [11] N. P. P. S. S. a. S. A. Ayushi Trivedi, "Speech to text and text
to speech recognition systems-Areview," IOSR Journal of
speakers of the Language they supports me to give their Computer Engineering (IOSR-JCE), vol. 20, no. 2, p. 40,
interests for recording. Next I would like to thank my advisor, May-April 2018.
Dr. Hussien Seid for his tireless support, patience, guidance,
and encouragement. [12] S. D. D. E. Kodhai, "Textaloud Assistant App Development
for Multilanguage," International Journal of Innovative
Technology and Exploring Engineering (IJITEE), vol. 8, no.
7s, May 2019.
References [13] M. R. B.,. C. N. M. Suhas R. Mache, "Review on Text-To-
[1] M. S. Siyoum, "syllable-based text-to- speech synthesis (tts) Speech Synthesizer," International Journal of Advanced
for amharic," addis ababa university, june, 2012. Research in Computer and Communication Engineering, vol.
4, no. 8, p. 56, August 2015.
[2] R. K. Kaveri Kamble, "Translation of Text to Speech
Conversion for Hindi Language," 2012. [14] P. G. K. D. Pawan S. Nadig, "Survey on text-to-speech
Kannada using Neural Networks," International Journal of
[3] S. A. S. S. P. P. Mrs. Mangal Joshi, "Text to Speech Synthesis Advance Research, Ideas and Innovations in Technology, vol.
for Hindi Language using Festival Framework," International 5, no. 6, p. 128, 2019.
Research Journal of Engineering and Technology (IRJET),
vol. 06, no. 04, p. 630, Apr 2019. [15] G. D. R. R. J. R. Sunil S. Nimbhore, "Implementation of
English-Text to Marathi-Speech (ETMS) Synthesizer," IOSR
[4] Dr. Samuel manoharan, "a smart image processing algorithm Journal of Computer Engineering (IOSR-JCE), vol. 17, no. 1,
for text recognition, information extraction and vocalization pp. 34-43, Feb. 2015.
for the visually challenged," journal of innovative image
processing (jiip), vol. 01, pp. 31-38, (2019). [16] Y. B. Ilyes Rebai, "Text-to-speech synthesis system with
Arabic diacritic recognition system," Multimedia InfoRmation
[5] R. J. R. G. D. Ramteke, "Text-To-Speech Synthesis of System and Advanced Computing Laboratory, 17 April 2015.
Marathi Numerals," vol. 3, no. 7, July 2015.
[17] A. T. Zegeye, "a generalized approach to amharic text-to-
[6] A. Kiflu, "Unit Selection Based Text-to-Speech Synthesizer speech (tts) synthesis system," addis ababa university, july,
for Tigrinya Language," vol. Volume 1, December 2012. 2010.
[7] A. T. Ei Phyu Phyu Soe, "Text-to-Speech Synthesis for
Myanmar Language," International Journal of Scientific &
Engineering Research, vol. 4, no. 6, p. 1509, June 2013.

You might also like