2

I use Ubuntu 12.04.

I want to make extensive use of the text-to-speech capabilities of Linux to create audio files from text.

I've tried Festival but finding good voices and installing them is overlly complex so I use it with its default voices.

I also tried Pico2Wave.

Festival text-to-speech is totally robotic and un-natural and it's not suitable for long term listening. It has a "whirring" sound in the background but you can hear the words crisply nonetheless, but again, robotic and bad quality in terms of speech.

Festival sample here

Pico2Wave is very natural and comparable to Apple's text-to-speech, in terms of diction and human-like speech, but the quality of the sound itself is awful. It sounds as if it was recorded in a very empty room with a lot of echo. It sounds "stuffy", muddy, tubby, whith too much bass. So much it makes the speakers rattle and it's very difficult to understand sometimes, unless you are using earphones. The sound is not crips at all. I also suspect the sound "clips" but I'm no audio expert.

Pico2Wave sample here

My question is:

How can I improve the sound quality of the generated audio file? I'm no audio expert so I don't know what I have to fiddle with (gain?, bass?, reduce noise? to what extend? etc.) Note that I'm an not asking for recommended tools, but to be explained what is exactly wrong with that audio and what qualities should I fiddle with in my audio editing/improving app of choice.

NOTE: The sample text is the first paragraph of "The Last of the Mohicans":

It was a feature peculiar to the colonial wars of North America, that the toils and dangers of the wilderness were to be encountered before the adverse hosts could meet. A wide and apparently an impervious boundary of forests severed the possessions of the hostile provinces of France and England. The hardy colonist, and the trained European who fought at his side, frequently expended months in struggling against the rapids of the streams, or in effecting the rugged passes of the mountains, in quest of an opportunity to exhibit their courage in a more martial conflict. But, emulating the patience and self-denial of the practiced native warriors, they learned to overcome every difficulty; and it would seem that, in time, there was no recess of the woods so dark, nor any secret place so lovely, that it might claim exemption from the inroads of those who had pledged their blood to satiate their vengeance, or to uphold the cold and selfish policy of the distant monarchs of Europe.

2 Answers 2

7

I just run into the same issue and at the moment I'm end with something like

pico2wave -l $LANGUAGE -w $WAV "$*" && play -qV0 $WAV treble 24 gain -l 6

which sounds much more "crisp".

1
  • Hi, where did you find documentation for pico2wave? thanks EDIT: ah, I see those are options on play (which is from the package sox ).
    – alchemy
    Commented Sep 27, 2022 at 2:02
2

Looking at the waveform in Audacity, the peak level is very high - while the waveform doesn't look clipped, it is probably causing clipping on playback - sounds nasty when played with VLC. Using Audacity's 'Amplify' effect I set the peak amplitude to -3.0 which plays back nice and clean - I tried this, exported back to wav and it then plays nice and clean in VLC. No doubt this could be done on the command line or in a script using SoX or similar.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .