Aes2001 Bonada PDF
Aes2001 Bonada PDF
Aes2001 Bonada PDF
Convention Paper
Presented at the 111th Convention
2001 September 2124
New York, NY, USA
This convention paper has been reproduced from the author's advance manuscript, without editing, corrections, or consideration
by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request
and remittance to Audio Engineering Society, 60 East 42 nd Street, New York, New York 10165-2520, USA; also see www.aes.org.
All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the
Journal of the Audio Engineering Society.
___________________________________
Spectral Approach to the Modeling of the Singing Voice
Jordi Bonada, Alex Loscos, Pedro Cano, Xavier Serra
Audiovisual Institute, Pompeu Fabra University
Barcelona, Spain
{jordi.bonada, alex.loscos, pedro.cano, xavier.serra}@iua.upf.es
http://www.iua.upf.es/mtg
Hideki Kenmochi
Advanced System Development Center, YAMAHA Corporation
Hamamatsu, Japan
[email protected]
ABSTRACT
In this paper we present two different approaches to the modeling of the singing voice. Each of these approaches
has been thought to fit in the specific requirements of two applications. These are an automatic voice
impersonator for karaoke systems and a singing voice synthesizer.
1. INTRODUCTION
Singing voice synthesis has been an active research field for almost
fifty years [Cook, 1996]. Most of the systems developed until now
do not provide enough quality or do not meet the practical
requirements to have found real-world commercial applications.
Anyhow, it seems that one of the main issues behind singing voice
synthesis is to offer not only quality but flexible and musically
meaningful control over the vocal sound. In that sense, we may
think of applications where impossible singing voices can be
synthesized or where existing voices can be enhanced.
In a broad sense, and according to whether the focus is put on the
system or its output, synthesis models used in singing voice
synthesis can be classified into two main groups: spectral models
and physical models. Spectral models are based on perceptual
BONADA ET AL.
0 h ( t , ) u ( ) d
(3)
sound
FFT
Peak
detection
Peak
continuation
Sine
magnitudes,
frequencies,
and phases
Spectral sine
generator
FFT
Window
generation
BH92
Pitch
estimation
Residual
spectrum
s (t ) =
Ar ( t ) cos [ r ( t )] + e ( t )
(1)
r =1
where Ar(t) and r(t) are the instantaneous amplitude and phase of
the r th sinusoid, respectively, and e(t) is time-varying noise
component.
This estimation of the sinusoidal component is generally done by
first computing the STFT of the sound, then detecting the spectral
peaks (and measuring the magnitude, frequency and phase of each
one), and organizing them as time -varying sinusoidal tracks. By
using the fundamental frequency information in the peak
continuation algorithm, we can identify the harmonic partials.
The sinusoidal plus residual model assumes that the sinusoids are
stable partials of the sound with a slowly changing amplitude and
frequency. With this restriction, we are able to add major
constraints to the detection of sinusoids in the spectrum and omit
the detection of the phase of each peak. The instantaneous phase
that appears in the equation is taken to be the integral of the
instantaneous frequency r(t), and therefore satisfies
r
( t ) = o r ( ) d
t
sine frequencies
sine magnitudes
sine phases
spectral
sine
generation
magnitude
spectrum
phase
spectrum
(2)
synthesis
window
complex
spectrum
residual
spectral data
spectral
residual
generation
magnitude
spectrum
phase
spectrum
window
generation
polar to
rectangular
conversion
IFFT
output
sound
polar to
rectangular
conversion
AES 111T H CONVENTION, NEW YORK, NY, USA, 2001 SEPTEMBER 2124
BONADA ET AL.
SMSAnalysis
SMS-Frames
analysis
parameters
Recognition
Feature
Analysis
alignment
parameters
Observations:
-mel cepstrum
-delta mel cepstrum
-energy
-delta energy
-voicedness
-delta pitch
Pitch Morph
Sine Spectral
Shape Morph
morph
param.
morph
param.
Amplitude
Residual Morph
Residual Morph
Transition
target data
Phase
Normalized
Alignment Morph
morph
parameters
Morph
Windowing
Transformation
process
Transformations
SMS-Frames
Stable State
SMSFrame
+
target
features
Overlap
SMS-Frames
combined features
user
features
+
alignment
History
target data
transformation
paramaters
Amplitude Sines
Morph
SMS-Frame, user features,alignment
Segmentation,
Alignment
and
Recognition
Voice
output
General
Phonetic
Dictionary
(HMM)
IFFT
Generate
Frame
Spectrum
SMS-Synthesis
Target Decoder
morph
parameters
synthesis
parameters
AES 111T H CONVENTION, NEW YORK, NY, USA, 2001 SEPTEMBER 2124
BONADA ET AL.
attack
Once we have all the required inputs set we can start processing
the user's voice. The first module of the running system includes
the real-time analysis and the recognition/alignment steps. Each
analysis frame, with the appropriate parameterization, is associated
with the phoneme of a specific moment of the song and thus with a
target frame. Once a user frame is matched to a target frame, we
morph them interpolating data from both frames and we synthesize
the output sound. Only voiced phonemes are morphed and the user
has control over which and by how much each parameter is
interpolated. The frames belonging to unvoiced phonemes are left
untouched, thus always having the users unvoiced consonants in
the output.
Therefore, depending on the phoneme the user is singing, a unit
from the target is selected and then each frame from the user is
morphed with a different frame from the target, advancing
sequentially in time as illustrated in figure 4. The user has the
choice to interpolate the different parameters extracted at the
analysis stage, such as amplitude, fundamental frequency, spectral
shape, residual signal, etc. In general, the amplitude will not be
interpolated, always using the amplitude from the user. The
unvoiced phonemes will not be morphed either, so we will have
always the user's. This will give the user the feeling of being in
control of the synthesis.
steady
release
target
normal morphing
loop-mode morphing
user
12 coefficients
12 coefficients
1 coefficient
1 coefficient
2 coefficients
Once all the chosen parameters have been interpolated for a given
frame, they are added back to the basic SMS synthesis frame.
Synthesis is done with the standard synthesis procedures of SMS.
Fig 6 Concatenation of silences and aspirations in the FSN
AES 111T H CONVENTION, NEW YORK, NY, USA, 2001 SEPTEMBER 2124
BONADA ET AL.
I
AS total = 20 log10 ai
i =1
where a i is the linear amplitude of the ith harmonic and I is the total
number of harmonics found in the current frame.
The amplitude of the residual component is calculated as the sum
of the absolute values of the residual of the current frame
expressed in dB. This amplitude can also be computed by adding
the frequency samples of the corresponding magnitude spectrum,
(
(
)
( ))
M 1
ARtotal = 20 log 10
xR ( n )
n= 0
N 1
= 20 log 10
(5)
XR k
k =0
where x R(n) is the residual sound, M is the size of the frame, x R(k)
is the spectrum of the residual sound, and N is the size of the
magnitude spectrum.
The fundamental frequency is defined as the frequency that best
explains the harmonics of the current frame. This can be computed
by taking the weighted average of all the normalized harmonic
frequencies,
I
F0 =
f(|t s-tm |)
i =1
(4)
fi
i
ai
I
i =1
(6)
ai
alpha
|ts -t m |
a
SShape = {( f , a
1
AES 111T H CONVENTION, NEW YORK, NY, USA, 2001 SEPTEMBER 2124
) ( f , a ) ... ( f , a )}
2
(7)
BONADA ET AL.
This set of points that define the spectral shape envelope is joined
with a third order spline interpolation instead of linear
interpolation. Spline interpolation gives a better approximation of
the resonant character of the spectrum of the singing voice.
Magnitude
Spectral
Shape 1
Zone1
f1,f2
f3
f4
Zone5
Zone3
Magnitude
Zone4
Zone6
Zone2
f1'
Anyway, our future plan does not consider any of the mentioned
propositions to be the one to concentrate on. We believe the
solution implies accomplishing a more flexible system in which we
would work with singer models rather than singer performances.
The idea is to record the target singer singing all possible phonetics
at different registers, with different intensities, in different emotion
contexts, etcetera, sampling this way the space of all the possible
phonetic and musical contexts. The analysis of these recordings
would be the basis of a singer model from which we could later
synthesize , out of the score and the lyrics of a song, a possible
performance of the singer modeled. That is what brought us to the
singing synthesizer application.
Frequency
f5
Spectral
Shape 2
f2'
f3'
f4'
Frequency
f5'
{e , e
1
, K , eq , K , e N
} = max [ X
k
( qM
+ k)
(8)
3.4 Discussion
expression
Singing
voice
synthesizer
(EpR)
Gain
Time
singer
database
Fig 10 Frequency domain implementation of the EpR model
AES 111T H CONVENTION, NEW YORK, NY, USA, 2001 SEPTEMBER 2124
BONADA ET AL.
Pitch
Gain
Time
Excitation
template
Excitation
transformation
Excitation
parameters
Gain
Fractional
delay
Time
Gain
Voiced
harmonic
excitation
Voiced
residual
excitation
Source
Filter
Source
Filter
Vocal tract
filter
Vocal tract
filter
Unvoiced
excitation
Source
Filter
EpR
excitation
UNVOICED
PHONATION
Subtract
offset
Time
Gain
Time
Windowing
EpR
filter
FILTER SOURCE
VOICED
PHONATION
Pulse
generation
Gain
Time
FFT
Gain
Transposition
voiced
harmonic
excitation
+
singing voice
spectrum
Fig 11 The EpR voice model
has traces of the original pitch. Otherwise, in the case of an
unvoiced phonation, we apply a filter that just changes the tilt
curve and the gain of the STFT of an original recording.
Phase
Pitch
Frequency
Frequency
SMS
residual
Amp
Frequency
1/F
Amp
Voiced
residual
excitation
Frequen cy
AES 111T H CONVENTION, NEW YORK, NY, USA, 2001 SEPTEMBER 2124
BONADA ET AL.
Unvoiced excitation
The excitation in the unvoiced parts is left unmodeled, using
directly the original recording of a singer's performance.
where
fs = Sampling rate
C = e
Slope f
(9)
HSS ( f
fs
B = 2cos( ) e
The EpR filter can be decomposed in two cascade filters. The first
of them models the differentiated glottal pulse frequency response,
and the second the vocal tract (resonance filter).
The EpR source filter
The EpR source is modeled as a frequency domain curve and one
source resonance applied to the input frequency domain flat
excitation described in the previous section. This source curve is
defined by a gain and an exponential decay as follows:
2 Bw
Bw
fs
A = 1 B C
The amplitude parameter (Amp) is relative to the source curve (a
value of 1 means the resonance maximum is just over the source
curve).
The EpR vocal tract filter
The vocal tract is modeled by a vector of resonances plus a
differential spectral shape envelope. It can be understood as an
approximation to the vocal tract filter. These filter resonances are
modeled in the same way as the source resonance (see eq. 11),
where the lower frequency resonances are somewhat equivalent to
the vocal tract formants.
Amp
(10)
filter resonances
frequency
Amp
Gain
Gain + SlopeDepth e
Slope f
)
Amp
ideal EpR(f)
Gain SlopeDepth
Gain
frequency
Gain SlopeDepth
frequency
Amp
source resonance
Gain
The EpR filters for voiced harmonic and residual excitations are
basically the same, but just differ in the gain and slope depth
parameters. This approximation has been obtained after comparing
the harmonic and residual spectral shape of several SMS analysis
of singer recordings. Figure 17 shows these differences.
Gain SlopeDepth
frequency
Amp
H (z) =
A
1 Bz
Cz
)=
Amp
H e
R(f
frequency
f F
j 2 0.5+
fs
( )
H e
(11)
AES 111T H CONVENTION, NEW YORK, NY, USA, 2001 SEPTEMBER 2124
BONADA ET AL.
degradation due to the fact that the sinusoids are assumed to have
constant amplitude and frequency along the frame duration.
Voiced
excitation spectrum
Amplitude
Phase
Linear to dB
Gain
Source
Slope (dB)
Phase
Source
Resonance
Frequency
Gain
ER F1
F2
Linear
to
dB
ER F1
F2
Vocal tract
Resonances
F3
Frequency
F3
Frequency
Differential
Spectral Shape
(dB)
Gain
dB to Linear
ER F1
F2
F3
Frequency
Amplitude
Phase
Voice
spectrum
Fig 19 Frequency domain implementation of the EpR model
Fig 18 The phase alignment is approximated as a linear
segment, with a phase shift for each resonance
The EpR filter implementation
The EpR filter is implemented in the frequency domain. The input
is the spectrum that results out from the voiced harmonic excitation
or from the voiced residual excitation. Both inputs are supposed to
be approximately flat spectrums, so we just need to add the EpR
resonances, the source curve and the differential spectral shape to
the amplitude spectrum. In the case of the voiced harmonic
excitation we also need to add the EpR phase alignment to the
phase spectrum.
For each frequency bin we have to compute the value of the EpR
filter. This implies a considerable computational cost, because we
have to calculate the value of all the resonances. However, we can
optimize this process by assuming that the value of the sum of all
the resonances is equal to the maximum amplitude (dB) of all the
filter and excitation resonances (over the source curve). Then we
can even do better by only using the two neighbors' resonances for
each frequency bin. This is not a low-quality approximation of the
original method because the differential spectral shape envelope
takes care of all the differences between the model and the real
spectrum.
If we want to avoid the time domain voiced excitation, especially
because of the computational cost of the fractional delay and the
FFT, we can change it to be directly generated in the frequency
domain. From the pitch and gain input we can generate a train of
deltas in frequency domain (sinusoids) that will be convolved with
the transform of the synthesis window and then synthesized with
the standard frame based SMS synthesis, using the IFFT and
overlap-add method. However the voice quality may suffer some
AES 111T H CONVENTION, NEW YORK, NY, USA, 2001 SEPTEMBER 2124
BONADA ET AL.
AES 111T H CONVENTION, NEW YORK, NY, USA, 2001 SEPTEMBER 2124
10