State-Of-The-Art in Fundamental Frequency Tracking: Stéphane Rossignol, Peter Desain and Henkjan Honing
State-Of-The-Art in Fundamental Frequency Tracking: Stéphane Rossignol, Peter Desain and Henkjan Honing
State-Of-The-Art in Fundamental Frequency Tracking: Stéphane Rossignol, Peter Desain and Henkjan Honing
Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.
Abstract
Pitch-tracking has been an important topic of research in speech and music research. Several methods have
been proposed to obtain reliable f0 trajectories from harmonic signals. The paper will review these. Some
issues that are left are: how to evaluate and improve the quality and reliability of the pitch-tracking, and how
to realize this in an automated method that can be use reliably and systematically on large data sets. In order
to address these issues, we will focus on an approach that takes advantage of the availability of knowledge in
trying to obtain more reliable and precise f0 trajectories from monophonic and harmonic audio fragments.
Two methods are compared that obtain reliable and precise f0 trajectories from monophonic audio fragments.
These trajectories can be used for the analysis and modeling of vibrato (frequency modulation) in music
performance. The pitch extraction methods take advantage of the fact that the score, the timing (the
performers synchronized with a piano accompaniment), the instrument and sometimes even the fingering is
known.
harmonic sines) sounds, it is necessary to isolate each the frames used for the band-pass filtering and for the
harmonic by band-pass filtering. fn extraction (see Figures 2 and 4). During the data
In our case, the score of the music is known and can fusion stage, knowledge about the instrument can be
be used as a guide. The pitch trackers described in this used, like its spectral characteristics. Since sometimes
article are therefore referred to as knowledge-based a frequency trajectory is too noisy to be used, caused
pitch trackers. by, for example, a missing harmonic (e.g., in wind
A pitch tracker using knowledge is described in instruments) or sympathetic resonance (e.g., in string
Scheirer (1995). One of the goals of the work instruments).
presented in this article is to solve the problem of the We will examine here two alternative pitch extraction
transcription of polyphonic sounds. It is a score-aided methods. Both are made-up of three stages. In the first
transcription system. A comb-filter strategy, that is to stage, for both methods, the audio signal is fed
say a not local strategy, is used. In this article, through a band-pass filter bank. For each of the first N
Scheirer says: “It seems on the surface that using the harmonics a time-varying band-pass filter is used
score to aid transcription is “cheating”, or worse, which adjusts its length and central frequency
useless - what good is it to build a system which according to the frequency information in the score,
extracts information you already know?”. In our case, f0s. Information from the instrument is used to adjust
as the amplitude of the frequency modulation is the bandwidth to the pitch and to the speed of
assumed to be great, the score does not follow the transitions. Thus, each harmonic is isolated, and N
frequency trajectory. The score-based pitch tracking is new sounds signals are obtained. The two following
very useful to solve our specific problem. stages are not the same for the two methods.
Considering the first method, in the second stage the
1.2 Current Research frequency and energy trajectories are computed for
Pitch-tracking has been an important topic of research each harmonic (peak tracking), using the signals
in speech and music research. Several methods have obtained in the previous stage. In the final stage the fi
been proposed to obtain reliable f0-trajectories from and amplitude trajectories obtained are merged to
harmonic signals. The paper will review these. Some provide the optimal f 0 trajectory. Considering the
issues that are left are: how to evaluate and improve other method, in the second stage, portions of the
the quality and reliability of the pitch-tracking, and spectrum, centered on the frequency given by the
how to realize this in an automated method that can be score, are merged. In the third stage, the peak tracking
use reliably and systematically on large data sets. is performed. During the data fusion stage, for both
To address these issues, we will focus on an approach methods, instrument information is used to decide on
that takes advantage of the availability of knowledge the correct interpretation in situations where a higher
in trying to obtain more reliable and precise f0- harmonic is known to be a louder or more reliable
trajectories from monophonic and harmonic audio source of f0 information than the fundamental itself, or
fragments. It is a hard problem, especially, for where the tracks of certain harmonics of certain
instance, when sympathetic resonance of open strings fundamental frequencies are known to be distorted by
in string instrument interfere with some harmonics of sympathetic resonance. For the second method
the main sound, or when transitions are so fast that automatic techniques to detect the bad tracks have
tracks of different harmonics are connected. We will been implemented.
show that knowledge about the instrument and music Next, we will describe the dataset that was used in the
played can be used to improve the results of the analyses, followed by the two pitch extraction
presented methods. methods (sections 2 and 3), completed by an
These methods are developed in the context of a larger evaluation and discussion of the results obtained
project on the analysis and modeling of vibrato in (sections 4 and 5).
music performance (Desain and Honing 1996;
Timmers and Desain 2000). In order to model the 1.3 Data set of music performances
vibrato during notes and in note transitions accurate The dataset used in this paper consists of a large and
f0-trajectories are needed. For this a large systematic systematically collected set of music performances of
set of music performances was collected (see section a single fragment of music performances by a variety
1.3). The setup of the data collection provides two of instruments. The fragment consists of the twenty
kinds of knowledge. Firstly, “score” information is first notes of “The Swan” of C. Saint-Saëns,
used such as pitch information and the predicted onset performed along with a MIDI-controlled grand piano.
times, using the known tempo, is used (the latter This was used to control for the desired tempo, and as
makes it different from a score, hence the inverted such allows for studying, for example, how vibrato is
comma’s). f0s is used for instance to fit the length of adapted to note duration. Seven instruments (cello,
Published as: Rossignol, S., Desain, P., and Honing, H. (2001) State-of-the-art in fundamental frequency tracking. Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.
i =1
sounds are cut into frames, this method is considered a where W i are the weights (information coming from
“global strategy”. But, as the length of the frames the box “Instrument”). And where the parameters si
changes with f 0s and as such provides us with an describe the fact that for some instruments (e.g. string
optimal size, knowledge allows us to improve the instruments) the harmonic are a little bit shifted in
results. frequency. At the moment, the weights W i a r e
For the AS method, the f0s knowledge is used to predefined. Some automatic methods have been
determine the length of the frames, which is equal to studied. They are based on the results of a rating
M fe/f0s samples (with M ε [3 10]). Due to the Hilbert experiment in which listeners compared original and
filtering, we say that this method is “global”. But to re-synthesized sound signals (see section 3.3 and 4.3).
compute the “instantaneous frequency” only two
complex samples are needed.
Considering the TK method, the instantaneous 3 Pitch-tracker B (fusion before
frequency is estimated as: peak detection)
P[ x (n) − x (n − 1)] (1)
F = arccos(1 − )
2 P[ x (n)]
3.1 Architecture
where: For the alternative pitch-tracker the analysis is also
P[ y( m)] = (2)
y 2 ( m) − y( m − 1) y( m + 1) composed of three stages. The first stage is the same
is the Teager-Kaiser operator; and where x are the for both pitch-trackers. Secondly (section 3.2)
sound samples. portions of spectrum are extracted. Thirdly, these
A similar formula is available in order to estimate the portions are merged, and the peak tracking performed.
instantaneous amplitude: The weights Wi described in the section 2.4 can be
P[ x (n)] (3) taken into account. They weight the amplitude of the
A =
1 − cos2 ( F ) extracted portions of spectrums. In the other hand, a
Knowledge is not used here. technique to automatically detect the bad notes has
As only four consecutive sound samples are needed to been implemented. Thus, the data fusion stage is
obtain an estimate of the frequency and of the completed by the automatic detection of bad tracks
amplitude, the TK method is considered a “local (section 3.3).
strategy”. It is assumed that “the amplitude and the
frequency do not vary too fast (time rate of change of
value) or too greatly (range of value) in time
compared to the carrier frequency” (Maragos and
Kaiser 1993). These two conditions are related to the
vibrato: the first one, to its frequency fv (or vibrato
rate), and the second one to its amplitude A v (or
vibrato extent). And the transitions have to be also
relatively smooth.
∑A
0
N
i =1
i
i =1 i
3.2 Spectra extraction (phase 2)
where N is the number of harmonics taken into In the second stage transposed portions of the
account, fi is the frequency found for the ith harmonic, spectrum are combined. These portions correspond
and Ai is the amplitude of the ith harmonic. respectively to these frequency bands:
It can be noticed that for this first method, no [ f − ∆f f + ∆f ] (6)
0 0
information is coming from the box “Instrument” (see for the first harmonic, and
Figure 2). [2 f 0 − 2 ∆f + 2 ∆f ] (7)
2f
A more refined method can be used: 0
considered harmonic. So, the width of the main lobe showed these differences considering a simulated
decreases with the number of the harmonic. signal.
Three tests on a simulated signal have been
3.3 Data fusion with automatic detection performed. The parameters for this signal are equal to:
of bad tracks (phase 3) fundamental frequency of the first note f0(a) = 440 Hz,
In the pitch-tracker discussed above weights were fundamental frequency of the second note f0(b) = 493
used (that had to be explicitly provided) to improve Hz, transition moment ta = 0.71 s, transition speed tr =
the quality of the data fusion. In this pitch tracker we 0.003, magnitude of the vibrato Av = 30 Hz, frequency
incorporate an automatic method to rate the quality of of the vibrato fv = 5 Hz and phase of the vibrato ϕ =
v
(10)
position of this maximum for two successive frames. (b) (a)
When something disrupts the partial tracker, this rate where t is the time (in second), c = f 0 − f 0 , and Φ( a )
2
is expected to be great. the phase at t=0.
The third one M 3 is the correlation between the It can be noticed that, for these tests, the disruptive
spectrum around is maximum and its theoretical shape parameters 1 and 2 concern the time rate of change of
if the analyzed signal was a pure sine, witch constant value and the range of value (see section 2.3). The
amplitude and frequency. length of the frames is constant. It has been chosen
The final measure of bad tracks detection is thus: equal to 7 ms, which is close to 3 fe /f0s for the smallest
Mi = a1 M1 + a2 (1- M2) + a3 M3 (8) fundamental frequency, f0(a).
The parameters a1, a2, a3 and δf have to be optimized.
This variable M is used instead of the Wi. Therefore, a 4.1.2 Behavior on sine signal with a transition and
value is obtained for each frame. It is not the case vibrato
when the weights W i are considered, which are In Figure 5 are shown the f0 trajectories obtained for
defined note by note. the whole sound. In Figure 6 are shown the f0
trajectories during the transition. In both cases, four f0
4. Results trajectories are plotted: the ideal f0 trajectory, the f0
Firstly, the performance of the three harmonics trajectories obtained using the FFT method, the AS
trackers is discussed. Secondly, the two whole pitch method and the TK method. It can be seen that the
tracker systems are compared. Thirdly, the three harmonic trackers can follow the variation of the
performance of the technique to automatically detect frequency well. However, the TK method shows some
the bad tracks is analyzed. And fourthly, the artefacts during the transition.
performance of the whole system, using the second
pitch-tracker, is shown. 520
510
trackers
490
480
f0 (HZ)
4.1.1 Introduction
470
methods do not behave in the same way at all. We 0.3 0.4 0.5 0.6
time (s)
0.7 0.8 0.9 1
Published as: Rossignol, S., Desain, P., and Honing, H. (2001) State-of-the-art in fundamental frequency tracking. Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.
480
460
510
490
fundamental frequency (Hz)
430
480
420
470
410
FFT
460
AS
400
TK
450 390
0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3
time (s)
FFT
440 AS
TK 480
430
0.69 0.695 0.7 0.705 0.71 0.715 0.72 470
time (s)
460
440
In Figure 7, are shown the results obtained when the
length of the frames is fixed to 25 ms. This value is 430
length. 390
0.1 0.12 0.14 0.16 0.18 0.2
time (s)
0.22 0.24 0.26 0.28 0.3
480
disturbed.
470
4.1.4 Behavior on noisy signals
460 In this case, the simulated signal is equal to:
450
s = cos(Φ1 + Φ ) + b
(1) (12)
ideal
FFT
where b is a normal noise, with mean equal to 0 and
standard deviation equal σ . The results are shown in
440 AS
TK
430
0.69 0.695 0.7 0.705 0.71 0.715 0.72
Figure 9.
time (s)
460
4
(11) 440
s = cos(Φ1 + Φ ) + ∑ ai cos(i Φ1 + Φ )
(1) (i )
i=2 430
390
0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3
time (s)
Published as: Rossignol, S., Desain, P., and Honing, H. (2001) State-of-the-art in fundamental frequency tracking. Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.
480
Figure 10: zoom of the f0 trajectory (true trajectory,
trajectory obtained with the pitch-tracker A, trajectory when
the band-pass filtering is not performed)
470
460
-3
fundamental frequency (Hz)
x 10
450 8
Filtering
No filtering
440 7
430 6
420 5
410
harmonic1
4
FFT
AS
400
TK 3
390
0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 2
time (s)
Hz.
500
frequency (Hz)
460.5
Ideal
Filtering 495
No filtering
460
490
459.5
485 TK
frequency (Hz)
459 FFT
AS
480
458.5 21.9 21.92 21.94 21.96 21.98 22 22.02 22.04 22.06 22.08 22.1
time (second)
458
457.5
457
0.355 0.36 0.365 0.37 0.375 0.38 0.385 0.39
time (s)
Published as: Rossignol, S., Desain, P., and Honing, H. (2001) State-of-the-art in fundamental frequency tracking. Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.
20
After filtering
closer from the true trajectory than the one obtained
using the first fusion method (i.e. pitch tracker A).
Before filtering
0
-20
460
-40
455
-60
amplitude
450
-80
445
-100
frequency (Hz)
440
-120
-140 435
-160 430
0 1000 2000 3000 4000 5000 6000
frequency (Hz)
Score
Figure12. Top panel: f0 trajectories obtained with the three 425 Ideal
Pitch-Tracker A
420
harmonic). Bottom panel: spectra of a frame of the original 0.15 0.2 0.25 0.3
time (s)
0.35 0.4
signal [21.9s 21.92s] and of the corresponding band-pass Figure 13: f0 trajectories obtained with the two pitch
filtered signal
trackers on a simulated signal
For the disruptive partial, the frequency is 3f0 + 150 Figure 14: The spectra of the four harmonics in the
simulated signal; and the combined spectra (X-axis: ≈
Hz; the amplitude is 1.5/15 (notice that the amplitude
frequency (not in Hz); Y-axis: linear amplitude)
of the disruptive partial is higher that the amplitude of
the third harmonic) and there is a vibrato: fv = 4.9 Hz, The second pitch-tracker is more robust to mistakes
Av = 29.4 Hz (it is different of the vibrato presents on on W i than the first one. Figure 14 illustrates that
the harmonics). indeed, in the case of the presence of a more noisy
Figure 13 shows the fundamental frequency harmonic, taking the average spectrum is a more
trajectories for pitch-tracker A and B.
realiable method.
Here, clearly, the trajectory obtained using the
alternative fusion method (i.e. pitch tracker B) is
Published as: Rossignol, S., Desain, P., and Honing, H. (2001) State-of-the-art in fundamental frequency tracking. Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.
215
210
205
200
195
6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4
The correlation between the mean within the subjects, trajectories. The obtained f0 trajectories can be used
and the results obtained with the automatic method is to analyse vibrato and portamento.
0.5. While our primary motivation of developing this
knowledge-based method is to obtain precise f 0
4.4 Example of the full system information from the experimental data set, the idea to
Finally, we will show an example of preferred use knowledge in f0 tracking can be useful for other
method, pitch tracker B, the full system in operation. computer music systems as well. For instance, when f0
For this we return again to the example presented in needs to be tracked in a live situation where score and
the introduction (see Figure 1). In Figure 17, the timing information is available. The methods
spectrogram, the score information and the obtained described in this paper can in principle be used for an
frequency trajectories are plotted. In Figure 17, only efficient f0 tracker that considers only those parts of
the first harmonic is shown. In Figure 1, the first four the audio signal of the singer or instrumentalist to be
harmonics are shown. The dotted score lines indicate followed that are relevant for f0 tracking.
that the corresponding harmonic is not taken into A measure of the voicing coefficient is also obtained.
account. The corresponding portions of the frequency It allows us to detect, for instance, silences, noisy state
trajectories are plotted in white. The amplitude part (noise component predominant over harmonic
trajectory information is also taken into account. component), but also to check the quality of our
When the amplitude is too small, the corresponding processing.
portions of the frequency trajectories are not shown An interesting extension of our pitch-trackers would
(see for instance the end of each harmonic, after 24 be to use it to analyze polyphonic sounds. For
seconds). example, when two harmonics, coming from two
As an example, for the long note at 57 MIDI pitch, we different instruments or voices, are to close, the
can see in Figure 1 that there is a resonance at the corresponding trajectories (see Figure 2) or spectrum
beginning of this note. So, the harmonic tracker fails (see Figure 4) would not be used during the fusion
for this part of the signal, as it can be seen. But, after stage.
the data fusion stage, the f0 trajectory shown in Figure
17 is obtained. If we compare this trajectory to the 6 Acknowledgements
frequency trajectory obtained for the first harmonic This research is supported by the MOSART (Music
(Figure 1), the results have been clearly improved. Orchestration Systems in Algorithmic Research and
Technology) European project, and by the
Netherlands Organization for Scientific Research
(NWO).
References
Boashash, Boualem, “Estimating and interpreting the
instantaneous frequency of a signal”, Proceedings of
the IEEE, Volume 80, no. 4, pp. 539-568, 1992, April.
IEEE.
Desain, P. and Honing, H., “Modeling Continuous Timmers, Renee and Desain, Peter, “Vibrato: the
Aspects of Music Performance: Vibrato and questions and answers from musicians and science”,
Portamento”, Proceedings of the International Music Proceedings of the International Conference on Music
Perception and Cognition Conference, B. Pennycook Perception and Cognition, Keele University,
& E. Costa-Giomi, CD-ROM, 1996 Department of Psychology, CD-ROM, 2000
Hermes, D., “Pitch analysis”, In M. Cooke and S. Vakman, David, “On the AS, the TK energy
Beet, eds. Visual Representation of Speech Signals, algorithm and other methods for defining amplitude
New York, John Wiley and Sons, 1992 and frequency”, IEEE Transaction on Signal
Processing, Volume 44, no. 4, pp. 791-797, 1996,
Hess, Wolfgang, “Pitch determination of speech April
signals”, Springer-Verlag, 1983
Wang, Avery Li-Chun, “Instantaneous and frequency-
Lane, J., “Pitch detection using a tunable IIR filter”, warped signal processing techniques for auditory
Computer Music Journal, Volume 14, no. 3, pp. 46-59 source separation”, Ph. D. thesis, Stanford University,
1994, August
Maragos, Petros and Kaiser, James K., “Energy
separation in signal modulations with application to
speech analysis”, IEEE Transaction on Signal
Processing, Volume 41, no. 10, pp. 3024-3050, 1993,
October