State-Of-The-Art in Fundamental Frequency Tracking: Stéphane Rossignol, Peter Desain and Henkjan Honing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Published as: Rossignol, S., Desain, P., and Honing, H. (2001) State-of-the-art in fundamental frequency tracking.

Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.

State-of-the-art in fundamental frequency tracking


Stéphane Rossignol, Peter Desain and Henkjan Honing
Music, Mind, Machine Group, NICI, University of Nijmegen, The Netherlands, http://www.nici.kun.nl/mmm

{S.Rossignol, desain, honing}@nici.kun.nl

Abstract

Pitch-tracking has been an important topic of research in speech and music research. Several methods have
been proposed to obtain reliable f0 trajectories from harmonic signals. The paper will review these. Some
issues that are left are: how to evaluate and improve the quality and reliability of the pitch-tracking, and how
to realize this in an automated method that can be use reliably and systematically on large data sets. In order
to address these issues, we will focus on an approach that takes advantage of the availability of knowledge in
trying to obtain more reliable and precise f0 trajectories from monophonic and harmonic audio fragments.
Two methods are compared that obtain reliable and precise f0 trajectories from monophonic audio fragments.
These trajectories can be used for the analysis and modeling of vibrato (frequency modulation) in music
performance. The pitch extraction methods take advantage of the fact that the score, the timing (the
performers synchronized with a piano accompaniment), the instrument and sometimes even the fingering is
known.

spectrum, or of the cepstrum (Schafer and Rabiner


1970).
1. Introduction The methods based on auditory models combine
frequency and temporal methods.
1.1 Fundamental frequency extraction Most of the efforts have taken place in the frequency
Robust systems that retrieve pitch information from domain (see Brown 1992). In this paper we present
musical performances are still hard to design. A very methods working in the temporal and in the frequency
large number of methods have been developed (see domains (the FFT in the frequency domain; the
for instance Hess 1983). We can classify pitch Analytic Signal and the Teager-Kaiser methods in the
trackers into five general categories: autocorrelation, temporal domain). It is shown that the frequency
adaptive filter, time domain, frequency domain and domain method is more efficient.
models of the human ears (see Roads 1996). In order to obtain precise frequency trajectories, we
Consider firstly the autocorrelation algorithms must use local strategies, that is to say we have to use
(Moorer 1975). These methods are most efficient at relatively short frames length. However, using the
mid to low frequencies. In musical applications, the FFT spectrum, we must use frames which length have
pitch range is broader. to be at least three times the period of the signal we
Considering the adaptive filter methods (see Lane want to detect. For a sine with a frequency of 440 Hz
1990), on pitch detector is based on the analysis of the the frames length must be around 7 ms. That is to say,
difference between the filter output and the filter if the sampling rate is 11 kHz, 75 samples. Some
input. This difference must be close to zero. The band- alternatives to the FFT have been proposed in the
pass filter center frequency is controlled by this literature. One of them is based on the Analytic Signal
difference. Another adaptive filter is based on the (Hess, 1983; Boashash, 1992; Wang, 1994). Another
optimum comb method (Lane 1990). The goal is to one is based on the Teager-Kaiser energy algorithm
minimize the output signal. (Maragos, 1993; Vakman, 1996). For the first one
Considering the time domain methods, one type of only two samples are needed to estimate the
pitch detector is based of the analysis of the zero- instantaneous frequency and the instantaneous
crossing points (Moorer 1975, Hermes 1992). amplitude of a signal. For the second one, four
Preprocessing by filters has to be performed, in order samples are needed. But, for both of them, the signal
to solve the problem of the low-amplitude zero- is assumed a pure sine, which frequency and
crossings caused by high-frequency components. amplitude vary slowly in time. As the musical sounds
A few pitch detectors exist in the frequency domain. in use are composed (i.e. composed of a sum a
Most of them are based on the analysis of the FFT
Published as: Rossignol, S., Desain, P., and Honing, H. (2001) State-of-the-art in fundamental frequency tracking. Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.

harmonic sines) sounds, it is necessary to isolate each the frames used for the band-pass filtering and for the
harmonic by band-pass filtering. fn extraction (see Figures 2 and 4). During the data
In our case, the score of the music is known and can fusion stage, knowledge about the instrument can be
be used as a guide. The pitch trackers described in this used, like its spectral characteristics. Since sometimes
article are therefore referred to as knowledge-based a frequency trajectory is too noisy to be used, caused
pitch trackers. by, for example, a missing harmonic (e.g., in wind
A pitch tracker using knowledge is described in instruments) or sympathetic resonance (e.g., in string
Scheirer (1995). One of the goals of the work instruments).
presented in this article is to solve the problem of the We will examine here two alternative pitch extraction
transcription of polyphonic sounds. It is a score-aided methods. Both are made-up of three stages. In the first
transcription system. A comb-filter strategy, that is to stage, for both methods, the audio signal is fed
say a not local strategy, is used. In this article, through a band-pass filter bank. For each of the first N
Scheirer says: “It seems on the surface that using the harmonics a time-varying band-pass filter is used
score to aid transcription is “cheating”, or worse, which adjusts its length and central frequency
useless - what good is it to build a system which according to the frequency information in the score,
extracts information you already know?”. In our case, f0s. Information from the instrument is used to adjust
as the amplitude of the frequency modulation is the bandwidth to the pitch and to the speed of
assumed to be great, the score does not follow the transitions. Thus, each harmonic is isolated, and N
frequency trajectory. The score-based pitch tracking is new sounds signals are obtained. The two following
very useful to solve our specific problem. stages are not the same for the two methods.
Considering the first method, in the second stage the
1.2 Current Research frequency and energy trajectories are computed for
Pitch-tracking has been an important topic of research each harmonic (peak tracking), using the signals
in speech and music research. Several methods have obtained in the previous stage. In the final stage the fi
been proposed to obtain reliable f0-trajectories from and amplitude trajectories obtained are merged to
harmonic signals. The paper will review these. Some provide the optimal f 0 trajectory. Considering the
issues that are left are: how to evaluate and improve other method, in the second stage, portions of the
the quality and reliability of the pitch-tracking, and spectrum, centered on the frequency given by the
how to realize this in an automated method that can be score, are merged. In the third stage, the peak tracking
use reliably and systematically on large data sets. is performed. During the data fusion stage, for both
To address these issues, we will focus on an approach methods, instrument information is used to decide on
that takes advantage of the availability of knowledge the correct interpretation in situations where a higher
in trying to obtain more reliable and precise f0- harmonic is known to be a louder or more reliable
trajectories from monophonic and harmonic audio source of f0 information than the fundamental itself, or
fragments. It is a hard problem, especially, for where the tracks of certain harmonics of certain
instance, when sympathetic resonance of open strings fundamental frequencies are known to be distorted by
in string instrument interfere with some harmonics of sympathetic resonance. For the second method
the main sound, or when transitions are so fast that automatic techniques to detect the bad tracks have
tracks of different harmonics are connected. We will been implemented.
show that knowledge about the instrument and music Next, we will describe the dataset that was used in the
played can be used to improve the results of the analyses, followed by the two pitch extraction
presented methods. methods (sections 2 and 3), completed by an
These methods are developed in the context of a larger evaluation and discussion of the results obtained
project on the analysis and modeling of vibrato in (sections 4 and 5).
music performance (Desain and Honing 1996;
Timmers and Desain 2000). In order to model the 1.3 Data set of music performances
vibrato during notes and in note transitions accurate The dataset used in this paper consists of a large and
f0-trajectories are needed. For this a large systematic systematically collected set of music performances of
set of music performances was collected (see section a single fragment of music performances by a variety
1.3). The setup of the data collection provides two of instruments. The fragment consists of the twenty
kinds of knowledge. Firstly, “score” information is first notes of “The Swan” of C. Saint-Saëns,
used such as pitch information and the predicted onset performed along with a MIDI-controlled grand piano.
times, using the known tempo, is used (the latter This was used to control for the desired tempo, and as
makes it different from a score, hence the inverted such allows for studying, for example, how vibrato is
comma’s). f0s is used for instance to fit the length of adapted to note duration. Seven instruments (cello,
Published as: Rossignol, S., Desain, P., and Honing, H. (2001) State-of-the-art in fundamental frequency tracking. Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.

oboe, tenor, theremin, violin, soprano, and


shakuhachi) played the melody in ten different tempos
(54.5, 55.8, 57.1, 58.5, 60.0, 61.5, 63.2, 64.9, 66.7 and
68.8 beats per minute). And each performance was
repeated six times to be able to check for consistency
in performance. All this results in 420 recordings of
which the f 0 trajectories had to be obtained. See
Desain, Honing, Aarts and Timmers (2000) for more
details.
An example is given Figure 1. The spectrogram, the
score information in use (melody contours in straight Figure 2. Architecture of pitch-tracker with fusion after peak
lines) and the obtained frequency trajectories are detection
shown.
2.2 Filtering (phase 1)
In the first phase the appropriate harmonic needs to be
selected. This is input for the f N extraction phase.
After this time-varying band-pass filtering, the
amplitude of the harmonic we want to keep must be
higher than the amplitude of all the other harmonics.
Furthermore, the isolated harmonics are used for
checking the quality and appropriate selection
controlled by the score information (see section 4.3).

2.3 fn extraction (phase 2)


Three harmonic trackers have been tested. The input
signal considered for each of them is the sound
obtained after the band-pass filtering. The results
Figure 1: spectrogram, selection of bands using score obtained with each of them for a simulated signal and
information (melody contour in straight lines) and frequency for a true sound signal are shown in section 4. It is
trajectories obtained therein, for the cello (54.5 bpm)
shown there why the last two methods have been
rejected. The first method is based on the FFT
2. Pitch-tracker A (fusion after peak spectrum (FFT method); the second one is based on
detection) the Analytic Signal (AS method): for more complete
theoretical developments, examine the references
Hess 1983, Boashash 1992, and Wang 1994; and the
2.1 Architecture third one is based on the Teager-Kaiser energy
The analysis of f0 from audio signals is composed of algorithm (TK method): Maragos 1993 and Vakman
three stages. Firstly, the original audio signal is band- 1996. These three algorithms are shown Figure 3.
pass filtered. Thus, each harmonic is isolated (section The FFT method is a “frequency” domain strategy;
2.2), and N new sounds are obtained. Secondly, the and AS and TK methods are “temporal” domain
frequency and the energy trajectories are computed for
strategies.
each harmonic, using the signals obtained at the
The results obtained with each of them for simulated
previous stage of the analysis (section 2.3). Three
signals are shown and compared in section 4.1.
methods to obtain these trajectories have been tested,
with FFT as the preferred method. Thirdly, the fn and
An trajectories are mixed in order to provide the
optimal f0 trajectory (section 2.4).

Figure 3: The three alternative harmonic trackers


Published as: Rossignol, S., Desain, P., and Honing, H. (2001) State-of-the-art in fundamental frequency tracking. Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.

For the FFT method, the f 0 s knowledge is used to (5)



1 1 N
W A f
determine the length of the frames, which is equal to f = ∑ i i i

M fe/f0s samples (with M ε [3 10]). As the analysed


N
0
N
∑Wi Ai i =1 s i

i =1
sounds are cut into frames, this method is considered a where W i are the weights (information coming from
“global strategy”. But, as the length of the frames the box “Instrument”). And where the parameters si
changes with f 0s and as such provides us with an describe the fact that for some instruments (e.g. string
optimal size, knowledge allows us to improve the instruments) the harmonic are a little bit shifted in
results. frequency. At the moment, the weights W i a r e
For the AS method, the f0s knowledge is used to predefined. Some automatic methods have been
determine the length of the frames, which is equal to studied. They are based on the results of a rating
M fe/f0s samples (with M ε [3 10]). Due to the Hilbert experiment in which listeners compared original and
filtering, we say that this method is “global”. But to re-synthesized sound signals (see section 3.3 and 4.3).
compute the “instantaneous frequency” only two
complex samples are needed.
Considering the TK method, the instantaneous 3 Pitch-tracker B (fusion before
frequency is estimated as: peak detection)
P[ x (n) − x (n − 1)] (1)
F = arccos(1 − )
2 P[ x (n)]
3.1 Architecture
where: For the alternative pitch-tracker the analysis is also
P[ y( m)] = (2)
y 2 ( m) − y( m − 1) y( m + 1) composed of three stages. The first stage is the same
is the Teager-Kaiser operator; and where x are the for both pitch-trackers. Secondly (section 3.2)
sound samples. portions of spectrum are extracted. Thirdly, these
A similar formula is available in order to estimate the portions are merged, and the peak tracking performed.
instantaneous amplitude: The weights Wi described in the section 2.4 can be
P[ x (n)] (3) taken into account. They weight the amplitude of the
A =
1 − cos2 ( F ) extracted portions of spectrums. In the other hand, a
Knowledge is not used here. technique to automatically detect the bad notes has
As only four consecutive sound samples are needed to been implemented. Thus, the data fusion stage is
obtain an estimate of the frequency and of the completed by the automatic detection of bad tracks
amplitude, the TK method is considered a “local (section 3.3).
strategy”. It is assumed that “the amplitude and the
frequency do not vary too fast (time rate of change of
value) or too greatly (range of value) in time
compared to the carrier frequency” (Maragos and
Kaiser 1993). These two conditions are related to the
vibrato: the first one, to its frequency fv (or vibrato
rate), and the second one to its amplitude A v (or
vibrato extent). And the transitions have to be also
relatively smooth.

2.4 Data fusion (phase 3)


The definition used is: Figure 4: The whole system when the fusion is performed
(4) before the peak detection.

1 1 N
A f
f = N ∑ i i

∑A
0
N
i =1
i
i =1 i
3.2 Spectra extraction (phase 2)
where N is the number of harmonics taken into In the second stage transposed portions of the
account, fi is the frequency found for the ith harmonic, spectrum are combined. These portions correspond
and Ai is the amplitude of the ith harmonic. respectively to these frequency bands:
It can be noticed that for this first method, no [ f − ∆f f + ∆f ] (6)
0 0

information is coming from the box “Instrument” (see for the first harmonic, and
Figure 2). [2 f 0 − 2 ∆f + 2 ∆f ] (7)
2f
A more refined method can be used: 0

for the second harmonic, etc. The bounds of each band


correspond to the information given in the score. The
bandwidth increases with the number of the
Published as: Rossignol, S., Desain, P., and Honing, H. (2001) State-of-the-art in fundamental frequency tracking. Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.

considered harmonic. So, the width of the main lobe showed these differences considering a simulated
decreases with the number of the harmonic. signal.
Three tests on a simulated signal have been
3.3 Data fusion with automatic detection performed. The parameters for this signal are equal to:
of bad tracks (phase 3) fundamental frequency of the first note f0(a) = 440 Hz,
In the pitch-tracker discussed above weights were fundamental frequency of the second note f0(b) = 493
used (that had to be explicitly provided) to improve Hz, transition moment ta = 0.71 s, transition speed tr =
the quality of the data fusion. In this pitch tracker we 0.003, magnitude of the vibrato Av = 30 Hz, frequency
incorporate an automatic method to rate the quality of of the vibrato fv = 5 Hz and phase of the vibrato ϕ =
v

the fn’s. 1.6 radian.


We analyze here the extracted portions of the The transition is modeled as a hyperbolic tangent.
spectrums, frame by frame. We inspect three Thus, the used model of the fundamental frequency
measures. trajectory (without vibrato) is:
The first one, M 1, is the ratio between the portion of (a) t − ta
(b) (a)
f0 − f0 (9)
f (t ) = f + [tanh( ) + 1]
energy around the maximum of the spectrum 0 0
t r
2
( [ f − δf f + δf ] ) and its whole energy. This portion So, finally, the signal model in use is: s = cos(Φ1 + Φ( a ))
max max

is expected to be great when the analysed signal is a with:


t−
pure sine and when the score information is relevant. Φ = 2Π{ f t + ct + Af sin(2Π f t + Φ ) + t t[log (cosh( t )) − log (cosh(− t
(a)
v a a
1 v r
))]}
0 v
t t
e e

The second one M2 is the rate of change in the v r r

(10)
position of this maximum for two successive frames. (b) (a)

When something disrupts the partial tracker, this rate where t is the time (in second), c = f 0 − f 0 , and Φ( a )
2
is expected to be great. the phase at t=0.
The third one M 3 is the correlation between the It can be noticed that, for these tests, the disruptive
spectrum around is maximum and its theoretical shape parameters 1 and 2 concern the time rate of change of
if the analyzed signal was a pure sine, witch constant value and the range of value (see section 2.3). The
amplitude and frequency. length of the frames is constant. It has been chosen
The final measure of bad tracks detection is thus: equal to 7 ms, which is close to 3 fe /f0s for the smallest
Mi = a1 M1 + a2 (1- M2) + a3 M3 (8) fundamental frequency, f0(a).
The parameters a1, a2, a3 and δf have to be optimized.
This variable M is used instead of the Wi. Therefore, a 4.1.2 Behavior on sine signal with a transition and
value is obtained for each frame. It is not the case vibrato
when the weights W i are considered, which are In Figure 5 are shown the f0 trajectories obtained for
defined note by note. the whole sound. In Figure 6 are shown the f0
trajectories during the transition. In both cases, four f0
4. Results trajectories are plotted: the ideal f0 trajectory, the f0
Firstly, the performance of the three harmonics trajectories obtained using the FFT method, the AS
trackers is discussed. Secondly, the two whole pitch method and the TK method. It can be seen that the
tracker systems are compared. Thirdly, the three harmonic trackers can follow the variation of the
performance of the technique to automatically detect frequency well. However, the TK method shows some
the bad tracks is analyzed. And fourthly, the artefacts during the transition.
performance of the whole system, using the second
pitch-tracker, is shown. 520

510

4.1 Performance of the three harmonic 500

trackers
490

480
f0 (HZ)

4.1.1 Introduction
470

Four characteristics of the signal complicate the


460

harmonic tracking. The first one is the vibrato


450

(frequency and amplitude); the second one are 440

transitions; the third one are neighboring harmonics; 430 FFT


AS
and the last one is the additive noise. The three 420 TK

methods do not behave in the same way at all. We 0.3 0.4 0.5 0.6
time (s)
0.7 0.8 0.9 1
Published as: Rossignol, S., Desain, P., and Honing, H. (2001) State-of-the-art in fundamental frequency tracking. Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.

480

Figure 5: Results obtained with the three methods (frame


length 7 ms)
470

460
510

fundamental frequency (Hz)


450
500
440

490
fundamental frequency (Hz)

430

480
420

470
410
FFT
460
AS
400
TK

450 390
0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3
time (s)
FFT
440 AS
TK 480

430
0.69 0.695 0.7 0.705 0.71 0.715 0.72 470
time (s)
460

Figure 6: Close up of the transition shown in Figure 5

fundamental frequency (Hz)


450

440
In Figure 7, are shown the results obtained when the
length of the frames is fixed to 25 ms. This value is 430

the value commonly used by the pitch trackers which 420

do not use knowledge (see Brown and Puckette 1993). 410

It can be demonstrated that, in the transition, the FFT 400


FFT
AS
method is less efficient when using a larger frame TK

length. 390
0.1 0.12 0.14 0.16 0.18 0.2
time (s)
0.22 0.24 0.26 0.28 0.3

Figure 8: Behavior with non-pure sine signals.


510
a2=a3=a4=0.001 (top), a2=a3=a4=0.003 (bottom)
500
It can be seen that when the other harmonics are not
490 removed well, the behavior of AS and TK methods is
fundamental frequency (Hz)

480
disturbed.
470
4.1.4 Behavior on noisy signals
460 In this case, the simulated signal is equal to:
450
s = cos(Φ1 + Φ ) + b
(1) (12)
ideal
FFT
where b is a normal noise, with mean equal to 0 and
standard deviation equal σ . The results are shown in
440 AS
TK

430
0.69 0.695 0.7 0.705 0.71 0.715 0.72
Figure 9.
time (s)

Figure 7: Results obtained with the three methods (frames 480

length 25 ms) 470

460

4.1.3 Behavior with non-pure sine signals


fundamental frequency (Hz)

In this case, the simulated signal is equal to:


450

4
(11) 440

s = cos(Φ1 + Φ ) + ∑ ai cos(i Φ1 + Φ )
(1) (i )

i=2 430

It means that the higher harmonics are not completely 420

removed. Their amplitudes are indicated by the


410
parameter ai. The results are shown in Figure 8. FFT
AS
400
TK

390
0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3
time (s)
Published as: Rossignol, S., Desain, P., and Honing, H. (2001) State-of-the-art in fundamental frequency tracking. Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.

480
Figure 10: zoom of the f0 trajectory (true trajectory,
trajectory obtained with the pitch-tracker A, trajectory when
the band-pass filtering is not performed)
470

460
-3
fundamental frequency (Hz)

x 10
450 8
Filtering
No filtering
440 7

430 6

420 5

410

harmonic1
4
FFT
AS
400
TK 3

390
0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 2
time (s)

Figure 9: Behavior on noisy signals. σ = 1e − 5 , σ = 3e − 4 1

Where there is noise, the AS and TK methods do not


perform well. -1
0.1 0.2 0.3 0.4 0.5
time (s)
0.6 0.7 0.8 0.9

Figure 11: Relative difference between the true f0 trajectory


4.1.5 Behavior of the FFT method and the trajectory obtained with the pitch-tracker A and the
The goal is to show that the band-pass filtering is also trajectory when the band-pass filtering is not performed
necessary for the FFT based method.
We have to notice that the speed of a given transition 4.1.6 Behavior of the three harmonic trackers on
increases with the number of the harmonic. For true sound signal
instance, let us consider two consecutive notes which The top panel of Figure 12, shows the f0 trajectories
fundamental frequencies are respectively 440 Hz and obtained for the first harmonic of the last note of the
554.36 Hz, and which are connected by a 50 cello. As expected, the trajectory obtained with the AS
milliseconds transition. For the first harmonic, during is noisier than the result of FFT method, and the
these 50 milliseconds, the jump in frequency is about trajectory of the TK method even more. This is due to
114 Hz; and for the fourth harmonic, it is 457 Hz. It is the fact that the analysed signal is not a pure sine (see
shown in Figures 6 and 7 that to adapt the length of the spectra shown in the bottom panel of Figure 12).
the frames to the frequency allows the FFT based After the band-pass filtering, the amplitude of the
method to follow efficiently the frequency during the higher harmonics are respectively [7.0 8.6e-3 7.2e-3
transitions. 1.2e-2]. The AS and TK methods need signals
These results are shown in Figures 10 and 11. The composed of a very dominant sine.
signal used is a simulated one. The model is described
in the section 4.1.1. The amplitude of each harmonic
is 1. The sound lasts 1 second. And we have: ta = 0.5,
510

tr = 0.012, Φv = 0.9rad , f0(a) = 440 Hz and f0(b) = 554.36 505

Hz.
500
frequency (Hz)

460.5
Ideal
Filtering 495
No filtering
460

490
459.5

485 TK
frequency (Hz)

459 FFT
AS

480
458.5 21.9 21.92 21.94 21.96 21.98 22 22.02 22.04 22.06 22.08 22.1
time (second)

458

457.5

457
0.355 0.36 0.365 0.37 0.375 0.38 0.385 0.39
time (s)
Published as: Rossignol, S., Desain, P., and Honing, H. (2001) State-of-the-art in fundamental frequency tracking. Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.
20
After filtering
closer from the true trajectory than the one obtained
using the first fusion method (i.e. pitch tracker A).
Before filtering
0

-20

460
-40

455
-60
amplitude

450
-80

445
-100

frequency (Hz)
440
-120

-140 435

-160 430
0 1000 2000 3000 4000 5000 6000
frequency (Hz)
Score
Figure12. Top panel: f0 trajectories obtained with the three 425 Ideal
Pitch-Tracker A

harmonic trackers. Cello (54.5 bpm, last note, first


Pitch-Tracker B

420
harmonic). Bottom panel: spectra of a frame of the original 0.15 0.2 0.25 0.3
time (s)
0.35 0.4

signal [21.9s 21.92s] and of the corresponding band-pass Figure 13: f0 trajectories obtained with the two pitch
filtered signal
trackers on a simulated signal

4.1.7 Discussion Next, we will look at the method of combining spectra


A very efficient band-pass filtering stage is absolutely method uses in pitch-tracker B on the simulated
necessary for the AS and TK methods. For these two signal. This is illustrated in Figure 14. In this Figure,
methods, the signal given to the harmonic trackers the spectra of each harmonic of the simulated signal
must be a pure sine with slowly varying amplitude are shown (indicated by S1, S2, S3 and S4).
and frequency. The FFT method seems to be the best, It can be seen that the maximum for the third (labeled
as the use of knowledge allows us to improve its 3) spectrum does not occur in the same place ( ≈ 30)
performance. We decided therefore to use this method that for the 3 other spectrums ( ≈ 15).
in our system. We have also a trajectory labeled mean. It is the result
of the fusion the four previous spectrums. It can be
4.2 Comparison of the two pitch trackers seen that the position of the maximum of this red
spectrum is around 15. So, it is well positioned.
4.2.1 Simulated signal
Here we give some results obtained for the evaluation 0.2

of the two pitch tracker methods. The difference 0.18


S3
between these two methods concerns mainly the data 0.16

fusion stage. 0.14


S1
We use a simulated signal, composed of an harmonic
0.12
component and of a disruptive component.
For the harmonic component, the fundamental 0.1
mean
frequency is constant (it means that there is only one 0.08
S2
note): f0 = 440 Hz; there is a vibrato: fv = 5 Hz, Av = 0.06

20 Hz; and the amplitude of each harmonic is 1/15. 0.04

The disruptive component is composed of an 0.02

additional partial, which is close in frequency from the S4


0
third harmonic. 5 10 15 20 25 30

For the disruptive partial, the frequency is 3f0 + 150 Figure 14: The spectra of the four harmonics in the
simulated signal; and the combined spectra (X-axis: ≈
Hz; the amplitude is 1.5/15 (notice that the amplitude
frequency (not in Hz); Y-axis: linear amplitude)
of the disruptive partial is higher that the amplitude of
the third harmonic) and there is a vibrato: fv = 4.9 Hz, The second pitch-tracker is more robust to mistakes
Av = 29.4 Hz (it is different of the vibrato presents on on W i than the first one. Figure 14 illustrates that
the harmonics). indeed, in the case of the presence of a more noisy
Figure 13 shows the fundamental frequency harmonic, taking the average spectrum is a more
trajectories for pitch-tracker A and B.
realiable method.
Here, clearly, the trajectory obtained using the
alternative fusion method (i.e. pitch tracker B) is
Published as: Rossignol, S., Desain, P., and Honing, H. (2001) State-of-the-art in fundamental frequency tracking. Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.

and validate the automatic detection method used in


4.2.2 Instrumental sound pitch tracker B.
We will now demonstrate the workings of the two For this we use the dataset described in section 1.3,
pitch trackers using a realistic example: a note of the using a single performance of each instrument at
cello (57 MIDI), for which we have a string resonance tempo 60 BPM. Participants judged for each note in
which disrupts the first harmonic (see Figure 1, the selected fragment the filtered signal (i.e. the first
between 6 and 10 seconds). For this sound, we obtain four harmonics) and the signal resynthesised with the
the results shown in Figure 15. The pitch tracker B is resulting fn trajectories. First the original signal was
more robust. presented, followed by four pairs of the filtered and
In Figure 16, are given the f0 trajectories obtained with synthesized harmonics, every time judging the
the two pitch trackers when the weights Wi are taken similarity between the filtered and synthesized signal,
into account (W1=0). The results are very similar. and the consistency of the synthesized signal.
Score
220 Pitch-Tracker A
Pitch-Tracker B

215

210

205

200

195

6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4

Figure 15: f0 trajectories obtained for the beginning of


a note of cello, for which there is a disruptive The goal of the similarity rating is to check the quality
sympathetic resonance; the weights are not taken into of the harmonic tracker. The filtered and the
account synthesized signals have to be similar (and they have
220
Score
Pitch-Tracker A
to be harmonic). If they are different, it means that
Pitch-Tracker B
something is went wrong with extracting frequency
215
trajectory (for example, caused by the noise in the
signal is noisy, a resonance, etc.).
210 The goal of the consistency rating is to check if we
can use the frequency trajectory during the data fusion
205 stage. We cannot use the frequency trajectory if the
note is not consistent, that is to say if more than one
200
note is perceived. For instance caused by a jump in
frequency in selecting two competing peaks.
195
Participants rated similarity on a three point scale,
6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 with 0 indicating that both the filtered and synthesized
Figure 16: f0 trajectories obtained for the beginning of signal are different and 2 indicating that they are
a note of cello, for which there is a disruptive similar. They rated consistency on a two point scale,
sympathetic resonance; the weights are taken into with 0 indicating that the signal is not consistent, and
account 1 that it was perceived as one note.
The rating were combined to a final measure as r =
4.3 Evaluation of the automatic detection r1/2*r2.
Seven subjects participated to this experiment. The
method mean was computer over these seven subjects, and
compared to the results obtained with the automatic
We conducted an experiment to get a better insight in
method described in the section 3.2.
quality and the relevancy of the processing by having
For the cello, the correlations between the mean
participants listening to the filtered sounds and the
within the subjects, and each subject are: [0.87 0.86
resynthetised sounds obtained with pitch tracker A.
0.75 0.79 0.85 0.66 0.71]. The mean of these
The results of this experiment were used to improve
correlations is 0.78.
Published as: Rossignol, S., Desain, P., and Honing, H. (2001) State-of-the-art in fundamental frequency tracking. Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.

The correlation between the mean within the subjects, trajectories. The obtained f0 trajectories can be used
and the results obtained with the automatic method is to analyse vibrato and portamento.
0.5. While our primary motivation of developing this
knowledge-based method is to obtain precise f 0
4.4 Example of the full system information from the experimental data set, the idea to
Finally, we will show an example of preferred use knowledge in f0 tracking can be useful for other
method, pitch tracker B, the full system in operation. computer music systems as well. For instance, when f0
For this we return again to the example presented in needs to be tracked in a live situation where score and
the introduction (see Figure 1). In Figure 17, the timing information is available. The methods
spectrogram, the score information and the obtained described in this paper can in principle be used for an
frequency trajectories are plotted. In Figure 17, only efficient f0 tracker that considers only those parts of
the first harmonic is shown. In Figure 1, the first four the audio signal of the singer or instrumentalist to be
harmonics are shown. The dotted score lines indicate followed that are relevant for f0 tracking.
that the corresponding harmonic is not taken into A measure of the voicing coefficient is also obtained.
account. The corresponding portions of the frequency It allows us to detect, for instance, silences, noisy state
trajectories are plotted in white. The amplitude part (noise component predominant over harmonic
trajectory information is also taken into account. component), but also to check the quality of our
When the amplitude is too small, the corresponding processing.
portions of the frequency trajectories are not shown An interesting extension of our pitch-trackers would
(see for instance the end of each harmonic, after 24 be to use it to analyze polyphonic sounds. For
seconds). example, when two harmonics, coming from two
As an example, for the long note at 57 MIDI pitch, we different instruments or voices, are to close, the
can see in Figure 1 that there is a resonance at the corresponding trajectories (see Figure 2) or spectrum
beginning of this note. So, the harmonic tracker fails (see Figure 4) would not be used during the fusion
for this part of the signal, as it can be seen. But, after stage.
the data fusion stage, the f0 trajectory shown in Figure
17 is obtained. If we compare this trajectory to the 6 Acknowledgements
frequency trajectory obtained for the first harmonic This research is supported by the MOSART (Music
(Figure 1), the results have been clearly improved. Orchestration Systems in Algorithmic Research and
Technology) European project, and by the
Netherlands Organization for Scientific Research
(NWO).

References
Boashash, Boualem, “Estimating and interpreting the
instantaneous frequency of a signal”, Proceedings of
the IEEE, Volume 80, no. 4, pp. 539-568, 1992, April.
IEEE.

Brown, Judith C. and Puckette, Miller S., “A high


resolution fundamental frequency determination based
on phase changes of the Fourier transform”, Journal of
the Acoustical Society of America, 1993, volume 94,
Figure 17: f0 trajectory obtained for the cello (54.5 bpm); no. 2, pp. 662-667
fusion before peak detection
Brown, Judith C., “Musical fundamental frequency
tracking using a pattern recognition method”, Journal
5 Conclusion and prospects of the Acoustical Society of America, 1992, volume
In this paper, two efficient f0 trackers, which use 92, no. 3, pages 1394 – 1402
knowledge, have been presented and compared. Our
future goal will be to provide models of the vibrato Desain, P., Honing, H., Aarts, R. and Timmers, R.,
useful for music synthesis and composition. The first “Rhythmic Aspects of Vibrato (In P. Desain and W.
step of the analysis was to obtain “good” f 0
Published as: Rossignol, S., Desain, P., and Honing, H. (2001) State-of-the-art in fundamental frequency tracking. Proceedings of
Workshop on Current Research Directions in Computer Music, 244-254. Barcelona: UPF.

L. Windsor, Rhythm Perception and Production,) recordings”, IJCAI - Workshop on Computational


Swets & Zeitlinger, 2000, pp. 203-216 Auditory Scene Analysis, 1995, August

Desain, P. and Honing, H., “Modeling Continuous Timmers, Renee and Desain, Peter, “Vibrato: the
Aspects of Music Performance: Vibrato and questions and answers from musicians and science”,
Portamento”, Proceedings of the International Music Proceedings of the International Conference on Music
Perception and Cognition Conference, B. Pennycook Perception and Cognition, Keele University,
& E. Costa-Giomi, CD-ROM, 1996 Department of Psychology, CD-ROM, 2000

Hermes, D., “Pitch analysis”, In M. Cooke and S. Vakman, David, “On the AS, the TK energy
Beet, eds. Visual Representation of Speech Signals, algorithm and other methods for defining amplitude
New York, John Wiley and Sons, 1992 and frequency”, IEEE Transaction on Signal
Processing, Volume 44, no. 4, pp. 791-797, 1996,
Hess, Wolfgang, “Pitch determination of speech April
signals”, Springer-Verlag, 1983
Wang, Avery Li-Chun, “Instantaneous and frequency-
Lane, J., “Pitch detection using a tunable IIR filter”, warped signal processing techniques for auditory
Computer Music Journal, Volume 14, no. 3, pp. 46-59 source separation”, Ph. D. thesis, Stanford University,
1994, August
Maragos, Petros and Kaiser, James K., “Energy
separation in signal modulations with application to
speech analysis”, IEEE Transaction on Signal
Processing, Volume 41, no. 10, pp. 3024-3050, 1993,
October

Moorer, J. A., “On the segmentation and analysis of


continuous musical sound”, Ph-D thesis, Stanford
University, Department of Music, 1975

Prame, Eric, “Measurements of the vibrato rate of ten


singers”, Journal of the Acoustical Society of
America, 1994, October, 1979 - 1984

Prame, Eric, “Vibrato extent and intonation in


professional Western lyric singing”, Journal of the
Acoustical Society of America, 1997, July, 616 - 621

Roads, Curtis, “The computer music tutorial”, The


MIT Press, Cambridge, Massachusetts, London,
England, 1996

Rossignol, Stéphane, Desain, Peter and Honing,


Henkjan, “Refined knowledge-based f 0 tracking:
Comparing three frequency extraction methods”,
International Computer Music Conference, 2001,
September

Schafer, R., and Rabiner, L., “System for automatic


formant analysis of voiced speech”, Journal of the
Acoustical Society of America, Volume 47, no. 2, pp.
634-644

Scheirer, Eric, “Using musical knowledge to extract


expressive performance information from audio

You might also like