Kyoto University

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Drum sound recognition for polyphonic audio signals by

Title adaptation and matching of spectrogram templates with


harmonic structure suppression

Author(s) Yoshii, K; Goto, M; Okuno, HG

IEEE TRANSACTIONS ON AUDIO SPEECH AND


Citation LANGUAGE PROCESSING (2007), 15(1): 333-345

Issue Date 2007-01

URL http://hdl.handle.net/2433/50283

(c)2007 IEEE. Personal use of this material is permitted.


However, permission to reprint/republish this material for
advertising or promotional purposes or for creating new
Right collective works for resale or redistribution to servers or lists,
or to reuse any copyrighted component of this work in other
works must be obtained from the IEEE.

Type Journal Article

Textversion publisher

Kyoto University
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007 333

Drum Sound Recognition for Polyphonic Audio


Signals by Adaptation and Matching of Spectrogram
Templates With Harmonic Structure Suppression
Kazuyoshi Yoshii, Student Member, IEEE, Masataka Goto, and Hiroshi G. Okuno, Senior Member, IEEE

Abstract—This paper describes a system that detects onsets of find our desired musical pieces in a huge music database. Music
the bass drum, snare drum, and hi-hat cymbals in polyphonic content analysis enables MIR systems to automatically under-
audio signals of popular songs. Our system is based on a tem- stand the contents of musical pieces and to deal with them even
plate-matching method that uses power spectrograms of drum
sounds as templates. This method calculates the distance between if they do not have metadata about the artists and titles.
a template and each spectrogram segment extracted from a song As the first step of achieving content-based MIR systems in
spectrogram, using Goto’s distance measure originally designed the future, we focus on detecting onset times of individual mu-
to detect the onsets in drums-only signals. However, there are two sical instruments. In this paper, we call this process recogni-
main problems. The first problem is that appropriate templates tion, which means simultaneous processing of both onset detec-
are unknown for each song. The second problem is that it is more
difficult to detect drum-sound onsets in sound mixtures including tion and identification of each sound. Although onset time in-
various sounds other than drum sounds. To solve these problems, formation of each musical instrument is low-level musical con-
we propose template-adaptation and harmonic-structure-suppres- tent, the recognition results can be used as a basis for higher-
sion methods. First of all, an initial template of each drum sound, level music content analysis concerning the rhythm, melody,
called a seed template, is prepared. The former method adapts it and chord, such as beat tracking, melody detection, and chord
to actual drum-sound spectrograms appearing in the song spectro-
gram. To make our system robust to the overlapping of harmonic change detection.
sounds with drum sounds, the latter method suppresses harmonic In this paper, we propose a system of recognizing drum
components in the song spectrogram before the adaptation and sounds in polyphonic audio signals sampled from commercial
matching. Experimental results with 70 popular songs showed compact-disc (CD) recordings of popular music. We allow
that our template-adaptation and harmonic-structure-suppression various music styles for popular music, such as rock, dance,
methods improved the recognition accuracy and achieved 83%,
58%, and 46% in detecting onsets of the bass drum, snare drum, house, hip-hop, eurobeat, soul, R&B, and folk. Our system
and hi-hat cymbals, respectively. detects onset times of three drum instruments—bass drum,
snare drum, and hi-hat cymbals—while identifying them. For
Index Terms—Drum sound recognition, harmonic structure sup-
pression, polyphonic audio signal, spectrogram template, template a large class of popular music with drum sounds, these three
adaptation, template matching. instruments play important roles as the rhythmic backbone
of music. We believe that accurate onset detection of drum
sounds is useful for describing temporal musical contents such
as rhythm, tempo, beat, and measure. Previous studies [1]–[4]
I. INTRODUCTION on describing those temporal contents, however, have focused
on the periodicity of time-frame-based acoustic features, and
HE importance of music content analysis for musical have not tried to detect accurate onset times of drum sounds.
T audio signals has been increasing in the field of music
information retrieval (MIR). MIR aims at retrieving musical
Previous studies [5], [6] on genre classification did not consider
onset times of drum sounds while such onset times could be
pieces by executing a query about not only text information used for improving classification performances by identifying
such as artist names and music titles but also musical contents drum patterns unique to musical genres. Some recent studies
such as rhythms and melodies. Although the amount of digitally [7], [8] reported the use of drum patterns for genre classification
recorded music available over the Internet is rapidly increasing, while Ellis et al. [7] dealt with only MIDI signals. The results
there are only a few ways of using text information to efficiently of our system are useful for such genre classification with
higher-level content analysis of real-world audio signals.
The rest of this paper is organized as follows. In Section II,
Manuscript received February 1, 2005; revised December 19, 2005. This
we describe the current state of drum sound recognition tech-
work was supported in part by the Ministry of Education, Culture, Sports,
Science and Technology (MEXT), Grant-in-Aid for Scientific Research (A) niques. In Section III, we examine the problems and solutions of
15200015 and by the COE Program of MEXT, Japan. The associate editor recognizing drum sounds contained in commercial CD record-
coordinating the review of this manuscript and approving it for publication was
ings. Sections IV and V describe the proposed solutions: tem-
Dr. Michael Davies.
K. Yoshii and H. G. Okuno are with the Department of Intelligence Science plate-adaptation and template-matching methods, respectively.
and Technology, Graduate School of Informatics, Kyoto University, Kyoto 606- Section VI describes a harmonic-structure-suppression method
8501, Japan (e-mail: [email protected]; [email protected]). to improve the performance of our system. Section VII shows
M. Goto is with the National Institute of Advanced Industrial Science and
Technology (AIST), Tsukuba 305-8568, Japan (e-mail: [email protected]). experimental results of evaluating these methods. Finally, Sec-
Digital Object Identifier 10.1109/TASL.2006.876754 tion VIII summarizes this paper.
1558-7916/$20.00 © 2006 IEEE
334 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007

II. ART OF DRUM SOUND RECOGNITION Klapuri’s algorithm to estimate the amount of percussive on-
sets. However, drum sound identification was not evaluated. To
We start on describing the current state of the art of drum
identify drum sounds extracted from polyphonic audio signals,
sound recognition and related work motivating our approach.
Sandvold et al. [27] proposed a method that adapts feature
models to those of drum sounds used in each musical piece, but
A. Current State they used correct instrument labels for the adaptation.
Although there are many studies on onset detection or iden-
B. Related Work
tification of drum sounds, a few of them have dealt with drum
sound recognition for polyphonic audio signals such as com- We explain two related methods in detail.
mercial CD recordings. The drum sound recognition method by 1) Drum Sound Recognition for Solo Drum Performances:
Goto and Muraoka [9] was the earliest work that could deal with Goto and Muraoka [9] reported a template-matching method for
drum-sound mixtures of solo performances with MIDI rock- recognizing drum sounds contained in musical audio signals of
drums. Herrera et al. [10] compared conventional feature-based popular-music solo drum performances by a MIDI tone gener-
classifiers in the experiments of identifying monophonic drum ator. Their method was designed in the time-frequency domain.
sounds. To recognize drum sounds in drums-only audio sig- First, a fixed-time-length power spectrogram of each drum to be
nals, various modeling methods such as N-grams [11], proba- recognized is prepared as a spectrogram template. There were
bilistic models [12], and SVM [13] have been used. By using nine templates corresponding to nine drum instruments (bass
a noise-space-projection method, Gillet and Richard [14] tried and snare drums, toms, and cymbals) in a drum set. Next, onset
to recognize drum sounds in polyphonic audio signals. These times are detected by comparing the template with the power
studies, however, cannot fully deal with both the variation of spectrogram of the input audio signal, assuming that the input
drum-sound features and their distortion caused by the overlap- signal is a polyphonic sound mixture of those templates. In the
ping of other sounds. template-matching stage, they proposed a distance measure (we
The detection of bass and snare drum sounds in polyphonic call this “Goto’s distance measure” in this paper), which is ro-
CD recordings was mentioned in Goto’s study on beat tracking bust for the spectral overlapping of a drum sound corresponding
[15]. Since it roughly detected them to estimate a hierarchical to the target template with other drum sounds.
beat structure, the accurate drum detection was not investi- Although their method achieved the high recognition accu-
gated. Gouyon et al. [16] proposed a method that classifies racy, it has a limitation that the power spectrogram of each drum
mixed sounds extracted from polyphonic audio signals into two used in the input audio signal must be registered with the recog-
categories of the bass and snare drums. As the former step of nition system. In addition, it has difficulty recognizing drum
the classification, they proposed a percussive onset detection sounds included in polyphonic music because it does not as-
method. It was based on a unique idea of template adaptation sume the spectral overlapping of harmonic sounds.
that can deal with drum-sound variations according to musical 2) Drum Sound Resynthesis From CD Recordings: Zils et al.
pieces. Zils et al. [17] tried the extraction and resynthesis of [17] reported a template-adaptation method for recognizing bass
drum tracks from commercial CD recordings by extending and snare drum sounds from polyphonic audio signals sampled
Gouyon’s method, and showed the promising results. from popular-music CD recordings. Their method is defined in
To recognize drum sounds in audio signals of drum tracks, the time domain. First, a fixed-time-length signal of each drum
sound source separation methods have been focused. They made is prepared as a waveform template, which is different from an
various assumptions in decomposing a single music spectro- actual drum signal used in a target musical piece. Next, by cal-
gram into multiple spectrograms of musical instruments; in- culating the correlation between each template and the musical
dependent subspace analysis (ISA) [18], [19] assumes the sta- audio signal, onset times at which the correlation is large are de-
tistical independence of sources, non-negative matrix factor- tected. Finally, a drum sound is created (i.e., the signal template
ization (NMF) [20] assumes their non-negativity, and sparse is updated) by averaging fixed-time-length signals starting from
coding combined with NMF [21] assumes their non-negativity those detected onset times. These operations are repeated until
and sparseness. Further developments were made by FitzGerald the template converges.
et al. [22], [23]. They proposed PSA (Prior Subspace Anal- Although their time-domain analysis seems to be promising,
ysis) [22] that assumes prior frequency characteristics of drum it has limitations in dealing with overlapping drum sounds in the
sounds, and applied it to recognize drum sounds in the presence presence of other musical instrument sounds.
of harmonic sounds [23]. For the same purpose, Dittmar and
Uhle [24] adopted non-negative independent component anal- III. DRUM SOUND RECOGNITION PROBLEM
ysis (ICA) that considers the non-negativity of sources. In these FOR POLYPHONIC AUDIO SIGNALS
studies, the recognition results depend not only on the separa- First, we define the task of our drum sound recognition
tion quality but also on the reliability of estimating the number system. Next, we describe the problems and solutions in recog-
of sources and classifying them. However, the estimation and nizing drum sounds in polyphonic audio signals.
classification methods are not robust enough for the sake of
recognizing drum sounds in audio signals containing time-fre- A. Target
quency-varying various sounds. The purpose of our research is to detect onset times of three
Klapuri [25] reported a method of detecting onsets of all kinds of drum instruments in a drum set: bass drum, snare drum,
sounds in polyphonic audio signals. Herrera et al. [26] used and hi-hat cymbals. Our system takes polyphonic musical audio
YOSHII et al.: DRUM SOUND RECOGNITION FOR POLYPHONIC AUDIO SIGNALS 335

target piece, even if other musical instrument sounds overlap the


drum sounds. By using Goto’s distance measure considering the
spectral overlapping, we compare the adapted template with the
spectrogram of the audio signal. We present an improved spec-
tral weighting algorithm based on Goto’s algorithm for use in
calculating the matching distance.
3) Harmonic Structure Suppression: The purpose is to sup-
press harmonic components of other instrument sounds in the
audio signal when recognizing sounds of bass and snare drums.
Fig. 1. Overview of drum sound recognition system: a drum-sound spectrogram
template (input) is adapted to actual drum-sound spectrograms appearing in the In the recognition of hi-hat cymbal sounds, this processing is
song spectrogram (input) in which the harmonic structure is suppressed. The not performed under the assumption that harmonic components
adapted template is compared with the song spectrogram to detect onsets (output). are weak enough at a high-frequency band.
We use two different distance measures between the template
adaptation and matching stages. In the adaptation stage, it is
signals as input, which are sampled from popular-music CD desirable to detect only semi-pure drum sounds that have little
recordings and contain sounds of vocal parts and various mu- overlap with other sounds. Those drum sounds tend to result
sical instruments (e.g., piano, trumpet, and guitar) as well as in a good adapted template that includes little spectral compo-
drum sounds. Drum sounds are performed by real drum sets nents of other sounds. Because it is not necessary to detect all
(e.g., popular/rock drums) or electronic instruments (e.g., MIDI the onset times of a target drum instrument, a distance measure
tone generators). Assuming the main target is popular rock-style used in this stage does not care about the spectral overlapping
music, we focus on the basic playing style of drum performances of other sounds. In the matching stage, on the other hand, we
using normal sticks, and do not deal with special playing styles used the Goto’s distance measure because it is necessary to ex-
(e.g., head-mute and brush). haustively detect all the onset times even if target drum sounds
are overlapped by other sounds.
B. Problems
The recognition of bass drum, snare drum, and hi-hat cymbal
In this paper, we develop a template-based recognition system sounds is performed separately. In the following sections, the
that defines a template as a fixed-time-length power spectrogram term “drum” means one of these three drum instruments.
of each drum: bass drum, snare drum, or hi-hat cymbals. There
are the following two problems, considering the discussion in
Section II-B. IV. TEMPLATE ADAPTATION
1) Individual Difference Problem: Acoustic features of drum A drum sound template is a power spectrogram in the
sounds vary among musical pieces and the appropriate tem- time-frequency domain. Our template-adaptation method uses
plates for recognizing drum sounds in each piece are usually a single initial template, called a “seed template,” for each kind
unknown in advance. of drum instruments. To recognize the sounds of the bass drum,
2) Mixed Sound Problem: It is difficult to accurately de- snare drum and hi-hat cymbals, for example, we require just
tect drum sounds included in polyphonic audio signals because three seed templates, each of which is individually adapted by
acoustic features are distorted by the overlapping of other mu- using the method.
sical instrument sounds. Our method is based on an iterative adaptation algorithm.
An overview of the method is shown in Fig. 2. First, Onset-
C. Approach Candidate-Detection stage roughly detects onset candidates in
We propose an advanced template-adaptation method to solve the input audio signal of a musical piece. Starting from each
the individual difference problem described in Section III-B. onset candidate, a spectrogram segment whose time-length is
After performing the template adaptation, we detect onset times fixed is extracted from the power spectrogram of the input audio
of drum sounds using an advanced template-matching method. In signal. Then, by using the seed template and all the spectro-
addition, in order to solve the mixed sound problem, we propose gram segments, the iterative algorithm successively applies two
a harmonic-structure-suppression method that improves the stages—Segment Selection and Template Updating—to obtain
robustness of our adaptation and matching methods. Fig. 1 shows the adapted template.
an overview of our proposed drum sound recognition system. 1) The Segment-Selection stage estimates the reliability that
1) Template Adaptation: The purpose of this adaptation is each spectrogram segment includes the drum sound spec-
to obtain a spectrogram template that is adapted to its corre- trogram. The spectrogram segments with high reliabilities
sponding drum sound used in the polyphonic audio signal of are then selected: this selection is based on a fixed ratio to
a target musical piece. Before the adaptation, we prepare in- the number of all the spectrogram segments.
dividual spectral templates (we call seed-templates) for bass 2) The Template-Updating stage then reconstructs an updated
drum, snare drum, and hi-hat cymbals; three templates in total. template by estimating the power that is defined, at each
To adapt the seed-templates to the actual drum sounds, we ex- frame and each frequency, as the median power among
tended Zils’ method to the time-frequency domain. the selected spectrogram segments. The template is thus
2) Template Matching: The purpose is to detect all the onset adapted to the current piece and used for the next adaptive
times of drum sounds in the polyphonic audio signal of the iteration.
336 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007

Fig. 3. Lowpass filter functions F , F which represent typical frequency


characteristics of bass and snare drum sounds, and highpass filter function F
which represents that of hi-hat cymbal sounds.

where is a lowpass or highpass


filter function, as is shown in Fig. 3. We assume that it rep-
resents the typical frequency characteristics of bass drum
sounds (BD), snare drum sounds (SD) or hi-hat cymbal
Fig. 2. Overview of template-adaptation method: each template is represented
as a fixed-time-length power spectrogram in the time-frequency domain. This sounds (HH).
method adapts a single seed template corresponding to each drum instrument to 3) Each onset time is given by the time found by peak-picking
actual drum sounds appearing in a target musical piece. The method is based on in . is smoothed by Savitzky and Golay’s
an iterative adaptation algorithm, which successively applies two stages—Seg-
ment Selection and Template Updating—to obtain the adapted template. smoothing method [28] before its peak time is calculated.

B. Preparing Seed Templates and Spectrogram Segments


A. Onset Candidate Detection 1) Seed Template Construction: Seed template (the sub-
To reduce the computational cost of the template matching, script means seed) is an average power spectrogram prepared
the Onset-Candidate-Detection stage detects possible onset for each drum type to be recognized. The time-length (frames)
times of drum sounds as candidates: the template matching of seed template is fixed. is represented as a time-fre-
is performed only at these onset candidates. For the purpose quency matrix whose element is denoted as (
of detecting onset times, Klapuri’s method [25] is often used, frames , bins ).
but we adopted an easy peak-picking method [9] to detect To create seed template , it is necessary to prepare multiple
onset candidate times. The reason is that it is important to drum sounds each of which contains a solo tone of the drum
minimize the detection failure (miss) of actual drum-sound sound. We used drum-sound samples taken from “RWC Music
onsets; the high recall rate is preferred even if there are many Database: Musical Instrument Sound” (RWC-MDB-I-2001).
false alarms. Note that each detected onset candidate does They were performed in a normal style on six different real
not necessarily correspond to an actual drum-sound onset. drum sets. By applying the onset candidate detection method,
The template-matching method judges whether each onset an onset time in each sample is detected. Starting from each
candidate is an actual drum-sound onset. time, a power spectrogram whose size is the same as the seed
The time at which the power takes a local maximum value is template, is calculated by executing STFT. Therefore, multiple
detected as an onset candidate. Let denote the power at power spectrograms of monophonic drum sounds are obtained,
frame and frequency bin , and be its time differen- each of which is denoted as , where
tial. At every frame (441 points), is calculated by ap- means the number of the extracted power spectrograms (the
plying the short-time Fourier transformation (STFT) with Han- number of the prepared drum sounds).
ning windows (4096 points) to the signal sampled at 44.1 kHz. Because there are timbre variations of drum sounds, we used
In this paper, we use log scale [dB] as the power unit. The onset multiple drum-sound spectrograms in constructing seed tem-
candidate times are then detected as follows: plate . Therefore, in this paper, seed template is calculated
1) If is satisfied for three consecutive frames by collecting the maximum power of the power spectrograms
, is defined as at each frame and each frequency bin

(1) (3)

Otherwise, . In the iterative adaptation algorithm, let denote a template


2) At every frame , the weighted summation of being adapted after th iteration. Because is the first tem-
is calculated by plate, is set to . We also obtain power spectrogram
weighted by filter function
(2)
(4)
YOSHII et al.: DRUM SOUND RECOGNITION FOR POLYPHONIC AUDIO SIGNALS 337

2) Spectrogram Segment Extraction: The th spectrogram


segment is a power spectrogram via STFT
starting from an onset candidate time [ms] in the audio signal
of a target musical piece, where is the number of the onset
candidates. The size of each spectrogram segment is the same
with that of seed template , and thus it is also represented as
a time-frequency matrix. We also obtain power spectrogram
weighted by filter function

(5) Fig. 4. Spectral smoothing at a lower time-frequency resolution in the Seg-


ment-Selection stage in bass and snare drum sound recognition: this inhibits the
undesirable increase of distance between seed template and spectrogram seg-
C. Segment Selection ment which includes a drum sound spectrogram.

The reliability that spectrogram segment includes the


spectral components of the target drum sound is estimated, and range ,
then spectrogram segments are selected in descending order is calculated by
with respect to the reliabilities . The ratio
of the number of the selected segments to the number of all the
spectrogram segments (the number of the onset candidates: )
is fixed. In this paper, the ratio is empirically set to 0.1 (i.e., the (7)
number of the selected segments is ).
We define the reliability as the reciprocal of the distance is calculated in the same way. This operation means
between template and spectrogram segment the averaging and reallocation of the power, shown in Fig. 4.
First, the time-frequency domain is separated into rectangular
(6) sectors. The size of each sector is 2 [frames] 5 [bins]. Next, the
average power in each sector is calculated, and then reallocated
The distance measure used in calculating the distance is re- to each bin in that sector.
quired to satisfy that, if the reliability that spectrogram segment The spectral distance between seed template and
includes the drum sound spectrogram becomes large, the dis- spectrogram segment in the first iteration is defined as
tance becomes small. We describe the individual distance
measurement for each drum sound recognition.
1) In Recognition of Bass and Snare Drum Sounds: In the (8)
first adaptive iteration, typical spectral distance measures (e.g.,
Euclidean distance measure) cannot be applied to calculate the
distance because those measures inappropriately make the After the first iteration, we can use the Euclidean distance
distance large even if spectrogram segment includes the measure without the spectral smoothing because the spectral
target drum sound spectrogram. In general, the power spectro- peak positions of template are adapted to those of
gram of bass or snare drum sounds has salient spectral peaks the drum sound used in the audio signal. The spectral distance
that depend on the kind of drum instrument. Because seed tem- between template and spectrogram segment
plate has never been adapted, the spectral peak positions of in the th adaptive iteration is defined as
are different from those of the target drum sound spectro-
gram, which makes the distance large. On the other hand,
if spectral peaks of other musical instruments in a spectrogram (9)
segment happen to overlap the salient peaks of seed template
, the distance becomes small, which results in selecting
inappropriate spectrogram segments. To focus on the precise characteristic peak positions of the drum
To solve this problem, we perform spectral smoothing in a sound used in the musical performance, we do not use the spec-
lower time-frequency resolution for seed template and each tral smoothing in the equation (9). Because those positions are
spectrogram segment . In this paper, the time resolution is 2 useful for selecting appropriate spectrogram segments, it is de-
[frames] and the frequency resolution is 5 [bins] in the spectral sirable that the equation (9) reflects the differences of the spec-
smoothing, shown in Fig. 4. This processing allows for differ- tral peak positions between the template and a spectrogram seg-
ences in the spectral peak positions between seed template ment to the distance.
and each spectrogram segment and inhibits the undesirable 2) In Recognition of Hi-Hat Cymbal Sounds: The spectral
increase of the distance when a spectrogram segment in- distance in any adaptive iteration is always calculated after
cludes the drum sound spectrogram. the spectral smoothing for template and spectrogram seg-
Let and denote the smoothed seed template and a ment . In this paper, the time resolution is 2 [frames] and the
smoothed spectrogram segment. in a time-frequency frequency resolution is 20 [bins] in the spectral smoothing. A
338 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007

smoothed template and a smoothed spectrogram seg-


ment are obtained in the similar way of smoothing the
spectrogram of bass and snare drum sounds. Using these spec-
trograms, the spectral distance between template and
spectrogram segment is defined as

(10)

In general, the power spectrogram of hi-hat cymbal sounds


seems not to have salient spectral peaks such as those of bass and
snare drum sounds. We think it is more appropriate to focus on Fig. 5. Updating template by collecting the median power at each frame and
the shape of the spectral envelope than the fine spectral structure. each frequency bin among selected spectrogram segments: harmonic compo-
To ignore the large variation of the local spectral component in nents are suppressed in the updated template.
a small time-frequency range and extract the spectral envelope,
the spectral smoothing is necessary.

D. Template Updating
An updated template is constructed by collecting the median
power at each frame and each frequency bin among all the se-
lected spectrogram segments. The updated template is used as
the template in the next adaptive iteration. We describe updating
algorithms for the template of each drum sound.
1) In Recognition of Bass and Snare Drum Sounds: The up-
dated template which is weighted by filter function is
obtained by

(11)

Fig. 6. Overview of template-matching method: each spectrogram segment is


where are the spectrogram segments se- compared with the adapted template by using Goto’s distance measure to detect
lected in the Segment-Selection stage. is the number of the actual onset times. This distance measure can appropriately determine whether
the adapted template is included in a spectrogram segment even if there are other
selected spectrogram segments, which is in this paper. simultaneous sounds.
We pick out the median power at each frame and each fre-
quency bin because we can suppress spectral components that
do not belong to the target drum sound spectrogram (Fig. 5). If spectrogram segments are not smoothed, the stable me-
A spectral structure of the target drum sound spectrogram (e.g., dian power cannot be obtained because the local power in the
salient spectral peaks) can be expected to appear as the same spectrogram of hi-hat cymbal sounds varies among onsets. By
spectral shape in most selected spectrogram segments. On the smoothing the spectrogram segments, the median power is de-
other hand, spectral components of other musical instrument termined as a stable value because the shape of the spectral en-
sounds appear at different frequencies among spectrogram seg- velope obtained by the spectral smoothing is stable in the spec-
ments. In other words, the local power at the same frame and trogram of hi-hat cymbal sounds.
the same frequency in many spectrogram segments is exposed
as the power of the pure drum sound spectrogram. By picking V. TEMPLATE MATCHING
out the median of the local power, unnecessary spectral compo- To find actual onset times, this method judges whether the
nents of other musical instrument sounds become outliers and drum sound actually occurs at each onset candidate time,
are not picked out. We can thus obtain a template which is close shown in Fig. 6. This alternative determination is difficult
to the solo drum sound spectrogram even if various instrument because other various sounds often overlap the drum sounds.
sounds are included in the musical audio signal. If we use a general distance measure, the distance between
2) In Recognition of Hi-Hat Cymbal Sounds: The updated the adapted template and a spectrogram segment including the
and smoothed template that is weighted by filter function target drum sound spectrogram becomes large when there are
is obtained by many other sounds that are simultaneously performed with
the drum sound. In other words, the overlapping of the other
instrument sounds makes the distance large even if the target
(12) drum sound spectrogram is included in a spectrogram segment.
YOSHII et al.: DRUM SOUND RECOGNITION FOR POLYPHONIC AUDIO SIGNALS 339

Fig. 8. Examples of adapted templates of bass drum (left), snare drum


(center) and hi-hat cymbals (right): these spectrograms show that characteristic
frequency bins are different among three drum instruments.

B. Power Adjustment of Spectrogram Segments


Fig. 7. Power adjustment of spectrogram segments: if a spectrogram segment
includes the drum sound spectrogram, the power adjustment value is large (top). The power of each spectrogram segment is adjusted to match
Otherwise, the power adjustment value is small (bottom). with that of the adapted template by assuming that the drum
sound spectrogram is included in that spectrogram segment.
This adjustment is necessary to correctly determine that the
To solve this problem, we adopt a distance measure proposed adapted template is included in a spectrogram segment even if
by Goto et al. [9]. Because Goto’s distance measure focuses the power of the drum sound spectrogram included in that spec-
on whether the adapted template is included in a spectrogram trogram segment is smaller than that of the template. On the
segment, it can calculate an appropriate distance even if the other hand, if the drum sound spectrogram is not actually in-
drum sound is overlapped by other musical instrument sounds. cluded in a spectrogram segment, the power difference is ex-
We present an improved method for selecting characteristic fre- pected to be large. Therefore, if the power difference is larger
quencies. In addition, we propose a thresholding method that than a threshold, we determine that the drum sound spectrogram
automatically determines appropriate thresholds for each mu- is not included in that spectrogram segment.
sical piece. To calculate the power difference between each spectrogram
An overview of our method is shown in Fig. 6. First, Weight- segment and template , we focus on the local power dif-
Function-Preparation stage generates a weight function which ferences at spectral characteristic frequency bins of in the
represents spectral saliency of each spectral component in the time-frequency domain. The algorithm of the power adjustment
adapted template. This function is used for selecting charac- is described as follows:
teristic frequency bins in the template. Next, Power-Adjustment 1) Selecting Characteristic Frequency Bins in Adapted Tem-
stage calculates the power difference between the template and plate: Let be the characteristic frequency
each spectrogram segment by focusing on the local power dif- bins in the adapted template, where
ference at each characteristic frequency bin (Fig. 7). If the power is the number of characteristic frequency bins at each frame.
difference is larger than a threshold, it judges that the drum In this paper, , , . Fig. 8
sound spectrogram does not appear in that segment, and does shows the differences of characteristic frequency bins among
not execute the subsequent processing. Otherwise, the power of three drum instruments. is determined at each frame .
that segment is adjusted to compensate for the power difference. is selected as a frequency bin where is the th largest
Finally, Distance-Calculation stage calculates the distance be- among which satisfies the following conditions:
tween the adapted template and each adjusted spectrogram seg-
ment. If the distance is smaller than a threshold, it judges that (14)
the drum sound spectrogram is included. (15)
In this section, we describe a template-matching algorithm
for bass and snare drum sound recognition. In hi-hat cymbal (16)
sound recognition, the adapted template is obtained as the where is a constant, which is set to 0.5 in this paper. These
smoothed spectrogram. Therefore, a template-matching al- three conditions (14), (15), and (16) mean that should
gorithm for hi-hat cymbal sound recognition is obtained by be peaked along the frequency direction.
replacing with in each expression (e.g., , ). 2) Calculating Power Difference: The local power difference
at frame and characteristic frequency bin is cal-
A. Weight Function Preparation culated as
A weight function represents the spectral saliency at each
(17)
frame and frequency bin in the adapted template. The weight
function is defined as The local-time power difference at frame is determined
as the first quartile of
(13)
first-quartile (18)
where represents the adapted template which is weighted by
arg-first-quartile (19)
filter function .
340 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007

where is when is the first quartile. If the The total distance is calculated by integrating the local
number of frames where is satisfied is larger than distance in the time-frequency domain, weighted by weight
a threshold , we determine that the template is not included function
in that spectrogram segment, where is a threshold auto-
matically determined in Section V-D and is set to 5 [frames]
in this paper. (23)
We pick out not the minimum but the first quartile among
the power differences because the To determine whether the targeted drum sound occurred at a
latter value is more robust for outliers included in them. The time corresponding to the spectrogram segment , the distance
power difference at a characteristic frequency bin may become is compared with a threshold . If is satisfied,
large when harmonic components of other musical instrument we conclude that the targeted drum sound occurred. is also
sounds accidentally exist at that frequency. Picking out the first automatically determined in Section V-D.
quartile ignores the accidental large power difference and ex-
tracts the essential power difference derived from whether the D. Automatic Thresholding
template is included in a spectrogram segment or not.
3) Adjusting Power of Spectrogram Segments: The total To determine 12 thresholds ( ,
power difference is calculated by integrating the local-time and ) that are optimized for each musical piece, we use a
power difference which satisfies , weighted threshold selection method proposed by Otsu [29]. It is better to
by weight function dynamically change the thresholds to yield the best recognition
results for each piece.
By using Otsu’s method, we determine each optimized
threshold ( , or ) which classifies a set of
(20) values ( , or
) into two classes: the one class contains
If is satisfied, we are able to determine that the tem- values which are less than the threshold, the other contains
plate is not included in that spectrogram segment, where is the rest of values. We define a threshold which maximizes
a threshold automatically determined in Section V-D. the between-class variance (i.e., minimizes the within-class
Let denote an adjusted spectrogram segment after the variance).
power adjustment, obtained by Finally, to balance the recall rate with the precision rate (these
rates are defined in Section VII-A), we adjust thresholds and
which are determined by Otsu’s method
(21)

(24)
C. Distance Calculation
To calculate the distance between adapted template and where and are empirically determined scaling (bal-
an adjusted spectrogram segment , we adopt Goto’s distance ancing) factors, which are described in Section VII-B.
measure [9]. It is useful for judging whether the adapted tem-
plate is included in each spectrogram segment or not (the answer VI. HARMONIC STRUCTURE SUPPRESSION
is “yes” or “no”). Goto’s distance measure does not make the
distance large even if the spectral components of the target drum Our proposed method of suppressing harmonic compo-
sound are overlapped with those of other sounds. If is nents improves the robustness of the template-adaptation and
larger than , Goto’s distance measure regards template-matching methods for the spectral overlapping of har-
as a mixture of spectral components not only of the drum sound monic instrument sounds. Real-world CD recordings usually
but also of other musical instrument sounds. In other words, include many harmonic instrument sounds. If the combined
when we identify that includes , then the local power of various harmonic components is much larger than that
distance at frame and frequency bin is minimized. There- of the drum sound spectrogram in a spectrogram segment, it is
fore, the local distance measure is defined as often difficult to correctly detect the drum sound. Therefore, the
recognition accuracy is expected to be improved by suppressing
those unnecessary harmonic components.
To suppress harmonic components in a musical audio
(22)
otherwise signal, we sequentially perform three operations for each
spectrogram segment: estimating F0 of harmonic structure,
where is the local distance at frame and frequency bin verifying harmonic components, and suppressing harmonic
. The negative constant makes this components. These operations are enabled in bass and snare
distance measure robust for the small variation of local spectral drum sound recognition. In hi-hat cymbal sound recognition,
components. If is larger than about , the harmonic-structure-suppression method is not necessary
becomes zero. In this paper, dB , because most influential harmonic components are expected to
dB . be suppressed by the highpass filter function .
YOSHII et al.: DRUM SOUND RECOGNITION FOR POLYPHONIC AUDIO SIGNALS 341

A. F0 Estimation of Harmonic Structure


The F0 is estimated at each frame by using a comb-filter-like
spectral analysis [30], which is effective in roughly estimating
predominant harmonic structures in polyphonic audio signals.
The basic idea is to evaluate the reliability that the
frequency is the F0 at each frame and each frequency .
The reliability is defined as the summation of the
local amplitude weighted by a comb-filter Fig. 9. Suppressing hth harmonic component of the F0 F (t) by linearly
interpolating between the minimum power on both sides of spectral peak.

(25)
frame in the neighborhood of a th harmonic component of the
F0 (from cent to cent in our implementation)
where the frequency unit of and is [cent],1 and each in- is calculated. Second, we determine that the th harmonic com-
crement of is 100 [cent] in the summation. is the ponent of the F0 at frame is actually derived from only har-
local amplitude at frame and frequency [cent] in a spectro- monic instrument sounds if is larger than a threshold,
gram segment . denotes a comb-filter-like function which is set to 2.0 in this paper (c.f., the kurtosis of the Gaussian
which passes only harmonic components which form the har- distribution is 3.0).
monic structure of the F0
C. Harmonic Component Suppression
We suppress harmonic components that are identified
(26)
as being actually derived from only harmonic instrument
sounds. An overview is shown in Fig. 9. First, we find the
(27) two frequencies of the local minimum power adjacent to the
spectral peak corresponding to each harmonic component at
where is the number of harmonic components considered and cent . Second, we linearly interpolate the
power between them along the frequency axis while preserving
is an amplitude attenuation factor. The spectral spreading of
the original phase.
each harmonic component is represented by . is a
Gaussian distribution, where is the mean and is the standard
deviation. In this paper, , , cent . VII. EXPERIMENTS AND RESULTS
Frequencies of the F0 are determined by finding fre- We performed experiments of recognizing the bass drums,
quencies that satisfy the following condition: snare drums, and hi-hat cymbals for polyphonic audio signals.

A. Experimental Conditions
(28)
We tested our methods on seventy songs sampled from
the popular music database “RWC Music Database: Popular
where is a constant, which is set to 0.7 in this paper. The F0
Music” (RWC-MDB-P-2001) developed by Goto et al. [31].
is searched from 2000 [cent] (51.9 [Hz]) to 7000 [cent] (932 Hz)
Those songs contain sounds of vocals and various instruments
by shifting every 100 [cent].
as songs in commercial CDs do. Seed templates were created
B. Harmonic Component Verification from solo tones included in “RWC Music Database: Musical
Instrument Sound” (RWC-MDB-I-2001) [32]: a seed template
It is necessary to verify that each harmonic component esti- of each drum is created from multiple sound files each of
mated in Section VI-A is actually derived from only harmonic which contains a sole tone of the drum sound by normal-style
instrument sounds. To suppress all the estimated harmonic com- performance. All original data were sampled at 44.1 kHz with
ponents without this verification is not appropriate because a 16 bits, stereo. We converted them to monaural recordings.
characteristic frequency of drum sounds may be erroneously es- We evaluated the experimental results by the recall rate, pre-
timated as a harmonic frequency if the power of drum sounds cision rate and f-measure
is much larger than that of harmonic instrument sounds. In an-
other case, a characteristic frequency of drum sounds may be correctly detected onsets
accidentally equal to a harmonic frequency. The verification of recall rate
actual onsets
each harmonic component prevents characteristic spectral com- correctly detected onsets
ponents of drum sounds from being suppressed. rate
detected onsets
We focus on the general fact that spectral peaks of harmonic
recall rate precision rate
components are much more peaked than characteristic spectral f-measure
peaks of drum sounds. First, the spectral kurtosis at recall rate precision rate
1Frequency f in hertz is converted to frequency fcent in cents: fcent = To prepare actual onset times (correct answers), we extracted
1200 log (f =(440 2 2 )). onset times (note-on events) of the bass drums, snare drums,
342 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007

TABLE I C. Discussion
NUMBER OF ACTUAL ONSETS IN 70 MUSICAL PIECES

The experimental results show the effectiveness of our


methods. In general, the fully-enabled -procedures
yielded the best performance in bass and snare drum sound
TABLE II recognition. In these case, the average f-measure was 82.924%
SETTING OF COMPARATIVE EXPERIMENTS
and 58.288%, respectively. In hi-hat cymbal sound recognition
by the -procedure, the average f-measure was 46.249%. In
total, the f-measure averaged over those three drum instruments
was about 62%. In our observation, the effectiveness of the
A-method and S-method was almost independent to specific
playing styles. If harmonic sounds which mainly distribute in
a low frequency band (e.g., spectral components of bass line)
are more dominant, the suppression method tends to be more
and hi-hat cymbals from the standard MIDI files of the sev- effective. We discuss in detail in the following sections.
enty songs, which are distributed with the music database, and 1) Bass Drum Sound Recognition: The f-measure in bass
aligned them to the corresponding audio signals by hand. The drum sound recognition (82.92% in total) was highest among
number of actual onsets of each drum sound included in seventy the results of recognizing three drum instruments. Table IV
songs is shown in Table I (about 100 000 onsets in total). If the showed that both the -method and the -method were very ef-
difference between a detected onset time and an actual onset fective, especially in group I. It also showed that the -method
time was less than 25 [ms], we judged that the detected onset in recognizing bass drum sounds was more effective, compared
time is correct. to snare drum sound recognition. The -method could suppress
undesirable harmonic components of the bass line which has
B. Experimental Results the large power in a low frequency band.
To evaluate our proposed three methods: template-matching 2) Snare Drum Sound Recognition: In group I, the f-measure
method ( -method), template-adaptation method ( -method), was drastically improved from 65.33% to 87.63% by enabling
and harmonic-structure-suppression method ( -method), we both the -method and the -method. Table IV showed that the
performed comparative experiments by enabling each method -method in recognizing snare drum sounds was less effective
one by one: we tested three procedures shown in Table II, than the -method.
-procedure, -procedure, and -procedure. The In group II, on the other hand, the -method was more effec-
-procedure was not tested for recognizing hi-hat cymbal tive than the -method. These results suggest that the template-
sounds because the -method is enabled only for recognizing adaptation became to work correctly after suppressing harmonic
bass or snare drum sounds. The -procedure used a seed tem- components in some pieces. In other words, the -method and
plate instead of the adapted template for the template-matching. the -method helped each other in improving the f-measure, and
The balancing factors and were determined for each thus it is important to use both methods.
experiment as shown in Table III. In group III, however, the f-measure was slightly degraded by
For convenience, we evaluated three procedures by dividing enabling the -method because the template-adaptation failed
70 musical pieces into three groups: group I, II, and III. First, in some pieces. In these pieces, the seed template was erro-
70 pieces were sorted in descending order with respect to the neously adapted to harmonic components. The -method was
f-measure by the fully-enabled procedure (i.e., -procedure not effective enough to recover from such erroneous adaptation.
in bass and snare drum sound recognition, -procedure in These facts suggest that acoustic features of snare drum sounds
hi-hat cymbal sound recognition). Second, the first 20 pieces in these pieces are too different from those of the seed template.
were put in group I, and the next 25 pieces were put in group II, To overcome these problems, we plan to incorporate multiple
and the remaining 25 pieces were put in group III. templates for each drum instrument.
The average recall and precision rates of onset candidate de- 3) Hi-Hat Cymbal Sound Recognition: The f-measure in
tection was 88%/22% (bass drum sound recognition), 77%/18% hi-hat cymbal sound recognition (46.25% in total) was lowest
(snare drum sound recognition), and 87%/36% (hi-hat cymbal among the experimental results in recognizing three drum
sound recognition). This means the chance rates of onset de- instruments. The performance without the -method and the
tection by the coin-toss decision were 29%, 25%, and 39%, re- -method indicates that this is the most difficult task in our
spectively. Table III shows the experimental results obtained by experiments. Unfortunately, the -method was not effective
each procedure. Table IV shows the recognition error reduc- enough for hi-hat cymbals, while it reduced some errors as
tion rates which represent the f-measure improvement obtained shown in Table IV. This is because there are three major playing
by enabling the -method added to the -procedure, and that styles for hi-hat cymbals, closed, open, and half-open, and they
obtained by enabling the -method added to the -proce- are used in a mixed way in an actual musical piece. Since our
dure. Table V shows a complete list of musical pieces sorted method used just a single template, the template could not
in descending order with respect to f-measure of each drum in- cover all spectral variations by those playing styles and was
strument recognition. Fig. 10 shows f-measure curves along the not appropriately adapted to those sounds in the piece even
sorted musical pieces in recognizing each drum instrument. by the -method. We plan to incorporate multiple templates
YOSHII et al.: DRUM SOUND RECOGNITION FOR POLYPHONIC AUDIO SIGNALS 343

TABLE III
DRUM SOUND RECOGNITION RATES

Note: 70 musical pieces were sorted in descending order with respect to the f-measure by the fully-enabled procedure (i.e., SAM -procedure in bass and
snare drum sound recognition, AM -procedure in hi-hat cymbal sound recognition). The first 20 pieces were put in group I, and the next 25 ones were put
in group II, and the last 25 ones were put in group III.

TABLE IV
RECOGNITION ERROR REDUCTION RATES

Note: The definition of group I, II and III is described in Table III. This shows the recognition error reduction rates which represent the f-measure improvement
obtained by enabling the A-method added to the M -procedure, and that obtained by enabling the S -method added to the AM -procedure.

TABLE V
LIST OF MUSICAL PIECES SORTED IN DESCENDING ORDER WITH RESPECT TO f-MEASURE

Fig. 10. (a), (b): f-measure curves by three procedures in (a) bass drum sound recognition and (b) snare drum sound recognition along sorted musical pieces in
descending order with respect to f-measure by SAM -procedure. (c): f-measure curves by two procedures in hi-hat cymbal sound recognition along sorted musical
pieces in descending order with respect to f-measure by AM -procedure.

as discussed above to deal with this difficulty while another bals. Since a drum-sound spectrogram prepared as a seed tem-
problem of identifying the playing styles of hi-hat cymbals will plate is different from one used in a musical piece, our tem-
still remain an open question. plate-adaptation method adapts the template to the piece. By
using the adapted template, our template-matching method then
VIII. CONCLUSION detects their onset times even if drum sounds are overlapped
In this paper, we have presented a drum sound recognition by other musical instrument sounds. In addition, to improve
system that can detect onset times of drum sounds and iden- the performance of the adaptation and matching, we proposed
tify them. Our system used template-adaptation and template- a harmonic-structure-suppression method that suppresses har-
matching methods to individually detect onset times of three monic components of other musical instrument sounds by using
drum instruments, the bass drum, snare drum, and hi-hat cym- comb-filter-like spectral analysis.
344 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007

To evaluate our system, we performed reliable experiments [16] F. Gouyon, F. Pachet, and O. Delerue, “On the use of zero-crossing
with popular-music CD recordings, which are the largest rate for an application of classification of percussive sounds,” in Proc.
COST-G6 Conf. Digital Audio Effects (DAFX), 2000.
experiments for drum sounds as far as we know. The exper- [17] A. Zils, F. Pachet, O. Delerue, and F. Gouyon, “Automatic extraction
imental results showed that both of the template-adaptation of drum tracks from polyphonic music signals,” in Proc. Int. Conf. Web
and harmonic-structure-suppression methods improved the Delivering of Music (WEDELMUSIC), 2002, pp. 179–183.
[18] D. FitzGerald, E. Coyle, and B. Lawlor, “Sub-band independent sub-
f-measure of recognizing each drum. The average f-measures space analysis for drum transcription,” in Proc. Int. Conf. Digital Audio
were 82.924%, 58.288%, and 46.249% in recognizing bass Effects (DAFX), 2002, pp. 65–69.
drum sounds, snare drum sounds, and hi-hat cymbal sounds, [19] C. Uhle, C. Dittmar, and T. Sporer, “Extraction of drum tracks from
polyphonic music using independent subspace analysis,” in Proc. Int.
respectively. Our system, called AdaMast [33], in which the Symp. Independent Component Analysis and Blind Signal Separation
harmonic-structure-suppression method was disabled won the (ICA), 2003, pp. 843–848.
first prize of Audio Drum Detection Contest in MIREX2005. [20] J. Paulus and A. Klapuri, “Drum transcription with non-negative spec-
trogram factorisation,” in Proc. Eur. Signal Process. Conf. (EUSIPCO),
We expect that these results could be used as a benchmark. 2005.
In the future, we plan to use multiple seed templates for each [21] T. Virtanen, “Sound source separation using sparse coding with
kind of the drums to improve the coverage of the timbre varia- temporal continuity objective,” in Proc. Int. Computer Music Conf.
(ICMC), 2003, pp. 231–234.
tion of drum sounds. A study on timbre variation of drum sounds [22] D. FitzGerald, B. Lawlor, and E. Coyle, “Prior subspace analysis for
[34] seems to be helpful. The improvement of the template- drum transcription,” in Proc. Audio Eng. Soc. (AES), 114th Conv.,
matching method is also necessary to deal with the spectral vari- 2003.
[23] ——, “Drum transcription in the presence of pitched instruments using
ation among onsets. In addition, we will apply our system to prior subspace analysis,” in Proc. Irish Signals Syst. Conf. (ISSC),
rhythm-related content description for building a content-based 2003, pp. 202–206.
MIR system. [24] C. Dittmar and C. Uhle, “Further steps towards drum transcription of
polyphonic music,” in Proc. Audio Eng. Soc. (AES), 116th Conv., 2004.
[25] A. Klapuri, “Sound onset detection by applying psychoacoustic knowl-
edge,” in Proc. Int. Conf. Acoust., Speech, Signal Process. (ICASSP),
1999, pp. 3089–3092.
[26] P. Herrera, V. Sandvold, and F. Gouyon, “Percussion-related semantic
REFERENCES descriptors of music audio files,” in Proc. Int. Conf. Audio Eng. Soc.
(AES), 2004.
[1] E. Scheirer, “Tempo and beat analysis of acoustic musical signals,” J. [27] V. Sandvold, F. Gouyon, and P. Herrera, “Percussion classification in
Acoust. Soc. Am., vol. 103, no. 1, pp. 588–601, Jan. 1998. polyphonic audio recordings using localized sound models,” in Proc.
[2] J. Paulus and A. Klapuri, “Measuring the similarity of rhythmic pat- Int. Conf. Music Information Retrieval (ISMIR), 2004, pp. 537–540.
terns,” in Proc. Int. Conf. Music Information Retrieval (ISMIR), 2002, [28] A. Savitzky and M. Golay, “Smoothing and differentiation of data by
pp. 150–156. simplified least squares procedures,” J. Anal. Chem., vol. 36, no. 8, pp.
[3] F. Gouyon and P. Herrera, “Determination of the meter of musical 1627–1639, Jul. 1964.
audio signals: seeking recurrences in beat segment descriptors,” in [29] N. Otsu, “A threshold selection method from gray-level histograms,”
Proc. Audio Engineering Soc. (AES), 114th Conv., 2003. IEEE Trans. Syst., Man, Cybern., vol. SMC-6, no. 1, pp. 62–66, Jan.
[4] E. Pampalk, S. Dixon, and G. Widmer, “Exploring music collections 1979.
by browsing different views,” J. Comput. Music J., vol. 28, no. 2, pp. [30] M. Goto, K. Itou, and S. Hayamizu, “A real-time filled pause detec-
49–62, summer 2004. tion system for spontaneous speech recognition,” in Proc. Eurospeech,
[5] G. Tzanetakis and P. Cook, “Musical genre classification of audio sig- 1999, pp. 227–230.
nals,” IEEE Trans. Speech Audio Process., vol. 10, no. 5, pp. 293–302, [31] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, “RWC music data-
Jul. 2002. base: popular, classical, and jazz music databases,” in Proc. Int. Conf.
[6] S. Dixon, E. Pampalk, and G. Widmer, “Classification of dance music Music Information Retrieval (ISMIR), 2002, pp. 287–288.
by periodicity patterns,” in Proc. Int. Conf. Music Information Retrieval [32] ——, “RWC music database: music genre database and musical instru-
(ISMIR), 2003, pp. 159–165. ment sound database,” in Proc. Int. Conf. Music Information Retrieval
[7] D. Ellis and J. Arroyo, “Eigenrhythms: Drum pattern basis sets for clas- (ISMIR), 2003, pp. 229–230.
sification and generation,” in Proc. Int. Conf. Music Information Re- [33] K. Yoshii, M. Goto, and H. Okuno, “AdaMast: a drum sound recognizer
trieval (ISMIR), 2004, pp. 554–559. based on adaptation and matching of spectrogram templates,” in Proc.
[8] C. Uhle and C. Dittmar, “Drum pattern based genre classification of Music Information Retrieval Evaluation eXchange (MIREX), 2005.
popular music,” in Proc. Int. Conf. Audio Eng. Soc. (AES), 2004. [34] E. Pampalk, P. Hlavac, and P. Herrera, “Hierarchical organization and
[9] M. Goto and Y. Muraoka, “A sound source separation system for visualization of drum sample libraries,” in Proc. Int. Conf. Digital
percussion instruments,” IEICE Trans. D-II, vol. J77-D-II, no. 5, pp. Audio Effects (DAFX), 2004, pp. 378–383.
901–911, May 1994.
[10] P. Herrera, A. Yeterian, and F. Gouyon, “Automatic classification of
drum sounds: a comparison of feature selection methods and classifi-
cation techniques,” in Proc. Int. Conf. Music and Artificial Intelligence
(ICMAI), LNAI2445, 2002, pp. 69–80.
[11] J. Paulus and A. Klapuri, “Conventional and periodic N-grams in the
transcription of drum sequences,” in Proc. Int. Conf. Multimedia and Kazuyoshi Yoshii (S’05) received the B.S. and
Expo (ICME), 2003, pp. 737–740. M.S. degrees from Kyoto University, Kyoto, Japan,
[12] ——, “Model-based event labeling in the transcription of percussive in 2003 and 2005, respectively. He is currently
audio signals,” in Proc. Int. Conf. Digital Audio Effects (DAFX), 2003, pursuing the Ph.D degree in the Department of
pp. 73–77. Intelligence Science and Technology, Graduate
[13] O. Gillet and G. Richard, “Automatic transcription of drum loops,” in School of Informatics, Kyoto University.
Proc. Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2004, pp. His research interests include music scene analysis
269–272. and human-machine interaction.
[14] ——, “Drum track transcription of polyphonic music using noise Mr. Yoshii is a member of the Information Pro-
subspace projection,” in Proc. Int. Conf. Music Information Retrieval cessing Society of Japan (IPSJ) and Institute of Elec-
(ISMIR), 2005. tronics, Information, and Communication Engineers
[15] M. Goto, “An audio-based real-time beat tracking system for music (IEICE). He is supported by the JSPS Research Fellowships for Young Scien-
with or without drum-sounds,” J. New Music Res., vol. 30, no. 2, pp. tists (DC1). He has received several awards including the FIT2004 Paper Award
159–171, Jun. 2001. and the Best in Class Award of MIREX2005.
YOSHII et al.: DRUM SOUND RECOGNITION FOR POLYPHONIC AUDIO SIGNALS 345

Masataka Goto received the Doctor of Engineering Hiroshi G. Okuno (SM’06) received the B.A. and
degree in electronics, information, and communi- Ph.D degrees from the University of Tokyo, Tokyo,
cation engineering from Waseda University, Tokyo, Japan, in 1972 and 1996, respectively.
Japan, in 1998. He worked for Nippon Telegraph and Telephone,
He then joined the Electrotechnical Laboratory Kitano Symbiotic Systems Project, and Tokyo Uni-
(ETL; reorganized as the National Institute of Ad- versity of Science. He is currently a Professor in the
vanced Industrial Science and Technology (AIST) Department of Intelligence Science and Technology,
in 2001), where he has been a Senior Research Graduate School of Informatics, Kyoto University,
Scientist since 2005. He served concurrently as a Kyoto, Japan. He was a Visiting Scholar at Stanford
Researcher in Precursory Research for Embryonic University, Stanford, CA, and Visiting Associate Pro-
Science and Technology (PRESTO), Japan Science fessor at the University of Tokyo. He has done re-
and Technology Corporation (JST), from 2000 to 2003, and an Associate search in programming languages, parallel processing, and reasoning mecha-
Professor in the Department of Intelligent Interaction Technologies, Graduate nisms in AI, and is currently engaged in computational auditory scene analysis,
School of Systems and Information Engineering, University of Tsukuba, music scene analysis, and robot audition. He edited (with D. Rosenthal) Compu-
Tsukuba, Japan, since 2005. His research interests include music information tational Auditory Scene Analysis (Princeton, NJ: Lawrence Erlbaum, 1998) and
processing and spoken language processing. (with T. Yuasa) Advanced Lisp Technology (London, U.K.: Taylor &Francis,
Dr. Goto is a member of the Information Processing Society of Japan (IPSJ), 2002).
Acoustical Society of Japan (ASJ), Japanese Society for Music Perception and Dr. Okuno has received various awards including the 1990 Best Paper Award
Cognition (JSMPC), Institute of Electronics, Information, and Communication of JSAI, the Best Paper Award of IEA/AIE-2001 and 2005, and IEEE/RSJ
Engineers (IEICE), and International Speech Communication Association Nakamura Award for IROS-2001 Best Paper Nomination Finalist. He was also
(ISCA). He has received 18 awards, including the IPSJ Best Paper Award and awarded 2003 Funai Information Science Achievement Award. He is a member
IPSJ Yamashita SIG Research Awards (special interest group on music and of the IPSJ, JSAI, JSSST, JSCS, RSJ, ACM, AAAI, ASA, and ISCA.
computer, and spoken language processing) from the IPSJ, the Awaya Prize for
Outstanding Presentation and Award for Outstanding Poster Presentation from
the ASJ, Award for Best Presentation from the JSMPC, Best Paper Award for
Young Researchers from the Kansai-Section Joint Convention of Institutes of
Electrical Engineering, WISS 2000 Best Paper Award and Best Presentation
Award, and Interaction 2003 Best Paper Award.

You might also like