Kyoto University
Kyoto University
Kyoto University
URL http://hdl.handle.net/2433/50283
Textversion publisher
Kyoto University
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007 333
Abstract—This paper describes a system that detects onsets of find our desired musical pieces in a huge music database. Music
the bass drum, snare drum, and hi-hat cymbals in polyphonic content analysis enables MIR systems to automatically under-
audio signals of popular songs. Our system is based on a tem- stand the contents of musical pieces and to deal with them even
plate-matching method that uses power spectrograms of drum
sounds as templates. This method calculates the distance between if they do not have metadata about the artists and titles.
a template and each spectrogram segment extracted from a song As the first step of achieving content-based MIR systems in
spectrogram, using Goto’s distance measure originally designed the future, we focus on detecting onset times of individual mu-
to detect the onsets in drums-only signals. However, there are two sical instruments. In this paper, we call this process recogni-
main problems. The first problem is that appropriate templates tion, which means simultaneous processing of both onset detec-
are unknown for each song. The second problem is that it is more
difficult to detect drum-sound onsets in sound mixtures including tion and identification of each sound. Although onset time in-
various sounds other than drum sounds. To solve these problems, formation of each musical instrument is low-level musical con-
we propose template-adaptation and harmonic-structure-suppres- tent, the recognition results can be used as a basis for higher-
sion methods. First of all, an initial template of each drum sound, level music content analysis concerning the rhythm, melody,
called a seed template, is prepared. The former method adapts it and chord, such as beat tracking, melody detection, and chord
to actual drum-sound spectrograms appearing in the song spectro-
gram. To make our system robust to the overlapping of harmonic change detection.
sounds with drum sounds, the latter method suppresses harmonic In this paper, we propose a system of recognizing drum
components in the song spectrogram before the adaptation and sounds in polyphonic audio signals sampled from commercial
matching. Experimental results with 70 popular songs showed compact-disc (CD) recordings of popular music. We allow
that our template-adaptation and harmonic-structure-suppression various music styles for popular music, such as rock, dance,
methods improved the recognition accuracy and achieved 83%,
58%, and 46% in detecting onsets of the bass drum, snare drum, house, hip-hop, eurobeat, soul, R&B, and folk. Our system
and hi-hat cymbals, respectively. detects onset times of three drum instruments—bass drum,
snare drum, and hi-hat cymbals—while identifying them. For
Index Terms—Drum sound recognition, harmonic structure sup-
pression, polyphonic audio signal, spectrogram template, template a large class of popular music with drum sounds, these three
adaptation, template matching. instruments play important roles as the rhythmic backbone
of music. We believe that accurate onset detection of drum
sounds is useful for describing temporal musical contents such
as rhythm, tempo, beat, and measure. Previous studies [1]–[4]
I. INTRODUCTION on describing those temporal contents, however, have focused
on the periodicity of time-frame-based acoustic features, and
HE importance of music content analysis for musical have not tried to detect accurate onset times of drum sounds.
T audio signals has been increasing in the field of music
information retrieval (MIR). MIR aims at retrieving musical
Previous studies [5], [6] on genre classification did not consider
onset times of drum sounds while such onset times could be
pieces by executing a query about not only text information used for improving classification performances by identifying
such as artist names and music titles but also musical contents drum patterns unique to musical genres. Some recent studies
such as rhythms and melodies. Although the amount of digitally [7], [8] reported the use of drum patterns for genre classification
recorded music available over the Internet is rapidly increasing, while Ellis et al. [7] dealt with only MIDI signals. The results
there are only a few ways of using text information to efficiently of our system are useful for such genre classification with
higher-level content analysis of real-world audio signals.
The rest of this paper is organized as follows. In Section II,
Manuscript received February 1, 2005; revised December 19, 2005. This
we describe the current state of drum sound recognition tech-
work was supported in part by the Ministry of Education, Culture, Sports,
Science and Technology (MEXT), Grant-in-Aid for Scientific Research (A) niques. In Section III, we examine the problems and solutions of
15200015 and by the COE Program of MEXT, Japan. The associate editor recognizing drum sounds contained in commercial CD record-
coordinating the review of this manuscript and approving it for publication was
ings. Sections IV and V describe the proposed solutions: tem-
Dr. Michael Davies.
K. Yoshii and H. G. Okuno are with the Department of Intelligence Science plate-adaptation and template-matching methods, respectively.
and Technology, Graduate School of Informatics, Kyoto University, Kyoto 606- Section VI describes a harmonic-structure-suppression method
8501, Japan (e-mail: [email protected]; [email protected]). to improve the performance of our system. Section VII shows
M. Goto is with the National Institute of Advanced Industrial Science and
Technology (AIST), Tsukuba 305-8568, Japan (e-mail: [email protected]). experimental results of evaluating these methods. Finally, Sec-
Digital Object Identifier 10.1109/TASL.2006.876754 tion VIII summarizes this paper.
1558-7916/$20.00 © 2006 IEEE
334 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007
II. ART OF DRUM SOUND RECOGNITION Klapuri’s algorithm to estimate the amount of percussive on-
sets. However, drum sound identification was not evaluated. To
We start on describing the current state of the art of drum
identify drum sounds extracted from polyphonic audio signals,
sound recognition and related work motivating our approach.
Sandvold et al. [27] proposed a method that adapts feature
models to those of drum sounds used in each musical piece, but
A. Current State they used correct instrument labels for the adaptation.
Although there are many studies on onset detection or iden-
B. Related Work
tification of drum sounds, a few of them have dealt with drum
sound recognition for polyphonic audio signals such as com- We explain two related methods in detail.
mercial CD recordings. The drum sound recognition method by 1) Drum Sound Recognition for Solo Drum Performances:
Goto and Muraoka [9] was the earliest work that could deal with Goto and Muraoka [9] reported a template-matching method for
drum-sound mixtures of solo performances with MIDI rock- recognizing drum sounds contained in musical audio signals of
drums. Herrera et al. [10] compared conventional feature-based popular-music solo drum performances by a MIDI tone gener-
classifiers in the experiments of identifying monophonic drum ator. Their method was designed in the time-frequency domain.
sounds. To recognize drum sounds in drums-only audio sig- First, a fixed-time-length power spectrogram of each drum to be
nals, various modeling methods such as N-grams [11], proba- recognized is prepared as a spectrogram template. There were
bilistic models [12], and SVM [13] have been used. By using nine templates corresponding to nine drum instruments (bass
a noise-space-projection method, Gillet and Richard [14] tried and snare drums, toms, and cymbals) in a drum set. Next, onset
to recognize drum sounds in polyphonic audio signals. These times are detected by comparing the template with the power
studies, however, cannot fully deal with both the variation of spectrogram of the input audio signal, assuming that the input
drum-sound features and their distortion caused by the overlap- signal is a polyphonic sound mixture of those templates. In the
ping of other sounds. template-matching stage, they proposed a distance measure (we
The detection of bass and snare drum sounds in polyphonic call this “Goto’s distance measure” in this paper), which is ro-
CD recordings was mentioned in Goto’s study on beat tracking bust for the spectral overlapping of a drum sound corresponding
[15]. Since it roughly detected them to estimate a hierarchical to the target template with other drum sounds.
beat structure, the accurate drum detection was not investi- Although their method achieved the high recognition accu-
gated. Gouyon et al. [16] proposed a method that classifies racy, it has a limitation that the power spectrogram of each drum
mixed sounds extracted from polyphonic audio signals into two used in the input audio signal must be registered with the recog-
categories of the bass and snare drums. As the former step of nition system. In addition, it has difficulty recognizing drum
the classification, they proposed a percussive onset detection sounds included in polyphonic music because it does not as-
method. It was based on a unique idea of template adaptation sume the spectral overlapping of harmonic sounds.
that can deal with drum-sound variations according to musical 2) Drum Sound Resynthesis From CD Recordings: Zils et al.
pieces. Zils et al. [17] tried the extraction and resynthesis of [17] reported a template-adaptation method for recognizing bass
drum tracks from commercial CD recordings by extending and snare drum sounds from polyphonic audio signals sampled
Gouyon’s method, and showed the promising results. from popular-music CD recordings. Their method is defined in
To recognize drum sounds in audio signals of drum tracks, the time domain. First, a fixed-time-length signal of each drum
sound source separation methods have been focused. They made is prepared as a waveform template, which is different from an
various assumptions in decomposing a single music spectro- actual drum signal used in a target musical piece. Next, by cal-
gram into multiple spectrograms of musical instruments; in- culating the correlation between each template and the musical
dependent subspace analysis (ISA) [18], [19] assumes the sta- audio signal, onset times at which the correlation is large are de-
tistical independence of sources, non-negative matrix factor- tected. Finally, a drum sound is created (i.e., the signal template
ization (NMF) [20] assumes their non-negativity, and sparse is updated) by averaging fixed-time-length signals starting from
coding combined with NMF [21] assumes their non-negativity those detected onset times. These operations are repeated until
and sparseness. Further developments were made by FitzGerald the template converges.
et al. [22], [23]. They proposed PSA (Prior Subspace Anal- Although their time-domain analysis seems to be promising,
ysis) [22] that assumes prior frequency characteristics of drum it has limitations in dealing with overlapping drum sounds in the
sounds, and applied it to recognize drum sounds in the presence presence of other musical instrument sounds.
of harmonic sounds [23]. For the same purpose, Dittmar and
Uhle [24] adopted non-negative independent component anal- III. DRUM SOUND RECOGNITION PROBLEM
ysis (ICA) that considers the non-negativity of sources. In these FOR POLYPHONIC AUDIO SIGNALS
studies, the recognition results depend not only on the separa- First, we define the task of our drum sound recognition
tion quality but also on the reliability of estimating the number system. Next, we describe the problems and solutions in recog-
of sources and classifying them. However, the estimation and nizing drum sounds in polyphonic audio signals.
classification methods are not robust enough for the sake of
recognizing drum sounds in audio signals containing time-fre- A. Target
quency-varying various sounds. The purpose of our research is to detect onset times of three
Klapuri [25] reported a method of detecting onsets of all kinds of drum instruments in a drum set: bass drum, snare drum,
sounds in polyphonic audio signals. Herrera et al. [26] used and hi-hat cymbals. Our system takes polyphonic musical audio
YOSHII et al.: DRUM SOUND RECOGNITION FOR POLYPHONIC AUDIO SIGNALS 335
(1) (3)
(10)
D. Template Updating
An updated template is constructed by collecting the median
power at each frame and each frequency bin among all the se-
lected spectrogram segments. The updated template is used as
the template in the next adaptive iteration. We describe updating
algorithms for the template of each drum sound.
1) In Recognition of Bass and Snare Drum Sounds: The up-
dated template which is weighted by filter function is
obtained by
(11)
where is when is the first quartile. If the The total distance is calculated by integrating the local
number of frames where is satisfied is larger than distance in the time-frequency domain, weighted by weight
a threshold , we determine that the template is not included function
in that spectrogram segment, where is a threshold auto-
matically determined in Section V-D and is set to 5 [frames]
in this paper. (23)
We pick out not the minimum but the first quartile among
the power differences because the To determine whether the targeted drum sound occurred at a
latter value is more robust for outliers included in them. The time corresponding to the spectrogram segment , the distance
power difference at a characteristic frequency bin may become is compared with a threshold . If is satisfied,
large when harmonic components of other musical instrument we conclude that the targeted drum sound occurred. is also
sounds accidentally exist at that frequency. Picking out the first automatically determined in Section V-D.
quartile ignores the accidental large power difference and ex-
tracts the essential power difference derived from whether the D. Automatic Thresholding
template is included in a spectrogram segment or not.
3) Adjusting Power of Spectrogram Segments: The total To determine 12 thresholds ( ,
power difference is calculated by integrating the local-time and ) that are optimized for each musical piece, we use a
power difference which satisfies , weighted threshold selection method proposed by Otsu [29]. It is better to
by weight function dynamically change the thresholds to yield the best recognition
results for each piece.
By using Otsu’s method, we determine each optimized
threshold ( , or ) which classifies a set of
(20) values ( , or
) into two classes: the one class contains
If is satisfied, we are able to determine that the tem- values which are less than the threshold, the other contains
plate is not included in that spectrogram segment, where is the rest of values. We define a threshold which maximizes
a threshold automatically determined in Section V-D. the between-class variance (i.e., minimizes the within-class
Let denote an adjusted spectrogram segment after the variance).
power adjustment, obtained by Finally, to balance the recall rate with the precision rate (these
rates are defined in Section VII-A), we adjust thresholds and
which are determined by Otsu’s method
(21)
(24)
C. Distance Calculation
To calculate the distance between adapted template and where and are empirically determined scaling (bal-
an adjusted spectrogram segment , we adopt Goto’s distance ancing) factors, which are described in Section VII-B.
measure [9]. It is useful for judging whether the adapted tem-
plate is included in each spectrogram segment or not (the answer VI. HARMONIC STRUCTURE SUPPRESSION
is “yes” or “no”). Goto’s distance measure does not make the
distance large even if the spectral components of the target drum Our proposed method of suppressing harmonic compo-
sound are overlapped with those of other sounds. If is nents improves the robustness of the template-adaptation and
larger than , Goto’s distance measure regards template-matching methods for the spectral overlapping of har-
as a mixture of spectral components not only of the drum sound monic instrument sounds. Real-world CD recordings usually
but also of other musical instrument sounds. In other words, include many harmonic instrument sounds. If the combined
when we identify that includes , then the local power of various harmonic components is much larger than that
distance at frame and frequency bin is minimized. There- of the drum sound spectrogram in a spectrogram segment, it is
fore, the local distance measure is defined as often difficult to correctly detect the drum sound. Therefore, the
recognition accuracy is expected to be improved by suppressing
those unnecessary harmonic components.
To suppress harmonic components in a musical audio
(22)
otherwise signal, we sequentially perform three operations for each
spectrogram segment: estimating F0 of harmonic structure,
where is the local distance at frame and frequency bin verifying harmonic components, and suppressing harmonic
. The negative constant makes this components. These operations are enabled in bass and snare
distance measure robust for the small variation of local spectral drum sound recognition. In hi-hat cymbal sound recognition,
components. If is larger than about , the harmonic-structure-suppression method is not necessary
becomes zero. In this paper, dB , because most influential harmonic components are expected to
dB . be suppressed by the highpass filter function .
YOSHII et al.: DRUM SOUND RECOGNITION FOR POLYPHONIC AUDIO SIGNALS 341
(25)
frame in the neighborhood of a th harmonic component of the
F0 (from cent to cent in our implementation)
where the frequency unit of and is [cent],1 and each in- is calculated. Second, we determine that the th harmonic com-
crement of is 100 [cent] in the summation. is the ponent of the F0 at frame is actually derived from only har-
local amplitude at frame and frequency [cent] in a spectro- monic instrument sounds if is larger than a threshold,
gram segment . denotes a comb-filter-like function which is set to 2.0 in this paper (c.f., the kurtosis of the Gaussian
which passes only harmonic components which form the har- distribution is 3.0).
monic structure of the F0
C. Harmonic Component Suppression
We suppress harmonic components that are identified
(26)
as being actually derived from only harmonic instrument
sounds. An overview is shown in Fig. 9. First, we find the
(27) two frequencies of the local minimum power adjacent to the
spectral peak corresponding to each harmonic component at
where is the number of harmonic components considered and cent . Second, we linearly interpolate the
power between them along the frequency axis while preserving
is an amplitude attenuation factor. The spectral spreading of
the original phase.
each harmonic component is represented by . is a
Gaussian distribution, where is the mean and is the standard
deviation. In this paper, , , cent . VII. EXPERIMENTS AND RESULTS
Frequencies of the F0 are determined by finding fre- We performed experiments of recognizing the bass drums,
quencies that satisfy the following condition: snare drums, and hi-hat cymbals for polyphonic audio signals.
A. Experimental Conditions
(28)
We tested our methods on seventy songs sampled from
the popular music database “RWC Music Database: Popular
where is a constant, which is set to 0.7 in this paper. The F0
Music” (RWC-MDB-P-2001) developed by Goto et al. [31].
is searched from 2000 [cent] (51.9 [Hz]) to 7000 [cent] (932 Hz)
Those songs contain sounds of vocals and various instruments
by shifting every 100 [cent].
as songs in commercial CDs do. Seed templates were created
B. Harmonic Component Verification from solo tones included in “RWC Music Database: Musical
Instrument Sound” (RWC-MDB-I-2001) [32]: a seed template
It is necessary to verify that each harmonic component esti- of each drum is created from multiple sound files each of
mated in Section VI-A is actually derived from only harmonic which contains a sole tone of the drum sound by normal-style
instrument sounds. To suppress all the estimated harmonic com- performance. All original data were sampled at 44.1 kHz with
ponents without this verification is not appropriate because a 16 bits, stereo. We converted them to monaural recordings.
characteristic frequency of drum sounds may be erroneously es- We evaluated the experimental results by the recall rate, pre-
timated as a harmonic frequency if the power of drum sounds cision rate and f-measure
is much larger than that of harmonic instrument sounds. In an-
other case, a characteristic frequency of drum sounds may be correctly detected onsets
accidentally equal to a harmonic frequency. The verification of recall rate
actual onsets
each harmonic component prevents characteristic spectral com- correctly detected onsets
ponents of drum sounds from being suppressed. rate
detected onsets
We focus on the general fact that spectral peaks of harmonic
recall rate precision rate
components are much more peaked than characteristic spectral f-measure
peaks of drum sounds. First, the spectral kurtosis at recall rate precision rate
1Frequency f in hertz is converted to frequency fcent in cents: fcent = To prepare actual onset times (correct answers), we extracted
1200 log (f =(440 2 2 )). onset times (note-on events) of the bass drums, snare drums,
342 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007
TABLE I C. Discussion
NUMBER OF ACTUAL ONSETS IN 70 MUSICAL PIECES
TABLE III
DRUM SOUND RECOGNITION RATES
Note: 70 musical pieces were sorted in descending order with respect to the f-measure by the fully-enabled procedure (i.e., SAM -procedure in bass and
snare drum sound recognition, AM -procedure in hi-hat cymbal sound recognition). The first 20 pieces were put in group I, and the next 25 ones were put
in group II, and the last 25 ones were put in group III.
TABLE IV
RECOGNITION ERROR REDUCTION RATES
Note: The definition of group I, II and III is described in Table III. This shows the recognition error reduction rates which represent the f-measure improvement
obtained by enabling the A-method added to the M -procedure, and that obtained by enabling the S -method added to the AM -procedure.
TABLE V
LIST OF MUSICAL PIECES SORTED IN DESCENDING ORDER WITH RESPECT TO f-MEASURE
Fig. 10. (a), (b): f-measure curves by three procedures in (a) bass drum sound recognition and (b) snare drum sound recognition along sorted musical pieces in
descending order with respect to f-measure by SAM -procedure. (c): f-measure curves by two procedures in hi-hat cymbal sound recognition along sorted musical
pieces in descending order with respect to f-measure by AM -procedure.
as discussed above to deal with this difficulty while another bals. Since a drum-sound spectrogram prepared as a seed tem-
problem of identifying the playing styles of hi-hat cymbals will plate is different from one used in a musical piece, our tem-
still remain an open question. plate-adaptation method adapts the template to the piece. By
using the adapted template, our template-matching method then
VIII. CONCLUSION detects their onset times even if drum sounds are overlapped
In this paper, we have presented a drum sound recognition by other musical instrument sounds. In addition, to improve
system that can detect onset times of drum sounds and iden- the performance of the adaptation and matching, we proposed
tify them. Our system used template-adaptation and template- a harmonic-structure-suppression method that suppresses har-
matching methods to individually detect onset times of three monic components of other musical instrument sounds by using
drum instruments, the bass drum, snare drum, and hi-hat cym- comb-filter-like spectral analysis.
344 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007
To evaluate our system, we performed reliable experiments [16] F. Gouyon, F. Pachet, and O. Delerue, “On the use of zero-crossing
with popular-music CD recordings, which are the largest rate for an application of classification of percussive sounds,” in Proc.
COST-G6 Conf. Digital Audio Effects (DAFX), 2000.
experiments for drum sounds as far as we know. The exper- [17] A. Zils, F. Pachet, O. Delerue, and F. Gouyon, “Automatic extraction
imental results showed that both of the template-adaptation of drum tracks from polyphonic music signals,” in Proc. Int. Conf. Web
and harmonic-structure-suppression methods improved the Delivering of Music (WEDELMUSIC), 2002, pp. 179–183.
[18] D. FitzGerald, E. Coyle, and B. Lawlor, “Sub-band independent sub-
f-measure of recognizing each drum. The average f-measures space analysis for drum transcription,” in Proc. Int. Conf. Digital Audio
were 82.924%, 58.288%, and 46.249% in recognizing bass Effects (DAFX), 2002, pp. 65–69.
drum sounds, snare drum sounds, and hi-hat cymbal sounds, [19] C. Uhle, C. Dittmar, and T. Sporer, “Extraction of drum tracks from
polyphonic music using independent subspace analysis,” in Proc. Int.
respectively. Our system, called AdaMast [33], in which the Symp. Independent Component Analysis and Blind Signal Separation
harmonic-structure-suppression method was disabled won the (ICA), 2003, pp. 843–848.
first prize of Audio Drum Detection Contest in MIREX2005. [20] J. Paulus and A. Klapuri, “Drum transcription with non-negative spec-
trogram factorisation,” in Proc. Eur. Signal Process. Conf. (EUSIPCO),
We expect that these results could be used as a benchmark. 2005.
In the future, we plan to use multiple seed templates for each [21] T. Virtanen, “Sound source separation using sparse coding with
kind of the drums to improve the coverage of the timbre varia- temporal continuity objective,” in Proc. Int. Computer Music Conf.
(ICMC), 2003, pp. 231–234.
tion of drum sounds. A study on timbre variation of drum sounds [22] D. FitzGerald, B. Lawlor, and E. Coyle, “Prior subspace analysis for
[34] seems to be helpful. The improvement of the template- drum transcription,” in Proc. Audio Eng. Soc. (AES), 114th Conv.,
matching method is also necessary to deal with the spectral vari- 2003.
[23] ——, “Drum transcription in the presence of pitched instruments using
ation among onsets. In addition, we will apply our system to prior subspace analysis,” in Proc. Irish Signals Syst. Conf. (ISSC),
rhythm-related content description for building a content-based 2003, pp. 202–206.
MIR system. [24] C. Dittmar and C. Uhle, “Further steps towards drum transcription of
polyphonic music,” in Proc. Audio Eng. Soc. (AES), 116th Conv., 2004.
[25] A. Klapuri, “Sound onset detection by applying psychoacoustic knowl-
edge,” in Proc. Int. Conf. Acoust., Speech, Signal Process. (ICASSP),
1999, pp. 3089–3092.
[26] P. Herrera, V. Sandvold, and F. Gouyon, “Percussion-related semantic
REFERENCES descriptors of music audio files,” in Proc. Int. Conf. Audio Eng. Soc.
(AES), 2004.
[1] E. Scheirer, “Tempo and beat analysis of acoustic musical signals,” J. [27] V. Sandvold, F. Gouyon, and P. Herrera, “Percussion classification in
Acoust. Soc. Am., vol. 103, no. 1, pp. 588–601, Jan. 1998. polyphonic audio recordings using localized sound models,” in Proc.
[2] J. Paulus and A. Klapuri, “Measuring the similarity of rhythmic pat- Int. Conf. Music Information Retrieval (ISMIR), 2004, pp. 537–540.
terns,” in Proc. Int. Conf. Music Information Retrieval (ISMIR), 2002, [28] A. Savitzky and M. Golay, “Smoothing and differentiation of data by
pp. 150–156. simplified least squares procedures,” J. Anal. Chem., vol. 36, no. 8, pp.
[3] F. Gouyon and P. Herrera, “Determination of the meter of musical 1627–1639, Jul. 1964.
audio signals: seeking recurrences in beat segment descriptors,” in [29] N. Otsu, “A threshold selection method from gray-level histograms,”
Proc. Audio Engineering Soc. (AES), 114th Conv., 2003. IEEE Trans. Syst., Man, Cybern., vol. SMC-6, no. 1, pp. 62–66, Jan.
[4] E. Pampalk, S. Dixon, and G. Widmer, “Exploring music collections 1979.
by browsing different views,” J. Comput. Music J., vol. 28, no. 2, pp. [30] M. Goto, K. Itou, and S. Hayamizu, “A real-time filled pause detec-
49–62, summer 2004. tion system for spontaneous speech recognition,” in Proc. Eurospeech,
[5] G. Tzanetakis and P. Cook, “Musical genre classification of audio sig- 1999, pp. 227–230.
nals,” IEEE Trans. Speech Audio Process., vol. 10, no. 5, pp. 293–302, [31] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, “RWC music data-
Jul. 2002. base: popular, classical, and jazz music databases,” in Proc. Int. Conf.
[6] S. Dixon, E. Pampalk, and G. Widmer, “Classification of dance music Music Information Retrieval (ISMIR), 2002, pp. 287–288.
by periodicity patterns,” in Proc. Int. Conf. Music Information Retrieval [32] ——, “RWC music database: music genre database and musical instru-
(ISMIR), 2003, pp. 159–165. ment sound database,” in Proc. Int. Conf. Music Information Retrieval
[7] D. Ellis and J. Arroyo, “Eigenrhythms: Drum pattern basis sets for clas- (ISMIR), 2003, pp. 229–230.
sification and generation,” in Proc. Int. Conf. Music Information Re- [33] K. Yoshii, M. Goto, and H. Okuno, “AdaMast: a drum sound recognizer
trieval (ISMIR), 2004, pp. 554–559. based on adaptation and matching of spectrogram templates,” in Proc.
[8] C. Uhle and C. Dittmar, “Drum pattern based genre classification of Music Information Retrieval Evaluation eXchange (MIREX), 2005.
popular music,” in Proc. Int. Conf. Audio Eng. Soc. (AES), 2004. [34] E. Pampalk, P. Hlavac, and P. Herrera, “Hierarchical organization and
[9] M. Goto and Y. Muraoka, “A sound source separation system for visualization of drum sample libraries,” in Proc. Int. Conf. Digital
percussion instruments,” IEICE Trans. D-II, vol. J77-D-II, no. 5, pp. Audio Effects (DAFX), 2004, pp. 378–383.
901–911, May 1994.
[10] P. Herrera, A. Yeterian, and F. Gouyon, “Automatic classification of
drum sounds: a comparison of feature selection methods and classifi-
cation techniques,” in Proc. Int. Conf. Music and Artificial Intelligence
(ICMAI), LNAI2445, 2002, pp. 69–80.
[11] J. Paulus and A. Klapuri, “Conventional and periodic N-grams in the
transcription of drum sequences,” in Proc. Int. Conf. Multimedia and Kazuyoshi Yoshii (S’05) received the B.S. and
Expo (ICME), 2003, pp. 737–740. M.S. degrees from Kyoto University, Kyoto, Japan,
[12] ——, “Model-based event labeling in the transcription of percussive in 2003 and 2005, respectively. He is currently
audio signals,” in Proc. Int. Conf. Digital Audio Effects (DAFX), 2003, pursuing the Ph.D degree in the Department of
pp. 73–77. Intelligence Science and Technology, Graduate
[13] O. Gillet and G. Richard, “Automatic transcription of drum loops,” in School of Informatics, Kyoto University.
Proc. Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2004, pp. His research interests include music scene analysis
269–272. and human-machine interaction.
[14] ——, “Drum track transcription of polyphonic music using noise Mr. Yoshii is a member of the Information Pro-
subspace projection,” in Proc. Int. Conf. Music Information Retrieval cessing Society of Japan (IPSJ) and Institute of Elec-
(ISMIR), 2005. tronics, Information, and Communication Engineers
[15] M. Goto, “An audio-based real-time beat tracking system for music (IEICE). He is supported by the JSPS Research Fellowships for Young Scien-
with or without drum-sounds,” J. New Music Res., vol. 30, no. 2, pp. tists (DC1). He has received several awards including the FIT2004 Paper Award
159–171, Jun. 2001. and the Best in Class Award of MIREX2005.
YOSHII et al.: DRUM SOUND RECOGNITION FOR POLYPHONIC AUDIO SIGNALS 345
Masataka Goto received the Doctor of Engineering Hiroshi G. Okuno (SM’06) received the B.A. and
degree in electronics, information, and communi- Ph.D degrees from the University of Tokyo, Tokyo,
cation engineering from Waseda University, Tokyo, Japan, in 1972 and 1996, respectively.
Japan, in 1998. He worked for Nippon Telegraph and Telephone,
He then joined the Electrotechnical Laboratory Kitano Symbiotic Systems Project, and Tokyo Uni-
(ETL; reorganized as the National Institute of Ad- versity of Science. He is currently a Professor in the
vanced Industrial Science and Technology (AIST) Department of Intelligence Science and Technology,
in 2001), where he has been a Senior Research Graduate School of Informatics, Kyoto University,
Scientist since 2005. He served concurrently as a Kyoto, Japan. He was a Visiting Scholar at Stanford
Researcher in Precursory Research for Embryonic University, Stanford, CA, and Visiting Associate Pro-
Science and Technology (PRESTO), Japan Science fessor at the University of Tokyo. He has done re-
and Technology Corporation (JST), from 2000 to 2003, and an Associate search in programming languages, parallel processing, and reasoning mecha-
Professor in the Department of Intelligent Interaction Technologies, Graduate nisms in AI, and is currently engaged in computational auditory scene analysis,
School of Systems and Information Engineering, University of Tsukuba, music scene analysis, and robot audition. He edited (with D. Rosenthal) Compu-
Tsukuba, Japan, since 2005. His research interests include music information tational Auditory Scene Analysis (Princeton, NJ: Lawrence Erlbaum, 1998) and
processing and spoken language processing. (with T. Yuasa) Advanced Lisp Technology (London, U.K.: Taylor &Francis,
Dr. Goto is a member of the Information Processing Society of Japan (IPSJ), 2002).
Acoustical Society of Japan (ASJ), Japanese Society for Music Perception and Dr. Okuno has received various awards including the 1990 Best Paper Award
Cognition (JSMPC), Institute of Electronics, Information, and Communication of JSAI, the Best Paper Award of IEA/AIE-2001 and 2005, and IEEE/RSJ
Engineers (IEICE), and International Speech Communication Association Nakamura Award for IROS-2001 Best Paper Nomination Finalist. He was also
(ISCA). He has received 18 awards, including the IPSJ Best Paper Award and awarded 2003 Funai Information Science Achievement Award. He is a member
IPSJ Yamashita SIG Research Awards (special interest group on music and of the IPSJ, JSAI, JSSST, JSCS, RSJ, ACM, AAAI, ASA, and ISCA.
computer, and spoken language processing) from the IPSJ, the Awaya Prize for
Outstanding Presentation and Award for Outstanding Poster Presentation from
the ASJ, Award for Best Presentation from the JSMPC, Best Paper Award for
Young Researchers from the Kansai-Section Joint Convention of Institutes of
Electrical Engineering, WISS 2000 Best Paper Award and Best Presentation
Award, and Interaction 2003 Best Paper Award.