A Comparative Study On Content-Based Music Genre Classification
A Comparative Study On Content-Based Music Genre Classification
A Comparative Study On Content-Based Music Genre Classification
Classification
ABSTRACT 1. INTRODUCTION
Content-based music genre classification is a fundamental Music is not only for entertainment and for pleasure, but
component of music information retrieval systems and has has been used for a wide range of purposes due to its social
been gaining importance and enjoying a growing amount of and physiological effects. At the beginning of the 21st cen-
attention with the emergence of digital music on the Inter- tury the world is facing ever-increasing growth of the on-line
net. Currently little work has been done on automatic music music information, empowered by the permeation of Inter-
genre classification, and in addition, the reported classifi- net into daily life. Efficient and accurate automatic music in-
cation accuracies are relatively low. This paper proposes a formation processing (accessing and retrieval, in particular)
new feature extraction method for music genre classification, will be an extremely important issue, and it has been enjoy-
DWCHs 1 . DWCHs capture the local and global information ing a growing amount of attention. Music can be classified
of music signals simultaneously by computing histograms on based on its style and the styles have a hierarchical struc-
their Daubechies wavelet coefficients. Effectiveness of this ture. A currently popular topic in automatic music infor-
new feature and of previously studied features are compared mation retrieval is the problem of organizing, categorizing,
using various machine learning classification algorithms, in- and describing music contents on the web. Such endeavor
cluding Support Vector Machines and Linear Discriminant can be found in on-line music databases such as mp3.com
Analysis. It is demonstrated that the use of DWCHs signif- and Napster. One important aspect of the the genre struc-
icantly improves the accuracy of music genre classification. tures in these on-line databases is that the genre is specified
by human experts as well as amateurs (such as the user)
and that labeling process is time-consuming and expensive.
Categories and Subject Descriptors Currently music genre classification is done mainly by hand
H.3 [Information Storage and Retrieval]: Content because giving a precise definition of a music genre is ex-
Analysis and Indexing,Information Search and Retrieval; I.2 tremely difficult and, in addition, many music sounds sit on
[Artificial Intelligence]: Learning; I.5 [Pattern Recog- boundaries between genres. These difficulties are due to the
nition]: Applications; J.5 [Arts and Humanities]: music fact that music is an art that evolves, where performers and
composers have been influenced by music in other genres.
However, it has been observed that audio signals (digital
General Terms or analog) of music belonging to the same genre share cer-
tain characteristics, because they are composed of similar
Algorithms, Measurement, Performance, Experimentation, types of instruments, having similar rhythmic patterns, and
Verification similar pitch distributions [7]. This suggests feasibility of
automatic musical genre classification.
By automatic musical genre classification we mean here
Keywords the most strict form of the problem, i.e., classification of
Music Genre Classification, wavelet coefficients histogram, music signals into a single unique class based computational
feature extraction, multi-class classification analysis of music feature representations. Automatic mu-
sic genre classification is a fundamental component of mu-
1 sic information retrieval systems. We divide the process of
DWCHs stands for Daubechies Wavelet Coefficient His-
tograms genre categorization in music into two steps: feature extrac-
tion and multi-class classification. In the feature extraction
step, we extract from the music signals information repre-
senting the music. The features extract should be compre-
Permission to make digital or hard copies of all or part of this work for hensive (representing the music very well), compact (requir-
personal or classroom use is granted without fee provided that copies are ing a small amount of storage), and effective (not requiring
not made or distributed for profit or commercial advantage and that copies much computation for extraction). To meet the first require-
bear this notice and the full citation on the first page. To copy otherwise, to ment the design has to be made so that the both low-level
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. and high-level information of the music is included. In the
SIGIR’03, July 28–August 1, 2003, Toronto, Canada. second step, we build a mechanism (an algorithm and/or
Copyright 2003 ACM 1-58113-646-3/03/0007 ...$5.00.
282
a mathematical model) for identifying the labels from the bral features MFCCs have been dominantly used in speech
representation of the music sounds with respect to their fea- recognition. Logan [18] examines MFCCs for music mod-
tures. eling and music/speech discrimination. Rhythmic content
There has been a considerable amount of work in ex- features contains information about the regularity of the
tracting features for speech recognition and music-speech rhythm, the beat and tempo information. Tempo and beat
discrimination, but much less work has been reported on tracking from acoustic musical signals has been explored
the development of descriptive features specifically for mu- in [13, 15, 26]. Foote and Uchihashi [10] use the beat spec-
sic signals. To the best of our knowledge, currently the trum to represent rhythm. Pitch content features deals with
most influential approach to direct modeling of music sig- the frequency information of the music bands and are ob-
nals for automatic genre classification is due to Tsanetakis tained using various pitch detection techniques.
and Cook [31], where the timbral texture, rhythm, and pitch Much less work has been reported on music genre classi-
content features are explicitly developed. The accuracy of fication. Tzanetakis and Cook [31] proposes a comprehen-
classification based on these features, however, is only 61% sive set of features for direct modeling of music signals and
is achieved on their ten-genre sound dataset. This raises explores the use of those features for musical genre classi-
the question of whether there are different features that are fication using K-Nearest Neighbors and Gaussian Mixture
more useful in music classification and whether the use of models. Lambrou et al. [14] uses statistical features in the
statistical or machine learning techniques (e.g., discriminant temporal domain as well as three different wavelet transform
analysis and support vector machines) can improve the ac- domains to classify music into rock, piano and jazz. Desh-
curacy. The main goal of this paper is to address these two pande et al. [5] uses Gaussian Mixtures, Support Vector Ma-
issues. chines and Nearest Neighbors to classify the music into rock,
In this paper, we propose a new feature extraction method piano, and jazz based on timbral features. Pye [23] inves-
based on wavelet coefficients histogram, which we call tigates the use of Gaussian Mixture Modeling (GMM) and
DWCHs. DWCHs represent both local and global informa- Tree-Based Vector Quantization in music genre classifica-
tion by computing histograms on Daubechies wavelet coeffi- tion. Soltau et al. [28] propose an approach of representing
cients at different frequency subbands with different resolu- temporal structures of input signal. They show that this new
tions. Using DWCHs along with advanced machine learning set of abstract features can be learned via artificial neural
techniques, accuracy of music genre classification has been networks and can be used for music genre identification.
significantly improved. On the dataset of [31], the classifi- The problem of discriminate music and speech has been
cation has been increased to almost 80%. On genre specific investigated by Saunders [25], Scheier and Slaney [27].
classification, i.e., distinguish one genre from the rest, the Zhang and Kuo [34] propose a heuristic rule-based system to
accuracy can be as high as 98%. We also examined which segment and classify audio signals from movies or TV pro-
features proposed in [31] are the most effective. It turns out grams. In [33] audio contents are divided into instrument
that the timbral features combined with the Mel-frequency sounds, speech sounds, and environment sounds using auto-
Cepstral Coefficients achieve a high accuracy on each of the matically extracted features. Foote [9] constructs a learning
multi-class classification algorithms we tested. The rest of tree vector quantizer using twelve MFCCs plus energy as
the paper is organized as follows: Section 2 reviews the re- audio features for retrieval. Li and Khokhar [16] propose
lated work on automatic genre classification. Section 3 gives nearest feature line methods for content based classification
a brief overview of the previously proposed features. Sec- audio retrieval.
tion 4 presents the wavelet coefficients histogram features.
Section 5 describes the machine learning techniques used for
our experiments. Section 6 presents and discusses our exper- 3. THE CONTENT-BASED FEATURES
imental results. Finally, Section 7 provides our conclusions.
3.1 Timbral Textural Feature
2. RELATED WORK Timbral textual features are used to differentiate mixture
of sounds that are possibly with the same or similar rhyth-
Many different features can be used for music classifica- mic and pitch contents. The use of these features originates
tion, e.g., reference features including title and composer, from speech recognition. To extract timbral features, the
content-based acoustic features including tonality, pitch, sound signals are first divided into frames that are statisti-
and beat, symbolic features extracted from the scores, and cally stationary, usually by applying a windowing function
text-based features extracted from the song lyrics. In this at fixed intervals. The window function, typically a Ham-
paper we are interested in content-based features2 . ming window, removes edge effects. Timbral textural fea-
The content-based acoustic features are classified into tim- tures are then computed for each frame and the statistical
bral texture features, rhythmic content features,and pitch values (such as the mean and the variance) of those features
content features [31]. Timbral features are mostly origi- are calculated.
nated from traditional speech recognition techniques. They
are usually calculated for every short-time frame of sound
based on the Short Time Fourier Transform (STFT) [24]. • Mel-Frequency Cesptral Coefficients (MFCCs)
Typical timbral features include Spectral Centroid, Spec- are designed to capture short-term spectral-based fea-
tral Rolloff, Spectral Flux, Energy, Zero Crossings, Linear tures. After taking the logarithm of the amplitude
Prediction Coefficients, and Mel-Frequency Cepstral Coeffi- spectrum based on STFT for each frame, the fre-
cients (MFCCs) (see [24] for more detail). Among these tim- quency bins are grouped and smoothed according to
Mel-frequency scaling, which is design to agree with
2 perception. MFCCs are generated by decorrelating the
Our future work will address the issue of mixing different
types of futures. Mel-spectral vectors using discrete cosine transform.
283
• Spectral Centroid is the centroid of the magnitude local information of music signals from a global perspec-
spectrum of STFT and it is a measure of spectral tive, but not enough in representing the global information
brightness. of the music. Moreover, as indicated by our experiments
in Section 6, the rhythm and pitch content features don’t
• Spectral Rolloff is the frequency below which 85% seem to capture enough information content for classifica-
of the magnitude distribution is concentrated. It mea- tion purpose. In this section, we will propose a new feature
sures the spectral shape. extraction technique for music genre classification, namely,
• Spectral Flux is the squared difference between the DWCHs, based on wavelet histogram to capture local and
normalized magnitudes of successive spectral distribu- information of music signals simultaneously.
tions. It measures the amount of local spectral change.
4.1 Wavelet Basics
• Zero Crossings is the number of time domain zero The wavelet transform is a synthesis of ideas emerging
crossings of the signal. It measures noisiness of the over many years from different fields such as mathemat-
signal. ics and image and signal processing and has been widely
used in information retrieval and data mining [19, 20]. A
• Low Energy is the percentage of frames that have en-
complete survey on wavelet application in data mining can
ergy less than the average energy over the whole signal.
be found in [17]. Generally speaking, the wavelet trans-
It measures amplitude distribution of the signal.
form, providing good time and frequency resolution, is a
tool that divides up data, functions, or operators into dif-
3.2 Rhythmic Content Features ferent frequency components and then studies each compo-
Rhythmic content features characterize the movement of nent with a resolution matched to its scale [3]. Straight-
music signals over time and contain such information as the forwardly, a wavelet coefficients histogram is the histogram
regularity of the rhythm, the beat, the tempo, and the time of the (rounded) wavelet coefficients obtained by convolving
signature. The feature set for representing rhythm struc- a wavelet filter with an input music signal (details on his-
ture is based on detecting the most salient periodicities of togram and wavelet filter/analysis can be found in [3, 29]
the signal and it is usually extracted from beat histogram. respectively).
To construct the beat histogram, the time domain amplitude Several favorable properties of wavelets, such as compact
envelope of each band is first extracted by decomposing the support, vanishing moments and decorrelated coefficients,
music signal into a number of octave frequency band. Then, make them useful tools for signal representation and trans-
the envelopes of each band are summed together followed by formation. Generally speaking, wavelets are designed to give
the computation of the autocorrelation of resulting sum en- good time resolution at high frequencies and good frequency
velop. The dominant peaks of the autocorrelation function, resolution at low frequencies. Compact support guarantees
corresponding to the various periodicities of the signal’s en- the localization of wavelet, vanishing moment property al-
velope, are accumulated over the whole sound file into a beat lows wavelet focusing on most important information and
histogram where each bin corresponds to the peak lag. The discarding noisy signal, and decorrelated coefficients prop-
rhythmic content features are then extracted from the beat erty enables wavelet to reduce temporal correlation so that
histogram, and generally they contain relative amplitude of the correlation of wavelet coefficients are much smaller than
the first and the second histogram peak, ration of the am- that of the corresponding temporal process [8]. Hence, after
plitude of the second peak divided by the amplitude of the wavelet transform, the complex signal in the time domain
first peak, periods of the first and second peak, overall sum can be reduced into a much simpler process in the wavelet
of the histogram. domain. By computing the histograms of wavelet coeffi-
cients, we then could get a good estimation of the probabil-
3.3 Pitch Content Features ity distribution over time. The good probability estimation
The pitch content features describe the melody and har- thus leads to a good feature representation.
mony information about music signals and are extracted
based on various pitch detection techniques. Basically the 4.2 DWCHs
dominant peaks of the autocorrelation function, calculated
A sound file is a kind of oscillation waveform in time-
via the summation of envelopes for each frequency band
domain and can be considered as a two dimensional entity
obtained by decomposing the signal, are accumulated into
of the amplitude over time, in the form of M (t) = D(A, t)
pitch histograms and the pitch content features are then
where A is the amplitude and generally ranges from [−1, 1].
extracted from the pitch histograms. The pitch content fea-
The distinguishing characteristics are contained in the am-
tures typically include: the amplitudes and periods of maxi-
plitude variation, and in consequence, identifying the ampli-
mum peaks in the histogram, pitch intervals between the two
tude variation would be essential for music categorization.
most prominent peaks, the overall sums of the histograms.
On one hand, the histogram technique is an efficient
means for the distribution estimation. However, the raw
4. DWCHS OF MUSIC SIGNALS signal in time domain is not good representation particu-
It is not difficult to see that the traditional feature extrac- larly for content-based categorization since the most dis-
tion described in Section 3 more or less capture incomplete tinguished characteristics are hidden in frequency domain.
information of music signals. Timbral textural features are On the other hand, the sound frequency spectrum is gen-
standard features used in speech recognition and are calcu- erally divided into octaves with each having a unique qual-
lated for every short-time frame of sound while rhythmic ity. An octave is the interval between any frequencies that
and pitch content features are computed over the whole file. have a tonal ratio of 2 to 1, a logarithmic-relation in fre-
In other words, timbral features capture the statistics of quency band. The wavelet decomposition scheme matches
284
Blues Classical
the models of sound octave division for perceptual scales and 5000 5000
provides good time and frequency resolution [16]. In other
words, the decomposition of audio signal using wavelets
produces a set of subband signals at different frequencies 0
0 20 Country 40 60
0
0 20 Disco 40 60
5000 5000
corresponding different characteristics. This motivates the
use of wavelet histogram techniques for feature extraction.
The wavelet coefficients are distributed in various frequency 0 0
Hiphop 40 Jazz
bands at different resolutions. 5000
0 20 60
5000
0 20 40 60
285
to codewords). Then l classifiers are trained to predict each algorithm is then used to estimate the parameters for
bit of the string. For new instances, the predicted class is the each Gaussian component and the mixture weight.
one whose codeword is the closest (in Hamming distance) to
the codeword produced by the classifiers. One-versus-the- • Linear Discriminant Analysis (LDA) In the
rest and pairwise comparison can be regarded as two special statistical pattern recognition literature discriminant
cases of ECOC with specific coding schemes. Multi-class ob- analysis approaches are known to learn discriminative
jective functions aims to directly modify the objective func- feature transformations very well. The approach has
tion of binary SVMs in such a way that it simultaneously al- been successfully used in many classification tasks [11].
lows the computation of a multi-class classifier. In practice, The basic idea of LDA is to find a linear transforma-
the choice of reduction method from multi-class to binary tion that best discriminates among classes and perform
is problem-dependent and not a trivial task and it is fair to classification in the transformed space based on some
say that there is probably no method generally outperforms metric such as Euclidean distances. Fisher discrimi-
the others [1]. nant analysis finds discriminative feature transform as
We conducted experiments with the following additional eigenvectors of matrix T = Σ̂−1 w Σ̂b where Σ̂w is the
multi-class classification approaches (see [21] for more infor- intra-class covariance matrix and Σ̂b is the inter-class
mation about the methods): covariance matrix. This matrix T captures both com-
pactness of each class and separations between classes.
Blues No.1 Blues No.2 So, the eigenvectors corresponding to the largest eigen-
5000 5000
values of T are expected to constitute a discriminative
feature transform.
0 0
0 20 Blues No.3 40 60 0 20 Blues No.4 40 60
5000 5000 6. EXPERIMENTAL RESULTS
Cla. No.1 Cla. No.2
5000 5000
0 0
0 20 Blues No.5 40 60 0 20 Blues No.6 40 60
5000 5000
0 0
0 20 Cla. No.3 40 60 0 20 Cla. No.4 40 60
5000 5000
0 0
0 20 Blues No.7 40 60 0 20 Blues No.8 40 60
5000 5000
0 0
0 20 Cla. No.5 40 60 0 20 Cla. No.6 40 60
5000 5000
0 0
0 20 Blues No.9 40 60 0 20Blues No.1040 60
5000 5000
0 0
0 20 Cla. No.7 40 60 0 20 Cla. No.8 40 60
5000 5000
0 0
0 20 40 60 0 20 40 60
286
files), Classical (164 files), Fusion (136 files), Jazz (251 files) On the other hand, the use of DWCHs further improved
and Rock (96 files)3 . For both datasets, the sound files are the accuracy on all methods. In particular, there is a signif-
converted to 22050Hz, 16-bit, mono audio files. icant jump in the accuracy when Support Vector Machines
are used with either the pairwise or the one-versus-all ap-
6.2 Experiment Setup proach. The accuracy of the one-versus-the-rest SVM is
We used MARSYAS4, a public software framework for 78.5% on the average in the ten-fold cross validation. For
computer audition applications, for extracting the features some of the cross validation tests the accuracy went be-
proposed in [31]: MFCCs, FFT, Beat and Pitch. Mel- yond 80%. This is a remarkable improvement over Tsane-
frequency Cepstral Coefficients (denoted by MFCCs), the takis and Cook’s 61%. Perrot and Gjerdigen [22] report
timbral texture features excluding MFCCs (denoted by a human study in which college students were trained to
FFT), the rhythm content features (denoted by Beat), and learn a music company’s genre classification on a ten-genre
the pitch contents feature (denoted by Pitch) (see Section 3). data collection in which about 70% of accuracy is achieved.
The MFCCs feature vector consists of the mean and variance Although these results are not directly comparable due to
of each of the first five MFCC coefficients over the frames, the different dataset collections, it clearly implies that the
the FFT feature vector consists of the mean and variance automatic content-based genre classification could possibly
of spectral centroid, of rolloff, of flux, and of zero crossings, achieve similar accuracy as human performance. In fact, in
and of low energy, the Beat feature vector consists of six fea- comparison the performance of our best method seems to go
tures from the rhythm histogram, the Pitch feature vector far beyond that.
consists of five features from pitch histograms. More infor- There are papers reporting better accuracy of automatic
mation of the feature extraction can be found in [30]. Our music genre recognition of smaller datasets. Pye [23] reports
original DWCH feature set contains four features for each 90% on a total set of 175 songs covering six genres (Blues,
of seven frequency subbands along with nineteen traditional Easy Listening, Classical, Opera, Dance, and Indie Rock).
timbral features. However, we found that not all the fre- Soltau et al. [28] report 80% accuracy on four classes (Rock,
quency subbands are informative and we only use selective Pop, Techno, and Classical). Just for the sake of compari-
subbands, resulting a feature vector of length 35. son, we show in Table 2 the performance of the best classifier
For classification methods, we use three different reduc- (DWCHs with SVM) on one-versus-all tests on each of the
tion methods to extend SVM for multi-class: pairwise, one- ten music genres in Dataset A. The performance of these
against-the-rest, and multi-class objective functions. For classifiers are extremely good. Also, in Table 3 we show the
one-against-the-rest and pairwise methods, our SVM im- performance of the multi-class classification for distinction
plementation was based on the LIBSVM [2], a library for among smaller numbers of classes. The accuracy gradually
support vector classification and regression. For multi- decreases as the number of classes increases.
class objective functions, our implementation was based on Table 4 presents the results on our own dataset. This
MPSVM [12]. For experiments involving SVMs, we tested dataset was generated with little control by blindly taking
them with linear, polynomial, and radius-based kernels and 30 seconds after introductory 30 seconds of each music and
the best results are reported in the tables. For Gaussian covers many different albums, so the performance was antic-
Mixture Models, we used three Gaussian mixtures to model ipated to be lower than that for Database A. Also, there is
each music genre. For K-Nearest Neighbors, we set k = 5. the genre of Ambient, which covers music bridging between
Classical and Jazz. The difficulty in classifying such border-
6.3 Results and Analysis line cases is compensated for the reduction in the number
Table 1 shows the accuracy of the various classification of classes. The overall performance was only 4 to 5% lower
algorithms on Dataset A. The bottom four rows show how than that for Database A.
the classifiers performed on a single set of features proposed We observe that SVMs are always the best classifiers
in [31]. The experiments verify the fact that each of the tra- for content-based music genre classification. However, the
dition features contains useful yet incomplete information choice of the reduction method from multi-class to binary
characterizing music signals. The classification accuracy on seems to be problem-dependent and there is no clear overall
any single feature set is significantly better than random winner. It is fair to say that there is probably no reduction
guess(the accuracy of random guess on dataset A is 10%.). method generally outperforms the others. Feature extrac-
The performance with either FFT or MFCC was signifi- tion is crucial for music genre classification. The choice of
cantly higher than that with Beat or Pitch in each of the features is more important than the choice of classifiers. The
methods tested. This naturally raises a question of whether variations of classification accuracy on different classification
FFT and MFCC are each more suitable than Beat or Pitch techniques are much smaller than those of different feature
for music genre classification. We combined the four sets of extraction techniques.
features in every possible way to examine the accuracy. The
accuracy with only Beat and Pitch is significantly smaller 7. CONCLUSIONS AND FUTURE WORK
than the accuracy with any combination that includes ei- In this paper we proposed DWCHs, a new feature extrac-
ther FFT or MFCC. Indeed, the accuracy with only FFT tion method for music genre classification. DWCHs repre-
and MFCC is almost the same as that with all four for all sent music signals by computing histograms on Daubechies
methods. This seems to answer positively to our question. wavelet coefficients at various frequency bands and it has
3 significantly improved the classification accuracy. We also
A list of music sources for Dataset B is available at
http://www.cs.rochester.edu/u/ogihara/music/SIGIR- provide a comparative study of various feature extraction
list.xls. and classification methods and investigate the classification
4 performance of various classification methods on different
MARSYAS can be downloaded from
http://www.cs.princeton.edu/∼gtzan/marsyas.html. feature sets. To the best of our knowledge, this is the first
287
Features Methods
SVM1 SVM2 MPSVM GMM LDA KNN
DWCHs 74.9(4.97) 78.5(4.07) 68.3(4.34) 63.5(4.72) 71.3(6.10) 62.1(4.54)
Beat+FFT+MFCC+Pitch 70.8(5.39) 71.9(5.09) 66.2(5.23) 61.4(3.87) 69.4(6.93) 61.3(4.85)
Beat+FFT+MFCC 71.2(4.98) 72.1(4.68) 64.6(4.16) 60.8(3.25) 70.2(6.61) 62.3(4.03)
Beat+FFT+Pitch 65.1(4.27) 67.2(3.79) 56.0(4.67) 53.3(3.82) 61.1(6.53) 51.8(2.94)
Beat+MFCC+Pitch 64.3(4.24) 63.7(4.27) 57.8(3.82) 50.4(2.22) 61.7(5.23) 54.0(3.30)
FFT+MFCC+Pitch 70.9(6.22) 72.2(3.90) 64.9(5.06) 59.6(3.22) 69.9(6.76) 61.0(5.40)
Beat+FFT 61.7(5.12) 62.6(4.83) 50.8(5.16) 48.3(3.82) 56.0(6.73) 48.8(5.07)
Beat+MFCC 60.4(3.19) 60.2(4.84) 53.5(4.45) 47.7(2.24) 59.6(4.03) 50.5(4.53)
Beat+Pitch 42.7(5.37) 41.1(4.68) 35.6(4.27) 34.0(2.69) 36.9(4.38) 35.7(3.59)
FFT+MFCC 70.5(5.98) 71.8(4.83) 63.6(4.71) 59.1(3.20) 66.8(6.77) 61.2(7.12)
FFT+Pitch 64.0(5.16) 68.2(3.79) 55.1(5.82) 53.7(3.15) 60.0(6.68) 53.8(4.73)
MFCC+Pitch 60.6(4.54) 64.4(4.37) 53.3(2.95) 48.2(2.71) 59.4(4.50) 54.7(3.50)
Beat 26.5(3.30) 21.5(2.71) 22.1(3.04) 22.1(1.91) 24.9(2.99) 22.8(5.12)
FFT 61.2(6.74) 61.8(3.39) 50.6(5.76) 47.9(4.91) 56.5(6.90) 52.6(3.81)
MFCC 58.4(3.31) 58.1(4.72) 49.4(2.27) 46.4(3.09) 55.5(3.57) 53.7(4.11)
Pitch 36.6(2.95) 33.6(3.23) 29.9(3.76) 25.8(3.02) 30.7(2.79) 33.3(3.20)
Table 1: Classification accuracy of the learning methods tested on Dataset A using various combinations of
features. The accuracy values are calculated via ten-fold cross validation. The numbers within parentheses
are standard deviations. SVM1 and SVM2 respectively denote the pairwise SVM and the one-versus-the-rest
SVM.
Table 2: Genre specific accuracy of SVM1 on DWCHs. The results are calculated via ten fold cross validation
and each entry in the table is in the form of accuracy(standard deviation).
Classes Methods
SVM1 SVM2 MPSVM GMM LDA KNN
1&2 98.00(3.50) 98.00(2.58) 99.00(2.11) 98.00(3.22) 99.00(2.11) 97.5(2.64)
1, 2 & 3 92.33(5.46) 92.67(4.92) 93.33(3.51) 91.33(3.91) 94.00(4.10) 87.00(5.54)
1 through 4 90.5(4.53) 90.00(4.25) 89.75(3.99) 85.25(5.20) 89.25(3.92) 83.75(5.92)
1 through 5 88.00(3.89) 86.80(4.54) 83.40(5.42) 81.2(4.92) 86.2(5.03) 78.00(5.89)
1 through 6 84.83(4.81) 86.67(5.27) 81.0(6.05) 73.83(5.78) 82.83(6.37) 73.5(6.01)
1 through 7 83.86(4.26) 84.43(3.53) 78.85(3.67) 74.29(6.90) 81.00(5.87) 73.29(5.88)
1 through 8 81.5(4.56) 83.00(3.64) 75.13(4.84) 72.38(6.22) 79.13(6.07) 69.38(5.47)
1 through 9 78.11(4.83) 79.78(2.76) 70.55(4.30) 68.22(7.26) 74.47(6.22) 65.56(4.66)
Table 3: Accuracy on various subsets of Dataset A using DWCHs. The class numbers correspond to those
of Table 2. The accuracy values are calculated via ten-fold cross validation. The numbers in the parentheses
are the standard deviations.
Features Methods
SVM1 SVM2 MPSVM GMM LDA KNN
DWCHs 71.48(6.84) 74.21(4.65) 67.16(5.60) 64.77(6.15) 65.74(6.03) 61.84(4.88)
Beat+FFT+MFCC+Pitch 68.65(3.90) 69.19(4.32) 65.21(3.63) 63.08(5.89) 66.00(5.57) 60.59(5.43)
FFT+MFCC 66.67(4.40) 70.63(4.13) 64.29(4.54) 61.24(6.29) 65.35(4.86) 60.78(4.30)
Beat 43.37(3.88) 44.52(4.14) 41.01(4.46) 37.95(5.21) 40.87(4.50) 41.27(2.96)
FFT 61.65(5.57) 62.19(5.26) 54.76(2.94) 50.80(4.89) 57.94(5.11) 57.42(5.64)
MFCC 60.45(5.12) 67.46(3.57) 57.42(4.67) 53.43(5.64) 59.26(4.77) 59.93(3.49)
Pitch 37.56(4.63) 39.37(3.88) 36.49(5.12) 29.62(5.89) 37.82(4.67) 38.89(5.04)
Table 4: Classification accuracy of the learning methods tested on Dataset B using various combinations of
features calculated via via ten-fold cross validation. The numbers within parentheses are standard deviations.
288
study of such kind in music genre classification. Applications of Signal Processing to Audio and
Acoustics (WASPAA01), 2001.
[16] G. Li and A. A. Khokhar. Content-based indexing and
Acknowledgments retrieval of audio data using wavelets. In IEEE
The authors thank George Tzanetakis for useful discussions International Conference on Multimedia and Expo
and for kindly sharing his data with us. This work is sup- (II), pages 885–888, 2000.
ported in part by NSF grants EIA-0080124, DUE-9980943, [17] T. Li, Q. Li, S. Zhu, and M. Ogihara. A survey on
and EIA-0205061, and in part by NIH grants RO1-AG18231 wavelet applications in data mining. SIGKDD
(5-25589) and P30-AG18254. Explorations, 4(2):49–68, 2003.
[18] B. Logan. Mel frequency cepstral coefficients for music
8. REFERENCES modeling. In Proc. Int. Symposium on Music
Information RetrievalISMIR, 2000.
[1] E. L. Allwein, R. E. Schapire, and Y. Singer. [19] M. K. Mandal, T. Aboulnasr, and S. Panchanathan.
Reducing multiclass to binary: A unifying approach Fast wavelet histogram techniques for image indexing.
for margin classifiers. In Proc. 17th International Computer Vision and Image Understanding: CVIU,
Conf. on Machine Learning, pages 9–16. Morgan 75(1–2):99–110, 1999.
Kaufmann, San Francisco, CA, 2000. [20] Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based
[2] C.-C. Chang and C.-J. Lin. LIBSVM: a library for histograms for selectivity estimation. In Proceeding of
support vector machines, 2001. Software available at the ACM SIGMOD Conference, pages 448–459, 1998.
http://www.csie.ntu.edu.tw/~cjlin/libsvm. [21] T. M. Mitchell. Machine Learning. The McGraw-Hill
[3] I. Daubechies. Ten lectures on wavelets. SIAM, Companies,Inc., 1997.
Philadelphia, 1992. [22] D. Perrot and R. R. Gjerdigen. Scanning the dial: an
[4] A. David and S. Panchanathan. Wavelet-histogram exploration of factors in the identification of musical
method for face recognition. Journal of Electronic style. In Proceedings of the 1999 Society for Music
Imaging, 9(2):217–225, 2000. Perception and Cognition, page 88, 1999.
[5] H. Deshpande, R. Singh, and U. Nam. Classification [23] D. Pye. Content-based methods for managing
of music signals in the visual domain. In Proceedings electronic music. In Proceedings of the 2000 IEEE
of the COST-G6 Conference on Digital Audio Effects, International Conference on Acoustic Speech and
2001. Signal Processing, 2000.
[6] T. G. Dietterich and G. Bakiri. Solving multiclass [24] L. Rabiner and B. Juang. Fundamentals of Speech
learning problems via error-correcting output codes. Recognition. Prentice-Hall, NJ, 1993.
Journal of Artificial Intelligence Research, 2:263–286, [25] J. Saunders. Real-time discrimination of broadcast
1995. speech/music. In Proc. ICASSP 96, pages 993–996,
[7] W. J. Dowling and D. L. Harwood. Music Cognition. 1996.
Academic Press, Inc, 1986. [26] E. Scheirer. Tempo and beat analysis of acoustic
[8] P. Flandrin. Wavelet analysis and synthesis of musical signals. Journal of the Acoustical Society of
fractional Brownian motion. IEEE Transactions on America, 103(1), 1998.
Information Theory, 38(2):910–917, 1992. [27] E. Scheirer and M. Slaney. Construction and
[9] J. Foote. Content-based retrieval of music and audio. evaluation of a robust multifeature speech/music
In In Multimedia Storage and Archiving Systems II, discriminator. In Proc. ICASSP ’97, pages 1331–1334,
Proceedings of SPIE, pages 138–147, 1997. Munich, Germany, 1997.
[10] J. Foote and S. Uchihashi. The beat spectrum: a new [28] H. Soltau, T. Schultz, and M. Westphal. Recognition
approach to rhythm analysis. In IEEE International of music types. In Proceedings of the 1998 IEEE
Conference on Multimedia & Expo 2001, 2001. International Conference on Acoustics, Speech and
[11] K. Fukunaga. Introduction to statistical pattern Signal Processing, 1998.
recognition. Academic Press, New York, 2nd edition, [29] M. Swain and D. Ballard. Color indexing. Int. J.
1990. computer vision, 7:11–32, 1991.
[12] G. Fung and O. L. Mangasarian. Multicategory [30] G. Tzanetakis and P. Cook. Marsyas: A framework for
proximal support vector machine classifiers. Technical audio analysis. Organized Sound, 4(3):169–175, 2000.
Report 01-06, University of Wisconsin at Madison, [31] G. Tzanetakis and P. Cook. Musical genre
2001. classification of audio signals. IEEE Transactions on
[13] M. Goto and Y. Muraoka. A beat tracking system for Speech and Audio Processing, 10(5), July 2002.
acoustic signals of music. In ACM Multimedia, pages [32] V. N. Vapnik. Statistical Learning Theory. Wiley, New
365–372, 1994. York, 1998.
[14] T. Lambrou, P. Kudumakis, R. Speller, M. Sandler, [33] E. Wold, T. Blum, D. Keislar, and J. Wheaton.
and A. Linney. Classification of audio signals using Content-based classification, search and retrieval of
statistical features on time and wavelet tranform audio. IEEE Multimedia, 3(2):27–36, 1996.
domains. In Proc. Int. Conf. Acoustic, Speech, and [34] T. Zhang and C.-C. J. Kuo. Audio content analysis for
Signal Processing (ICASSP-98), volume 6, pages online audiovisual data segmentation and
3621–3624, 1998.
classification. IEEE Transactions on Speech and Audio
[15] J. Laroche. Estimating tempo, swing and beat Processing, 3(4), 2001.
locations in audio recordings. In Workshop on
289