Music Genre Classification Using Machine Learning Techniques: April 2018

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/324218667

Music Genre Classification using Machine Learning Techniques

Article · April 2018

CITATIONS READS

30 11,938

1 author:

Hareesh Bahuleyan
University of Waterloo
16 PUBLICATIONS   242 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Queue Length Prediction View project

Neural Natural Language Generation View project

All content following this page was uploaded by Hareesh Bahuleyan on 17 April 2018.

The user has requested enhancement of the downloaded file.


Music Genre Classification using Machine Learning Techniques

Hareesh Bahuleyan
University of Waterloo, ON, Canada
[email protected]

Abstract audio file. The first model described in this paper


Categorizing music files according to their uses convolutional neural networks (Krizhevsky
genre is a challenging task in the area et al., 2012), which is trained end-to-end on the
MEL spectrogram of the audio signal. In the sec-
arXiv:1804.01149v1 [cs.SD] 3 Apr 2018

of music information retrieval (MIR). In


this study, we compare the performance ond part of the study, we extract features both in
of two classes of models. The first is a the time domain and the frequency domain of the
deep learning approach wherein a CNN audio signal. These features are then fed to con-
model is trained end-to-end, to predict the ventional machine learning models namely Logis-
genre label of an audio signal, solely us- tic Regression, Random Forests (Breiman, 2001),
ing its spectrogram. The second approach Gradient Boosting (Friedman, 2001) and Support
utilizes hand-crafted features, both from Vector Machines which are trained to classify the
the time domain and frequency domain. given audio file. The models are evaluated on the
We train four traditional machine learning Audio Set dataset (Gemmeke et al., 2017). We
classifiers with these features and compare compare the proposed models and also study the
their performance. The features that con- relative importance of different features.
tribute the most towards this classification The rest of this paper is organized as follows.
task are identified. The experiments are Section 2 describes the existing methods in the lit-
conducted on the Audio set data set and we erature for the task of music genre classification.
report an AUC value of 0.894 for an en- Section 3 is an overview of the the dataset used
semble classifier which combines the two in this study and how it was obtained. The pro-
proposed approaches.1 posed models and the implementation details are
discussed in Section 4. The results are reported in
1 Introduction Section 5.2, followed by the conclusions from this
With the growth of online music databases and study in Section 6.
easy access to music content, people find it in-
creasing hard to manage the songs that they lis- 2 Literature Review
ten to. One way to categorize and organize songs
is based on the genre, which is identified by Music genre classification has been a widely stud-
some characteristics of the music such as rhyth- ied area of research since the early days of the
mic structure, harmonic content and instrumen- Internet. Tzanetakis and Cook (2002) addressed
tation (Tzanetakis and Cook, 2002). Being able this problem with supervised machine learning ap-
to automatically classify and provide tags to the proaches such as Gaussian Mixture model and k-
music present in a user’s library, based on genre, nearest neighbour classifiers. They introduced 3
would be beneficial for audio streaming services sets of features for this task categorized as tim-
such as Spotify and iTunes. This study explores bral structure, rhythmic content and pitch con-
the application of machine learning (ML) algo- tent. Hidden Markov Models (HMMs), which
rithms to identify and classify the genre of a given have been extensively used for speech recognition
1
tasks, have also been explored for music genre
The code has been opensourced and is available
at https://github.com/HareeshBahuleyan/ classification (Scaringella and Zoia, 2005; Soltau
music-genre-classification et al., 1998). Support vector machines (SVMs)
with different distance metrics are studied and animal sounds and so on2 . This study requires
compared in Mandel and Ellis (2005) for classi- only the audio files that belong to the music cat-
fying genre. egory, specifically having one of the seven genre
In Lidy and Rauber (2005), the authors dis- tags shown in Table 1.
cuss the contribution of psycho-acoustic features
for recognizing music genre, especially the impor- Table 1: Number of instances in each genre class
tance of STFT taken on the Bark Scale (Zwicker
and Fastl, 1999). Mel-frequency cepstral coef- Genre Count
ficients (MFCCs), spectral contrast and spectral 1 Pop Music 8100
roll-off were some of the features used by (Tzane-
2 Rock Music 7990
takis and Cook, 2002). A combination of visual
and acoustic features are used to train SVM and 3 Hip Hop Music 6958
AdaBoost classifiers in Nanni et al. (2016). 4 Techno 6885
With the recent success of deep neural net- 5 Rhythm Blues 4247
works, a number of studies apply these techniques
to speech and other forms of audio data (Abdel- 6 Vocal 3363
Hamid et al., 2014; Gemmeke et al., 2017). Rep- 7 Reggae Music 2997
resenting audio in the time domain for input to
neural networks is not very straight-forward be- Total 40540
cause of the high sampling rate of audio signals.
However, it has been addressed in Van Den Oord The number of audio clips in each category
et al. (2016) for audio generation tasks. A com- has also been tabulated. The raw audio clips of
mon alternative representation is the spectrogram these sounds have not been provided in the Audio
of a signal which captures both time and frequency Set data release. However, the data provides the
information. Spectrograms can be considered as YouTubeID of the corresponding videos, along
images and used to train convolutional neural net- with the start and end times. Hence, the first task
works (CNNs) (Wyse, 2017). A CNN was de- is to retrieve these audio files. For the purpose of
veloped to predict the music genre using the raw audio retrieval from YouTube, the following steps
MFCC matrix as input in Li et al. (2010). In were carried out:
Lidy and Schindler (2016), a constant Q-transform
(CQT) spectrogram was provided as input to the 1. A command line program called
CNN to achieve the same task. youtube-dl (Gonzalez, 2006) was
This work aims to provide a comparative study utilized to download the video in the mp4
between 1) the deep learning based models which format.
only require the spectrogram as input and, 2) the
traditional machine learning classifiers that need 2. The mp4 files are converted into the desired
to be trained with hand-crafted features. We also wav format using an audio converter named
investigate the relative importance of different fea- ffmpeg (Tomar, 2006) (command line tool).
tures.
Each wav file is about 880 KB in size, which
means that the total data used in this study is ap-
3 Dataset proximately 34 GB.

In this work, we make use of Audio Set, which is 4 Methodology


a large-scale human annotated database of sounds
This section provides the details of the data pre-
(Gemmeke et al., 2017). The dataset was cre-
processing steps followed by the description of
ated by extracting 10-second sound clips from a
the two proposed approaches to this classification
total of 2.1 million YouTube videos. The audio
problem.
files have been annotated on the basis of an on-
tology which covers 527 classes of sounds includ- 2
https://research.google.com/audioset/
ing musical instruments, speech, vehicle sounds, ontology/index.html
Figure 1: Sample spectrograms for 1 audio signal from each music genre

Figure 2: Convolutional neural network architecture (Image Source: Hvass Tensorflow Tutorials)

4.1 Data Pre-processing the image class. In this study, the sound wave can
In order to improve the Signal-to-Noise Ratio be represented as a spectrogram, which in turn can
(SNR) of the signal, a pre-emphasis filter, given be treated as an image (Nanni et al., 2016)(Lidy
by Equation 1 is applied to the original audio sig- and Schindler, 2016). The task of the CNN is to
nal. use the spectrogram to predict the genre label (one
of seven classes).
y(t) = x(t) − α ∗ x(t − 1) (1)
4.2.1 Spectrogram Generation
where, x(t) refers to the original signal, and y(t)
refers to the filtered signal and α is set to 0.97. A spectrogram is a 2D representation of a signal,
Such a pre-emphasis filter is useful to boost ampli- having time on the x-axis and frequency on the
tudes at high frequencies (Kim and Stern, 2012). y-axis. A colormap is used to quantify the mag-
nitude of a given frequency within a given time
4.2 Deep Neural Networks window. In this study, each audio signal was con-
verted into a MEL spectrogram (having MEL fre-
Using deep learning, we can achieve the task of
quency bins on the y-axis). The parameters used
music genre classification without the need for
to generate the power spectrogram using STFT are
hand-crafted features. Convolutional neural net-
listed below:
works (CNNs) have been widely used for the task
of image classification (Krizhevsky et al., 2012). • Sampling rate (sr) = 22050
The 3-channel (RGB) matrix representation of an
image is fed into a CNN which is trained to predict • Frame/Window size (n fft) = 2048
• Time advance between frames (hop size) In this study, a CNN architecture known as
= 512 (resulting in 75% overlap) VGG-16, which was the top performing model in
the ImageNet Challenge 2014 (classification + lo-
• Window Function: Hann Window calization task) was used (Simonyan and Zisser-
• Frequency Scale: MEL man, 2014). The model consists of 5 convolutional
blocks (conv base), followed by a set of densely
• Number of MEL bins: 96 connected layers, which outputs the probability
that a given image belongs to each of the possible
• Highest Frequency (f max) = sr/2 classes.
4.2.2 Convolutional Neural Networks For the task of music genre classification using
spectrograms, we download the model architec-
From the Figure 1, one can understand that there
ture with pre-trained weights, and extract the conv
exists some characteristic patterns in the spectro-
base. The output of the conv base is then send to
grams of the audio signals belonging to different
a new feed-forward neural network which in turn
classes. Hence, spectrograms can be considered
predicts the genre of the music, as depicted in Fig-
as ’images’ and provided as input to a CNN, which
ure 2.
has shown good performance on image classifica-
There are two possible settings while imple-
tion tasks. Each block in a CNN consists of the
menting the pre-trained model:
following operations3 :
1. Transfer learning: The weights in the conv
• Convolution: This step involves sliding a
base are kept fixed but the weights in the
matrix filter (say 3x3 size) over the input im-
feed-forward network (represented by the
age which is of dimension image width
yellow box in Figure 2) are allowed to be
x image height. The filter is first placed
tuned to predict the correct genre label.
on the image matrix and then we compute an
element-wise multiplication between the fil- 2. Fine tuning: In this setting, we start with the
ter and the overlapping portion of the image, pre-trained weights of VGG-16, but allow all
followed by a summation to give a feature the model weights to be tuned during training
value. We use many such filters , the values process.
of which are ’learned’ during the training of
the neural network via backpropagation. The final layer of the neural network outputs
the class probabilities (using the softmax activa-
• Pooling: This is a way to reduce the dimen- tion function) for each of the seven possible class
sion of the feature map obtained from the labels. Next, the cross-entropy loss is computed as
convolution step, formally know as the pro- follows:
cess of down sampling. For example, by max
pooling with 2x2 window size, we only retain M
X
the element with the maximum value among L=− yo,c ∗ log po,c (2)
c=1
the 4 elements of the feature map that are
covered in this window. We keep moving this where, M is the number of classes; yo,c is a bi-
window across the feature map with a pre- nary indicator whose value is 1 if observation o be-
defined stride. longs to class c and 0 otherwise; po,c is the model’s
predicted probability that observation o belongs to
• Non-linear Activation: The convolution op-
class c. This loss is used to backpropagate the er-
eration is linear and in order to make the neu-
ror, compute the gradients and thereby update the
ral network more powerful, we need to intro-
weights of the network. This iterative process con-
duce some non-linearity. For this purpose,
tinues until the loss converges to a minimum value.
we can apply an activation function such as
ReLU4 on each element of the feature map. 4.2.3 Implementation Details
The spectrogram images have a dimension of 216
3
https://ujjwalkarn.me/2016/08/11/ x 216. For the feed-forward network connected
intuitive-explanation-convnets/
4
https://en.wikipedia.org/wiki/ to the conv base, a 512-unit hidden layer is imple-
Rectifier_(neural_networks) mented. Over-fitting is a common issue in neural
(a) Accuracy (b) Loss

Figure 3: Learning Curves - used for model selection; Epoch 4 has the minimum validation loss and
highest validation accuracy

networks. In order to prevent this, two strategies 32 with the ADAM optimizer (Kingma and Ba,
are adopted: 2014). One epoch refers to one iteration over the
entire training dataset.
1. L2-Regularization (Ng, 2004): The loss Figure 3 shows the learning curves - the loss
function of the neural network is added (which is being optimized) keeps decreasing as the
with the term 12 λ i wi 2 , where w refers to
P
training progresses. Although the training accu-
the weights in the neural networks. This racy keeps increasing, the validation accuracy first
method is used to penalize excessively high increases and after a certain number of epochs, it
weights. We would like the weights to be dif- starts to decrease. This shows the model’s ten-
fused across all model parameters, and not dency to overfit on the training data. The model
just among a few parameters. Also, intu- that is selected for evaluation purposes is the one
itively, smaller weights would correspond to that has the highest accuracy and lowest loss on
a less complex model, thereby avoiding over- the validation set (epoch 4 in Figure 3).
fitting. λ is set to a value of 0.001 in this
study. 4.2.4 Baseline Feed-forward Neural Network
2. Dropout (Srivastava et al., 2014): This is a To assess the performance improvement that can
regularization mechanism in which we shut- be achived by the CNNs, we also train a baseline
off some of the neurons (set their weights feed-forward neural network that takes as input
to zero) randomly during training. In each the same spectrogram image. The image which
iteration, we thereby use a different combi- is a 2-dimensional vector of pixel values is un-
nation of neurons to predict the final output. wrapped or flattened into a 1-dimensional vector.
This makes the model generalize without any Using this vector, a simple 2-layer neural network
heavy dependence on a subset of the neurons. is trained to predict the genre of the audio signal.
A dropout rate of 0.3 is used, which means The first hidden layer consists of 512 units and the
that a given weight is set to zero during an second layer has 32 units, followed by the out-
iteration, with a probability of 0.3. put layer. The activation function used is ReLU
and the same regularization techniques described
The dataset is randomly split into train (90%), in Section 4.2.3 are adopted.
validation (5%) and test (5%) sets. The same split
is used for all experiments to ensure a fair compar- 4.3 Manually Extracted Features
ison of the proposed models.
In this section, we describe the second category
The neural networks are implemented in Python
of proposed models, namely the ones that re-
using Tensorflow 5 ; an NVIDIA Titan X GPU
quire hand-crafted features to be fed into a ma-
was utilized for faster processing. All models
chine learning classifier. Features can be broadly
were trained for 10 epochs with a batch size of
classified as time domain and frequency domain
5
http://tensorflow.org/ features. The feature extraction was done using
librosa6 , a Python library. 4.3.2 Frequency Domain Features
4.3.1 Time Domain Features The audio signal can be transformed into the fre-
quency domain by using the Fourier Transform.
These are features which were extracted from the
We then extract the following features.
raw audio signal.

1. Central moments: This consists of the 1. Mel-Frequency Cepstral Coefficients


mean, standard deviation, skewness and kur- (MFCC): Introduced in the early 1990s
tosis of the amplitude of the signal. by Davis and Mermelstein, MFCCs have
been very useful features for tasks such as
2. Zero Crossing Rate (ZCR): A zero crosss- speech recognition (Davis and Mermelstein,
ing point refers to one where the sig- 1990). First, the Short-Time Fourier-
nal changes sign from positive to negative Transform (STFT) of the signal is taken with
(Gouyon et al., 2000). The entire 10 sec- n fft=2048 and hop size=512 and a
ond signal is divided into smaller frames, and Hann window. Next, we compute the power
the number of zero-crossings present in each spectrum and then apply the triangular MEL
frame are determined. The frame length is filter bank, which mimics the human percep-
chosen to be 2048 points with a hop size of tion of sound. This is followed by taking the
512 points. Note that these frame parameters discrete cosine transform of the logarithm
have been used consistently across all fea- of all filterbank energies, thereby obtaining
tures discussed in this section. Finally, the the MFCCs. The parameter n mels, which
average and standard deviation of the ZCR corresponds to the number of filter banks,
across all frames are chosen as representative was set to 20 in this study.
features.
2. Chroma Features: This is a vector which
3. Root Mean Square Energy (RMSE): The corresponds to the total energy of the sig-
energy in a signal is calculated as: nal in each of the 12 pitch classes. (C, C#,
D, D#, E ,F, F#, G, G#, A, A#, B) (Ellis,
N
X
|x(n)|2 (3) 2007). The chroma vectors are then aggre-
n=1 gated across the frames to obtain a represen-
tative mean and standard deviation.
Further, the root mean square value can be
computed as: 3. Spectral Centroid: For each frame, this cor-
v responds to the frequency around which most
u N
u1 X of the energy is centered (Tjoa, 2017). It is a
t |x(n)|2 (4) magnitude weighted frequency calculated as:
N n=1
P
RMSE is calculated frame by frame and then k S(k)f (k)
fc = P , (5)
we take the average and standard deviation k fk
across all frames.
where S(k) is the spectral magnitude of fre-
4. Tempo: In general terms, tempo refers to the quency bin k and f(k) is the frequency corre-
how fast or slow a piece of music is; it is ex- sponding to bin k.
pressed in terms of Beats Per Minute (BPM).
Intuitively, different kinds of music would 4. Spectral Band-width: The p-th order spec-
have different tempos. Since the tempo of tral band-width corresponds to the p-th or-
the audio piece can vary with time, we aggre- der moment about the spectral centroid (Tjoa,
gate it by computing the mean across several 2017) and is calculated as
frames. The functionality in librosa first X 1

computes a tempogram following (Grosche [ (S(k)f (k) − fc )p ] p (6)


k
et al., 2010) and then estimates a single value
for tempo. For example, p = 2 is analogous to a
6
https://librosa.github.io/ weighted standard deviation.
5. Spectral Contrast: Each frame is divided as decision trees). However, unlike RFs,
into a pre-specified number of frequency boosting algorithms are trained in a sequen-
bands. And, within each frequency band, tial manner using forward stagewise additive
the spectral contrast is calculated as the dif- modelling (Hastie et al., 2001).
ference between the maximum and minimum During the early iterations, the decision trees
magnitudes (Jiang et al., 2002). learnt are fairly simple. As training pro-
6. Spectral Roll-off: This feature corresponds gresses, the classifier become more powerful
to the value of frequency below which 85% because it is made to focus on the instances
(this threshold can be defined by the user) of where the previous learners made errors. At
the total energy in the spectrum lies (Tjoa, the end of training, the final prediction is
2017). a weighted linear combination of the output
from the individual learners. XGB refers to
For each of the spectral features described eXtreme Gradient Boosting, which is an im-
above, the mean and standard deviation of the val- plementation of boosting that supports train-
ues taken across frames is considered as the repre- ing the model in a fast and parallelized man-
sentative final feature that is fed to the model. ner.
The features described in this section would
4. Support Vector Machines (SVM): SVMs
be would be used to train machine learning algo-
transform the original input data into a
rithms (refer Section 4.4). The features that con-
high dimensional space using a kernel trick
tribute the most in achieving a good classification
(Cortes and Vapnik, 1995). The transformed
performance will be identified and reported.
data can be linearly separated using a hyper-
4.4 Classifiers plane. The optimal hyperplane maximizes
the margin. In this study, a radial basis func-
This section provides a brief overview of the four
tion (RBF) kernel is used to train the SVM
machine learning classifiers adopted in this study.
because such a kernel would be required
1. Logistic Regression (LR): This linear clas- to address this non-linear problem. Simi-
sifier is generally used for binary classifica- lar to the logistic regression setting discussed
tion tasks. For this multi-class classification above, the SVM is also implemented as a
task, the LR is implemented as a one-vs-rest one-vs-rest classification task.
method. That is, 7 separate binary classi-
5 Evaluation
fiers are trained. During test time, the class
with the highest probability from among the 5.1 Metrics
7 classifiers is chosen as the predicted class. In order to evaluate the performance of the models
described in Section 4, the following metrics will
2. Random Forest (RF): Random Forest is a
be used.
ensemble learner that combines the predic-
tion from a pre-specified number of decision • Accuracy: Refers to the percentage of cor-
trees. It works on the integration of two main rectly classified test samples.
principles: 1) each decision tree is trained
with only a subset of the training samples • F-score: Based on the confusion matrix, it
which is known as bootstrap aggregation (or is possible to calculate the precision and re-
bagging) (Breiman, 1996), 2) each decision call. F-score7 is then computed as the har-
tree is required to make its prediction using monic mean between precision and recall.
only a random subset of the features (Amit
• AUC: This evaluation criteria known as the
and Geman, 1997). The final predicted class
area under the receiver operator characteris-
of the RF is determined based on the majority
tics (ROC) curve is a common way to judge
vote from the individual classifiers.
the performance of a multi-class classifica-
3. Gradient Boosting (XGB): Boosting is an- tion system. The ROC is a graph between the
other ensemble classifier that is obtained by 7
https://en.wikipedia.org/wiki/F1_
combining a number of weak learners (such score
Table 2: Comparison of performance of the models on the test set

Accuracy F-score AUC


Spectrogram-based models
VGG-16 CNN Transfer Learning 0.63 0.61 0.891
VGG-16 CNN Fine Tuning 0.64 0.61 0.889
Feed-forward NN baseline 0.43 0.33 0.759
Feature Engineering based models
Logistic Regression (LR) 0.53 0.47 0.822
Random Forest (RF) 0.54 0.48 0.840
Support Vector Machines (SVM) 0.57 0.52 0.856
Extreme Gradient Boosting (XGB) 0.59 0.55 0.865
Ensemble Classifiers
VGG-16 CNN + XGB 0.65 0.62 0.894

true positive rate and the false positive rate. A engineered methods.
baseline model which randomly predicts each
class label with equal probability would have 5.2.1 Most Important Features
an AUC of 0.5, and hence the system being In this section, we investigate which features con-
designed is expected to have a AUC higher tribute the most during prediction, in this classifi-
than 0.5. cation task. To carry out this experiment, we chose
5.2 Results and Discussion the XGB model, based on the results discussed in
In this section, the different modelling approaches the previous section. To do this, we rank the top
discussed in Section 4 are evaluated based on the 20 most useful features based on a scoring metric
metrics described in Section 5.1. The values have (Figure 4). The metric is calculated as the number
been reported in Table 2. of times a given feature is used as a decision node
The best performance in terms of all metrics among the individual decision trees that form the
is observed for the convolutional neural network gradient boosting predictor.
model based on VGG-16 that uses only the spec- As can be observed from Figure 4, Mel-
trogram to predict the music genre. It was ex- Frequency Cepstral Coefficients (MFCC) appear
pected that the fine tuning setting, which addition- the most among the important features. Previ-
ally allows the convolutional base to be trainable, ous studies have reported MFCCs to improve the
would enhance the CNN model when compared to performance of speech recognition systems (It-
the transfer learning setting. However, as shown in tichaichareon et al., 2012). Our experiments show
Table 2, the experimental results show that there is that MFCCs contribute significantly to this task of
no significant difference between transfer learning music genre classification. The mean and standard
and fine-tuning. The baseline feed-forward neural deviation of the spectral contrasts at different fre-
network that uses the unrolled pixel values from quency bands are also important features. The mu-
the spectrogram performs poorly on the test set. sic tempo, calculated in terms of beats per minute
This shows that CNNs can significantly improve also appear in the top 20 useful features.
the scores on such an image classification task. Next, we study how much of performance in
Among the models that use manually crafted terms of AUC and accuracy, can be obtained by
features, the one with the least performance is the just using the top N while training the model.
Logistic regression model. This is expected since From Table 3 it can be seen that with only the top
logistic regression is a linear classifier. SVMs 10 features, the model performance is surprisingly
outperform random forests in terms of accuracy. good. In comparison to the full model which has
However, the XGB version of the gradient boost- 97 features, the model with the top 30 features has
ing algorithm performs the best among the feature only a marginally lower performance (2 points on
Figure 4: Relative importance of features in the XGBoost model; the top 20 most contributing features
are displayed

the AUC metric and 4 point on the accuracy met- Table 4: Comparison of Time Domain features
ric). and Frequency Domain features

Table 3: Ablation Study: Comparing XGB perfor- Model AUC Accuracy


mance keeping only top N features
Time Domain only 0.731 0.40
Frequency Domain only 0.857 0.57
N AUC Accuracy
Both 0.865 0.59
10 0.803 0.47
20 0.837 0.52
30 0.845 0.55 trix refers to the number of test instances of class
97 0.865 0.59 i that the model predicted as class j. Diagonal
elements aii corresponds to the correct predic-
The final experiment in this section is compar- tions. Figure 5 compares the confusion matrices
ison of time domain and frequency domain fea- of the best performing CNN model and XGB, the
tures listed in Section 4.3. Two XGB models were best model among the feature-engineered classi-
trained - one with only time domain features and fiers. Both models seems to be good at predict-
the other with only frequency domain features. Ta- ing the class ’Rock’ music. However, many in-
ble 4 compares the results in terms of AUC and ac- stances of class ’Hip Hop’ are often confused with
curacy. This experiment further confirms the fact class ’Pop’ and vice-versa. Such a behaviour is
that frequency domain features are definitely bet- expected when the genres of music are very close.
ter than time domain features when it comes to Some songs may fall into multiple genres, even as
modelling audio for machine learning tasks. much that it may be difficult for humans to recog-
nize the exact genre.
5.2.2 Confusion Matrix
Confusion matrix is a tabular representation which 5.2.3 Ensemble Classifier
enables us to further understand the strengths and Ensembling is a commonly adopted practice
weaknesses of our model. Element aij in the ma- in machine learning, wherein, the results from
(a) VGG-16 CNN Transfer Learning

(b) Extreme Gradient Boosting (c) Ensemble Model

Figure 5: Confusion Matrices of the best performing models

different classifiers are combined. This is bling. In this study, the best CNN model namely,
done by either majority voting or by averaging VGG-16 Transfer Learning is ensembled with
scores/probabilities. Such an ensembling scheme XGBoost the best feature engineered model by av-
which combines the prediction powers of differ- eraging the predicted probabilities. As shown in
ent classifiers makes the overall system more ro- Table 2, this ensembling is beneficial and is ob-
bust. In our case, each classifier outputs a predic- served to outperform the all individual classifiers.
tion probability for each of the class labels. Hence, The ROC curve for the ensemble model is above
averaging the predicted probabilities from the dif- that of VGG-16 Fine Tuning and XGBoost as il-
ferent classifiers would be a straight-forward way lustrated in Figure 6.
to do ensemble learning.
The methodologies described in 4.2 and 4.4 use 6 Conclusion
very different sources of input, the spectrograms
and the hand-crafted features respectively. Hence, In this work, the task of music genre classifica-
it makes sense to combine the models via ensem- tion is studied using the Audioset data. We pro-
Corinna Cortes and Vladimir Vapnik. 1995. Support-
vector networks. Machine learning 20(3):273–297.

Steven B Davis and Paul Mermelstein. 1990. Compar-


ison of parametric representations for monosyllabic
word recognition in continuously spoken sentences.
In Readings in speech recognition, Elsevier, pages
65–74.

Dan Ellis. 2007. Chroma feature analysis and synthe-


sis. Resources of Laboratory for the Recognition
and Organization of Speech and Audio-LabROSA .

Jerome H Friedman. 2001. Greedy function approx-


Figure 6: ROC Curves for the best performing
imation: a gradient boosting machine. Annals of
models and their ensemble statistics pages 1189–1232.

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman,


pose two different approaches to solving this prob- Aren Jansen, Wade Lawrence, R Channing Moore,
lem. The first involves generating a spectrogram Manoj Plakal, and Marvin Ritter. 2017. Audio set:
An ontology and human-labeled dataset for audio
of the audio signal and treating it as an image. An events. In Acoustics, Speech and Signal Processing
CNN based image classifier, namely VGG-16 is (ICASSP), 2017 IEEE International Conference on.
trained on these images to predict the music genre IEEE, pages 776–780.
solely based on this spectrogram. The second ap-
Ricardo Garcia Gonzalez. 2006. Youtube-dl: down-
proach consists of extracting time domain and fre- load videos from youtube. com.
quency domain features from the audio signals,
followed by training traditional machine learning Fabien Gouyon, François Pachet, Olivier Delerue, et al.
classifiers based on these features. XGBoost was 2000. On the use of zero-crossing rate for an ap-
plication of classification of percussive sounds. In
determined to be the best feature-based classifier; Proceedings of the COST G-6 conference on Digital
the most important features were also reported. Audio Effects (DAFX-00), Verona, Italy.
The CNN based deep learning models were shown
Peter Grosche, Meinard Müller, and Frank Kurth. 2010.
to outperform the feature-engineered models. We
Cyclic tempograma mid-level tempo representation
also show that ensembling the CNN and XGBoost for musicsignals. In Acoustics Speech and Sig-
model proved to be beneficial. It is to be noted that nal Processing (ICASSP), 2010 IEEE International
the dataset used in this study was audio clips from Conference on. IEEE, pages 5522–5525.
YouTube videos, which are in general very noisy.
Trevor Hastie, Robert Tibshirani, and Jerome Fried-
Futures studies can identify ways to pre-process man. 2001. The elements of statistical learnine.
this noisy data before feeding it into a machine
learning model, in order to achieve better perfor- Chadawan Ittichaichareon, Siwat Suksri, and
Thaweesak Yingthawornsuk. 2012. Speech
mance. recognition using mfcc. In International Con-
ference on Computer Graphics, Simulation and
Modeling (ICGSM’2012) July. pages 28–29.
References
Dan-Ning Jiang, Lie Lu, Hong-Jiang Zhang, Jian-Hua
Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Tao, and Lian-Hong Cai. 2002. Music type classi-
Jiang, Li Deng, Gerald Penn, and Dong Yu. 2014. fication by spectral contrast feature. In Multimedia
Convolutional neural networks for speech recogni- and Expo, 2002. ICME’02. Proceedings. 2002 IEEE
tion. IEEE/ACM Transactions on audio, speech, and International Conference on. IEEE, volume 1, pages
language processing 22(10):1533–1545. 113–116.

Yali Amit and Donald Geman. 1997. Shape quantiza- Chanwoo Kim and Richard M Stern. 2012. Power-
tion and recognition with randomized trees. Neural normalized cepstral coefficients (pncc) for robust
computation 9(7):1545–1588. speech recognition. In Acoustics, Speech and Sig-
nal Processing (ICASSP), 2012 IEEE International
Leo Breiman. 1996. Bagging predictors. Machine Conference on. IEEE, pages 4101–4104.
learning 24(2):123–140.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A
Leo Breiman. 2001. Random forests. Machine learn- method for stochastic optimization. arXiv preprint
ing 45(1):5–32. arXiv:1412.6980 .
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin- Aaron Van Den Oord, Sander Dieleman, Heiga Zen,
ton. 2012. Imagenet classification with deep con- Karen Simonyan, Oriol Vinyals, Alex Graves,
volutional neural networks. In Advances in neural Nal Kalchbrenner, Andrew Senior, and Koray
information processing systems. pages 1097–1105. Kavukcuoglu. 2016. Wavenet: A generative model
for raw audio. arXiv preprint arXiv:1609.03499 .
Tom LH Li, Antoni B Chan, and A Chun. 2010. Auto-
matic musical pattern feature extraction using con- Lonce Wyse. 2017. Audio spectrogram representations
volutional neural network. In Proc. Int. Conf. Data for processing with convolutional neural networks.
Mining and Applications. arXiv preprint arXiv:1706.09559 .
Thomas Lidy and Andreas Rauber. 2005. Evaluation E Zwicker and H Fastl. 1999. Psychoacoustics facts
of feature extractors and psycho-acoustic transfor- and models .
mations for music genre classification. In ISMIR.
pages 34–41.
Thomas Lidy and Alexander Schindler. 2016. Parallel
convolutional neural networks for music genre and
mood classification. MIREX2016 .
Michael I Mandel and Dan Ellis. 2005. Song-level fea-
tures and support vector machines for music classi-
fication. In ISMIR. volume 2005, pages 594–599.
Loris Nanni, Yandre MG Costa, Alessandra Lumini,
Moo Young Kim, and Seung Ryul Baek. 2016.
Combining visual and acoustic features for music
genre classification. Expert Systems with Applica-
tions 45:108–117.
Andrew Y Ng. 2004. Feature selection, l 1 vs. l 2 regu-
larization, and rotational invariance. In Proceedings
of the twenty-first international conference on Ma-
chine learning. ACM, page 78.
Nicolas Scaringella and Giorgio Zoia. 2005. On the
modeling of time information for automatic genre
recognition systems in audio signals. In ISMIR.
pages 666–671.
Karen Simonyan and Andrew Zisserman. 2014. Very
deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556 .
Hagen Soltau, Tanja Schultz, Martin Westphal, and
Alex Waibel. 1998. Recognition of music types. In
Acoustics, Speech and Signal Processing, 1998. Pro-
ceedings of the 1998 IEEE International Conference
on. IEEE, volume 2, pages 1137–1140.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: A simple way to prevent neural networks
from overfitting. The Journal of Machine Learning
Research 15(1):1929–1958.
Steve Tjoa. 2017. Music information retrieval.
https://musicinformationretrieval.
com/spectral_features.html. Accessed:
2018-02-20.
Suramya Tomar. 2006. Converting video formats with
ffmpeg. Linux Journal 2006(146):10.
George Tzanetakis and Perry Cook. 2002. Musical
genre classification of audio signals. IEEE Trans-
actions on speech and audio processing 10(5):293–
302.

View publication stats

You might also like