Music Genre Classification Using Machine Learning Techniques: April 2018
Music Genre Classification Using Machine Learning Techniques: April 2018
Music Genre Classification Using Machine Learning Techniques: April 2018
net/publication/324218667
CITATIONS READS
30 11,938
1 author:
Hareesh Bahuleyan
University of Waterloo
16 PUBLICATIONS 242 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Hareesh Bahuleyan on 17 April 2018.
Hareesh Bahuleyan
University of Waterloo, ON, Canada
[email protected]
Figure 2: Convolutional neural network architecture (Image Source: Hvass Tensorflow Tutorials)
4.1 Data Pre-processing the image class. In this study, the sound wave can
In order to improve the Signal-to-Noise Ratio be represented as a spectrogram, which in turn can
(SNR) of the signal, a pre-emphasis filter, given be treated as an image (Nanni et al., 2016)(Lidy
by Equation 1 is applied to the original audio sig- and Schindler, 2016). The task of the CNN is to
nal. use the spectrogram to predict the genre label (one
of seven classes).
y(t) = x(t) − α ∗ x(t − 1) (1)
4.2.1 Spectrogram Generation
where, x(t) refers to the original signal, and y(t)
refers to the filtered signal and α is set to 0.97. A spectrogram is a 2D representation of a signal,
Such a pre-emphasis filter is useful to boost ampli- having time on the x-axis and frequency on the
tudes at high frequencies (Kim and Stern, 2012). y-axis. A colormap is used to quantify the mag-
nitude of a given frequency within a given time
4.2 Deep Neural Networks window. In this study, each audio signal was con-
verted into a MEL spectrogram (having MEL fre-
Using deep learning, we can achieve the task of
quency bins on the y-axis). The parameters used
music genre classification without the need for
to generate the power spectrogram using STFT are
hand-crafted features. Convolutional neural net-
listed below:
works (CNNs) have been widely used for the task
of image classification (Krizhevsky et al., 2012). • Sampling rate (sr) = 22050
The 3-channel (RGB) matrix representation of an
image is fed into a CNN which is trained to predict • Frame/Window size (n fft) = 2048
• Time advance between frames (hop size) In this study, a CNN architecture known as
= 512 (resulting in 75% overlap) VGG-16, which was the top performing model in
the ImageNet Challenge 2014 (classification + lo-
• Window Function: Hann Window calization task) was used (Simonyan and Zisser-
• Frequency Scale: MEL man, 2014). The model consists of 5 convolutional
blocks (conv base), followed by a set of densely
• Number of MEL bins: 96 connected layers, which outputs the probability
that a given image belongs to each of the possible
• Highest Frequency (f max) = sr/2 classes.
4.2.2 Convolutional Neural Networks For the task of music genre classification using
spectrograms, we download the model architec-
From the Figure 1, one can understand that there
ture with pre-trained weights, and extract the conv
exists some characteristic patterns in the spectro-
base. The output of the conv base is then send to
grams of the audio signals belonging to different
a new feed-forward neural network which in turn
classes. Hence, spectrograms can be considered
predicts the genre of the music, as depicted in Fig-
as ’images’ and provided as input to a CNN, which
ure 2.
has shown good performance on image classifica-
There are two possible settings while imple-
tion tasks. Each block in a CNN consists of the
menting the pre-trained model:
following operations3 :
1. Transfer learning: The weights in the conv
• Convolution: This step involves sliding a
base are kept fixed but the weights in the
matrix filter (say 3x3 size) over the input im-
feed-forward network (represented by the
age which is of dimension image width
yellow box in Figure 2) are allowed to be
x image height. The filter is first placed
tuned to predict the correct genre label.
on the image matrix and then we compute an
element-wise multiplication between the fil- 2. Fine tuning: In this setting, we start with the
ter and the overlapping portion of the image, pre-trained weights of VGG-16, but allow all
followed by a summation to give a feature the model weights to be tuned during training
value. We use many such filters , the values process.
of which are ’learned’ during the training of
the neural network via backpropagation. The final layer of the neural network outputs
the class probabilities (using the softmax activa-
• Pooling: This is a way to reduce the dimen- tion function) for each of the seven possible class
sion of the feature map obtained from the labels. Next, the cross-entropy loss is computed as
convolution step, formally know as the pro- follows:
cess of down sampling. For example, by max
pooling with 2x2 window size, we only retain M
X
the element with the maximum value among L=− yo,c ∗ log po,c (2)
c=1
the 4 elements of the feature map that are
covered in this window. We keep moving this where, M is the number of classes; yo,c is a bi-
window across the feature map with a pre- nary indicator whose value is 1 if observation o be-
defined stride. longs to class c and 0 otherwise; po,c is the model’s
predicted probability that observation o belongs to
• Non-linear Activation: The convolution op-
class c. This loss is used to backpropagate the er-
eration is linear and in order to make the neu-
ror, compute the gradients and thereby update the
ral network more powerful, we need to intro-
weights of the network. This iterative process con-
duce some non-linearity. For this purpose,
tinues until the loss converges to a minimum value.
we can apply an activation function such as
ReLU4 on each element of the feature map. 4.2.3 Implementation Details
The spectrogram images have a dimension of 216
3
https://ujjwalkarn.me/2016/08/11/ x 216. For the feed-forward network connected
intuitive-explanation-convnets/
4
https://en.wikipedia.org/wiki/ to the conv base, a 512-unit hidden layer is imple-
Rectifier_(neural_networks) mented. Over-fitting is a common issue in neural
(a) Accuracy (b) Loss
Figure 3: Learning Curves - used for model selection; Epoch 4 has the minimum validation loss and
highest validation accuracy
networks. In order to prevent this, two strategies 32 with the ADAM optimizer (Kingma and Ba,
are adopted: 2014). One epoch refers to one iteration over the
entire training dataset.
1. L2-Regularization (Ng, 2004): The loss Figure 3 shows the learning curves - the loss
function of the neural network is added (which is being optimized) keeps decreasing as the
with the term 12 λ i wi 2 , where w refers to
P
training progresses. Although the training accu-
the weights in the neural networks. This racy keeps increasing, the validation accuracy first
method is used to penalize excessively high increases and after a certain number of epochs, it
weights. We would like the weights to be dif- starts to decrease. This shows the model’s ten-
fused across all model parameters, and not dency to overfit on the training data. The model
just among a few parameters. Also, intu- that is selected for evaluation purposes is the one
itively, smaller weights would correspond to that has the highest accuracy and lowest loss on
a less complex model, thereby avoiding over- the validation set (epoch 4 in Figure 3).
fitting. λ is set to a value of 0.001 in this
study. 4.2.4 Baseline Feed-forward Neural Network
2. Dropout (Srivastava et al., 2014): This is a To assess the performance improvement that can
regularization mechanism in which we shut- be achived by the CNNs, we also train a baseline
off some of the neurons (set their weights feed-forward neural network that takes as input
to zero) randomly during training. In each the same spectrogram image. The image which
iteration, we thereby use a different combi- is a 2-dimensional vector of pixel values is un-
nation of neurons to predict the final output. wrapped or flattened into a 1-dimensional vector.
This makes the model generalize without any Using this vector, a simple 2-layer neural network
heavy dependence on a subset of the neurons. is trained to predict the genre of the audio signal.
A dropout rate of 0.3 is used, which means The first hidden layer consists of 512 units and the
that a given weight is set to zero during an second layer has 32 units, followed by the out-
iteration, with a probability of 0.3. put layer. The activation function used is ReLU
and the same regularization techniques described
The dataset is randomly split into train (90%), in Section 4.2.3 are adopted.
validation (5%) and test (5%) sets. The same split
is used for all experiments to ensure a fair compar- 4.3 Manually Extracted Features
ison of the proposed models.
In this section, we describe the second category
The neural networks are implemented in Python
of proposed models, namely the ones that re-
using Tensorflow 5 ; an NVIDIA Titan X GPU
quire hand-crafted features to be fed into a ma-
was utilized for faster processing. All models
chine learning classifier. Features can be broadly
were trained for 10 epochs with a batch size of
classified as time domain and frequency domain
5
http://tensorflow.org/ features. The feature extraction was done using
librosa6 , a Python library. 4.3.2 Frequency Domain Features
4.3.1 Time Domain Features The audio signal can be transformed into the fre-
quency domain by using the Fourier Transform.
These are features which were extracted from the
We then extract the following features.
raw audio signal.
true positive rate and the false positive rate. A engineered methods.
baseline model which randomly predicts each
class label with equal probability would have 5.2.1 Most Important Features
an AUC of 0.5, and hence the system being In this section, we investigate which features con-
designed is expected to have a AUC higher tribute the most during prediction, in this classifi-
than 0.5. cation task. To carry out this experiment, we chose
5.2 Results and Discussion the XGB model, based on the results discussed in
In this section, the different modelling approaches the previous section. To do this, we rank the top
discussed in Section 4 are evaluated based on the 20 most useful features based on a scoring metric
metrics described in Section 5.1. The values have (Figure 4). The metric is calculated as the number
been reported in Table 2. of times a given feature is used as a decision node
The best performance in terms of all metrics among the individual decision trees that form the
is observed for the convolutional neural network gradient boosting predictor.
model based on VGG-16 that uses only the spec- As can be observed from Figure 4, Mel-
trogram to predict the music genre. It was ex- Frequency Cepstral Coefficients (MFCC) appear
pected that the fine tuning setting, which addition- the most among the important features. Previ-
ally allows the convolutional base to be trainable, ous studies have reported MFCCs to improve the
would enhance the CNN model when compared to performance of speech recognition systems (It-
the transfer learning setting. However, as shown in tichaichareon et al., 2012). Our experiments show
Table 2, the experimental results show that there is that MFCCs contribute significantly to this task of
no significant difference between transfer learning music genre classification. The mean and standard
and fine-tuning. The baseline feed-forward neural deviation of the spectral contrasts at different fre-
network that uses the unrolled pixel values from quency bands are also important features. The mu-
the spectrogram performs poorly on the test set. sic tempo, calculated in terms of beats per minute
This shows that CNNs can significantly improve also appear in the top 20 useful features.
the scores on such an image classification task. Next, we study how much of performance in
Among the models that use manually crafted terms of AUC and accuracy, can be obtained by
features, the one with the least performance is the just using the top N while training the model.
Logistic regression model. This is expected since From Table 3 it can be seen that with only the top
logistic regression is a linear classifier. SVMs 10 features, the model performance is surprisingly
outperform random forests in terms of accuracy. good. In comparison to the full model which has
However, the XGB version of the gradient boost- 97 features, the model with the top 30 features has
ing algorithm performs the best among the feature only a marginally lower performance (2 points on
Figure 4: Relative importance of features in the XGBoost model; the top 20 most contributing features
are displayed
the AUC metric and 4 point on the accuracy met- Table 4: Comparison of Time Domain features
ric). and Frequency Domain features
different classifiers are combined. This is bling. In this study, the best CNN model namely,
done by either majority voting or by averaging VGG-16 Transfer Learning is ensembled with
scores/probabilities. Such an ensembling scheme XGBoost the best feature engineered model by av-
which combines the prediction powers of differ- eraging the predicted probabilities. As shown in
ent classifiers makes the overall system more ro- Table 2, this ensembling is beneficial and is ob-
bust. In our case, each classifier outputs a predic- served to outperform the all individual classifiers.
tion probability for each of the class labels. Hence, The ROC curve for the ensemble model is above
averaging the predicted probabilities from the dif- that of VGG-16 Fine Tuning and XGBoost as il-
ferent classifiers would be a straight-forward way lustrated in Figure 6.
to do ensemble learning.
The methodologies described in 4.2 and 4.4 use 6 Conclusion
very different sources of input, the spectrograms
and the hand-crafted features respectively. Hence, In this work, the task of music genre classifica-
it makes sense to combine the models via ensem- tion is studied using the Audioset data. We pro-
Corinna Cortes and Vladimir Vapnik. 1995. Support-
vector networks. Machine learning 20(3):273–297.
Yali Amit and Donald Geman. 1997. Shape quantiza- Chanwoo Kim and Richard M Stern. 2012. Power-
tion and recognition with randomized trees. Neural normalized cepstral coefficients (pncc) for robust
computation 9(7):1545–1588. speech recognition. In Acoustics, Speech and Sig-
nal Processing (ICASSP), 2012 IEEE International
Leo Breiman. 1996. Bagging predictors. Machine Conference on. IEEE, pages 4101–4104.
learning 24(2):123–140.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A
Leo Breiman. 2001. Random forests. Machine learn- method for stochastic optimization. arXiv preprint
ing 45(1):5–32. arXiv:1412.6980 .
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin- Aaron Van Den Oord, Sander Dieleman, Heiga Zen,
ton. 2012. Imagenet classification with deep con- Karen Simonyan, Oriol Vinyals, Alex Graves,
volutional neural networks. In Advances in neural Nal Kalchbrenner, Andrew Senior, and Koray
information processing systems. pages 1097–1105. Kavukcuoglu. 2016. Wavenet: A generative model
for raw audio. arXiv preprint arXiv:1609.03499 .
Tom LH Li, Antoni B Chan, and A Chun. 2010. Auto-
matic musical pattern feature extraction using con- Lonce Wyse. 2017. Audio spectrogram representations
volutional neural network. In Proc. Int. Conf. Data for processing with convolutional neural networks.
Mining and Applications. arXiv preprint arXiv:1706.09559 .
Thomas Lidy and Andreas Rauber. 2005. Evaluation E Zwicker and H Fastl. 1999. Psychoacoustics facts
of feature extractors and psycho-acoustic transfor- and models .
mations for music genre classification. In ISMIR.
pages 34–41.
Thomas Lidy and Alexander Schindler. 2016. Parallel
convolutional neural networks for music genre and
mood classification. MIREX2016 .
Michael I Mandel and Dan Ellis. 2005. Song-level fea-
tures and support vector machines for music classi-
fication. In ISMIR. volume 2005, pages 594–599.
Loris Nanni, Yandre MG Costa, Alessandra Lumini,
Moo Young Kim, and Seung Ryul Baek. 2016.
Combining visual and acoustic features for music
genre classification. Expert Systems with Applica-
tions 45:108–117.
Andrew Y Ng. 2004. Feature selection, l 1 vs. l 2 regu-
larization, and rotational invariance. In Proceedings
of the twenty-first international conference on Ma-
chine learning. ACM, page 78.
Nicolas Scaringella and Giorgio Zoia. 2005. On the
modeling of time information for automatic genre
recognition systems in audio signals. In ISMIR.
pages 666–671.
Karen Simonyan and Andrew Zisserman. 2014. Very
deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556 .
Hagen Soltau, Tanja Schultz, Martin Westphal, and
Alex Waibel. 1998. Recognition of music types. In
Acoustics, Speech and Signal Processing, 1998. Pro-
ceedings of the 1998 IEEE International Conference
on. IEEE, volume 2, pages 1137–1140.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: A simple way to prevent neural networks
from overfitting. The Journal of Machine Learning
Research 15(1):1929–1958.
Steve Tjoa. 2017. Music information retrieval.
https://musicinformationretrieval.
com/spectral_features.html. Accessed:
2018-02-20.
Suramya Tomar. 2006. Converting video formats with
ffmpeg. Linux Journal 2006(146):10.
George Tzanetakis and Perry Cook. 2002. Musical
genre classification of audio signals. IEEE Trans-
actions on speech and audio processing 10(5):293–
302.