aft
Music Information Retrieval
Dr
George Tzanetakis
Associate Professor
Canada Tier II Research Chair
in Computer Analysis of Audio and Music
Department of Computer Science
University of Victoria
January 5, 2015
2
Dr
aft
List of Figures
Conceptual dimensions of MIR . . . . . . . . . . . . . . . . . . .
3.1
3.2
3.3
Sampling and quantization . .
Simple sinusoids . . . . . . .
Cosine and sine projections of
Figure needs to be redone) . .
4.1
4.2
Score Examples . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rhythm notation . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
Feature extraction showing how frequency and time summarization with a texture window can be used to extract a feature vector
characterizing timbral texture . . . . . . . . . . . . . . . . . . . . 70
The time evolution of audio features is important in characterizing
musical content. The time evolution of the spectral centroid for
two different 30-second excerpts of music is shown in (a). The
result of applying a moving mean and standard deviation calculation over a texture window of approximately 1 second is shown in
(b) and (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Self Similarity Matrix using RMS contours . . . . . . . . . . . . 74
The top panel depicts the time domain representation of a fragment of a polyphonic jazz recording, below which is displayed its
corresponding spectrogram. The bottom panel plots both the onset
detection function SF(n) (gray line), as well as its filtered version
(black line). The automatically identified onsets are represented
as vertical dotted lines. . . . . . . . . . . . . . . . . . . . . . . . 75
Onset Strength Signal . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3
5.4
5.5
.
.
a
.
. . . . . . . . . .
. . . . . . . . . .
vector on the unit
. . . . . . . . . .
Dr
5.2
aft
1.1
3
14
. . . . . . . . 39
. . . . . . . . 42
circle (XXX
. . . . . . . . 45
60
61
4
LIST OF FIGURES
5.6
5.7
Beat Histograms of HipHop/Jazz and Bossa Nova . . . . . . . . . 78
Beat Histogram Calculation. . . . . . . . . . . . . . . . . . . . . 79
7.1
Similarity Matrix between energy contours and alignment path
using Dynamic Time Warping . . . . . . . . . . . . . . . . . . .
95
Dr
aft
14.1 Similarity Matrix between energy contours and alignment path
using Dynamic Time Warping . . . . . . . . . . . . . . . . . . . 124
List of Tables
20120 MIREX Music Similarity and Retrieval Results . . . . . . 105
9.1
Audio-based classification tasks for music signals (MIREX 2009) 110
aft
8.1
14.1 2009 MIREX Audio Cover Song Detection - Mixed Collection . . 125
14.2 2009 MIREX Audio Cover Song Detection - Mazurkas . . . . . . 125
Dr
19.1 Software for Audio Feature Extraction . . . . . . . . . . . . . . . 138
5
LIST OF TABLES
Dr
aft
6
Contents
aft
Introduction
1.1 Context and History . . . . . . . . . . .
1.2 Music Information Retrieval . . . . . .
1.3 Dimensions of MIR research . . . . . .
1.3.1 Specificity . . . . . . . . . . .
1.3.2 Data sources and information .
1.3.3 Stages . . . . . . . . . . . . . .
1.4 History . . . . . . . . . . . . . . . . .
1.4.1 Pre-history . . . . . . . . . . .
1.4.2 History 2000-2005 . . . . . . .
1.4.3 History 2006-today . . . . . . .
1.4.4 Current status . . . . . . . . . .
1.5 Existing tutorials and literature surveys .
1.6 Structure of this book . . . . . . . . . .
1.6.1 Intended audience . . . . . . .
1.6.2 Terminology . . . . . . . . . .
1.7 Reproducibility . . . . . . . . . . . . .
1.8 Companion website . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Dr
1
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Tasks
2.1 Similarity Retrieval, Playlisting and Recommendation . . . . . . .
2.2 Classification and Clustering . . . . . . . . . . . . . . . . . . . .
2.2.1 Genres and Styles . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Artist, Group, Composer, Performer and Album Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
7
8
12
13
14
15
15
16
16
16
17
17
17
17
18
18
19
19
21
22
24
25
26
8
CONTENTS
aft
2.2.3 Mood/Emotion Detection . . . . . . . . . . .
2.2.4 Instrument Recognition, Detection . . . . . . .
2.2.5 Tag annotation . . . . . . . . . . . . . . . . .
2.2.6 Other . . . . . . . . . . . . . . . . . . . . . .
2.3 Rhythm, Pitch and Music Transcription . . . . . . . .
2.3.1 Rhythm . . . . . . . . . . . . . . . . . . . . .
2.3.2 Melody . . . . . . . . . . . . . . . . . . . . .
2.3.3 Chords and Key . . . . . . . . . . . . . . . . .
2.4 Music Transcription and Source Separation . . . . . .
2.5 Query-by-humming . . . . . . . . . . . . . . . . . . .
2.6 Symbolic Processing . . . . . . . . . . . . . . . . . .
2.7 Segmentation, Structure and Alignment . . . . . . . .
2.8 Watermarking, fingerprinting and cover song detection
2.9 Connecting MIR to other disciplines . . . . . . . . . .
2.10 Other topics . . . . . . . . . . . . . . . . . . . . . . .
Fundamentals
3
Audio representations
3.1 Sound production and perception . . . . . . . . .
3.2 Sampling, quantization and the time-domain . . .
3.3 Sinusoids and frequency . . . . . . . . . . . . .
3.3.1 Sinusoids and phasors . . . . . . . . . .
3.3.2 The physics of a simple vibrating system
3.3.3 Linear Time Invariant Systems . . . . . .
3.3.4 Complex Numbers . . . . . . . . . . . .
3.3.5 Phasors . . . . . . . . . . . . . . . . . .
3.3.6 Time-Frequency Representations . . . .
3.3.7 Frequency Domain Representations . . .
3.3.8 Discrete Fourier Transform . . . . . . . .
3.3.9 Sampling and Quantization . . . . . . . .
3.3.10 Discrete Fourier Transform . . . . . . . .
3.3.11 Linearity, propagation, resonances . . . .
3.3.12 The Short-Time Fourier Transform . . . .
3.3.13 Wavelets . . . . . . . . . . . . . . . . .
3.4 Perception-informed Representations . . . . . . .
Dr
I
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
27
27
28
28
29
30
30
31
31
32
32
32
32
32
33
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
36
38
39
40
41
43
44
47
48
49
53
54
54
54
54
54
54
CONTENTS
3.5
3.6
Music Representations
4.1 Common music notation . . . .
4.1.1 Notating rhythm . . . .
4.1.2 Notating pitch . . . . .
4.1.3
. . . . . . . . . . . . .
4.2 MIDI . . . . . . . . . . . . . .
4.3 Notation formats . . . . . . . .
4.4 Music Theory . . . . . . . . . .
4.5 Graphical Score Representations
4.6 MIDI . . . . . . . . . . . . . .
4.7 MusicXML . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
aft
5
3.4.1 Auditory Filterbanks . . . . . .
3.4.2 Perceptual Audio Compression .
Source-based Representations . . . . .
Further Reading . . . . . . . . . . . . .
Feature Extraction
5.1 Monophonic pitch estimation . . . . . . . . .
5.1.1 Terminology . . . . . . . . . . . . .
5.1.2 Psychoacoustics . . . . . . . . . . .
5.1.3 Musical Pitch . . . . . . . . . . . . .
5.1.4 Time-Domain Pitch Estimation . . .
5.1.5 Frequency-Domain Pitch Estimation .
5.1.6 Perceptual Pitch Estimation . . . . .
5.2 Timbral Texture Features . . . . . . . . . . .
5.2.1 Spectral Features . . . . . . . . . . .
5.2.2 Mel-Frequency Cepstral Coefficients
5.2.3 Other Timbral Features . . . . . . . .
5.2.4 Temporal Summarization . . . . . . .
5.2.5 Song-level modeling . . . . . . . . .
5.3 Rhythm Features . . . . . . . . . . . . . . .
5.3.1 Onset Strength Signal . . . . . . . .
5.3.2 Tempo Induction and Beat Tracking .
5.3.3 Rhythm representations . . . . . . .
5.4 Pitch/Harmony Features . . . . . . . . . . .
5.5 Other Audio Features . . . . . . . . . . . . .
Dr
4
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
54
54
54
54
.
.
.
.
.
.
.
.
.
.
57
60
61
62
62
62
62
62
62
62
62
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
63
64
64
65
66
66
66
66
66
66
67
68
69
72
73
75
77
78
80
82
10
CONTENTS
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
Bag of codewords . . . . . . . . . . . . . .
Spectral Features . . . . . . . . . . . . . .
Mel-Frequency Cepstral Coefficients . . . .
Pitch and Chroma . . . . . . . . . . . . . .
Beat, Tempo and Rhythm . . . . . . . . . .
Modeling temporal evolution and dynamics
Pattern representations . . . . . . . . . . .
Stereo and other production features . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 83
. 83
. 83
. 83
. 83
. 83
. 83
. 83
Context Feature Extraction
85
6.1 Extracting Context Information About Music . . . . . . . . . . . 85
7
Analysis
7.1 Feature Matrix, Train Set, Test Set . .
7.2 Similarity and distance metrics . . . .
7.3 Classification . . . . . . . . . . . . .
7.3.1 Evaluation . . . . . . . . . .
7.3.2 Generative Approaches . . . .
7.3.3 Discriminative . . . . . . . .
7.3.4 Ensembles . . . . . . . . . .
7.3.5 Variants . . . . . . . . . . . .
7.4 Clustering . . . . . . . . . . . . . . .
7.5 Dimensionality Reduction . . . . . .
7.5.1 Principal Component Analysis
7.5.2 Self-Organizing Maps . . . .
7.6 Modeling Time Evolution . . . . . . .
7.6.1 Dynamic Time Warping . . .
7.6.2 Hidden Markov Models . . .
7.6.3 Kalman and Particle Filtering
7.7 Further Reading . . . . . . . . . . . .
Dr
aft
6
II Tasks
8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
87
89
89
89
90
91
91
91
91
91
92
92
93
94
94
95
95
95
97
Similarity Retrieval
99
8.0.1 Evaluation of similarity retrieval . . . . . . . . . . . . . . 103
CONTENTS
Classification and Tagging
9.1 Introduction . . . . . . . . .
9.2 Genre Classification . . . . .
9.2.1 Formulation . . . . .
9.2.2 Evaluation . . . . .
9.2.3 Criticisms . . . . . .
9.3 Symbolic genre classification
9.4 Emotion/Mood . . . . . . .
9.5 Instrument . . . . . . . . . .
9.6 Other . . . . . . . . . . . .
9.7 Symbolic . . . . . . . . . .
9.8 Tagging . . . . . . . . . . .
9.9 Further Reading . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
aft
9
11
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
107
107
107
108
109
111
111
111
111
111
111
111
111
Dr
10 Structure
113
10.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
10.2 Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
10.3 Structure Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 113
11 Transcription
115
11.1 Monophonic Pitch Tracking . . . . . . . . . . . . . . . . . . . . 115
11.2 Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
11.3 Chord Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
12 Symbolic Music Information Retrieval
117
13 Queries
119
13.1 Query by example . . . . . . . . . . . . . . . . . . . . . . . . . . 119
13.2 Query by humming . . . . . . . . . . . . . . . . . . . . . . . . . 119
14 Fingerprinting and Cover Song Detection
121
14.1 Fingerprinting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
14.2 Cover song detection . . . . . . . . . . . . . . . . . . . . . . . . 121
12
CONTENTS
15 Other topics
15.1 Optical Music Recognition . . . . . . . .
15.2 Digital Libraries and Metadata . . . . . .
15.3 Computational Musicology . . . . . . . .
15.3.1 Notated Music . . . . . . . . . .
15.3.2 Computational Ethnomusicology
15.3.3 MIR in live music performance .
15.4 Further Reading . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Systems and Applications
129
aft
III
.
.
.
.
.
.
.
127
127
127
127
127
128
128
128
16 Interaction
131
16.1 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Dr
17 Evaluation
133
17.1 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
18 User Studies
135
19 Tools
137
19.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
19.2 Digital Music Libraries . . . . . . . . . . . . . . . . . . . . . . . 139
A Statistics and Probability
141
B Project ideas
B.1 Project Organization and Deliverables
B.2 Previous Projects . . . . . . . . . . .
B.3 Suggested Projects . . . . . . . . . .
B.4 Projects that evolved into publications
143
143
145
150
155
C Commercial activity 2011
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
157
CONTENTS
13
159
Dr
aft
D Historic Time Line
CONTENTS
Dr
aft
14
Preface
Dr
aft
Music Information Retrieval (MIR) is a field that has been rapidly evolving since
2000. It encompases a wide variety of ideas, algorithms, tools, and systems that
have been proposed to handle the increasingly large and varied amounts of musical data available digitally. Researchers in this emerging field come from many
different backgrounds. These include Computer Science, Electrical Engineering,
Library and Information Science, Music, and Psychology. They all share the
common vision of designing algorithms and building tools that can help us organize, understand and search large collections of music in digital form. However,
they differ in the way they approach the problem as each discipline has its own
terminology, existing knowledge, and value system. The resulting complexity,
characteristic of interdisciplinary research, is one of the main challenges facing
students and researchers entering this new field. At the same time, it is exactly
this interdisciplinary complexity that makes MIR such a fascinating topic.
The goal of this book is to provide a comprehensive overview of past and
current research in Music Information Retrieval. A considerable challenge has
been to make the book accessible, and hopefully interesting, to readers coming
from the wide variety of backgrounds present in MIR research. Intuition and
high-level understanding are stressed while at the same time providing sufficient
technical information to support implementation of the described algorithms. The
fundamental ideas and concepts that are needed in order to understand the published literature are also introduced. A thorough bibliography to the relevant
literature is provided and can be used to explore in more detail the topics covered
in this book. Throughout this book my underlying intention has been to provide
enough support material to make any publication published in the proceedings
of the conference of the International Society for Music Information Retrieval
(ISMIR) understandable by the readers.
1
2
CONTENTS
Dr
aft
I have been lucky to be part of the growing MIR community right from it’s
inception. It is a great group of people, full of passion for their work and always
open to discussing their ideas with researchers in other disciplines. If you ever
want to see a musicologist explaining Shenkerian analysis to an engineer or an
engineer explaining Fourier Transforms to a librarian then try attending ISMIR,
the International Conference of the Society for Music Information Retrieval. I
hope this book motivates you to do so.
Acknowledgements
Dr
aft
Music Information Retrieval is a research topic that I have devoted most of my
adult life. My thinking about it has been influenced by many people that I have
had the fortune to interact with. As is the case in most such acknowledgement
sections there are probably more names than most readers care about. At the same
time there is a good chance that I also have probably forgotten some to which I
apologize.
First of all I would like to thank my wife Tiffany and my two sons Panos and
Nikos for not letting my work consume me. I am typing these sentences on an
airplane seat with my 1 year old son Nikos sleeping on my lap and my 5 year old
son Panos sleeping next to me. Like most kids they showed music information
retrieval abilities before they walked or talked. I always smile when I remember
Panos bobbing his head in perfect time with the uneven rhythms of Cretan folk
music, or Nikos pointing at the iPod until the music started so that he could swirl
to his favorite reggae music. Both of these seemingly simple tasks (following a
rhythm and recognizing a particular genre of music) are still beyond the reach
of computer systems today, and my sons, as well as most children their age,
can perform them effortlessly. Since 2000 me and many other researchers have
explored how to computationally solve such problems utilizing techniques from
digital signal processing, machine learning, and computer science. It is humbling
to realize how much further we still need to go.
I would also like to thank my parents Anna and Panos who kindled my love
of music and science respectively from an early age. I also would like to thank
my uncle Andreas who introduced me to the magic of computer programming
using Turbo Pascal (version 3) when I was a high shool student. Studying music
has always been part of my life and I owe a lot to my many music teachers
over the years. I would like to single out Theodoros Kerkezos, my saxophone
teacher, who through his teaching rigor turned a self taught saxophone player
3
4
CONTENTS
Dr
aft
with many bad playing habits into a decent player with solid technique. I have
also been fortunate to play with many incredible musicians and I am constantly
amazed by what an incredible experience playing music is. I would like to thank,
in somewhat chronological order: Yorgos Pervolarakis, Alexis Drakos Ktistakis,
Dimitris Zaimakis, Dimitris Xatzakis, Sami Amiris, Tina Pearson and Paul Walde
with whom I have shared some of my favorite music making.
I have also been very fortunate to work and learn from amazing academic mentors. As an undergraduate student, my professors Giorgos Tziritas and Apostolos
Traganitis at the University of Crete exposed me to research in image and signal
processing, and supported my efforts to work on audio and music processing.
My PhD supervisor Perry Cook provided all the support someone could wish for,
when my research involving computer analysis of audio representations of music
signals was considered a rather exotic computer science endeavor. His boundless
creativity, enthusiasm, and love of music in most (but not all) forms have always
been a source of inspiration. I am also very grateful that he never objected to me
taking many music courses at Princeton and learning from amazing music teachers
such as Steven Mackey and Kofi Agawu.
I also learned a lot from working as a postdoctoral fellow with Roger Dannenberg at Carnegie Mellon University. Roger was doing music information retrieval
long before the term was coined. I am completely in awe of the early work he was
able to do when processing sound using computers was so much more challenging
than today because of the hardware limitations of the time. When I joined Google
Research as visiting faculty during a six month sabbatical leave in 2011, I thought
I had a pretty decent grasp of digital signal processing. Dick Lyon showed me how
much more I needed to learn especially related to computer models of the human
auditory system. I also learned a lot from Ken Steiglitz both as a graduate student
at Princeton University and later on through his book “A Digital Signal Processing
Primer”. I consider this book the best DSP book for beginners that I have read. It
also strongly influenced the way I approach explaining DSP, especially for readers
with no previous exposure to it both in the classes that I teach as well as in this
book.
Marsyas (Music Analysis, Retrieval and Synthesis for Audio Signals) is a
free software framework for audio processing with specific emphasis on music
information retrieval that I designed and for which I am the main developer since
2000. It has provided me, my students and collaborators from around the world,
a set of tools and algorithms to explore music information retrieval. The core
Marsyas developers have been amazing friends and collaborators throughout the
years. I would like to particularly acknowledge, in very rough chronological
CONTENTS
5
Dr
aft
order: Ari Lazier, Luis Gustavo Martins, Luis F. Texeira, Aaron Hechmer, Neil
Burroughs, Carlos Castillio, Mathieu Lagrange, Emiru Tsunoo, Alexander Lerch,
Jakob Leben and last but by no means least, Graham Percival all of which made
significant contributions to the development of Marsyas.
I have also been taught a lot from the large number of undergraduate and
graduate students I have worked with with over the course of my academic career. I would like to single out Ajay Kapur and Adam Tindale who through their
PhD work made me realize the potential of utilizing music information retrieval
techniques in the context of live music performance. My current PhD students
Steven Ness and Shawn Trial helped expand the use of automatic audio analysis to
bioacoustic signals and robotic music instruments respectively. Graham Percival
has been a steady collaborator and good friend through his Masters under my
supervision, his PhD at the University of Glasgow, and a 6 month post-doc working with me. I have also had the great fortune to work with several Postdoctoral
fellows and international visiting students. Mathieu Lagrange and Luis Gustavo
Martins have been amazing friends and collaborators. I also had a great time
working with and learning from Alexander Lerch, Jayme Barbedo, Tiago Tavares,
Helene Papadopoulos, Dan Godlovich, Isabel Barbancho, and Lorenzo Tardon.
CONTENTS
Dr
aft
6
Introduction
aft
Chapter 1
Dr
I know that the twelve notes in each octave and the variety of rhythm offer me
opportunities that all of human genius will never exhaust.
Igor Stavinsky
In the first decade of the 21st century there has been a dramatic shift in how
music is produced, distributed and consumed. This is a digital age where computers are pervasive in every aspect of our lives. Music creation, distribution,
and listening have gradually become to a large extent digital. For thousands of
years music only existed while it was being created at a particular time and place.
Suddenly not only it can be captured through recordings but enormous amounts
of it are easily accessible through electronic devices. A combination of advances
in digital storage capacity, audio compression, and increased network bandwidth
have made this possible. Portable music players, computers, and smart phones
have facilitated the creation of personal music collections consisting of thousands
of music tracks. Millions of music tracks are accessible through streaming over
the Internet in digital music stores and personalized radio stations. It is very
likely that in the near future all of recorded music will be available and accessible
digitally.
The central challenge that research in the emerging field of music information retrieval (MIR) tries to address is how to organize and analyze these vast
amounts of music and music-related information available digitally. In addition,
7
8
CHAPTER 1. INTRODUCTION
1.1
Dr
aft
it also involves the development of effective tools for listeners, musicians and
musicologists that are informed by the results of this computer supported analysis
and organization. MIR is an inherently interdisciplinary field combining ideas
from many fields including Computer Science, Electrical Engineering, Library
and Information Science, Music, and Psychology. The field has a history of
approximately 15 years so it is a relatively new research area. Enabling anyone
with access to a computer and the Internet to listen to essentially most of recorded
music in human history is a remarkable technological achievement that would
probably be considered impossible even as recently as 20 years ago. The research
area of Music Information Retrieval (MIR) gradually emerged during this time
period in order to address the challenge of effectively accessing and interacting
with these vast digital collections of music and associated information such as
meta-data, reviews, blogs, rankings, and usage/download patterns.
This book attempts to provide a comprehensive overview of MIR that is relatively self-contained in terms of providing the necessary interdisciplinary background required to understand the majority of work in the field. The guiding
principle when writing this book was that there should be enough background
material in it, to support understanding any paper that has been published in
the main conference of the field: the International Conference of the Society for
Music Information Retrieval (ISMIR). In this chapter the main concepts and ideas
behind MIR as well as how the field evolved historically are introduced.
Context and History
Music is one of the most fascinating activities that we engage as a species as well
as one of the most strange and intriguing. Every culture throughout history has
been involved with the creation of organized collections of air pressure waves that
somehow profoundly affect us. To produce these waves we have invented and
perfected a large number of sound producing objects using a variety of materials
including wood, metal, guts, hair, and skin. With the assistance of these artifacts,
that we call musical instruments, musicians use a variety of ways (hiting, scraping,
bowing, blowing) to produce sound. Music instruments are played using every
conceivable part of our bodies and controlling them is one of the most complex
and sophicated forms of interaction between a human and an artifact. For most
of us listening to music is part of our daily lives, and we find it perfectly natural
that musicians spend large amounts of their time learning their craft, or that we
1.1. CONTEXT AND HISTORY
9
Dr
aft
frequently pay money to hear music being performed. Somehow these organized
structures of air pressure waves are important enough to warrant our time and
resources and as far as we can tell this has been a recurring pattern throughout human history and across all human cultures around the world. It is rather perplexing
how much time and energy we are willing to put into music making and listening
and there is vigorous debate about what makes music so important. What is clear
is that music was, is, and probably will be an essential part of our lives.
Technology has always played a vital role in music creation, distribution, and
listening. Throughout history the making of musical instruments has followed
the technological trends of the day. For example instrument making has taken
advantage of progressively better tools for wood and bone carving, the discovery
of new metal alloys, and materials imported from far away places. In addition,
more conceptual technological developments such as the use of music notation
to preserve and transmit music, or the invention of recording technology have
also had profound implications on how music has been created and distributed.
In this section, a brief overview of the history of how music and music-related
information has been transmitted and preserved across time and space is provided.
Unlike painting or sculpture, music is emphemeral and exists only during the
course of a performance and as a memory in the minds of the listeners. It is
important to remember that this has been the case throughout most of human
history. Only very recently with the invention of recording technology could
music be “preserved” across time and space. The modern term “live” music
applied to all music before recording was invented. Music is a process not an
artifact. This fleeting nature made music transmission across space and time
much more difficult than other arts. For example, although we know a lot about
what instruments and to some extent what music scales were used in antiquity,
we have very vague ideas of how the music actually sounded. In contrast, we can
experience ancient sculptures or paintings in a very similar way to how they were
perceived when they were created. For most of human history the only means of
transmission and preservation of musical information was direct contact between
a performer and a listener during a live performance. The first major technological
development that changed this state of affairs was the gradual invention of musical
notation. Although several forms of music notation exist for our discussion we
will focus on the evolution of Western common music notation which today is the
most widely used music notation around the world.
Several music cultures around the world developed symbolic notation systems
although musicians today are mostly familiar with Western common music notation. The origins of western music notation lie in the use of written symbols
10
CHAPTER 1. INTRODUCTION
Dr
aft
as mnemonic aids in assisting singers render melodically a particular religious
text. Fundamentally the basic idea behind music notation systems is to encode
symbolically information about how a particular piece of music can be performed.
In the same way that a recipe is a set of instructions for making a particular dish, a
notated piece of music is a set of instructions that musicians must follow to render
a particular piece of music. Gradually music notation became more precise and
able to capture more information about what is being performed. However it can
only capture some aspects of a music performance which leads to considerable
diversity in how musicians interpret a particular notated piece of music. The limitated nature of music notation is because of its discrete symbolic nature and the
limited amount of possible symbols one can place on a piece of paper and realistically read. Although initially simply a mnemonic aid, music notation evolved and
was used as a way to preserve and transmit musical information across space and
time. It provided an alternative way of encountering new music to attending a live
performance. For example, a composer like J. S. Bach could hear music composed
in Italy by simply reading a piece of paper and playing it on a keyboard rather
than having to physically be present when and where it was performed. Thanks
to written music scores any keyboard player today can still perform the pieces
J. S. Bach composed more than 300 hundred years ago by simply reading the
scores he wrote. One interesting sidenote that is a recurring theme in the history
of technology and music technology is that frequently inventions are used in ways
that their original creators never envisioned. Although originally conceived more
as a mnemonic aid for existing music, music notation profoundly has affected the
way music was created, distributed and consumed. For example, the advent of
music notation to a large extent was the catalyst that enabled the creation of a
new type of specialized worker which today we know as a composer. Freelance
composers supported themselves by selling scores rather than requiring a rich
patron to support them and Mozart is a well known example of a composer who
experienced the transition of having a single patron to freelancing.
The invention of recording technology caused even more significant changes
in the way music was created, distributed and consumed [92, ?]. The history
of recording and the effects it had in music is extremely fascinating. In many
ways it is also quite informative about many of the debates and discussions about
the role of digital music technology today. Recording enabled repeatability of
music listening at a much more precise level than music notation ever could and
decoupled the process of listening from the process of music making.
Given the enormous impact of recording technology to music it is hard to
imagine that there was a time when it was considered a distraction from the “real”
1.1. CONTEXT AND HISTORY
11
application of recording speech for business applications by none less than the
distinguished Thomas Edison. Despite his attempts music recordings became
extremely popular and jukebox sales exploded. For the first time in history music
could be captured, replicated and distributed across time and space and recording
devices quickly spread across the world. The ensuing cross-polination of diverse
musical styles triggered the generation of various music styles that essentially
form the foundation of many of the popular music genres today.
aft
Initially the intention of recordings was to capture as accurately as possible
the sound of live music performances. At the same time, limitations of early
recording technology such as the difficulty of capturing dynamics and the short
duration of recordings forced musicians to modify and adapt how they played
music. As technology progressed with the development of inventions such as
electric microphones and magnetic tapes, recordings progressively became more
and more a different art form. Using effects such as overdubbing, splicing, and
reverb, recordings became essentially virtual creations impossible to recreate in
live music performance.
Dr
Early gramophones allowed their owners to make their own recordings but
soon after with the standarization and mass production of recording discs the
playback devices were limited to playing not recording. Recordings became
something to be collected, prized and owned. They also created a different way
of listening that could be conducted in isolation. It is hard for us to imagine how
strange this isolated listening to music seemed in the early days of recording. My
favorite such story, recounted by Katz [?], is about an early british enthusiast who
constructed elaborate dollhouses as visual backdrops for listening to recordings of
his favorite operas.
Music broadcasting through radio was another important music technology
milestone of the 20th century. It shifted the business model from selling physical
recordings to essentially free access to music supported by advertising. It also had
an important accessibility effect making music listening effectively free. Swing
music, the only jazz genre that became mainstream popular, was to a large extent
promoted through radio and made accessible during and after the Great Depression when purchasing individual recordings was prohibitive for many listeners.
Analog casettes reintroduced the lost ability that early grammophones had to
perform home recordings. The creation of mix tapes from radio and LP sources
was a pervasive and intense activity during the highschool days of the author in
the 1980s. One can trace the origin of the concept of creating a playlist to mix
tapes. Another way that casettes had impact was the introduction of portable
12
CHAPTER 1. INTRODUCTION
1.2
aft
music playing devices starting with the famous Sony Walkman (also one of the
my prized pocession during high-school).
These two historical developments in music technology: musical notation
and audio recording, are particularly relevant to the field of MIR, which is the
focus of this book. They are directly reflected in the two major types of digital
representations that have been used for MIR: symbolic and audio-based. These
digital representations were originally engineered to simply store digitally the
corresponding analog (for example vinyl) and physical (for example typeset paper) information but eventually were utilized to develop completely new ways of
interacting with music using computers. It is also fascinating how the old debates
contrasting the individual model of purchasing recordings with the broadcasting
model of radio are being replayed in the digital domain with digital music stores
that sell individual tracks such as iTunes and subscription services that provided
unlimited access to streaming tracks such as Spotify.
Music Information Retrieval
Dr
As we discussed in the previous section the ideas behind music information and
retrieval are not new although the term itself is. In this book we focus on the
modern use of the term music information retrieval which will be abbreviated to
MIR for the remainder of the book. MIR deals with the analysis and retrieval of
music in digital form. It reflects the tremendous recent growth of music-related
data digitally available and the consequent need to search within it to retrieve
music and musical information efficiently and effectively.
Arguably the birth of Music Information Retrieval (MIR) as a separate research area happened in Plymouth, Massachusetss in 2000 at the first conference
(then symposium) on Music Information Retrieval (ISMIR). Although work in
MIR had appeared before that time in various venues, it wasn’t identified as such
and the results were scattered in different communities and publication venues.
During that symposium a group of computer scientists, electrical engineers, information scientists, musicologists, psychologists, and librarians met for the first
time in a common space, exchanged ideas and found out what types of MIR
research were happening at the time. MIR research until then was published
in completely separate and different conferences and communities. The seed
funding for the symposium and some prepatory workshops that preceded it, was
provided by the National Science Foundation (NSF) through a grant program
1.3. DIMENSIONS OF MIR RESEARCH
13
Dr
aft
targeting Digital Libraries (more details can be found at http://www.ismir.
net/texts/Byrd02.html (accessed January 2014) . The original organizers
Don Byrd, J. Stephen Downie and Tim Crawford had backgrounds connected to
symbolic processing and digital libraries but fortunately the symposium attracted
some attendees that were working with audio. The resulting cross-fertilization
was instrumetnal in establishing MIR as a research area with a strong sense of
interdisciplinarity.
When MIR practitioners convey what is Music Information Retrieval to a more
general audience they frequently use two metaphors that capture succintly how
the majority, but not all, of MIR research could potentially be used. The first
metaphor is the so called “grand celestial jukebox in the sky” and refers to the
potential availability of all recorded music to any computer user. The goal of MIR
research is to develop better ways of browsing, choosing, listening and interacting
with this vast universe of music and associated information. To a large extent
online music stores, such as the iTunes store, are well on their way of providing
access to large collections of music from around the world. A complimentary
metaphor could be termed the “absolute personalized radio” in which the listening
habits and preferences of a particular user are analyzed or directly specified and
computer algorithms provide a personalized sequence of music pieces that is
customized to the particular user. As was the case with the traditional jukebox
and radio for the first metaphor the emphasis is on choice by the user versus
choice by some expert entity for the second metaphor. It is important to note
that there is also MIR research that does not fit well in these two metaphors.
Examples include: work on symbolic melodic pattern segmentation or work on
computational ethnomusicology.
1.3
Dimensions of MIR research
There are multiple different ways that MIR research can be organized into thematic areas. Three particular “dimensions” of organization, that I have found
useful when teaching and thinking about the topic are: 1) specificity which refers
to the semantic focus of the task, 2) data sources which refers to the various
representations (symbolic, meta-data, context, lyrics) utilized to achieve a particular task, and 3) stages which refers to the common processing stages in a
MIR systems such as representation, analysis, and interaction. In the next three
subsections, these different conceptual dimensions are described in more detail.
14
CHAPTER 1. INTRODUCTION
Figure 1.1: Conceptual dimensions of MIR
Make figure
dimensions of
MIR
1.3.1
Specificity
One can view many MIR tasks and systems though a retrieval paradigm in which
a query is somehow specified by the user and a set of hopefully relevant results
to that query is returned by the computer. This paradigm is very familiar from
text search engines such as Google. MIR queries can take many forms such as an
actual audio recording, a set of keywords describing the music, or a recording of
a user singing a melody. Specificity refers to the range of retrieved results that are
considered relevant for a particular query.
The most specific type of task is audio fingerprinting in which the query is a
recorded (possibly noisy or compressed) audio snippet of a particular music track
and the result returned is the exact audio recording from which that snippet was
obtained. A different recording of the same piece of music even by the same group
is not considered relevant. Cover song detection expands the specifity so that the
returned results also include different versions of the same piece of music. These
can differ in terms of instruments played, style, artist but as long as they have the
same underlying melody and harmonic structure they are considered the same.
Further expanding the circle of specifity we have artist identification in which
any recording by the same artist is considered relevant to a particular query. Genre
or style classification is quite broad and any returned recording that is of the same
genre or style is considered relevant. In Emotion recognition returned results
should all have a particular emotion or evoke a particular mood possibly spanning
multiple genres and styles. The most wide circle of specificity is exemplified
by simply shuffling in which any music recording in a database is considered
relevant to a particular query. Although shuffling is trivial it is a common way of
interacting with large audio collections.
Dr
the conceptual
aft
here showing
1.3. DIMENSIONS OF MIR RESEARCH
1.3.2
15
Data sources and information
Stages
Dr
1.3.3
aft
Another way to conceptually organize MIR tasks and systems is through the
data sources and information they utilize. Obviously an important source of
information is the music itself or what is frequently termed the musical content.
This can be the digital audio recording or some symbolic representation of the
music which in most cases can be viewed as a type of a musical score. In addition
there is a lot of additional information that can be utilized for retrieval and analysis
purposes that is not part of the actual music. This information that is about the
music rather than the actual music is termed the context. Frequently it consists
of text for example lyrics, blogs, web pages, and social media posts. Other
forms of context information can also be used such as ratings, download patterns
in peer to peer networks, and graphs of social network connections. Context
information tends to be symbolic in nature. Music information retrieval systems
can combine multiple source of information. For example, a genre classification
system might rely on both audio feature extraction from the recording, and text
feature extraction from the lyrics.
Another way of organizing MIR algorithms and systems is more procedural and
through stages that are involved in most such systems. Parallels to these stages
can be found in how humans perceive, understand, and interact with music.
• Representation - Hearing
Audio signals are stored (in their basic uncompressed form) as a time series
of numbers corresponding to the amplitude of the signal over time. Although this representation is adequate for transmission and reproduction of
arbitrary waveforms, it is not particularly useful for analyzing and understanding audio signals. The way we perceive and understand audio signals
as humans is based on and constrained by our auditory system. It is well
known that the early stages of the human auditory system (HAS), to a first
approximation, decompose incoming sound waves into different frequency
bands. In a similar fashion, in MIR, time-frequency analysis techniques are
frequently used for representing audio signals. The representation stage of
the MIR pipeline refers to any algorithms and systems that take as input
simple time-domain audio signals and create a more compact, information-
16
CHAPTER 1. INTRODUCTION
bearing representations. The most relevant academic research area to this
stage is Digital Signal Processing (DSP).
• Analysis - Understanding
aft
Once a good representation is extracted from the audio signal various types
of automatic analysis can be performed. These include similarity retrieval,
classification, clustering, segmentation, and thumbnailing. These higherlevel types of analysis typically involve aspects of memory and learning
both for humans and machines. Therefore Machine Learning algorithms
and techniques are important for this stage.
• Interaction - Acting
1.4
Dr
When the signal is represented and analyzed by the system, the user must
be presented with the necessary information and act according to it. The
algorithms and systems of this stage are influence by ideas from Human
Computer Interaction and deal with how information about audio signals
and collections can be presented to the user and what types of controls for
handling this information are provided.
History
In this section, the history of MIR as a field is traced chronologically mostly
through certain milestones.
1.4.1
Pre-history
1.4.2
History 2000-2005
The first activities in MIR were initated through activities in the digital libraries
community [5].
1.5. EXISTING TUTORIALS AND LITERATURE SURVEYS
1.4.3
History 2006-today
1.4.4
Current status
1.5
Existing tutorials and literature surveys
17
Structure of this book
Dr
1.6
aft
As a new emerging interdiscipinary area there are not a lot of comprehensive
published overviews of MIR research. However there are some good tutorials as
well as overview papers for specific sub-areas. There are also published books that
are edited collections of chapters written by different authors also usually focusing
on specific aspects of Music Information Retrieval. A well written overview,
although somewhat dated, focusing on information retrieval and digital library
aspects peculiar to music as well as typologies of users and their information
needs was written by Nicola Orio in 2006.
Even though MIR is a relatiely young research area, the variety of topics, disciplines and concepts that have been explored make it a hard topic to cover comprehensively. It also makes a linear exposition harder. One of the issues I struggled
with while writing this book was whether to narrow the focus to topics that I was
more familiar (for example only covering MIR for audio signals) or attempt to
provide a more complete coverage and inevitably write about topics such as symbolic MIR and optical music recognition (OMR) with which I was less familiar.
At the end I decided to attempt to cover all topics as I think they are all important
and wanted the book to reflect the diversity of research in this field. I did my
best to familiarize myself with the published literature in topics removed from my
expertise and received generous advice and help from friends who are familiar
with them to help me write the corresponding chapters.
The organization and chapter structure of the book was another issue I struggled with. Organizing a particular body of knowledge is always difficult and this is
made even more so in an interdisciplinary field like MIR. An obvious organization
would be historical with different topics introduced chronologically based on the
order they appeared in the published literature. Another alternative would be
based an organization based on MIR tasks and applications. However, there are
18
CHAPTER 1. INTRODUCTION
1.6.1
aft
several concepts and algorithms that are fundamental to and shared by most MIR
applications. After several attempts I arrived at the following organization which
although not perfect was the best I could do. The book is divided into three large
parts. The first, titled Fundamentals, introduces the main concepts and algorithms
that have been utilized to perform various MIR tasks. It can be viewed as a crash
course in important fundamental topics such as digital signal processing, machine
learning and music theory through the lens of MIR research. Readers with more
experience in a particular topic can easily skip the corresponding chapter and
move to the next part of the book. The second part of the book, titled Tasks,
describes various MIR tasks that have been proposed in the literature based on the
background knowledge described in the Fundamentals part. The third part, titled
Systems and Applications, provides a more implementation oriented exposition of
more complete MIR systems that combine MIR tasks as well as information about
tools and data sets. Appendices provide supplementary background material as
well as some more specific optional topics.
Intended audience
Dr
The intended audience of this book is pretty broad encompassing any person interested in music information retrieval research. A minimal set of math requirements
would be basic high school math with some probability/statistics, linear algebra
and very basic calculus as well. Basic familiarity with computer programming
is also assumed. Finally although not strictly required basic familiarity with
music theory and notation is also helpful. I have tried to make the exposition
as much as possible self-contained and the chapters in the Fundamentals section
attempt to build the necessary foundation for good understanding of published
MIR literature.
This book has evolved from a set of course notes for CSC475 Music Retrieval
Systems, a Computer Science course on MIR taught by the author at the University of Victoria. The students are a combination of 4th year undergraduate
students and graduate students mostly in Computer Science but sometimes also in
other fields.
1.6.2
Terminology
pitch, intensity, timbre = basic features of musical sound
1.7. REPRODUCIBILITY
19
informatics, retrieval
piece, track, song
database, collection, library
modern popular music, western art music
MIDI
monophonic polyphonic
symbolic vs audio
x(t) vs x[t] lower case of time domain and upper case for frequency domain
Reproducibility
1.8
Companion website
Dr
aft
1.7
CHAPTER 1. INTRODUCTION
Dr
aft
20
Tasks
aft
Chapter 2
Dr
Like any researh area, Music Information Retrieval (MIR) is characterized by the
types of problems researchers are working on. In the process of growing as a
field, more and hopefully better algorithms for solving each task are constanty
designed, described and evaluated in the literature. In addition occassionally
new tasks are proposed. The majority of these tasks can easily be described
to anyone with a basic understanding of music. Moreover they are tasks that
most humans perform effortlessly. A three year old child can easily peform
music information related tasks such as recognizing a song, clapping along with
music, or listening to the sung words in a complex mixture of instrumental sounds
and voice signals. At the same time, the techniques used to perform such tasks
automatically are far from simple. It is humbling to realize that a full arsenal
of sophisticated signal processing and machine learning algorithms are required
to perform these seemingly simple music information tasks using computers. At
the same time computers, unlike humans, can process arbitrarily large amounts of
data and that way open up new possibilities in how listeners interact with large
digital collections of music and associated information. For example, a computer
can analyze a digital music collection with thousands of songs to find pieces of
music that share a particular melody or retrieve several music tracks that match
an automatically calculated tempo. These are tasks beyond the ability of most
humans when applied to large music collections. Ultimately MIR research should
be all about combining the unique capabilities of humans and machines in the
context of music information and interaction.
21
22
CHAPTER 2. TASKS
Dr
aft
In this chapter, a brief overview of the main MIR tasks that have been proposed
in the literature and covered in this book is provided. The goal is to describe
how each task/topic is formulated in terms of what it tries to achieve rather than
explaining how these tasks are solved or how algorithms for solving them are
evaluated. Moreover, the types of input and output required for each task are
discussed. In most tasks there is typically a seminal paper or papers that had a
lot of impact in terms of formulating the task and influencing subsequent work.
In many cases these seminal papers are preceded by earlier attempts that did not
gain traction. For each task described in this chapter I attempt, to the best of my
knowledge, to provide pointers to the earliest relevant publication, the seminal
publication and some representative examples. Additional citations are provided
in the individual chapters in the rest of the book.
This chapter can also be viewed as an attempt to organize the growing and
somewhat scattered literature in MIR into meaningful categories and serve as a
quick reference guide. The rest of the book is devoted to explaining how we can
design algorithms and build systems to solve these tasks. Solving these tasks can
sometimes directly result in useful systems, but frequently they form components
of larger more complicated and complex systems. For example beat tracking can
be considered a MIR task, but it can also be part of a larger computer-assisted DJ
system that performs similarity retrieval and beat alignment. Monophonic pitch
extraction is another task that can be part of a larger query-by-humming system.
Roughly speaking I consider as a task any type of MIR problem for which there is
existing published work and there is a clear definition of what the expected input
and desired output should be. The order of presentation of tasks in this chapter is
relatively subjective and the descriptions are roughly proportional to the amount
of published literature in each particular task.
2.1
Similarity Retrieval, Playlisting and Recommendation
Similarity retrieval (or query-by-example) is one of the most fundamental MIR
tasks. It is also one of the first tasks that were explored in the literature. It
was originally inspired by ideas from text information retrieval and this early
influence is reflected in the naming of the field. Today most peope with computers
use search engines on a daily basis and are familiar with the basic idea of text
2.1. SIMILARITY RETRIEVAL, PLAYLISTING AND RECOMMENDATION23
Dr
aft
information retrieval. The user submits a query consisting of some words to the
search engine and the search engine returns a ranked list of web pages sorted by
how relevant they are to the query.
Similarity retrieval can be viewed as an analogous process where instead of the
user querying the system by providing text the query consists of an actual piece
of music. The system then responds by returning a list of music pieces ranked by
their similarity to the query. Typically the input to the system consists of the query
music piece (using either a symbolic or audio representation) as well as additional
metadata information such as the name of the song, artist, year of release etc. Each
returned item typically also contains the same types of meta-data. In addition to
the audio content and meta-data other types of user generated information can also
be considered such as ranking, purchase history, social relations and tags.
Similarity retrieval can also be viewed as a basic form of playlist generation in
which the returned results form a playlist that is “seeded” by the query. However
more complex scenarios of playlist generation can be envisioned. For example a
start and end seed might be specified or additional constraints such as approximate
duration or minimum tempo variation can be specified. Another variation is based
on what collection/database is used for retrieval. The term playlisting is more
commonly used to describe the scenario where the returned results come from the
personal collection of the user, while the term recommendation is more commonly
used in the case where the returned results are from a store containing a large
universe of music. The purpose of the recommnedation process is to entice the
user to purchase more music pieces and expand their collection. Although these
three terms (similarity retrieval, music recommendation, automatic playlisting)
have somewhat different connotations the underlying methodology for solving
them is similar so for the most part we will use them interchangeably. Another
related term that is sometimes used in personalized radio in which the idea is to
play music that is personalized to the preferences to the user.
One can distinguish three basic approaches to computing music similarity.
Content-based similarity is performed by analyzing the actual content to extract
the necessary information. Metadata approaches exploit sources of information
that are external to the actual content such as relationships between artists, styles,
tags or even richer sources of information such as web reviews and lyrics. Usagebased approaches track how users listen and purchase music and utilize this information for calculating similarity. Examples include collaborive filtering in which
the commonalities between purchasing histories of different users are exploited,
tracking peer-to-peer downloads or radio play of music pieces to evaluate their
“hotness” and utilizing user generated rankings and tags.
24
CHAPTER 2. TASKS
2.2
aft
There are trade-offs involved in all these three approaches and most likely
the ideal system would be one that combines all of them intelligently. Usagebased approaches suffer from what has been termed the “cold-start” problem in
which new music pieces for which there is no usage information can not be recommended. Metadata approaches suffer from the fact that metadata information is
frequently noisy or inaccurate and can sometimes require significant semi-manual
effort to clean up. Finally content-based methods are not yet mature enough to
extract high-level information about the music.
Evaluation of similarity retrieval is difficult as it is a subjective topic and
therefore require large scale user studies in order to be properly evaluated. In most
published cases similarity retrieval systems are evaluated through some related
tasks such how many of the returned music tracks belong to the same genre as the
query.
Classification and Clustering
Dr
Classification refers to the process of assigning one or more textual labels in
order to characterize a piece of music. Humans have an innate drive to group
and categorize things including music pieces. In classification tasks the goal is
given a piece of music to perform this grouping and categorization automatically.
Typically this is achieved by automatically analyzing a collection of music that
has been manually annotated with the corresponding classification information.
The analysis results are used to “train” models (computer algorithms) that given
the analysis results (referred to as audio features) for a new unlabelled music
track are able “predict” the classification label with reasonable accuracy. This is
referred to as “supervised learning” in the field of machine learning/data mining.
At the opposite end of the spectrum is “unsupervised learning” or “clustering” in
which the objects of interest (music pieces in our case) are automatically grouped
together into “clusters” such that pieces that are similar fall in the same cluster.
There are several interesting variants and extensions of the classic problems
of the classification and clustering. In semi-supervised learning the learning algorithms utilizes both labeled data as in standard classification as well as unlabeled
data as in clustering. The canonical output of classification algorithms is a single
label from a finite known set of classification lables. In multi-label classification
each object to be classified is associated with multiple labels both when training
and predicting.
2.2. CLASSIFICATION AND CLUSTERING
25
When characterizing music several possible such groupings have been used
historically as means of organizing music and computer algorithms that attempt to
perform the categorization automatically have been developed. Although the most
common input to MIR classification and clustering systems is audio signals there
has also been work that utilizes symbolic representations as well as metadata (such
as tags and lyrics) and context information (peer-to-peer downloads, purchase
history). In the following subsections, several MIR tasks related to classification
and clustering are described.
2.2.1
Genres and Styles
Dr
aft
Genres or styles are words used to describe groupings of musical pieces that share
some characteristics. In the majority of cases these characteristics are related
to the musical content but not always (for example in Christian Rap). They
are used to physically organize content in record stores and virtually organize
music in digital stores. In addition they can be used to convey music preferences
and taste as the stereotypical response to the question “What music you like”
which typically goes something like “I like all different kinds of music except
X” where X can be classical, country, heavy metal or whatever other genre the
particular person is not fond of. Even though there is clear shared meaning among
different individual when discussing genres their exact definitation and boundaries
are fuzzy and subjective. Even though sometimes they are criticized as being
driven by the music industry their formation can be an informal and fascinating
process 1 .
Top level genres like “classical” or “rock” tend to encompass a wide variety
of music pieces with the extreme case of “world music” which has enormous
diversity and almost no discriptive capability. On the other hand, more specialized
genres (sometimes refered to as sub-genres) such Neurofunk or Speed Metal are
meaningful to smaller groups of people and in many cases used to differentiate the
insiders from the outsiders. Automatic genre classification was one of the earliest
classification problems tackled in the music information retrieval literature and is
still going strong. The easy availability of ground truth (especially for top-level
genres) that can be used for evaluation, and the direct mapping of this task to well
established supervised machine learning techniques are the two main reasons for
1
http://www.guardian.co.uk/music/2011/aug/25/
origins-of-music-genres-hip-hop (accessed September 2011)
26
CHAPTER 2. TASKS
the popularity of automatic genre classification. Genre classification can be cast
either as the more common single-label classification problem (in which one out of
a set of N genres is selected for unknown tracks) or as a multi-label classification
in which a track can belong to more than one category/genre.
[66]
2.2.2
Artist, Group, Composer, Performer and Album Identi-
aft
fication
2.2.3
Dr
Another obvious grouping that can be used for classification is identifying the
artist or group that performs a particular music track. In the case of rock and
popular music frequently the performing artists are also the composers and the
tracks are typically associated to a group (like the Beatles or Led Zeppelin). On
the other hand in classical music there is typically a clearer distinction between
composer and performer. The advantage of artist identification compared to genre
is that the task is much more narrow and in some ways well-defined. At the
same time it has been criticized as identifying the production/recording approach
used, rather than the actual musical content (especially in the case of songs from
the same album) and also being a somewhat artifical tasks as in most practical
scenarios this information is readily available from meta-data.
Mood/Emotion Detection
Music can evoke a wide variety of emotional responses which is one of the reasons
it is heavily utilized during movies. Even though culture and education can significantly affect the emotional response of listeners to music there has been work
attempting to automatically detection mood and emotion in music. As this is information that listeners frequently used to discuss music and not readily available
in most cases there has been considerable interest in performing it automatically.
Even though it is occassionally cast as a single-label classification problem it is
more appropriately considered a multi-label problem or in some cases a regression
in a continuous “emotion” space.
2.2. CLASSIFICATION AND CLUSTERING
2.2.4
27
Instrument Recognition, Detection
Monophonic instrument recognition refers to automatically predicting the name/type
of a recorded sample of a musical instrument. In the easiest configuration isolated
notes of the particular instrument are provided. A more complex scenario is when
larger phrases are used, and of course the most difficult problem is identifying
the instruments present in a mixture of musical sounds (polyphonic instrument
recognition or instrumentation detection). Instrument classification techniques
are typically applied to databases of recorded samples.
Tag annotation
aft
2.2.5
Dr
The term “tag” refers to any keyword associated to an article, image, video, or
piece of music on the web. In the past few years there has been a gradual shift
from manual annotation into fixed hierarchical taxonomies to collaborative social
tagging where any user can annotate multimedia objects with tags (so called folksonomies) without conforming to a fixed hierarchy and vocabulary. For example,
Last.fm is a collaborative social tagging network which collects roughly 2 million
tags (such as “saxophone”, “mellow”, “jazz”, “happy”) per month [?] and uses
that information to recommend music to its users. Another source of tags are
“games with a purpose” [?] where people contribute tags as a by-product of doing
a task that they are naturally motivated to perform, such as playing casual web
games. For example TagATune [?] is a game in which two players are asked to
describe a given music clip to each other using tags, and then guess whether the
music clips given to them are the same or different.
Tags can help organize, browse, and retrieve items within large multimedia
collections. As evidenced by social sharing websites including Flickr, Picasa,
Last.fm, and You Tube, tags are an important component of what has been termed
as “Web 2.0”. The focus of MIR research is systems that automatically predict
tags (sometimes called autotags) by analyzing multimedia content without requiring any user annotation. Such systems typically utilize signal processing and
supervised machine learning techniques to “train” autotaggers based on analyzing
a corpus of manually tagged multimedia objects.
There has been considerable interest for automatic tag annotation in multimedia research. Automatic tags can help provide information about items that
have not been tagged yet or are poorly tagged. This avoids the so called “coldstart problem” [?] in which an item can not be retrieved until it has been tagged.
28
CHAPTER 2. TASKS
Addressing this problem is particularly important for the discovery of new items
such as recently released music pieces in a social music recommendation system.
2.2.6
Other
2.3
aft
Audio classification techniques have also been applied to a variety of other areas
in music information retrieval and more generally audio signal processing. For
example they can be used to detect which parts of a song contain vocals (singing
identification) or detect the gender of a singer. They can also be applied to the
classification/tagging of sound effects for movies and games (such as door knocks,
gun shots, etc). Another class of audio signals to which similar analysis and
classification techniques can applied are bioacoustic signals which are the sounds
animals use for communication. There is also work on applying classification
techniques on symbolic representations of music.
Rhythm, Pitch and Music Transcription
Dr
Music is perceived and represented at multiple levels of abstraction. At the one
extreme, an audio recording appears to capture every detailed nuance of music
performance for prosperity. However, even in this case this is an illusion. Many
details of a performance are lost such as the visual aspects of the performers and
musician communication or the intricacies of the spatial reflections of the sound.
At the same time, a recording does capture more concrete, precise information
than any other representation. At the other extreme a global genre categorization
of a music piece such as saying this is a piece of reggae or classical music reduces
a large group of different pieces of music that share similar characteristics to a single world. In between these two extremes there is a large number of intermediate
levels of abstraction that can be used to represent the underlying musical content.
One way of thinking about these abstractions are as different representations
that are invariant to transformations of the underlying music. For example most
listeners, with or without formal musical training, can recognize the melody of
Twinkle, Twinkle Little Star (or some melody they are familiar with) independently of the instrument it is played on, or how fast it is played, or what is the
starting pitch. Somehow the mental representation of the melody is not affected
by these rather drastic changes to the underlying audio signal. In music theory a
2.3. RHYTHM, PITCH AND MUSIC TRANSCRIPTION
29
common way of describing music is based on rhythm and pitch information. Furthermore, pitch information can be abstracted as melody and harmony. A western
common music score is a comprehensive set of notation conventions that represent
what pitches should be played, when they should be played and for how long,
when a particular piece of music is performed. In the following subsections we
describe some of the MIR tasks that can be formulated as automatically extracting
a variety representations at the different levels of musical abstraction.
2.3.1
Rhythm
Dr
aft
Rhythm refers to the hierarchical repetitive structure of music over time. Althouh
not all music is structured rhythmically, a lot of music in the world is and has
patterns of sounds that repeat periodically. These repeating patterns are typically
formed on top of a underlying conceptual sequence of semi-regular pulses that are
grouped hierarchically.
Automatic rhythm analysis tasks attempt to extract different kinds of rhythmic
information from audio signals. Although there is some variability in terminoloy
there are certain tasks that are sufficiently well-defined and for which several
algorithms have been proposed.
Tempo induction refers to the process of extracting an estimate of the tempo
(the frequency of the main metrical level i.e the frequency that most humans would
tap their foot when listening to the music) for a track or excerpt. Tempo induction
is mostly meaningful for popular music with a relatively steady beat that stays
the same throughout the audio that is analyzed. Beat tracking refers to the more
involved process of identifying the time locations of the individual beats and is
still applicable when there are significant tempo changes over the course of the
track being analyzed.
In addition to the two basic tasks of tempo induction and beat tracking there
are additional rhythm-related tasks that have been investigated to a smaller extent.
Meter estimation refers to the process of automatically detecting the grouping
of beat locations into large units called measures such as 4/4 (four beats per
measure) or 3/4 (3 beats per measure). A related task is the automatic extraction
of rhythmic patterns (repeating sequences of drum sounds) typically found in
modern popular music with drums. Finally drum transcription deals with the
extraction of a “score” notating which drum sounds are played and when in piece
of music. Early work focused on music containing only percussion [?] but more
recent work considers arbitrary polyphonic music. There is also a long history
30
CHAPTER 2. TASKS
of rhythmic analysis work related to these tasks performed on symbolic representations (mostly MIDI) in many cases predating the birth of MIR as a research
field.
2.3.2
Melody
Dr
aft
Melody is another aspect of music that is abstract and invariant to many transformations. Several tasks related to melody have been proposed in MIR. The result
of monophonic pitch extraction is a time-varying estimate of the fundamental frequency of a musical instrument or human voice. The output is frequently referred
to as a pitch contour. Monophonic pitch transcription refers to the process of
converting the continuous-valued pitch contour to discrete notes with a specified
start and duration. Figure ?? shows visually these concepts. When the music is
polyphonic (more than one sound source/music instrument is present) then the related problem is termed predominant melody extraction. In many styles of music
and especially modern pop and rock there is a clear leading melodic line typically
performed by a singer. Electric guitars, saxophones, trumpets also frequently
perform leading melodic lines in polyphonic music. Similalry to monophonic
melody extraction in some cases a continuous fundamental frequency contour is
extracted and in other cases a discrete set of notes with associated onsets and
offests is extracted.
A very imprortant and not at all straightforward task is melodic similarity.
Once these melodic representations are computed it investigates how they can be
compared with each other so that melodies that would be perceived by humans
as associated with the same piece of music or as similar will be automatically
detected as such. The reason this task is not so simple is that there is an enormous
amount of variations that can be applied to a melody while it still maintains
its identity. These variations range from the straightforward transformations of
pitch transposition and changing tempo to complicated aspects such as rhythmic
elaboration and the introduction of new pitches.
Make figure
of melodic
[8]
variations
2.3.3
Chords and Key
Chord Detection, Key Detection/tonality
2.4. MUSIC TRANSCRIPTION AND SOURCE SEPARATION
2.4
31
Music Transcription and Source Separation
Ultimately music is an organized collection of individual sound sources. There
are several MIR tasks that can be performed by characterizing globally and statistically the music signal without individually characterizing each individual sound
source. At the same time it is clear that as listeners we are able to focus on
individual sound sources and extract information from them. In this section we
review different approaches to extracting information for individual sound sources
that have been explored in the literature.
aft
Source separation refers to the process of taking as input a polyphonic mixture
of individual sound sources and producing as output the individual audio signals
corresponding to each individual sound source. The techniques used in this task
have their origins in speech separation research which frequently is performed
using multiple microphones for input. Blind sound separation refers ... .
2.5
Dr
blind source separation, computational auditory scene analysis, music understanding
Query-by-humming
In query by humming (QBH) the input to the system is a recording of a user
singing or humming the melody of a song. The output is an audio recording
that corresponds to the hummed melody selected from a large datatabse of songs.
Earlier systems constrained the user input, for example by requiring the user to
sing usign fixed syllables such as La, La, La or even just the rhythm (query by
tapping) with the maching being against small music collections. Current systems
have removed these restrictions accepting as input normal singing and performing
the matching against larger music collections. QBH is particularly suited for smart
phones. Some related tasks that have also been explored are query by beat boxing
in which the input is a vocal rendering of a particular rhythmic pattern and the
output is either a song or a drum loop that corresponds to the vocal rhythmic
pattern.
[?]
32
2.6
CHAPTER 2. TASKS
Symbolic Processing
representations [37]
searching for motifs/themes, polyphonic search
2.7
Segmentation, Structure and Alignment
2.8
Watermarking, fingerprinting and cover song detection
Connecting MIR to other disciplines
Dr
2.9
aft
[?] [32]
thumbnailing, summaring
The most common hypothetical scenario for MIR systems is the organization and
analysis of large music collections of either popular or classical music for the
average listener. In addition there has been work in connecting MIR to other
disciplines such as digital libraries [5, 26], musicology [13].
2.10 Other topics
Computational musicology and ethnomusicology, performance analysis such as
extracting vibrato from monophonic instruments [9], optical music recognition
[?], digital libraries, standards (musicXML), assistive technologies [36], symbolic
musicology, hit song science
aft
Part I
Dr
Fundamentals
33
Dr
aft
Chapter 3
aft
Audio representations
Is it not strange that sheep’s guts should hale souls out of men’s bodies
Shakespeare
Dr
The most basic and faithful way to preserve and reproduce a specfic music
performance is as an audio recording. Audio recordings in analog media such
as vinyl or magnetic tape degrade over time eventually becoming unusable. The
ability to digitize sound means that extremely accurate reproductions of sound
and music can be retained without any loss of information stored as patterns
of bits. Such digital representations can theoretically be transferred indefinitely
across different physical media without any loss of information. It is a remarkable
technological achievement that all sounds that we can hear can be stored as an
extremely long string of ones and zeroes. The automatic analysis of music stored
as a digital audio signal requires a sophisticated process of distilling information
from an extremely large amount of data. For example a three minute song stored
as uncompressed digital audio is represented digitally by a sequence almost of
16 million numbers (3 (minutes) * 60 (seconds) * 2 (stereo channels) * 44100
(sampling rate)) or 16 * 16 million = 256 million bits. As an example of extreme
distilling of information in the MIR task of tempo induction these 16 million
numbers need to somehow be converted to a single numerical estimate of the
tempo of the piece.
In this chapter we describe how music and audio signals in general can be
represented digitally for storage and reproduction as well as describe audio representations that can be used as the basis for extracting interesting information from
35
36
CHAPTER 3. AUDIO REPRESENTATIONS
3.1
Dr
aft
these signals. We begin by describing the process of sound generation and our
amazing ability to extract all sorts of interesting information from what we hear.
In order for a computer to analyze sound, it must be represented digitally. The
most basic digital audio representation is a sequence of quantized pulses in time
corresponding to the discretized displacement of air pressure that occured when
the music was recorded. Humans (and many other organisms) make sense of their
auditory environment by identifying periodic sounds with specific frequencies. At
a very fundamental level music consists of sounds (some of which are periodic)
that start and stop at different moments in time. Therefore representations of
sound that have a separate notion of time and frequency are commonly used as
the first step in audio feature extraction and are the main topic of this chapter. We
start by introducing the basic idea of a time-frequency representation. The basic
concepts of sampling and quantization that are required for storing sound digitally
are then introduced. This is followed by a discussion of sinusoidal signals which
are in many ways fundamental to understanding how to model sound mathematically. The discrete Fourier transform (DFT) is one of the most fundamental and
widely used tools in music information retrieval and more generally digital signal
processing. Therefore, its description and interpretation forms a large part of this
chapter. Some of the limitations of the DFT are then discussed in order to motivate
the description of other audio representations such as wavelets and perceptually
informed filterbanks that conclude this chapter.
Sound production and perception
Sound and therefore music are created by moving objects whose motion results
in changes in the surrounding air pressure. The changes in air pressure propagate through space and if they reach a flexible object like our eardrums or the
membrane of a microphone they cause that flexible object to move in a similar
way that the original sound source did. This movement happens after a short time
delay due to the time it takes for the “signal” to propagate through the medium
(typically air). Therefore we can think of sound as either time-varying air-pressure
waves or time-varying displacement of some membrane like the surface of a drum
or the diaphragm of a microphone. Audio recordings capture and preserve this
time-varying displament information and enable reproduction of the sound by
recreating a similar time-varying displament pattern by means of a loudspeaker
with a electrically controlled vibrating membrane. The necessary technology
3.1. SOUND PRODUCTION AND PERCEPTION
37
Dr
aft
only became available in the past century and caused a major paradigm shift
in how music has been created, distributed and consumed [?, ?]. Depending
on the characteristics of the capturing and reproduction system a high-fidelity
reproduction of the original sound is possible.
Extracting basic information about music from audio recordings is relatively
easy for humans but extremely hard for automatic systems. Most non-musically
trained humans are able, just by listing to an audio recording, to determine a
variety of interesting pieces of information about it such the tempo of the piece,
whether there is a singing voice, whether the singer is male or female, and what is
the broad style/genre (for example classical or pop) of the piece. Understanding
the words of the lyrics despite all the other interferring sounds from the music
instruments playing at the same time, is also impressive and beyond the capabilities of any automatic system today. Building automatic systems to perform these
seemingly easy tasks turns out to be quite challenging and involves some of the
state-of-the-art algorithms in digital signal processing and machine learning. The
goal of this chapter is to provide an informal, and hopefully friendly, introduction
to the underlying digital signal processing concepts and mathematical tools of
how sound is represented and analyzed digitally.
The auditory system of humans and animals has evolved to interpret and
understand our environment. Auditory Scene Analysis is the process by which the
auditory system builds mental descriptions of complex auditory environments by
analyzing sounds. A wonderful book about this topic has been written by retired
McGill psychologist Albert Bregman [14] and several attempts to computationally
model the process have been made [?]. Fundamentally the main task of auditory
perception is to determine the identity and nature of the various sound sources
contributing to a complex mixture of sounds. One important information cue
is that when nearly identical patterns of vibration are repeated many times they
are very likely to originate from the same sound source. This is true both for
macroscopic time scales (a giraffe walking, a human paddling) and microscopic
time scales (a bird call, the voice of a friend). In the microscopic time scales this
perceptual cue becomes so strong that rather than perceiving individual repeating
sounds as seperate entities we fuse them into one coherent sound source giving
rise to the phenomenon of pitch perception. Sounds such as those produced by
most musical instruments and the human voice are perceived by our brains as
coherent sound producing objects with specific characteristics and identity rather
than many isolated repeating vibrations in sound pressure. This is by no means
a trivial process as researchers who try to model this ability to analyze complex
auditory scenes computationally have discovered [?].
38
CHAPTER 3. AUDIO REPRESENTATIONS
In order to analyze music stored as a recorded audio signal we need to devise
representations that roughly correspond to how we perceive sound through the
auditory system. At a fundamental level such audio representations will help
determine when things happen (time) and how fast they repeat (frequency). Therefore the foundation of any audio analysis alogirthm is a representation that is
organized around the “dimensions” of time and frequency.
3.2
Sampling, quantization and the time-domain
Dr
aft
The changes in displacement over time that correspond to sound waves are continuous. Originally recordings stored these changes on an analog medium (for
example by having a needle engrave a groove) preserving their continuous nature.
In order for computers to process audio signals these continuous signals have
to be converted into sequences of numbers that are evenly spaced in time and
are discrete. There are two steps in that process: sampling andx quantization.
Sampling corresponds to measuring the continuous values of the signal at discrete
instances of time. For example one can “sample” the temperature during the
course of a day by recording the value of the temperature every hour using a
thermometer. We would then have 24 measurements per day. When sampling
sound waves using a computer typically 44100 “measurements” are taken every
second. These continuous measurements need to be turned into a discrete representation suitable for digital storage and transmission. This process is called
quantization and it is typical to use 16 bits (i.e 216 = 65536 posible levels)
to represent each measurement. The sampling process is characterized by the
sampling rate mneasured in Hertz (for example 44100 Hz) and the bit depth (the
number of bits used to represent each measurement for example 16 bit).
The process of sampling and quantization is far from trivial but explaining it is
beyond the scope of this book. A fascinating result is that the original continuous
signal can be perfectly reconstructed from a corresponding sequence of digital
samples provided that the sampling rate is high enough for the signal of interest. In
modern computers the process of sampling and quantization and the inverse process of generating a continuous signal is performed with hardware circuits called
ananlog-to-digital converters (ADC) and digital-to-analog converters (DAC). For
our purpose the audio signal will be represented as a very long sequence of floating
point numbers (between −1 and 1) that we can notate as follows:
39
Dr
aft
3.3. SINUSOIDS AND FREQUENCY
Figure 3.1: Sampling and quantization
x[n] n = 0, . . . , N − 1
(3.1)
Figure 3.1 shows a continuous sine wave signal in blue and the corresponding
sampled values as vertical lines. The height of each line has to be encoded
digitally so a discrete set of quantization steps need to be utilized to represent
each value.
3.3
Sinusoids and frequency
Sinusoidal signals are one of the most fundamental abstractions required for understanding how to represent sound digitally as a long series of numbers as well as
a way to model mathematically the concept of frequency. In addition they are also
fundamental in understanding the mathematical concepts and notation needed for
40
CHAPTER 3. AUDIO REPRESENTATIONS
analyzing and manipulating audio signals so we start our exposition by discussing
sinusoids.
Most readers of this book probably have a rough intuitive understanding of
a spectrum as a way of measuring/showing the amount of different frequencies
present in a signal. For example we know what is the effect of using a graphic
equalizer to boost low frequencies. At the same time if someone is not familiar
with digital signal processing then equations such as the definition of the Fourier
transform:
Z ∞
x(t)e−j2πf t dt
(3.2)
X(f ) =
−∞
Sinusoids and phasors
Dr
3.3.1
aft
are intimidating and hard to interpret. Frequently even students that have taken
courses on the topic end up manipulating the symbols mechanically without a
deeper understanding of the underlying concepts. In the following subsections we
will attempt to explain the fundamental ideas underlying spectral analysis and the
notation used to expresss them.
Sinusoidal signals are functions of a particular shape that have some rather remarkable properties. For example one of the most fundamental ideas in digital
signal processing is that any sound we hear can be represented as a summation
of elementary sinusoidal signals. We motivate sinusoidal signals through several
different complimentary ways of viewing them:
• As solutions to the differential and partial differential equations describing
the physics of simple vibrating sound generating systems and as good appoximation of more complex vibrating systems.
• As the only family of signals that pass in some way unchanged through
linear time-invariant systems (LTI). More specifically passing a sine wave
of a particular frequency through a LTI system will result in a sine wave of
the same frequency possibly scaled and shifted in phase i.e the basic “shape”
of the signal remains the same. Many of the systems involved in the production, propagation and sensing of sound can be modeled with reasonable
accuracy as linear time-invariant systems and understanding their response
to sine waves can be very informative about what these systems do.
3.3. SINUSOIDS AND FREQUENCY
41
• As phasors (rotating vectors) they form an expressive notation that helps
intuition and understanding of many fundamental digital signal processing
concepts.
• As the basis functions of the Fourier Transform which can be used to represent any periodic signal as a summation of elementary sinusoidal signals.
The term sinusoid is used to describe a family of elementary signals that have
a particular shape/pattern of repetition. The sine wave x(t) = sin(ωt) and the
cosine wave x(t) = cosin(ωt) are particular examples of sinusoids that can be
described by the more general equation:
aft
x(t) = sin(ωt + φ)
(3.3)
3.3.2
Dr
where ω is the frequency and φ is the phase. There is an infite number of continuous periodic signals that belong to the sinusoid family but essentially they all have
the same shape. Each particular member is characterized by three numbers: the
amplitude which encodes the maximum and minimum value of the oscillation, the
frequency which encodes the period of repetition, and the phase which encodes
the initial position at time t = 0. Figure 3.2 shows two sinusoid signals over a
time interval of 3 seconds. The first signal (S1) has an amplitude of 7 in arbitrary
units, a frequency of 2Hz, and phase equal to 0. The second signal (S2) has an
amplitude of 4, a frequency of 3Hzx, and a phase equal to π4 . The two sinusoids
are also shown superimposed (S1,S2) and their summation is also plotted (S1+S2).
The physics of a simple vibrating system
Music consists of a mixture of sounds many of which are periodic and are produced by musical instruments. Sound is generated by vibrations so a good place
to start is on of the simplest systems in physics that can vibrate. Let’s consider
striking a metal rod such as the tine of a tuning fork or hitting a ruler that is
clamped with a hand on a table. Ther ruler deforms when it is stuck, then a
force restores it briefly to it’s original shape. However it still has inertia and
it overshoot the resting position by deforming in the opposite direction. This
oscillatory pattern repeats causing the surrounding air molecules to move and the
resulting air pressure waves reach our ears as sound. The vibration at any time
is caused by the balance between two factors: the force that tends to restore the
shape to equilibrium, and the intertia that tends to make it overshoot. This type
CHAPTER 3. AUDIO REPRESENTATIONS
Dr
aft
42
Figure 3.2: Simple sinusoids
of motion is called simple harmonic motion and the equation describing it can be
derived from simple physics.
At any time t during the vibration Newton’s second law of motion applies:
the restoring force produces an acceleration proportional to to that force. The
restoring force caused by the deformation tries to restore the ruler to its original
position therefore acts in a direction opposite to the displacement x and for small
amounts of deformation its relation to the displacement can be considered linear.
This can be written as follows:
F = ma = −kx
(3.4)
Because acceleration is the second derivative with respect to the time variable
t we can rewrite this equation as:
3.3. SINUSOIDS AND FREQUENCY
43
d2 x
= −(k/m)x
(3.5)
dt2
where m is the mass, k is the stiffness, x is the displacement from the resting
position and t is time. Now we need to figure out a particular function or family
of functions that satisfy the equation above. From calculus we know that the
derivatives and second derivatives of the sine and cosine functions of t are also
sinusoids:
d
sin(ωt) = ω cos(ωt)
dt
d2
sin(ωt) = −ω 2 sin(ωt)
dt2
(3.6)
Dr
aft
d2
d
cos(ωt) = −ω 2 cos(ωt)
(3.7)
cos(ωt) = −ωsin(ωt)
dt
dt2
where ω is the frequency of oscillation. The only difference between the functions
sin(ωt) and cos(ωt) is that they differ by a delay/time shift but essentially they
have the same shape. The term sinusoid will be used when it is not important
to distinguish between sine and cosine. As can be seen both of these functions
of t satisfy equation 3.5 that characterizes simple harmonic motion. It is also
straightforward to show that the equation is satisfied by any sinusoid with an
arbitrary phase φ of the following form:
x(t) = sin(ωt + φ)
(3.8)
Therefore sinusoidal signals arise as solutions to equations that that characterize simple (or simplified) sources of sound vibration such as a tuning fork.
3.3.3
Linear Time Invariant Systems
A general system takes as input one or more input signals and produces as output one or more output signals. We will focus on single input/single output
(SISO) systems as they can be used for many operations on audio signals. In
addition more complicated MIMO (multiple-input, multiple-output) systems can
be assembled by connecting elementary SISO systems together in processing
networks. A very important class of SISO systems are the Linear Time-Invariant
(LTI) systems because the can be used to model many physical processes and
the LTI property makes them easier to handle mathematically. Let’s use the
following notation: y(t) is the system output, S is the system and x(t) is the
44
CHAPTER 3. AUDIO REPRESENTATIONS
3.3.4
Dr
aft
system input. Linearity means that one can calculate the output of the system
to the sum of two input signals by summing the system outputs for each input
signal individually. Formally if y1 (t) = S{x1 (t)} and y2 (t) = S{x2 (t)} then
S{x1 (t)+x2 (t)} = ysum (t) = y1 (t)+y2 (t). Time invariance (or shift invariance in
discrete systems) means that the output of the system given an input shifted in time
is the same as the output of the system of the original unshifted input signal shifted
by the same amount. Formally if y(t) = S[x(t)] then S[x(t − t0 )] = y(t − t0 )
where t0 is the amount of shifting in time. As we will see shortly frequently we can
express complicated signals of interest as linear combinations of more elementary
signals. When processing such compicated signals through a LTI system we can
analyze its response simply by looking at how it responds to the elementary signals
and then combining these individual responses.
Now comes the interesting part which is what happens when a sinusoidal
signal is used as input to a LTI system. It turns out that no matter what LTI system
is used, the output not only will also be a sinusoidal signal but also have the
same frequency as the original input sinusoid. By extension if the input to a LTI
system is a linear combination of sinusoidal components of certain frequencies,
no new frequencies will be introduced in the output. Therefore we can completely
characterize a LTI system by how it responds to elementary sinusoidal signals.
Several processes in sound generation and perception can be modelled by LTI
systems. Examples include the resonant cavities of musical instruments such as
the body of a guitar, parts of the human hearing process such as the the effect of
the outer ear, and acoustic spaces such as a concert hall. So far we have motivated
sinusoidal signals from physics of simple vibrations and from LTI systems. Now
we look into the mathematical notation and concepts that we will need.
Complex Numbers
A good intuitive understanding of sinusoidal signals and the mathematical notation conventions used to represent and manipulate them, is important in order
to understand how time-frequency representations work. The notation conventions typically used are based on complex numbers which can intimidate readers
without a background in digital signal processing and mathematics. Thinking of
sinusoidal signals as vectors rotating at a constant speed on the plane, rather than
as single-valued signals that go up and down will help us demystify the complex
number notation as well as be able to explain interesting properties geometrically
rather than algebraically.
45
Dr
aft
3.3. SINUSOIDS AND FREQUENCY
Figure 3.3: Cosine and sine projections of a vector on the unit circle (XXX Figure
needs to be redone)
46
CHAPTER 3. AUDIO REPRESENTATIONS
Dr
aft
Figure 3.3 shows how the projection of a vector on the unit cycle with angle
θ onto the x-axis is the cosine of the angle and the projection onto the y-axis is
the sine of the angle. If that vector rotates at constant speed and we plot the x and
y axis projections over time the resulting single-valued waveforms are a cosine
wave and a sine wave respectively. So we can think of any sine wave as a rotating
clock hand or the spoke of bicycle wheel.
Complex numbers provide an elegant notation system for manipulating rotating vectors. Any vector with coordinates x and y is notated as the complex number
x + jy. The x part is called the real part and the part
√ with the weird j factor is
called the imaginary part. Typically j is defined as −1. This somewhat strange
looking definition makes much more sense if we think of j as meaning rotate by π2
(counter-clockwise). Starting from the real number 1 (with 0 imaginary part) two
succesive rotations bring us to the real number −1 hence j 2 = −1. This should
make it clear that there is nothing imaginary or complex about j if it is considered
as a rotation.
In terms of addition complex numbers behave exactly as vectors. The xcoordinate of the sum of two vectors is the sum of the x-coordinates and the ycoordinate of the sum of two vectors is the sume of the y-coordinates. Similarly
the real part of the sum is the sum of the real parts and the imaginary part of the
sum is the sum of the imaginary parts. The expressive power of complex numbers
shows up when they are multiplied. We define multiplication using standard
algebra and taking care to replace j 2 with −1 whenever we need to. Suppose
we want to multiple the complex number a + jb with the complex number c + jd.
The result is the following complex number:
(a + jb) ∗ (c + jd) = ac + jbc + jad + j 2 bd = (ac − bd) + j(ad + bc) (3.9)
Notice that the use of j is simply a convenient notation so that we can manipulate
expressions of complex numbers using our familiar addition and multiplication
skills for real numbers. When writing computer code j is not needed and we
could write the definition directly as:
S.re = A.re * B.re - A.im * B.im
S.im = A.re * B.im + B.re * A.im
In order to understand the effect of complex number multiplication we need
to think of complex numbers as vectors in polar form. The lenght of the complex
3.3. SINUSOIDS AND FREQUENCY
47
number c = a + jb is called the magnitude, is written as |c|, and is the Euclidean
distance from the origin interpreting a and b as the x and y coordinates:
R = |c| =
√
a2 + b2
(3.10)
The angle with the real axis is often called its argument and is given by:
θ = ARG(c) = arctan(b/a)
(3.11)
aft
Based on these definitions a complex number can be notated as R∠θ. You can
visualize this as scaling the unit vector on the real-axis by R and then rotating
the result counter-clockwise by angle theta in order to get to a particular point on
the 2D plane. As examples the complex number 1 + j0 is 1∠0 and the complex
number 0 + j1 is 1∠ π2 . We can easily go back to cartesian coordinates based on
the relations shown in Figure 3.3.
a = R cos θ
b = R sin θ
(3.12)
3.3.5
Phasors
Dr
When complex numbers are represented in polar form their multiplication has
a clear geometric interpretation. The magnitudes are multiplied and the angles
are added i.e the product of R1 ∠θ1 and R2 ∠θ2 is R1 R2 ∠(θ1 + θ2 ). Visually
multiplication by a compex number is simply a rotation and scaling operation.
I hope that after reading this section you can see complex numbers are not as
complex if viewed from the right angle.
Let’s return to our representation of sinusoids as rotating vectors with constant
speed and try to use what we have learned about complex numbers for expressing
them mathematically. A complex sinusoid can be simply notated as:
cos (ωt) + j sin (ωt)
(3.13)
where ω is the frequency in radians per second. Every time the time t changes the
the
rotating vectors moves around the circle. For example, when the change is 2π
ω
argument ωt changes by 2π radians and the sinusoidal signal goes through one
cycle.
Another way to view this process is that the rotating vector that represents a
sinusoid is just a single fixed complex number raised to progressively higher and
48
CHAPTER 3. AUDIO REPRESENTATIONS
higher powers as time goes by. As we have seen complex number multiplication
can be interpreted as a rotation so raising a complex number to an integer power
corresponds to moving the phasor around the circle in discrete steps. By making
the power continuous and a function of time we can model a continuous rotation.
More specifically we can notate our rotating phasor as a function of continuous
angle (assuming unit magnitude):
E(θ) = C θ = cos θ + j sin θ = 1∠θ
(3.14)
θ = ωt
(3.15)
where
aft
We can easily calculate the derivative of this complex function E(θ) with respect
to θ:
dE(θ)
= − sin θ + j cos(θ) = jE(θ)
(3.16)
dθ
ejωt = cos ωt + j sin(ωt)
(3.17)
Dr
Hopefully after this exposition the complex number notation used in the definition of various frequency transforms will not look as intimidating. Another
possible viewpoint is to simply consider the use of exponentials as a notation
convention that allows us to express complex number multiplication using the
regular rules we expect from exponents. Using this convention the multiplication
of two complex numbers c1 and c2 can be written as:
c1 · c2 = R1 ejθ1 R2 ejθ2 = R1 R2 ej(θ1 +θ2 )
(3.18)
We end this subsection by returning to the geometric view of phasors as vectors rotating at constant speed and using it to illustrate intuitive explanations of
some important properties of sinusoidal signals.
XXX geometric phasor descriptions of negative frequency and cancelling of
the imaginary part, adding sinusoids of the same frequency but with different
amplitude and phase, aliasing and associated figures XXX
3.3.6
Time-Frequency Representations
A large number of organisms including humans are capable of both producing and
detecting sound (variations in air pressure). The two most important questions
a listener needs to answer regarding any sound are when it happened and what
3.3. SINUSOIDS AND FREQUENCY
49
3.3.7
Dr
aft
produced it. One important cue about the origin of a sound is the rate of the
periodic variation in air pressure. For example it is unlikely (actually physically
impossible) that a low loud rumbling sound will come from a tiny bird. Therefore
it is hardly surprising that most animal auditory systems are structured so that at
the most basic level they can differentiate between sounds happening at different
times and having different rates of vibration (frequencies). Without going into
details in most mammallian auditory systems sounds with different frequencies
excite different groups of neurons (inner hair cells) attached to a membrane (the
basilar membrane) that performs a type of frequency analysis; a given place along
the basilar membrane responds more strongly to a sinusoidal vibration of a particular frequency. At the same time a given place along the basilar membrane will
also react to other frequencies to a lesser degree and further mechanisms related to
timing are used to better resolve these frequencies. Furthermore various structural
features of the auditory cortex (the part of the brain that deals with understanding
sound) are “tonotopically” organized. When excited by a sine wave of increasing
frequency the excitation on the cortex moves continuously along a curved path.
So there is something fundamental about frequency in the response of the brain to
sound, right up to the highest level, the auditory cortex.
Frequency Domain Representations
A frequent and elegant approach to representing audio signals is to express them
as weighted mixtures of elementary building blocks (basis functions) which can
be thought of as simple prototypical signals. Then the representation essentially
consists of the weights of the mixture. Audio signal representations computed
using the same set of basis functions can be then analyzed and compared simply
by considering the weights of the mixture. A common strategy in digital signal
processing it to take advantage of a strategy called superposition. In superposition the signal being processed is decomposed into simple components, each
component is processed separately and the results are reunited. Linear systems
are typically required for superposition to work. As we have already discussed in
order to extract information from music audio signals the audio representation
must somehow encode time and frequency. By analyzing the audio signal in
small chunks the representation becomes dependent on time and by utilizing basis
functions that correspond to different frequency the weights of the mixture encode
frequency.
50
CHAPTER 3. AUDIO REPRESENTATIONS
There are four commonly used variants of frequency domain representations:
the Fourier series, the Discrete Fourier Transform (DFT), the z-transform, and the
classical Fourier transform. The intuition behind all of them is similar. The main
idea is to express the signals of interest as weighted mixture of elementary signals;
as you probably expect these will be some form of sinusoidal signals i.e phasors.
Armed with our knowledge of sinusoidal signals and how they can be viewed as
phasors and notated using complex numbers we are ready to understand some of
the most fundamental concepts in digital signal processing.
Coordinate systems
aft
The weights of a mixture of elementary signals can be viewed as coordinates in a
signal space spanned by these elementary signals in an analogous way to how the
spatial coordinates of a point in space can viewed as the weights that are needed
to combine the unit-length basis vectors corresponding to each axis. For example
a vector ν in three-dimensional space can be written as:
ν = νx x + νy y + νz z
(3.19)
Dr
which can be interpreted as the fact that any vector in the three-dimensional space
can be represented as a weighted combination of three parts: a vector in the xdirection of length νx , a vector in the y-direction of length νy and a vector in the
z direction of length νz . The three unit length vectors in the coordinate directions
x, y, z are called a basis for the three dimensional space and the numbers νx , νy , νz
are called the projections of vector ν onto the respective basis elements. We can
denote the projection of a vector ν onto another vector u by ν < ν, u > in which
case:
νx = hν, xi
νy = hν, yi
νz = hν, zi
The projection hν, υi is also called the inner product of ν and υ and can be
defined as the sum of products of like coordinates (when the vector has coordinate
values that are complex numbers the coordinates of the second vector need to be
conjugated):
3.3. SINUSOIDS AND FREQUENCY
51
hν, υi = νx υx∗ + νy υy∗ + νz υz∗
(3.20)
or more generally:
hν, υi =
X
νi υi∗
(3.21)
i
The projection/inner product follows a distributive law and is symmetric which
means that:
hν + υ, ωi = hν, ωi + hυ, ωi
hν, υi = hυ, νi
(3.22)
(3.23)
aft
Also notice that the projection of a vector onto itself (or equivalently the inner
product of a vector with itself) is the square of the length of the vector (notice
how the complex conjugate in the definition of the inner product is required to
make this true when the coordinates are complex numbers):
hν, νi = |νx |2 + |νy |2 + |νz |2
(3.24)
Dr
The three coordinate basis vectors x, y, z are orthogonal to each other which
means that their projections onto each other are zero.
hx, yi = 0
hx, zi = 0
hy, zi = 0
When the basis vectors are mutually orthogonal like this the basis is called an
orthogonal basis. So basically in order to create an orhtogonal coordinate system
we need two essential components: a projection operator, and an orthogonal
basis. We are now ready to use these concepts in order to better understand
frequency domain representations which we will define by specifying appropriate
projection operators and orthogonal basis vectors which in this case will be some
form of phasors.
Frequency Domain Representations as coordinate systems
Suppose we want to represent period signals that are functions of a continuous
time variable t defined in the range 0 ≤ t ≤ T (think of the signal as repeating
52
CHAPTER 3. AUDIO REPRESENTATIONS
outside this range). Even though it might seem strange we can think of every
possible value of t in the range 0 to T as a coordinate. In this case instead of
having a finite number of dimensions we have an infinite number of them. We can
then generalize the definition of the projection operator (or inner product) to use
a continuous integral:
1
hf, gi =
T
Z
T
f (t)g ∗ (t)dt
(3.25)
0
The basis elements should also be periodic signals of a continuous time variable t
and as you probably have guessed they are the phasors that have period T:
k = . . . , −1, 0, 1, 2 . . .
(3.26)
aft
ejtk2π/T ,
Dr
Notice that negative frequencies are included in order to represent real functions of t. Geometrically the phasor for k = 1 will complete one circle in time
T , the phasor for k = 2 will complete two circles, etc. The negative frequencies
correspond to phasors moving at the same speed but clockwise instead of counterclockwise. It is also straightforward to show that this basis is orthogonal. It is
important to notice that using this representation the coordinate system has been
radically transformed from one that is continuously indexed (time) to one that
is discretely indexed (the phasor basis indexed by integers). Although there are
some mathematic restrictions it is remarkable that this works.
Based on these two definition of projection operator and basis functions we
can now express any periodic function in terms of the basis:
f (t) =
∞
X
Ck ejkt2π/T
(3.27)
−∞
The complex coefficient Ck can be thought of as the coordinate in the “direction” of the phasor ejkt2π/T or the amount of the frequency k2π/T . Equation 3.27
is called the Fourier Series of f (t) and the values CK are called the spectrum
of the periodic signal f (t). To find the Fourier coefficients Ck we can simply
project f (t) on the kth basis element (having a bad memory I always rely on the
definition of the inner product with the complex conjugate to determine the sign
of the exponent in frequency transforms):
Ck = hf (t), e
jkt2π/T
1
i=
T
Z
T
f (t)e−jkt2π/T dt
0
(3.28)
3.3. SINUSOIDS AND FREQUENCY
53
Notice that when f (t) is a real (as is the case for audio signals) there is a very
simple relationship between the coefficients for negative and positive frequencies:
∗
Ck = C−k
3.3.8
(3.29)
Discrete Fourier Transform
Dr
aft
The Fourier Series representation transforms periodic signals in a continuous time
domain to an infinite discrete frequency domain. Therefore it is not directly
usable when processing digital signals. The transform that is needed is called
the Discrete Fourier Transform (DFT). It transforms a finite discrete time domain
signal (or an array of numbers in computer parlance) to a finite discrete frequency
domain signal (also an array of numbers). The DFT arguably the most widely
used transform in digital signal processing and as an extension in MIR. In part
this because of a very fast algorithm for computing it that is called the Fast
Fourier Transform (FFT). Even though the tecnique was first discovered by Gauss
in 1805 it became widely used in its current form through the work of Cooley
and Tukey. In the published literature frequently the terms DFT and FFT are used
interchangeably but it is good to keep in mind the the DFT refers to the transform
whereas the FFT refers to a fast algorithm for computing it. Similarly to how
the Fourier Series was presented we define the DFT by specifying a projection
operator and a set of basis functions. As probably expected the projection operator
is the standard inner product between vectors of complex numbers:
hx, yi =
X
xt yt∗
(3.30)
i
Due to aliasing frequencies of discrete-time signals are equivalent modulo the
sampling frequency.
Arguably the most basic elementary basis function is the sampled sinusoid
which forms the basis of the Discrete Fourier Transform and the Short Time
Fourier Transform (STFT). The STFT forms the foundation of most proposed
algorithms for the analysis of audio signals and music. So our description of
audio representations begins with the sound of a sinusoid.
Audio Examples Source Code (Marsyas/Matlab) Figure
54
3.3.9
CHAPTER 3. AUDIO REPRESENTATIONS
Sampling and Quantization
3.3.10
Discrete Fourier Transform
3.3.11
Linearity, propagation, resonances
3.3.12
The Short-Time Fourier Transform
3.3.13
Wavelets
aft
Music-humans-language-mental models-abstraction-better time modeling scales
MPEG-7 standard
“feature” is a distinctive characteristic of the data which signifies something
to somebody.
Perception-informed Representations
3.4.1
Auditory Filterbanks
3.4.2
Perceptual Audio Compression
3.5
Source-based Representations
Dr
3.4
LPC
3.6
Further Reading
The theory behind the audio representations used in music information retrieval
forms the field of Digital Signal Processing (DSP). There are many excellent DSP
books. The following list consists of a few of my own personal preferences rather
3.6. FURTHER READING
55
Problems
Dr
aft
than any attempt at a comprehensive catalog. The DSP Primer by Ken Steiglitz [?]
is my favorite beginner book as it stresses intuition and has a strong emphasis on
music. The classic book by Oppenheim and Schafer [?] is the essential reference
for anyone needing a deeper understanding of signal processing.
Digital Signal Processing: A Practical Guide for Engineers and Scientists
[Paperback] Steven Smith (Author)
Paperback: 650 pages Publisher: Newnes; Book and CD ROM edition (November 6, 2002) Language: English ISBN-10: 075067444X ISBN-13: 978-0750674447
Product Dimensions: 9.9 x 7.2 x 1.2 inches Shipping Weight: 2.6 pounds
Understanding Digital Signal Processing (2nd Edition) [Hardcover] Richard
G. Lyons (Author) Hardcover: 688 pages Publisher: Prentice Hall; 2 edition
(March 25, 2004) Language: English ISBN-10: 0131089897 ISBN-13: 9780131089891
Hardcover: 1120 pages Publisher: Prentice Hall; 3 edition (August 28, 2009)
Language: English ISBN-10: 0131988425 ISBN-13: 978-0131988422 Product
Dimensions: 9.3 x 7.5 x 1.7 inches Shipping Weight: 4.2 pounds (View shipping
rates and policies)
Discrete-Time Signal Processing (3rd Edition) (Prentice Hall Signal Processing) [Hardcover] Alan V. Oppenheim (Author), Ronald W. Schafer (Author)
1. Show that a sinusoidal signal with any phase angle satisfies equation 3.5
2. Show that the basis fuctions for the Fourier Series are orthogonal using the
appropriate definition of an inner product.
3. Plot the DFT magnitude response in dB (using a DFT size of 2048) of a sine
wave that is exactly equal to the center frequency of DFT bin 600. On the
same plot overlap the DFT magnitude response in dB of a sine wave with
frequency that would correspond to bin 600.5 (in a sense falling between the
crack of two neigh- boring frequency bins). Finally on the same plot overlap
the DFT magnitue rsponse in dB of the second sine wave (600.5) windowed
by a hanning window of size 2048. Write 2-3 brief sentences describing
and explaning the three plots and what is the effect of windowing to the
mangitude response for these inputs.
aft
CHAPTER 3. AUDIO REPRESENTATIONS
Dr
56
Chapter 4
aft
Music Representations
Dr
Music is an ephemeral activity and for most of human history it only existed while
it was created. The only way it was represented was in the mental memories of
musicians. Although such mental representations can actually be quite abstract
and effective in terms of retrieval, their workings are a mystery and not easity
understood through introspection. For example, a musician might easily recognize
a tune from a badly rendered melodic fragment sung by an audience member or
be able to play from memory a highly complex piece of music. However in most
cases it is impossible or very hard to describe how these tasks are accomplished.
The use of external symbolic representations to communicate and store information was a major technological advancement in human history. Several writing
systems were developed codifying different types of information the most well
known being logistical/accounting information, concepts/words (hieroglyphics)
and phonemes/sounds (alphabet). Gradually symbolic representations for codifying different aspects of music were developed with western common music
notation being probably the most well known and used example today. Musicians
spend many hours learning and studying how to codify music in scores and how
to perform music written as a score. Even though a full music education is
not required to work on music information retrieval a basic understanding of
how scores codify music is essential in comprehending several tasks that have
been investigated in MIR. This chapter is an overview of the basic music theory,
concepts and abstractions involved when representing music symbolically. For
57
58
CHAPTER 4. MUSIC REPRESENTATIONS
Dr
aft
readers with a formal music background this material is probably familiar and
should be skipped as needed.
Music throughout most of human history was trasmitted orally and only existed ephemerally at the time and place it was performed. Language and poetry
were similarly orally transmitted until the development of writing which allowed
linguistic information to be preserved over time and space using physical objects
such as clay tablets. Cuniform script is one of the earliest forms of writing and it
is perhaps not surprising that one of the earliest examples of music notation is in
cuniform dated to 2000 BC [?].
Over time several cultures developed a variety of music notation systems. To
some extent all these notation systems can be considered as mnemonic aids for
performers differing in the type of information they encode. For example many
notation systems are designed for specific instruments such as guitar tablature
notation which encodes information about what string to play and when. The
structured symbolic representation used to notate music is called a musical score.
The symbols on the score correspond to various types of acoustic events and
encode information about the gestures required to produce them.
Music notation gradually evolved from a simple mnemonic aid, and transformed the way music was produced, consumed and distributed. Through scores
music could be transmitted across space and time. Without notation we would not
have been able to listen to the music written by J.S. Bach today and J. S. Bach
would not have been able to study Italian composers during his time. In fact,
the profession of a composer only became possible with music notation. Finally,
musical scores were and still are instrumental in the field of musicology and music
theory which, among other things, study the process of music composition and the
rules (and exceptions) that govern it.
Music scores can be viewed as a set of instructions to produce a particular musical output. Viewed that way they are similar to computer programs.
This analogy is not as far fetched as it might seem at first glance. Common
music notation has conventions for what are essentially looping constructs and
goto statements. Early mechanical calculators were influenced by the automated
Jackard looms which in turn were influenced by player piano mechanisms. So
with some speculation one can consider music notation a proto programming
language making an interesting prelude to the much later development of music
information retrieval techniques.
The “dimensions” of time and frequency are fundamental in understanding
music from cultures from around the world and throughout history. Pitch refers to
the subjective perception of frequency by humans. It is also important to keep in
59
mind that a periodic sound will have a clear pitch only if it has sufficient duration.
Much of our knowledge of pitch perception comes from psychophysics the psychology discipline that investigates the relationship between physical stimuli and
their subjective perception. For example an important concept in psychophysics
is the idea of “just noticable difference” (JND) which is the difference in physical
stimuli that subjects can not detect reliably. In a typical experimental setup for
pitch perception the subject is played two sine waves of different frequencies and
is asked to say whether they are the same or not. Based on similar experiments it
can be shown that the total number of perceptible pitch steps in the range of human
hearing is approximately 1400 (several factors including individual differences,
intensity, duration and timbre can affect this number).
Dr
aft
Categorical perception occurs when the continuous stimuli that reach our sense
organs is sorted out by our brains into distinct categories whose members come to
resemble more one another than they resemble members of the other categories.
The canonical example of categorical perception is how we perceive colors. In
physical terms colors differ only in their wavelength which gets shorter across the
spectrum of visible colors. However we perceive discrete categories such as red,
orange and green. Similar categorical perception can be observed in speech and
more importantly for our purposes in the perception of pitch.
Many music cultures utilize some form of division of the pitch continuum in
discrete categories. Some of these devisions are more formalized while others are
more subjective and vary among culture, instrument and performers. An interval
between two sounds is their spacing in pitch. Most humans have categorical
perception of pitch intervals. The identity of a melody (sequence of notes) like
“Twinkle, Twinkle Little Star” is based on a sequence of pitch intervals and is, to
a large extent, not affected by the frequency of the starting note. The most fundamental interval is the “octave” or a doubling in frequency which is recognized
by most music traditions. Tuning systems are formalized theories that describe
how categorical pitch intervals are formed. All tuning systems are affected by the
physics of sound production and perception but they are also affected by music
traditions and tuning pecularities of the instruments used in those traditions. Pitch
intervals are perceived as ratios of frequencies so that an octave above 440 Hz
(the frequency used today for tuning in most western classical music) would be
880 Hz and an octave above 880 Hz would be 1760 Hz. Most tuning systems can
be viewed as different ways of subdividing the octave (the interval corresponding
to a frequency ratio of two). Tuning systems and their relations with timbre and
acoustics are a fascinating topic however for the purposes of this book we will only
60
CHAPTER 4. MUSIC REPRESENTATIONS
aft
Figure 4.1: Score Examples
Dr
consider two well known variants: Pythagorean tuning and Equal Temperement.
More details about tuning systems can be found in Sethares [?].
In this chapter, the basic conventions and concepts used in Western common
music notation are introduced as it is the most widely used notation system today.
The chapter also serves as a compact introduction to basic music theory for readers
who do not have a formal music background. When reading MIR literature it
is common to encounter terms from music theory and have figures with music
score examples so some basic knowledge of music theory is useful. Music theory
and notation are complex topics that can take many years to master and in this
chapter we barely scratch the surface of the subject. I hope that this exposition
motivates you to get some formal music training. This overview of music theory
and notation is followed by a presentation of digital formats for symbolic music
representation.
4.1
Common music notation
Figure 4.1 shows two examples of hand written scores. The score on the left is the
beginning of the well-known prelude in C by J. S. Bach and the one on the right
is an excerpt of the Firebird Suite by Igor Stravinsky. Even though the underlying
music is very different and the manuscripts were written more than 200 years
apart, the notation symbols and conventions are very similar.
4.1. COMMON MUSIC NOTATION
61
4.1.1
Notating rhythm
aft
Figure 4.2: Rhythm notation
Dr
Rhythm is encoded using combinations of symbols (circles, dots and stems) that
indicate relative duration in terms of a theoretical regular underlying pulse. In
many music styles there are regular groupings of rhythmic units called measures
and any particular rhythmic event including a regular pulse can be expressed as
a subdivision of the measure. For example a lot of music has 4 subdivisions in
each measure and each one is called a quarter note. A whole note lasts as much
as 4 quarter notes. A finer subdivision is in eigth notes which as you might guess
last half the duration of a quarter note. Figure 4.2 shows a couple of examples of
rhythmic symbols with numbers under them indicating the number of eigth notes
they correspond to. Notice the use of stems, circle filling, and dots as duration
modifiers. The encoding of rhythmic durations is relative and the same score can
be performed faster or slower depending on the speed of the underlying regular
pulse. This can be specified by tempo indicators such as the 85 in Figure 4.2 which
indicates that there should be 85 quarter notes (beats) in each minute. When the
tempo is specified then the relative durations indicated by the symbols can be
converted to absolute durations.
define beat,
tatum, tactus
62
4.1.2
CHAPTER 4. MUSIC REPRESENTATIONS
Notating pitch
4.1.3
4.2
MIDI
4.3
Notation formats
4.4
aft
MusicXML, Guido, Lilypond
Music Theory
intervals, scales, chords, cadences
Graphical Score Representations
4.6
MIDI
4.7
MusicXML
Dr
4.5
[37] [87] [?] [49] [62]
[18]
[19]
Chapter 5
aft
Feature Extraction
Dr
Audio feature extraction forms the foundation for any type of music data mining.
It is the process of distilling the huge amounts of raw audio data into much
more compact representations that capture higher level information about the
underlying musical content. As such it is much more specific to music than the
data mining processes that typically follow it.
A common way of grouping audio features (or descriptors as they are sometimes called) is based on the type of information they are attempting to capture
[98]. On an abstract level one can identify different high level aspects of a music
recording. The hierarchical organization in time is refered to as rhythm and the
hierarchical organization in frequency or pitch is refered to as harmony. Timbre
is the quality that distinguishes sounds of the same pitch and loudness generated
by different methods of sound production. We will use the term timbral texture to
refer to the more general quality of the mixture of sounds present in music that is
independent of rhythm and harmony. For example the same piece of notated music
played at the same tempo by a string quartet and a saxophone quartet would have
different timbral texture characteristics in each configuration.
In this chapter various audio feature extraction methods that attempt to capture
these three basic aspects of musical information are reviewed. We begin by
examining monophonic fundamental frequency (or pitch) estimation as it is one of
the most basic audio features one can calculate, has interesting applications, and
can help motivate the discussion of other features. Some additional audio features
that cover other aspects of musical information are also briefly described. The
63
64
CHAPTER 5. FEATURE EXTRACTION
chapter ends with a short description of audio genre classification as a canonical
case study of how audio featue extraction can be used as well as some references
to freely available software that can be used for audio feature extraction. Audio
feature extraction is a big topic and it would be impossible to cover it fully in
this chapter. Although our coverage is not complete we believe we describe
most of the common audio feature extraction approaches and the bibliography
is representative of the “state of the art” in this area in 2013.
5.1
Monophonic pitch estimation
5.1.1
Terminology
aft
As we saw in Chapter 12 music notation systems typically encode information
about discrete pitch (notes on a piano) and timing. When reading literature in
music information retrieval and associated research areas the term pitch is used in
different ways which can result in some confusion.
Dr
In this book I have tried to be more precise by using the following terms:
Perceptual Pitch: is a perceived quality of sound that can be ordered from “low”
to “high”.
Musical Pitch: refers to a discrete finite set of perceived pitches that are played
on musical instruments
Measured Pitch: is a calculated quantity of a sound using an algorithm that tries
to match the perceived pitch.
Fundamental frequency:
The pitch extraction methods described in this chapter are designed for monophonic audio signals. Monophoning pieces of music are ones in which a single
sound source (instrument or voice) is playing and only one pitch is heard at any
particular time instance. Polyphonic refers to pieces of music in which multiple
notes of the same instrument are heard simultanesouly as in a piece of piano music
or pieces with multiple instruments playing such as a symphony or a pop ballad.
Polyphonic pitch extraction and additional aspects of monophonic pitch extraction
are coverd in Chapter ??.
5.1. MONOPHONIC PITCH ESTIMATION
5.1.2
65
Psychoacoustics
Dr
aft
A lot of what we know about perceptual pitch comes from the field of Psychoacoustics, which is the scientific study of sound perception. The origins of
psychoacoustics can be traced all the way back to ancient Greece and Pythagoras
of Samos. Pythagoras noticed that melodies consisted of intervals (sequences of
two pitches) and was able to establish a connection between the intervals used
by musicians and physical measurable quantities. More specifically he used an
instrument he designed called the monochord which consisted of a single string
with a movable bridge. He experimentally established that the intervals musicans
used corresponded to dividing the string into integer ratios using the movable
bridge.
Modern psychoacousticians also try to probe the connection between physical measurements and perception of course using more sophisticated tools. For
example, frequently we are interested in the limits of perception such as what
is the highest (or lowest) frequency one can hear. This can be established by
playing sine waves of increasing frequency until the listener can not hear them
anymore. Young listers can hear frequencies up to approximately 20000 Hz with
the upper limit getting lower with age. Today it is relatively straightforward to
do simple psychoacoustic experiments using a computer and headphones (some
ideas can be found in the problems at the end of this chapter). Similar limits can
be established with loudness. Another important concept in pyschoacoustics is the
Just Noticeable Difference (JND) which refers to the minimum amount a physical
quantity needs to change in order for the change to be perceived. For example if
I play a sine wave of a particular frequency followed by another one of a slightly
higher frequency how much higher does the second one need to be in order for the
change to be perceived.
Through simple psychoacoustic experiments one can establish two important
findings:
• We are able to perceive the frequencies of sine waves as ordered from low
to high
• More complex periodic sounds such as those produced by musical instruments are also perceived as ordered from low to high and can be matched
with corresponding sine waves in terms of their perceived pitch
These findings motivate a very common experimental paradigm for studying
pitch perception in which the listener hears a complex sound and then adjusts the
66
CHAPTER 5. FEATURE EXTRACTION
frequency of sine wave generator until the sine wave and the compelx sound are
perceived as having the same pitch. The frequency of the adjusted sine wave is
defined as the perceptual or perceived pitch of the sound. For a large number of
musical sounds the perceived pitch corresponds to the fundamental frequency of
the sound i.e the lowest frequency with significant energy (or peak) present in the
sound when performing Fourier analysis.
Musical Pitch
5.1.4
Time-Domain Pitch Estimation
5.1.5
Frequency-Domain Pitch Estimation
5.1.6
Perceptual Pitch Estimation
5.2
Timbral Texture Features
Dr
aft
5.1.3
Features representating timbral information have been the most widely used audio
features and ones that have so far provided the best results when used in isolation. Another factor in their popularity is their long history in the area of speech
recognition. There are many variations in timbral feature extraction but most
proposed systems follow a common general process. First some form of timefrequency analysis such as the STFT is performed followed by summarization
steps that result in a feature vector of significantly smaller dimensionality. A
similar approach can be used with other time-frequency represenations.
5.2.1
Spectral Features
Spectral features are directly computed on the magnitude spectrum and attempt
to summarize information about the general characteristics of the distribution
of energy across frequencies. They have been motivated by research in timbre
perception of isolated notes of instruments [41]. The spectral centroid is defined
5.2. TIMBRAL TEXTURE FEATURES
67
as the center of gravity of the spectrum and is correlated with the perceived
“brightness” of the sound. It is defined as:
Cn =
PN −1
|X[k]|n k
PN −1
k=0 k
k=0
(5.1)
where n is the frame number to be analyzed, k is the frequency bin number and
|X(k)|n is the corresponding magnitude spectrum.
The spectral rolloff is defined as the frequency Rn below which 85% of the
energy distribution of the magnitude spectrum is concentrated:
n=0
5.2.2
= 0.85 ∗
N
−1
X
n=0
|X[k]|n
aft
RX
n −1
(5.2)
Mel-Frequency Cepstral Coefficients
Dr
The Mel-Frequency Cepstral Coefficients MFCC [22] are the most common representation used in automatic speech recognition systems and have been frequently
used for music modeling. Their computation consists of 3 stages: 1) Mel-scale
filterbank 2) Log energy computation 3) Discrete Cosine Transform.
A computationally inexpensive method of computing a filterbank is to perform
the filtering by grouping STFT bins using triangular windows with specific center
frequencies and bandwidths. The result is a single energy value per SFTFT frame
corresponding to the output of each subband. The most common implementation
of MFCC is caclulated using 13 linearly spaced filters separated by 133.33 Hz
between their center frequencies, followed by 27 log-spaced filters (separated by
a factor of 1.0711703 in frequency) resulting in 40 filterbank values for each STFT
frame.
The next step consists of computing the logarithm of the magnitude of each of
the filterbank outputs. This can be viewed as a simple step of dynamic compression, making feature extraction less sensitive to variations in dynamics.
The final step consists of reducing the dimensionality of the 40 filterbank
outputs by performing a Discrete Cosine Transform (DCT) which is similar to the
Discrete Fourier Transform but uses only real numbers. It expresses a finite set of
values in terms of a sum of cosine functions oscillating at difference frequencies.
The DCT is used frequently in compression applications because for typical signals of intererst in tends to concentrate most of the signal information in few of
the lower frequency components and therefore higher frequency components can
68
CHAPTER 5. FEATURE EXTRACTION
be discarded for compression purposes. In the “classic” MFCC implementation
the lower 13 coefficients are retained after transforming the 40 logarithmically
compressed Mel filterbank outputs.
[70]
5.2.3
Other Timbral Features
Dr
aft
There have been many other features proposed to describe short-term timbral
information. In this subsection we briefly mention them and provide citations to
where the details of their calculation can be found. Time domain zero-crossings
can be used to measure how noisy is the signal and also somewhat correlate to
high frequency content [98, 10]. Spectral bandwidth [10, 30, 71] and octave
based spectral contrast [53, 71] are other features that attempt to summarize the
magnitude spectrum. The spectral flatness measure and spectral crest factor [1]
are low level descriptors proposed in the context of the MPEG-7 standard [15].
[69]
Daubechies Wavelet Coefficient Histogram is a technique for audio feature
extraction based on the Discrete Wavelet Transform (DWT) [65]. It is applied in
3 seconds of audio using the db8 Daubechies wavelet filter [21] with seven levels
of decomposition. After the decomposition, the histograms of the coefficients in
each band are calculated. Finally each histogram is characterized by its mean,
variance and skewness as well as the subband energy defined as the mean of the
absolute coefficient values. The result is 7 (subbands) * 4 (quantities per subband)
= 28 features.
One of the major innovations that enabled the explosive growth of music
represented digitally is perceptual audio compression [80]. Audio compression is
frequently used to reduce the high data requirements of audio and music. Audio
data does not compress well using generic data compression algorithms so specialized algorithms have been developed. The majority of audio coding algorithms
are lossy meaning that the original data can not be reconstructed exactly from the
compressed signal. They are frequently based on some form of time-frequency
representation and they achieve compression by allocating different number of
bits to different parts of the signal. One of the key innovations in audio coding is
the use of psychoacoustics (i.e the scientific study of sound perception by humans)
to guide this process so that any artifacts introduced by the compression are not
perceptible.
5.2. TIMBRAL TEXTURE FEATURES
69
There has been some interest in computing audio features directly in the compressed domain as part of the decompression process. Essentially this takes advantage of the fact that a type of time-frequency analysis has already been performed
for encoding purposes and it is not necessary to repeat it after decompression. The
type of features used are very similar to the ones we have described except for the
fact that they are computed directly on the compressed data. Early works mainly
focused on the MPEG audio compression standard and the extraction of timbral
texture features [97, 85]. More recently the use of newer sparse overcomplete
represenations as the basis for audio feature extraction has been explored [28]
covering timbral texture, rhythmic content and pitch content.
Temporal Summarization
aft
5.2.4
Dr
The features that have been described so far characterize the content of a very
short segment of music audio (typically around 20-40 milliseconds) and can be
viewed as different ways of summarizing frequency information. Frequently a
second step of temporal summarization is performed to characterize the evolution
of the feature vectors over a longer time frame of around 0.5 to 1 seconds. We
can define a “texture” window TM [n] of size M corresponding to analysis frame
n as the sequence of the previous M − 1 computed feature vectors including the
feature vector corresponding to n.
TM [n] = F [n − M + 1]...F [n]
(5.3)
Temporal integration is performed by summarizing the information contained
in the texture window to a single feature vector. In other words the sequence
of M past feature vectors is “collapsed” to a single feature vector. At the one
extreme, the texture window can be advanced by one analysis frame at a time in
which case the resulting sequence of temporally summarized feature vectors has
the same sampling rate as the original sequence of feature vectors [98]. At the
other extreme temporal integration can be performed across the entire length of
the audio clip of interest resulting in a single feature vector representing the clip
(sometimes such features are termed song-level [73] or aggregate features [10]).
Figure 5.1 shows schematically a typical feature extraction process starting with
a time-frequency representation based on the STFT, followed by the calculation
of MFCC (frequency) summarization and ending with summarization of the resulting feature vector sequence over the texture window.
CHAPTER 5. FEATURE EXTRACTION
Dr
aft
70
Figure 5.1: Feature extraction showing how frequency and time summarization
with a texture window can be used to extract a feature vector characterizing
timbral texture
5.2. TIMBRAL TEXTURE FEATURES
71
Dr
aft
The is no consistent terminology describing temporal summarization. Terms
that have been used include: dynamic features [84, 74] , aggregate features [10],
temporal statistics [30], temporal feature integration [76], fluctuation patterns
[86], and modulation features [61]. More recently the term pooling has also been
used [?].
Statistical moments such as the mean, standard deviation and the covariance
matrix can be used to summarize the sequence of feature vectors over the texture
window into a single vector. For example a common approach is to compute
the means and variances (or standard deviations) of each feature over the texture
window [98]. Figure 5.2 shows the original trajectory of spectral centroids over
a texture window as well as the trajectory of the running mean and standard
deviation of the spectral centroid for two pieces of music.
Another alternative is to use the upper triangular part of the covariance matrix
as the feature vector characterizing the texture window [73]. Such statistics capture the variation of feature vectors within a texture window but do not directly
represent temporal information.
Another possibility is to utilize techniques from multivariate time-series modeling to characterize the feature vector sequence that better preserve temporal
information. For example multivariate autoregressive models have been used to
model temporal feature evolution [76]. The temporal evolution of the feature
vectors can also be characterizing by performing frequency analysis on their trajectories over time and analyzing their periodicity characteristics. Such modulation features show how the feature vectors change over time and are typically
calculated at rates that correspond to rhythmic events. Any method of timefrequency analysis can be used to calculate modulation features but a common
choice is the Short-Time Fourier Transform (STFT).
As a representative example of calculating modulation frequencies we briefly
describe the approach proposed by McKinnery and Breebart [74]. A standard
MFCC computation is performed resulting in a sequence of 64 feature vectors
each with 13 dimensions characterizing a texture window of 743 milliseconds of
audio. A power spectrum of size 64 using a DFT is calculated for each of the 13
coefficients/dimesions resulting in 13 power spectra which are then summarized
in 4 frequency bands (0Hz, 1 − 2Hz, 3 − 15Hz, 20 − 43Hz) that roughly correspond to musical beat rates, speech syllabic rates and modulations contributing to
perceptual roughness. So the final representations for the 13x64 matrix of feature
values of the texture window is 4x13 = 52 dimensions.
A final considerations is the the size of the texture windows and the amount
of overlap between them. Although the most common approach is to use fixed
72
CHAPTER 5. FEATURE EXTRACTION
0.10
0.07
Beatles
Debussy
0.025
Beatles
Debussy
0.06
0.08
Beatles
Debussy
0.020
0.05
0.06
0.015
Centroid
Centroid
Centroid
0.04
0.03
0.04
0.010
0.02
0.02
0.000
0.005
0.01
100
200
Frames
300
400
(a) Centroid
500
0.000
100
200
Frames
300
400
500
0.0000
100
200
Frames
(b) Mean Centroid over (c) Standard
Texture Window
300
400
500
Deviation
of Centroid over Texture
Window
aft
Figure 5.2: The time evolution of audio features is important in characterizing
musical content. The time evolution of the spectral centroid for two different
30-second excerpts of music is shown in (a). The result of applying a moving
mean and standard deviation calculation over a texture window of approximately
Dr
1 second is shown in (b) and (c).
window and hop sizes there are alternatives. Aligning texture windows to note
events can provide more consistent timbral information [103]. In many music
analysis applications such as cover song detection it is desired to obtain an audio
feature representation that is to some-extent tempo-invariant. One way to achieve
this is using so-called beat-synchronous features which as their name implies are
calculated using estimated beat locations as boundaries [29]. Although more
commonly used with features that describe pitch content they can also be used
for timbral texture modeling.
5.2.5
Song-level modeling
Frequently in music data mining the primary unit of consideration is an entire
track or excerpt of a piece of music. The simplest and most direct type of representation is a single feature vector that represents the entire piece of music under consideration. This is typically accomplished by temporal summarization techniques
such as the ones described in the previous section applied to the entire sequence
of feature vectors. In some case two stages of summarization are performed: one
5.3. RHYTHM FEATURES
73
aft
at the texture window level and one across the song [98]. The efffectiveness of
different parameter choices and temporal summarization methods has also been
explored [10].
In other music data mining problems the entire sequence of feature vectors
is retained. These problems typically deal with the internal structure within a
music piece rather than the relations among a collection of pieces. For example
in audio segmentation [31, 95] algorithms the locations in time where the musical
content changes significantly (such the transition from a singing part to an electric
guitar solo) are detected. Audio structure analysis goes a step further and detects
repeating sections of a song and their form such as the well known AABA form
[20]. Sequence representations are also used in cover song detection [29].
A representation that is frequently utilized in structure analysis is the self
similarity matrix. This matrix is calculated by calculating pair-wise similarities
between feature vectors vi and vj derived from audio analysis frames i and j.
s(i, j) = sim(vi , vj )
(5.4)
Dr
where sim is a function that returns a scalar value corresponding to some notion
of similarity (or symmetrically distance) between the two feature vectors. As
music is generally self-similarity with regular structure and repetitions which can
be revealed through the self similarity matrix.
Figure 5.3 shows an example of a self-similarity matrix calculated for a piece
of HipHop/Jazz fusion using simply energy contours (shown to the left and bottom
of the Matrix). The regular structure at the beat and measure level as well as some
sectioning can be observed.
The final variation in song-level representations of audio features is to model
each music piece as a distribution of feature vectors. In this case a parametric distribution form (such as a Gaussian Mixture Model [3]) is assumed and its parameters are estimated from the sequence of feature vectos. Music data mining tasks
such as classification are performed by considering distance functions between
distributions typically modeled as mixture models such as the KL-divergence or
Earth Mover Distance [73, 52].
5.3
Rhythm Features
Automatically extracting information related to rhythm is also an important component of audio MIR systems and has been an active area of research for over 20
CHAPTER 5. FEATURE EXTRACTION
aft
74
Figure 5.3: Self Similarity Matrix using RMS contours
Dr
years. A number of different subproblems within this area have been identified
and explored. The most basic approach is finding the average tempo of the entire
recording which can be defined as the frequency with which a human would tap
their foot while listening to the same piece of music. The more difficult task of
beat tracking consists of estimating time-varying tempo (frequency) as well as the
locations in time of each beat (phase). Rhythmic information is hierarchical in
nature and tempo is only one level of the hierarchy. Other levels frequently used
and estimated by audio MIR systems are tatum (defined as the shortest commonly
occurring time interval), beat or tactus (the preferred human tapping tempo), and
bar or measure. For some MIR applications such as automatic classification it is
possible to use a a representation that provides a salience value for every possible
tempoe.g., the beat histograms described in [98]. Rhythm analysis approaches
can be characterized in different ways. The first and most important distinction is by the type of input: most of the earlier beat tracking algorithms used a
symbolic representation while audio signals have been used more recently. Symbolic algorithms can still be utilized with audio signals provided an intermediate
transcription step is performedtypically audio onset detection. Another major
distinction between the algorithms is the broad approach used which includes
rulebased, autocorrelative, oscillating filters, histogramming, multiple agent, and
5.3. RHYTHM FEATURES
75
aft
probabilistic. A good overview of these approaches can be found in Chapter 4 of
Klapuri and Davy [43].
Figure 5.4: The top panel depicts the time domain representation of a fragment
of a polyphonic jazz recording, below which is displayed its corresponding
Dr
spectrogram. The bottom panel plots both the onset detection function SF(n) (gray
line), as well as its filtered version (black line). The automatically identified onsets
are represented as vertical dotted lines.
5.3.1
Onset Strength Signal
Frequently, the first step in rhythm feature extraction is the calculation of the
onset strength signal. The goal is to calculate a signal that has high values at the
onsets of musical events. By analyzing the onset strength signal to detect common
recurring periodicities it is possible to perform tempo induction, beat tracking as
well as other more sophisticated forms of rhythm analysis. Other names used in
the literature include onset strength function and novelty curve.
The onset detection algorithm described is based on a recent tutorial article
[24], where a number of onset detection algorithms were reviewed and compared
on two datasets. Dixon concluded that the use of a spectral flux detection function
for onset detection resulted in the best performance versus complexity ratio.
76
CHAPTER 5. FEATURE EXTRACTION
The spectral flux as an onset detection function is defined as :
SF (n) =
N/2
X
k=0
H(|X(n, k)| − |X(n − 1, k)|)
(5.5)
aft
where H(x) = x+|x|
is the half-wave rectifier function, X(n, k) represents
2
the k-th frequency bin of the n-th frame of the power magnitude (in dB) of the
short time Fourier Transform, and N is the corresponding Hamming window size.
The bottom panel of Figure 5.4 plots the values over time of the onset detection
function SF(n) for an jazz excerpt example.
The onsets are subsequently detected from the spectral fux values by a causal
peak picking algorithm, that attempts to find local maxima as follows. A peak at
is selected as an onset if it satisfies the following conditions:
time t = nH
fs
SF (n) ≥ SF (k) ∀k : n − w ≤ k ≤ n + w
SF (k)
× thres + δ
mw + w + 1
k=n−w
(5.7)
Dr
SF (n) >
Pk=n+w
(5.6)
where w = 6 is the size of the window used to find a local maximum, m = 4
is a multiplier so that the mean is calculated over a larger range before the peak,
thres = 2.0 is a threshold relative to the local mean that a peak must reach in
order to be sufficiently prominent to be selected as an onset, and δ = 10−20 is a
residual value to avoid false detections on silent regions of the signal. All these
parameter values were derived from preliminary experiments using a collection of
music signals with varying onset characteristics.
As a way to reduce the false detection rate, the onset detection function SF(n)
is smoothed (see bottom panel of Figure 5.4),using a Butterworth filter defined
as:
0.1173 + 0.2347z −1 + 0.1174z −2
(5.8)
H(z) =
1 − 0.8252z −1 + 0.2946z −2
In order to avoid phase distortion (which would shift the detected onset time
away from the SF(n) peak) the input data is filtered in both the forward and reverse
directions. The result has precisely zero-phase distortion, the magnitude is the
square of the filter’s magnitude response, and the filter order is double the order
of the filter specified in the equation above.
5.3. RHYTHM FEATURES
77
0.16
0.14
Onset Strength
0.12
0.10
0.08
0.06
0.04
0.02
500
1000
1500
Analysis Frames
2000
2500
aft
0.000
Figure 5.5: Onset Strength Signal
5.3.2
Tempo Induction and Beat Tracking
Dr
Many pieces of music are structured in time on top of an underlying semi-regular
sequence of pulses frequently accentuated by percussion instruments especially
in modern popular music. The basic tempo of a piece of music is the rate of
musical beats/pulses in time, sometimes also called the “foot-tapping” rate for
obvious reasons. Tempo induction refers to the process of estimating the tempo
of an audio recording. Beat tracking is the additional related problem of locating
the positions in time of the associated beats.
In this section we describe a typical method for tempo induction as a representative example and provide pointers to additional literature in the topic. The
first step is the calculation of the onset strength signal as described above. Figure
5.5 shows an example of an onset strength signal for a piece of HipHop/Jazz
fusion showing periodicities at the beat and measure level. The autocorrelation
of the onset strength signal will exhibit peaks at the lags corresponding to the
periodcities of the signal. The autocorrelation values can be warped to form a beat
histogram which is indexed by tempos in beats-per-minute (BPM) and has values
proportional to the sum of autocorrelation values tha map to the same tempo
bin. Typically either the highest or the second highest peak of the beat histogram
corresponds to the tempo and can be selected with peak picking heuristics. Figure
5.6 shows two example beat histograms from 30 second clips of HipHop Jazz
78
CHAPTER 5. FEATURE EXTRACTION
30
30
BeatHistogram
Tempo Candidates
25
20
Beat Strength
Beat Strength
20
15
15
10
10
5
5
00
BeatHistogram
Tempo Candidates
25
50
100
Tempo (BPM)
150
200
00
50
100
Tempo (BPM)
150
200
aft
Figure 5.6: Beat Histograms of HipHop/Jazz and Bossa Nova
Dr
(left) and Bossa Nova (right). As can be seen in both histograms the prominant
periodicities or candidate tempos are clearly visible. Once the tempo of the
piece is identified the beat locations can be calculated by locally fitting tempo
hypothesis with regularly spaced peaks of the onset strength signal. There has
been a systematic evaluation of different tempo induction methods [40] in the
context of the Music Information Retrieval Evaluation Exchange [25].
Frequently a subband decomposition is performed so that periodicities at different frequency ranges can be identified. For example the bass drum sound
will mostly affect low frequency whereas a snare hit will affect all frequencies.
Figure 5.7 shows a schematic diagram of the Beat Histogram calculation method
described originally in [98]. A Discrete Wavelet Transform (DWT) filterbank
is used as the front-end, followed by multiple channel envelop extraction and
periodicity detection.
5.3.3
Rhythm representations
The beat histogram described in the previous section can be viewed as a song-level
representation of rhythm. In addition to the tempo and related periodicities the
total energy and/or height of peaks represent the amount of self-similarity that the
music has. This quantity has been termed beat strength and has been shown to be
a perceptually salient characteristic of rhythm. For example a piece of rap music
with tempo of 80BPM will have more beat strength than a piece of country music
at the same tempo. The spread of energy around each peak indicates the amount
of tempo variations and the ratios between the tempos of the prominent peaks give
5.3. RHYTHM FEATURES
79
BEAT HISTOGRAM CALCULATION FLOW DIAGRAM
Discrete Wavelet Transform Octave Frequency Bands
Full Wave Rectification
Low Pass Filtering
Downsampling
Envelope
Extraction
Envelope
Extraction
Envelope
Extraction
Mean Removal
Envelope Extraction
Autocorrelation
aft
Multiple Peak Picking
Beat Histogram
Figure 5.7: Beat Histogram Calculation.
Dr
hints about the time signature of the piece. A typical approach is to further reduce
the dimensionality of the Beat Histogram by extracting characteristic features such
as the location of the highest peak and its corresponding height [98]. A thorough
investigation of various features for characterizing rhythm can be found in Gouyon
[39].
An alternative method of computing a very similar representation to the Beat
Histogram is based on the self-similarity matrix and termed the Beat Spectrum
[33]. Another approach models the rhythm characteristics of patterns as a sequence of audio features over time [82]. A dynamic time warping algorithm can
be used to align the time axis of the two sequences and allow their comparison.
Another more recent approach is to identify rhythmic patterns that are characteristic of a particular genre automatically [93] and then use their occurence histogram
as a feature vector.
One interesting question is whether rhythm representations should be tempo
invariant or variant. To some extent the answer depends on the task. For example
if one is trying to automatically classify speed metal (a genre of rock music)
pieces then absolute tempo is a pretty good feature to include. On the other hand
classifying something as a Waltz has more to do with the ratios of periodicities
80
CHAPTER 5. FEATURE EXTRACTION
Pitch/Harmony Features
Dr
5.4
aft
rather than absolute tempo. Representations that are to some degree tempo invariant have also been explored. Dynamic periodicity wrapping is a dynamic
programming technique used to align average periodicity spectra obtained from
the onset strength signal [47]. Tempo invariance is achieved through the alignment
process. The Fast Melin Transform is a transform that is invariant to scale. It
has been used to provide a theoretically tempo invariant (under the assumption
that tempo is scaled uniformly throughout the piece) representation by taking the
FMT of the autocorrelation of the onset strength function [48]. An exponential
grouping of the lags of the autocorrelation function of the onset strength signal
can also be used for a tempo invariant representation [51]. Beat Histograms
can also be used as the basis for a tempo invariant representation by using a
logarithmically-spaced lag-axis [42]. The algorithm requires the estimation of
a reference point. Experiments comparing four periodicity representations in the
spectral or temporal domains using the autocorrelation and the discrete fourier
transform (DFT) of the onset strength signal have been also conducted [83].
In other cases, for example in cover song identification or automatic chord detection, it is desired to have a representation that is related to the pitch content of
the music rather than the specifics of the instruments and voices that are playing.
Conceivably a fully automatically transcribed music score could be used for this
purpose. Unfortunately current automatic transcription technology is not robust
enough to be used reliably.
Instead the most common pitch content representation is the Pitch and Pitch
Class Profile (PCP) (other alternative names used in literature are pitch histograms
and chroma vectors). The pitch profile measures the occurrence of specific discrete musical pitches in a music segment and the pitch class profile considers all
octaves equivalent essentially folding the pitch profile into 12 pitch classes. The
pitch profile and pitch class profile are strongly related to the underlying harmony
of the music piece. For example, a music segment in C major is expected to have
many occurrences of the discrete pitch classes C, E, and G that form the C major
chord. These represenations are used for a variety of tasks including automatic key
identification [90, 63], chord detection [90, 63], cover song detection [29, 89, 67],
and polyphonic audio-score alignment [78].
5.4. PITCH/HARMONY FEATURES
81
There are two major approaches to the computation of the PCP. The first
approach directly calculates the PCP by appropriately mapping and folding all the
magnitude bins of a Fast Fourier Transform. The terms chroma and chromagram
are used to describe this process [7]. Each FFT bin is mapped to its closest note,
according to:
f (p) = 440 ∗ 2p−69/12
(5.9)
Dr
aft
where p is the note number. This is equivalent to segmenting the magnitude spectrum into note regions. The average magnitude within each regino is calculated
resulting in a pitch histogram representation. The pitch histogram is then folded so
that all octaves map to the same pitch class resulting into a vector of size 12. The
FFT-based calculation of chroma has the advantage that it is efficient to compute
and has consistent behavior throughout the song. However it is affected by nonpitched sound sources such as drums and the harmonics of pitch sound sources
are mapped to pitches which reduces the potential precision of the represenation.
An alternative is to utilize multiple pitch detection to calculate the pitch and
pitch class profiles. If the multiple pitch detection was perfect then this approach
would eliminate the problems of chroma calculation however in practice pitch
detection especially in polyphonic audio is not particularly robust. The errors
tend to average out but there is no consensus about whether the pitch-based or
FFT-based approach is better. Multiple pitch detection typically operates in an
incremental fashion. Initially the dominant pitch is detected and the frequencies
corresponding to the fundamental and the associated harmonics are removed for
example by using comb filtering. Then the process is repeated to detect the second
most dominant pitch.
Pitch detection has been a topic of active research for a long time mainly
due to its important in speech processing. In this chapter we briefly outline two
common approach. The first approach is to utilize time-domain autocorrelation of
the signal [12]:
1
R(τ ) =
N
N −1−m
X
x[n]x[n + m]
n=0
0≤m<M
(5.10)
An alternative used in the YIN pitch extraction method [23] is based on the
difference function:
dt =
N
−1
X
n=0
(x[n] − x[n + τ ])2
(5.11)
82
CHAPTER 5. FEATURE EXTRACTION
The dips in the difference function correspond to periodicities. In order to
reduce the occurrence of subharmonic errors, YIN employs a cumulative mean
function which de-emphasizes higher period dips in the difference function.
Independently of their method of calculation PCPs can be calculated using
fixed size analaysis window. This occassionally will result in inaccuracies when
a window contains information from two chords. The use of beat synchronous
features [29] can help improve the results as by considering windows that are
aligned with beat locations it is more likely the the pitch content information
remains stable within an analysis window.
Other Audio Features
aft
5.5
Dr
In addition to timbral texture, rhythm and pitch content there are other facets of
musical content that have been used as the basis for audio feature extraction. Another important source of information is the instrumentation of audio recordings
i.e what instruments are playing at any particular time of an audio recording.
For example it is more likely that a saxophone will be part of a jazz recording
than a recording of country music although of course there are exceptions. The
classification of musical instrument in a polyphonic context has been explored
[88] but so far has not been evaluated in the context of other music data mining
tasks such as classification.
So far all the features described characterize the mixture of sound sources
that comprise the musical signal. Another possibility is to characterize individually each sound source. Feature extraction techniques for certain types of sound
sources have been proposed but they are not widely used partly due to the difficult
of separating and/or characterizing individual sound source in a complex mixture
of sounds such as music. An example is automatic singer identification. The
most basic approach is to use features developed for voice/speech identification
directly on the mixed audio signal [57]. More sophisticated approaches first
attempt to identify the parts of the signal containing vocals and in some cases
even attempt to separately characterize the singing voice and reduce the effect
of accompaniment [34]. Another type of sound source that has been explore for
audio feature extraction is the bass line [58, 94].
In many modern pop and rock recordings each instrument is recorded separately and the final mix is created by a recording producer/engineer(s) who among
other transformations add effects such as reverb or filtering and spatialize in-
5.6. BAG OF CODEWORDS
83
dividual tracks using stereo panning cues. For example the amount of stereo
panning and placement of sources remains constant in older recordings that tried
to reproduce live performances compared to more recent recordings that would
be almost impossible to realize in a live setting. Stereo panning features have
recently been used for audio classiffication [99, 102].
Bag of codewords
5.7
Spectral Features
aft
5.6
Spectral Centroid and spectral spread: These features are computed on the power
spectrum of the sound. They are respectively the mean and standard deviation of
the frequencies in the power spectrum weighted by the power index.
Mel-Frequency Cepstral Coefficients
5.9
Pitch and Chroma
Dr
5.8
5.10 Beat, Tempo and Rhythm
5.11 Modeling temporal evolution and dynamics
5.12 Pattern representations
5.13 Stereo and other production features
aft
CHAPTER 5. FEATURE EXTRACTION
Dr
84
Chapter 6
6.1
aft
Context Feature Extraction
Extracting Context Information About Music
Dr
In addition to information extracted by analyzing the audio content of music there
is a wealth of information that can be extracted by analyzing information on the
web as well as patterns of downloads/listening. We use the general term musical
context to describe this type of information. In some cases such as song lyrics the
desired information is explicitly available somewhere on the web and the challenge is to appropriately filter out irrelevant information from the corresponding
web-pages. Text-based search engines such as Google and Bing can be leveraged
for the initial retrieval that can then be followed by some post-processing based
on heuristics that are specific to the music domain. Other types of information
are not as straightforward and can require more sophisticated mechanism such
as term weighting used in text retrieval systems or natural language processing
techniques such as entity detection. Such techniques are covered in detail in the
literature as they are part of modern day search engines. As an illustrative example
we will consider the problem of detecting the country of origin of a particular
artist. As a first attempt one can query a search for various pairs of artist name
and countries and simply count the number of pages returned. The country with
the highest number of pages returned is returned as the country of origin. A more
sophisticated approach is to analyze the retrieved web pages using term weighting.
More specifically consider country c as a term. The document frequency DF (c, a)
85
86
CHAPTER 6. CONTEXT FEATURE EXTRACTION
is defined as the total number of web pages retrieved for artist a in which the
country term c appears at least once. The term frequency T F (c, a) is defined as
the total number of occurrences of the country term c in all pages retrieved for
artist a. The basic idea of term frequency-inverse document frequency weighting
(TF-IDF) is to “penalize” terms that appear in many documents (in our case the
documents retrieved for all artists) and increase the weight of terms that occur
frequently in the set of web pages retrieved for a specific artists. There are several
TF-IDF weighting schemes. For example a logarithmic formulation is:
n
(6.1)
T F IDF (c, a) = ln(1 + T F (c, a)) ∗ ln 1 +
DF (c)
Dr
aft
where DF(c) is the document frequency of a particular country c over the documents returned for all artists a. Using the above equation the weight of every
country c can be calculated for a particular artist query a. The country with the
highest weight is then selected as the the predicted country of origin.
Analysis
aft
Chapter 7
Dr
Essentially, all models are wrong but some are useful.
George Box
In the previous chapters we have described how music in digital audio formats
can be represented using time-frequency representations as well as how audio
features at different time scales can be extracted based on these representations.
In this chapter we review the techniques used to analyze these audio features and
extract musical content information from them. Most of these techniques originate
in the related fields of machine learning, pattern recognition and data mining.
Going further back most of them are based on ideas from statistics and probability
theory. Computers are particularly suited to solving tasks that are well-defined and
for which there is a clear sequence of steps that guarantees finding a solution. For
example sorting a list of numbers or calculating compounded interest rates are
the type of tasks that computers excel at and perform daily. A solution to such a
problem can be easily verified and is fixed for a particular set of input data.
However, there are many problems that we are interested in solving for which
there is no known sequence of steps that allows us to solve them and for which
there is no perfect well defined solution. A simple example would be predicting
the weather in a particular location based on measurements of humidity, pressure,
and temperature from various weather stations in the surrounding area. If someone
is given this problem it is unclear what procedure should be followed and although
an algorithm that would correctly predict the weather with 100% accuracy would
be ideal, any algorithm that performs significantly better than random guessing
87
88
CHAPTER 7. ANALYSIS
aft
would still be useful. The traditional approach to solving this problem, before the
widespread use of computers and advances in machine learning and data mining,
would be to plot the measurements over several days, try to fit some simple
mathematical functions, and try to come up with some prediction rules expressed
as an algorithm that can be executed on a computer. Such an approaches requires
significant effort and as the number of measurements to be considered increases
it becomes very difficult if not impossible to utilize all the available information.
One key observation is that for many such problems although there is no well
defined optimal solution and no clear procedure for obtaining it, it is possible to
compare two different algorithms and decide which one performs better i.e there
is a way to evaluate the effectiveness of an algorithm. For example in the weather
prediction example we can execute the algorithm over historical data for which
the weather that should be predicted is known and evaluate how many times the
different algorithms under consideration performed on this data.
Dr
As defined by Mitchell [?] a machine learning algorithm is a computer program P that tries to solve a particular task T and whose performance on this task
as measured by some metric E improves over the course of its execution. For
example, a computer chess program “learns” from playing chess (T) with human
players (E) as measured by the number of games it wins (P), if the number of
games it wins (P) increases as it plays more games.
A common way this improvement with experience is achieved is by considering more data. Returning to the weather prediction example, in a machine learning
approach the acquired measurements as well as the associated “ground truth”
would be the input to the algorithm. The algorithm would somehow model the
association between the measurements and the desired output without requiring
any decision to be made by the programmer and once “trained” could be applied
to new data. The ability of the algorithm to predict the weather would improve
based on the availability of historical data. Using a machine learning/data mining
approach large amounts of data can be utilized and no explicit programming of
how the problem is solved needs to be done. This process might seem a little
bit strange and abstract at this point but hopefully will become clearer through
concrete examples in the following sections. The techniques described in this
chapter are used in several MIR tasks described in the second part of this book.
7.1. FEATURE MATRIX, TRAIN SET, TEST SET
7.1
89
Feature Matrix, Train Set, Test Set
aft
In many analysis and mining tasks the data of interest is represented as vectors
of numbers. By having each object of interest represent as a finite vector of
numbers (features) the domain specific information is encapsulated in the feature
and generic machine learning algorithms can be developed. For example, the same
classification algorithm with different features as input can be used to perform face
recognition (with features based on image analysis), music genre classifiaction
(with features based on audio analysis), or document categorization (with features
based on text analysis).
A feature matrix is a 2D matrix where each row (also known as instance or
sample) is a feature vector consisting of numbers (also known as attributes or
features) characterizing a particular object of interest. Formally it can be notated
as:
X = xi ∈ Rd , i = 1, . . . , n
(7.1)
Dr
where i is the instance index, n is the number of instances, and d is the dimensionality of the feature vector.
In supervised learning the feature matrix is associated with a vector of ground
truth classification labels typically represented as integers for each instance yi .
The training set consists of the feature matrix and associated labels:
T = (X, y) = (xi , yi ).
7.2
(7.2)
Similarity and distance metrics
Euclidean, Mahalanobis, Earth Mover, Kullback-Leibler
7.3
Classification
Classification is one of the most common problems in pattern recognition and
machine learning. The basic idea is straightforward to describe. The goal of a
classification system is to “classify” objects by assigning to them a unique label
from a finite, discrete set of labels (also known as classes or categories). In
90
CHAPTER 7. ANALYSIS
7.3.1
Evaluation
aft
classic supervised learning, a “training” set of feature vectors reprsenting different
objects is provided together with associated “ground truth” class labels. After the
classifier is “trained” it can then be used to “predict” the class label of objects
that it has not encountered before using the feature vectors corresponding to these
objects.
A classifier is an algorithm that operates in two phases. In the first phase
called training, it takes as input the training set T and analyzes it in order to
compute what is called a model. The model is some kind of representation,
typically a vector of parameters, that characterizes the particular classification
problem for a given type of classifier. In the second phase, called prediction or
testing, the trained model is used to predict (estimate) the classification labels for
a new feature matrix called the testing set.
Dr
Before describing different classification algorithms we discuss how their effectiveness can be evaluated. Having a good understanding of evaluation is even more
important than understanding how different algorithm works as in most cases the
algorithm can be simply treated as a black box.
The goal of a classifier is to be able to classify objects it has not encountered
before. Therefore in order to get a better estimate of its performance on unknown
data it is necessary to use some of the instances labeled with ground truth for
testing purposes and not take them into account when training. The most common
such evaluation scheme is called K-fold cross-validation. In this scheme the set of
labeled instances is divided into K distinct subsets (folds) of approximately equal
sizes. Each fold is used for testing once with the K − 1 remaining folds used
for training. As an example if there are 100 feature vectors then each fold will
contain 10 feature vectors with each one of them being used one time for testing
and K −1 times for training. The most common evaluation metric is classification
accuracy which is defined as the percentage of testing feature vectors that were
classified correctly based on the ground truth labels. When classifying music
tracks a common post-processing technique that is applied is the so-called artist
filter which ensures that the feature vectors corresponding to tracks from the same
artist are not split between training and testing and are exclusively allocated only
to one of them. The rationale behind artist filtering is that feature vectors from
the same artist will tend to be artificially related or correlated due to similarities in
the recording process and instrumentation. Such feature vectors will be classified
7.4. CLUSTERING
91
7.3.2
aft
more easily if included in both training and testing and maybe inflate the classification accuracy. Similar considerations apply to feature vectors from the same
music track if each track is represented by more than one in which case a similar
track filter should be applied.
XXX Add explanations of bootstrapping, leave-one-out, percentage split XXX
stratification
The most common evaluation metric for automatic classification is accuracy
which is simply defined as the number of correctly classified instances in the
testing data. It is typically expressed as a percentage. Additional insight can
be provided by examining the confusion matrix which is a matrix that shows the
correct classifications in the diagonal and shows how the misclassification are
distributed among the other class labels.
XXX Figure with hypothetical example XXX Figure with confusion matrix
Generative Approaches
7.3.3
Dr
naive bayes, gmm
Discriminative
perceptron, svm, ann, decision tree perceptron, svm, decision tree, adaboost, KNN
7.3.4
Ensembles
adaboost
7.3.5
Variants
multi-label, hierarchical, co-clustering, online, multi-instance,
7.4
Clustering
k-means, spectral clustering, hierarchical
92
CHAPTER 7. ANALYSIS
7.5
Dimensionality Reduction
7.5.1
Principal Component Analysis
aft
Principal component analysis (PCA) coverts a set of feature vectors with possibly
correlated attributes into a set of feature vectors where the attributes are linearly
uncorrelated. These new transformed attributes are called the principal components and when the method is used for dimensionality reduction their number is
less than the number of original attributes. Intuitively this transformation can be
understood as a projection of the original feature vectors to a new set of orthogonal
axes (the principal components) such that each succeeding axis explains the highest variance of the original data-set possible with the constraint that it is orthogonal
to the preceding component. In a typical application scenario where each song is
represented by a 70 dimensional feature vector, the application of PCA to these
feature vectors could be used to convert them to 3 dimensional feature vectors
which can be visualized as points in a 3D space. A common way of calculating
PCA is based on the covariance matrix which defined as:
1 X
B × B∗
N
Dr
C=
(7.3)
where ∗ denotes the transpose operator and B is the matrix resulting from
subtracting the empirical mean of each dimension of the original data matrix
consisting of the N feature vectors. The eigenvectors and eigenvalues of this
covariance matrix are then computed and sorted in order of decreasing eigenvalue
and the first K where K is the desired number of reduced dimensions are selected
as the new basis vectors. The original data can then be projected into the new
space spanned by these basis functions i.e the principal components. PCA is a
standard operation in statistics and it is commonly available in software package
dealing with matrices.
One of the potential issues with PCA for music visualization is that because
it tries to preserve as much as possible the distances between points from the
original feature space to the transformed feature space it might leave large areas of
the available space empty of points. This is particularly undesirable in interfaces
based on various types of touch surfaces where in the ideal case everywhere a user
might press should trigger some music. Self-organizing maps are an approach
that attempts to perform both dimensionality reduction and clustering at the same
time while mostly preserving topology but not distances. It can result in more
7.5. DIMENSIONALITY REDUCTION
93
dense transformed feature spaces that are also discrete in nature in contrast to
PCA which produces a transformed continuous space that needs to be discretized.
7.5.2
Self-Organizing Maps
Dr
aft
The traditional SOM consists of a 2D grid of neural nodes each containing an
n-dimensional vector, x(t) of data. The goal of learning in the SOM is to cause
different neighboring parts of the network to respond similarly to certain input
patterns. This is partly motivated by how visual, auditory and other sensory
information is handled in separate parts of the cerebral cortex in the human brain.
The network must be fed a large number of example vectors that represent, as
closely as possible, the kinds of vectors expected during mapping. The data
associated with each node is initialized to small random values before training.
During training, a series of n-dimensional vectors of sample data are added to the
map. The “winning” node of the map known as the best matching unit (BMU)
is found by computing the distance between the added training vector and each
of the nodes in the SOM. This distance is calculated according to some predefined distance metric which in our case is the standard Euclidean distance on
the normalized feature vectors.
Once the winning node has been defined, it and its surrounding nodes reorganize their vector data to more closely resemble the added training sample. The
training utilizes competitive learning. The weights of the BMU and neurons close
to it in the SOM lattice are adjusted towards the input vector. The magnitude of the
change decreases with time and with distance from the BMU. The time-varying
learning rate and neighborhood function allow the SOM to gradually converge and
form clusters at different granularities. Once a SOM has been trained, data may
be added to the map simply by locating the node whose data is most similar to
that of the presented sample, i.e. the winner. The reorganization phase is omitted
when the SOM is not in the training mode.
The update formula for a neuron with representative vector N(t) can be written
as follows:
N(t + 1) = N(t) + Θ(v, t)α(t)(x(t) − N(t))
(7.4)
where α(t) is a monotonically decreasing learning coefficient and x(t) is the
input vector. The neighborhood function Θ(v, t) depends on the lattice distance
between the BMU and neuron v.
PCA, LDA, Self-Organizing Maps, FastMap,
94
CHAPTER 7. ANALYSIS
7.6
Modeling Time Evolution
7.6.1
Dynamic Time Warping
aft
More formally the problem is given two sequences of feature vectors with different lengths and timings find the optimal way of “elastically” transforming by
the sequences so that they match each other. A common technique used to solve
this problem and also frequently employed in the literature for cover song detection is Dynamic Time Warping (DTW) a specific variant of dynamic programming. Given two time series of feature vectors X = (x1 , x2 , . . . , xM ) and
Y = (y1 , y2 , . . . , yN ) with X, Y ∈ Rd the DTW algorithm yields an optimal
solution in O(M N ) time where M and N are the lengths of the two sequences. It
requires a local distance measure that can be used to compare individual feature
vectors which should have small values when the vectors are similar and large
values when they are different:
d : Rd × Rd → R ≥ 0
(7.5)
Dr
The algorithm starts by building the distance matrix C ∈ RM ×N representing all
the pairwise distances between the feature vectors of the two sequences. The goal
of the algorithm is to find the alignment or warping path which is a correspondence between the elements of the two sequences with the boundary constraint
that the first and last elements of the two sequences are assigned to each other.
Intuitively for matching sequences the alignment path will be roughly diagonal
and will run through the low-cast areas of the cost matrix. More formally the
alignment is a sequence of points (pi , pj ) ∈ [1 : M ] × [1 : N ] for which
the starting and ending points must be the first and last points of the aligned
sequences, the points are time-ordered and each step size is constrained to either
move horizontally, vertically or diagonally. The cost function of the alignment
path is the sum of all the pairwise distances associated with its points and the
alignment path that has the minimal cost is called the optimal alignment path
and is the output of the DTW.
Figure 14.1 shows two distance matrices that are calculated based on energy
contours of different orchestra music movements. The left matrix is between two
performances by different orchestras of the same piece. Even though the timing
and duration of each performance is different they exhibit a similar overall energy
envelope shown by the energy curves under the two axes. The optimal alignment
path computed by DTW is shown imposed over the distance matrix. In contrast
7.7. FURTHER READING
95
(a) Good Alignment
(b) Bad Alignment
Dynamic Time Warping
aft
Figure 7.1: Similarity Matrix between energy contours and alignment path using
the matrix on the right shows the distance matrix between two unrelated orchestral
movements where it is clear there is no alignment.
Hidden Markov Models
7.6.3
Kalman and Particle Filtering
7.7
Further Reading
Dr
7.6.2
cite Mitchell, Duda and Hart, Theodoridis and Bishop.
CHAPTER 7. ANALYSIS
Dr
aft
96
aft
Part II
Dr
Tasks
97
Dr
aft
Chapter 8
aft
Similarity Retrieval
Dr
Similarity retrieval (or query-by-example) is one of the most fundamental MIR
tasks. It is also one of the first tasks that were explored in the literature. It
was originally inspired by ideas from text information retrieval and this early
influence is reflected in the naming of the field. Today most people with computers
use search engines on a daily basis and are familiar with the basic idea of text
information retrieval. The user submits a query consisting of some words to the
search engine and the search engine returns a ranked list of web pages sorted by
how relevant they are to the query.
Similarity retrieval can be viewed as an analogous process where instead of the
user querying the system by providing text the query consists of an actual piece
of music. The system then responds by returning a list of music pieces ranked by
their similarity to the query. Typically the input to the system consists of the query
music piece (using either a symbolic or audio representation) as well as additional
metadata information such as the name of the song, artist, year of release etc. Each
returned item typically also contains the same types of meta-data. In addition to
the audio content and meta-data other types of user generated information can
also be considered such as ranking, purchase history, social relations and tags.
Similarity retrieval can also be viewed as a basic form of playlist generation in
which the returned results form a playlist that is “seeded” by the query. However
more complex scenarios of playlist generation can be envisioned. For example a
start and end seed might be specified or additional constraints such as approximate
duration or minimum tempo variation can be specified. Another variation is based
99
100
CHAPTER 8. SIMILARITY RETRIEVAL
Dr
aft
on what collection/database is used for retrieval. The term playlisting is more
commonly used to describe the scenario where the returned results come from the
personal collection of the user, while the term recommendation is more commonly
used in the case where the returned results are from a store containing a large
universe of music. The purpose of the recommendation process is to entice the
user to purchase more music pieces and expand their collection. Although these
three terms (similarity retrieval, music recommendation, automatic playlisting)
have somewhat different connotations the underlying methodology for solving
them is similar so for the most part we will use them interchangeably. Another
related term that is sometimes used in personalized radio in which the idea is to
play music that is personalized to the preferences to the user.
One can distinguish three basic approaches to computing music similarity.
Content-based similarity is performed by analyzing the actual content to extract
the necessary information. Metadata approaches exploit sources of information
that are external to the actual content such as relationships between artists, styles,
tags or even richer sources of information such as web reviews and lyrics. Usagebased approaches track how users listen and purchase music and utilize this information for calculating similarity. Examples include collaborative filtering in
which the commonalities between purchasing histories of different users are exploited, tracking peer-to-peer downloads or radio play of music pieces to evaluate
their “hotness” and utilizing user generated rankings and tags.
There are trade-offs involved in all these three approaches and most likely
the ideal system would be one that combines all of them intelligently. Usagebased approaches suffer from what has been termed the “cold-start” problem in
which new music pieces for which there is no usage information can not be recommended. Metadata approaches suffer from the fact that metadata information is
frequently noisy or inaccurate and can sometimes require significant semi-manual
effort to clean up. Finally content-based methods are not yet mature enough to
extract high-level information about the music.
From a data mining perspective similarity retrieval can be considered a ranking
problem. Given a query music track q and a collection of music tracks D the
goal of similarity retrieval is to return a ranked list of the music tracks in D
sorted by similarity so that the most similar objects are at the top of the list.
In most approaches this ranking is calculated by defining some similarity (or
distance) metric between pairs of music tracks. The most basic formulation is
to represent each music track as a single feature vector of fixed dimensionality
x = [x1 , x2 , . . . , xn ]T and use standard distance metrics such as L1 (Manhattan)
or L2 (Euclidean) or Mahalanobis on the resulting high dimensional space. This
101
Dr
aft
feature vector is calculated using audio feature extraction techniques as described
in the following sections. Unless the distance metric is specifically designed to
handle features with different dynamic ranges the feature vectors are typically
normalized for example by scaling all of them so that their maximum value over
the dataset is 1 and their minimum value is 0. A more complex alternative is to
treat each music track as a distribution of feature vectors. This is accomplished by
assuming that the feature vectors are samples of an unknown underlying probability density function that needs to be estimated. By assuming a particular
parametric form for the pdf (for example a Gaussian Mixture Model) the music
track is then represented as a parameter vector θ that is estimated from the data.
This way the problem of finding the similarity between music tracks is transformed to the problem of somehow finding the difference between two probability
distributions. Several such measures of probability distance have been proposed
such as histogram intersection, symmetric Kullback-Leibler divergence and earth
mover’s distance. In general computation of such probabilistic distances are more
computationally intensive than geometric distances on feature vectors and in many
cases require numerical approximation as they can not be obtained analytically.
An alternative to audio feature extraction is to consider similarity based on text
such as web pages, user tags, blogs, reviews, and song lyrics. The most common
model when dealing with text is the so called “bag of words” representation in
which each document is represented as an unordered set of its words without
taking into account any syntax or structure. Each word is assigned a weight that
indicates the importance of the word for some particular task. The document can
then be represented as a feature vector comprising of the weights corresponding
to all the words of interest. From a data mining perspective the resulting feature
vector is no different than the ones extracted from audio feature extraction and
can be handled using similar techniques. In the previous section we described a
particular example of text based feature extraction for the purpose of predicting
the country of origin as an artist.
As an example of how a text-based approach can be used to calculate similarity
consider the problem of finding how similar are two artists A and B. Each artist
is characterized by a feature vector consisting of term weights for the terms they
have in common. The cosine similarity between the two feature vectors is defined
as the cosine of the angle between the vectors and has the property that it is not
affected by the magnitude of the vector (which would correspond to the absolute
number of times terms appear and could be influenced by the popularity of the
artist):
102
CHAPTER 8. SIMILARITY RETRIEVAL
sim(A, B) = cosθ = pP
P
A(t) × B(t)
pP
2×
2
A(t)
t
t B(t)
t
(8.1)
aft
Another approach to calculating similarity is to assume that the occurrence
of two music tracks or artists within the same context indicates some kind of
similarity. The context can be web pages (or page counts returned by a search
engine), playlists, purchase histories, and usage patterns in peer-to-peer (P2P)
networks. Collaborative filtering (CF) refers to a set of techniques that make
recommendations to users based on preference information from many users. The
most common variant is to assume that the purchase history of a particular user (or
to some extent equivalently their personal music collections) is characteristic of
their taste in music. As an example of how co-occurrence can be used to calculate
similarity, a search engine can be queries individually for documents that contain
a particular artist A and B, as well as documents that contain both A and B. The
artist similarity between A and B can then be found by:
sim(A, B) =
co(A, B)
min(co(A), co(B))
(8.2)
Dr
where co(X) is the number of pages returned for query X or the co-occurrences of
A and B in some context. A similar measure can be defined based co-occurrences
between tracks and artists in playlists and compilation albums based on conditional probabilities:
1
sim(A, B) = ∗
2
co(A, B) co(A, B)
+
co(A)
co(B)
(8.3)
Co-occurrences can also be defined in the context of peer-to-peer networks
by considering the number of users that have both artists A and B in their shared
collection. The popularity bias refers to the problem of popular artists appearing
more similar than they should be due to them occurring in many contexts. A
similarity measure can be designed to down weight the similarity between artists
if one of them is very popular and the other is not (the right-hand part of the
following equation):
C(A, B)
|C(A) − C(B)|
sim(A, B) =
∗ 1−
C(B)
C(M ax)
(8.4)
where C(M ax) is the number of times the most popular artist appears in a context.
103
8.0.1
Evaluation of similarity retrieval
Dr
aft
One of the challenges in content-based similarity retrieval is evaluation. In evaluating mining systems ideally one can obtain ground truth information that is
identical to the outcome of the mining algorithm. Unfortunately this is not the
case in similarity as it would require manually sorting large collections of music
in order of similarity to a large number of queries. Even for small collections and
number of queries collecting such data would be extremely time consuming and
practically impossible. Instead the more common approach is to only consider the
top K results for each query where K is a small number and have users annotate
each result as relevant or not relevant. Sometimes a numerical discrete score is
used instead of a binary relevance decision. Another possibility that has been
used is to assume that tracks by the same artist or same genre should be similar
and use such groupings to assign relevance values.
Evaluation metrics based on information retrieval can be used to evaluate the
retrieved results for a particular query. They assume that each of the returned
results has a binary annotation indicating whether or not it is relevant to the query.
The most common one is the F-measure which is a combination of the simpler
measures of Precision and Recall. Precision is the fraction of retrieved instances
that are relevant. Recall is the fraction of relevant instances that are retrieved. As
an example consider a music similarity search in which relevance is defined by
genre. If for a given query of Reggae music it returns 20 songs and 10 of them are
0.5. If there are a total of 30 reggae
also Reggae the precision for that query is 10/20
=
10
= 0.33.
songs in the collection searched then the Relevance for that query is 30
The F-measure is defined as the harmonic mean of Precision P and Recall R.
F =2×
P ×R
P +R
(8.5)
These three measures are based on the list of documents returned by the system
without taking into account the order they are returned. For similarity retrieval a
more accurate measure is the Average Precision which is calculated by computing
the precision and recall at every position in the ranked sequence of documents,
creating a precision-recall curve and computing the average. This is equivalent to
the following finite sum:
AP =
Pn
P (k) × rel(k)
#relevantdocuments
k=1
(8.6)
104
CHAPTER 8. SIMILARITY RETRIEVAL
Dr
aft
where P(k) is the precision at list position k and rel(k) is an indicator function that
is 1 if the item at list position (or rank) k is a relevant document and 0 otherwise.
All of the measures described above are defined for a single query. They can
easily be extended to multiple queries by taking their average across the queries.
The most common way of evaluating similarity systems with binary relevance
ground-truth is the Mean Average Precision (MAP) which the defined as the mean
of the Average Precision across a set of queries.
The Music Information Retrieval Evaluation Exchange (MIREX) is an annual
evaluation benchmark in which different groups submit algorithms to solve various MIR tasks and their performance is evaluated using a variety of subjective and
objective metrics. Table ?? shows representative results of the music similarity
and retrieval task from 2010. It is based on a dataset of 7000 30-second audio
clips drawn from 10 genres. The objective statistics are the precision at 5, 10, 20
and 50 retrieved items without counting entries by the artist (artist filtering). The
subjective statics are based on human evaluation of approximately 120 randomly
selected queries and 5 results per query. Each result is graded with a fine score
(between 0 and 100 with 100 being most similar) and a broad score (0 not similar,
1 somewhat similar, 2 similar) and the results are averaged. As can be seen all
automatic music similarity systems perform significantly better than the random
baseline (RND). The differ in terms of the type of extracted features utilized,
the decision fusion strategy (such as simple concatenation of the different feature
sets or empirical combinations of distances from the individual feature sets), and
whether post-processing is applied to the resulting similarity matrix. There is
also a strong correlation between the subjective and objective measures although
it is not perfect (for example SSPK2 is better than PSS1 in terms of subjective
measures but worst in terms of objective measures).
105
BS
P@5
P@10
P@20
P@50
RND
17
0.2
8
9
9
9
TLN3
47
0.97
48
47
45
42
TLN2
47
0.97
48
47
45
42
TLN1
46
0.94
47
45
43
40
BWL1
50
1.08
53
51
50
47
PSS1
SSPK2
Dr
PS1
aft
FS
55
1.22
59
57
55
51
55
1.21
62
60
58
55
57
1.24
59
58
56
53
Table 8.1: 20120 MIREX Music Similarity and Retrieval Results
aft
CHAPTER 8. SIMILARITY RETRIEVAL
Dr
106
Chapter 9
9.1
Introduction
aft
Classification and Tagging
Dr
The combination of audio feature extraction extraction followed by a stage of
machine learning forms the basis of a variety of different MIR classification systems that have been proposed. Probably the most widely studied problem is automatic genre or style classification. More recently emotion and mood recognition
have also been tackled. The task of identifying musical instruments either in
monophonic or polyphonic recordings is something listers, especially if they are
musically trained, can perform with reasonable accuracy. Instrument recognition
is another well explored MIR classification topic covered in this chapter.
9.2
Genre Classification
In the last 15 years musical genre classification of audio signals has been widely
studied and can be viewed as a canonical problem in which audio feature extraction followed by supervised learning has been used. Frequently new approaches
to audio feature extraction are evaluated in the context of musical genre classification. Examples of such new feature sets include sparse overcomplete repre107
108
CHAPTER 9. CLASSIFICATION AND TAGGING
9.2.1
Dr
aft
sentations [28] as well as bio-inspired joint acoustic and modulation frequency
represenations [81].
Audio classification had a long history in the area of Speech Recognition but it
was only more recently applied to music starting around 2000. Even though there
was some earlier work [60, 91] a good starting point for audio-based musical
genre classification is the system that was proposed in 2002 by Tzanetakis [98].
This work was influential as it was the first time that a relatively large, at least
for the time, audio collection was used (1000 clips each 30 seconds long evenly
distributed in 10 genres). Another important contribution was the use of features
related to pitch and rhythmic content that were specific to music. Probably most
importantly autoatic genre classification was a problem that clearly combined
digital signal processing and classic machine learning with an easy to explain
evaluation methodology that did not require user involvement. In the next 15
years, a variety of automatic genre classsification systems were proposed, frequently utilizing the infamous GTZAN dataset for evaluation. There have also
been valid criticisms about overuse of this dataset as well as methodological issues
with automatic genre classification.
There is a large body of work in musical genre classification (the term recognition is also frequently used). One of the most thorough and complete bibliographies on this topic can be found in the thorough survey paper of evaluation
in musical genre recognition by Bob Sturm [] who cites almost 500 publications
related to this topic and provides a lot of information about the associated history,
different ways of evaluation, and the datasets that have been used. There have also
been several survey of musical genre recognition [].
Formulation
We begin our exposition by describing more formally the problem of automatic
genre classification in its most common and basic form.
Once the audio feature are extracted they need to be used to “train” a classifier using supervised learning techniques. This is accomplished using all the
labeled audio feature representations for a training collection tracks. If the audio
feature representation has been summarized over the entire track to a single highdimensional feature vector then this corresponds to the “classic” formulation of
classification and any machine learning classifier can be used directly. Examples
of classifiers that have been used in the context of audio-based music classification
9.2. GENRE CLASSIFICATION
109
9.2.2
Evaluation
aft
include: Gaussian Mixture Models [4, 98], Support Vector Machines [73, 64], and
AdaBoost [10].
Another alternative is to perform classification in smaller segments and then
aggregate the results using majority voting. A more complicated approach (frequently called bag-of-frames) consists of modeling each track using distribution
estimation methods for example a Gaussian Mixture Model trained using the EMalgorithm. In this case each track is representanted as a probability distribution
rather than a single high-dimensional point (feature vector). The distance between
the probability distributions can be estimated for example using KL-divergence
or approximations of it for example using the Monte-Carlo method depending
on the particular parametric form used for density estimation. By establishing a
distance metric between tracks it is possible to perform retrieval and classification
by simple techniques such as k-nearest neighbors [4].
Dr
Evaluation of classification is relatively straightforward and in most ways identical
to any classification task. The standard approach is to compare the predicted labels
with ground truth labels. Common metrics include classification accuracy as well
as retrieval metrics such as precision, recall, f-measure. When retrieval metrics
are used it is assumed that for a particular query revelant documents are the tracks
with the same genre label. Cross-validation is a technique frequently used in
evaluating classification where the labeled data is split into training and testing
sets in different ways to ensure that the metrics are not influenced by the particular
choice of training and testing data. One detail that needs to be taken into account is
the so-called album effect in which classification accuracy improves when tracks
from the same album are included in both training and testing data. The cause is
recording production effects that are common between tracks in the same album.
The typical approach is to ensure that when performing cross-validation tracks
from the same album or artist only go to either the training or testing dataset.
Classification accuracy on the same dataset and using the same cross-validation
approach can be used for comparing the relative performance of different algorithms and design choices. Interpreting the classification accuracy in absolute
terms is trickier because of the subjective nature of genre labeling as has already
been discussed in the section on ground truth labeling. In the early years of
research in audio-based musical genre classification each research group utilized
different datasets, cross-validation approaches and metrics making it hard to draw
110
CHAPTER 9. CLASSIFICATION AND TAGGING
Genre Classification
66.41
Genre Classification (Latin)
65.17
Audio Mood Classification
58.2
Artist Identification
47.65
Classical Composer Identification
53.25
Audio Tag Classification
0.28
Dr
aft
Table 9.1: Audio-based classification tasks for music signals (MIREX 2009)
any conclusions about the merits of different approaches. Sharing datasets is
harder due to copyright restrictions. The Music Information Retrieval Evaluation
Exchange (MIREX) [25] is an annual event where different MIR algorithms are
evaluated using a variety of metrics on different tasks including several audiobased classification tasks. The participants submit their algorithms and do not
have access to the data which addresses both the copy-right problem and the issue
of over-fitting to a particular dataset. Table 9.1 shows representative results of
of the best performing system in different audio classification tasks from MIREX
2009. With the exception of Audio Tag Classification all the results are percentages of classification accuracy. For audio tag classification the average f-measure
is used instead.
9.3. SYMBOLIC GENRE CLASSIFICATION
9.2.3
Criticisms
9.3
Symbolic genre classification
9.4
Emotion/Mood
9.5
Instrument
9.6
aft
[45]
Other
9.7
Dr
music/speech
Symbolic
[55]
9.8
Tagging
9.9
Further Reading
[96]
111
aft
CHAPTER 9. CLASSIFICATION AND TAGGING
Dr
112
Chapter 10
[32]
[50]
Dr
10.1 Segmentation
aft
Structure
10.2 Alignment
[?] [?]
10.3 Structure Analysis
113
CHAPTER 10. STRUCTURE
Dr
aft
114
Chapter 11
aft
Transcription
11.1 Monophonic Pitch Tracking
Dr
[9], [56]
11.2 Transcription
[8]
11.3 Chord Detection
[6]
115
CHAPTER 11. TRANSCRIPTION
Dr
aft
116
Chapter 12
Dr
Retrieval
aft
Symbolic Music Information
117
aft
CHAPTER 12. SYMBOLIC MUSIC INFORMATION RETRIEVAL
Dr
118
Chapter 13
aft
Queries
13.1 Query by example
[75] [11] [79]
Dr
13.2 Query by humming
119
CHAPTER 13. QUERIES
Dr
aft
120
Chapter 14
Detection
[2] [44]
Dr
14.1 Fingerprinting
aft
Fingerprinting and Cover Song
14.2 Cover song detection
In addition to content-based similarity there are two related music mining problems. The goal of audio fingerprinting is to identify whether a music track is one
of the recordings in a set of reference tracks. The problem is trivial if the two
files are byte identical but can be considerably more challenging when various
types of distortion need to be taken into account. The most common distortion is
perceptual audio compression (such as mp3 files) which can result in significant
alterations to the signal spectrum. Although these alterations are not directly
perceptible by humans they make the task of computer identification harder. Another common application scenario is music matching/audio fingerprinting for
mobile applications in which the query signal is acquired through a low quality
121
122
CHAPTER 14. FINGERPRINTING AND COVER SONG DETECTION
Dr
aft
microphone on a mobile phone and contains significant amount of background
noise and interference. At the same time the underlying signal is the same exact
music recording which can help find landmark features and representations that
are invariant to these distortions. Cover song detection is the more subtle problem
of finding versions of the same song possibly performed by different artists, instrumentation, tempo. As the underlying signals are completely different it requires
the use of more high level representations such as chroma vectors that capture
information about the chords and the melody of the song without being affected
by timbral information. In addition it requires sophisticated sequence matching
approaches such as dynamic time warping (DTW) or Hidden Markov Models
(HMM) to deal with the potential variations in tempo. Although both of these
problems can be viewed as content-based similarity retrieval problems with an
appropriately defined notion of similarity they have some unique characteristics.
Unlike the more classic similarity retrieval in which we expect the returned results
to gradually become less similar, in audio fingerprinting and cover song detection
there is a sharper cutoff defining what is correct or not. In the ideal case copies
or cover versions of the same song should receive a very high similarity score
and everything else a very low similarity score. This specificity is the reason why
typically approaches that take into account the temporal evolution of features are
more common.
Audio fingerprinting is a mature field with several systems being actively used
in industry. As a representative example we describe a landmark-based audio
fingerprinting system based on the ideas used by Shazam which is a music matching service for mobile phones. In this scheme each audio track is represented
by the location in time and frequency of prominent peaks of the spectrogram.
Even though the actual amplitude of these peaks might vary due to noise and
audio compression their actual location in the time frequency plane is preserved
quite well in the presence of noise and distortion. The landmarks are combined
into pairs and each pair is characterized by three numbers f1 , f2 , t which are the
frequency of the first peak, the frequency of the second peak and the time between
them. Both reference tracks and the query track are converted into this landmark
representation. The triplets characterizing each pair are quantized with the basic
idea being that if the query and a reference track have a common landmarks
with consistent timing they are a match. The main challenge in an industrial
strength implementation is deciding on the number of landmarks per second and
the thresholds used for matching. The lookup of the query landmarks into the
large pool of reference landmarks can be performed very efficiently using hashing
14.2. COVER SONG DETECTION
123
Dr
aft
techniques to effectively create an inverted index which maps landmarks to the
files they originate.
To solve the audio cover song detection problem there are two issues that
need to be addressed. The first issue is to compute a representation of the audio
signal that is not affected significantly by the timbre of the instruments playing
but still captures information about the melody and harmony (the combination of
discrete pitches that are simultaneously sounding) of the song. The most common
representation used in music mining for this purpose are chroma vectors (or pitch
class profiles) which can be thought of as histograms showing the distribution
of energy among different discrete pitches. The second issue that needs to be
addressed is how to match two sequence of feature vectors (chroma vectors in
this case) that have different timing and length as there is no guarantee that a
cover song is played at the same tempo as the original and there might be multiple
sections with different timing each.
More formally the problem is given two sequences of feature vectors with
different lengths and timings find the optimal way of “elastically” transforming
by the sequences so that they match each other. A common technique used
to solve this problem and also frequently employed in the literature for cover
song detection is Dynamic Time Warping (DTW) a specific variant of dynamic
programming. Given two time series of feature vectors X = (x1 , x2 , . . . , xM )
and Y = (y1 , y2 , . . . , yN ) with X, Y ∈ Rd the DTW algorithm yields an optimal
solution in O(M N ) time where M and N are the lengths of the two sequences. It
requires a local distance measure that can be used to compare individual feature
vectors which should have small values when the vectors are similar and large
values when they are different:
d : Rd × Rd → R ≥ 0
(14.1)
The algorithm starts by building the distance matrix C ∈ RM ×N representing all
the pairwise distances between the feature vectors of the two sequences. The goal
of the algorithm is to find the alignment or warping path which is a correspondence between the elements of the two sequences with the boundary constraint
that the first and last elements of the two sequences are assigned to each other.
Intuitively for matching sequences the alignment path will be roughly diagonal
and will run through the low-cast areas of the cost matrix. More formally the
alignment is a sequence of points (pi , pj ) ∈ [1 : M ] × [1 : N ] for which
the starting and ending points must be the first and last points of the aligned
sequences, the points are time-ordered and each step size is constrained to either
124
CHAPTER 14. FINGERPRINTING AND COVER SONG DETECTION
aft
move horizontally, vertically or diagonally. The cost function of the alignment
path is the sum of all the pairwise distances associated with its points and the
alignment path that has the minimal cost is called the optimal alignment path
and is the output of the DTW.
(a) Good Alignment
(b) Bad Alignment
Figure 14.1: Similarity Matrix between energy contours and alignment path using
Dr
Dynamic Time Warping
Figure 14.1 shows two distance matrices that are calculated based on energy
contours of different orchestra music movements. The left matrix is between two
performances by different orchestras of the same piece. Even though the timing
and duration of each performance is different they exhibit a similar overall energy
envelope shown by the energy curves under the two axes. The optimal alignment
path computed by DTW is shown imposed over the distance matrix. In contrast
the matrix on the right shows the distance matrix between two unrelated orchestral
movements where it is clear there is no alignment.
Cover song detection is performed by applying DTW between all the query
song and all the references and returning as a potential match the one with the
minimum total cost for the optimal alignment path. Typically the alignment cost
between covers of the same song will be significantly lower than the alignment
cost between two random songs. DTW is a relatively costly operation and therefore this approach does not scale to large number of songs. A common solution
for large scale matching is to apply an audio fingerprinting type of approach with
efficient matching to filter out a lot of irrelevant candidates and once a sufficient
small number of candidate reference tracks have been selected apply pair-wise
DTW between the query and all of them.
14.2. COVER SONG DETECTION
125
RE
SZA
TA
Mean # of covers in top 10
6.20
7.35
1.96
Mean Average Precision
0.66
0.75
0.20
Mean Rank of first correct cover
2.28
6.15
29.90
Table 14.1: 2009 MIREX Audio Cover Song Detection - Mixed Collection
SZA
TA
Mean # of covers in top 10
8.83
9.58
5.27
Mean Average Precision
0.91
0.96
0.56
Mean Rank of first correct cover
1.68
1.61
5.49
aft
RE
Dr
Table 14.2: 2009 MIREX Audio Cover Song Detection - Mazurkas
Table 14.1 shows the results of the audio cover song detection task of MIREX
2009 in the so called “mixed” collection which consists of 1000 pieces that contain
11 “cover song” each represented by 11 different versions. As can be seen the
performance is far from perfect but it is still impressive given the difficulty of the
problem. An interesting observation is that the objective evaluation measures are
not consistent. For example the RE algorithm performs slightly worse than the
SZA in terms of mean average precision but has been mean rank for the first
correctly identified cover. Table 14.2 shows the results of the MIREX 2009
audio cover song detection task for the Mazurkas collection which consists 11
different performances/versions of 49 Chopin Mazurkas. As can be seen from
the results this is a easier dataset to find covers probably due to the smaller size
and more uniformity in timbre. The RE algorithm is based on the calculation of
different variants of chroma vectors utilizing multiple feature sets. In contrast to
the more common approach of scoring the references in a ranked list and setting
up a threshold for identifying covers it follows a classification approach in which
a pair is either classified as reference/cover or as reference/non-cover. The SZA
algorithm is based on harmonic pitch class profiles (HPCP) which are similar
126
CHAPTER 14. FINGERPRINTING AND COVER SONG DETECTION
Dr
aft
to chroma vectors but computed over a sparse harmonic representation of the
audio signal. The sequence of feature vectors of one song is transposed to the
main tonality of the other song in consideration. A state space representation of
embedding m and time delay z is used to represent the time series of HPCP with
a recurrence quantification measure used for calculating cover song similarity.
Other topics
aft
Chapter 15
Dr
In this chapter we cover several MIR topics that either are emerging or their
associated literature is limited or both.
15.1 Optical Music Recognition
[16] [72] [?]
15.2 Digital Libraries and Metadata
[5], [26] [?], [68] [27]
15.3 Computational Musicology
15.3.1
Notated Music
[17] [59]
127
128
15.3.2
CHAPTER 15. OTHER TOPICS
Computational Ethnomusicology
[35] [13] [?] [?] [?] [?]
15.3.3
MIR in live music performance
Dr
aft
15.4 Further Reading
aft
Part III
Dr
Systems and Applications
129
Dr
aft
Chapter 16
[36]
[?]
Dr
16.1 Interaction
aft
Interaction
131
CHAPTER 16. INTERACTION
Dr
aft
132
Chapter 17
Dr
17.1 Bibliography
aft
Evaluation
133
CHAPTER 17. EVALUATION
Dr
aft
134
Chapter 18
Dr
[?] [?] [?]
aft
User Studies
135
CHAPTER 18. USER STUDIES
Dr
aft
136
Chapter 19
aft
Tools
Dr
Although frequently researchers implement their own audio feature extraction
algorithms there are several software collections that are freely available that
contain many of the methods described in this chapter. They have enabled researchers more interested in the data mining and machine learning aspects of
music analysis to build systems more easily. They differ in the programming
language/environment they are written, the computational efficiency of the extraction process, their ability to deal with batch processing of large collections,
their facilities for visualizing feature data, and their expressiveness/flexibility in
describing complex algorithms.
Table 19.1 summarizes information about some of the most commonly used
software resources as of the year 2010. The list is by no means exhaustive but
does provide reasonable coverage of what is available. Several of the figures in
this chapter were created using Marsyas and some using custom MATLAB code.
19.1 Datasets
[38]
137
138
CHAPTER 19. TOOLS
URL
Programming Language
Auditory Toolbox
tinyurl.com/3yomxwl
CLAM
clam-project.org/
C++
D.Ellis Code
tinyurl.com/6cvtdz
MATLAB
HTK
htk.eng.cam.ac.uk/
C++
jAudio
tinyurl.com/3ah8ox9
Java
Marsyas
marsyas.info
Dr
aft
Name
MATLAB
C++/Python
MA Toolbox
pampalk.at/ma/
MATLAB
MIR Toolbox
tinyurl.com/365oojm
MATLAB
Sphinx
cmusphinx.sourceforge.net/
C++
www.vamp-plugins.org/
C++
VAMP plugins
Table 19.1: Software for Audio Feature Extraction
19.2. DIGITAL MUSIC LIBRARIES
19.2 Digital Music Libraries
Dr
aft
VARIATIONS
139
CHAPTER 19. TOOLS
Dr
aft
140
Appendix A
aft
Statistics and Probability
Dr
A brief overview of the necessary concepts.
Email Ashley to send me his PhD thesis
141
aft
APPENDIX A. STATISTICS AND PROBABILITY
Dr
142
Appendix B
aft
Project ideas
B.1
Dr
In this Appendix I provide some information about projects related to music
information retrieval based on the course that I have taught at the University
of Victoria. First I provide some information about how to plan and organize
a group project in MIR. A list of past projects done in the course is provided,
followed by some additional suggestions. These are appropriate for class projects
of small teams (2-3 students). I am really proud that some of these projects have
resulted in ISMIR publications. Another good source of project ideas are the
online proceedings of the ISMIR conference.
Project Organization and Deliverables
Music Information Retrieval is an interdisciplinary field so there is considerable
variation in the types of projects one can do. The projects described in this
appendix differ in the skills required, type of programming, availability of existing
published work, and many other factors. The exact details are usually refined over
the course of the term through better knowledge of MIR and interaction with the
instructor. The expectations for each project are also adjusted depending on the
number of students in the group.
Typically projects have the following components/stages:
1. Problem specification, data collection and ground truth annotation
143
144
APPENDIX B. PROJECT IDEAS
2. Information extraction and analysis
3. Implementation and interaction
4. Evaluation
Dr
aft
At least one of these components/stages should be non-trivial for each project.
For example if the project tackles a completely new task then just doing a good
problem specification and associated data collection might be sufficient, followed
by some baseline analysis using existing tools. Another possibility would be to
take an existing task for which there is data and code and build a really nice
and intuitive graphical user interface in which case the non-trivial part would be
the implementation and interaction stage. A thorough experimental comparison
of existing techniques for a particular task would also be a valid project. What
should be avoided is just taking existing tools, data for a well known task and just
getting some results. Excellent projects which would have a good chance of being
accepted as ISMIR submissions typically have more than one of these stages being
novel and non-trivial. The exact details are worked out throughout the term based
on the project deliverables.
The following deliverables are required for each project. All the written reports should conform to the template used for ISMIR submissions and follow the
structure of an academic paper. It is a good idea to read a few ISMIR papers
in order to familiarize yourself with both the template and the sections that are
typically used. Each successive report builds upon the previous one. There is also
a final presentation. In the MIR course at the University of Victoria, the project is
worth 60% of the final grade.
1. Design Specification (15%)
The design specification describes the proposed project, provides a timeline with specific objectives, and outlines the role of each team member.
In addition it describes the tools, data sets, and associate literature that the
group is going to use for the project. This report should be 2-4 pages using
the ISMIR template. A proper bibliography of 15-20 references should also
be provided.
2. Progress Report (15%)
The progress report extends the design specification with specific information about what has been accomplished so far. Specific objectives achieved
B.2. PREVIOUS PROJECTS
145
are mentioned and deviations from the original plan are discussed. The
time line is also revised to reflect the deeper understanding of the project.
Additional literature can also be provided at this stage. This report should
be 1-2 pages using the ISMIR template in addition to the text of the design
specification.
3. Final Report (20%)
4. Presentation (10%)
aft
The final report describes the system developed, the evaluation methodology, and the role of each member. It also describes which of the objectives
were achieved, what has been learned and provides ideas for future work.
The target audience is a future student of the course who would like to work
on a similar project. The final report should be 2-3 pages of text in addition
to the text of the progress report. The final document which will contain
parts of the design specification, progress report and final report should be
6-10 pages using the ISMIR template. A bibliography of 20-30 references
to literature relevant to the project should also be included.
B.2
Dr
The final presentation is typically done with all of the class attending and is
a 10 minute slide presentation summarizing what was done and what was
learned. It can also include a short video or demo of the system developed.
For groups or students who can not attend physically the final presentation,
a 10-minute video or audio recording+slides can be used instead.
Previous Projects
The following projects have been completed by groups of students in past offering
of the Music Information Retrieval course at the University of Victoria. There is
some variety in terms of difficulty and novelty but that is intentional as students
can pick a project that is a more realistic match with their time commitment to the
course. The descriptions are taken directly with minimal editing from the abstracts
of the final project reports written by the students.
• Physical Interfaces for Music Retrieval
This project explores integration of tangible interfaces with a music information retrieval system. The framework, called Intellitrance, uses sensor
146
APPENDIX B. PROJECT IDEAS
data via midi input data to query and search a library of beats/sound/loops.
This application introduces a new level of of control and functionality to
the modern DJ and expert musician with a variety of flexible capabilities.
IntelliTrance is written in Marsyas, a rapid prototyping and experimentation
system that offers audio analysis and synthesis with specific emphasis to
music signals and music information retrieval.
• World Music Classification
• Audio Caricatures
aft
Classification of music intro a general genre is a difficult task even for most
humans; it is even more difficult to automate by a computer. Typically
music would have attached to it meta-data that would include its genre
classification, but as music databases become more prolific, this information
is essentially missing. Automatic classification becomes essential for the
organizing of music databases. In this project, we classify the genre of
international music from a specific area and train a classifier to predict the
genre of an unknown song given that it is from the same region. For this
project we chose Turkish music as our region of interest.
Dr
We have set out to tackle three problems. The first is how to accurately
detect pitch in acoustic songs in wav format. The second is how to translate
this information into a high level form that can be read and used by music
production programs. And finally we want to create a caricature of the original wav file by excluding certain instruments. The result of our attempt at
these problems has been Kofrasi, a perl program that uses text files produced
by the pitch detection program, which bridges the gap between analyzing
the original wav files and creating the final caricature.
• An Interactive Query-by-Humming System
The challenge to create an effective and robust front-end for a query-byhumming system is a familiar topic in the field of Music Information Retrieval. This paper discusses the application we have created as a means
to explore this problem. It first looks at the issues involved in creating
an effective user interface and examines our own solution, specifically in
relation to how it reconciles ease of use with maximal user control.
• SIREN: A Sound Information Retrieval ENgine
B.2. PREVIOUS PROJECTS
147
Multimedia players and the Internet have significantly altered the way people listen to and store music libraries. Many people do not realize the value
of thier personal collections because they are unable to organize them in
a clear and intuitive manner that can be efficiently searched. Clearly, a
better way of fetching audio files and creating rich playlists from personal
music collections needs to be developed utilizing MIR techniques. We have
developed SIREN, a multimedia player that has been optimized to perform
this task.
• MIXMASTER: Automatic Musical Playlist Generation
aft
Mixmaster is a playlist creator. Start and end songs are specified, as well as
a time length. Then, a playlist is created by finding songs that approach the
same similarity as the specified end song, while not deviating greatly from
the original start and that last as long as the user wanted.
• Musical Instrument Classification in Mixture of Sounds
Dr
This project attempts to evaluate the possibility of classifying musical instruments from an amplitude panned stereo or multi-channel mix. First
the signal from each individual instrument is isolated by the frequency
domain upmix algorithm proposed by Avendano and Jot. Then various
features are extracted from the isolated instrument sound. The last step is to
classify these features and try to determine what instrument(s) are present
in the mixture of sound. The results show that the feature extraction and
classification produce decent performance on single instrument, and as a
whole, the classification system also produces reasonable result considering
the distortion and loss of information introduced by unmix.
• Ostitch: MIR applied to musical instruments
The paper discusses the use of MIR in computer music instruments. This
paper proposed and implements a performance time MIR based instrument
(Ostitch) that produces ”audio mosaics” or ”audio collages”. Buffering,
overlapping and stiching (audio concatenation) algorithms are discusse‘d
- problems around these issues are evaluated in detail. Overlapping and
mixing algorithms are proposed and implemented.
• Extracting Themes from Symbolic Music
This project investigates extracting the dominant theme from a MIDI file.
The problem was broken into two tasks: track extraction and theme extrac-
148
APPENDIX B. PROJECT IDEAS
tion. Two algorithms were developed for both tasks and a combination of
these algorithms were tested on various MIDI files.
• Singer Gender Identification
aft
Automatic detection of whether a recording contains a male or female voice
is an interesting problem. Ideally a computer could be programmed to
recognize whether a singer is male or female in both monophonic and polyphonic audio samples. Artificial neural networks are a common technique
for classifying data. Given enough training data and sufficient time to
train, a neural network should be able to classify the gender of a voice
in a recording with fairly high accuracy. This report outlines our results
in building an artificial neural network that can determine the gender of a
voice in both monophonic and polyphonic recordings.
• Music information retrieval techniques for computer-assisted music mashup
production
Dr
A music mashup (also know as a bootleg, boot, blend, bastard pop, smashup,
and cut-up) is a style of music where two or more songs are mixed together
to create a new song. Traditionally, mashup artists use trial and error, record
keeping of tempos and keys of their song collection, and meticulous audio
editing to create mashups. For this research project I will investigate the use
of music information retrieval techniques to assist this production process
by automatically determining tempo and keys of songs as well determining
the potential of two songs to mix together by using these features in addition
to other features such as frequency distribution.
• Automatic Classification and Equalization of Music
For this project we designed a program for a stereo system which automatically classifies the current music being played and changes the equalization
to best match the genre of music.
• Location- and Temporally-Aware Intelligent Playlists
As location-aware mobile devices (PMPs, media-centric cell phones, etc.)
are becoming increasingly ubiquitous, developers are gaining location data
based on GPS, cellular tower triangulation, and Wi-Fi access point triangulation for use in a myriad of applications. I propose to create a proof-ofconcept location-aware intelligent playlist for mobile devices (e.g. an Apple
iPod Touch) that will generate a playlist based on media metadata (such as
B.2. PREVIOUS PROJECTS
149
tag, BPM, track length, etc.) and profiling the user’s preferences based on
absolute location as well as route of travel.
• Guitar Helper
aft
Route-of-travel based information could be useful in determine a user’s
listening habits while in transit to common destinations (”What does the
user like listening to on the way to UVic from home?”), but another major
factor missing is the time of travel. Does the user like listening to something
mellow on the way to UVic in the morning, but prefer something more
energetic while heading to UVic in the afternoon? If the user is heading
downtown in the afternoon to shop, do they listen to the same music as
when heading downtown at 10pm to hit a club? By profiling the user by
absolute location, path, destination, and time, more interesting and appropriate playlists may be created using a history of the user’s preference than
a shotgun approach of using one playlist for the entire day without context.
Dr
A system is developed that detects which notes are being played on a guitar
and creates a standard guitar tab. The system should work for both single
notes and multiple notes/chords. An extension to the system would be to
have the system perform accompaniment in the form of beats and/or chords.
This will require beat detection and key detection.
• RandomTracks : Music for sport. Corey Sanford
Random tracks is a GUI that allows the user to select a genre of music,
or several genres of music to be played amongst each other. These genres
can be played for a set period of time (for example perhaps 45 minutes
of ambient music, or 3 minutes of punk rock). The specific problem this
program will be built for is compiling a playlist for boxing sparring rounds.
In gyms, music plays songs of various genres (mostly metal/rock) with the
low points of the song sometimes coinciding with the more intense parts of
training (the end of a 3 minute round). This program - randomtracks - can
choose to play the most ”intense” 3 minutes from a song that fits the genre
specified by the user. As an example, the user may want to specify: A: 3
minutes, Rock. Increasing intensity. B: 1 minute, Ambient. Random start
position. Alternate A and B. (Could also randomize A and B).
• Robot Drummer Clap Tracking: Neil MacMillan
Build a clap sensor - this will offload the signal processing from the Arduino to another chip and eliminate the need to move around the robot’s
150
APPENDIX B. PROJECT IDEAS
stepper motor wiring (the stepper is connected to the analog input pins).
Measure the software delays and physical delays between the stimulus and
the drummer’s strike at various velocities, to get a table of delays that can be
used to predict the next beat. Modify the robot’s firmware to accept the clap
sensor’s digital output and predictively play along with a simple, regular
beat pattern. Program the robot to play along with more sophisticated
patterns (e.g. varations in timing and velocity).
B.3
Suggested Projects
aft
The following are some representative ideas for projects for this class. They are
listed in no particular order.
• Feature Extraction on Multi-Core Processors
Dr
Feature extraction forms the basis of many MIR algorithms. It is a very
computationally intensive process. However it has low memory requirements and is very straightforward to parallelize. The goal of this project
is to explore how modern multi-core processor can be utilized to provide
more efficient feature extraction. Experiments to find out what is the best
granularity will be conducted using the Marsyas software framework.
• Genre classification on MIDI data
Genre classification has mostly been explored in the audio domain for some
time. More recently algorithms for genre classification based on statistics/features over symbolic data such as MIDI files have appeared in the
literature. The goal of this project would be to recreate some of the existing
algorithms and investigate alternatives.
• A framework for evaluating similarity retrieval for music
A variety of similarity-based retrieval algorithms have been proposed for
music in audio-format. The only way to reliably evaluate content-based
similarity retrieval is to conduct user studies. The goal of this project is
to build a framework (possibly web-based) that would allow different algorithms for audio similarity to be used and evaluated by users. The main
challenge would be to design the framework to be flexible in the way the
algorithms are evaluated, the similarity measure, the presentation mode etc.
B.3. SUGGESTED PROJECTS
151
• Sensor-based MIR
One of the less explored areas in MIR is the interface of MIR systems to
the user. As more and more music is available in portable digital music
players of various forms and sizes we should envision how MIR can be used
on these devices. This project is going to explore how sensor technology
such as piezos, knobs, sliders can be used for browsing music collections,
specifying music queries (for example tapping a query or playing a melody),
and for annotation such onset detection and beat locations.
• Key finding in polyphonic audio
aft
There has been some existing work on key finding on symbolic scores.
In addition, pitch-based representations such as Chroma vectors or Pitch
Histograms have been shown to be effective for alignment, structural analysis and classification. This project will explore the use of pitch-based
representations in order to identify the key in polyphonic audio recordings.
• Query-by-humming front-end
Dr
The first stage in a QBH system is to convert a recording of a human singing,
humming or whistling into either a pitch contour or note sequence that
can then be used to search a database of musical pieces for a match. A
large variety of pitch detection algorithms have been proposed in literature.
This project will explore different pitch detection algorithms as well as note
segmentation strategies
• Query-by-humming back-end
Once either a pitch contour or a series of notes have been extracted they
can be converted to some representation that can then be used to search a
database of melodies for approximate matches. In this project some of the
major approaches that have been proposed for representing melodies and
searching melodic databases will be implemented.
• ThemeFinder
In order to search for melodic fragments in polyphonic music it is necessary
to extract the most important ”themes” of a polyphonic recording. This can
be done by incorporating knowledge from voice leading, MIDI instrument
labels, amount of repetition, melodic shape and many other factors. The
goal of this project is to implement a theme finder using both techniques
described in the literature as well as exploring alternatives.
152
APPENDIX B. PROJECT IDEAS
• Structural analysis based on similarity matrix
The similarity matrix is a visual representation that shows the internal structure of a piece of music (chorus-verse, measures, beats). By analyzing this
representation it is possible to reconstruct the structural form of a piece of
music such as AABA.
• Drum pattern similarity retrieval
aft
Drums are part of a large number of musical pieces. There are many software packages that provide a wide variety of drum loops/pattern that can
be used to create music. Typically these large drum loop collections can
only by browsed/searched based on filename. The aim of this project is to
explore how the actual sound/structural similarity between drum patterns
can be exploited for finding drum loops that are ”similar”. Accurate drum
pattern classification/similarity can potentially lead to significant advances
in audio MIR as most of recorded music today is characterized by drums
and their patterns.
• Drum detection in polyphonic audio
Dr
Recently researchers have started looking at the problem of identifying
individual drum sounds in polyphonic music recordings such as hihat, bass
drum etc. In this project, students will implement some of these new algorithms and explore variations and alternative approach. A significant
part of the project will consists of building tools for obtaining ground truth
annotations as well as evaluating the developed algorithms.
• Content-based audio analysis using plugins
Many of the existing software music players such as WinAmp or itunes
provide an API for writing plugins. Although typically geared toward spectrum visualization these plugins could pottentially be used as a front-end
for feature extraction, classification and similarity retrieval. This project
will expore this possibility.
• Chord-detection in polyphonic audio
Even though polyphonic transcription of general audio is still far from being
solved a variety of pitch-based representations such as chroma-vectors and
pitch histograms have been proposed for audio. There is some limited
research on using such representations potentially with some additional
B.3. SUGGESTED PROJECTS
153
knowledge (such as likely chord progression) to perform chord detection in
polyphonic audio signals. The goal of this project is to explore possibilities
in that space. Jazz standards or beatles tunes might be a good starting point
for data.
• Polyphonic alignment of audio and MIDI
• Music Caricatures
aft
A symbolic score even in a ”low” level format such as MIDI contains a
wealth of useful information that is not directly available in the acoustic
waveforms (beats/measures/chords etc). On the other hand most of the time
we are interesting in hearing actual music rather than bad sounding MIDI
files. In polyphonic audio alignment the idea is to compute features on both
the audio and MIDI data and try to align the two sequences of features. This
project will implement some of the existing approaches to this problem and
explore alternatives and variations.
Dr
Even though we are still a long way from full polyphonic transcription music information retrieval are increasingly extracting more and more higherlevel information about audio signals. The idea behind this project is to use
this information to create musical ”caricatures” of the original audio using
MIDI. The only constrain is that the resulting ”caricature” should somehow
match possibly in a funny way the original music.
• Comparison of algorithms for audio-segmentation
Audio segmentation referes to the process of detecting when there is a
change of audio ”texture” such as the change from singing to instrumental
background, the change from an orchestra to guitar solo, etc. A variety of
algorithms have been proposed for audio segmentation. The goal of this
project is to implement the main approaches and explore alternatives and
variants.
• Music Information Retrieval using MPEG-7 low level descriptors
The MPEG-7 standard was recently proposed for standarizing some of the
ways multimedia content is described. Part of it describes some audio
descriptors that can be used to characterize audio signals. There has been
little evaluation of those descriptors compared to more other feature frontends proposed in the literature. The goal of this project is to implement
154
APPENDIX B. PROJECT IDEAS
the MPEG-7 audio descriptors and compare them with other features in a
variety of tasks such as similarity retrieval, classification and segmentation.
• Instrumentation-based genre/style classification
The type of instruments used in a song can be a quite reliable indicator of a
particular musical genre. For example the significant presense of saxophone
probably implies a jazz tune. Even though these rules always have exceptions they still will probably work for many cases. The goal of this project is
to explore the use of decision trees for automatically finding and using such
instrumentation-based rules. A significant part of the project will consist of
collecting instrumentation annotation data.
aft
• Template-based detection of instrumentation
Dr
The goal of this project is to detect what (and maybe when) instruments are
present in an audio recording. The goal is not source separation or transcription but rather just a presense/absence indicator for particular instruments.
For example from minute 1 to minute 2 there is a saxophone, piano and
drums playing after which a singer joins the ensemble would be the output
of such a system. In order to identify specific instruments templates will be
learned from a large database of examples and then adapted to the particular
recording.
• Singing-voice detection
Detecting the segments of a piece of music where there is singing is the
first step in singer identification. This is a classic classification problem
which is made difficult by the large variety of singers and instrumental backgrounds. The goal of this project is to explore various proposed algorithms
and feature front-ends for this task. Specifically the use of phasevocoding
techniques for enhancing the prominent singing voice is a promising area
of exploration.
• Singer Identification
The singer identity is major part of the way popular music is characterized
and identified. Most listeners that hear a piece they haven’t heard before
can not identify the group until the singer starts singing. The goal of this
project is to explore existing approaches to singer identification and explore
variations and alternatives.
B.4. PROJECTS THAT EVOLVED INTO PUBLICATIONS
155
• Male/Female singer detection
Automatic male/female voice classification has been explored in the context
of the spoken voice. The goal of this project is to first explore male/female
singer detection in monophonic recordings of singing and then expand this
work to polyphonic recordings.
• Direct manipulation music browsing
aft
Although MIR for historical reasons has been mostly focused on retrieval a
large part of music listening involves browsing and exploration. The goal of
this project is to explore various creative ways of browsing large collections
of music that are direct and provide constant audio feedback about the user
actions.
• Hyperbolic trees for music collection visualization
Dr
Hyperbolic trees are an impressive visualization technique for representing
trees/graphs of documents/images. The goal of this project is to explore the
potential of using this technique for visualizing large music collections. Of
specific interest is the possibility adjustment of this technique to incorporate
content-based music similarity.
• Playlist summarization using similarity graphs
Similarity graphs are constructed by using content-based distances for edges
and nodes that correspond to musical pieces. The goal of this project is to
explore how this model could be used to generate summaries for music
playlists i.e a short duration representation (3 seconds for each song in the
playlist) that summarizes a playlist.
B.4
Projects that evolved into publications
Some of the best previous projects in the MIR course evolved into ISMIR publications (typically with some additional work done after the completion of the
course). The corresponding publications are a good starting point and hopefully
will inspire you to work hard on your projects with the eventual goal of an ISMIR
publication.
Drum transcription for audio signals can be performed based on onset detection and subband analysis [101]. There are some interesting alternative ways
156
APPENDIX B. PROJECT IDEAS
Dr
aft
of specifying a music query beyond query-by-example and query-by-humming.
One possibility is beat-boxing [?] and another is various interfaces for specifying
rhythm [54]. Tabletop displays provides interesting possibilities for collaboration and intutive music browsing [46]. Stereo panning information is frequently
ignored in MIR in which typically stereo signals are converted to mono. However stereo panning information is an important cue about the record production
process and can be used for classifying recording production style [100]. Smart
phones contain location and acceleration sensors that can be used ot infer the
context of user activites and create personalized music based on the occassion
[77]. Audio analysis can be used as an empirical analytical tool to study how DJs
select and order tracks in electronic dance music.
Appendix C
aft
Commercial activity 2011
Dr
Commercial activity in MIR related topics is in constant flux so it is hard to
provide comprehensive information about it. At the same time it is interesting to
observe which algorithms and techniques described in this book are actively used
in real world applications. This appendix attempts to provide a snapshot of the
current state of commercial activity in 202. The list of companies and systems
described is by no means exhaustive but tries to be representative of existing
activity.
157
aft
APPENDIX C. COMMERCIAL ACTIVITY 2011
Dr
158
Appendix D
aft
Historic Time Line
Dr
In this Appendix I provide a list of historic milestones both academic and commercial in the emerging field of Music Information Retrieval. As is the case with
any such attempt the resulting list is to some extent idiosyncratic. My hope is that
it provides a picture of how the field evolved over time and helps readers situate
the techniques and developments described in this book in time.
159
aft
APPENDIX D. HISTORIC TIME LINE
Dr
160
Bibliography
aft
[1] E. Allamanche, J. Herre, O. Hellmuth, B. Froba, and T. Kastner. Contentbased identification of audio material using MPEG-7 low level description.
In Int. Conf. on Music Information Retrieval (ISMIR), 2001.
[2] E. Allamanche, J. Herre, O. Hellmuth, B. Fröba, T. Kastner, and M. Cremer.
Content-based identification of audio material using mpeg-7 low level
description. In Proceedings of the International Conference on Music
Information Retrieval (ISMIR), 2001.
Dr
[3] J. J. Aucouturier and F. Pachet. Music similarity measures: What’s the use
? In Int. Conf. on Music Information Retrieval (ISMIR), 2002.
[4] J. J. Aucouturier and F. Pachet. Representing musical genre: A state of the
art. Journal of New Music Research, 32(1):1–2, 2003.
[5] D. Bainbridge. The role of music ir in the new zealand digital music
library project. In Proceedings of the International Conference on Music
Information Retrieval (ISMIR), 2000.
[6] J. Barthélemy. Figured bass and tonality recognition. In Proceedings of the
International Conference on Music Information Retrieval (ISMIR), 2001.
[7] M. A. Bartsch and G. H. Wakefield. To catch a chorus: using chromabased represenations for audio thumbnailing. In Workshop on Applications
of Signal Processing to Audio and Acoustics, pages 150–18, 2001.
[8] J. P. Bello, G. Monti, and M. Sandler. Techniques for automatic music
transcription. In Proceedings of the International Conference on Music
Information Retrieval (ISMIR), 2000.
161
162
BIBLIOGRAPHY
[9] D. Bendor and M. Sandler. Time domain extraction of vibrato from
monophonic instruments. In Proceedings of the International Conference
on Music Information Retrieval (ISMIR), 2000.
[10] J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kegl. Aggregate
features and AdaBoost for music classification. Machine Learning, 65(23):473–484, 2006.
[11] W. P. Birmingham, R. B. Dannenberg, G. H. Wakefield, M. A. Bartsch,
D. Mazzoni, C. Meek, M. Mellody, and W. Rand. Musart: Music retrieval
via aural queries. In Proceedings of the International Conference on Music
Information Retrieval (ISMIR), 2001.
aft
[12] P. Boersma. Accurate short-term analysis of the fundamental frequency
and the harmonics-to-noise ratio of a sampled sound. Proceedings of the
Institute of Phonetic Sciences, 17:97–110, 1993.
[13] A. Bonardi. Ir for contemporary music: What the musicologist needs.
In Proceedings of the International Conference on Music Information
Retrieval (ISMIR), 2000.
Dr
[14] A. Bregman. Auditory Scene Analysis. MIT Press, Cambridge, 1990.
[15] M. Casey. Sound classification and similarity tools. In B. Manjunath,
P. Salembier, and T. Sikora, editors, Introduction to MPEG-7: Multimedia
Content Description Language, pages 309–323. J. Wiley, 2002.
[16] G. S. Choudhury, T. DiLauro, M. Droettboom, I. Fujinaga, B. Harrington,
and K. MacMillan. Optical music recognition system within a large-scale
digitization project. In Proceedings of the International Conference on
Music Information Retrieval (ISMIR), 2000.
[17] M. Clausen, R. Engelbrecht, D. Meyer, and J. Schmitz. Proms: A webbased tool for searching in polyphonic music. In Proceedings of the
International Conference on Music Information Retrieval (ISMIR), 2000.
[18] D. Cliff and H. Freeburn. Exploration of point-distribution models
for similarity-based classification and indexing of polyphonic music.
In Proceedings of the International Conference on Music Information
Retrieval (ISMIR), 2000.
BIBLIOGRAPHY
163
[19] M. Crochemore, C. S. Iliopoulos, Y. J. Pinzon, and W. Rytter. Finding
motifs with gaps. In Proceedings of the International Conference on Music
Information Retrieval (ISMIR), 2000.
[20] R. Dannenberg and N. Hu. Pattern discovery techniques for music audio.
Journal of New Music Research, June 2003:153–164, 2003.
[21] I. Daubechies. Orthonormal bases of compactly supported wavelets.
Communications on Pure and Applied Math, 41:909–996, 1988.
aft
[22] S. Davis and P. Mermelstein. Experiments in syllable-based recognition of
continuous speech. IEEE Transcactions on Acoustics, Speech and Signal
Processing, 28:357–366, Aug. 1980.
[23] A. de Cheveigne and H. Kawahara. YIN, a fundamental frequency
estimator for speech and music. Journal of Acoustical Society of America,
111(4):1917–1930, 2002.
[24] S. Dixon. Onset detection revisited. In Int. Conf. on Digital Audio Effects
(DAFx), 2006.
Dr
[25] J. S. Downie. The music information retrieval evaluation exchange (20052007): A window into music information retrieval. Acoustical Science and
Technology, 29(4):247–255, 2008.
[26] J. W. Dunn. Beyond variations: Creating a digital music library.
In Proceedings of the International Conference on Music Information
Retrieval (ISMIR), 2000.
[27] J. W. Dunn, M. W. Davidson, and E. J. Isaacson. Indiana university digital
music library project: An update. In Proceedings of the International
Conference on Music Information Retrieval (ISMIR), 2001.
[28] L. D. E. Ravelli, G. Richard. Audio signal representations for indexing
in the transform domain. IEEE Trans. on Audio, Speech, and Language
Processing, 18(3):434–446, 2010.
[29] D. Ellis and G. H. Poliner. Identifying cover songs with chroma features
and dynamic programming beat tracking. In Int. Conf. on Acoustics,
Speech, and Signal Processing (ICASSP), pages IV–1429–IV–1432, 2007.
164
BIBLIOGRAPHY
[30] M. T. F. Morchen, A. Ultsch and I. Lohken. Modeling timbre distance with
temporal statistics from polyphonic music. IEEE Transactions on Audio,
Speech and Language Processing, 8(1):81–90, 2006.
[31] J. Foote. Visualizing music and audio using self-similarity.
Multimedia, pages 77–80, 1999.
In ACM
[32] J. Foote. Arthur: Retrieving orchestral music by long-term structure.
In Proceedings of the International Conference on Music Information
Retrieval (ISMIR), 2000.
aft
[33] J. Foote, M. Cooper, and U. Nam. Audio retrieval by rhythmic similarity.
In Int. Conf. on Music Information Retrieval (ISMIR), 2002.
[34] H. Fujihara, M. Goto, T. Kitahara, and H. G. Okuno. A modeling
of singing voice robust to accompaniment sounds and its application to
singer identification and vocal-timbre similarity based music information
retrieval. IEEE Trans. on Audio, Speech and Language Processing,
18(3):638–648, 2010.
Dr
[35] G. Geekie.
Carnatic ragas as music information retrieval entities.
In Proceedings of the International Conference on Music Information
Retrieval (ISMIR), 2002.
[36] A. Georgaki, S. Raptis, and S. Bakamidis. A music interface for visually
impaired people in the wedelmusic environment. design and architecture.
In Proceedings of the International Conference on Music Information
Retrieval (ISMIR), 2000.
[37] M. Good. Representing music using xml. In Proceedings of the
International Conference on Music Information Retrieval (ISMIR), 2000.
[38] H. . N. T. . O. R. Goto, Masataka ; Hashiguchi. Rwc music database:
Popular, classical and jazz music databases. In Proceedings of the
International Conference on Music Information Retrieval (ISMIR), 2002.
[39] F. Gouyon, S. Dixon, E. Pampalk, and G. Widmer. Evaluating rhythmic
descriptors for musical genre classification. In AES 25th Int. Conf. on
Metadata for Audio, pages 196–204, 2002.
BIBLIOGRAPHY
165
[40] F. Gouyon, A. Klapuri, S. Dixon, M. Alonso, G. Tzanetakis, C. Uhle, and
P. Cano. An experimental comparison of audio tempo induction algorithms.
IEEE Trans. on Audio, Speech and Language Processing, 14(5):1832–
1844, 2006.
[41] J. Grey. Multidimensional perceptual scaling of musical timbres. Journal
of the Acoustical Society of America, 61(5):1270–1277, 1977.
[42] M. Gruhne, C. Dittmar, and D. Gaertner. Improving rhythmic similarity
computation by beat histogram transformations. In Int. Conf. on Music
Information Retrieval, 2009.
aft
[43] S. Hainsworth. Beat tracking and musical metre analysis. In A. Klapuri
and M. Davy, editors, Signal Processing Methods for Music Transcription,
pages 101–129. Springer, New York, 2006.
[44] T. Haitsma, Jaap ; Kalker. A highly robust audio fingerprinting system.
In Proceedings of the International Conference on Music Information
Retrieval (ISMIR), 2002.
Dr
[45] P. Herrera-Boyer, X. Amatriain, E. Batlle, and X. Serra. Towards
instrument segmentation for music content description:a critical review of
instrument classification techniques. In Proceedings of the International
Conference on Music Information Retrieval (ISMIR), 2000.
[46] S. Hitchner, J. Murdoch, and G. Tzanetakis. Music browsing using a
tabletop display. In ISMIR, pages 175–176, 2007.
[47] A. Holzapfel and Y. Stylianou. Rhythmic similarity of music based on
dynamic periodicity warping. In IEEE Int. Conf. on Acoustics, Speech and
Signal Processing (ICASSP), pages 2217–2220, 2008.
[48] A. Holzapfel and Y. Stylianou. A scale transform based method for
rhythmic similarity of music. In IEEE Int. Conf. on Acoustics, Speech and
Signal Processing (ICASSP), pages 317–320, 2009.
[49] H. H. Hoos, K. Renz, and M. Görg. Guido/mir an experimental musical
information retrieval system based on guido music notation. In Proceedings
of the International Conference on Music Information Retrieval (ISMIR),
2001.
166
BIBLIOGRAPHY
[50] Ö. Izmirli. Using a spectral flatness based feature for audio segmentation
and retrieval. In Proceedings of the International Conference on Music
Information Retrieval (ISMIR), 2000.
[51] J. Jensen, M. Christensen, and S. Jensen.
A tempo-insensitive
representation of rhythmic patterns. In European Signal Processing
Conference (EUSIPCO), 2009.
[52] J. H. Jensen, D. Ellis, M. Christensen, and S. H. Jensen. Evaluation of
distance measures between gaussian mixture models of mfccs. In Int. Conf.
on Music Information Retrieval (ISMIR), 2007.
aft
[53] D. Jiang, L. Lu, H. Zhang, and J. Tao. Music type classification by spectral
contrast feature. In Int. Conf. on Multimedia and Expo (ICME), 2002.
[54] A. Kapur, R. I. McWalter, and G. Tzanetakis. New music interfaces for
rhythm-based retrieval. In ISMIR, pages 130–136. Citeseer, 2005.
[55] F. J. Kiernan. Score-based style recognition using artificial neural networks.
In Proceedings of the International Conference on Music Information
Retrieval (ISMIR), 2000.
Dr
[56] Y. E. Kim, W. Chai, R. Garcia, and B. Vercoe. Analysis of a contour-based
representation for melody. In Proceedings of the International Conference
on Music Information Retrieval (ISMIR), 2000.
[57] Y. E. Kim and B. Whitman. Singer identification in popular music
recordings using voice coding features. In Int. Conf. on Music Information
Retrieval (ISMIR), 2002.
[58] T. Kitahara, Y. Tsuchihashi, and H. Katayose. Music genre classification
and similarity calculation using bass-line features. In Multimedia, 2008.
ISM 2008. Tenth IEEE International Symposium on, 2008.
[59] A. Kornstädt. The jring system for computer-assisted musicological
analysis. In Proceedings of the International Conference on Music
Information Retrieval (ISMIR), 2001.
[60] T. Lambrou, P. Kudumakis, R. Speller, M. Sandler, and A. Linnery.
Classification of audio signals using statistical features on time and wavelet
transform domains. In Int. Conf. on Acoustics, Speech and Signal
Processing (ICASSP), 1998.
BIBLIOGRAPHY
167
[61] C. Lee, J. L. Shih, K. Yu, and H. Lin. Automatic music genre classification
based on modulation spectral analysis of spectral and cepstral features.
IEEE Trans. on Multimedia, 11(4):670–682, 2009.
[62] J. S. . R. A. Lee, Jin Ha ; Downie. Representing traditional korean music
notation in xml. In Proceedings of the International Conference on Music
Information Retrieval (ISMIR), 2002.
[63] K. Lee and M. Slaney. Acoustic chord transcription and key extraction
from audio using key-dependent hmms trained on synthesized audio. IEEE
Trans. on Audio, Speech and Language Processing, 16(2):291–301, 2008.
aft
[64] T. Li and M. Ogihara. Detecting emotion in music. In Int. Symposium on
Music Information Retrieval (ISMIR), pages 239–240, 2003.
[65] T. Li and M. Ogihara. Toward intelligent music information retrieval. IEEE
Trans. on Multimedia, 8(3):564–574, 2006.
Dr
[66] T. Li and G. Tzanetakis. Factors in automatic musical genre classification.
In Proc. Workshop on applications of signal processing to audio and
acoustics WASPAA, New Paltz, NY, 2003. IEEE.
[67] C. Liem and A. Hanjalic. Cover song retrieval: a comparative study of
system component choices. In Int. Conf. on Music Information Retrieval
(ISMIR), 2009.
[68] K. Lin and T. Bell. Integrating paper and digital music information systems.
In Proceedings of the International Conference on Music Information
Retrieval (ISMIR), 2000.
[69] A. Lindsay and Y. E. Kim. Adventures in standardization, or, how
we learned to stop worrying and love mpeg-7. In Proceedings of the
International Conference on Music Information Retrieval (ISMIR), 2001.
[70] B. Logan. Mel frequency cepstral coefficients for music modeling.
In Proceedings of the International Conference on Music Information
Retrieval (ISMIR), 2000.
[71] L. Lu, D. Lie, and H.-J. Zhang. Automatic mood detection and tracking
of music audio signals. IEEE Trans. on Speech and Audio Processing,
14(1):5–18, 2006.
168
BIBLIOGRAPHY
[72] K. MacMillan, M. Droettboom, and I. Fujinaga.
Gamera: A
structured document recognition application development environment.
In Proceedings of the International Conference on Music Information
Retrieval (ISMIR), 2001.
[73] M. Mandel and D. Ellis. Song-level features and svms for music
classification. In Int. Conf. on Music Information Retrieval (ISMIR), 2005.
[74] M. F. McKinney and J. Breebaart. Features for audio and music
classification. In Int. Conf. on Music Information Retrieval (ISMIR), 2003.
aft
[75] C. Meek and W. P. Birmingham. Thematic extractor. In Proceedings of the
International Conference on Music Information Retrieval (ISMIR), 2001.
[76] A. Meng, P. Ahrendt, and J. Larsen. Temporal feature integration for music
genre classification. IEEE Transactions on Audio, Speech and Language
Processing, 5(15):1654–1664, 2007.
Dr
[77] S. Miller, P. Reimer, S. R. Ness, and G. Tzanetakis. Geoshuffle: Locationaware, content-based music browsing using self-organizing tag clouds. In
ISMIR, pages 237–242, 2010.
[78] G. T. N. Hu, R. B. Dannenberg. Polyphonic audio matching and alignment
for music retrieval. In Workshop on Applications of Signal Processing to
Audio and Acoustics, 2003.
[79] T. Nishimura, H. Hashiguchi, J. Takita, J. X. Zhang, M. Goto, and R. Oka.
Music signal spotting retrieval by a humming query using start frame
feature dependent continuous dynamic programming. In Proceedings of the
International Conference on Music Information Retrieval (ISMIR), 2001.
[80] T. Painer and A. Spanias. Perceptual coding of digital audio. Proceedings
of the IEEE, 88(4):451–515, 2000.
[81] Y. Panagakis, C. Kotropoulos, and G. C. Arce. Non-negative multilinear
principal component analysis of auditory temporal modulations for music
genre classification. IEEE Trans. on Audio, Speech and Language
Processing, 18(3):576–588, 2010.
[82] J. Paulus and A. Klapuri. measuring the similarity of rhythmic patterns. In
Int. Conf. on Music Information Retrieval (ISMIR), 2002.
BIBLIOGRAPHY
169
[83] G. Peeters. Spectral and temporal periodicity representations of rhythm for
the automatic classification of music audio signal. IEEE Trans. on Audio,
Speech, and Language Processing, (to appear), 2010.
[84] G. Peeters, A. L. Burthe, and X. Rodet. Toward automatic music
audio summary generation from signal analysis. In Int. Conf. on Music
Information Retrieval (ISMIR), 2002.
[85] D. Pye. Content-based methods for management of digital music. In Int.
Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2000.
aft
[86] A. Rauber, E. Pampalk, and D. Merkl. The SOM-enhanced JukeBox:
Organization and visualization of music collections based on perceptual
models. Journal of New Music Research, 32(2):193–210, 2003.
[87] P. Roland. Xml4mir: Extensible markup language for music information
retrieval. In Proceedings of the International Conference on Music
Information Retrieval (ISMIR), 2000.
Dr
[88] B. D. S. Essid, G. Richard. Musical instrument recognition by pairwise
classification. IEEE Trans. on Audio, Speech and Language Processing,
14(4):1401–1412, 2006.
[89] J. Serr and E. Gmez. Audio cover song identification based on tonal
sequence alignment. In IEEE International Conference on Acoustics,
Speech and Signal processing (ICASSP), pages 61–64, 2008.
[90] A. Sheh and D. P. W. Ellis. Chord segmentation and recognition using
em-trained hidden markov models. In Int. Conf. on Music Information
Retrieval (ISMIR), 2003.
[91] H. Sotlau, T. Schultz, M. Westphal, and A. Waibel. Recognition of music
types. In Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP),
1998.
[92] P. Tschmuck. Creativity and Innovation in the Music Industry. Springer,
2012.
[93] E. Tsunoo. Audio genre classification using percussive pattern clustering
combined with timbral features. In Int. Conf. on Multimedia and Expo
(ICME), 2009.
170
BIBLIOGRAPHY
[94] E. Tsunoo, N. Ono, and S. Sagayama. Musical bass-line pattern clustering
and its application to audio genre classification. In Int. Conf. on Music
Information Retrieval (ISMIR), 2009.
[95] G. Tzanetakis and P. Cook. Multi-feature audio segmentation for browsing
and annotation. In Workshop on Applications of Signal Processing to Audio
and Acoustics, 1999.
[96] G. Tzanetakis and P. Cook. Audio information retrieval (air) tools.
In Proceedings of the International Conference on Music Information
Retrieval (ISMIR), 2000.
aft
[97] G. Tzanetakis and P. Cook. Sound analysis using mpeg compressed audio.
In Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2000.
[98] G. Tzanetakis and P. Cook. Musical genre classification of audio signals.
IEEE Transactions on Speech and Audio Processing, 10(5):293–302, 2002.
Dr
[99] G. Tzanetakis, R. Jones, and K. McNally. Stereo panning features for
classifying recording production style. In Int. Conf. on Music Information
Retrieval (ISMIR), 2007.
[100] G. Tzanetakis, R. Jones, and K. McNally. Stereo panning features for
classifying recording production style. In ISMIR, pages 441–444, 2007.
[101] G. Tzanetakis, A. Kapur, and R. I. Mcwalter. Subband-based drum
transcription for audio signals. In Multimedia Signal Processing, 2005
IEEE 7th Workshop on, pages 1–4. IEEE, 2005.
[102] G. Tzanetakis, L. Martins, K. McNally, and R. Jones. Stereo panning
information for music information retrieval tasks. Journal of the Audio
Engineering Society, 58(5):409–417, 2010.
[103] K. West and S. Cox. Finding an optimal segmentation for audio genre
classification. In Int. Conf. on Music Information Retrieval (ISMIR), 2005.