Bebopnet: Deep Neural Models For Personalized Jazz Improvisations

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

BEBOPNET: DEEP NEURAL MODELS FOR PERSONALIZED JAZZ

IMPROVISATIONS

Shunit Haviv Hakimi∗ Nadav Bhonker∗ Ran El-Yaniv


Computer Science Department
Technion – Israel Institute of Technology
[email protected], [email protected], [email protected]

ABSTRACT forms of art, and notably for composing music [1]. The ex-
plosive growth of deep learning models over the past sev-
A major bottleneck in the evaluation of music generation eral years has expanded the possibilities for musical gen-
is that music appreciation is a highly subjective matter. eration, leading to a line of work that pushed forward the
When considering an average appreciation as an evalua- state-of-the-art [2–6]. Another recent trend is the devel-
tion metric, user studies can be helpful. The challenge opment and offerings of consumer services such as Spo-
of generating personalized content, however, has been ex- tify, Deezer and Pandora, aiming to provide personalized
amined only rarely in the literature. In this paper, we streams of existing music content. Perhaps the crowning
address generation of personalized music and propose a achievement of such personalized services would be for
novel pipeline for music generation that learns and opti- the content itself to be generated explicitly to match each
mizes user-specific musical taste. We focus on the task individual user’s taste. In this work we focus on the task of
of symbol-based, monophonic, harmony-constrained jazz generating user personalized, monophonic, symbolic jazz
improvisations. Our personalization pipeline begins with improvisations. To the best of our knowledge, this is the
BebopNet, a music language model trained on a corpus of first work that aims at generating personalized jazz solos
jazz improvisations by Bebop giants. BebopNet is able to using deep learning techniques.
generate improvisations based on any given chord progres- The common approach for generating music with neu-
sion 1 . We then assemble a personalized dataset, labeled ral networks is generally the same as for language mod-
by a specific user, and train a user-specific metric that re- eling. Given a context of existing symbols (e.g., charac-
flects this user’s unique musical taste. Finally, we employ ters, words, music notes), the network is trained to predict
a personalized variant of beam-search with BebopNet to the next symbol. Thus, once the network learns the dis-
optimize the generated jazz improvisations for that user. tribution of sequences from the training set, it can gener-
We present an extensive empirical study in which we ap- ate novel sequences by sampling from the network output
ply this pipeline to extract individual models as implicitly and feeding the result back into itself. The products of
defined by several human listeners. Our approach enables such models are sometimes evaluated through user studies
an objective examination of subjective personalized mod- (crowd-sourcing). Such studies assess the quality of gen-
els whose performance is quantifiable. The results indi- erated music by asking users their opinion, and computing
cate that it is possible to model and optimize personal jazz the mean opinion score (MOS). While these methods may
preferences and offer a foundation for future research in measure the overall quality of the generated music, they
personalized generation of art. We also briefly discuss op- tend to average-out evaluators’ personal preferences. An-
portunities, challenges, and questions that arise from our other, more quantitative but rigid approach for evaluation
work, including issues related to creativity. of generated music is to compute a metric based on musical
theory principles. While such metrics can, in principle, be
defined for classical music, they are less suitable for jazz
1. INTRODUCTION
improvisation, which does not adhere to such strict rules.
Since the dawn of computers, researchers and artists have To generate personalized jazz improvisations, we pro-
been interested in utilizing them for producing different pose a framework consisting of the following elements: (a)
1 Supplementary material and numerous MP3 demonstrations of jazz
BebopNet: jazz model learning; (b) user preference elicita-
improvisations of jazz standards and pop songs generated by BebopNet
tion; (c) user preference metric learning; and (d) optimized
are provided in https://shunithaviv.github.io/bebopnet. music generation via planning.
As many jazz teachers would recommend, the key to at-
taining great improvisation skills is by studying and emu-
c Shunit Haviv Hakimi, Nadav Bhonker, and Ran El- lating great musicians. Following this advice, we train Be-
Yaniv. Licensed under a Creative Commons Attribution 4.0 Interna-
tional License (CC BY 4.0). Attribution: Shunit Haviv Hakimi, Na-
bopNet, a harmony-conditioned jazz model that composes
dav Bhonker, and Ran El-Yaniv, “BebopNet: Deep Neural Models for entire solos. We use a training dataset of hundreds of pro-
Personalized Jazz Improvisations”, in Proc. of the 21st Int. Society for fessionally transcribed jazz improvisations performed by
Music Information Retrieval Conf., Montréal, Canada, 2020. saxophone giants such as Charlie Parker, Phil Woods and

828
Proceedings of the 21st ISMIR Conference, Montréal, Canada, October 11-16, 2020

    Cm7
                               
  
2. RELATED WORK
       
C7 F7 B♭maj7 Cm7 F7 B♭6

 

                                     
Many different techniques for algorithmic musical compo-

 Figure
    1.      
    generated       
     
Cm7 F7 Fm7 B♭7 E♭9 E♭9 B♭6 B♭6

 A short excerpt   BebopNet.


sition have been used over the years. For example, some
by are grammar-based [11], rule-based [1, 12], use Markov

                              
                 
chains [13–15], evolutionary methods [16, 17] or neural

F7 Cm7 B♭6 B♭6

networks [18–20]. For a comprehensive summary of this

          given
sequence  
Cannonball Adderley (see details in Section 4.1.1). In this broad area, we refer the reader to [21]. Here we con-
                      
Cm7 F7 B♭6 B♭6

 
dataset, each solo is a monophonic note fine the discussion to closely related works that mainly

      harmony  F7  sequence. 3    is


in symbolic form (MusicXML) accompanied by a syn-
                    
concern jazz improvisation using deep learning techniques
       
   generating
   high fidelity     improvisation

Cm7 B♭7 Fm7 E♭9 E♭9


chronized After training, BebopNet over symbolic data. In this narrower context, most works
capable of phrases follow a generation by prediction paradigm, whereby a
(thisB♭6 is a subjective
B♭6 
 
     a short     excerpt         by
      generated 
          
impressionF7of the authors). Figure 1 model trained to predict the next symbol is used to greed-
     
Cm7 B♭maj7 B♭maj7

   
presents BebopNet. ily generate sequences. The first work on blues improvisa-

    in              
    our
 tastes,   goal  this   go     straight-
 beyond 
tion [22] straightforwardly applied long short-term mem-
   
Considering that different people have different musi-
                    
Cm7 F7 B♭6 B♭6
cal paper is to ory (LSTM) networks on a small training set. While

 personalized   preferences.
  For this purpose, 
their results may seem limited at a distance of nearly two
                 
forward generation by this model and optimize the gener-
Cm7 F7

   
ation toward decades 2 , they were the first to demonstrate long-term
structure captured by neural networks.
satisfaction
   throughout             
we determine a user’s preference by measuring the level of

      
ant of continuous response     interface    [7].  This is
   (CRDI)
theirFm7 theB♭7solos using a digital vari- One approach to improving a naïve greedy genera-
tion from a jazz model is by using a mixture of experts.

      the                         
   (from
For example, Franklin et al. [23] trained an ensemble of
           good/bad
          

accomplished byA♭9playing, for the B♭6 user, computer-generated

          
E♭9 B♭6

solos jazz model) and recording their neural networks were trained, one specialized for each

  sufficient
    B♭6    about      
melody, and then selected from among them at genera-
    
feedback in real time throughout each solo. Once we have
 
                           
B♭6 Am7♭57 Am7♭57 D7 D7

 of two aligned sequences 


gathered data the user’s preferences, con- tion time using reinforcement learning (RL) utilizing a
sisting (for the solos and feed- handcrafted reward function. Johnson et al. [24] gener-
back), we train a user preference metric in the form of
2
ated improvisations by training a network consisting of
a recurrent regression model to predict this user’s prefer- two experts, each focusing on a different note represen-
ences. A key feature of our technique is that the result- tation. The experts were combined using the technique
ing model can be evaluated objectively using hold-out user of product of experts [25] 3 . Other remotely related non-
preference sequences (along with their corresponding so- jazz works have attempted to produce context-dependent
los). A big hurdle in accomplishing this step is that the melodies [2, 3, 5, 26–30].
signal elicited from the user is inevitably extremely noisy. A common method for collecting continuous measure-
To reduce this noise, we apply selective prediction tech- ments from human subjects listening to music is the con-
niques [8, 9] to distill cleaner predictions from the user’s tinuous response digital interface (CRDI), first reported
preference model. Thus, we allow this model to abstain by [7]. CRDI has been successful in measuring a variety
whenever it is not sufficiently confident. The fact that it of signals from humans such as emotional response [31],
is possible to extract a human continuous response prefer- tone quality and intonation [32], beauty in a vocal perfor-
ence signal on musical phrases and use it to train (and test) mance [33], preference for music of other cultures [34] and
a model with non-trivial predictive capabilities is interest- appreciation of the aesthetics of jazz music [35]. Using
ing in itself (and new, to the best of our knowledge). CRDI, listeners are required to rate different elements of
Equipped with a personalized user preference metric the music by adjusting a dial (which looks similar to a vol-
(via the trained model), in the last stage we employ a vari- ume control dial present on amplifiers).
ant of beam-search [10], to generate optimized jazz solos
from BebopNet. For each user, we apply the last three 3. PROBLEM STATEMENT
stages of this process where the preference elicitation stage
We now state the problem in mathematical terms. We de-
takes several hours of tagging per user. We applied the pro-
note an input xt = (st , ct ) consisting of a note st and its
posed pipeline on four users, all of whom are amateur jazz
context ct . Each note st ∈ S, in turn, consists of a pitch
musicians. We present numerical analysis of the results
and a duration at index t and S represents a predefined
showing that a personalized metric can be trained and then
set of pitch-duration combinations (i.e., notes). The con-
used to optimize solo generation.
text ct ∈ C represents the chord that is played with note
To summarize, our contributions include: (1) a use- st , where C is the set of all possible chords. The context
ful monophonic neural model for general jazz improvi- may contain additional information such as the offset of
sation within any desired harmonic context; (2) a viable the note within a measure (see details in Section 4). Let
methodology for eliciting and learning high resolution hu- D denote a training dataset consisting of M solos. Each
man preferences for music; (3) a personalized optimization
2 Listen to their generated pieces at www.iro.umontreal.ca/
process of jazz solo generation; and (4) an objective eval-
~eckdoug/blues/index.html.
uation method for subjective content and plagiarism anal- 3 Listen to the generated solos at www.cs.hmc.edu/~keller/
ysis for the generated improvisations. jazz/improvisor/iccc2017/

829
Proceedings of the 21st ISMIR Conference, Montréal, Canada, October 11-16, 2020

solo is a sequence Xτ = x1 · · · xτ ∈ (S × C)τ of arbitrary


length τ . In our work, these are the aforementioned jazz
improvisations.
We define a context-dependent jazz model fθ (Eq. 1),
as the estimator of the probability of a note st given the Pitch 62 128 74 71 72
sequence of previous inputs Xt−1 and the current context Duration 4 2 4 2 4
ct , where θ are the parameters of the model. This is similar Offset 0 12 18 30 36
to a human jazz improviser who is informed of the chord 0 0 0 0 0 C
over which his next note will be played. 0 0 0 0 0 C#
1 1 1 1 1 D
fθ (Xt−1 , ct ) = P r(st |Xt−1 , ct ) (1) Chord 0 0 0 0 0 D#
⋮ ⋮ ⋮ ⋮ ⋮
For any solo Xτ , we also consider an associated se- 0 0 0 0 0 A#
quence of annotation Yτ = y1 · · · yτ ∈ Y τ . An annotation 1
1 1
1 1 1 1 B
yt ∈ Y represents the quality of the solo up to point t by
some metric. In our case, yt may be a measure of prefer-
Figure 2. An example of a measure in music notation and
ence as indicated by a user or a score measuring harmonic
its vector representation. Integers are converted to one-hot
compliance. Let D e denote a training dataset consisting of
representations.
N solos. Each solo Xτ of arbitrary length τ is labeled
with a sequence Yτ . Given D,e we define a metric gφ (Eq.
2) to predict yτ given a sequence of inputs Xτ . gφ is the Duration The duration of each note is encoded using a
user-preference model and φ are the learned parameters. one-hot vector consisting of all the existing durations in
the dataset. Durations smaller than 1/24 are removed.
ŷτ = gφ (Xτ ) (2) Offset The offset of the note lies within the measure and is
We denote by ψ a function that is used to sample notes quantized to 48 “ticks” per (four-beat) measure. This cor-
from fθ to generate solos. In our case, this will be our responds to a duration of 1/12 of a beat. This is similar to
beam-search variant. The objective here is to train viable the learned positional-encoding used in translation [37].
models, fθ and gφ , and then to use ψ to sample solos from Chord The chord is represented by a four-hot vector of
fθ while maximizing gφ . size 12, representing the 12 possible pitch classes to appear
in a chord. As common in jazz music, unless otherwise
4. METHODS noted, we assume that chords are played using their 7th
form. Thus, the chord pitches are usually the 1st , 3rd , 5th ,
In this section we describe the methods used and imple- and 7th degrees of the root of the chord. This chord repre-
mentation details of our personalized generation pipeline. sentation allows the flexibility of representing rare chords
such as sixth, diminished and augmented chords.
4.1 BebopNet: Jazz Model Learning
4.1.2 Network Architecture
In the first step of our pipeline, we use supervised learning
to train BebopNet, a context-dependent jazz model fθ from BebopNet, as many language models, can be implemented
a given corpus of transcribed jazz solos. using different architectures such as recurrent neural net-
4.1.1 Dataset and music representation works (RNNs), convolutional networks (CNNs) [5, 26, 38]
or attention-based models [39]. BebopNet contains a
Our corpus D consists of 284 professionally transcribed three-layer LSTM network [40]. Recent promising results
solos of (mostly) Bebop saxophone players of the early with attention based models enabled us to improve Bebop-
20th century. These are Charlie Parker, Sonny Stitt, Can- Net by replacing the LSTM with Transformer-XL [41].
nonball Adderley, Dexter Gordon, Sonny Rollins, Stan The architecture of the network used to estimate fθ is il-
Getz, Phil Woods and Gene Ammons. We consider only lustrated in Figure 3. The network’s input xt includes
solos that are in 4/4 metre and include chords in their tran- the note st (pitch and duration) and context ct (offset and
scription. The solos are provided in musicXML format. chord). The pitch, duration and offset are each represented
As opposed to MIDI, this format allows the inclusion of by learned embedding layers. The chord is encoded by
chord symbols 4 . We represent notes using a representa- using the embedding of the pitches comprising it. While
tion method inspired by sheet music (see Figure 2). notes at different octaves have different embeddings, the
Pitch The pitch is encoded as a one-hot vector of size 129. chord pitch embeddings are always taken from the octave
Indices 0—127 match the pitch range of the MIDI stan- in which most notes in the dataset reside. This embed-
dard. 5 Index 128 corresponds to the rest symbol. ded vector is passed to the LSTM network. The LSTM
4 The solos were purchased from SaxSolos.com [36]; we are thus un- output is then passed to two heads. Each head consists
able to publish them. Nevertheless, in the supplementary material we of two fully-connected layers with a sigmoid activation in-
provide a complete list of solos used for training, which are available between. The output of the first layer is the same size as the
from the above vendor.
5 The notes appearing in the corpus all belong to a much smaller range; embedding of the pitch (or duration), and the second out-
however, the MIDI range standard was maintained for simplicity. put size is the number of possible pitches (or durations).

830
Proceedings of the 21st ISMIR Conference, Montréal, Canada, October 11-16, 2020

Share weights
Pitch Embedding Pitch Decoder

Chord

LSTM
Duration Embedding Duration Decoder
Duration

Pitch

Figure 3. The BebopNet architecture for the next note prediction. Each note is represented by concatenating the embed-
dings of the pitch (red bar), the duration (purple bar) and the four pitches comprising the current chord (green bars). The
output of the LSTM is passed to two heads (orange bars), one the size of the pitch embedding (top) and the other the size
of the duration embedding (bottom).

Following [42, 43], we tie the weights of the final fully- pleasing than those he heard previously. Thus, the label in-
connected layers to those of the embedding. Finally, the dicates the preference of the past sequence. The labels are
outputs of the two heads pass through a softmax layer and linearly scaled down to the range [−1, 1]. Since the data in
are trained to minimize the negative log-likelihood of the De is small and unbalanced, we use stratified sampling over
corpus. To enrich our dataset while encouraging harmonic solos to divide the dataset into training and validation sets.
context dependence, we augment our dataset by transpos- We then use bagging to create an ensemble of five models
ing to all 12 keys. for the final estimate.
4.3.1 Network Architecture
4.2 User Preference Elicitation
We estimate the function gφ using transfer learning from
Using BebopNet, we created a dataset to be labeled by
BebopNet. The user preference model consists of the same
users, consisting of 124 improvisations. These solos were
layers as BebopNet without the final fully-connected lay-
divided into three groups of roughly the same size: so-
ers. Next, we apply scaled dot-product attention [45] over
los from the original corpus, solos generated by BebopNet
τ time steps followed by fully-connected and tanh layers.
over jazz standards present in the training set, and gener-
The transferred layers are initialized using the weights θ
ated solos over jazz standards not present in the training
of BebopNet. Furthermore, the weights of the embedding
set. The length of each solo is two choruses, or twice the
layers are frozen during training.
length of the melody. For each standard, we generated a
backing track in MP3 format that includes a rhythm sec- 4.3.2 Selective Prediction
tion and a harmonic instrument to play along the improvi-
To elevate the accuracy of gφ , we utilize selective pre-
sation using Band-in-a-Box [44]. This dataset amounts to
diction whereby we ignore predictions whose confidence
approximately five hours of played music.
is too low. We use the prediction magnitude as a proxy
We created a system inspired by CRDI that is entirely
for confidence. Given confidence threshold parameters,
digital, replacing the analog dial with strokes of a keyboard 0
β1 < 0, β2 > 0, we define gφ,β1 ,β2
(Xti ) in Eq. 3.
moving a digital dial. A figure of our dial is presented in
the supplementary material. While the original CRDI had (
a range of 255 values, our initial experiments found that 0 if β1 < gφ (Xti ) < β2
0
quantizing the values to five levels was easier for users. gφ,β1 ,β2
(Xti ) = (3)
gφ (Xti ) else
We recorded the location of the dial at every time step and
aligned it to the note being played at the same moment. The parameters β1 and β2 change our coverage rate
and are determined by minimizing error (risk) on the risk-
4.3 User Preference Metric Learning coverage plot along a predefined coverage contour. More
details are given in Section 5.2.
In the user preference metric learning stage we again use
supervised learning to train a metric function gφ . This
4.4 Optimized Music Generation
function should predict user preference scores for any solo,
given its harmonic context. During training, for each se- To optimize generations from fθ , we apply a variant of
quence Xτ we estimate yτ , corresponding to the label the beam-search, ψ, whose objective scores are obtained from
user provided for the last note in the sequence. We choose non-rejected predictions of gφ . Pseudocode of the ψ proce-
the last label of the sequence, rather than the mode or dure is presented in the supplementary material. We denote
mean, because of delayed feedback. During the user elici- by Vb = [Xt1 , Xt2 , ..., Xtb ] a running batch (beam) of size
tation step, we noticed that when a user decides to change (beam-width) b containing the most promising candidate
the position of the dial, it is because he has just heard a sequences found so far by the algorithm. The sequences
sequence of notes that he considers to be more (or less) are all initialized with the starting input sequence. In our

831
Proceedings of the 21st ISMIR Conference, Montréal, Canada, October 11-16, 2020

Name Adderley Gordon Getz Parker Rollins Stitt Woods Ammons BebopNet (Heard) BebopNet (Unheard)
Chord 0.50 0.54 0.53 0.52 0.52 0.53 0.50 0.54 0.53 0.52
Scale 0.78 0.83 0.81 0.80 0.81 0.83 0.78 0.83 0.82 0.81

Table 1. Harmonic coherence: The average chord and scale matches computed for artists in the dataset and for BebopNet.
A higher number indicates a high coherency level. BebopNet is measured separately for harmonic progressions heard and
not heard in the training dataset.

case, this is the melody of the jazz standard. At every time take as a baseline the average matching statistics of these
step t, we produce a probability distribution of the next quantities for each jazz artist in our dataset. The harmonic
note of every sequence in Vb by passing the b sequences coherence statistics of BebopNet are computed over the
through the network fθ (Xti , cit+1 ). As opposed to typical dataset used for the preference metric learning (generated
applications of beam-search, rather than choosing the most by BebopNet), which also includes chord progressions not
probable notes from P r(st+1 |Xti , cit+1 ), we independently heard during the jazz modeling stage. The baselines and
and randomly sample them. We then calculate the score of results are reported in Table 1. It is evident that our model
the extended candidates using the preference metric, gφ . exhibits harmonic coherence in the ‘ballpark’ of the jazz
Every δ steps, we perform a beam update process. We artists even on chord progressions not previously heard.
choose the highest scoring k sequences calculated by gφ .
Then we duplicate these sequences b/k times to maintain
a full beam of b sequences. Choosing different values of positive
120
δ allows us to control a horizon parameter, which facili- neutral
negative
tates longer term predictions when extending candidate se- 100
quences in the beam. The use of larger horizons may lead
80
to sub-optimal optimization but increases variability.
60

5. EXPERIMENTS 40

We start the experimental process by training BebopNet as 20


described in Section 4. After training, we use BebopNet to
generate multiple solos over different jazz standards 6 . To 0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
verify that BebopNet can generalize to harmonic progres-
sions of different musical genres, we also generate impro- i Histogram of predictions
visations over pop songs (see supplementary material).
This section has two sub-sections. First, we evaluate 1.000
error, user_1
BebopNet in terms of harmonic coherence (5.1). Next, we 0.875
present an analysis of our personalization process (5.2). 0.5
1.0 0.750
All experiments were performed on desktop computers
with a single Titan X GPU. Hyperparameters are provided 0.8
0.625 0.4
Risk (Error)
Coverage

in the supplementary material. 0.6


0.500
0.4

0.375 0.3
5.1 Harmonic Coherence 0.2

0.0
We begin by evaluating the extent to which BebopNet was 0.250
0.0 0.0 0.2
0.2 0.1
able to capture the context of chords, which we term har- pos 0.4 0.3
0.2
ld 0.125
t hre 0.6 0.4 sho
monic coherence. We define two harmonic coherence met- sho 0.8 0.5
thre
neg
0.6
rics using either scale match or chord match. These metrics ld 1.0 0.7 0.000
are defined as the percent of time within a measure where ii Risk-coverage plot
notes match pitches of the scale or the chord being played,
respectively. We rely on a standard definition of match- Figure 4. 4i Predictions of the preference model on se-
ing scales to chords using the chord-scale system [46]. quences from a validation set. Green: sequences labeled
While most notes in a solo should be harmonically coher- with a positive score (yτ > 0); yellow: neutral (yτ = 0);
ent, some non-coherent notes are often incorporated. Com- red: negative (yτ < 0). The blue vertical lines indicate
mon examples of their uses are chromatic lines, approach thresholds β1 , β2 used for selective prediction. 4ii Risk-
notes and enclosures [47]. Therefore, as we do not expect a coverage plot for the predictions of the preference model.
perfect harmonic match according to pure music rules, we β1 , β2 (green lines) are defined to be the thresholds that
6 To appreciate the diversity of BebopNet, listen to seven solos gener- yield a minimum error on the contour of 25% coverage.
ated for user-4 for the tune Recorda-Me in the supplementary material.

832
Proceedings of the 21st ISMIR Conference, Montréal, Canada, October 11-16, 2020

5.2 Analyzing Personalized Models for the typical amount of copying exhibited by humans.
Another plagiarism measurement we define is the
We applied the proposed pipeline to generate personalized largest common sub-sequence. For each solo, we consider
models for each of the four users, all amateur jazz musi- the solos of other artists as the source set. Then, we aver-
cians. All users listened to the same training dataset of age the results per artist. Also, for every artist, we com-
solos to create their personal metric (see Section 4). Each pare every solo against the rest of his solos to measure
user provided continuous feedback for each solo using our self-plagiarism. For BebopNet, we quantify the plagiarism
CRDI variant. In this section, we describe our evaluation level with respect to the entire corpus. The average plagia-
process for user-1. The evaluation results for the rest of the rism level of BebopNet is 3.8. Interestingly, this value lies
users are presented in the supplementary material. within the human plagiarism range found in the dataset.
We analyze the quality of our preference metric func- This indicates that BebopNet can be accused of plagiarism
tion gφ by plotting a histogram of the network’s predic- as much as some of the famous jazz giants. We present the
tions applied on a validation set. Consider Figure 4i. We extended results in the supplementary material.
can crudely divide the histogram into three areas: the
right-hand side region corresponds to mostly positive se-
quences predicted with high accuracy; the center region 7. CONCLUDING REMARKS
corresponds to high confusion between positive and neg- We presented a novel pipeline for generating personalized
ative; and the left one, to mostly negative sequences pre- harmony-constrained jazz improvisations by learning and
dicted with some confusion. While the overall error of the optimizing a user-specific musical preference model. To
preference model is high (0.4 MSE where the regression distill the noisy human preference models, we used a se-
domain is [-1,1]), it is still useful since we are interested lective prediction approach. We introduced an objective
in its predictions in the positive (green) spectrum for the evaluation method for subjective content and numerically
forthcoming optimization stage. While trading-off cover- analysed our proposed pipeline on four users.
age, we increase prediction accuracy using selective pre- Our work raises many questions and directions for fu-
diction by allowing our classifier to abstain when it is not ture research. While our generated solos are locally coher-
sufficiently confident. To this end, we ignore predictions ent and often interesting/pleasing, they lack the qualities of
whose magnitude is between two rejection thresholds (see professional jazz related to general structure such as motif
Section 4.3.2). Based on preliminary observations, we fix development and variations. Preliminary models we have
the rejection thresholds to maintain 25% coverage over the trained on smaller datasets were substantially weak. Can a
validation set. In Figure 4ii we present a risk-coverage plot much larger dataset generate a significantly better model?
for user-1 (see definition in [8]). The risk surface is com- To acquire such a large corpus it might be necessary to
puted by moving two thresholds β1 and β2 across the his- abandon the symbolic approach and rely on raw audio.
togram in Figure 4i, and at each point, for data not between Our work emphasizes the need to develop effective
the thresholds, we calculate the risk (error of classification methodologies and techniques to extract and distill noisy
to three categories: positive, neutral and negative) and the human feedback that will be required for developing many
coverage (percent of data maintained). personalized applications. Our proposed method raises
We increase the diversity of generated samples by tak- many questions. To what extent does our metric express
ing the score’s sign rather than the exact score predicted the specifics of one’s musical taste? Can we extract precise
by the preference model gφ . Therefore, different posi- properties from this metric? Additionally, our technique
tive samples are given equal score. For user-1, the aver- relies on a sufficiently large labeled sample to be provided
age score predicted by gφ for generated solos of Bebop- by each user, a substantial effort on the user’s part. We
Net is 0.07. As we introduce beam-search and increase the anticipate that the problem of eliciting user feedback will
beam width, the performance increases up to an optimal be solved in a completely different manner, for example,
point from which it decreases (see supplementary mate- by monitoring user satisfaction unobtrusively, e.g., using a
rial). User-1’s scores peaked at 0.8 with b = 32, k = 8. camera, EEG, or even direct brain-computer connections.
Anecdotally, there was one solo that user-1 felt was excep- The challenge of evaluating neural networks that gen-
tionally good. For that solo, the model predicted the per- erate art remains a central issue in this research field. An
fect score of 1. This indicates that the use of beam-search ideal jazz solo should be creative, interesting and mean-
is indeed beneficial for optimizing the preference metric. ingful. Nevertheless, when evaluating jazz solos, there are
no mathematical definitions for these properties—as yet.
6. PLAGIARISM ANALYSIS Previous works attempted to define and optimize creativ-
ity [48], but no one has yet delineated an explicit objective
One major concern is the extent to which BebopNet plagia- definition. Some of the main properties of creative per-
rizes. In our calculations, two sequences that are identical formance are innovation and the generations of patterns
up to transposition are considered the same. To quantify that reside out-of-the-box— namely, the extrapolation of
plagiarism in a solo with respect to a set of source solos, outlier patterns beyond the observed distribution. Present
we measure the percentage of n-grams in that solo that also machine learning regimes, however, are mainly capable of
appear in any other solo in the source. These statistics are handling interpolation tasks and not extrapolation. Is it at
also applied to any artist in our dataset to form a baseline all possible to learn the patterns of outliers?

833
Proceedings of the 21st ISMIR Conference, Montréal, Canada, October 11-16, 2020

8. ACKNOWLEDGEMENTS [12] M. Löthe, “Knowledge Based Automatic Composition


and Variation of Melodies for Minuets in Early Clas-
* Both authors contributed equally to this work.
sical Style,” in Annual Conference on Artificial Intelli-
gence. Springer, 1999, pp. 159–170.
9. REFERENCES
[13] F. Pachet, “The Continuator: Musical Interaction with
[1] L. A. Hiller Jr. and L. M. Isaacson, “Musical Compo- Style,” Journal of New Music Research, vol. 32, no. 3,
sition with a High Speed Digital Computer,” in Audio pp. 333–341, 2003.
Engineering Society Convention 9. Audio Engineer-
ing Society, 1957. [14] R. Wooller and A. R. Brown, “Investigating Morphing
Algorithms for Generative Music,” in Third Iteration:
[2] N. Jaques, S. Gu, R. E. Turner, and D. Eck, Third International Conference on Generative Systems
“Tuning Recurrent Neural Networks with Rein- in the Electronic Arts, Melbourne, Australia, 2005.
forcement Learning,” in 5th International Con-
ference on Learning Representations, ICLR 2017, [15] J. Sakellariou, F. Tria, V. Loreto, and F. Pachet, “Maxi-
Toulon, France, April 24-26, 2017, Workshop mum Entropy Models Capture Melodic Styles,” Scien-
Track Proceedings, 2017. [Online]. Available: tific reports, vol. 7, no. 1, p. 9172, 2017.
https://openreview.net/forum?id=Syyv2e-Kx
[16] P. Laine and M. Kuuskankare, “Genetic Algorithms in
[3] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, Musical Style Oriented Generation,” in Proceedings of
C. Hawthorne, A. M. Dai, M. D. Hoffman, and D. Eck, the First IEEE Conference on Evolutionary Computa-
“Music Transformer: Generating Music with Long- tion. IEEE World Congress on Computational Intelli-
Term Structure,” arXiv preprint arXiv:1809.04281, gence. IEEE, 1994, pp. 858–862.
2018.
[17] G. Papadopoulos and G. Wiggins, “A Genetic Algo-
[4] C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, rithm for the Generation of Jazz Melodies,” Proceed-
C. A. Huang, S. Dieleman, E. Elsen, J. Engel, and ings of STEP, vol. 98, 1998.
D. Eck, “Enabling Factorized Piano Music Modeling
[18] P. Toiviainen, “Modeling the Target-Note Technique of
and Generation with the MAESTRO Dataset,” CoRR,
Bebop-Style Jazz Improvisation: An Artificial Neural
vol. abs/1810.12247, 2018. [Online]. Available: http:
Network Approach,” Music Perception: An Interdisci-
//arxiv.org/abs/1810.12247
plinary Journal, vol. 12, no. 4, pp. 399–413, 1995.
[5] K. Chen, W. Zhang, S. Dubnov, G. Xia, and W. Li,
[19] M. Nishijima and K. Watanabe, “Interactive Music
“The Effect of Explicit Structure Encoding of Deep
Composer Based on Neural Networks,” in Proceedings
Neural Networks for Symbolic Music Generation,”
of the 1992 International Computer Music Conference,
in 2019 International Workshop on Multilayer Music
ICMC 1992, San Jose, California, USA, October
Representation and Processing (MMRP). IEEE, 2019,
14-18, 1992, 1992. [Online]. Available: http:
pp. 77–84.
//hdl.handle.net/2027/spo.bbp2372.1992.015
[6] R. Child, S. Gray, A. Radford, and I. Sutskever,
[20] J. Franklin, “Multi-Phase Learning for Jazz Improvisa-
“Generating long sequences with sparse transformers,”
tion and Interaction,” in In Proceedings of the Biennial
arXiv preprint arXiv:1904.10509, 2019.
Symposium on Arts and Technology, 2001.
[7] C. R. Robinson, “Differentiated Modes of Choral
[21] J. D. Fernández and F. Vico, “AI Methods in Algorith-
Performance Evaluation Using Traditional Procedures
mic Composition: A Comprehensive Survey,” Journal
and a Continuous Response Digital Interface device,”
of Artificial Intelligence Research, vol. 48, pp. 513–
Ph.D. dissertation, Florida State University, 1988.
582, 2013.
[8] Y. Geifman and R. El-Yaniv, “Selective Classification
[22] D. Eck and J. Schmidhuber, “Learning the Long-Term
for Deep Neural Networks,” in Advances in Neural In-
Structure of the Blues,” in International Conference on
formation Processing Systems, 2017, pp. 4878–4887.
Artificial Neural Networks. Springer, 2002, pp. 284–
[9] Y. Geifman and R. El-Yaniv, “SelectiveNet: A Deep 289.
Neural Network with an Integrated Reject Option,”
[23] J. A. Franklin, “Jazz Melody Generation Using Recur-
in International Conference on Machine Learning
rent Networks and Reinforcement Learning,” Interna-
(ICML), 2019.
tional Journal on Artificial Intelligence Tools, vol. 15,
[10] P. Norvig, Paradigms of Artificial Intelligence Pro- no. 04, pp. 623–650, 2006.
gramming: Case Studies in Common LISP. Morgan
[24] D. D. Johnson, R. M. Keller, and N. Weintraut, “Learn-
Kaufmann, 1992.
ing to Create Jazz Melodies Using a Product of Ex-
[11] J. Gillick, K. Tang, and R. M. Keller, “Learning Jazz perts,” in Proceedings of the Eighth International Con-
Grammars,” Proceedings of the SMC, pp. 125–130, ference on Computational Creativity (ICCC’17), At-
2009. lanta, GA, 19, 2017, p. 151.

834
Proceedings of the 21st ISMIR Conference, Montréal, Canada, October 11-16, 2020

[25] G. E. Hinton, “Training Products of Experts by Mini- [35] J. C. Coggiola, “The Effect of Conceptual Advance-
mizing Contrastive Divergence,” Neural Computation, ment in Jazz Music Selections and Jazz Experience on
vol. 14, no. 8, pp. 1771–1800, 2002. Musicians’ Aesthetic Response,” Journal of Research
in Music Education, vol. 52, no. 1, pp. 29–42, 2004.
[26] L. Yang, S. Chou, and Y. Yang, “MidiNet: A
Convolutional Generative Adversarial Network for [36] “Sax solos,” https://saxsolos.com/, accessed: 2019-05-
Symbolic-Domain Music Generation,” in Proceedings 16.
of the 18th International Society for Music Information
Retrieval Conference, ISMIR 2017, Suzhou, China, [37] J. Gehring, M. Auli, D. Grangier, D. Yarats,
October 23-27, 2017, 2017, pp. 324–331. [Online]. and Y. N. Dauphin, “Convolutional Sequence to
Available: https://ismir2017.smcnus.org/wp-content/ Sequence Learning,” in Proceedings of the 34th
uploads/2017/10/226_Paper.pdf International Conference on Machine Learning, ICML
2017, Sydney, NSW, Australia, 6-11 August 2017,
[27] C. Hawthorne, A. Stasyuk, A. Roberts, I. Si- 2017, pp. 1243–1252. [Online]. Available: http:
mon, C.-Z. A. Huang, S. Dieleman, E. Elsen, //proceedings.mlr.press/v70/gehring17a.html
J. Engel, and D. Eck, “Enabling Factorized Piano
[38] A. van den Oord, S. Dieleman, H. Zen, K. Si-
Music Modeling and Generation with the MAE-
monyan, O. Vinyals, A. Graves, N. Kalchbrenner,
STRO Dataset,” in International Conference on
A. W. Senior, and K. Kavukcuoglu, “WaveNet: A
Learning Representations, 2019. [Online]. Available:
Generative Model for Raw Audio,” in The 9th ISCA
https://openreview.net/forum?id=r1lYRjC9F7
Speech Synthesis Workshop, Sunnyvale, CA, USA,
[28] G. Hadjeres, F. Pachet, and F. Nielsen, “DeepBach: 13-15 September 2016, 2016, p. 125. [Online]. Avail-
a Steerable Model for Bach Chorales Generation,” able: http://www.isca-speech.org/archive/SSW_2016/
in Proceedings of the 34th International Conference abstracts/ssw9_DS-4_van_den_Oord.html
on Machine Learning, ICML 2017, Sydney, NSW,
[39] R. Al-Rfou, D. Choe, N. Constant, M. Guo, and
Australia, 6-11 August 2017, 2017, pp. 1362–1371.
L. Jones, “Character-Level Language Modeling with
[Online]. Available: http://proceedings.mlr.press/v70/
Deeper Self-Attention,” CoRR, vol. abs/1808.04444,
hadjeres17a.html
2018. [Online]. Available: http://arxiv.org/abs/1808.
[29] H. H. Mao, T. Shin, and G. W. Cottrell, “DeepJ: 04444
Style-Specific Music Generation,” in 12th IEEE [40] S. Hochreiter and J. Schmidhuber, “Long Short-Term
International Conference on Semantic Computing, Memory,” Neural computation, vol. 9, no. 8, pp. 1735–
ICSC 2018, Laguna Hills, CA, USA, January 31 1780, 1997.
- February 2, 2018, 2018, pp. 377–382. [Online].
Available: https://doi.org/10.1109/ICSC.2018.00077 [41] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and
R. Salakhutdinov, “Transformer-xl: Attentive language
[30] G. Hadjeres and F. Nielsen, “Interactive Music Gener- models beyond a fixed-length context,” arXiv preprint
ation with Positional Constraints using Anticipation- arXiv:1901.02860, 2019.
RNNs,” CoRR, vol. abs/1709.06404, 2017. [Online].
Available: http://arxiv.org/abs/1709.06404 [42] H. Inan, K. Khosravi, and R. Socher, “Tying
Word Vectors and Word Classifiers: A Loss
[31] E. Schubert, “Continuous Measurement of Self-Report Framework for Language Modeling,” in 5th In-
Emotional Response to Music,” Music and Emotion: ternational Conference on Learning Representations,
Theory and Research, pp. 394–414, 2001. ICLR 2017, Toulon, France, April 24-26, 2017, Con-
ference Track Proceedings, 2017. [Online]. Available:
[32] C. K. Madsen and J. M. Geringer, “Comparison of
https://openreview.net/forum?id=r1aPbsFle
Good Versus Bad Tone Quality/Intonation of Vocal and
String Performances: Issues Concerning Measurement [43] O. Press and L. Wolf, “Using the Output Embedding
and Reliability of the Continuous Response Digital In- to Improve Language Models,” in Proceedings of
terface,” Bulletin of the Council for Research in Music the 15th Conference of the European Chapter of
education, pp. 86–92, 1999. the Association for Computational Linguistics, EACL
2017, Valencia, Spain, April 3-7, 2017, Volume 2:
[33] E. Himonides, “Mapping a Beautiful Voice: The Con- Short Papers, 2017, pp. 157–163. [Online]. Available:
tinuous Response Measurement Apparatus (CReMA),” https://aclanthology.info/papers/E17-2025/e17-2025
Journal of Music, Technology & Education, vol. 4,
no. 1, pp. 5–25, 2011. [44] PG Music Inc., “Band-in-a-box.” [Online]. Available:
https://www.pgmusic.com/
[34] R. V. Brittin, “Listeners’ Preference for Music of Other
Cultures: Comparing Response Modes,” Journal of [45] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
Research in Music Education, vol. 44, no. 4, pp. 328– L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,
340, 1996. “Attention is All you Need,” in Advances in

835
Proceedings of the 21st ISMIR Conference, Montréal, Canada, October 11-16, 2020

Neural Information Processing Systems 30: Annual


Conference on Neural Information Processing Systems
2017, 4-9 December 2017, Long Beach, CA, USA,
2017, pp. 6000–6010. [Online]. Available: http:
//papers.nips.cc/paper/7181-attention-is-all-you-need

[46] M. Cooke, D. Horn, and J. Cross, The Cambridge


Companion to Jazz. Cambridge University Press,
2002.

[47] J. Cocker, “Elements of the Jazz Language for the De-


veloping Improviser,” Miami: CPP Belwin, 1991.

[48] J. Schmidhuber, “Formal theory of creativity, fun, and


intrinsic motivation (1990–2010),” IEEE Transactions
on Autonomous Mental Development, vol. 2, no. 3, pp.
230–247, 2010.

836

You might also like