Bebopnet: Deep Neural Models For Personalized Jazz Improvisations
Bebopnet: Deep Neural Models For Personalized Jazz Improvisations
Bebopnet: Deep Neural Models For Personalized Jazz Improvisations
IMPROVISATIONS
ABSTRACT forms of art, and notably for composing music [1]. The ex-
plosive growth of deep learning models over the past sev-
A major bottleneck in the evaluation of music generation eral years has expanded the possibilities for musical gen-
is that music appreciation is a highly subjective matter. eration, leading to a line of work that pushed forward the
When considering an average appreciation as an evalua- state-of-the-art [2–6]. Another recent trend is the devel-
tion metric, user studies can be helpful. The challenge opment and offerings of consumer services such as Spo-
of generating personalized content, however, has been ex- tify, Deezer and Pandora, aiming to provide personalized
amined only rarely in the literature. In this paper, we streams of existing music content. Perhaps the crowning
address generation of personalized music and propose a achievement of such personalized services would be for
novel pipeline for music generation that learns and opti- the content itself to be generated explicitly to match each
mizes user-specific musical taste. We focus on the task individual user’s taste. In this work we focus on the task of
of symbol-based, monophonic, harmony-constrained jazz generating user personalized, monophonic, symbolic jazz
improvisations. Our personalization pipeline begins with improvisations. To the best of our knowledge, this is the
BebopNet, a music language model trained on a corpus of first work that aims at generating personalized jazz solos
jazz improvisations by Bebop giants. BebopNet is able to using deep learning techniques.
generate improvisations based on any given chord progres- The common approach for generating music with neu-
sion 1 . We then assemble a personalized dataset, labeled ral networks is generally the same as for language mod-
by a specific user, and train a user-specific metric that re- eling. Given a context of existing symbols (e.g., charac-
flects this user’s unique musical taste. Finally, we employ ters, words, music notes), the network is trained to predict
a personalized variant of beam-search with BebopNet to the next symbol. Thus, once the network learns the dis-
optimize the generated jazz improvisations for that user. tribution of sequences from the training set, it can gener-
We present an extensive empirical study in which we ap- ate novel sequences by sampling from the network output
ply this pipeline to extract individual models as implicitly and feeding the result back into itself. The products of
defined by several human listeners. Our approach enables such models are sometimes evaluated through user studies
an objective examination of subjective personalized mod- (crowd-sourcing). Such studies assess the quality of gen-
els whose performance is quantifiable. The results indi- erated music by asking users their opinion, and computing
cate that it is possible to model and optimize personal jazz the mean opinion score (MOS). While these methods may
preferences and offer a foundation for future research in measure the overall quality of the generated music, they
personalized generation of art. We also briefly discuss op- tend to average-out evaluators’ personal preferences. An-
portunities, challenges, and questions that arise from our other, more quantitative but rigid approach for evaluation
work, including issues related to creativity. of generated music is to compute a metric based on musical
theory principles. While such metrics can, in principle, be
defined for classical music, they are less suitable for jazz
1. INTRODUCTION
improvisation, which does not adhere to such strict rules.
Since the dawn of computers, researchers and artists have To generate personalized jazz improvisations, we pro-
been interested in utilizing them for producing different pose a framework consisting of the following elements: (a)
1 Supplementary material and numerous MP3 demonstrations of jazz
BebopNet: jazz model learning; (b) user preference elicita-
improvisations of jazz standards and pop songs generated by BebopNet
tion; (c) user preference metric learning; and (d) optimized
are provided in https://shunithaviv.github.io/bebopnet. music generation via planning.
As many jazz teachers would recommend, the key to at-
taining great improvisation skills is by studying and emu-
c Shunit Haviv Hakimi, Nadav Bhonker, and Ran El- lating great musicians. Following this advice, we train Be-
Yaniv. Licensed under a Creative Commons Attribution 4.0 Interna-
tional License (CC BY 4.0). Attribution: Shunit Haviv Hakimi, Na-
bopNet, a harmony-conditioned jazz model that composes
dav Bhonker, and Ran El-Yaniv, “BebopNet: Deep Neural Models for entire solos. We use a training dataset of hundreds of pro-
Personalized Jazz Improvisations”, in Proc. of the 21st Int. Society for fessionally transcribed jazz improvisations performed by
Music Information Retrieval Conf., Montréal, Canada, 2020. saxophone giants such as Charlie Parker, Phil Woods and
828
Proceedings of the 21st ISMIR Conference, Montréal, Canada, October 11-16, 2020
Cm7
2. RELATED WORK
C7 F7 B♭maj7 Cm7 F7 B♭6
Many different techniques for algorithmic musical compo-
Figure
1.
generated
Cm7 F7 Fm7 B♭7 E♭9 E♭9 B♭6 B♭6
chains [13–15], evolutionary methods [16, 17] or neural
F7 Cm7 B♭6 B♭6
given
sequence
Cannonball Adderley (see details in Section 4.1.1). In this broad area, we refer the reader to [21]. Here we con-
Cm7 F7 B♭6 B♭6
dataset, each solo is a monophonic note fine the discussion to closely related works that mainly
chronized After training, BebopNet over symbolic data. In this narrower context, most works
capable of phrases follow a generation by prediction paradigm, whereby a
(thisB♭6 is a subjective
B♭6
a short excerpt by
generated
impressionF7of the authors). Figure 1 model trained to predict the next symbol is used to greed-
Cm7 B♭maj7 B♭maj7
presents BebopNet. ily generate sequences. The first work on blues improvisa-
in
our
tastes, goal this go straight-
beyond
tion [22] straightforwardly applied long short-term mem-
Considering that different people have different musi-
Cm7 F7 B♭6 B♭6
cal paper is to ory (LSTM) networks on a small training set. While
personalized preferences.
For this purpose,
their results may seem limited at a distance of nearly two
forward generation by this model and optimize the gener-
Cm7 F7
ation toward decades 2 , they were the first to demonstrate long-term
structure captured by neural networks.
satisfaction
throughout
we determine a user’s preference by measuring the level of
ant of continuous response interface [7]. This is
(CRDI)
theirFm7 theB♭7solos using a digital vari- One approach to improving a naïve greedy genera-
tion from a jazz model is by using a mixture of experts.
the
(from
For example, Franklin et al. [23] trained an ensemble of
good/bad
accomplished byA♭9playing, for the B♭6 user, computer-generated
E♭9 B♭6
solos jazz model) and recording their neural networks were trained, one specialized for each
sufficient
B♭6 about
melody, and then selected from among them at genera-
feedback in real time throughout each solo. Once we have
B♭6 Am7♭57 Am7♭57 D7 D7
829
Proceedings of the 21st ISMIR Conference, Montréal, Canada, October 11-16, 2020
830
Proceedings of the 21st ISMIR Conference, Montréal, Canada, October 11-16, 2020
Share weights
Pitch Embedding Pitch Decoder
Chord
LSTM
Duration Embedding Duration Decoder
Duration
Pitch
Figure 3. The BebopNet architecture for the next note prediction. Each note is represented by concatenating the embed-
dings of the pitch (red bar), the duration (purple bar) and the four pitches comprising the current chord (green bars). The
output of the LSTM is passed to two heads (orange bars), one the size of the pitch embedding (top) and the other the size
of the duration embedding (bottom).
Following [42, 43], we tie the weights of the final fully- pleasing than those he heard previously. Thus, the label in-
connected layers to those of the embedding. Finally, the dicates the preference of the past sequence. The labels are
outputs of the two heads pass through a softmax layer and linearly scaled down to the range [−1, 1]. Since the data in
are trained to minimize the negative log-likelihood of the De is small and unbalanced, we use stratified sampling over
corpus. To enrich our dataset while encouraging harmonic solos to divide the dataset into training and validation sets.
context dependence, we augment our dataset by transpos- We then use bagging to create an ensemble of five models
ing to all 12 keys. for the final estimate.
4.3.1 Network Architecture
4.2 User Preference Elicitation
We estimate the function gφ using transfer learning from
Using BebopNet, we created a dataset to be labeled by
BebopNet. The user preference model consists of the same
users, consisting of 124 improvisations. These solos were
layers as BebopNet without the final fully-connected lay-
divided into three groups of roughly the same size: so-
ers. Next, we apply scaled dot-product attention [45] over
los from the original corpus, solos generated by BebopNet
τ time steps followed by fully-connected and tanh layers.
over jazz standards present in the training set, and gener-
The transferred layers are initialized using the weights θ
ated solos over jazz standards not present in the training
of BebopNet. Furthermore, the weights of the embedding
set. The length of each solo is two choruses, or twice the
layers are frozen during training.
length of the melody. For each standard, we generated a
backing track in MP3 format that includes a rhythm sec- 4.3.2 Selective Prediction
tion and a harmonic instrument to play along the improvi-
To elevate the accuracy of gφ , we utilize selective pre-
sation using Band-in-a-Box [44]. This dataset amounts to
diction whereby we ignore predictions whose confidence
approximately five hours of played music.
is too low. We use the prediction magnitude as a proxy
We created a system inspired by CRDI that is entirely
for confidence. Given confidence threshold parameters,
digital, replacing the analog dial with strokes of a keyboard 0
β1 < 0, β2 > 0, we define gφ,β1 ,β2
(Xti ) in Eq. 3.
moving a digital dial. A figure of our dial is presented in
the supplementary material. While the original CRDI had (
a range of 255 values, our initial experiments found that 0 if β1 < gφ (Xti ) < β2
0
quantizing the values to five levels was easier for users. gφ,β1 ,β2
(Xti ) = (3)
gφ (Xti ) else
We recorded the location of the dial at every time step and
aligned it to the note being played at the same moment. The parameters β1 and β2 change our coverage rate
and are determined by minimizing error (risk) on the risk-
4.3 User Preference Metric Learning coverage plot along a predefined coverage contour. More
details are given in Section 5.2.
In the user preference metric learning stage we again use
supervised learning to train a metric function gφ . This
4.4 Optimized Music Generation
function should predict user preference scores for any solo,
given its harmonic context. During training, for each se- To optimize generations from fθ , we apply a variant of
quence Xτ we estimate yτ , corresponding to the label the beam-search, ψ, whose objective scores are obtained from
user provided for the last note in the sequence. We choose non-rejected predictions of gφ . Pseudocode of the ψ proce-
the last label of the sequence, rather than the mode or dure is presented in the supplementary material. We denote
mean, because of delayed feedback. During the user elici- by Vb = [Xt1 , Xt2 , ..., Xtb ] a running batch (beam) of size
tation step, we noticed that when a user decides to change (beam-width) b containing the most promising candidate
the position of the dial, it is because he has just heard a sequences found so far by the algorithm. The sequences
sequence of notes that he considers to be more (or less) are all initialized with the starting input sequence. In our
831
Proceedings of the 21st ISMIR Conference, Montréal, Canada, October 11-16, 2020
Name Adderley Gordon Getz Parker Rollins Stitt Woods Ammons BebopNet (Heard) BebopNet (Unheard)
Chord 0.50 0.54 0.53 0.52 0.52 0.53 0.50 0.54 0.53 0.52
Scale 0.78 0.83 0.81 0.80 0.81 0.83 0.78 0.83 0.82 0.81
Table 1. Harmonic coherence: The average chord and scale matches computed for artists in the dataset and for BebopNet.
A higher number indicates a high coherency level. BebopNet is measured separately for harmonic progressions heard and
not heard in the training dataset.
case, this is the melody of the jazz standard. At every time take as a baseline the average matching statistics of these
step t, we produce a probability distribution of the next quantities for each jazz artist in our dataset. The harmonic
note of every sequence in Vb by passing the b sequences coherence statistics of BebopNet are computed over the
through the network fθ (Xti , cit+1 ). As opposed to typical dataset used for the preference metric learning (generated
applications of beam-search, rather than choosing the most by BebopNet), which also includes chord progressions not
probable notes from P r(st+1 |Xti , cit+1 ), we independently heard during the jazz modeling stage. The baselines and
and randomly sample them. We then calculate the score of results are reported in Table 1. It is evident that our model
the extended candidates using the preference metric, gφ . exhibits harmonic coherence in the ‘ballpark’ of the jazz
Every δ steps, we perform a beam update process. We artists even on chord progressions not previously heard.
choose the highest scoring k sequences calculated by gφ .
Then we duplicate these sequences b/k times to maintain
a full beam of b sequences. Choosing different values of positive
120
δ allows us to control a horizon parameter, which facili- neutral
negative
tates longer term predictions when extending candidate se- 100
quences in the beam. The use of larger horizons may lead
80
to sub-optimal optimization but increases variability.
60
5. EXPERIMENTS 40
0.375 0.3
5.1 Harmonic Coherence 0.2
0.0
We begin by evaluating the extent to which BebopNet was 0.250
0.0 0.0 0.2
0.2 0.1
able to capture the context of chords, which we term har- pos 0.4 0.3
0.2
ld 0.125
t hre 0.6 0.4 sho
monic coherence. We define two harmonic coherence met- sho 0.8 0.5
thre
neg
0.6
rics using either scale match or chord match. These metrics ld 1.0 0.7 0.000
are defined as the percent of time within a measure where ii Risk-coverage plot
notes match pitches of the scale or the chord being played,
respectively. We rely on a standard definition of match- Figure 4. 4i Predictions of the preference model on se-
ing scales to chords using the chord-scale system [46]. quences from a validation set. Green: sequences labeled
While most notes in a solo should be harmonically coher- with a positive score (yτ > 0); yellow: neutral (yτ = 0);
ent, some non-coherent notes are often incorporated. Com- red: negative (yτ < 0). The blue vertical lines indicate
mon examples of their uses are chromatic lines, approach thresholds β1 , β2 used for selective prediction. 4ii Risk-
notes and enclosures [47]. Therefore, as we do not expect a coverage plot for the predictions of the preference model.
perfect harmonic match according to pure music rules, we β1 , β2 (green lines) are defined to be the thresholds that
6 To appreciate the diversity of BebopNet, listen to seven solos gener- yield a minimum error on the contour of 25% coverage.
ated for user-4 for the tune Recorda-Me in the supplementary material.
832
Proceedings of the 21st ISMIR Conference, Montréal, Canada, October 11-16, 2020
5.2 Analyzing Personalized Models for the typical amount of copying exhibited by humans.
Another plagiarism measurement we define is the
We applied the proposed pipeline to generate personalized largest common sub-sequence. For each solo, we consider
models for each of the four users, all amateur jazz musi- the solos of other artists as the source set. Then, we aver-
cians. All users listened to the same training dataset of age the results per artist. Also, for every artist, we com-
solos to create their personal metric (see Section 4). Each pare every solo against the rest of his solos to measure
user provided continuous feedback for each solo using our self-plagiarism. For BebopNet, we quantify the plagiarism
CRDI variant. In this section, we describe our evaluation level with respect to the entire corpus. The average plagia-
process for user-1. The evaluation results for the rest of the rism level of BebopNet is 3.8. Interestingly, this value lies
users are presented in the supplementary material. within the human plagiarism range found in the dataset.
We analyze the quality of our preference metric func- This indicates that BebopNet can be accused of plagiarism
tion gφ by plotting a histogram of the network’s predic- as much as some of the famous jazz giants. We present the
tions applied on a validation set. Consider Figure 4i. We extended results in the supplementary material.
can crudely divide the histogram into three areas: the
right-hand side region corresponds to mostly positive se-
quences predicted with high accuracy; the center region 7. CONCLUDING REMARKS
corresponds to high confusion between positive and neg- We presented a novel pipeline for generating personalized
ative; and the left one, to mostly negative sequences pre- harmony-constrained jazz improvisations by learning and
dicted with some confusion. While the overall error of the optimizing a user-specific musical preference model. To
preference model is high (0.4 MSE where the regression distill the noisy human preference models, we used a se-
domain is [-1,1]), it is still useful since we are interested lective prediction approach. We introduced an objective
in its predictions in the positive (green) spectrum for the evaluation method for subjective content and numerically
forthcoming optimization stage. While trading-off cover- analysed our proposed pipeline on four users.
age, we increase prediction accuracy using selective pre- Our work raises many questions and directions for fu-
diction by allowing our classifier to abstain when it is not ture research. While our generated solos are locally coher-
sufficiently confident. To this end, we ignore predictions ent and often interesting/pleasing, they lack the qualities of
whose magnitude is between two rejection thresholds (see professional jazz related to general structure such as motif
Section 4.3.2). Based on preliminary observations, we fix development and variations. Preliminary models we have
the rejection thresholds to maintain 25% coverage over the trained on smaller datasets were substantially weak. Can a
validation set. In Figure 4ii we present a risk-coverage plot much larger dataset generate a significantly better model?
for user-1 (see definition in [8]). The risk surface is com- To acquire such a large corpus it might be necessary to
puted by moving two thresholds β1 and β2 across the his- abandon the symbolic approach and rely on raw audio.
togram in Figure 4i, and at each point, for data not between Our work emphasizes the need to develop effective
the thresholds, we calculate the risk (error of classification methodologies and techniques to extract and distill noisy
to three categories: positive, neutral and negative) and the human feedback that will be required for developing many
coverage (percent of data maintained). personalized applications. Our proposed method raises
We increase the diversity of generated samples by tak- many questions. To what extent does our metric express
ing the score’s sign rather than the exact score predicted the specifics of one’s musical taste? Can we extract precise
by the preference model gφ . Therefore, different posi- properties from this metric? Additionally, our technique
tive samples are given equal score. For user-1, the aver- relies on a sufficiently large labeled sample to be provided
age score predicted by gφ for generated solos of Bebop- by each user, a substantial effort on the user’s part. We
Net is 0.07. As we introduce beam-search and increase the anticipate that the problem of eliciting user feedback will
beam width, the performance increases up to an optimal be solved in a completely different manner, for example,
point from which it decreases (see supplementary mate- by monitoring user satisfaction unobtrusively, e.g., using a
rial). User-1’s scores peaked at 0.8 with b = 32, k = 8. camera, EEG, or even direct brain-computer connections.
Anecdotally, there was one solo that user-1 felt was excep- The challenge of evaluating neural networks that gen-
tionally good. For that solo, the model predicted the per- erate art remains a central issue in this research field. An
fect score of 1. This indicates that the use of beam-search ideal jazz solo should be creative, interesting and mean-
is indeed beneficial for optimizing the preference metric. ingful. Nevertheless, when evaluating jazz solos, there are
no mathematical definitions for these properties—as yet.
6. PLAGIARISM ANALYSIS Previous works attempted to define and optimize creativ-
ity [48], but no one has yet delineated an explicit objective
One major concern is the extent to which BebopNet plagia- definition. Some of the main properties of creative per-
rizes. In our calculations, two sequences that are identical formance are innovation and the generations of patterns
up to transposition are considered the same. To quantify that reside out-of-the-box— namely, the extrapolation of
plagiarism in a solo with respect to a set of source solos, outlier patterns beyond the observed distribution. Present
we measure the percentage of n-grams in that solo that also machine learning regimes, however, are mainly capable of
appear in any other solo in the source. These statistics are handling interpolation tasks and not extrapolation. Is it at
also applied to any artist in our dataset to form a baseline all possible to learn the patterns of outliers?
833
Proceedings of the 21st ISMIR Conference, Montréal, Canada, October 11-16, 2020
834
Proceedings of the 21st ISMIR Conference, Montréal, Canada, October 11-16, 2020
[25] G. E. Hinton, “Training Products of Experts by Mini- [35] J. C. Coggiola, “The Effect of Conceptual Advance-
mizing Contrastive Divergence,” Neural Computation, ment in Jazz Music Selections and Jazz Experience on
vol. 14, no. 8, pp. 1771–1800, 2002. Musicians’ Aesthetic Response,” Journal of Research
in Music Education, vol. 52, no. 1, pp. 29–42, 2004.
[26] L. Yang, S. Chou, and Y. Yang, “MidiNet: A
Convolutional Generative Adversarial Network for [36] “Sax solos,” https://saxsolos.com/, accessed: 2019-05-
Symbolic-Domain Music Generation,” in Proceedings 16.
of the 18th International Society for Music Information
Retrieval Conference, ISMIR 2017, Suzhou, China, [37] J. Gehring, M. Auli, D. Grangier, D. Yarats,
October 23-27, 2017, 2017, pp. 324–331. [Online]. and Y. N. Dauphin, “Convolutional Sequence to
Available: https://ismir2017.smcnus.org/wp-content/ Sequence Learning,” in Proceedings of the 34th
uploads/2017/10/226_Paper.pdf International Conference on Machine Learning, ICML
2017, Sydney, NSW, Australia, 6-11 August 2017,
[27] C. Hawthorne, A. Stasyuk, A. Roberts, I. Si- 2017, pp. 1243–1252. [Online]. Available: http:
mon, C.-Z. A. Huang, S. Dieleman, E. Elsen, //proceedings.mlr.press/v70/gehring17a.html
J. Engel, and D. Eck, “Enabling Factorized Piano
[38] A. van den Oord, S. Dieleman, H. Zen, K. Si-
Music Modeling and Generation with the MAE-
monyan, O. Vinyals, A. Graves, N. Kalchbrenner,
STRO Dataset,” in International Conference on
A. W. Senior, and K. Kavukcuoglu, “WaveNet: A
Learning Representations, 2019. [Online]. Available:
Generative Model for Raw Audio,” in The 9th ISCA
https://openreview.net/forum?id=r1lYRjC9F7
Speech Synthesis Workshop, Sunnyvale, CA, USA,
[28] G. Hadjeres, F. Pachet, and F. Nielsen, “DeepBach: 13-15 September 2016, 2016, p. 125. [Online]. Avail-
a Steerable Model for Bach Chorales Generation,” able: http://www.isca-speech.org/archive/SSW_2016/
in Proceedings of the 34th International Conference abstracts/ssw9_DS-4_van_den_Oord.html
on Machine Learning, ICML 2017, Sydney, NSW,
[39] R. Al-Rfou, D. Choe, N. Constant, M. Guo, and
Australia, 6-11 August 2017, 2017, pp. 1362–1371.
L. Jones, “Character-Level Language Modeling with
[Online]. Available: http://proceedings.mlr.press/v70/
Deeper Self-Attention,” CoRR, vol. abs/1808.04444,
hadjeres17a.html
2018. [Online]. Available: http://arxiv.org/abs/1808.
[29] H. H. Mao, T. Shin, and G. W. Cottrell, “DeepJ: 04444
Style-Specific Music Generation,” in 12th IEEE [40] S. Hochreiter and J. Schmidhuber, “Long Short-Term
International Conference on Semantic Computing, Memory,” Neural computation, vol. 9, no. 8, pp. 1735–
ICSC 2018, Laguna Hills, CA, USA, January 31 1780, 1997.
- February 2, 2018, 2018, pp. 377–382. [Online].
Available: https://doi.org/10.1109/ICSC.2018.00077 [41] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and
R. Salakhutdinov, “Transformer-xl: Attentive language
[30] G. Hadjeres and F. Nielsen, “Interactive Music Gener- models beyond a fixed-length context,” arXiv preprint
ation with Positional Constraints using Anticipation- arXiv:1901.02860, 2019.
RNNs,” CoRR, vol. abs/1709.06404, 2017. [Online].
Available: http://arxiv.org/abs/1709.06404 [42] H. Inan, K. Khosravi, and R. Socher, “Tying
Word Vectors and Word Classifiers: A Loss
[31] E. Schubert, “Continuous Measurement of Self-Report Framework for Language Modeling,” in 5th In-
Emotional Response to Music,” Music and Emotion: ternational Conference on Learning Representations,
Theory and Research, pp. 394–414, 2001. ICLR 2017, Toulon, France, April 24-26, 2017, Con-
ference Track Proceedings, 2017. [Online]. Available:
[32] C. K. Madsen and J. M. Geringer, “Comparison of
https://openreview.net/forum?id=r1aPbsFle
Good Versus Bad Tone Quality/Intonation of Vocal and
String Performances: Issues Concerning Measurement [43] O. Press and L. Wolf, “Using the Output Embedding
and Reliability of the Continuous Response Digital In- to Improve Language Models,” in Proceedings of
terface,” Bulletin of the Council for Research in Music the 15th Conference of the European Chapter of
education, pp. 86–92, 1999. the Association for Computational Linguistics, EACL
2017, Valencia, Spain, April 3-7, 2017, Volume 2:
[33] E. Himonides, “Mapping a Beautiful Voice: The Con- Short Papers, 2017, pp. 157–163. [Online]. Available:
tinuous Response Measurement Apparatus (CReMA),” https://aclanthology.info/papers/E17-2025/e17-2025
Journal of Music, Technology & Education, vol. 4,
no. 1, pp. 5–25, 2011. [44] PG Music Inc., “Band-in-a-box.” [Online]. Available:
https://www.pgmusic.com/
[34] R. V. Brittin, “Listeners’ Preference for Music of Other
Cultures: Comparing Response Modes,” Journal of [45] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
Research in Music Education, vol. 44, no. 4, pp. 328– L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,
340, 1996. “Attention is All you Need,” in Advances in
835
Proceedings of the 21st ISMIR Conference, Montréal, Canada, October 11-16, 2020
836