Mmm
Mmm
Mmm
?
Jeff Ens and Philippe Pasquier
1 Introduction
both inpainting and attribute control have not been integrated into a single
model.
2 Related Work
There are two main ways in which musical material is represented: as a matrix
(i.e. a pianoroll), or as a sequence of tokens. A piano roll is a boolean ma-
trix x ∈ {0, 1}T ×P , where T is the number of time-steps and P is the number
of pitches. Typically P = 128, allowing the piano roll to represent all possi-
ble MIDI pitches, however, it is not uncommon to reduce the range of pitches
represented (Dong, Hsiao, Yang, & Yang, 2018). Multi-track musical material
can be represented using a boolean tensor x ∈ {0, 1}M ×T ×P , where M is the
number of tracks. However, using this type of representation is inherently inef-
ficient, as the number number of inputs increases by T × P for each track that
is added, and accommodating small note lengths (ex. 32nd triplets) substan-
tially increases T . Despite these drawbacks, this representation has been used in
practice (Boulanger-Lewandowski, Bengio, & Vincent, 2012; Dong et al., 2018;
Huang et al., 2017). The alternative approach, is to represent musical material as
a sequence of tokens, where each token corresponds to a specific musical event or
piece of metadata. For example, the PerformanceRNN (Oore, Simon, Dieleman,
Eck, & Simonyan, 2018) and Music Transformer (Huang et al., 2019) use a token
based representation comprised of 128 distinct NOTE ON tokens, which are used
to indicate the onset of a particular pitch; 128 NOTE OFF tokens, which denote
the end of a particular pitch; and 100 TIME SHIFT tokens, which correspond to
different time-shifts ranging from 10ms to 1 second. Although this type of repre-
sentation can accommodate polyphony, it does not distinguish between different
tracks or instruments.
Our work is most similar to LahkNES (Donahue et al., 2019), MusicVAE
(Roberts, Engel, Raffel, Hawthorne, & Eck, 2018) and MuseNet (Payne, 2019),
which all employ token-based representations to model multi-track music. Lahk-
NES models Nintendo Entertainment System (NES) data, which is comprised
of 3 monophonic tracks and a drum track, using a transformer architecture.
MusicVAE is trained on bass, melody and drum trios extracted from the Lahk
MIDI Dataset (LMD) (Raffel, 2016), which allows generation to be conditioned
on a latent vector. MuseNet is trained on a superset of the LMD, and accommo-
dates 10 different track types ranging from piano to guitar. Note that MuseNet
supports both polyphonic and monophonic tracks.
However, in contrast to these methods, where the musical events from several
tracks are interleaved into a single time-ordered sequence, we concatenate single-
instrument tracks, which allows for greater flexibility in several areas. First of all,
we can decouple track information from NOTE ON and NOTE OFF tokens, allowing
the use of the same NOTE ON and NOTE OFF tokens in each track. This differs from
LahkNES, MusicVAE, and MuseNet, which use separate NOTE ON and NOTE OFF
tokens for each track, placing inherent limitations on the number of tracks that
can be represented. Even MuseNet, which is the largest of these networks, can
4 Jeff Ens and Philippe Pasquier
3 Motivation
Although systems which generate high-quality music have been proposed in re-
cent years (Huang et al., 2019; Payne, 2019; Liang, Gotham, Johnson, & Shotton,
2017; Sturm & Ben-Tal, 2017), their usage in practical contexts is limited for
two different reasons. First of all, most models place restrictions on the nature
of the input. In most cases, there are limitations placed on the number and type
of tracks (Roberts et al., 2018; Payne, 2019). Secondly, the user is not afforded
fine-grained control over the generation process, which is critical for a system to
be useful in the context of computational assisted composition. Even MusicVAE
(Roberts et al., 2018), which incorporates a latent model of musical space, al-
lowing for interpolation between examples, does not afford fine-grained control
of the individual tracks. For example, it is not possible to freeze the melody
and generate a new drum-part and bassline. Although one-shot generation of
musical material is impressive from a technical standpoint, it is not that useful
in a practical context, as the user may wish to create subtle variations on a fixed
piece of music.
In contrast to time-ordered sequences, where most of the important depen-
dencies, such as the most recently played notes, are in the recent history, non-
time-ordered sequences frequently feature important dependencies in the distant
history. For example, in our representation, simultaneously sounding notes in dif-
ferent tracks are spread far apart. The use of non-time-ordered representations is
directly motivated the nature of the transformer attention mechanism (Vaswani
et al., 2017). In contrast to Recurrent Neural Networks (RNN), which sharply
distinguish nearby context (the most recent 50 tokens) from the distant history
(Khandelwal, He, Qi, & Jurafsky, 2018), attention-based architectures allow for
MMM : Exploring Conditional Multi-Track Music Generation 5
BAR_END <BAR>
BAR_START FILL_END
<BAR> FILL_START
BAR_END <BAR>
FILL_END
Fig. 1. The MultiTrack and BarFill representations are shown. The <bar> tokens cor-
respond to complete bars, and the <track> tokens correspond to complete tracks.
distant tokens to be directly attended to if they are relevant to the current pre-
diction. Consequently, we do not pay a significant penalty for training models on
non-time-ordered sequences, where important dependencies are predominantly
in the distant history, provided the necessary tokens are within the attention
window. This directly motivates the usage of non-time-ordered sequences, as
they facilitate rich conditional generation.
4 Proposed Representation
To provide a comprehensive overview of the proposed representation, we first
describe how a single bar of musical material is represented. Based on represen-
tations explored in previous studies (Oore et al., 2018; Huang et al., 2019), we
represent musical material using 128 NOTE ON tokens, 128 NOTE OFF tokens, and
48 TIME SHIFT tokens. Since musical events are quantized using 12 subdivisions
per beat, 48 TIME SHIFT tokens allow for the representation of any rhythmic
unit from sixteenth note triplets to a full 4-beat bar of silence. Each bar begins
with a BAR START token, and ends with a BAR END token. Tracks are simply a
sequence of bars delimited by TRACK START and TRACK END tokens. At the start
of each track, immediately following the TRACK START token, an INSTRUMENT to-
ken is used to specify the MIDI program which is to be used to play the notes
6 Jeff Ens and Philippe Pasquier
on this particular track. Since there are 128 possible MIDI programs, we have
128 distinct INSTRUMENT tokens. A DENSITY LEVEL token follows the INSTRUMENT
token, and indicates the note density of the current track. A piece is simply a
sequence of tracks, however, all tracks sound simultaneously rather than being
played one after the other. A piece begins with the PIECE START token. This
process of nesting bars within a track and tracks within a piece is illustrated in
Figure 1. Notably, we do not use a PIECE END token, as we can simply sample
until we reach the nth TRACK END token if we wish to generate n tracks. We refer
to this representation as the MultiTrack representation.
Using the MultiTrack representation, the model learns to condition the gen-
eration of each track on the tracks which precede it. At generation time, this
allows for a subset of the musical material to be fixed while generating addi-
tional tracks. However, while the MultiTrack representation offers control at the
track level, it does not allow for control at the bar level, except in cases where
the model is asked to complete the remaining bars of a track. Without some
changes, it is not possible to generate the second bar in a track conditioned on
the first, third, and fourth bars. In order to accommodate this scenario, we must
guarantee that the bars on which we want to condition precede the bars we wish
to predict, in the sequence of tokens that is passed to the model. To do this, we
remove all the bars which are to be predicted from the piece, and replace each
bar with a FILL PLACEHOLDER token. Then, at the end of the piece (i.e. imme-
diately after the last TRACK END token), we insert each bar, delimiting each bar
with FILL START and FILL END tokens instead of BAR START and BAR END tokens.
Note that these bars must appear in the same order as the they appeared in the
original MultiTrack representation. We refer to this representation as the BarFill
representation. Note that the MultiTrack representation is simply a special case
of the BarFill representation, where no bars are selected for inpainting.
5 Training
We use the Lahk MIDI Dataset (LMD) (Raffel, 2016), which is comprised of
176,581 MIDI files. In order to explain how we derive token sequences from MIDI
files, it is necessary to provide an overview of the MIDI protocol. There are three
formats for MIDI files. Type 0 MIDI files are comprised of a header-chunk and
a single track-chunk. Both Type 1 and 2 MIDI files contain a header-chunk and
multiple track-chunks, however, the tracks in a Type 1 MIDI file are played simul-
taneously, while tracks in a Type 2 MIDI file are played sequentially. Since only
0.03% of the LMD are Type 2 MIDI files, and the library we use for midi parsing
does not support this encoding, we simply ignore them. Within a track-chunk,
musical material is represented as a sequence of MIDI messages each which spec-
ify a channel and the time delta since the last message. In addition to note-on
and note-off messages, which specify the onset and end of notes, patch-change
messages specify changes in timbre by selecting one of 128 different instruments.
To formally define a track, consider a Type 1 MIDI file F = {t1 , .., tk } comprised
of k track-chunks, where each track-chunk ti = {mii , .., mini } is an ordered set of
MMM : Exploring Conditional Multi-Track Music Generation 7
ni MIDI messages. Note that a Type 0 MIDI file is simply a special case where
k = 1. Let chan(x) (resp. inst(x)) be a function that returns the channel (resp.
instrument) on which the message x is played. Then, we can define a track as the
set of midi messages ti,c,k = {mk` : inst(mk` ) = i, chan(mk` ) = c, mk` ∈ tk , tk ∈ F}
that are found on the k th track-chunk, and played on the cth channel using
the ith MIDI instrument. For example, given a MIDI file F = {t1 , t2 }, where
t1 = {m11 }, t2 = {m21 , m22 }, chan(m11 ) = 0, inst(m11 ) = 0, chan(m21 ) = 3,
inst(m21 ) = 0, chan(m22 ) = 3, and inst(m22 ) = 34, there would be three tracks
(t0,0,1 , t0,3,2 , t34,3,2 ).
For each of the 128 general MIDI instruments, we calculate the number of
note onsets for each bar in the dataset, and use the quantiles of the resulting
distributions to define distinct note-density bins for each MIDI instrument. Note
that using the same note-density bins for all instrument types would be problem-
atic as note-density varies significantly between instruments. We use 10 different
note-density bins, where the ith bin is bounded by the 10i (lower) and 10(i + 1)
(upper) quantiles. We train a GPT2 (Radford et al., 2019) model using the
HuggingFace Transformers library (Wolf et al., 2019) with 8 attention heads, 6
layers, an embedding size of 512, and an attention window of 2048. We train two
types of models: MMMBar, which is trained using the BarFill representation;
MMMTrack, which is trained using the MultiTrack representation. We train 4-
bar and 8-bar versions of MMMBar and MMMTrack. For 4-bar (resp. 8-bar)
models we provide the model with at most 12 (resp. 6) tracks. Each time we
select a n-bar segment, we randomly order the tracks so that the model learns
each possible conditional between different types of tracks. When training the
MMMBar models, we also select a random subset of bars for inpainting.
6 Using MMM
In order to illustrate the flexibility of MMM, we make available 1 examples gener-
ated by the system, and an interactive demo. The demo was developed in Google
Colab, making it accessible to all users with a compatible internet browser. The
interface automatically selects the appropriate model, either MMMBar or MMM-
Track, based on the bars or tracks that are selected for generation. We briefly
outline the various ways that one can interact with MMM when generating
musical material.
1. Track Inpainting : Given a possibly empty set of tracks t = {t1 , ..., tk }, we can
generate n additional tracks. When the set of tracks is empty, this is equiv-
alent to unconditioned generation. To do this, we condition the model with
the tokens representing k tracks and then sample until the nth TRACK END
token is reached.
2. Bar Inpainting : Given a set of tracks t = {t1 , ..., tk } and a set of bars
b = {b1 , ..., bn }, we can resample each bar in b. For this method, we condition
the model with the tokens representing all the tracks, replacing each bi in b
1
https://jeffreyjohnens.github.io/MMM/
8 Jeff Ens and Philippe Pasquier
with the FILL PLACEHOLDER. Then we sample until the nth FILL END token
is reached.
3. Attribute Control for Instruments: We can specify a set of MIDI instruments
for each generated track, from which the model will choose. Practically, this
is accomplished by masking the MIDI instruments we wish to avoid before
sampling the INSTRUMENT token at the start of a new track.
4. Attribute Control for Note Density : We can specify the note density level
for each generated track.
5. Iterative Generation : The user can chain together various generation meth-
ods to iteratively compose a piece of music. Alternatively, generation meth-
ods can be chained automatically using a meta-algorithm. For example, given
a set of tracks t = {t1 , ..., tk }, we can progressively resample each track ti in
t by asking the model to generate (ti |{tj : tj ∈ t, j 6= i}) for each 1 ≤ i ≤ k.
This bears some similarity to Gibbs sampling. The resulting output should
be more similar to the input than simply generating a set of tracks from
scratch. Iterative generation also affords the user the opportunity to itera-
tively explore variations on generated material, or gradually refine a piece
by progressively resampling bars which are not to their liking.
7 Conclusion
One current limitation, is that the model only allows for a fixed number of bars to
be generated. Although approximately 99.8% of 10-track 4-bar segments, 86.8%
of 10-track 8-bar segments and 38.8% of 10-track 16-bar segments in the LMD
can be represented using less than 2048 tokens, with some slight modifications
to the architecture and representation, it should be possible to incorporate ad-
ditional musical material. The transformer-XL architecture (Dai et al., 2019)
allows for extremely distant tokens to influence the current prediction via a hid-
den state, combining the strengths of the attention and recurrent mechanisms.
Using this type of model, the current n-bar window could be conditioned on
previous and future (if they are known) n-bar windows via the hidden state. Im-
plementing additional types of attribute-control is an interesting area for future
work. For example, conditioning generation on a particular genre or emotion
would offer increased control at generation time. However, we must note that
this type of control is available to a certain extent in the current model. Since
MMM offers conditional generation, the genre or emotion of the generated bars
or tracks should reflect the genre or emotion of the content they are conditioned
on. For example, if generation is conditioned on a jazz style drum track, gener-
ated tracks or bars should be consistent with this style. In addition, future work
will include a more rigorous evaluation of the system itself. We have introduced
a novel approach to representing musical material that offers increased control
over the generated output. This offers a new and exciting avenue for future work,
harnessing the strengths of the Transformer architecture to provide fine-grained
control for the user at generation time.
References 9
References
Roberts, A., Engel, J. H., Raffel, C., Hawthorne, C., & Eck, D. (2018). A
hierarchical latent vector model for learning long-term structure in music.
In Proceedings of the 35th international conference on machine learning
(pp. 4361–4370).
Sturm, B. L., & Ben-Tal, O. (2017). Taking the models back to music prac-
tice: Evaluating generative transcription models built using deep learning.
Journal of Creative Music Systems, 2 (1).
Sturm, B. L., Ben-Tal, O., Monaghan, Ú., Collins, N., Herremans, D., Chew, E.,
. . . Pachet, F. (2019). Machine learning research that matters for music
creation: A case study. Journal of New Music Research, 48 (1), 36–55.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
. . . Polosukhin, I. (2017). Attention is all you need. In Advances in neural
information processing systems (pp. 5998–6008).
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., . . . Brew,
J. (2019). Huggingface’s transformers: State-of-the-art natural language
processing. ArXiv , abs/1910.03771 .