UCL Icmc

COSMIX: using MIR techniques for Sonic Interaction in Virtual Environments
Jonathan Bell
PRISM-CNRS
[email protected]
ABSTRACT
The present study proposes to explore a sound corpus in

VR, in which audio data is sliced and analysed in FluCoMa
in order obtain relatively large collections of samples clus-
tered by timbre similarity in a 3d room. The recent imple-
mentation of the multiplayer feature in PatchXR lets envis-
age wide variety of gesture-based control interfaces query-
ing those corpora, and in which performers can interact
remotely in order to simulate a chamber music situation.
1. INTRODUCTION
In 2009, Miller Puckette, the inventor of Max and Pure Figure 1. A VR interface in which each button in the world cor-
Data, stated: “Having seen a lot of networked performances responds to an item of the sound corpus. Machine learning helps
of one sort or another, I find myself most excited by the po- bringing closer sounds that share common spectral characteristics.
tential of networked “telepresence” as an aid to rehearsal, https://www.youtube.com/playlist?list=PLc WX6wY4JtkNjWelOprx-
mOwioStXbUa
not performance.” [?]. Meanwhile,
The recent emergence of multiplayer capabilities of a VR
software such as patchXR [?] urges to find meaningful in- by Diemo Schwarz in CartaRT/ftm (today fully integrated
strument design in order to remotely and collaboratively in mubu), the idea is today gaining increasing popularity
interact musically with a digital instrument, in multiplayer (FluComA, AudioStellar, Mosaique, DICY2, Somax2) be-
VR (or metaverse). [?] In search of experiences that would cause of the potential it opens when combined machine
relate to those found in traditional chamber music, the so- learning. Focusing on research questions characteristic of
lution proposed here focuses on the exploration of a sound the well-established NIME (New Interface for Musical Ex-
corpus projected on a 3d space, which the users can then pression) community, COSMIX consists in porting sound
navigate with his hand controllers (see Fig. ??): in an etude maps to Extended Reality (XR) technology within the rapidly
for piano and saxophone for instance 1 , one musician plays evolving PatchXR creative coding environment, in order
blue buttons (the saxophone samples), and the other the to build interactive instruments. By combining FluCoMA
yellow buttons (the piano samples). and PatchXR’s capacities in machine learning/listening and
COSMIX (Corpus-based Sound Manipulation for Inter- Human Computer Interaction (HCI) respectively, the re-
active Instruments in eXtended reality) seeks to build upon sulting instruments will facilitate multi-user sonic explo-
FluCoMa’s results and apply them to the realm of Vir- ration of large corpora through various immersive inter-
tual Reality Musical Instruments (VRMIs) by simulating faces. Inherently corpus independent, VRMIs’ affordances
acoustic instruments in telematic performances. By its fo- will be examined through three criteria: 1/ modes of ex-
cus on Creative Music Information Retrieval (CMIR), Flu- citation/interaction 2/ primarily sonic, but also visual and
CoMa set as its agenda to enable mid-level creative coding, haptic feedback 3/ co-creativity in Networked music per-
thus finding a sought-after balance between three conflict- formance.
ing affordances 1/ access low-level audio features extrac- An important reason here for using VR to explore a 3D
tion 2/ accessible learning curve and 3/ customisable use. dataset is that it allows users to interact with the data in a
The typical use case of the toolbox automatically segments more natural and immersive way compared to a 2d plane
and analyses a sound corpus according to a custom array of (Chapter 3 will show how the present study derives from
audio descriptors in order to reveal similarities by project- the CataRT 2d interface project), using the experience both
ing its data on a two or three-dimensional plot (or sound as a tool for performance as well as data visualisation and
map). Initiated at IRCAM at the turn of the millennium analysis. Users can move around and explore the data from
1 https://youtu.be/kIi7YdzP2Nw?t=89
different angles, which can help them to better understand
the relationships between different data points and identify
patterns, which becomes more evident as the number of
Copyright: c 2023 Jonathan Bell et al. This is an open-access article points increases. The use of machine learning (dimension-
distributed under the terms of the Creative Commons Attribution License ality reduction in our case) renders a world in which the
3.0 Unported, which permits unrestricted use, distribution, and reproduc- absolute coordinates of each point has no more link to the
tion in any medium, provided the original author and source are credited. descriptor space (the high sounds cannot be mapped to the
single project or create something entirely new. In addi-
tion, Patch has a robust library of resources for users to
draw from, including tutorials, documentation, and sample
patches. The community around Patch is also very active,
with regular competitions, events, and meet ups happening
around the world.
One of the most exciting aspects of Patch is its poten-
tial for use in music performance and composition. The
modular design of the platform allows users to create com-
plex audio and visual environments that can be controlled
in real-time, opening up new possibilities for live music
and audiovisual performances. Patch has been used in a va-
riety of contexts, from creating interactive installations and
exhibits to developing VR training simulations and games.
Its flexibility and modular design make it a powerful tool
for anyone interested in exploring the creative possibilities
Figure 2. The implementation of the FM algorithm 1/ in Pure Data (left) of VR.
2/ in PatchXR (right).
3. CORPUS-BASED CONCATENATIVE SOUND

y axis for instance), but offers compelling results for clus- SYNTHESIS (CBCS)
tering information relating to the different playing styles
of the instrument that is being analysed: as an example, in Corpus-Based Concatenative Sound Synthesis (CBCS) is a
this extract 2 based on flute sounds, the opening shows a technique used in computer music that involves construct-
clear opposition between two types of gestures: 1/ (0’00) ing a sound or music piece by concatenating (joining to-
staccato notes and 2/ (0’05) legato scale-like type of ma- gether) smaller units of sound, such as phonemes in speech
terial. This contrast in timbral quality is made explicit by synthesis or musical phrases in music synthesis. It is used
a movement of the avatar which jumps from a cluster of in our case to model an improvising instrumental musician
buttons to another. by creating a database of recorded musical phrases or seg-
ments that can be combined and rearranged in real-time to
create a musical performance that sounds like it is being
2. VIRTUAL REALITY improvised.
2.1 Musical Metaverse Today nearly 20 years old if one refers to the first CataRT
publications [?], CBCS today enjoys an increasing pop-
Turchet conducted a thorough study on what the musical ularity. Various apps today are based on similar princi-
metaverse means today [?], although the realm still is in its pals (AudioStellar, Audioguide, LjudMAP or XO). The
infancy. Berthaut [?] reviewed 3D interaction techniques democratisation of audio analysis and machine learning
and examined how they can be used for musical control. tools such as the FluCoMa package (for Max, SuperCol-
A survey of the emerging field of networked music perfor- lider and Pure Data) encourages computer music practi-
mances in VR was offered by Loveridge [?]. Atherton and tioners to engage in this field at the crux between music
Wang [18] provided an overview of recent musical works creation and data science/machine learning.
in VR, while Çamcı and Hamilton [?] identified research
trends in the Musical XR field through a set of workshops
3.1 Timbre Space
focusing on Audio-first VR.
Amongst a plethora of tools available today, PatchXR has In spite of promising advances in the domain of deep learn-
caught our attention primarily because its ressemblance to ing applied to sound synthesis [?] [?], CBCS tools may
Max/Pure Data environments (see Fig. ??). earn their popularity from a metaphor which leads back
to the early days of computer music: the notion of timbre
2.2 PatchXR space, developed by Wessel [?] and Grey [?], according
PatchXR [?] is a tool for creating immersive virtual reality to which the multi-dimentional qualities of timbre may be
experiences in which users can design and build interac- better understood using spatial metaphors (e.g. the tim-
tive worlds, games, and music experiences. At its core, bre of the English horn being closer to basson than is it of
Patch is a modular synthesis platform that allows users to trumpet).
create and connect different building blocks, or “patches” Pioneers in the perception of timbre studies such as Grey
together to create complex systems. These building blocks [?], J.C. Risset, D. Wessel, [?] or Stephen McAdams [?]
can include everything from 3D models and textures to [?] most often define timbre by underline what it is not.
physics simulations, lighting controls, and, most impor- Risset and Wessel, for instance, define it as follow: It is the
tantly, audio digital signal processing. perceptual attribute that enables us to distinguish among
One of the key features of Patch is its ability to enable orchestral instruments that are playing the same pitch and
collaboration between users. Patches can be shared and are equally loud. The co-variance such parameters (pitch,
remixed, allowing multiple users to work together on a loudness and timbre), however, leads Schwarz to distin-
guish timbre space and CBCS notions: ‘Note that this con-
2 https://youtu.be/777fqIIJCY4 cept is similar but not equivalent to that of the timbre space
mfcc analysis on a given audio buffer 4 by embedding the
pipo.mfcc plugin inside the mubu.process object.
FluComa also aims to be general purpose, but seems par-
ticularly suited to perform two popular specific tasks. With
only limited knowledge of the framework nor of theory
laying behind the algorithms it uses (such as those dimen-
tionality reduction, mfcc analysis, or neural network train-
ing), the framework allows: 1/ to segment, analyse and
represent/playback a sound corpus 2/ to train a neural net-
work to control a synthesizer, in a manner reminiscent of
Fiebrink’s Wekinator [?].
Only the tools for segmentation, analysis, representation
and playback (described in detail in Chapter ??) were used
here, for they precisely fit the needs of corpus-based syn-
thesis.
4. CONNECTING PATCHXR AND FLUCOMA

Porting analysis made in Max/FluCoMa to PatchXR con-
sists in generating a world in which each sonic fragment’s
Figure 3. Multidimensional perceptual scaling of mu- 3d position follows coordinates delivered by FluCoMa.
sical timbres (John M. Grey[?]). Sounds available at: The structure of a .patch file (a patchXR world) follows
https://muwiserver.synology.me/timbrespaces/grey.htm.
the syntax of a .maxpat (for Max) or .pd file (for pure
data) in the sense that it first declares the objects used, and
put forward by Wessel and Grey [7, 24], since timbre is de- then the connexions between them. This simple structure
fined as those characteristics that serve to distinguish one helped to generate a javascript routine generating a tem-
sound from another, that remain after removing differences plate world, taking as input 1/ dictionaries (json files) with
in loudness and pitch. Our sound space explicitly includes each segment’s 3d coordinates and 2/ each segment’s tem-
those differences that are very important to musical expres- poral position in the sound file, and as output a new .patch
sion.” [?] file (a world accessible in VR, see general workflow on
The workflow described in Chapter ?? gave in practice Fig. ??).
strong evidence of inter-dependance between register, tim-
bre and dynamics, particularly when the analysis run over a
single instrument sound file (e.g. 30 minutes of solo flute),
and chopped in short samples. The system will then pre-
cisely be able to find similarity between instrumental pas-
sages played in the same register, same dynamic, and same
playing technique (e.g. a flute playing fast trills mezzo
forte, in mid-low register, with air).
3.2 Corpus-Based Concatenative Synthesis - State of

the art
A wide array of technologies today can be called corpus-
based concatenative synthesis, in the sense that they allow, Figure 4. General workflow: from an input audio file to its .patch 3d
through segmentation and analysis, to explore large quan- representation in PatchXR.
tities of sound. Some of them are presented as “ready-
made” solutions, such as the recent Audiostellar [?], or After a few trials in which the x y z coordinates of a
SCMIR 3 for SuperCollider. Hackbarth’s AudioGuide [?] world directly represented audio descriptors such as loud-
offers a slightly different focus because it uses the mor- ness, pitch and centroid 5 , I more systematically used mfcc
phology/timeline of a soundfile to produce a concatenated
output. Within the Max world finally, two environments 4 MFCC stands for Mel-Frequency Cepstral Coefficients. It is a type
appear as highly customizable: IRCAM’s MuBu [?] and of feature extraction method that is commonly used in speech and speaker
recognition systems. MFCCs are used to represent the spectral character-
the more recent EU funded FluCoMa [?] project. CataRT
istics of a sound in a compact form that is easier to analyze and process
is now fully integrated in MuBu, whose purpose encom- than the raw waveform. They are calculated by applying a series of trans-
passes multimodal audio analysis as well as machine for formations to the power spectrum of a sound signal, including a Mel-scale
movement and gesture recognition [?]. This makes MuBu warping of the frequency axis, taking the logarithm of the power spec-
trum, and applying a discrete cosine transform (DCT) to the resulting
extremely general purpose, but also difficult to grasp. The
coefficients. The resulting coefficients, which are called MFCCs, cap-
data processing tools in MuBu are mostly exposed in the ture the spectral characteristics of the sound and are commonly used as
pipo plugin framework [?], which can compute for instance features for training machine learning models for tasks such as speech
recognition and speaker identification.
3 A demo is available at: https://youtu.be/jxo4StjV0Cg 5 https://youtu.be/1LHcbYh2KCI?t=19
analysis and dimensionality reduction, as will be shown in 5.1 Slicing
section ??.
Slicing a sound file musically allow various possible ex-
Section ?? will present different javascript programs and
ploitations in the realm of CBCS. In MuBu onset detection
Max patches that were developed in order to diversify the
is done with pipo.onseg or pipo.gate. FluCoMa expose five
ways in which FluCoMa analysis is represented in PatchXR,
different onset detection algorithms:
and how the user can interact with it.
1. fluid.ampslice: Amplitude-based detrending slicer
5. WORKFLOW - ANALYSIS IN FLUCOMA 2. fluid.ampgate: Gate detection on a signal
3. fluid.onsetslice: Spectral difference-based audio buffer
My experiments have focussed on musical instrument cor-
slicer
pora almost exclusively 6 . The tools presented here can
efficiently generate plausible virtuosic instrumental music 4. fluid.noveltyslice: Based on self-similarity matrix
but recent uses found more satisfying results in slower, (SSM)
quieter, “Feldman-like” types of textures. Various limi- 5. fluid.transcientslice: Implements a de-clicking algo-
tations on the playback side (either in standalone VR, or rithm
on a Pure Data sampler for RaspberryPi described in Sec-
tion ??) have imposed restrictions in the first stages on the Onsetslice only was extensively tested. The only tweaked
amount of data it could handle (less than 5 minutes in AIFF parameters were a straight-forward “threshold” as well as a
in PatchXR) or the number of slice the sample could be “minslicelength” argument, determining the shortest slice
chunked into (256 because of limitation of lists in Max, a allowed (or minimum duration of a slice) in hopSize. This
limitation that has also been surpassed since). Both lim- introduce a common limitation in CBCS: the system strongly
itations were later overcome (use of the compressed ogg biases the user to choose short samples for better anal-
format in PatchXR, and taking advantage of longer sound ysis results, and more interactivity, when controlling the
files since version 672, increase of internal buffer size in database with a gesture follower. Aaron Einbond remarks
fluid.buf2list in FluCoMa), thus allowing for far more con- in the use of CataRT how short samples most suited his in-
vincing models. tention: “Short samples containing rapid, dry attacks, such
Using concatenative synthesis to model an improvising as close-miked key-clicks, were especially suitable for a
instrumental musician typically involves several steps: convincing impression of motion of the single WFS source.
The effect is that of a virtual instrument moving through the
1. Segmentation of a large soundfile: This involves di- concert hall in tandem with changes in its timbral content,
viding a large audio recording of the musician’s per- realizing Wessel’s initial proposal.”[?]
formance into smaller units or segments. A related limitation of concatenative synthesis lies in the
fact that short samples will demonstrate the efficiency of
2. Analysis: These segments are then organised in a the algorithm 7 , but at the same time moves away from
database according to various descriptor data (mfcc the “plausible simulation” sought in the present study. A
in our case). balance therefore must be found between the freedom im-
3. Scaling/pre-processing: scaling is applied for better posed by large samples, and the refined control one can
visualisation. obtain with short samples.
A direct concatenation of slices clicks in most cases on
4. Dimension reduction: Based on mfcc descriptors, the edit point, which can be avoided through the use of
the dimensionality of the data is reduced in order to ramps. The second most noticeable glitch on concatena-
make it more manageable and easier to work with. tion concerns the interruption of low register resonances,
This can be done using techniques such as principal which even a large reverb fails making sound plausible.
component analysis (PCA) singular value decompo- Having a low threshold and large “minslicelenght” results
sition (SVD), or Uniform Manifold Approximation in equidistant slices, all of identical durations, as would do
and Projection (UMAP, preferred in our case). the pipo.onseg object in MuBu.
5. Near neighbours sequencing: Once the segments have Because we listen to sound in time, this parameter re-
been organised and analysed, the algorythm selects sponsible for the duration of samples is of prior impor-
and combines them in real-time based on certain in- tance.
put parameters or rules to create a simulated musical
performance that sounds like it is being improvised 5.2 MFCC on each slice - across one whole
by the musician. We use here a near neighbours al- slice/segment
gorithm, which selects segments that are similar in Multidimensional MFCC analysis: MFCC (Mel-Frequency
some way (e.g., in terms of pitch, loudness, or timbre Cepstral Coefficient) analysis is a technique used to extract
- thanks to similarities revealed by umap on mfccs in features from audio signals that are relevant for speech and
our case) to the current segment being played. music recognition. It involves calculating a set of coeffi-
cients that represent the spectral envelope of the audio sig-
We will now describe these steps in further detail: nal, or decomposing a sound signal into a set of frequency
bands and representing the power spectrum of each band
6 For cello: https://youtu.be/L-MiKmsIzjM For various instruments:
with a set of coefficients. The resulting MFCC coefficients
https://www.youtube.com/playlist?list=PLc WX6wY4JtnNqu4Lwe2YzEUq9S1IMvUk
7 e.g. https://youtu.be/LD0ivjyuqMA?t=3032
For flute: https://www.youtube.com/playlist?list=PLc WX6wY4JtlbjLuLHDZhlx78sTDm aqs
capture important spectral characteristics of the sound sig- By applying UMAP to the MFCC coefficients of a sound
nal (albeit hardly interpretable by the novice user), such signal, it is possible to create a visual representation of the
as the frequency and magnitude of the spectral peaks. We sound that preserves the relationships between the different
will see that combines with umap, it is able to capture the MFCC coefficients (see Fig. ??).
spectral characteristics of the musician’s playing style.
5.3 Statistical Analysis Over Each Slice

BufStats is used to calculate statistical measures on data
stored in a buffer channel. BufStats calculates seven statis-
tics on the data in the buffer channel: mean, standard de-
viation, skewness, kurtosis, low, middle, and high values.
These statistics provide information about the central ten-
dency of the data and how it is distributed around that ten-
dency. In addition to calculating statistics on the original
buffer channel, BufStats can also calculate statistics on up
to two derivatives of the original data, apply weights to the
data using a weights buffer, and identify and remove out-
lier frames. These statistical measures can be useful for
comparing different time-series data, even if they have dif-
ferent length, and may provide better distinction between
data points when used in training or analysis. The output
of BufStats is a buffer with the same number of channels as
the original data, with each channel containing the statis-
tics for its corresponding data in the original buffer.
5.4 Normalization Figure 5. Dimensionality reduction of MFCCs help revealing spectral

similarities. UMAP outputs coordinates in 2d or 3d.
The FluCoMa package proposes several scaling or prepro-
cessing tools, amongst which normalization and standard- UMAP is therefore used for its clustering abilities in the
ization were used. Standardization and normalization are first place, helping for classification purposes. It helps
techniques used to transform variables so that they can be identifying patterns or trends that may not be evident from
compared or combined in statistical analyses. Both tech- the raw data. This can be useful for tasks such as explor-
niques are used to make data more comparable, but they ing the structure of a sound dataset, identifying patterns or
work in slightly different ways. trends in the data, and comparing different sounds.
Standardization scales a variable to have a mean of 0 and Most importantly, the non-linear dimensions proposed by
a standard deviation of 1, while normalization scales a vari- UMAP (whether in 2d in Max or in 3 dimensions in PatchXR,
able to have a minimum value of 0 and a maximum value and when compared to linear analyses in which, for in-
of 1. Normalization ccaling was found easier to use both stance, x, y and z correspond to pitch, loudness and cen-
in 2-D (in FluCoMa, the fluid.plotter object), as well as troid) gave far more “intelligent” clustering than more con-
in the VR 3D world in which the origin corresponds to a ventional parameter-consistent types of representations.
corner of the world. The fluid.nomalize object features an
“@max” attribute (1 by default), which then maps directly 5.6 Neighbourhood queries
to the dimensions of the VR world.
The neighbourhood retrieval function is based in FluCoMa
5.5 Dimensionality Reduction - UMAP on K-d trees and the knn algorythm. In MuBu, the mubu.knn
object serves similar tasks. The ml.kdtree object in the
Dimensionality reduction is a technique used in machine ml.star library [?] gives comparable results.
learning to reduce the number of features (dimensions) in K-d trees (short for ”k-dimensional trees”) and k-nearest
a dataset. The goal of dimensionality reduction is to sim- neighbours (k-NN) are two algorithms that are related to
plify the data without losing too much information. Var- each other, but serve different purposes, a k-d tree is a data
ious dimensionality reduction algorithms are presented in structure that is used to store and efficiently query a set of
an early FluCoMa study [?], with interestingly no mention points in a k-dimensional space, while the k-NN algorithm
of UMAP, later favoured. is a machine learning algorithm that is used for classifi-
UMAP (Uniform Manifold Approximation and Projec- cation or regression. Both algorithms are often used in ap-
tion) is a non-linear dimensionality reduction technique plications such as pattern recognition, image classification,
that is based on the principles of topological data analy- and data mining.
sis. It can be used to visualize high-dimensional data in
a lower-dimensional space. When applied to sound data
6. WORKFLOW IN PATCHXR
analysed with MFCC (Mel-Frequency Cepstral Coefficients),
UMAP reduces the dimensionality of the data and creates a I have most often used FluCoMa and PatchXR to gener-
visual representation of the sound in a 2- or 3-dimensional ate monophonic instruments (one performer plays one in-
space. strument at a time), most typically in experiences where
players the players face one another 8 . In the case of ”but- 6.2 Max/pd dependence
ton worlds” such as this one or those described in section
The second stage investigated the possibility to render the
??, there is no need of nearest neighbour retrieval since the
sounds on an array of raspberry pi computers [?]. While
performer clicks exactly on the data point, and he (medi-
this method shows advantages in terms of patching (be-
ated by his avatar) reproduces what knn would do with an
cause of the convenience of using max and pure data), the
automated instrument: he will privilege in his choice the
main drawback is that patches designed in this way cannot
samples he can reach at hand, rather than constantly jump
be accessed by the PatchXR community. Documentation,
large distances between items (see Fig. ??).
similarly is harder to record since the sound is produced
In the worlds developed in Sections ?? and ?? on the
outside of PatchXR. The haptic and visual feedback is very
other hand, data points are not explicitely represented and
different here in the sense that the user controls primarily
some near neighbour strategies need to be implemented.
the region of the space to be played, and when to start and
PatchXR exposes a wide range of blocks (a block corre-
stop playing (when his hand touches the interface or not).
sponds to an object in Max or Pure Data) making it simple
The way he plays is less rhythmical than in the button in-
to access gesture data such as:
terface where the player “hits” 10 each sample (section??).
• The position/distance between hands/controllers and Here on the contrary, the automat simply keeps playing as
a reference. long as he touches the interface 11
Flucoma exposes the fluid.kdtree that is able to find the
• The rotation angles (x y z) of both hands’ controllers
k nearest number once it is given as input the coordinates
• 2-d touchscreen-like controllers, where the user moves of each data point (see Fig. ??). This method proved more
the xy position of a selector across a plane by man- suitable to control automata in which the player selects a
ually grabbing it. region of a 2d plane together with the number of neigh-
• 2-d lazer-like controllers, where the user moves the bours he wants the automate to improvise with.
xy position of a selector remotely, as if using a lazer Most satisfying results were achieved by sending mes-
pointer towards a remote screen or board. sages to each RaspberriPi independently, according to its
specific (static) IP address, with a simple syntax of a 2-
• 2-d pads, which allow to access the velocity at which
integer list corresponding to: 1/which buffer to lookup 2/
the pad is hit
which slice in this buffer to play, each Pi/speaker thus be-
• 3-d slider or theremine like controllers, where the ing able to play each sound in a pure data patch, in which
user moves the xyz position of a selector across a each slice looks up an array with the corresponding slice
plane by manually grabbing it. points.
• A block called “interaction box” similar to 3-d slider,
with the main difference that the user does not grab 6.3 Nearest neighbour in PatchXR
the selector, but instead comes in and out of the in- The currently most frequently used program is a javascript
teractive zone routine that generates a world (a .patch file) in which the
• 1-d sliders, knobs, buttons... x y z coordinates of each data point are stored in a “knob
board” block (the long rectangle in Fig.??). To measure
One of the current challenges consists in diversifying the the distance in 3D using Thales’ theorem (or so called dis-
ways in which the corpus is queried. tance formula), we need to find the distance between two
One to one mapping of UMAP results such as those of points in three-dimensional space, i.e find the diameter of
section ?? use buttons facing each other, in order to prompt the sphere that passes through both points.
the players to face each other 9 . p
When playing alone and controlling at the same time many (x2 − x1 )2 + (y2 − y1 )2 + (z2 − z1 )2
instruments (the one-man-orchestra), encourages to use higher
level type of control over automata, i.e. to implement the Fig. ?? shows the corresponding implementation in patchXR.
simple ability to concatenate automatically: play the next This fragment of visual program (called “abstraction” or
sample as soon as the previous one has stopped (see section “subpatch” in Max, and called “group” in PatchXR) is then
??). used in a more complex patch which iterates (100 times per
second, for instance) over an array (the rectangular “knob-
6.1 One to one mapping between data points and board” object in Fig.??, so as to output the index (the value
buttons 234 in the figure) of the point situated the closest to the
controller.
The first javascript routine developed was designed to sim-
ply map a data point in the sonic space to a button in the
virtual 3d space. By iterating over an array, the routine 7. NETWORKED MUSIC PERFORMANCE
generates a world in which each button’s coordinates are The concept of online virtual collaboration has gained sig-
dictated by the FluCoMa umap analysis descripbed in sec- nificant attention, notably through Meta’s promotion in 2021.
tion ??. Albeit simpler than the method that will be ex- The application of this concept within the realm of “tele-
posed later, this simple one to one to one mapping has improvisations” [?], most commonly referred to as Net-
several advantages, most importantly the haptic and visual worked Music Performance (NMP) or holds the potential
feedback the performer gets when hitting each button.
10 the opening of this video aptly conveys how the energy transfers form
8 https://youtu.be/WhuqOOuzzBw the player’s gesture https://www.youtube.com/watch?v=glFdzbAJVRU
9 https://youtu.be/ 11 https://youtu.be/vtob96F9cQw
Figure 7. The coordinates of the player’s controller (here 0.01, 0.06,
0.13) yield as a result index 234 as nearest neighbour (read from left to
right).
Figure 8. The implementation of Thales’ theorem in patchXR (read from

right to left).
relationships (which also permeate the ‘genuine’ musical

sign of the index) are replaced by plausibility, that is the
amount to which performers and spectators are capable of
‘buying’ the outcome of a performance by building mental
maps of the interaction.” Performances of laptop orches-
tras, along with various other experiments using technol-
ogy collaboratively, wether in local or distributed settings,
have reported similar concerns, most commonly express-
ing a lack of embodiment in the performance. Although
in its infancy, a first live performance staged/composed by
the author in Paris, with participants distributed across Eu-
Figure 6. The fluid.kdtree object is used here to retrieve the 8 nearest rope, showed a promising potential for tackling these is-
neighbours of a point in a 2-d space. sues, most importantly through its attempt at dramatising
the use of avatars.
• JIM 23: https://youtu.be/npyfwqN02qE
to overcome what was hitherto viewed as intrinsic limita-
tion of the field, both from practitioners and from an audi- A lot remains to be improved in order to give the audi-
ence point of view. ence an experience aesthetically comparable to that of the
NMPs have indeed stimulated considerable research and concert hall. Carefully orchestrated movements of cam-
experimentation whilst facing resistance at the same time. eras around the avatars, and faithful translation the headset
In his article “Not Being There”, Miller Puckette argues experience of spatial audio with immersive visuals need
that he finds “Having seen a lot of networked performances further exploration, but would go beyond the scope of the
of one sort or another, I find myself most excited by the po- present article.
tential of networked “telepresence” as an aid to rehearsal,
not performance.” [?]. In Embodiment and Disembodiment
8. CONCLUSIONS
in NMP [?], Georg Hajdu identifies an issue with a lack of
readability from an audience perspective, who cannot per- We’ve proposed a workflow for corpus-based concatena-
ceive the gesture to sound relationship as would be the case tive synthesis CBCS in multiplayer VR (or metaverse), ar-
in a normal concert situation: “These performances take guing that machine learning tools for data visualisation of-
machine–performer–spectator interactions into considera- fer revealing and exploitable information about the timbral
tion, which, to a great deal, rely on embodied cognition quality of the material that is being analysed. In a wider
and the sense of causality[...]. Classical cause-and-effect sense, the present approach can be understood as reflexive
practice on new media, according to which the notion of [11] K. Fitz, M. Burk, and M. McKinney, “Multidimen-
data-base may be considered an art form [?]). sional perceptual scaling of musical timbre by hearing-
The discussed tools for “machine listening” (FluCoMa, impaired listeners,” The Journal of the Acoustical So-
MuBu) help building intelligent instruments with relatively ciety of America, vol. 125, p. 2633, 05 2009.
small amounts of data, the duration of samples appear cru-
cial in CBCS. A balance must be found between 1/ short [12] J.-C. Risset and D. Wessel, “Exploration of timbre by
duration sample analysis which are easier to process and analysis and synthesis,” Psychology of Music, pp. 113–
categorise and 2/ long samples which sound more natural 169, 1999.
in the context instrument-based simulations. [13] S. Mcadams, S. Winsberg, S. Donnadieu, G. De Soete,
and J. Krimphoff, “Perceptual scaling of synthesized
Acknowledgments musical timbres: Common dimensions, specificities,
I am grateful for the support of UCA/CTEL, whose artist and latent subject classes,” Psychological research,
residency research program has allowed to hold these ex- vol. 58, pp. 177–92, 02 1995.
periments, and for the support of PRISM-CNRS.
[14] A. Caclin, S. Mcadams, B. Smith, and S. Winsberg,
“Acoustic correlates of timbre space dimensions: A
9. REFERENCES confirmatory study using synthetic tones,” The Jour-
[1] Andersson, “immersive audio programming in a vir- nal of the Acoustical Society of America, vol. 118, pp.
tual reality sandbox,” journal of the audio engineering 471–82, 08 2005.
society, march 2019. [15] D. Schwarz, “The Sound Space as Musical Instrument:
[2] S. Serafin, R. Nordahl, C. Erkut, M. Geronazzo, Playing Corpus-Based Concatenative Synthesis,” in
F. Avanzini, and A. de Götzen, “Sonic interaction in New Interfaces for Musical Expression (NIME), Ann
virtual environments,” in 2015 IEEE 2nd VR Workshop Arbour, United States, May 2012, pp. 250–253, cote
on Sonic Interactions for Virtual Environments (SIVE), interne IRCAM: Schwarz12a. [Online]. Available:
2015, pp. 1–2. https://hal.archives-ouvertes.fr/hal-01161442
[3] L. Turchet, “Musical Metaverse: vision, opportunities, [16] L. Garber, T. Ciccola, and J. C. Amusategui, “Au-
and challenges,” Personal and Ubiquitous Computing, dioStellar, an open source corpus-based musical instru-
01 2023. ment for latent sound structure discovery and sonic ex-
perimentation,” 12 2020.
[4] F. Berthaut, “3D interaction techniques for musical ex-
pression,” Journal of New Music Research, vol. 49, [17] B. Hackbarth, N. Schnell, P. Esling, and D. Schwarz,
no. 1, pp. 60–72, 2020. “Composing Morphology: Concatenative Synthesis
as an Intuitive Medium for Prescribing Sound
[5] B. Loveridge, “Networked music performance in vir-
in Time,” Contemporary Music Review, vol. 32,
tual reality: current perspectives,” Journal of Network
no. 1, pp. 49–59, 2013. [Online]. Available: https:
Music and Arts, vol. 2, no. 1, p. 2, 2020.
//hal.archives-ouvertes.fr/hal-01577895
[6] A. Çamcı and R. Hamilton, “Audio-first VR: new per-
spectives on musical experiences in virtual environ- [18] N. Schnell, A. Roebel, D. Schwarz, G. Peeters, and
ments,” Journal of New Music Research, vol. 49, no. 1, R. Borghesi, “MUBU and FRIENDS -ASSEMBLING
pp. 1–7, 2020. TOOLS FOR CONTENT BASED REAL-TIME IN-
TERACTIVE AUDIO PROCESSING IN MAX/MSP,”
[7] D. Schwarz, G. Beller, B. Verbrugghe, and S. Britton, Proceedings of the International Computer Music Con-
“Real-Time Corpus-Based Concatenative Synthesis ference (ICMC 2009), 01 2009.
with CataRT,” in 9th International Conference on
Digital Audio Effects (DAFx), Montreal, Canada, Sep. [19] P. A. Tremblay, G. Roma, and O. Green, “Enabling
2006, pp. 279–282, cote interne IRCAM: Schwarz06c. Programmatic Data Mining as Musicking: The
[Online]. Available: https://hal.archives-ouvertes.fr/ Fluid Corpus Manipulation Toolkit,” Computer Music
hal-01161358 Journal, vol. 45, no. 2, pp. 9–23, 06 2021. [Online].
Available: https://doi.org/10.1162/comj a 00600
[8] J.-P. Briot, G. Hadjeres, and F.-D. Pachet, Deep
Learning Techniques for Music Generation – A [20] F. Bevilacqua and R. Müller, “A Gesture follower for
Survey, Aug. 2019. [Online]. Available: https: performing arts,” 05 2005.
//hal.sorbonne-universite.fr/hal-01660772
[21] N. Schnell, D. Schwarz, J. Larralde, and R. Borghesi,
[9] P. Esling, A. Chemla-Romeu-Santos, and A. Bitton, “PiPo, a Plugin Interface for Afferent Data Stream Pro-
“Generative timbre spaces with variational audio syn- cessing Operators,” in International Society for Music
thesis,” CoRR, vol. abs/1805.08501, 2018. [Online]. Information Retrieval Conference, 2017.
Available: http://arxiv.org/abs/1805.08501
[22] R. Fiebrink and P. Cook, “The Wekinator: A System
[10] D. L. Wessel, “Timbre Space as a Musical Control for Real-time, Interactive Machine Learning in Mu-
Structure,” Computer Music Journal, vol. 3, no. 2, pp. sic,” Proceedings of The Eleventh International Soci-
45–52, 1979. [Online]. Available: http://www.jstor. ety for Music Information Retrieval Conference (IS-
org/stable/3680283 MIR 2010), 01 2010.
[23] A. Einbond and D. Schwarz, “Spatializing Tim-
bre With Corpus-Based Concatenative Synthesis,” 06
2010.
[24] G. Roma, O. Green, and P. A. Tremblay, “Adaptive
Mapping of Sound Collections for Data-driven Musical
Interfaces,” in New Interfaces for Musical Expression,
2019.
[25] B. D. Smith and G. E. Garnett, “Unsupervised Play:
Machine Learning Toolkit for Max,” in New Interfaces
for Musical Expression, 2012.
[26] PrÉ : connected polyphonic immersion. Zenodo,
Jul. 2022. [Online]. Available: https://doi.org/10.5281/
zenodo.6806324
[27] R. Mills, “Tele-Improvisation: Intercultural Interaction
in the Online Global Music Jam Session,” in Springer
Series on Cultural Computing, 2019. [Online].
Available: https://api.semanticscholar.org/CorpusID:
57428481
[28] M. Puckette, “Not Being There,” Contemporary
Music Review, vol. 28, no. 4-5, pp. 409–412,
2009. [Online]. Available: https://doi.org/10.1080/
07494460903422354
[29] G. Hajdu, “Embodiment and disembodiment in
networked music performance,” 2017. [Online].
Available: https://api.semanticscholar.org/CorpusID:
149523160
[30] L. Manovich, “Database as Symbolic Form,” Conver-
gence: The International Journal of Research into New
Media Technologies, vol. 5, pp. 80 – 99, 1999.

UCL Icmc

Uploaded by

Copyright:

Available Formats

UCL Icmc

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UCL Icmc

Uploaded by

Copyright:

Available Formats

COSMIX: using MIR techniques for Sonic Interaction in Virtual Environments

The present study proposes to explore a sound corpus in

3. CORPUS-BASED CONCATENATIVE SOUND

4. CONNECTING PATCHXR AND FLUCOMA

3.2 Corpus-Based Concatenative Synthesis - State of

5.3 Statistical Analysis Over Each Slice

5.4 Normalization Figure 5. Dimensionality reduction of MFCCs help revealing spectral

Figure 8. The implementation of Thales’ theorem in patchXR (read from

relationships (which also permeate the ‘genuine’ musical

You might also like