Papers by Dorien Herremans
NAACL, 2024
The quality of the text-to-music models has reached new heights due to recent advancements in dif... more The quality of the text-to-music models has reached new heights due to recent advancements in diffusion models. The controllability of various musical aspects, however, has barely been explored. In this paper, we propose Mustango: a music-domain-knowledgeinspired text-to-music system based on diffusion. Mustango aims to control the generated music, not only with general text captions, but with more rich captions that can include specific instructions related to chords, beats, tempo, and key. At the core of Mustango is MuNet, a Music-Domain-Knowledge-Informed UNet guidance module that steers the generated music to include the music-specific conditions, which we predict from the text prompt, as well as the general text embedding, during the reverse diffusion process. To overcome the limited availability of open datasets of music with text captions, we propose a novel data augmentation method that includes altering the harmonic, rhythmic, and dynamic aspects of music audio and using state-of-the-art Music Information Retrieval methods to extract the music features which will then be appended to the existing descriptions in text format. We release the resulting MusicBench dataset which contains over 52K instances and includes music-theory-based descriptions in the caption text. Through extensive experiments, we show that the quality of the music generated by Mustango is state-ofthe-art, and the controllability through musicspecific text prompts greatly outperforms other models such as MusicGen and AudioLDM2.
Many of the music generation systems based on neural networks are fully autonomous and do not off... more Many of the music generation systems based on neural networks are fully autonomous and do not offer control over the generation process. In this research, we present a controllable music generation system in terms of tonal tension. We incorporate two tonal tension measures based on the Spiral Array Tension theory into a variational autoencoder model. This allows us to control the direction of the tonal tension throughout the generated piece, as well as the overall level of tonal tension. Given a seed musical fragment, stemming from either the user input or from directly sampling from the latent space, the model can generate variations of this original seed fragment with altered tonal tension. This altered music still resembles the seed music rhythmically, but the pitch of the notes are changed to match the desired tonal tension as conditioned by the user.
Record companies invest billions of dollars in new talent around the globe each year. Gaining ins... more Record companies invest billions of dollars in new talent around the globe each year. Gaining insight into what actually makes a hit song would provide tremendous benefits for the music industry. In this research we tackle this question by focussing on the dance hit song classification problem. A database of dance hit songs from 1985 until 2013 is built, including basic musical features, as well as more advanced features that capture a temporal aspect. A number of different classifiers are used to build and test dance hit prediction models. The resulting best model has a good performance when predicting whether a song is a "top 10" dance hit versus a lower listed position.
High-level musical qualities (such as emotion) are often abstract, subjective, and hard to quanti... more High-level musical qualities (such as emotion) are often abstract, subjective, and hard to quantify. Given these difficulties, it is not easy to learn good feature representations with supervised learning techniques, either because of the insufficiency of labels, or the subjectiveness (and hence large variance) in human-annotated labels. In this paper, we present a framework that can learn high-level feature representations with a limited amount of data, by first modelling their corresponding quantifiable low-level attributes. We refer to our proposed framework as Music FaderNets, which is inspired by the fact that low-level attributes can be continuously manipulated by separate "sliding faders" through feature disentanglement and latent regularization techniques. High-level features are then inferred from the low-level representations through semi-supervised clustering using Gaussian Mixture Variational Autoencoders (GM-VAEs). Using arousal as an example of a high-level f...
We propose a flexible framework that deals with both singer conversion and singers vocal techniqu... more We propose a flexible framework that deals with both singer conversion and singers vocal technique conversion. The proposed model is trained on non-parallel corpora, accommodates many-to-many conversion, and leverages recent advances of variational autoencoders. It employs separate encoders to learn disentangled latent representations of singer identity and vocal technique separately, with a joint decoder for reconstruction. Conversion is carried out by simple vector arithmetic in the learned latent spaces. Both a quantitative analysis as well as a visualization of the converted spectrograms show that our model is able to disentangle singer identity and vocal technique and successfully perform conversion of these attributes. To the best of our knowledge, this is the first work to jointly tackle conversion of singer identity and vocal technique based on a deep learning approach.
2017 International Conference on Orange Technologies (ICOT)
Tension is a complex multidimensional concept that is not easily quantified. This research propos... more Tension is a complex multidimensional concept that is not easily quantified. This research proposes three methods for quantifying aspects of tonal tension based on the spiral array, a model for tonality. The cloud diameter measures the dispersion of clusters of notes in tonal space; the cloud momentum measures the movement of pitch sets in the spiral array; finally, tensile strain measures the distance between the local and global tonal context. The three methods are implemented in a system that displays the results as tension ribbons over the music score to allow for ease of interpretation. All three methods are extensively tested on data ranging from small snippets to phrases with the Tristan chord and larger sections from Beethoven and Schubert piano sonatas. They are further compared to results from an existing empirical experiment.
Computer aided composition is a research area that focuses on using com-puters to assist composer... more Computer aided composition is a research area that focuses on using com-puters to assist composers. In its most extreme form the computer generates the entire musical piece. This can be done by realizing that composing music can—at least partially—be regarded as a combinatorial optimization problem, in which one or more melodies are “optimized ” to fit the “rules ” of their specific mu-sical style. In a previous paper, a variable neighbourhood search (VNS) algorithm was developed that could generate first and fifth species counterpoint fragments based on an objective function that was manually coded from music theory [5]. In this research machine learning is used to automatically generate the objective function for a VNS that generates first species counterpoint. When composing or generating counterpoint fragments, it is essential to con-sider both vertical (harmonic) and horizontal (melodic) aspects. These two di-mensions should be linked instead of treated separately. Furthermore,...
the extrema from statistical models of music with variable
Sensors
In this paper, we tackle the problem of predicting the affective responses of movie viewers, base... more In this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies. Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. Yet, these typically, while ignoring the correlation between multiple modality inputs, ignore the correlation between temporal inputs (i.e., sequential features). To explore these correlations, a neural network architecture—namely AttendAffectNet (AAN)—uses the self-attention mechanism for predicting the emotions of movie viewers from different input modalities. Particularly, visual, audio, and text features are considered for predicting emotions (and expressed in terms of valence and arousal). We analyze three variants of our proposed AAN: Feature AAN, Temporal AAN, and Mixed AAN. The Feature AAN applies the self-attention mechanism in an innovative way on the features extracted from the different modalities (...
ArXiv, 2020
Many of the music generation systems based on neural networks are fully autonomous and do not off... more Many of the music generation systems based on neural networks are fully autonomous and do not offer control over the generation process. In this research, we present a controllable music generation system in terms of tonal tension. We incorporate two tonal tension measures based on the Spiral Array Tension theory into a variational autoencoder model. This allows us to control the direction of the tonal tension throughout the generated piece, as well as the overall level of tonal tension. Given a seed musical fragment, stemming from either the user input or from directly sampling from the latent space, the model can generate variations of this original seed fragment with altered tonal tension. This altered music still resembles the seed music rhythmically, but the pitch of the notes are changed to match the desired tonal tension as conditioned by the user.
2020 Joint 9th International Conference on Informatics, Electronics & Vision (ICIEV) and 2020 4th International Conference on Imaging, Vision & Pattern Recognition (icIVPR), 2020
Generating an image from a provided descriptive text is quite a challenging task because of the d... more Generating an image from a provided descriptive text is quite a challenging task because of the difficulty in incorporating perceptual information (object shapes, colors, and their interactions) along with providing high relevancy related to the provided text. Current methods first generate an initial low-resolution image, which typically has irregular object shapes, colors, and interaction between objects. This initial image is then improved by conditioning on the text. However, these methods mainly address the problem of using text representation efficiently in the refinement of the initially generated image, while the success of this refinement process depends heavily on the quality of the initially generated image, as pointed out in the Dynamic Memory Generative Adversarial Network (DM-GAN) paper. Hence, we propose a method to provide good initialized images by incorporating perceptual understanding in the discriminator module. We improve the perceptual information at the first ...
2019 IEEE VTS Asia Pacific Wireless Communications Symposium (APWCS)
Shallow water environments create a challenging channel for communications. In this paper, we foc... more Shallow water environments create a challenging channel for communications. In this paper, we focus on the challenges posed by the frequency-selective signal distortion called the Doppler effect. We explore the design and performance of machine learning (ML) based demodulation methods-(1) Deep Belief Network-feed forward Neural Network (DBN-NN) and (2) Deep Belief Network-Convolutional Neural Network (DBN-CNN) in the physical layer of Shallow Water Acoustic Communication (SWAC). The proposed method comprises of a ML based feature extraction method and classification technique. First, the feature extraction converts the received signals to feature images. Next, the classification model correlates the images to a corresponding binary representative. An analysis of the ML based proposed demodulation shows that despite the presence of instantaneous frequencies, the performance of the algorithm shows an invariance with a small 2dB error margin in terms of bit error rate (BER).
ArXiv, 2017
We present a semantic vector space model for capturing complex polyphonic musical context. A word... more We present a semantic vector space model for capturing complex polyphonic musical context. A word2vec model based on a skip-gram representation with negative sampling was used to model slices of music from a dataset of Beethoven's piano sonatas. A visualization of the reduced vector space using t-distributed stochastic neighbor embedding shows that the resulting embedded vector space captures tonal relationships, even without any explicit information about the musical contents of the slices. Secondly, an excerpt of the Moonlight Sonata from Beethoven was altered by replacing slices based on context similarity. The resulting music shows that the selected slice based on similar word2vec context also has a relatively short tonal distance from the original slice.
Sensors (Basel, Switzerland), 2021
Intelligent systems are transforming the world, as well as our healthcare system. We propose a de... more Intelligent systems are transforming the world, as well as our healthcare system. We propose a deep learning-based cough sound classification model that can distinguish between children with healthy versus pathological coughs such as asthma, upper respiratory tract infection (URTI), and lower respiratory tract infection (LRTI). To train a deep neural network model, we collected a new dataset of cough sounds, labelled with a clinician’s diagnosis. The chosen model is a bidirectional long–short-term memory network (BiLSTM) based on Mel-Frequency Cepstral Coefficients (MFCCs) features. The resulting trained model when trained for classifying two classes of coughs—healthy or pathology (in general or belonging to a specific respiratory pathology)—reaches accuracy exceeding 84% when classifying the cough to the label provided by the physicians’ diagnosis. To classify the subject’s respiratory pathology condition, results of multiple cough epochs per subject were combined. The resulting pr...
Proceedings of the 12th International Audio Mostly Conference on Augmented and Participatory Sound and Music Experiences, 2017
Emotional response to music is often represented on a two-dimensional arousal-valence space witho... more Emotional response to music is often represented on a two-dimensional arousal-valence space without reference to score information that may provide critical cues to explain the observed data. To bridge this gap, we present IMMA-Emo, an integrated software system for visualising emotion data aligned with music audio and score, so as to provide an intuitive way to interactively visualise and analyse music emotion data. The visual interface also allows for the comparison of multiple emotion time series. The IMMA-Emo system builds on the online interactive Multi-modal Music Analysis (IMMA) system. Two examples demonstrating the capabilities of the IMMA-Emo system are drawn from an experiment set up to collect arousal-valence ratings based on participants' perceived emotions during a live performance. Direct observation of corresponding score parts and aural input from the recording allow explanatory factors to be identified for the ratings and changes in the ratings.
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
Most of the state-of-the-art automatic music transcription (AMT) models break down the main trans... more Most of the state-of-the-art automatic music transcription (AMT) models break down the main transcription task into sub-tasks such as onset prediction and offset prediction and train them with onset and offset labels. These predictions are then concatenated together and used as the input to train another model with the pitch labels to obtain the final transcription. We attempt to use only the pitch labels (together with spectrogram reconstruction loss) and explore how far this model can go without introducing supervised sub-tasks. In this paper, we do not aim at achieving state-of-the-art transcription accuracy, instead, we explore the effect that spectrogram reconstruction has on our AMT model. Our proposed model consists of two U-nets: the first U-net transcribes the spectrogram into a posteriorgram, and a second U-net transforms the posteriorgram back into a spectrogram. A reconstruction loss is applied between the original spectrogram and the reconstructed spectrogram to constra...
Information on liquid jet stream flow is crucial in many real world applications. In a large numb... more Information on liquid jet stream flow is crucial in many real world applications. In a large number of cases, these flows fall directly onto free surfaces (e.g. pools), creating a splash with accompanying splashing sounds. The sound produced is supplied by energy interactions between the liquid jet stream and the passive free surface. In this investigation, we collect the sound of a water jet of varying flowrate falling into a pool of water, and use this sound to predict the flowrate and flowrate trajectory involved. Two approaches are employed: one uses machinelearning models trained using audio features extracted from the collected sound to predict the flowrate (and subsequently the flowrate trajectory). In contrast, the second method directly uses acoustic parameters related to the spectral energy of the liquidliquid interaction to estimate the flowrate trajectory. The actual flowrate, however, is determined directly using a gravimetric method: tracking the change in mass of the ...
Uploads
Papers by Dorien Herremans
Drupal is one of the best content management systems (CMS) around. In fact, it has just won (for the second time) the Best Overall 2008 Open Source CMS Award and Best PHP Open Source CMS Award. For about 8 years now, Drupal has been providing users with one of the best and most versatile frameworks around. In this book, I will guide the reader through the different modules needed to build a solid community site. And I go even further, by looking at how to structure content and making a few €, $, ¥, ¢, £,... from your site. It is going to be an exploration of modules and their features.
The main strategy of this book is to use only out-of-the-box, user contributed modules. So anybody can make a great site. This well thought strategy offers us the following advantages:
1. Easy to update.
2. Anyone can do it.
3. Expansive: at any time, you can install a module safely, to allow additional functionality.
Every function will be explained using the fully integrated case study Drupalfun.
Anyone can build a social networking site with Drupal after reading this book.
About the author
Dorien Herremans is a MSc Commercial Engineer in Management Information Systems, from the University of Antwerp, Belgium. She currently lives in the Swiss Alps where she lectured IT and 3D Computer Animation at Les Roches University of Applied Science and has her own company in Switzerland. Among others, she loves life and Drupal, and has set up a multitude of sites, using creative techniques to get the maximum out of this great framework.
http://www.dagstuhl.de/16092
February 28 – March 4 , 2016, Dagstuhl Seminar 16092
Computational Music Structure Analysis
The first two models, an if-then ruleset and a decision tree, give comprehensible insights about the differences in style between Beethoven, Haydn and Bach, which might be interesting for musical theorists. The third model, a logistic regression model, outputs the probability that a fragment is composed by a certain composer. This model is integrated in the objective function of a variable neighborhood search algorithm that was previously developed by the authors. The original algorithm was able to generate counterpoint music, a type of polyphonic baroque music. The result is a system that can generate an continuous stream of counterpoint with \emph{composer specific characteristics} that sounds pleasing to the ear. This system is implemented as an Android app called FuX that can be installed on any Android phone or tablet.
If time permits, I might briefly introduce some of my other work, for instance "Dance hit prediction" of "Automatic piano fingering.""
In the first part of esta talk, we'll look at generating music as an optimization problem. In particularly, we'll be focusing on music Which has a long term structure. A Variable Neighbourhood powerful search algorithm (VNS) was developed, Which is Able to generate a range of musical styles based on it's objective function, constraining Whilst the music to a structural template. In the first stage of the project, an objective function based on rules from music theory was used to generate counterpoint. In the next stage, a machine learning approach is combined With the VNS in order to generate structured music for the bagana, an Ethiopian lyre. The approach Followed In This research Allows us to combine the power of machine learning methods With optimization algorithms.
In the second part of esta talk, we'll zoom in on the dance hit song prediction problem problem. With annual investments of several billions of dollars worldwide, record companies can benefit tremendously by gaining insight into what Actually Makes a hit song. We'll describe how we built a database of dance hit songs from 1985 Until 2013, Which includes basic musical features, as well as more advanced features to capture That temporal aspect. Different classifiers: such as SVM and logistic regression are used to build and test prediction models dance hit. The RESULTING model has a good performance Whether predicting When a song is a "top 10" dance hit versus a lower position listed.
In this research, a machine learning approach is combined with the VNS in order to generate structured music for the bagana, an Ethiopian lyre. Different ways are explored in which a Markov model can be used to construct quality metrics that represent how well a fragment fits the chosen style (e.g. music for bagana). Current research that aims to extend the objective function with models such as recursive neural networks is also briefly discussed. The approach followed in this
research allows us to combine the power of machine learning methods with optimisation algorithms.
In the first part of our research, we use optimisation techniques to generate counterpoint pieces. Counterpoint is a formally defined style that can be evaluated using strict rules, which are defined in music theory. In the second part of the research, we break free from the restriction of a formally defined style and we explore how styles can be "learned" from existing pieces. Combining machine learning with optimisation has the added advantage that we can impose structural constraints and thus enforce a theme and long term coherence to a piece.