Academia.eduAcademia.edu

LET IT BEE – TOWARDS NMF-INSPIRED AUDIO MOSAICING

A swarm of bees buzzing " Let it be " by the Beatles or the wind gently howling the romantic " Gute Nacht " by Schu-bert – these are examples of audio mosaics as we want to create them. Given a target and a source recording, the goal of audio mosaicing is to generate a mosaic recording that conveys musical aspects (like melody and rhythm) of the target, using sound components taken from the source. In this work, we propose a novel approach for automatically generating audio mosaics with the objective to preserve the source's timbre in the mosaic. Inspired by algorithms for non-negative matrix factorization (NMF), our idea is to use update rules to learn an activation matrix that, when multiplied with the spectrogram of the source recording, resembles the spectrogram of the target recording. However, when applying the original NMF procedure , the resulting mosaic does not adequately reflect the source's timbre. As our main technical contribution, we propose an extended set of update rules for the iterative learning procedure that supports the development of sparse diagonal structures in the activation matrix. We show how these structures better retain the source's timbral characteristics in the resulting mosaic.

LET IT BEE – TOWARDS NMF-INSPIRED AUDIO MOSAICING Jonathan Driedger, Thomas Prätzlich, Meinard Müller International Audio Laboratories Erlangen {jonathan.driedger,thomas.praetzlich,meinard.mueller}@audiolabs-erlangen.de Target 1. INTRODUCTION Using the sounds in a recording of buzzing bees to recreate a recording of the song “Let it be” by the Beatles is a typical example of an audio mosaic. In this example, the recording of the bees serves as source, while the Beatles recording is called the target. Ultimately, one should be able to identify the target recording when listening to the mosaic, but at the same time perceive the timbre of the source sounds. Therefore, the audio mosaic of “Let it be” with the bee recording could give the impression of bees being musicians, buzzing the song’s tune. Audio mosaicing is an interesting audio effect which has found its way into both artistic work as well as academic research. Artists like John Oswald used thousands of manually selected source audio snippets to create new c Jonathan Driedger, Thomas Prätzlich, Meinard Müller. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Jonathan Driedger, Thomas Prätzlich, Meinard Müller. “Let it Bee – Towards NMF-inspired Audio Mosaicing”, 16th International Society for Music Information Retrieval Conference, 2015. Time target ≈ . Time source Learned activations Time source Source Frequency = Time target Mosaic Frequency A swarm of bees buzzing “Let it be” by the Beatles or the wind gently howling the romantic “Gute Nacht” by Schubert – these are examples of audio mosaics as we want to create them. Given a target and a source recording, the goal of audio mosaicing is to generate a mosaic recording that conveys musical aspects (like melody and rhythm) of the target, using sound components taken from the source. In this work, we propose a novel approach for automatically generating audio mosaics with the objective to preserve the source’s timbre in the mosaic. Inspired by algorithms for non-negative matrix factorization (NMF), our idea is to use update rules to learn an activation matrix that, when multiplied with the spectrogram of the source recording, resembles the spectrogram of the target recording. However, when applying the original NMF procedure, the resulting mosaic does not adequately reflect the source’s timbre. As our main technical contribution, we propose an extended set of update rules for the iterative learning procedure that supports the development of sparse diagonal structures in the activation matrix. We show how these structures better retain the source’s timbral characteristics in the resulting mosaic. Frequency ABSTRACT Time target Figure 1. Schematic overview of our proposed audio mosaicing method. The sparse diagonal structures in the activation matrix are important in order to preserve the timbre of the source in the mosaic. musical compositions 1 and real-time audio mosaicing has been used by musicians as an instrument in live performances [4, 22]. Over the years, many different systems for audio mosaicing were proposed [1, 3, 5, 11, 13, 17, 18]. The core idea of most automated systems is to split the source into short audio segments, which are suitably concatenated afterwards to match spectral and temporal characteristics of the target [19]. In this work, we propose a novel way to create audio mosaics. Our idea is to learn an activation matrix that, when multiplied with the spectrogram of the source recording, approximates the spectrogram of the target recording (see Figure 1). The source spectrogram hereby serves as a template matrix which is fixed throughout the learning process. This way, as opposed to many previous automated mosaicing approaches, a frame of the target can be resynthesized as the superposition of several spectral frames of the source , thus allowing “polyphony” of the source sounds. 1 Especially on his album Plexure [16]. As our main technical contribution, we therefore propose an extended set of update rules that supports the development of sparse diagonal structures in the activation matrix during the learning process (see the activation matrix in Figure 1). Rather than single frames, diagonal structures activate whole frame sequences in their original order. This preserves the source’s temporal characteristics in the resulting mosaic. Furthermore, the extended set of update rules also limits the number of simultaneous activations, making the learned activation matrix sparse and reducing the problem of too many source sounds being audible simultaneously. This way, we trade some approximation quality for a better preservation of the source’s timbre. The idea of activating sequences of frames is inspired by methods like non-negative matrix factor deconvolution (NMFD) and related formulations [20,21], where template sequences of frames from a dictionary are activated by single activation values. However, our approach is conceptually different. Instead of changing the NMF problem formulation, our approach stays in the standard NMF setting, supporting the activation of whole frame sequences directly in the activation matrix with additional update rules. Besides being computationally very efficient and easy to implement, this also has the advantage that we do not need to choose a maximal length of the sequences as in NMFD. Similarly, the sparseness constraint imposed by our procedure is not enforced by penalty terms in the problem formulation (as for example in [8, 10, 12, 23]), but also by additional update rules. The remainder of this paper is structured as follows. In Section 2 we introduce the basic concept of using NMFinspired update rules for the task of audio mosaicing. In Section 3 we present the extended set of update rules that supports the development of sparse diagonal structures in a learned activation matrix. The effects of these update rules on the audio mosaics are discussed and demonstrated in Section 4. (a) Frequency [Hz] Frequency [Hz] 1500 1000 500 0 0 (b) 2000 2000 2 4 6 8 Time target [sec] 1500 1000 500 0 35 10 (c) 60 (d) Frequency [Hz] 55 50 45 40 35 40 45 50 55 Time source [sec] 2000 60 Time source [sec] As a first contribution, we propose an audio mosaicing procedure which is inspired by well-known algorithms for non-negative matrix factorization (NMF) [14]. Keeping the template matrix fixed (the source’s magnitude spectrogram), this basic procedure learns an activation matrix by iteratively applying a standard NMF update rule to a randomly initialized matrix. Experiments show that in case the source recording offers an appropriate amount of different sounds, this procedure can closely approximate the spectrogram of the target recording. However, the source’s timbre is often barely recognizable in the resulting mosaics. The reason is that the procedure recreates every target frame independently, thus destroying temporal characteristics of the source in the final audio mosaic. Furthermore, the method can superimpose an arbitrary number of spectral frames from the source to construct a good numerical approximation of a single target frame. A superposition of a large number of source sounds may however result in a timbre that is no longer similar to the actual timbre of the source. Therefore, an exact approximation of the target’s spectrogram cannot be our procedure’s sole goal. 0 2 4 6 8 Time target [sec] 10 1500 1000 500 0 0 2 4 6 8 Time target [sec] 10 Figure 2. Basic NMF-inspired audio mosaicing. (a): Magnitude spectrogram of “Let it be” V (target). (b): Magnitude spectrogram of a recording of bees W (source). (c): Activation matrix H. (d): The product WH (mosaic). 2. BASIC NMF-INSPIRED AUDIO MOSAICING Non-negative matrix factorization (NMF) has been applied very successfully in a large variety of music processing tasks and beyond. Given a non-negative matrix N ×M V ∈ R≥0 , the goal of NMF is to decompose this maN ×K K×M trix into two factors W ∈ R≥0 and H ∈ R≥0 , where N, M, K ∈ N. The distance between the product WH and the matrix V is minimized with respect to some distance measure, for example the Kullback-Leibler divergence (V ||WH ) = X Vnm log nm Vnm −Vnm +(WH)nm . (1) (WH)nm In the context of music processing, the matrix V is usually a magnitude spectrogram of a music recording, the matrix W is interpreted as a set of spectral templates, and the matrix H constitutes an activation matrix. Non-zero values in a row of H activate the associated template in W at the respective time instance. The two factors W and H are usually learned by iteratively applying multiplicative update rules to two suitably initialized matrices [14]. Fixing the template matrix W to be the magnitude spectrogram of the source recording, the basic idea of our proposed audio mosaicing approach is to learn only the activation matrix H. More precisely, we proceed as follows. Given the target recording xtar and the source recording xsrc , we first compute the complex valued spectrograms Xtar and Xsrc by applying the short-time Fourier transform (STFT) to both recordings. Afterwards, we set V := |Xtar |, W := |Xsrc |, and randomly initialize H (1) ∈ (0, 1]K×M . Fixing a number of iterations L, we then iteratively update H with (ℓ+1) Hkm = (ℓ) Hkm P n Wnk Vnm /(W H (ℓ) )nm P , n Wnk (2) A second observation is that the matrix H usually activates many source frames simultaneously. The learning process can thus closely approximate the spectral shapes of the target frames. However, in the context of audio mosaicing, this has several drawbacks. Since H is multiplied with the complex spectrogram Xsrc , phase cancellation artifacts may arise when superimposing many complex spectral frames. This way, especially low pitched sounds tend to cancel each other out and are not audible in the final audio mosaic. Furthermore, since a sound’s timbre is also closely related to the energy distribution in its frequency spectrum, adapting the spectral shapes may change the timbre of the source. An update rule which sets a limit on the maximal number of simultaneous activations is presented in Section 3.2. A third problem connected with the activation matrix shown in Figure 2c is the loss of temporal characteristics of the source. The typical “buzzing sound” of the bees, which results from pitch modulations (see Figure 2b), is lost in the mosaic (see Figure 2d). This is the case since the spectral frames of the source are activated independently of their order in the source spectrogram. To preserve some temporal characteristics, the update rule presented in Section 3.3 supports the development of diagonal structures in the activation matrix. Time source Time source (b) Time target Time target (c) (d) Time source Figure 2 shows this basic procedure applied to our running example. In Figure 2a we see an excerpt of the magnitude spectrogram of the song “Let it be”. Our goal is to create an audio mosaic of this song, using the recording of buzzing bees, which can be seen in Figure 2b. To increase the range of different pitches occurring in our source, we used a pitch-shifting algorithm [6] to create differently pitched versions of the bee recording and concatenated them. Figure 2c shows an excerpt of the activation matrix H, derived by applying the basic procedure described above. A first observation about H is the predominance of horizontal activation structures. These patterns correspond to single spectral frames in the source which are activated repeatedly to mimic the stable spectral structures in the target. Although the resulting mosaic, shown in Figure 2d, closely resembles these spectral structures, one can hear a “stuttering” effect when listening to the reconstructed audio recording. This stuttering originates from the same frame of the source being repeated over and over again. In Section 3.1, we aim to prevent the learning process from activating the same frame in fast repetition with an additional update rule. (a) Time source for k ∈ [1 : K], m ∈ [1 : M ], and the iteration index ℓ ∈ [1 : L − 1]. Finally, we set H := H (L) . The learned activation matrix H is then multiplied with the complex valued Xsrc , yielding the complex valued spectrogram of the audio mosaic Xmos := Xsrc H. To compute the audio mosaic xmos , we apply an “inverse” STFT to the spectrogram Xmos which also adjusts the phases such that artifacts from phase discontinuities are reduced [9]. Time target Time target Figure 3. (a): Activation matrix H (ℓ) . (b): Repetition restricted activation matrix R(ℓ) . The horizontal neighborhood is indicated in red. (c): Polyphony restricted activation matrix P (ℓ) . For each column, the highest value is indicated in red. (d): Continuity enhancing activation matrix C (ℓ) . The diagonal kernel is indicated in red. 3. LEARNING SPARSE DIAGONAL ACTIVATIONS The core idea to overcome the issues of the basic NMFinspired audio mosaicing procedure is to impose specific constraints on the learned activation matrices by adapting the iterative update process. As discussed in the previous section, we identified three main problems of the mosaics generated by the basic procedure, all related to properties of the the derived activation matrices. First, horizontal activation patterns cause stuttering artifacts in the mosaics. Second, too many simultaneous activations lead to phase cancellations and overfitting of the spectral shapes. Third, the source’s temporal characteristics are destroyed by activating source frames independently of each other. We therefore introduce additional update rules to approach these issues, see also Figure 3. 3.1 Avoiding repeated activations To avoid activating the same spectral frame of the source in subsequent time-instances, the idea is to only keep the highest activations in a horizontal neighborhood of the matrix H, suppressing the remaining values. However, we do not want to interfere too much with the actual learning process in the first few update iterations. The amount of suppression applied to the smaller values is therefore dependent on the iteration index ℓ. Given the activation matrix H (ℓ) , the size of a horizontal neighborhood r, and the number of iterations L, we compute a repetition restricted activation matrix R(ℓ) by ( (ℓ) (ℓ) r,(ℓ) if Hkm = µkm Hkm (ℓ) , (3) Rkm = (ℓ) (ℓ+1) Hkm (1 − L ) otherwise r,(ℓ) Note that the suppression of smaller values becomes strict in the last update iteration for ℓ = L − 1. Intuitively, the parameter r defines the minimal horizontal distance (and therefore the minimal time interval) between two activations of the same source frame. Figure 3b shows the repetition restricted activation matrix R(ℓ) derived from the toy example activation matrix shown in Figure 3a, using r = 2, ℓ = 8, and L = 10. As opposed to H (ℓ) , there are no two dominant values next to each other in R(ℓ) . 3.2 Restricting the number of simultaneous activations p,(ℓ) where Ωm contains the indices of the p highest values in the mth column of R(ℓ) . The parameter p can be directly interpreted as the desired degree of polyphony in the mosaic. For example, setting p = 1 results in a mosaic where the source sounds are not heavily superimposed but mainly concatenated to mimic the most dominant features of the target. In Figure 3c, we see the polyphony restricted activation matrix P (ℓ) derived from R(ℓ) , using p = 1. One can see that in P (ℓ) there is (at most) one single dominant value left in every column. 3.3 Supporting time-continuous activations To support the development of diagonal structures that activate successive frames of the source, we now compute a continuity enhancing activation matrix C (ℓ) . The idea here is to convolve the matrix P with a diagonal kernel. Choosing c ∈ N, which defines the length of the kernel, we compute (ℓ) Ckm = c X (ℓ) P(k+i)(m+i) . 0.6 59 0.4 58.5 58 16.5 (6) i=−c Intuitively, the length 2c + 1 of the kernel defines the minimal number of source frames that we would like to successively activate. Figure 3d shows the matrix C (ℓ) for our toy example, computed with c = 2. Note that in C (ℓ) the number of simultaneous dominant activations may locally exceed the limit which was imposed in the computation of the polyphony restricted activation matrix P (ℓ) . In practice, this is however not a problem and even desirable since this way, the diagonal structures can overlap with each other to some degree. Therefore, the corresponding audio segments of the source are overlapped in the final mosaic as well, leading to smooth transitions between them. 0.2 17 17.5 18 Time target [sec] 18.5 0.03 59.5 0.02 59 0 58 16.5 0.1 60 0 17 17.5 18 Time target [sec] 18.5 (d) 0.08 59.5 0.06 59 0.04 58.5 0.02 17 17.5 18 Time target [sec] 0.01 58.5 (c) 60 58 16.5 Next, we address the problem of too many simultaneous activations. Setting a limit p ∈ N on the number of activations in one column of the activation matrix, we compute a polyphony restricted activation matrix P (ℓ) in a similar manner as R(ℓ) by ( (ℓ) p,(ℓ) if k ∈ Ωm Rkm (ℓ) , (5) Pkm = (ℓ) Rkm (1 − (ℓ+1) L ) otherwise 0.8 59.5 Time source [sec] (4) (b) 60 1 18.5 0 Time source [sec] (ℓ) Time source [sec] (ℓ) r,(ℓ) µkm = max(Hk(m−r) , . . . , Hk(m+r) ) . (a) 60 Time source [sec] with ℓ ∈ [1 : L − 1] and µkm being the maximum value of H (ℓ) in a horizontal neighborhood 1 0.8 59.5 0.6 59 0.4 58.5 58 16.5 0.2 17 17.5 18 Time target [sec] 18.5 0 Figure 4. The activation matrix H for the mosaic of “Let it bee” with a recording of bees in different states. (a): H (1) . (b): H (3) . (c): H (6) . (d): H (10) . The repetition restricting neighborhood is indicated in red. 3.4 Adapting the activations to fit the target Finally, we perform the standard NMF update step to let the mosaic adapt to the target again. Similarly to Equation (2), we compute the activation matrix for the next iteration by P Wnk Vnm /(W C (ℓ) )nm (ℓ+1) (ℓ) P . (7) Hkm = Ckm n n Wnk In summary, a single update step of the activation matrix H is computed by applying Equations (3), (5), (6), and (7) sequentially. Note that in one update iteration, the three intermediate update rules (3), (5), and (6) are insensitive to the target and therefore may increase the distance measure of Equation (1). However, as already discussed in Section 1, we are not interested in minimizing this measure, but trade some approximation accuracy for a better preservation of the source’s timbre. In practice, our procedure usually yields an activation matrix that, when multiplied with the source spectrogram, approximates the target spectrogram to a sufficient degree, while preserving the source’s timbre in the mosaic much better than the basic procedure described in Section 2. Figure 4 shows an excerpt of the activation matrix H of our running example “Let it be” for several iteration indices ℓ. Here, we set the repetition restriction parameter to r = 3, the limit of simultaneous activations to p = 10, the kernel parameter to c = 3 (resulting in a diagonal kernel of length 7), and the number of update iterations to L = 10. Figure 4a shows the random initialization of the activation matrix H (1) . After two iterations, one can already notice diagonal patterns in H (3) , see Figure 4b. Figure 4c shows the activations after another three update iterations. The diagonal patterns in H (6) are even more prominent and one can observe that separate diagonal structures start to 2 2.5 3 Time target [sec] (c) 53.5 54.5 54 54 54.5 Time source [sec] 55 (d) 55 Time source [sec] 55 Magnitude 200 0 3.5 (a) 400 (b) Magnitude 200 600 (c) Magnitude 400 0 Time source [sec] (b) Frequency [Hz] Frequency [Hz] (a) 600 54.5 54 0 53.5 53.5 2 2.5 3 Time target [sec] 3.5 2 Frequency [Hz] Frequency [Hz] 200 0 2 2.5 3 Time target [sec] 2.5 3 Time target [sec] 3.5 600 400 200 0 1200 1600 2000 3.5 (f) 400 800 Frequency [Hz] (e) 600 400 2 2.5 3 Time target [sec] 3.5 Figure 5. The effect of diagonal activation patterns. (a): Spectrogram of the target recording “Let it be”. (b): Spectrogram of the source recording of buzzing bees. (c): Activation matrix H derived with the basic approach. (d): Activation matrix H derived with the extended set of update rules. (e): Spectrogram of the audio mosaic resulting from the basic approach. (f): Spectrogram of the audio mosaic resulting from the extended procedure. emerge, leaving regions of lower values inbetween them. In Figure 4d, the activation matrix H (10) is shown. In this final activation matrix, four clear diagonal structures have emerged. The remaining activations are outside the visible range. Looking at the two upper diagonals, one can see that although they seem to be rather close together, they obey the repetition restricting horizontal neighborhood indicated in red. Furthermore, it is noteworthy that the length of the diagonals greatly exceeds the length of the diagonal kernel. For example, while we used a diagonal kernel of length 7, the lowest diagonal has a length of 25 nonzero activations, corresponding to an audio segment in the source of roughly one second. This means that the procedure uses a whole one-second patch of source audio material to recreate the target between second 17 and 18. 4. EXPERIMENTS AND EXAMPLES In this section, we both visually and acoustically demonstrate the effectiveness of our proposed method. As discussed in previous sections, the main drawbacks of the basic audio mosaicing approach described in Section 2 were both the loss of temporal characteristics and spectral shapes of the source sounds in the resulting audio mosaics. The idea was to approach these problems by supporting the development of sparse diagonal structures in the activation matrix with an extended set of update rules. In the follow- Figure 6. Comparison of spectral shapes. (a): A single spectral frame of the target recording (“Let it be”). Harmonics are indicated by red circles. (b): The spectral frame of the mosaic computed with the basic procedure at the same temporal position. Harmonics which are present in both the original frame as well as in the mosaic are indicated by red circles. (c): The spectral frame of the mosaic computed by using the extended set of update rules. ing, we exemplify how these structures can preserve the source’s desired characteristics in the audio mosaic. 4.1 Preserving temporal characteristics of the source In Figure 5, we once again revert to our running example. Here, spectrogram excerpts of the target recording “Let it be” as well as the source recording of buzzing bees are shown in Figures 5a and 5b, respectively. The spectrogram of the target recording exhibits sounds with very stable pitches, resulting from the solo piano at the beginning of the song. In contrast, the buzzing of the bees leads to rather strong amplitude modulations that are characteristic for the sound. Figure 5c shows an excerpt of the activation matrix H as derived by the basic NMF-inspired audio mosaicing procedure. In this excerpt of H, only two different spectral frames of the source are activated repeatedly by the procedure to mimic the stable pitch of the piano sound. The resulting spectrogram of the audio mosaic, shown in Figure 5e, approximates the target’s spectrogram quite precisely. However, the characteristic pitch modulations of the buzzing bee sound are lost almost completely. Looking at Figure 5d, one can see the activation matrix H derived by our proposed procedure based on the extended set of update rules. The diagonal patterns shown activate segments of the source that have a duration of roughly half a second. As can be seen by comparing the regions marked in red in the source (Figure 5b) and the mosaic spectrogram (Figure 5f), the temporal structures of these segments are preserved in the mosaic. While the mosaic computed with the extended set of update rules exhibits a lot of pitch modulations, which reflect the preserved timbre of the buzzing bee sound, the tonal content as well as rhythmic structures of the target are still maintained. For example, the two strong partials of the target recording at around 270 Hz and Name of the target LetItBe GuteNacht FunkJazz Stepdad Freischütz Vermont Description of the target An excerpt of the song “Let it be” by the Beatles (piano & singing). An excerpt of “Gute Nacht” by Franz Schubert which is part of the romantic Winterreise song cycle, taken from [15]. An excerpt from a jazz piece performed by the band “Music Delta” (saxophone, synthesizer, bass, and drums), taken from [2]. Excerpt from the song “My leather, my fur, my nails” by the pop band Stepdad (synthesizers, drums, and singing). Excerpt from the opera “Der Freischütz” by Carl Maria von Weber (full orchestra, applause at the end). An excerpt of the song “Vermont” by the band “The Districts” (singing, guitar, bass, and drums), taken from [2]. Name of the source Bees Wind Description of the source Recording of a buzzing swarm of bees. Recording of howling wind. Whales Recording of whale songs and whale sounds. Chainsaw Recording of a chainsaw’s sawing and engine sounds. AirRaid Recording of an air raid siren. RaceCars Recording of engine sounds of starting race cars. Table 1. List of target and source recordings used in our experiments. 300 Hz in Figure 5a are also visible in the audio mosaic in Figure 5f, only this time pitch modulated. Similarly, the onset in the target at second 2.6 is present in the mosaic as well. 4.2 Preserving spectral shapes of the source In Figure 6, we investigate typical spectral shapes of the target as well as the mosaic for our running example. Figure 6a shows the spectral frame of the target’s spectrogram at second 4.6 as a frequency-magnitude plot. One can see the harmonic structure with several clear partials in this frame, resulting from the piano sound in the target. The corresponding spectral frame of the mosaic computed by the basic procedure shown in Figure 6b shows a very similar spectral structure. Most of the harmonics visible in the target are also present in this frame (indicated by the red circles) and even the relations between peak heights are often preserved. In contrast, the spectral frame of the mosaic computed with the extended set of update rules only roughly corresponds to the spectral shape of the target frame, see Figure 6c. However, some of the dominant peaks in the target frame are still present in the mosaic, leading to a sound that captures only the dominant tonal characteristics of the target. The noisy timbre of the buzzing bees, visible by the increased noise level in the frame, is therefore preserved. 4.3 Audio examples In order to also give an auditory demonstration of our method, we set up an accompanying website for this paper at [7]. On this website, one finds the target recordings as well as source recordings listed in Table 1. To ensure that each source recording offers an adequate pitch range, we computed several pitch-shifted versions of it (using a pitch-shifting algorithm from [6]) and concatenated them. For each pair of target and source, we then generated an audio mosaic using both the basic mosaicing procedure described in Section 2 as well as the procedure based on the extended set of update rules proposed in Section 3. For these experiments, we used music recordings sampled at 22050 Hz, an STFT frame length of 2048 samples and a hop size of 1024 samples to compute the spectrograms. In order to derive the activation matrices for both procedures, we performed L = 20 iterations of the respective update steps. For the extended set of update rules, we set the repetition restriction parameter to r = 3, the limit of simultaneous activations to p = 10, and the kernel parameter to c = 3. To reconstruct time-domain signals from the derived complex valued mosaic spectrograms, we finally performed 20 iterations of the STFT inversion procedure proposed in [9]. 5. CONCLUSION AND FUTURE WORK In this work we presented a novel approach for automatically generating an audio mosaic of a target recording using the sounds from a source recording. The core idea of this NMF-inspired procedure was to learn an activation matrix that, when multiplied with the spectrogram of the source recording, yields the spectrogram of the mosaic recording. As our main technical contribution, we proposed an extended set of update rules that supports the development of sparse diagonal structures in the activation matrix during the learning process. Our experiments showed that these diagonal activation structures correspond to the activation of whole sequences of spectral frames and help to preserve timbral characteristics of the source in the mosaic. In future work we want to investigate if our proposed procedure can also be applied in scenarios beyond audio mosaicing. One possibility is to examine whether supporting the development of diagonal structures in the activation matrix can also be beneficial when learning not only the activation matrix, but also the template matrix. Such an NMF procedure could be applied for learning and identifying repeating patterns in feature sequences, similar to [24] who used techniques based on NMFD for this task. In this context, we hope that our approach may yield a simpler implementation as well as more flexibility since the maximal length of sequences does not need to be fixed. Acknowledgments: This work has been supported by the German Research Foundation (DFG MU 2686/6-1). The International Audio Laboratories Erlangen are a joint institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer Institut für Integrierte Schaltungen. Furthermore, we would like to thank Colin Raffel and the other organizers of the HAMR Hack Day at ISMIR 2014, where the core ideas of the presented work were born. 6. REFERENCES [1] G. Bernardes. Composing Music by Selection: Content-Based Algorithmic-Assisted Audio Composition. PhD thesis, Faculty of Engineering, University of Porto, 2014. [2] R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello. Medleydb: A multitrack dataset for annotation-intensive MIR research. In Proc. of the 15th International Society for Music Information Retrieval Conference ISMIR, pages 155–160, Taipei, Taiwan, October 2014. [3] G. Coleman, E. Maestre, and J. Bonada. Augmenting sound mosaicing with descriptor-driven transformation. In Proc. of the International Conference on Digital Audio Effects (DAFx), Graz, Austria, 2010. [4] J. M. Comajuncosas, A. Barrachina, J. O’Connell, and E. Guaus. Nuvolet: 3D gesture-driven collaborative audio mosaicing. In Proc. of the International Conference on New Interfaces for Musical Expression, pages 252– 255, Oslo, Norway, 2011. [5] E. Costello, V. Lazzarini, and J. Timoney. A streaming audio mosaicing vocoder implementation. In Proc. of the 16th International Conference on Digital Audio Effects (DAFx), Maynooth, Ireland, September 2013. [6] J. Driedger and M. Müller. TSM Toolbox: MATLAB implementations of time-scale modification algorithms. In Proc. of the International Conference on Digital Audio Effects (DAFx), pages 249–256, Erlangen, Germany, 2014. [7] J. Driedger, T. Prätzlich, and M. Müller. Accompanying website: Let it bee – towards NMF-inspired audio mosaicing. http://www. audiolabs-erlangen.de/resources/MIR/ 2015-ISMIR-LetItBee/. [8] J. Eggert and E. Körner. Sparse coding and NMF. In Proc. of the IEEE International Joint Conference on Neural Networks, volume 4, pages 2529–2533, July 2004. [9] D. W. Griffin and J. S. Lim. Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech and Signal Processing, 32(2):236–243, 1984. [10] P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5:1457–1469, 2004. [11] J. Janer and M. de Boer. Extending voice-driven synthesis to audio mosaicing. In 5th Sound and Music Computing Conference, Berlin, Germany, July 2008. [12] J. Kim and H. Park. Toward faster nonnegative matrix factorization: A new algorithm and comparisons. In Proc. of the IEEE International Conference on Data Mining (ICDM), pages 353–362, Pisa, Italy, 2008. [13] R. Kobayashi. Sound clustering synthesis using spectral data. In Proc. of the International Computer Music Conference (ICMC), Singapore, 2003. [14] D. D. Lee and H. S. Seung. Algorithms for nonnegative matrix factorization. In Proc. of the Neural Information Processing Systems (NIPS), pages 556–562, Denver, USA, 2000. [15] M. Müller, V. Konz, W. Bogler, and V. Arifi-Müller. Saarland music data (SMD). In Proc. of the International Society for Music Information Retrieval Conference (ISMIR): Late Breaking session, 2011. [16] J. Oswald. Plexure. CD, 1993. //www.allmusic.com/album/ plexure-mw0000621108. http: [17] N. Schnell, M. A. S. Cifuentes, and J.-P. Lambert. First steps in relaxed real-time typo-morphological audio analysis/synthesis. In Sound and Music Computing, Barcelona, Spain, 2010. [18] D. Schwarz. A system for data-driven concatenative sound synthesis. In Proc. of the International Conference on Digital Audio Effects (DAFx), Verona, Italy, July 2000. [19] D. Schwarz. Concatenative sound synthesis: The early years. Journal of New Music Reaserch, 35(1), March 2006. [20] P. Smaragdis. Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs. In Independent Component Analysis and Blind Signal Separation, volume 3195 of Lecture Notes in Computer Science, pages 494–499. Springer Berlin Heidelberg, 2004. [21] P. Smaragdis, B. Raj, and M. Shashanka. Sparse and shift-invariant feature extraction from non-negative data. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP, pages 2069–2072, Las Vegas, Nevada, USA, 2008. [22] P. A. Tremblay and D. Schwarz. Surfing the waves : Live audio mosaicing of an electric bass performance as a corpus browsing interface. In Proc. of the International Conference on New Interfaces for Musical Expression, pages 447–450, Sydney, Australia, September 2010. [23] T. Virtanen. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Transactions on Audio, Speech and Language Processing, 15(3):1066–1074, 2007. [24] R. J. Weiss and J. P. Bello. Unsupervised discovery of temporal structure in music. IEEE Journal of Selected Topics in Signal Processing, 5:1240–1251, 2011.