From expressive gesture to sound

Dirk Moelants

From expressive gesture to sound

Dirk Moelants

2010, Journal on Multimodal User Interfaces

visibility

…

description

12 pages

link

1 file

This paper contributes to the development of a multimodal, musical tool that extends the natural action range of the human body to communicate expressiveness into the virtual music domain. The core of this musical tool consists of a low cost, highly functional computational model developed upon the Max/MSP platform that (1) captures real-time movement of the human body into a 3D coordinate system on the basis of the orientation output of any type of inertial sensor system that is OSC-compatible, (2) extract low-level movement features that specify the amount of contraction/expansion as a measure of how a subject uses the surrounding space, (3) recognizes these movement features as being expressive gestures, and (4) creates a mapping trajectory between these expressive gestures and the sound synthesis process of adding harmonic related voices on an in origin monophonic voice. The concern for a user-oriented and intuitive mapping strategy was thereby of central importance. This was achieved by conducting an empirical experiment based on theoretical concepts from the embodied music cognition paradigm. Based on empirical evidence, this paper proposes a mapping trajectory that facilitates the interaction between a musician and his instrument, the artistic collaboration between (multimedia) artists and the communication of expressiveness in a social, musical context.

Journal on Multimodal User Interfaces manuscript No. (will be inserted by the editor) From expressive gesture to sound The development of an embodied mapping trajectory inside a musical interface Pieter-Jan Maes · Marc Leman · Micheline Lesaffre · Michiel Demey · Dirk Moelants Received: date / Accepted: date Abstract This paper contributes to the development of a multimodal, musical tool that extends the natural action range of the human body to communicate expressiveness into the virtual music domain. The core of this musical tool consists of a low cost, highly functional computational model developed upon the Max/MSP platform that (1) captures real-time movement of the human body into a 3D coordinate system on the basis of the orientation output of any type of inertial sensor system that is OSC-compatible, (2) extract low-level movement features that specify the amount of contraction/expansion as a measure of how a subject uses the surrounding space, (3) recognizes these movement features as being expressive gestures, and (4) creates a mapping trajectory between these expressive gestures and the sound synthesis process of adding harmonic related voices on an in origin monophonic voice. The concern for a user-oriented and intuitive mapping strategy was thereby of central importance. This was achieved by conducting an empirical experiment based on theoretical concepts from the embodied music cognition paradigm. Based on empirical evidence, this P. -J. Maes, M. Leman, M. Lesaffre, M. Demey, D. Moelants IPEM - Dept. of Musicology, Ghent University, Blandijnberg 2, B-9000, Ghent Tel.: +32 (0)9 264 4126 Fax: +32 (0)9 264 4143 E-mail: [email protected] M. Leman E-mail: [email protected] M. Lesaffre E-mail: [email protected] M. Demey E-mail: [email protected] D. Moelants E-mail: [email protected] paper proposes a mapping trajectory that facilitates the interaction between a musician and his instrument, the artistic collaboration between (multimedia) artists and the communication of expressiveness in a social, musical context. Keywords multimodal interface · mapping · inertial sensing technique · usability testing 1 Introduction 1.1 Theoretical framework Playing music requires the control of a multimodal interface, namely, the music instrument, that mediates the transformation of bio-mechanical energy to sound energy, using feedback loops based on different sensing channels, such as auditory, visual, haptic and tactile channels. In recent decennia, a lot of attention has been devoted to the development of electronic multimodal interfaces for music [4, 6, 9]. The basic problem of these interfaces concerns the fact that the mediation between the different modalities (basically from movement to sound) has an arbitrary component, which is due to the fact that the energies of the modalities are transformed into electronic signals. This is in contrast with traditional instruments, where energetic modalities are mechanically mediated and where the user gets a natural feeling of the causality of the multimodal interface. So far, research has focused on theoretical reflections on the possible connection between movement and sound [41, 5, 1, 13, 23] and especially also on the development of all kinds of electronic multimodal interfacing technologies, including models and practical designs for the mapping of performed action to sound synthesis parameters [35, 21, 2, 38, 20]. However, relatively less attention has been devoted to the idea of an empirical solution to the mapping problem. Such a solution would be based on experiments that probe 2 the natural tendencies for multimodal mappings. The theoretical grounding for such approach is found in the embodied music cognition paradigm, in which the importance of gesture and corporeal imitation as a basis of understanding musical expressiveness is stressed [27, 28]. In this view, multimodal interfaces for music are approached in terms of mediation technologies that connect with action-relevant cues (i.e. affordances). Musical intentionality and expressiveness can be attributed by a mirroring process that relates these cues to the subjects own action-oriented ontology. The present paper aims at identifying mapping strategies for multimodal music interfaces, using these concepts of embodiment, action-perception coupling and action-oriented ontology as a starting point. The goal is to better understand bodily movement in relation to sound from the viewpoint of the peripersonal space, that is, the space immediately surrounding a person’s body which can be reached by the limbs [40, 19]. This study will focus on the relation between the dynamically changing pattern of the upper limbs in terms of contraction and expansion and the communication of musical expressiveness. The choice for this particular movement feature is made in line with findings of previous research indicating that: (I) music intuitively stimulates movement in the peripersonal space [34, 17, 18, 12], (II) the human body is an important channel of affective communication [15, 16, 11, 3, 37], (III) the upper body features are most significant in conveying emotion and expressiveness [24], (IV) the movement size and openness can be related to the emotional intensity of the musical sound production [14, 7], and (V) an open body position in contrast to a closed body position reinforces the communicators intent to persuade [31, 30]. It is assumed that a better understanding of the connection between this spatio-kinetic movement feature and expressive features in relation to multimodal interfaces may provide a cue to the solution of the mapping problem in a number of application contexts. part of this paper integrates this modelling platform into an empirical-experimental framework. An experiment is conducted in order to study the natural, corporeal resonance behaviour of subjects as a response to a specific musical feature integrated in four different pre-recorded sound stimuli. The musical feature that is under investigation is termed with one-to-many alternation. It concerns the musical process of adding and subsequently removing extra harmonic related voices on an in origin monophonic voice. As a result, this study wants to put in evidence the effect of the contrast between a solo voice and multiple harmonic voices on bodily movement behaviour. After pure physical meaurements of human movement, low-level movement features are extracted. Based on the effort/shape theory of Laban [26] a particular interest lies on spatio-kinetic movement features that measure the amount of contraction/extension of the upper limbs in the peripersonal space or kinesphere of subjects. This feature will be defined by taking into account the distance between (1) the elbows relative to each other and (2) the wrists relative to each other. Then, the extracted movement features are investigated in relation to the auditory feature and mapped into gesture trajectories that describe the one-to-many alternation. A next layer defines how these specific, spatio-kinetic gestural trajectories can be associated with an expressive content forming expressive gestures [25, 4, 8]. This will be realised by integrating verbal descriptions related to emotion and expression into a model for semantic description of music [29]. The gestural trajectory may then be connected with parameters that control the synthesis of in principal every kind of energetic modality [27]. However, we will limit these possibilities in this paper to the proposal of a system that enables the recognition and extraction of expressive gestures that are subsequently mapped to control a sound process that corresponds with the one-to-many alternation. 1.2 Methodological framework 2 Technical Setup - Motion Sensing and 3D position determination The methodology of this paper will follow the layered, conceptual framework proposed by Camurri [6, 8, 9, 27]. This model starts with modelling movement on a pure physical level, followed by the extraction of features that are subsequently mapped into gestural trajectories and linked with high-level structures and concepts. This framework makes it possible to establish connections between the sensory, the gestural and the semantic level. The first part of this paper focuses on the development of a low cost, highly functional Max/MSP algorithm that enables measurement and modelling of human movement on the basis of the orientation output of inertial sensing systems. The platform is innovative in the way it supports basically every sensor that outputs orientation data. The second Inertial sensors were used to measure the movement of the upper body of the user. The sensors enable a mobile setup, which is often useful in an artistic context. There is no problem of visual occlusion of marker points or lighting conditions, although the large mass and shape of the sensors can be problematic. Inertial sensors do not provide an absolute 3D position but it is possible to determinate the relative 3D position of the joints of the upper body with respect to a fixed point on the body, using only the orientation output of the sensors together with fixed lengths of the different body parts. In what follows the motion sensors are described in more detail, together with the software used and the different steps in the algorithm that determines the relative 3D position of the upper body. 3 In this study, five commercial inertial sensors are used from Xsens (MTx XBus Kit). They are positioned on the chest, the upper arms and the lower arms as shown in figure 1 (above). Flexible straps with Velcro are used to attach the sensors to the body. The sensors are daisy chained with wires and connected to a battery powered central unit (XBus) that is attached to the subjects hip with a belt. From this central unit the data is transmitted wirelessly over a Bluetooth connection. This setup enables a sampling rate of 25Hz of the quaternion information of the five motion sensors. This data is collected on a MacBook Pro laptop running a standalone program that collects the data from the Bluetooth port and converts this into the OSC protocol. The sensor data is then send to a Max/MSP program that calculates the relative 3D position of the four joints (the elbows and wrists) of the upper body. A visualization, generated in Jitter, is shown in figure 1 (below). coinciding with the arms when the T-pose is assumed and the Z-axis is defined as pointing forward in the horizontal plane. As such, a right-handed coordinate system is defined. The origin of this system is located on the torso in the middle of the line connecting the shoulders. The orientation of this local coordinate system is fixed by the orientation of the motion sensor attached to the chest. The position of the shoulders is then fixed on the X-axis at a distance of 17 cm from the origin, resulting in coordinates (−0.17, 0.0, 0.0) and (0.17, 0.0, 0.0), where the units are expressed in meters. The length of the upper arms and lower arms are fixed to 26 cm and 27 cm respectively. To obtain the position of the elbows one has to calculate the relative orientation of the upper arm with respect to that of the chest. This is obtained through the Hamiltonian product of the quaternions of the chest (qc ) and that of the upper arm (qu ) where the first quaternion is conjugated: q = q∗c · qu Y X Z Fig. 1 Above: person equipped with five inertial MTx sensors from Xsens on the upper body. Below: a stick-figure representation of the upper body generated in Jitter (Cycling ’74). To calibrate the system the subject had to stand up with horizontally stretched arms, so that a T-shape is formed with the torso and the arms. In this posture all sensors are orientated in the same way, which allows a calibration of the system. The 3D coordinate system attached to the body of the subject is called the local system. Its vertical direction is defined as the Y-axis, the horizontal direction as the Xaxis pointing to the left from the viewpoint of the user and (1) The resulting quaternion is converted into three Euler angles namely the pitch (rotation around the Y-axis), roll (rotation around the X-axis) and yaw (rotation around the Z-axis). When taking the biomechanical constraints of the shoulder joint into account, one can see that there are only two rotations that influence the position of the elbow. These rotations are the flexion/extension (rotation in the horizontal plane around the vertical direction) and abduction/adduction (rotation in the vertical plane around the horizontal direction). Using the Euler angles, the length of the upper arm and the position of the shoulders the 3D position of the elbow in the local coordinate system can then be calculated. Once the positions of the elbows are calculated, the position of the wrists can be obtained in the same way as described above, using the relative difference in orientation between the upper arm and the lower arm. For the case of the elbow joint there is only the flexion/extension rotation that has an effect on the position of the wrists. The results from this rotation is a vector w describing the position of the wrist in reference to the elbow. A crucial step is the transformation of this vector w from the frame oriented in the elbow to the local coordinate system with the orientation of the sensor on the chest. This transformation is accomplished through the use of the following formula: w′ = q · w · q∗ (2) where q is the relative orientation of the upper arm qu with respect to the chest qc as defined in equation 1 and w′ is the resulting 3D position of the wrist in orientation of the local coordinate system. To make the implementation of formula 2 more clear one can write out the quaternions explicitly as follows: w′ = (qw , iqx , jqy , kqz ) · (0, iwx , jwy , kwz ) · (qw , −iqx , − jqy , −kqz ). (3) 4 By making use of the Hamiltonian product and the calculus of the quaternions one can obtain the following result: w′x = wx (qx qx + qw qw − qy qy − qz qz ) (4) Table 1 Characteristics of the four musical stimuli. + wy (2qx qy − 2qw qz ) + wz (2qx qz + 2qw qy ) w′y = wx (2qw qz + 2qx qy ) 3.2 Auditory stimuli (5) + wy (qw qw − qx qx + qy qy − qz qz ) Stimuli A B C D Timbre Mode Direction Cycle length Tempo Repeats Voice Major Below 12 90 5 Voice Major Below 4 90 10 Piano Minor Above 4 90 10 Piano Minor Above 4 120 14 + wz (−2qw qx + 2qy qz ) w′z = wx (−2qw qy + 2qx qz ) (6) + wy (2qw qx + 2qy qz ) + wz (qw qw − qx qx − qy qy + qz qz ). When the resulting coordinates of vector w′ are added with the coordinates of the elbow, the coordinates of the wrist in the local frame are obtained. From this point on the 3D coordinates of the joints of the upper body are fully determined and can be used both in the visualization of an avatar and in the measurement of the expressive movement of the subject. 3 Set-up of the empirical-experimental framework Like it was quoted in section 1 the experiment proposed in this paper starts from the embodied music cognition theory that states that musical involvement relies on motor imitation [27]. That is why in this experiment research is done concerning the motor resonance behaviour of subjects in response to the musical process called the one-to-many alternation (see section 1). The hypothesis of the experiment is that there exists a coupling between the auditory information received from the pre-recorded sound stimuli and the motor behaviour as a spontaneous re-action to this sensory input (cfr. sensorimotor coupling). Moreover, it is assumed that (1) the spatial characteristics of this motor reaction in terms of contraction and expansion of the upper limbs in peripersonal space shows communalities over the different subjects participating in the experiment and (2) the sensorimotor coupling arouses a similar expressive percept among the different subjects. If this is indeed the case, knowledge about this sensorimotor coupling could provide grounds for the development of a mapping trajectory facilitating an intuitive, expressive interaction between man, machine and social/artistic environment. 3.1 Subjects 25 subjects, aged 21-27 (mean: 23.2), 5 male, 20 female, participated in the experiment. 20 of them reported had at least seven years of musical education, four of the non-musicians and two musicians had between 1 and 10 years of dance education. The experiment presented in this study investigates the effect of a changing number of musical, harmonic voices on bodily movement. Therefore, four sound stimuli were made and recorded in advance of the actual experiment emphasizing this specific musical feature. They all exist of one single musical pattern repeated over time (see figure 2), characterised by the appearance and subsequently disappearance of extra harmonic voices upon an in origin single voice (a musical feature that was termed with one-to-many alternation). The extra voices are the harmonic third and fifth. The starting tone is always A4 (440 Hz). As a result, a single voice is gradually building up to a triad chord and subsequently fading away back to a single sounding voice. This effect of harmonization, applied in advance of the experiment, was done differently for the different stimuli. In the first two stimuli, discrete quarter notes produced by the human voice were used as input of a harmonizer (i.e. MuVoice) that electronically synthesized the extra voices by real-time pitch shifting. The volume of the two extra voices was controlled and recorded manually with the help of a USB controller (Evolution UC-16) in a way it resulted in the patterns defined in figure 2. In the last two stimuli, piano sounds were used. There, the harmonization of the discrete quarter notes was applied directly by a pianist by adding respectively the third and fifth. In order to find out if shared strategies of movement can be found using different stimulus material, pitch direction, mode, timbre, rhythm and tempo were varied. In order to avoid confusion with the effect of a rising or falling pitch on the movements, in two of the stimuli the triad was added above the starting tone, while in the other two it was added below the starting tone. Similarly, a possible effect of the mode was dealt with by using a major triad in the first two stimuli and a minor triad in the other two. The main characteristics of the four series are summarized in table 1, and a transcription in musical notation is given in figure 2. These show the variation in rhythm and tempo, and the use of two different timbres: a recording of a soprano voice singing A4 in the first two and the A4 played on a piano in the last two stimuli. The patterns given in figure 2, were repeated between 5 and 14 times, so the total length of one series was always between 45 and 60 seconds. 5 Fig. 2 Representation of the auditory stimuli by means of note patterns. 3.3 Procedure The subjects were subsequently equipped with the five inertial sensors (Xsens, MTx XBus Kit) as explained in section 2 (see figure 1). The subjects were instructed to move along with the variation they noticed in the pre-recorded sound stimuli, using their arms and upper body. Neither details about the stimuli were given, nor more precise specifications on the way to move to be sure that the results arose from a natural and intuitive interaction between the subjects and the auditory stimuli. Before starting the actual experiment, the participants performed a test trial. The researchers created and recorded the test stimulus in advance according to the same principle described in section 3.2. However, a sine wave was used instead of the human voice or piano sound. This test trial allowed the subjects to get acquainted with the task and the feeling of the sensors attached to their body. It also allowed to check the technical setup and eventually to fix some problems. After the test trial, the subjects performed the same task while listening to stimulus A. Immediately after the measurement, they were asked to give a verbal, subjective description of how they experienced the evolution in the auditory stimuli using affective adjectives that they could choose spontaneously at own will. This procedure was repeated for stimuli B, C and D. comparison between the two features on a low-level, physical level. Therefore, the musical feature under investigation (i.e. one-to-many alternation) needs to be extracted from the complex acoustic signal representing each pre-recorded sound stimulus. The process to obtain this feature in physical format is explained in more detail in this section. It demands several operations whereby the complexity of the original acoustic signal is reduced in a high degree, yet without losing essential information. First, a pitch-tracking algorithm is executed on the acoustic signal of the four prerecorded sound stimuli in order to obtain the frequencies inherent in the sound as a function of the time. By selection of (i.e. filtering out) the fundamental frequencies related to the three different voices (tonica, third, fifth), it could be observed how the presence (i.e. amplitude) of the multiple voices evolves over time and as such establish the effect of harmonization termed as one-to-many alternation. Second, a continuous contour is created that gives an indication of the evolving harmonization. This was done by adding the different amplitudes of the third and fifth and normalizing this result from 0 to 1. The minimum value of the signal means that only the root voice is present, the middle value means root and third are present and the maximum value means root, third and fifth are present. The continuous contour is obtained by applying an FFT-based filter to the data where the lower phazors were retained. The result (figure 3) is a low-level, physical signal specifying the musical feature providing a means to quantify the relation between sound and movement. Fig. 3 Visualisation of the one-to-many alternation in auditory stimulus A. The signal indicated in blue represents the filtered output. The signal indicated in red represents the FFT-smoothed and scaled representation of the blue signal. 4 Results of the experiment 4.1 Analysis of the one-to-many variation in the auditory stimuli 4.2 Movement feature extraction The intent of this study is to quantify the relation between the musical structure inherent in the four pre-recorded sound stimuli and the bodily response of subjects in terms of the contraction/expansion pattern of the upper limbs in peripersonal space. The quantification of this relation requires a The movement feature of interest is related with the amount of contraction/expansion of the upper body, which is here expressed by the changing distances between (1) the elbows relative to each other, and (2) the wrists relative to each 6 other. An increase in distance means an expansion of the used peripersonal space and vice versa. The amount of contraction/expansion is calculated as the Euclidian distance between positions, according to equation 7, where d is distance, and ∆ E is the difference in position between the left (E1 ) and right elbow (E2 ) in (x, y, z). q d(t) = ∆ Ex (t)2 + ∆ Ey (t)2 + ∆ Ez (t)2 (7) The signal is then scaled between 0 (i.e. minimum distance between the elbows) and 1 (i.e. maximum distance between the elbows) and smoothed with a Savitzky-Golay FIR smoothing filter of polynomial order 3 and with a frame size of 151. The same operations are executed to obtain the contraction/expansion index of the wrists. 4.3 Cross-correlation between movement feature and sensory feature The comparison between (1) extracted movement features (distance between elbows, or wrists), and (2) auditory stimulus (regularly repeated alternations between tonic, tonic + third, and tonic + third + fifth) is based on a cross-correlation analysis, from which the highest correlation coefficient r is selected within a given time lag interval. A time lag interval is allowed in order to take into account the anticipation/retardation behaviour of the subjects. However, the value of the time lag is limited to a quarter of a period that characterizes an alternation between tonic, tonic + third, and tonic + third + fifth and back. This is done in order not to cancel anti-phase correlation patterns. Every subject (N=25) performed the motor-attuning task on each of the four auditory stimuli. From the captured movement data of each performance, features were extracted and cross-correlated with the signal that describes the structural musical feature in each stimulus. In this way, for each of the auditory stimuli, 25 correlation coefficients were obtained relative to the performances of each subject. These data can be structured in a 25-by-4 data matrix wherein each column bundles the 25 correlation coefficients of each auditory stimulus. Once this data structure was obtained a statistical distribution was fit to the 25 correlation coefficients of each of the four columns. These four fitted distributions, represented by probability distribution functions (PDF), can be seen in figure 4 (above). The horizontal X-axis locates all the possible values of the correlation coefficient expressing the correlation between movement and musical feature while the vertical Y-axis describes how likely these values can occur (i.e. the probability) for each auditory stimuli. The same operations were executed for the movement feature that defines the distance between the two wrists (figure 4, below). Table 2 gives an overview of the means and medians of the eight different distributions. Fig. 4 Above: Scaled PDF expressing the probability distribution for each auditory stimulus of the correlation coefficients defining the similarity between movement feature (i.e. the distance between the elbows) and musical feature (the one-to-many alternation). Below: the same for the movement feature defining the distance between the two wrists. Table 2 The means and medians of the 25 correlation coefficients r of each of the four data vectors that express the similarity between the varying distance of the elbows/wrists and the one-to-many alternation in the four musical stimuli) Stimuli N Elbow rmean rmedian Wrist rmean rmedian A 25 B 25 C 25 D 25 All 100 0.65 0.76 0.45 0.46 0.60 0.71 0.54 0.64 0.56 0.65 0.53 0.52 0.25 0.24 0.47 0.48 0.36 0.28 0.40 0.41 Analysis of the correlations shows that the coordination between the movements of the wrists and the music is significantly smaller than the coordination of the elbows with the music. Analysis of variance shows a significant difference for stimuli B (F(1, 48) = 5.66, p < 0.05) and D (F(1, 48) = 8, 62, p < 0, 01), while there is a similar tendency for the two other stimuli, with p-values of 0.10 and 0.06 for A and C respectively. A significant difference between the four stimuli is found for both the wrists (F(3, 96) = 5, 48, p < 0.01) and the elbows (F(3, 96) = 2.87, p < 0.05). Post-hoc tests show that in both cases the difference is caused by a poorer coordination in stimulus B. 7 These observations suggest, in general, that the elbows as well as the wrists (although in a lesser degree) attune with the auditory stimuli. This means that, when extra harmonic voices are added to a monophonic voice there is a general tendency to make broader gestures and vice versa. This indicates that a gestural trajectory is established as a response to the one-to-many alternation, expressed as a pattern of contraction/expansion of the upper limbs in peripersonal space. However, this seems not the case for every performance. From the lower ends of the different probability curves, it can be observed that some performances do not establish a clear sensorimotor coupling. Therefore, the results of the performances were related to the music and dance background of the subjects in order to see if this influenced the performance results. For each of the 25 subjects, a mean was calculated of the eight correlation coefficients defining how well the distances between both the elbows and both the wrists correlated with the one-to-many alternation in each of the four sound stimuli. The distribution of these 25 mean correlation coefficient in relation to the years of musical and dance education that the subjects experienced, revealed no linear relationship (see figure 5). The relation was measured by calculating the correlation coefficient between the variable specifying the correlation between movement and music and the two variables years of music education and years of dance education (respectively r = 0.30 and r = 0.09). 20 Number of years Music education Dance education 15 10 5 0 −1 −0.5 0 0.5 1 Correlation coefficient Fig. 5 Distribution of mean correlation coefficients in relation with artistic (i.e. music and dance) background of subjects. An alternative analysis is presented in section 5 focussing on the subjective descriptions that subjects gave after each performance in order to find out whether these could be related with the performance results. 5 Quality of experience The subjective descriptions of the participants experiences were given after each measurement, which gives a total of 100 (=N25*4) description moments, and a total of 204 descriptors. As these descriptions were given on a free and spontaneous basis, they were first categorized into four categories, namely, expressivity, richness, intensity and structure. These description types go together with a model for semantic description of music used by Lesaffre [29] in an experiment that focused on unveiling relationships between musical structure and musical expressiveness. The category expressivity (11%) relates to interest, affect and judgment and in our test it was expressed with adjectives such as irritancy, cheerfulness and difficulty. Richness (17%) consists of tone qualities such as fullness, wealth and spatiality. Intensity (9,3%) was expressed in terms of exuberance, intensity, force, and impressiveness. Structure (63%) had a focus on aspects of harmony (e.g. from unison to many-voiced, triadic) and movement (e.g. larger, wider, repetitive). The large amount of structure-related adjectives can be explained by the interview setup, which requires participants to give adjectives that express participants perception of evolution within the musical samples and their overall experience during the performance. Starting from this categorical approach a quality of experience value was assigned to each of the 100 description sets, using a score from 0 to 3. This was done twice by two musicologists who worked independently of each other. Their assessments resulted in two spreadsheets that were qualitatively compared with each other. In 93% of the cases, an agreement was found between both the assessments. In the other cases, disagreement occurred due to specific assessment of the terms greater and louder. After discussion and based on Lesaffres [29] model for semantic description of music, they were put in the intensity category.The obtained scores lead to four quality of experience groups in the following way: (1) the lowest score of 0 (5 descriptors), groups the description sets that provide no, minimal or vague sketches of the sensory stimuli; (2) the score of 1 (17 descriptors) has been given to the group that holds to one single descriptor category. This group also represents subjects who among other things said that they focused on the pulse (rhythm) of the sound stimuli, had negative experiences such as irritancy and oppression, or found the task difficult; (3) the group with a score of 2 (49 descriptors) corresponds with rather well described experiences, but the description simply accounts for one or two out of four description categories; (4) the group with the highest score of 3 (29 descriptors) consists of description sets that include multiple descriptors spread over the four categories, although there were no descriptions sets that covered all four description categories. The distribution of correlation coefficients (representing the probability of correlation between movement 8 and auditory stimulus) was then re-calculated for each of the four quality of experience groups. Once all the values of the correlation coefficients are obtained, a fitted probability distribution (PDF) is created for each group (see figure 6). expansion/contraction at the level of the elbows is characterized by a mean r of 0.66 (median r = 0.71). With a mean r of 0.47 (median r = 0.45), the correlation between the expansion/contraction pattern of the wrists in reference to the auditory stimuli is existing although less convincing. So, it seems that motor imitation occurs especially at the level of the elbows. Fig. 6 Scaled probability distributions (PDF) that define the degree of sensorimotor coupling (expressed with correlation coefficient values that determine the correlation between the expansion/contraction pattern of the elbows and the one-to-many alternation in the music) as a function of the quality of experience. Four groups are defined based on different qualities of experience in terms of intensity, richness, expressivity and structure. With a mean r of respectively 0.37 and 0.27, group 1 and 2 score less good than group 3 and 4 who have a mean r of respectively 0.65 and 0.67. These results suggest that when subjects are capable of giving a clear description of their experience in terms of expressive features like richness and intensity and structural features, they are more likely to give proof of a sensorimotor coupling (i.e. motor resonance behaviour). The mapping trajectory that will be proposed further in this paper is based on the results of the performances of which the subjective descriptors were categorized in group 3 and 4 (see figure 6). In contrast with the performances that have a subjective descriptor in group 1 and 2 (i.e. 22 performances), performances of group 3 and 4 (i.e. 78 performances) gave evidence of a strong sensorimotor coupling accompanied with clear descriptions of subjective percepts in terms of coherent expressive and structural features. As a result, only the 78 performances classified in group 3 and 4 are relevant and meaningful for the development of the mapping trajectory. Figure 7 shows how the remaining 78 performances can be distributed according to the degree of correlation between (1) the expansion/contraction pattern of the upper limbs into peripersonal space and (2) the one-to-many alternation in the pre-recorded auditory stimuli. This statistical distribution of the 78 correlation coefficients concerned with the amount of Fig. 7 Distribution and PDF of the correlation-coefficients (N=78) expressing the correlation between the varying distance between the elbows (above) and wrists (below) and the one-to-many alternation in the music. 6 Analysis of the directionality of the expansion in peripersonal space The previous section indicated that an attuning is established between movement and sound defined by a contraction/expansion pattern of the upper limbs. However, this expansion can be the result of different movements of the upper limbs. So further analysis of position-related aspects needs to be conducted in order to specify how this spatio-kinetic movement trajectory is defined with respect to the 3D peripersonal space of the subjects. This analysis needs to incorporate the trajectory of the elbows and wrists along the different axes of the 3D coordinate system during each performance. The trajectory along the axes is approached from the perspective of the subjects’ peripersonal spaces. Therefore, the trajectory along the X-axis is scaled from 0 to 1 enabling an optimal comparison with the signal that specifies the dis- 9 tance between the elbows. The 0-value corresponds with the 0-value position on the horizontal X-axis, the 1-value corresponds with the maximum distance of the elbows and wrists to the body in both directions along the horizontal X-axis. The trajectory along the Z-axis is defined according to the same concept but in the horizontal forward/backward direction. The trajectory along the vertical Y-axis has a 0-value at the lowest possible position of the elbow and wrist and a 1value at the highest position. It must be stated that a 1-value has a different length according it concerns the elbow or the wrist. For each performance (N=78) correlations were made between (1) the trajectories of both the elbows and wrists along the three different axes and (2) the varying distance between the elbows. In this way, its possible to estimate the spatio-kinetic trajectory that determines the motor resonance behaviour of subjects in response to the one-to-many alternation in the auditory stimuli. After these correlations were executed, a matrix was created consisting of 12 columns of each 75 correlation coefficients. Departing from the statistical distribution of each column, means, medians and percentiles (25th and 5th) were calculated (table 3). Moreover, from each statistical distribution, there was a fitted probability distribution obtained represented by a PDF (figure 8). These statistical calculations give an indication of how the data values are spread over the interval from the smallest value to the largest value of the correlation coefficient. The results indicate that there is not so much difference between the left and right parts of the upper body. Because of that, in what follows the difference between the two body parts will be excluded and mean values will be used. A first global observation of the means, medians, percentile values and PDF plot suggests that there are high r values concerning the trajectories of elbows and wrists along the vertical Y-axis. For the elbows, 95% of the r values (i.e. 74 out of the 78 values in total) go beyond 0.56. The probability density function indicates that the r value that is most likely to occur amounts to 0.93. For the wrists, the r value that seems most likely to occur amounts to 0.90. 95% of the r values are higher than 0.33 while 75% of the r values (i.e. 59 out of the 78 values in total) exceed 0.69. These observations suggest that the motor response behaviour determined by an expansion of the elbows is largely dependent on a displacement of the elbows and wrists in the upward direction. Further analysis is executed in order to investigate whether the upward movement of the elbows is due to a horizontal movement in the forward/backward direction (i.e. Z-axis) or in the sideward direction (i.e. X-axis). In order to realize this, it was investigated for all the performances whether an increase in distance between the elbows corresponded also to a similar, increasing displacement of the elbows and wrists along the X-axis. For the left elbow the con- traction/expansion range in the horizontal, sideward direction is defined by x-coordinate values that fall in between 0 (i.e. maximum contraction) and 0.43 (i.e. maximum expansion). For the right elbow, the contraction/expansion range is defined by x-coordinates falling in between 0 and -0.43. For the left wrist, the x-values fall in between 0 and 0.7, for the right wrist, in between 0 and -0.7. In order to simplify the interpretation of the subsequent correlation analysis, the x-values for the right elbow and wrist were multiplied by -1 in order to relate increasing x-values to an expansion of the elbows and wrists. Moreover, the different ranges were normalized between 0 and 1. Four different correlation analyses were then applied between the two variables defining (1) the scaled x-values specifying the displacement of both the elbows and wrists and (2) the varying distance between both the elbows. For the elbows, 95% of the r values exceed 0.37, while 75% of the r values go beyond 0.62. The PDF fit of the distribution indicates that a value of 0.84 is most likely to occur. What concerns the wrists, the value that is most likely to occur is 0.43. These results suggest that the tendency towards spatial verticality is related with an expansion of the upper limbs in the outward direction along the X-axis. A similar correlation analysis was performed between the variables specifying the displacement along the Z-axis and the distance between the elbows. This analysis indicated an absence of correlation between the expansion of the elbows and the displacement of both the elbows and wrists along the Z-axis. Table 3 Means, medians and percentile values of the 12 distributions of 78 correlation coefficients r that define the correlation between (1) the expansion of the elbows and (2) the displacement of the upper limbs along the three axes in peripersonal space. The six columns above contain the correlation of the expansion of the elbow with the trajectories along the X, Y and Z-axes of the left and right elbow. The six columns below these of the wrists. N Direction Elbow rmean rmedian r prctl(25) r prctl(5) Wrist rmean rmedian r prctl(25) r prctl(5) 78 X left 0.78 0.78 0.70 0.50 left 0.50 0.55 0.32 -0.01 78 X right 0.68 0.73 0.53 0.24 right 0.30 0.28 0.14 -0.13 78 Y left 0.90 0.91 0.87 0.68 left 0.79 0.84 0.71 0.43 78 Y right 0.85 0.92 0.83 0.43 right 0.74 0.85 0.66 0.22 78 Z left 0.09 0.11 -0.09 -0.40 left 0.16 0.13 -0.07 -0.32 78 Z right 0.40 0.39 0.18 -0.10 right 0.25 0.24 0.05 -0.14 7 Discussion Similar with previous findings [34, 17, 18, 12], our study indicates that subjects have an intuitive tendency to associate 10 Furthermore, analysis suggested that there is no linear relationship between the artistic (i.e. music and dance) background of subjects and their motor-attuning performances on the pre-recorded stimuli. An alternative analysis however confirms the influence of personality and contextual factors, like nervousness and uneasiness, as important influences on the performance of subjects. These two observations feed the hypothesis that the observed corporeal resonance behaviour (in the case when a person is feeling comfortable) is really a spontaneous one, rooted in the action-oriented ontology of subjects independent of their artistic background. However, to confirm this hypothesis more adequate, additional research needs to be conducted with more and more divergent subjects preferably in more ecological valid situations. Fig. 8 Scaled fitted probability distributions (PDF) that define the correlation between (1) the expansion of the elbows and (2) the displacement of the elbows (above) and wrists (below) along the three axes in peripersonal space. musical structures with physical space and bodily motion. The analysis of the extracted movement features shows that the corporeal resonance behaviour in response to the oneto-many alternation in the auditory stimuli is characterised by a contraction/expansion of the upper limbs in the subjects’ peripersonal space. This seems to be the case especially what concerns the elbows. A partial explanation of this result could be that the wrists have rather a tendency to follow the regular pulse of the musical stimuli. A first brief, qualitative observation of the data seems to confirm this explanation although further research is needed. Moreover, empirical evidence shows that the expansion of the upper limbs is characterized by an upward (Y-axis) and (although in a lesser degree) sideward (X-axis) directionality. Eitan and Granot [17] who investigated the relationship between musical structural descriptions and synesthetic/kinesthetic descriptions found that changes in pitch height corresponds to changes in movement characterised by a spatial vertical directionality as well. This could raise the question whether it is the addition of extra voices that stimulates expansion or whether it is the change in pitch. The experimental set-up anticipated this question. By adding voices in the octave below the root voice, there could be no question of a raising pitch in stimuli A and B. Nevertheless, there is a clear tendency to spatial verticality. The results of this study can be taken as ground for the establishment of a mapping trajectory that connects the analysed gestural trajectory of the upper limbs in peripersonal space defined by the contraction/expansion index to a module that controls the synthesis of extra voices on a monophonic auditory input (e.g. MuVoice module). In order to realize this, the computational model proposed in this paper must be adjusted to recognize the gestural trajectories of users, equipped with five inertial sensors and connected with the system. Once a pattern of contraction/expansion of the upper limbs is then recognized, the computational model must activate a sound-synthesizing module that adds voices on the monophonic input. We propose a mapping trajectory that (1) gradually adds a third according to the expansion of the elbows in the upward (Y-axis) and sideward (X-axis) direction and (2) gradually adds an extra fifth on the root and third according to the expansion of the wrists in the upward (Y-axis) and sideward (X-axis) direction. However, this proposed strategy must be validated in the future in ecological valuable environments and situations. The innovative aspect of this mapping strategy is that it relies on an embodied, sensorimotor approach to music perception and expressiveness putting in evidence the actionoriented ontology of users [22, 32, 10, 39]. According to this approach, perception is build upon structural couplings between sensory patterns and motor activity. The qualitative research presented in this study (see section 5) confirms this these by showing that only the subjects that give evidence of a structural coupling between (1) the auditory pattern (i.e. one-to-many alternation) inherent in the auditory stimuli and (2) the corporeal resonance behaviour, could give a clear and coherent description of how they perceived the musical stimuli. In contrast, the subjective descriptions of the participants lacking the sensorimotor coupling were mostly vague and incoherent. Moreover, the subjective descriptions of the subjects that gave proof of a sensorimotor coupling were very much in accordance with each other. They described the passage from one voice to more, harmonic re- 11 lated and simultaneous sounding voices in terms of more full, more intense, more forceful, more exuberant, etc. As a result, a connection could be established between musical structural feature, movement feature and the perceptual experience in terms of richness, intensity and expressivity (table 4). This observed relationship is in line with previous research putting in evidence the relation between (1) the size and openness of bodily movement and the emotional intensity of the musical sound [14, 7] and (2) the contrast between an open and closed body position and the intent of communicators to persuade [31, 30]. Table 4 Relationship between musical structural feature, movement feature and perceptual experience in terms of richness, intensity and expressivity. Structural musical cue adding voices dissapearing voices Movement cue expansion contraction Perception full, intense, force empty, delicate By developing a mapping strategy that relies on sensorimotor integration and the coupling of action and perception, it seems possible to address descriptions on a pure subjective level (ideas, feelings, moods, etc.) by means of corporeal articulation and translate them further into sound. It must be noticed that the proposed sound process of one-to-many alternation is not related with one particular emotion or affect as such but rather with the intensity or quality of an emotion whatsoever [36, 8]. Nonetheless further research is required, the results presented in this paper suggest that an emotion, affect or dramatic content could be reinforced by the specific musical process of adding voices to an in origin monophonic voice. By integrating a musical synthesis module that is able to perform this kind of musical process (MuVoice) inside the algorithm that is able to model movement, a multimodal, digital interface is created that can be used by singers to enhance the natural capabilities of the voice to communicate emotions and feeling by means of the expressive qualities of the upper body extended into the virtual musical domain. In this way, it also provides a tool that enhances music-driven psychosocial interaction in an artistic and social context. It facilitates users, multimedia artists and dancers to control and manipulate a music performance by means of expressive gestures. For the performer it means that the performed actions are attuned with the intended actions communicated by the musical structural cues in the constitution of unambiguous expressive percepts. For the outside world, it means that visual and auditory sensory input received from stage (visual, auditory,...) are perceived as attuned intended actions that can be corporeally imitated and related to the actionoriented ontology creating expressive content. However the advantages of the proposed mapping strategy, further research needs to be conducted in order to in- tegrate the proposed methodology into systems that support one-to-many, many-to-one or many-to-many mapping strategies [35, 21]. This will be accomplished in the future by conducting additional experiments investigating the relation between expressive gesture and sound.[33] Acknowledgements This work is funded by the EmcoMetecca project Ghent University. We want to thank Pieter Coussement for his contribution to the project with the development of a Jitter-generated stick-figure representation. We also want to thank Mark T. Marshall and Marcelo Wanderley at the Input Devices and Music Interaction Laboratory at McGill University (www.idmil.org) for their software that accesses the Xsens sensors and converts the received data to the OSC protocol. References 1. N. Bernardini. http://www.cost287.org/. 2. F. Bevilacqua, J. Ridenour, and D.J. Cuccia. 3D motion capture data: motion analysis and mapping to music. In Proceedings of the Workshop/Symposium on Sensing and Input for Media-centric Systems, 2002. 3. N. Bianchi-Berthouze, P. Cairns, A. Cox, C. Jennett, and W.W. Kim. On posture as a modality for expressing and recognizing emotions. In Emotion and HCI workshop at BCS HCI London, 2006. 4. C. Cadoz and M.M. Wanderley. Gesture-music. Trends in Gestural Control of Music, pages 71–93, 2000. 5. A. Camurri, G. De Poli, A. Friberg, M. Leman, and G. Volpe. The MEGA project: analysis and synthesis of multisensory expressive gesture in performing art applications. Journal of New Music Research, 34(1):5–21, 2005. 6. A. Camurri, G. De Poli, M. Leman, and G. Volpe. A multilayered conceptual framework for expressive gesture applications. In Proc. Intl MOSART Workshop, Barcelona, 2001. 7. A. Camurri, B. Mazzarino, M. Ricchetti, R. Timmers, and G. Volpe. Multimodal analysis of expressive gesture in music and dance performances. Lecture notes in computer science, pages 20–39, 2004. 8. A. Camurri, B. Mazzarino, M. Ricchetti, R. Timmers, and G. Volpe. Multimodal analysis of expressive gesture in music and dance performances. Lecture notes in computer science, pages 20–39, 2004. 9. A. Camurri, G. Volpe, G. De Poli, and M. Leman. Communicating expressiveness and affect in multimodal interactive systems. IEEE Multimedia, 12(1):43–53, 2005. 10. G. Colombetti and E. Thompson. The feeling body: Toward an enactive approach to emotion. Body in mind, mind in body: Developmental perspectives on embodiment and consciousness. Hillsdale, NJ: Lawrence Erlbaum, 2007. 11. M. Coulson. Expressing emotion through body movement. Animating Expressive Characters for Social Interaction, page 71, 2008. 12. P. Craenen. Music from Some (no) where, Here and There: Reflections over the Space of Sounding Compositions. TIJDSCHRIFT VOOR MUZIEKTHEORIE, 12(1):122, 2007. 13. S. Dahl. On the beat: Human movement and timing in the production and perception of music. PhD thesis, KTH School of Computer Science and Communications, SE-100 44 Stockholm, Sweden, 2005. 14. J.W. Davidson. What type of information is conveyed in the body movements of solo musician performers. Journal of Human Movement Studies, 6:279–301, 1994. 12 15. P.R. De Silva and N. Bianchi-Berthouze. Modeling human affective postures: an information theoretic characterization of posture features. Computer Animation and Virtual Worlds, 15, 2004. 16. PR De Silva, M. Osano, A. Marasinghe, and AP Madurapperuma. Towards recognizing emotion with affective dimensions through body gestures. In Automatic Face and Gesture Recognition, 2006. FGR 2006. 7th International Conference on, pages 269–274, 2006. 17. Z. Eitan and R.Y. Granot. Musical parameters and images of motion. In Proceedings of the Conference on Interdisciplinary Musicology (CIM04), Graz/Austria, pages 15–18, 2004. 18. Z. Eitan and R.Y. Granot. How music moves. Music perception, 23(3):221–248, 2006. 19. A. Farne, M.L. Dematte, and E. Ladavas. Neuropsychological evidence of modular organization of the near peripersonal space. Neurology, 65(11):1754–1758, 2005. 20. D. Fenza, L. Mion, S. Canazza, and A. Roda. Physical movement and musical gestures: a multilevel mapping strategy. Proceedings of Sound and Music Computing’05, 2005. 21. A. Hunt, M. Wanderley, and R. Kirk. Towards a model for instrumental mapping in expert musical interaction. In International Computer Music Conference, pages 209–212, 2000. 22. SL Hurley. Consciousness in action. Harvard Univ Pr, 2002. 23. AR Jensenius. Action-Sound: Developing Methods and Tools to Study Music-Related Body Movement. PhD thesis, Ph. D. thesis, Department of Musicology, University of Oslo, 2007. 24. A. Kleinsmith, T. Fushimi, and N. Bianchi-Berthouze. An incremental and interactive affective posture recognition system. In International Workshop on Adapting the Interaction Style to Affective Factors, Edinburgh, UK, 2005. 25. G. Kurtenbach and E.A. Hulteen. Gestures in Human-Computer Communication. The Art of Human-Computer Interface Design, pages 309–317, 1990. 26. R. Laban and F.C. Lawrence. Effort. Macdonald & Evans London, 1947. 27. M. Leman. Embodied music cognition and mediation technology. Mit Press, 2007. 28. M. Leman and A. Camurri. Understanding musical expressiveness using interactive multimedia platforms. Musicae Scientiae, 10(I):209, 2006. 29. M. Lesaffre, L.D. Voogdt, M. Leman, B.D. Baets, H.D. Meyer, and J.P. Martens. How potential users of music search and retrieval systems describe the semantic quality of music. Journal of the American Society for Information Science and Technology, 59(5), 2008. 30. H. McGinley, R. LeFevre, and P. McGinley. The influence of a communicator’s body position on opinion change in others. Journal of Personality and Social Psychology, 31(4):686–690, 1975. 31. A. Mehrabian and J.T. Friar. Encoding of Attitude by a Seated Communicator via Posture and Position Cues. J Consult Clin Psychol, 1969. 32. A. Noë. Action in perception. MIT Press, 2004. 33. F. Ofli, Y. Demir, Y. Yemez, E. Erzin, A.M. Tekalp, K. Balcı, İ. Kızoğlu, L. Akarun, C. Canton-Ferrer, J. Tilmanne, et al. An audio-driven dancing avatar. Journal on Multimodal User Interfaces, 2(2):93–103, 2008. 34. B.H. Repp. Music as motion: A synopsis of Alexander Truslit’s (1938) Gestaltung und Bewegung in der Musik. Psychology of Music, 21(1):48, 1993. 35. J.B. Rovan, M.M. Wanderley, S. Dubnov, and P. Depalle. Instrumental gestural mapping strategies as expressivity determinants in computer music performance. In Proceedings of Kansei-The Technology of Emotion Workshop, pages 3–4, 1997. 36. KR Scherer. Why music does not produce basic emotions: pleading for a new approach to measuring the emotional effects of music. In Proc. Stockholm Music Acoustics Conference SMAC-03, pages 25–28, 2003. 37. K. Schindler, L. Van Gool, and B. de Gelder. Recognizing emotions expressed by body pose: A biologically inspired neural model. Neural Networks, 21(9):1238–1246, 2008. 38. L. Tarabella and G. Bertini. About the Role of Mapping in gesturecontrolled live computer music. Lecture notes in computer science, pages 217–224, 2004. 39. D. Taraborelli and M. Mossio. On the relation between the enactive and the sensorimotor approach to perception. Consciousness and Cognition, 17(4):1343–1344, 2008. 40. R. von Laban and FC Lawrence. Effort. Macdonald & Evans, 1967. 41. T. Winkler. Making motion musical: Gesture mapping strategies for interactive computer music. In Proceedings of the 1995 International Computer Music Conference, pages 261–264, 1995.

Log In

From expressive gesture to sound

Related papers

Related papers

Related topics