Academia.eduAcademia.edu

Producing 3-D Audio

2019, Producing Music

Arguably the most rapidly expanding market for audio production is that of 3-D audio – 360° spatial audio with representation of height. Such playback is becoming commonplace in cinemas, and multi-speaker home setups are following. As these mature, greater convenience and enhanced functionality will likely increase domestic uptake. In parallel, headphone-based 3D is experiencing an unprecedented rate of development. Headphones with spaced multiple drivers are emerging, accelerometers are being incorporated to facilitate head-tracking-driven audio panning, and binaural algorithms are steadily improving. The principal exponents of such playback are virtual and augmented reality. While gaming will lead this market, soon the applications will proliferate into many areas of daily life – from productivity to education, through social networking to music playback, and advertising will pervade these new media. All of these applications require the production of audio, and this chapter explo...

PRODUCING 3-D AUDIO JUSTIN PATERSON AND GARETH LLEWELLYN Abstract (REVISED) Arguably the most rapidly expanding market for audio production is that of 3-D audio – 360° spatial audio with representation of height. Such playback is becoming commonplace in cinemas, and multi-speaker home setups are following. As these mature, greater convenience and enhanced functionality will likely increase domestic uptake. In parallel, headphone-based 3D is experiencing huge an unprecedented rate of developmentgrowth. Headphones with spaced multiple drivers are emerging, accelerometers are being incorporated to facilitate head-tracking-driven audio panning, and binaural algorithms are steadily improving. The principal exponents of such playback are virtual and augmented reality. While gaming will lead this market, soon the applications will proliferate into many areas of daily life – from productivity to education, through social networking to music playback, and advertising will pervade these new media. All of these applications require the production of audio, and this chapter explores the associated implications for the music producer. 3D offers an exciting new paradigm for music, since the conventions associated with stereo and horizontal surround are increasingly outmoded by the greater options associated with perception of height. This chapter will consider relevant technologies and identify key production-practice, underpinned by a framework of scientific research to help define the emergent praxis of 3-D audio production. It will consider opportunities, applications and limitations, 1 all of which combine to introduce and help define a new approach to music production that will increasingly pervade into the future. Introduction For the purposes of this chapter, 3-D audio can be defined as the reproduction of sound that is perceived as coming from all around us, including behind and above, or even below. This is of course the manner in which we perceive the real world in daily life. The recreation of such perception straddles many disciplines, from recording to psychoacoustics –, to reproduction itself –, which Michael Gerzon (1973) referred to as periphony. While this chapter is centred on ‘production’ (a multifaceted term in itself), it is also necessary to contextualize this with a number of references to these various disciplines. Clearly, when working in 3D, the most radical departure from stereoi or horizontalsurround production is that of source placement in the sound field, and so a perspective of this will be presented. Having said that, periphony brings other creative opportunities and hazards to the production process, and some of these will also be discussed. Further, the word ‘audio’ generally refers to music, but might be taken in a broader context to include dialogue and non-musical sounds that might accompany video. Beyond academic research, little has been published on periphonic production at the time of writing (fall of 2017), and so this chapter also has a responsibility to introduce the topic in order to serve as a primer for the numerous texts that will doubtless be published in the future, but also to draw the reader's attention to specific aspects of the periphonic arena so that they might seek out and read further specialist texts. 2 There are three main paths by which 3-D immersive audio can be constructed: channel based, Ambisonic and object based, and these will be discussed. These approaches can, of course, be combined in any given playback scenario. At some point each of these requires rendering to speakers or headphones for playback. Whilst the details of each system differ and a good understanding of their workings is important, there are a number of audio-production approaches which tend to be similar across the board, although the manner of ‘getting there’ may require slightly different approaches. The context of 3-D audio Periphonic reproduction has been the subject of research for a number of decades. Over this time the degree of interest has ebbed and flowed, and various systems have been developed that facilitate so-called ‘immersive audio’. It is worth first considering some of the context and various applications that employ 3-D audio. In acousmatic music, composers have styled their music to be played back through multi-channel (often periphonic) speaker arrays. In ‘live’ performance, a musician designated as the ‘diffuser’ might dynamically spatialize the performance in a given array –, moving particular instruments from speaker to speaker, and many such arrays have been configured in universities and performance venues. The term ‘engulfment’ has been used in such circles to refer to representation of heightii. In the past five or six years, companies such as Dolby and Auro-3D® have developed multi-speaker periphonic playback systems for cinema. Most recent Hollywood blockbusters have embraced this approach, and consequently 3-D audio creation is now commonplace in post-production. Special effects and atmospheres can exploit the ability to be placed anywhere in a 3-D sound field around the listener, and this can 3 enhance synchresisiii and opens many possibilities for non-diegetic immersive audio. The listeners might perceive ambient sounds all around them, or experience dynamic effects such as aircraft flying overhead that can disappear into the distance behind. The associated music composition/production is also beginning to exploit such playbackiv. The ‘soundbar’ is an increasingly popular add-on for large-screen televisions; it attempts to extend the perceived width of the sound sources beyond that of the television screen and might work both horizontally and vertically through a process termed Wave-Field Synthesis – sometimes also bouncing sound off the ceiling in order to gain the perception of height. Binaural audio can give the impression of front-back and up-down sound-source localization on headphones. It does this by incorporating natural phenomena such as the difference in time that a sound takes to reach each ear, and the physical shape of the pinnae, which together contribute to how the brain deduces the point of origin. However, because it is dependent on the listener's physiological shape, binaural performs with varying degrees of accuracy for a given individual. Binaural localization can be achieved either through special recording techniques or via artificial ‘synthesis’. In March 2015, the Audio Engineering Society (AES) published the AES69-2015 standard that provides a framework to facilitate widespread adoption into creative content in handheld devices. This will continue to accelerate development and uptake of binaural technologies. Virtual reality, augmented reality and mixed reality are technically different, but for convenience, will now be referred to collectively by the common acronym ‘VR’. VR represents one of the biggest technological revolutions yet to be seen, and has a 4 revenue stream forecast to reach $108 billion by 2021 (Digi-Capital, 2017). The attendant audio is generally binaural. Localization is greatly enhanced by the head tracking that is a feature of Head-Mounted Displays (HMD), whereby the position of a sound appears fixed relative to the natural movement of the head. Whereas VR is currently dominated by ‘early-adopter’ gaming applications (the gaming industry can be considered as residing ‘within’ VR for the purposes of this text), its impact will become more ubiquitous as it expands towards productivity, education and health. VR and its associated revenue has precipitated a step change in interest and momentum for 3-D audio. 360º video is also in ascendance. In March 2015, YouTube launched a dedicated 360º video channel, and Mobile VR (use of a smartphone with a low-budget HMD ‘holder’) is driving its popularity, and these videos typically require 3-D audio soundtracks. Broadcasting is another industry that is increasingly experimenting with 3-D audio. Both the BBC and NHK in Japan have broadcast with a 22.2-channel system, and the BBC has increasingly created binaural content both with and without accompanying visuals, e.g. Dr. Who and Radio 3. All of the above requires specialist equipment and thus another sector that is rapidly expanding is that of 3-D audio tool-developers, both hardware and software. Established international players such as Dolby and Pioneer are creating hardware, as are emergent start-ups such as Smyth. One interesting example is OSSIC X headphones which measure physiological characteristics of the listener's head and ears, feature spaced directional drivers to help create the sound field, and have inbuilt head-tracking. Software is coming from a similar range of developers, from the 5 international DTS:X protocol through to niche developers such as plugin manufacturer Blue Ripple Sound. There is a great deal of academic research into the many areas associated with 3-D audio, spread across the humanities, science and medicine. These include music composition and performance, sonic art, human-computer interface design, audio post-production, cognition/perception, architectural space-modelling, recording techniques, software design, creative applications, assistive technologies and more. Publication is increasingly widespread, and in 2016 the AES held its first dedicated 3D-audio conference, and The Art of Record Production held its first such event in late 2017. Spatial placement We must first consider some of the issues associated with 3-D audio ‘placement’ in reproduction. Our ability to identify a given audio source’s point of origin in space – with a meaningful resolution, requires the right kind of information to be delivered to the ears, so that the brain is able to reconstruct an ostensibly correct (or at least intended) placement of that audio. Creating a 3-D audio experience requires the presentation of usable spatial information to the ears so that this mental reconstruction can occur. The production of useful spatial information can only come via three mechanisms: 1] ‘Captured’ from the real world The first is to try and capture some of the spatial cues that exist in real acoustic environments using two or more microphones, and reproduce these cues over speakers or headphones. A single monophonic microphone recording may contain a number of clues as to the environmental acoustics of the space and the sound source 6 being captured, but some of amount of ‘directionality’ is required if one wants to begin expressing spatial positioning from native audio recordings –, and this requires two or more microphones. Some of the inherent spatial information can then be captured and later reproduced from the timing and intensity differences between the capsules at the moment of capture. 2] Synthesized artificially When it is not practical or desirable to capture real-world spatial information, one can model the acoustics that would be generated by a sound source in an acoustic environment synthetically. There are myriad ways of achieving this, including using spatial impulse responses, real-time room modelling and synthetic HRTF modelling, some of which will be discussed later. 3] As a function of the playback system Panning a sound within a 3-D speaker array naturally generates spatial information at the moment of playback that is not inherent in the source audio itself. By nature of the physical relationships between the speakers, the room and the listener, 3-D positioning may be induced from power-panning alone. Good localization may thereby be achieved in a speaker array only from these factors alone. This reproduction of spatial cues will potentially be partially or completely absent from headphone reproductions of the same material. The dynamic between these three fundamental approaches is at the center of all 3D audio production. The intended playback system (or systems) that any given 3-D audio production is expected to play back on will fundamentally influence the approach that is required. There are potentially major differences between the aesthetics and approach for producing a large-format theatrical electronic-music track 7 and an acoustic orchestral piece of music for a 360º video. The real challenge is where and when it is appropriate to use different capture and synthesis techniques, and where and how transforms between formats should occur. So, to elaborate…: 1] Spatial Capture Specialized recording techniques have been designed in order to capture audio in three dimensions. One seminal approach that is currently regaining popularity is Ambisonics. This concept was developed in the 1970s by Gerzon and others, and an excellent collection of Ambisonic resources has been compiled by Richard Elen (1998). Ambisonics can function at various resolutions that determine the accuracy of the perceived source localization, and this determines the required number of channelsv.vi These resolutions are categorised by ‘order.’, and for For instance, ‘first order’ is a four-channel system. A very important aspect is that these channels are encoded/decoded, meaning that the signals derived from (say, four) microphone capsules are not the same as those transmitted – they are encoded into a different four for transmission, and these are not the same as those passed to four speakers for reproduction – which must first be decoded (also known as rendered) from those transmitted. Alternatively, decoding could be into other formats such as 13.1 or two channels for binaural playback on headphonesvii.viii This offers a great deal of flexibility for different capture, transmission and reproduction configurations. Importantly, the transmitted signals – which are called B-format – are easily manipulated through computationally efficient multiplication by a scalar, and such operations can for instance rotate the sound field. 8 One reason for the resurgence in popularity of Ambisonics is that such rotation is what typically happens in a VR HMD, where the sound field appears stationary independently of the movements of the head: an accelerometer/magnetometer/gyroscope combination generates such a scalar to manipulate the B-format signals before decoding. Such dynamic panning greatly resolves the front-back confusion typical of binaural playback, and gives an overall much greater sense of realism. Such operations are typical of contemporary HMDs. Higher-Order Ambisonics (HOA), sometimes called Scene-Based Audio (SBA) (Shivappa et al., 2016), is when additional channels are incorporated to represent further 'spherical harmonics'ix, and these give much better localization and a larger sweet spot as the order increases. The trade-off in this approach is the increased channel count, and in the case of seventh order, 64 channelsx are needed to convey a single track of audio, and this puts considerable strain on any Digital Audio Workstation (DAW) as the track count increases during music production. It should be remembered that beyond Ambisonics, there are other ways to capture 3-D audio too, e.g. those of Michael Williams (2004). 2] Synthesized Where it is not possible, desirable, or practical to capture the spatial information that is present at a given recording situation, then spatialiation cues can be synthesized after the recording event. Spatialization of monophonic audio cues can take two forms; firstly simple panning through a periphonic system that reproduces HRTF effects, and secondly, through the simulation of room acoustics that aim to localize and place the audio event realistically in 3-D space. With regard to the latter, in a real-time game engine this 9 acoustic simulation may be largely procedural, and can include reflections based upon room geometry, the occlusion of sounds behind physical objects, sound radiation patterns and frequency roll-off over distance – all of which are dependant on the game-software code that is deployed and the processing limitations of the system. With linear playback materials all these effects can be generated offline by engineers using similar tools, but with more direct access and control over the materials. The addition of reverberation to sounds has been proven to increase a sense of externalization when listening on headphones, at the expense of localization accuracy e.g. (Begault, 1992). Real-time game engines such as Unity and Unreal Engine already have such built-in audio physics, and these can be further enhanced through a number of proprietary plugins, code or middleware specifically designed for more realistic synthesis (e.g. Steam Audio, Oculus Spatializer and G’Audio etc.). Synthetic spatialization of music elements will frequently take the form of individual monophonic sounds being placed in 3D with a specific direction and distance from the listening position. This is particularly true for more ‘abstract’ electronic sounds, which do not have the same restrictions of placement and acoustic realism demanded by ‘acoustic music’ e.g. orchestral or acoustic folk. While this approach can be enough to generate the necessary ‘spread’ of sounds, some judicious use of spatial early reflections is often helpful in placing these sounds satisfactorily. Synthetic spatial early-reflections are timed delays that try to emulate the timing and direction of the first reflections to reach the listening position before the reverberant field takes over. If these are not captured with some form of multidirectional microphone array at the time of the recording, it can be useful to use synthetic versions to achieve good placement. It is possible to create such spatial reflections 10 with combinations of multiple delay units – if the engineer has the processing and patience necessary to set things up appropriately. Martin and King (2015) give an indepth description of just such a system in the description of their 22.2 mix setup. There are a number of emergent bespoke systems that can also achieve this, as will be discussed. It is very important that any audio potentially destined for later synthetic spatialization must be recorded ‘clean, close and tight’ if believable results are to be achieved. 3] Playback Sound is literally spatialized as soon as it is played back through a periphonic system by definition of the physicality of the 3-D arrayxi, positioned in the array by amplitude panning.xii The configuration of the speakers and the acoustic properties of the room all come into play as a holistic system to make sounds appear from different places around the listeners. Similarly the configuration of the panning system also has an effect, which might follow one of a number of mathematical approaches, e.g. Ambisonic panning, Vector Base Amplitude Panning (VBAP) (Pulkki, 1997) or Distance Base Amplitude Panning DBAP (Lossius et al., 2009), together with the nature of the co-ordinates that are used to describe placement. When transforming audio between systems these issues need to be taken into consideration. In a binaural version of the same mix, all of the effects of speaker placement, head morphology and room acoustics are typically absent or grossly simplified (although modelling is rapidly developing). Thus, if translation between headphone and speaker based mixes are required, the approach must be considered with care to either maximize ‘natural compatibility’ or allow for the various mix components to be optimally reassembled. 11 Objects and Channels There are some fundamental differences between channel-based and object-based 3-D production. ‘Objects’ are aspects of audio, including both PCM-type files and associated parameters that exist discreetly, only to be reassembled at the point of playback, and this approach offers great flexibility around the modes and devices of reproduction (Pike et al., 2016). In 3-D work, a given audio file might be associated with a specific point in space and thus be linked to its spatial playback information. This approach is independent of the number of speakers or size of a given auditorium, and is employed in current 3-D cinema systems such as Dolby Atmos® as well as gaining increasing use in broadcast. Channels (in this context not to be confused with the constituent audio streams of multichannel tracks as above) refer to the routing of an audio source to a discrete speaker output. This paradigm is familiar in stereoxiii, and has also been traditionally used with horizontal surround systems.xiv Such ‘native’ recording allows for the capture and reproduction of timing differences to be conveyed between mic-capsules and speakers. These differences are crucial for convincing 3-D reproduction, and such an approach allows some of the real spatial information that is present in the recording location to be encoded and transmitted over speakers (or binaurally rendered to headphones) at a later stage. Conversely, one cannot record natively in any meaningful way for object-based workflows. The engineer is limited to reproducing notionally monophonic recordings in some kind of monophonic-object playback/rendering system. The objects can be placed in a periphonic system using amplitude panning, but a ‘believable’ placement requires (at least) convincing early reflections since these are in major part of psychoacoustic source localization. Adequate tools for generating satisfactory 12 reflections (and hence externalization) in object-based systems are still developing, since until recently, the CPU cost of generating ‘scalable’ reflection/reverb systems often outstripped their commercial application, and the approach can adversely effect the balance and spectra of the mix in different rooms. De Sena et al. (2013, 2015) have made considerable progress in this regard with a computationally efficient reverb algorithm that uses a network of delay lines connected via scattering junctions –, effectively emulating myriad paths that sound like propagation around a modelled room. Also, FB360 employs a system that models some early key reflections to both enhance binaural placement, and also allows the subsequent application of ‘production reverbs’ in the processing chain. Oculus and Google are currently developing near-field binaural rendering that takes account of the perceptual issues caused by head shadowing (Betbeder, 2017), and also the modelling of volumetric sounds that emanate from areas larger than a more typical point source (Stirling, 2017; Google, 2017). One current way of approaching this reflection/object problem is to allow reflections and reverb to be generated in (channel-based) beds, whilst the objects themselves are rendered as ‘dry’ objects at playback (perhaps with monophonic early reflections/reverb ‘baked-in’). This may give a reasonable semblance of realistic positioning; however, the effectiveness of this approach is very dependent on the audio material and there may be perceptual differences at different playback locations in the sound field. Equally, different kinds of audio material have varying degrees of dependence on the reproduction of these kinds of ‘realistic’ reflections. For example, in orchestral material both the recording environment itself, and the positions of the instruments therein are crucial to the musicality of the piece, and so such music is heavily dependant on ‘authentic’ spatial reproduction. More ‘abstract’ materials (e.g. 13 pop or electronic) may not be as dependent on realistic acoustic placement, and so may translate equally well in an object-based environment – since the clarity and depth of each sound’s position is inherently less fixed and thereby open to a less strict spatial interpretation. There are also implications for using effects processing in channel and object-based systems. Dynamic Range Compression (DRC), as usually used in standard stereo systems starts to take on a more complex character when working spatially. Unlinked channel-based compressors create deformations of the spatial image, and there are few linked compressors that work beyond stereo or 5.1. Bryan Martin (2017) states that in 3D, DRC begins to sound ‘canned’ and tends towards making sound sources appear smaller rather than larger. Further, when working in pure stereo the artefacts of compression might tend to become desirable yet subtle components, but these same artefacts become increasingly apparent and unpleasant in large 3-D formats. It is not yet feasible to have a true object-based compressor (that is, one that will operate consistently at the playback stage) and control of dynamic range – such as is possible – can only really be made on the original monophonic sounds before they enter the panning system. Having said that, the MPEG-H 3D audio-coding standard provides an enhanced concept for DRC that can adapt to different playback scenarios and listening conditions (Kuech et al., 2015). Given the increase in popularity for object-based approaches, it is likely that solutions will soon emerge that unify the object production-chain. Aside from technical challenges, keeping within acceptable processing requirements can be an issue. Systems that offer hardware rendering such as PSVR or Dolby Atmos® tend to avoid such concerns. Indeed, for theatrical applications these issues are generally less problematic, since cinema sound tends to not rely on the character that compression effects bring to material, but for creative 14 music production, the compromised nature of this familiar toolset tends to have greater implications. Indeed, these issues apply to any traditionally insert-type effects that might affect objects spectromorphologically. It is also worth noting that because Ambisonic audio is encoded throughout the production chain between the point of capture and reproduction, it is not generally possible to apply any kind of effects processing to it either, since such ‘distortion’ will conflict with the delicate spherical harmonics, especially at higher orders. There are a few ambisonic plug-ins that offer EQ, delay and reverb, but these tend to be limited to lower-order operation. As a side note, it is worth noting that object-based production workflows are nothing new per se. At all times during a DAW-based mix, one is effectively working with objects. The fundamental difference is that during ‘normal’ mixing the rendering to speakers happens at the same moment as the mix proceeds, as opposed an objectbased mix that postpones that rendering to the moment of playback at the consumer end. An obvious implication in addition to the above, is that the normal kinds of ‘buss’ processing which engineers are used to are no longer available when the mix is fragmented into its constituent objects, since the summing of the playback/production chain has been deferred to the moment of reproduction. This is why we currently see hybrid systems of objects and channels, and ‘pure’ object-only systems are still in evolution. Modern game engines also follow this logic – and in fact in many ways have been leading such workflow architecture for some time – blending stereo and 5.1 tracks with real object-based sounds to achieve pleasing results. Capture considerations 15 When assessing the value of the channel/object parts of a given system, one must look at the material conditions of the recording in question. If one wants to capture something of the ‘acoustic’ and the spatial relationships between elements as they were in the room at the time of recording (and have real coherence between speakers at playback) then some form of multi-microphone/channel-based system currently tends to be preferable to the object approach (at present). This state of affairs is likely to be the case only so long as synthetic acoustic modelling tools lag behind the ‘real thing’. As soon as the audio equivalent of photo-realistic visual effects is achievable, the flexibility advantages of an object approach may make it the dominant form. It is worth considering the difference between working with spaced microphones and an Ambisonic microphone (e.g. a Calrec Soundfield). Of the former, Williams’ psychoacoustic research (2004) looks at the way in which multi-microphone systems create ‘coverage’ between any given pair of capsules. Williams notes that the Stereo Recording Angle (SRA), coverage, and angular distortionxv are all dependant on a combination of the microphones’ polar patterns, their relative angles, and their distance from each other, and this points to potential issues with Ambisonic microphones that are frequently overlooked (Williams, 2017). An Ambisonic microphone utilizes (theoretically) coincident capsules, and this means that it can only record intensity difference. Crucial timing information from the recording location is lost (and cannot be recovered). Further, the capsule types and angles imply large sections of SRA overlap in the horizontal plane once summed for playback, and this gives rise to angular distortion at regular intervals around the full circular field of capture (which becomes increasingly apparent outside of a very small sweet spot). The lack of timing information, coupled with the overlap of polar patterns can lead to 16 an unsatisfactory image that lacks the spaciousness and accuracy of a spaced-array recording (Williams, 1991). Some of the issues of first-order ambisonic microphones have been improved with the recent introduction of higher-order ambisonic microphones. These greatly improve some of the angular distortion issues associated with first-order variants, but they can come at the cost of the spectral-colouration issues such as the low frequency roll-off associated with more directional cardioids, or potentially high-frequency phase artefacts from their large number of capsules. As is so often the case, there is a tradeoff to be made between spectral and spatial quality on the one hand, and practicality and cost on the other, and the engineer must evaluate this for each project. New tools are emerging to support this, such as Hyunkook Lee’s (2017) Microphone Array Recording and Reproduction Simulator (MARRS) app, which predicts “the perceived positions of multiple sound sources for a given microphone configuration” and can also “automatically configure suitable microphone arrays for the user’s desired spatial scene in reproduction”. In a VR workflow there is often an assumption that Ambisonic recording is ‘native’ recording. Whilst this may be true enough for some kinds of single-perspective 360º video recordings, it can be less so for ‘true’ VR that utilizes 6DOFxvi - where the listener’s locative movement within a virtual space and its associated sound field requires a shift in sonic perspective.xvii Spaced arrays and multi-microphone setups can be reproduced in VR if care is taken over their reproduction in the audio system of a game engine. Issues with phase tend to be dependent both on the material being reproduced, the placement of the ‘array’, and the particular qualities of the binaural decoder. Recordings with low levels of correlation in each channel tend to work best. Game engines also allow for other more sophisticated ways of placing and 17 spatializing monophonic sounds, and Ambisonics might be only one aspect of the greater sound mix, perhaps layered with binaurally synthesized elements and headlocked audioxviii. Further, coverage from multiple microphones can be blended into ‘scenes’ using synthetic acoustic placement (see below). Summing synthetically positioned spot-microphones and ‘spatial’ Ambisonic recordings can achieve very satisfactory results and becomes necessary to satisfy the requirements of balancing dry/close recordings with more distant or reverberant elements. This approach has been corroborated by Riaz et al. (2017) in a scientific setting, although the full results of that test are not published at the time of writing. Phase is likely to play an important part in the success of such ‘scene’ combinations and so the engineer might evaluate this and adjust it for optimal timbre and placement. At this time, a common problem with HOA is that of channel count, since the host DAW perhaps needs to accommodate say 36 (for fifth-order Ambisonic) channels for each ‘monophonic’ track of audio, and this quickly becomes demanding in terms of CPU load and streaming bandwidth. Mixed-order Ambisonics can offer a solution whereby different orders are combined, perhaps with first-order Ambisonics for sounds like bedsxix without need of more precise localization, and HOA for key sounds in need of accurate placement. This approach might also be implemented by providing decreased resolution in the vertical plane to which the ear is less sensitive; the ears sit on a horizontal plane and are therefore more attuned to it (Travis, 2009). However, in either case, care is needed with the type of decoder used: “A 'projection' decoder will have matrix coefficients that do not change for the first four spherical-harmonic components for 1st, 2nd and 3rd order. So simple summation in the equivalent of the ‘mix-buss’ will work here, regardless of the layout. In contrast, a pseudo-inverse decoder won't behave this way and will have different 18 matrix coefficients for the first four spherical-harmonic channels for each order. Thus the layout must be as symmetrical/even as possible, for this to give results similar to projection.” (Kearney, 2017) When working in linear environments like cinema (as opposed to real-time, for instance in VR or gaming, where spatialization requirements might change dynamically) there is scope to render high-quality acoustic spatial elements. For instance, it is possible to ameliorate the above CPU and bandwidth issues by ‘bakingin’ higher-resolution reverbs into higher-order Ambisonic beds. Another approach taken by the PSVR system uses an object-based approach, sending each voice with its associated azimuth and elevation information, via middleware to bespoke audio libraries in the PSVR hardware for rendering along with mixed-order components. This gives excellent localization for the objects – which can also be moving, if tied to animated ‘emitters’ associated with visual elements of a game or application – but loses the ability to perform more typical channel-based production processing en route. Mixing for Binaural The pinnae, head and torso modify sounds before they enter the ear canal (filtering via reflection effects) in unique ways that are individual to the listener and dependant on the morphology of that person. Whilst ITD (Inter-aural Time Difference) and ILD (Inter-aural Level Differences) are important for determining the position of sounds in the horizontal plane, spectral content, filtered largely by the pinnae, is mostly responsible for the perception of elevation, or sounds in the median planexx (Wightman and Kistler, 1997), although this is slightly augmented by head and torso effects.xxi Combining the above into filters whose response is given by such a HeadRelated Transfer Functionxxii (HRTF)xxiii allows a sound to be processedxxiv by that 19 filter in order that the brain perceives it as coming from a particular point in space.xxv Such HRTFs need to be provided for every point in space that might be emulated, and the appropriate HRTF filtration applied to a source sound relative to a given localization placement. This is the mechanism of binaural reproduction in headphones and forms the basis for the binaural ‘synthesis’ previously mentioned. Although HRTFs are person-specific, there have been a number of one-size-fits-all ones developed, e.g. Furse’s (2015) ‘Amber’ which exploits statistical averages to gain the best (compromised) optimization for the largest number of people. A good introductory treatment of binaural theory is given by Rumsey (2001). The implications of the above for recording and mixing in 3D are important. The first is that binauralxxvi 3-D mixing on (for) headphones, using a non-personalised off-theshelf HRTF will tend to lead to mixes which may or may not work as intended for others depending on the relative differences between the morphologies of: 1) the mixer 2) the HRTF model that was utilized, and 3) the end listener. Whilst some tools allow the engineer to select a preferred HRTF, although this will undoubtedly give a truer and more responsive experience to that individual, it is less likely to successfully translate to the largest proportion of end listeners. One option is to first mix to a favoured HRTF, then switch to one such as Amber and then optimize the mix for that prior to final distribution. Clearly, such an approach is an idiosyncratic choice. Another strategy is to mix as much as possible over a 3-D speaker array and use a conversion transform at the end of the mix process to complete the 3-D processing to a headphone format. This reduces the error distance between the mixer’s morphology and the end user, since the engineer will not attempt to correct the HRTF model to suit his or her own peculiarities. If this approach is adopted then care must be taken to ensure that the conversion tools are high quality, since there can be large differences 20 in the quality of encoder/decoders that convert between speakers and headphones, and vice versa. This approach can represent something of a leap of faith for the engineer though, who simply has to accept the algorithmically created binaural mix. There is some hope that more personalized HRTF models will ameliorate some of these margins of error in the medium term, but it remains an issue at the point of writing. There are numerous sets of Ambisonic software tools available at the time of writing, many of which are freeware. Several also feature binaural decoders, and can therefore monitor and render for headphones. Many of these tools tend to be first order, although the excellent ambiX suite (Kronlachner, 2014) goes up to 7th order. A typical DAW workflow would be to insert an Ambisonic panner (an encoder) onto a monaural audio track, and the panner would be able to send its multichannel output to a buss, with the track’s main output muted. The panner will convert the track to (say) 4-channel for first order – which can now accommodate 3-D spatial information. A similar setup should be implemented in other tracks. The four-channel busses should be routed to another track of the same channel width, and a suitable binaural decoder should be inserted on this track. This will have four inputs and two outputs, the latter representing the headphone feed. The inserted panners can then be used to spatialize the various tracks, thus localizing monophonic sound. This basic setup can be extended by having a parallel destination track with an Ambisonic reverb inserted on it, then fed into the main destination track prior to the decoder. To work at higher orders, panners and decoders designed for this can replace those above, and the channel count increased accordingly for all tracks and busses. People will perceive the output of such a system in different ways to different degrees of the intended effect, a function of the HRTF issue, but also variations in headphones including their physical structurexxvii and more. It is generally possible to create 21 definite localization to the rear, although sometimes this is just perceived as ultrawide stereo. Elevation tends to be less successful. However, interesting mixes can be created with good separation and width. Of course, the engineer cannot know exactly how others will perceive it. As mentioned, much of the current interest in in binaural Ambisonic audio is coming from VR HMDs, and crucially these feature head tracking. As mentioned, an Ambisonic sound field can be rotated through simple multiplication by a scalar, and if head-tracking data is converted to such a scalar, then the sound field can be rotated in counter motion to the head to give the impression of sounds that are fixed in space and independent of the listener's head movement, as is of course the case with reallife. Such head tracking is very effective at removing front-back confusions, and further greatly enhances perception of elevation, even if that is only through the listener being able to ‘look up’ at the audio and bring it into frontal auditory focus to affirm that it is up there. To monitor this in a DAW, a ‘rotator’ plugin can be inserted before the decoder, and head-tracking data routed to the rotator’s parametersxxviii.xxix To replicate the effect for the end user, an HMD needs to be supplied with B-format Ambisonicxxx multi-channel audio so that the rotation can be rendered ‘live’.xxxi At the time of writing, various head trackers are available that can be attached to normal headphones bypassing the need for an HMD, although they typically have various compatibility issues, for instance being tied to a given manufacturer’s software. Hedrot (Baskind, n.d.) is a cheap DIY head tracker that is widely compatible via OSC data. The nature of HRTF-based spatial modelling means that filtering (particularly of upper frequencies) is readily discernable in many listening directions. This becomes much more apparent with dynamic head tracking – manifesting itself as something of 22 a sweep, and so care must be exercised regarding what elements of a mix are panned where. If for instance a drum-kit stem is placed at the front, as the head turns there will be noticeable attenuation of the cymbals as the head is turned, and this might be less preferable than the desired effect of spatialization. Head lockingxxxii of such parts provides a solution. Sounds with more arbitrary timbre are more tolerant of such panning since there is a less polarized ‘right’ sound, and so these might be better placed around the sound field. Good effect can be gained by fixing a key sound such as bass to the front, since the anchored low frequency energy gives a strong impression of rotation regardless of what else is panned, but care must be exercised since rotating low frequencies can also induce nausea, and so judicious filtering – possibly splitting a given musical part into separately locatable tracks – might be necessary. Interesting effects can be readily created, for instance with high-pass filtered reverb only apparent when looking up, which can give the impression of a high ceiling. Something that should be considered is that a unique musical journey can be created dependent on head position. The music might literally be presented differently dependent on the user’s aural ‘field of view’, analogous to visuals on a train journey where one might look in different directions each time the journey is taken. It is quite possible that head tracking will become a standard feature of many headphones, perhaps especially as Augmented Reality proliferates. Some speaker considerations It is commonly accepted that vertical perception is guided by spectral cues, whereby the higher the frequency, the higher the perceived height relative to a single source loudspeaker, and in fact for a center speaker (on the median plane), frequencies below 1 kHz tend to be received as coming from lower than the speaker. This is supported by several scientific studies and is called the pitch-height effect (Cabrera and Tilley, 23 2003). Lee (2015) developed an upmixingxxxiii technique called Perceptual Band Allocation (PBA) that could enhance the 3-D spatial impression by band-splitting horizontal-surround-recorded ambience and routing the bands to the higher and lower loudspeaker layers of an array (the upper frequencies to the upper layer).xxxiv The PBA system outperformed an eight-channel 3-D recording in listening tests, and a center frequency of 1 kHz was proposed for this to workxxxv.xxxvi Lee (Lee, 2016a, 2017a) also examined the degree of Vertical Image Spread (VIS) – the overall perceived height of the sound field, this time with phantom images. This was also done via PBA, and he experimented with the control of the VIS on a per-band basis by mapping multiple frequency bands to speaker layers. Lee found that the vertical locations of most bands tended to be higher than the height of the physical speaker layer and that it was generally possible control the VIS via different PBA schemes. There is another psychoacoustic phenomenon related to the pitch-height effect; the ‘phantom-image elevation effect’ (De Boer, 1947), that also influences our perception of elevation. This is the observation that for perfectly coherent stereoscopic phantomcenter images on speakers (e.g. a vocal, panned centrally between a pair, but not routed directly to any center speaker), different frequencies perceptually ‘map’ themselves to different elevations in the median plane. It is a common approach in channel-based mixing with a speaker array to employ a conventional stereo source routed to a pair of speakers, which of course might typically form a phantom image in the center. In this context, Lee (2016b, 2017b) again explored the relationship between frequency and psychoacoustic perception of height using a variety of loudspeaker pair base-angles, including the conventional (stereo-like) 60°, and 180° pairs that were orthogonal to the median plane – directly to the either side of the head. For the 60° pair, broadband frequencies appeared to be elevated, and for the 180° set- 24 up, both octave bands of 500 Hz and 8 kHz, and sounds with transient characteristics were the most prominently elevated, and the 1 kHz region was often perceived to be behind the listener for all base angles. Further, as the base angle increases from 0° to 240°, the perceived phantom image is increasingly elevated. Overall, elevation was negligible (along with general directionality) for frequencies below 100 Hz. Although the higher frequency elevations aligns with Blauert’s (1969) ‘directional bands theory’, Lee hypothesizes that the sub-3 kHz effect is due to acoustic crosstalk from a pair of speakers being interpreted by the brain as reflections from the shoulders. The application of such theory to mixing offers an enhanced understanding and extends the possibilities for spatial placement. When mixing to ‘single’ speakers, EQing to roll off above 1 kHz can extend the sound field to below that speaker, and in general, higher frequencies will be perceived as coming from progressively higher. Although the pitch-height theory pertains to the center speaker, this effect might momentarily extend to any speaker in the array since as the listener rotates their head to face it, it comes into the effective median plane. Sounds panned to the height layer of a speaker array that contain a certain amount of higher frequencies may be perceived as emanating from even higher, offering an extension of the sound field, and high-pass filtering might offer a further exaggeration of this effect. Phantom centers present interesting opportunities. When working with 60° stereo pairs of speakers that create a coherent phantom center, most frequencies will elevate, and these might be balanced against any signal actually routed to the center speaker to create a vertical panorama on the median plane. Mid-range content (possibly bandpass filtered, since the 250 Hz octave elevated less) panned equally to a 180° pair will again elevate the phantom image, and this could be implemented with either a main speaker layer or indeed the height layer to gain greater sonic elevation; the same 25 applies for 8 kHz. Increasing elevation might be generated by automating the routing to progressively wider pairs of speakers, but this should be done incrementally rather than progressively to avoid comb filtering and transient smearing in the process. The 1 kHz octave band has a ‘center of gravity’ behind the listener, and thus panning and EQ can be arranged to exploit this. Conversely, it is useful to remember not to ‘fight’ this phenomenon. Also, where bespoke 3-D reverbs are not available, one can follow the PBA ideas and achieve good results by band splitting a reverb at, say 1 kHz, and placing the hipassed components in the height layer. This expands the vertical perception of the space, and has less of the problems of using two reverbs to achieve the same result. Thus, these theories present a basic premise for working with height in certain circumstances, although naturally any such spectral presentation is likely to be subservient to the greater tonality of the mix. It should be remembered that much of this applies principally to phantom-center images, and other components of a mix might easily compensate – or fight – spectrally and hence spatially. There is a copious body of literature on 3-D capture for the interested reader to explore; the archive of the Audio Engineering Society is a good place to start. Mixing: general approaches As one moves towards higher spatial resolution in 3D, the intrinsic quality of the original component recordings/parts of the piece become increasingly laid bare, allowing a greater focus on the individual elements of a mix. These now have more space to be heard clearly, just as might be desired by three-point panning in stereo. That aside, in stereo, when all sounds come from just a pair of speakers there is a lot of spectral and transient masking & blending that can greatly obfuscate many of the 26 elements within that mix. This can of course be a very desirable aspect of balancing the competing elements of a complex music mix, and may in fact be the reason that ‘mixing’ might be required in the first place. As the speaker count (or binaural headphone resolution) increases, so do the opportunities to more precisely discern individual elements (assuming that both the total number of audio elements in the mix remains constant and also exploits the extra ‘space’). As this happens, the otherwise ‘hidden’ qualities of the underlying elements are increasingly exposed – for better or for worse. This is doubly so when utilizing monophonic sounds in an object-based system, where background noise and recorded reflections can compromise the ability to repurpose the sound in 3-D space. Bryan Martin (2017) attests to all this and also emphasizes that edits must be much tighter than are often accepted in dense stereo work. For these reasons, 3-D mixing invariably demands a number of new things of the mixer: 1] There needs to be a greater amount of sonic material to ‘fill’ the new-found space, and this can be approached in two ways; from a compositional/arrangement perspective, or from a pure mix perspective. Clearly, compositionally there is physically more space for the arrangement of elements in a 3-D space than there is in a stereo mix. Adding new layers of instrumentation (which need not represent harmonic extensions of the extant musical parts) is one approach to fill this space, although this approach overtly changes the character and arrangement of the original music. The same approach works more readily in film where ambience can more easily be added to and augmented into height/surround layers based on the content of the original stereo ambience. With music, this approach can lead to complexity and adornment that may be undesirable, and it is an approach that only tends to works 27 with abstract (non-acoustic) performances. Of course, many composers will relish the opportunity to embrace such a medium and will tailor their arrangements in order to exploit the new-found possibilities. As 3-D delivery proliferates, such approaches will likely become accepted and commonplace. Perhaps the more readily adopted approach is to use mix techniques to expand the sonic space of the original, in ways that do not compositionally change the piece in any radical manner. There are a number of approaches to achieving this kind of ‘spread’ of the component elements. The first is to simply pan elements to their new positions in the 3-D sphere. The downside of this is that in their own right, each direction’s audio can appear somewhat spartan in content, as the original busy stereo field is decomposed into individual elements – separated and exposed over the larger playback area. The second is to add width and size to those elements. If the placement of a sound renders it too small in the overall soundstage, when for instance it is emanating only from a single speaker, then the mix engineer might wish to spread it over more than one unit. This can have a number of effects. If it is (say) evenly panned between two speakers that subtend around 60º to the listenerxxxvii then it will generate a new phantom center that will not necessarily sound any bigger than the original – just displaced, and biased panning will simply move the phantom’s origin. Further, if transient material is present, then smearing can occur at certain points in the listening area, and this of course will be a function of the relative distance to each of the two speaker units. If sounds are shared over two or more speaker channels it is always a good idea to ensure there is some kind of decorrelation, so that the sound is not exactly the same in all speakers – this at least mitigates the phantom-center problem, although not the 28 transient smearing. Sound sources in the real world are essentially decorrelated through reflections and asymmetries of the listening environment. Decorrelation within a mix environment can be achieved through slightly offsetting EQ settings, delay times, or pitch. Such EQ may not be desirable tonally, and strategic decisions might need to be taken to balance such tensions. Decorrelation can also be achieved by use of reverb systems and Impulse Response convolutions. Some commercial spatialization systems feature a parameter called ‘spread’ or some such that can automatically change the size of a source in the sound field although their effect is not always as intuitively expected. 2] Time-based effects (at least delays and echoes) can be most effective if adapted to spatially modulate over the front-back and median planes. There are also some excellent ‘3D’ reverbs that have come onto the market that can achieve great results, but stereo or mono algorithms may be used in multiple channels, so long as the settings are adjusted in each to ensure a satisfactory level of decorrelation, and of course interesting effects can be generated by separating the ‘wet from the dry’ spatially, to varying degrees. However, delays and phase shifts can also be used to place sounds in ‘the sphere’ according to the precedence effect; there are situations where it is better not to rely solely on amplitude panning alone. The ‘ambience’ around a source might need to change accordingly – particularly for moving sources, and to achieve this without overtly muddying the mix is a question of balance, taste and technique. Smaller ‘room’ settings with short early-reflection times and shorter reverb tails will often still work well if too much reverb is problematic for the greater mix. Martin (2015) makes a detailed examination of the use of synthetically generated early reflections in a 22.2 channel-based system, and gives clear descriptions of the 29 utilized timings. There is some disagreement over how well the precedence effect works in the vertical, but it is a viable option for placing sounds and the individual mixer will have to determine the impact of such use in their own mix. If translation to binaural might be later required, then the effectiveness of such panning mechanisms and indeed the reverbs must also be considered in that context. Reverbs in particular can cause problems when ‘collapsed’ from speakers into a binaural render. 3] DRC in HOA has spatial and spectral artefacts that can be most undesirable, and such a process falls into the category already mentioned that will corrupt spherical harmonics leading to a degradation of the spatialization. Also, if working with objects, it should be understood that they are by their nature hard to compress in any natural sounding fashion – just as sounds in the real world are not naturally compressed. DRC and EQ in both 3-D speaker based systems and 3-D binaural headphone mixes must be approached with more caution than in a stereo or even surround system. As the 3-D resolution increases, the artifice of heavy compression and EQ become increasingly revealed. Martin (2017) warns mixers who wish to embark on 3-D channel based mixing to: “forget compression – it’s the opposite of what you want”. He also described the counter-intuitive experience of using stereophonic compression techniques in a 3-D-mix environment: “Compression fails – because it makes it even smaller. You have this huge panorama to present, and compression just makes it even smaller – so it actually just fights you. People think compression is going to make it more ‘there’ but it makes it smaller, when you want it to be bigger. Compression is not your friend.” Use of DRC as a ‘glue to hold it all together’ becomes increasingly problematic in 3D. Coupled with the increasing bias towards creating a realistic (or at least 30 believable) sound field around the listener that is engendered when working in higher resolution 3D, traditional ‘stereo buss’ techniques such as buss compression feel increasingly out of place – not to mention technically unachievable at present. Richard Furse’s Blue Ripple toolset (2017) has an interchange plugin which can transform an Ambisonic mix to a 20 channel A-format that allows the use of standard stereophonic-type effects processing, followed by a re-encode to HOA. Although it is preferable to remain in the Ambisonic realm when mixing ‘Ambisonically’, Furse’s approach allows for signal processing that is not currently possible with the currently available Ambisonic toolset. Perhaps the future of mixing in high-resolution 3D will move towards increasing reliance on the highest quality reproduction of the musical elements themselves, within a dynamic and intelligent physical environment with metadata-linked ends of the processing pipeline. Conclusion Immersive audio indicates by its very name that some kind of presence in the performance is anticipated for the listener. This also implies that the overall effect of the sound elements should have a clarity and some kind of all-encompassing physical expression free from the limitations of a 60° wide front-facing stereo track. It is not always obvious quite how detached a lot of stereo music-reproduction is from sounds in the physical world, perhaps most greatly exemplified by classical music, and this leads to issues when extending and translating stereo techniques to 3D, which invariably lean towards bringing sounds into a more open, physically encompassing expression in three dimensions. There are a number of approaches to capturing and creating 3-D audio, and each has its merits and limitations. While stereo & horizontal-surround music production is an extremely well established art-form with a number of conventions and aspects of 31 accepted good practice, many of these are subverted by the increased complexity, both perceptual and technical that nascent 3-D production entails. As was the case with 5.1, it is unlikely that multi-speaker setups will routinely make it into the domestic living room with appropriate configuration, but as wave-field synthesis, transaural binaural reproduction and distributed-mode loudspeakers evolve, so will home 3D. However, the VR industry is expanding rapidly and bringing headtracked audio into the mainstream expectation, and regular audio-only head-tracked headphones will soon follow in their droves. This will turn relatively banal binaural listening into something much more dynamic, and while the limitations of such systems will preclude their usage in lots of music, even small aspects of dynamic binaural reproduction will make it into a large proportion of mainstream popular music mixes. Although binaural is at the mercy of HRTF-matching at present, future solutions will overcome this and emancipate headphone listening further still. 3-D music production is evolving and will deserve increasing attention in order to develop the art form, and it is not possible to comprehensively cover the topic in a single chapter. Doubtless, much will be written in due course, and accordingly we can look forward to developing increased understanding of the praxis. The creative possibilities are exciting – presenting one of the biggest single opportunities to forward music-production in a long time. These possibilities await only innovative individuals with suitable tools to build a new sound world that will represent a step change in the creation and consumption of music. Acknowledgements 32 The authors would like to thank the experts who graciously contributed interviews for this text: Richard Furse, Hyunkook Lee and, Bryan Martin and Michael Williams. Special thanks go to Hyunkook Lee for the guidance around his own work. References Barrett, N. (2002) ‘Spatio-musical composition strategies’, Organised Sound, 7(3), pp. 313–323. doi: 10.1017/S1355771802003114. Baskind A (n.d.) Hedrot. Available from: https://abaskind.github.io/hedrot/ (accessed 11 July 2017). Begault DR (1992) Perceptual Effects of Synthetic Reverberation on ThreeDimensional Audio Systems. J. Audio Eng. Soc 40(11): 895–904. Available from: http://www.aes.org/e-lib/browse.cfm?elib=7027. Betbeder L (2017) Near-field 3D Audio Explained. Available from: https://developer.oculus.com/blog/near-field-3d-audio-explained (accessed 11 October 2017). Blauert J (1969) Sound localization in the median plane. Acta Acustica united with Acustica 22(4): 205–213. Cabrera D and Tilley S (2003) Vertical Localization and Image Size Effects in Loudspeaker Reproduction. In: Audio Engineering Society Conference: 24th International Conference: Multichannel Audio, The New Reality. Available from: http://www.aes.org/e-lib/browse.cfm?elib=12269. 33 De Boer K (1947) A remarkable phenomenon with stereophonic sound reproduction. Philips Tech. Rev 9(8). De Sena E, Hacıhabiboğlu H and Cvetković Z (2013) Analysis and Design of Multichannel Systems for Perceptual Sound Field Reconstruction. IEEE Transactions on Audio, Speech, and Language Processing 21(8): 1653–1665. De Sena E, Hacιhabiboğlu H, Cvetković Z, et al. (2015) Efficient synthesis of room acoustics via scattering delay networks. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 23(9): 1478–1492. Available from: http://dl.acm.org/citation.cfm?id=2824192.2824199 (accessed 3 October 2017). Digi-Capital (2017) After mixed year, mobile AR to drive $108 billion VR/AR market by 2021. Available from: http://www.digi-capital.com/news/2017/01/aftermixed-year-mobile-ar-to-drive-108-billion-vrar-market-by-2021/ (accessed 16 July 2017). Elen R (1998) The ambisonic motherlode. Available from: http://decoy.iki.fi/dsound/ambisonic/motherlode/index (accessed 30 August 2017). Furse R (2015) Amber HRTF. Blue Ripple Sound. Available from: http://www.blueripplesound.com/hrtf-amber (accessed 7 November 2017). Furse R (2017) 3.1 O3A B->A20 and O3A A20->B Converters. O3AManipulators_UserGuide, User Guide. Available from: http://www.blueripplesound.com/sites/default/files/O3AManipulators_UserGu ide_v2.1.4.pdf. 34 Gerzon MA (1973) Periphony: With-Height Sound Reproduction. Journal of the Audio Engineering Society 21(1): 2–10. Available from: http://www.aes.org/elib/online/browse.cfm?elib=2012 (accessed 29 March 2017). Gerzon MA (1985) Ambisonics in Multichannel Broadcasting and Video. Journal of the Audio Engineering Society 33(11): 859–871. Available from: http://www.aes.org/e-lib/browse.cfm?elib=4419 (accessed 29 March 2017). Google (2017) Resonance Audio. Google Developers. Available from: https://developers.google.com/resonance-audio/ (accessed 7 November 2017). Jot J-M, Larcher V and Pernaux J-M (1999) A Comparative Study of 3-D Audio Encoding and Rendering Techniques. In: Audio Engineering Society. Available from: http://www.aes.org/e-lib/browse.cfm?elib=8029 (accessed 29 March 2017). Kearney G (2017) Ambisonic Question: E-mail. Kronlachner M (2014) ambiX v0.2.7 – Ambisonic plug-in suite | matthiaskronlachner.com. Available from: http://www.matthiaskronlachner.com/?p=2015 (accessed 28 March 2017). Kuech F, Kratschmer M, Neugebauer B, et al. (2015) Dynamic Range and Loudness Control in MPEG-H 3D Audio. In: Audio Engineering Society Convention 139. Available from: http://www.aes.org/e-lib/browse.cfm?elib=18021. Lee H (2015) 2D-to-3D Ambience Upmixing based on Perceptual Band Allocation. J. Audio Eng. Soc 63(10): 811–821. Available from: http://www.aes.org/elib/browse.cfm?elib=18044. 35 Lee H (2016a) Perceptual Band Allocation (PBA) for the Rendering of Vertical Image Spread with a Vertical 2D Loudspeaker Array. J. Audio Eng. Soc 64(12): 1003–1013. Available from: http://www.aes.org/e-lib/browse.cfm?elib=18534. Lee H (2016b) Phantom Image Elevation Explained. In: Audio Engineering Society Convention 141. Available from: http://www.aes.org/e-lib/browse.cfm? elib=18468. Lee H (2017a) Interview with Hyunkook Lee. Interviewed by Gareth Llewelyn for Producing 3D Audio. Lee H (2017b) Sound Source and Loudspeaker Base Angle Dependency of Phantom Image Elevation Effect. J. Audio Eng. Soc 65(9): 733–748. Available from: http://www.aes.org/e-lib/browse.cfm?elib=19203. Lee H, Johnson D and Mironovs M (2017) An Interactive and Intelligent Tool for Microphone Array Design. In: Audio Engineering Society Convention 143. Available from: http://www.aes.org/e-lib/browse.cfm?elib=19338. Lossius T, Balthazar P and de la Hogue T (2009) DBAP – Distance Based Amplitude Panning. In: International Computer Music Conference (ICMC), Montreal. Lynch, H. and Sazdov, R. (2011) ‘An Investigation Into Compositional Techniques Utilized For The Three-Dimensional Spatialization Of Electroacoustic Music’, in Proceedings of the Electroacoustic Music Studies Conference, Sforzando! Electroacoustic Music Studies Conference, New York, USA. Available at: http://www.ems-network.org/spip.php?article328 (Accessed: 10 February 2018). 36 Martin B (2017) Interview with Bryan Martin. Interviewed by Gareth Llewelyn for Producing 3D Audio. Martin B and King R (2015) Three Dimensional Spatial Techniques in 22.2 Multichannel Surround Sound for Popular Music Mixing. In: Audio Engineering Society Convention 139, Audio Engineering Society. Available from: http://www.aes.org/e-lib/online/browse.cfm?elib=17988 (accessed 1 July 2016). Martin B, King R, Leonard B, et al. (2015) Immersive Content in Three Dimensional Recording Techniques for Single Instruments in Popular Music. In: Audio Engineering Society Convention 138, Audio Engineering Society. Available from: http://www.aes.org/e-lib/online/browse.cfm?elib=17675 (accessed 1 July 2016). Pike C, Taylor R, Parnell T, et al. (2016) Object-Based 3D Audio Production for Virtual Reality Using the Audio Definition Model. In: Audio Engineering Society. Available from: http://www.aes.org/e-lib/browse.cfm?elib=18498 (accessed 21 March 2017). Pulkki V (1997) Virtual Sound Source Positioning Using Vector Base Amplitude Panning. J. Audio Eng. Soc 45(6): 456–466. Available from: http://www.aes.org/e-lib/browse.cfm?elib=7853. Riaz H, Stiles M, Armstrong C, et al. (2017) Multichannel Microphone Array Recording for Popular Music Production in Virtual Reality. In: Audio Engineering Society Convention 143. Available from: http://www.aes.org/elib/browse.cfm?elib=19333. 37 Rumsey F (2001) Spatial Audio. Oxford ; Boston: Focal Press. Shivappa S, Morrell M, Sen D, et al. (2016) Efficient, Compelling, and Immersive VR Audio Experience Using Scene Based Audio/Higher Order Ambisonics. In: Audio Engineering Society. Available from: http://www.aes.org/elib/browse.cfm?elib=18493 (accessed 22 May 2017). Stirling P (2017) Volumetric Sounds. Available from: https://developer.oculus.com/blog/volumetric-sounds (accessed 11 October 2017). Travis C (2009) A New Mixed-Order Scheme for Ambisonic Signals" — Ambisonics. File. Available from: http://ambisonics.iem.at/symposium2009/proceedings/ambisym09-travisnewmixedorder.pdf/@@download/file/AmbiSym09_Travis_NewMixedOrder. pdf (accessed 10 October 2017). Wightman FL and Kistler DJ (1997) Monaural sound localization revisited. The Journal of the Acoustical Society of America 101(2): 1050–1063. Available from: https://doi.org/10.1121/1.418029. Williams M (1991) Microphone Arrays for Natural Multiphony. In: Audio Engineering Society Convention 91. Available from: http://www.aes.org/elib/browse.cfm?elib=5559. Williams M (2004) Microphone Arrays for Stereo and Multichannel Sound Recordings. Editrice Il Rostro. 38 Williams M (2017) Interview with Michael Williams. Interviewed by Gareth Llewelyn for Producing 3D Audio. 39 i Perhaps ironically in the context of this chapter, ‘stereo’ comes from the ancient Greek ‘stereos’, which means solid – with reference to three dimensionality, albeit with regard to the formation of words. ii A body of research has formed around this, for example (Barrett, 2002) and (Lynch and Sazdov, 2011). iii Synchresis is the psychological linking between what we might see and hear when such events occur simultaneously. iv Media composers such as Joel Douek and Michael Price are notable for embracing 3-D approaches. v In fact, conventional stereophony is a subsystem of Ambisonics (Gerzon, 1985)). vi In fact, conventional stereophony is a subsystem of Ambisonics (Gerzon, 1985)). vii Jot, Larcher and Pernaux (1999) provide a useful text on encoding/rendering. viii Jot, Larcher and Pernaux (1999) provide a useful text on encoding/rendering. ix Spherical polar-coordinate solutions to the acoustic wave equation. x If the channel order = N (which defines the angular resolution of the spherical harmonics), the number of audio channels required for that order is given by (N+1)2. xi Although this also applies to any multi-speaker setup, be that stereo or horizontal planar surround. xii Although this also applies to any multi-speaker setup, be that stereo or horizontal planar surround. xiii e.g. in stereo: record an acoustic guitar with a single microphone and pan it to play back from a single speaker, or indeed centrally, when it is mapped to both speakers with equal amplitude etc.. xiv e.g. in stereo: record an acoustic guitar with a single microphone and pan it to play back from a single speaker, or indeed centrally, when it is mapped to both speakers with equal amplitude etc.. xv Williams used this term to describe how changes in the relative angles of microphone capsules might shift the perceived position of the instruments on the horizontal plane, once reproduced. xvi 6DOF: Six Degrees of Freedom, which refers to movement in a 3-D-space: front/back, left/right, up/down, plus pitch, roll and yaw of the head position. In other words, the user can navigate the space, and might expect a subsequent shift in sound field both whilst moving and looking around. In contrast, 360º video (often presented as ‘VR’) does not permit locative movement within its space. xvii 6DOF: Six Degrees of Freedom, which refers to movement in a 3-D-space: front/back, left/right, up/down, plus pitch, roll and yaw of the head position. In other words, the user can navigate the space, and might expect a subsequent shift in sound field both whilst moving and looking around. In contrast, 360º video (often presented as ‘VR’) does not permit locative movement within its space. xviii A separate audio stem in the overall mix that does not respond to head tracking to provide elements such as narrative or beds. xix A bed is an audio track that might be spatialized, but is not dependent on precise localization of any of its content. It might typically be a sonic foundation on which to place other more localized sources. xx Also known as the mid-sagittal plane, this is the vertical plane that bisects the human body vertically through the naval. As such, strictly speaking it can represent perception of sonic height, but only when looking directly ahead. xxi Also known as the mid-sagittal plane, this is the vertical plane that bisects the human body vertically through the naval. As such, strictly speaking it can represent perception of sonic height, but only when looking directly ahead. xxii The HRTF actually exists in the frequency domain. The temporal equivalent is known as HeadRelated Impulse Response (HRIR). xxiii The HRTF actually exists in the frequency domain. The temporal equivalent is known as Head-Related Impulse Response (HRIR). xxiv Such processing is performed by convolution, a CPU-intense mathematical operation commonly used in reverbs whereby every sample in the source is multiplied by every sample in the filter in order to impose a spectral fingerprint on the source. xxv Such processing is performed by convolution, a CPU-intense mathematical operation commonly used in reverbs whereby every sample in the source is multiplied by every sample in the filter in order to impose a spectral fingerprint on the source. xxvi Both fixed two-channel, and real-time multi-channel ambisonic that might also respond to head tracking xxvii HRTFs are linked to in-ear or over-ear designs. xxviii Typically: Pitch – analogous to nodding, Yaw – looking side to side, and Roll – moving the ear towards the shoulder. xxix Typically: Pitch – analogous to nodding, Yaw – looking side to side, and Roll – moving the ear towards the shoulder. xxx Other systems are also possible, which might spatialize individual tracks in middleware or a game engine. xxxi Other systems are also possible, which might spatialize individual tracks in middleware or a game engine. xxxii Head locking is where the sound is independent of movement, just as with normal headphone listening. A separate buss that bypasses the rotator is required for such parts. xxxiii The process of converting one playback format to another of a greater number of channels. xxxiv The process of converting one playback format to another of a greater number of channels. xxxv In fact, it was shown that a cut-off frequency could be anywhere between 1 and 4 kHz for this effect to hold. xxxvi In fact, it was shown that a cut-off frequency could be anywhere between 1 and 4 kHz for this effect to hold. xxxvii To maintain something close to stereo theory, although this only holds when facing forward towards the speakers, and will break down for say a lateral placement.