Academia.eduAcademia.edu

A tutorial on immersive three-dimensional sound technologies

2020, Acoustical Science and Technology

There is renewed interest in virtual auditory perception and spatial audio arising from a technological drive toward enhanced perception via mixed-reality systems. Because the various technologies for three-dimensional (3D) sound are so numerous, this tutorial focuses on underlying principles. We consider the rendering of virtual auditory space via both loudspeakers and binaural headphones. We also consider the recording of sound fields and the simulation of virtual auditory space. Special attention is given to areas with the potential for further research and development. We highlight some of the more recent technologies and provide references so that participants can explore issues in more detail.

Acoust. Sci. & Tech. 41, 1 (2020) #2020 The Acoustical Society of Japan INVITED TUTORIAL A tutorial on immersive three-dimensional sound technologies Craig T. Jin School of Electrical and Information Engineering, University of Sydney, Building J03, Maze Crescent, Darlington, Sydney, NSW, 2006 Australia Abstract: There is renewed interest in virtual auditory perception and spatial audio arising from a technological drive toward enhanced perception via mixed-reality systems. Because the various technologies for three-dimensional (3D) sound are so numerous, this tutorial focuses on underlying principles. We consider the rendering of virtual auditory space via both loudspeakers and binaural headphones. We also consider the recording of sound fields and the simulation of virtual auditory space. Special attention is given to areas with the potential for further research and development. We highlight some of the more recent technologies and provide references so that participants can explore issues in more detail. Keywords: 3D sound technology, Virtual auditory space, Spatial hearing PACS number: 43.38.Md, 43.38.Vk, 43.10.Ln, 43.10.Sv 1. INTRODUCTION There is renewed interest in three-dimensional (3D) sound technologies driven in large part by the technological drive towards enhanced perception via mixed reality systems. A few commercial examples, as of 2018, are Microsoft 3D Soundscape and the HoloLens, Google’s Resonance Audio, Sony Playstation headset with 3D sound, the Oculus headset and its support for 3D sound spatialization. There is also increasing support and awareness for 3D sound by broadcasting companies. A few broadcasting examples are the BiLi project in France, the Binaural Project by the BBC in the United Kingdom, the Orpheus European project, the NHK Super Hi-Vision theatre support for 3D sound with a 22.2 multichannel system. At the current time, the technologies for 3D sound are growing exponentially and so this tutorial will focus on fundamental principles. To begin, we should clarify our definition of 3D sound. By 3D sound, we refer to an immersive experience in which the listener has a clear and extended perception of a 3D sound space — that is sound objects and sound events clearly positioned relative to the listener and to each other in some ambient space that encompasses both the listener and the sound objects. It requires and involves something more than a transient illusion or perception of sound direction; it requires an extended and believable perceptual experience of auditory space that supports some version of reality.  e-mail: [email protected] 16 [doi:10.1250/ast.41.16] This tutorial focuses on three primary technological areas: loudspeaker sound reproduction, binaural sound reproduction using earphones or headphones, and sound field recording. A related tutorial introduction is [1]. To a lesser extent we consider sound field simulation — the art of using engineering design to create a virtual sound environment. We will assume familiarity with the fundamental processes underlying human 3D sound perception. In particular, we assume familiarity with interaural time difference cues, interaural level difference cues, monaural spectral cues and the necessity for head-tracking when rendering 3D sound over headphones. We also assume familiarity with the following mathematical or signal processing terms: head-related impulse response filters, binaural room impulse responses and head-related transfer functions. 2. LOUDSPEAKER SOUND REPRODUCTION Four loudspeaker reproduction methods shall be considered: vector-base amplitude and intensity panning, transaural cross-talk cancellation, ambisonics, and wavefield synthesis. Loudspeaker reproduction methods for 3D sound can be classified according to the listening conditions. Vector-base amplitude panning can provide a relatively stable acoustic image across a moderate-sized listening area with some compromises in spatial fidelity. Wave-field synthesis can provide higher spatial fidelity, but requires a substantially more dense loudspeaker array and is often limited in the frequency range for which high spatial fidelity can be achieved. The ambisonics method C. T. JIN: A TUTORIAL ON IMMERSIVE 3D SOUND TECHNOLOGIES generally accommodates only a single listener and has a sweet-spot the size of a listener’s head. An advantage of the ambisonic method is its rendering flexibility — it can accommodate wave-field approximation at lower frequencies as well as panning at higher frequencies; scene rotations are also easily handled. Transaural cross-talk cancellation systems attempt to directly control the signal at each ear and are generally targeted at a single or few listeners. These systems often come in the form of a single row of loudspeakers, often above or below a video screen. With the exception of transaural cross-talk cancellation, loudspeaker reproduction methods naturally or automatically encompass head movement. In other words, the loudspeaker array creates a sound field and when the listener moves his/her head the acoustic spatial impression changes accordingly. With transaural cross-talk cancellation, the loudspeaker array simulates a virtual headphone listening condition and head motion must be tracked using some form of head tracker. 2.1. Vector-Base Panning We begin with vector-base amplitude panning [2] and vector-base intensity panning [3]. In three-dimensions, we pan the amplitude of an acoustic source across the three nearest loudspeakers as shown in Fig. 1. The three loudspeaker positions are associated with the vectors xu , xv , xw and the virtual source position is indicated by the vector xS . The virtual source position, xS , can be expressed as a linear combination of the three loudspeaker positions using matrix algebra as follows: 1 10 0 1 0 u ðS Þ xu xv xw xS C CB B C B xS ¼ @ yS A ¼ @ yu yv yw A@ v ðS Þ A; ð1Þ zS zu zv zw w ðS Þ where the vector ðS Þ contains the component weights for the virtual source position vector expressed in terms of the three loudspeaker position vectors. Note that we label the plane-wave direction associated with the virtual source by S . It is straight-forward to compute the vector ðS Þ given the virtual source position: 0 11 xu xv xw B C ðS Þ ¼ @ yu yv yw A xS : ð2Þ zu zv zw With vector base amplitude panning (VBAP), we select the three loudspeakers in the loudspeaker array that are closest to the intended virtual source position. We then set the three VBAP loudspeaker gains, gu , gv , and gw according to ðS Þ. However, we normalize the loudspeaker gains so that the total energy is unity. In other words, we set the loudspeaker gains as follows: u ðS Þ gu ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; 2u ðS Þ þ 2v ðS Þ þ 2w ðS Þ v ðS Þ gv ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; 2 u ðS Þ þ 2v ðS Þ þ 2w ðS Þ v ðS Þ gw ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : 2u ðS Þ þ 2v ðS Þ þ 2w ðS Þ The result of setting the loudspeaker gains according to this equation results in the acoustic velocity vector [4] pointing towards the direction of the virtual source. The physical significance of the acoustic velocity vector pointing towards the virtual source is that the interaural time difference cue for the listener will be appropriate at the low frequencies. We say that VBAP panning preserves the low-frequency interaural time difference cues for the virtual acoustic source. This leads naturally to the question of the interaural level difference cues — will they be veridical with VBAP panning? Unfortunately, the answer is not at the higher frequencies. In order to preserve the interaural level difference cues, we utilize vector base intensity panning. Vector base intensity panning (VBIP) is based on the acoustic energy of the loudspeakers, often referred to simply as loudspeaker energy. The loudspeaker energy for the loudspeaker at position, xu , is given by the square of the loudspeaker gain: g2u . If we weight the three loudspeaker directions by the loudspeaker energies, rather than the loudspeaker gains, we obtain what is often referred to as the energy vector rE : rE ¼ Fig. 1 A virtual source is panned between three loudspeakers. We define vectors from a listening point to the three loudspeakers as xu , xv , xw and a vector to the virtual source as xS . ð3Þ g2u xu þ g2v xv þ g2w xw : g2u þ g2v þ g2w ð4Þ Note that the energy vector is normalized by the total acoustic energy. In VBIP, the loudspeaker gains are calculated such that: 1) the vector rE points toward the virtual source direction; 2) only the three loudspeakers 17 Acoust. Sci. & Tech. 41, 1 (2020) that are the most closely associated with the virtual source direction are employed; 3) the total acoustic energy is equal to one. In VBIP, we set the loudspeaker energies, eu , ev , and ew , according to ðS Þ as follows: ½eu ðS Þ; ev ðS Þ; ew ðS Þ ¼ ½u ; v ; w  : u þ v þ w Correspondingly, we set the loudspeaker gains as: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u ðS Þ ; gu ¼ u ðS Þ þ v ðS Þ þ w ðS Þ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi v ðS Þ ; gv ¼ u ðS Þ þ v ðS Þ þ w ðS Þ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u ðS Þ : gw ¼ u ðS Þ þ v ðS Þ þ w ðS Þ ð5Þ ð6Þ One can easily verify that the total loudspeaker energy is unity and that: rE ¼ xS : u þ v þ w ð7Þ The difference between the acoustic velocity vector and the acoustic energy vector and the interaural time and level difference cues has been long understood. Nonetheless, when amplitude panning a virtual source between multiple loudspeakers one has a conflict between these two direction vectors — they are not the same, so what should one do? In this case, the duplex theory partially comes to the rescue. The duplex theory originates from Lord Rayleigh and his experiments with tuning forks [5] and indicates that interaural time difference (ITD) cue influences perception at lower frequencies and that the interaural level difference (ILD) cue influences perception at higher frequencies. However, there is no clear agreement on the boundary frequency and it depends on how one explores this boundary; for an entertaining and informative review, refer to Hartmann’s recent paper [6]. Most practitioners of vector base panning split the usage of VBAP and VBIP gains at a boundary frequency between 500 Hz and 1,500 Hz. The boundary frequency, however, is not the only concern. There is also the perceived width of the source which is related to the norm of the energy vector, rE [7]. In this regard, practitioners quantify the angular spread, ðS Þ, of the virtual source using [8]: ðS Þ ¼ 2 cos1 ð2krE ðS Þk  1Þ: ð8Þ With VBAP and VBIP, one does not have independent control of the angular spread of the virtual acoustic source. For example, with VBIP the loudspeaker gains are set by the acoustic energy vector and any variation in the individual loudspeaker energies will necessarily modify the perceived loudness of the virtual source. This implies 18 that as the source moves in space its perceived width will vary. Efforts to compensate for the variation in acoustic width are described here [8], but one can only really widen the source width by utiliizing a greater number of loudspeakers. 2.2. Higher-Order Ambisonics Higher-Order Ambisonics (HOA) provides a more sophisticated viewpoint for rendering a sound scene using loudspeakers. Nonetheless, at its essence, it still involves panning a virtual sound source across a number of loudspeakers. With HOA, a circular or spherical array of loudspeakers is generally used and one considers panning across a much larger number of loudspeakers than two or three loudspeakers. This approach is particularly effective for ambient sounds, but not so practical for direct sounds. Therefore, more recent approaches separate the direct and ambient components of the sound field [9–12]. A particular advantage of HOA rendering is its ability to work with sound field recordings and its ability to render ambient, reverberant environments. Before delving into the technical details we attempt to characterize the spirit and intuition behind HOA rendering. Consider a sound field together with a listening viewpoint. Without going into the details of sound field recordings, it stands to reason that if one records the sound field with a spherical microphone array at the listening viewpoint, then any loudspeaker reproduction of that sound field should ideally result in the same spherical microphone array recording when the spherical microphone array is positioned at the listening viewpoint of the loudspeaker array. Indeed, this provides one empirical method to validate loudspeaker reproduction techniques. The spherical microphone array enables a specific spherical harmonic decomposition of the sound field and sometimes this spherical harmonic decomposition is referred to as an ambisonic decomposition. Thus, the HOA rendering method attempts to identify the spherical harmonic decomposition of the virtual sound source or virtual sound scene and then reproduce this spherical harmonic decomposition using the loudspeaker array. 2.2.1. Underlying theory The HOA method of sound field reproduction is based on the idea that the sound field can be described as a sum of spherical harmonic components and can be considered as panning in the spherical harmonic domain. In the frequency domain, any sound field consisting of incoming sound waves can be expressed as a series of spherical harmonic functions [13]: m 1 X   X p r; ; ’; k ¼ il jl ðkrÞYlm ð; ’Þblm ðkÞ; l¼0 m¼l ð9Þ where pðr; ; ’; kÞ is the acoustic pressure corresponding to the wave number k and at the point with spherical C. T. JIN: A TUTORIAL ON IMMERSIVE 3D SOUND TECHNOLOGIES coordinates ðr; ; ’Þ, and where jl is the spherical Bessel function of degree l and Ylm is the spherical harmonic function of order l and degree m. Equation (9), known as the spherical harmonic representation of the sound field, states that the sound field is fully described by the complex coefficients blm ðkÞ. Describing the sound field at any point in space requires an infinite number of blm coefficients, which is impossible in practice. Nevertheless, a good approximation of the acoustic pressure in the vicinity of the origin can be obtained by truncating the series to an order L that depends upon the wave number, k, and radius, r, from the origin according to the formula [14]: L ekr  1 ; 2 ð10Þ where e is the mathematical constant known as Euler’s number. Hence, the higher the order, the larger the area where the sound field is accurately described, with regard to the wavelength. 2.2.2. Basic HOA decoding Using HOA to record a sound field consists in obtaining the blm components up to order-L, which are often referred to as HOA signals. Conversely, using HOA to synthesise a sound field consists in controlling the loudspeakers so that they create a sound field having the desired order-L spherical harmonic representation. In the HOA method, it is usually assumed that the loudspeakers act on the sound field as would plane-wave sound sources. This is a reasonable approximation assuming the distance from the loudspeakers to the center of the listening area is large enough with regard to the order L. More precise methods that consider the distance of both the loudspeakers and the virtual sources more accurately are described in [15–17]. With these more accurate methods, one can even enable near-field sources [15]. Returning to the simplifying assumption that the loudspeakers may be modelled as plane-wave sources, the relation between the speaker signals and the spherical harmonic components of the sound field is a simple matrix-vector product: b ¼ Yg ð11Þ where b denotes the vector of the spherical harmonic components blm up to order L, s is the vector of the speaker complex gains, and Y is the matrix of the spherical harmonic function values at the loudspeaker angular positions: 2 3 Y00 ð1 ; ’1 Þ Y00 ð2 ; ’2 Þ . . . Y00 ðS ; ’S Þ 6 Y ð ; ’ Þ Y ð ; ’ Þ . . . Y ð ; ’ Þ 7 11 2 2 11 S S 7 6 11 1 1 7 Y ¼6 .. .. .. .. 6 7 4 5 . . . . YLL ð1 ; ’1 Þ YLL ð2 ; ’2 Þ ... YLL ðS ; ’S Þ ð12Þ where S is the number of speakers. The speaker signals are calculated by decoding the desired (or recorded) spherical harmonic signals using a decoding matrix D: g ¼ Db0 ð13Þ where b0 denotes the vector of the desired spherical harmonic components. The sound field reconstruction error can be defined as the norm of the difference between the desired and reconstructed sound field: E ¼ kb  b0 k ¼ kðYD  IÞb0 k ð14Þ where I denotes the identity matrix. Clearly, the sound field will be perfectly reconstructed if the decoding matrix inverts perfectly the matrix Y. This leads to the following results: . A perfect sound field reconstruction is possible only if there at least as many loudspeaker as spherical harmonics. In 3D, this condition translates into the following relation: S  ðL þ 1Þ2 ð15Þ . The rank of Y must be greater or equal to the number of spherical harmonics. This requires that the loudspeaker positions achieve a good angular sampling of the sphere, which means that the loudspeakers must be distributed as evenly as possible over the surface of the sphere. An obvious solution to the HOA decoding problem is to use the Moore-Penrose pseudo inverse of matrix Y: D ¼ pinvðYÞ ð16Þ This solution is known as the basic HOA decoding and is a direct analogue of vector-base amplitude panning in the spherical harmonic domain. It results in the acoustic velocity vector pointing toward the virtual source and correct low-frequency interaural time difference cues. 2.2.3. Sweet spot size and alternate decodings The size of the area where the sound field is accurately described by the order-L spherical harmonic representation of the sound field decreases with frequency, as expressed in Eq. (10). A perfect sound field reconstruction requires the size of this area to be greater than or equal to the size of the listener’s head. Assuming the listener’s head radius is 10 cm, the maximum frequency value at which the sound field can be reconstructed perfectly for a single listener is given by: fmax ¼ c 2L þ 1 2 0:1e ð17Þ The corresponding frequency values are presented in Table 1 for up to order 6. Clearly, the sound field will not be perfectly reconstructed for a single listener above a few kHz, unless a very large number of loudspeakers is 19 Acoust. Sci. & Tech. 41, 1 (2020) Table 1 Maximum frequency for perfect reconstruction as a function of the HOA order. L 1 2 3 4 5 6 fmax (Hz) 602 1,004 1,406 1,807 2,209 2,611 used. Therefore, above the specified frequency values, it is preferable to use an alternative to the basic decoding procedure that improves the perceived quality of the sound field reconstruction. The most common alternate decoding method is referred to as the maxrE decoding and is an analogue of vector-base intensity panning in the spherical harmonic domain. The maxrE decoding matrix is calculated by multiplying the basic decoding matrix with a weighting matrix, as follows: DmaxrE ¼ DW maxrE ð18Þ where W maxrE is a diagonal matrix whose diagonal components depend only on the order. These weights are calculated so that, when playing back a wave incoming from a particular direction, the speaker signal energy is concentrated in the corresponding direction. Mathematically, this maximizes the norm of the so-called energy vector, which can be expressed as: PS kgi k2 xi rE ¼ Pi¼1 ð19Þ S 2 i¼1 kgi k where gi is the gain of the ith loudspeaker and xi is the unit vector pointing towards this loudspeaker from the listening point. For a given maximum order L, the maxrE weights are determined as follows. One first solves PLþ1 ðxÞ ¼ 0, where PLþ1 ðÞ is the Legendre polynomial of order L þ 1. Let x be the largest value of x which solves this equation. Then the maxrE weights, wl , for order l  L are solved recursively as follows: w0 ¼ 1; w1 ¼ x ; wlþ1 ¼ ð2l þ 1Þx wl  lwl1 : lþ1 ð20Þ A detailed description of the benefits of maxrE decoding and the computation of the weighting matrix can be found in the PhD thesis by Jerome Daniel [4]. 2.3. Wave-Field Synthesis With wave-field synthesis, we consider both a reproduction region of space, , and a control region of space, V, as shown in Fig. 2. We also refer to the loudspeaker signals as driving point functions. The question is how to calculate the driving point functions given the desired sound field. Traditional descriptions of wave-field synthesis (WFS) develop the theory via the Rayleigh integrals and 20 Fig. 2 Two regions of space are identified: a reproduction region and a control region. Adapted from [23]. the Kirchoff-Helmholtz integral [17–21] and initially considered planar and linear loudspeaker arrays. The Rayleigh integral approach contains many approximations and will only be briefly touched upon here. Two readable accounts of WFS may be found in the short tutorial presented at the 2013 European Acoustics Association Winter School [18] and the book by Jens Ahrens [22]. The brief description given here follows the formulation provided in [22]. The Rayleigh I Integral describes the sound field, SðzÞ, in a source-free half-space defined by a boundary plane @. It states that the sound field is uniquely defined by the gradient of the sound field on the planar boundary in the normal direction pointing into the halfspace:  Z  @ Gðz  z0 ÞdAðz0 Þ; ð21Þ 2 SðzÞ PðzÞ ¼  @ @n z¼z0 where Gðz  z0 Þ is the transfer function for an acoustic point source, detailed below. Thus, the driving point function, Dðz0 Þ, is given by:   @ Dðz0 Þ ¼ 2 SðzÞ : ð22Þ @n z¼z0 It is here that the complexities of WFS begin and we only provide a summary of these complexities (refer to [22] for details) in this tutorial introduction. What generally occurs next is that many approximations are made. The planar array of loudspeakers accommodating the driving point function degenerates to a linear array that accommodates listening only in the horizontal plane at a height of zero. Because the integral involves integration with a complex exponential, Laplace’s method of stationary phase is used to make a high-frequency approximation that asserts that the only important complex exponential terms contributing to the integral occur where the phase of the exponential term is stationary. This approximation leads to a driving point function that depends on the listening position of the listener and so further approximations are made so that the driving point function is constant. In the end, the approximations lead to what is commonly referred to as a 2.5 dimensional solution in which the driving point function has the form: C. T. JIN: A TUTORIAL ON IMMERSIVE 3D SOUND TECHNOLOGIES rffiffiffiffiffiffiffiffiffiffiffiffiffi   2dref Dðz0 Þ ; D2.5D ðzÞ ¼ ik height¼0 ð23Þ where dref is a reference distance chosen to optimize the playback and k is the wavenumber, described in more detail below. From a computational viewpoint, it is easier and more practical to follow a formulation of WFS that follows the simple source formulation as espoused by Fazi [23]. In the following mathematical considerations, we operate in the frequency domain and consider a single frequency only. For simplicity, we omit the frequency dependence of the variables. Using the simple source formulation, the pressure, pðzÞ, at a location, z, within the reproduction region is given by: Z pðzÞ ¼ Gðz; yÞðyÞdAðyÞ; ð24Þ @ eikjzyj where Gðz; yÞ ¼ 4jzyj is the transfer function for an acoustic point source and k ¼ 2c f is the wavenumber, f is frequency,  is the air density, and c is the speed of sound. The transfer function for an acoustic point source is referred to as a Green’s function in the literature. As well, the point source is referred to as a monopole loudspeaker in the wave-field synthesis literature. If one knows both the interior sound field, p ðzÞ, and exterior sound field, pþ ðzÞ, as shown in Fig. 3, then the driving point function, ðyÞ, is determined as: ðyÞ ¼ rn p ðyÞ  rn pþ ðyÞ; ð25Þ ^ where rn pðzÞ :¼ rpðzÞ  nðzÞ is the gradient of the pressure in the direction of the outward normal. The complicated mathematical considerations above become significantly simpler when using a matrix formulation. This matrix formulation is similar to the Boundary Surface Control method developed by Ise [24]. Assume one has a set of loudspeaker locations indexed by i and a set of control points indexed by j. One then forms the transfer function matrix, S, where the element Si; j ¼ Fig. 3 The interior sound field, p ðzÞ and exterior sound field, p ðzÞ are identified. Adapted from [23]. Gðzi ; yj Þ. Let the desired sound field at the control points be represented by the vector y and let the unknown loudspeaker driving point functions be represented by the vector . One then computes the loudspeaker driving point functions, , as a solution to the system of linear equations: S ¼ y: ð26Þ A common technique to obtain a least-norm solution is to use Tikhonov regularization:  ¼ ½SH S þ I1 SH y: ð27Þ A common value for the regualization parameter is ¼ 0:03; refer to [25] for a discussion of reguarlization. This is then repeated for each frequency. A complicated sound field can be simulated for a given configuration of loudspeakers and control zone using a sound field simulator such as MCRoomSim [26]. With WFS, there are a number of practical considerations. For example, when the number of loudspeakers is large, a sparse solution generally improves performance [27], i.e., solve: ^ ¼ argminfkk1 : S ¼ yg: ð28Þ   With a large array of loudspeakers one should also compensate for the directivity pattern of the loudspeakers. In addition, one should also scale the size of the control zone proportional to the frequency. One can use the ruleof-thumb that the approximate pffiffiffi radius, R, of the control zone should scale as: R  4 ekN , where N is the number of loudspeakers, k is the wavenumber, and e is the natural logarithm number. For robustness, one should include some control points inside the control zone. These ‘‘inside’’ control points are sometimes referred to as CHIEF-points in the boundary element method literature [28], where CHIEF refers to combined Helmholtz integral equation formulation. The CHIEF points resolve ambiguities in the sound field solution. 2.4. Transaural Cross-talk Cancellation We now consider transaural cross-talk cancellation systems. These systems are commonly implemented in the form of a row of loudspeakers as shown in Fig. 4 and aim to control the signals at the two ears. A detailed description and further references are given in [29]. The mathematical description is similar to WFS except there are only two control points: the signals at the two ears. As previously, we operate in the frequency domain and for simplicity omit the frequency dependence of variables. Let the loudspeaker driving point functions be given by  and the desired binaural pressure signals at the two ears be specified by y ¼ ½yL ; yR . We again solve: S ¼ y; ð29Þ 21 Acoust. Sci. & Tech. 41, 1 (2020) Fig. 4 Each loudspeaker has a transfer function to the location of both the left and right ear. Adapted from [29]. where S is the transfer function matrix. A least-norm solution for the M 2 loudspeaker filter matrix, H, is given by: H ¼ ½SH S þ I1 SH ; ð30Þ so that  ¼ Hy. If one needs to compensate for slight movements of the listener, an adaptive solution for the loudspeaker filter matrix can be obtained from simple beamforming considerations [29]. Note that in all of the discussion above, we do not explicitly consider headrelated impulse responses — we simply control the sound field at two points. One could, of course, amend the transfer function matrix to include a consideration of head-related impulse responses. 3. SOUND FIELD RECORDING Spherical microphone arrays (SMAs) have been the focus of considerable recent research [30–39] and are especially useful for recording panoramic sound scenes. Two recent books [40,41] provide a fairly comprehensive overview. By virtue of their spherical symmetry, SMAs provide a natural framework for analyzing sound fields in the spherical harmonic domain, see Fig. 5. We now derive the mathematical model used to describe the acoustic behavior of a rigid, baffled SMA. This description is taken from [42] and has been shortened in parts for this tutorial. We consider the case of an SMA consisting of N omnidirectional microphones located at various positions around a perfectly rigid sphere with radius R. As illustrated in Fig. 6, we define the position of the microphones by their spherical coordinates ðr; ; Þ. For simplicity, the mathematical expressions are derived in the frequency domain as a function of the dimensionless frequency kR, where k denotes the wavenumber, k ¼ 2 f =c, f denotes the frequency and c denotes the speed of sound. As well, for a given radial distance, r, we introduce the dimensionless radius, , defined by:  ¼ r=R. Consider the n-th microphone of the SMA, whose spherical coordinates are ðn R; n ; n Þ. In the case where 22 Fig. 5 The dual-radius SMA prototype described in [42]. Fig. 6 The notations used to describe the geometry of an SMA are illustrated. From [42]. the incident sound field consists of incoming waves, the acoustic pressure measured by this sensor is given by: pn ¼ l 1 X X l¼0 m¼l wl ðkR; n ÞYlm ðn ; n Þbl;m ; ð31Þ where . blm is a complex coefficient depending only on the incident sound field, which we denote as the order-l and degree-m spherical harmonic component. . Ylm denotes the order-l and degree-m real-valued spherical harmonic function: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2l þ 1 ðl  mÞ! jmj m Yl ð; Þ ¼ P ðsin Þ . . . 4 ðl þ mÞ! l C. T. JIN: A TUTORIAL ON IMMERSIVE 3D SOUND TECHNOLOGIES ( cos m for m  0 sin jmj for m < 0 ; ð32Þ where Pm l is the order-l, degree-m associated Legendre polynomial. Note that the sin  term arises from the spherical coordinate convention chosen in this paper (see Fig. 6). . wl ðkR; n Þ is the ‘modal strength’ associated with an ideal, rigid spherical baffle, of the order-l spherical harmonic modes at the microphone position and is given by: ! j0l ðkRÞ ð2Þ l wl ðkR; n Þ ¼ i jl ðn kRÞ  ð2Þ 0 hl ðn kRÞ ; hl ðkRÞ ð33Þ where jl and hð2Þ denote the order-l spherical Bessel l function and spherical Hankel function of the second kind, respectively. We refer to Eq. (31) as a Bessel-weighted spherical harmonic expansion of the acoustic pressure. In the audio engineering literature, this equation is sometimes referred to as a spherical Fourier transform or a Higher Order Ambisonic (HOA) transform and we use the term HOA synonymously with the expression ‘spherical harmonic.’ According to Eq. (31), the exact value of the pressure is determined by the summation of an infinite number of terms. This sum must be truncated for the pressure to be estimated numerically. In practice, we advise selecting the truncation order, , as the minimum order such that the relative error in the pressure measured by the SMA’s farthest sensor is less than 80 dB (0.01%). Figure 7 shows the truncation order, , as a function of kR for different values of . Details for the calculations to determine the truncation order are provided in [42]. Assuming that an appropriate truncation order, , has been determined, the value of the acoustic pressure measured by the n-th sensor can be approximated as: pn   X l X l¼0 m¼l wl ðkR; n ÞYlm ðn ; n Þbl;m : ð34Þ Equation (34) expresses the summation over a finite number of terms. It can therefore be rewritten as the following vector product: pn ¼ tT;n b ; ð35Þ where t;n ¼ ½t0;0;n ; t1;1;n ; t1;0;n ; . . . ; t;;n T ; tl;m;n ¼ wl ðkR; n ÞYlm ðn ; n Þ; b ¼ ½b0;0 ; b1;1 ; b1;0 ; . . . ; b; T : ð36Þ Similarly, the vector of the acoustic pressures received by the N microphones of the SMA can be expressed as: p ¼ T  b ; ð37Þ where T  is the transfer matrix between the HOA components up to order- and the pressure received by the N microphones, given by: T  ¼ ½t;1 ; t;2 ; . . . ; t;N T : ð38Þ 3.1. Higher-Order Ambisonic Encoding We refer to the process of retrieving the up-to-order-L HOA components from the microphone signals as order-L HOA encoding. In practice, this operation consists in convolving the microphone signals with a matrix of Finite Impulse Response (FIR) encoding filters. In this section, we derive the mathematical expression for calculating the encoding filter frequency responses. The problem of finding the values of the HOA components from the microphone signals essentially consists in solving Eq. (37) for b . In other words, the matrix of the order-L encoding filter frequency responses, EL , must invert the matrix T  as accurately as possible. Mathematically, this means that EL must minimize the quantity kEL T   Ck2F , where k  kF denotes the Frobenius norm and C is given by: C ¼ ½I ðLþ1Þ2 0ðLþ1Þ2 ððþ1Þ2 ðLþ1Þ2 Þ ; ð39Þ where I n denotes the n n identity matrix and 0m n denotes the m n zero matrix. In practice we solve for EL using Tikhonov regularization. In other words, we define EL as: EL ¼ arg minðkAT   Ck2F þ A 2 kAk2F Þ; ð40Þ where is the regularization parameter setting the relative importance given to minimizing the norm of the matrix with respect to minimizing the error. Matrix EL is unique and given by: Fig. 7 The truncation order  is shown as a function of kR for different values of . From [42]. EL ¼ CT H ðT  T H þ 2 I N Þ1 23 Acoust. Sci. & Tech. 41, 1 (2020) where is the ratio of the energy of the measurement noise in the HOA signals to the energy of the measurement noise in the microphone signals. The proof of this result is given in [42]. Consider now the issue of spatial aliasing. At high frequencies, microphones can receive energy from undesired, greater-than-order-L, HOA components of the incoming sound. This results in spatial aliasing where the encoded HOA signals are polluted by these undesired HOA components. Figure 7 shows that the microphones with larger  receive more energy from the undesired, greaterthan-order-L, HOA components than those with smaller . Therefore, at high frequencies it is desirable to apply greater weight to the signals from microphones located closer to the rigid spherical surface. In summary, the highorder information contained in T  is essential for optimal calculation of the encoding filters. 4. Fig. 8 The amplitude of the modal weights, wl ðkR; Þ are shown as a function of kR and for orders 0 to 4 when  ¼ 1 (top) and  ¼ 2 (bottom). From [42]. ¼ T HL ðT  T H þ 2 I N Þ1 : ð41Þ This solution for EL incorporates two critical aspects. First, it applies regularization to prevent measurement noise from being amplified unreasonably at low kR values. Second, it optimally weights the microphone signals to minimize spatial aliasing by employing a matrix T  of sufficiently high order. With regard to the amplification of measurement noise, Fig. 8 shows the amplitude of the modal weights wl as a function of kR and for orders 0 to 4. At low kR values, the weights corresponding to HOA signals of order greater than one have very small amplitudes. As well, in the case where the microphones are not located on the rigid spherical surface ( > 1), the weights are equal to zero at certain kR values. Without regularization, the encoding filters would amplify the microphone signals greatly to acquire the HOA signals at these frequencies and would thus be very non-robust to measurement noise. The amount of measurement noise in the encoded signals is bounded by the regularization parameter as follows:  2 1  ; ð42Þ 2 24 BINAURAL VIRTUAL SPATIAL AUDIO 4.1. Head-tracking and Individualization Binaural spatial audio is based on head-related impulse responses and/or binaural room impulse responses. The first and most important point to make is that immersive binaural virtual spatial audio requires head-tracking. Without head-tracking the virtual acoustic sources move with the head and destroy the perceptual illusion. It is the advent of mixed-reality devices with head-tracking that provide a significant impetus for immersive spatial audio in the consumer industry. This then raises the issue of just how important is individualized head-related impulse responses. In order to illustrate these issues we consider some recent data [43] for an experiment contrasting individualized HRIRs and generic HRIRs. This description is taken from [43]. There were four listening conditions: (1) binaural rendering with individualized HRIR filters and head-tracking; (2) binaural rendering with generic HRIR filters and head-tracking; (3) binaural rendering with individualized HRIR filters and no head-tracking; and (4) normal headphone listening without binaural spatial rendering. There were twenty-three self-reporting normallyhearing listeners participate in the listening test. Listeners were asked to listen to six sound excerpts: . Mono: drums, Radiohead — Weird Fishes/Arpeggi . Mono: guitar, Tarrega — Capriccio Arabe . Stereo: Pop, Radiohead — Jigsaw Falling Into Place . Stereo: Bossa-Nova, Stan Getz, João Gilberto — Vivo Sonhando . 5.1 Surround: Rock, Pink Floyd — Money . 5.1 Surround: Pop Jazz, Norah Jones — Come Away With Me Sounds were played to the listener using the AKG 1000 open headphones and also a loudspeaker array consisting C. T. JIN: A TUTORIAL ON IMMERSIVE 3D SOUND TECHNOLOGIES Fig. 9 Results from the binaural listening test are shown for the six sound stimuli. The average population scores for the four listening conditions are shown using a bar plot. The legend labels are as follows: Indiv. + HT  Individualized HRIRs with head-tracking; Generic + HT  generic HRIRs with head-tracking; Indiv. no HT  Individualized HRIRs with no head-tracking; No 3D  no binaural spatial rendering. Figures taken from [43]. of 12 loudspeakers: 5 Tannoy System 15 loudspeakers forming a 5.1 arrangement and 7 additional Tannoy V6 loudspeakers forming a circular array spaced every 45 degrees. The loudspeaker playback provided a reference for the headphone listening. Because the headphones are open, the loudspeakers could be heard without distortion. Every listener had HRIRs recorded using a blocked-ear recording method [44] in an anechoic chamber using a semi-circular robotic arm (methods were similar to those presented here [45]). A MUSHRA-like [46] test paradigm was used in which there was no hidden reference, but an anchor was included. The explicit reference was loudspeaker playback and the anchor was headphone presentation with no spatial audio rendering. Listeners participated in two different trials. In one trial listeners were asked to rate overall preference and in another trial listeners were asked to rate the clarity of the frontal image. Head-tracking was implemented using a Polhemus G4 head-tracking device mounted on the headphones. Results of the listening test are shown in Fig. 9. As expected, head-tracking contributed significantly to the listeners’ scores because it provides a consistent listening environment in which sound sources are robustly and consistently localized when the head moves. Interestingly, listeners also showed a small, but consistent bias for individualized binaural rendering over generic binaural rendering. The added benefit of individualized binaural rendering is small compared to the benefit of head-tracking. Nevertheless, in listening conditions without a visual reference, there does seem to be a small benefit for individualization in binaural rendering. This would suggest that individualized binaural rendering will play some role when visual stimuli are absent — for example, in augmented spatial hearing conditions using hearables. 4.2. Binaural Reverberation and Ambiance 4.2.1. Binaural rendering of HOA signals Head-related impulse responses are generally recorded under anechoic, free-field conditions. This means that one must consider the issue of how reverberation and ambiance will be handled during binaural rendering. A perceptually reasonable (the author is unaware of precise psychophysical validation) encoding of reverberant and ambient sound fields is provided by order-3 HOA signals. Thus, we consider binaural rendering of HOA signals. Let the pressure, p, at the ears in the presence of N plane waves be given by: p ¼ Hs, where H contains the head-related transfer function information and s specifies the planewave sources. If the plane-wave sources are encoded as HOA signals, Y, then we require a decoding matrix, F, so that: p^ ¼ FY, where ideally, p^ ¼ p. A possible mathematical solution can be found using the pseudo-inverse: F ¼ H pinvðY Þ. Nevertheless, this does not work well for many reasons: the pseudo-inverse will smear the source signals across directions, as well the phase component of the matrix H is complicated and this solution will result in strong spectral coloration. Therefore, it is recommended that one smoothly fix the phase of the head-related transfer functions above a certain frequency to a constant to obtain H mod . Using F ¼ H mod pinvðYÞ provides a much improved binaural rendering for order-3 HOA signals provided one has a dense recording of head-related transfer functions. 4.2.2. Manipulation of the covariance matrix There are many methods to simulate room impulse 25 Acoust. Sci. & Tech. 41, 1 (2020) responses and also many methods to decorrelate audio signals and it is not uncommon to apply these for binaural rendering of spatial audio. The crux of the issue is that the covariance of the signals at the ears determines many binaural perceptual properties and so it is best to preserve or control the covariance structure of the binaural signals. These matters are well described in [47]. A brief description is provided here. Assume there is a decoding method to obtain binaural signals, pQ , from a set of input signals: pQ ¼ Qx, where Q is some decoding matrix and x is a vector of input signals and that somehow one knows the desired covariance matrix Cp ¼ E½ ppH  for some true or desired p. In general, CpQ 6¼ Cp , so what can one do? It turns out there is a method to obtain new signals p^ ¼ Mx þ M D D½Qx, where D½ is a decorrelation operator and the matrices M and M D are to be determined, such that Cp^ ¼ Cp . Unfortunately, these computations are sufficiently complicated that one is referred to [47] for the details. The important point is that there are methods to control the covariance of the signals at the two ears. 4.3. Improving Static Binaural Spatial Audio One of the more interesting research directions for binaural spatial audio is to enable static binaural recordings to be rendered with head-tracking in an attempt to create an immersive experience. This would be relevant to the broadcasting industry which likely has many historical, static binaural recordings. As well, this would be interesting for personal binaural recordings where listeners record a sound scene using earphones capable of making recordings, such as the Roland CS-10EM (available in 2018). This technique requires the extraction of acoustic sources and source directions from the static binaural recording and the creation of a new binaural rendering with headtracking. This could likely be done using personal mobile devices. An interesting reference to work in this area is [48]. As well, refer to [49,50] for methods to decompose audio into direct and ambient sound as well as determine the direction of acoustic sources. REFERENCES [1] W. Zhang, P. N. Samarasinghe, H. Chen and T. D. Abhayapala, ‘‘Surround by sound: A review of spatial audio recording and reproduction,’’ Appl. Sci., 7, 1–19 (2017). [2] V. Pulkki, ‘‘Virtual sound source positioning using vector base amplitude panning,’’ J. Audio Eng. Soc., 45, 456–466 (1997). [3] J.-M. Pernaux, P. Boussard and J.-M. Jot, ‘‘Virtual sound source positioning and mixing in 5.1 implementation on the real-time system genesis,’’ Proc. Conf. Digital Audio Effects (DAFx-98), Barcelona, November, pp. 76–80 (1998). [4] J. Daniel, Représentation de Champs Acoustiques, Application à la Transmission et à la Reproduction de Scènes Sonores Complexes dans un Contexte Multimédia, Ph.D. thesis, Université Paris 6 (2000). [5] J. W. Strutt, ‘‘On our perception of sound direction,’’ Philos. 26 Mag., 13, 214–232 (1907). [6] W. M. Hartmann, B. Rakerd and Z. D. Crawford, ‘‘Transaural experiments and a revised duplex theory for the localization of low-frequency tones,’’ J. Acoust. Soc. Am., 139, 968–985 (2016). [7] M. Frank, Phantom Sources using Multiple Loudspeakers in the Horizontal Plane, Ph.D. thesis, University of Music and Performing Arts, Graz (2013). [8] N. Epain, C. T. Jin and F. Zotter, ‘‘Ambisonic decoding with constant angular spread,’’ Acta Acust. united Ac., 100, 928–936 (2014). [9] V. Pulkki, ‘‘Spatial sound reproduction with directional audio coding,’’ J. Audio Eng. Soc., 55, 503–516 (2007). [10] C. T. Jin, N. Epain and T. Noohi, ‘‘Sound field analysis using sparse recovery,’’ in Parametric Time-Frequency Domain Spatial Audio, V. Pulkki, S. Delikaris-Manias and A. Politis, Eds. (John Wiley & Sons, Hoboken, New York, 2018), Chap. 3. [11] V. Pulkki, A. Politis, M.-V. Laitinen, J. Vilkamo and J. Ahonen, ‘‘First-order directional audio coding (dirac),’’ in Parametric Time-Frequency Domain Spatial Audio, V. Pulkki, S. Delikaris-Manias and A. Politis, Eds. (John Wiley & Sons, Hoboken, New York, 2018), Chap. 5. [12] P. Archontis and V. Pulkki, ‘‘Higher-order directional audio coding,’’ in Parametric Time-Frequency Domain Spatial Audio, V. Pulkki, S. Delikaris-Manias and A. Politis, Eds. (John Wiley & Sons, Hoboken, New York, 2018), Chap. 6. [13] E. G. Williams, Fourier Acoustics: Sound Radiation and Nearfield Acoustic Holography (Academic Press, London, 1999). [14] N. A. Gumerov and R. Duraiswami, Fast Multipole Methods for the Helmholtz Equation in Three Dimensions (Elsevier, Amsterdam, 2005). [15] J. Daniel, ‘‘Spatial sound encoding including near field effect: Introducing distance coding filters and a viable, new ambisonic format,’’ Proc. 23rd Int. Conf. Audio Eng. Soc.: Signal Processing in Audio Recording and Reproduction, May (2003). [16] M. A. Poletti, ‘‘Three-dimensional surround sound systems based on spherical harmonics,’’ J. Audio Eng. Soc., 53, 1004– 1025 (2005). [17] J. Ahrens and S. Spors, ‘‘An analytical approach to sound field reproduction using circular and spherical loudspeaker distributions,’’ Acta Acust. united Ac., 94, 988–999 (2008). [18] S. Spors and F. Zotter, ‘‘Spatial sound synthesis with loudspeakers,’’ in Cutting Edge in Spatial Audio, F. Zotter, Ed. (EAA Documenta Acustica, 2013). [19] A. J. Berkhout, ‘‘A holographic approach to acoustic control,’’ J. Audio. Eng. Soc., 36, 977–995 (1988). [20] A. J. Berkhout, D. de Vries and P. Vogel, ‘‘Acoustic control by wave field synthesis,’’ J. Acoust. Soc. Am., 93, 2764–2778 (1993). [21] S. Spors, H. Teutsch and R. Rabenstein, ‘‘High-quality acoustic rendering with wave field synthesis,’’ Proc. Vision, Modeling, and Visualization Conf. 2002, November (2002). [22] J. Ahrens, Analytic Methods of Sound Field Synthesis (Springer-Verlag, Berlin, 2012). [23] F. M. Fazi, Sound Field Reproduction, Ph.D. thesis, University of Southampton (2010). [24] S. Enomoto, Y. Ikeda, S. Ise and S. Nakamura, ‘‘Threedimensional sound field reproduction and recording systems based on boundary surface control principle,’’ Proc. 14th Int. Conf. Auditory Display, Paris, June, pp. 1–8 (2008). [25] S. J. Elliot, J. Cheer, J.-W. Choi and Y. Kim, ‘‘Robustness and reguarlization of personal audio systems,’’ IEEE Trans. Audio Speech Lang. Process., 20, 2123–2133 (2012). C. T. JIN: A TUTORIAL ON IMMERSIVE 3D SOUND TECHNOLOGIES [26] A. Wabnitz, N. Epain, C. T. Jin and A. van Schaik, ‘‘Room acoustics simulation for multichannel microphone arrays,’’ Proc. AES 127th Conv., Sydney, August, pp. 1–6 (2010). [27] N. Epain, C. T. Jin and A. van Schaik, ‘‘The application of compressive sampling to the analysis and synthesis of spatial sound fields,’’ Proc. AES 127th Conv., New York, October, pp. 1–8 (2008). [28] P. M. Juhl, The Boundary Element Method for Sound Field Calculations, Ph.D. thesis, Technical University of Denmark (1993). [29] S. Gálvez, F. Marcos, T. Takeuchi and F. M. Fazi, ‘‘Lowcomplexity, listener’s position-adaptive binaural reproduction over a loudspeaker array,’’ Acta Acust. united Ac., 103, 847– 857 (2017). [30] J. Meyer and G. Elko, ‘‘A highly scalable spherical microphone array based on an orthonormal decomposition of the soundfield,’’ Proc. ICASSP 2002, Vol. 2, pp. 1781–1784 (2002). [31] T. D. Abhayapala and D. B. Ward, ‘‘Theory and design of high order sound field microphones using spherical microphone array,’’ Proc. ICASSP 2002, Vol. 2, pp. 1949–1952 (2002). [32] B. Rafaely, ‘‘Analysis and design of spherical microphone arrays,’’ IEEE Trans. Speech Audio Process., 13, 135–143 (2005). [33] D. B. Ward and T. D. Abhayapala, ‘‘Reproduction of a planewave sound field using an array of loudspeakers,’’ IEEE Trans. Speech Audio Process., 9, 697–707 (2001). [34] A. Wabnitz, N. Epain and C. T. Jin, ‘‘A frequency-domain algorithm to upscale ambisonic sound scenes,’’ Proc. ICASSP 2012, Kyoto, Japan, March (2012). [35] S. Bertet, J. Daniel, E. Parizet and O. Warusfel, ‘‘Investigation on localisation accuracy for first and higher order ambisonics reproduced sound sources,’’ Acta Acust. united Ac., 99, 642– 657 (2013). [36] H. H. Chen and S. C. Chan, ‘‘Adaptive beamforming and doa estimation using uniform concentric spherical arrays with frequency invariant characteristics,’’ J. VLSI Signal Process., No. 46, pp. 15–34 (2007). [37] B. Rafaely, Y. Peled, M. Agmon, D. Khaykin and E. Fisher, ‘‘Spherical microphone array beamforming,’’ in Speech Processing in Modern Communication: Challenges and Perspectives, I. Cohen, J. Benesty and S. Gannot, Eds. (Springer, Berlin, Heidelberg, 2010). [38] H. Sun, H. Teutsch, E. Mabande and W. Kellermann, ‘‘Robust localization of multiple sources in reverberent environments using EB-ESPRIT with spherical microphone arrays,’’ Proc. ICASSP 2011, Prague, Czech Republic, May (2011). [39] N. Epain and C. T. Jin, ‘‘Independent component analysis using spherical microphone arrays,’’ Acta Acust. united Ac., 98, 91–102 (2012). [40] B. Rafaely, Fundamentals of Spherical Array Processing (Springer-Verlag, Berlin, 2015). [41] D. P. Jarrett, E. A. P. Habetrs and P. A. Naylor, Theory and Applications of Spherical Microphone Array Processing (Springer International Publishing, Cham, 2017). [42] C. T. Jin, N. Epain and A. Parthy, ‘‘Design, optimization and evaluation of a dual-radius spherical microphone array,’’ IEEE/ ACM Trans. Audio Speech Lang. Process., 22, 193–204 (2014). [43] C. T. Jin, R. Zolfaghari, X. Long, A. Sebastian, S. Hossain, Glaunés, A. Tew, M. Shahnawaz and A. Sarti, ‘‘Considerations regarding individualization of head-related transfer functions,’’ Proc. ICASSP 2018, Calgary, Canada, April, pp. 6787–6791 (2018). [44] H. Moller, ‘‘Fundamentals of binaural technology,’’ Appl. Acoust., 36, 171–218 (1992). [45] C. T. Jin, A. Corderoy, S. Carlile and A. van Schaik, ‘‘Contrasting monaural and interaural spectral cues for human sound localization,’’ J. Acoust. Soc. Am., 115, 3124–3141 (2004). [46] ITU-R BS.1534-1:2003, Method for the subjective assessment of intermediate quality level of coding systems, ITU-R (2003). [47] J. Vilkamo and T. Bäckström, ‘‘Time-frequency procesing: Methods and tools,’’ in Parametric Time-Frequency Domain Spatial Audio, V. Pulkki, S. Delikaris-Manias and A. Politis, Eds. (John Wiley & Sons, Hoboken, New York, 2018). [48] S. Nagel and P. Jax, ‘‘Dynamic binaural cue adaptation,’’ 2018 Int. Workshop on Acoustic Signals Enhancement, Tokyo, Japan, September (2018). [49] C. Faller, ‘‘Multi-loudspeaker playback of stereo signals,’’ J. Audio Eng. Soc., 54, 1051–1064 (2006). [50] C. T. Jin and N. Epain, ‘‘Super-resolution sound field analysis,’’ in Cutting Edge in Spatial Audio, F. Zotter, Ed. (EAA Documenta Acustica, 2013). Craig Jin received a B.S. in Physics from Stanford in 1990, an M.S. in Applied Physics from Caltech, and a Ph.D. from the University of Sydney in 2001. He is currently an Associate Professor within the School of Electrical and Information Engineering at the University of Sydney and is the director of the Computing and Audio Research Laboratory. 27