A tutorial on immersive three-dimensional sound technologies

Craig Jin

A tutorial on immersive three-dimensional sound technologies

Craig Jin

2020, Acoustical Science and Technology

visibility

…

description

12 pages

link

1 file

There is renewed interest in virtual auditory perception and spatial audio arising from a technological drive toward enhanced perception via mixed-reality systems. Because the various technologies for three-dimensional (3D) sound are so numerous, this tutorial focuses on underlying principles. We consider the rendering of virtual auditory space via both loudspeakers and binaural headphones. We also consider the recording of sound fields and the simulation of virtual auditory space. Special attention is given to areas with the potential for further research and development. We highlight some of the more recent technologies and provide references so that participants can explore issues in more detail.

Acoust. Sci. & Tech. 41, 1 (2020) #2020 The Acoustical Society of Japan INVITED TUTORIAL A tutorial on immersive three-dimensional sound technologies Craig T. Jin School of Electrical and Information Engineering, University of Sydney, Building J03, Maze Crescent, Darlington, Sydney, NSW, 2006 Australia Abstract: There is renewed interest in virtual auditory perception and spatial audio arising from a technological drive toward enhanced perception via mixed-reality systems. Because the various technologies for three-dimensional (3D) sound are so numerous, this tutorial focuses on underlying principles. We consider the rendering of virtual auditory space via both loudspeakers and binaural headphones. We also consider the recording of sound ﬁelds and the simulation of virtual auditory space. Special attention is given to areas with the potential for further research and development. We highlight some of the more recent technologies and provide references so that participants can explore issues in more detail. Keywords: 3D sound technology, Virtual auditory space, Spatial hearing PACS number: 43.38.Md, 43.38.Vk, 43.10.Ln, 43.10.Sv 1. INTRODUCTION There is renewed interest in three-dimensional (3D) sound technologies driven in large part by the technological drive towards enhanced perception via mixed reality systems. A few commercial examples, as of 2018, are Microsoft 3D Soundscape and the HoloLens, Google’s Resonance Audio, Sony Playstation headset with 3D sound, the Oculus headset and its support for 3D sound spatialization. There is also increasing support and awareness for 3D sound by broadcasting companies. A few broadcasting examples are the BiLi project in France, the Binaural Project by the BBC in the United Kingdom, the Orpheus European project, the NHK Super Hi-Vision theatre support for 3D sound with a 22.2 multichannel system. At the current time, the technologies for 3D sound are growing exponentially and so this tutorial will focus on fundamental principles. To begin, we should clarify our deﬁnition of 3D sound. By 3D sound, we refer to an immersive experience in which the listener has a clear and extended perception of a 3D sound space — that is sound objects and sound events clearly positioned relative to the listener and to each other in some ambient space that encompasses both the listener and the sound objects. It requires and involves something more than a transient illusion or perception of sound direction; it requires an extended and believable perceptual experience of auditory space that supports some version of reality. e-mail: [email protected] 16 [doi:10.1250/ast.41.16] This tutorial focuses on three primary technological areas: loudspeaker sound reproduction, binaural sound reproduction using earphones or headphones, and sound ﬁeld recording. A related tutorial introduction is [1]. To a lesser extent we consider sound ﬁeld simulation — the art of using engineering design to create a virtual sound environment. We will assume familiarity with the fundamental processes underlying human 3D sound perception. In particular, we assume familiarity with interaural time diﬀerence cues, interaural level diﬀerence cues, monaural spectral cues and the necessity for head-tracking when rendering 3D sound over headphones. We also assume familiarity with the following mathematical or signal processing terms: head-related impulse response ﬁlters, binaural room impulse responses and head-related transfer functions. 2. LOUDSPEAKER SOUND REPRODUCTION Four loudspeaker reproduction methods shall be considered: vector-base amplitude and intensity panning, transaural cross-talk cancellation, ambisonics, and waveﬁeld synthesis. Loudspeaker reproduction methods for 3D sound can be classiﬁed according to the listening conditions. Vector-base amplitude panning can provide a relatively stable acoustic image across a moderate-sized listening area with some compromises in spatial ﬁdelity. Wave-ﬁeld synthesis can provide higher spatial ﬁdelity, but requires a substantially more dense loudspeaker array and is often limited in the frequency range for which high spatial ﬁdelity can be achieved. The ambisonics method C. T. JIN: A TUTORIAL ON IMMERSIVE 3D SOUND TECHNOLOGIES generally accommodates only a single listener and has a sweet-spot the size of a listener’s head. An advantage of the ambisonic method is its rendering ﬂexibility — it can accommodate wave-ﬁeld approximation at lower frequencies as well as panning at higher frequencies; scene rotations are also easily handled. Transaural cross-talk cancellation systems attempt to directly control the signal at each ear and are generally targeted at a single or few listeners. These systems often come in the form of a single row of loudspeakers, often above or below a video screen. With the exception of transaural cross-talk cancellation, loudspeaker reproduction methods naturally or automatically encompass head movement. In other words, the loudspeaker array creates a sound ﬁeld and when the listener moves his/her head the acoustic spatial impression changes accordingly. With transaural cross-talk cancellation, the loudspeaker array simulates a virtual headphone listening condition and head motion must be tracked using some form of head tracker. 2.1. Vector-Base Panning We begin with vector-base amplitude panning [2] and vector-base intensity panning [3]. In three-dimensions, we pan the amplitude of an acoustic source across the three nearest loudspeakers as shown in Fig. 1. The three loudspeaker positions are associated with the vectors xu , xv , xw and the virtual source position is indicated by the vector xS . The virtual source position, xS , can be expressed as a linear combination of the three loudspeaker positions using matrix algebra as follows: 1 10 0 1 0 u ðS Þ xu xv xw xS C CB B C B xS ¼ @ yS A ¼ @ yu yv yw A@ v ðS Þ A; ð1Þ zS zu zv zw w ðS Þ where the vector ðS Þ contains the component weights for the virtual source position vector expressed in terms of the three loudspeaker position vectors. Note that we label the plane-wave direction associated with the virtual source by S . It is straight-forward to compute the vector ðS Þ given the virtual source position: 0 11 xu xv xw B C ðS Þ ¼ @ yu yv yw A xS : ð2Þ zu zv zw With vector base amplitude panning (VBAP), we select the three loudspeakers in the loudspeaker array that are closest to the intended virtual source position. We then set the three VBAP loudspeaker gains, gu , gv , and gw according to ðS Þ. However, we normalize the loudspeaker gains so that the total energy is unity. In other words, we set the loudspeaker gains as follows: u ðS Þ gu ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ; 2u ðS Þ þ 2v ðS Þ þ 2w ðS Þ v ðS Þ gv ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ; 2 u ðS Þ þ 2v ðS Þ þ 2w ðS Þ v ðS Þ gw ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ : 2u ðS Þ þ 2v ðS Þ þ 2w ðS Þ The result of setting the loudspeaker gains according to this equation results in the acoustic velocity vector [4] pointing towards the direction of the virtual source. The physical signiﬁcance of the acoustic velocity vector pointing towards the virtual source is that the interaural time diﬀerence cue for the listener will be appropriate at the low frequencies. We say that VBAP panning preserves the low-frequency interaural time diﬀerence cues for the virtual acoustic source. This leads naturally to the question of the interaural level diﬀerence cues — will they be veridical with VBAP panning? Unfortunately, the answer is not at the higher frequencies. In order to preserve the interaural level diﬀerence cues, we utilize vector base intensity panning. Vector base intensity panning (VBIP) is based on the acoustic energy of the loudspeakers, often referred to simply as loudspeaker energy. The loudspeaker energy for the loudspeaker at position, xu , is given by the square of the loudspeaker gain: g2u . If we weight the three loudspeaker directions by the loudspeaker energies, rather than the loudspeaker gains, we obtain what is often referred to as the energy vector rE : rE ¼ Fig. 1 A virtual source is panned between three loudspeakers. We deﬁne vectors from a listening point to the three loudspeakers as xu , xv , xw and a vector to the virtual source as xS . ð3Þ g2u xu þ g2v xv þ g2w xw : g2u þ g2v þ g2w ð4Þ Note that the energy vector is normalized by the total acoustic energy. In VBIP, the loudspeaker gains are calculated such that: 1) the vector rE points toward the virtual source direction; 2) only the three loudspeakers 17 Acoust. Sci. & Tech. 41, 1 (2020) that are the most closely associated with the virtual source direction are employed; 3) the total acoustic energy is equal to one. In VBIP, we set the loudspeaker energies, eu , ev , and ew , according to ðS Þ as follows: ½eu ðS Þ; ev ðS Þ; ew ðS Þ ¼ ½u ; v ; w : u þ v þ w Correspondingly, we set the loudspeaker gains as: sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ u ðS Þ ; gu ¼ u ðS Þ þ v ðS Þ þ w ðS Þ sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ v ðS Þ ; gv ¼ u ðS Þ þ v ðS Þ þ w ðS Þ sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ u ðS Þ : gw ¼ u ðS Þ þ v ðS Þ þ w ðS Þ ð5Þ ð6Þ One can easily verify that the total loudspeaker energy is unity and that: rE ¼ xS : u þ v þ w ð7Þ The diﬀerence between the acoustic velocity vector and the acoustic energy vector and the interaural time and level diﬀerence cues has been long understood. Nonetheless, when amplitude panning a virtual source between multiple loudspeakers one has a conﬂict between these two direction vectors — they are not the same, so what should one do? In this case, the duplex theory partially comes to the rescue. The duplex theory originates from Lord Rayleigh and his experiments with tuning forks [5] and indicates that interaural time diﬀerence (ITD) cue inﬂuences perception at lower frequencies and that the interaural level diﬀerence (ILD) cue inﬂuences perception at higher frequencies. However, there is no clear agreement on the boundary frequency and it depends on how one explores this boundary; for an entertaining and informative review, refer to Hartmann’s recent paper [6]. Most practitioners of vector base panning split the usage of VBAP and VBIP gains at a boundary frequency between 500 Hz and 1,500 Hz. The boundary frequency, however, is not the only concern. There is also the perceived width of the source which is related to the norm of the energy vector, rE [7]. In this regard, practitioners quantify the angular spread, ðS Þ, of the virtual source using [8]: ðS Þ ¼ 2 cos1 ð2krE ðS Þk 1Þ: ð8Þ With VBAP and VBIP, one does not have independent control of the angular spread of the virtual acoustic source. For example, with VBIP the loudspeaker gains are set by the acoustic energy vector and any variation in the individual loudspeaker energies will necessarily modify the perceived loudness of the virtual source. This implies 18 that as the source moves in space its perceived width will vary. Eﬀorts to compensate for the variation in acoustic width are described here [8], but one can only really widen the source width by utiliizing a greater number of loudspeakers. 2.2. Higher-Order Ambisonics Higher-Order Ambisonics (HOA) provides a more sophisticated viewpoint for rendering a sound scene using loudspeakers. Nonetheless, at its essence, it still involves panning a virtual sound source across a number of loudspeakers. With HOA, a circular or spherical array of loudspeakers is generally used and one considers panning across a much larger number of loudspeakers than two or three loudspeakers. This approach is particularly eﬀective for ambient sounds, but not so practical for direct sounds. Therefore, more recent approaches separate the direct and ambient components of the sound ﬁeld [9–12]. A particular advantage of HOA rendering is its ability to work with sound ﬁeld recordings and its ability to render ambient, reverberant environments. Before delving into the technical details we attempt to characterize the spirit and intuition behind HOA rendering. Consider a sound ﬁeld together with a listening viewpoint. Without going into the details of sound ﬁeld recordings, it stands to reason that if one records the sound ﬁeld with a spherical microphone array at the listening viewpoint, then any loudspeaker reproduction of that sound ﬁeld should ideally result in the same spherical microphone array recording when the spherical microphone array is positioned at the listening viewpoint of the loudspeaker array. Indeed, this provides one empirical method to validate loudspeaker reproduction techniques. The spherical microphone array enables a speciﬁc spherical harmonic decomposition of the sound ﬁeld and sometimes this spherical harmonic decomposition is referred to as an ambisonic decomposition. Thus, the HOA rendering method attempts to identify the spherical harmonic decomposition of the virtual sound source or virtual sound scene and then reproduce this spherical harmonic decomposition using the loudspeaker array. 2.2.1. Underlying theory The HOA method of sound ﬁeld reproduction is based on the idea that the sound ﬁeld can be described as a sum of spherical harmonic components and can be considered as panning in the spherical harmonic domain. In the frequency domain, any sound ﬁeld consisting of incoming sound waves can be expressed as a series of spherical harmonic functions [13]: m 1 X X p r; ; ’; k ¼ il jl ðkrÞYlm ð; ’Þblm ðkÞ; l¼0 m¼l ð9Þ where pðr; ; ’; kÞ is the acoustic pressure corresponding to the wave number k and at the point with spherical C. T. JIN: A TUTORIAL ON IMMERSIVE 3D SOUND TECHNOLOGIES coordinates ðr; ; ’Þ, and where jl is the spherical Bessel function of degree l and Ylm is the spherical harmonic function of order l and degree m. Equation (9), known as the spherical harmonic representation of the sound ﬁeld, states that the sound ﬁeld is fully described by the complex coeﬃcients blm ðkÞ. Describing the sound ﬁeld at any point in space requires an inﬁnite number of blm coeﬃcients, which is impossible in practice. Nevertheless, a good approximation of the acoustic pressure in the vicinity of the origin can be obtained by truncating the series to an order L that depends upon the wave number, k, and radius, r, from the origin according to the formula [14]: L ekr 1 ; 2 ð10Þ where e is the mathematical constant known as Euler’s number. Hence, the higher the order, the larger the area where the sound ﬁeld is accurately described, with regard to the wavelength. 2.2.2. Basic HOA decoding Using HOA to record a sound ﬁeld consists in obtaining the blm components up to order-L, which are often referred to as HOA signals. Conversely, using HOA to synthesise a sound ﬁeld consists in controlling the loudspeakers so that they create a sound ﬁeld having the desired order-L spherical harmonic representation. In the HOA method, it is usually assumed that the loudspeakers act on the sound ﬁeld as would plane-wave sound sources. This is a reasonable approximation assuming the distance from the loudspeakers to the center of the listening area is large enough with regard to the order L. More precise methods that consider the distance of both the loudspeakers and the virtual sources more accurately are described in [15–17]. With these more accurate methods, one can even enable near-ﬁeld sources [15]. Returning to the simplifying assumption that the loudspeakers may be modelled as plane-wave sources, the relation between the speaker signals and the spherical harmonic components of the sound ﬁeld is a simple matrix-vector product: b ¼ Yg ð11Þ where b denotes the vector of the spherical harmonic components blm up to order L, s is the vector of the speaker complex gains, and Y is the matrix of the spherical harmonic function values at the loudspeaker angular positions: 2 3 Y00 ð1 ; ’1 Þ Y00 ð2 ; ’2 Þ . . . Y00 ðS ; ’S Þ 6 Y ð ; ’ Þ Y ð ; ’ Þ . . . Y ð ; ’ Þ 7 11 2 2 11 S S 7 6 11 1 1 7 Y ¼6 .. .. .. .. 6 7 4 5 . . . . YLL ð1 ; ’1 Þ YLL ð2 ; ’2 Þ ... YLL ðS ; ’S Þ ð12Þ where S is the number of speakers. The speaker signals are calculated by decoding the desired (or recorded) spherical harmonic signals using a decoding matrix D: g ¼ Db0 ð13Þ where b0 denotes the vector of the desired spherical harmonic components. The sound ﬁeld reconstruction error can be deﬁned as the norm of the diﬀerence between the desired and reconstructed sound ﬁeld: E ¼ kb b0 k ¼ kðYD IÞb0 k ð14Þ where I denotes the identity matrix. Clearly, the sound ﬁeld will be perfectly reconstructed if the decoding matrix inverts perfectly the matrix Y. This leads to the following results: . A perfect sound ﬁeld reconstruction is possible only if there at least as many loudspeaker as spherical harmonics. In 3D, this condition translates into the following relation: S ðL þ 1Þ2 ð15Þ . The rank of Y must be greater or equal to the number of spherical harmonics. This requires that the loudspeaker positions achieve a good angular sampling of the sphere, which means that the loudspeakers must be distributed as evenly as possible over the surface of the sphere. An obvious solution to the HOA decoding problem is to use the Moore-Penrose pseudo inverse of matrix Y: D ¼ pinvðYÞ ð16Þ This solution is known as the basic HOA decoding and is a direct analogue of vector-base amplitude panning in the spherical harmonic domain. It results in the acoustic velocity vector pointing toward the virtual source and correct low-frequency interaural time diﬀerence cues. 2.2.3. Sweet spot size and alternate decodings The size of the area where the sound ﬁeld is accurately described by the order-L spherical harmonic representation of the sound ﬁeld decreases with frequency, as expressed in Eq. (10). A perfect sound ﬁeld reconstruction requires the size of this area to be greater than or equal to the size of the listener’s head. Assuming the listener’s head radius is 10 cm, the maximum frequency value at which the sound ﬁeld can be reconstructed perfectly for a single listener is given by: fmax ¼ c 2L þ 1 2 0:1e ð17Þ The corresponding frequency values are presented in Table 1 for up to order 6. Clearly, the sound ﬁeld will not be perfectly reconstructed for a single listener above a few kHz, unless a very large number of loudspeakers is 19 Acoust. Sci. & Tech. 41, 1 (2020) Table 1 Maximum frequency for perfect reconstruction as a function of the HOA order. L 1 2 3 4 5 6 fmax (Hz) 602 1,004 1,406 1,807 2,209 2,611 used. Therefore, above the speciﬁed frequency values, it is preferable to use an alternative to the basic decoding procedure that improves the perceived quality of the sound ﬁeld reconstruction. The most common alternate decoding method is referred to as the maxrE decoding and is an analogue of vector-base intensity panning in the spherical harmonic domain. The maxrE decoding matrix is calculated by multiplying the basic decoding matrix with a weighting matrix, as follows: DmaxrE ¼ DW maxrE ð18Þ where W maxrE is a diagonal matrix whose diagonal components depend only on the order. These weights are calculated so that, when playing back a wave incoming from a particular direction, the speaker signal energy is concentrated in the corresponding direction. Mathematically, this maximizes the norm of the so-called energy vector, which can be expressed as: PS kgi k2 xi rE ¼ Pi¼1 ð19Þ S 2 i¼1 kgi k where gi is the gain of the ith loudspeaker and xi is the unit vector pointing towards this loudspeaker from the listening point. For a given maximum order L, the maxrE weights are determined as follows. One ﬁrst solves PLþ1 ðxÞ ¼ 0, where PLþ1 ðÞ is the Legendre polynomial of order L þ 1. Let x be the largest value of x which solves this equation. Then the maxrE weights, wl , for order l L are solved recursively as follows: w0 ¼ 1; w1 ¼ x ; wlþ1 ¼ ð2l þ 1Þx wl lwl1 : lþ1 ð20Þ A detailed description of the beneﬁts of maxrE decoding and the computation of the weighting matrix can be found in the PhD thesis by Jerome Daniel [4]. 2.3. Wave-Field Synthesis With wave-ﬁeld synthesis, we consider both a reproduction region of space, , and a control region of space, V, as shown in Fig. 2. We also refer to the loudspeaker signals as driving point functions. The question is how to calculate the driving point functions given the desired sound ﬁeld. Traditional descriptions of wave-ﬁeld synthesis (WFS) develop the theory via the Rayleigh integrals and 20 Fig. 2 Two regions of space are identiﬁed: a reproduction region and a control region. Adapted from [23]. the Kirchoﬀ-Helmholtz integral [17–21] and initially considered planar and linear loudspeaker arrays. The Rayleigh integral approach contains many approximations and will only be brieﬂy touched upon here. Two readable accounts of WFS may be found in the short tutorial presented at the 2013 European Acoustics Association Winter School [18] and the book by Jens Ahrens [22]. The brief description given here follows the formulation provided in [22]. The Rayleigh I Integral describes the sound ﬁeld, SðzÞ, in a source-free half-space deﬁned by a boundary plane @. It states that the sound ﬁeld is uniquely deﬁned by the gradient of the sound ﬁeld on the planar boundary in the normal direction pointing into the halfspace: Z @ Gðz z0 ÞdAðz0 Þ; ð21Þ 2 SðzÞ PðzÞ ¼ @ @n z¼z0 where Gðz z0 Þ is the transfer function for an acoustic point source, detailed below. Thus, the driving point function, Dðz0 Þ, is given by: @ Dðz0 Þ ¼ 2 SðzÞ : ð22Þ @n z¼z0 It is here that the complexities of WFS begin and we only provide a summary of these complexities (refer to [22] for details) in this tutorial introduction. What generally occurs next is that many approximations are made. The planar array of loudspeakers accommodating the driving point function degenerates to a linear array that accommodates listening only in the horizontal plane at a height of zero. Because the integral involves integration with a complex exponential, Laplace’s method of stationary phase is used to make a high-frequency approximation that asserts that the only important complex exponential terms contributing to the integral occur where the phase of the exponential term is stationary. This approximation leads to a driving point function that depends on the listening position of the listener and so further approximations are made so that the driving point function is constant. In the end, the approximations lead to what is commonly referred to as a 2.5 dimensional solution in which the driving point function has the form: C. T. JIN: A TUTORIAL ON IMMERSIVE 3D SOUND TECHNOLOGIES rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2dref Dðz0 Þ ; D2.5D ðzÞ ¼ ik height¼0 ð23Þ where dref is a reference distance chosen to optimize the playback and k is the wavenumber, described in more detail below. From a computational viewpoint, it is easier and more practical to follow a formulation of WFS that follows the simple source formulation as espoused by Fazi [23]. In the following mathematical considerations, we operate in the frequency domain and consider a single frequency only. For simplicity, we omit the frequency dependence of the variables. Using the simple source formulation, the pressure, pðzÞ, at a location, z, within the reproduction region is given by: Z pðzÞ ¼ Gðz; yÞðyÞdAðyÞ; ð24Þ @ eikjzyj where Gðz; yÞ ¼ 4jzyj is the transfer function for an acoustic point source and k ¼ 2c f is the wavenumber, f is frequency, is the air density, and c is the speed of sound. The transfer function for an acoustic point source is referred to as a Green’s function in the literature. As well, the point source is referred to as a monopole loudspeaker in the wave-ﬁeld synthesis literature. If one knows both the interior sound ﬁeld, p ðzÞ, and exterior sound ﬁeld, pþ ðzÞ, as shown in Fig. 3, then the driving point function, ðyÞ, is determined as: ðyÞ ¼ rn p ðyÞ rn pþ ðyÞ; ð25Þ ^ where rn pðzÞ :¼ rpðzÞ nðzÞ is the gradient of the pressure in the direction of the outward normal. The complicated mathematical considerations above become signiﬁcantly simpler when using a matrix formulation. This matrix formulation is similar to the Boundary Surface Control method developed by Ise [24]. Assume one has a set of loudspeaker locations indexed by i and a set of control points indexed by j. One then forms the transfer function matrix, S, where the element Si; j ¼ Fig. 3 The interior sound ﬁeld, p ðzÞ and exterior sound ﬁeld, p ðzÞ are identiﬁed. Adapted from [23]. Gðzi ; yj Þ. Let the desired sound ﬁeld at the control points be represented by the vector y and let the unknown loudspeaker driving point functions be represented by the vector . One then computes the loudspeaker driving point functions, , as a solution to the system of linear equations: S ¼ y: ð26Þ A common technique to obtain a least-norm solution is to use Tikhonov regularization: ¼ ½SH S þ I1 SH y: ð27Þ A common value for the regualization parameter is ¼ 0:03; refer to [25] for a discussion of reguarlization. This is then repeated for each frequency. A complicated sound ﬁeld can be simulated for a given conﬁguration of loudspeakers and control zone using a sound ﬁeld simulator such as MCRoomSim [26]. With WFS, there are a number of practical considerations. For example, when the number of loudspeakers is large, a sparse solution generally improves performance [27], i.e., solve: ^ ¼ argminfkk1 : S ¼ yg: ð28Þ With a large array of loudspeakers one should also compensate for the directivity pattern of the loudspeakers. In addition, one should also scale the size of the control zone proportional to the frequency. One can use the ruleof-thumb that the approximate pﬃﬃﬃ radius, R, of the control zone should scale as: R 4 ekN , where N is the number of loudspeakers, k is the wavenumber, and e is the natural logarithm number. For robustness, one should include some control points inside the control zone. These ‘‘inside’’ control points are sometimes referred to as CHIEF-points in the boundary element method literature [28], where CHIEF refers to combined Helmholtz integral equation formulation. The CHIEF points resolve ambiguities in the sound ﬁeld solution. 2.4. Transaural Cross-talk Cancellation We now consider transaural cross-talk cancellation systems. These systems are commonly implemented in the form of a row of loudspeakers as shown in Fig. 4 and aim to control the signals at the two ears. A detailed description and further references are given in [29]. The mathematical description is similar to WFS except there are only two control points: the signals at the two ears. As previously, we operate in the frequency domain and for simplicity omit the frequency dependence of variables. Let the loudspeaker driving point functions be given by and the desired binaural pressure signals at the two ears be speciﬁed by y ¼ ½yL ; yR . We again solve: S ¼ y; ð29Þ 21 Acoust. Sci. & Tech. 41, 1 (2020) Fig. 4 Each loudspeaker has a transfer function to the location of both the left and right ear. Adapted from [29]. where S is the transfer function matrix. A least-norm solution for the M 2 loudspeaker ﬁlter matrix, H, is given by: H ¼ ½SH S þ I1 SH ; ð30Þ so that ¼ Hy. If one needs to compensate for slight movements of the listener, an adaptive solution for the loudspeaker ﬁlter matrix can be obtained from simple beamforming considerations [29]. Note that in all of the discussion above, we do not explicitly consider headrelated impulse responses — we simply control the sound ﬁeld at two points. One could, of course, amend the transfer function matrix to include a consideration of head-related impulse responses. 3. SOUND FIELD RECORDING Spherical microphone arrays (SMAs) have been the focus of considerable recent research [30–39] and are especially useful for recording panoramic sound scenes. Two recent books [40,41] provide a fairly comprehensive overview. By virtue of their spherical symmetry, SMAs provide a natural framework for analyzing sound ﬁelds in the spherical harmonic domain, see Fig. 5. We now derive the mathematical model used to describe the acoustic behavior of a rigid, baﬄed SMA. This description is taken from [42] and has been shortened in parts for this tutorial. We consider the case of an SMA consisting of N omnidirectional microphones located at various positions around a perfectly rigid sphere with radius R. As illustrated in Fig. 6, we deﬁne the position of the microphones by their spherical coordinates ðr; ; Þ. For simplicity, the mathematical expressions are derived in the frequency domain as a function of the dimensionless frequency kR, where k denotes the wavenumber, k ¼ 2 f =c, f denotes the frequency and c denotes the speed of sound. As well, for a given radial distance, r, we introduce the dimensionless radius, , deﬁned by: ¼ r=R. Consider the n-th microphone of the SMA, whose spherical coordinates are ðn R; n ; n Þ. In the case where 22 Fig. 5 The dual-radius SMA prototype described in [42]. Fig. 6 The notations used to describe the geometry of an SMA are illustrated. From [42]. the incident sound ﬁeld consists of incoming waves, the acoustic pressure measured by this sensor is given by: pn ¼ l 1 X X l¼0 m¼l wl ðkR; n ÞYlm ðn ; n Þbl;m ; ð31Þ where . blm is a complex coeﬃcient depending only on the incident sound ﬁeld, which we denote as the order-l and degree-m spherical harmonic component. . Ylm denotes the order-l and degree-m real-valued spherical harmonic function: sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2l þ 1 ðl mÞ! jmj m Yl ð; Þ ¼ P ðsin Þ . . . 4 ðl þ mÞ! l C. T. JIN: A TUTORIAL ON IMMERSIVE 3D SOUND TECHNOLOGIES ( cos m for m 0 sin jmj for m < 0 ; ð32Þ where Pm l is the order-l, degree-m associated Legendre polynomial. Note that the sin term arises from the spherical coordinate convention chosen in this paper (see Fig. 6). . wl ðkR; n Þ is the ‘modal strength’ associated with an ideal, rigid spherical baﬄe, of the order-l spherical harmonic modes at the microphone position and is given by: ! j0l ðkRÞ ð2Þ l wl ðkR; n Þ ¼ i jl ðn kRÞ ð2Þ 0 hl ðn kRÞ ; hl ðkRÞ ð33Þ where jl and hð2Þ denote the order-l spherical Bessel l function and spherical Hankel function of the second kind, respectively. We refer to Eq. (31) as a Bessel-weighted spherical harmonic expansion of the acoustic pressure. In the audio engineering literature, this equation is sometimes referred to as a spherical Fourier transform or a Higher Order Ambisonic (HOA) transform and we use the term HOA synonymously with the expression ‘spherical harmonic.’ According to Eq. (31), the exact value of the pressure is determined by the summation of an inﬁnite number of terms. This sum must be truncated for the pressure to be estimated numerically. In practice, we advise selecting the truncation order, , as the minimum order such that the relative error in the pressure measured by the SMA’s farthest sensor is less than 80 dB (0.01%). Figure 7 shows the truncation order, , as a function of kR for diﬀerent values of . Details for the calculations to determine the truncation order are provided in [42]. Assuming that an appropriate truncation order, , has been determined, the value of the acoustic pressure measured by the n-th sensor can be approximated as: pn X l X l¼0 m¼l wl ðkR; n ÞYlm ðn ; n Þbl;m : ð34Þ Equation (34) expresses the summation over a ﬁnite number of terms. It can therefore be rewritten as the following vector product: pn ¼ tT;n b ; ð35Þ where t;n ¼ ½t0;0;n ; t1;1;n ; t1;0;n ; . . . ; t;;n T ; tl;m;n ¼ wl ðkR; n ÞYlm ðn ; n Þ; b ¼ ½b0;0 ; b1;1 ; b1;0 ; . . . ; b; T : ð36Þ Similarly, the vector of the acoustic pressures received by the N microphones of the SMA can be expressed as: p ¼ T b ; ð37Þ where T is the transfer matrix between the HOA components up to order- and the pressure received by the N microphones, given by: T ¼ ½t;1 ; t;2 ; . . . ; t;N T : ð38Þ 3.1. Higher-Order Ambisonic Encoding We refer to the process of retrieving the up-to-order-L HOA components from the microphone signals as order-L HOA encoding. In practice, this operation consists in convolving the microphone signals with a matrix of Finite Impulse Response (FIR) encoding ﬁlters. In this section, we derive the mathematical expression for calculating the encoding ﬁlter frequency responses. The problem of ﬁnding the values of the HOA components from the microphone signals essentially consists in solving Eq. (37) for b . In other words, the matrix of the order-L encoding ﬁlter frequency responses, EL , must invert the matrix T as accurately as possible. Mathematically, this means that EL must minimize the quantity kEL T Ck2F , where k kF denotes the Frobenius norm and C is given by: C ¼ ½I ðLþ1Þ2 0ðLþ1Þ2 ððþ1Þ2 ðLþ1Þ2 Þ ; ð39Þ where I n denotes the n n identity matrix and 0m n denotes the m n zero matrix. In practice we solve for EL using Tikhonov regularization. In other words, we deﬁne EL as: EL ¼ arg minðkAT Ck2F þ A 2 kAk2F Þ; ð40Þ where is the regularization parameter setting the relative importance given to minimizing the norm of the matrix with respect to minimizing the error. Matrix EL is unique and given by: Fig. 7 The truncation order is shown as a function of kR for diﬀerent values of . From [42]. EL ¼ CT H ðT T H þ 2 I N Þ1 23 Acoust. Sci. & Tech. 41, 1 (2020) where is the ratio of the energy of the measurement noise in the HOA signals to the energy of the measurement noise in the microphone signals. The proof of this result is given in [42]. Consider now the issue of spatial aliasing. At high frequencies, microphones can receive energy from undesired, greater-than-order-L, HOA components of the incoming sound. This results in spatial aliasing where the encoded HOA signals are polluted by these undesired HOA components. Figure 7 shows that the microphones with larger receive more energy from the undesired, greaterthan-order-L, HOA components than those with smaller . Therefore, at high frequencies it is desirable to apply greater weight to the signals from microphones located closer to the rigid spherical surface. In summary, the highorder information contained in T is essential for optimal calculation of the encoding ﬁlters. 4. Fig. 8 The amplitude of the modal weights, wl ðkR; Þ are shown as a function of kR and for orders 0 to 4 when ¼ 1 (top) and ¼ 2 (bottom). From [42]. ¼ T HL ðT T H þ 2 I N Þ1 : ð41Þ This solution for EL incorporates two critical aspects. First, it applies regularization to prevent measurement noise from being ampliﬁed unreasonably at low kR values. Second, it optimally weights the microphone signals to minimize spatial aliasing by employing a matrix T of suﬃciently high order. With regard to the ampliﬁcation of measurement noise, Fig. 8 shows the amplitude of the modal weights wl as a function of kR and for orders 0 to 4. At low kR values, the weights corresponding to HOA signals of order greater than one have very small amplitudes. As well, in the case where the microphones are not located on the rigid spherical surface ( > 1), the weights are equal to zero at certain kR values. Without regularization, the encoding ﬁlters would amplify the microphone signals greatly to acquire the HOA signals at these frequencies and would thus be very non-robust to measurement noise. The amount of measurement noise in the encoded signals is bounded by the regularization parameter as follows: 2 1 ; ð42Þ 2 24 BINAURAL VIRTUAL SPATIAL AUDIO 4.1. Head-tracking and Individualization Binaural spatial audio is based on head-related impulse responses and/or binaural room impulse responses. The ﬁrst and most important point to make is that immersive binaural virtual spatial audio requires head-tracking. Without head-tracking the virtual acoustic sources move with the head and destroy the perceptual illusion. It is the advent of mixed-reality devices with head-tracking that provide a signiﬁcant impetus for immersive spatial audio in the consumer industry. This then raises the issue of just how important is individualized head-related impulse responses. In order to illustrate these issues we consider some recent data [43] for an experiment contrasting individualized HRIRs and generic HRIRs. This description is taken from [43]. There were four listening conditions: (1) binaural rendering with individualized HRIR ﬁlters and head-tracking; (2) binaural rendering with generic HRIR ﬁlters and head-tracking; (3) binaural rendering with individualized HRIR ﬁlters and no head-tracking; and (4) normal headphone listening without binaural spatial rendering. There were twenty-three self-reporting normallyhearing listeners participate in the listening test. Listeners were asked to listen to six sound excerpts: . Mono: drums, Radiohead — Weird Fishes/Arpeggi . Mono: guitar, Tarrega — Capriccio Arabe . Stereo: Pop, Radiohead — Jigsaw Falling Into Place . Stereo: Bossa-Nova, Stan Getz, João Gilberto — Vivo Sonhando . 5.1 Surround: Rock, Pink Floyd — Money . 5.1 Surround: Pop Jazz, Norah Jones — Come Away With Me Sounds were played to the listener using the AKG 1000 open headphones and also a loudspeaker array consisting C. T. JIN: A TUTORIAL ON IMMERSIVE 3D SOUND TECHNOLOGIES Fig. 9 Results from the binaural listening test are shown for the six sound stimuli. The average population scores for the four listening conditions are shown using a bar plot. The legend labels are as follows: Indiv. + HT Individualized HRIRs with head-tracking; Generic + HT generic HRIRs with head-tracking; Indiv. no HT Individualized HRIRs with no head-tracking; No 3D no binaural spatial rendering. Figures taken from [43]. of 12 loudspeakers: 5 Tannoy System 15 loudspeakers forming a 5.1 arrangement and 7 additional Tannoy V6 loudspeakers forming a circular array spaced every 45 degrees. The loudspeaker playback provided a reference for the headphone listening. Because the headphones are open, the loudspeakers could be heard without distortion. Every listener had HRIRs recorded using a blocked-ear recording method [44] in an anechoic chamber using a semi-circular robotic arm (methods were similar to those presented here [45]). A MUSHRA-like [46] test paradigm was used in which there was no hidden reference, but an anchor was included. The explicit reference was loudspeaker playback and the anchor was headphone presentation with no spatial audio rendering. Listeners participated in two diﬀerent trials. In one trial listeners were asked to rate overall preference and in another trial listeners were asked to rate the clarity of the frontal image. Head-tracking was implemented using a Polhemus G4 head-tracking device mounted on the headphones. Results of the listening test are shown in Fig. 9. As expected, head-tracking contributed signiﬁcantly to the listeners’ scores because it provides a consistent listening environment in which sound sources are robustly and consistently localized when the head moves. Interestingly, listeners also showed a small, but consistent bias for individualized binaural rendering over generic binaural rendering. The added beneﬁt of individualized binaural rendering is small compared to the beneﬁt of head-tracking. Nevertheless, in listening conditions without a visual reference, there does seem to be a small beneﬁt for individualization in binaural rendering. This would suggest that individualized binaural rendering will play some role when visual stimuli are absent — for example, in augmented spatial hearing conditions using hearables. 4.2. Binaural Reverberation and Ambiance 4.2.1. Binaural rendering of HOA signals Head-related impulse responses are generally recorded under anechoic, free-ﬁeld conditions. This means that one must consider the issue of how reverberation and ambiance will be handled during binaural rendering. A perceptually reasonable (the author is unaware of precise psychophysical validation) encoding of reverberant and ambient sound ﬁelds is provided by order-3 HOA signals. Thus, we consider binaural rendering of HOA signals. Let the pressure, p, at the ears in the presence of N plane waves be given by: p ¼ Hs, where H contains the head-related transfer function information and s speciﬁes the planewave sources. If the plane-wave sources are encoded as HOA signals, Y, then we require a decoding matrix, F, so that: p^ ¼ FY, where ideally, p^ ¼ p. A possible mathematical solution can be found using the pseudo-inverse: F ¼ H pinvðY Þ. Nevertheless, this does not work well for many reasons: the pseudo-inverse will smear the source signals across directions, as well the phase component of the matrix H is complicated and this solution will result in strong spectral coloration. Therefore, it is recommended that one smoothly ﬁx the phase of the head-related transfer functions above a certain frequency to a constant to obtain H mod . Using F ¼ H mod pinvðYÞ provides a much improved binaural rendering for order-3 HOA signals provided one has a dense recording of head-related transfer functions. 4.2.2. Manipulation of the covariance matrix There are many methods to simulate room impulse 25 Acoust. Sci. & Tech. 41, 1 (2020) responses and also many methods to decorrelate audio signals and it is not uncommon to apply these for binaural rendering of spatial audio. The crux of the issue is that the covariance of the signals at the ears determines many binaural perceptual properties and so it is best to preserve or control the covariance structure of the binaural signals. These matters are well described in [47]. A brief description is provided here. Assume there is a decoding method to obtain binaural signals, pQ , from a set of input signals: pQ ¼ Qx, where Q is some decoding matrix and x is a vector of input signals and that somehow one knows the desired covariance matrix Cp ¼ E½ ppH for some true or desired p. In general, CpQ 6¼ Cp , so what can one do? It turns out there is a method to obtain new signals p^ ¼ Mx þ M D D½Qx, where D½ is a decorrelation operator and the matrices M and M D are to be determined, such that Cp^ ¼ Cp . Unfortunately, these computations are suﬃciently complicated that one is referred to [47] for the details. The important point is that there are methods to control the covariance of the signals at the two ears. 4.3. Improving Static Binaural Spatial Audio One of the more interesting research directions for binaural spatial audio is to enable static binaural recordings to be rendered with head-tracking in an attempt to create an immersive experience. This would be relevant to the broadcasting industry which likely has many historical, static binaural recordings. As well, this would be interesting for personal binaural recordings where listeners record a sound scene using earphones capable of making recordings, such as the Roland CS-10EM (available in 2018). This technique requires the extraction of acoustic sources and source directions from the static binaural recording and the creation of a new binaural rendering with headtracking. This could likely be done using personal mobile devices. An interesting reference to work in this area is [48]. As well, refer to [49,50] for methods to decompose audio into direct and ambient sound as well as determine the direction of acoustic sources. REFERENCES [1] W. Zhang, P. N. Samarasinghe, H. Chen and T. D. Abhayapala, ‘‘Surround by sound: A review of spatial audio recording and reproduction,’’ Appl. Sci., 7, 1–19 (2017). [2] V. Pulkki, ‘‘Virtual sound source positioning using vector base amplitude panning,’’ J. Audio Eng. Soc., 45, 456–466 (1997). [3] J.-M. Pernaux, P. Boussard and J.-M. Jot, ‘‘Virtual sound source positioning and mixing in 5.1 implementation on the real-time system genesis,’’ Proc. Conf. Digital Audio Effects (DAFx-98), Barcelona, November, pp. 76–80 (1998). [4] J. Daniel, Représentation de Champs Acoustiques, Application à la Transmission et à la Reproduction de Scènes Sonores Complexes dans un Contexte Multimédia, Ph.D. thesis, Université Paris 6 (2000). [5] J. W. Strutt, ‘‘On our perception of sound direction,’’ Philos. 26 Mag., 13, 214–232 (1907). [6] W. M. Hartmann, B. Rakerd and Z. D. Crawford, ‘‘Transaural experiments and a revised duplex theory for the localization of low-frequency tones,’’ J. Acoust. Soc. Am., 139, 968–985 (2016). [7] M. Frank, Phantom Sources using Multiple Loudspeakers in the Horizontal Plane, Ph.D. thesis, University of Music and Performing Arts, Graz (2013). [8] N. Epain, C. T. Jin and F. Zotter, ‘‘Ambisonic decoding with constant angular spread,’’ Acta Acust. united Ac., 100, 928–936 (2014). [9] V. Pulkki, ‘‘Spatial sound reproduction with directional audio coding,’’ J. Audio Eng. Soc., 55, 503–516 (2007). [10] C. T. Jin, N. Epain and T. Noohi, ‘‘Sound ﬁeld analysis using sparse recovery,’’ in Parametric Time-Frequency Domain Spatial Audio, V. Pulkki, S. Delikaris-Manias and A. Politis, Eds. (John Wiley & Sons, Hoboken, New York, 2018), Chap. 3. [11] V. Pulkki, A. Politis, M.-V. Laitinen, J. Vilkamo and J. Ahonen, ‘‘First-order directional audio coding (dirac),’’ in Parametric Time-Frequency Domain Spatial Audio, V. Pulkki, S. Delikaris-Manias and A. Politis, Eds. (John Wiley & Sons, Hoboken, New York, 2018), Chap. 5. [12] P. Archontis and V. Pulkki, ‘‘Higher-order directional audio coding,’’ in Parametric Time-Frequency Domain Spatial Audio, V. Pulkki, S. Delikaris-Manias and A. Politis, Eds. (John Wiley & Sons, Hoboken, New York, 2018), Chap. 6. [13] E. G. Williams, Fourier Acoustics: Sound Radiation and Nearfield Acoustic Holography (Academic Press, London, 1999). [14] N. A. Gumerov and R. Duraiswami, Fast Multipole Methods for the Helmholtz Equation in Three Dimensions (Elsevier, Amsterdam, 2005). [15] J. Daniel, ‘‘Spatial sound encoding including near ﬁeld eﬀect: Introducing distance coding ﬁlters and a viable, new ambisonic format,’’ Proc. 23rd Int. Conf. Audio Eng. Soc.: Signal Processing in Audio Recording and Reproduction, May (2003). [16] M. A. Poletti, ‘‘Three-dimensional surround sound systems based on spherical harmonics,’’ J. Audio Eng. Soc., 53, 1004– 1025 (2005). [17] J. Ahrens and S. Spors, ‘‘An analytical approach to sound ﬁeld reproduction using circular and spherical loudspeaker distributions,’’ Acta Acust. united Ac., 94, 988–999 (2008). [18] S. Spors and F. Zotter, ‘‘Spatial sound synthesis with loudspeakers,’’ in Cutting Edge in Spatial Audio, F. Zotter, Ed. (EAA Documenta Acustica, 2013). [19] A. J. Berkhout, ‘‘A holographic approach to acoustic control,’’ J. Audio. Eng. Soc., 36, 977–995 (1988). [20] A. J. Berkhout, D. de Vries and P. Vogel, ‘‘Acoustic control by wave ﬁeld synthesis,’’ J. Acoust. Soc. Am., 93, 2764–2778 (1993). [21] S. Spors, H. Teutsch and R. Rabenstein, ‘‘High-quality acoustic rendering with wave ﬁeld synthesis,’’ Proc. Vision, Modeling, and Visualization Conf. 2002, November (2002). [22] J. Ahrens, Analytic Methods of Sound Field Synthesis (Springer-Verlag, Berlin, 2012). [23] F. M. Fazi, Sound Field Reproduction, Ph.D. thesis, University of Southampton (2010). [24] S. Enomoto, Y. Ikeda, S. Ise and S. Nakamura, ‘‘Threedimensional sound ﬁeld reproduction and recording systems based on boundary surface control principle,’’ Proc. 14th Int. Conf. Auditory Display, Paris, June, pp. 1–8 (2008). [25] S. J. Elliot, J. Cheer, J.-W. Choi and Y. Kim, ‘‘Robustness and reguarlization of personal audio systems,’’ IEEE Trans. Audio Speech Lang. Process., 20, 2123–2133 (2012). C. T. JIN: A TUTORIAL ON IMMERSIVE 3D SOUND TECHNOLOGIES [26] A. Wabnitz, N. Epain, C. T. Jin and A. van Schaik, ‘‘Room acoustics simulation for multichannel microphone arrays,’’ Proc. AES 127th Conv., Sydney, August, pp. 1–6 (2010). [27] N. Epain, C. T. Jin and A. van Schaik, ‘‘The application of compressive sampling to the analysis and synthesis of spatial sound ﬁelds,’’ Proc. AES 127th Conv., New York, October, pp. 1–8 (2008). [28] P. M. Juhl, The Boundary Element Method for Sound Field Calculations, Ph.D. thesis, Technical University of Denmark (1993). [29] S. Gálvez, F. Marcos, T. Takeuchi and F. M. Fazi, ‘‘Lowcomplexity, listener’s position-adaptive binaural reproduction over a loudspeaker array,’’ Acta Acust. united Ac., 103, 847– 857 (2017). [30] J. Meyer and G. Elko, ‘‘A highly scalable spherical microphone array based on an orthonormal decomposition of the soundﬁeld,’’ Proc. ICASSP 2002, Vol. 2, pp. 1781–1784 (2002). [31] T. D. Abhayapala and D. B. Ward, ‘‘Theory and design of high order sound ﬁeld microphones using spherical microphone array,’’ Proc. ICASSP 2002, Vol. 2, pp. 1949–1952 (2002). [32] B. Rafaely, ‘‘Analysis and design of spherical microphone arrays,’’ IEEE Trans. Speech Audio Process., 13, 135–143 (2005). [33] D. B. Ward and T. D. Abhayapala, ‘‘Reproduction of a planewave sound ﬁeld using an array of loudspeakers,’’ IEEE Trans. Speech Audio Process., 9, 697–707 (2001). [34] A. Wabnitz, N. Epain and C. T. Jin, ‘‘A frequency-domain algorithm to upscale ambisonic sound scenes,’’ Proc. ICASSP 2012, Kyoto, Japan, March (2012). [35] S. Bertet, J. Daniel, E. Parizet and O. Warusfel, ‘‘Investigation on localisation accuracy for ﬁrst and higher order ambisonics reproduced sound sources,’’ Acta Acust. united Ac., 99, 642– 657 (2013). [36] H. H. Chen and S. C. Chan, ‘‘Adaptive beamforming and doa estimation using uniform concentric spherical arrays with frequency invariant characteristics,’’ J. VLSI Signal Process., No. 46, pp. 15–34 (2007). [37] B. Rafaely, Y. Peled, M. Agmon, D. Khaykin and E. Fisher, ‘‘Spherical microphone array beamforming,’’ in Speech Processing in Modern Communication: Challenges and Perspectives, I. Cohen, J. Benesty and S. Gannot, Eds. (Springer, Berlin, Heidelberg, 2010). [38] H. Sun, H. Teutsch, E. Mabande and W. Kellermann, ‘‘Robust localization of multiple sources in reverberent environments using EB-ESPRIT with spherical microphone arrays,’’ Proc. ICASSP 2011, Prague, Czech Republic, May (2011). [39] N. Epain and C. T. Jin, ‘‘Independent component analysis using spherical microphone arrays,’’ Acta Acust. united Ac., 98, 91–102 (2012). [40] B. Rafaely, Fundamentals of Spherical Array Processing (Springer-Verlag, Berlin, 2015). [41] D. P. Jarrett, E. A. P. Habetrs and P. A. Naylor, Theory and Applications of Spherical Microphone Array Processing (Springer International Publishing, Cham, 2017). [42] C. T. Jin, N. Epain and A. Parthy, ‘‘Design, optimization and evaluation of a dual-radius spherical microphone array,’’ IEEE/ ACM Trans. Audio Speech Lang. Process., 22, 193–204 (2014). [43] C. T. Jin, R. Zolfaghari, X. Long, A. Sebastian, S. Hossain, Glaunés, A. Tew, M. Shahnawaz and A. Sarti, ‘‘Considerations regarding individualization of head-related transfer functions,’’ Proc. ICASSP 2018, Calgary, Canada, April, pp. 6787–6791 (2018). [44] H. Moller, ‘‘Fundamentals of binaural technology,’’ Appl. Acoust., 36, 171–218 (1992). [45] C. T. Jin, A. Corderoy, S. Carlile and A. van Schaik, ‘‘Contrasting monaural and interaural spectral cues for human sound localization,’’ J. Acoust. Soc. Am., 115, 3124–3141 (2004). [46] ITU-R BS.1534-1:2003, Method for the subjective assessment of intermediate quality level of coding systems, ITU-R (2003). [47] J. Vilkamo and T. Bäckström, ‘‘Time-frequency procesing: Methods and tools,’’ in Parametric Time-Frequency Domain Spatial Audio, V. Pulkki, S. Delikaris-Manias and A. Politis, Eds. (John Wiley & Sons, Hoboken, New York, 2018). [48] S. Nagel and P. Jax, ‘‘Dynamic binaural cue adaptation,’’ 2018 Int. Workshop on Acoustic Signals Enhancement, Tokyo, Japan, September (2018). [49] C. Faller, ‘‘Multi-loudspeaker playback of stereo signals,’’ J. Audio Eng. Soc., 54, 1051–1064 (2006). [50] C. T. Jin and N. Epain, ‘‘Super-resolution sound ﬁeld analysis,’’ in Cutting Edge in Spatial Audio, F. Zotter, Ed. (EAA Documenta Acustica, 2013). Craig Jin received a B.S. in Physics from Stanford in 1990, an M.S. in Applied Physics from Caltech, and a Ph.D. from the University of Sydney in 2001. He is currently an Associate Professor within the School of Electrical and Information Engineering at the University of Sydney and is the director of the Computing and Audio Research Laboratory. 27

Log In

A tutorial on immersive three-dimensional sound technologies

Related papers

Related papers

Related topics