Capturing 360 Audio Using An Equal Segment
Capturing 360 Audio Using An Equal Segment
Capturing 360 Audio Using An Equal Segment
H. Lee, “Capturing 360◦ Audio Using an Equal Segment Microphone Array (ESMA),”
J. Audio Eng. Soc., vol. 67, no. 1/2, pp. 13–26, (2019 January/February.).
DOI: https://doi.org/10.17743/jaes.2018.0068
1
The SRA refers to the horizontal span of the sound field in
1.1 Williams Curves
front of the microphone array that will be reproduced in full width
between two loudspeakers [6]. Williams [5] recommends the microphone spacing of
2
Preliminary results from this work were presented at the AES 24 cm for the quadraphonic cardioid ESMA. This is es-
International Conference on Audio for Virtual and Augmented timated based on the so-called “Williams curves” [8],
Reality in 2016 [7]. which are a collection of curves that indicate possible
combinations of microphone spacings and subtended an- Table 1. ICTD and ICLD shift factors for the 60◦ and 90◦
gles to achieve specific SRAs. They are based on an ICTD loudspeaker setups suggested by the MARRS psychoacoustic
model [10].
and ICLD trade-off relationship derived from the polyno-
mial interpolations of ICTD and ICLD values required for
Shift factor
10◦ , 20◦ , and 30◦ image shifts that were obtained from a lis- Speaker base Image shift
tening test in the standard 60◦ loudspeaker setup. Williams angle region ICTD ICLD
[8] claims that the SRA is virtually independent of the
loudspeaker base angle, suggesting that the same ICTD 60◦ 0–66.7% 13.3%/0.1ms 7.8%/dB
66.7%–100% 6.7%/0.1ms 3.9%/dB
and ICLD trade-off model obtained for the 60◦ loudspeaker 90◦ 0–66.7% 8.86%/0.1ms 6%/dB
setup can also be applied to the 90◦ setup. From this, he pro- 66.7%–100% 4.43%/0.1ms 3%/dB
poses that 24 cm between each microphone in the quadra-
phonic cardioid ESMA can produce the desired SRA of
90◦ for each stereophonic segment. Note that the ICTD If Theile’s constant relative shift theory described above
and ICLD produced by a near-coincident microphone con- is applied here (i.e., using the data obtained for the 60◦
figuration vary slightly depending on the distance between loudspeaker setup for the 90◦ setup), the correct spacing
sound source and microphone array and so does the SRA of for each segment of the quadraphonic cardioid ESMA to
the array. However, it is not stated in [8] what source-array achieve the 90◦ SRA at 2 m source-array distance is 30 cm.
distance the Williams curves were based on. However, the author’s previous research on amplitude
panning [15] suggests that ICLD shift factors must vary de-
1.2 Image Assistant pending on the loudspeaker base angle in order to achieve
an accurate phantom image localization; a larger base an-
In contrast with the Williams’s curves, the psychoacous-
gle requires a larger ICLD for a given proportion of image
tic model used in the “Image Assistant” tool [9] assumes a
shift. An informal listening test confirmed that this was
linear trade-off between ICTD and ICLD within the 75%
also the case with ICTD. Therefore, the MARRS model
image shift region (e.g., 0 to 22.5◦ for the 60◦ loudspeaker
[10] scales the original ICTD and ICLD shift factors de-
setup). It also allows the user to choose a specific source-
pending on the loudspeaker base angle. For example, for
array distance for the SRA estimation. The amount of to-
the 90◦ loudspeaker setup, the original ICLD shift factor
tal image shift within this region is estimated by simply
is scaled by 0.77, which is the ratio of the interaural level
adding the image shifts that individually result from ICTD
difference (ILD) above 1 kHz produced at 30◦ (the loud-
and ICLD (13%/0.1 ms and 7.5%/dB, respectively), which
speaker azimuth in the original 60◦ setup, which serves as
is a method proposed by Theile [11]. Outside the linear
the reference) to that at 45◦ (the loudspeaker azimuth of the
region, where the image shift pattern tends to become loga-
90◦ setup). Similarly, the ICTD shift factor is multiplied
rithmic for both ICTD and ICLD, an approximate function
by the ratio of interaural time differences (ITDs) below
is applied to derive a non-linear ICTD and ICLD trade-off
1 kHz between 30◦ and 45◦ , which is 0.67. This scaling
relationship [12]. The tool suggests that at 2 m distance
process results in shift factors optimized for the 90◦ loud-
between the source and the center of the array, which was
speaker setup, which are presented in Table 1. Based on
used in the experiment of the present study, 24 cm is the
these, the correct spacing between adjacent microphones
correct microphone spacing to produce the required SRA
for the quadraphonic cardioid ESMA is estimated to be
of 90◦ . The ICTD and ICLD shift factors used in the Image
50 cm. Note that this spacing is calculated for the source-
Assistant were obtained for the standard 60◦ loudspeaker
array distance of 2 m. However, the difference for a larger
setup. However, as in Williams’ assumption that the SRA is
distance in a practical recording situation is very small, e.g.,
conserved regardless of the loudspeaker base angle, Theile
50.4 cm spacing for 5 m source distance for the cardioid
[13] also claims the same ICTD and ICLD image shift fac-
ESMA. In addition, the size of a quadraphonic ESMA could
tors can be used for an arbitrary loudspeaker base angle,
be made smaller if microphones with a higher directionality
which is here referred to as the constant relative shift the-
are used, e.g., 40 cm for supercardioids at 2m source dis-
ory. Based on this, the microphone spacing of 24 cm is
tance. Readers who are interested in more details about the
assumed to be still valid for the loudspeaker base angle of
algorithm used in MARRS are referred to the open-access
90◦ in the quadraphonic reproduction setup.
Matlab source code package3 . MARRS is also available as
a free mobile app from the Apple and Google app stores.
1.3 MARRS
The 30 cm and 50 cm spacings are based on SRA esti- 2 EXPERIMENTAL DESIGN
mations using the present author’s microphone array sim-
ulation tool “MARRS (Microphone Array Recording and Two subjective experiments were carried out. Experi-
Reproduction Simulator)” [10]. The psychoacoustic model ment 1 evaluated the localization accuracies of the four mi-
used for MARRS relies on an ICTD and ICLD trade-off crophone arrays with different spacings in a quadraphonic
model derived from region-adaptive ICTD and ICLD im- loudspeaker reproduction. Experiment 2 repeated the same
age shift factors for the 60◦ loudspeaker setup presented
in Table 1; they were defined based on subjective local-
3
ization test data obtained using natural sound sources [14]. https://github.com/APL-Huddersfield/MARRS
Fig. 3. Examples of sound field rotation applied to stimuli created for sound sources at 0◦ and 45◦ ; each sound field rotation simulating
the equivalent head rotation.
Table 2. Target image position for each sound field rotation for localization responses were obtained for each test condi-
each source position tion. They comprised staff researchers, postgraduate re-
search students, and final year undergraduate students of
Source Sound field Equivalent head Target image
the Applied Psychoacoustics Lab at the University of Hud-
position rotation rotation position
dersfield, with their ages ranging from 21 to 38. All of
0◦ 0◦ 0◦ them reported normal hearing and had extensive experience
45◦ –45◦ 45◦ in conducting sound localization tasks in formal listening
90◦ –90◦ 90◦ tests. All subjects completed the loudspeaker test (Experi-
0◦ 135◦ –135◦ 135◦
180◦ –180◦ 180◦ ment 1), at least one week after which they sat the binaural
225◦ –225◦ 225◦ test (Experiment 2). They did not know the nature of the
270◦ –270◦ 270◦ test stimuli until they completed both experiments.
315◦ –315◦ 315◦
loudspeaker positions, which were considered to have the Table 3. Summary of the results for phantom source
highest localization accuracy among all stimuli. localization in loudspeaker reproduction (Experiment 1):
Median perceived angles for each experimental condition.
The playback levels of all stimuli were calibrated to 70 Conditions with a significant difference from the target
dB LAeq at the listening position. Each trial in the test position (Wilcoxon signed rank test): * p < .05; ** p < .01.
contained a single stimulus and the subjects could listen Conditions with a significant bimodal distribution (Hartigan’s
to it repeatedly until they judged its perceived position. dip test): ∧ p < .05; ∧∧ p < .01.
All stimuli were presented in a randomized order. For the
sound-field-rotated stimuli, one of the mirrored target im- Target azimuth after sound
field rotation (degree)
age positions (e.g., 315◦ or 45◦ ) was randomly selected for
Source angle Mic spacing
each listener for each microphone array condition. This was (degree) (cm) 0 45 90 135 180
to minimize psychological order effects as well as to avoid a
∧
potential listening fatigue that might occur when the sound 50 0 41 135 180
is presented only from the left- or right-hand side. Every 0 30 0 40 67* 134 180
24 0 34 68 135 180
subject judged each test condition twice in a randomized 0 0 24** 45* 134 180
order.
50 0 45 90 135 180
∧
45 30 0 44* 90 135
2.4.2 Experiment 2 24 0 39** 90 135 180
∧∧ ∧
0 0 30** 152
The listening test was conducted in the same room as
Experiment 1. The test procedure was identical to that of
Experiment 1, apart from the following. The headphones significance of bimodality was examined using the Harti-
used for the test were Sennheiser HD650. To equalize them, gan’s dip test [20].
their impulse responses were measured five times using the
KU100 dummy head, with them re-seated on the head each 3.1 Phantom Source Localization in
time. The average responses were then inverse filtered us- Loudspeaker Reproduction
ing a regularization method by Kirkeby et al. [19]. Prior Fig. 4 shows the bubble plots of the data for the phan-
to the actual test the subjects were presented with familiar- tom source conditions (i.e., microphone array recordings)
ization trials comprising the binaural recordings of the real from Experiment 1. Table 3 presents the summary of the
sources for the eight loudspeaker positions. The loudness statistical analyses.
unit level of all binaural stimuli was calibrated at –18 LUFS
and the headphone playback level was determined by the 3.1.1 Sound Source at 0◦
present author to match the perceived loudness of the loud- The results for the 0◦ source position are first presented.
speaker playback from Experiment 1 (70 dB LAeq). No From the scatterplots in Fig. 4, it appears that all micro-
head tracking was used for rendering different image posi- phone spacings produced a relatively accurate localization
tions in binaural reproduction; the sound field was rotated when the target angle was 0◦ ; there is no front-back confu-
instead as described in Sec. 2.2.3. sion. For the 45◦ target angle (45◦ simulated head rotation),
the 0 cm condition had the median perceived angle (MED)
of 24◦ , which was significantly smaller than the target
3 RESULTS (p = 0.027), whereas the differences of the 50 cm, 30 cm,
and 24 cm spacings to the target was not significant (p >
As mentioned earlier, the stimuli with the mirrored target 0.05). Looking at the 90◦ target angle (90◦ simulated head
image positions were randomly selected for each listener in rotation), the responses for the 0◦ source appear to have
the listening tests. For the purposes of the statistical analysis wide spreads in general. The 50 cm spacing had a signif-
and data plotting, the perceived angles for the stimuli with icant bimodal distribution (p = 0.022). The MEDs for the
the target angles in the left-hand side of the circle were 30 cm and 24 cm were considerably smaller than the target
converted into the corresponding angles in the right-hand angle (67◦ –68◦ ). The 0 cm spacing had the largest deviation
side (e.g., 315◦ to 45◦ , 270◦ to 90◦ ). For the continuity of from the target angle among all spacings (MED = 45◦ , p =
data in the analysis, any responses for the 0◦ target angle that 0.015). For both the 135◦ and 180◦ target angles, the MEDs
were given in the left-hand side of the circle were converted for all spacings did not have a significant difference from
into negative values (e.g., 355◦ to –5◦ ), whereas those for the target angles (p > 0.05). However, the responses for the
the 180◦ target angle in the left side were unchanged. 135◦ target angle tended to be widely spread between the
Shapiro-Wilk and Levene’s tests were first performed to front and rear regions.
examine the normality and variance of the data collected.
The results suggested that the data were not suitable for 3.1.2 Sound Source at 45◦
parametric statistical testing. Therefore, the non-parametric For the 0◦ target angle (315◦ sound field rotation), all
Wilcoxon signed-rank test was conducted to examine if conditions had no significant difference between the per-
there was a significant difference between the target and ceived and target angles (p > 0.05). For the 45◦ target angle
perceived image positions for each test condition, except (no sound field rotation), the MED was closer to the target
for those that had a significant bimodal distribution. The angle in the order of 50 cm (45◦ ), 30 cm (44◦ ), 24 cm (39◦ ),
Fig. 4. Bubble plots of the data obtained from the loudspeaker localization test (Experiment 1). The diameter of each circle represents
the percentage of responses for each condition.
and 0 cm (30◦ ). Apart from the 50 cm spacing, the MEDs Table 4. Summary of the results for phantom source
were all found to deviate significantly from the target (p = localization in binaural reproduction (Experiment 1): Median
perceived angles for each experimental condition. Conditions
0.047 for 30 cm, p = 0.000 for 24 cm and 0 cm). For the with a significant difference from the target position (Wilcoxon
90◦ target angle, the 50 cm, 30 cm, and 24 cm spacings did signed rank test): * p < .05; ** p < .01. Conditions with a
not have a significant difference between the perceived and significant bimodal distribution (Hartigan’s dip test): ∧ p <
target angles (MED = 90◦ , p > 0.05), whereas the 0 cm .05; ∧∧ p < .01.
produced a significant bimodal distribution between around
45◦ and 135◦ (p = 0.002). Looking at the target angle of Target azimuth after sound field
rotation (degree)
135◦ , the MEDs for the 50 cm, 30 cm, and 24 cm were the
Source angle Mic spacing
same as the target, whereas that for the 0 cm (152◦ ) was (degree) (cm) 0 45 90 135 180
noticeably closer to the median plane, although this was
∧∧ ∧∧
not statistically significant (p > 0.05). For the 180◦ target 50 42 100 180
∧∧ ∧∧
angle, 50 cm and 24 cm were found to produce an accu- 0 30 35 62 180
∧∧ ∧ ∧
24 39 180*
rate result (MED = 180◦ , p > 0.05), whereas responses for 0 ∧∧
39 69 ∧
180
30 cm and 0 cm had a significant bimodality (p = 0.036 50 ∧∧
47 90 135 180
and 0.01, respectively). ∧∧ ∧∧
45 30 50* 90* 129**
∧ ∧ ∧
24 47 90
∧∧ ∧∧ ∧∧ ∧∧
0 27*
Fig. 5. Bubble plots of the data obtained from the binaural localization test (Experiment 2). The diameter of each circle represents the
percentage of responses for each condition.
angle, again the 50 cm spacing produced the most accurate had a significant bimodal distribution (p = 0.04 for 24 cm
result. The MEDs for 30 cm and 0 cm (62◦ and 69◦ , respec- and 0.000 for 0 cm). Last, for the target angle was 180◦ , the
tively) were considerably narrower than the target, while 50 cm spacing produced an accurate result (MED = 180◦ ,
responses for 24 cm were significantly bimodal (p < 0.05). p > 0.05), whereas the other spacings all had a significant
All conditions for the target angle of 135◦ were found to bimodality.
have a significant bimodal distribution between around 45◦
and 135◦ (p < 0.05 for 50 cm and 30 cm, p < 0.01 for 24
3.3 Real Source Localization in Loudspeaker
cm and 0 cm). For the 180◦ target angle, only the 30 cm
and Binaural Reproductions
condition was found to be significantly different from the
target (p < 0.05). Fig. 6 presents the responses given to the real source
stimuli (i.e., single loudspeaker conditions) in both loud-
3.2.2 Sound Source at 45◦ speaker and binaural experiments. Wilcoxon tests suggest
that, for the loudspeaker results, there was no significant
For the 45◦ source position, the responses for the tar-
difference between the perceived and target angles for all
get angle of 0◦ were found to be significantly bimodal
stimuli (p > 0.05). For the binaural conditions, on the other
regardless of the microphone spacing (i.e., front-to-back
hand, it was found that the responses for the 0◦ and 180◦
confusion). For the 45◦ target angle, the 50 cm and 24 cm
sources were significantly bimodal, exhibiting front-back
spacings both produced the MED of 47◦ , which was not
confusion. Furthermore, the 45◦ source (MED = 52◦ ) was
significantly different from the target (p > 0.05). However,
found to be perceived at a significantly wider position than
the 30 cm and 0 cm had significant differences between
the target (p < 0.01).
the target and perceived angles (MEDs = 50◦ and 27◦ , re-
spectively, p < 0.05). The results for the 90◦ target angle
show that the 50 cm, 30 cm, and 24 cm all had the median 4 DISCUSSIONS
perceived angles of 90◦ , whereas the 0 cm condition had a
significant bimodal distribution (p < 0.01) between around This section discusses various aspects of the subjective
45◦ and 135◦ . For the 135◦ target angle, 50 cm was the only results described above. The measurements of interaural
spacing that produced an accurate result (MED = 135◦ , time and level differences are provided to explain the sub-
p > 0.05). The MED for 30 cm (129◦ ) was significantly jective results. A higher order and 3D versions of ESMA
different from the target (p = 0.007), while 24 cm and 0 cm are also introduced.
Fig. 7. Difference of ESMA to real source in Interaural time difference (ITD) and interaural level differences (ILD) for each experimental
condition; average of results obtained for 50 ms overlapping windows for each of the 42 ERB critical bands.
with head tracking, such an issue may be resolved even if 200 Hz. It was shown in the subjective results that the
non-individualized HRTFs are used for the binaural render- 50 cm spacing produced a highly accurate localization for
ing of ESMA, which requires further investigation. this test condition. Based on the literature [28, 29], this
subjective result seems to be due to a trade-off between the
effects of the ITDs and ILDs on localization. That is, a wider
4.4 Analyses of Interaural Time and Level image position due to the ITD being greater than the refer-
Differences ence and a narrower image position due to the ILD being
To gain further insights into potential reasons for the smaller than the reference might have been spatially aver-
subjective results, the ITDs and ILDs of all of the binaural aged. Especially between about 700 Hz and 4 kHz, where
stimuli with off-center target angles (45◦ , 90◦ , and 135◦ ) Griesinger [30] claims to be the most important frequency
were estimated and compared. 0◦ and 180◦ were excluded region to determine the perceived position of a broadband
since at those angles there is no ITD and the ILD exists phantom image, the average ITD and ILD differences to
only at very high frequencies due to ear asymmetry. The the reference for this condition are 0.1 ms and –0.75 dB,
binaural model used for the analyses is described as follows. respectively. This gives the ratio of 0.13 ms/dB, which lies
Each binaural stimulus was first split into 42 frequency within the range of ITD/ILD trading ratios4 found in the lit-
bands through a Gammatone “equivalent rectangular band erature (i.e., 0.04 – 0.2 ms/dB [26]). This suggests that the
(ERB)” filter bank [25] that mimics the critical bands of the degree of the positive image shift from the target position
inner ear. To emulate the breakdown of the phase-locking by the ITD cue and that of the negative shift by the ILD
mechanism in the ear signals, half-wave rectification and cue would have been similar, thus resulting in the spatial
a first-order low-pass filtering at 1 kHz were applied to averaging around the target position. On the other hand, for
each band, as in [26, 27]. Time-varying ITD and ILD for all the other spacing conditions for the 45◦ source with a
each band were computed for 50%—overlapping 50 ms 0◦ rotation, the “center of gravity” between the ITD and
frames with the Hanning window. The ITD was defined ILD images (as described in [29]) seem to be at a nar-
as the lag of the maximum of the normalized interaural rower position than the target. For example, for the 24 cm
cross-correlation function (i.e., lag ranging between –1 ms ESMA, the average ITD difference to the reference between
and 1 ms). The ILDs were computed as the energy ratio 700 Hz and 4 kHz was only –0.02 ms, whereas the aver-
between the left and right signals. The ITDs obtained for age ILD difference was –1.7 dB. This would have caused a
all of the frames were averaged for each band; so were the considerable deviation from the target towards a narrower
ILDs. The results are presented in Fig. 7 as the ITD and ILD position mainly due to the ILD cue. It is also interesting to
differences of each microphone array stimulus to the real observe that the 0 cm condition, which had the worst subjec-
source stimulus with the corresponding target angle (i.e., tive result, had the opposite trend to the 24 cm condition;
the single source dummy head recordings). Therefore, the the average ILD difference was only –0.15 dB, whereas
closer the difference is to the 0 reference, the more accurate the ITD difference was considerably large (–0.18 ms).
the ITD or ILD produced by the microphone array is.
Looking at the plots for the 45◦ source with a 0◦ rotation
(45◦ target angle), the 50 cm spacing produced slightly more 4
ITD/ILD trading ratio refers to the equivalence between in-
ITDs than the dummy head reference across all bands, while teraural time and level differences measured in terms of the mag-
it produced slightly lower ILDs constantly above about nitude of perceived image shift [29].
Fig. 9. Examples of the vertical extension of the quadraphonic ESMA for 3D sound capture (namely, ESMA-3D): (a) four vertical
mid-side pairs of cardioid and fig-8 microphones; (b) four vertical coincident pairs of cardioid microphones.
narrower than 90◦ , thus requiring a slight increase in micro- (ii) With the sound field rotation of the quadraphonic
phone spacing to maintain the 90◦ SRA for each segment. ESMA, a sound source placed at a central position
For example, if the microphones of a quadraphonic ESMA tends to produce a less stable localization than that
are tilted downwards at –35.3◦ , the subtended angle for at a position closer to the microphones’ on-axis di-
each microphone pair from the base point becomes 70.5◦ rections (e.g., ±45◦ );
(i.e., the angle between the diagonals of a cube). In this (iii) The binaural rendering of the ESMA record-
case, based on the MARRS model [10], the correct spacing ing produces more bimodal response distributions
between the main layer microphones to produce the 90◦ (e.g., front-back confusion) than the loudspeaker
SRA is 54 cm for cardioids and 48cm for supercardioids. reproduction—this may be resolved by allowing
head rotations in head-tracked VR scenarios.
[3] E. Bates, M. Gorzel, L. Ferguson, H. O’Dwyer and [18] G. Kearney and T. Doyle, “An HRTF Database for
F. M. Boland, “Comparing Ambisonic Microphones: Part Virtual Loudspeaker Rendering,” presented at the 139th
1,” presented at the 2016 AES International Conference on Convention of the Audio Engineering Society (2015 Oct.),
Sound Field Control (2016 Jul.), conference paper 6-3. convention paper 9424.
[4] M. Williams, “Microphone Arrays for Natural Mul- [19] O. Kirkeby, P. A. Nelson, H. Hamada, and F. Orduña
tiphony,” presented at the 91st Convention of the Audio Bustamante, “Fast Deconvolution of Multichannel Systems
Engineering Society (1991 Oct.), convention paper 3157. Using Regularization,” IEEE Trans. Speech and Audio
[5] M. Williams, “Migration of 5.0 Multichannel Micro- Process., vol. 6, no. 2, pp. 189–195 (1998 Mar.). DOI:
phone Array Design to Higher Order MMAD (6.0, 7.0 & https://doi.org/10.1109/89.661479
8.0) With or Without the Inter-Format Compatibility Crite- [20] J. A. Hartigan and P. M. Hartigan, “The Dip Test of
ria,” presented at the 124th Convention of the Audio Engi- Unimodality,” Ann. Stat., vol. 13, pp. 70–84. (1985). DOI:
neering Society (2008 May), convention paper 7480. https://doi.org/10.1214/aos/1176346577
[6] M. Williams and G. Le Du, “Microphone Array [21] E. Benjamin, R. Lee, and A. Heller, “Is My Decoder
Analysis for Multichannel Sound Recording,” presented Ambisonics?” presented at the 125th Convention of the
at the 107th Convention of the Audio Engineering Society Audio Engineering Society (2008 Oct.), convention paper
(1999 Sep.), convention paper 4997. 7553.
[7] H. Lee, “Capturing and Rendering 360◦ VR Au- [22] G. Theile and G. Plenge, “Localization of Lateral
dio Using Cardioid Microphones,” presented at the Phantom Images,” J. Audio Eng. Soc., vol. 25, pp. 196–200
2016 AES International Conference on Audio for Vir- (1977 Apr.).
tual and Augmented Reality (2016 Sep.), conference [23] G. Martin, W. Woszczyk, J. Corey and R. Quesnel,
paper 8-3. “Sound Source Localization in a Five Channel Surround
[8] M. Williams, “Unified Theory of Microphone Sys- Sound Reproduction System,” presented at the 117th Con-
tems for Stereophonic Sound Recording,” presented at the vention the Audio Engineering Society (1999 Oct.), con-
82nd Convention of the Audio Engineering Society (1987 vention paper 4994.
Mar.), convention paper 2466. [24] F. Wightman and D. Kistler, “Resolution of Front–
[9] H. Wittek and G. Theile, “The Recording Angle— Back Ambiguity in Spatial Hearing by Listener and Source
Based on Localization Curves,” presented at the 112th Con- Movement,” J. Acoust. Soc. Am., vol. 105, no. 5, pp. 2841–
vention of the Audio Engineering Society (2002 May), con- 2853 (1999 May). DOI: https://doi.org/10.1121/1.426899
vention paper 5568. [25] P. Søndergaard and P. Majdak, “The Auditory Mod-
[10] H. Lee, D. Johnson, and M. Mironovs, “An Inter- eling Toolbox,” in The Technology of Binaural Listening,
active and Intelligent Tool for Microphone Array Design,” edited by J. Blauert (Springer, Berlin, Heidelberg, 2013).
presented at the 143rd Convention of the Audio Engineering DOI: https://doi.org/10.1007/978-3-642-37762-4
Society (2017 Oct.), e-Brief 390. [26] L. R. Bernstein and C. Trahiotis, “The Normalized
[11] G. Theile, “On the Performance of Two-Channel Correlation: Accounting for Binaural Detection across Cen-
and Multichannel Stereophony,” presented at the 88th Con- ter Frequency,” J. Acoust. Soc. Am., vol. 100, no. 5, pp.
vention of the Audio Engineering Society (1990 Mar.), con- 3774–3784 (1996). DOI: https://doi.org/10.1121/1.417237
vention paper 2932. [27] V. Pulkki and M. Karjalainen, Communication
[12] H. Wittek, Untersuchungen zur Richtungsabbil- Acoustics: An Introduction to Speech, Audio and Psychoa-
dung mit L-C-R Hauptmikrofonen, Masters thesis Institut coustics (Wiley, 2015).
für Rundfunktechnik (2000). [28] R. H. Whitworth and L. A. Jeffress, “Time
[13] G. Theile, “Natural 5.1 Music Recording Based versus Intensity in the Localization of Tones,”
on Psychoacoustic Principles,” presented at the AES 19th J. Acoust. Soc. Am., vol. 33, pp. 925–929 (1961). DOI:
International Conference: Surround Sound Techniques, https://doi.org/10.1121/1.1908849
Technology, and Perception (2001 June), conference [29] J. Blauert, Spatial hearing (Cambridge, The MIT
paper 1904. Press, 1997).
[14] H. Lee and F. Rumsey, “Level and Time Panning of [30] D. Griesinger, “Stereo and Surround Panning
Phantom Images for Musical Sources,” J. Audio Eng. Soc., in Practice,” presented at the 112th Convention of
vol. 61, pp. 753–767 (2013 Dec.). the Audio Engineering Society (2002 May), convention
[15] H. Lee, “Perceptually Motivated Amplitude Pan- paper 5564.
ning (PMAP) for Accurate Phantom Image Localization,” [31] R. Wallis and H. Lee, “The Effect of Interchannel
presented at the 142nd Convention of the Audio Engineer- Time Difference on Localization in Vertical Stereophony,”
ing Society (2017 May), convention paper 9770. J. Audio Eng. Soc., vol. 63, pp. 767–776 (2015 Oct.). DOI:
[16] A. Farina, “Advancements in Impulse Response https://doi.org/10.17743/jaes.2015.0069
Measurements by Sine Sweeps,” presented at the 122nd [32] J. L. Barbour, “Elevation Perception: Phantom Im-
Convention of the Audio Engineering Society (2007 May), ages in the Vertical Hemisphere,” presented at the AES
convention paper 7121. 24th International Conference: Multichannel Audio, The
[17] V. Hansen and G. Munch, “Making Recordings for New Reality (2003 June), conference paper 14.
Simulation Tests in the Archimedes Project,” J. Audio Eng. [33] M. Mironovs and H. Lee, “The Influence of Source
Soc., vol. 39, pp. 768–774 (1991 Oct.). Spectrum and Loudspeaker Azimuth on Vertical Amplitude
Panning,” presented at the 142nd Convention of the Audio presented at the 132nd Convention of the Audio Engineering
Engineering Society (2017 May), convention paper 9782. Society (2012 Apr.), convention paper 8595.
[34] H. Lee and C. Gribben, “Effect of Vertical Micro- [36] ITU-R, Recommendation ITU-R BS.2051-1: Ad-
phone Layer Spacing for a 3D Microphone Array,” J. Au- vanced sound system for programme production (2017).
dio Eng. Soc., vol. 62, pp. 870–884 (2014 Dec.). DOI: [37] R. Wallis and H. Lee, “The Reduction of Verti-
https://doi.org/10.17743/jaes.2014.0045 cal Interchannel Crosstalk: The Analysis of Localization
[35] P. Geluso, “Capturing Height: The Addition of Z Thresholds for Natural Sound Sources,” Appl. Sci., vol. 7,
Microphones to Stereo and Surround Microphone Arrays,” p. 278 (2017). DOI: https://doi.org/10.3390/app7030278
THE AUTHOR
Hyunkook Lee
Hyunkook Lee is a Senior Lecturer (i.e., Associate Pro- and objective sound quality metrics. From 2006 to 2010,
fessor) for music technology courses at the University Hyunkook was a Senior Research Engineer in audio R&D
of Huddersfield, UK, where he founded and leads the at LG Electronics, South Korea, where he participated in
Applied Psychoacoustics Laboratory (APL). He is also a the standardizations of MPEG audio codecs and developed
sound engineer with 20 years of experience in surround spatial audio algorithms for mobile devices. He received
recording, mixing, and live sound. Dr. Lee’s recent research his degree in music and sound recording (Tonmeister)
advanced understanding about the perceptual mechanisms from the University of Surrey, UK, in 2002 and obtained
of vertical stereophonic localization and image spread as his Ph.D in spatial audio psychoacoustics from the Institute
well as the phantom image elevation effect. This helped of Sound Recording (IoSR) at the same University in 2006.
develop new 3D microphone array techniques, vertical Hyunkook has been an active member of the AES since
mixing/upmixing techniques, and a virtual 3D panning 2001 and received the AES Fellowship award at the 145th
method. His ongoing research topics include 3D sound Convention in 2018.
perception, capture and reproduction, virtual acoustics