Capturing 360 Audio Using An Equal Segment

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

PAPERS

H. Lee, “Capturing 360◦ Audio Using an Equal Segment Microphone Array (ESMA),”
J. Audio Eng. Soc., vol. 67, no. 1/2, pp. 13–26, (2019 January/February.).
DOI: https://doi.org/10.17743/jaes.2018.0068

Capturing 360◦ Audio Using an Equal Segment


Microphone Array (ESMA)

HYUNKOOK LEE, AES Fellow


([email protected])

Applied Psychoacoustics Laboratory (APL), University of Huddersfield, Huddersfield, HD1 3DH, UK

The equal segment microphone array (ESMA) is a multichannel microphone technique


that attempts to capture a sound field in 360◦ without any overlap between the stereophonic
recording angle of each pair of adjacent microphones. This study investigated into the opti-
mal microphone spacing for a quadraphonic ESMA using cardioid microphones. Recordings
of a speech source were made using the ESMAs with four different microphone spacings of
0 cm, 24 cm, 30 cm, and 50 cm based on different psychoacoustic models for microphone array
design. Multichannel and binaural stimuli were created with the reproduced sound field rotated
with 45◦ intervals. Listening tests were conducted to examine the accuracy of phantom image
localization for each microphone spacing in both loudspeaker and binaural headphone repro-
ductions. The results generally indicated that the 50 cm spacing, which was derived from an
interchannel time and level trade-off model that is perceptually optimized for 90◦ loudspeaker
base angle, produced more accurate localization results than the 24 cm and 30 cm ones, which
were based on conventional models derived from the standard 60◦ loudspeaker setup. The 0 cm
spacing produced the worst accuracy with the most frequent bimodal distributions of responses
between the front and back regions. Analyses of the interaural time and level differences of
the binaural stimuli supported the subjective results. In addition, two approaches for adding
the vertical dimension to the ESMA (ESMA-3D) were devised. Findings from this study are
considered to be useful for acoustic recording for virtual reality applications as well as for
multichannel surround sound.

0 INTRODUCTION itations in terms of perceived spaciousness and the size


of sweet spot in loudspeaker reproduction due to the high
Microphone array techniques for surround sound record- level of interchannel correlation [2]. Higher order Ambison-
ing can be broadly classified into two groups: those that at- ics (HOA) offers a higher spatial resolution than the FOA
tempt to produce the continuous phantom imaging around and therefore can overcome the limitations of the FOA to
360◦ in the horizontal plane and those that treat the front and some extent, although it is more costly and requires a larger
rear channels separately (i.e., source imaging in the front number of channels. An HOA recording can be made using
and environmental imaging in the rear) [1]. In conventional a spherical microphone array (e.g., mh Acoustics Eigen-
surround sound productions for home cinema settings, the mike). A system that supports a higher order typically re-
front and rear separation approach tends to be used more quires a larger number of microphones to be used on the
widely due to its flexibility to control the amount of am- sphere. A review of currently available Ambisonics micro-
bience feeding the rear channels. However, with the recent phone systems can be found in [3].
development of virtual reality (VR) technologies that al- On the other hand, a near-coincident microphone array,
low the user to view visual images in 360◦ , the need for which incorporates directional microphones that are spaced
recording audio in 360◦ arises. and angled outwards, can provide a greater balance between
Currently, the most popular method for capturing 360◦ spaciousness and localizability than a pure coincident ar-
audio for VR is arguably the first order Ambisonics (FOA). ray. This is due to the fact that it relies on both interchannel
FOA microphone systems are typically compact in size, time difference (ICTD) and interchannel level difference
thus convenient for location recording, and offers a sta- (ICLD) for phantom imaging [4]. The so-called “equal seg-
ble localization characteristic due to its coincident micro- ment microphone arrays (ESMAs),” originally proposed by
phone arrangement [1]. Furthermore, the FOA allows one Williams [4, 5], are a group of multichannel near-coincident
to flexibly rotate the initially captured sound field in post- arrays that attempt to produce a continuous 360◦ imaging
production. However, it is known that the FOA has lim- in surround reproduction. The ESMAs follow the “critical

J. Audio Eng. Soc., Vol. 67, No. 1/2, 2019 January/February 13


LEE PAPERS

between microphones for a microphone array with a spe-


cific SRA are determined based on a psychoacoustic ICTD
and ICLD trade-off relationship required for a full phan-
tom image shift, as discussed in detail in Sec. 1. In the
case of the ESMA, the subtended angle between micro-
phones is predetermined according to the number of chan-
nels involved (e.g., 90◦ for four channels) as mentioned
above, thus making the microphone spacing the sole factor
to determine the SRA. For example, if a correct micro-
phone spacing is applied to the quadraphonic ESMA, then
a sound source located at ±45◦ will be localized at ±45◦ in
a quadraphonic reproduction with 90◦ base angle for each
stereophonic segment. There exist several different ICTD
and ICLD trade-off models for estimating the SRA [8–10],
and it is of interest of this study to discover which model
produces the most accurate result. Conventional models
Fig. 1. Top view of a quadraphonic equal segment microphone [8, 9] have been derived from experimental data obtained
array (ESMA) using cardioid microphones. The microphone spac- using the standard 60◦ loudspeaker setup. However, each
ing (d) is determined to produce the stereophonic recording angle
of 90◦ . stereophonic segment in the quadraphonic reproduction for
the ESMA has the base angle of 90◦ . Therefore, the valid-
ity of applying such models to the design of the ESMA is
questioned here. From this, the present study evaluates the
linking” concept [5], which assumes that a continuous 360◦ imaging accuracies of the quadraphonic cardioid ESMAs
imaging can be achieved when the stereophonic recording with four different microphone spacings based on different
angle (SRA)1 of each stereophonic segment is connected models: (i) 24 cm based on both the Williams curves [8]
without overlap. There are three requirements to configure and Image Assistant [9] models, both of which are based
and use an ESMA: (i) all two-channel stereophonic seg- on data obtained using the 60◦ loudspeaker setup; (ii) 30
ments of the array must have an equal subtended angle cm based on the Microphone Array Recording and Re-
between microphones, (ii) the subtended angles must be production Simulator (MARRS) model [10], which is also
the same as the SRA for each segment, and (iii) the loud- originally derived from the 60◦ setup; (iii) 50 cm based on
speaker array for reproduction must have the same angular the MARRS model that is perceptually optimized for the
arrangement as the microphone array. For example, as il- 90◦ setup; and (iv) 0 cm as in the so-called “in-phase” de-
lustrated in Fig. 1, a four-channel (quadraphonic) ESMA coding of the FOA B-format signals [2], which is equivalent
is configured to produce the SRA of 90◦ using four unidi- to using four cardioids arranged in the quadraphonic setup.
rectional microphones with the subtended angle of 90◦ for The rest of the paper is organized as follows. Sec. 1
each stereophonic segment; each of the microphone signals discusses the psychoacoustic models used to calculate dif-
is discretely routed to each loudspeaker in a quadraphonic ferent microphone spacings for the quadraphonic cardioid
setup. Although the ESMA was originally proposed as a ESMAs tested. Sec. 2 describes methods used for two lis-
recording technique for multichannel loudspeaker repro- tening experiments conducted in loudspeaker and binaural
duction [4, 5], it is proposed here that the ESMA would also headphone reproductions. Results obtained from the exper-
be suitable for binaural headphone reproduction with head iments are statistically analyzed in Sec. 3, followed by the
tracking for 360◦ audio applications. This can be achieved discussions of the results in Sec. 4. Sec. 4 also analyzes
by convolving the ESMA signals with head-related impulse interaural time and level difference cues in the binaural
responses (HRIRs) for the corresponding loudspeaker po- stimuli and discusses possible ways to extend the ESMA
sitions, which are dynamically updated according to the for three dimensional sound recording. Finally, Sec. 5 con-
angle of head rotation. cludes the paper.
The current study2 aims to (i) determine the appropri-
ate microphone spacing for a quadraphonic ESMA using
cardioid microphones to achieve an SRA of 90◦ and (ii) 1 PSYCHOACOUSTIC MODELS
examine the localization characteristics of the ESMA in
loudspeaker and binaural headphone reproductions with This section describes three different ICTD and ILD
sound field rotations. The spacing and subtended angle trade-off models that were used to derive the microphone
spacings tested in this study.

1
The SRA refers to the horizontal span of the sound field in
1.1 Williams Curves
front of the microphone array that will be reproduced in full width
between two loudspeakers [6]. Williams [5] recommends the microphone spacing of
2
Preliminary results from this work were presented at the AES 24 cm for the quadraphonic cardioid ESMA. This is es-
International Conference on Audio for Virtual and Augmented timated based on the so-called “Williams curves” [8],
Reality in 2016 [7]. which are a collection of curves that indicate possible

14 J. Audio Eng. Soc., Vol. 67, No. 1/2, 2019 January/February


PAPERS CAPTURING 360◦ AUDIO USING ESMA

combinations of microphone spacings and subtended an- Table 1. ICTD and ICLD shift factors for the 60◦ and 90◦
gles to achieve specific SRAs. They are based on an ICTD loudspeaker setups suggested by the MARRS psychoacoustic
model [10].
and ICLD trade-off relationship derived from the polyno-
mial interpolations of ICTD and ICLD values required for
Shift factor
10◦ , 20◦ , and 30◦ image shifts that were obtained from a lis- Speaker base Image shift
tening test in the standard 60◦ loudspeaker setup. Williams angle region ICTD ICLD
[8] claims that the SRA is virtually independent of the
loudspeaker base angle, suggesting that the same ICTD 60◦ 0–66.7% 13.3%/0.1ms 7.8%/dB
66.7%–100% 6.7%/0.1ms 3.9%/dB
and ICLD trade-off model obtained for the 60◦ loudspeaker 90◦ 0–66.7% 8.86%/0.1ms 6%/dB
setup can also be applied to the 90◦ setup. From this, he pro- 66.7%–100% 4.43%/0.1ms 3%/dB
poses that 24 cm between each microphone in the quadra-
phonic cardioid ESMA can produce the desired SRA of
90◦ for each stereophonic segment. Note that the ICTD If Theile’s constant relative shift theory described above
and ICLD produced by a near-coincident microphone con- is applied here (i.e., using the data obtained for the 60◦
figuration vary slightly depending on the distance between loudspeaker setup for the 90◦ setup), the correct spacing
sound source and microphone array and so does the SRA of for each segment of the quadraphonic cardioid ESMA to
the array. However, it is not stated in [8] what source-array achieve the 90◦ SRA at 2 m source-array distance is 30 cm.
distance the Williams curves were based on. However, the author’s previous research on amplitude
panning [15] suggests that ICLD shift factors must vary de-
1.2 Image Assistant pending on the loudspeaker base angle in order to achieve
an accurate phantom image localization; a larger base an-
In contrast with the Williams’s curves, the psychoacous-
gle requires a larger ICLD for a given proportion of image
tic model used in the “Image Assistant” tool [9] assumes a
shift. An informal listening test confirmed that this was
linear trade-off between ICTD and ICLD within the 75%
also the case with ICTD. Therefore, the MARRS model
image shift region (e.g., 0 to 22.5◦ for the 60◦ loudspeaker
[10] scales the original ICTD and ICLD shift factors de-
setup). It also allows the user to choose a specific source-
pending on the loudspeaker base angle. For example, for
array distance for the SRA estimation. The amount of to-
the 90◦ loudspeaker setup, the original ICLD shift factor
tal image shift within this region is estimated by simply
is scaled by 0.77, which is the ratio of the interaural level
adding the image shifts that individually result from ICTD
difference (ILD) above 1 kHz produced at 30◦ (the loud-
and ICLD (13%/0.1 ms and 7.5%/dB, respectively), which
speaker azimuth in the original 60◦ setup, which serves as
is a method proposed by Theile [11]. Outside the linear
the reference) to that at 45◦ (the loudspeaker azimuth of the
region, where the image shift pattern tends to become loga-
90◦ setup). Similarly, the ICTD shift factor is multiplied
rithmic for both ICTD and ICLD, an approximate function
by the ratio of interaural time differences (ITDs) below
is applied to derive a non-linear ICTD and ICLD trade-off
1 kHz between 30◦ and 45◦ , which is 0.67. This scaling
relationship [12]. The tool suggests that at 2 m distance
process results in shift factors optimized for the 90◦ loud-
between the source and the center of the array, which was
speaker setup, which are presented in Table 1. Based on
used in the experiment of the present study, 24 cm is the
these, the correct spacing between adjacent microphones
correct microphone spacing to produce the required SRA
for the quadraphonic cardioid ESMA is estimated to be
of 90◦ . The ICTD and ICLD shift factors used in the Image
50 cm. Note that this spacing is calculated for the source-
Assistant were obtained for the standard 60◦ loudspeaker
array distance of 2 m. However, the difference for a larger
setup. However, as in Williams’ assumption that the SRA is
distance in a practical recording situation is very small, e.g.,
conserved regardless of the loudspeaker base angle, Theile
50.4 cm spacing for 5 m source distance for the cardioid
[13] also claims the same ICTD and ICLD image shift fac-
ESMA. In addition, the size of a quadraphonic ESMA could
tors can be used for an arbitrary loudspeaker base angle,
be made smaller if microphones with a higher directionality
which is here referred to as the constant relative shift the-
are used, e.g., 40 cm for supercardioids at 2m source dis-
ory. Based on this, the microphone spacing of 24 cm is
tance. Readers who are interested in more details about the
assumed to be still valid for the loudspeaker base angle of
algorithm used in MARRS are referred to the open-access
90◦ in the quadraphonic reproduction setup.
Matlab source code package3 . MARRS is also available as
a free mobile app from the Apple and Google app stores.
1.3 MARRS
The 30 cm and 50 cm spacings are based on SRA esti- 2 EXPERIMENTAL DESIGN
mations using the present author’s microphone array sim-
ulation tool “MARRS (Microphone Array Recording and Two subjective experiments were carried out. Experi-
Reproduction Simulator)” [10]. The psychoacoustic model ment 1 evaluated the localization accuracies of the four mi-
used for MARRS relies on an ICTD and ICLD trade-off crophone arrays with different spacings in a quadraphonic
model derived from region-adaptive ICTD and ICLD im- loudspeaker reproduction. Experiment 2 repeated the same
age shift factors for the 60◦ loudspeaker setup presented
in Table 1; they were defined based on subjective local-
3
ization test data obtained using natural sound sources [14]. https://github.com/APL-Huddersfield/MARRS

J. Audio Eng. Soc., Vol. 67, No. 1/2, 2019 January/February 15


LEE PAPERS

phone system was used to capture B-format RIRs, which


were decoded using the in-phase decoding method [2] as
mentioned earlier. This produced four virtual cardioid mi-
crophones that were coincidentally arranged and pointing
towards 45◦ , 135◦ , 225◦ , and 315◦ .
Sound sources used for the RIR measurements were the
loudspeakers placed at 0◦ and 45◦ . They were selected for
the following reasons. First, the 45◦ position was to inves-
tigate whether the arrays could achieve the goal of the 90◦
SRA for each stereophonic segment. If the goal were indeed
achieved, then the phantom image for the source should be
localized at 45◦ in reproduction. The 0◦ position was se-
lected for examining how accurately a centrally panned
phantom image can be localized at the desired position for
a given sound field rotation.

Fig. 2. Loudspeaker setup used for room impulse response mea-


surements in Experiment 1. The circle represents the acoustic
2.2.2 Stimuli for Experiment 1
curtain used to hide the loudspeakers. For the loudspeaker listening test, four-channel stimuli
for each source position were created by convolving the
RIRs captured using the microphones with an anechoically
tests over headphones using binaurally synthesized stim-
recorded male speech signal taken from [17]. Prior to the
uli of the ESMAs. Various degrees of head rotations were
convolution, all reflection components of the RIRs (i.e.,
simulated by rotating the reproduced sound field by the
beyond 2.5 ms after the direct sound) were removed us-
corresponding degrees with the listeners kept facing for-
ing a half Hann window. This was to avoid excessive
wards. This method was opted over real head-rotations
room reflections to be heard when the stimuli were re-
since it allowed an efficient randomization and accurate
produced in the same room where the RIRs were captured.
implementation of target angle condition for each trial. Fur-
However, it should be acknowledged that in practical sit-
thermore, the head-static listening with sound field rotation
uations the recording and reproduction environments are
is a practical scenario, e.g., watching 360◦ video on a mon-
usually different and their acoustic characteristics would
itor screen rather than using a head-mount display. How-
interact.
ever, results from this study would require verification in a
Sound field rotations from 0◦ to 315◦ were applied to
practical virtual reality scenario with head tracking in the
the original four-channel stimuli with 45◦ intervals. This
future.
was done by offsetting the azimuth of the loudspeaker for
each of the four signals by 45◦ for every 45◦ rotation. For
2.1 Physical Setup instance, as illustrated in Fig. 3(c) and (f), the signals of
The experiments were conducted in the ITU-R BS.1116- microphones 1, 2, 3, and 4 shown in Fig. 1 were presented
compliant listening room of the Applied Psychoacoustics from the loudspeakers at 45◦ , 135◦ , 225◦ , and 315◦ , re-
Laboratory at the University of Huddersfield (6.2 x 5.6 x spectively, for a 90◦ sound field rotation. In this case, the
3.8m; RT = 0.25s; NR = 12). The room was used for both target perceived positions for the sound sources at 0◦ and
stimuli creation and listening tests. Eight Genelec 8040A 45◦ were 90◦ and 135◦ , respectively. Table 2 presents the
loudspeakers were arranged in a circle as shown in Fig. 2. target image position for each sound field rotation and its
The loudspeakers were positioned at the azimuth angles of equivalent head rotation for each source position.
0◦ , 45◦ , 90◦ , 135◦ , 180◦ , 225◦ , 270◦ , and 315◦ clockwise. In addition, eight real source stimuli were created by
The distance between the center of the circle (the listening routing the speech signal to each of the eight loudspeak-
position) and each loudspeaker was 2 m. In the listening ers individually. These served as reference conditions to
tests, the loudspeaker setup was hidden to the listeners by compare the localization behaviors of the phantom source
using acoustically transparent curtains. stimuli against.

2.2 Stimuli Creation 2.2.3 Stimuli for Experiment 2


2.2.1 Room Impulse Response Measurement For the binaural listening test, the same speech signal
In order to create test stimuli, four-channel room im- used in Experiment 1 was convolved with the RIRs cap-
pulse responses (RIRs) were first acquired in the listening tured using the microphone arrays. In contrast with the
room for each of the four microphone arrays individually, loudspeaker listening test, full RIRs including room reflec-
using the exponential sine sweep method [16]. The mi- tions were used to auralize the listening room condition.
crophones used for the ESMAs with 24 cm, 30 cm and The resulting signals were then convolved with anechoic
50 cm were Neumann KM184 cardioid microphones, which HRIRs captured using a Neumann KU100 dummy head,
were pointing towards 45◦ , 135◦ , 225◦ , and 315◦ . In addi- which were taken from the “SADIE” database [18]. Head
tion to the ESMAs, a Soundfield SPS422b FOA micro- rotations were simulated by applying HRIRs corresponding

16 J. Audio Eng. Soc., Vol. 67, No. 1/2, 2019 January/February


PAPERS CAPTURING 360◦ AUDIO USING ESMA

Fig. 3. Examples of sound field rotation applied to stimuli created for sound sources at 0◦ and 45◦ ; each sound field rotation simulating
the equivalent head rotation.

Table 2. Target image position for each sound field rotation for localization responses were obtained for each test condi-
each source position tion. They comprised staff researchers, postgraduate re-
search students, and final year undergraduate students of
Source Sound field Equivalent head Target image
the Applied Psychoacoustics Lab at the University of Hud-
position rotation rotation position
dersfield, with their ages ranging from 21 to 38. All of
0◦ 0◦ 0◦ them reported normal hearing and had extensive experience
45◦ –45◦ 45◦ in conducting sound localization tasks in formal listening
90◦ –90◦ 90◦ tests. All subjects completed the loudspeaker test (Experi-
0◦ 135◦ –135◦ 135◦
180◦ –180◦ 180◦ ment 1), at least one week after which they sat the binaural
225◦ –225◦ 225◦ test (Experiment 2). They did not know the nature of the
270◦ –270◦ 270◦ test stimuli until they completed both experiments.
315◦ –315◦ 315◦

0◦ 0◦ 45◦ 2.4 Test Procedure


45◦ –45◦ 90◦ 2.4.1 Experiment 1
90◦ –90◦ 135◦
45◦ 135◦ –135◦ 180◦ The subject was seated at the center of the loudspeaker
180◦ –180◦ 225◦ circle, and the chair was adjusted so that their ear height
225◦ –225◦ 270◦ matched the height of the loudspeaker’s acoustic center
270◦ –270◦ 315◦ (1.35 m from the floor). The subjects were instructed to
315◦ –315◦ 0◦
face the front and not to move their heads during the test,
while eye movement was encouraged. A small headrest was
placed at the back of the subject’s head to reduce movement,
to the target position associated with each rotation angle.
which was visually monitored by the experimenter during
Additionally, reference binaural stimuli for a real source
the test. The subject’s task was to mark down the apparent
were created by recording the anechoic speech reproduced
location of perceived image for each stimulus on a hori-
from each of the eight loudspeakers in the listening room
zontal circle provided on a graphical user interface (GUI)
using a Neumann KU100 dummy head placed at the listen-
written using Max 7. The angular resolution in the response
ing position.
was 1◦ . Small markers were indicated on the circle from 0◦
with 22.5◦ intervals. Markers with the same intervals were
2.3 Subjects also placed on the acoustic curtain to help the subject cor-
Nine critical listeners participated in both experiments rectly map the perceived image position onto the circle.
in which they tested each stimulus condition twice in a Prior to the actual test, the subjects were given familiariza-
randomized order for each experiment; a total of eighteen tion trials comprising the real source stimuli for the eight

J. Audio Eng. Soc., Vol. 67, No. 1/2, 2019 January/February 17


LEE PAPERS

loudspeaker positions, which were considered to have the Table 3. Summary of the results for phantom source
highest localization accuracy among all stimuli. localization in loudspeaker reproduction (Experiment 1):
Median perceived angles for each experimental condition.
The playback levels of all stimuli were calibrated to 70 Conditions with a significant difference from the target
dB LAeq at the listening position. Each trial in the test position (Wilcoxon signed rank test): * p < .05; ** p < .01.
contained a single stimulus and the subjects could listen Conditions with a significant bimodal distribution (Hartigan’s
to it repeatedly until they judged its perceived position. dip test): ∧ p < .05; ∧∧ p < .01.
All stimuli were presented in a randomized order. For the
sound-field-rotated stimuli, one of the mirrored target im- Target azimuth after sound
field rotation (degree)
age positions (e.g., 315◦ or 45◦ ) was randomly selected for
Source angle Mic spacing
each listener for each microphone array condition. This was (degree) (cm) 0 45 90 135 180
to minimize psychological order effects as well as to avoid a

potential listening fatigue that might occur when the sound 50 0 41 135 180
is presented only from the left- or right-hand side. Every 0 30 0 40 67* 134 180
24 0 34 68 135 180
subject judged each test condition twice in a randomized 0 0 24** 45* 134 180
order.
50 0 45 90 135 180

45 30 0 44* 90 135
2.4.2 Experiment 2 24 0 39** 90 135 180
∧∧ ∧
0 0 30** 152
The listening test was conducted in the same room as
Experiment 1. The test procedure was identical to that of
Experiment 1, apart from the following. The headphones significance of bimodality was examined using the Harti-
used for the test were Sennheiser HD650. To equalize them, gan’s dip test [20].
their impulse responses were measured five times using the
KU100 dummy head, with them re-seated on the head each 3.1 Phantom Source Localization in
time. The average responses were then inverse filtered us- Loudspeaker Reproduction
ing a regularization method by Kirkeby et al. [19]. Prior Fig. 4 shows the bubble plots of the data for the phan-
to the actual test the subjects were presented with familiar- tom source conditions (i.e., microphone array recordings)
ization trials comprising the binaural recordings of the real from Experiment 1. Table 3 presents the summary of the
sources for the eight loudspeaker positions. The loudness statistical analyses.
unit level of all binaural stimuli was calibrated at –18 LUFS
and the headphone playback level was determined by the 3.1.1 Sound Source at 0◦
present author to match the perceived loudness of the loud- The results for the 0◦ source position are first presented.
speaker playback from Experiment 1 (70 dB LAeq). No From the scatterplots in Fig. 4, it appears that all micro-
head tracking was used for rendering different image posi- phone spacings produced a relatively accurate localization
tions in binaural reproduction; the sound field was rotated when the target angle was 0◦ ; there is no front-back confu-
instead as described in Sec. 2.2.3. sion. For the 45◦ target angle (45◦ simulated head rotation),
the 0 cm condition had the median perceived angle (MED)
of 24◦ , which was significantly smaller than the target
3 RESULTS (p = 0.027), whereas the differences of the 50 cm, 30 cm,
and 24 cm spacings to the target was not significant (p >
As mentioned earlier, the stimuli with the mirrored target 0.05). Looking at the 90◦ target angle (90◦ simulated head
image positions were randomly selected for each listener in rotation), the responses for the 0◦ source appear to have
the listening tests. For the purposes of the statistical analysis wide spreads in general. The 50 cm spacing had a signif-
and data plotting, the perceived angles for the stimuli with icant bimodal distribution (p = 0.022). The MEDs for the
the target angles in the left-hand side of the circle were 30 cm and 24 cm were considerably smaller than the target
converted into the corresponding angles in the right-hand angle (67◦ –68◦ ). The 0 cm spacing had the largest deviation
side (e.g., 315◦ to 45◦ , 270◦ to 90◦ ). For the continuity of from the target angle among all spacings (MED = 45◦ , p =
data in the analysis, any responses for the 0◦ target angle that 0.015). For both the 135◦ and 180◦ target angles, the MEDs
were given in the left-hand side of the circle were converted for all spacings did not have a significant difference from
into negative values (e.g., 355◦ to –5◦ ), whereas those for the target angles (p > 0.05). However, the responses for the
the 180◦ target angle in the left side were unchanged. 135◦ target angle tended to be widely spread between the
Shapiro-Wilk and Levene’s tests were first performed to front and rear regions.
examine the normality and variance of the data collected.
The results suggested that the data were not suitable for 3.1.2 Sound Source at 45◦
parametric statistical testing. Therefore, the non-parametric For the 0◦ target angle (315◦ sound field rotation), all
Wilcoxon signed-rank test was conducted to examine if conditions had no significant difference between the per-
there was a significant difference between the target and ceived and target angles (p > 0.05). For the 45◦ target angle
perceived image positions for each test condition, except (no sound field rotation), the MED was closer to the target
for those that had a significant bimodal distribution. The angle in the order of 50 cm (45◦ ), 30 cm (44◦ ), 24 cm (39◦ ),

18 J. Audio Eng. Soc., Vol. 67, No. 1/2, 2019 January/February


PAPERS CAPTURING 360◦ AUDIO USING ESMA

Fig. 4. Bubble plots of the data obtained from the loudspeaker localization test (Experiment 1). The diameter of each circle represents
the percentage of responses for each condition.

and 0 cm (30◦ ). Apart from the 50 cm spacing, the MEDs Table 4. Summary of the results for phantom source
were all found to deviate significantly from the target (p = localization in binaural reproduction (Experiment 1): Median
perceived angles for each experimental condition. Conditions
0.047 for 30 cm, p = 0.000 for 24 cm and 0 cm). For the with a significant difference from the target position (Wilcoxon
90◦ target angle, the 50 cm, 30 cm, and 24 cm spacings did signed rank test): * p < .05; ** p < .01. Conditions with a
not have a significant difference between the perceived and significant bimodal distribution (Hartigan’s dip test): ∧ p <
target angles (MED = 90◦ , p > 0.05), whereas the 0 cm .05; ∧∧ p < .01.
produced a significant bimodal distribution between around
45◦ and 135◦ (p = 0.002). Looking at the target angle of Target azimuth after sound field
rotation (degree)
135◦ , the MEDs for the 50 cm, 30 cm, and 24 cm were the
Source angle Mic spacing
same as the target, whereas that for the 0 cm (152◦ ) was (degree) (cm) 0 45 90 135 180
noticeably closer to the median plane, although this was
∧∧ ∧∧
not statistically significant (p > 0.05). For the 180◦ target 50 42 100 180
∧∧ ∧∧
angle, 50 cm and 24 cm were found to produce an accu- 0 30 35 62 180
∧∧ ∧ ∧
24 39 180*
rate result (MED = 180◦ , p > 0.05), whereas responses for 0 ∧∧
39 69 ∧
180
30 cm and 0 cm had a significant bimodality (p = 0.036 50 ∧∧
47 90 135 180
and 0.01, respectively). ∧∧ ∧∧
45 30 50* 90* 129**
∧ ∧ ∧
24 47 90
∧∧ ∧∧ ∧∧ ∧∧
0 27*

3.2 Phantom Source Localization in Binaural


Reproduction 3.2.1 Sound Source at 0◦
The scatter plots of the data obtained for the phan- Looking at the results for the 0◦ source first, the responses
tom source conditions from Experiment 1 are presented in for the 0◦ target were significantly bimodal for all of the
Fig. 5. Table 4 summarizes the results from the statistical spaced array conditions (p < 0.01). The responses were
analyses. From Fig. 5, it is generally observed that the re- mainly given to either 0◦ or 180◦ , exhibiting strong ten-
sponses from the binaural test were more widely spread dencies of front-to-back confusion. For the target angle of
compared to those from the loudspeaker test (Fig. 4). The 45◦ , none of the spacings produced a significant difference
table also indicates that the binaural test had more condi- between the perceived and target angles, although 50 cm
tions with a significant bimodal distribution. had an MED that is closest to the target. For the 90◦ target

J. Audio Eng. Soc., Vol. 67, No. 1/2, 2019 January/February 19


LEE PAPERS

Fig. 5. Bubble plots of the data obtained from the binaural localization test (Experiment 2). The diameter of each circle represents the
percentage of responses for each condition.

angle, again the 50 cm spacing produced the most accurate had a significant bimodal distribution (p = 0.04 for 24 cm
result. The MEDs for 30 cm and 0 cm (62◦ and 69◦ , respec- and 0.000 for 0 cm). Last, for the target angle was 180◦ , the
tively) were considerably narrower than the target, while 50 cm spacing produced an accurate result (MED = 180◦ ,
responses for 24 cm were significantly bimodal (p < 0.05). p > 0.05), whereas the other spacings all had a significant
All conditions for the target angle of 135◦ were found to bimodality.
have a significant bimodal distribution between around 45◦
and 135◦ (p < 0.05 for 50 cm and 30 cm, p < 0.01 for 24
3.3 Real Source Localization in Loudspeaker
cm and 0 cm). For the 180◦ target angle, only the 30 cm
and Binaural Reproductions
condition was found to be significantly different from the
target (p < 0.05). Fig. 6 presents the responses given to the real source
stimuli (i.e., single loudspeaker conditions) in both loud-
3.2.2 Sound Source at 45◦ speaker and binaural experiments. Wilcoxon tests suggest
that, for the loudspeaker results, there was no significant
For the 45◦ source position, the responses for the tar-
difference between the perceived and target angles for all
get angle of 0◦ were found to be significantly bimodal
stimuli (p > 0.05). For the binaural conditions, on the other
regardless of the microphone spacing (i.e., front-to-back
hand, it was found that the responses for the 0◦ and 180◦
confusion). For the 45◦ target angle, the 50 cm and 24 cm
sources were significantly bimodal, exhibiting front-back
spacings both produced the MED of 47◦ , which was not
confusion. Furthermore, the 45◦ source (MED = 52◦ ) was
significantly different from the target (p > 0.05). However,
found to be perceived at a significantly wider position than
the 30 cm and 0 cm had significant differences between
the target (p < 0.01).
the target and perceived angles (MEDs = 50◦ and 27◦ , re-
spectively, p < 0.05). The results for the 90◦ target angle
show that the 50 cm, 30 cm, and 24 cm all had the median 4 DISCUSSIONS
perceived angles of 90◦ , whereas the 0 cm condition had a
significant bimodal distribution (p < 0.01) between around This section discusses various aspects of the subjective
45◦ and 135◦ . For the 135◦ target angle, 50 cm was the only results described above. The measurements of interaural
spacing that produced an accurate result (MED = 135◦ , time and level differences are provided to explain the sub-
p > 0.05). The MED for 30 cm (129◦ ) was significantly jective results. A higher order and 3D versions of ESMA
different from the target (p = 0.007), while 24 cm and 0 cm are also introduced.

20 J. Audio Eng. Soc., Vol. 67, No. 1/2, 2019 January/February


PAPERS CAPTURING 360◦ AUDIO USING ESMA

higher directionality than the cardioid. Therefore, it is ex-


pected that the phantom image would be localized closer to
the target position of 45◦ if the basic decoder was used for
the FOA recording. This is currently under investigation.

4.2 Source Angle


The responses for the 0◦ source tended to have larger
data spread and more bimodal distributions than the 45◦
source, especially when sound field rotations were applied.
This could be explained as follows. The ICTD and ICLD
trade-off models tested in were originally obtained from ex-
periments using a loudspeaker pair that was symmetrically
arranged in the front. With a sound field rotation, the signals
for the 0◦ source would create a phantom image between the
loudspeakers that are asymmetrical to the direction where
the head faces (e.g., Fig. 3(b) or 3(c)). Therefore, the orig-
inal trade-off models would not be applied correctly. More
Fig. 6. Bubble plots of the data obtained for single sources from
the loudspeaker and binaural tests. The diameter of each circle notably with the 90◦ rotation of the sound field for the 0◦
represents the percentage of responses for each condition. source (90◦ target angle), where the signals were presented
dominantly from the loudspeakers at 45◦ and 135◦ , the re-
sponses were noticeably spread or bimodal between 45◦ and
135◦ in both loudspeaker and binaural conditions. The poor
4.1 Microphone Spacing localization certainty of a lateral phantom image observed
in the current study is in line with past results reported by
In general, among all of the microphone spacings tested,
Theile and Plenge [22] and Martin et al. [23].
50 cm produced the best results in terms of phantom image
From the above discussion, it might be suggested that,
localization accuracy. In the loudspeaker presentation, for
in 360◦ audio applications with sound field rotation or
all target angle conditions apart from 90◦ , the 50 cm spacing
head-tracking, the localization accuracy and precision of
had no significant difference between the target and median
a quadraphonic ESMA might be at their best with sources
perceived angles (MEDs) as evident in Table 2. This seems
around the edges of the SRA (i.e., ±45◦ ), and become
to validate the localization prediction model of the MARRS
poorer as the source azimuth becomes closer to ±90◦ .
tool [8], which is optimized for the 90◦ loudspeaker base
angle (Sec. 1.3). The 45◦ source angle with no sound field
rotation was a particularly important test condition for ex- 4.3 Loudspeaker Reproduction vs. Binaural
amining whether the quadraphonic ESMA can achieve the Reproduction
goal of 90◦ SRA, as discussed in the Introduction. The re- Overall, the loudspeaker and binaural presentations pro-
sults indicate that the 24 cm and 30 cm spacings, which are duced similar patterns of phantom image localization, but
based on conventional psychoacoustic models [6, 7], fail Wilcoxon tests performed between the loudspeaker and bin-
to achieve the goal; they produced significantly narrower aural test data suggest that there were a few conditions that
MEDs than the target angle of 45◦ . In the binaural pre- had significant differences. Notably, the 0◦ target angle
sentation, there were generally more bimodal distributions condition had a significant bimodality in the binaural pre-
than in the loudspeaker test. However, 50 cm had the most sentation for both the 0◦ and 45◦ source positions but not in
conditions that were not significantly different from the tar- the loudspeaker presentation. Furthermore, the 45◦ source
get positions. The differences between the loudspeaker and condition without a sound field rotation (i.e., 45◦ target an-
binaural results are further discussed in Sec. 4.3. gle) produced responses spread between around 45◦ and
The 0 cm spacing demonstrated the worst localiza- 135◦ in the binaural reproduction (i.e., front-back confu-
tion performance, having the largest number of conditions sion), whereas it was localized only in the front region in
where the MED was significantly narrower than the target the loudspeaker reproduction. It is interesting that similar
angle or the data distribution was significantly bimodal. For tendencies were also observed for the single sources at 0◦
example, the MEDs for the stimuli with the target angle of and 45◦ (see Fig. 6). It may be suggested that the front-
45◦ were only 30◦ and 27◦ in the loudspeaker and binaural back confusion observed for the 0◦ and 45◦ target angle
presentations, respectively. However, it is worth noting that conditions were associated with the binaural synthesis us-
this should not be assumed as the general localization per- ing the non-personalized HRTSs. However, as Wightman
formance of FOA. As mentioned in Sec. 2.2.1, the current and Kistler [24] found, such confusions could happen even
study used the four virtual cardioid microphones derived with personalized head-related transfer functions (HRTFs)
from the in-phase decoding of B-format signals. This was when head movement is not allowed. The current experi-
for direct comparisons against the ESMAs with cardioid mi- ment did not allow head movement while listening, which
crophones. The polar pattern of virtual microphone formed might explain the front-back confusion observed. From the
by the basic decoder is the supercardioid [21], which has a above, it is considered that, in practical VR applications

J. Audio Eng. Soc., Vol. 67, No. 1/2, 2019 January/February 21


LEE PAPERS

Fig. 7. Difference of ESMA to real source in Interaural time difference (ITD) and interaural level differences (ILD) for each experimental
condition; average of results obtained for 50 ms overlapping windows for each of the 42 ERB critical bands.

with head tracking, such an issue may be resolved even if 200 Hz. It was shown in the subjective results that the
non-individualized HRTFs are used for the binaural render- 50 cm spacing produced a highly accurate localization for
ing of ESMA, which requires further investigation. this test condition. Based on the literature [28, 29], this
subjective result seems to be due to a trade-off between the
effects of the ITDs and ILDs on localization. That is, a wider
4.4 Analyses of Interaural Time and Level image position due to the ITD being greater than the refer-
Differences ence and a narrower image position due to the ILD being
To gain further insights into potential reasons for the smaller than the reference might have been spatially aver-
subjective results, the ITDs and ILDs of all of the binaural aged. Especially between about 700 Hz and 4 kHz, where
stimuli with off-center target angles (45◦ , 90◦ , and 135◦ ) Griesinger [30] claims to be the most important frequency
were estimated and compared. 0◦ and 180◦ were excluded region to determine the perceived position of a broadband
since at those angles there is no ITD and the ILD exists phantom image, the average ITD and ILD differences to
only at very high frequencies due to ear asymmetry. The the reference for this condition are 0.1 ms and –0.75 dB,
binaural model used for the analyses is described as follows. respectively. This gives the ratio of 0.13 ms/dB, which lies
Each binaural stimulus was first split into 42 frequency within the range of ITD/ILD trading ratios4 found in the lit-
bands through a Gammatone “equivalent rectangular band erature (i.e., 0.04 – 0.2 ms/dB [26]). This suggests that the
(ERB)” filter bank [25] that mimics the critical bands of the degree of the positive image shift from the target position
inner ear. To emulate the breakdown of the phase-locking by the ITD cue and that of the negative shift by the ILD
mechanism in the ear signals, half-wave rectification and cue would have been similar, thus resulting in the spatial
a first-order low-pass filtering at 1 kHz were applied to averaging around the target position. On the other hand, for
each band, as in [26, 27]. Time-varying ITD and ILD for all the other spacing conditions for the 45◦ source with a
each band were computed for 50%—overlapping 50 ms 0◦ rotation, the “center of gravity” between the ITD and
frames with the Hanning window. The ITD was defined ILD images (as described in [29]) seem to be at a nar-
as the lag of the maximum of the normalized interaural rower position than the target. For example, for the 24 cm
cross-correlation function (i.e., lag ranging between –1 ms ESMA, the average ITD difference to the reference between
and 1 ms). The ILDs were computed as the energy ratio 700 Hz and 4 kHz was only –0.02 ms, whereas the aver-
between the left and right signals. The ITDs obtained for age ILD difference was –1.7 dB. This would have caused a
all of the frames were averaged for each band; so were the considerable deviation from the target towards a narrower
ILDs. The results are presented in Fig. 7 as the ITD and ILD position mainly due to the ILD cue. It is also interesting to
differences of each microphone array stimulus to the real observe that the 0 cm condition, which had the worst subjec-
source stimulus with the corresponding target angle (i.e., tive result, had the opposite trend to the 24 cm condition;
the single source dummy head recordings). Therefore, the the average ILD difference was only –0.15 dB, whereas
closer the difference is to the 0 reference, the more accurate the ITD difference was considerably large (–0.18 ms).
the ITD or ILD produced by the microphone array is.
Looking at the plots for the 45◦ source with a 0◦ rotation
(45◦ target angle), the 50 cm spacing produced slightly more 4
ITD/ILD trading ratio refers to the equivalence between in-
ITDs than the dummy head reference across all bands, while teraural time and level differences measured in terms of the mag-
it produced slightly lower ILDs constantly above about nitude of perceived image shift [29].

22 J. Audio Eng. Soc., Vol. 67, No. 1/2, 2019 January/February


PAPERS CAPTURING 360◦ AUDIO USING ESMA

for the choice of the vertically coincident configuration is


as follows. First, in terms of vertical source localization,
Wallis and Lee [31] showed that a vertical ICTD is an
unstable cue for vertical stereophonic panning due to the
lack of the precedence effect in the vertical plane. On the
other hand, a vertical ICLD was found to have some con-
trol over the perceived vertical image position, although its
perceptual resolution and consistency were not high [32,
33]. Furthermore, Lee and Gribben [34] found that vertical
spacing between main and height microphones of a main
microphone array had no significant effect on the perceived
spatial impression. A vertical coincident design also has
an advantage in 3D-to-2D downmixing in that there is no
comb-filter effect when the lower and upper microphone
signals are summed.
The first approach proposed here is to coincidentally ar-
range a vertically oriented figure-of-eight microphone with
each of the main microphones of the ESMA. This is il-
Fig. 8. Octagonal cardioid ESMA. d = 82cm according to lustrated in Fig. 9(a). Each of the vertical coincident pair
Williams’s ICTD-ICLD trade-off model [8]; 55 cm according to is essentially a vertical mid-side pair. Therefore, it can be
the MARRS model [10]. decoded into downward-facing and upward-facing virtual
microphones, which are then routed to lower and upper
loudspeakers in 3D sound reproduction, respectively, as
A similar trend to the above is generally observed in the described in [35]. When the microphone array is placed at
other source-rotation conditions. the same height as the sound sources, the recommended
loudspeaker arrangement is the so-called “cube” format,
which is commonly used for the 3D reproduction of an FOA
4.5 Higher Resolution ESMA
recording (e.g., quadraphonic loudspeaker layers at –35◦
The unstable phantom centre image localization in sound and 35◦ elevations). This will allow sound sources placed
field rotations, which was discussed in Sec. 4.2, could be at the microphone array height to be presented as vertical
improved if the SRA resolution is increased. For example, phantom center images between the two loudspeaker lay-
an octagonal (eight-channel) ESMA, which was originally ers, while sounds arriving from vertical directions would
proposed by Williams [5], is considered here. As illustrated be localized vertically due to the ICLD cue.
in Fig. 8, the microphone array is configured with eight In the case of using the quadraphonic layer at the ear
spaced cardioid microphones arranged in an octagon with height augmented with another quadraphonic layer ele-
the 45◦ subtended angle for each microphone pair. It re- vated at 30◦ to 45◦ [36], cardioid or supercardioid micro-
quires an octagonal loudspeaker layout for reproduction. phones facing directly upwards are recommended to cap-
To achieve the “critical linking” for each stereophonic seg- ture the height information. Previous research suggests that
ment, the SRA for each pair of adjacent microphones should to avoid the perceived position of a source image to be
be made 45◦ , for which the microphone spacing d should shifted upwards unintentionally in vertical stereophonic re-
be determined. As discussed earlier, different microphone production, the level of source sound captured by the height
spacings can be suggested depending on which psychoa- microphone needs to be at least 7–9 dB lower than that
coustic model for ICLD and ICTD trade-off. If cardioid captured by the main microphone [37]. If the microphone
microphones are used, for example, the necessary spacing array was raised at the same height as the sound source,
is 82 cm according to the Williams curves [8], whereas it with the main microphones being on-axis to the source,
is 55 cm based on the MARRS model [10]. This is be- supercardioid microphones would be a better choice than
cause MARRS scales the ICTD and ICLD trade-off func- cardioids for the height channels since they provide suffi-
tion adaptively depending on the loudspeaker base angle as cient level attenuation for the source sound arriving from
described in Sec. 1.3, whereas the Williams curves applies 90◦ (i.e., –10 dB). However, if the array was raised higher
the same model used for the 60◦ base angle. Further study than the sound source, which is common in classical music
is required to confirm the localization accuracies of various recording, cardioid microphones would also be suitable for
spacings for the octagonal ESMA. the height channels since their theoretical polar response
is smaller than –10 dB beyond 110◦ off-axis. In this case,
4.6 ESMA-3D it would be desired that the main microphones are angled
Two methods of adding the height dimension to the on-axis towards the sources to ensure optimal localization
quadraphonic ESMA for 3D sound reproduction (namely, and tonal quality, while the height microphones are angled
ESMA-3D) are proposed in this section. The underlying de- directly upwards (e.g., Fig. 9(b)). It should be noted, how-
sign concept for the ESMA-3D is to use horizontally spaced ever, that this configuration makes the subtended angle be-
pairs of vertically coincident microphones. The rationale tween the main microphones of each stereophonic segment

J. Audio Eng. Soc., Vol. 67, No. 1/2, 2019 January/February 23


LEE PAPERS

Fig. 9. Examples of the vertical extension of the quadraphonic ESMA for 3D sound capture (namely, ESMA-3D): (a) four vertical
mid-side pairs of cardioid and fig-8 microphones; (b) four vertical coincident pairs of cardioid microphones.

narrower than 90◦ , thus requiring a slight increase in micro- (ii) With the sound field rotation of the quadraphonic
phone spacing to maintain the 90◦ SRA for each segment. ESMA, a sound source placed at a central position
For example, if the microphones of a quadraphonic ESMA tends to produce a less stable localization than that
are tilted downwards at –35.3◦ , the subtended angle for at a position closer to the microphones’ on-axis di-
each microphone pair from the base point becomes 70.5◦ rections (e.g., ±45◦ );
(i.e., the angle between the diagonals of a cube). In this (iii) The binaural rendering of the ESMA record-
case, based on the MARRS model [10], the correct spacing ing produces more bimodal response distributions
between the main layer microphones to produce the 90◦ (e.g., front-back confusion) than the loudspeaker
SRA is 54 cm for cardioids and 48cm for supercardioids. reproduction—this may be resolved by allowing
head rotations in head-tracked VR scenarios.

5 CONCLUSIONS Future work will examine the imaging accuracy of


ESMA in a practical recording environment with a finer
Listening experiments were conducted to evaluate the resolution of source angles. Furthermore, the octagonal
phantom image localization accuracies produced by dif- ESMA and ESMA-3D designs described in Secs. 4.5 and
ferent microphone spacings of the quadraphonic equal 4.6 will be evaluated. Investigations into the low-level spa-
segment microphone array (ESMA) with cardioid micro- tial attributes of different 360◦ microphone arrays and their
phones. The spacings of 24 cm, 30 cm, and 50 cm, which correlations with subjective preference and quality of expe-
were based on different psychoacoustic models, as well as rience in VR are currently underway. In addition, the influ-
the 0 cm spacing for the in-phase decoding of the first-order ence of the acoustic characteristics of the recording venue
Ambisonics, were tested in both loudspeaker and binaural on the perception of spatial attributes in 360◦ audio/visual
reproductions. The 50 cm spacing was based on an ICTD recordings will be studied.
and ICLD trade-off model that is perceptually optimized
for the 90◦ loudspeaker reproduction, whereas the 30 cm
and 24 cm spacings were based on conventional models 6 ACKNOWLEDGMENTS
using data obtained for the 60◦ loudspeaker setup. The test
stimuli were the recordings of an anechoic speech source This work was supported by the Engineering and Phys-
located at 0◦ and 45◦ azimuth angles, made using the mi- ical Sciences Research Council (EPSRC), UK, Grant Ref.
crophone arrays with the four different spacings as well EP/L019906/1. The author would like to thank the members
as a dummy head. The listening tests measured the per- of the Applied Psychoacoustics Lab (APL) at the Univer-
ceived positions of the phantom and real source images sity of Huddersfield who participated in the listening tests.
with the sound field rotated with 45◦ intervals, which was He is also grateful to the editor and three anonymous re-
for simulating head-rotation or scene-rotation in virtual re- viewers of this paper for their constructive comments, and
ality applications. Furthermore, the ITD and ILD produced Tom Robotham for the drawings of the microphone arrays
in each phantom source condition were compared to those used in this paper.
for the corresponding real source condition.
From the results and discussions presented in this paper, 7 REFERENCES
the following conclusions are drawn:
[1] F. Rumsey Spatial Audio (Oxford, Focal Press,
(i) The 50 cm microphone spacing generally produces 2001).
a more accurate and stable phantom imaging than [2] E. Benjamin, R. Lee, and A. Heller, “Localization
the other spacings tested, achieving the stereophonic in Horizontal-Only Ambisonic System,” presented at the
recording angle of 90◦ for each segment, which is 121st Convention of the Audio Engineering Society (2006
the original design goal for an ESMA. Oct.), convention paper 6967.

24 J. Audio Eng. Soc., Vol. 67, No. 1/2, 2019 January/February


PAPERS CAPTURING 360◦ AUDIO USING ESMA

[3] E. Bates, M. Gorzel, L. Ferguson, H. O’Dwyer and [18] G. Kearney and T. Doyle, “An HRTF Database for
F. M. Boland, “Comparing Ambisonic Microphones: Part Virtual Loudspeaker Rendering,” presented at the 139th
1,” presented at the 2016 AES International Conference on Convention of the Audio Engineering Society (2015 Oct.),
Sound Field Control (2016 Jul.), conference paper 6-3. convention paper 9424.
[4] M. Williams, “Microphone Arrays for Natural Mul- [19] O. Kirkeby, P. A. Nelson, H. Hamada, and F. Orduña
tiphony,” presented at the 91st Convention of the Audio Bustamante, “Fast Deconvolution of Multichannel Systems
Engineering Society (1991 Oct.), convention paper 3157. Using Regularization,” IEEE Trans. Speech and Audio
[5] M. Williams, “Migration of 5.0 Multichannel Micro- Process., vol. 6, no. 2, pp. 189–195 (1998 Mar.). DOI:
phone Array Design to Higher Order MMAD (6.0, 7.0 & https://doi.org/10.1109/89.661479
8.0) With or Without the Inter-Format Compatibility Crite- [20] J. A. Hartigan and P. M. Hartigan, “The Dip Test of
ria,” presented at the 124th Convention of the Audio Engi- Unimodality,” Ann. Stat., vol. 13, pp. 70–84. (1985). DOI:
neering Society (2008 May), convention paper 7480. https://doi.org/10.1214/aos/1176346577
[6] M. Williams and G. Le Du, “Microphone Array [21] E. Benjamin, R. Lee, and A. Heller, “Is My Decoder
Analysis for Multichannel Sound Recording,” presented Ambisonics?” presented at the 125th Convention of the
at the 107th Convention of the Audio Engineering Society Audio Engineering Society (2008 Oct.), convention paper
(1999 Sep.), convention paper 4997. 7553.
[7] H. Lee, “Capturing and Rendering 360◦ VR Au- [22] G. Theile and G. Plenge, “Localization of Lateral
dio Using Cardioid Microphones,” presented at the Phantom Images,” J. Audio Eng. Soc., vol. 25, pp. 196–200
2016 AES International Conference on Audio for Vir- (1977 Apr.).
tual and Augmented Reality (2016 Sep.), conference [23] G. Martin, W. Woszczyk, J. Corey and R. Quesnel,
paper 8-3. “Sound Source Localization in a Five Channel Surround
[8] M. Williams, “Unified Theory of Microphone Sys- Sound Reproduction System,” presented at the 117th Con-
tems for Stereophonic Sound Recording,” presented at the vention the Audio Engineering Society (1999 Oct.), con-
82nd Convention of the Audio Engineering Society (1987 vention paper 4994.
Mar.), convention paper 2466. [24] F. Wightman and D. Kistler, “Resolution of Front–
[9] H. Wittek and G. Theile, “The Recording Angle— Back Ambiguity in Spatial Hearing by Listener and Source
Based on Localization Curves,” presented at the 112th Con- Movement,” J. Acoust. Soc. Am., vol. 105, no. 5, pp. 2841–
vention of the Audio Engineering Society (2002 May), con- 2853 (1999 May). DOI: https://doi.org/10.1121/1.426899
vention paper 5568. [25] P. Søndergaard and P. Majdak, “The Auditory Mod-
[10] H. Lee, D. Johnson, and M. Mironovs, “An Inter- eling Toolbox,” in The Technology of Binaural Listening,
active and Intelligent Tool for Microphone Array Design,” edited by J. Blauert (Springer, Berlin, Heidelberg, 2013).
presented at the 143rd Convention of the Audio Engineering DOI: https://doi.org/10.1007/978-3-642-37762-4
Society (2017 Oct.), e-Brief 390. [26] L. R. Bernstein and C. Trahiotis, “The Normalized
[11] G. Theile, “On the Performance of Two-Channel Correlation: Accounting for Binaural Detection across Cen-
and Multichannel Stereophony,” presented at the 88th Con- ter Frequency,” J. Acoust. Soc. Am., vol. 100, no. 5, pp.
vention of the Audio Engineering Society (1990 Mar.), con- 3774–3784 (1996). DOI: https://doi.org/10.1121/1.417237
vention paper 2932. [27] V. Pulkki and M. Karjalainen, Communication
[12] H. Wittek, Untersuchungen zur Richtungsabbil- Acoustics: An Introduction to Speech, Audio and Psychoa-
dung mit L-C-R Hauptmikrofonen, Masters thesis Institut coustics (Wiley, 2015).
für Rundfunktechnik (2000). [28] R. H. Whitworth and L. A. Jeffress, “Time
[13] G. Theile, “Natural 5.1 Music Recording Based versus Intensity in the Localization of Tones,”
on Psychoacoustic Principles,” presented at the AES 19th J. Acoust. Soc. Am., vol. 33, pp. 925–929 (1961). DOI:
International Conference: Surround Sound Techniques, https://doi.org/10.1121/1.1908849
Technology, and Perception (2001 June), conference [29] J. Blauert, Spatial hearing (Cambridge, The MIT
paper 1904. Press, 1997).
[14] H. Lee and F. Rumsey, “Level and Time Panning of [30] D. Griesinger, “Stereo and Surround Panning
Phantom Images for Musical Sources,” J. Audio Eng. Soc., in Practice,” presented at the 112th Convention of
vol. 61, pp. 753–767 (2013 Dec.). the Audio Engineering Society (2002 May), convention
[15] H. Lee, “Perceptually Motivated Amplitude Pan- paper 5564.
ning (PMAP) for Accurate Phantom Image Localization,” [31] R. Wallis and H. Lee, “The Effect of Interchannel
presented at the 142nd Convention of the Audio Engineer- Time Difference on Localization in Vertical Stereophony,”
ing Society (2017 May), convention paper 9770. J. Audio Eng. Soc., vol. 63, pp. 767–776 (2015 Oct.). DOI:
[16] A. Farina, “Advancements in Impulse Response https://doi.org/10.17743/jaes.2015.0069
Measurements by Sine Sweeps,” presented at the 122nd [32] J. L. Barbour, “Elevation Perception: Phantom Im-
Convention of the Audio Engineering Society (2007 May), ages in the Vertical Hemisphere,” presented at the AES
convention paper 7121. 24th International Conference: Multichannel Audio, The
[17] V. Hansen and G. Munch, “Making Recordings for New Reality (2003 June), conference paper 14.
Simulation Tests in the Archimedes Project,” J. Audio Eng. [33] M. Mironovs and H. Lee, “The Influence of Source
Soc., vol. 39, pp. 768–774 (1991 Oct.). Spectrum and Loudspeaker Azimuth on Vertical Amplitude

J. Audio Eng. Soc., Vol. 67, No. 1/2, 2019 January/February 25


LEE PAPERS

Panning,” presented at the 142nd Convention of the Audio presented at the 132nd Convention of the Audio Engineering
Engineering Society (2017 May), convention paper 9782. Society (2012 Apr.), convention paper 8595.
[34] H. Lee and C. Gribben, “Effect of Vertical Micro- [36] ITU-R, Recommendation ITU-R BS.2051-1: Ad-
phone Layer Spacing for a 3D Microphone Array,” J. Au- vanced sound system for programme production (2017).
dio Eng. Soc., vol. 62, pp. 870–884 (2014 Dec.). DOI: [37] R. Wallis and H. Lee, “The Reduction of Verti-
https://doi.org/10.17743/jaes.2014.0045 cal Interchannel Crosstalk: The Analysis of Localization
[35] P. Geluso, “Capturing Height: The Addition of Z Thresholds for Natural Sound Sources,” Appl. Sci., vol. 7,
Microphones to Stereo and Surround Microphone Arrays,” p. 278 (2017). DOI: https://doi.org/10.3390/app7030278

THE AUTHOR

Hyunkook Lee

Hyunkook Lee is a Senior Lecturer (i.e., Associate Pro- and objective sound quality metrics. From 2006 to 2010,
fessor) for music technology courses at the University Hyunkook was a Senior Research Engineer in audio R&D
of Huddersfield, UK, where he founded and leads the at LG Electronics, South Korea, where he participated in
Applied Psychoacoustics Laboratory (APL). He is also a the standardizations of MPEG audio codecs and developed
sound engineer with 20 years of experience in surround spatial audio algorithms for mobile devices. He received
recording, mixing, and live sound. Dr. Lee’s recent research his degree in music and sound recording (Tonmeister)
advanced understanding about the perceptual mechanisms from the University of Surrey, UK, in 2002 and obtained
of vertical stereophonic localization and image spread as his Ph.D in spatial audio psychoacoustics from the Institute
well as the phantom image elevation effect. This helped of Sound Recording (IoSR) at the same University in 2006.
develop new 3D microphone array techniques, vertical Hyunkook has been an active member of the AES since
mixing/upmixing techniques, and a virtual 3D panning 2001 and received the AES Fellowship award at the 145th
method. His ongoing research topics include 3D sound Convention in 2018.
perception, capture and reproduction, virtual acoustics

26 J. Audio Eng. Soc., Vol. 67, No. 1/2, 2019 January/February

You might also like