JAES V61 1 2 PG17hirez
JAES V61 1 2 PG17hirez
JAES V61 1 2 PG17hirez
net/publication/236169398
CITATIONS READS
230 4,900
4 authors:
All content following this page was uploaded by Tapio Lokki on 02 February 2016.
This paper presents a spatial encoding method for room impulse responses. The method is
based on decomposing the spatial room impulse responses into a set of image-sources. The
resulting image-sources can be used for room acoustics analysis and for multichannel convo-
lution reverberation engines. The analysis method is applicable for any compact microphone
array and the reproduction can be realized with any of the current spatial reproduction methods.
Listening test experiments with simulated impulse responses show that the proposed method
produces an auralization indistinguishable from the reference in the best case.
Measurement or simulation
Room Impulse Response
Microphone array
reflections. At each time moment t, the sound pressure at
receiving location r n has a scalar value, i.e., it is a scalar
function h(r n , x|t). The scalar value is the overall sum of
different sound pressure waves arriving at the same time to
Microphone signals the receiver location. In the context of this paper the spa-
. . . tial room impulse response is measured with n = 1, . . ., N
microphones, i.e., a microphone array.
Spatial analysis and The whole impulse response is altered by several acoustic
encoding, e.g., phenomena. A majority of the acoustic events is attenuated
(SDM / SIRR / 1.OA / HOA)
according to 1/r-law and affected by air absorption. In ad-
Encoded stream dition, the frequency response of an event is altered by the
. . .
Analysis and
synthesis
absorption of the surfaces in the enclosure. Moreover, the
Spatial decoding and
directivities of the microphones and the sound source have
reproduction, e.g., an effect on the impulse response.
(VBAP / 1.OA / HOA / WFS) As time progresses, the number of acoustic events per
Loudspeaker signals time window increases. In room acoustics research and
. . .
convolution reverberation engines, the impulse response is
traditionally divided into three consecutive regions in time:
Spatially Reproduced
Acoustics
Fig. 2. The processing in the proposed spatial encoding method consists of localization and combining the omni-directional pressure
signal with the estimated locations.
is interpolated with the exponential fit [21]. The TDOA time interval between the samples (t) and the size or the
estimates are denoted with dimensions of the microphone array. The smaller these val-
ues are the more spatial and temporal separation between
τ̂ k = [τ̂(k) (k) (k)
1,2 , τ̂1,3 , . . . , τ̂ N −1,N ] ,
T
individual acoustic events can be made. This improves the
where N is the number of microphones, and the correspond- localization for individual acoustic events. Other methods
ing microphone position difference vectors are denoted with require larger aperture size to improve the approximations
for low frequencies, however, in SDM this is not a require-
V = [r 1 − r 2 , r 1 − r 3 , . . . , r N −1 − r N ]T . ment since in SDM the lower frequencies can be estimated
The least squares solution for slowness vector is then given by elongating the window size. However, this would also
as [22, p. 75]: require that SDM processing is done for different frequency
bands with different window sizes. This is further discussed
m̂k = V + τ̂ k , (3) in Section 4.1.
where (·)+ is Moore-Penrose pseudo-inverse, and the A limiting factor for the window size is the largest di-
direction of the arriving sound wave is given as mension of the microphone array. That is, the window size
n̂k = −m̂k /m̂k . The distance to the image-source k is should be larger than the time that it takes for a sound wave
given directly by the time index and the speed of sound dk to travel through the array, i.e., Lt > 2dmax /c, where dmax
= ckt. is the maximum distance between any two microphones
in the array. Theoretically, a large window size improves
1.2.2 Step 2: Dividing the Omni-Directional and worsens the localization performance at the same time.
Pressure Signal Namely, as the window size increases, the localization per-
formance of a single acoustic event improves, as stated by
The second step of the analysis selects one of the avail-
the CRLB. However, the probability that more than one
able omni-directional microphone signals as the pressure
acoustic event is present in the analysis window increases.
signal hp . Ideally, the microphone for the pressure signal
This latter part is seen as a possible problem in the analysis
is located in the geometric center of the array. In this case,
and, therefore, it is recommended that the window size is
the analysis assigns each sample of the pressure impulse
selected such that it is just over 2dmax c. In addition, if an
response hp (tk) with a 3-D location x̂ k , which is the out-
acoustic event is assumed to be short, time-wise, increas-
put from Step 1. Then, the method has encoded the spatial
ing the window size would actually decrease the theoreti-
impulse response with four values per sample, the pressure
cal performance since the energy of the noise in the time
value and the 3-D location of the sample.
window increases relative to the energy of the signal, thus
In case the pressure microphone is not in the geometric
decreasing the signal-to-noise ratio.
center of the array, one has to predict the value of the
The next part assesses the effect of the window size
pressure signal according to the image-source locations.
selection with a quantity called echo density. The echo
This is done by first calculating the distance from the image-
density describes the average number of echoes in a room
source location to the location of the pressure microphone
per a time instant and is valid for any arbitrarily shaped
rp
enclosure [23, p. 92]. It is defined as
dk = r p − x k , (4)
Nr c3 t 2
= 4π , (6)
and then assigning each image-source with the pres- dt V
sure value hp (fs dk /c). When using plane wave propagation where Nr is the number of reflections, and V is volume.
model, the distance is calculated as Echo density is a useful tool for inspecting the effects of
dk = |nk (r p − x k,0 )|, (5) the window size selection on the number of acoustic events,
i.e., image-sources, per time window. The threshold when
where nk and x k,0 are the plane normal and a point on the there is less than Nr reflection(s) present in the time window
plane, respectively. can be examined with
Instead of predicting the pressure in the center of the
array, one can predict the image-source locations in the Nr V √
τ1 = ≈ 0.0014 V. (7)
location of the pressure signal. This is an easier choice dt4πc3
because it does not require resampling of the signal. This The last approximation is yielded for less than one reflec-
paper applies neither of these approaches, since the pressure tion Nr → 1 and assuming that the speed of sound is con-
microphone is always located in the middle. stant c = 345 m/s. For example, a window size of dt =
Lt = 1 ms produces the value τ1 = 119 ms for a room
1.3 Limitations on the Performance and the with volume (30 × 20 × 12 = 7200 m3 ), which indicates
Effect of the Window Size that there is only one acoustic event present in the analy-
Several aspects affect the accuracy of the analysis in sis window until 119 ms after the direct sound. Thus, the
SDM. When the noise level decreases and the number of parameter τ1 describes the average time when there will
microphones increases, the performance of the localization be more than one reflection present in the analysis time
improves, as predicted by the Cramér-Rao lower bound window. The smaller the window size, the bigger the pa-
(CRLB) (see, e.g., [18]). Other important factors are the rameter τ1 and the more accurate localization of individual
Y−coordinate [m]
1.4 Rationale for SDM
0
The accurate localization of first acoustic events with
respect to time in the impulse responses, i.e., the direct
sound and the first reflections, is possible as shown in [24] −50
and [18], respectively. However, as the time progresses the
number of acoustic events per time window increases, and
eventually more than one reflection arrives during the time −100
window. In this case, a cross-correlation-based localiza- −100 −50 0 50 100
X−coordinate [m]
tion algorithm localizes the sound to the location of the
reflection that is the strongest one in that time window. (a) Original image-source locations and amplitudes.
The strongest direction is selected because it shows as the
strongest peak in the cross-correlation functions. Analo- 100
gous example of this behavior with one localization algo-
rithm is shown with speech sources in [25]. However, it is
also possible that the estimated location is an intermediate 50
point that is between the reflections within that analysis
Y−coordinate [m]
window. This is, for example, the case if the localization
algorithm is based on the average direction of the sound 0
intensity. Thus, the estimated location depends highly on
the localization algorithm. The behavior of the localization
algorithms in the case of several acoustic events should be −50
further investigated, but here this is left for future research.
In any case, SDM assumes in the spatial reproduction that
the estimated location corresponds to the correct percep- −100
−100 −50 0 50 100
tual location. The assumption has been used previously for X−coordinate [m]
example in SIRR [10,26].
(b) Analyzed image-source locations and amplitudes
SDM produces the diffuse sound field naturally. Namely,
in SDM each time step has a random direction in a diffuse Fig. 3. An example of the locations and amplitudes of (a) sim-
sound field. The total directional distribution over the total ulated image-sources and (b) decomposed image-sources with
diffuse sound field, i.e., late reverberation is then uniform. SDM from a spatial room impulse response. The area of each
filled circle illustrates the energy of that image-source. The image-
Further evidence for this is provided in a recently published
sources with the highest energy are correctly analyzed.
article which uses SDM for spatial analysis [27].
Since the first acoustic events are correctly localized from
spatial room impulse response in the SDM framework, and Fig. 3, where the radius of each circle corresponds to the
these events are known to have a very prominent effect on amplitude of respective image-source, illustrates the results
the perception of spatial sound [1,28], the resulting aural- of the analysis. As can be seen in Fig. 3, the early part of the
ization should be credible. Moreover, the late part of the simulated spatial room impulse response (a) is very similar
spatial room impulse response will be naturally presented to the one analyzed by SDM (b).
as diffuse by SDM because multiple arriving reflections
will produce random directions.
2 LISTENING TEST EXPERIMENTS
1.5 An Example of the Analysis with SDM This section describes the listening test setup, the listen-
This section demonstrates the principles in SDM with ing room, the simulated room acoustic conditions, and the
an illustration of analysis results of spatial room impulse source signals. In addition, listening test procedures and
response. The spatial room impulse response is recorded results are presented.
from a simulation of a shoebox room of size (20 × 30 × This paper uses Vector Base Amplitude Panning (VBAP)
12) m3 . Furthermore, the source was at [16.04, 8.06, 3.58] [8] as the spatial reproduction technique for the listening
m and the receiver at [7.35, 7.92, 3.22] m. In addition, the tests. Other reproduction methods could also be used, but
applied window was 1.33 ms Hanning window and overlap VBAP is here preferred since it can be implemented for a
between two consecutive windows is 99%. Speed of sound 3-D spatial sound with less number of loudspeakers than
was set to c = 345 m/s, sampling frequency to fs = 48 kHz, the other methods and since it provides good subjective
reflection coefficient to 0.85, and reflections up to 45th order quality in overall. The listening tests compares the pro-
were simulated. posed method to SIRR [10,26], which can be considered the
Table 1. Reverberation time (RT), sound pressure level (SPL), Table 3. Source and receiver positions, source signals,
and noise level (NL) in the listening room. Sound pressure level dimensions of the rooms, and sample naming used in the
is given with respect to the reference (Ref.) value at 200 Hz- listening test. Speed of sound was set to c = 345 m/s, sampling
4 kHz frequency band. In the calibration, the SPL was 87 dB, frequency to fs = 48 kHz, reflection coefficient to 0.85, and
which gives a signal-to-noise ratio of more than 45 dB for each reflections up to 45th order were simulated.
octave band.
Sample Source position Receiver position
Octave band RT [s] SPL [dB] NL [dB]
(Signal) x [m] y [m] z [m] x [m] y [m] z [m]
Ref. [200 Hz - 4 kHz] 0.14 0.00 -
125 [Hz] 0.24 1.66 39.9 Large room (30 × 20 × 12) m3
250 [Hz] 0.17 0.47 35.7 A (Sp.) 16.04 8.06 3.58 7.35 7.92 3.22
500 [Hz] 0.13 0.36 32.9 B (Tr.) 17.44 12.81 2.88 2.64 13.48 3.72
1 [kHz] 0.13 0.02 28.6 C (Ca.) 20.37 11.99 2.52 3.10 12.10 2.86
2 [kHz] 0.12 0.16 20.4 Small room (5 × 3 × 2.8) m3
4 [kHz] 0.11 −1.03 18.9 D (Sp.) 3.44 0.80 1.53 1.02 0.64 1.40
8 [kHz] 0.10 −2.93 21.5 E (Tr.) 3.87 1.45 1.65 0.76 1.39 1.33
F (Ca.) 3.78 0.85 1.81 1.24 0.97 2.07
Table 2. The azimuth and elevation directions and distance of Sp.: Speech, Tr. Trombone, and Ca.: Castanet
each individual loudspeaker (LPS) in the 14-channel loudspeaker
reproduction setup in the listening room. 8, 4, and 2
loudspeakers are located approximately at the lateral plane, 45 pressure level with slow temporal averaging in the listen-
degrees above lateral plane, and –45 degrees below lateral plane, ing position for a band-pass filtered noise from 100 Hz to
respectively. The loudspeakers were localized using the method
presented in [24]. 5 kHz. Since the distance of the loudspeakers is not the
same to the reference position for all loudspeakers, they
LPS # Azimuth [◦ ] Elevation [◦ ] Distance [m] are all delayed with digital signal processing so that each
loudspeaker is at a virtual distance of 1.40 m.
1 46.7 0.1 1.01 The simulated impulse responses for the listening test
2 89.6 −1.5 1.02
were produced with the image-source method [11] in two
3 134.2 −0.1 0.98
4 179.8 −0.2 0.95 modelled rectangular rooms. In the image-source method,
5 −135.9 0.1 1.02 throughout this paper, the reflection coefficient is set to
6 −91.6 0.2 0.94 0.85, the speed of sound to 345 m/s, the sampling frequency
7 −45.6 1.0 0.96 to 48 kHz, and reflections up to 45th order are simulated. In
8 −0.3 1.6 0.98
addition, Table 3 shows the room dimensions, source, and
9 45.9 43.5 1.30
10 135.1 40.5 1.33 receiver positions used in the image-source method. Two
11 −137.9 42.3 1.40 shoebox rooms, a large and a small one, are simulated for
12 −45.4 46.4 1.33 the listening tests. The large and the small room have wide
13 24.0 −46.1 1.29 band reverberation times of 2.0 s and 0.4 s, respectively.
14 −19.6 −45.1 1.27
In all the cases, the room impulse responses are truncated
from –40 dB onwards according to the backward integrated
Schroeder curve.
state-of-the-art spatial sound encoding method for spatial
room impulse responses, at least for VBAP. SIRR also op-
erates under the same assumption as SDM, that the binaural 2.1.1 Reference and Anchor
cues are produced correctly. The reference was generated with the image-source
method. The location and amplitude of each image-source
2.1 Listening Room Setup and Stimuli was transferred into a virtual source, which was panned
with VBAP for the current loudspeaker setup [8]. Finally,
Listening tests were conducted in an acoustically treated
to simulate a real room impulse response measurement situ-
room with dimensions of (x × y × z : 3.0 × 5.1 × 3.8) m3 .
ation, the anechoic impulse response of Genelec 1029A was
Table 1 shows the reverberation time, sound pressure level,
convolved with the impulse responses, and the impulse re-
and noise level in the listening room. The listening room
sponse was filtered with air absorption filters, implemented
fulfills the recommendations given by ITU in [29], with
according to [30].
the exceptions that the noise level fulfills the noise rating
The anchor for the listening test was selected to be the
(NR) 30 requirement, whereas the recommendation is NR
same mono impulse response as in the reference but instead
15 and the listening distance is about 1.2 meters on average,
of VBAP processing according to the directional informa-
whereas the recommendation is more than two meters.
tion obtained from the image-source method, it was used
The listening room includes a 3-D 14-channel loud-
directly in the front loudspeaker (# 8 in Table 2).
speaker setup, out of which 12 are of type Genelec 8030A,
and two are of type Genelec 1029A loudspeakers. Table 2
gives the location of each loudspeaker with respect to the 2.1.2 Spatial Encoding Methods
listening position at the origin (0,0,0) m. Each loudspeaker Similarly to the reference, the image-source method was
is calibrated so that they produce equal A-weighted sound used to generate spatial impulse responses for a virtual
Microphone signals
...
SDM VBAP
...
-reflection coefficient, 0.85
...
...
x(t), y(t), z(t), h(t)
-speed of sound, 345 m/s
-reflection order, 45
Non-diffuse stream
...
W
Image source
X Diffuse stream
...
method B-format SIRR VBAP
...
conversion Y
Z Decorrelation
...
Air absorption filter
Fig. 4. Processing of the samples in the listening test experiments. The shaded areas highlight the different spatial encoding methods
(from top to down, SDM, SIRR and reference).
Table 4. Origin centered coordinates for the microphone arrays. for high frequencies above 1 kHz with the smaller spacer.
Spacing dspc is equal for each microphone pair on a single axis. Before the addition, the analyzed signals for small and large
spacer are low-pass and high-pass filtered with a 10th order
Microphone # X [m] Y [m] Z [m] Butterworth IIR filter, respectively. The motivation for such
1 dspc /2 0 0 processing is that the present authors have used such array
2 −dspc /2 0 0 in measurements of concert halls [31].
3 0 dspc /2 0 To compare the methods in the same conditions, all the
4 0 −dspc /2 0 analyses use a Hanning window of 1.33 ms (64 samples
5 0 0 dspc /2
6 0 0 −dspc /2
at 48 kHz). SIRR has an overlap of 50% between two
7 0 0 0 consecutive windows, and SDM has an overlap of 99%
(63 samples), as explained in Section 2.2 (Step 1). The
window size was selected as 1.33 ms since it is the one used
microphone array. The microphone array consists of seven in the original SIRR paper [26]. It should be emphasized
microphones, of which six are on a sphere and one in the that for SDM the optimal window size is much smaller
geometric center of the array, as shown in Table 4. The than the selected one. Especially for the smaller simulated
central microphone is used as the microphone for the pres- room, it is expected that the lengthy time window causes
sure signal in the spatial encoding methods. problem in SDM, since parameter τ1 is 1.4 ms. However,
The proposed spatial encoding method, SDM, was com- since the goal is to compare these two techniques in the
pared to two versions of SIRR. The first version of SIRR, as same conditions, the same window size is used for both.
well as SDM, was implemented with seven microphones, Moreover, a virtual microphone-based synthesis, originally
and the second version of SIRR was implemented with 13 developed for DirAC in [32], was noticed to provide a more
microphones. Their naming is the following: natural sound for SIRR and was included in the processing.
The output of the SDM, i.e., the extracted image-sources,
• SDM with a single microphone array with spacing dspc = are directly panned with VBAP for the current loudspeaker
100 mm and one microphone in the geometric center is setup. The output of the SIRR-analysis is processed as de-
named SDML7, scribed in [10] and [26] for VBAP reproduction. In addition,
• SIRR with a single microphone array with spacing the diffuse part of the SIRR is implemented with the Hybrid
dspc = 100 mm and one microphone in the geometric Method described in [26]. The processing of the listening
center is named SIRRL7, and test samples for SDM, SIRR, and the reference case are
• SIRR with two microphone arrays with spacings dspc = illustrated in Fig. 4.
100 mm and dspc = 25 mm and one microphone in the
geometric center is named SIRR13. 2.1.3 Source Signals and Test Samples
Approximately ten seconds of male speech, trombone,
The microphone arrays were selected as such, since and castanets were selected as the source signals. Each
SIRR-processing can be implemented for them [10,26]. sample was convolved separately with the corresponding
Namely, SIRR requires the three components of parti- 14-channel VBAP output for a reference, SIRR, or SDM.
cle velocity, which can be calculated with the gradient The test samples are named from A to F as indicated in
microphone-technique and a pressure signal, which is the Table 3.
microphone in the geometric center.
SIRR13 analyzes separately the room impulse responses
for large and small spacing and in the post-processing phase 2.2 Listening Test Procedure
combines them. Combination adds the analysis result for The task in the listening test is to compare the “similarity”
low frequencies below 1 kHz from the large spacer, and of the spatially encoded samples with the reference sample,
1 A: Speech,
large ro o m
B : Tro m bone,
0.9 large ro o m
C: Ca sta net,
large ro o m
0.8
D: Tro m bone,
sm all ro om
0.7 E: Speech,
sm all ro om
F: Ca sta net,
0.6 sm all ro om
Sim ilarity
0.5
0.4
0.3
0.2
0.1
A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F
Ref. SDM L7 SIRRL7 SIRR13 Ancho r
Fig. 5. Screen capture of the user interface (UI) used in the Fig. 6. Listening test results, the thicker boxes with solid color
listening tests. The subjects can freely move, listen to, and rate the illustrate the 25 and 75 percentiles, the thinner lines illustrate the
samples. Note that in the UI the alphabet stands for “method.” most extreme data points, the circles outside the boxes illustrate
outliers, and the dots the median.
A: Speech,
Table 5. The attributes that the subjects used for assessing the
1
la rge ro o m
B : Trom bone,
similarity according to the interviews. Most of the attributes are
la rge ro o m translated from Finnish to English.
C: Ca sta net,
la rge ro o m
D: Tro m bone,
0.8
sm all ro om Timbral aspects
E: Speech,
sm all ro om
F: Castanet, Localization (×5), spatial impression (×4), the amount of
0.6 sm all ro om
reverberation (×4), distance (×3), spatial width,
Sim ila rity
0.2
Spatial aspects
begin with. This requires additional research and is there- an open spherical microphone array with six microphones,
fore left for future work. In addition, future work includes with an additional seventh microphone in the geometric
open source implementations of the SDM encoding for a center of the array.
general pressure microphone array, B-format microphone, Listening test experiments showed that the presented
and the decoding implementations for wave field synthesis method produces sound that is indistinguishable from a
and higher order Ambisonics. The problem of increased reference sound in the best case. In overall, the similarity
clarity may not be present in the other reproduction ap- of the sound samples encoded with the presented method
proaches. were perceived to be closer than that of a state-of-the-art
In this paper the room acoustic simulation used ideal method in the same conditions.
specular reflections, which is an inherent property of the ap-
plied image-source room simulation method. SDM should 6 ACKNOWLEDGMENT
also be tested with diffuse reflections. However, the gen-
eration of the reference case for diffuse reflection is prob- This work was supported by ERC grant agreement no.
lematic, since for the reference case the direction, time of [203636] and Academy of Finland agreement nos. [218238
arrival, and pressure value for each time instant is required. and 140786]. The authors are grateful to Prof. Lauri Savioja,
This information is available in beam-tracer or ray-tracing Prof. Ville Pulkki, and Mr. Mikko-Ville Laitinen for discus-
methods. Unfortunately, these methods neglect the tempo- sions. The authors would also like to thank the anonymous
ral spreading of the reflections and consider that diffuse re- reviewers for their valuable comments, which helped to im-
flections only introduce spatial spreading for the reflected prove the quality of this paper. Dr. Alex Southern is thanked
sound. Moreover, the room acoustic simulation methods for providing the implementation for air absorption filter
that aim to solve the wave equation, e.g., finite element computations.
method, boundary element method, and finite difference
in time-domain, may generate the correct pressure values, 7 REFERENCES
but they do not produce directional information. The only [1] T. Lokki, H. Vertanen, A. Kuusinen, J. Pätynen, and
method that produces all the necessary information and S. Tervo “Concert Hall Acoustics Assessment with Indi-
takes into account the temporal and spatial spreading is vidually Elicited Attributes,” J. Acoust. Soc. Am., vol. 130,
presented in [35], but it only applies for low-frequencies. pp. 835–849 (Aug. 2011).
The comparison for a reference case with diffuse reflections [2] T. Lokki, J. Pätynen, S. Tervo, S. Siltanen, and L.
is currently not possible. Savioja, “Engaging Concert Hall Acoustics Is Made Up
SIRR was implemented with the parameters given in the of Temporal Envelope Preserving Reflections,” J. Acoust.
original paper [26]. It should be emphasized that the ad- Soc. Am., vol. 129, pp. EL223–EL22 (Apr. 2011).
vances made in Directional Audio Coding could possibly [3] J. Daniel, R. Nicol, and S. Moreau, “Further Investi-
improve quality of the SIRR. In informal listening, for ex- gations of High Order Ambisonics and Wavefield Synthe-
ample, the multi-rate implementation [36] was found to sis for Holophonic Sound Imaging,” presented at the 114th
increase the overall quality in SIRR. Studies using SIRR Convention of the Audio Engineering Society(2003 March),
with alternative processing approaches are currently not convention paper 5788.
available in the literature. [4] A. J Berkhout, D. De Vries, and P. Vogel, “Acoustic
Control by Wave Field Synthesis,” J. Acoust. Soc. Am.,
5 CONCLUSIONS vol. 93, no. 5, pp. 2764–2778 (1993).
[5] M. M. Boone, E. N. G. Verheijen, and P. F. Van Tol,
This paper presented a spatial encoding method for spa- “Spatial Sound-Field Reproduction by Wave-Field Synthe-
tial room impulse responses. The analysis of the method sis,” J. Audio Eng. Soc., vol. 43, pp. 1003–1012 (1995
estimates the location in very small time windows at ev- Dec.).
ery discrete time sample, where the localization method [6] D. Hammershøi and H. Møller, Communication
depends on the applied microphone array and acoustic con- Acoustics, chapter 9, “Binaural Technique—Basic Methods
ditions. Each of the discrete time samples is therefore rep- for Recording, Synthesis, and Reproduction”, pp. 223–254
resented by an image-source. Thus, the analysis results in a (Springer-Verlag, New York, NY, USA, 2005).
set of image-sources. Then, depending on the spatial repro- [7] D. Schönstein and B. F. G. Katz, “Variability in Per-
duction method, the samples are distributed to several re- ceptual Evaluation of HRTFs,” J. Audio Eng. Soc., vol. 60,
production channels to obtain individual impulse responses pp. 783–793 (2012 Oct.).
for all reproduction channels. [8] V. Pulkki, “Virtual Sound Source Positioning Us-
The main advantage of the method follows from the de- ing Vector Base Amplitude Panning,” J. Audio Eng. Soc.,
composition of the image-sources. Namely, the method can vol. 45, pp. 456–466 (1997 Jun).
be applied to any arbitrary microphone array and the spa- [9] F. Zotter and M. Frank, “All-Round Ambisonic Pan-
tial reproduction method can be any of a variety of existing ning and Decoding,” J. Audio Eng. Soc., vol. 60, pp. 807–
techniques. It should be emphasized that the method is not 820(2012 Oct.).
designed for a continuous signal, but for spatial room im- [10] J. Merimaa and V. Pulkki, “Spatial Impulse Re-
pulse responses, which can then be convolved with an ane- sponse Rendering I: Analysis and Synthesis,” J. Audio Eng.
choic signal. In this paper the applied microphone array was Soc., vol. 53, pp. 1115–1127 (2005 Dec.).
[11] J. B. Allen and D. A. Berkley, “Image Method for [24] S. Tervo, T. Lokki, and L. Savioja, “Maximum Like-
Efficiently Simulating Small-Room Acoustics,” J. Acoust. lihood Estimation of Loudspeaker Locations from Room
Soc. Am., vol. 65, no. 4, pp. 943–950 (1979). Impulse Responses,” J. Audio Eng. Soc., vol. 59, pp. 845–
[12] U. P. Svensson, R. I. Fred, and J. Vanderkooy, “An 857 (2011 Nov.).
Analytic Secondary Source Model of Edge Diffraction Im- [25] A. Brutti, M. Omologo, and P. Svaizer, “Mul-
pulse Responses, J. Acoust. Soc. Am., vol. 106, pp. 2331– tiple Source Localization Based on Acoustic Map De-
2344 (1999). emphasis,” EURASIP J. Audio, Speech, and Music Pro-
[13] B. I. Dalenbäck, M. Kleiner, and P. Svensson, “A cessing, 2010, 2010, paper 147495.
Macroscopic View of Diffuse Reflection, J. Audio Eng. [26] V. Pulkki and J. Merimaa, “Spatial Impulse Re-
Soc., vol. 42, pp. 793–807 (1994 Oct.). sponse Rendering II: Reproduction of Diffuse Sound and
[14] J.-M. Jot, L. Cerveau, and O. Warusfel, “Analysis Listening Tests,” J. Audio Eng. Soc., vol. 54, pp. 3–20 (2006
and Synthesis of Room Reverberation Based on a Statistical Jan./Feb.).
Time-Frequency Model, presented at the 103rd Convention [27] J. Pätynen, S. Tervo, and T. Lokki, “Analysis
of the Audio Engineering Society, Convention(1997 Sept.), of Concert Hall Acoustics via Visualizations of Time-
convention paper 4629. Frequency and Spatiotemporal Responses,” J. Acoust. Soc.
[15] C. Zhang, Z. Zhang, and D. Florêncio, “Maximum Am., vol. 133, no. 17 (January 2013).
Likelihood Sound Source Localization for Multiple Di- [28] J. S. Bradley, H. Sato, M. Picard, et al., “On the
rectional Microphones,“IEEE International Conference on Importance of Early Reflections for Speech in Rooms,” J.
Acoustics, Speech, and Signal Processing, vol. 1, pp. 125– Acoust. Soc. Am., vol. 113, no. 6, pp. 3233–3244 (2003).
128(2007). [29] Geneva International Telecommunication Union.
[16] D. Levin, E. A. P. Habets, and S. Gannot, “Maxi- ITU-R BS.1116-1: Methods for the Subjective Assessment
mum Likelihood Estimation of Direction of Arrival Using of Small Impairments in Audio Systems including Multi-
an Acoustic Vector-Sensor,” J. Acoust. Soc. Am., vol. 131, channel Sound Systems, 1997.
no. 2, pp. 1240–1248 (2012). [30] H. E. Bass, H.-J. Bauer, and L. B. Evans, “Atmo-
[17] S. Tervo, “Direction Estimation Based on Sound In- spheric Absorption of Sound: Analytical Expressions,” J.
tensity Vectors,“European Signal Processing Conference, Acoust. Soc. Am., vol. 52, no. 3B, pp. 821–825 (1972).
Glasgow, Scotland, August 24-28, pp. 700–704 (2009). [31] T. Lokki, J. Pätynen, A. Kuusinen, and S. Tervo,
[18] S. Tervo, J. Pätynen, and T. Lokki, “Acoustic Re- “Disentangling Preference Ratings of Concert Hall Acous-
flection Localization from Room Impulse Responses,” tics Using Subjective Sensory Profiles,” J. Acoust. Soc. Am.,
Acta Acustica united with Acustica, vol. 98, pp. 418–440 vol. 132, pp. 3148–3161 (Nov. 2012).
(2012). [32] J. Vilkamo, T. Lokki, and V. Pulkki, “Directional
[19] A. Host-Madsen, “On the Existence of Efficient Audio Coding: Virtual Microphone-Based Synthesis and
Estimators,” IEEE Trans. Signal Processing, vol. 48, no. Subjective Evaluation,” J. Audio Eng. Soc., vol. 57, pp. 709
11, pp. 3028–3031 (2000). (2009 Sept.).
[20] C. Knapp and G. Carter, “The Generalized Corre- [33] B. B. Schultz, “Levene’s Test for Relative Vari-
lation Method for Estimation of Time Delay,” IEEE Trans. ation,” Systematic Biology, vol. 34, no. 4, pp. 449–456
Acoust., Speech and Signal Proc., vol. 24, no. 4, pp. 320– (1985).
327 (1976). [34] M. A. Stephens, “EDF Statistics for Goodness of
[21] L. Zhang and X. Wu, “On Cross Correlation Based- Fit and Some Comparisons,” J. Am. Statistical Assoc., vol.
Discrete Time Delay Estimation,“IEEE International Con- 69, no. 347, pp. 730–737 (1974 Sept.).
ference on Acoustics, Speech, and Signal Processing, [35] S. Siltanen, T. Lokki, S. Tervo, and L. Savioja,
vol. 4, pp. 981–984 (2005). “Modeling Incoherent Reflections from Rough Room Sur-
[22] T. Pirinen, Confidence Scoring of Time Delay Based faces with Image Sources,” J. Acoust. Soc. Am., vol. 132,
Direction of Arrival Estimates and a Generalization to Dif- pp.4604–4614 (June 2012).
ference Quantities, Ph.D. thesis, Ph.D. thesis, Tampere Uni- [36] T. Pihlajamäki and V. Pulkki, “Low-Delay Direc-
versity of Technology, 2009. Publication; 854. Publication; tional Audio Coding for Real-Time Human-Computer In-
854. teraction,” presented at the 130th Convention of the Au-
[23] H. Kutruff, Room acoustics, 4th Ed.(Spon Press, dio Engineering Society (2011 May), convention paper
NY, NY, USA, 2000). 8413.
THE AUTHORS
Dr. Sakari Tervo is a post-doctoral researcher in the De- partment of Media Technology at Aalto University School
partment of Media Technology, Aalto University School of Science under supervision of professor Tapio Lokki. In
of Science from where he also received a D.Sc. degree in his doctoral research he focuses on the perceptual char-
acoustics in January 2012. The topic of his research is on acteristics of concert hall acoustics. This research also in-
the objective room acoustic measures. cludes development of descriptive vocabulary for sound,
Previously he has been working in the Department of music, and room acoustics as well as elaboration of listen-
r
Signal Processing, Tampere University of Technology from ing test methodology and statistical data analysis.
where he also graduated as a M.Sc. majoring in audio sig-
nal processing in 2006. He has visited the Digital Sig-
nal Processing Group of Philips Research, Eindhoven, The Dr. Tapio Lokki was born in Helsinki, Finland, in 1971.
Netherlands, in 2007 and the Department of Electronics of He has studied acoustics, audio signal processing, and com-
r
the University of York, United Kingdom, in 2010. puter science at the Helsinki University of Technology
(TKK) and received an M.Sc. degree in electrical engi-
neering in 1997 and a D.Sc. (Tech.) degree in computer
Dr. Jukka Pätynen was born in 1981 in Espoo, Fin- science and engineering in 2002.
land. He received M.Sc. and D.Sc. (Tech.) degrees from At present Dr. Lokki is an Associate Professor (tenured)
the Helsinki University of Technology, Finland, in 2007, with the Department of Media Technology at Aalto Univer-
and Aalto University, Finland, in 2011, respectively. He sity. Prof. Lokki leads the virtual acoustics team jointly with
is currently working as a post-doctoral researcher in the Prof. Lauri Savioja. The research aims to create novel ob-
Department of Media Technology, Aalto University. His jective and subjective ways to evaluate concert hall acous-
research activities include room acoustics, musical acous- tics. In addition, the team develops physically-based room
r
tics, and signal processing. acoustics modeling methods to obtain authentic auraliza-
tion. Furthermore, the team studies augmented reality au-
dio. The team is funded by the Academy of Finland and by
Antti Kuusinen received an M.Sc. degree in March 2012. Prof. Lokki’s Starting Grant from the European Research
He is currently working as a doctoral candidate at the De- Council (ERC).