JAES V61 1 2 PG17hirez

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/236169398
Spatial Decomposition Method for Room Impulse Responses
Article in Journal of the Audio Engineering Society · January 2013
CITATIONS READS
230 4,900
4 authors:
Sakari Tervo Jukka Pätynen

Aalto University Aalto University
59 PUBLICATIONS 1,398 CITATIONS 81 PUBLICATIONS 1,762 CITATIONS
SEE PROFILE SEE PROFILE
Antti Kuusinen Tapio Lokki

Turku University of Applied Sciences Aalto University
31 PUBLICATIONS 654 CITATIONS 293 PUBLICATIONS 4,829 CITATIONS
SEE PROFILE SEE PROFILE
All content following this page was uploaded by Tapio Lokki on 02 February 2016.
The user has requested enhancement of the downloaded file.

PAPERS
Spatial Decomposition Method for Room Impulse

Responses
SAKARI TERVO, JUKKA PÄTYNEN, ANTTI KUUSINEN

([email protected]) (jukka.pä[email protected]) ([email protected])
AND TAPIO LOKKI, AES Member

([email protected])
Department of Media Technology, Aalto University School of Science, FI-00076 Aalto
This paper presents a spatial encoding method for room impulse responses. The method is
based on decomposing the spatial room impulse responses into a set of image-sources. The
resulting image-sources can be used for room acoustics analysis and for multichannel convo-
lution reverberation engines. The analysis method is applicable for any compact microphone
array and the reproduction can be realized with any of the current spatial reproduction methods.
Listening test experiments with simulated impulse responses show that the proposed method
produces an auralization indistinguishable from the reference in the best case.
0 INTRODUCTION crophone arrays. An advantage of the first two groups is that

they can be applied to a continous signal, such as speech
Spatial sound encoding and reproduction techniques are or music. It should be stated that this paper concentrates on
important tools for room acoustics research [1,2]. For per- the spatial encoding of the spatial room impulse responses,
ceptual evaluation of room acoustics, a spatial room impulse not continuous signals.
response is first measured, then encoded for a multichan- Most of the professionals working in the field will agree,
nel loudspeaker reproduction system, and convolved with that when applied in a careful manner, any of the afore-
anechoic music. This process of reproducing spatial sound mentioned encoding techniques will provide a realistic or
from a spatial room impulse response is illustrated in Fig. 1. at least plausible auralization of the acoustics. However, the
The last part of this spatial sound reproduction process is use of special microphone arrays imposes limitations to the
typically called convolution reverb. measurement procedure. First, some of the microphones
Previous research has presented several spatial encod- are known to have inaccurate directional response, espe-
ing methods that can be applied for spatial impulse re- cially in the higher frequencies, which naturally affects the
sponses. The spatial encoding methods can be divided to accuracy of the analysis and the reproduction. Second, the
three groups according to their aim. In the first group the measurement, especially for WFS, either requires a multi-
aim is to reproduce the originally measured sound field tude of microphones or is time consuming, which can be
over a certain area. These methods include, First-Order costly. Finally, the microphone array setups for some of the
Ambisonics (1.OA), Higher-Order Ambisonics (HOA) [3], reproduction approaches are only limited to that specific
and Wave-Field Synthesis (WFS) [4,5]. In contrast, in the approach.
second group, the binaural reproduction methods, the in- This paper presents a spatial encoding technique for spa-
tention is to reproduce the sound pressure correctly at lis- tial room impulse responses, named here Spatial Decompo-
tener’s eardrums by recording the soundfield close or at the sition Method (SDM). In contrast to previously developed
eardrum [6,7]. In the third group, the starting point is to methods, SDM can be applied for an arbitrary compact mi-
analyze and reproduce some of the spatial cues correctly crophone array with a small number of microphones and
[8,9]. An example of an analysis method belonging to the any spatial sound reproduction technique. The presented
third group is the Spatial Room Impulse Response Render- method relies upon the simple assumption that the sound
ing (SIRR) [10]. The first two groups require specialized propagation direction is the average of all the waves arriving
microphones or microphone arrays, whereas the methods to the microphone array at time t, and the sound pressure
of the last group aim to present signal processing schemes of a single impulse response in the geometric center of the
that are applicable, at least to some extent, for several mi- array is associated with it. The method analyzes the spatial
J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February 17

TERVO ET AL. PAPERS
direct sound, discrete reflections, diffractions, or diffuse
Measurement or simulation
Room Impulse Response
Microphone array
reflections. At each time moment t, the sound pressure at
receiving location r n has a scalar value, i.e., it is a scalar
function h(r n , x|t). The scalar value is the overall sum of
different sound pressure waves arriving at the same time to
Microphone signals the receiver location. In the context of this paper the spa-
. . . tial room impulse response is measured with n = 1, . . ., N
microphones, i.e., a microphone array.
Spatial analysis and The whole impulse response is altered by several acoustic
encoding, e.g., phenomena. A majority of the acoustic events is attenuated
(SDM / SIRR / 1.OA / HOA)
according to 1/r-law and affected by air absorption. In ad-
Encoded stream dition, the frequency response of an event is altered by the
. . .
Analysis and
synthesis
absorption of the surfaces in the enclosure. Moreover, the
Spatial decoding and
directivities of the microphones and the sound source have
reproduction, e.g., an effect on the impulse response.
(VBAP / 1.OA / HOA / WFS) As time progresses, the number of acoustic events per
Loudspeaker signals time window increases. In room acoustics research and
. . .
convolution reverberation engines, the impulse response is
traditionally divided into three consecutive regions in time:
Spatially Reproduced
Acoustics
Convolution with anechoic

source signal
the direct sound, the early reflections, and the late reverber-
ation. Next subsections list features of these categories in
Convolved signals theory and in practice.
. . .
1.1.1 Direct Sound, Specular and Diffuse

. . .
Reflections, and Diffraction
In theory, the direct sound is a single impulse, i.e., a
Loudspeakers
Dirac delta function. Moreover, with the assumption of
Fig. 1. The general processing applied in auralization using room ideal reflecting surfaces, the reflections are also impulses.
impulse response measurements or simulations. The acronyms for In practice, ideal specular reflections are rare, since they
the encoding and decoding techniques are given in the text. require an infinite rigid and flat plane. Thus, the early re-
flections are often spread over time instead of being single
events in the impulse response and have a certain frequency
impulse response with this assumption and encodes it to a response due to the absorption at the boundaries. Also, due
response that consists of samples that have a pressure value to the non-ideal response of the loudspeakers and micro-
and a spatial location. phones, the direct sound is not an impulse. Moreover, the
loudspeaker impulse response is typically different in all
directions, therefore, reflections from different directions
1 THEORY AND METHODS have different responses in time and frequency.
This section presents theoretical background on the spa- The concept of image-source describes an ideally specu-
tial room impulse response and the proposed spatial ana- lar reflection from a surface [11]. Although such reflections
lysis. are rare or not even possible in real situations, the model can
be used to describe several acoustic events. First, the diffrac-
tion from an edge can be modeled with properly weighted
1.1 Room Impulse Response
image-sources [12]. Second, non-ideal reflections, i.e., dif-
A room impulse responses captured with a microphones fuse reflections, are caused by diffraction in a very small
n at location r n is the sum of individual acoustic events scale [13]. Third, close-to-ideal specular reflections and the
hp,n (t): direct sound can be modeled with a limited number of prop-
⎡ ⎤
P erly weighted image-sources and a source, respectively.

h n (t) = h(t|r n , x) = ⎣ h p,n (t)⎦ + wn (t) Thus, it can be concluded that the acoustics of an enclosure
p=0 can be modeled to some extent with a limited number of
l image-sources. However, in practice, the acoustic modeling
∞
= H p,n (ω)e jωt dω + wn (t), (1) of complex room geometries with image-sources is a very
l=0 −∞ demanding task.
where n denotes microphone index, t is time, ω is the an-
gular frequency, x is the source position, p = 0, . . . , P is 1.1.2 Late Reverberation
the index for each acoustic event, wn (t) is the measurement The reflection density increases in the room impulse re-
noise, and Hp,n (ω) is the frequency domain representation sponse as the time progresses. When enough reflections
of hp,n (t). The acoustic events can be, for example, the arrive during the same time, the sound field becomes
18 J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS SPATIAL DECOMPOSITION METHOD
Fig. 2. The processing in the proposed spatial encoding method consists of localization and combining the omni-directional pressure
signal with the estimated locations.
diffuse. A diffuse sound field is spatially homogeneous 1.2.1 Step 1: Localization

and isotropic. In practice, this means that the distributions First, SDM solves the location of the source and image-
of the phase and direction are uniform and amplitude is sources from the spatial room impulse response. For each
equally distributed for each position. It follows from these discrete time step, a localization function P(· | · ) estimates
conditions that the net energy flow over a volume is zero. the average location of the arriving sound in a small time
The time when this occurs in an impulse response is typi- window with respect to the geometric center of the array.
cally referred to as the mixing time [14], and after that the The localization function maps the received data into a
impulse response is considered to be late reverberation. cost function that is given for a location x and possible
parameters or a priori models χ:
1.2 Analysis
SDM assumes that the impulse response can be presented x̂ k = arg max{P(H(k)|x, χ)}, (2)
x
as a set of limited number of image-sources. SDM ana-
lyzes the spatial room impulse response at every discrete where H(k) is the spatial impulse response in a short time
time step t = 1/fs , where fs is the sampling frequency. window that is defined by vector k = [−L/2 + k . . . L/2 +
The sound arriving during these time windows has an av- k]t with discrete time indices at time tk, k = 1. . .K,
erage direction that is estimated with robust localization where K is the length of the impulse response, and window
methods. As a result, a set of discrete pressure values and size L.
their corresponding locations, i.e., image-sources, present The a priori models and the localization function depends
the spatial room impulse response. Decomposition of the on the applied microphone array, measurement conditions,
image-sources describes this process and thus the name and assumption on the sound field propagation model. As
SDM (Spatial Decomposition Method). The overall pro- an example for an arbitrary array with arbitrary directivities,
cessing in the method is illustrated in Fig. 2. one can apply the maximum likelihood estimation given in
The analysis assumes the following general requirements [15] with the reverberation parameter set to γ = 0. For
for the used microphone array: acoustic vector-sensors, e.g., a gradient microphone array,
one can apply the solutions given in [16] or [17]. Moreover,
• For 3-D spatial sound encoding, the minimum require- [18] gives an overview of different localization functions
ment of the number of microphones is four, which are that are based on time difference of arrival and time of
not on the same plane, so that they can set up a 3-D arrival estimation. The accuracy of the localization depends
space. on the applied microphone array and localization method,
• The directivity of one of the microphones is omni- as well as on the conditions during the measurements.
directional or it is possible to create one virtual omni- This paper uses the least squares solution for time differ-
directional pressure microphone signal from the others. ence of arrival estimates (TDOA) for localizing the image-
• The dimensions of the array are not large, i.e., the mi- sources. Plane-wave propagation model is assumed for the
crophone array is compact. The dimensions should be localization since an efficient estimator for the problem ex-
less or equal to the dimensions of a human head. ists, unlike for spherical wave propagation model [19] and
• Open microphone arrays are preferred, but closed ones since the source and the image-sources can be assumed to be
can also be used as long as the above requirements are in the far field. Although the solution is a set of plane-waves
met. instead of a set of image-sources, the method can treat them
in a similar manner due to the far-field assumption.
In detail, for a set of room impulse responses H(t) = The TDOAs are obtained from generalized correlation
{h n (t)}n=1
N
, i.e., a spatial room impulse response, the anal- method with direct weighting [20] for each time step in the
ysis proceeds as follows. small analysis window k. In addition, each TDOA estimate

TERVO ET AL. PAPERS
is interpolated with the exponential fit [21]. The TDOA time interval between the samples (t) and the size or the
estimates are denoted with dimensions of the microphone array. The smaller these val-
ues are the more spatial and temporal separation between
τ̂ k = [τ̂(k) (k) (k)
1,2 , τ̂1,3 , . . . , τ̂ N −1,N ] ,
T
individual acoustic events can be made. This improves the
where N is the number of microphones, and the correspond- localization for individual acoustic events. Other methods
ing microphone position difference vectors are denoted with require larger aperture size to improve the approximations
for low frequencies, however, in SDM this is not a require-
V = [r 1 − r 2 , r 1 − r 3 , . . . , r N −1 − r N ]T . ment since in SDM the lower frequencies can be estimated
The least squares solution for slowness vector is then given by elongating the window size. However, this would also
as [22, p. 75]: require that SDM processing is done for different frequency
bands with different window sizes. This is further discussed
m̂k = V + τ̂ k , (3) in Section 4.1.
where (·)+ is Moore-Penrose pseudo-inverse, and the A limiting factor for the window size is the largest di-
direction of the arriving sound wave is given as mension of the microphone array. That is, the window size
n̂k = −m̂k /m̂k . The distance to the image-source k is should be larger than the time that it takes for a sound wave
given directly by the time index and the speed of sound dk to travel through the array, i.e., Lt > 2dmax /c, where dmax
= ckt. is the maximum distance between any two microphones
in the array. Theoretically, a large window size improves
1.2.2 Step 2: Dividing the Omni-Directional and worsens the localization performance at the same time.
Pressure Signal Namely, as the window size increases, the localization per-
formance of a single acoustic event improves, as stated by
The second step of the analysis selects one of the avail-
the CRLB. However, the probability that more than one
able omni-directional microphone signals as the pressure
acoustic event is present in the analysis window increases.
signal hp . Ideally, the microphone for the pressure signal
This latter part is seen as a possible problem in the analysis
is located in the geometric center of the array. In this case,
and, therefore, it is recommended that the window size is
the analysis assigns each sample of the pressure impulse
selected such that it is just over 2dmax c. In addition, if an
response hp (tk) with a 3-D location x̂ k , which is the out-
acoustic event is assumed to be short, time-wise, increas-
put from Step 1. Then, the method has encoded the spatial
ing the window size would actually decrease the theoreti-
impulse response with four values per sample, the pressure
cal performance since the energy of the noise in the time
value and the 3-D location of the sample.
window increases relative to the energy of the signal, thus
In case the pressure microphone is not in the geometric
decreasing the signal-to-noise ratio.
center of the array, one has to predict the value of the
The next part assesses the effect of the window size
pressure signal according to the image-source locations.
selection with a quantity called echo density. The echo
This is done by first calculating the distance from the image-
density describes the average number of echoes in a room
source location to the location of the pressure microphone
per a time instant and is valid for any arbitrarily shaped
rp
enclosure [23, p. 92]. It is defined as
dk = r p − x k , (4)
Nr c3 t 2
= 4π , (6)
and then assigning each image-source with the pres- dt V
sure value hp (fs dk /c). When using plane wave propagation where Nr is the number of reflections, and V is volume.
model, the distance is calculated as Echo density is a useful tool for inspecting the effects of
dk = |nk (r p − x k,0 )|, (5) the window size selection on the number of acoustic events,
i.e., image-sources, per time window. The threshold when
where nk and x k,0 are the plane normal and a point on the there is less than Nr reflection(s) present in the time window
plane, respectively. can be examined with
Instead of predicting the pressure in the center of the
array, one can predict the image-source locations in the Nr V √
τ1 = ≈ 0.0014 V. (7)
location of the pressure signal. This is an easier choice dt4πc3
because it does not require resampling of the signal. This The last approximation is yielded for less than one reflec-
paper applies neither of these approaches, since the pressure tion Nr → 1 and assuming that the speed of sound is con-
microphone is always located in the middle. stant c = 345 m/s. For example, a window size of dt =
Lt = 1 ms produces the value τ1 = 119 ms for a room
1.3 Limitations on the Performance and the with volume (30 × 20 × 12 = 7200 m3 ), which indicates
Effect of the Window Size that there is only one acoustic event present in the analy-
Several aspects affect the accuracy of the analysis in sis window until 119 ms after the direct sound. Thus, the
SDM. When the noise level decreases and the number of parameter τ1 describes the average time when there will
microphones increases, the performance of the localization be more than one reflection present in the analysis time
improves, as predicted by the Cramér-Rao lower bound window. The smaller the window size, the bigger the pa-
(CRLB) (see, e.g., [18]). Other important factors are the rameter τ1 and the more accurate localization of individual

acoustic events is achieved. To conclude, shorter time win- 100

dows should be preferred over long ones, and the minimum
length of the time window is defined by the maximum of
the spacing between any microphone pair. 50
Y−coordinate [m]
1.4 Rationale for SDM
0
The accurate localization of first acoustic events with
respect to time in the impulse responses, i.e., the direct
sound and the first reflections, is possible as shown in [24] −50
and [18], respectively. However, as the time progresses the
number of acoustic events per time window increases, and
eventually more than one reflection arrives during the time −100
window. In this case, a cross-correlation-based localiza- −100 −50 0 50 100
X−coordinate [m]
tion algorithm localizes the sound to the location of the
reflection that is the strongest one in that time window. (a) Original image-source locations and amplitudes.
The strongest direction is selected because it shows as the
strongest peak in the cross-correlation functions. Analo- 100
gous example of this behavior with one localization algo-
rithm is shown with speech sources in [25]. However, it is
also possible that the estimated location is an intermediate 50
point that is between the reflections within that analysis
Y−coordinate [m]
window. This is, for example, the case if the localization
algorithm is based on the average direction of the sound 0
intensity. Thus, the estimated location depends highly on
the localization algorithm. The behavior of the localization
algorithms in the case of several acoustic events should be −50
further investigated, but here this is left for future research.
In any case, SDM assumes in the spatial reproduction that
the estimated location corresponds to the correct percep- −100
−100 −50 0 50 100
tual location. The assumption has been used previously for X−coordinate [m]
example in SIRR [10,26].
(b) Analyzed image-source locations and amplitudes
SDM produces the diffuse sound field naturally. Namely,
in SDM each time step has a random direction in a diffuse Fig. 3. An example of the locations and amplitudes of (a) sim-
sound field. The total directional distribution over the total ulated image-sources and (b) decomposed image-sources with
diffuse sound field, i.e., late reverberation is then uniform. SDM from a spatial room impulse response. The area of each
filled circle illustrates the energy of that image-source. The image-
Further evidence for this is provided in a recently published
sources with the highest energy are correctly analyzed.
article which uses SDM for spatial analysis [27].
Since the first acoustic events are correctly localized from
spatial room impulse response in the SDM framework, and Fig. 3, where the radius of each circle corresponds to the
these events are known to have a very prominent effect on amplitude of respective image-source, illustrates the results
the perception of spatial sound [1,28], the resulting aural- of the analysis. As can be seen in Fig. 3, the early part of the
ization should be credible. Moreover, the late part of the simulated spatial room impulse response (a) is very similar
spatial room impulse response will be naturally presented to the one analyzed by SDM (b).
as diffuse by SDM because multiple arriving reflections
will produce random directions.
2 LISTENING TEST EXPERIMENTS
1.5 An Example of the Analysis with SDM This section describes the listening test setup, the listen-
This section demonstrates the principles in SDM with ing room, the simulated room acoustic conditions, and the
an illustration of analysis results of spatial room impulse source signals. In addition, listening test procedures and
response. The spatial room impulse response is recorded results are presented.
from a simulation of a shoebox room of size (20 × 30 × This paper uses Vector Base Amplitude Panning (VBAP)
12) m3 . Furthermore, the source was at [16.04, 8.06, 3.58] [8] as the spatial reproduction technique for the listening
m and the receiver at [7.35, 7.92, 3.22] m. In addition, the tests. Other reproduction methods could also be used, but
applied window was 1.33 ms Hanning window and overlap VBAP is here preferred since it can be implemented for a
between two consecutive windows is 99%. Speed of sound 3-D spatial sound with less number of loudspeakers than
was set to c = 345 m/s, sampling frequency to fs = 48 kHz, the other methods and since it provides good subjective
reflection coefficient to 0.85, and reflections up to 45th order quality in overall. The listening tests compares the pro-
were simulated. posed method to SIRR [10,26], which can be considered the

TERVO ET AL. PAPERS
Table 1. Reverberation time (RT), sound pressure level (SPL), Table 3. Source and receiver positions, source signals,
and noise level (NL) in the listening room. Sound pressure level dimensions of the rooms, and sample naming used in the
is given with respect to the reference (Ref.) value at 200 Hz- listening test. Speed of sound was set to c = 345 m/s, sampling
4 kHz frequency band. In the calibration, the SPL was 87 dB, frequency to fs = 48 kHz, reflection coefficient to 0.85, and
which gives a signal-to-noise ratio of more than 45 dB for each reflections up to 45th order were simulated.
octave band.
Sample Source position Receiver position
Octave band RT [s] SPL [dB] NL [dB]
(Signal) x [m] y [m] z [m] x [m] y [m] z [m]
Ref. [200 Hz - 4 kHz] 0.14 0.00 -
125 [Hz] 0.24 1.66 39.9 Large room (30 × 20 × 12) m3
250 [Hz] 0.17 0.47 35.7 A (Sp.) 16.04 8.06 3.58 7.35 7.92 3.22
500 [Hz] 0.13 0.36 32.9 B (Tr.) 17.44 12.81 2.88 2.64 13.48 3.72
1 [kHz] 0.13 0.02 28.6 C (Ca.) 20.37 11.99 2.52 3.10 12.10 2.86
2 [kHz] 0.12 0.16 20.4 Small room (5 × 3 × 2.8) m3
4 [kHz] 0.11 −1.03 18.9 D (Sp.) 3.44 0.80 1.53 1.02 0.64 1.40
8 [kHz] 0.10 −2.93 21.5 E (Tr.) 3.87 1.45 1.65 0.76 1.39 1.33
F (Ca.) 3.78 0.85 1.81 1.24 0.97 2.07
Table 2. The azimuth and elevation directions and distance of Sp.: Speech, Tr. Trombone, and Ca.: Castanet
each individual loudspeaker (LPS) in the 14-channel loudspeaker
reproduction setup in the listening room. 8, 4, and 2
loudspeakers are located approximately at the lateral plane, 45 pressure level with slow temporal averaging in the listen-
degrees above lateral plane, and –45 degrees below lateral plane, ing position for a band-pass filtered noise from 100 Hz to
respectively. The loudspeakers were localized using the method
presented in [24]. 5 kHz. Since the distance of the loudspeakers is not the
same to the reference position for all loudspeakers, they
LPS # Azimuth [◦ ] Elevation [◦ ] Distance [m] are all delayed with digital signal processing so that each
loudspeaker is at a virtual distance of 1.40 m.
1 46.7 0.1 1.01 The simulated impulse responses for the listening test
2 89.6 −1.5 1.02
were produced with the image-source method [11] in two
3 134.2 −0.1 0.98
4 179.8 −0.2 0.95 modelled rectangular rooms. In the image-source method,
5 −135.9 0.1 1.02 throughout this paper, the reflection coefficient is set to
6 −91.6 0.2 0.94 0.85, the speed of sound to 345 m/s, the sampling frequency
7 −45.6 1.0 0.96 to 48 kHz, and reflections up to 45th order are simulated. In
8 −0.3 1.6 0.98
addition, Table 3 shows the room dimensions, source, and
9 45.9 43.5 1.30
10 135.1 40.5 1.33 receiver positions used in the image-source method. Two
11 −137.9 42.3 1.40 shoebox rooms, a large and a small one, are simulated for
12 −45.4 46.4 1.33 the listening tests. The large and the small room have wide
13 24.0 −46.1 1.29 band reverberation times of 2.0 s and 0.4 s, respectively.
14 −19.6 −45.1 1.27
In all the cases, the room impulse responses are truncated
from –40 dB onwards according to the backward integrated
Schroeder curve.
state-of-the-art spatial sound encoding method for spatial
room impulse responses, at least for VBAP. SIRR also op-
erates under the same assumption as SDM, that the binaural 2.1.1 Reference and Anchor
cues are produced correctly. The reference was generated with the image-source
method. The location and amplitude of each image-source
2.1 Listening Room Setup and Stimuli was transferred into a virtual source, which was panned
with VBAP for the current loudspeaker setup [8]. Finally,
Listening tests were conducted in an acoustically treated
to simulate a real room impulse response measurement situ-
room with dimensions of (x × y × z : 3.0 × 5.1 × 3.8) m3 .
ation, the anechoic impulse response of Genelec 1029A was
Table 1 shows the reverberation time, sound pressure level,
convolved with the impulse responses, and the impulse re-
and noise level in the listening room. The listening room
sponse was filtered with air absorption filters, implemented
fulfills the recommendations given by ITU in [29], with
according to [30].
the exceptions that the noise level fulfills the noise rating
The anchor for the listening test was selected to be the
(NR) 30 requirement, whereas the recommendation is NR
same mono impulse response as in the reference but instead
15 and the listening distance is about 1.2 meters on average,
of VBAP processing according to the directional informa-
whereas the recommendation is more than two meters.
tion obtained from the image-source method, it was used
The listening room includes a 3-D 14-channel loud-
directly in the front loudspeaker (# 8 in Table 2).
speaker setup, out of which 12 are of type Genelec 8030A,
and two are of type Genelec 1029A loudspeakers. Table 2
gives the location of each loudspeaker with respect to the 2.1.2 Spatial Encoding Methods
listening position at the origin (0,0,0) m. Each loudspeaker Similarly to the reference, the image-source method was
is calibrated so that they produce equal A-weighted sound used to generate spatial impulse responses for a virtual

Microphone signals
Source Response filter

Room parameters Loudspeaker signals
Air absorption filter

-room dimensions Microphone array
Convolution with source signal

-source position Extracted image source locations
-sampling frequency, 48 kHz and pressure signal
...
SDM VBAP
...
-reflection coefficient, 0.85
...
...
x(t), y(t), z(t), h(t)
-speed of sound, 345 m/s
-reflection order, 45
Non-diffuse stream
...
W
Image source
X Diffuse stream
...
method B-format SIRR VBAP
...
conversion Y
Z Decorrelation
Monaural impulse response and spatial metadata

Source Response filter
VBAP
...
Air absorption filter
Fig. 4. Processing of the samples in the listening test experiments. The shaded areas highlight the different spatial encoding methods
(from top to down, SDM, SIRR and reference).
Table 4. Origin centered coordinates for the microphone arrays. for high frequencies above 1 kHz with the smaller spacer.
Spacing dspc is equal for each microphone pair on a single axis. Before the addition, the analyzed signals for small and large
spacer are low-pass and high-pass filtered with a 10th order
Microphone # X [m] Y [m] Z [m] Butterworth IIR filter, respectively. The motivation for such
1 dspc /2 0 0 processing is that the present authors have used such array
2 −dspc /2 0 0 in measurements of concert halls [31].
3 0 dspc /2 0 To compare the methods in the same conditions, all the
4 0 −dspc /2 0 analyses use a Hanning window of 1.33 ms (64 samples
5 0 0 dspc /2
6 0 0 −dspc /2
at 48 kHz). SIRR has an overlap of 50% between two
7 0 0 0 consecutive windows, and SDM has an overlap of 99%
(63 samples), as explained in Section 2.2 (Step 1). The
window size was selected as 1.33 ms since it is the one used
microphone array. The microphone array consists of seven in the original SIRR paper [26]. It should be emphasized
microphones, of which six are on a sphere and one in the that for SDM the optimal window size is much smaller
geometric center of the array, as shown in Table 4. The than the selected one. Especially for the smaller simulated
central microphone is used as the microphone for the pres- room, it is expected that the lengthy time window causes
sure signal in the spatial encoding methods. problem in SDM, since parameter τ1 is 1.4 ms. However,
The proposed spatial encoding method, SDM, was com- since the goal is to compare these two techniques in the
pared to two versions of SIRR. The first version of SIRR, as same conditions, the same window size is used for both.
well as SDM, was implemented with seven microphones, Moreover, a virtual microphone-based synthesis, originally
and the second version of SIRR was implemented with 13 developed for DirAC in [32], was noticed to provide a more
microphones. Their naming is the following: natural sound for SIRR and was included in the processing.
The output of the SDM, i.e., the extracted image-sources,
• SDM with a single microphone array with spacing dspc = are directly panned with VBAP for the current loudspeaker
100 mm and one microphone in the geometric center is setup. The output of the SIRR-analysis is processed as de-
named SDML7, scribed in [10] and [26] for VBAP reproduction. In addition,
• SIRR with a single microphone array with spacing the diffuse part of the SIRR is implemented with the Hybrid
dspc = 100 mm and one microphone in the geometric Method described in [26]. The processing of the listening
center is named SIRRL7, and test samples for SDM, SIRR, and the reference case are
• SIRR with two microphone arrays with spacings dspc = illustrated in Fig. 4.
100 mm and dspc = 25 mm and one microphone in the
geometric center is named SIRR13. 2.1.3 Source Signals and Test Samples
Approximately ten seconds of male speech, trombone,
The microphone arrays were selected as such, since and castanets were selected as the source signals. Each
SIRR-processing can be implemented for them [10,26]. sample was convolved separately with the corresponding
Namely, SIRR requires the three components of parti- 14-channel VBAP output for a reference, SIRR, or SDM.
cle velocity, which can be calculated with the gradient The test samples are named from A to F as indicated in
microphone-technique and a pressure signal, which is the Table 3.
microphone in the geometric center.
SIRR13 analyzes separately the room impulse responses
for large and small spacing and in the post-processing phase 2.2 Listening Test Procedure
combines them. Combination adds the analysis result for The task in the listening test is to compare the “similarity”
low frequencies below 1 kHz from the large spacer, and of the spatially encoded samples with the reference sample,

TERVO ET AL. PAPERS
1 A: Speech,
large ro o m
B : Tro m bone,
0.9 large ro o m
C: Ca sta net,
large ro o m
0.8
D: Tro m bone,
sm all ro om
0.7 E: Speech,
sm all ro om
F: Ca sta net,
0.6 sm all ro om
Sim ilarity
0.5
0.4
0.3
0.2
0.1
A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F
Ref. SDM L7 SIRRL7 SIRR13 Ancho r
Fig. 5. Screen capture of the user interface (UI) used in the Fig. 6. Listening test results, the thicker boxes with solid color
listening tests. The subjects can freely move, listen to, and rate the illustrate the 25 and 75 percentiles, the thinner lines illustrate the
samples. Note that in the UI the alphabet stands for “method.” most extreme data points, the circles outside the boxes illustrate
outliers, and the dots the median.
instead of “impairment” recommended by ITU [29]. This

deviation from the ITU-recommendation was made since this paper. Most (9/17) of the test subjects can be considered
the test subjects are not encouraged to think that the samples expert listeners in spatial audio due to their background in
are somehow impaired. In addition, the ITU-recommended spatial audio research. Others (3/17) had experience on crit-
impairment scale (imperceptible; perceptible, but not an- ical listening, but this was not necessarily on spatial audio.
noying; slightly annoying; annoying; very annoying) is not These subjects were considered as experienced listeners.
used in the listening test, since it is known from previous The rest (5/17) were naı̈ve test subjects and had limited or
research [26] that SIRR-processed samples sound quite nat- no experience in critical listening. The test took on aver-
ural and are quite similar to the reference. Thus, the idea age approximately 50 minutes, including approximately a
of this listening test is to find out which encoding method 10-minute familiarization step and a 5-minute interview.
produces the sound that is most similar with the reference.
The listening test was implemented as a parallel com- 3 RESULTS
parison with continuous scale, and the task was to compare
the similarity of five samples to a reference sample. Test All the results from the listening tests are shown in Fig. 6.
subjects completed the test twice. The order of the test The results from the listening tests are normalized for each
cases A – F (Table 3) was randomized between subjects test case and subject between 0 and 1. As shown in Fig. 6,
and repetitions. Also, in each test case, the five samples the references and anchors are found correctly in most of
(Ref., Anchor, SIRR13, SIRR7, SDML7) were presented the cases. SDML7 is mistaken as the reference 12 times,
in a random order with letters A – E. A screen shot of and the SIRR13 as the anchor once. This result already
the user-interface and one comparative evaluation of one suggests that SDML7 is well suited for spatial encoding.
test case is shown in Fig. 5. During the listening test, the Multi-way analysis of variance (ANOVA) is applied to
subjects could freely loop a time window what they were examine the main effects, and two- and three-factor inter-
listening to and listen to an unlimited number of times. That actions. The examined main effects are the spatial encoding
is, there was no time limit for completing the test. method (Method), repetition of the test case (Repetition),
In the beginning of the test, the subjects had an adequate the size of the room (Room), and the source sound sample
time to familiarize themselves with the samples. The test (Sound). To perform ANOVA, the cases should be inde-
subjects were instructed to carefully consider the timbral pendent, the variances equal (homoscedasticity), and the
and the spatial aspects in the samples. They were also told residuals normally distributed. Here, the cases are assumed
that one of the five samples is the hidden reference sam- to be independent but other assumptions for ANOVA, the
ple and one other sample is a mono anchor sample, which homoscedescacity and the normality of the residuals, are
is played back from the front loudspeaker. After the fa- next tested with statistical tests.
miliarization, the subjects rated the samples according to Levene’s test [33] shows that the variances between dif-
similarity to the reference in the actual listening test. When ferent test cases are significantly different. In addition,
the test ended, the subjects were interviewed and asked for Anderson-Darling test [34] indicates that the residuals are
the attributes that they used for discriminating and rating not normally distributed. Both of these results are most
the samples. likely a consequence of the scale in the listening test. That
Seventeen test subjects with normal hearing participated is, as shown in Fig. 6, the results for the reference and an-
in the test. None of the subjects were the present authors of chor sound condition are negatively and positively skewed,

A: Speech,
Table 5. The attributes that the subjects used for assessing the
1
la rge ro o m
B : Trom bone,
similarity according to the interviews. Most of the attributes are
la rge ro o m translated from Finnish to English.
C: Ca sta net,
la rge ro o m
D: Tro m bone,
0.8
sm all ro om Timbral aspects
E: Speech,
sm all ro om
F: Castanet, Localization (×5), spatial impression (×4), the amount of
0.6 sm all ro om
reverberation (×4), distance (×3), spatial width,
Sim ila rity
spaciousness, gating effect, perception of room size,

reverberation time, artifacts in the reverberation,
0.4
direct-to-reverberant ratio.
0.2
Spatial aspects
Coloration (×7), muddiness (×4), clarity (×4), low-frequency

0
content (×2), pitch (×2), brightness (×2), tone, depth,
frequency shift, differences in the direct sound, metallic
A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F reverb, artifacts at high frequencies, high-frequency content.
Ref. SDM L7 SIRRL7 SIRR13 Ancho r
Fig. 7. Rated similarity of different spatial encoding method for

each sample individually. The results are presented with mean and 4 DISCUSSION
95% confidence intervals.
4.1 Advantages and Drawbacks of the SDM and
Future Work
respectively. For this reason, the anchor and the reference In rendering, SDM uses only one omni-directional im-
are removed from the ANOVA examination and the statis- pulse response. Therefore, the frequency response in the
tical tests are run again. Indeed, when these two methods exact sweet spot is identical to the original one in the pres-
are removed, Levene’s test shows that the variances can sure microphone. That is, due to the direct use of the pres-
be assumed equal [F(2,609) = 0.73, p > 0.05], and the sure microphone signal, no peaks or dips occur in the fre-
Anderson-Darling test statistic indicates that the residuals quency response in the sweet spot. The spatial distribution
are normally distributed [A2 * = 0.99, p < 0.05]. This means of sound can be inaccurate, but as the direct sound and early
that the ANOVA is suitable for the data. reflections are accurately reproduced the perceived error is
The results of the ANOVA indicate that the only signif- negligible. In addition, in SDM the diffuse part of the sound
icant main effect is the spatial encoding method [Method, field is obtained automatically, whereas in SIRR the diffuse
F(2,576) = 332.80, p < 0.001] and none of the interactions part of sound is reproduced with uncorrelated loudspeaker
is found significant. In total, the model explains 56% of the signals, which are not easy to implement.
variance. The main effect for spatial encoding method is SDM assumes wideband signals. That is, the room im-
very strong and it explains 49% of the variance, and thus pulse responses should be measured with full band width.
the remaining 7% are non-significant effects. In the case of band limited room impulse responses, SDM
The results are presented for the spatial encoding method, will artificially increase the energy outside the frequency
in Fig. 7 with means and 95% confidence intervals. As can band, since each sample in the encoded version is presented
be seen from Fig. 7, out of the spatial encoding methods, by a Dirac-impulse.
SDML7 is the most similar with the reference, SIRRL7 In a real room impulse response, the energy in the high
is the least similar, and SIRR13 is slightly more similar frequencies decreases as time progresses due to air absorp-
than SIRRL7. All the means are significantly different and tion and surface absorptions. Thus, the frequency response
the average values for the methods are: reference: 0.98, of the late reverberation is a “low-pass” filtered version of
SDML7: 0.80, SIRR13: 0.48, SIRRL7: 0.40, and Anchor: the original response of the direct sound. As pointed out by
0.00. Thus, according to the listening test experiments, the listening test subjects, SDM slightly increases the per-
SDML7 is the most similar with the reference out of the ceived brightness or high frequency content in the late part
tested encoding methods. In the best cases, in samples A of the impulse response. The division of the pressure signal
and B (speech or trombone in large room), the results of causes this drawback. An image-source represents each of
the reference and SDML7 are not significantly different. the samples in the pressure signal and the image-source is
In all the other samples, the SDML7 results are signifi- a Dirac-impulse in time-domain, which is wide band in fre-
cantly different from the other methods and the reference. quency domain. Since the late part of the impulse response
The furthest from the reference are the results for sample E does not have as much energy on the high frequencies as
(trombone in small room). the early part this results in an increase in the perceived
All the attributes from the interviews are listed in Table brightness of the reverberation.
5. They are grouped into two groups, spatial and timbral The problem of increased brightness in the late reverber-
aspects. The interviews of the test subjects revealed that ation can be overcome by equalizing the frequency response
they most often used localization, spatial impression, mud- in a post-processing step. Another option is to analyze the
diness, coloration, distance, and clarity as the attributes for locations of the image-sources in frequency domain. This
rating the samples. way, each frequency would have a correct weighting to

TERVO ET AL. PAPERS
begin with. This requires additional research and is there- an open spherical microphone array with six microphones,
fore left for future work. In addition, future work includes with an additional seventh microphone in the geometric
open source implementations of the SDM encoding for a center of the array.
general pressure microphone array, B-format microphone, Listening test experiments showed that the presented
and the decoding implementations for wave field synthesis method produces sound that is indistinguishable from a
and higher order Ambisonics. The problem of increased reference sound in the best case. In overall, the similarity
clarity may not be present in the other reproduction ap- of the sound samples encoded with the presented method
proaches. were perceived to be closer than that of a state-of-the-art
In this paper the room acoustic simulation used ideal method in the same conditions.
specular reflections, which is an inherent property of the ap-
plied image-source room simulation method. SDM should 6 ACKNOWLEDGMENT
also be tested with diffuse reflections. However, the gen-
eration of the reference case for diffuse reflection is prob- This work was supported by ERC grant agreement no.
lematic, since for the reference case the direction, time of [203636] and Academy of Finland agreement nos. [218238
arrival, and pressure value for each time instant is required. and 140786]. The authors are grateful to Prof. Lauri Savioja,
This information is available in beam-tracer or ray-tracing Prof. Ville Pulkki, and Mr. Mikko-Ville Laitinen for discus-
methods. Unfortunately, these methods neglect the tempo- sions. The authors would also like to thank the anonymous
ral spreading of the reflections and consider that diffuse re- reviewers for their valuable comments, which helped to im-
flections only introduce spatial spreading for the reflected prove the quality of this paper. Dr. Alex Southern is thanked
sound. Moreover, the room acoustic simulation methods for providing the implementation for air absorption filter
that aim to solve the wave equation, e.g., finite element computations.
method, boundary element method, and finite difference
in time-domain, may generate the correct pressure values, 7 REFERENCES
but they do not produce directional information. The only [1] T. Lokki, H. Vertanen, A. Kuusinen, J. Pätynen, and
method that produces all the necessary information and S. Tervo “Concert Hall Acoustics Assessment with Indi-
takes into account the temporal and spatial spreading is vidually Elicited Attributes,” J. Acoust. Soc. Am., vol. 130,
presented in [35], but it only applies for low-frequencies. pp. 835–849 (Aug. 2011).
The comparison for a reference case with diffuse reflections [2] T. Lokki, J. Pätynen, S. Tervo, S. Siltanen, and L.
is currently not possible. Savioja, “Engaging Concert Hall Acoustics Is Made Up
SIRR was implemented with the parameters given in the of Temporal Envelope Preserving Reflections,” J. Acoust.
original paper [26]. It should be emphasized that the ad- Soc. Am., vol. 129, pp. EL223–EL22 (Apr. 2011).
vances made in Directional Audio Coding could possibly [3] J. Daniel, R. Nicol, and S. Moreau, “Further Investi-
improve quality of the SIRR. In informal listening, for ex- gations of High Order Ambisonics and Wavefield Synthe-
ample, the multi-rate implementation [36] was found to sis for Holophonic Sound Imaging,” presented at the 114th
increase the overall quality in SIRR. Studies using SIRR Convention of the Audio Engineering Society(2003 March),
with alternative processing approaches are currently not convention paper 5788.
available in the literature. [4] A. J Berkhout, D. De Vries, and P. Vogel, “Acoustic
Control by Wave Field Synthesis,” J. Acoust. Soc. Am.,
5 CONCLUSIONS vol. 93, no. 5, pp. 2764–2778 (1993).
[5] M. M. Boone, E. N. G. Verheijen, and P. F. Van Tol,
This paper presented a spatial encoding method for spa- “Spatial Sound-Field Reproduction by Wave-Field Synthe-
tial room impulse responses. The analysis of the method sis,” J. Audio Eng. Soc., vol. 43, pp. 1003–1012 (1995
estimates the location in very small time windows at ev- Dec.).
ery discrete time sample, where the localization method [6] D. Hammershøi and H. Møller, Communication
depends on the applied microphone array and acoustic con- Acoustics, chapter 9, “Binaural Technique—Basic Methods
ditions. Each of the discrete time samples is therefore rep- for Recording, Synthesis, and Reproduction”, pp. 223–254
resented by an image-source. Thus, the analysis results in a (Springer-Verlag, New York, NY, USA, 2005).
set of image-sources. Then, depending on the spatial repro- [7] D. Schönstein and B. F. G. Katz, “Variability in Per-
duction method, the samples are distributed to several re- ceptual Evaluation of HRTFs,” J. Audio Eng. Soc., vol. 60,
production channels to obtain individual impulse responses pp. 783–793 (2012 Oct.).
for all reproduction channels. [8] V. Pulkki, “Virtual Sound Source Positioning Us-
The main advantage of the method follows from the de- ing Vector Base Amplitude Panning,” J. Audio Eng. Soc.,
composition of the image-sources. Namely, the method can vol. 45, pp. 456–466 (1997 Jun).
be applied to any arbitrary microphone array and the spa- [9] F. Zotter and M. Frank, “All-Round Ambisonic Pan-
tial reproduction method can be any of a variety of existing ning and Decoding,” J. Audio Eng. Soc., vol. 60, pp. 807–
techniques. It should be emphasized that the method is not 820(2012 Oct.).
designed for a continuous signal, but for spatial room im- [10] J. Merimaa and V. Pulkki, “Spatial Impulse Re-
pulse responses, which can then be convolved with an ane- sponse Rendering I: Analysis and Synthesis,” J. Audio Eng.
choic signal. In this paper the applied microphone array was Soc., vol. 53, pp. 1115–1127 (2005 Dec.).

[11] J. B. Allen and D. A. Berkley, “Image Method for [24] S. Tervo, T. Lokki, and L. Savioja, “Maximum Like-
Efficiently Simulating Small-Room Acoustics,” J. Acoust. lihood Estimation of Loudspeaker Locations from Room
Soc. Am., vol. 65, no. 4, pp. 943–950 (1979). Impulse Responses,” J. Audio Eng. Soc., vol. 59, pp. 845–
[12] U. P. Svensson, R. I. Fred, and J. Vanderkooy, “An 857 (2011 Nov.).
Analytic Secondary Source Model of Edge Diffraction Im- [25] A. Brutti, M. Omologo, and P. Svaizer, “Mul-
pulse Responses, J. Acoust. Soc. Am., vol. 106, pp. 2331– tiple Source Localization Based on Acoustic Map De-
2344 (1999). emphasis,” EURASIP J. Audio, Speech, and Music Pro-
[13] B. I. Dalenbäck, M. Kleiner, and P. Svensson, “A cessing, 2010, 2010, paper 147495.
Macroscopic View of Diffuse Reflection, J. Audio Eng. [26] V. Pulkki and J. Merimaa, “Spatial Impulse Re-
Soc., vol. 42, pp. 793–807 (1994 Oct.). sponse Rendering II: Reproduction of Diffuse Sound and
[14] J.-M. Jot, L. Cerveau, and O. Warusfel, “Analysis Listening Tests,” J. Audio Eng. Soc., vol. 54, pp. 3–20 (2006
and Synthesis of Room Reverberation Based on a Statistical Jan./Feb.).
Time-Frequency Model, presented at the 103rd Convention [27] J. Pätynen, S. Tervo, and T. Lokki, “Analysis
of the Audio Engineering Society, Convention(1997 Sept.), of Concert Hall Acoustics via Visualizations of Time-
convention paper 4629. Frequency and Spatiotemporal Responses,” J. Acoust. Soc.
[15] C. Zhang, Z. Zhang, and D. Florêncio, “Maximum Am., vol. 133, no. 17 (January 2013).
Likelihood Sound Source Localization for Multiple Di- [28] J. S. Bradley, H. Sato, M. Picard, et al., “On the
rectional Microphones,“IEEE International Conference on Importance of Early Reflections for Speech in Rooms,” J.
Acoustics, Speech, and Signal Processing, vol. 1, pp. 125– Acoust. Soc. Am., vol. 113, no. 6, pp. 3233–3244 (2003).
128(2007). [29] Geneva International Telecommunication Union.
[16] D. Levin, E. A. P. Habets, and S. Gannot, “Maxi- ITU-R BS.1116-1: Methods for the Subjective Assessment
mum Likelihood Estimation of Direction of Arrival Using of Small Impairments in Audio Systems including Multi-
an Acoustic Vector-Sensor,” J. Acoust. Soc. Am., vol. 131, channel Sound Systems, 1997.
no. 2, pp. 1240–1248 (2012). [30] H. E. Bass, H.-J. Bauer, and L. B. Evans, “Atmo-
[17] S. Tervo, “Direction Estimation Based on Sound In- spheric Absorption of Sound: Analytical Expressions,” J.
tensity Vectors,“European Signal Processing Conference, Acoust. Soc. Am., vol. 52, no. 3B, pp. 821–825 (1972).
Glasgow, Scotland, August 24-28, pp. 700–704 (2009). [31] T. Lokki, J. Pätynen, A. Kuusinen, and S. Tervo,
[18] S. Tervo, J. Pätynen, and T. Lokki, “Acoustic Re- “Disentangling Preference Ratings of Concert Hall Acous-
flection Localization from Room Impulse Responses,” tics Using Subjective Sensory Profiles,” J. Acoust. Soc. Am.,
Acta Acustica united with Acustica, vol. 98, pp. 418–440 vol. 132, pp. 3148–3161 (Nov. 2012).
(2012). [32] J. Vilkamo, T. Lokki, and V. Pulkki, “Directional
[19] A. Host-Madsen, “On the Existence of Efficient Audio Coding: Virtual Microphone-Based Synthesis and
Estimators,” IEEE Trans. Signal Processing, vol. 48, no. Subjective Evaluation,” J. Audio Eng. Soc., vol. 57, pp. 709
11, pp. 3028–3031 (2000). (2009 Sept.).
[20] C. Knapp and G. Carter, “The Generalized Corre- [33] B. B. Schultz, “Levene’s Test for Relative Vari-
lation Method for Estimation of Time Delay,” IEEE Trans. ation,” Systematic Biology, vol. 34, no. 4, pp. 449–456
Acoust., Speech and Signal Proc., vol. 24, no. 4, pp. 320– (1985).
327 (1976). [34] M. A. Stephens, “EDF Statistics for Goodness of
[21] L. Zhang and X. Wu, “On Cross Correlation Based- Fit and Some Comparisons,” J. Am. Statistical Assoc., vol.
Discrete Time Delay Estimation,“IEEE International Con- 69, no. 347, pp. 730–737 (1974 Sept.).
ference on Acoustics, Speech, and Signal Processing, [35] S. Siltanen, T. Lokki, S. Tervo, and L. Savioja,
vol. 4, pp. 981–984 (2005). “Modeling Incoherent Reflections from Rough Room Sur-
[22] T. Pirinen, Confidence Scoring of Time Delay Based faces with Image Sources,” J. Acoust. Soc. Am., vol. 132,
Direction of Arrival Estimates and a Generalization to Dif- pp.4604–4614 (June 2012).
ference Quantities, Ph.D. thesis, Ph.D. thesis, Tampere Uni- [36] T. Pihlajamäki and V. Pulkki, “Low-Delay Direc-
versity of Technology, 2009. Publication; 854. Publication; tional Audio Coding for Real-Time Human-Computer In-
854. teraction,” presented at the 130th Convention of the Au-
[23] H. Kutruff, Room acoustics, 4th Ed.(Spon Press, dio Engineering Society (2011 May), convention paper
NY, NY, USA, 2000). 8413.

TERVO ET AL. PAPERS
THE AUTHORS
Sakari Tervo Jukka Pätynen Antti Kuusinen Tapio Lokki
Dr. Sakari Tervo is a post-doctoral researcher in the De- partment of Media Technology at Aalto University School
partment of Media Technology, Aalto University School of Science under supervision of professor Tapio Lokki. In
of Science from where he also received a D.Sc. degree in his doctoral research he focuses on the perceptual char-
acoustics in January 2012. The topic of his research is on acteristics of concert hall acoustics. This research also in-
the objective room acoustic measures. cludes development of descriptive vocabulary for sound,
Previously he has been working in the Department of music, and room acoustics as well as elaboration of listen-
r
Signal Processing, Tampere University of Technology from ing test methodology and statistical data analysis.
where he also graduated as a M.Sc. majoring in audio sig-
nal processing in 2006. He has visited the Digital Sig-
nal Processing Group of Philips Research, Eindhoven, The Dr. Tapio Lokki was born in Helsinki, Finland, in 1971.
Netherlands, in 2007 and the Department of Electronics of He has studied acoustics, audio signal processing, and com-
r
the University of York, United Kingdom, in 2010. puter science at the Helsinki University of Technology
(TKK) and received an M.Sc. degree in electrical engi-
neering in 1997 and a D.Sc. (Tech.) degree in computer
Dr. Jukka Pätynen was born in 1981 in Espoo, Fin- science and engineering in 2002.
land. He received M.Sc. and D.Sc. (Tech.) degrees from At present Dr. Lokki is an Associate Professor (tenured)
the Helsinki University of Technology, Finland, in 2007, with the Department of Media Technology at Aalto Univer-
and Aalto University, Finland, in 2011, respectively. He sity. Prof. Lokki leads the virtual acoustics team jointly with
is currently working as a post-doctoral researcher in the Prof. Lauri Savioja. The research aims to create novel ob-
Department of Media Technology, Aalto University. His jective and subjective ways to evaluate concert hall acous-
research activities include room acoustics, musical acoustics. In addition, the team develops physically-based room
r
tics, and signal processing. acoustics modeling methods to obtain authentic auraliza-
tion. Furthermore, the team studies augmented reality au-
dio. The team is funded by the Academy of Finland and by
Antti Kuusinen received an M.Sc. degree in March 2012. Prof. Lokki’s Starting Grant from the European Research
He is currently working as a doctoral candidate at the De- Council (ERC).

View publication stats

JAES V61 1 2 PG17hirez

Uploaded by

Copyright:

Available Formats

JAES V61 1 2 PG17hirez

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

JAES V61 1 2 PG17hirez

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Spatial Decomposition Method for Room Impulse Responses

Article in Journal of the Audio Engineering Society · January 2013

Sakari Tervo Jukka Pätynen

SEE PROFILE SEE PROFILE

Antti Kuusinen Tapio Lokki

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Spatial Decomposition Method for Room Impulse

SAKARI TERVO, JUKKA PÄTYNEN, ANTTI KUUSINEN

AND TAPIO LOKKI, AES Member

Department of Media Technology, Aalto University School of Science, FI-00076 Aalto

0 INTRODUCTION crophone arrays. An advantage of the first two groups is that

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February 17

direct sound, discrete reflections, diffractions, or diffuse

Convolution with anechoic

1.1.1 Direct Sound, Specular and Diffuse

18 J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

diffuse. A diffuse sound field is spatially homogeneous 1.2.1 Step 1: Localization

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February 19

20 J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

acoustic events is achieved. To conclude, shorter time win- 100

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February 21

22 J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

Source Response filter

Air absorption filter

Convolution with source signal

Monaural impulse response and spatial metadata

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February 23

instead of “impairment” recommended by ITU [29]. This

24 J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

spaciousness, gating effect, perception of room size,

Coloration (×7), muddiness (×4), clarity (×4), low-frequency

Fig. 7. Rated similarity of different spatial encoding method for

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February 25

26 J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February 27

Sakari Tervo Jukka Pätynen Antti Kuusinen Tapio Lokki

28 J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

You might also like