Speech Enhancement With Natural Sounding Residual Noise Based On Connected Time-Frequency Speech Presence Regions
Speech Enhancement With Natural Sounding Residual Noise Based On Connected Time-Frequency Speech Presence Regions
Speech Enhancement With Natural Sounding Residual Noise Based On Connected Time-Frequency Speech Presence Regions
1.
INTRODUCTION
The performance of many speech enhancement methods relies mainly on the quality of a noise power spectral density
(PSD) estimate. When the noise estimate diers from the
true noise, it will lead to artifacts in the enhanced speech.
The approach taken in this paper is based on connected region speech presence detection. Our aim is to exploit spectral and temporal masking mechanisms in the human auditory system [1] to reduce the perception of these artifacts in
speech presence regions and eliminate the artifacts in speech
absence regions. We achieve this by leaving downscaled natural sounding background noise in the enhanced speech in
connected time-frequency regions with speech absence. The
downscaled natural sounding background noise will spectrally and temporally mask artifacts in the speech estimate
while preserving the naturalness of the background noise.
In the definition of speech presence regions, we are inspired by the work of Yang [2]. Yang demonstrates high perceptual quality of a speech enhancement method where conThis is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
stant gain is applied in frames with no detected speech presence. Yang lets a single decision cover a full frame. Thus, musical noise is present in the full spectrum of the enhanced
speech in frames with speech activity. We therefore extend
the notion of speech presence to individual time-frequency
locations. This, in our experience, significantly improves the
naturalness of the residual noise. The speech enhancement
method, proposed in this paper, thereby eliminates audible
musical noise in the enhanced speech. However, fluctuating
speech presence decisions will reduce the naturalness of the
enhanced speech and the background noise. Thus, reasonably connected regions of the same speech presence decision
must be established.
To achieve this, we use spectral-temporal periodogram
smoothing. To this end, we make use of the spectraltemporal smoothing method by Martin and Lotter [3],
which extends the original groundbreaking work of Martin
[4, 5]. Martin and Lotter derive optimum smoothing coefficients for (generalized) 2 -distributed spectrally smoothed
spectrograms, which is particularly well suited for noise types
with a smooth power spectrum. The underlying assumption
in this approach is that the real and imaginary parts of the
associated STFT coecients for the averaged periodograms
have the same means and variances. For the application of
2955
2.
After an introduction to the signal model, we give a structural description of the algorithm to provide an algorithmic
2.1.
Signal model
L
1
=0
j2k
,
K
(1)
(2)
2.2.
(3)
The structure of the proposed algorithm and names of variables with a central role are shown in Figure 1. After applying an analysis window to the noisy speech, we take the
STFT, from which we calculate periodograms PY (, k)
|Y (, k)|2 . These periodograms are spectrally smoothed,
yielding PY (, k), and then temporally smoothed to produce
P (, k). These smoothed periodograms are temporally minimum tracked, and by comparing ratios and dierences of
the minimum tracked values to P (, k), they are used for
speech presence detection. As a distinct feature of the proposed method, we use speech presence detection to achieve
low-biased noise PSD estimates PN (, k), but also for noise
periodogram estimates PN (, k), which equal PY (, k) when
D(, k) = 0, that is, no detected speech presence. When
D(, k) = 1, that is, detected speech presence, the noise periodogram estimate equals the noise PSD estimate, that is,
a recursively smoothed bias compensation factor applied on
the minimum tracked values. The bias compensation factor is recursively smoothed power ratios between the noise
periodogram estimates and the minimum tracks. This factor is only updated while no speech is present in the frames
and kept fixed while speech is present. A noise magnitude
spectrum estimate |N(,
k)| obtained from the noise PSD
2956
y (i)
Windowing
| |2
Y (, k)
STFT
PY (, k)
Spectral
smoothing
Y (, k)
Temporal
smoothing
Minimum
tracking
Speech presence
detection
=D
b()PY , (k )
(4)
(, k)
D
Temporal smoothing
D (, k)
Noise
estimation
(, k)|
|N
Speech
enhancement
|S (, k)|
Inverse
STFT
Y (, k)
WOLA
s (i)
2 , (6)
2
2 + K P ( 1, k)/E N(, k) 1
with
K = (4D + 2)
SPECTRAL-TEMPORAL PERIODOGRAM
SMOOTHING
(7)
2
E N(, k) = PN ( 1, k).
3.3.
3.
2
L1 2
=0 b ()
L L=01 b4 ()
(8)
Pseudocode for the complete spectral-temporal periodogram smoothing method is provided in Algorithm 1. A
smoothing parameter correction factor c (, k), proposed
by Martin [5], is multiplied on (, k). Additionally, in this
paper, we lower-limit the resulting smoothing parameters to
ensure a minimum degree of smoothing, that is,
(, k) = max c (, k)
(, k), 0.4 .
(9)
We now base a speech presence detection method on comparisons, at each frequency, between the smoothed noisy
speech periodograms and temporal minimum tracks of the
smoothed noisy speech periodograms.
4.1. Temporal minimum tracking
From the spectrally temporally smoothed noisy speech periodograms P (, k), we track temporal minimum values
Pmin (, k), within a minimum search window of length Dmin ,
that is,
(10)
2957
The decision rules, that are used for speech presence detection, have the threshold values listed in Table 2. For noise
estimation, we use the two parameters from Table 4. The
speech enhancement method uses the parameter settings that
are listed in Table 5.
4.2.
(11)
(12)
2958
Variable
D
b()
M
K
PN (1, k)
c (1)
Value
7
Gb triang(2D + 1)i
154220ii
20.08iii
PY (0, k)
1
Description
Spectral window length: 2D + 1
Spectral smoothing window
Number of frames
Equivalent degrees of freedom of 2 -distribution
Initial noise periodogram estimate
Initial correction variable
Value
6
0.5
Description
Constant for ratio-based decision rule
Constant for dierence-based decision rule
D (,k)=1
D (,k)=0
Pmin (, k),
(13)
D (, k) : P (, k)
D (,k)=1
Pmin (, k) +
(15)
and intensity of environmental noise [11] and it can be adjusted empirically. This is also the case for . For applications where a reasonable objective performance measure can
be defined, the constants and can be obtained by interpreting the decision rule as artificial neural network and then
conduct a supervised training of this network [9].
Speech at frequencies below 100 Hz is considered perceptually unimportant, and bins below this frequency are therefore always classified with speech absence. Real-life noise
sources often have a large part of their power at the low frequencies, so this rule ensures that this power does not cause
the speech presence detection method to falsely classify these
low-frequency bins as if speech is present. If less than 5% of
the K periodogram bins are classified with speech presence,
we expect that these decisions have been falsely caused by the
noise characteristics, and all decisions in the current frame
are reclassified to speech absence. When the speech presence
decisions are used in a speech enhancement method, as we
propose in Section 6, this reclassification will ensure the naturalness of the background noise in periods of speaker silence.
5.
NOISE ESTIMATION
2959
Value
8 kHz
256
256
128
Gh 1 Hanning(K)i
Description
Sample frequency
FFT size
Frame size
Frame skip
Analysis window
hs ()
Gh Hanning(K)ii
Synthesis window
iG
h
is the square root of the energy of Hanning(K), which scales the analysis window to unit energy. This is to avoid scaling factors
throughout the paper.
ii G scales the synthesis window h () such that the analysis window h(), multiplied with h (), yields a Hanning(K) window.
h
s
s
PN (, k) =
min ()Pmin (, k)
if D(, k) = 1,
if D(, k) = 0.
PY (, k)
(16)
This noise periodogram estimate equals the true noise periodogram |N(, k)|2 when the speech presence detection is
correctly detecting no-speech presence. When entering a region with speech presence, the noise periodogram estimate
will take on the smooth shape of the minimum track, scaled
with the bias compensation factor in (18) such that the power
develops smoothly into the speech presence region.
5.2.
N(,
k) = P (, k).
N
6.
Rmin ()
if
if
K
1
k=0
K
1
D(, k) > 0,
D(, k) = 0,
k=0
(18)
where 0 min 1 is a constant recursive smoothing parameter. The magnitude spectrum, at time index , is obtained by
taking the square root of noise periodogram estimate, that is,
N(,
k) = P (, k).
N
(21)
(17)
R ( 1)
min
(20)
K 1
P ( 1, k)
Rmin () = k=K01 N
,
k=0 Pmin (, k)
(19)
SPEECH ENHANCEMENT
2960
Variable
Dmin
min
i Corresponds
Value
150i
0.7
Description
Minimum tracking window length
Scaling factor smoothing parameter
Description
Noise scaling factor for no-speech presence
Noise overestimation factor for speech presence
Attenuation rule order for speech presence
with only a single speech presence decision covering all frequencies in each frame, has previously been proposed by
Yang [2]. Moreover, Cohen and Berdugo [11] propose a binary detection of speech presence/absence (called the indicator function in their paper), which is similar to the one
we propose in this paper. However, their decision includes
noisy speech periodogram bins without smoothing, hence
some decisions will not be regionally connected. In our experience, this leads to artifacts if the decisions are used directly
in a speech enhancement scheme with two dierent attenuation rules for speech absence and speech presence. Cohen
and Berdugo smooth their binary decisions to obtain estimated speech presence probabilities, which are used for a soft
decision between two separate attenuation functions. Our
approach, as opposed to this, is to obtain adequately timefrequency smoothed spectra from which connected speech
presence regions can be obtained directly in a robust manner. As a consequence, we avoid distortion in speech absence
regions, and thereby obtain a natural sounding background
noise.
Let the generalized spectral subtraction variant be given
similar to the one proposed by Berouti et al. [12], but with
the decision of which attenuation rule to use given explicitly
by the proposed speech presence decisions, instead of comparisons between the estimated speech power and an estimated noise floor. The immediate advantage of our approach
is a higher degree of control with the properties of the enhancement algorithm. Our proposed method is given by
S(,
k)
a
1/a1
Y (, k)a1 1 N(,
k) 1
=
0 Y (, k)
if D(, k) = 1,
if D(, k) = 0,
(22)
k) = S(,
k)e jY (,k) .
S(,
(23)
EXPERIMENTAL SETUP
EXPERIMENTAL RESULTS
10
0
10
20
30
40
3000
2000
1000
0
0.5
1.5
Time
2.5
Frequency
4000
10
0
10
20
30
40
3000
2000
1000
0
0.5
0
10
20
30
40
1.5
Time
120
100
80
60
40
20
2.5
10
0
10
20
30
40
20
of the noise periodogram estimate and the noise PSD estimate, obtained using the methods we propose in Section 5,
are shown in Figures 5a and 5b, respectively.
We evaluate the performance of the noise estimation
methods by means of their spectral distortion, which we
measure as segmental noise-to-error ratios (SegNERs). We
calculate the SegNERs in the time-frequency domain, as the
ratio (in dB) between the noise energy and the noise estimation error energy. These values are upper and lower limited
by 35 and 0 dB [15], respectively, that is,
K 1
2
(24)
where
N(, k)
NER() = 10 log10 K 1
2 ,
N(, k) N(,
k)
k=0
(25)
k=0
M 1
1
SegNER().
M =0
(26)
Frequency index k
(b)
(a)
10
120
100
80
60
40
20
20
Frequency index k
Frequency
4000
2961
Frequency index k
120
100
80
60
40
20
10
0
10
20
30
40
20
2962
Highway trac
0
19.3
4.6
3.6
1.0
Car interior
5
17.0
4.6
3.1
1.8
10
14.7
4.4
2.6
2.4
10
0
10
20
30
40
Frequency index k
(a)
120
100
80
60
40
20
10
0
10
20
30
40
20
As an objective measure of time-domain waveform similarity, we list the signal-to-noise ratios, and as a subjective
measure of speech quality, we conduct an informal listening
test. In this test, test subjects give scores from the scale in
Table 7 ranging from 1 to 5, in steps of 0.1, to three dierent speech enhancement methods, with the noisy speech as a
reference signal. Higher score is given to the preferred speech
enhancement method. The test subjects are asked to take parameters, such as the naturalness of the enhanced speech, the
quality of the speech, and the degree of noise reduction into
Frequency
Excellent
Good
Fair
Poor
Bad
Frequency index k
Description
5
4
3
2
1
20
5
16.6
3.1
2.3
2.1
4000
Score
120
100
80
60
40
20
0
18.3
3.0
2.7
1.9
10
0
10
20
30
3000
2000
1000
0
10
15.0
3.2
2.0
2.6
40
0.5
1.5
Time
2.5
account, when assigning a score to an estimate. The presentation order of estimates from individual methods is blinded,
randomized, and vary in each test set and for each test subject. A total of 8 listeners, all working within the field of
speech signal processing, participated in the test. The proposed speech enhancement method was compared with our
implementation of two reference methods.
(i) MMSE-LSA. Minimum mean-square error log-spectral amplitude estimation, as proposed by Ephraim
and Malah [7].
(ii) MMSE-LSA-DD. Decision-directed MMSE-LSA,
which is the MMSE-LSA estimation in combination
with a smoothing mechanism [7]. Constants are as
proposed by Ephraim and Malah.
All three methods in the test use the proposed noise PSD estimate, as shown in Figure 5b. Also, they all use the analysis/synthesis setup described in Section 7. The enhanced
speech obtained from the noisy speech signal in Figure 2a is
shown in Figure 6.
SNRs and mean opinion scores (MOSs) from the informal subjective listening test are listed in Tables 8 and 9. All
results are averaged over both speakers and listeners. The best
obtained results are emphasized using bold letters. To identify if the proposed method is significantly better, that is, has
a higher MOS, than MMSE-LSA-DD, we use the matched
sample design [16], where the absolute values of the opinion
scores are eliminated as a source of variation. Let d be the
mean of the opinion score dierence between the proposed
method and the MMSE-LSA-DD. Using this formulation, we
2963
Proposed method
MMSE-LSA-DD
MMSE-LSA
Noisy speech
MOS
3.50
2.75
1.63
SNR (dB)
10.3
11.1
9.3
5.0
MOS
3.56
2.85
1.92
SNR (dB)
13.0
15.0
14.0
10.0
MOS
3.74
3.07
2.04
SNR (dB)
16.5
15.4
12.6
10.0
MOS
3.95
3.29
2.37
Proposed method
MMSE-LSA-DD
MMSE-LSA
Noisy speech
MOS
3.53
2.54
1.89
SNR (dB)
13.4
10.9
7.7
5.0
Test statistics
Test result
Interval estimate
0
5
10
z = 10.3
z = 10.2
z = 10.1
Reject H0
Reject H0
Reject H0
0.75 0.19
0.72 0.18
0.67 0.17
(27)
if z > z.01 ,
(28)
DISCUSSION
MOS
3.82
2.99
2.07
Test statistics
Test result
Interval estimate
0
5
10
z = 11.4
z = 9.4
z = 6.7
Reject H0
Reject H0
Reject H0
1.00 0.23
0.83 0.23
0.66 0.25
connected region speech presence detection method. Despite the simplicity, the proposed methods are shown to have
superior performance when compared to our implementation of state-of-the-art reference methods in the case of both
noise estimation and speech enhancement.
In the first proposed noise estimation method, the connected speech presence regions are used to achieve noise periodogram estimates in the regions where speech is absent.
In the remaining regions, where speech is present, minimum tracks of the smoothed noisy speech periodograms are
bias compensated with a factor that is updated in regions
with speech absence. A second proposed noise estimation
method provides a noise PSD estimate by means of the same
power-scaled minimum tracks that are used by the noise periodogram estimation method when speech is present. It is
shown that the noise PSD estimate has less spectral distortion
than both our implementation of 2 -based noise estimation
[3] and MS noise estimation [5]. This can be explained by a
more accurate bias compensation factor, which uses speech
presence information. The noise periodogram estimate is by
far the less spectrally distorted noise estimate of the tested
noise estimation methods. This verifies the connected region
speech presence principle which is fundamental for the proposed speech enhancement method.
Our proposed enhancement method uses dierent attenuation rules for each of the two types of speech presence regions. When no speech is present, the noisy speech is downscaled and left in the speech estimate as natural sounding
masking noise, and when speech is present, a noise PSD estimate is used in a traditional generalized spectral subtraction.
In addition to enhancing the speech, the most distinct feature of the proposed speech enhancement method is that it
2964
leaves natural sounding background noise matching the actual surroundings of the person wearing the hearing aid. The
proposed method performs well at SNRs equal to or higher
than 0 dB for noise types with slowly changing and spectrally smooth periodograms. Rapid, and speech-like, changes
in the noise will be treated as speech, and will therefore be
enhanced, causing a decrease in the naturalness of the background noise. At very low SNRs, the detection of speech presence will begin to fail. In this case, we suggest the implementation of the proposed method in a scheme, where low SNR
is detected and causes a change to an approach with only a
single and very conservative attenuation rule. Strong tonal
interferences will aect the speech presence decisions as well
as the noise estimation and enhancement method and should
be detected and removed by preprocessing of the noisy signal
immediately after the STFT analysis. Otherwise, a suciently
strong tonal interference with duration longer than the minimum search window will cause the signal to be treated as if
speech is absent and the speech enhancement algorithm will
downscale the entire noisy speech by multiplication with 0 .
Our approach generalizes to other noise reduction
schemes. As an example, the proposed binary scheme can
also be used with MMSE-LSA-DD for the speech presence
regions. For such a combination, we expect performance
similar to, or better than, what we have shown in this paper
for the generalized spectral subtraction. This is supported by
the findings of Cohen and Berdugo [11] that have shown that
a soft-decision approach improves MMSE-LSA-DD.
The informal listening test confirms that listeners prefer the downscaled background noise with fully preserved
naturalness over the less realistic whitened residual noise
from, for example, MMSE-LSA-DD. From our experiments,
we can conclude, with a confidence level of 99%, that the
proposed speech enhancement method receives significantly
higher MOS than MMSE-LSA-DD at all tested combinations
of SNR and noise type.
ACKNOWLEDGMENTS
The authors would like to thank the anonymous reviewers
for many constructive comments and suggestions to the previous versions of the manuscript, which have largely improved the presentation of this work. This work was supported by The Danish National Centre for IT Research, Grant
no. 329, and Microsound A/S.
REFERENCES
[1] T. Painter and A. Spanias, Perceptual coding of digital audio,
Proc. IEEE, vol. 88, no. 4, pp. 451515, 2000.
[2] J. Yang, Frequency domain noise suppression approaches in
mobile telephone systems, in Proc. IEEE Int. Conf. Acoustics,
Speech, Signal Processing (ICASSP 93), vol. 2, pp. 363366,
Minneapolis, Minn, USA, April 1993.
[3] R. Martin and T. Lotter, Optimal recursive smoothing of
non-stationary periodograms , in Proc. International Workshop on Acoustic Echo Control and Noise Reduction (IWAENC
01), pp. 4346, Darmstadt, Germany, September 2001.
[4] R. Martin, Spectral subtraction based on minimum statistics, in Proc. 7th European Signal Processing Conference (EU-
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]