Analysis of Existing Noise Estimation Algorithms
Analysis of Existing Noise Estimation Algorithms
Analysis of Existing Noise Estimation Algorithms
ALGORITHMS
In this chapter some of the existing noise estimation algorithms which are based on
tracking noise power spectrum of noisy speech are discussed. Most of these algorithms can be
classified broadly into two classes. The first class is based on updating the noise estimate by
tracking the silence regions of speech and the other class is based on updating noise estimate
using the histogram of the noisy speech power spectrum.
2.1 Minimum Statistics Noise Estimation (MSNE)
Martins method [13] was based on minimum statistics and optimal smoothing of the
noisy speech power spectral density. This method rested on two major observations inwhich the
first observation was independent of speech and noise which implied that the power spectrum of
the noisy speech was equal to the sum of the power spectrum of clean speech and noise
respectively.
That is,
( ) ( ) ( )
2 2 2
, , , k D k X k Y + = (2.1)
Where
( ) ( )
2 2
, , , k X k Y and ( )
2
, k D were the power spectrum of noisy speech, clean speech
and noise respectively and and k denoted the time index and frequency bin index
respectively. The second observation was that the power spectrum of noisy speech often
becomes equal to the power spectrum of noise. Hence the estimate of noise power spectral
density was obtained by tracking the minimum of the noisy speech in each frequency bin
separately.
2.1.1 Principles of Minimum Statistics Noise Estimation Algorithm
The noise variance was estimated by tracking the minimum of noisy speech power
spectral density over a fixed window length. This window length was chosen wide enough to
bridge the broadest peak in any speech signal. It was found out experimentally that window
lengths of approximately 0.8-1.4s gave good results. For searching the minimum the noisy
speech power spectral density by using a first-order recursive version.
( ) ( ) ( ) ( )
2
min min , 1 , 1 , k Y k P k P o o + = (2.2)
( ) k P , min is the minimum power spectral density of current frame and ( ) k P , 1 min is
the minimum power spectral density of previous frame.
Where was the smoothing constant.
2.1.2 Deriving Optimal Time-Frequency Dependent Smoothing Factor
The smoothing parameter used in Eq. (2.2) had to be very low to follow the non-
stationary of the speech signal. On the other hand, it had to be close to one to keep the variance
of the minimum tracking as small as possible. Hence there was need for time and frequency
dependent smoothing factors in place of a fixed smoothing factor. This was derived for speech
absent region. The requirement was that the smoothed power spectrum P(,k) had to be equal to
the noise variance
( ) k
D
,
2
o
during speech pauses. Hence the smoothing parameter was derived
by minimizing the mean squared error between P(,k) and
( ) k
D
,
2
o
as follows
( ) ( ) ( ) ( ) { } k P k k P E
D
, 1 / , ,
2
2
o (2.3)
Where
( ) ( ) ( ) ( ) ( ) ( )
2
, , 1 , , , k Y k k P k k P o o + = (2.4)
Note that in Eq. (2.4) time-frequency dependent smoothing factor (,k) was used instead
of fixed as defined in Eq. (2.2). Substituting Eq. (2.4) in Eq. (2.3) and setting the first
derivative to zero gave the optimum value for (,k)
( )
( )
( )
2
2
1 ,
, 1
1
1
,
|
|
.
|
\
|
+
=
k
k P
k
D
opt
o
o
(2.5)
2.1.3 Drawbacks of MSNE
The noise power spectrum estimate in MSNE was obtained by tracking the minimum of
noisy speech power spectrum over a specified window of length L frames. This length was based
on the concept that it will encompass at least one silence period of the noisy speech and in turn
means that it tracks at least one frame of noise only region. Also, since there was no way to
adjust the length of this window based on the width of the speech peaks, this window length was
chosen large enough to encompass the broadest peak possible in any speech waveform.
2.2 Minima Controlled Recursive Averaging (MCRA)
In minima controlled recursive averaging (MCRA) [2] noise estimate was updated by
averaging the past spectral values of noisy speech which was controlled by a time and frequency
dependent smoothing factors. These smoothing factors were calculated based on the signal
presence probability in each frequency bin separately. This probability was in turn calculated
using the ratio of the noisy speech power spectrum to its local minimum calculated over a fixed
window time.
2.2.1 Noise Spectrum Estimation
The derivation for noise power spectrum from the signal presence probability was based
on the following two hypotheses.
( ) ( ) ( )
( ) ( ) ( ) ( ) k D k X k Y k H
k D k Y k H
, , , : ,
, , : ,
1
0
+ =
=
(2.6)
Where Y(,k), X(,k) and D(,k) represented the short time Fourier transform of the
noisy speech, clean speech and noise respectively and H
0
(,k ) and ( ) k H ,
1
represented the
speech absent and speech present hypotheses respectively. The noise variance was represented as
(2.7)
The update of noise estimate for the above two hypotheses was written as follows
( ) ( ) ( ) ( ) ( )
2
2 2 '
0
, 1 , , 1 : , k Y k k k H
d D d D
o o o o + = + (2.8)
( ): ,
'
1
k H ( ) ( ) k k
D D
, , 1
2 2
o o = +
Where ( ) k
D
,
2
o was the estimate of noise variance and
d
o was the smoothing factor. The
overall noise estimate was obtained based on speech presence probability as
( )
( )
( )
( )
( )
( )
( ) ( ) k p
k H
k
k p
k H
k
k
D D
D
, 1
,
, 1
,
,
, 1
, 1
'
'
0
'
'
1
o
o
(
+
+
(
+
= + (2.9)
Where
( )
( )
( )
|
|
.
|
\
|
=
k Y
k H
P k p
,
,
,
'
1 '
(, k) = P(, k)N
(3.3)
Where N
(, k) is the smoothed power spectrum, l is the frame index, k is the frequency index,
|Y(, k)|
is the short time power spectrum of noisy speech and is a smoothing constant [7].
Smoothing constant is not fixed but varies with time and frequency. The above recursive
equation provides a smoothed version of periodogram |Y(, k)|
[11].
3.2 SERA (Spectral Entropy Recursive Averaging)Noise EstimationAlgorithm
Fig. 3.2: SERA Algorithm
In this the noise signal is segmented into number of frames and FFT is performed on
those frames [8]. To discriminate various frames of noisy speech signal entropy is calculated.
Based on threshold determination, classification of noisy speech signal and how the noise power
is estimated and updated is shown in figure3.3.
Fig. 3.3: Classification of Noisy Speech Signal
3.3 Determination of Entropy
The proposed noise estimation method classifies the noisy speech into three categories as
pure speech, non-speech and quasi speech precisely.
For this purpose, two thresholds are introduced for entropy H (l)
H(, k) = S(, k) log(S(, k))
(3.4)
H(l) is called entropy of the noisy speech signal, which is a quantitative measure of how certain
the outcome of a random noisy speech signal.
Where
S(, k) =
(,)()
(,)()
(3.5)
Y
energy
(l, k) is the energy of noisy speech.
R() = max
(, k) = N
( 1, k) + (1 )|Y(, k)|
(3.8)
N(l, k) is the noise spectrum estimated in non-speech frame. is known as forgetting
factor (or) look ahead factor (or) smoothing factor lies between 0.7 to 0.9 [9],[10].
3.5.2 Update of Noise Estimation for Quasi-Speech
The proposed algorithm for updating noise spectrum in speech present frames was based
on classifying speech present or absent frequency bins in each frame. This was done by tracking
the local minimum of noisy speech and then deciding speech presence in each frequency bin
separately using the ratio of noisy speech power to its local minimum. Based on that decision a
frequency-dependent smoothing parameter was calculated to update the noise power spectrum.
The purpose of introducing quasi speech frame is to analyze noisy speech signal accurately.
If T
2
(l) <H
avg
(l) < T
1
(l) then
N
(, k) = P(, k)N
(3.9)
3.5.3 Tracking the Minimum of Noisy Speech
For tracking the minimum of the noisy speech power spectrum over a fixed search
window length, various methods were proposed. These methods were sensitive to outliers and
also the noise update was dependent on the length of the minimum-search window. For tracking
the minimum of the noisy speech by continuously averaging past spectral values, a different non-
linear rule is used.
If P
( 1, k) P (l, k) then
P
(, k) = P
( 1, k) +
(, k) = P(, k) (3.11)
=0.998, =0.96 &=0.6 to 0.7 were determined experimentally.
Where P
is the local minimum of the noisy speech power spectrum. , and are
constants which are determined experimentally. The look ahead factor controls the adaptation
time of the local minimum.
3.5.4 Speech Presence Probability
The approach taken to determine the speech presence in each frequency binlet the ratio of
noisy speech power spectrum and its local minimum be defined as
( )
( )
( ) k P
k P
k S
r
,
,
,
min
= (3.12)
This ratio is then compared with a frequency-dependent threshold, and if the ratio is
greater than the threshold, it is taken as speech present frequency bin else it is taken asspeech
absentfrequency bin. This is based on the principle that the power spectrum of noisy speech will
be nearly equal to its local minimum when speech is absent. Hence smaller the ratio defined in
Eq. (3.12) the higher the possibility that it will be a noise-only region or vice versa.
If ( ) ( ) k k S
r
o > , , then
( ) 1 , = k I Speech present
else
( ) 0 , = k I Speech absent
End
Where (k) is the frequency dependent threshold whose optimal value is determined
experimentally. Note that, a fixed threshold was used in place of ( ) k o for all frequencies. From
the above rule, the speech-presence probability ( ) k p , is updated using the following first-order
recursion.
3.5.5 Calculating Frequency Dependent Smoothing Constant
By using the above speech-presence probability estimate, the timefrequency dependent
smoothing factor ( ) k
s
, o is computed as follows
( ) ( ) ( ) k p k
d d s
, 1 , o o o + == (3.13)
Where
d
o is a constant.
P
sp
(l, k) is a speech present probability is given by
P
sp
(, k) =
|(,)|
(,)
(3.14)
P
min
(l, k) is minimum noisy speech spectrum and it is updated by the following equation
P
min
(l, k)= P
min
(l-1, k) ) +(1 )|Y(, k)|
(3.15)
Where is smoothing factor.P
min
(l, k) is the minimum power spectral density of current
frame.P
min
(l-1, k) is the minimum power spectral density of previous frame.|Y(, k)|
is the short
time power spectrum of noisy speech.
The noise spectrum of noisy speech is updated by following
( ) ( ) ( ) ( ) k I k p k p
p p
, 1 , 1 , o o + = (3.16)
Where
p
o is a smoothing constant. ( ) k p , isthenoise spectrumof current frame.
( ) k p , 1 noise spectrumof previous frame. The recursive equation Eq. (3.16) implicitly exploits
the correlation for speech presence in adjacent frames.
In practical implementation smoothing parameter [11] whose maximum value is 0.96 to
avoid deadlock for r (, k) = 1.
r(, k) =
N
(,)
(,)
(3.17)
Eq. (3.17) is a smoothed version of posterior SNR.
7.3.1 Adaptive Kalman Filtering Algorithm
Due to the noise changes with the surrounding environment, it is necessary to
constantly update the estimation of noise. So we can get a more accurate expression of noise. An
adaptive Kalman filtering algorithm for speech enhancement can adapt to any changes in
environmental noise, and also it can constantly update the estimation of background noise.
Everyone known Kalman filtering algorithm is very well. Adaptive kalman filtering
algorithm can estimate system process noise and measurement noise on-line according to the
measured value and filtered value, tracking changes of noise in real time to amend the filter
parameters, and improve the filtering effect.
In this adaptive kalman filter, we can set a reasonable threshold, it is used to determine
whether the current speech frame is noise or not. It consists of mainly two steps: one is updating
the variance of the environmental noise R
v
(n), and the second one is updating the threshold U.
1) Updating the variance of the environmental noise by
R
v
(n)=(1-d)R
v
(n)+dR
u
(n) (7.3.1a)
In above equation d is the loss factor that can limit the length of the filtering memory, and
enhance the role of new observations under the current estimates. Making new data play a major
role in the estimation, and leaving the old data forgotten gradually. According to the [7]..?
its formula is
d=1-b/1-b
t+1
(7.3.1b)
b is the forgetting factor(0<b<1), usually ranged from 0.95 to 0.99. In this paper the value of b is
considered 0.99.
Before implementation of (18), we will compare between the variance of the current
speech frame R
u
(n) and threshold U which has been updated in the previous iteration. If R
u
(n) is
less than or equal to U the current speech frame can be considered as noise, and then the
algorithm will re-estimate the noise variance.
In this paper,R
u
(n) cant replace R
V
(n) directly. In order to reduce the error, we used.
2) Updating and threshold by
U=(1-d) U+ d R
u
(n) (7.3.1c)
In (17) , d is used again to reduce the error. However, there will be a large error when the
noise is large, because the updating threshold U is not restricted by the limitation R
u
(n)<U. It is
only affected by R
u
(n). So we must add another limitation before implementation of (20). In
order to rule out of speech frames which their SNR (Signal-to-noise rate) is high enough, it is
defined that
()
()
()
) (7.3.1d)
Another for the whole speech signal is
SNR
0
(n)=10 log
10
(
()
()
) (7.3.1e)
In (21) and (22), n is the number of speech frames, and