Analysis of Existing Noise Estimation Algorithms

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

ANALYSIS OF EXISTING NOISE ESTIMATION

ALGORITHMS

In this chapter some of the existing noise estimation algorithms which are based on
tracking noise power spectrum of noisy speech are discussed. Most of these algorithms can be
classified broadly into two classes. The first class is based on updating the noise estimate by
tracking the silence regions of speech and the other class is based on updating noise estimate
using the histogram of the noisy speech power spectrum.
2.1 Minimum Statistics Noise Estimation (MSNE)

Martins method [13] was based on minimum statistics and optimal smoothing of the
noisy speech power spectral density. This method rested on two major observations inwhich the
first observation was independent of speech and noise which implied that the power spectrum of
the noisy speech was equal to the sum of the power spectrum of clean speech and noise
respectively.
That is,
( ) ( ) ( )
2 2 2
, , , k D k X k Y + = (2.1)
Where

( ) ( )
2 2
, , , k X k Y and ( )
2
, k D were the power spectrum of noisy speech, clean speech
and noise respectively and and k denoted the time index and frequency bin index
respectively. The second observation was that the power spectrum of noisy speech often
becomes equal to the power spectrum of noise. Hence the estimate of noise power spectral
density was obtained by tracking the minimum of the noisy speech in each frequency bin
separately.


2.1.1 Principles of Minimum Statistics Noise Estimation Algorithm
The noise variance was estimated by tracking the minimum of noisy speech power
spectral density over a fixed window length. This window length was chosen wide enough to
bridge the broadest peak in any speech signal. It was found out experimentally that window
lengths of approximately 0.8-1.4s gave good results. For searching the minimum the noisy
speech power spectral density by using a first-order recursive version.
( ) ( ) ( ) ( )
2
min min , 1 , 1 , k Y k P k P o o + = (2.2)
( ) k P , min is the minimum power spectral density of current frame and ( ) k P , 1 min is
the minimum power spectral density of previous frame.
Where was the smoothing constant.
2.1.2 Deriving Optimal Time-Frequency Dependent Smoothing Factor
The smoothing parameter used in Eq. (2.2) had to be very low to follow the non-
stationary of the speech signal. On the other hand, it had to be close to one to keep the variance
of the minimum tracking as small as possible. Hence there was need for time and frequency
dependent smoothing factors in place of a fixed smoothing factor. This was derived for speech
absent region. The requirement was that the smoothed power spectrum P(,k) had to be equal to
the noise variance
( ) k
D
,
2
o
during speech pauses. Hence the smoothing parameter was derived
by minimizing the mean squared error between P(,k) and
( ) k
D
,
2
o
as follows
( ) ( ) ( ) ( ) { } k P k k P E
D
, 1 / , ,
2
2
o (2.3)
Where
( ) ( ) ( ) ( ) ( ) ( )
2
, , 1 , , , k Y k k P k k P o o + = (2.4)
Note that in Eq. (2.4) time-frequency dependent smoothing factor (,k) was used instead
of fixed as defined in Eq. (2.2). Substituting Eq. (2.4) in Eq. (2.3) and setting the first
derivative to zero gave the optimum value for (,k)
( )
( )
( )
2
2
1 ,
, 1
1
1
,
|
|
.
|

\
|

+
=
k
k P
k
D
opt
o

o
(2.5)
2.1.3 Drawbacks of MSNE
The noise power spectrum estimate in MSNE was obtained by tracking the minimum of
noisy speech power spectrum over a specified window of length L frames. This length was based
on the concept that it will encompass at least one silence period of the noisy speech and in turn
means that it tracks at least one frame of noise only region. Also, since there was no way to
adjust the length of this window based on the width of the speech peaks, this window length was
chosen large enough to encompass the broadest peak possible in any speech waveform.

2.2 Minima Controlled Recursive Averaging (MCRA)
In minima controlled recursive averaging (MCRA) [2] noise estimate was updated by
averaging the past spectral values of noisy speech which was controlled by a time and frequency
dependent smoothing factors. These smoothing factors were calculated based on the signal
presence probability in each frequency bin separately. This probability was in turn calculated
using the ratio of the noisy speech power spectrum to its local minimum calculated over a fixed
window time.
2.2.1 Noise Spectrum Estimation
The derivation for noise power spectrum from the signal presence probability was based
on the following two hypotheses.
( ) ( ) ( )
( ) ( ) ( ) ( ) k D k X k Y k H
k D k Y k H
, , , : ,
, , : ,
1
0


+ =
=
(2.6)
Where Y(,k), X(,k) and D(,k) represented the short time Fourier transform of the
noisy speech, clean speech and noise respectively and H
0
(,k ) and ( ) k H ,
1
represented the
speech absent and speech present hypotheses respectively. The noise variance was represented as

(2.7)

The update of noise estimate for the above two hypotheses was written as follows
( ) ( ) ( ) ( ) ( )
2
2 2 '
0
, 1 , , 1 : , k Y k k k H
d D d D
o o o o + = + (2.8)
( ): ,
'
1
k H ( ) ( ) k k
D D
, , 1
2 2
o o = +
Where ( ) k
D
,
2
o was the estimate of noise variance and
d
o was the smoothing factor. The
overall noise estimate was obtained based on speech presence probability as
( )
( )
( )
( )
( )
( )
( ) ( ) k p
k H
k
k p
k H
k
k
D D
D
, 1
,
, 1
,
,
, 1
, 1
'
'
0
'
'
1

o
o
(

+
+
(

+
= + (2.9)
Where
( )
( )
( )
|
|
.
|

\
|
=
k Y
k H
P k p
,
,
,
'
1 '

was the speech presence probability. The noise variance


for the two hypotheses defined in Eq. (2.8) was substituted and simplified as follows
( ) ( ) ( ) ( ) | | k k k k
d D d D
,
~
1 , , , 1
2
o o o o + = + ( )
2
, k Y (2.10)
Where

( ) ( ) ( ) k p k
d d D
, 1 ,
' 2
o o o + =
(2.11)
In Eq. (2.11) was based on the principle that noise estimate was updated whenever
silence was ( ) 1 0 < <
d d
o o detected, otherwise it was kept constant.



2.2.2 Drawbacks of MCRA
The major drawback with the noise estimation algorithm in MCRA was the update of
local minimum of noisy speech for increasing noise levels. According to the minimum tracking
( ) ( ) | |
2
2
, , k D E k
D
o =
rule in MCRA, the minimum value was chosen as the minimum of previous local minimum
estimate and the current noisy speech power.
2.3 Weighted Average Technique (WAT)
In weighted average technique [5], noise power spectrum estimate was updated by
comparing the noisy speech power spectra to the current noise estimate without explicit Voice
Activity Detector. In WAT, the noise estimate was updated continuously by smoothing the
spectral values of noisy speech which were below the threshold. This smoothing was represented
as follows
( ) ( ) ( ) ( )
2
, 1 , 1 , k Y k D k D o o + = (2.12)
( ) k D , ispower spectral density of current frame. D( 1, k) is power spectral density of
previous frame. The threshold for this method was taken as D ( 1, k) where takes a value in
the range of 1.5 to 2.5. This threshold was adaptive in the sense that the threshold changes
depending on the noise power level present in the noisy speech. Thus the threshold can follow
the slow changes in noise power levels for slowly varying noise statistics. The overall algorithm
can be summarized as follows
If ( ) ( ) k D k Y , 1 ,
2
< | then
( ) ( ) ( ) ( )
2
, 1 , 1 , k Y k D k D o o + = (2.13)
else
( ) ( ) k D k D , 1 , = (2.14)
end




2.3.1 Drawbacks of WAT
In Weighted Averaging Techniquevery simple and computationally efficient procedure
was used for noise estimation. This method however has a drawback when a high SNR speech
segment was followed by a low SNR speech segment. In this case, the initial estimate of the
noise power spectrum was very low during the high SNR regions. Hence the threshold which
was based on the current noise estimate was also very low. During the low SNR regions the
noise energy itself will be very high. But the threshold for finding noise only regions will be very
low since it was based on the noise estimate at high SNR region. This may result in a situation
where the noisy speech will never become smaller than the threshold. Thus the noise estimate
will never be updated when the noise power stays at that higher level itself [15]. The weighted
average technique uses the rule for updating the noise only segments. Hence it fails to track the
noise level completely.

To overcome this drawback, we have proposed a reliable and fast noise estimation
technique SERA (Spectral Entropy Recursive Averaging) for speech enhancement in real time
environment which is described in Chapter 3, Results in Chapter 4 and Chapter 5 gives the
conclusion.








SPECTRAL ENTROPY RECURSIVE AVERAGING (SERA)NOISE
ESTIMATION ALGORITHM

This chapter deals with the proposedalgorithm, named as Spectral Entropy Recursive
Averaging(SERA)noiseestimation algorithm which isbased on estimating thenoisepower
spectrum usingonlythepowerspectrumofnoisyspeechis shown in figure 3.1.This method is
based on spectral entropy trackingtheminimumofthenoisyspeechpowerspectrum. Hence the
update for varying noise power levels is much faster compared to other
algorithmsandatthesametimenoisepowerspectrum isnotoverestimated.

In SERA, noise estimation is updated in both speech pauses and also speech present
frames. Speech presence is determined by computing the ratio of the noisy speech power
spectrum to its local minimum, which is computed by averaging past values of the noisy
speech power spectra with a look-ahead factor.

3.1. Proposed Noise Estimation Algorithm



Fig. 3.1: Over view of proposed work



In this work, the estimation of noise algorithm is improved in the following aspects

- Update of the noise estimate without using Voice Activity Decision.
- Estimate of Speech-Presence Probability exploiting the correlation of power spectral
components in neighboring frames.

Let the noisy speech signal is denoted as
y (n) = x(n) + d(n) (3.1)
Where x (n) is the original speech and d (n) is the noise.
TheFouriertransformofy(n), x(n) and d(n) inthel
th
frame andat thek
th
frequencybinare
expressedby
Y (l,k)= X (l,k)+ D(l,k) (3.2)

The smoothed power spectrum of noisy speech is computed using the following first order
recursive equation as follows
N

(, k) = P(, k)N

( 1, k) + (1 P(, k))|Y(, k)|

(3.3)
Where N

(, k) is the smoothed power spectrum, l is the frame index, k is the frequency index,
|Y(, k)|

is the short time power spectrum of noisy speech and is a smoothing constant [7].
Smoothing constant is not fixed but varies with time and frequency. The above recursive
equation provides a smoothed version of periodogram |Y(, k)|

[11].





3.2 SERA (Spectral Entropy Recursive Averaging)Noise EstimationAlgorithm



Fig. 3.2: SERA Algorithm
In this the noise signal is segmented into number of frames and FFT is performed on
those frames [8]. To discriminate various frames of noisy speech signal entropy is calculated.
Based on threshold determination, classification of noisy speech signal and how the noise power
is estimated and updated is shown in figure3.3.


Fig. 3.3: Classification of Noisy Speech Signal

3.3 Determination of Entropy
The proposed noise estimation method classifies the noisy speech into three categories as
pure speech, non-speech and quasi speech precisely.
For this purpose, two thresholds are introduced for entropy H (l)

H(, k) = S(, k) log(S(, k))

(3.4)

H(l) is called entropy of the noisy speech signal, which is a quantitative measure of how certain
the outcome of a random noisy speech signal.
Where
S(, k) =

(,)()

(,)()

(3.5)
Y
energy
(l, k) is the energy of noisy speech.

R() = max

*Y(, k)+is a constant used to stabilize the S(l, k).



3.4 Determination of Threshold Conditions
It is highly dependent on SNR of noisy speech and controlled by max (y(l, k)). The
stabilization parameter R(l) is adjusted in each frame in order to rapid change in noise power
spectrum.

Let T
1
(l) = r
1
E[H
avg
(l)] (3.6)
T
2
(l) = r
2
E[H
avg
(l)] (3.7)
Where T
1
(l), T
2
(l) are thresholds to classify noisy speech into Non speech, original
speech & quasi speech. r
1
r
2
are 0.98, 0.95 respectively which are determined by experiment. E
[H
avg
(l)] means an average over the recent number of initial silence frames including l
th
frame.
If H
avg
(l) > T
1
(l-1)ThenT
1
(l), T
2
(l) are updated by Eq. (3.6) and (3.7) respectively.
3.5Classifying Noisy Speech into Speech Present/Absent Frames
The power spectrum of the noisy speech is equal to the sum of the speech power
spectrum and noise power spectrum since speech and the background noise are assumed to be
independent. Also in any speech sentence there are pauses between words which do not contain
any speech. Those frames will contain only background noise. The noise estimate can be updated
by tracking those noise-only frames.
To identify those frames, a simple procedure is used which calculates the ratio of noisy
speech power spectrum to the noise power spectrum at three different frequency bands.
3.5.1Updateof Noise EstimationforNon Speech
The noise estimate is then updated with a constant smoothing factor if the frame is
classified as speech absent frame. This rule can be stated as follows
If H
avg
(l) > T
1
(l) then
N

(, k) = N

( 1, k) + (1 )|Y(, k)|

(3.8)

N(l, k) is the noise spectrum estimated in non-speech frame. is known as forgetting
factor (or) look ahead factor (or) smoothing factor lies between 0.7 to 0.9 [9],[10].

3.5.2 Update of Noise Estimation for Quasi-Speech

The proposed algorithm for updating noise spectrum in speech present frames was based
on classifying speech present or absent frequency bins in each frame. This was done by tracking
the local minimum of noisy speech and then deciding speech presence in each frequency bin
separately using the ratio of noisy speech power to its local minimum. Based on that decision a
frequency-dependent smoothing parameter was calculated to update the noise power spectrum.
The purpose of introducing quasi speech frame is to analyze noisy speech signal accurately.

If T
2
(l) <H
avg
(l) < T
1
(l) then
N

(, k) = P(, k)N

( 1, k) + (1 P(, k))|Y(, k)|

(3.9)


3.5.3 Tracking the Minimum of Noisy Speech
For tracking the minimum of the noisy speech power spectrum over a fixed search
window length, various methods were proposed. These methods were sensitive to outliers and
also the noise update was dependent on the length of the minimum-search window. For tracking
the minimum of the noisy speech by continuously averaging past spectral values, a different non-
linear rule is used.
If P

( 1, k) P (l, k) then
P

(, k) = P

( 1, k) +

(P(, k) P( 1, k)) (3.10)


If P

( 1, k)> P (l, k) then


P

(, k) = P(, k) (3.11)
=0.998, =0.96 &=0.6 to 0.7 were determined experimentally.
Where P

is the local minimum of the noisy speech power spectrum. , and are
constants which are determined experimentally. The look ahead factor controls the adaptation
time of the local minimum.
3.5.4 Speech Presence Probability
The approach taken to determine the speech presence in each frequency binlet the ratio of
noisy speech power spectrum and its local minimum be defined as
( )
( )
( ) k P
k P
k S
r
,
,
,
min

= (3.12)
This ratio is then compared with a frequency-dependent threshold, and if the ratio is
greater than the threshold, it is taken as speech present frequency bin else it is taken asspeech
absentfrequency bin. This is based on the principle that the power spectrum of noisy speech will
be nearly equal to its local minimum when speech is absent. Hence smaller the ratio defined in
Eq. (3.12) the higher the possibility that it will be a noise-only region or vice versa.
If ( ) ( ) k k S
r
o > , , then
( ) 1 , = k I Speech present
else
( ) 0 , = k I Speech absent
End

Where (k) is the frequency dependent threshold whose optimal value is determined
experimentally. Note that, a fixed threshold was used in place of ( ) k o for all frequencies. From
the above rule, the speech-presence probability ( ) k p , is updated using the following first-order
recursion.
3.5.5 Calculating Frequency Dependent Smoothing Constant
By using the above speech-presence probability estimate, the timefrequency dependent
smoothing factor ( ) k
s
, o is computed as follows
( ) ( ) ( ) k p k
d d s
, 1 , o o o + == (3.13)
Where
d
o is a constant.
P
sp
(l, k) is a speech present probability is given by
P
sp
(, k) =
|(,)|

(,)
(3.14)
P
min
(l, k) is minimum noisy speech spectrum and it is updated by the following equation
P
min
(l, k)= P
min
(l-1, k) ) +(1 )|Y(, k)|

(3.15)
Where is smoothing factor.P
min
(l, k) is the minimum power spectral density of current
frame.P
min
(l-1, k) is the minimum power spectral density of previous frame.|Y(, k)|

is the short
time power spectrum of noisy speech.


The noise spectrum of noisy speech is updated by following
( ) ( ) ( ) ( ) k I k p k p
p p
, 1 , 1 , o o + = (3.16)
Where
p
o is a smoothing constant. ( ) k p , isthenoise spectrumof current frame.
( ) k p , 1 noise spectrumof previous frame. The recursive equation Eq. (3.16) implicitly exploits
the correlation for speech presence in adjacent frames.
In practical implementation smoothing parameter [11] whose maximum value is 0.96 to
avoid deadlock for r (, k) = 1.
r(, k) =
N

(,)

(,)
(3.17)

Eq. (3.17) is a smoothed version of posterior SNR.







7.3.1 Adaptive Kalman Filtering Algorithm
Due to the noise changes with the surrounding environment, it is necessary to
constantly update the estimation of noise. So we can get a more accurate expression of noise. An
adaptive Kalman filtering algorithm for speech enhancement can adapt to any changes in
environmental noise, and also it can constantly update the estimation of background noise.
Everyone known Kalman filtering algorithm is very well. Adaptive kalman filtering
algorithm can estimate system process noise and measurement noise on-line according to the
measured value and filtered value, tracking changes of noise in real time to amend the filter
parameters, and improve the filtering effect.
In this adaptive kalman filter, we can set a reasonable threshold, it is used to determine
whether the current speech frame is noise or not. It consists of mainly two steps: one is updating
the variance of the environmental noise R
v
(n), and the second one is updating the threshold U.
1) Updating the variance of the environmental noise by
R
v
(n)=(1-d)R
v
(n)+dR
u
(n) (7.3.1a)
In above equation d is the loss factor that can limit the length of the filtering memory, and
enhance the role of new observations under the current estimates. Making new data play a major
role in the estimation, and leaving the old data forgotten gradually. According to the [7]..?
its formula is
d=1-b/1-b
t+1
(7.3.1b)
b is the forgetting factor(0<b<1), usually ranged from 0.95 to 0.99. In this paper the value of b is
considered 0.99.
Before implementation of (18), we will compare between the variance of the current
speech frame R
u
(n) and threshold U which has been updated in the previous iteration. If R
u
(n) is
less than or equal to U the current speech frame can be considered as noise, and then the
algorithm will re-estimate the noise variance.
In this paper,R
u
(n) cant replace R
V
(n) directly. In order to reduce the error, we used.
2) Updating and threshold by
U=(1-d) U+ d R
u
(n) (7.3.1c)
In (17) , d is used again to reduce the error. However, there will be a large error when the
noise is large, because the updating threshold U is not restricted by the limitation R
u
(n)<U. It is
only affected by R
u
(n). So we must add another limitation before implementation of (20). In
order to rule out of speech frames which their SNR (Signal-to-noise rate) is high enough, it is
defined that

is the variance of pure speech signals,

is the variance of the input noise speech


signals, and

is the variance of background noise.we calculate two SNRs and compare


between them.According to [6], one for the current speech frame is
SNR
1
(n)=10 log
10
(

()

()

()
) (7.3.1d)
Another for the whole speech signal is
SNR
0
(n)=10 log
10
(

()

()
) (7.3.1e)
In (21) and (22), n is the number of speech frames, and

has been updated I order to


achieve a higher accuracy. The speech frame is noise when SNR
1
(n) is less than or equal to
SNR
0
(n), or SNR
0
(n) is less than zero and then these frames will be follow the second limitation
(R
u
(n)U). However, if SNR
1
(n) is larger than SNR
0
(n), the noise estimation will be attenuated to
avoid damaging the speech signals.
The recursive estimation of AdaptiveKalman filtering algorithm is shown below
[Intialization]
S(0)=0,Rv(1)=

(1) (variance of the first speech frame) (7.3.1f)


[iteration]
IfSNR1(n)<=SNR0(n) ||
SNR0(n)<0 then (7.3.1g)
If Ru(n) Uthen
Rv(n)=variance of the environmental noise
Ru(n)=variance of the current speech frame
1. R
v
(n)=(1-d)R
v
(n)+dR
u
(n) (7.3.1h)
End
2. U=(1-d) U+ dR
u
(n) (7.3.1i)
Else
3. Rv(n)=Rv(n)/1.2(7.3.1j)
End
4. Rs(n)= Rs(n)= (() ()) ()(7.3.1k)
5. K(n)=Rs(n)=Rs(n)/(Rs(n)+Rv(n)) (7.3.1l)
6. S(n)=K(n) * y(n) (7.3.1m)

You might also like