Marz Koll 2002 Speech-Pause

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO.
2, FEBRUARY 2002 109
Speech Pause Detection for Noise Spectrum

Estimation by Tracking Power Envelope Dynamics
Mark Marzinzik and Birger Kollmeier
Abstract—A speech pause detection algorithm is an important speech pause detection. Martin [2], [3] uses the minimum of the
and sensitive part of most single-microphone noise reduction sub-band signal power within a time window of about 1 s as an es-
schemes for enhancement of speech signals corrupted by additive timate of the noisepower in the respective sub-band. This ideawas
noise as an estimate of the background noise is usually deter-
mined when speech is absent. An algorithm is proposed which already formulated by Paul [4]. Doblinger [5] proposed a contin-
detects speech pauses by adaptively tracking minima in a noisy uous noise estimation scheme similar to Martin’s which is com-
signal’s power envelope both for the broadband signal and for putationally more efficient. This scheme was, however, not sys-
the high-pass and low-pass filtered signal. In poor signal-to-noise tematically tested. Hirsch [6] and Hirsch and Ehrlicher [7] pro-
ratios (SNRs), the proposed algorithm maintains a low false-alarm posed an algorithmwhich is based on the observation that the most
rate in the detection of speech pauses while the standardized
algorithm of ITU G.729 shows an increasing false-alarm rate commonly occurring spectral magnitude value in clean speech is
in unfavorable situations. These characteristics are found with zero. Hence, having noisy speech their algorithm measures the
different types of noise and indicate that the proposed algorithm distribution density function of the spectral magnitude and deter-
is better suited to be used for noise estimation in noise reduction mines the maxima which are then used as an estimate of the re-
algorithms, as speech deteriorations may thus be kept at a low spective noise magnitude. These kind of algorithms which avoid
level. It is shown that in connection with the Ephraim–Malah noise
reduction scheme [1], the speech pause detection performance speech pause detection for noise estimation are supposed to cope
can even be further increased by using the noise-reduced signal better with nonstationary (i.e., fluctuating) noise, since they are
instead of the noisy signal as input for the speech pause decision generally faster in their adaptation to changing noise levels even
unit. during speech activity. On the other hand, the continuous update
Index Terms—Envelope dynamics, envelope minima, noise esti- of the noise estimate (independently in the sub-bands) is suscep-
mation, noise reduction, speech pause detection. tible to erroneously capture speech energy. This, however, leads
inevitably to speech deterioration in a subsequent noise reduction
process. Fischer and Stahl [8] investigated a spectral subtraction
I. INTRODUCTION
noise reduction algorithm with a continuous noise spectrum up-
N EW technologies in mobile telecommunication, robust

speech recognition and digital hearing aids are a strongly
driving force in the development of real-time noise reduction
dating scheme. They found that the corruption of the noise esti-
mate by speech is too large to be further considered and conclude
that voice activity detection plays an important role and cannot be
algorithms. The number of publications on single-microphone fully omitted. Recently, Nemer et al. [9] proposed to use the kur-
noise reduction algorithms indicates an unbroken interest tosis (fourth-order statistics) of the noisy signal to continuously
in this research field over the past two or three decades. A estimate speech and noise energies. The examples presented used
crucial point for these kind of algorithms is the concurrent noisy speech signals with positive signal-to-noise ratios (SNRs)
estimate of the target speech spectrum and the interfering noise and yield promising results, but further research is required to ex-
spectrum in particular. Since most realistic noisy environments tend these results to negative SNRs and different classes of noise,
are characterized by nonstationarity, it is necessary to update respectively.
the noise spectrum estimate as often as possible to maintain Most authors reporting on noise reduction refer to speech pause
an effective noise reduction. This can be done, for example detection when dealing with the problem of noise estimation. As
whenever target speech is absent, which means that the input Hirsch [6] pointed out, “this is a very difficult and ultimately un-
signal consists of noise only. Another constraint is the limited solved problem for realistic situations with a varying noise level.”
complexity of the algorithm when it is supposed to become A lot of studies thus evade the problem by using an ideal speech
implemented in digital circuits. Hence, computational and pause detection using the clean speech signal or by using only
memory requirements should be as low as possible. short test signals with an initial noise-only period for noise esti-
Different algorithms have been proposed which continuously mation without the need for updating the noise spectrum estimate.
update the noise estimate and hence avoid the need for explicit In some applications like audio restoration (e.g., restoration of old
gramophone recordings) the noise estimation indeed can often be
Manuscript received May 15, 2001; revised September 24, 2001. This work done “manually” off-line. However, other applications like noise
was supported in part by a research grant from GN ReSound. The associate ed-
itor coordinating the review of this manuscript and approving it for publication reduction for mobile communication and for digital hearing aids
was Dr. Hynek Hermansky. require automatic updating of the noise spectrum estimate. Most
The authors are with the Medical Physics Department, Carl von Ossi- authorsagreethatvoiceactivityorspeechpausedetectors,respec-
etzky University Oldenburg, D-26111 Oldenburg, Germany (e-mail:
[email protected]; [email protected]). tively,are a very sensitive and often limiting part of systems for the
Publisher Item Identifier S 1063-6676(02)01523-7. reduction of additive noise in speech [10], [11].
1063–6676/02$17.00 © 2002 IEEE
110 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 2, FEBRUARY 2002
Various procedures for speech pause detection have been a speech/nonspeech identification based on four different pa-
described in the literature so far. Kang and Fransen [12] proposed rameters. The first is the maximum value of the auto-correlation
a very simple scheme. Whenever the low-pass band energy (in function of the LPC residual signal, which represents the degree
the frequency range from 0 to 1 kHz) of a current signal frame of the periodicity of the signal waveform. Second is a spectral
is below a specific fraction of the low-pass band dynamic range slope parameter, third is a reflection coefficient which itself is
as scanned in the past frames, the frame is used for updating the computed from some PARCOR coefficients, and fourth is the
noise spectrum estimate. Obviously, this procedure has strong signal energy. For each of the parameters, Itoh and Mizushima
limitations. It will only work with higher SNRs and will fail used empirically determined thresholds for a speech/stationary
in noises with prominently low frequencies. A more elaborate noise/nonstationary noise decision. It seems, however, that the
algorithm using adaptive energy thresholds was proposed by decision for nonstationary noise is made only on the basis of the
van Gerven and Xie [13]. Elberling et al. [14] used the so-called spectral slope parameter. Unfortunately, the proposed algorithm
synchro method for spectral estimation of the background noise. was not tested in low SNR situations.
This procedure makes use of the specific characteristic of voiced Irrespective of the actual kind of speech pause detector used,
speech sounds, i.e., that the energy is confined to pitch-harmonic a comprehensive and fair evaluation should include its hit rate
frequencies. Based on successive multiplication of the envelopes as well as its false-alarm rate using different noises with a large
from neighboring pairs of band-pass signals, followed by a sum- variety of SNRs. These measures reveal most of an algorithm’s
mation over all resulting signal-products, a global measure of capabilities and deficiencies. For an application in noise reduc-
energy synchronization is obtained which is then used to classify tion, the problem is that a speech pause detection algorithm with
the time frames of the input signal into those dominated by speech a high false-alarm rate results in remarkably deteriorated speech
(high synchronization) and those not dominated by speech (low after the noise reduction. On the other hand, a speech pause de-
synchronization). This patent application is reported to work tection algorithm that finds too few of the actual speech pauses
successfully in SNRs ranging from 9 to 9 dB with various results in worse reduction of the noise. Hence, noise estimation
noises. However, an increase of wrong speech pause decisions is a very sensitive stage in the noise reduction process.
with decreasing SNR is reported. Sheikhzadeh et al. [15] pro- The algorithm for speech pause detection that will be de-
posed a pause detection algorithm based on an auto-correlation scribed in the next section dynamically tracks the dynamics of
voicing detection which was performed on the enhanced signal the signal’s temporal power envelope as well as of its low- and
(i.e., after the noise reduction rather than on the noisy signal). high-pass frequency band power envelopes. After a number of
Although extensive testing is mentioned, no performance results threshold comparisons, a frame-by-frame decision is made on
are presented. However, the authors state that the algorithm the presence of a speech pause. This approach was motivated by
is not supposed to work well below SNRs of 0 dB. Dendrinos the work of Festen et al. [24], who used the minima in the signal
and Bakamidis [10] presented an algorithm for determining the envelope for estimating the noise level in a speech-plus-noise
starting and ending points of speech segments in colored-noise signal to control an AGC (automatic gain control) algorithm for
environments through singular value decomposition based on hearing aids. The proposed algorithm can be regarded as an ex-
some thresholds which have been determined experimentally. tension of the simple scheme proposed by Kang and Fransen
Good performance was proved for SNRs higher than 0 dB. [12]. In order to assess its applicability to real-time noise reduc-
However, the complexity of the algorithm makes a real-time tion for practical applications (see above), both the hit rate and
implementation difficult. Recently, El-Maleh and Kabal [16] false-alarm rate are evaluated for a large range of SNRs and dif-
performed a comparative study of three voice activity detection ferent types of noise and compared to a voice activity detector
(VAD) algorithms: a VAD used in the GSM cellular system [17], (VAD) algorithm recommended by the International Telecom-
the VAD used in the enhanced variable rate codec (EVRC) of munication Union [25].
the North American CDMA-based PCS and cellular systems
[18], and a third-order statistics based VAD [19]. Unfortunately, II. ALGORITHM
the authors did not investigate false-alarm rates and hit rates
The speech pause detection algorithm calculates the signal’s
systematically but present only some noisy waveforms with the temporal power envelope by summing up the squares of
respective VAD decisions. However, the EVRC VAD is reported
the spectral components of the input signal in each short-time
to show consistent superiority over the other VADs. Davídek
frame
et al. [20] implemented a speech activity detector using cepstral
coefficients for use in a real-time noise cancellation system. (1)
However, a comprehensive evaluation of the detector itself is not
given. Abdallah et al. [21] introduced a local entropic criterion for
Here, denotes the spectral component of the noisy
speech signal detection. Very good performance down to SNRs
input signal at frequency at time frame . In addition, a
of 20 dB is reported. However, only white noise was tested so
low-pass band power envelope and a high-pass band power en-
far. McKinley and Whipple [22] suggested a model based speech
velope are calculated:
pause detection algorithm which is claimed to be robust for low
SNRs. The speech pause detection problem is formulated into a (2)
decision theory framework. However, this algorithm requires ex-
tensive training of a Hidden Markov Model with the set of speech (3)
prototypes to be encountered. Itoh and Mizushima [23] proposed
MARZINZIK AND KOLLMEIER: SPEECH PAUSE DETECTION FOR NOISE SPECTRUM ESTIMATION 111
where runs over all spectral components up to the cut-off fre- not receive too much attention no LP Speech
quency, and runs over the remaining spectral components. In Pause). Now, if the difference between the current
ordertoslightlysmooththeenvelopes, , and and of the low-pass band
are averaged over a few frames by a recursive low-pass filter of envelope is smaller than some fraction of
first order with a release time constant ; no smoothing is per- (which means that the actual envelope is near its
formed in case of an increase in energy (i.e., attack time zero) to minimum), a closer look at the high-pass band is
avoid smearing over onsets. The algorithm tracks the minimum necessary to support a speech pause decision.
value and the maximum value of each envelope and uses these for
Case 1) of the high-pass band is smaller
the speech pause decision as described by the following scheme.
than threshold .
1) After an assumed 200 ms initial phase of noise only the In this case no additional informa-
minimum and maximum values are set as follows: tion can be obtained from the high-pass
band because of its small dynamic
range. Now, if at least (the signal’s
envelope) lies in the lower half of its
(4) dynamic range [i.e., in the lower half
between and ] the
This guarantees that the minimum envelope values corre- current frame can be assumed to be
spond roughly with the noise energy at the beginning. a speech pause because of the close-
2) The minimum and maximum values are updated for each ness of the low-pass band energy to its
of the three envelopes in the following manner. minimum value ( LP Speech Pause)
• If the current envelope value is larger than the max- otherwise, however, there is not enough
imum value for the corresponding envelope, then support for a speech pause decision (
the maximum value is set to the current value. Oth- no LP Speech Pause).
erwise, the maximum value slowly decays. This is Case 2) is bigger than two times the
done by a recursive low-pass filter of first order with threshold .
a release time constant , which takes as input In this case, there is enough dynamic
the current envelope value. range to pay attention to the high-pass
• If the current envelope value is smaller than the min- band. Thus, it is demanded that the
imum value for the corresponding envelope, then difference between the current
the minimum value is set to the current value. Oth- and of the high-pass en-
erwise, the minimum value is slowly raised. This is velope is smaller than two times the
done by a recursive low-pass filter of first order with fraction of to support the small
attack time constant , which takes as input the envelope value in the low-pass band.
current envelope value. Then a noise-only frame is assumed (
3) The differences between the maximum and the minimum LP Speech Pause). This demand is not
values are calculated for each envelope as strict as that for the low-pass band, to
account for the case that the disturbing
noise has a rather high-frequency char-
acteristic. But if this condition is not
fulfilled, speech may be present in the
(5) actual frame ( no LP Speech Pause).
4) Three different criteria are introduced of which only one Case 3) is smaller than two times the
has to be true for making the decision that target speech is threshold , but bigger than .
In this case, which is not as clear
not present in the actual frame: a) the speech pause deci-
as Case 2, it is only demanded that
sion can be made because of a low signal dynamics in both
(the high-pass envelope) lies in
the low-pass and the high-pass band (Dyn Speech Pause);
the lower half of its dynamic range to
b) the decision can be based on the low-pass band infor-
support the small envelope value in the
mation (LP Speech Pause); and c) it can be made upon
low-pass band. Then it is assumed that
the high-band information (HP Speech Pause). These de-
target speech is absent ( LP Speech
cision criteria are derived as follows.
Pause). However, if this condition is not
a) If is smaller than some threshold and also fulfilled, speech may be present in the
then it is assumed that only noise is
actual frame ( no LP Speech Pause).
present due to the very small dynamic range of the
signal ( Dyn Speech Pause). c) Condition b) accounts for the case that the dis-
b) If a) is not true, it is checked whether is turbing noise has a rather high-frequency charac-
bigger than (otherwise the dynamic range in teristic, hence the speech pause decision should
the low-pass band is very small and it should mainly be made upon the information in the
During the development of the algorithm noisy signals gener-

ated from various different noise types and speech signals at sev-
eral SNRs were used for performance verification. Finally, the
following values were chosen for the free parameters: The input
signal was digitized with a sampling frequency of 22 050 Hz and
partitioned in Hann-windowed segments of length 8 ms with
4 ms overlap. These segments were padded with zeros and a
256-point FFT was performed. This framework is compatible
with most single-microphone noise reduction algorithms which
can thus easily be integrated. Such short segments are motivated
by the fact that then the same signal analysis and synthesis as
necessary for a real-time noise reduction environment can be
used. Due to the longer signal delay, longer window lengths in
real-time signal processing applications would cause problems
with lip reading and would cause stuttering when speaking. The
cut-off frequency between low-pass and high-pass band was set
to 2 kHz, motivated by the fact that excluding speech frequen-
cies above 1.9 kHz has a roughly similar effect on speech intelli-
gibility as excluding those below this value [26]. The time con-
stant for the envelope smoothing was set to 32 ms. The time
constants and were both set to 3 s. These constants
were determined by examination of the envelopes from several
speech samples. With these settings a good approximation to the
actual dynamic range of the signal and of its “placement” in the
level area under a variety of conditions was achieved. However,
systematic variations of these parameters were not investigated.
The threshold was set to 5 dB and the fraction was set to 0.1.
III. EXAMPLES
To illustrate the speech pause detection scheme, Figs. 3–5
show some detection examples using a target sentence of ap-
proximately 5 s length mixed with different noises (digitally
added).
Fig. 3 shows an example with car noise. This type of noise
was recorded in the cabin of a driving car and has dominant parts
in the low frequency range. The bar at the bottom of the panels
shows the real speech pauses which were determined manually.
[For comparison, the waveform of the clean sentence is dis-
Fig. 1. Flowchart of the proposed speech pause detection algorithm operating played in Fig. 2 (upper panel); the lower panel shows the mixed
on a single time frame. See text for details. signal with a SNR of 5 dB.] The speech pause decisions of
the algorithm are displayed in the other bottom three bars. The
low-pass band. To account also for the case that distinct bars give additional information about the reason for the
it has a rather low-frequency characteristic, the speech pause decision. The first bar shows a symbol whenever
same conditions as under condition b) have to be a speech pause is detected due to a small dynamic range of the
checked but now with reverse roles of the low-pass signal in the low-pass band as well as in the high-pass band, and
and the high-pass bands to determine whether generally in the initial noise estimation phase (the first 200 ms).
target speech is absent (HP Speech Pause). The second bar shows a symbol whenever a speech pause is de-
Fig. 1 gives a flowchart of the proposed speech pause detec- tected on the basis of the low-pass band information. Finally, a
tion algorithm. The flowchart is not fully symmetrical with re- symbol in the third bar means that the decision was based on the
spect to LP and HP speech pause detection since several redun- high-pass band information.
dant tests are omitted. The car noise example shows that it is worthwhile to consider
Due to its flexible design this novel approach for speech pause band-limited envelopes. In this case, the signal’s low-pass band
detection can easily be adjusted to obtain a rather low false- envelope (as well as its broadband envelope) are strongly dis-
alarm rate by adapting the main parameters and . Generally, turbed by the noise. However, the high-pass envelope is “clean
a low false-alarm rate is desirable to reduce speech distortions enough” for obtaining reliable speech pause decisions (Fig. 3).
in the subsequent noise reduction process. However, this also Actually, the third bar in the figure panels shows that the deci-
results in a reduced hit rate. sion is mainly based on the high-pass information.
Fig. 3. Low-pass band power envelope (upper panel) and high-pass band
Fig. 2. Upper panel: Waveform of the sentence “I played in a theater festival,
power envelope (lower panel) of the sentence displayed in Fig. 2 when mixed
honoring the German writer Heiner Müller.” Lower panel: Sentence mixed with
0 0
with car noise at 5 dB SNR (solid curves). The dashed curves display
car noise at 5 dB SNR.
E and E , respectively. The detected as well as the actual speech
pauses are displayed in the additional bars (see text for details).
Fig. 4 shows an example, where the sentence is mixed with

the noise of a drilling machine at 5 dB SNR. This noise makes speech coding scheme as its Recommendation G.729 Annex B
it impossible to get reliable speech pause information from the [25]. The VAD algorithm makes a voice activity decision every
high-pass channel, but in this case the low-pass band informa- 10 ms based on differential parameters of the full-band energy,
tion can be used. Comparison with the lowest bar in the figures the low-pass band energy, the zero-crossing rate and a spectral
(the “true” speech pauses) shows that a good speech pause de- distortion measure. These are obtained at each frame as dif-
tection is obtained. Although the algorithm wrongly considers ferences between each parameter and its respective long-term
the time frames around 0.6 s (“p” from “played”), 1.2 s (“th” average. The output of the VAD module is either 1 or 0, indi-
from “theater”) and around 1.5 s (“f” from “festival”) as noise, cating the presence or absence of voice activity, respectively.
these speech parts actually sound very similar to equally short Several publications compared their own algorithms with the
segments of the drill noise. Hence, these wrong decisions are G.729 VAD so far [27], [28].
assumed to have no adverse effects on the speech quality when Using the G.729 algorithm here as a competitor is motivated
used for noise estimation in a noise reduction algorithm. by the fact that it has proven being successful in a wide range
Fig. 5 shows an example with restaurant noise, which is nei- of conditions and that it is available from the ITU. Comparing a
ther mainly low-frequency nor high-frequency in its character- novel algorithm with this “standard” makes it also comparable
istics. As can be seen at the second and third bar in the figures, to other algorithms, if these are tested against this “standard.”
the speech pause detection, indeed, is sometimes based on the Of course, the G.729 algorithm was intended to be used in less
low-pass band information and sometimes on the high-pass in- noisy environments, originally.
formation. In combination, a good speech pause detection per-
A. Procedure
formance is obtained.
A female reading of a short story (41 s length) from the
German PhonDat database [29] was used to test the perfor-
IV. COMPARISON WITH G.729 VAD ALGORITHM
mance of the proposed algorithm versus the G.729 algorithm.
In 1996 the International Telecommunication Union (ITU) The speech signal was mixed with a car noise, a multi-talker
“standardized” a voice activity detector (VAD) algorithm for a babble noise, an aircraft engine noise, and a factory noise,
Fig. 5. Low-pass band power envelope (upper panel) and high-pass band
power envelope (lower panel) of the sentence displayed in Fig. 2 when mixed
Fig. 4. Low-pass band power envelope (upper panel) and high-pass band +
with restaurant noise at 5 dB SNR (solid curves). The dashed curves display
power envelope (lower panel) of the sentence displayed in Fig. 2 when mixed E and E , respectively. The detected as well as the actual speech
+
with drilling machine noise at 5 dB SNR (solid curves). The dashed curves pauses are displayed in the additional bars. See text for details.
display E and E , respectively. The detected as well as the
actual speech pauses are displayed in the additional bars (see text for details).
G.729 algorithm is recommended by the ITU, it can be taken for
granted that it works well for clean speech. Note, that in the com-
respectively, which were taken from the NOISEX-92 database parative test with the proposed new algorithm this may give an
[30]. SNRs from 10 dB to 20 dB were employed. Negative advantage for the G.729 algorithm, as it defines the “clean” stan-
SNRs do often occur in real-life situations and especially dard. Hand-labeling of the real speech pauses was not considered
hearing-impaired persons have enormous problems to have since an automatic procedure was much more economical for de-
conversations in noisy environments. Of course, the frequency termination of even very short pauses.
shape of a noise signal has a strong influence on its masking Finally, both algorithms are compared in terms of receiver
effect. While the speech reception threshold (i.e., SNR where operating characteristics (ROC).1
50% of the speech are intelligible) for some machinery noises
can be very low (for drill noise it is about 20 dB; [31]), for
B. Results
cafeteria noise, e.g., it may be much higher (about 4 dB; [31])
but still negative. The detection results are shown in Figs. 6 and 7. The upper
False-alarm rates (i.e., the fraction of all real speech frames that panels show the false-alarm rate, the lower panels present the
were erroneously detected as speech pauses) and hit rates (i.e., hit rate of both algorithms.
the fraction of all real speech pauses that were correctly detected The comparison with the G.729 Annex B algorithm shows
as speech pauses) were determined in each noise condition for that the proposed speech pause detection algorithm yields a
both the proposed algorithm and the G.729 algorithm. For the cal- clearly lower false-alarm rate in each of the four different noises
culation of the false-alarm rate as well as the hit rate, the “real” 1According to Egan [32], the receiver operating characteristic (ROC) is a
speech frames and “real” speech pauses were determined using function which summarizes the possible performances of an observer faced with
the G.729 VAD algorithm on the clean speech signal. Using the the task of detecting a signal in noise. In general, the ROC is given as a plot of
the hit rate versus the false-alarm rate which is obtained by modifying the deci-
G.729 itself as reference takes into consideration that no simple sion criterion. In the present study, the signal to be detected is a “speech pause”
rule exists even for determining pauses in clean speech. Since the occurring in a noisy speech signal.
Fig. 6. Speech pause detection performance of the proposed algorithm and Fig. 7. Speech pause detection performance of the proposed algorithm and the
the G.729 VAD algorithm in car noise and multi-talker babble noise with SNRs G.729 VAD algorithm in aircraft engine and factory noise with SNRs ranging
0 +
ranging from 10 to 20 dB. The upper panel shows the false-alarm rates and 0 +
from 10 to 20 dB. The upper panel shows the false-alarm rates and the lower
the lower panel shows the hit rates with the respective algorithms. panel shows the hit rates with the respective algorithms.
over the entire range of SNRs that were tested (cf., Figs. 6 and G.729 algorithm can be examined by comparing them in ROC
7). On the other hand, fewer speech pauses are actually detected space (in terms of discriminability, i.e., the area under the ROC
than with the G.729 algorithm. curve). Figs. 8–10 show ROC curves of the proposed algorithm
The false-alarm rates are lowest in car noise, followed by using car noise, babble noise, and aircraft noise, respectively.
the multi-talker babble noise, the factory noise, and the aircraft The upper panels were obtained at SNRs of 10 dB; for the
engine noise. However, a principal difference between the al- lower panels SNRs of 10 dB were used. The curves were
gorithms is observed: While the proposed algorithm keeps the generated by varying the threshold in the decision rule of the
false-alarm rate and the hit rate almost constant with changing proposed algorithm (cf., Section II) from 1 to 25 dB in 1-dB
SNR, the performance of the G.729 algorithm strongly depends steps.
on the SNR—the lower the SNR, the larger the false-alarm rate Since in all noise conditions the G.729 algorithm falls below
as well as the hit rate. It is striking that the performance of the the ROC curve of the proposed algorithm, it may be concluded
G.729 algorithm in car noise is rather poor even at moderate that the discriminability is better with the proposed speech pause
noise levels of 20 dB. detection algorithm.
In terms of receiver operating characteristics (ROC), the Additionally, in Fig. 10 (upper panel) the ROC curve was
working point of the G.729 algorithm shifts up and to the right determined for the proposed algorithm using a noise-reduced
in ROC space with decreasing SNR, while the working point signal as input for the speech pause detection (by employing
of the proposed algorithm stays nearly at the same place in the single-microphone noise reduction algorithm from Ephraim
ROC space. In general, the false-alarm rates can be decreased and Malah [1], on a frame-by-frame basis) instead of the noisy
by changing threshold criteria in the algorithm’s decision rules. signal. The detected speech pauses are in turn used to adjust
This is, of course, connected with a decrease of the hit rates. the noise spectrum estimate for the noise reduction. Although
Whether the proposed algorithm is generally “better” than the this leads to a recursive design of the signal flow, no stability
0
Fig. 9. ROC curve of the proposed algorithm using babble noise at 10 dB
0
Fig. 8. ROC curve of the proposed algorithm using car noise at 10 dB SNR +
SNR (upper panel) and 10 dB SNR (lower panel). The curve was generated
+
(upper panel) and 10 dB SNR (lower panel). The curve was generated by by varying the threshold in the decision rule from 1 to 25 dB in 1-dB steps.
varying the threshold in the decision rule from 1 to 25 dB in 1-dB steps. For For comparison, the performance of the G.729 VAD algorithm is also indicated.
comparison, the performance of the G.729 VAD algorithm is also indicated.
duction process. In fact, the proposed speech pause detection

problems were observed for a wide range of input signals and
algorithm maintains a low false-alarm rate over a wide range of
SNRs.
SNRs while the hit rate decreases only slightly at poorer SNRs.
This modified algorithm is denoted as “Proposed Algo NR.”
Hence, the algorithm keeps a relatively fixed position in ROC
Actually, the discriminability of the speech pause detection al-
space over a wide range of SNRs. In contrast to the proposed
gorithm is further increased by this modification as can be seen
algorithm, the algorithm of the ITU Recommendation G.729
at the larger area under the ROC curve (cf., Fig. 10, upper panel).
yields very large false-alarm rates (but also larger hit rates) at
low SNRs.
C. Discussion Obviously, the G.729 was not designed to detect the true
In a noise estimation application for noise reduction algo- speech pauses in adverse noise conditions. In conditions where
rithms it is generally proposed to operate the speech pause the speech is hardly noticeable, the G.729 VAD algorithm rather
detection at rather low hit rates to keep the false-alarm rate decides to classify this situation as speech-free (i.e., a kind of
low. Large false-alarm rates in the speech pause detection lead extended speech pause). Since this behavior is inherent in the
to wrong noise spectrum estimates which include significant algorithmic design of the G.729 scheme, it cannot be overcome
speech parts and hence cause artifacts in a subsequent noise re- by global changes of its threshold parameters. In a noise reduc-
noises. If the noise is strongly fluctuating in its characteristics

between speech pauses, a noise estimate determined only when
speech is absent is not sufficient to ensure effective noise re-
duction. For such conditions, noise reduction schemes have to
be employed which exploit other features (for example separa-
tion in space between noise and target source [33]), or a running
noise estimate has to be determined from the noisy signal and
not only during speech pauses.
Apart from that, low hit rates in the proposed algorithm do not
necessarily mean that some speech pause intervals are not de-
tected at all, but rather that several frames during speech pauses
are not detected as such (see for example Fig. 3). For the ad-
justment of a noise spectrum estimate, the proposed algorithm
can hence be employed at rather low hit rates to obtain low
false-alarm rates and still detects at least some frames during
most speech pauses. The proposed algorithm has successfully
been employed in several experiments with single-microphone
noise reduction algorithms [31].
It might seem strange that the false-alarm rates of the pro-
posed algorithm increase slightly for better SNRs, but this is
due to the fact that the G.729 defines the clean reference. Very
soft consonant parts (with insignificant low energy) are classi-
fied as speech pause by the proposed algorithm. However, these
parts are classified as speech by the G.729 algorithm.
V. CONCLUSIONS
The proposed speech pause detection algorithm maintains a
low and approximately constant false-alarm rate over a wide
range of SNRs. The hit rate decreases only slightly at poorer
SNRs.
Since the proposed speech pause detection algorithm was
shown to be superior to the G.729 VAD algorithm in terms
of discriminability (area under the ROC curve) in speech with
noise, it should be preferred in applications where noise distur-
bances may occur.
The performance can be further enhanced if the algorithm
is combined with the single-microphone noise reduction algo-
rithm proposed by Ephraim and Malah [1] and the noise reduced
signal is employed for the speech pause detection.
0
Fig. 10. ROC curve of the proposed algorithm using aircraft noise at 10 dB
+
SNR (upper panel) and 10 dB SNR (lower panel). The curve was generated
The relatively low complexity of the algorithm should allow
an immediate application in, for example, digital hearing aids
by varying the threshold in the decision rule from 1 to 25 dB in 1-dB steps.
For comparison, the performance of the G.729 VAD algorithm is also indicated. or cellular phones. The delay time due to the signal processing
is below 10 ms.
tion application, this behavior probably makes it impossible for
a noise reduction algorithm to “retrieve” the speech signal, if the ACKNOWLEDGMENT
whole signal is classified as noise. As the proposed algorithm The authors thank the anonymous reviewers for critical
detects speech pauses by tracking envelope minima, its behavior reading of the manuscript and for their helpful comments.
at very poor SNRs differs here. It still decides for speech pauses
only when energy minima occur. REFERENCES
The threshold parameters in the proposed speech pause de- [1] Y. Ephraim and D. Malah, “Speech enhancement using a min-
tection algorithm were determined empirically to obtain low imum mean-square error short-time spectral amplitude estimator,”
false-alarm rates for a wide range of input signals and SNRs. IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, pp.
1109–1121, June 1984.
By this, speech deteriorations due to wrong noise spectrum es- [2] R. Martin, “An efficient algorithm to estimate the instantaneous SNR of
timates (i.e., including speech energy) in any subsequent noise speech signals,” in Proc. EUROSPEECH’93, vol. 1, 1993.
reduction processing are minimized. However, low false-alarm [3] , “Spectral subtraction based on minimum statistics,” in Signal Pro-
cessing VII, Theories and Applications. Proceedings of EUSIPCO-94,
rates are connected with lower hit rates which could also lead vol. 1, M. J. J. Holt, C. F. N. Cowan, P. M. Grant, and W. A. Sandham,
to signal deteriorations for certain types of strongly fluctuating Eds. Lausanne, Switzerland, 1994.
[4] D. B. Paul, “The spectral envelope estimation vocoder,” IEEE Trans. [23] K. Itoh and M. Mizushima, “Environmental noise reduction based
Acoust., Speech, Signal Processing, vol. ASSP-29, pp. 786–794, Apr. on speech/nonspeech identification for hearing aids,” in Proc. IEEE
1981. Int. Conf. Acoustics, Speech, Signal Processing 1997, Conference
[5] G. Doblinger, “Computationally efficient speech enhancement by spec- Proceedings. Los Alamitos, CA: IEEE Comput. Soc. Press, 1997, pp.
tral minima tracking in subbands,” in Proc. 4th Eur. Conf. Speech Com- 419–422.
munication Technology EUROSPEECH’95. Madrid, Spain, Sept. 1995, [24] J. M. Festen, J. N. Van Dijkhuizen, and R. Plomp, “The efficiacy of a
pp. 1513–1516. multichannel hearing aid in which the gain is controlled by the minima
[6] H. G. Hirsch, “Estimation of noise spectrum and its application to SNR- in the temporal envelope,” Scand. Audiol., vol. Suppl. 38, pp. 101–110,
estimation and speech enhancement,” Int. Comput. Sci. Inst., Berkeley, 1993.
CA, Tech. Rep. TR-93-012, 1993. [25] ITU-T Recommendation G.729—Annex B: A Silence Compression
[7] H. G. Hirsch and C. Ehrlicher, “Noise estimation techniques for robust Scheme for G.729 Optimized for Terminals Conforming to Recommen-
speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal dation V.70, 1996.
Processing 1995, vol. 1, 1995, pp. 153–156. [26] D. M. Jones, “Noise,” in Stress and Fatigue in Human Performance, R.
[8] A. Fischer and V. Stahl, “On improvement measures for spectral sub- Hockey, Ed. New York: Wiley, 1983, ch. 3, pp. 61–95.
traction applied to robust automatic speech recognition in car environ- [27] J. Stegmann and G. Schröder, “Robust voice-activity detection based on
ments,” in Proc. Workshop Robust Methods Speech Recognition Adverse the wavelet transform,” in Proc. 1997 IEEE Workshop Speech Coding
Conditions, Tampere, Finland, May 1999, pp. 75–78. Telecommunications, New York, 1997, pp. 99–100.
[9] E. Nemer, R. Goubran, and S. Mahmoud, “SNR estimation of speech [28] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice ac-
signals using subbands and fourth-order statistics,” IEEE Signal Pro- tivity detection,” IEEE Signal Processing Lett., vol. 6, pp. 1–3, Jan.
cessing Lett., vol. 6, pp. 171–174, July 1999. 1999.
[10] M. Dendrinos and S. Bakamidis, “Voice activity detection in colored- [29] C. Draxler, “Introduction to the Verbmobil-PhonDat Database of spoken
noise environment through singular value decomposition,” in Proc. 5th German,” in Proc. 3rd Int. Conf. Practical Application Prolog, Paris,
Int. Conf. Signal Processing Applications and Technology. Waltham, France, 1995, pp. 201–212.
MA: DSP Associates, 1994, vol. 1, pp. 137–141. [30] H. J. M. Steeneken and F. W. M. Geurtsen, “Description of the RSG.10
[11] P. Sovka and P. Pollák, “The study of speech/pause detectors for speech noise database,” TNO Inst. Perception, Soesterberg, The Netherlands,
enhancement methods,” in Proc. 4th Eur. Conf. Speech Communication Tech. Rep. IZF 1988-3, 1988.
Technology EUROSPEECH’95. Madrid, Spain: ESCA, September [31] M. Marzinzik, “Noise reduction schemes for digital hearing aids
1995, pp. 1575–1578. and their use for the hearing impaired,” Ph.D. dissertation,
[12] G. S. Kang and L. J. Fransen, “Quality improvement of LPC-processed Carl von Ossietzky Universität, Oldenburg, [Online]. Avail-
noisy speech by using spectral subtraction,” IEEE Trans. Acoust., able: http://docserver.bis.uni-oldenburg.de/publikationen/disserta-
Speech, Signal Processing, vol. 37, pp. 930–942, June 1989. tion/2001/marnoi00/marnoi00.html, Germany, 2000.
[13] S. Van Gerven and F. Xie, “A comparative study of speech detection [32] J. P. Egan, Signal Detection Theory and ROC Analysis. New York:
methods,” in Proc. 5th Eur. Conf. Speech Communication Technology, Academic, 1975.
EUROSPEECH’97, Rhodes, Greece, 1997. [33] T. Wittkop, “Two-channel noise reduction algorithms motivated by
[14] C. Elberling, C. Ludvigsen, and G. Keidser, “The design and testing of a models of binaural interaction,” Ph.D. dissertation, Carl von Ossietzky
noise reduction algorithm based on spectral subtraction,” Scand. Audiol., Univ., Oldenburg, Germany, 2001.
vol. Suppl. 38, pp. 39–49, 1993.
[15] H. Sheikhzadeh, R. L. Brennan, and H. Sameti, “Real-time implemen-
tation of HMM-based MMSE algorithm for speech enhancement in
hearing aid applications,” in Proc. IEEE Int. Conf. Acoustics, Speech, Mark Marzinzik was born in 1970 in Bremen, Germany, and studied physics
Signal Processing 1995, vol. 1, 1995, pp. 808–811. from 1990 to 1996 at the Carl von Ossietzky Universität, Oldenburg, Germany.
[16] K. El-Maleh and P. Kabal, “Comparison of voice activity detection He received the Ph.D. degree in physics (supervised by B. Kollmeier) in 2000
algorithms for wireless personal communications systems,” in Proc. with a dissertation on “Noise reduction schemes for digital hearing aids and
CCECE’97 Can. Conf. Electrical Computer Engineering, vol. 2, 1997, their use for the hearing impaired.”
pp. 470–473. He is currently a Research Associate with the Medical Physics Department,
[17] K. Srinivasan and A. Gersho, “Voice activity detection for cellular net- Universität Oldenburg. His studies focus on dynamic compression and noise
works,” in Proc. IEEE Speech Coding Workshop, 1993, pp. 85–86. reduction for digital hearing aids.
[18] TIA, “Enhanced variable rate codec, speech service option 3 for wide-
band spread spectrum digital systems,”, Document PN-3292, 1996.
[19] M. Rangoussi and G. Carayannis, “Higher order statistics based Birger Kollmeier was born in 1958. He received the Diplom degree in physics
Gaussianity test applied to on-line speech processing,” in Proc. IEEE in 1982, the M.D. degree in 1986, the Ph.D. degree in physics (supervised by
Asilomar Conf., 1995, pp. 303–307. M. R. Schroeder) in 1986 and the Ph.D. degree in medicine in 1989, all from
[20] V. Davídek, J. Šika, and J. Štusák, “Noise cancellation system on the Universität Göttingen, Germany. He received the Fulbright Scholarship and
TMS320C31,” in Proc. 1st Eur. DSP Education Research Conf.. Paris, was with Washington University and Central Institute for the Deaf in St. Louis,
France, 1996, pp. 134–138. MO, from 1982 to 1983.
[21] I. Abdallah, S. Montrésor, and M. Baudry, “Speech signal detection in He was an Assistant Professor (1986–1991) and Associate Professor
noisy environment using a local entropic criterion,” in Proc. 5th Eur. (1991–1992) at the Third Physikalisches Institut, Universität Göttingen.
Conf. Speech Communication Technology, EUROSPEECH’97, Rhodes, Since 1993, he has been Full Professor of physics and Head of the Medical
Greece, 1997. Physics Department at the Universität Oldenburg, Germany. He has authored
[22] B. L. McKinley and G. H. Whipple, “Model based speech pause de- or co-authored more than 100 original papers and six books and has supervised
tection,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing 21 completed Ph.D. dissertations.
1997, Los Alamitos, CA, 1997, pp. 1179–1182. Dr. Kollmeier is vice president of the German Audiological Society and has
received various prizes and honors.

Marz Koll 2002 Speech-Pause

Uploaded by

Copyright:

Available Formats

Marz Koll 2002 Speech-Pause

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Marz Koll 2002 Speech-Pause

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO.

2, FEBRUARY 2002 109

Speech Pause Detection for Noise Spectrum

N EW technologies in mobile telecommunication, robust

During the development of the algorithm noisy signals gener-

Fig. 4 shows an example, where the sentence is mixed with

duction process. In fact, the proposed speech pause detection

noises. If the noise is strongly fluctuating in its characteristics

You might also like