Interspeech 2018
2-6 September 2018, Hyderabad
Early detection of continuous and partial audio events using CNN
Ian McLoughlin1,2 , Yan Song2 , Lam Pham1 , Ramaswamy Palaniappan1 , Huy Phan3 , Yue Lang4
1
The University of Kent, School of Computing, Medway, UK
The University of Science and Technology of China, Hefei, PRC
3
University of Oxford, Department of Engineering Science, Oxford, UK
4
Huawei European Research Center, Munich, Germany
2
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
Abstract
Sound event detection is an extension of the static auditory
classification task into continuous environments, where performance depends jointly upon the detection of overlapping events
and their correct classification. Several approaches have been
published to date which either develop novel classifiers or employ well-trained static classifiers with a detection front-end.
This paper takes the latter approach, by combining a proven
CNN classifier acting on spectrogram image features, with
time-frequency shaped energy detection that identifies seed regions within the spectrogram that are characteristic of auditory
energy events. Furthermore, the shape detector is optimised to
allow early detection of events as they are developing. Since
some sound events naturally have longer durations than others,
waiting until completion of entire events before classification
may not be practical in a deployed system. The early detection
capability of the system is thus evaluated for the classification
of partial events. Performance for continuous event detection is
shown to be good, with accuracy being maintained well when
detecting partial events.
Index Terms: sound event detection, convolutional neural networks, audio classification, segmentation.
the presence of individual events, and the occurrence of overlapping events, and do so in levels of signal-to-noise (SNR) that are
unknown a priori. The task is particularly difficult when many
possible sound classes are involved, and when some classes
have an inherently noise-like sound.
This paper proposes a detection front-end to identify seed
regions from spectrogram image features that have the characteristic time-frequency shape of sound events, prior to classification. Detected seed regions are then classified using a welltrained CNN to classify zero, one or multiple events. The seed
region detector is further optimised to enable early event detection. This is inspired by systems such as [14, 15] which aim
to enable reliable classification of sound events as they are occurring, rather than waiting until they have completed (i.e. online classification). This is an important requirement for future
real-time machine hearing systems that need to classify sound
events that have long durations. We evaluate performance on
the standard continuous audio event detection task first developed in [16] and extended in [17], then evaluate the abilities
of the system when forced to perform partial detection. Results show very good performance for full event detection, and
gracefully degrade as classification is performed earlier.
1. Introduction
2. Background
Continuous sound event detection means the identification of
sound events as they occur in a continuous audio medium.
It extends the classification of isolated and separated sounds
into real-world machine hearing scenarios. This is important
for smart home and vehicle environments, speech interaction
and telecommunication systems, and has relevance to audiobased security monitoring, ambient event detection and auditory scene analysis. Sound event detection research has traditionally been driven by techniques developed for speech recognition, including Mel-frequency cepstral coefficients (MFCCs),
perceptual linear prediction (PLPs) with Gaussian mixture models (GMMs) and hidden Markov models (HMMs) [1, 2, 3, 4, 5].
However these features and methods have more recently been
surpassed by spectrogram-based techniques [6, 7], especially
for the classification of noise-corrupted sounds. Recent systems have demonstrated very good results from the use of deep
learning, including deep neural networks (DNN) [8, 9, 10, 11]
and convolutional neural networks (CNN) [12, 13]. Both DNN
and CNN classifiers perform well in the presence of acoustic
background noise, with the latter demonstrating superior noise
robustness.
While acoustic noise robustness is an important real-world
attribute of such systems, practical methods must also have the
capability to distinguish between the absence of sound events,
The basic classifier in many recent sound event classifiers is
typically trained in a supervised fashion using data which is presented in individual files. Each file contains an isolated sound
event without added noise, corresponding to a single class. In
the baseline CNN classifier used in this paper (in Section 3),
spectrogram image features (SIF) are obtained from individual
labelled sounds, conditioned, downsampled, and used to train
a CNN. Since the training material has no added background
noise, a basic energy detector is easily capable of identifying
regions of interest in the SIF prior to training.
For classification of detected sounds, many types of feature have been explored in the research literature, including raw
waveform, MFCC, several kinds of spectrogram and correlogram, as have many kinds of back-end classifier. For example MFCC-HMM [18], SIF-SVM [8], SIF-DNN [8] and SIFCNN [12]. Each of those systems was evaluated in clean and
noisy isolated sounds (known as robust sound event classification), using a standard 50-class evaluation of real-world sounds
first proposed by Dennis [18]. However real-world audio is continuous rather than discrete, with sounds of unknown duration
occurring at unknown, perhaps overlapping, times. A detection
operation is thus required in conjunction with the classification
task.
For this reason, an experimental evaluation was proposed
3314
10.21437/Interspeech.2018-1821
by the authors [17] that combined detection and classification
of real-world sounds in continuous waveforms that included
overlapping sounds – with the test material as illustrated in
Fig. 1. The system proposed and evaluated in this paper for
the robust classification of continuous and overlapping sounds,
uses identical training data, but enhances the evaluation further through the development of early-detection capabilities, inspired by those first introduced in [19].
Early detection is another capability that is important for
real-world sound event detection. Some sound events have
longer durations than others, and waiting until completion of
entire sound events before classification – as most current systems do (including [17]) may be impractical for longer sounds.
Early detection is needed for online detection, and the degree of
earliness is a factor in the classification latency of a system.
Figure 2: Block diagram of the classifier in test mode.
3. The proposed detection system
The proposed system is shown in Fig. 2, roughly divisible into
the detection process (top half) and the classifier (bottom half).
Within the classifier, a CNN architecture is employed that is
unchanged from the baseline classifier in [17]; this means that
improvements in performance are due to the capabilities of the
detection system alone.
patch can be selected, and frames with energy lower than 10%
of peak energy inside the context are excluded. This confers a
degree of noise resistance, with the hold-off period designed to
ensure that loud sounds spanning multiple frames do not dominate over quieter sounds occurring elsewhere. This applies to
sounds characterised by a strong attack energy and a sustained
release, or multi-part sounds that have double or multiple energy peaks (e.g. stapler, footsteps, doorbell).
The CNN classifier outputs posterior probability Pk , for
each image patch, over k = 1...50 classes. Index n =
arg max(Pk ), k = 1...50 identifies the highest probability
class, but is only accepted if Pn > Pth , otherwise this sound
event is classed as noise. As mentioned above, this energy-gated
detector is used primarily during training.
3.1. Spectrogram image features
Both DNN and CNN classifiers have been shown very capable of extracting discriminative information from spectrogram
features [8, 12, 13], with the best performing classifiers being
CNN-based, and acting on SIF features. the SIF extraction process is; (a) take FFT magnitude of overlapping analysis windows (size 25 ms, overlap 20 ms), (b) downsample in both time
and frequency to a 52 × 40 patch, (c) normalise in amplitude
and (d) optionally denoise prior to classification.
3.3. Shape-based seed detection
Discrete sound events in nature are characterised by their acoustic energy, which is often the result of the conversion of kinetic
energy to sound, where the cause is percussive or frictional, or
the resonance of moving air (which itself is the conversion of
kinetic energy in the air to correlated wave motion). The observation of the authors is that the physical basis for sound creation
means that sound energy from single events tends to be either
narrowband in frequency yet of relatively long duration, or is
wideband but of shorter duration. Percussive sounds, clicks,
staplers, claps and bangs have wideband, short duration energy
releases. Horns, whistles, bells, squeaks typically have narrowband acoustic releases, but of longer duration. Even if the
same amount of energy is generated/received for each sound, its
shape in the time-frequency space will differ. This observation
motivated the creation of a shape-based detector that detects either narrow-but-long or wide-but-short regions.
In operation, the detector computes the energy from the
spectrogram, S of Lx frequency bins over Ly frames, i, so
P Ly
that; Ei =
y=0 |S(i, y)| and then the box filter-smoothed
PP
e
envelope Ei =
l=1 al .Ei−l is extracted, where al = 1 for
0 < l < 240. Peak candidates are obtained from the dife ′ and then sorted by peak energy
ferential of the envelope, E
with a 240-frame hold-off and a minimum height threshold of
1.0. Energy is computed over a longer time span, encouraging
both short duration, wideband energy events, as well as longer
narrowband events (i.e. instantaneous frame energy is unimportant). Thresholding then improves noise rejection, similar
to the thresholding mentioned in Section 3.2, Pth defines the
3.2. Energy detector
During training – which uses clean and labelled sound files –
energy gating is used to select SIF patches for classification,
with up to 9 patches per file (with one sound per file and 2500
files in total) being used to contribute to the training. While
this works well when testing clean sounds, the method is easily
defeated by background noise, and it does not work well for
overlapping sounds or complex multi-part sounds. More noiserobust methods are thus required for testing.
Waveform frames are processed sequentially from each
sound file during training, with up to 9 highest energy frames
and their immediate 40-frame context being selected as an image patch. A hold-off of 20 frames is imposed until the next
Figure 1: Illustration of the continuous test material.
3315
0.8
0.6
Clean
20dB
10dB
0dB
Clean
20dB
10dB
0dB
0.55
0.5
0.6
(1-R)
(1-R)
0.7
0.5
0.45
0.4
0.4
0.35
0.3
0.1
0.2
0.3
0.4
0.5
0
(1-P)
0.2
0.4
0.6
0.8
(1-P)
Figure 3: Det curve for energy-only front-end detector operating on clean, 20dB, 10dB and 0dB SNR sound recordings.
Figure 4: Det curve for shape-based front-end detector operating on clean, 20dB, 10dB and 0dB SNR sound recordings.
minimum probability threshold for detection of any class at the
output of the classifier (we sweep Pth during experiments, but
the best results are generally obtained when Pth ≈ 0.05).
class region. Therefore, it was possible that there could be
many correct detections within a single ground truth class region (e.g. if one ground truth class region contained 20 analysis
segments, there could be up to 20 correct detections counted
across that region, rather than up to 1 in the current system).
The stricter criterion is important because we are measuring
early detection, which affects event-based detection much more
than frame-based detection. We therefore first re-evaluate the
baseline detector from [17] using the stricter criterion.
In the reported results, we define precision as P = M/N ,
where M is the number of ground truth sound events detected
correctly, and N is the total number of detected events. Recall
is computed as R = M/K, where K is the total number of
ground truth sound events in the test. The composite F1 score
combines both metrics to yield a single overall performance figure; F1 = 2/(P −1 + R−1 ).
3.4. Convolutional neural network
CNNs are well known in image classification [20, 21], and in
this application, the spectrogram patch is an image. The CNN
structure used, derived from [17], can be seen in Fig. 2. It has 5layers (2 convolutional layers, 2 subsampling layers and 1 fully
connected layer), with 52 × 40 = 2080 input dimensionality
and 50 output classes from a single fully connected layer. The
first and second convolutional layers consist of 6 and 12 kernels, each with a kernel size of 5 × 5. The subsampling layers
employ average pooling with a common factor of 2:1. Batch
normalization [22] is applied before each convolutional layer.
3.5. The evaluation task
4. Results and discussion
The sound material used for training and evaluation consists of
4000 recordings divided into 50 different sound event classes,
each of 80 files. The files were randomly selected from the Real
World Computing Partnership (RWCP) Sound Scene Database
in Real Acoustic Environments [23] across a subset of 50
classes, as specified in [7]. Of the 80 files in each class, 50
were randomly selected to be the training set (50 × 50 = 2500)
with the remainder (30×50 = 1500) being used for evaluation.
The evaluation material is formed by first creating 100 separate 1-min long empty test files to which 15 randomly-selected
test sound events are inserted at random time indices. The random nature of the selection means that some sounds are represented multiple times per test file, and that double and even
triple overlap events occur. In the original definition of the evaluation method [17], noise was randomly selected from random
positions within four different NOISEX-92 noises, however the
tests in the current paper employ only AWGN, which improves
the repeatability of the experiments. One further change is made
to the current evaluation compared to the testing methodology
described in [17]. The is the adoption of a much stricter criterion for class detection; in the current paper, any analysis frame
containing any classes with posterior probability exceeding Pth
are counted as a detection, with the detection being correct only
if the candidate classes match the ground truth. There may be
between 0 and many (up to 50 if Pth is low) detections per
analysis frame, and perhaps several hundred analysis frames for
each ground truth class region. Yet each ground truth class region can only contribute either 0 or 1 correct detections. In
the previous work [17], detections were made for each analysis frame in the same way, but correct detections were counted
for each analysis segment, rather than for a whole ground truth
4.1. Energy and shape detection
We first explore the performance of the system with a basic energy detector. Fig. 3 plots the recall against precision for a range
of Pth thresholds in clean and noisy conditions. The results
show degradation in overall detection and classification performance due to the presence of noise. This is not unexpected,
given that region detection is based only on patch energy. The
shape-based detector of Section 3.3 was then applied and the
above tests repeated, with results plotted in Fig. 4. In this case,
very little degradation was experienced at 20dB SNR, or even
at 10dB SNR, although at 0dB SNR it is significantly degraded.
Further results are given in Table 1. Results were obtained
for a range of peak candidate thresholds Pth around the maximal F1 region, and the scores at which peak F1 occurs are reported for each test. For now, consider just the lines beginning
with “full”, which are the results in which early detection is not
being evaluated.
It is interesting to note that the highest F1 score actually
occurs when low levels of noise are present – due to the fact that
even ‘clean’ recordings contain low levels of noise, and that it
is better to spread noise evenly than to cluster it around sound
events. The same phenomenon was found in CNN classification
of isolated sounds (e.g. in [12]) where low levels of background
noise tended to be beneficial to performance. Nevertheless, as
noise increases beyond 10dB SNR, performance degrades, so
that scores at 0dB are very poor, in common with prior methods
such as [17]. Even with isolated sound event classification [8],
recognition of sounds in 0dB SNR is extremely challenging.
From the overall results presented so far, the best F1 for each
3316
Table 1: Precision (P), recall (R) and F1 score for the original energy-based detector and the proposed shape-based detector performing
feature selection with backend CNN-based classification. The results report the best achieved F1 score over a Pth range [0.01:0.95]
with a step size of 0.05 for clean, 20dB, 10dB and 0dB SNR AWGN and early detection degrees of 100%, 50%, 25% and 12.5%.
Clean
Earliness
P
R
Energy-based detector
full
0.711 0.567
0.711 0.517
50%
0.667 0.373
25%
12.5%
0.084 0.057
Shape-based detector
full
0.852 0.633
50%
0.839 0.627
0.750 0.490
25%
12.5%
0.376 0.137
F1
20dB
P
R
F1
10dB
P
R
F1
0dB
P
R
F1
0.631
0.598
0.479
0.068
0.732
0.749
0.776
0.135
0.600
0.587
0.403
0.073
0.659
0.658
0.531
0.095
0.725
0.763
0.740
0.127
0.617
0.580
0.473
0.083
0.667
0.659
0.577
0.101
0.711
0.798
0.511
0.025
0.533
0.487
0.403
0.077
0.610
0.605
0.451
0.038
0.727
0.718
0.593
0.200
0.843
0.851
0.790
0.373
0.647
0.630
0.477
0.137
0.732
0.724
0.595
0.200
0.814
0.870
0.786
0.361
0.670
0.623
0.490
0.143
0.735
0.726
0.604
0.205
0.582
0.633
0.659
0.292
0.547
0.540
0.450
0.150
0.564
0.583
0.535
0.198
;2
0.8
34*56*7.8+,9:;/
:1
90
0.6
F1 score
-!
;2
&<,
!"#
$
$"#
%
%"#
&
&"#
'()*+,-*.-/
:1
34*56*7.8+,9:;/
12$34$*(5%&678,
/%
0.4
Clean
20dB
10dB
0dB
0.2
90
/%
-!
;2
0
Full
&=,
!"#
$
$"#
%
%"#
&
50%
25%
12.5%
&"#
'()*+,-*.-/
Figure 6: F1 score achieved by the shape-based detector in different levels of AWGN, for each early detection condition.
34*56*7.8+,9:;/
:1
90
/%
&(,
-!
-
!"#
$
.
$"#
%
/
'()*+,-*.-/
%"#
&
0
&"#
in spectrogram (a) has a significant overlap between the second
and third events. The overlap is small when only 50% of the
sounds are included, and is absent in the 12.5% case (although
overlap still occurs in other parts of the test database).
Full results for precision, recall and F1 score are presented
in Table 1 for both energy-based and shape-based detector, for
each early detection condition. The shape-based detector results degrade much less for the early detection cases than do
the energy-based detector results. In fact, degradation due to
early event detection is small up to even 25%, and may even be
beneficial in some cases (for example, some slightly improved
accuracy for 50% early detection), which we believe is due to
a trade off between less data being available for classification,
and the reduction in overlap. Fig. 6 shows the peak F1 score for
each tested condition of the shape-based detector.
!"#$%&'$()*+',
Figure 5: Spectrogram of a fixed of one of the 100 test files from
the (a) full, (b) 50% and (c) 12.5% event test databases.
noise condition for the shape-based detector system compares
well with the energy-based detector apart from in 0dB AWGN.
4.2. Early detection
Early detection was then explored by creating four sets of experimental continuous sound recordings. Each used the same
random selection of sound events, starting positions and overlaps, but only included the beginning segment of each sound
included in the test. It is thus a task of detecting partial sounds,
but since these segments all include the beginning of the sounds
under question, with the end truncated, it forces the system to
perform detection on just the early parts of each sound. The
task is illustrated in Fig. 5, which shows a fixed short segment
of spectrogram from a single experimental condition, from three
early detection databases. In each case, these are clean sounds
without additional AWGN. In the figure, the same three events
are present, starting at the same position in each recording. The
full event data (a) includes the entire sound for each of the three
events, whereas in (b) only the first half of each sound has been
included, and in (c) only the first 12.5% has been pasted in. The
25% data has not been shown for space reasons, but follows a
similar pattern. For each of the experiments, classification uses
this data alone with no a priori information regarding the length
of each event. It is interesting to note that as the length of event
is curtailed, the degree of overlap also reduces; the full data test
5. Conclusion
This paper has proposed a shape-based front-end detector that
operates in conjunction with a well-trained isolated sound CNN
classifier, to perform robust early sound event detection. The
baseline CNN classifier is first evaluated in clean and noisy conditions, using a standard acoustic noise database, with a simple
energy-based front-end. The proposed shape-based detector is
then evaluated in the same conditions, and shown to improve
performance. The early-detection task is derived from the standard test methodology, allowing performance to be evaluated
for four early-detection conditions. The new detector allied
with the backend CNN classifier are shown to perform very well
when even 50% of each sound event is omitted, and to degrade
gracefully as detection is forced on the basis of less and less
classification data.
3317
6. References
[18] J. W. Dennis, “Sound event recognition in unstructured environments using spectrogram image processing,” Ph.D. dissertation,
Nanyang Technological University, Singapore, 2014.
[1] H. Phan, M. Maas, R. Mazur, and A. Mertins, “Random regression forests for acoustic event detection and classification,” Audio,
Speech, and Language Processing, IEEE/ACM Transactions on,
vol. 23, no. 1, pp. 20–31, 2015.
[19] H. Phan, M. Maass, R. Mazur, and A. Mertins, “Early event detection in audio streams,” in 2015 IEEE International Conference
on Multimedia and Expo (ICME), June 2015, pp. 1–6.
[2] H. Phan, L. Hertel, M. Maass, R. Mazur, and A. Mertins, “Learning representations for nonspeech audio events through their similarities to speech patterns,” IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 24, no. 4, pp. 807–822,
April 2016.
[20] Y. LeCun and Y. Bengio, “Convolutional networks for images,
speech, and time series,” The handbook of brain theory and neural
networks, vol. 3361, 1995.
[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proceedings of the
IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[3] J. Portelo, M. Bugalho, I. Trancoso, J. Neto, A. Abad, and
A. Serralheiro, “Non-speech audio event detection,” in Acoustics,
Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on. IEEE, 2009, pp. 1973–1976.
[22] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in Proceedings of ICML, 2015, pp. 448–456.
[4] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D.
Plumbley, “Detection and classification of acoustic scenes and
events,” IEEE Trans. Multimedia, vol. 17, no. 10, pp. 1733–1746,
2015.
[23] S. Nakamura, K. Hiyane, F. Asano, T. Yamada, and T. Endo,
“Data collection in real acoustical environments for sound
scene understanding and hands-free speech recognition,” in EUROSPEECH, 1999, pp. 2255–2258.
[5] T. Heittola, A. Mesaros, T. Virtanen, and A. Eronen, “Sound event
detection in multisource environments using source separation,”
in Workshop on machine listening in Multisource Environments,
2011, pp. 36–40.
[6] J. Dennis, H. D. Tran, and H. Li, “Spectrogram image feature
for sound event classification in mismatched conditions,” Signal
Processing Letters, IEEE, vol. 18, no. 2, pp. 130–133, 2011.
[7] J. Dennis, H. D. Tran, and E. S. Chng, “Image feature representation of the subband power distribution for robust sound event classification,” IEEE Transactions on Audio, Speech, and Language
Processing, vol. 21, no. 2, pp. 367–377, 2013.
[8] I. McLoughlin, H.-M. Zhang, Z.-P. Xie, Y. Song, and W. Xiao,
“Robust sound event classification using deep neural networks,”
IEEE Transactions on Audio, Speech, and Language Processing,
vol. 23, pp. 540–552, Mar. 2015.
[9] E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, “Polyphonic sound event detection using multi label deep neural networks,” in 2015 International Joint Conference on Neural Networks (IJCNN), July 2015, pp. 1–7.
[10] T. L. Nwe, T. H. Dat, and B. Ma, “Convolutional neural network
with multi-task learning scheme for acoustic scene classification,”
in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Dec 2017,
pp. 1347–1350.
[11] J. Li, W. Dai, F. Metze, S. Qu, and S. Das, “A comparison of deep
learning methods for environmental sound detection,” in 2017
IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), March 2017, pp. 126–130.
[12] H. Zhang, I. McLoughlin, and Y. Song, “Robust sound event
recognition using convolutional neural networks,” in Acoustics,
Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, no. 2635. IEEE, Apr. 2015, pp. 559–563.
[13] T. Heittola, E. Çakır, and T. Virtanen, The Machine Learning Approach for Analysis of Sound Scenes and Events. Springer International Publishing, 2018, pp. 13–40.
[14] H. Phan, M. Maass, R. Mazur, and A. Mertins, “Acoustic event
detection and localization with regression forests,” in 15th Annual
Conference of the International Speech Communication Association (Interspeech), Singapore, September 2014, pp. 1–5.
[15] H. Phan, P. Koch, I. McLoughlin, and A. Mertins, “Enabling
early audio event detection with neural networks,” arXiv preprint
arXiv:1712.02116, 2017.
[16] H. Zhang, I. McLoughlin, and Y. Song, “Robust sound event detection in continuous audio environments,” in Proc. Interspeech,
Sep. 2016.
[17] I. McLoughlin, H. Zhang, Z. Xie, Y. Song, W. Xiao, and
H. Phan, “Continuous robust sound event classification using
time-frequency features and deep learning,” PloS one, vol. 12,
no. 9, p. e0182309, 2017.
3318