Academia.eduAcademia.edu

Polyphonic Sound Event Detection by using Capsule Neural Networks

2019, IEEE Journal of Selected Topics in Signal Processing

https://doi.org/10.1109/JSTSP.2019.2902305

Artificial sound event detection (SED) has the aim to mimic the human ability to perceive and understand what is happening in the surroundings. Nowadays, deep learning offers valuable techniques for this goal such as convolutional neural networks (CNNs). The capsule neural network (CapsNet) architecture has been recently introduced in the image processing field with the intent to overcome some of the known limitations of CNNs, specifically regarding the scarce robustness to affine transformations (i.e., perspective, size, orientation) and the detection of overlapped images. This motivated the authors to employ CapsNets to deal with the polyphonic SED task, in which multiple sound events occur simultaneously. Specifically, we propose to exploit the capsule units to represent a set of distinctive properties for each individual sound event. Capsule units are connected through a so-called dynamic routing that encourages learning part-whole relationships and improves the detection performance in a polyphonic context. This paper reports extensive evaluations carried out on three publicly available datasets, showing how the CapsNet-based algorithm not only outperforms standard CNNs but also allows to achieve the best results with respect to the state-of-the-art algorithms.

1 Polyphonic Sound Event Detection by using Capsule Neural Networks Fabio Vesperini, Leonardo Gabrielli, Emanuele Principi∗ , and Stefano Squartini, Senior Member, IEEE Abstract—Artificial sound event detection (SED) has the aim to mimic the human ability to perceive and understand what is happening in the surroundings. Nowadays, deep learning offers valuable techniques for this goal such as convolutional neural networks (CNNs). The capsule neural network (CapsNet) architecture has been recently introduced in the image processing field with the intent to overcome some of the known limitations of CNNs, specifically regarding the scarce robustness to affine transformations (i.e., perspective, size, orientation) and the detection of overlapped images. This motivated the authors to employ CapsNets to deal with the polyphonic SED task, in which multiple sound events occur simultaneously. Specifically, we propose to exploit the capsule units to represent a set of distinctive properties for each individual sound event. Capsule units are connected through a so-called dynamic routing that encourages learning part-whole relationships and improves the detection performance in a polyphonic context. This paper reports extensive evaluations carried out on three publicly available datasets, showing how the CapsNet-based algorithm not only outperforms standard CNNs but also allows to achieve the best results with respect to the state-of-the-art algorithms. Index Terms—Capsule Neural Networks, Convolutional Neural Network, Polyphonic Sound Event Detection, DCASE, Computational Audio Processing I. I NTRODUCTION H UMAN cognition relies on the ability to sense, process, and understand the surrounding environment and its sounds. Although the skill of listening and understanding their origin is so natural for living beings, it still results in a very challenging task for computers. Sound event detection (SED), or acoustic event detection, has the aim to mimic this cognitive feature by means of artificial systems. Basically, a SED algorithm is designed to detect the onset and offset times for a variety of sound events captured in an audio recording and associate a textual descriptor, i.e., a label for each of these events. In recent years, SED has received significant interest from the computational auditory scene analysis community [1], due to its potential in several engineering applications. Indeed, the automatic recognition of sound events and scenes can have a considerable impact in a wide range of applications where sound or sound sensing is advantageous with respect to other modalities. This is the case of acoustic surveillance [2], healthcare monitoring [3], [4] or urban sound analysis [5], where the short duration of certain events (i.e., a human fall, a gunshot or a glass breaking) or the personal privacy ∗ Corresponding author. The authors are with the A3lab, Department of Information Engineering, Università Politecnica delle Marche, Ancona (Italy), E-mail: [email protected], {l.gabrielli,e.principi,s.squartini}@univpm.it motivate the exploitation of audio information rather than, e.g., image processing. In addition, audio processing is often less computationally demanding compared to other multimedia domains, thus embedded devices can be easily equipped with microphones and sufficient computational capacity to locally process the signal captured. These could be smart home devices for home automation purposes or sensors for wildlife and biodiversity monitoring (i.e., bird calls detection [6]). SED algorithms in a real-life scenario face many challenges. These include the presence of simultaneous events, environmental noise and events of the same class produced by different sources [7]. Since multiple events are very likely to overlap, a polyphonic SED algorithm, i.e., an algorithm able to detect multiple simultaneous events, needs to be designed. Finally, the effects of noise and intra-class variability represent further challenges for SED in real-life situations. Traditionally, polyphonic acoustic event analysis has been approached with statistical modelling methods, including hidden Markov models (HMM) [8], Gaussian mixture models (GMM) [9], non-negative matrix Factorization (NMF) [10] and support vector machines (SVM) [11]. In the recent era of “deep learning”, different neural network architectures have been successfully used for sound event detection and classification tasks, including feed-forward neural networks (FNN) [12], deep belief networks [13], convolutional neural networks (CNNs) [14] and recurrent neural networks (RNNs) [15]. In addition, these architectures laid the foundation for end-to-end systems [16], [17], in which the feature representation of the audio input is automatically learnt from the raw audio signal waveforms. A. Related Works The use of deep learning models has been motivated by the increased availability of datasets and computational resources and resulted in significant performance improvements. The methods based on CNNs and RNNs have established the new state-of-the-art performance on the SED task, thanks to the capabilities to learn the non-linear relationship between time-frequency features of the audio signal and a target vector representing sound events. In [18], the authors show how “local” patterns can be learned by a CNN and can be exploited to improve the performance of detection and classification of non-speech acoustic events occurring in conversation scenes, in particular compared to a FNN-based system which processes multiple resolution spectrograms in parallel. The combination of the CNN structure with recurrent units has increased the detection performance by taking advantage 2 of the characteristics of each architecture. This is the case of convolutional recurrent neural networks (CRNNs) [19], which provided state-of-the-art performance especially in the case of polyphonic SED. CRNNs consolidate the CNN property of local shift invariance with the capability to model shortand long-term temporal dependencies provided by the RNN layers. This architecture has been also employed in almost all of the most performing algorithms proposed in the recent editions of research challenges such as the IEEE Audio and Acoustic Signal Processing (AASP) Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) [20]. On the other hand, if the datasets are not sufficiently large, problems such as overfitting can be encountered with these models, which typically are composed of a considerable number of free-parameters (i.e., more than 1M). Encouraging polyphonic SED performance has been obtained using CapsNets in preliminary experiments conducted on the Bird Audio Detection task in occasion of the DCASE 2018 challenge [21], confirmed by the results reported in [22]. The CapsNet [23] is a recently proposed architecture for image classification and it is based on the grouping of activation units into novel structures introduced in [24], named capsules, along with a procedure called dynamic routing. The capsule has been designed to represent a set of properties for an entity of interest, while dynamic routing is included to allow the network to implicitly learn global coherence and to identify part-whole relationships between capsules. The authors of [23] show that CapsNets outperform stateof-the-art approaches based on CNNs for digit recognition in the MNIST dataset case study. They designed the CapsNet to learn how to assign the suited partial information to the entities that the neural network has to predict in the final classification. This property should overcome the limitations of solutions such as max-pooling, currently employed in CNNs to provide local translation invariance, but often reported to cause an excessive information loss. Theoretically, the introduction of the dynamic routing can supply invariances for any property captured by a capsule, allowing also to adequately train the model without requiring extensive data augmentation or dedicated domain adaptation procedures. B. Contribution The proposed system is a fully data-driven approach based on the CapsNet deep neural architecture presented by Sabour et al. [23]. This architecture has shown promising results on the classification of highly overlapped digit images. In the audio field, a similar condition can be found in the detection of multiple concomitant sound events from acoustic spectral representations, thereby we propose to employ the CapsNet for polyphonic SED in real-life recordings. The novel computational structure based on capsules, combined with the routing mechanism, allows to be invariant to intra-class affine transformations and to identify part-whole relationships between data features. In the SED case study, it is hypothesized that this characteristic confers to CapsNet the ability to effectively select most representative spectral features of each individual sound event and separate them from overlapped descriptions of the other sounds in the mixture. This hypothesis is supported by previously mentioned related works. Specifically, in [21], the CapsNet is exploited in order to obtain the prediction of the presence of heterogeneous polyphonic sounds (i.e., bird calls) on unseen audio files recorded in various conditions. In [22], the authors proposed a CapsNet for sound event detection that uses gated convolutions in the initial layers of the network, and an attention layer that operates in parallel with the final capsule layer. The outputs of the two layers are merged and used to obtain the final prediction. The algorithm is evaluated on the weakly-labeled dataset of the DCASE 2017 challenge [25] with promising results. In [26], capsule networks have been applied to a speech command recognition task, and the authors obtained a significant performance improvement with respect to CNNs. In this paper, we present an extensive analysis of SED conducted on real-life audio datasets and compare the results with state-of-the-art methods. In addition, we propose a variant of the dynamic routing procedure which takes into account the temporal dependence of adjacent frames. The proposed method outperforms previous SED approaches in terms of detection error rate in the case of polyphonic SED, while it has comparable performance with respect to CNNs in the case of monophonic SED. The whole system is composed of a feature extraction stage and a detection stage. The feature extraction stage transforms the audio signal into acoustic spectral features, while the second stage processes these features to detect the onset and offset times of specific sound events. In this latter stage we include the capsule units. The network parameters are obtained by supervised learning using annotations of sound events activity as target vectors. We have evaluated the proposed method against three datasets of real-life recordings and we have compared its performance both with the results of experiments with a traditional CNN architecture, and with the performance of well-established algorithms which have been assessed on the same datasets. The rest of the paper is organized as follows. In Section II the task of polyphonic SED is formally described and the stages of the approach we propose are detailed, including a presentation of the CapsNet architecture characteristics. In Section III, we present the evaluation set-up used to accomplish the performance of the algorithm we propose and the comparative methods. In Section IV the results of experiments are discussed and compared with baseline methods. Section V finally presents our conclusions for this work. II. P ROPOSED M ETHOD The aim of polyphonic SED is to find and classify the sound events present in an audio signal. The algorithm we propose is composed of two main stages: sound representation and polyphonic detection. In the sound representation stage, the audio signal is transformed in a two-dimensional timefrequency representation to obtain, for each frame t of the audio signal, a feature vector xt ∈ RF , where F represents the number of frequency bands. Sound events possess temporal characteristics that can be exploited for SED, thus certain events can be efficiently 3 distinguished by their time evolution. Impulsive sounds are extremely compact in time (e.g., gunshot, object impact), while other sound events have indefinite length (i.e., wind blowing, people walking). Other events can be distinguished from their spectral evolution (e.g., bird singing, car passing by). Long-term time domain information is very beneficial for SED and motivates for the use of a temporal context allowing the algorithm to extract information from a chronological sequence of input features. Consequently, these are presented as a context window matrix Xt:t+T −1 ∈ RT ×F ×C , where T ∈ N is the number of frames that defines the sequence length of the temporal context, F ∈ N is the number of frequency bands and C is the number of audio channels. Differently, the target output matrix is defined as Yt:t+T −1 ∈ NT ×K , where K is the number of sound event classes. In the SED stage, the task is to estimate the probabilities p(Yt:t+T −1 |Xt:t+T −1 , θ) ∈ RT ×K , where θ denotes the parameters of the neural network. The network outputs, i.e., the event activity probabilities, are then compared to a threshold in order to obtain event activity predictions Ŷt:t+T −1 ∈ NT ×K . The parameters θ are trained by supervised learning, using the frame-based annotation of the sound event class as target output, thus, if class k is active during frame t, Y (t, k) is equal to 1, and is set to 0 otherwise. The case of polyphonic SED implies that this target output matrix can have multiple nonzero elements K in the same frame t, since several classes can be simultaneously present. Indeed, polyphonic SED can be formulated as a multi-label classification problem in which the sound event classes are detected by multi-label annotations over consecutive time frames. The onset and offset time for each sound event are obtained by combining the classification results over consequent time frames. The trained model will then be used to predict the activity of the sound event classes in an audio stream without any further post-processing operations and prior knowledge on the events locations. channel C = {1, 2}, thus the resulting feature tensor is Xt:t+T −1 ∈ R256×F ×C , where F is equal to 513 for the STFT and equal to 40 for the LogMels. The range of feature values is then normalized according to the mean and the standard deviation computed on the training sets of the neural networks. B. Background on capsule networks Capsules have been introduced to overcome some limitations of CNNs, in particular the loss of information caused by the max-pooling operator used for obtaining translational invariance [23], [24]. The main idea behind capsules is to replace conventional neurons with local units that produce a vector output (capsules) incorporating all the information detected in the input. Moreover, lower-level capsules are connected to higher-level ones with a set of weights determined during inference by using a dynamic routing mechanism. These two aspects represent the main differences from conventional neural networks, where neurons output a single scalar value, and connection weights are determined in the training phase by using back-propagation [23], [24]. Recalling the original formulation in [23], [24], a layer of a capsule network is divided in multiple computational units named capsules. Considering capsule j, its total input sj is calculated as: X X αij ûj|i , (1) αij Wij ui = sj = i i A. Feature Extraction where αij are coupling coefficients between capsule i and capsule j in the lower-level layer, ui is the output of capsule i, Wij are transformation matrices, and ûj|i are prediction vectors. The vector output of capsule j is calculated by applying a non-linear squashing function that makes the length of short vectors close to zero and the length of long vectors close to 1: ksj k2 sj vj = . (2) 2 1 + ksj k ksj k For our purpose, we use two acoustic spectral representation, the magnitude of the Short Time Fourier Transform (STFT) and LogMel coefficients, obtained from all the audio channels and extensively used for other SED algorithms. Except where differently stated, we study the performance of binaural audio features and compare it with those extracted from a single channel audio signal. In all cases, we operate with audio signals sampled at 16 kHz and we calculate the STFT with a frame size equal to 40 ms and a frame step equal to 20 ms. Furthermore, the audio signals are normalized to the range [−1, 1] in order to have the same dynamic range for all the recordings. The STFT is computed on 1024 points for each frame, while LogMel coefficients are obtained by filtering the STFT magnitude spectrum with a filter-bank composed of 40 triangular filters evenly spaced in the mel frequency scale [27]. In both cases, the logarithm of the energy of each frequency band is computed. The input matrix Xt:t+T −1 concatenates T = 256 consequent STFT or LogMel vectors for each Using the squashing function of Eq. (2) allows to interpret the magnitude of the vector as a probability, in particular the probability that the entity represented by the capsule is present in the input [23]. Coefficients αij measure how likely capsule i may activate capsule j. Thus, the value of αij should be relatively high if the properties of capsule i coincide with the properties of capsule j in the layer above. As shown in detail in the next section, this is obtained by using the notion of agreement between capsules in two consecutive layers. The coupling coefficients are calculated by the iterative process of dynamic routing, and capsules in the higher layers should include capsules in the layer below in terms of the entity they identify. Dynamic routing iteratively attempts to find these associations and supports capsules to learn features that ensure these connections. The new “routing-by-agreement” algorithm introduced in [23] represents an evolution of the simpler routing mechanism intrinsic in max-pooling and will be described in the next section. 4 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: procedure ROUTING(ûij , r, l) ∀ capsule i in layer l and capsule j in layer (l + 1): βij ← 0. for r iterations do exp(βij ) ∀ capsule i in layer l: αij ← P exp(β k P ik ) ∀ capsule j in layer (l + 1): sj ← i αij ûj|i sj ksj k2 ∀ capsule j in layer (l + 1): vj ← 1+ks 2 j k ksj k ∀ capsule i in layer l and capsule j in layer (l +1): βij ← βij + ûj|i · vj end for return vj end procedure Input Features Convolutional Convolutional Layers Layers Fig. 1. The dynamic routing algorithm proposed in [23]. Primary Capsules 1) Dynamic Routing: After giving a qualitative description of the routing mechanism, we describe in detail the algorithm used in [23] to compute the coupling coefficients. The “routing-by-agreement” algorithm operates as shown in Fig. 1. The algorithm is executed for each layer l of the network and for r iterations, and it outputs vectors vj of layer (l+1). In essence, the algorithm represents the forward pass of the network. As shown in line 4, coupling coefficients αij are determined by applying the softmax function to coefficients βij : exp(βij ) . (3) αij = P k exp(βik ) The softmax function ensures that αij ∈ (0, 1), thus making αij the probability that capsule i in the lower-level layer sends its output to capsule j in the upper-level layer. The coefficients βij are initialized to zero so that the coupling coefficients αij have all the same initial value. After this step, the βij coefficients are updated by using an iterative algorithm which uses the agreement between the output of capsule j, vj , and the prediction of capsule i, ûij , in the layer below. The agreement is measured by the scalar product ûj|i · vj , and it provides a measure of how similar the directions (i.e., the properties of the entity they represent) of capsules i and j are. 2) Margin loss function: The length of the vector vj is used to represent the probability that the entity represented by the capsule j exists. The CapsNet have to be trained to produce long instantiation vector at the corresponding kth capsule if the event that it represents is present in the input audio sequence. A separate margin loss is defined for each target class k as: Lk = Tk max(0, m+ − kvj k)2 + λ(1 − Tk ) max(0, kvj k − m− )2 , (4) where Tk = 1 if an event of class k is present, while λ is a down-weighting factor of the loss for absent sound event classes. m+ , m− and λ are respectively set equal to 0.9, 0.1 and 0.5 as suggested in [23]. The total loss is simply the sum of the losses of all the output capsules. C. CapsNet for Polyphonic Sound Event Detection The architecture of the neural network is shown in Fig. 2. The first stages of the model are traditional CNN blocks which Dynamic Routing Detection Capsules ... Euclidean Norm Fig. 2. Flow chart of the capsule neural network architecture used for polyphonic sound event detection. act as feature extractors on the input Xt:t+T −1 . The input of each CNN block is zero-padded in order to preserve its dimension, and, after each block, max-pooling [28] is used to halve the dimensions only on the frequency axis. Thus, the output of the first CNN layers has dimension T × F ′ × Q, where F ′ < F is the number of elements in the frequency axis after max-pooling, and Q is the number of kernels in the final CNN block. This tensor is then used as input for the Primary Capsule Layer that represents the lowest level of multi-dimensional entities. The processing stages occurring after the CNN blocks are depicted in Fig. 3. Basically, the Primary Capsule Layer is a convolutional layer with J · M filters, i.e., it contains M convolutional capsules with J kernels each. The output tensor of this layer has dimension T × F ′ × J · M , and it is then reshaped in order to obtain a T × F ′ · J × M tensor. Capsule vectors ui are represented by the M dimensional T · F ′ · J vectors of this tensor obtained after applying the squashing operation of Eq. (2). The final layer, or Detection Capsule Layer, is a time-distributed layer composed of K densely connected capsule units with G elements. With “time-distributed”, we mean that the same weights are applied for each time-index. For each t, thus, the Detection Capsule Layer outputs K vectors vi composed of G elements. This differs from the architecture proposed in [26], where all the capsule vectors from the Primary Capsule Layer are processed as a whole. Since the previous layer is also a capsule layer, the 5 dynamic routing algorithm is used to compute the output. The background class was included in the set of K target events, in order to represent its instance with a dedicated capsule unit and train the system to recognize the absence of events. In the evaluation, however, we consider only the outputs relative to the target sound events. The model predictions are obtained by computing the Euclidean norm of the output of each Detection Capsule. These values represent the probabilities that one of the target events is active in a frame t of the input feature matrix Xt:t+T −1 , thus we consider them as the network output predictions. In [23], the authors propose a series of densely connected neuron layers stacked at the bottom of the CapsNet, with the aim to regularize the weights training by reconstructing the input image. Here, this technique entails an excessive complexity of the model to train, due to the higher number of units needed to reconstruct Xt:t+T −1 ∈ RT ×F ×C , yielding poor performance in our preliminary experiments. We decided, thus, to use dropout [29] and L2 weight normalization [30] as regularization techniques, as done in [22]. III. E XPERIMENTAL S ET-U P In order to evaluate the performance of the proposed method, we performed a series of experiments on three datasets provided to the participants of different editions of the DCASE challenge [25], [31]. We evaluated the results by comparing the system based on the Capsule architecture with the traditional CNN. The hyperparameters of each network have been optimized with a random search strategy [32]. Furthermore, we reported the baselines and the best state-ofthe-art performance provided by the challenge organizers. A. Dataset We assessed the proposed method on three datasets, two containing stereo recordings from real-life environments and one artificially generated monophonic mixtures of isolated sound events and real background audio. In order to evaluate the proposed method in polyphonic reallife conditions, we used the TUT Sound Events 2016 & 2017 datasets, which were included in the corresponding editions of the DCASE Challenge. For the monophonic SED case study, we used the TUT Rare Sound Events 2017 which represents the task 2 of the DCASE 2017 Challenge. 1) TUT Sound Events 2016: The TUT Sound events 2016 (TUT-SED 2016)1 dataset consists of recordings from two acoustic scenes, respectively “Home” (indoor) and “Residential area” (outdoor) which we considered as two separate subsets. These acoustic scenes were selected from the challenge organizers to represent common environments of interest in applications for safety and surveillance (outside home) and human activity monitoring or home surveillance [31]. The dataset was collected in Finland by the Tampere University of Technology from different locations by means of a binaural recording system. A total amount of around 54 and 59 minutes of audio are provided respectively for “Home” 1 http://www.cs.tut.fi/sgn/arg/dcase2016/ and “Residential area” scenarios. Sound events present in each recording were manually annotated without any further crossverification, due to the high level of subjectivity inherent to the problem. For the “Home” scenario a total of 11 classes were defined, while for the “Residential Area” scenario 7 classes were annotated. Each scenario of the TUT-SED 2016 has been divided into two subsets: Development dataset and Evaluation dataset. The split was done based on the number of examples available for each sound event class. In addition, for the Development dataset a cross-validation setup is provided in order to easily compare the results of different approaches on this dataset. The setup consists of 4 folds, so that each recording is used exactly once as test data. More in detail, the “Residential area” set consists of 5 recordings in the Evaluation set and 12 recordings in the Development set, while the “Home” set consists of 5 recordings in the Evaluation set and 10 recordings in turn divided into 4 folds as training and validation subsets. 2) TUT Sound Events 2017: The TUT Sound Events 2017 (TUT-SED 2017)2 dataset consists of recordings of street acoustic scenes with various levels of traffic and other activities, for a total of 121 minutes of audio. The scene was selected as representing an environment of interest for detection of sound events related to human activities and hazard situations. It is a subset of the TUT Acoustic scenes 2016 dataset [31], from which also TUT-SED 2016 dataset was taken. Thus, the recording setup, the annotation procedure, the dataset splitting, and the cross-validation setup is the same described above. They share also some audio contents, in particular the “Residential area” scenario. The 6 target sound event classes were selected to represent common sounds related to human presence and traffic, and they include brakes squeaking, car, children, large vehicle, people speaking, people walking. The Evaluation set of the TUT-SED 2017 consists of 29 minutes of audio, whereas the Development set is composed of 92 minutes of audio which are employed in the cross-validation procedure. 3) TUT Rare Sound Events 2017: The TUT Rare Sound Events 2017 (TUT-Rare 2017)2 [25] consists of isolated sounds of three different target event classes (respectively, baby crying, glass breaking and gunshot) and 30-second long recordings of everyday acoustic scenes to serve as background, such as park, home, street, cafe, train, etc. [31]. In this case we consider a monophonic-SED, since the sound events are artificially mixed with the background sequences without overlap. In addition, the event potentially present in each test file is known a-priori thus it is possible to train different models, each one specialized for a sound event. In the Development set, we used a number of sequences equal to 750, 750 and 1250 for training respectively of the baby cry, glass-break and gunshot models, while we used 100 sequences as validation set and 500 sequences as test set for all of them. In the Evaluation set, the training and test sequences of the Development set are combined into a single training set, while the validation set is the same used in the Development dataset. The system is evaluated against an “unseen” set of 1500 samples (500 for 2 http://www.cs.tut.fi/sgn/arg/dcase2017/ 6 T F0 F 0 ¢J ... 1 1 M Euclidean Norm K T ... Q 1 ... ... F0 J¢M ... ... 1 ... T ... Reshape T ... T ... Primary Capsule Layer Detection Capsule Layer ... ... G 1 1 K Fig. 3. Details of the processing stages that occur after the initial CNN layers. The dimension of vectors ui is 1 × 1 × M , the dimension of vectors vj is 1 × 1 × G. The decision stage after the Euclidean norm calculation is not shown for simplicity. each target class) with a sound event presence probability for each class equal to 0.5. B. Evaluation Metrics In this work we used the Error Rate (ER) as primary evaluation metric to ensure comparability with the reference systems. In particular, for the evaluations on the TUT-SED 2016 and 2017 datasets we consider a segment-based ER with a one-second segment length, while for the TUT-Rare 2017 the evaluation metric is event-based error rate calculated using onset-only condition with a collar of 500 ms. In the segmentbased ER the ground truth and system output are compared in a fixed time grid, thus sound events are marked as active or inactive in each segment. For the event-based ER the ground truth and system output are compared at event instance level. ER score is calculated in a single time frame of one second length from intermediate statistics, i.e., the number of substitutions (S(t1 )), insertions (I(t1 )), deletions (D(t1 )) and active sound events from annotations (N (t1 )) for a segment t1 . Specifically: 1) Substitutions S(t1 ) are the number of ground truth events for which we have a false positive and one false negative in the same segment; 2) Insertions I(t1 ) are events in system output that are not present in the ground truth, thus the false positives which cannot be counted as substitutions; 3) Deletions D(t1 ) are events in ground truth that are not correctly detected by the system, thus the false negatives which cannot be counted as substitutions. These intermediate statistics are accumulated over the segments of the whole test set to compute the evaluation metric ER. Thus, the total error rate is calculated as: PT PT PT t1 =1 S(t1 ) + t1 =1 I(t1 ) + t1 =1 D(t1 ) , (5) ER = PT t1 =1 N (t1 ) where T is the total number of segments t1 . If there are multiple scenes in the dataset, such as in the TUT-SED 2016, evaluation metrics are calculated for each scene separately and then the results are presented as the average across the scenes. A detailed and visualized explanation of segment-based ER score in multi label setting can be found in [33]. C. Comparative Algorithms Since the datasets we used were employed to develop and evaluate the algorithms proposed from the participants of the DCASE challenges, we can compare our results with the most recent approaches in the state-of-the-art. In addition, each challenge task came along with a baseline method that consists in a basic approach for the SED. It represents a reference for the participants of the challenges while they were developing their systems. 1) TUT-SED 2016: The baseline system is based on melfrequency cepstral coefficients (MFCC) acoustic features and multiple GMM-based classifiers. In detail, for each event class, a binary classifier is trained using the audio segments annotated as belonging to the model representing the event class, and the rest of the audio to the model which represents the negative class. The decision is based on likelihood ratio between the positive and negative models for each individual class, with a sliding window of one second. To the best of our knowledge, the most performing method for this dataset is an algorithm we proposed [34] in 2017, based on binaural MFCC features and a Multi-Layer Perceptron (MLP) neural network used as classifier. The detection task is performed by an adaptive energy Voice Activity Detector (VAD) which precedes the MLP and determines the starting and ending point of an event-active audio sequence. 2) TUT-SED 2017: In this case the baseline method relies on an MLP architecture using 40 LogMels as audio representation [25]. The network is fed with a feature vector comprehending 5-frame as temporal context. The neural network is composed of two dense layers of 50 hidden units per layer with the 20% of dropout, while the network output layer contains K sigmoid units (where K is the number of classes) that can be active at the same time and represent the network prediction of event activity for each context window. The state-of-the-art algorithm is based on the CRNN architecture [35]. The authors compared both monaural and binaural acoustic features, observing that binaural features in general have similar performance as single channel features on the Development dataset although the best result on the Evaluation dataset is obtained using monaural LogMels as network inputs. According to the authors, this can suggest that the dataset was possibly not large enough to train the CRNN fed with this kind of features. 3) TUT-Rare 2017: The baseline [31] and the state-of-theart methods of the DCASE 2017 challenge (Rare-SED) were based on a very similar architectures to that employed for the TUT-SED 2016 and described above. For the baseline method, the only difference relies in the output layer, which in this case is composed of a single sigmoid unit. The first classified algorithm [36] takes 128 LogMels as input and process them 7 TABLE I H YPERPARAMETERS OPTIMIZED IN THE RANDOM - SEARCH PHASE AND THE RESULTING BEST PERFORMING MODELS . Parameter Batch Normalization Range Distribution [yes - no] random choice CNN layers Nr. CNN kernels Nr. CNN kernels dim. Pooling dim. CNN activation CNN dropout CNN L2 [1 - 4] [4 - 64] [3×3 - 8×8] [1×1 - 2×5] [tanh - ReLU] [0 - 0.5] [yes - no] uniform log-uniform uniform uniform random choice uniform random choice Primary Capsules Nr. M Primary Capsules kernels dim. Primary Capsules dimension J Detection Capsules dimension G Capsules dropout Routing iterations [2 - 8] [3×3 - 5×5] [2 - 16] [2 - 16] [0 - 0.5] [1 - 5] uniform uniform uniform uniform uniform uniform On TUT-SED 2016 and 2017 datasets, the event activity probabilities are simply thresholded at a fixed value equal to 0.5, in order to obtain the binary activity matrix used to compute the reference metric. On the TUT-Rare 2017 the network output signal is processed as proposed in [42], thus it is convolved with an exponential decay window then it is processed with a sliding median filter with a local window-size and finally a threshold is applied. IV. R ESULTS In this section, we present the results for all the datasets and experiments described in Section III. The evaluation of Capsule and CNNs based methods have been conducted on the Development sets of each examined dataset using random combinations of hyperparameters given in Table I. A. TUT-SED 2016 frame-wise by means of a CRNN with 1D filters on the first stage. D. Neural Network configuration We performed a hyperparameter search by running a series of experiments over predetermined ranges. We selected the configuration that leads, for each network architecture, to the best results from the cross-validation procedure on the Development dataset of each task and used this architecture to compute the results on the corresponding Evaluation dataset. The number and shape of convolutional layers, the nonlinear activation function, the regularizers in addition to the capsules dimensions and the maximum number of routing iterations have been varied for a total of 100 configurations. Details of searched hyperparameters and their ranges are reported in Table I. The neural networks training was accomplished by the AdaDelta stochastic gradient-based optimization algorithm [37] for a maximum of 100 epochs and batch size equal to 20 on the margin loss function. The optimizer hyperparameters were set according to [37] (i.e., initial learning rate lr = 1.0, ρ = 0.95, ǫ = 10−6 ). The trainable weights were initialized according to the Glorot-uniform scheme [38] and an early stopping strategy was employed during the training in order to avoid overfitting. If the validation ER did not decrease for 20 consecutive epochs, the training was stopped, and the last saved model was selected as the final model. In addition, dropout and L2 weight normalization (with λ = 0.01) have been used as weights regularization techniques [29]. The algorithm has been implemented in the Python language using Keras [39] and Tensorflow [40] as deep learning libraries, while Librosa [41] has been used for feature extraction3 . For the CNN models, we performed a similar random hyperparameters search procedure for each dataset, considering only the first two blocks of the Table I and by replacing the capsule layers with feedforward layers with sigmoid activation function. 3 Source code available at the following address: https://gitlab.com/ a3labShares/capsule-for-sed Results on TUT-SET 2016 dataset are shown in Table III, while Table II reports the configurations which yielded the best performance on the Evaluation dataset. All the found models have ReLU as non-linear activation function and use dropout technique as weight regularization, while the batchnormalization applied after each convolutional layer seems to be effective only for the CapsNet. Table III reports the results considering each combination of architecture and features we evaluated. The best performing setups are highlighted with bold face. The use of STFT as acoustic representation is beneficial for both the architectures with respect to the LogMels. In particular, the CapsNet obtains the lowest ER on the cross-validation performed on Development dataset when is fed by the binaural version of such features. On the two scenarios of the Evaluation dataset, a model based on CapsNet and binaural STFT obtains an averaged ER equal to 0.69, which is largely below both the challenge baseline [31] (0.19) and the best score reported in literature [34] (-0.10). The comparative method based on CNNs seems not to fit at all when LogMels are used as input, while the performance is aligned with the challenge baseline based on GMM classifiers when the models are fed by monaural STFT. This discrepancy can be motivated by the enhanced ability of CapsNet to exploit small training datasets, in particular due to the effect of the routing mechanism on the weight training. In fact, the TUTSED 2016 is composed of a small amount of audio and the sounds events occur sparsely (i.e., only 49 minutes of the total audio contain at least one event active), thus, the overall results of the comparative methods (CNNs, Baseline, and State-ofthe-art) on this dataset are quite low compared to the other datasets. Another CapsNet property that is worth to highlight is the lower number of free parameters that compose the models compared to evaluated CNNs. As shown in Table II, the considered architectures have 267 K and 252 K free parameters respectively for the “Home” and the “Residential area” scenario. It is a relatively low number of parameters to be trained (e.g., a popular deep architecture for image classification such as AlexNet [43] is composed of 60 M parameters), and the best performing CapsNets of each considered scenario have 8 (a) (a) (b) (b) (c) (c) (d) Fig. 4. STFT Spectrogram of the input sequence (a), ground truth (b) and event activity probabilities for CapsNet (c) and CNN (d) from a sequence of test examples from TUT-SED 2016 dataset. even less parameters with respect to the CNNs (-22% and 64% respectively for the “Home” and the “Residential area” scenario). Thus, the high performance of CapsNet can be explained with the architectural advantage rather than the model complexity. In addition, there can be a significant performance shift for the same type of networks with the same number of parameters, which means that a suitable hyperparameters search action (e.g., number of filters on the convolutional layers, dimension of the capsule units) is crucial in finding the best performing network structure. 1) Closer Look at Network Outputs: A comparative example on the neural network outputs, which are regarded as event activity probabilities is presented in Fig. 4. The monaural STFT from a 40 seconds sequence of the “Residential area” dataset is shown along with event annotations and the network outputs of the CapsNet and the CNN best performing models. For this example, we chose the monaural STFT as input feature because generally it yields the best results over all the considered datasets. Fig. 4 shows a bird singing event lasting for the whole sequence and correctly detected by both the architectures. When the car passing by event overlaps the bird singing, the CapsNet detects more clearly its presence. The people speaking event is slightly detected by both the models, while the object banging activates the relative Capsule exactly only in correspondence of the event annotation. It must (d) Fig. 5. STFT Spectrogram of the input sequence (a), ground truth (b) and event activity probabilities for CapsNet (c) and CNN (d) from a sequence of test examples from TUT-SED 2017 dataset. be noted that the dataset is composed of unverified manually labelled real-life recordings, that may present a degree of subjectivity, thus, affecting the training. Nevertheless, the CapsNet exhibits remarkable detection capability especially in the condition of overlapping events, while the CNN outputs are definitely more “blurred” and the event people walking is wrongly detected in this sequence. B. TUT-SED 2017 The bottom of Table III reports the results obtained with the TUT-SED 2017. As in the TUT-SED 2016, the best performing models on the Development dataset are those fed by the Binaural STFT of the input signal. In this case, we can also observe interesting performance obtained by the CNNs, which on the Evaluation dataset obtain a lower ER (i.e., equal to 0.65) with respect to the state-of-the-art algorithm [35], based on CRNNs. CapsNet confirms its effectiveness and it obtains lowest ER equal to 0.58 with LogMel features, although with a slight margin with respect to the other inputs (i.e., -0.03 compared to the STFT features, -0.06 compared to both the binaural version of LogMels and STFT spectrograms). It is worth highlighting that in the Development crossvalidation, the CapsNet models yielded significantly better performance with respect to the other reported approaches, 9 TABLE II H YPERPARAMETERS OF THE BEST PERFORMING MODELS ON THE TUT-P OLYPHONIC SED 2016 & 2017 E VALUATION DATASETS . TUT-SED 2016 Home TUT-SED 2017 Residential Street CapsNet CNN CapsNet CNN CapsNet CNN CNN kernels Nr. CNN kernels dim. Pooling dim. (F axis) MLP layers dim. [32, 32, 8] 6×6 [4, 3, 2] - [64, 64, 16, 64] 5×5 [2, 2, 2, 2] [85, 65] [4, 16, 32, 4] 4×4 [2, 2, 2, 2] - [64] 5×5 [2] [42, 54, 66, 139] [4, 16, 32, 4] 4×4 [2, 2, 2, 2] - [64, 64, 16, 64] 5×5 [2, 2, 2, 2] [85, 65] Primary Capsules Nr. M Primary Capsules kernels dim. Primary Capsules dimension J Detection Capsules dimension G Routing iterations 8 4×4 9 11 3 - 7 3×3 16 8 4 - 7 3×3 16 8 4 - # Params 267 K 343 K 252 K 709 K 223K 342 K TABLE III R ESULTS OF BEST PERFORMING MODELS IN TERMS OF ER ON THE TUT-SED 2016 & 2017 DATASET. TUT-SED 2016 - Home Development Evaluation Features LogMels Binaural LogMels STFT Binaural STFT LogMels Binaural LogMels STFT Binaural STFT CNN CapsNet 11.15 0.58 11.58 0.59 1.06 0.44 1.07 0.39 6.80 0.74 8.43 0.75 0.95 0.61 0.92 0.69 TUT-SED 2016 - Residential Area Features LogMels Binaural LogMels STFT Binaural STFT LogMels Binaural LogMels STFT Binaural STFT CNN CapsNet 3.24 0.36 3.11 0.34 0.64 0.32 1.10 0.32 2.36 0.72 2.76 0.75 1.00 0.78 1.35 0.68 5.60 0.75 0.98 0.70 1.14 0.69 TUT-SED 2016 - Averaged CNN CapsNet 7.20 0.47 7.35 0.47 Baseline [31] State-of-the-art [34] 0.85 0.38 1.09 0.36 4.58 0.73 0.91 0.78 0.88 0.79 TUT-SED 2017 Development Evaluation Features LogMels Binaural LogMels STFT Binaural STFT LogMels Binaural LogMels STFT Binaural STFT CNN CapsNet 1.56 0.45 2.12 0.42 0.57 0.36 0.60 0.36 1.38 0.58 1.79 0.64 0.67 0.61 0.65 0.64 Baseline [25] State-of-the-art [35] 0.69 0.52 while the CNNs have decidedly worse performance. On the Evaluation dataset, however, the ER scores of the CapsNets suffer more relative deterioration with respect to the CNNs ones. This is related to the fact that the CapsNet are subject to larger random fluctuations of the ER from epoch to epoch. In absence of ground truth labels and, thus, of the early stopping strategy, the model taken after a fixed number of training epochs is sub-optimal, and, with CapsNet, more prone to large errors than with CNN. Notwithstanding this weakness, the absolute performance obtained both with monaural and binaural spectral features is consistent and improves the state-of-the-art result, with a reduction of the ER of up to 0.21 in the best case. This is particularly evident in Fig. 5, that shows the output of the two best performing systems for a sequence of approximately 0.93 0.79 20 seconds length which contains highly overlapping sounds. The event classes “people walking” and “large vehicle” are overlapped for almost all the sequence duration and they are well detected by the CapsNet, although they are of different nature: the “large vehicle” has a typical timber and is almost stationary, while the class “people walking” comprehend impulsive and desultory sounds. The CNN does not seem to be able to distinguish between the “large vehicle” and the “car” classes, detecting confidently only the latter, while the activation corresponding “people walking” class is modest. The presence of the “brakes squeaking” class, which has a specific spectral profile mostly located in the highest frequency bands (as shown in the spectrogram), is detected only by the CapsNet. We can assume this as a concrete experimental validation of the routing effectiveness. 10 The number of free parameters amounts to 223 K for the best configuration shown in Table II and it is similar to those found for the TUT-SED 2016, which consists also in this case in a reduction equal to 35% with respect to the best CNN layout. C. TUT-Rare SED 2017 The advantage provided by the routing procedure to the CapsNet is particularly effective in the case of polyphonic SED. The results on the monophonic SED task have been obtained by using the TUT-Rare SED 2017 dataset and they are shown in Table V. In this case, the evaluation metric is the event-based ER calculated using onset-only condition. We performed a separate random-search for each of the three sound event classes both for CapsNets and CNNs and we report the averaged score over the three classes. The setups that obtained the best performance on the Evaluation dataset are shown in Table IV. This is the largest dataset we evaluated, and its characteristic is the high unbalance between the amount of background sounds versus the target sound events. From the analysis of the results of the individual classes on the Evaluation set (not included here for the sake of conciseness), we notice that both architectures achieve the best performance on the glass break class (0.25 and 0.24 respectively for CNNs and CapsNet with LogMels features), due to its clear spectral fingerprint compared to the background sound. The worst performing class is the gunshot (ER equal to 0.58 for the CapsNet), although the noise produced by different instances of this class involves similar spectral components. The low performance is probably due to the fast decay of this sound, which means that in this case the routing procedure is not sufficient to avoid confusing the gunshot with other background noises, especially in the case of dataset unbalancing and low event-to-background ratio. A solution to this issue can be found in the combination of CapsNet with RNN units, as proposed in [19] for the CNNs which yields an efficient modelling of the gunshot by CRNN and improves the detection abilities even in polyphonic conditions. The baby cry class consists of short, harmonic sounds, and it is detected with almost the same accuracy by the two architectures. Finally, the CNN shows better generalization performance with respect to the CapsNet, although the ER score is far from state-of-the-art that use the aforementioned CRNNs [36] or a hierarchical framework [42]. In addition, in this case the CNN models have a reduced number of trainable parameters (36%) compared to the CapsNets, except for the “gunshot” case but, as mentioned, it is also the configuration that gets the worst results. D. Alternative Dynamic Routing for SED We observed that the original routing procedure implies the initialization of the coefficients βij to zero each time the procedure is restarted, i.e., after each input sample has been processed. This is reasonable in the case of image classification, for which the CapsNet has been originally proposed. In the case of audio task, we clearly expect a higher correlation between samples belonging to adjacent temporal frames X. We thus investigated the chance to initialize the coefficients βij to zero only at the very first iteration, while for subsequent X to assign them the last values they had at the end of the previous iterative procedure. We experimented this variant considering the best performing models of the analyzed scenarios for polyphonic SED, taking into account only the systems fed with the monaural STFT. As shown in Table VI, the modification we propose in the routing procedure is effective in particular on the Evaluation datasets, conferring improved generalization properties to the models we tested even without accomplishing a specific hyperparameters optimization. V. C ONCLUSION In this work, we proposed to apply a novel neural network architecture, the CapsNet, to the polyphonic SED task. The architecture is based on both convolutional and capsule layers. The convolutional layers extract high-level time-frequency feature maps from input matrices which provide an acoustic spectral representation with long temporal context. The obtained feature maps are then used as input to the Primary Capsule layer which is connected to the Detection Capsule layer that extracts the event activity probabilities. These last two layers are involved in the iterative routing-by-agreement procedure, which computes the outputs based on a measure of likelihood between a capsule and its parent capsules. This architecture combines, thus, the ability of convolutional layers to learn local translation invariant filters with the ability of capsules to learn part-whole relations by using the routing procedure. Part of the novelty of this work resides in the adaptation of the CapsNet architecture for the audio event detection task, with a special care on the input data, the layers interconnection and the regularization techniques. The routing procedure is also modified to account for an assumed temporal correlation within the data, with a further average performance improvement of 6% among the polyphonic SED tasks. An extensive evaluation of the algorithm is proposed with comparison to recent state-of-the-art methods on three different datasets. The experimental results demonstrate that the use of dynamic routing procedure is effective, and it provides significant performance improvement in the case of overlapping sound events compared to traditional CNNs, and other established methods in polyphonic SED. Interestingly, the CNN based method obtained the best performance in the monophonic SED case study, thus emphasizing the suitability of the CapsNet architecture in dealing with overlapping sounds. We showed that this model is particularly effective with small sized datasets, such as TUT-SED 2016 which contains a total 78 minutes of audio for the development of the models of which one third is background noise. Furthermore, the network trainable parameters are reduced with respect to other deep learning architectures, confirming the architectural advantage given by the introduced features also in the task of polyphonic SED. The results we observed in this work are consistent with many other classification tasks in various domains [44]–[46] 11 TABLE IV H YPERPARAMETERS OF THE BEST PERFORMING MODELS ON THE TUT-R ARE 2017 M ONOPHONIC SED E VALUATION DATASETS . TUT-Rare SED 2017 Monophonic SED Baby cry Glass break Gunshot CapsNet CNN CapsNet CNN CapsNet CNN CNN kernels Nr. CNN kernels dim. Pooling dim. (F axis) MLP layers dim. [16, 64, 32] 6×6 [4, 3, 2] - [16, 32, 8, 16] 8×8 [3, 3, 2, 2] [212, 67] [16, 64, 32] 6×6 [4, 3, 2] - [16, 32, 8, 16] 8×8 [3, 3, 2, 2] [212, 67] [16, 16] 8×8 [5, 2] - [16, 64, 32, 32] 7×7 [5, 4, 2, 1] 112, 51 Primary Capsules Nr. M Primary Capsules kernels dim. Primary Capsules dimension J Detection Capsules dimension G Routing iterations 7 3×3 8 14 5 - 7 3×3 8 14 5 - 8 3×3 8 6 1 - # Params 131 K 84 K 131 K 84 K 30 K 211 K TABLE V R ESULTS OF BEST PERFORMING MODELS IN TERMS OF ER ON THE TUT-R ARE SED 2017 DATASET. TUT-RareSED 2017 - Monophonic SED Development Features CNN CapsNet Baseline [31] Hierarchic CNNs [42] State-of-the-art [36] LogMels 0.29 0.17 0.53 0.13 0.07 Evaluation STFT 0.21 0.20 LogMels 0.41 0.45 0.64 0.22 0.13 STFT 0.46 0.54 TABLE VI R ESULTS OF TEST PERFORMED WITH OUR PROPOSED VARIANT OF ROUTING PROCEDURE . channel audio signals. Moreover, regularization methods can be investigated to overcome the lack of generalization which seems to affect the CapsNets. Furthermore, regarding the SED task the addition of recurrent units may be explored to enhance the detection of particular (i.e., impulsive) sound events in reallife audio and the recently-proposed variant of routing, based on the Expectation Maximization algorithm (EM) [47], can be investigated in this context. ACKNOWLEDGEMENT This research has been partly supported by the Italian University and Research Consortium CINECA. We acknowledge them for the availability of high-performance computing resources and support. TUT-SED 2016 - Home R EFERENCES Development [1] T. Virtanen, M. D. Plumbley, and D. Ellis, Computational analysis of sound scenes and events. Springer, 2018. [2] M. Crocco, M. Cristani, A. Trucco, and V. Murino, “Audio surveillance: a systematic review,” ACM Computing Surveys (CSUR), vol. 48, no. 4, p. 52, 2016. [3] Y.-T. Peng, C.-Y. Lin, M.-T. Sun, and K.-C. Tsai, “Healthcare audio event classification using hidden Markov models and hierarchical hidden Markov models,” in Proc. of ICME, 2009. [4] P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, and M. Vento, “Reliable detection of audio events in highly noisy environments,” Pattern Recognition Letters, vol. 65, pp. 22–28, 2015. [5] J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279–283, 2017. [6] T. Grill and J. Schlüter, “Two convolutional neural networks for bird detection in audio signals,” in Proc. of EUSIPCO. IEEE, Aug 2017, pp. 1764–1768. [7] D. Stowell and D. Clayton, “Acoustic event detection for multiple overlapping similar sources,” in Proc. of WASPAA. IEEE, 2015, pp. 1–5. [8] N. Degara, M. E. Davies, A. Pena, and M. D. Plumbley, “Onset event decoding exploiting the rhythmic structure of polyphonic music,” IEEE J. Sel. T. in Signal Proc., vol. 5, no. 6, pp. 1228–1239, 2011. [9] T. Heittola, A. Mesaros, A. Eronen, and T. Virtanen, “Audio context recognition using audio event histograms,” in Proc. of EUSIPCO, 2010, pp. 1272–1276. [10] J. J. Carabias-Orti, T. Virtanen, P. Vera-Candeas, N. Ruiz-Reyes, and F. J. Canadas-Quesada, “Musical instrument sound multi-excitation model for non-negative spectrogram factorization,” IEEE J. Sel. T. in Signal Proc., vol. 5, no. 6, pp. 1144–1158, 2011. [11] G. Guo and S. Z. Li, “Content-based audio classification and retrieval by support vector machines,” IEEE Trans. Neural Netw., vol. 14, no. 1, pp. 209–215, 2003. CapsNet CapsNet - NR 0.44 0.41 -6.8% Evaluation 0.61 0.58 -4.9 % TUT-SED 2016 - Residential CapsNet CapsNet - NR 0.32 0.31 -3.1% 0.78 0.72 -7.7 % TUT-SED 2016 - Average CapsNet CapsNet - NR 0.38 0.36 -5.3% 0.70 0.65 -7.1 % TUT-SED 2017 - Street CapsNet CapsNet - NR 0.36 0.36 0.0% 0.61 0.58 -4.9 % and they prove that the CapsNet is an effective approach which enhances the well-established representation capabilities of the CNNs also in the audio field. However, several aspects still remain unexplored and require further studies: the robustness of CapsNets to overlapping signals (i.e., images or sounds) has been demonstrated in this work as well as in [23]. In [23], the authors demonstrated also the capability of CapsNets to be invariant to affine transformations of images, such as rotations. In the audio case study, this characteristic could be exploited for obtaining invariance respect to the source position by using a space-time representation of multi- 12 [12] I. McLoughlin, H. Zhang, Z. Xie, Y. Song, and W. Xiao, “Robust sound event classification using deep neural networks,” IEEE Trans. Audio, Speech, Language Process., vol. 23, no. 3, pp. 540–552, 2015. [13] A.-r. Mohamed, G. E. Dahl, G. Hinton, et al., “Acoustic modeling using deep belief networks,” IEEE Trans. Audio, Speech, Language Process., vol. 20, no. 1, pp. 14–22, 2012. [14] K. J. Piczak, “Environmental sound classification with convolutional neural networks,” in Proc. of MLSP. IEEE, 2015, pp. 1–6. [15] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. of ICASSP. IEEE, 2013, pp. 6645–6649. [16] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in Proc. of ICASSP. IEEE, 2016, pp. 5200–5204. [17] B. Wu, K. Li, F. Ge, Z. Huang, M. Yang, S. M. Siniscalchi, and C.H. Lee, “An end-to-end deep learning approach to simultaneous speech dereverberation and acoustic modeling for robust speech recognition,” IEEE J. Sel. T. in Signal Proc., vol. 11, no. 8, pp. 1289–1300, 2017. [18] M. Espi, M. Fujimoto, K. Kinoshita, and T. Nakatani, “Exploiting spectro-temporal locality in deep learning based acoustic event detection,” EURASIP J, on Audio, Speech, and Music Process., vol. 2015, no. 1, p. 26, Sep 2015. [19] E. Cakir, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen, “Convolutional recurrent neural networks for polyphonic sound event detection,” IEEE Trans. Audio, Speech, Language Process., vol. 25, no. 6, pp. 1291–1303, 2017. [20] T. Virtanen, A. Mesaros, T. Heittola, A. Diment, E. Vincent, E. Benetos, and B. Elizalde, Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017). Tampere University of Technology. Laboratory of Signal Processing, 11 2017. [21] F. Vesperini, L. Gabrielli, E. Principi, and S. Squartini, “A capsule neural networks based approach for bird audio detection,” DCASE2018 Challenge, Tech. Rep., September 2018. [22] T. Iqbal, Y. Xu, Q. Kong, and W. Wang, “Capsule routing for sound event detection,” in Proc. of the European Signal Processing Conference, Rome, Italy, Sep. 3-7 2018, pp. 2255–2259. [23] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” in Advances in Neural Information Processing Systems, 2017, pp. 3856–3866. [24] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming autoencoders,” in Proc. of ICANN. Springer, 2011, pp. 44–51. [25] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: Tasks, datasets and baseline system,” in Proc. of DCASE, 2017. [26] J. Bae and D.-S. Kim, “End-to-end speech command recognition with capsule network,” in Proc. of Interspeech, Hyderabad, India, Sep. 2-6 2018, pp. 776–780. [27] M. Slaney, “Auditory toolbox,” Interval Research Corporation, Tech. Rep, vol. 10, 1998. [28] D. Scherer, A. Müller, and S. Behnke, “Evaluation of pooling operations in convolutional architectures for object recognition,” in Proc. of ICANN. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 92–101. [29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014. [30] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, 1970. [31] A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene classification and sound event detection,” in Proc. of EUSIPCO, Aug 2016, pp. 1128–1132. [32] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” J. of Machine Learning Research, vol. 13, no. Feb, pp. 281– 305, 2012. [33] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, p. 162, 2016. [34] M. Valenti, D. Tonelli, F. Vesperini, E. Principi, and S. Squartini, “A neural network approach for sound event detection in real life audio,” in Proc. of EUSIPCO. IEEE, 2017, pp. 2754–2758. [35] S. Adavanne and T. Virtanen, “A report on sound event detection with different binaural features,” arXiv preprint arXiv:1710.02997, 2017. [36] H. Lim, J. Park, and Y. Han, “Rare sound event detection using 1D convolutional recurrent neural networks,” in Proc. of DCASE, 2017, pp. 80–84. [37] M. D. Zeiler, “AdaDelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012. [38] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. of AISTATS, 2010, pp. 249–256. [39] F. Chollet et al., “Keras,” https://github.com/keras-team/keras, 2015. [40] M. Abadi et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” pp. 265–283, 2016. [Online]. Available: https://www.tensorflow.org/ [41] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proc. of SciPy, 2015, pp. 18–25. [42] F. Vesperini, D. Droghini, E. Principi, L. Gabrielli, and S. Squartini, “Hierarchic ConvNets framework for rare sound event detection,” in Proc. of EUSIPCO. IEEE, Sept. 3-7 2018. [43] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105. [44] F. Deng, S. Pu, X. Chen, Y. Shi, T. Yuan, and S. Pu, “Hyperspectral image classification with capsule network using limited training samples,” Sensors, vol. 18, no. 9, 2018. [45] Y. Shen and M. Gao, “Dynamic routing on deep neural network for thoracic disease classification and sensitive area localization,” in Machine Learning in Medical Imaging, Y. Shi, H.-I. Suk, and M. Liu, Eds. Springer International Publishing, 2018, pp. 389–397. [46] M. A. Jalal, R. Chen, R. K. Moore, and L. Mihaylova, “American sign language posture understanding with deep neural networks,” in Proc. of FUSION. IEEE, 2018, pp. 573–579. [47] G. E. Hinton, S. Sabour, and N. Frosst, “Matrix capsules with EM routing,” in Proc. of ICLR, Vancouver, BC, 2018. Fabio Vesperini was born in San Benedetto del Tronto, Italy, on May 1989. He received his M.Sc. degree (cum laude) in electronic engineering in 2015 from Università Politecnica delle Marche (UnivPM). In 2014 he was at the Technische Universtität München as visiting student for 7 months, where he carried out his master thesis project on acoustic novelty detection. He is currently a PhD student at the Department of Information Engineering, at UnivPM. His research interests are in the fields of digital signal processing and machine learning for intelligent audio analysis. Leonardo Gabrielli got his M.Sc. and PhD degrees in Electronics Engineering from Università Politecnica delle Marche, Italy, respectively in 2011 and 2015. His main research topics are related to audio signal processing and machine learning with application to sound synthesis, Computational Sound Design, Networked Music Performance, Music Information Retrieval and audio classification. He has been co-founder of DowSee srl, and holds several industrial patents. He is coauthor of more than 30 scientific papers. 13 Emanuele Principi was born in Senigallia (Ancona), Italy, on January 1978. He received the M.S. degree in electronic engineering (with honors) from Università Politecnica delle Marche (Italy) in 2004. He received his Ph.D. degree in 2009 in the same university under the supervision of Prof. Francesco Piazza. In November 2006 he joined the 3MediaLabs research group coordinated by Prof. Francesco Piazza at Università Politecnica delle Marche where he collaborated to several regional and european projects on audio signal processing. Dr. Principi is author and coauthor of several international scientific peer-reviewed articles in the area of speech enhancement for robust speech and speaker recognition and intelligent audio analysis. He is member of the IEEE CIS Task Force on Computational Audio Processing, and is reviewer for several international journals. His current research interests are in the area of machine learning and digital signal processing for the smart grid (energy task scheduling, nonintrusive load monitoring, computational Intelligence for vehicle to grid) and intelligent audio analysis (multi-room voice activity detection and speaker localization, acoustic event detection, fall detection). Stefano Squartini (IEEE Senior Member, IEEE CIS Member) was born in Ancona, Italy, on March 1976. He got the Italian Laurea with honors in electronic engineering from University of Ancona (now Polytechnic University of Marche, UnivPM), Italy, in 2002. He obtained his PhD at the same university (November 2005). He worked also as post-doctoral researcher at UnivPM from June 2006 to November 2007, when he joined the DII (Department of Information Engineering) as Assistant Professor in Circuit Theory. He is now Associate Professor at UnivPM since November 2014. His current research interests are in the area of computational intelligence and digital signal processing, with special focus on speech/audio/music processing and energy management. He is author and coauthor of more than 190 international scientific peerreviewed articles. He is Associate Editor of the IEEE Transactions on Neural Networks and Learning Systems, IEEE Transactions on Cybernetics and IEEE Transactions on Emerging Topics in Computational Intelligence, and also member of Cognitive Computation, Big Data Analytics and Artificial Intelligence Reviews Editorial Boards. He joined the Organizing and the Technical Program Committees of more than 70 International Conferences and Workshops in the recent past. He is the Organizing Chair of the IEEE CIS Task Force on Computational Audio Processing.