Computational Audio Processing
0 Followers
Recent papers in Computational Audio Processing
In the field of human speech capturing systems, a fundamental role is played by the source localization algorithms. In this paper a Speaker Localization algorithm (SLOC) based on Deep Neural Networks (DNN) is evaluated and compared with... more
In the field of human speech capturing systems, a fundamental role is played by the source localization algorithms. In this paper a Speaker Localization algorithm (SLOC) based on Deep Neural Networks (DNN) is evaluated and compared with state-of-the art approaches. The speaker position in the room under analysis is directly determined by the DNN, leading the proposed algorithm to be fully data-driven. Two different neural network architectures are investigated: the Multi Layer Perceptron (MLP) and Convolutional Neural Networks (CNN). GCC-PHAT (Generalized Cross Correlation-PHAse Transform) Patterns, computed from the audio signals captured by the microphone are used as input features for the DNN. In particular, a multi-room case study is dealt with, where the acoustic scene of each room is influenced by sounds emitted in the other rooms. The algorithm is tested by means of the home recorded DIRHA dataset, characterized by multiple wall and ceiling microphone signals for each room. In detail, the focus goes to speaker localization task in two distinct neighboring rooms. As term of comparison, two algorithms proposed in literature for the addressed applicative context are evaluated, the Cross-power Spectrum Phase Speaker Localization (CSP-SLOC) and the Steered Response Power using the Phase Transform speaker localization (SRP-SLOC). Besides providing an extensive analysis of the proposed method, the article shows how DNN-based algorithm significantly outperforms the state-of-the-art approaches evaluated on the DIRHA dataset, providing an average locali-zation error, expressed in terms of Root Mean Square Error (RMSE), equal to 324 mm and 367 mm, respectively, for the Simulated and the Real subsets.
—This paper focuses on Voice Activity Detectors (VAD) for multi-room domestic scenarios based on deep neural network architectures. Interesting advancements are observed with respect to a previous work. A comparative and extensive... more
—This paper focuses on Voice Activity Detectors (VAD) for multi-room domestic scenarios based on deep neural network architectures. Interesting advancements are observed with respect to a previous work. A comparative and extensive analysis is lead among four different neural networks (NN). In particular, we exploit Deep Belief Network (DBN), Multi-Layer Perceptron (MLP), Bidirectional Long Short-Term Memory recurrent neural network (BLSTM) and Convolutional Neural Network (CNN). The latter has recently encountered a large success in the computational audio processing field and it has been successfully employed in our task. Two home recorded datasets are used in order to approximate real-life scenarios. They contain audio files from several microphones arranged in various rooms, from whom six features are extracted and used as input for the deep neural classifiers. The output stage has been redesigned compared to the previous author's contribution, in order to take advantage of the networks discriminative ability. Our study is composed by a multi-stage analysis focusing on the selection of the features, the network size and the input microphones. Results are evaluated in terms of Speech Activity Detection error rate (SAD). As result, a best SAD equal to 5.8% and 2.6% is reached respectively in the two considered datasets. In addiction, a significant solidity in terms of microphone positioning is observed in the case of CNN.
In sound reproduction systems the audio crossover plays a fundamental role. Nowadays, digital crossover based on IIR filters are commonly employed, of which non-linear phase is a relevant topic. For this reason, solutions aiming to IIR... more
In sound reproduction systems the audio crossover plays a fundamental role. Nowadays, digital crossover based on IIR filters are commonly employed, of which non-linear phase is a relevant topic. For this reason, solutions aiming to IIR filters approximating a linear phase behavior have been recently proposed. One of the latest exploits Fractional Derivative theory and uses Evolutionary Algorithms to explore the solution space in order to perform the IIR filter design: the IIR filter phase error is minimized to achieve a quasi-linear phase response. Nonetheless, this approach is not suitable for a crossover design, since the single filter transition band behavior is not predictable. This shoved the authors to propose a modified design technique including suitable constraints, as the amplitude response cutoff frequency, in the ad-hoc Particle Swarm Optimization algorithm exploring the space of IIR filter solutions. Simulations show that not only more performing filters can be obtained but also fully flat response crossovers achieved.
In the emerging field of acoustic novelty detection, most research efforts are devoted to probabilistic approaches such as mixture models or state-space models. Only recent studies introduced (pseudo-)generative models for acoustic... more
In the emerging field of acoustic novelty detection, most research efforts are devoted to probabilistic approaches such as mixture models or state-space models. Only recent studies introduced (pseudo-)generative models for acoustic novelty detection with recurrent neural networks in the form of an autoencoder. In these approaches, auditory spectral features of the next short term frame are predicted from the previous frames by means of Long-Short Term Memory recurrent denoising autoencoders. The reconstruction error between the input and the output of the autoencoder is used as activation signal to detect novel events. There is no evidence of studies focused on comparing previous efforts to automatically recognize novel events from audio signals and giving a broad and in depth evaluation of recurrent neural network-based autoencoders. The present contribution aims to consistently evaluate our recent novel approaches to fill this white spot in the literature and provide insight by extensive evaluations carried out on three databases: A3Novelty, PASCAL CHiME, and PROMETHEUS. Besides providing an extensive analysis of novel and state-of-the-art methods, the article shows how RNN-based autoencoders outperform statistical approaches up to an absolute improvement of 16.4% average í µí°¹-measure over the three databases.
—Novelty detection is the task of recognising events the differ from a model of normality. This paper proposes an acoustic novelty detector based on neural networks trained with an ad-versarial training strategy. The proposed approach is... more
—Novelty detection is the task of recognising events the differ from a model of normality. This paper proposes an acoustic novelty detector based on neural networks trained with an ad-versarial training strategy. The proposed approach is composed of a feature extraction stage that calculates Log-Mel spectral features from the input signal. Then, an autoencoder network, trained on a corpus of " normal " acoustic signals, is employed to detect whether a segment contains an abnormal event or not. A novelty is detected if the Euclidean distance between the input and the output of the autoencoder exceeds a certain threshold. The innovative contribution of the proposed approach resides in the training procedure of the autoencoder network: instead of using the conventional training procedure that minimises only the Minimum Mean Squared Error loss function, here we adopt an adversarial strategy, where a discriminator network is trained to distinguish between the output of the autoencoder and data sampled from the training corpus. The autoencoder, then, is trained also by using the binary cross-entropy loss calculated at the output of the discriminator network. The performance of the algorithm has been assessed on a corpus derived from the PASCAL CHiME dataset. The results showed that the proposed approach provides a relative performance improvement equal to 0.26% compared to the standard autoencoder. The significance of the improvement has been evaluated with a one-tailed z-test and resulted significant with p < 0.001. The presented approach thus showed promising results on this task and it could be extended as a general training strategy for autoencoders if confirmed by additional experiments.
The task of Speaker LOCalization (SLOC) has been the focus of numerous works in the research field, where SLOC is performed on pure speech data, requiring the presence of an Oracle Voice Activity Detection (VAD) algorithm. Nevertheless,... more
The task of Speaker LOCalization (SLOC) has been the focus of numerous works in the research field, where SLOC is performed on pure speech data, requiring the presence of an Oracle Voice Activity Detection (VAD) algorithm. Nevertheless, this perfect working condition is not satisfied in a real world scenario , where employed VADs do commit errors. This work addresses this issue with an extensive analysis focusing on the relationship between several data-driven VAD and SLOC models, finally proposing a reliable framework for VAD and SLOC. The effectiveness of the approach here discussed is assessed against a multi-room scenario, which is close to a real-world environment. Furthermore, up to the authors' best knowledge, only one contribution proposes a unique framework for VAD and SLOC acting in this addressed scenario; however, this solution does not rely on data-driven approaches. This work comes as an extension of the authors' previous research addressing the VAD and SLOC tasks, by proposing numerous advancements to the original neural network architectures. In details, four different models based on convolutional neural networks (CNNs) are here tested, in order to easily highlight the advantages of the introduced novelties. In addition, two different CNN models go under study for SLOC. Furthermore, training of data-driven models is here improved through a specific data augmentation technique. During this procedure, the room impulse responses (RIRs) of two virtual rooms are generated from the knowledge of the room size, reverberation time and microphones and sources placement. Finally, the only other framework for simultaneous detection and localization in a multi-room scenario is here taken into account to fairly compare the proposed method. As result, the proposed method is more accurate than the baseline framework , and remarkable improvements are specially observed when the data aug-*
In the past years, several hybridization techniques have been proposed to synthesize novel audio content owing its properties from two audio sources. These algorithms, however, usually provide no feature learning, leaving the user, often... more
In the past years, several hybridization techniques have been proposed to synthesize novel audio content owing its properties from two audio sources. These algorithms, however, usually provide no feature learning, leaving the user, often intentionally, exploring parameters by trial-and-error. The introduction of machine learning algorithms in the music processing field calls for an investigation to seek for possible exploitation of their properties such as the ability to learn semantically meaningful features. In this first work, we adopt a Neural Network Autoencoder architecture and we enhance it to exploit temporal dependencies. In our experiments the architecture was able to modify the original timbre, resembling what it learned during the training phase, while preserving the pitch envelope from the input.
Artificial sound event detection (SED) has the aim to mimic the human ability to perceive and understand what is happening in the surroundings. Nowadays, deep learning offers valuable techniques for this goal such as convolutional neural... more
Artificial sound event detection (SED) has the aim to mimic the human ability to perceive and understand what is happening in the surroundings. Nowadays, deep learning offers valuable techniques for this goal such as convolutional neural networks (CNNs). The capsule neural network (CapsNet) architecture has been recently introduced in the image processing field with the intent to overcome some of the known limitations of CNNs, specifically regarding the scarce robustness to affine transformations (i.e., perspective, size, orientation) and the detection of overlapped images. This motivated the authors to employ CapsNets to deal with the polyphonic SED task, in which multiple sound events occur simultaneously. Specifically, we propose to exploit the capsule units to represent a set of distinctive properties for each individual sound event. Capsule units are connected through a so-called dynamic routing that encourages learning part-whole relationships and improves the detection performance in a polyphonic context. This paper reports extensive evaluations carried out on three publicly available datasets, showing how the CapsNet-based algorithm not only outperforms standard CNNs but also allows to achieve the best results with respect to the state-of-the-art algorithms.
—This paper presents a novel application of convo-lutional neural networks (CNNs) for the task of acoustic scene classification (ASC). We here propose the use of a CNN trained to classify short sequences of audio, represented by their... more
—This paper presents a novel application of convo-lutional neural networks (CNNs) for the task of acoustic scene classification (ASC). We here propose the use of a CNN trained to classify short sequences of audio, represented by their log-mel spectrogram. We also introduce a training method that can be used under particular circumstances in order to make full use of small datasets. The proposed system is tested and evaluated on three different ASC datasets and compared to other state-of-the-art systems which competed in the " Detection and Classification of Acoustic Scenes and Events " (DCASE) challenges held in 2016 1 and 2013. The best accuracy scores obtained by our system on the DCASE 2016 datasets are 79.0% (development) and 86.2% (evaluation), which constitute a 6.4% and 9% improvements with respect to the baseline system. Finally, when tested on the DCASE 2013 evaluation dataset, the proposed system manages to reach a 77.0% accuracy, improving by 1% the challenge winner's score.
Cry detection is an important facility in both residential and public environments, which can answer to different needs of both private and professional users. In this paper, we investigate the problem of cry detection in professional... more
Cry detection is an important facility in both residential and public environments, which can answer to different needs of both private and professional users. In this paper, we investigate the problem of cry detection in professional environments, such as Neonatal Intensive Care Units (NICUs). The aim of our work is to propose a cry detection method based on deep neural networks (DNNs) and also to evaluate whether a properly designed synthetic dataset can replace on-field acquired data for training the DNN-based cry detector. In this way, a massive data collection campaign in NICUs can be avoided, and the cry detector can be easily retargeted to different NICUs. The paper presents different solutions based on single-channel and multi-channel DNNs. The experimental evaluation is conducted on the synthetic dataset created by simulating the acoustic scene of a real NICU, and on a real dataset containing audio acquired on the same NICU. The evaluation revealed that using real data in the training phase allows achieving the overall highest performance, with an Area Under Precision-Recall Curve (PRC-AUC) equal to 87.28 %, when signals are processed with a beamformer and a post-filter and a single-channel DNN is used. The same method, however, reduces the performance to 70.61 % when training is performed on the synthetic dataset. On the contrary, under the same conditions, the new single-channel architecture introduced in this paper achieves the highest performance with a PRC-AUC equal to 80.48 %, thus proving that the acoustic scene simulation strategy can be used to train a cry detection method with positive results. INDEX TERMS Infant cry detection, deep neural networks, neonatal intensive care unit, data augmentation, acoustic scene simulation, computational audio processing.
—In this paper, we propose a system for rare sound event detection using a hierarchical and multi-scaled approach based on Convolutional Neural Networks (CNN). The task consists on detection of event onsets from artificially generated... more
—In this paper, we propose a system for rare sound event detection using a hierarchical and multi-scaled approach based on Convolutional Neural Networks (CNN). The task consists on detection of event onsets from artificially generated mixtures. Spectral features are extracted from frames of the acoustic signals, then a first event detection stage operates as binary classifier at frame-rate and it proposes to the second stage contiguous blocks of frames which are assumed to contain a sound event. The second stage refines the event detection of the prior network, discarding blocks that contain background sounds wrongly classified by the first stage. Finally, the effective onset time of the active event is obtained. The performance of the algorithm has been assessed with the material provided for the second task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2017. The achieved overall error rate and F-measure, resulting respectively equal to 0.22 and 88.50% on the evaluation dataset, significantly outperforms the challenge baseline and the system guarantees improved generalization performance with a reduced number of free network parameters w.r.t. other competitive algorithms.
The amount of time an infant cries in a day helps the medical staff in the evaluation of his/her health conditions. Ex- tracting this information requires a cry detection algorithm able to operate in environments with challenging acoustic... more
The amount of time an infant cries in a day helps the medical staff in the evaluation of his/her health conditions. Ex- tracting this information requires a cry detection algorithm able to operate in environments with challenging acoustic conditions, since multiple noise sources, such as interferent cries, medical equipments, and persons may be present. This paper proposes an algorithm for detecting infant cries in such environments. The proposed solution is a multiple stage detection algorithm: the first stage is composed of an eight-channel filter-and-sum beamformer, followed by an Optimally Modified Log-Spectral Amplitude estimator (OMLSA) post-filter for reducing the effect of inter- ferences. The second stage is the Deep Neural Network (DNN) based cry detector, having audio Log-Mel features as inputs. A synthetic dataset mimicking a real neonatal hospital scenario has been created for training the network and evaluating the performance. Additionally, a dataset containing cries acquired in a real neonatology department has been used for assessing the performance in a real scenario. The algorithm has been compared to a popular approach for voice activity detection based on Long-Term Spectral Divergence, and the results show that the proposed solution achieves superior detection performance both on synthetic data and on real data.
Nowadays, the detection of human fall is a problem recognized by the entire scientific community. Methods that have good performance use human falls samples in the train set, while methods that do not use it, can only work well under... more
Nowadays, the detection of human fall is a problem recognized by the entire scientific community. Methods that have good performance use human falls samples in the train set, while methods that do not use it, can only work well under certain conditions. Since examples of human falls are very difficult to retrieve, there is a strong need to develop systems that can work well event with few or no data to be used for their training phase. In this article, we show a first study on few-shot learning Siamese Neural Network applied to human falls detection by using audio signals. This method has been compared with algorithms based on SVM and OCSVM, all evaluated starting from the same conditions. The proposed approach is able to learn the differences between signals belonging to different classes of events. In classification phase, using only one human fall signal as a template, it achieves about 80% of F 1 −Measure related to the human fall class, while the SVM based method gets around 69%, when it is trained in the same data knowledge conditions.
—Detecting the presence of speakers and suitably localize them in indoor environments undoubtedly represent two important tasks in the speech processing community. Several algorithms have been proposed for Voice Activity Detection (VAD)... more
—Detecting the presence of speakers and suitably localize them in indoor environments undoubtedly represent two important tasks in the speech processing community. Several algorithms have been proposed for Voice Activity Detection (VAD) and Speaker LOCalization (SLOC) so far, while their accomplishment by means of a joint integrated model has not received much attention. In particular, no studies focused on cooperative exploitation of VAD and SLOC information by means of machine learning have been conducted, up to the authors' knowledge. That is why the authors propose in this work a data driven approach for joint speech detection and speaker localization, relying on Convolutional Neural Network (CNN) which simultaneously process LogMel and GCC-PHAT Patterns features. The proposed algorithm is compared with a two-stage model composed by the cascade of a neural network (NN) based VAD and an NN based SLOC, discussed in previous authors' contributions. Computer simulations, accomplished against the DIRHA dataset addressing a multi-room acoustic environment, show that the proposed method allows to achieve a remarkable relative reduction of speech activity detection error equal to 33% compared to the original NN based VAD. Moreover, the overall localization accuracy is improved as well, by employing the joint model as speech detector and the standard neural SLOC system in cascade.
Supporting people in their homes is an important issue both for ethical and practical reasons. Indeed, in the recent years, the scientific community devoted particular attention to detecting human falls, since the first cause of death for... more
Supporting people in their homes is an important issue both for ethical and practical reasons. Indeed, in the recent years, the scientific community devoted particular attention to detecting human falls, since the first cause of death for elderly people is due to the consequences of a fall. In this paper, we propose a human fall classification system based on an innovative floor acoustic sensor able to capture the acoustic waves transmitted through the floor. The algorithm employed is able to discriminate human falls from non falls and it is based on Mel-Frequency Cep-stral Coefficients and a two class Support Vector Machine. The dataset employed for performance evaluation is composed by falls of a human mimicking doll, everyday objects and everyday noises. The obtained results show that the proposed solution is suitable for human fall detection in realistic scenarios, allowing to guarantee a 0% miss probability at very low false positive rates.
This paper presents and compares two algorithms based on artificial neural networks (ANNs) for sound event detection in real life audio. Both systems have been developed and evaluated with the material provided for the third task of the... more
This paper presents and compares two algorithms based on artificial neural networks (ANNs) for sound event detection in real life audio. Both systems have been developed and evaluated with the material provided for the third task of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 challenge. For the first algorithm, we make use of an ANN trained on different features extracted from the down-mixed mono channel audio. Secondly, we analyse a binaural algorithm where the same feature extraction is performed on four different channels: the two binaural channels, the averaged monaural signal and the difference between the binaural channels. The proposed feature set comprehends, along with mel-frequency cepstral coefficients and log-mel energies, also activity information extracted with two different voice activity detection (VAD) algorithms. Moreover, we will present results obtained with two different neural architectures, namely multi-layer perceptrons (MLPs) and recurrent neural networks. The highest scores obtained on the DCASE 2016 evaluation dataset are achieved by a MLP trained on binaural features and adaptive energy VAD; they consist of an averaged error rate of 0.79 and an averaged F1 score of 48.1%, thus marking an improvement over the best score registered in the DCASE 2016 challenge.
This paper focuses on employing Convolutional Neural Networks (CNN) with 3-D kernels for Voice Activity Detectors in multi-room domestic scenarios (mVAD). This technology is compared with the Multi Layer Perceptron (MLP) and interesting... more
This paper focuses on employing Convolutional Neural Networks (CNN) with 3-D kernels for Voice Activity Detectors in multi-room domestic scenarios (mVAD). This technology is compared with the Multi Layer Perceptron (MLP) and interesting advancements are observed with respect to previous works of the authors. In order to approximate real-life scenarios, the DIRHA dataset is exploited. It has been recorded in a home environment by means of several microphones arranged in various rooms. Our study is composed by a multi-stage analysis focusing on the selection of the network size and the input microphones in relation with their number and position. Results are evaluated in terms of Speech Activity Detection error rate (SAD). The CNN-mVAD outperforms the other method with a significant solidity in terms of performance statistics , achieving in the best overall case a SAD equal to 7.0%.
A Speaker Localization algorithm based on Neural Networks for multi-room domestic scenarios is proposed in this paper. The approach is fully data-driven and employs a Neural Network fed by GCC-PHAT (Generalized Cross Correlation Phase... more
A Speaker Localization algorithm based on Neural Networks for multi-room domestic scenarios is proposed in this paper. The approach is fully data-driven and employs a Neural Network fed by GCC-PHAT (Generalized Cross Correlation Phase Transform) Patterns , calculated by means of the microphone signals, to determine the speaker position in the room under analysis. In particular, we deal with a multi-room case study, in which the acoustic scene of each room is influenced by sounds emitted in the other rooms. The algorithm is tested against the home recorded DIRHA dataset, characterized by multiple wall and ceiling microphone signals for each room. In particular, we focused on the speaker localization problem in two distinct neighbouring rooms. We assumed the presence of an Oracle multi-room Voice Activity Detector (VAD) in our experiments. A three-stage optimization procedure has been adopted to find the best network configuration and GCC-PHAT Patterns combination. Moreover, an algorithm based on Time Difference of Arrival (TDOA), recently proposed in literature for the addressed applicative context, has been considered as term of comparison. As result, the proposed algorithm outperforms the reference one, providing an average localization error, expressed in terms of RMSE, equal to 525 mm against 1465 mm. Concluding, we also assessed the algorithm performance when a real VAD, recently proposed by some of the authors, is used. Even though a degradation of local-ization capability is registered (an average RMSE equal to 770 mm), still a remarkable improvement with respect to the state of the art performance is obtained.
Vehicle noise emissions are highly dependent on the road surface roughness and materials. A classification of the road surface conditions may be useful in several regards, from driving assistance to in-car audio equalization. With the... more
Vehicle noise emissions are highly dependent on the road surface roughness and materials. A classification of the road surface conditions may be useful in several regards, from driving assistance to in-car audio equalization. With the present work we exploit deep neural networks for the classification of the road surface roughness using microphones placed inside and outside the vehicle. A database is built to test our classification algorithms and results are reported, showing that the roughness classification is feasible whit the proposed approach.
The primary cause of injury-related death for the elders is represented by falls. The scientific community devoted them particular attention, since injuries can be limited by an early detection of the event. The solution proposed in this... more
The primary cause of injury-related death for the elders is represented by falls. The scientific community devoted them particular attention, since injuries can be limited by an early detection of the event. The solution proposed in this paper is based on a combined One-Class SVM (OCSVM) and template-matching classifier that discriminate human falls from nonfalls in a semisupervised framework. Acoustic signals are captured by means of a Floor Acoustic Sensor; then Mel-Frequency Cepstral Coefficients and Gaussian Mean Supervectors (GMSs) are extracted for the fall/nonfall discrimination. Here we propose a single-sensor two-stage user-aided approach: in the first stage, the OCSVM detects abnormal acoustic events. In the second, the template-matching classifier produces the final decision exploiting a set of template GMSs related to the events marked as false positives by the user. The performance of the algorithm has been evaluated on a corpus containing human falls and nonfall sounds. Compared to the OCSVM only approach, the proposed algorithm improves the performance by 10.14% in clean conditions and 4.84% in noisy conditions. Compared to Popescu and Mahnot (2009) the performance improvement is 19.96% in clean conditions and 8.08% in noisy conditions.
Vehicle noise emissions are highly dependent on the road surface roughness and materials. A classification of the road surface conditions may be useful in several regards, from driving assistance to in-car audio equalization. With the... more
Vehicle noise emissions are highly dependent on the road surface roughness and materials. A classification of the road surface conditions may be useful in several regards, from driving assistance to in-car audio equalization. With the present work we exploit deep neural networks for the classification of the road surface roughness using microphones placed inside and outside the vehicle. A database is built to test our classification algorithms and results are reported, showing that the roughness classification is feasible whit the proposed approach.
Supporting people in their homes is an important issue both for ethical and practical reasons. Indeed, in the recent years, the scientific community devoted particular attention to detecting human falls, since the first cause of death for... more
Supporting people in their homes is an important issue both for ethical and practical reasons. Indeed, in the recent years, the scientific community devoted particular attention to detecting human falls, since the first cause of death for elderly people is due to the consequences of a fall. In this paper, we propose a human fall classification system based on an innovative floor acoustic sensor able to capture the acoustic waves transmitted through the floor. The algorithm employed is able to discriminate human falls from non falls and it is based on Mel-Frequency Cepstral Coefficients and a two class Support Vector Machine. The dataset employed for performance evaluation is composed by falls of a human mimicking doll, everyday objects and everyday noises. The obtained results show that the proposed solution is suitable for human fall detection in realistic scenarios, allowing to guarantee a 0% miss probability at very low false positive rates.
The primary cause of injury-related death for the elders is represented by falls. The scientific community devoted them particular attention, since injuries can be limited by an early detection of the event. The solution proposed in this... more
The primary cause of injury-related death for the elders is represented by falls. The scientific community devoted them particular attention, since injuries can be limited by an early detection of the event. The solution proposed in this paper is based on a combined One-Class SVM (OCSVM) and template-matching classifier that discriminate human falls from nonfalls in a semisupervised framework. Acoustic signals are captured by means of a Floor Acoustic Sensor; then Mel-Frequency Cepstral Coefficients and Gaussian Mean Supervectors (GMSs) are extracted for the fall/nonfall discrimination. Here we propose a single-sensor two-stage user-aided approach: in the first stage, the OCSVM detects abnormal acoustic events. In the second, the template-matching classifier produces the final decision exploiting a set of template GMSs related to the events marked as false positives by the user. The performance of the algorithm has been evaluated on a corpus containing human falls and nonfall sounds...