INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
Child behaviour is a topic of great scientifc interest across a wide range of disciplines, includ... more Child behaviour is a topic of great scientifc interest across a wide range of disciplines, including social sciences and artifcial intelligence (AI). Knowledge in these diferent felds is not yet integrated to its full potential. The aim of this workshop was to bring researchers from these felds together. The frst two workshops had a signifcant impact. In this workshop, we discussed topics such as the use of AI techniques to better examine and model interactions and children's emotional development, analyzing head movement patterns with respect to child age. This workshop was a successful new step towards the objective of bridging social sciences and AI, attracting contributions from various academic felds on child behaviour analysis. This document summarizes the accepted papers. CCS CONCEPTS • Human-centered computing → Empirical studies in HCI; • Applied computing → Law, social and behavioral sciences.
This dataset contains data generated as part of the AudioCommons project (DS 5.8.1). Data take th... more This dataset contains data generated as part of the AudioCommons project (DS 5.8.1). Data take the form of python and MATLAB code implementing the timbral models documented in D5.8. <strong>References</strong> D5.8 (2019): A.Pearce, S.Safavi, T.Brookes, R.Mason, W.Wang, M.Plumbley, "Release of timbral characterisation tools for semantically annotating non-musical content", AudioCommons Deliverable Report<br> - http://www.audiocommons.org/materials/
2017 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), 2017
In this paper we present three methodologies for the fusion of different speaker verification mod... more In this paper we present three methodologies for the fusion of different speaker verification modes of operation. Specifically, we investigate a knowledge-based (rule-based) method, based on biometrics and security knowledge, a data-driven method, based on machine learning fusion models and a combination of them. The experimental results indicate that the hybrid fusion architecture, which is the combination of knowledge-based and data-driven based fusion, offers both robustness against spoofing and improvement in speaker verification performance.
Emergence of media outlets and public relations tools such as TV, radio and the Internet since th... more Emergence of media outlets and public relations tools such as TV, radio and the Internet since the 20th century provided the companies with a good platform for advertising their goods and services. Advertisement recognition is an important task that can help companies measure the efficiency of their advertising campaigns in the market and make it possible to compare their performance with competitors in order to get better business insights. Advertisement recognition is usually performed manually with help of human labor or is done through automated methods that are mainly based on heuristics features, these methods usually lack abilities such as scalability, being able to be generalized and be used in different situations. In this paper, we present an automated method for advertisement recognition based on audio processing method that could make this process fairly simple and eliminate the human factor out of the equation. This method has ultimately been used in Miras information t...
2017 IEEE 13th International Colloquium on Signal Processing & its Applications (CSPA), 2017
In this paper, we propose a methodology for the fusion of different modes of speaker verification... more In this paper, we propose a methodology for the fusion of different modes of speaker verification (SV) operation (fixed-passphrase, text-dependent and text-independent mode), using regression fusion models. The experimental results with and without spoofing attack conditions and using different single mode speaker verification engines, GMM-UBM, HMM-UBM and i-vector, indicated improvement in all the experiments. The 6.75 % in terms of EER is achieved as the best speaker verification performance, when using fusion of scores from three modes of operation of HMM-UBM based speaker verification systems. Relative improvement of 22.32 % achieved compare to the best performing single mode engine.
WOCCI 2017: 6th International Workshop on Child Computer Interaction, 2017
Speaker recognition is a well established area for research but it mainly focuses on adult speech... more Speaker recognition is a well established area for research but it mainly focuses on adult speech. Recent work on children's speech shows that not all the findings from speaker recognition on adult speech are directly applicable on children's speech. There are a variety of applications for speaker recognition from children's speech, for example it could be used as a safeguard for a child during her/his interactions on social media networking websites. It could also be used as one of the main blocks in automatic tutor systems for educational purposes at schools. In this research we have evaluated two scoring method for speaker recognition within the i-vector framework using two simulated environments; in a classroom (contains 30 students) and in a school (contains 288 students). The first method is based on the PLDA scoring approach and the second method is based on the cosine similarity measure. Results show that the first method outperforms the second approach in a simulated school, but it is the other way around for the recognition of a child in a classroom in which the second scoring method performs better.
We present a comparative evaluation of different classification algorithms for a fusion engine th... more We present a comparative evaluation of different classification algorithms for a fusion engine that is used in a speaker identity selection task. The fusion engine combines the scores from a number of classifiers, which uses the GMM-UBM approach to match speaker identity. The performances of the evaluated classification algorithms were examined in both the text-dependent and text-independent operation modes. The experimental results indicated a significant improvement in terms of speaker identification accuracy, which was approximately 7% and 14.5% for the text-dependent and the text-independent scenarios, respectively. We suggest the use of fusion with a discriminative algorithm such as a Support Vector Machine in a real-world speaker identification application where the text-independent scenario predominates based on the findings.
This paper presents an experimental study investigating the effect of frequency sub-bands on regi... more This paper presents an experimental study investigating the effect of frequency sub-bands on regional accent identification (AID) and speaker identification (SID) performance on the ABI-1 corpus. The AID and SID systems are based on Gaussian mixture modeling. The SID experiments show up to 100% accuracy when using the full 11.025 kHz bandwidth. The best AID performance of 60.34% is obtained when using band-pass filtered (0.23-3.4 kHz) speech. The experiments using isolated narrow sub-bands show that the regions (0-0.77 kHz) and (3.40-11.02 kHz) are the most useful for SID, while those in the region (0.34-3.44 kHz) are best for AID. AID experiments are also performed with intersession variability compensation, which provides the biggest performance gain in the (2.23-5.25 kHz) region.
2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), 2016
Keeping track of the multiple passwords, PINs, memorable dates and other authentication details n... more Keeping track of the multiple passwords, PINs, memorable dates and other authentication details needed to gain remote access to accounts is one of modern life's less appealing challenges. The employment of a voice-based verification as a biometric technology for both children and adults could be a good replacement to the old fashioned memory dependent procedure. Using voice for authentication could be beneficial in several application areas, including, security, protection, education, call-based and web-based services. Voice-based biometric applications are subject to different types of spoofing attacks. The most accessible and affordable type of spoofing for a voice-based biometrics system is a replay attack. Replay, which is to playback a pre-recorded speech sample, presents a genuine risk to automatic speaker verification technology. This work presents two architectures for detecting frauds caused by replay attacks in a voice-based biometrics authentication systems. Experimental results confirmed that obtained performances from both methods could further improve by applying a machine learning algorithm for performing fusion at the score level. The performance of both methods further improved by fusion using independent sources of scores in different architectures.
(1) Background: Situated in the domain of urban sound scene classification by humans and machines... more (1) Background: Situated in the domain of urban sound scene classification by humans and machines, the research in this project will be a first step towards mapping urban noise pollution experienced indoors and finding ways to reduce its negative impact in peoples' homes. The acoustic distinction between outdoor and indoor scenes is an active research field and can be automated with some success. A much subtler difference is the change in the indoor soundscape induced by an open window. Being able to determine this, however, would allow applications in warning systems and be a prerequisite for an app-based urban sound mapping project. Acoustic detection requires neither line of sight nor sensors at the window frame or knowledge of the number of windows or their size. The task, however, varies substantially in difficulty with the amount of sound inside and outside. From the point of machine classification the lack of specificity is the most problematic aspect: Very few sounds if any can be assumed to originate exclusively from outside <em>and</em> be present at all times to aid automatic detection. The required generalisation ability, however, can be assumed for humans, who might also use very subtle cues in the change of reverberations. (2) Aims The aims are (a) to determine the degree of reliability with which an open window can be recognised by humans and machines under varying circumstances based only on acoustic cues; (b) to investigate whether the findings for humans and machines can inform each other and can be used for further application-related research, e.g., window noise cancellation. (3) Method: (a) Dataset acquisition: A recording kit consisting of a dedicated laptop and microphone will be given to volunteers. Custom-programmed software will remind the user to specify the window state (establishing the so-called ground truth). (b) Perception experiments: Thirty participants will judge whether in the recorded clips a window is open or closed. After an extended familiarisation phase, they will proceed [...]
This dataset contains data generated as part of the AudioCommons project (DS 5.6.3). Data take th... more This dataset contains data generated as part of the AudioCommons project (DS 5.6.3). Data take the form of python and MATLAB code implementing the timbral models documented in D5.6 and evaluated in D5.7. <strong>References</strong> D5.6 (2018): A.Pearce, S.Safavi, T.Brookes, R.Mason, W.Wang, M.Plumbley, "Second prototype of timbral characterisation tool for semantically annotating non-musical content", AudioCommons Deliverable Report<br> - http://www.audiocommons.org/materials/ D5.7 (2018): A.Pearce, S.Safavi, T.Brookes, R.Mason, W.Wang, M.Plumbley, "Evaluation report on the second prototypes of the timbral characterisation tools", AudioCommons Deliverable Report<br> - http://www.audiocommons.org/materials/
2018 52nd Asilomar Conference on Signals, Systems, and Computers
In this paper, we compare different deep neural networks (DNN) in extracting speech signals from ... more In this paper, we compare different deep neural networks (DNN) in extracting speech signals from competing speakers in room environments, including the conventional fullyconnected multilayer perception (MLP) network, convolutional neural network (CNN), recurrent neural network (RNN), and the recently proposed capsule network (CapsNet). Each DNN takes input of both spectral features and converted spatial features that are robust to position mismatch, and outputs the separation mask for target source estimation. In addition, a psychacoustically-motivated objective function is integrated in each DNN, which explores perceptual importance of each TF unit in the training process. Objective evaluations are performed on the separated sounds using the converged models, in terms of PESQ, SDR as well as STOI. Overall, all the implemented DNNs have greatly improved the quality and speech intelligibility of the embedded target source as compared to the original recordings. In particular, bidirectional RNN, either along the temporal direction or along the frequency bins, outperforms the other DNN structures with consistent improvement.
This paper proposes an automatic smoking habit detection from spontaneous telephone speech signal... more This paper proposes an automatic smoking habit detection from spontaneous telephone speech signals. In this method, each utterance is modeled using i-vector and non-negative factor analysis (NFA) frameworks, which yield low-dimensional representation of utterances by applying factor analysis on Gaussian mixture model means and weights respectively. Each framework is evaluated using different classification algorithms to detect the smoker speakers. Finally, score-level fusion of the i-vector-based and the NFA-based recognizers is considered to improve the classification accuracy. The proposed method is evaluated on telephone speech signals of speakers whose smoking habits are known drawn from the National Institute of Standards and Technology (NIST) 2008 and 2010 Speaker Recognition Evaluation databases. Experimental results over 1194 utterances show the effectiveness of the proposed approach for the automatic smoking habit detection task.
Proceedings of the 19th ACM International Conference on Multimodal Interaction, 2017
The 6th International Workshop on Child Computer Interaction (WOCCI) was held in conjunction with... more The 6th International Workshop on Child Computer Interaction (WOCCI) was held in conjunction with ICMI 2017 in Glasgow, Scotland, on November 13, 2017. The workshop included ten papers spanning a range of topics relevant to child computer interaction, including speech therapy, reading tutoring, interaction with robots, storytelling using figurines, among others. In addition, an invited talk entitled ``Automatic Recognition of Children's Speech for Child-Computer Interaction'' was given by Martin Russell, Professor of Information Engineering at the University of Birmingham.
We present a comparative evaluation of different classification algorithms for the task of speake... more We present a comparative evaluation of different classification algorithms for the task of speaker identity selection based on GMM-UBM speaker identification scores. The performance of the evaluated classification algorithms was examined in both text-dependent and text-independent operation modes for speaker identification. The experimental results indicated a significant improvement in terms of speaker identification accuracy, which was approximately 7% and 14.5% for the text-dependent and the text-independent scenarios, respectively. Keywords— speaker identification; classification; machine learning.
The primary focus of autonomous driving research is to improve driving accuracy and reliability. ... more The primary focus of autonomous driving research is to improve driving accuracy and reliability. While great progress has been made, state-of-the-art algorithms still fail at times and some of these failures are due to the faults in sensors. Such failures may have fatal consequences. It therefore is important that automated cars foresee problems ahead as early as possible. By using real-world data and artificial injection of different types of sensor faults to the healthy signals, data models can be trained using machine learning techniques. This paper proposes a novel fault detection, isolation, identification and prediction (based on detection) architecture for multi-fault in multi-sensor systems, such as autonomous vehicles.Our detection, identification and isolation platform uses two distinct and efficient deep neural network architectures and obtained very impressive performance. Utilizing the sensor fault detection system’s output, we then introduce our health index measure an...
Although speaker verification is an established area of speech technology, previous studies have ... more Although speaker verification is an established area of speech technology, previous studies have been restricted to adult speech. This paper investigates speaker verification for children’s speech, using the PF-STAR children’s speech corpus. A contemporary GMM-based speaker verification system, using MFCC features and maximum score normalization, is applied to adult and child speech at various bandwidths using comparable test and training material. The results show that the Equal Error Rate (EER) for child speech is almost four times greater than that for adults. A study of the effect of bandwidth on EER shows that for adult speaker verification, the spectrum can be conveniently partitioned into three frequency bands: up to 3.5-4kHz, which contains individual differences in the part of the spectrum due to primary vocal tract resonances, the region between 4kHz and 6kHz, which contains further speaker-specific information and gives a significant reduction in EER, and the region above...
Speech and speaker recognition is one of the most important research and development areas and ha... more Speech and speaker recognition is one of the most important research and development areas and has received quite a lot of attention in recent years. The desire to produce a natural form of communication between humans and machines can be considered the motivating factor behind such developments. Speech has the potential to influence numerous fields of research and development. In this paper, MirasVoice which is a bilingual (English-Farsi) speech corpus is presented. Over 50 native Iranian speakers who were able to speak in both the Farsi and English languages have volunteered to help create this bilingual corpus. The volunteers read text documents and then had to answer questions spontaneously in both English and Farsi. The text-independent GMM-UBM speaker verification engine was designed in this study for validating and exploring the performance of this corpus. This multilingual speech corpus could be used in a variety of language dependent and independent applications. For exampl...
Speech signals contain important information about a speaker, such as age, gender, language, acce... more Speech signals contain important information about a speaker, such as age, gender, language, accent, and emotional/psychological state. Automatic recognition of these types of characteristics has a wide range of commercial, medical and forensic applications such as interactive voice response systems, service customization, natural human-machine interaction, recognizing the type of pathology of speakers, and directing the forensic investigation process. Many such applications depend on reliable systems using short speech segments without regard to the spoken text (text-independent). All these applications are also applicable using children’s speech. This research aims to develop accurate methods and tools to identify different characteristics of the speakers. Our experiments cover speaker recognition, gender recognition, age-group classification, and accent identification. However, similar approaches and techniques can be applied to identify other characteristics such as emotional/ps...
Situated in the domain of urban sound scene classification by humans and machines, this research i... more Situated in the domain of urban sound scene classification by humans and machines, this research is the first step towards mapping urban noise pollution experienced indoors and finding ways to reduce its negative impact in peoples’ homes. We have recorded a sound dataset, called Open-Window, which contains recordings from three different locations and four different window states; two stationary states (open and close) and two transitional states (open to close and close to open). We have then built our machine recognition base lines for different scenarios (open set versus closed set) using a deep learning framework. The human listening test is also performed to be able to compare the human and machine performance for detecting the window state just using the acoustic cues. Our experimental results reveal that when using a simple machine baseline system, humans and machines are achieving similar average performance for closed set experiments.
INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
Child behaviour is a topic of great scientifc interest across a wide range of disciplines, includ... more Child behaviour is a topic of great scientifc interest across a wide range of disciplines, including social sciences and artifcial intelligence (AI). Knowledge in these diferent felds is not yet integrated to its full potential. The aim of this workshop was to bring researchers from these felds together. The frst two workshops had a signifcant impact. In this workshop, we discussed topics such as the use of AI techniques to better examine and model interactions and children's emotional development, analyzing head movement patterns with respect to child age. This workshop was a successful new step towards the objective of bridging social sciences and AI, attracting contributions from various academic felds on child behaviour analysis. This document summarizes the accepted papers. CCS CONCEPTS • Human-centered computing → Empirical studies in HCI; • Applied computing → Law, social and behavioral sciences.
This dataset contains data generated as part of the AudioCommons project (DS 5.8.1). Data take th... more This dataset contains data generated as part of the AudioCommons project (DS 5.8.1). Data take the form of python and MATLAB code implementing the timbral models documented in D5.8. <strong>References</strong> D5.8 (2019): A.Pearce, S.Safavi, T.Brookes, R.Mason, W.Wang, M.Plumbley, "Release of timbral characterisation tools for semantically annotating non-musical content", AudioCommons Deliverable Report<br> - http://www.audiocommons.org/materials/
2017 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), 2017
In this paper we present three methodologies for the fusion of different speaker verification mod... more In this paper we present three methodologies for the fusion of different speaker verification modes of operation. Specifically, we investigate a knowledge-based (rule-based) method, based on biometrics and security knowledge, a data-driven method, based on machine learning fusion models and a combination of them. The experimental results indicate that the hybrid fusion architecture, which is the combination of knowledge-based and data-driven based fusion, offers both robustness against spoofing and improvement in speaker verification performance.
Emergence of media outlets and public relations tools such as TV, radio and the Internet since th... more Emergence of media outlets and public relations tools such as TV, radio and the Internet since the 20th century provided the companies with a good platform for advertising their goods and services. Advertisement recognition is an important task that can help companies measure the efficiency of their advertising campaigns in the market and make it possible to compare their performance with competitors in order to get better business insights. Advertisement recognition is usually performed manually with help of human labor or is done through automated methods that are mainly based on heuristics features, these methods usually lack abilities such as scalability, being able to be generalized and be used in different situations. In this paper, we present an automated method for advertisement recognition based on audio processing method that could make this process fairly simple and eliminate the human factor out of the equation. This method has ultimately been used in Miras information t...
2017 IEEE 13th International Colloquium on Signal Processing & its Applications (CSPA), 2017
In this paper, we propose a methodology for the fusion of different modes of speaker verification... more In this paper, we propose a methodology for the fusion of different modes of speaker verification (SV) operation (fixed-passphrase, text-dependent and text-independent mode), using regression fusion models. The experimental results with and without spoofing attack conditions and using different single mode speaker verification engines, GMM-UBM, HMM-UBM and i-vector, indicated improvement in all the experiments. The 6.75 % in terms of EER is achieved as the best speaker verification performance, when using fusion of scores from three modes of operation of HMM-UBM based speaker verification systems. Relative improvement of 22.32 % achieved compare to the best performing single mode engine.
WOCCI 2017: 6th International Workshop on Child Computer Interaction, 2017
Speaker recognition is a well established area for research but it mainly focuses on adult speech... more Speaker recognition is a well established area for research but it mainly focuses on adult speech. Recent work on children's speech shows that not all the findings from speaker recognition on adult speech are directly applicable on children's speech. There are a variety of applications for speaker recognition from children's speech, for example it could be used as a safeguard for a child during her/his interactions on social media networking websites. It could also be used as one of the main blocks in automatic tutor systems for educational purposes at schools. In this research we have evaluated two scoring method for speaker recognition within the i-vector framework using two simulated environments; in a classroom (contains 30 students) and in a school (contains 288 students). The first method is based on the PLDA scoring approach and the second method is based on the cosine similarity measure. Results show that the first method outperforms the second approach in a simulated school, but it is the other way around for the recognition of a child in a classroom in which the second scoring method performs better.
We present a comparative evaluation of different classification algorithms for a fusion engine th... more We present a comparative evaluation of different classification algorithms for a fusion engine that is used in a speaker identity selection task. The fusion engine combines the scores from a number of classifiers, which uses the GMM-UBM approach to match speaker identity. The performances of the evaluated classification algorithms were examined in both the text-dependent and text-independent operation modes. The experimental results indicated a significant improvement in terms of speaker identification accuracy, which was approximately 7% and 14.5% for the text-dependent and the text-independent scenarios, respectively. We suggest the use of fusion with a discriminative algorithm such as a Support Vector Machine in a real-world speaker identification application where the text-independent scenario predominates based on the findings.
This paper presents an experimental study investigating the effect of frequency sub-bands on regi... more This paper presents an experimental study investigating the effect of frequency sub-bands on regional accent identification (AID) and speaker identification (SID) performance on the ABI-1 corpus. The AID and SID systems are based on Gaussian mixture modeling. The SID experiments show up to 100% accuracy when using the full 11.025 kHz bandwidth. The best AID performance of 60.34% is obtained when using band-pass filtered (0.23-3.4 kHz) speech. The experiments using isolated narrow sub-bands show that the regions (0-0.77 kHz) and (3.40-11.02 kHz) are the most useful for SID, while those in the region (0.34-3.44 kHz) are best for AID. AID experiments are also performed with intersession variability compensation, which provides the biggest performance gain in the (2.23-5.25 kHz) region.
2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), 2016
Keeping track of the multiple passwords, PINs, memorable dates and other authentication details n... more Keeping track of the multiple passwords, PINs, memorable dates and other authentication details needed to gain remote access to accounts is one of modern life's less appealing challenges. The employment of a voice-based verification as a biometric technology for both children and adults could be a good replacement to the old fashioned memory dependent procedure. Using voice for authentication could be beneficial in several application areas, including, security, protection, education, call-based and web-based services. Voice-based biometric applications are subject to different types of spoofing attacks. The most accessible and affordable type of spoofing for a voice-based biometrics system is a replay attack. Replay, which is to playback a pre-recorded speech sample, presents a genuine risk to automatic speaker verification technology. This work presents two architectures for detecting frauds caused by replay attacks in a voice-based biometrics authentication systems. Experimental results confirmed that obtained performances from both methods could further improve by applying a machine learning algorithm for performing fusion at the score level. The performance of both methods further improved by fusion using independent sources of scores in different architectures.
(1) Background: Situated in the domain of urban sound scene classification by humans and machines... more (1) Background: Situated in the domain of urban sound scene classification by humans and machines, the research in this project will be a first step towards mapping urban noise pollution experienced indoors and finding ways to reduce its negative impact in peoples' homes. The acoustic distinction between outdoor and indoor scenes is an active research field and can be automated with some success. A much subtler difference is the change in the indoor soundscape induced by an open window. Being able to determine this, however, would allow applications in warning systems and be a prerequisite for an app-based urban sound mapping project. Acoustic detection requires neither line of sight nor sensors at the window frame or knowledge of the number of windows or their size. The task, however, varies substantially in difficulty with the amount of sound inside and outside. From the point of machine classification the lack of specificity is the most problematic aspect: Very few sounds if any can be assumed to originate exclusively from outside <em>and</em> be present at all times to aid automatic detection. The required generalisation ability, however, can be assumed for humans, who might also use very subtle cues in the change of reverberations. (2) Aims The aims are (a) to determine the degree of reliability with which an open window can be recognised by humans and machines under varying circumstances based only on acoustic cues; (b) to investigate whether the findings for humans and machines can inform each other and can be used for further application-related research, e.g., window noise cancellation. (3) Method: (a) Dataset acquisition: A recording kit consisting of a dedicated laptop and microphone will be given to volunteers. Custom-programmed software will remind the user to specify the window state (establishing the so-called ground truth). (b) Perception experiments: Thirty participants will judge whether in the recorded clips a window is open or closed. After an extended familiarisation phase, they will proceed [...]
This dataset contains data generated as part of the AudioCommons project (DS 5.6.3). Data take th... more This dataset contains data generated as part of the AudioCommons project (DS 5.6.3). Data take the form of python and MATLAB code implementing the timbral models documented in D5.6 and evaluated in D5.7. <strong>References</strong> D5.6 (2018): A.Pearce, S.Safavi, T.Brookes, R.Mason, W.Wang, M.Plumbley, "Second prototype of timbral characterisation tool for semantically annotating non-musical content", AudioCommons Deliverable Report<br> - http://www.audiocommons.org/materials/ D5.7 (2018): A.Pearce, S.Safavi, T.Brookes, R.Mason, W.Wang, M.Plumbley, "Evaluation report on the second prototypes of the timbral characterisation tools", AudioCommons Deliverable Report<br> - http://www.audiocommons.org/materials/
2018 52nd Asilomar Conference on Signals, Systems, and Computers
In this paper, we compare different deep neural networks (DNN) in extracting speech signals from ... more In this paper, we compare different deep neural networks (DNN) in extracting speech signals from competing speakers in room environments, including the conventional fullyconnected multilayer perception (MLP) network, convolutional neural network (CNN), recurrent neural network (RNN), and the recently proposed capsule network (CapsNet). Each DNN takes input of both spectral features and converted spatial features that are robust to position mismatch, and outputs the separation mask for target source estimation. In addition, a psychacoustically-motivated objective function is integrated in each DNN, which explores perceptual importance of each TF unit in the training process. Objective evaluations are performed on the separated sounds using the converged models, in terms of PESQ, SDR as well as STOI. Overall, all the implemented DNNs have greatly improved the quality and speech intelligibility of the embedded target source as compared to the original recordings. In particular, bidirectional RNN, either along the temporal direction or along the frequency bins, outperforms the other DNN structures with consistent improvement.
This paper proposes an automatic smoking habit detection from spontaneous telephone speech signal... more This paper proposes an automatic smoking habit detection from spontaneous telephone speech signals. In this method, each utterance is modeled using i-vector and non-negative factor analysis (NFA) frameworks, which yield low-dimensional representation of utterances by applying factor analysis on Gaussian mixture model means and weights respectively. Each framework is evaluated using different classification algorithms to detect the smoker speakers. Finally, score-level fusion of the i-vector-based and the NFA-based recognizers is considered to improve the classification accuracy. The proposed method is evaluated on telephone speech signals of speakers whose smoking habits are known drawn from the National Institute of Standards and Technology (NIST) 2008 and 2010 Speaker Recognition Evaluation databases. Experimental results over 1194 utterances show the effectiveness of the proposed approach for the automatic smoking habit detection task.
Proceedings of the 19th ACM International Conference on Multimodal Interaction, 2017
The 6th International Workshop on Child Computer Interaction (WOCCI) was held in conjunction with... more The 6th International Workshop on Child Computer Interaction (WOCCI) was held in conjunction with ICMI 2017 in Glasgow, Scotland, on November 13, 2017. The workshop included ten papers spanning a range of topics relevant to child computer interaction, including speech therapy, reading tutoring, interaction with robots, storytelling using figurines, among others. In addition, an invited talk entitled ``Automatic Recognition of Children's Speech for Child-Computer Interaction'' was given by Martin Russell, Professor of Information Engineering at the University of Birmingham.
We present a comparative evaluation of different classification algorithms for the task of speake... more We present a comparative evaluation of different classification algorithms for the task of speaker identity selection based on GMM-UBM speaker identification scores. The performance of the evaluated classification algorithms was examined in both text-dependent and text-independent operation modes for speaker identification. The experimental results indicated a significant improvement in terms of speaker identification accuracy, which was approximately 7% and 14.5% for the text-dependent and the text-independent scenarios, respectively. Keywords— speaker identification; classification; machine learning.
The primary focus of autonomous driving research is to improve driving accuracy and reliability. ... more The primary focus of autonomous driving research is to improve driving accuracy and reliability. While great progress has been made, state-of-the-art algorithms still fail at times and some of these failures are due to the faults in sensors. Such failures may have fatal consequences. It therefore is important that automated cars foresee problems ahead as early as possible. By using real-world data and artificial injection of different types of sensor faults to the healthy signals, data models can be trained using machine learning techniques. This paper proposes a novel fault detection, isolation, identification and prediction (based on detection) architecture for multi-fault in multi-sensor systems, such as autonomous vehicles.Our detection, identification and isolation platform uses two distinct and efficient deep neural network architectures and obtained very impressive performance. Utilizing the sensor fault detection system’s output, we then introduce our health index measure an...
Although speaker verification is an established area of speech technology, previous studies have ... more Although speaker verification is an established area of speech technology, previous studies have been restricted to adult speech. This paper investigates speaker verification for children’s speech, using the PF-STAR children’s speech corpus. A contemporary GMM-based speaker verification system, using MFCC features and maximum score normalization, is applied to adult and child speech at various bandwidths using comparable test and training material. The results show that the Equal Error Rate (EER) for child speech is almost four times greater than that for adults. A study of the effect of bandwidth on EER shows that for adult speaker verification, the spectrum can be conveniently partitioned into three frequency bands: up to 3.5-4kHz, which contains individual differences in the part of the spectrum due to primary vocal tract resonances, the region between 4kHz and 6kHz, which contains further speaker-specific information and gives a significant reduction in EER, and the region above...
Speech and speaker recognition is one of the most important research and development areas and ha... more Speech and speaker recognition is one of the most important research and development areas and has received quite a lot of attention in recent years. The desire to produce a natural form of communication between humans and machines can be considered the motivating factor behind such developments. Speech has the potential to influence numerous fields of research and development. In this paper, MirasVoice which is a bilingual (English-Farsi) speech corpus is presented. Over 50 native Iranian speakers who were able to speak in both the Farsi and English languages have volunteered to help create this bilingual corpus. The volunteers read text documents and then had to answer questions spontaneously in both English and Farsi. The text-independent GMM-UBM speaker verification engine was designed in this study for validating and exploring the performance of this corpus. This multilingual speech corpus could be used in a variety of language dependent and independent applications. For exampl...
Speech signals contain important information about a speaker, such as age, gender, language, acce... more Speech signals contain important information about a speaker, such as age, gender, language, accent, and emotional/psychological state. Automatic recognition of these types of characteristics has a wide range of commercial, medical and forensic applications such as interactive voice response systems, service customization, natural human-machine interaction, recognizing the type of pathology of speakers, and directing the forensic investigation process. Many such applications depend on reliable systems using short speech segments without regard to the spoken text (text-independent). All these applications are also applicable using children’s speech. This research aims to develop accurate methods and tools to identify different characteristics of the speakers. Our experiments cover speaker recognition, gender recognition, age-group classification, and accent identification. However, similar approaches and techniques can be applied to identify other characteristics such as emotional/ps...
Situated in the domain of urban sound scene classification by humans and machines, this research i... more Situated in the domain of urban sound scene classification by humans and machines, this research is the first step towards mapping urban noise pollution experienced indoors and finding ways to reduce its negative impact in peoples’ homes. We have recorded a sound dataset, called Open-Window, which contains recordings from three different locations and four different window states; two stationary states (open and close) and two transitional states (open to close and close to open). We have then built our machine recognition base lines for different scenarios (open set versus closed set) using a deep learning framework. The human listening test is also performed to be able to compare the human and machine performance for detecting the window state just using the acoustic cues. Our experimental results reveal that when using a simple machine baseline system, humans and machines are achieving similar average performance for closed set experiments.
Uploads
Papers by Saeid Safavi