Skip to main content

maryam najafian

Massachusetts Institute of Technology (MIT), Computer Science, Faculty Member

University of Texas at Dallas, Computer Science, Post-Doc

University of Birmingham, Electronic, Electrical and Computer Engineering, PhD student, Intern Apple Siri

Followers

71

Following

10

Co-authors

6

Public Views

I am a research scientist at MIT's Institute for Data, Systems, and Society (IDSS) and Computer Science and AI (CSAIL) alumni
Address: Birmingham, United kingdom

less

InterestsView All (10)

Uploads

Papers by maryam najafian

Exploiting convolutional neural networks for phonotactic based dialect identification

by maryam najafian and Sameer Khurana

ICASSP, 2018

In this paper, we investigate different approaches for Dialect Identification (DID) in Arabic bro... more In this paper, we investigate different approaches for Dialect Identification (DID) in Arabic broadcast speech. Dialects differ in their inventory of phonological segments. This paper proposes a new phonotactic based feature representation approach which enables discrimination among different occurrences of the same phone n-grams with different phone duration and probability statistics. To achieve further gain in accuracy we used multilingual phone recognizers, trained separately on Arabic, English, Czech, Hungarian and Russian languages. We use Support Vector Machines (SVMs), and Convolutional Neural Networks (CNNs) as backend classifiers throughout the study. The final system fusion results in 24.7% and 19.0% relative error rate reduction compared to that of a conventional phonotactic DID, and i-vectors with bottleneck features.

QMDIS: QCRI-MIT Advanced Dialect Identification System

by maryam najafian and Ahmed Ali

Interspeech, 2017

As a continuation of our efforts towards tackling the problem of spoken Dialect Identification (D... more As a continuation of our efforts towards tackling the problem of spoken Dialect Identification (DID) for Arabic languages, we present the QCRI-MIT Advanced Dialect Identification System (QMDIS). QMDIS is an automatic spoken DID system for Di-alectal Arabic (DA). In this paper, we report a comprehensive study of the three main components used in the spoken DID task: phonotactic, lexical and acoustic. We use Support Vector Machines (SVMs), Logistic Regression (LR) and Convolutional Neural Networks (CNNs) as backend classifiers throughout the study. We perform all our experiments on a publicly available dataset and present new state-of-the-art results. QMDIS discriminates between the five most widely used dialects of Arabic: namely Egyptian, Gulf, Levantine, North African, and Modern Standard Arabic (MSA). We report ≈ 73% accuracy for system combination. All the data and the code used in our experiments are publicly available for research.

Automatic Speech Recognition Of Arabic Multi-genre Broadcast Media

by maryam najafian, Wei-Ning Hsu, and Ahmed Ali

ASRU, 2017

This paper describes an Arabic Automatic Speech Recognition system developed on 15 hours of Multi... more This paper describes an Arabic Automatic Speech Recognition system developed on 15 hours of Multi-Genre Broadcast (MGB-3) data from YouTube, plus 1,200 hours of Multi-Dialect and Multi-Genre MGB-2 data recorded from the Aljazeera Arabic TV channel. In this paper, we report our investigations of a range of signal pre-processing, data augmentation , topic-specific language model adaptation, accent specific retraining , and deep learning based acoustic modeling topologies, such as feed-forward Deep Neural Networks (DNNs), Time-delay Neural Networks (TDNNs), Long Short-term Memory (LSTM) networks, Bidirectional LSTMs (BLSTMs), and a Bidirectional version of the Prioritized Grid LSTM (BPGLSTM) model. We propose a system combination for three purely sequence trained recognition systems based on lattice-free maximum mutual information, 4-gram language model re-scoring, and system combination using the minimum Bayes risk decoding criterion. The best word error rate we obtained on the MGB-3 Arabic development set using a 4-gram re-scoring strategy is 42.25% for a chain BLSTM system, compared to 65.44% baseline for a DNN system.

Environment aware speaker diarization for moving targets using parallel DNN-based recognizers

Current diarization algorithms are commonly applied to the outputs of single non-moving microphon... more Current diarization algorithms are commonly applied to the outputs of single non-moving microphones. They do not explicitly identify the content of overlapped segments from multiple speakers or acoustic events. This paper presents an acoustic environment aware child-adult diarization applied to the audio recorded by a single microphone attached to moving targets under realistic high noise conditions. This system exploits a parallel deep neural network and hidden Markov model based approach which enables tracking of rapid turn changes in audio segments as well as capturing the cross talk labels for overlapped speech. The proposed system out-performs the state-of-the-art diarization systems without the need to prior clustering or front-end speech activity detection.

Speaker Diarization And Speech Activity Detection Fusion For Detecting Hot Language Areas During Different Classroom Activities Using Deep Neural Networks

A vocalization heatmap can help with large-scale monitoring of the child language environment dur... more A vocalization heatmap can help with large-scale monitoring of the child language environment during different classroom activities. Here we present a case study that uses Speaker's location and diarization information to detect hot language areas within a childcare center. We propose a Deep Neu-ral Network-Hidden Markov Model (DNN-HMM) based di-arization and Speech Activity Detection (SAD) fused system. It detects the time label of non-speech acoustics, as well as the speech generated by each child and the speech directed to them by other children and adults. Experimental results show that our system is robust against speech and non-speech segment confusions occurred due to background noise, as it relies on embedded discriminative acoustic model level rather than feature level speech activity detection. This system is suitable for real-time vocalization monitoring since it doesn't rely on a prior clustering or speech activity detection stages.

Improving speech recognition using limited accent diverse British English training data with deep neural networks

by maryam najafian and Martin Russell

Despite the recent advances in acoustic modelling tasks modelling speech data coming from differe... more Despite the recent advances in acoustic modelling tasks modelling speech data coming from different speakers with varying accents, age, and speaking styles is a fundamental challenge for Deep Neu-ral Networks (DNNs) based Automatic Speech Recognition (ASR). A relative gain of 46.85% is achieved in recognising the Accents of British Isles corpus by applying a baseline DNN model rather than a Gaussian mixture model. However, even for powerful DNN based systems accents remain a challenge. Our study shows that for a 'difficult' accent such as Glaswegian the relative word error rate is 78.9% higher than that of the standard southern English accent. In this work we propose four multi-accent learning strategies, and evaluate their effectiveness within the context of DNN based acoustic modelling framework. Using an i-vector based accent identification system with 78% accuracy to label the training data. We present a novel study on the effect of increase in the accent diversity, the 'dif-ficulty' and the amount of supplemented training data on the ASR performance. On average a further ASR gain of 27.24 % is achieved using the proposed strategies. Our results show that across all accent regions supplementing the training set with a small amount of data from the most 'difficult' accent (2.25 hours of Glaswegian accent) leads to a similar gain in performance as using a large amount of accent diverse data (8.96 hours from 14 accent regions). Although the ideas presented are focused on DNN based analysis with limited amount of multi-accented data, they are applicable for training all classifiers with multi-conditional limited resources.

Modelling Accents for Automatic Speech Recognition

Accent is cited as an issue for speech recognition systems. If they are to be widely deployed, Au... more Accent is cited as an issue for speech recognition systems. If they are to be widely deployed, Automatic Speech Recognition (ASR) systems must deliver consistently high performance across user populations. Hence the development of accent-robust ASR is of significant importance. This research investigates techniques for compensating for the effects of accents on performance of Hidden Markov Model (HMM) based ASR systems. Recently, HMM systems based on Deep Neural Networks (DNNs) have achieved superior performance to more traditional systems based on Gaussian Mixture Models (GMMs), due to the discriminative nature of DNNs. Our research confirms, this by showing that a DNN system outperforms the GMM system even after an accent-dependent acoustic model was selected using Accent Identification (AID), followed by speaker adaptation. The average performance of the DNN system over all accent groups is maximized when either accent diversity is highest , or data from " difficult " accent-groups is included in the training set. Index Terms: Multi-accent speech recognition, acoustic data selection, deep neural network 1. Research Summary In the 'accent of English' book by Wells an accent is defined as a " a pattern of pronunciation used by a speaker for whom English is the native language or, more generally, by the community or social grouping to which he or she belongs ". This differentiates accent from dialect, which includes the use of words or phrases that are characteristic of that community. It includes varieties of English spoken as a first language in different countries (for example, US vs Australian English), geographical variations within a country, and patterns of pronunciation associated with particular social or ethnic groups. The recent growth in applications of ASR systems forces developers to consider approaches that deliver consistent high performance across different accent groups. It is of high importance for those systems to be able to deal with accented speakers. Over the last decade, DNN based systems achieved superior accuracy compared to GMM based systems in many applications. This has been attributed to be due to their discriminative nature, and layer by layer invariant feature learning ability. This enhances their robustness against different sources of variation, for example accent. Various techniques, such as adding accent discriminative acoustic features, adding an accent-specific layer on top of a DNN acoustic model, accent-specific pronunciation adaptation, accent-specific polyphone decision trees, and acoustic model adaptation, were proposed in literature. In this research we investigate, whether it is better to train a simple DNN system on a multi-accented data and rely on its layer-wise discriminative learning to learn different accent patterns in the training data, or it is better to do GMM based accent plus speaker adaptation. In the DNN system, we explored the effect of the amount of pre-training and training data and their accent diversity on the final performance. In our GMM system , we construct an accent-dependent acoustic model for 14 different British accents, and use Accent Identification (AID) to select the model that corresponds to a new speaker. For each new speaker we do accent-dependent acoustic model adaptation , followed by the speaker adaptation. Results of our AID system shows that with 43s of speech an individual's accent can be determined with 81% accuracy using unsupervised AID (i-Vectors). Thus, a possible solution is to use AID for accent-dependent ASR model selection and then apply unsupervised speaker adaptation to the GMM-HMM system. In DNN-HMM systems it is possible to use this AID to analyse the accent diversity of the training data and investigates its effect on the final performance. In our experiments we extracted the adaptation and test data from the Accents of British Isles (ABI-1) corpus containing data from 14 different regions of British Isles. These regional accents fall into four groups namely, Northern English, Southern English, Irish and Scottish. We applied an AID system to explore the accent diversity of the WSJCAM0 speech corpus and realised that it consists of a range of Northern English, Southern English, Irish and Scot-tish accents. Using the WSJCAM0 corpus, we achieved a relative gain of 46.9% in recognizing the ABI-1 corpus by applying DNNs rather than GMMs for the acoustic modelling using the WSJCAM0 training data. This shows that DNN systems are better than GMMs in dealing with the multi-accented data. A clear effect of accent is evident in the performance of our GMM-HMM speech recognition systems, even after applying multiple acoustic model adaptation. We observed that the accuracy of this GMM system, even after applying Maximum A Posterior adaptation to 40 minutes of accent-specific data, followed by unsupervised speaker adaptation using Maximum Likelihood Linear Regression with 43s (7.3%WER) of data, could not match the accuracy of our baseline DNN system (6.9%WER). Despite, the major gain achieved by using the DNNs rather than GMMs in modelling the acoustic data, the effect of accent is still evident and the system doesn't perform uniformly well across all accent groups. Even in our best system (DNN-HMM) the percentage Word Error Rate (%WER) for the most challenging accent Glaswegian (13.34%) nearly 5 times bigger than that for standard southern English accent (2.84%). Adding an accent-specific layer on top of a multi-accent neural network acoustic model is one potential solution and will be addressed in our future work. Although the work targets British English it is very likely that the techniques described are applicable to accented speech in other languages.

Modelling Accents for Automatic Speech Recognition

Accent is cited as an issue for speech recognition systems. If they are to be widely deployed, Au... more Accent is cited as an issue for speech recognition systems. If they are to be widely deployed, Automatic Speech Recognition (ASR) systems must deliver consistently high performance across user populations. Hence the development of accentrobust ASR is of significant importance. This research investigates techniques for compensating for the effects of accents on performance of Hidden Markov Model (HMM) based ASR systems. Recently, HMM systems based on Deep Neural Networks (DNNs) have achieved superior performance to more traditional systems based on Gaussian Mixture Models (GMMs), due to the discriminative nature of DNNs. Our research confirms, this by showing that a DNN system outperforms the GMM system even after an accent-dependent acoustic model was selected using Accent Identification (AID), followed by speaker adaptation. The average performance of the DNN system over all accent groups is maximized when either accent diversity is highest, or data from "difficult" accent-groups is included in the training set.

Automatic measurement and analysis of the child verbal communication using classroom acoustics within a child care center

Understanding the language environment of early learners is a challenging task for both human and... more Understanding the language environment of early learners is a challenging task for both human and machine, and it is critical in facilitating effective language development among young children. This papers presents a new application for the existing diarization systems and investigates the language environment of young children using a turn taking strategy employing an i-vector based baseline that captures adult-to-child or child-to-child conversational turns across different classrooms in a child care center. Detecting speaker turns is necessary before more in depth subsequent analysis of audio such as word count, speech recognition, and keyword spotting which can contribute to the design of future learning spaces specifically designed for typically developing children, or those at-risk with communication limitations. Experimental results using naturalistic child-teacher classroom settings indicate the proposed rapid child-adult speech turn taking scheme is highly effective under noisy classroom conditions and results in 27.3% relative error rate reduction compared to the baseline results produced by the LIUM diarization toolkit.

Delay reduction in real-time recognition of human activity for stroke rehabilitation

by maryam najafian and Roozbeh Nabiei

Assisting patients to perform activity of daily living (ADLs) is a challenging task for both huma... more Assisting patients to perform activity of daily living (ADLs) is a challenging task for both human and machine. Hence, developing a computer-based rehabilitation system to re-train patients to carry out daily activities is an essential step towards facilitating rehabilitation of stroke patients with apraxia and action disorganization syndrome (AADS). This paper presents a real-time hidden Markov model (HMM) based human activity recognizer, and proposes a technique to reduce the time-delay occurred during the decoding stage. Results are reported for complete tea-making trials. In this study, the input features are recorded using sensors attached to the objects involved in the tea- making task, plus hand coordinate data captured using Kinect TM sensor. A coaster of sensors, comprising an accelerometer and three force-sensitive resistors, are packaged in a unit which can be easily attached to the base of an object. A parallel asynchronous set of detectors, each responsible for the detection of one sub-goal in the tea-making task, are used to address challenges arising from overlaps between human actions. The proposed activity recognition system with the modified HMM topology provides a practical solution to the action recognition problem and reduces the time-delay by 64% with no loss in accuracy.

Employing speech and location information for automatic assessment of child language environments

Assessment of the language environment of children in early childhood is a challenging task for b... more Assessment of the language environment of children in early childhood is a challenging task for both human and machine, and understanding the classroom environment of early learners is an essential step towards facilitating language acquisition and development. This paper explores an approach for intelligent language environment monitoring based on the duration of child-to-child and adult-to-child conversations and a child's physical location in classrooms within a childcare center. Amount of child's communication with other children and adults was measured using an i-vector based child-adult diariza-tion system (developed at CRSS) with 69% accuracy was applied to the noisy classroom audio recordings to detect the rapid conversational turns. Furthermore the average time spent by each child across different activity areas within the classroom was measured using a location tracking system. The proposed solution here offers unique opportunities to assess speech and language interaction for children, and quantify location context which would contribute to improved language environments.

Identification of British English regional accents using fusion of i-vector and multi-accent phonotactic systems

The para-linguistic information in a speech signal includes clues to the geographical and social ... more The para-linguistic information in a speech signal includes clues to the geographical and social background of the speaker. This paper is concerned with recognition of the 14 regional accents of British English. For Accent Identification (AID), acoustic methods exploit differences between the distributions of sounds, while phonotactic approaches exploit the sequences in which these sounds occur. We demonstrate these methods are good complements for each other and use their confusion matrices for further analysis. Our relatively simple i-vector and phonotactic fused system with recognition accuracy of 84.87% outperforms the i-vector fused results reported in literature, by 4.7%. Further analysis on distribution of British English accents has been carried out by analyzing the low dimensional representation of i-vector AID feature space.

Acoustic Model Selection for Recognition of Regional Accented Speech

Accent is cited as an issue for speech recognition systems \cite{huang2004accent}. Research has s... more Accent is cited as an issue for speech recognition systems \cite{huang2004accent}. Research has shown that accent mismatch between the training and the test data will result in significant accuracy reduction in Automatic Speech Recognition (ASR) systems. Using HMM based ASR trained on a standard English accent, our study shows that the error rates can be up to seven times higher for accented speech, than for standard English. Hence the development of accent-robust ASR systems is of significant importance.

This research investigates different acoustic modelling techniques for compensating for the effects of regional accents on the performance of ASR systems. The study includes conventional Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) and more contemporary Deep Neural Network (DNN)-HMM systems. In both cases we consider both supervised and unsupervised techniques. This work uses the WSJCAM0 corpus as a set of `accent neutral' data and accented data from the Accents of the British Isles (ABI) corpora.

Initially, we investigated a model selection approach, based on automatic accent identification (AID). Three AID systems were developed and evaluated in this work, namely i-vector, phonotactic, and ACCDIST-SVM. Each focuses on a different property of speech to achieve AID. We use two-dimensional projections based on Expectation Maximization-Principal Component Analysis (EM-PCA) and Linear Discriminative Analysis (LDA) to visualise the different accent spaces and use these visualisations to analyse the AID and ASR results.

In GMM-HMM based ASR systems, we show that using a small amount of data from a test speaker to select an accented acoustic model using AID, results in superior performance compared to that obtained with unsupervised or supervised speaker adaptation. A possible objection to AID-based model selection is that in each accent there exist speakers who have varying degrees of accent, or whose accent exhibits properties of other accents. This motived us to investigated whether using an acoustic model created based on neighbouring speakers in the accent space can result in better performance. In conclusion, the maximum reduction in error rate achieved over all GMM-HMM based adaptation approaches is obtained by using AID to select an accent-specifc model followed by speaker adaptation. It is also shown that the
accuracy of an AID system does not have a high impact on the gain obtained by accent adaptation. Hence, in real time applications one can use a simple AID system for accented acoustic model selection.

Recently, HMM systems based on Deep Neural Networks (DNNs) have achieved superior
performance compared to more traditional GMM-HMM systems, due to their discriminative learning ability. Our research confirms this by showing that a DNN-HMM system outperforms a GMM-HMM system, even after the latter has benefited from two stages of accent followed by speaker adaptation. We investigate the effect of adding different types of accented data to the baseline training set. The addition of data is either supervised or unsupervised, depending on whether the added data corresponds to the accent of the test speaker. Our results show that overall accuracy of the DNN-HMM system on accented data is maximized when either the accent diversity of the supplementary training data is highest, or data from the most `difficult' accent groups is included in the training set.

Finally, the performance of the baseline DNN-HMM system on accented data prompts an investigation of the accent characteristics of the WSJCAM0 corpus, which suggests that instead of being `neutral' it contains speech that exhibits characteristics of many of the accents in the ABI corpus.

Modelling Accents for Automatic Speech Recognition

EUSCIPCO 2015

Accent is cited as an issue for speech recognition systems. If they are to be widely deployed, Au... more Accent is cited as an issue for speech recognition systems. If they are to be widely deployed, Automatic Speech Recognition (ASR) systems must deliver consistently high performance across user populations. Hence the development of accent-robust ASR is of significant importance. This research investigates techniques for compensating for the effects of accents on performance of Hidden Markov Model (HMM) based ASR systems. Recently, HMM systems based on Deep Neural Networks (DNNs) have achieved superior performance to more traditional systems based on Gaussian Mixture Models (GMMs), due to the discriminative nature of DNNs. Our research confirms, this by showing that a DNN system outperforms the GMM system even after an accent-dependent acoustic model was selected using Accent Identification (AID), followed by speaker adaptation. The average performance of the DNN system over all accent groups is maximized when either accent diversity is highest , or data from " difficult " accent-groups is included in the training set. Index Terms: Multi-accent speech recognition, acoustic data selection, deep neural network 1. Research Summary In the 'accent of English' book by Wells an accent is defined as a " a pattern of pronunciation used by a speaker for whom English is the native language or, more generally, by the community or social grouping to which he or she belongs ". This differentiates accent from dialect, which includes the use of words or phrases that are characteristic of that community. It includes varieties of English spoken as a first language in different countries (for example, US vs Australian English), geographical variations within a country, and patterns of pronunciation associated with particular social or ethnic groups. The recent growth in applications of ASR systems forces developers to consider approaches that deliver consistent high performance across different accent groups. It is of high importance for those systems to be able to deal with accented speakers. Over the last decade, DNN based systems achieved superior accuracy compared to GMM based systems in many applications. This has been attributed to be due to their discriminative nature, and layer by layer invariant feature learning ability. This enhances their robustness against different sources of variation, for example accent. Various techniques, such as adding accent discriminative acoustic features, adding an accent-specific layer on top of a DNN acoustic model, accent-specific pronunciation adaptation, accent-specific polyphone decision trees, and acoustic model adaptation, were proposed in literature. In this research we investigate, whether it is better to train a simple DNN system on a multi-accented data and rely on its layer-wise discriminative learning to learn different accent patterns in the training data, or it is better to do GMM based accent plus speaker adaptation. In the DNN system, we explored the effect of the amount of pre-training and training data and their accent diversity on the final performance. In our GMM system , we construct an accent-dependent acoustic model for 14 different British accents, and use Accent Identification (AID) to select the model that corresponds to a new speaker. For each new speaker we do accent-dependent acoustic model adaptation , followed by the speaker adaptation. Results of our AID system shows that with 43s of speech an individual's accent can be determined with 81% accuracy using unsupervised AID (i-Vectors). Thus, a possible solution is to use AID for accent-dependent ASR model selection and then apply unsupervised speaker adaptation to the GMM-HMM system. In DNN-HMM systems it is possible to use this AID to analyse the accent diversity of the training data and investigates its effect on the final performance. In our experiments we extracted the adaptation and test data from the Accents of British Isles (ABI-1) corpus containing data from 14 different regions of British Isles. These regional accents fall into four groups namely, Northern English, Southern English, Irish and Scottish. We applied an AID system to explore the accent diversity of the WSJCAM0 speech corpus and realised that it consists of a range of Northern English, Southern English, Irish and Scot-tish accents. Using the WSJCAM0 corpus, we achieved a relative gain of 46.9% in recognizing the ABI-1 corpus by applying DNNs rather than GMMs for the acoustic modelling using the WSJCAM0 training data. This shows that DNN systems are better than GMMs in dealing with the multi-accented data. A clear effect of accent is evident in the performance of our GMM-HMM speech recognition systems, even after applying multiple acoustic model adaptation. We observed that the accuracy of this GMM system, even after applying Maximum A Posterior adaptation to 40 minutes of accent-specific data, followed by unsupervised speaker adaptation using Maximum Likelihood Linear Regression with 43s (7.3%WER) of data, could not match the accuracy of our baseline DNN system (6.9%WER). Despite, the major gain achieved by using the DNNs rather than GMMs in modelling the acoustic data, the effect of accent is still evident and the system doesn't perform uniformly well across all accent groups. Even in our best system (DNN-HMM) the percentage Word Error Rate (%WER) for the most challenging accent Glaswegian (13.34%) nearly 5 times bigger than that for standard southern English accent (2.84%). Adding an accent-specific layer on top of a multi-accent neural network acoustic model is one potential solution and will be addressed in our future work. Although the work targets British English it is very likely that the techniques described are applicable to accented speech in other languages.

Unsupervised Model Selection For Recognition Of Regional Accented Speech

Interspeech 2014

This paper is concerned with automatic speech recognition (ASR) for accented speech. Given a smal... more This paper is concerned with automatic speech recognition (ASR) for accented speech. Given a small amount of speech from a new speaker, is it better to apply speaker adaptation to the baseline, or to use accent identification (AID) to identify the speaker’s accent and select an accent-dependent acoustic model? Three accent-based model selection methods are inves- tigated: using the “true” accent model, and unsupervised model selection using i-Vector and phonotactic-based AID. All three methods outperform the unadapted baseline. Most significantly, AID-based model selection using 43s of speech performs bet- ter than unsupervised speaker adaptation, even if the latter uses five times more adaptation data. Combining unsupervised AID- based model selection and speaker adaptation gives an average relative reduction in ASR error rate of up to 47%.

Acoustic Model Selection Using Limited Data For Accent Robust Speech Recognition

by maryam najafian and Saeid Safavi

EUSIPCO 2014

This paper investigates techniques to compensate for the effects of regional accents of British E... more This paper investigates techniques to compensate for the effects of regional accents of British English on automatic speech recogni- tion (ASR) performance. Given a small amount of speech from a new speaker, is it better to apply speaker adaptation, or to use accent identification (AID) to identify the speaker’s accent followed by accent-dependent ASR? Three approaches to accent-dependent modelling are investigated: using the ‘correct’ accent model, choos- ing a model using supervised (ACCDIST-based) accent identifi- cation (AID), and building a model using data from neighbouring speakers in ‘AID space’. All of the methods outperform the accent- independent model, with relative reductions in ASR error rate of up to 44%. Using on average 43s of speech to identify an appro- priate accent-dependent model outperforms using it for supervised speaker-adaptation, by 7%.

Comparison of Speaker Verification Performance for Adult and Child Speech

by Saeid Safavi and maryam najafian

WOCCI 2014

Although speaker verification is an established area of speech technology, previous studies have ... more Although speaker verification is an established area of speech technology, previous studies have been restricted to adult speech. This paper investigates speaker verification for children’s speech, using the PF-STAR children’s speech corpus. A contemporary GMM-based speaker verification system, using MFCC features and maximum score normalization, is applied to adult and child speech at various bandwidths using comparable test and training material. The results show that the Equal Error Rate (EER) for child speech is almost four times greater than that for adults. A study of the effect of bandwidth on EER shows that for adult speaker verification, the spectrum can be conveniently partitioned into three frequency bands: up to 3.5-4kHz, which contains individual differences in the part of the spectrum due to primary vocal tract resonances, the region between 4kHz and 6kHz, which contains further speaker-specific information and gives a significant reduction in EER, and the region above 6kHz. These finding are consistent with previous research. For young children’s speech a similar pattern emerges, but with each region shifted to higher frequency values.

Speaker Recognition for Children’s Speech

by maryam najafian and Saeid Safavi

Interspeech 2012, Sep 2012

This paper presents results on Speaker Recognition (SR) for children’s speech, using the OGI Kids... more This paper presents results on Speaker Recognition (SR) for children’s speech, using the OGI Kids corpus and GMM-UBM and GMM-SVM SR systems. Regions of the spectrum containing important speaker information for children are identified by conducting SR experiments over 21 frequency bands. As for adults, the spectrum can be split into four regions, with the first (containing primary vocal tract resonance information) and third (corresponding to high- frequency speech sounds) being most useful for SR. However, the frequencies at which these regions occur are from 11% to 38% higher for children. It is also noted that sub- band SR rates are lower for younger children. Finally results are presented of SR experiments to identify a child in a class (30 children, similar age) and school (288 children, varying ages). Class performance depends on age, with accuracy varying from 90% for young children to 99% for older children. The identification rate achieved for a child in a school is 81%.

Talks by maryam najafian

What's Happening In Accents & Dialects ?

modelling accents for automatic speech recognition to create better automated phone systems.

Maryam Najafian, School of Electronic, Electrical and Computer Engineering, College of Engineerin... more Maryam Najafian, School of Electronic, Electrical and Computer Engineering, College of Engineering and Physical Sciences finalist from the 2013 Birmingham Three Minute Thesis competition, delivers her three-minute presentation explaining her PhD research on modelling accents for automatic speech recognition to create better automated phone systems.

Exploiting convolutional neural networks for phonotactic based dialect identification

by maryam najafian and Sameer Khurana

ICASSP, 2018

In this paper, we investigate different approaches for Dialect Identification (DID) in Arabic bro... more In this paper, we investigate different approaches for Dialect Identification (DID) in Arabic broadcast speech. Dialects differ in their inventory of phonological segments. This paper proposes a new phonotactic based feature representation approach which enables discrimination among different occurrences of the same phone n-grams with different phone duration and probability statistics. To achieve further gain in accuracy we used multilingual phone recognizers, trained separately on Arabic, English, Czech, Hungarian and Russian languages. We use Support Vector Machines (SVMs), and Convolutional Neural Networks (CNNs) as backend classifiers throughout the study. The final system fusion results in 24.7% and 19.0% relative error rate reduction compared to that of a conventional phonotactic DID, and i-vectors with bottleneck features.

QMDIS: QCRI-MIT Advanced Dialect Identification System

by maryam najafian and Ahmed Ali

Interspeech, 2017

As a continuation of our efforts towards tackling the problem of spoken Dialect Identification (D... more As a continuation of our efforts towards tackling the problem of spoken Dialect Identification (DID) for Arabic languages, we present the QCRI-MIT Advanced Dialect Identification System (QMDIS). QMDIS is an automatic spoken DID system for Di-alectal Arabic (DA). In this paper, we report a comprehensive study of the three main components used in the spoken DID task: phonotactic, lexical and acoustic. We use Support Vector Machines (SVMs), Logistic Regression (LR) and Convolutional Neural Networks (CNNs) as backend classifiers throughout the study. We perform all our experiments on a publicly available dataset and present new state-of-the-art results. QMDIS discriminates between the five most widely used dialects of Arabic: namely Egyptian, Gulf, Levantine, North African, and Modern Standard Arabic (MSA). We report ≈ 73% accuracy for system combination. All the data and the code used in our experiments are publicly available for research.

Automatic Speech Recognition Of Arabic Multi-genre Broadcast Media

by maryam najafian, Wei-Ning Hsu, and Ahmed Ali

ASRU, 2017

This paper describes an Arabic Automatic Speech Recognition system developed on 15 hours of Multi... more This paper describes an Arabic Automatic Speech Recognition system developed on 15 hours of Multi-Genre Broadcast (MGB-3) data from YouTube, plus 1,200 hours of Multi-Dialect and Multi-Genre MGB-2 data recorded from the Aljazeera Arabic TV channel. In this paper, we report our investigations of a range of signal pre-processing, data augmentation , topic-specific language model adaptation, accent specific retraining , and deep learning based acoustic modeling topologies, such as feed-forward Deep Neural Networks (DNNs), Time-delay Neural Networks (TDNNs), Long Short-term Memory (LSTM) networks, Bidirectional LSTMs (BLSTMs), and a Bidirectional version of the Prioritized Grid LSTM (BPGLSTM) model. We propose a system combination for three purely sequence trained recognition systems based on lattice-free maximum mutual information, 4-gram language model re-scoring, and system combination using the minimum Bayes risk decoding criterion. The best word error rate we obtained on the MGB-3 Arabic development set using a 4-gram re-scoring strategy is 42.25% for a chain BLSTM system, compared to 65.44% baseline for a DNN system.

Environment aware speaker diarization for moving targets using parallel DNN-based recognizers

Current diarization algorithms are commonly applied to the outputs of single non-moving microphon... more Current diarization algorithms are commonly applied to the outputs of single non-moving microphones. They do not explicitly identify the content of overlapped segments from multiple speakers or acoustic events. This paper presents an acoustic environment aware child-adult diarization applied to the audio recorded by a single microphone attached to moving targets under realistic high noise conditions. This system exploits a parallel deep neural network and hidden Markov model based approach which enables tracking of rapid turn changes in audio segments as well as capturing the cross talk labels for overlapped speech. The proposed system out-performs the state-of-the-art diarization systems without the need to prior clustering or front-end speech activity detection.

Speaker Diarization And Speech Activity Detection Fusion For Detecting Hot Language Areas During Different Classroom Activities Using Deep Neural Networks

A vocalization heatmap can help with large-scale monitoring of the child language environment dur... more A vocalization heatmap can help with large-scale monitoring of the child language environment during different classroom activities. Here we present a case study that uses Speaker's location and diarization information to detect hot language areas within a childcare center. We propose a Deep Neu-ral Network-Hidden Markov Model (DNN-HMM) based di-arization and Speech Activity Detection (SAD) fused system. It detects the time label of non-speech acoustics, as well as the speech generated by each child and the speech directed to them by other children and adults. Experimental results show that our system is robust against speech and non-speech segment confusions occurred due to background noise, as it relies on embedded discriminative acoustic model level rather than feature level speech activity detection. This system is suitable for real-time vocalization monitoring since it doesn't rely on a prior clustering or speech activity detection stages.

Improving speech recognition using limited accent diverse British English training data with deep neural networks

by maryam najafian and Martin Russell

Despite the recent advances in acoustic modelling tasks modelling speech data coming from differe... more Despite the recent advances in acoustic modelling tasks modelling speech data coming from different speakers with varying accents, age, and speaking styles is a fundamental challenge for Deep Neu-ral Networks (DNNs) based Automatic Speech Recognition (ASR). A relative gain of 46.85% is achieved in recognising the Accents of British Isles corpus by applying a baseline DNN model rather than a Gaussian mixture model. However, even for powerful DNN based systems accents remain a challenge. Our study shows that for a 'difficult' accent such as Glaswegian the relative word error rate is 78.9% higher than that of the standard southern English accent. In this work we propose four multi-accent learning strategies, and evaluate their effectiveness within the context of DNN based acoustic modelling framework. Using an i-vector based accent identification system with 78% accuracy to label the training data. We present a novel study on the effect of increase in the accent diversity, the 'dif-ficulty' and the amount of supplemented training data on the ASR performance. On average a further ASR gain of 27.24 % is achieved using the proposed strategies. Our results show that across all accent regions supplementing the training set with a small amount of data from the most 'difficult' accent (2.25 hours of Glaswegian accent) leads to a similar gain in performance as using a large amount of accent diverse data (8.96 hours from 14 accent regions). Although the ideas presented are focused on DNN based analysis with limited amount of multi-accented data, they are applicable for training all classifiers with multi-conditional limited resources.

Modelling Accents for Automatic Speech Recognition

Accent is cited as an issue for speech recognition systems. If they are to be widely deployed, Au... more Accent is cited as an issue for speech recognition systems. If they are to be widely deployed, Automatic Speech Recognition (ASR) systems must deliver consistently high performance across user populations. Hence the development of accent-robust ASR is of significant importance. This research investigates techniques for compensating for the effects of accents on performance of Hidden Markov Model (HMM) based ASR systems. Recently, HMM systems based on Deep Neural Networks (DNNs) have achieved superior performance to more traditional systems based on Gaussian Mixture Models (GMMs), due to the discriminative nature of DNNs. Our research confirms, this by showing that a DNN system outperforms the GMM system even after an accent-dependent acoustic model was selected using Accent Identification (AID), followed by speaker adaptation. The average performance of the DNN system over all accent groups is maximized when either accent diversity is highest , or data from " difficult " accent-groups is included in the training set. Index Terms: Multi-accent speech recognition, acoustic data selection, deep neural network 1. Research Summary In the 'accent of English' book by Wells an accent is defined as a " a pattern of pronunciation used by a speaker for whom English is the native language or, more generally, by the community or social grouping to which he or she belongs ". This differentiates accent from dialect, which includes the use of words or phrases that are characteristic of that community. It includes varieties of English spoken as a first language in different countries (for example, US vs Australian English), geographical variations within a country, and patterns of pronunciation associated with particular social or ethnic groups. The recent growth in applications of ASR systems forces developers to consider approaches that deliver consistent high performance across different accent groups. It is of high importance for those systems to be able to deal with accented speakers. Over the last decade, DNN based systems achieved superior accuracy compared to GMM based systems in many applications. This has been attributed to be due to their discriminative nature, and layer by layer invariant feature learning ability. This enhances their robustness against different sources of variation, for example accent. Various techniques, such as adding accent discriminative acoustic features, adding an accent-specific layer on top of a DNN acoustic model, accent-specific pronunciation adaptation, accent-specific polyphone decision trees, and acoustic model adaptation, were proposed in literature. In this research we investigate, whether it is better to train a simple DNN system on a multi-accented data and rely on its layer-wise discriminative learning to learn different accent patterns in the training data, or it is better to do GMM based accent plus speaker adaptation. In the DNN system, we explored the effect of the amount of pre-training and training data and their accent diversity on the final performance. In our GMM system , we construct an accent-dependent acoustic model for 14 different British accents, and use Accent Identification (AID) to select the model that corresponds to a new speaker. For each new speaker we do accent-dependent acoustic model adaptation , followed by the speaker adaptation. Results of our AID system shows that with 43s of speech an individual's accent can be determined with 81% accuracy using unsupervised AID (i-Vectors). Thus, a possible solution is to use AID for accent-dependent ASR model selection and then apply unsupervised speaker adaptation to the GMM-HMM system. In DNN-HMM systems it is possible to use this AID to analyse the accent diversity of the training data and investigates its effect on the final performance. In our experiments we extracted the adaptation and test data from the Accents of British Isles (ABI-1) corpus containing data from 14 different regions of British Isles. These regional accents fall into four groups namely, Northern English, Southern English, Irish and Scottish. We applied an AID system to explore the accent diversity of the WSJCAM0 speech corpus and realised that it consists of a range of Northern English, Southern English, Irish and Scot-tish accents. Using the WSJCAM0 corpus, we achieved a relative gain of 46.9% in recognizing the ABI-1 corpus by applying DNNs rather than GMMs for the acoustic modelling using the WSJCAM0 training data. This shows that DNN systems are better than GMMs in dealing with the multi-accented data. A clear effect of accent is evident in the performance of our GMM-HMM speech recognition systems, even after applying multiple acoustic model adaptation. We observed that the accuracy of this GMM system, even after applying Maximum A Posterior adaptation to 40 minutes of accent-specific data, followed by unsupervised speaker adaptation using Maximum Likelihood Linear Regression with 43s (7.3%WER) of data, could not match the accuracy of our baseline DNN system (6.9%WER). Despite, the major gain achieved by using the DNNs rather than GMMs in modelling the acoustic data, the effect of accent is still evident and the system doesn't perform uniformly well across all accent groups. Even in our best system (DNN-HMM) the percentage Word Error Rate (%WER) for the most challenging accent Glaswegian (13.34%) nearly 5 times bigger than that for standard southern English accent (2.84%). Adding an accent-specific layer on top of a multi-accent neural network acoustic model is one potential solution and will be addressed in our future work. Although the work targets British English it is very likely that the techniques described are applicable to accented speech in other languages.

Modelling Accents for Automatic Speech Recognition

Accent is cited as an issue for speech recognition systems. If they are to be widely deployed, Au... more Accent is cited as an issue for speech recognition systems. If they are to be widely deployed, Automatic Speech Recognition (ASR) systems must deliver consistently high performance across user populations. Hence the development of accentrobust ASR is of significant importance. This research investigates techniques for compensating for the effects of accents on performance of Hidden Markov Model (HMM) based ASR systems. Recently, HMM systems based on Deep Neural Networks (DNNs) have achieved superior performance to more traditional systems based on Gaussian Mixture Models (GMMs), due to the discriminative nature of DNNs. Our research confirms, this by showing that a DNN system outperforms the GMM system even after an accent-dependent acoustic model was selected using Accent Identification (AID), followed by speaker adaptation. The average performance of the DNN system over all accent groups is maximized when either accent diversity is highest, or data from "difficult" accent-groups is included in the training set.

Automatic measurement and analysis of the child verbal communication using classroom acoustics within a child care center

Understanding the language environment of early learners is a challenging task for both human and... more Understanding the language environment of early learners is a challenging task for both human and machine, and it is critical in facilitating effective language development among young children. This papers presents a new application for the existing diarization systems and investigates the language environment of young children using a turn taking strategy employing an i-vector based baseline that captures adult-to-child or child-to-child conversational turns across different classrooms in a child care center. Detecting speaker turns is necessary before more in depth subsequent analysis of audio such as word count, speech recognition, and keyword spotting which can contribute to the design of future learning spaces specifically designed for typically developing children, or those at-risk with communication limitations. Experimental results using naturalistic child-teacher classroom settings indicate the proposed rapid child-adult speech turn taking scheme is highly effective under noisy classroom conditions and results in 27.3% relative error rate reduction compared to the baseline results produced by the LIUM diarization toolkit.

Delay reduction in real-time recognition of human activity for stroke rehabilitation

by maryam najafian and Roozbeh Nabiei

Assisting patients to perform activity of daily living (ADLs) is a challenging task for both huma... more Assisting patients to perform activity of daily living (ADLs) is a challenging task for both human and machine. Hence, developing a computer-based rehabilitation system to re-train patients to carry out daily activities is an essential step towards facilitating rehabilitation of stroke patients with apraxia and action disorganization syndrome (AADS). This paper presents a real-time hidden Markov model (HMM) based human activity recognizer, and proposes a technique to reduce the time-delay occurred during the decoding stage. Results are reported for complete tea-making trials. In this study, the input features are recorded using sensors attached to the objects involved in the tea- making task, plus hand coordinate data captured using Kinect TM sensor. A coaster of sensors, comprising an accelerometer and three force-sensitive resistors, are packaged in a unit which can be easily attached to the base of an object. A parallel asynchronous set of detectors, each responsible for the detection of one sub-goal in the tea-making task, are used to address challenges arising from overlaps between human actions. The proposed activity recognition system with the modified HMM topology provides a practical solution to the action recognition problem and reduces the time-delay by 64% with no loss in accuracy.

Employing speech and location information for automatic assessment of child language environments

Assessment of the language environment of children in early childhood is a challenging task for b... more Assessment of the language environment of children in early childhood is a challenging task for both human and machine, and understanding the classroom environment of early learners is an essential step towards facilitating language acquisition and development. This paper explores an approach for intelligent language environment monitoring based on the duration of child-to-child and adult-to-child conversations and a child's physical location in classrooms within a childcare center. Amount of child's communication with other children and adults was measured using an i-vector based child-adult diariza-tion system (developed at CRSS) with 69% accuracy was applied to the noisy classroom audio recordings to detect the rapid conversational turns. Furthermore the average time spent by each child across different activity areas within the classroom was measured using a location tracking system. The proposed solution here offers unique opportunities to assess speech and language interaction for children, and quantify location context which would contribute to improved language environments.

Identification of British English regional accents using fusion of i-vector and multi-accent phonotactic systems

The para-linguistic information in a speech signal includes clues to the geographical and social ... more The para-linguistic information in a speech signal includes clues to the geographical and social background of the speaker. This paper is concerned with recognition of the 14 regional accents of British English. For Accent Identification (AID), acoustic methods exploit differences between the distributions of sounds, while phonotactic approaches exploit the sequences in which these sounds occur. We demonstrate these methods are good complements for each other and use their confusion matrices for further analysis. Our relatively simple i-vector and phonotactic fused system with recognition accuracy of 84.87% outperforms the i-vector fused results reported in literature, by 4.7%. Further analysis on distribution of British English accents has been carried out by analyzing the low dimensional representation of i-vector AID feature space.

Acoustic Model Selection for Recognition of Regional Accented Speech

Accent is cited as an issue for speech recognition systems \cite{huang2004accent}. Research has s... more Accent is cited as an issue for speech recognition systems \cite{huang2004accent}. Research has shown that accent mismatch between the training and the test data will result in significant accuracy reduction in Automatic Speech Recognition (ASR) systems. Using HMM based ASR trained on a standard English accent, our study shows that the error rates can be up to seven times higher for accented speech, than for standard English. Hence the development of accent-robust ASR systems is of significant importance.

This research investigates different acoustic modelling techniques for compensating for the effects of regional accents on the performance of ASR systems. The study includes conventional Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) and more contemporary Deep Neural Network (DNN)-HMM systems. In both cases we consider both supervised and unsupervised techniques. This work uses the WSJCAM0 corpus as a set of `accent neutral' data and accented data from the Accents of the British Isles (ABI) corpora.

Initially, we investigated a model selection approach, based on automatic accent identification (AID). Three AID systems were developed and evaluated in this work, namely i-vector, phonotactic, and ACCDIST-SVM. Each focuses on a different property of speech to achieve AID. We use two-dimensional projections based on Expectation Maximization-Principal Component Analysis (EM-PCA) and Linear Discriminative Analysis (LDA) to visualise the different accent spaces and use these visualisations to analyse the AID and ASR results.

In GMM-HMM based ASR systems, we show that using a small amount of data from a test speaker to select an accented acoustic model using AID, results in superior performance compared to that obtained with unsupervised or supervised speaker adaptation. A possible objection to AID-based model selection is that in each accent there exist speakers who have varying degrees of accent, or whose accent exhibits properties of other accents. This motived us to investigated whether using an acoustic model created based on neighbouring speakers in the accent space can result in better performance. In conclusion, the maximum reduction in error rate achieved over all GMM-HMM based adaptation approaches is obtained by using AID to select an accent-specifc model followed by speaker adaptation. It is also shown that the
accuracy of an AID system does not have a high impact on the gain obtained by accent adaptation. Hence, in real time applications one can use a simple AID system for accented acoustic model selection.

Recently, HMM systems based on Deep Neural Networks (DNNs) have achieved superior
performance compared to more traditional GMM-HMM systems, due to their discriminative learning ability. Our research confirms this by showing that a DNN-HMM system outperforms a GMM-HMM system, even after the latter has benefited from two stages of accent followed by speaker adaptation. We investigate the effect of adding different types of accented data to the baseline training set. The addition of data is either supervised or unsupervised, depending on whether the added data corresponds to the accent of the test speaker. Our results show that overall accuracy of the DNN-HMM system on accented data is maximized when either the accent diversity of the supplementary training data is highest, or data from the most `difficult' accent groups is included in the training set.

Finally, the performance of the baseline DNN-HMM system on accented data prompts an investigation of the accent characteristics of the WSJCAM0 corpus, which suggests that instead of being `neutral' it contains speech that exhibits characteristics of many of the accents in the ABI corpus.

Modelling Accents for Automatic Speech Recognition

EUSCIPCO 2015

Accent is cited as an issue for speech recognition systems. If they are to be widely deployed, Au... more Accent is cited as an issue for speech recognition systems. If they are to be widely deployed, Automatic Speech Recognition (ASR) systems must deliver consistently high performance across user populations. Hence the development of accent-robust ASR is of significant importance. This research investigates techniques for compensating for the effects of accents on performance of Hidden Markov Model (HMM) based ASR systems. Recently, HMM systems based on Deep Neural Networks (DNNs) have achieved superior performance to more traditional systems based on Gaussian Mixture Models (GMMs), due to the discriminative nature of DNNs. Our research confirms, this by showing that a DNN system outperforms the GMM system even after an accent-dependent acoustic model was selected using Accent Identification (AID), followed by speaker adaptation. The average performance of the DNN system over all accent groups is maximized when either accent diversity is highest , or data from " difficult " accent-groups is included in the training set. Index Terms: Multi-accent speech recognition, acoustic data selection, deep neural network 1. Research Summary In the 'accent of English' book by Wells an accent is defined as a " a pattern of pronunciation used by a speaker for whom English is the native language or, more generally, by the community or social grouping to which he or she belongs ". This differentiates accent from dialect, which includes the use of words or phrases that are characteristic of that community. It includes varieties of English spoken as a first language in different countries (for example, US vs Australian English), geographical variations within a country, and patterns of pronunciation associated with particular social or ethnic groups. The recent growth in applications of ASR systems forces developers to consider approaches that deliver consistent high performance across different accent groups. It is of high importance for those systems to be able to deal with accented speakers. Over the last decade, DNN based systems achieved superior accuracy compared to GMM based systems in many applications. This has been attributed to be due to their discriminative nature, and layer by layer invariant feature learning ability. This enhances their robustness against different sources of variation, for example accent. Various techniques, such as adding accent discriminative acoustic features, adding an accent-specific layer on top of a DNN acoustic model, accent-specific pronunciation adaptation, accent-specific polyphone decision trees, and acoustic model adaptation, were proposed in literature. In this research we investigate, whether it is better to train a simple DNN system on a multi-accented data and rely on its layer-wise discriminative learning to learn different accent patterns in the training data, or it is better to do GMM based accent plus speaker adaptation. In the DNN system, we explored the effect of the amount of pre-training and training data and their accent diversity on the final performance. In our GMM system , we construct an accent-dependent acoustic model for 14 different British accents, and use Accent Identification (AID) to select the model that corresponds to a new speaker. For each new speaker we do accent-dependent acoustic model adaptation , followed by the speaker adaptation. Results of our AID system shows that with 43s of speech an individual's accent can be determined with 81% accuracy using unsupervised AID (i-Vectors). Thus, a possible solution is to use AID for accent-dependent ASR model selection and then apply unsupervised speaker adaptation to the GMM-HMM system. In DNN-HMM systems it is possible to use this AID to analyse the accent diversity of the training data and investigates its effect on the final performance. In our experiments we extracted the adaptation and test data from the Accents of British Isles (ABI-1) corpus containing data from 14 different regions of British Isles. These regional accents fall into four groups namely, Northern English, Southern English, Irish and Scottish. We applied an AID system to explore the accent diversity of the WSJCAM0 speech corpus and realised that it consists of a range of Northern English, Southern English, Irish and Scot-tish accents. Using the WSJCAM0 corpus, we achieved a relative gain of 46.9% in recognizing the ABI-1 corpus by applying DNNs rather than GMMs for the acoustic modelling using the WSJCAM0 training data. This shows that DNN systems are better than GMMs in dealing with the multi-accented data. A clear effect of accent is evident in the performance of our GMM-HMM speech recognition systems, even after applying multiple acoustic model adaptation. We observed that the accuracy of this GMM system, even after applying Maximum A Posterior adaptation to 40 minutes of accent-specific data, followed by unsupervised speaker adaptation using Maximum Likelihood Linear Regression with 43s (7.3%WER) of data, could not match the accuracy of our baseline DNN system (6.9%WER). Despite, the major gain achieved by using the DNNs rather than GMMs in modelling the acoustic data, the effect of accent is still evident and the system doesn't perform uniformly well across all accent groups. Even in our best system (DNN-HMM) the percentage Word Error Rate (%WER) for the most challenging accent Glaswegian (13.34%) nearly 5 times bigger than that for standard southern English accent (2.84%). Adding an accent-specific layer on top of a multi-accent neural network acoustic model is one potential solution and will be addressed in our future work. Although the work targets British English it is very likely that the techniques described are applicable to accented speech in other languages.

Unsupervised Model Selection For Recognition Of Regional Accented Speech

Interspeech 2014

This paper is concerned with automatic speech recognition (ASR) for accented speech. Given a smal... more This paper is concerned with automatic speech recognition (ASR) for accented speech. Given a small amount of speech from a new speaker, is it better to apply speaker adaptation to the baseline, or to use accent identification (AID) to identify the speaker’s accent and select an accent-dependent acoustic model? Three accent-based model selection methods are inves- tigated: using the “true” accent model, and unsupervised model selection using i-Vector and phonotactic-based AID. All three methods outperform the unadapted baseline. Most significantly, AID-based model selection using 43s of speech performs bet- ter than unsupervised speaker adaptation, even if the latter uses five times more adaptation data. Combining unsupervised AID- based model selection and speaker adaptation gives an average relative reduction in ASR error rate of up to 47%.

Acoustic Model Selection Using Limited Data For Accent Robust Speech Recognition

by maryam najafian and Saeid Safavi

EUSIPCO 2014

This paper investigates techniques to compensate for the effects of regional accents of British E... more This paper investigates techniques to compensate for the effects of regional accents of British English on automatic speech recogni- tion (ASR) performance. Given a small amount of speech from a new speaker, is it better to apply speaker adaptation, or to use accent identification (AID) to identify the speaker’s accent followed by accent-dependent ASR? Three approaches to accent-dependent modelling are investigated: using the ‘correct’ accent model, choos- ing a model using supervised (ACCDIST-based) accent identifi- cation (AID), and building a model using data from neighbouring speakers in ‘AID space’. All of the methods outperform the accent- independent model, with relative reductions in ASR error rate of up to 44%. Using on average 43s of speech to identify an appro- priate accent-dependent model outperforms using it for supervised speaker-adaptation, by 7%.

Comparison of Speaker Verification Performance for Adult and Child Speech

by Saeid Safavi and maryam najafian

WOCCI 2014

Although speaker verification is an established area of speech technology, previous studies have ... more Although speaker verification is an established area of speech technology, previous studies have been restricted to adult speech. This paper investigates speaker verification for children’s speech, using the PF-STAR children’s speech corpus. A contemporary GMM-based speaker verification system, using MFCC features and maximum score normalization, is applied to adult and child speech at various bandwidths using comparable test and training material. The results show that the Equal Error Rate (EER) for child speech is almost four times greater than that for adults. A study of the effect of bandwidth on EER shows that for adult speaker verification, the spectrum can be conveniently partitioned into three frequency bands: up to 3.5-4kHz, which contains individual differences in the part of the spectrum due to primary vocal tract resonances, the region between 4kHz and 6kHz, which contains further speaker-specific information and gives a significant reduction in EER, and the region above 6kHz. These finding are consistent with previous research. For young children’s speech a similar pattern emerges, but with each region shifted to higher frequency values.

Speaker Recognition for Children’s Speech

by maryam najafian and Saeid Safavi

Interspeech 2012, Sep 2012

This paper presents results on Speaker Recognition (SR) for children’s speech, using the OGI Kids... more This paper presents results on Speaker Recognition (SR) for children’s speech, using the OGI Kids corpus and GMM-UBM and GMM-SVM SR systems. Regions of the spectrum containing important speaker information for children are identified by conducting SR experiments over 21 frequency bands. As for adults, the spectrum can be split into four regions, with the first (containing primary vocal tract resonance information) and third (corresponding to high- frequency speech sounds) being most useful for SR. However, the frequencies at which these regions occur are from 11% to 38% higher for children. It is also noted that sub- band SR rates are lower for younger children. Finally results are presented of SR experiments to identify a child in a class (30 children, similar age) and school (288 children, varying ages). Class performance depends on age, with accuracy varying from 90% for young children to 99% for older children. The identification rate achieved for a child in a school is 81%.

What's Happening In Accents & Dialects ?

modelling accents for automatic speech recognition to create better automated phone systems.

Maryam Najafian, School of Electronic, Electrical and Computer Engineering, College of Engineerin... more Maryam Najafian, School of Electronic, Electrical and Computer Engineering, College of Engineering and Physical Sciences finalist from the 2013 Birmingham Three Minute Thesis competition, delivers her three-minute presentation explaining her PhD research on modelling accents for automatic speech recognition to create better automated phone systems.