We have previously developed a Fishervoice framework that maps the JFA-mean supervectors into a c... more We have previously developed a Fishervoice framework that maps the JFA-mean supervectors into a compressed discriminant subspace using nonparametric Fishers discriminant analysis. It was shown that performing cosine distance scoring (CDS) on these Fishervoice projected vectors (denoted as f-vectors) can outperform the classical joint factor analysis. Unlike the ivector approach in which the channel variability is suppressed in the classification stage, in the Fishervoice framework, channel variability is suppressed when the f-vectors are constructed. In this paper, we investigate whether channel variability can be further suppressed by performing Gaussian probabilistic discriminant analysis (PLDA) in the classification stage. We also use random subspace sampling to enrich the speaker discriminative information in the f-vectors. Experiments on NIST SRE10 show that PLDA can boost the performance of Fishervoice in speaker verification significantly by a relative decrease of 14.4% in mi...
In this paper, we propose an integration of random subspace sampling and Fishervoice for speaker ... more In this paper, we propose an integration of random subspace sampling and Fishervoice for speaker verification. In the previous random sampling framework [1], we randomly sample the JFA feature space into a set of low-dimensional subspaces. For every random subspace, we use Fishervoice to model the intrinsic vocal characteristics in a discriminant subspace. The complex speaker characteristics are modeled through multiple subspaces. Through a fusion rule, we form a more powerful and stable classifier that can preserve most of the discriminative information. But in many cases, random subspace sampling may discard too much useful discriminative information for high-dimensional feature space. Instead of increasing the number of random subspace or using more complex fusion rules which increase system complexity, we attempt to increase the performance of each individual weak classifier. Hence, we propose to investigate the integration of random subspace sampling with the Fishervoice approa...
Our ongoing work that applies Fishervoice to map joint factor analysis (JFA)-mean supervectors 1 ... more Our ongoing work that applies Fishervoice to map joint factor analysis (JFA)-mean supervectors 1 into a compressed discriminant subspace has shown that performing cosine distance scoring on the Fishervoice projected vectors outperforms classical JFA. In this paper, we refine Fishervoice for low-dimensional i-vectors by only using the nonparametric between-class scatter matrix to substitute the parametric one in linear discriminative analysis (LDA). The task of 2016 speaker recognition evaluation (SRE16) only has unlabeled in-domain training data and labeled out-of-domain training data for model training. Support vector machine (SVM) scoring can capture the discriminative information embedded in the unlabeled in-domain training data. We perform probabilistic linear discriminant analysis (PLDA) before SVM scoring for inter-session compensation with speaker label information from out-of-domain training data. This approach constitutes CUHK’s submission for SRE16. In this paper, we prese...
Linear discriminant analysis (LDA) is an effective and widely used discriminative technique for s... more Linear discriminant analysis (LDA) is an effective and widely used discriminative technique for speaker verification. However, it only utilizes the information on global structure to perform classification. Some variants of LDA, such as local pairwise LDA (LPLDA), are proposed to preserve more information on the local structure in the linear projection matrix. However, considering that the local structure may vary a lot in different regions, summing up related components to construct a single projection matrix may not be sufficient. In this paper, we present a speaker-aware strategy focusing on preserving distinct information on local structure in a set of linear discriminant projection matrices, and allocating them to different local regions for dimension reduction and classification. Experiments on NIST SRE2010 and NIST SRE2016 show that the speaker-aware strategy can boost the performance of both LDA and LPLDA backends in i-vector systems and x-vector systems.
Recently adversarial attacks on automatic speaker verification (ASV) systems attracted widespread... more Recently adversarial attacks on automatic speaker verification (ASV) systems attracted widespread attention as they pose severe threats to ASV systems. However, methods to defend against such attacks are limited. Existing approaches mainly focus on retraining ASV systems with adversarial data augmentation. Also, countermeasure robustness against different attack settings are insufficiently investigated. Orthogonal to prior approaches, this work proposes to defend ASV systems against adversarial attacks with a separate detection network, rather than augmenting adversarial data into ASV training. A VGG-like binary classification detector is introduced and demonstrated to be effective on detecting adversarial samples. To investigate detector robustness in a realistic defense scenario where unseen attack settings may exist, we analyze various kinds of unseen attack settings' impact and observe that the detector is robust (6.27% EER det degradation in the worst case) against unseen substitute ASV systems, but it has weak robustness (50.37% EER det degradation in the worst case) against unseen perturbation methods. The weak robustness against unseen perturbation methods shows a direction for developing stronger countermeasures.
Odyssey 2020 The Speaker and Language Recognition Workshop
Speaker verification systems usually suffer from the mismatch problem between training and evalua... more Speaker verification systems usually suffer from the mismatch problem between training and evaluation data, such as speaker population mismatch, the channel and environment variations. In order to address this issue, it requires the system to have good generalization ability on unseen data. In this work, we incorporate Bayesian neural networks (BNNs) into the deep neural network (DNN) x-vector speaker verification system to improve the system's generalization ability. With the weight uncertainty modeling provided by BNNs, we expect the system could generalize better on the evaluation data and make verification decisions more accurately. Our experiment results indicate that the DNN x-vector system could benefit from BNNs especially when the mismatch problem is severe for evaluations using out-of-domain data. Specifically, results show that the system could benefit from BNNs by a relative EER decrease of 2.66% and 2.32% respectively for short-and longutterance in-domain evaluations. Additionally, the fusion of DNN x-vector and Bayesian x-vector systems could achieve further improvement. Moreover, experiments conducted by outof-domain evaluations, e.g. models trained on Voxceleb1 while evaluated on NIST SRE10 core test, suggest that BNNs could bring a larger relative EER decrease of around 4.69%. Index termsspeaker verification, Bayesian neural network, DNN x-vector, uncertainty modelling
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 1, 2020
This work investigates the vulnerability of Gaussian Mixture Model (GMM) i-vector based speaker v... more This work investigates the vulnerability of Gaussian Mixture Model (GMM) i-vector based speaker verification systems to adversarial attacks, and the transferability of adversarial samples crafted from GMM i-vector based systems to x-vector based systems. In detail, we formulate the GMM i-vector system as a scoring function of enrollment and testing utterance pairs. Then we leverage the fast gradient sign method (FGSM) to optimize testing utterances for adversarial samples generation. These adversarial samples are used to attack both GMM i-vector and x-vector systems. We measure the system vulnerability by the degradation of equal error rate and false acceptance rate. Experiment results show that GMM i-vector systems are seriously vulnerable to adversarial attacks, and the crafted adversarial samples prove to be transferable and pose threats to neural network speaker embedding based systems (e.g. x-vector systems).
Developing a voice conversion (VC) system for a particular speaker typically requires considerabl... more Developing a voice conversion (VC) system for a particular speaker typically requires considerable data from both the source and target speakers. This paper aims to effectuate VC across arbitrary speakers, which we call any-to-any VC, with only a single target-speaker utterance. Two systems are studied: (1) the i-vector-based VC (IVC) system and (2) the speakerencoder-based VC (SEVC) system. Phonetic PosteriorGrams are adopted as speaker-independent linguistic features extracted from speech samples. Both systems train a multi-speaker deep bidirectional long-short term memory (DBLSTM) VC model, taking in additional inputs that encode speaker identities, in order to generate the outputs. In the IVC system, the speaker identity of a new target speaker is represented by i-vectors. In the SEVC system, the speaker identity is represented by speaker embedding predicted from a separately trained model. Experiments verify the effectiveness of both systems in achieving VC based only on a single target-speaker utterance. Furthermore, the IVC approach is superior to SEVC, in terms of the quality of the converted speech and its similarity to the utterance produced by the genuine target speaker.
We have previously developed a Fishervoice framework that maps the JFA-mean supervectors into a c... more We have previously developed a Fishervoice framework that maps the JFA-mean supervectors into a compressed discriminant subspace using nonparametric Fishers discriminant analysis. It was shown that performing cosine distance scoring (CDS) on these Fishervoice projected vectors (denoted as f-vectors) can outperform the classical joint factor analysis. Unlike the ivector approach in which the channel variability is suppressed in the classification stage, in the Fishervoice framework, channel variability is suppressed when the f-vectors are constructed. In this paper, we investigate whether channel variability can be further suppressed by performing Gaussian probabilistic discriminant analysis (PLDA) in the classification stage. We also use random subspace sampling to enrich the speaker discriminative information in the f-vectors. Experiments on NIST SRE10 show that PLDA can boost the performance of Fishervoice in speaker verification significantly by a relative decrease of 14.4% in minDCF (from 0.526 to 0.450).
We investigate how to improve the performance of DNN ivector based speaker verification for short... more We investigate how to improve the performance of DNN ivector based speaker verification for short, text-constrained test utterances, e.g. connected digit strings. A text-constrained verification, due to its smaller, limited vocabulary, can deliver better performance than a text-independent one for a short utterance. We study the problem with "phonetically aware" Deep Neural Net (DNN) in its capability on "stochastic phonetic-alignment" in constructing supervectors and estimating the corresponding i-vectors with two speech databases: a large vocabulary, conversational, speaker independent database (Fisher) and a small vocabulary, continuous digit database (RSR2015 Part III). The phonetic alignment efficiency and resultant speaker verification performance are compared with differently sized senone sets which can characterize the phonetic pronunciations of utterances in the two databases. Performance on RSR2015 Part III evaluation shows a relative improvement of EER, i.e., 7.89% for male speakers and 3.54% for female speakers with only digit related senones. The DNN bottleneck features were also studied to investigate their capability of extracting phonetic sensitive information which is useful for text-independent or textconstrained speaker verifications. We found that by tandeming MFCC with bottleneck features, EERs can be further reduced.
We have previously developed a Fishervoice framework that maps the JFA-mean supervectors into a c... more We have previously developed a Fishervoice framework that maps the JFA-mean supervectors into a compressed discriminant subspace using nonparametric Fishers discriminant analysis. It was shown that performing cosine distance scoring (CDS) on these Fishervoice projected vectors (denoted as f-vectors) can outperform the classical joint factor analysis. Unlike the ivector approach in which the channel variability is suppressed in the classification stage, in the Fishervoice framework, channel variability is suppressed when the f-vectors are constructed. In this paper, we investigate whether channel variability can be further suppressed by performing Gaussian probabilistic discriminant analysis (PLDA) in the classification stage. We also use random subspace sampling to enrich the speaker discriminative information in the f-vectors. Experiments on NIST SRE10 show that PLDA can boost the performance of Fishervoice in speaker verification significantly by a relative decrease of 14.4% in mi...
In this paper, we propose an integration of random subspace sampling and Fishervoice for speaker ... more In this paper, we propose an integration of random subspace sampling and Fishervoice for speaker verification. In the previous random sampling framework [1], we randomly sample the JFA feature space into a set of low-dimensional subspaces. For every random subspace, we use Fishervoice to model the intrinsic vocal characteristics in a discriminant subspace. The complex speaker characteristics are modeled through multiple subspaces. Through a fusion rule, we form a more powerful and stable classifier that can preserve most of the discriminative information. But in many cases, random subspace sampling may discard too much useful discriminative information for high-dimensional feature space. Instead of increasing the number of random subspace or using more complex fusion rules which increase system complexity, we attempt to increase the performance of each individual weak classifier. Hence, we propose to investigate the integration of random subspace sampling with the Fishervoice approa...
Our ongoing work that applies Fishervoice to map joint factor analysis (JFA)-mean supervectors 1 ... more Our ongoing work that applies Fishervoice to map joint factor analysis (JFA)-mean supervectors 1 into a compressed discriminant subspace has shown that performing cosine distance scoring on the Fishervoice projected vectors outperforms classical JFA. In this paper, we refine Fishervoice for low-dimensional i-vectors by only using the nonparametric between-class scatter matrix to substitute the parametric one in linear discriminative analysis (LDA). The task of 2016 speaker recognition evaluation (SRE16) only has unlabeled in-domain training data and labeled out-of-domain training data for model training. Support vector machine (SVM) scoring can capture the discriminative information embedded in the unlabeled in-domain training data. We perform probabilistic linear discriminant analysis (PLDA) before SVM scoring for inter-session compensation with speaker label information from out-of-domain training data. This approach constitutes CUHK’s submission for SRE16. In this paper, we prese...
Linear discriminant analysis (LDA) is an effective and widely used discriminative technique for s... more Linear discriminant analysis (LDA) is an effective and widely used discriminative technique for speaker verification. However, it only utilizes the information on global structure to perform classification. Some variants of LDA, such as local pairwise LDA (LPLDA), are proposed to preserve more information on the local structure in the linear projection matrix. However, considering that the local structure may vary a lot in different regions, summing up related components to construct a single projection matrix may not be sufficient. In this paper, we present a speaker-aware strategy focusing on preserving distinct information on local structure in a set of linear discriminant projection matrices, and allocating them to different local regions for dimension reduction and classification. Experiments on NIST SRE2010 and NIST SRE2016 show that the speaker-aware strategy can boost the performance of both LDA and LPLDA backends in i-vector systems and x-vector systems.
Recently adversarial attacks on automatic speaker verification (ASV) systems attracted widespread... more Recently adversarial attacks on automatic speaker verification (ASV) systems attracted widespread attention as they pose severe threats to ASV systems. However, methods to defend against such attacks are limited. Existing approaches mainly focus on retraining ASV systems with adversarial data augmentation. Also, countermeasure robustness against different attack settings are insufficiently investigated. Orthogonal to prior approaches, this work proposes to defend ASV systems against adversarial attacks with a separate detection network, rather than augmenting adversarial data into ASV training. A VGG-like binary classification detector is introduced and demonstrated to be effective on detecting adversarial samples. To investigate detector robustness in a realistic defense scenario where unseen attack settings may exist, we analyze various kinds of unseen attack settings' impact and observe that the detector is robust (6.27% EER det degradation in the worst case) against unseen substitute ASV systems, but it has weak robustness (50.37% EER det degradation in the worst case) against unseen perturbation methods. The weak robustness against unseen perturbation methods shows a direction for developing stronger countermeasures.
Odyssey 2020 The Speaker and Language Recognition Workshop
Speaker verification systems usually suffer from the mismatch problem between training and evalua... more Speaker verification systems usually suffer from the mismatch problem between training and evaluation data, such as speaker population mismatch, the channel and environment variations. In order to address this issue, it requires the system to have good generalization ability on unseen data. In this work, we incorporate Bayesian neural networks (BNNs) into the deep neural network (DNN) x-vector speaker verification system to improve the system's generalization ability. With the weight uncertainty modeling provided by BNNs, we expect the system could generalize better on the evaluation data and make verification decisions more accurately. Our experiment results indicate that the DNN x-vector system could benefit from BNNs especially when the mismatch problem is severe for evaluations using out-of-domain data. Specifically, results show that the system could benefit from BNNs by a relative EER decrease of 2.66% and 2.32% respectively for short-and longutterance in-domain evaluations. Additionally, the fusion of DNN x-vector and Bayesian x-vector systems could achieve further improvement. Moreover, experiments conducted by outof-domain evaluations, e.g. models trained on Voxceleb1 while evaluated on NIST SRE10 core test, suggest that BNNs could bring a larger relative EER decrease of around 4.69%. Index termsspeaker verification, Bayesian neural network, DNN x-vector, uncertainty modelling
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 1, 2020
This work investigates the vulnerability of Gaussian Mixture Model (GMM) i-vector based speaker v... more This work investigates the vulnerability of Gaussian Mixture Model (GMM) i-vector based speaker verification systems to adversarial attacks, and the transferability of adversarial samples crafted from GMM i-vector based systems to x-vector based systems. In detail, we formulate the GMM i-vector system as a scoring function of enrollment and testing utterance pairs. Then we leverage the fast gradient sign method (FGSM) to optimize testing utterances for adversarial samples generation. These adversarial samples are used to attack both GMM i-vector and x-vector systems. We measure the system vulnerability by the degradation of equal error rate and false acceptance rate. Experiment results show that GMM i-vector systems are seriously vulnerable to adversarial attacks, and the crafted adversarial samples prove to be transferable and pose threats to neural network speaker embedding based systems (e.g. x-vector systems).
Developing a voice conversion (VC) system for a particular speaker typically requires considerabl... more Developing a voice conversion (VC) system for a particular speaker typically requires considerable data from both the source and target speakers. This paper aims to effectuate VC across arbitrary speakers, which we call any-to-any VC, with only a single target-speaker utterance. Two systems are studied: (1) the i-vector-based VC (IVC) system and (2) the speakerencoder-based VC (SEVC) system. Phonetic PosteriorGrams are adopted as speaker-independent linguistic features extracted from speech samples. Both systems train a multi-speaker deep bidirectional long-short term memory (DBLSTM) VC model, taking in additional inputs that encode speaker identities, in order to generate the outputs. In the IVC system, the speaker identity of a new target speaker is represented by i-vectors. In the SEVC system, the speaker identity is represented by speaker embedding predicted from a separately trained model. Experiments verify the effectiveness of both systems in achieving VC based only on a single target-speaker utterance. Furthermore, the IVC approach is superior to SEVC, in terms of the quality of the converted speech and its similarity to the utterance produced by the genuine target speaker.
We have previously developed a Fishervoice framework that maps the JFA-mean supervectors into a c... more We have previously developed a Fishervoice framework that maps the JFA-mean supervectors into a compressed discriminant subspace using nonparametric Fishers discriminant analysis. It was shown that performing cosine distance scoring (CDS) on these Fishervoice projected vectors (denoted as f-vectors) can outperform the classical joint factor analysis. Unlike the ivector approach in which the channel variability is suppressed in the classification stage, in the Fishervoice framework, channel variability is suppressed when the f-vectors are constructed. In this paper, we investigate whether channel variability can be further suppressed by performing Gaussian probabilistic discriminant analysis (PLDA) in the classification stage. We also use random subspace sampling to enrich the speaker discriminative information in the f-vectors. Experiments on NIST SRE10 show that PLDA can boost the performance of Fishervoice in speaker verification significantly by a relative decrease of 14.4% in minDCF (from 0.526 to 0.450).
We investigate how to improve the performance of DNN ivector based speaker verification for short... more We investigate how to improve the performance of DNN ivector based speaker verification for short, text-constrained test utterances, e.g. connected digit strings. A text-constrained verification, due to its smaller, limited vocabulary, can deliver better performance than a text-independent one for a short utterance. We study the problem with "phonetically aware" Deep Neural Net (DNN) in its capability on "stochastic phonetic-alignment" in constructing supervectors and estimating the corresponding i-vectors with two speech databases: a large vocabulary, conversational, speaker independent database (Fisher) and a small vocabulary, continuous digit database (RSR2015 Part III). The phonetic alignment efficiency and resultant speaker verification performance are compared with differently sized senone sets which can characterize the phonetic pronunciations of utterances in the two databases. Performance on RSR2015 Part III evaluation shows a relative improvement of EER, i.e., 7.89% for male speakers and 3.54% for female speakers with only digit related senones. The DNN bottleneck features were also studied to investigate their capability of extracting phonetic sensitive information which is useful for text-independent or textconstrained speaker verifications. We found that by tandeming MFCC with bottleneck features, EERs can be further reduced.
Uploads
Papers by Jinghua Zhong