Papers by Dr. Sharada V. Chougule
Speech is the most natural source of communication for human beings. It can be produced instantan... more Speech is the most natural source of communication for human beings. It can be produced instantaneously when required. When treated as a signal, it is found useful in number of speech related applications performed by machines such as speech recognition, speaker recognition and speech synthesis. The basic requirement using the speech for various applications is to analyze the speech signal and extract the useful characteristics from the same for the specific task. In this paper, frequency domain methods to analyze the speech signal are discussed along with their significance in specific applications. The properties or specific features that can be extracted from frequency domain analysis are also described with the help of mathematical analysis behind the same.
This paper gives an overview of various methods and techniques used for feature extraction and mo... more This paper gives an overview of various methods and techniques used for feature extraction and modeling in speaker recognition. The research in speaker recognition have been evolved starting from short time features reflecting spectral properties of speech (low-level or physical traits) to the high level features (behavioural traits) such as prosody, phonetic information, conversational patterns etc. Low level acoustic information such as cepstral features has been dominated as these features gives very low error rates (especially in quiet conditions). But they are more prone to error in noisy conditions. In this paper various features along with modeling techniques used in speaker recognition are discussed.
Automatic Speech and Speaker Recognition technology has growing demands in variety of voice opera... more Automatic Speech and Speaker Recognition technology has growing demands in variety of voice operated devices. Although the input for all such systems is speech signal, the features useful for each application/task are different. Of the different speech sounds, vowel sounds spectrally well-defined and well represented by formants. Formants which represent resonances of vocal tract are the result of physiology of individual’s speech production mechanism as well as nature of speech (words) being spoken. In this way formants are features of speech as well as of speaker. In this paper significance of formants for speech and speaker recognition is explored through experimental analysis. Formant tracking and estimation is done using adaptive formant filter bank and single pole formant based filter. Twelve vowel sounds represented in ARPABET (Advanced Research Project Agency bet) form are used to estimate the first four formants. The analysis based on extracting and emphasizing speaker spec...
In this paper, robust front end features are proposed for improvement in speaker identification (... more In this paper, robust front end features are proposed for improvement in speaker identification (SI) performance by considering the factors of real world situations, like mismatch between training and testing conditions. The most commonly used MFCC features are very much sensitive to effects such as channel and environment mismatch. Characteristics of speech gets changed with room acoustics, channel and microphone as well as background noise, which adversely affects the performance of the SI system. To make the front end features robust, asymmetric hamming-cosine taper is used, which gives better spectral estimation and reduces the interfering band limited noise. To incorporate time varying information, second order derivatives of cepstral coefficients are concatenated to MFCC features. Convolutional errors are minimized by using cepstral mean normalization (CMN) and compensation to additive noise is achieved by magnitude spectral subtraction (MSS). The performance of closed-set tex...
This paper gives an overview of various methods and techniques used for feature extraction and mo... more This paper gives an overview of various methods and techniques used for feature extraction and modeling in speaker recognition. The research in speaker recognition have been evolved starting from short time features reflecting spectral properties of speech (low-level or physical traits) to the high level features (behavioural traits) such as prosody, phonetic information, conversational patterns etc. Low level acoustic information such as cepstral features has been dominated as these features gives very low error rates (especially in quiet conditions). But they are more prone to error in noisy conditions. In this paper various features along with modeling techniques used in speaker recognition are discussed. Introduction Speech is the product of a complex behaviour conveying different speaker-specific traits that are potential sources of complementary information. Human speech production can be modeled by so-called source-filter model featured. As the name suggests, the model consid...
International Journal of Image, Graphics and Signal Processing, 2017
Mismatch in speech data is one of the major reasons limiting the use of speaker recognition techn... more Mismatch in speech data is one of the major reasons limiting the use of speaker recognition technology in real world applications. Extracting speaker specific features is a crucial issue in the presence of noise and distortions. Performance of speaker recognition system depends on the characteristics of extracted features. Devices used to acquire the speech as well as the surrounding conditions in which speech is collected, affects the extracted features and hence degrades the decision rates. In view of this, a feature level approach is used to analyze the effect of sensor and environment mismatch on speaker recognition performance. The goal here is to investigate the robustness of segmental features in speech data mismatch and degradation. A set of features derived from filter bank energies namely: Mel Frequency Cepstral Coefficients (MFCCs), Linear Frequency Cepstral Coefficients (LFCCs), Log Filter Bank Energies (LOGFBs) and Spectral Subband Centroids (SSCs) are used for evaluating the robustness in mismatch conditions. A novel feature extraction technique named as Normalized Dynamic Spectral Features (NDSF) is proposed to compensate the sensor and environment mismatch. A significant enhancement in recognition results is obtained with proposed feature extraction method.
Procedia Computer Science, 2015
The widespread use of automatic speaker recognition technology in real world applications demands... more The widespread use of automatic speaker recognition technology in real world applications demands for robustness against various realistic conditions. In this paper, a robust spectral feature set, called NDSF (Normalized Dynamic Spectral Features) is proposed for automatic speaker recognition in mismatch condition. Magnitude spectral subtraction is performed on spectral features for compensation against additive noise. A spectral domain modification is further performed using time-difference approach followed by Gaussianization Non-linearity. Histogram normalization is applied to these dynamic spectral features, to compensate the effect of channel mismatch and some non-linear effects introduced due to handset transducers. Feature extraction using proposed features is carried out for a text independent automatic speaker recognition (identification) system. The performance of proposed feature set is compared with conventional cepstral features like (mel-frequency cepstral coefficients and linear prediction cepstral coefficients), for acoustic mismatch condition caused by use of different sensors. Studies are performed on two databases: A multi-variability speaker recognition (MVSR) developed by IIT-Guwahati and Multi-speaker continuous (Hindi) speech database (By Department of Information Technology, Government of India). From experimental analysis, it is observed that, spectral domain dynamic features enhance the robustness by reducing additive noise and channel effects caused by sensor mismatch. The proposed NDSF features are found to be more robust than cepstral features for both datasets.
2014 IEEE Global Conference on Wireless Computing & Networking (GCWCN), 2014
Speaker recognition in mismatched condition is a vital issue in the recent years as training and ... more Speaker recognition in mismatched condition is a vital issue in the recent years as training and testing speech data can be distorted due various practical conditions. Filter bank based cepstral features are used in many speech, speaker and language recognition tasks. These filter banks are mainly designed according to auditory based principles of speech processing, with variation in shape of filters and localization of their frequencies (center, left and right). The set of band pass filters can capture the information related to human vocal tract, which is one of the main distinguishing characteristics of individual. The non-uniform nature of (such as mel-scale warped) filter bank may cause loss of information in high frequency bands, which may carry some speaker specific information. Therefore uniformly (linearly) spaced filter bank cepstral coefficients can capture better speaker specific information, especially in mismatched conditions. In this paper, cepstral features derived from known psycho-acoustic filter banks (called MFCCs and BFCCs,) are compared with uniformly spaced filter bank cepstral coefficients (UFCCs) for text-dependent and text-independent cases. Experimental results shows that BFCCs are better in text-dependent case, whereas UFCC features give improved results than conventional MFCCs in case of text independent case, in mismatch condition. Results indicate that nature of filter bank plays an important role in extracting the speaker relevant features.
Advances in Intelligent Systems and Computing, 2014
Over the years, MFCC (Mel Frequency Cepstral Coefficients), has been used as a standard acoustic ... more Over the years, MFCC (Mel Frequency Cepstral Coefficients), has been used as a standard acoustic feature set for speech and speaker recognition. The models derived from these features gives optimum performance in terms of recognition of speakers for the same training and testing conditions. But mismatch between training and testing conditions and type of channel used for creating speaker model, drastically drops the performance of speaker recognition system. In this experimental research, the performance of MFCCs for closed-set text independent speaker recognition is studied under different training and testing conditions. Magnitude spectral subtraction is used to estimate magnitude spectrum of clean speech from additive noise magnitude. The mel-warped cepstral coefficients are then normalized by taking their mean, referred as cepstral mean normalization used to reduce the effect of convolution noise created due to change in channel between training and testing. The performance of this modified MFCCs, have been tested using Multi-speaker continuous (Hindi) speech database (By Department of Information Technology, Government of India). Use of improved MFCC as compared to conventional MFCC perk up the speaker recognition performance drastically.
2015 International Conference on Computing, Communication and Security (ICCCS), 2015
The speech features used for speaker recognition should uniquely reflect characteristics of the s... more The speech features used for speaker recognition should uniquely reflect characteristics of the speaker's vocal tract apparatus and contain negligible information about the linguistic contents in the speech. Cepstral features such as Linear Predictive Spectral Coefficients (LPCCs) and Mel Frequency Cepstral Coefficients (MFCCs) are most commonly used features for speaker recognition task, but found to be sensitive to noise and distortion. Other complementary features used initially for speech recognition can be found useful for speaker recognition task. In this work, Line Spectral Pair (LSP) features (derived from baseline linear predictive coefficients) are used for text independent speaker identification. In LSP features, power spectral density at any frequency tends to depend only on close to the respective LSP. In contrast, for cepstral features, changes in particular parameter affects the whole spectrum. The goal here is to investigate the performance of line spectral pair (LSP) features against conventional cepstral features in the presence of acoustic disturbance. Experimentation is carried out using TIMIT and NTIMIT dataset to analyze the performance in case of acoustic and channel distortions. It is observed that the LSP features perform equally well to conventional cepstral features on TIMIT dataset and have showed enhanced identification results on NTIMIT datasets.
Uploads
Papers by Dr. Sharada V. Chougule