39 22EC10057 Prasit

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Dept.

of Electronics and Electrical Communication


Engineering
Indian Institute of Technology Kharagpur

DSP Laboratory (EC39201)

Prasit Mazumder
22EC10057

Group Number: 39
2nd October 2024

1
Experiment 04: Speech Recognition with Primarily
Temporal Cues

AIM
This work aims to enhance our comprehension of the relative importance of low-frequency
temporal patterns and frequency content in speech perception. To do this, it is essential
to employ bandpass filters to modify speech signals and then evaluate their intelligibility.

THEORY
Specular features reveal phonetic content and vocal tract resonances, which are crucial
for speech recognition. Shannon et al. (1995) demonstrated that speech may be com-
prehended with little spectral information, provided the temporal envelope is preserved.
The amplitude of the temporal envelope influences speech intelligibility.
Shannon segmented the voice stream into bands and modulated noise using the tempo-
ral envelope of each band, preserving temporal cues but sacrificing spectral information.
They were startled that three noise-modulated envelope bands maintained speech intelli-
gibility. Additional bands enhanced intelligence, although beyond eight, the advantages
were negligible.
We will use bandpass filters to delineate frequency ranges for the analysis of speech
over 1, 2, 3, 4, 8, and 16 bands to reproduce these results. We will analyse temporal
structure and speech intelligibility across several bands to assess the relative significance
of temporal and spectral information in speech perception.

METHOD
1. Signal Processing Setup
The speech signal was processed to analyze the effects of different frequency bands on
intelligibility, involving filtering, envelope extraction, and noise modulation.
• Speech Signal: Sampled at 16 kHz, the signal contained frequency components
from 90 Hz to 5.76 kHz, covering six octaves essential for representing both low-
frequency (prosody and intonation) and high-frequency (phoneme recognition) ele-
ments.
• Filters: Fourth-order Butterworth bandpass filters were used, chosen for their flat
frequency response and steep roll-off. The frequency ranges for the 1, 2, 3, 4, 8,
and 16 bands were logarithmically spaced between 90 Hz and 5.76 kHz, mirroring
human auditory sensitivity to lower frequencies. For example, the 2-band case
split the signal into 90 Hz to 720 Hz and 720 Hz to 5.76 kHz.
• Envelope Extraction: The Hilbert transform was applied to each band to ex-
tract the temporal envelope, which was low-pass filtered at 240 Hz to retain slow
amplitude fluctuations critical for speech rhythm and intelligibility.

2
• Noise Modulation: The extracted envelope modulated white Gaussian noise,
which was filtered through the same bandpass filters to match the speech spectral
content. This modulated noise was summed across all bands to create versions of
the speech signal for intelligibility testing.

2. MATLAB Code for Speech Signal Processing


clear all ;
function y = bpf (n , w1 , w2 , fs , x )
[b , a ] = butter (n ,[ w1 , w2 ]/( fs /2) ,’ bandpass
’) y = filter (b ,a , x );
end

% info = audioinfo (’ fivewo . wav ’) ;


[x , Fs ] = audioread ( ’ fivewo . wav
’); t = 1:1: length ( x );
subplot (211) ;
plot (t , x );

w1 = 90;
w2 = 5760;
number = 16;
sig = zeros ([1 length ( t ) ]) ;
for i = 1: number
y (i ,:) = bpf (2 , w1 , w1 * 64^(1/ number ) ,Fs , x );
env (i ,:) = abs ( hilbert ( y (i ,:) )) ;
noise = transpose ( wgn ( length ( t ) ,1 ,1) );
n = bpf (2 , w1 , w1 * 64^(1/ number ) ,Fs , noise
); sig = sig + env (i ,:) .* n ;
fprintf (" % d ", w1 );
w1 = w1 * 64^(1/ number );
end subplot
(212)
plot (t , sig );
audiowrite ( ’ fivewo_op_ 16 band . wav ’,sig , Fs );

3. Output Audio Signals


All the processed output audio signals from the 1 band to 16 bands cases can be accessed
through the following Google Drive link:
https://drive.google.com/drive/folders/17NlRCLGrLN3V6x8pUG793FbW9fxi6WSp?
usp=sharing

DISCUSSION
• For N = 1 or 2 (where N denotes the number of bands), the vocalisation in the
supplied audio clip was indiscernible; nevertheless, the rhythmic pattern was per-
ceptible. Voice recognition enhanced beginning at N = 3.
• The voice was distinctly identifiable with N = 8 and N = 16 bands, while the
clarity at N = 16 was similar to that at N = 8.

3
• As the number of bands rises, the time-domain representation of the signal pro-
gressively mirrors the original, while the frequency-domain representation starts to
mimic the real spectrum of the audio stream.
• The spacing between bands is critical for accurately decoding the audio signal.
• The majority of the human voice exists within the low-frequency spectrum. Apply-
ing smaller bandwidth filters with closely spaced bands enhances the accessibility
of speech information. Consequently, augmenting the number of bands improves
the clarity of the output voice.

Final Audio Spectrum Plot

Figure 1: Final Audio Spectrum for Different Band Numbers

CONCLUSION
The results of this experiment provide support to the assertions of Shannon et al. (1995),
who claim that it is possible to reconstruct intelligible speech with little spectral informa-
tion, provided that temporal cues are preserved. The results of this experiment bolster
the conclusions of Shannon et al. Research indicates that a minimum of three frequency
bands suffices for intelligibility, while eight bands are required for near-perfect speech
recognition.

You might also like