39 22EC10057 Prasit
39 22EC10057 Prasit
39 22EC10057 Prasit
Prasit Mazumder
22EC10057
Group Number: 39
2nd October 2024
1
Experiment 04: Speech Recognition with Primarily
Temporal Cues
AIM
This work aims to enhance our comprehension of the relative importance of low-frequency
temporal patterns and frequency content in speech perception. To do this, it is essential
to employ bandpass filters to modify speech signals and then evaluate their intelligibility.
THEORY
Specular features reveal phonetic content and vocal tract resonances, which are crucial
for speech recognition. Shannon et al. (1995) demonstrated that speech may be com-
prehended with little spectral information, provided the temporal envelope is preserved.
The amplitude of the temporal envelope influences speech intelligibility.
Shannon segmented the voice stream into bands and modulated noise using the tempo-
ral envelope of each band, preserving temporal cues but sacrificing spectral information.
They were startled that three noise-modulated envelope bands maintained speech intelli-
gibility. Additional bands enhanced intelligence, although beyond eight, the advantages
were negligible.
We will use bandpass filters to delineate frequency ranges for the analysis of speech
over 1, 2, 3, 4, 8, and 16 bands to reproduce these results. We will analyse temporal
structure and speech intelligibility across several bands to assess the relative significance
of temporal and spectral information in speech perception.
METHOD
1. Signal Processing Setup
The speech signal was processed to analyze the effects of different frequency bands on
intelligibility, involving filtering, envelope extraction, and noise modulation.
• Speech Signal: Sampled at 16 kHz, the signal contained frequency components
from 90 Hz to 5.76 kHz, covering six octaves essential for representing both low-
frequency (prosody and intonation) and high-frequency (phoneme recognition) ele-
ments.
• Filters: Fourth-order Butterworth bandpass filters were used, chosen for their flat
frequency response and steep roll-off. The frequency ranges for the 1, 2, 3, 4, 8,
and 16 bands were logarithmically spaced between 90 Hz and 5.76 kHz, mirroring
human auditory sensitivity to lower frequencies. For example, the 2-band case
split the signal into 90 Hz to 720 Hz and 720 Hz to 5.76 kHz.
• Envelope Extraction: The Hilbert transform was applied to each band to ex-
tract the temporal envelope, which was low-pass filtered at 240 Hz to retain slow
amplitude fluctuations critical for speech rhythm and intelligibility.
2
• Noise Modulation: The extracted envelope modulated white Gaussian noise,
which was filtered through the same bandpass filters to match the speech spectral
content. This modulated noise was summed across all bands to create versions of
the speech signal for intelligibility testing.
w1 = 90;
w2 = 5760;
number = 16;
sig = zeros ([1 length ( t ) ]) ;
for i = 1: number
y (i ,:) = bpf (2 , w1 , w1 * 64^(1/ number ) ,Fs , x );
env (i ,:) = abs ( hilbert ( y (i ,:) )) ;
noise = transpose ( wgn ( length ( t ) ,1 ,1) );
n = bpf (2 , w1 , w1 * 64^(1/ number ) ,Fs , noise
); sig = sig + env (i ,:) .* n ;
fprintf (" % d ", w1 );
w1 = w1 * 64^(1/ number );
end subplot
(212)
plot (t , sig );
audiowrite ( ’ fivewo_op_ 16 band . wav ’,sig , Fs );
DISCUSSION
• For N = 1 or 2 (where N denotes the number of bands), the vocalisation in the
supplied audio clip was indiscernible; nevertheless, the rhythmic pattern was per-
ceptible. Voice recognition enhanced beginning at N = 3.
• The voice was distinctly identifiable with N = 8 and N = 16 bands, while the
clarity at N = 16 was similar to that at N = 8.
3
• As the number of bands rises, the time-domain representation of the signal pro-
gressively mirrors the original, while the frequency-domain representation starts to
mimic the real spectrum of the audio stream.
• The spacing between bands is critical for accurately decoding the audio signal.
• The majority of the human voice exists within the low-frequency spectrum. Apply-
ing smaller bandwidth filters with closely spaced bands enhances the accessibility
of speech information. Consequently, augmenting the number of bands improves
the clarity of the output voice.
CONCLUSION
The results of this experiment provide support to the assertions of Shannon et al. (1995),
who claim that it is possible to reconstruct intelligible speech with little spectral informa-
tion, provided that temporal cues are preserved. The results of this experiment bolster
the conclusions of Shannon et al. Research indicates that a minimum of three frequency
bands suffices for intelligibility, while eight bands are required for near-perfect speech
recognition.