Asian Journal of Convergence in Technology
Issn No.:2350-1146, I.F-2.71
Volume II, Issue III
Speech Signal Analysis: A frequency domain approach
Sharada V Chougule,
Department of Electronics and Telecommunication Engg.
Finolex Academy of Management and Technology, Ratnagiri
Maharashtra, India
[email protected]
Abstract - Speech is the most natural source of
communication for human beings. It can be produced
instantaneously when required. When treated as a signal, it is
found useful in number of speech related applications performed
by machines such as speech recognition, speaker recognition and
speech synthesis. The basic requirement using the speech for
various applications is to analyze the speech signal and extract
the useful characteristics from the same for the specific task. In
this paper, frequency domain methods to analyze the speech
signal are discussed along with their significance in specific
applications. The properties or specific features that can be
extracted from frequency domain analysis are also described
with the help of mathematical analysis behind the same.
Index Terms – Speech recognition, Speaker recognition,
Speech synthesis
Mahesh S Chavan
Department of Electronics Engg.
KIT’s College of Engg.Kolhapur
Maharashtra, India
[email protected]
language. Lungs, vocal folds and vocal tract are the three main
groups of speech production organs. While speaking, the lungs
act as an energy source which causes the air pressure to move
through trachea (wind-pipe) towards the vocal folds. The
tensed vocal folds within the larynx (complex system of
cartilages, muscles and ligaments) are caused to vibrate due to
air pressure (according to Bernoulli oscillation). Vocal folds
chops the air to create quasi-periodic pulses (during voiced
sounds), which are “spectrally shaped” by the vocal tract. The
vocal tract starts from the larynx and end at the lips and the
nasal cavity. The position of various articulators such as jaw,
tongue, lips determine the type of sound to be produced. Fig.1
shows the elements of human speech production organs and
the nature of spectral contents at each stage.
I. INTRODUCTION
Human speech is a multi-disciplinary area including
communication, linguistics and computer science. Though
occur naturally and easily, it is complicated in nature
comprising variety of information in it. Along with conveying
the actual message, it also gives the knowledge about the
language, emotion and identity of the person indirectly. All
this information is embedded in the speech signal in very
complex way. Speech signal analysis is necessary to
understand the nature and characteristics and is the integral
part of any speech related application. Speech analysis can be
performed through time domain as well frequency domain.
The structure of speech production organs can be better
examined, analyzed and modeled through spectral analysis
than that of time domain parameters. In this paper various
approaches towards the frequency domain analysis of speech
signal are discussed along with their pros and cons. The
remaining paper is organized as follows: Section II describes
the structure of the human speech production organs. In
section III, source-filter model of speech production is
discussed. Section IV and V presents various approaches
towards the frequency domain analysis of speech signal. The
paper ends with the conclusion in section VI.
II. HUMAN SPEECH PRODUCTION MECHANISM
Speech is a sequence of sounds intended to convey some
message to the listener. Speech production begins with the
formation of ideas in speaker’s mind which are represented in
the form of words and sentences applying the rules of the
www.asianssr.org
Special issues in Communication
Fig.1. Speech production organs and subsequent spectrum analogy [2]
The airflow velocity at the vocal folds results in periodic
oscillations representing time-varying area of the glottis (slit
like orifice between the two folds). The time duration of one
glottal cycle (T0) is referred to as pitch period and reciprocal
of pitch period is called as fundamental frequency (F0). Such
periodic oscillations are generally observed during vowel
sounds, which may have one to four pitch periods over the
duration of sound [1]. The function of vocal tract is to
generate perceptually distinct sounds by varying the position
of various articulators in oral and nasal path. This is analogous
to creating resonances of different frequencies. These
resonances are observed as peaks in the spectrum and are
usually called as formant frequencies or simply formants (e.g.
F1, F2 ,...). Thus location of formants is one of the important
characteristics of speech sounds. Practically there is some
Mail:
[email protected]
Asian Journal of Convergence in Technology
Issn No.:2350-1146, I.F-2.71
variation (in certain range) in the location of formants due to
difference in structure of speech production system from
person to person. Thus formants can be characteristics of
speech (‘what is spoken’) as well as of the speaker (‘who is
speaking’).
Volume II, Issue III
resonances are observed as peaks in the spectrum, which can
be modeled by an all-pole system V(z), in which each pair of
complex conjugate pole corresponds to the respective formant.
As observed in Fig. 3, there are three formants approximately
at 750 Hz, 2000 Hz and 2900 Hz.
Vocal Tract Response
III. DIGITAL MODEL OF SPEECH PRODUCTION
-40
-50
-60
-70
-80
-90
-100
0
500
1000
1500
2000
2500
3000
3500
4000
Fig.3. Resonances of the vocal tract
Fig. 2.Digital Model of Speech Production
The process of generating the speech from human speech
production system can be characterized in the form of digital
model. Each element of human speech production system is
represented as a part of linear time varying system. Voiced
sounds are periodic in nature. A periodic train of impulses is
generated from the impulse train generator with frequency
equal to the fundamental frequency (representing the
frequency of glottal pulses). Natural glottal pulse waveform
could be replaced by a synthetic pulse waveform of the form
[3]:
The generalized form of vocal tract transfer as an all-pole
system is given by:
(2)
The lossless tube model ignores all the losses except the losses
at the glottis and lips. Out of these two, the effect of pressure
at the lips is taken as radiation loss, where the pressure is
related to the volume velocity equivalent to high-pass filtering
operation. The radiation model can then be a first order system
(all-zero) characterized by:
(3)
In most of the cases, the radiation model and vocal tract model
are combined to form a single system.
The discrete time model thus formed is useful to analyze the
speech signal in order to investigate the characteristics of
excitation model (such as pitch period, voiced/unvoiced
classification) and vocal tract mode (e.g. formants) imparting
spectral shapes or peaks.
(1)
The nature of resultant glottal flow waveform is shown in
Fig.2. As the sequence g(n) has finite length, its z-transform
G(z) is all-zero system, creating a low-pass filtering effect in
frequency domain. Random noise generator provides the
excitation for unvoiced sounds. AV and AN are the gain
parameters.
Considering speech as a quasi-stationary signal, the relation
between glottal airflow velocity input and vocal tract airflow
velocity output can be approximated as a linear filter, whose
characteristics depends on the nature (shape and size) of vocal
tract. Under ideal condition, the vocal tract can be modeled as
a concatenation of lossless tubes of different areas and lengths.
The resonances (formants) of these tubes are created due to
different vocal tract configuration during speaking. These
www.asianssr.org
Special issues in Communication
IV. FREQUENCY DOMAIN ANALYSIS OF SPEECH SIGNALS
Frequency domain analysis of speech signal is mostly
performed using Fourier analysis, power spectrum, spectral
envelop detection and speech spectrogram. Taking Fourier
transform of the entire speech signal provides only gross
information about the frequency components present in the
signal without giving any timing information (when a
particular frequency component is present). Short-Time
Fourier Transform (STFT) gives better representation of timevarying frequency components.
A. Short-Time Fourier Transform (STFT)
Speech is slowly varying signal assumed to be quasistationary i.e. when observed over a short time interval of 1020 ms its characteristics are almost remain relatively constant.
Mail:
[email protected]
Asian Journal of Convergence in Technology
Issn No.:2350-1146, I.F-2.71
This leads to short-time analysis, in which speech signal is
divided into short segments called frames using a convenient
window function (mostly tapered window such as Hamming
window). Adjacent windows are overlapped (30-50% overlap)
to avoid spectral leakage. The STFT of windowed frame is
given as:
(4)
gives the short time spectrum of speech signal
Where
reflecting the time varying properties of the speech signal
and
is the shifted window sequence, which slides
over the entire speech signal
.T
Volume II, Issue III
vocal tract. There should be a proper compromise between
window length and spectral details. Selecting a smaller
window (3-5ms) may give rise to poor spectral resolution,
whereas spectrum may suffer from timing resolution using a
larger window (100-300 ms). As observed in Fig. 5 and Fig.6,
a proper window size will explore both the low frequency
details (envelop of the spectrum) characterizing the vocal tract
and high frequency detail like pitch and its harmonics relating
the excitation source. Shape of window function (except
rectangular window) does not affect the characteristics much.
Hamming window is good choice considering main lobe width
and peak side lobe amplitude.
100
90
80
Speech Signal
Power Spectrum Magnitude (dB)
Amplitude
1
0.5
0
70
60
50
40
-0.5
30
-1
0.5
1
1.5
2
2.5
Sample index
3
3.5
4
20
4
x 10
Amplitude
1
.
0
1000
2000
3000
4000
Frequency
5000
6000
7000
8000
Fig.6. Power spectral density of speech signal of word ‘had’
0.5
0
-0.5
-1
50
100
150
200
250
300
Frame index
350
400
450
500
Fig.4. Speech signal showing voiced and unvoiced frame
As shown in Fig.4, the details of information in short frames
of speech signal varies depending upon nature of speech
sounds.
Windowed Speech Signal
0.6
0.4
0.2
0
-0.2
-0.4
500
1000
1500
2000
2500
3000
Spectrum of speech signal
B. Spectrogram Analysis
Another approach to obtain the properties of speech signal is
through Spectrogram. Spectrogram (also called as Sonogram)
is similar in principle to spectrum of obtained using Fourier
analysis, but the difference is along with frequency it takes in
account the time factor simultaneously. It is graphical display
of the magnitude of the time-varying spectral characteristics
[3]. It also follows the framing and windowing before
analyzing the spectrum. It is thus a 3-D view of the spectrum
with time displayed on horizontal axis and frequency on
vertical axis. The magnitude of frequency components
(energy) with respect to time is observed as degree of darkness
(more the energy, more the darkness).
There are two types of spectrograms depending upon the size
(length) of window function used for framing the signal,
namely:
Wide-band
spectrogram
and
Narrowband
spectrogram. The general form of spectrogram of windowed
speech waveform is expressed as:
40
Magnitude (dB)
20
(5)
0
-20
-40
-60
0
1000
2000
3000
4000
Frequency (Hz)
5000
6000
7000
8000
Fig.5 Windowed speech frame and its spectrum
The power spectrum of the speech signal is also the useful
analysis technique, especially for obtaining peaks in the
spectrum. As discussed in section III, these peaks represent the
formants of the speech sound, which is the characteristic of
www.asianssr.org
Special issues in Communication
where
In equation (5),
is the spectrum of shifted window
function,
is the multiplicative spectral component of
glottal flow input g(n) and time varying system impulse
response h(n) respectively.
In wide-band spectrogram, spectral analysis is performed on a
small segment of windowed speech around 10 ms, giving
Mail:
[email protected]
Asian Journal of Convergence in Technology
Issn No.:2350-1146, I.F-2.71
broader bandwidth for analysis. This gives rise to better
resolution of individual pitch periods and voiced regions,
observed as vertical striations in the graphical display. In the
counterpart, narrowband spectrogram uses larger window
around 50 ms, having a narrow-band for analysis. Narrowband spectrogram thus better resolve the individual pitch
harmonics and gives horizontal striations representing
prominent formants. Spectrogram is one of the reliable to
estimate the pitch and formants of speech signal.
Volume II, Issue III
cepstral domain. In the quefrency domain the vocal tract
components are represented by the slowly varying components
concentrated near the lower quefrency region and excitation
components are represented by the fast varying components at
the higher quefrency region. Thus the cepstrum of the speech
signal is IDFT of the log spectrum of magnitude spectrum of
the speech signal given by [5]:
(8)
Wideband Spectrogram
Speech Signal and its Cepstrum
0.3
1.5
0.2
0.1
Amplitude
Frequency (kMel)
2
1
-0.2
0.5
-0.3
-0.4
0
0.5
1
1.5
Time (s)
2
2.5
0.01
0.02
0.03
Time (s)
0.04
0.05
0.06
300
250
200
Amplitude
Narrowband Spectrogram
2
Frequency (kMel)
0
-0.1
150
100
50
1.5
0
1
0
0.001
0.002
0.003
0.004
0.005
Quefrency (s)
0.006
0.007
0.008
0.009
0.01
Fig.6. Cepstrum of voiced segment of speech signal
0.5
0.5
1
1.5
Time (s)
2
2.5
Fig.5. Inverted Wideband and Narrowband Spectrogram
V. CEPSTRAL ANALYSIS
From the source-filter theory of speech signal modelling [4]
the speech is considered as the response of linear time-varying
system, where excitation is created by vocal folds vibration
and vocal tract determines the characteristics of the system. In
time domain, speech is represented as convolution operation
of these two entities. The objective of cepstral analysis is to
separate the speech into its source and system components
without any a priori knowledge about source and / or system
[5]. Convolution operation is transformed as multiplication of
in frequency domain their respective spectra. If e(n) is the
excitation, h(n) is the impulse response of the system , then
speech signal s(n) in time and frequency domain is given by:
and
(6)
Taking logarithm of magnitude spectrum gives:
(7)
Here the magnitude spectrum of excitation component and
vocal tract component are observed as linearly separable one.
Inverse DFT of log spectra transforms the spectra from
frequency domain to quefrency domain, also referred as
www.asianssr.org
Special issues in Communication
The cepstrum projects all the slowly varying components in
log magnitude spectrum to the low frequency region and fast
varying components to the high frequency regions. In the log
magnitude spectrum, the slowly varying components represent
the envelope corresponds to the vocal tract and the fast
varying components to the excitation source. As a result the
vocal tract and excitation source components get represented
naturally in the spectrum of speech.
VI. CONCLUSION
In this paper, methods for speech analysis in frequency
domain are discussed. Short time analysis is essential for
speech signal to derive important characteristics of the speech
signal. All these methods are based on source-system
modeling of speech signal. The characteristics of vocal fold
are represented in terms of pitch period or fundamental
frequency, glottal flow waveform etc. Resonances of the
spectrum or formants are the most prominent features of the
vocal tract. All of these features are well distinguished through
frequency domain analysis. The usefulness of the extracted
features depends upon the end task to be performed.
Spectrographic display is useful speech analysis tool, to
understand or investigate frequency components of speech
sounds with respect to time. Cepstral analysis is a convenient
technique for separating source and system parameters of
linear speech production system. Variability in same speech
sounds is the preferred characteristic in case of identifying an
individual from one’s voice whereas uniqueness of speech
sounds amongst number of speakers is the desirable
characteristics for speech recognition. Frequency domain
analysis is useful for variety of other applications like
Mail:
[email protected]
Asian Journal of Convergence in Technology
Issn No.:2350-1146, I.F-2.71
language recognition,
diarization/index.
emotion
recognition,
Volume II, Issue III
speaker
REFERENCES
[1] Thomas F. Quatieri, Discrete time speech signal processing, Principles
and Practice, Pearson Education, 2002.
[2] Hiroshi Shimodaira and Steve Renals: Automatic speech recognitionLectures Series.
[3] A.E. Rosenberg,” Effect of glottal pulse shape on the quality of natural
vowels“ Journal of Acoustic Soc. Am.Vol.49,No.2, pp.583-590, February
1971.
[4] Lawrence R. Rabiner and Ronald W. Schafer, Digital processing of
speech signals, Prentice Hall Innternational,1978J.
[5] Alan V. Oppenheim and Ronald W. Schafer, “From Frequency to
Quefrency: A History of the Cepstrum”, IEEE Signal Processing
Magazine, pp.95-110,September 2004.
www.asianssr.org
Special issues in Communication
Mail:
[email protected]