Speech Coding Journal
Speech Coding Journal
Speech Coding Journal
The inputoutput relationship is specified using a reference implementation, but novel implementations are
allowed, provided that inputoutput equivalence is maintained. Speech coders differ primarily in bit rate (measured in bits per sample or bits per second), complexity
(measured in operations per second), delay (measured in
milliseconds between recording and playback), and perceptual quality of the synthesized speech. Narrowband (NB)
coding refers to coding of speech signals whose bandwidth
is less than 4 kHz (8 kHz sampling rate), while wideband
(WB) coding refers to coding of 7-kHz-bandwidth signals
(1416 kHz sampling rate). NB coding is more common
than WB coding mainly because of the narrowband nature
of the wireline telephone channel (3003600 Hz). More
recently, however, there has been an increased effort in
wideband speech coding because of several applications
such as videoconferencing.
There are different types of speech coders. Table 1
summarizes the bit rates, algorithmic complexity, and
standardized applications of the four general classes of
coders described in this article; Table 2 lists a selection
of specific speech coding standards. Waveform coders
attempt to code the exact shape of the speech signal
waveform, without considering the nature of human
speech production and speech perception. These coders
are high-bit-rate coders (typically above 16 kbps). Linear
prediction coders (LPCs), on the other hand, assume that
the speech signal is the output of a linear time-invariant
(LTI) model of speech production. The transfer function
of that model is assumed to be all-pole (autoregressive
model). The excitation function is a quasiperiodic signal
constructed from discrete pulses (18 per pitch period),
pseudorandom noise, or some combination of the two. If
the excitation is generated only at the receiver, based on a
transmitted pitch period and voicing information, then the
system is designated as an LPC vocoder. LPC vocoders that
provide extra information about the spectral shape of the
excitation have been adopted as coder standards between
2.0 and 4.8 kbps. LPC-based analysis-by-synthesis coders
(LPC-AS), on the other hand, choose an excitation function
by explicitly testing a large set of candidate excitations
and choosing the best. LPC-AS coders are used in most
standards between 4.8 and 16 kbps. Subband coders are
frequency-domain coders that attempt to parameterize the
speech signal in terms of spectral properties in different
frequency bands. These coders are less widely used than
LPC-based coders but have the advantage of being scalable
MARK HASEGAWA-JOHNSON
University of Illinois at
UrbanaChampaign
Urbana, Illinois
ABEER ALWAN
University of California at Los
Angeles
Los Angeles, California
1.
INTRODUCTION
Rates (kbps)
Complexity
Standardized Applications
Section
1664
12256
4.816
2.04.8
Low
Medium
High
High
Landline telephone
Teleconferencing, audio
Digital cellular
Satellite telephony, military
2
3
4
5
EOT156
2
Tele conferencing
Digital
cellular
Multimedia
Satellite telephony
Secure communications
Rate
(kbps)
BW
(kHz)
Standards
Organization
Standard
Number
Algorithm
Year
64
1640
1640
4864
16
13
12.2
7.9
6.5
8.0
4.7512.2
18
5.36.3
2.018.2
4.15
3.6
2.4
2.4
4.8
1632
3.4
3.4
3.4
7
3.4
3.4
3.4
3.4
3.4
3.4
3.4
3.4
3.4
3.47.5
3.4
3.4
3.4
3.4
3.4
3.4
ITU
ITU
ITU
ITU
ITU
ETSI
ETSI
TIA
ETSI
ITU
ETSI
CDMA-TIA
ITU
ISO
INMARSAT
INMARSAT
DDVPC
DDVPC
DDVPC
DDVPC
G.711
G.726
G.727
G.722
G.728
Full-rate
EFR
IS-54
Half-rate
G.729
AMR
IS-96
G.723.1
MPEG-4
M
Mini-M
FS1015
MELP
FS1016
CVSD
1988
1990
1990
1988
1992
1992
1997
1990
1995
1996
1998
1993
1996
1998
1991
1995
1984
1996
1989
2.
WAVEFORM CODING
(1)
Many speech and audio applications use an odd number of
reconstruction levels, so that background noise signals
with a very low level can be quantized exactly to
s K/2 = 0. One important exception is the A-law companded
PCM standard [48], which uses an even number of
reconstruction levels.
2.1.1. Uniform PCM. Uniform PCM is the name given
to quantization algorithms in which the reconstruction
levels are uniformly distributed between Smax and Smin .
The advantage of uniform PCM is that the quantization
error power is independent of signal power; high-power
signals are quantized with the same resolution as
low-power signals. Invariant error power is considered
desirable in many digital audio applications, so 16-bit
uniform PCM is a standard coding scheme in digital audio.
The error power and SNR of a uniform PCM coder
vary with bit rate in a simple fashion. Suppose that a
signal is quantized using B bits per sample. If zero is a
reconstruction level, then the quantization step size is
=
Smax Smin
2B 1
(2)
Assuming that quantization errors are uniformly distributed between /2 and /2, the quantization error
power is
10 log10 E[e2 (n)] = 10 log10
2
12
(3)
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS
(4)
log(1 + |s(n)/Smax |)
sign(s(n))
log(1 + )
Successive speech samples are highly correlated. The longterm average spectrum of voiced speech is reasonably
well approximated by the function S(f ) = 1/f above about
500 Hz; the first-order intersample correlation coefficient
Output signal t (n )
m= 256
0.6
m= 0
0.4
0.2
0.2
0.4
0.6
Input signal s (n )
0.8
(6)
s (n) = d(n)
+ sp (n)
(7)
(8)
0.8
(5)
Quantization step
s (n ) + d (n )
d (n )
+
Quantizer
sp (n ) Predictor s (n ) +
+
P (z )
+
Encoder
d (n )
+
+
Decoder
+
s (n )
Channel
sp (n )
Predictor
P (z )
Figure 2. Schematic of a DPCM coder.
EOT156
4
3.
SUBBAND CODING
(10)
PCM
input
Analysis
filterbank
FFT
PCM
output
Scale
factors
Masking
thresholds
Signal-to-mask
ratios
Synthesis
filterbank
Scale
&
quantize
Dynamic bit
allocation
Source
bitrate
S(z) =
Noisy
channel
Scale
&
dequantize
Dynamic bit
decoder
M
U
X
U(z)
p
1
ai zi
(11)
i=1
D
E
M
U
X
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS
(b)
Codebook(s)
of LPC excitation
vectors
s (n )
W (z )
Perceptual
weighting
Minimize Error
Codebook(s)
of LPC excitation
vectors
1.2
Get specified
codevector
u (n )
1 LPC
A(z ) synthesis
s (n )
u (n )
1 LPC
A(z ) synthesis
Perceptual
W (z )
weighting
s (n )
0.8
0.6
0.4
(12)
0.2
sw (n )
4.2.
(a)
g2
1 + b2 2b cos T0
(13)
2
3
4
5
Digital frequency (radians/sample)
Figure 5. Normalized magnitude spectrum of the pitch prediction filter for several values of the prediction coefficient.
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS
80
75
Amplitude (dB)
Speech spectrum
70
65
60
|A(z/2 )|
,
|A(z/1 )|
0 < 2 < 1 1
(17)
S w (z) = W(z)S(z)
(19)
80
75
Speech spectrum
70
White noise at 5 dB SNR
65
60
1000
2000
Frequency (Hz)
3000
4000
1000
2000
Frequency (Hz)
3000
4000
1
2
(sw (n) s w (n))2
n
(20)
4.3.2. Adaptive Postfiltering. Despite the use of perceptually weighted error minimization, the synthesized
speech coming from an LPC-AS coder may contain audible
quantization noise. In order to minimize the perceptual
effects of this noise, the last step in the decoding process is
often a set of adaptive postfilters [11,80]. Adaptive postfiltering improves the perceptual quality of noisy speech by
giving a small extra emphasis to features of the spectrum
that are important for human-to-human communication,
including the pitch periodicity (if any) and the peaks in
the spectral envelope.
A pitch postfilter (or long-term predictive postfilter)
enhances the periodicity of voiced speech by applying
either an FIR or IIR comb filter to the output. The time
delay and gain of the comb filter may be set equal to the
transmitted pitch lag and gain, or they may be recalculated
at the decoder using the reconstructed signal s (n). The
pitch postfilter is applied only if the proposed comb filter
gain is above a threshold; if the comb filter gain is below
threshold, the speech is considered unvoiced, and no pitch
postfilter is used. For improved perceptual quality, the
LPC excitation signal may be interpolated to a higher
sampling rate in order to allow the use of fractional pitch
periods; for example, the postfilter in the ITU G.729 coder
uses pitch periods quantized to 18 sample.
A short-term predictive postfilter enhances peaks in the
spectral envelope. The form of the short-term postfilter is
similar to that of the masking function M(z) introduced
in the previous section; the filter has peaks at the same
frequencies as 1/A(z), but the peak-to-valley ratio is less
than that of A(z).
Postfiltering may change the gain and the average
spectral tilt of s (n). In order to correct these problems,
systems that employ postfiltering may pass the final signal
through a one-tap FIR preemphasis filter, and then modify
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS
Speech signal s (n )
LPC coefficients
Frame-Based Analysis
U = GX,
G = [g1 , g2 , . . .],
X1
X = X2
..
.
(22)
s w (n) =
h(i)u(n i)
i=0
(23)
where h(n) is the infinite-length impulse response of H(z).
Suppose that s w (n) has already been computed for n < 0,
and the coder is now in the process of choosing the optimal
u(n) for the subframe 0 n L 1. The sum above can be
divided into two parts: a part that depends on the current
subframe input, and a part that does not:
S w = S ZIR + UH
(24)
h(0) h(1) . . .
0
h(0) . . .
..
..
H = ..
.
.
.
0
0
...
U = [u(0), . . . , u(L 1)]
h(L 1)
h(L 2)
..
,
.
h(0)
(25)
EOT156
8
(26)
(27)
(28)
(30)
(31)
(32)
(34)
M
ckm (n)
(36)
m=1
(33)
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS
(38)
u (n Dmin)
b
sw (n )
u (n Dmax)
Adaptive codebook
u (n ) W (z ) sw (n )
+
A(z )
c1(n )
truncated impulse response of the filter W(z)/A(z), as discussed in Section 4.4 [3,97]. Davidson and Lin separately
proposed center clipping the stochastic codevectors, so that
most of the samples in each codevector are zero [15,67].
Lin also proposed structuring the stochastic codebook so
that each codevector is a slightly-shifted version of the previous codevector; such a codebook is called an overlapped
codebook [67]. Overlapped stochastic codebooks are rarely
used in practice today, but overlapped-codebook search
methods are often used to reduce the computational complexity of an adaptive codebook search. In the search of
an overlapped codebook, the correlation RX and autocorrelation
introduced in Section 4.4 may be recursively
computed, thus greatly reducing the complexity of the
codebook search [63].
Most CELP coders optimize the adaptive codebook
index and gain first, and then choose a stochastic
codevector and gain in order to minimize the remaining
perceptually weighted error. If all the possible pitch
periods are longer than one subframe, then the entire
content of the adaptive codebook is known before the
beginning of the codebook search, and the efficient
overlapped codebook search methods proposed by Lin
may be applied [67]. In practice, the pitch period of a
female speaker is often shorter than one subframe. In
order to guarantee that the entire adaptive codebook is
known before beginning a codebook search, two methods
are commonly used: (1) the adaptive codebook search may
simply be constrained to only consider pitch periods longer
than L samples in this case, the adaptive codebook will
lock onto values of D that are an integer multiple of
the actual pitch period (if the same integer multiple is not
chosen for each subframe, the reconstructed speech quality
is usually good); and (2) adaptive codevectors with delays
of D < L may be constructed by simply repeating the most
recent D samples as necessary to fill the subframe.
4.5.3. SELP, VSELP, ACELP, and LD-CELP. Rose and
Barnwell demonstrated that reasonable speech quality
is achieved if the LPC excitation vector is computed completely recursively, using two closed-loop pitch predictors
in series, with no additional information [82]. In their
self-excited LPC algorithm (SELP), the LPC excitation
is initialized during the first subframe using a vector of
samples known at both the transmitter and receiver. For
all frames after the first, the excitation is the sum of an
arbitrary number of adaptive codevectors:
u(n) =
sum(||^2)
g
cK (n )
Stochastic codebook
M
bm u(n Dm )
(39)
m=1
2
m=1
gm ckm (n)
(40)
EOT156
10
(42)
(43)
(p+1)/2
(44)
n=1
(p1)/2
= (1 z2 )
n=1
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS
5.
LPC VOCODERS
5.1.
p
1
ai s(n i)
s(n)
G
(46)
i=1
N1
1
|d(n) d(n |m|)|
N |m| n=|m|
(47)
(48)
m=20
Pulse
train
Frication, aspiration
White
noise
11
G
Voiced /unvoiced
switch
Source
spectral
shaping
H(z )
Transfer
function
+
White
noise
1/A (z )
Transfer
function
Source
spectral
shaping
Figure 11. The MELP speech synthesis model.
EOT156
12
5.4.
6.1.1. Intelligibility. Speech coder intelligibility is evaluated by coding a number of prepared words, asking listeners to write down the words they hear, and calculating
the percentage of correct transcriptions (an adjustment for
guessing may be subtracted from the score). The diagnostic
rhyme test (DRT) and diagnostic alliteration test (DALT)
are intelligibility tests which use a controlled vocabulary
to test for specific types of intelligibility loss [101,102].
Each test consists of 96 pairs of confusable words spoken in isolation. The words in a pair differ in only one
distinctive feature, where the distinctive feature dimensions proposed by Voiers are voicing, nasality, sustention,
sibilation, graveness, and compactness. In the DRT, the
words in a pair differ in only one distinctive feature of the
initial consonant; for instance, jest and guest differ in
the sibilation of the initial consonant. In the DALT, words
differ in the final consonant; for instance, oaf and oath
differ in the graveness of the final consonant. Listeners
hear one of the words in each pair, and are asked to select
the word from two written alternatives. Professional testing firms employ trained listeners who are familiar with
the speakers and speech tokens in the database, in order
to minimize test-retest variability.
Intelligibility scores quoted in the speech coding
literature often refer to the composite results of a DRT.
In a comparison of two federal standard coders, the LPC
10e algorithm resulted in 90% intelligibility, while the
FS-1016 CELP algorithm had 91% intelligibility [64].
An evaluation of waveform interpolative (WI) coding
published DRT scores of 87.2% for the WI algorithm, and
87.7% for FS-1016 [61].
A different kind of coding technique that has properties of both waveform and LPC-based coders has been
proposed [59,60] and is called prototype waveform interpolation (PWI). PWI uses both interpolation in the frequency
domain and forwardbackward prediction in the time
domain. The technique is based on the assumption that, for
voiced speech, a perceptually accurate speech signal can
be reconstructed from a description of the waveform of a
single, representative pitch cycle per interval of 2030 ms.
The assumption exploits the fact that voiced speech can
be interpreted as a concentration of slowly evolving pitch
cycle waveforms. The prototype waveform is described by
a set of linear prediction (LP) filter coefficients describing
the formant structure and a prototype excitation waveform, quantized with analysis-by-synthesis procedures.
The speech signal is reconstructed by filtering an excitation signal consisting of the concatenation of (infinitesimal)
sections of the instantaneous excitation waveforms. By
coding the voiced and unvoiced components separately, a
2.4-kbps version of the coder performed similarly to the
4.8-kbps FS1016 standard [61].
Recent work has aimed at reducing the computational
complexity of the coder for rates between 1.2 and 2.4 kbps
by including a time-varying waveform sampling rate and
a cubic B-spline waveform representation [62,86].
6.
6.1.2. Numerical Measures of Perceptual Quality. Perhaps the most commonly used speech quality
measure is the mean opinion score (MOS). A mean opinion score is computed by coding a set of spoken phrases
using a variety of coders, presenting all of the coded
speech together with undegraded speech in random order,
asking listeners to rate the quality of each phrase on
a numerical scale, and then averaging the numerical
ratings of all phrases coded by a particular coder. The
five-point numerical scale is associated with a standard
set of descriptive terms: 5 = excellent, 4 = good, 3 = fair,
2 = poor, and 1 = bad. A rating of 4 is supposed to correspond to standard toll-quality speech, quantized at 64 kbps
using ITU standard G.711 [48].
Mean opinion scores vary considerably depending
on background noise conditions; for example, CVSD
performs significantly worse than LPC-based methods in
quiet recording conditions, but significantly better under
extreme noise conditions [96]. Gender of the speaker may
also affect the relative ranking of coders [96]. Expert
listeners tend to give higher rankings to speech coders
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS
3
2
2.4
4.8
16
32
C
F
3
2
64
2.4
4.8
4
I
H
MPEG
Yeldener
4
Jarvinen
Kohler
16
32
M
M
4.8
16
32
64
2.4
4.8
16
32
64
A
4
I
3
L
2
L
K
2.4
E
4.8
16
32
64
C
MPC
COMSAT
H
3
64
2
2.4
13
L
O
2.4
L
K
E
E
4.8
with which they are familiar, even when they are not
consciously aware of the order in which coders are
presented [96]. Factors such as language and location of
the testing laboratory may shift the scores of all coders up
or down, but tend not to change the rank order of individual
coders [39]. For all of these reasons, a serious MOS test
must evaluate several reference coders in parallel with the
coder of interest, and under identical test conditions. If an
MOS test is performed carefully, intercoder differences
of approximately 0.15 opinion points may be considered
significant. Figure 12 is a plot of MOS as a function of bit
rate for coders evaluated under quiet listening conditions
in five published studies (one study included separately
tabulated data from two different testing sites [96]).
The diagnostic acceptability measure (DAM) is an
attempt to control some of the factors that lead to
variability in published MOS scores [100]. The DAM
employs trained listeners, who rate the quality of
standardized test phrases on 10 independent perceptual
scales, including six scales that rate the speech itself
(fluttering, thin, rasping, muffled, interrupted, nasal),
and four scales that rate the background noise (hissing,
buzzing, babbling, rumbling). Each of these is a 100point scale, with a range of approximately 30 points
between the LPC-10e algorithm (50 points) and clean
speech (80 points) [96]. Scores on the various perceptual
scales are combined into a composite quality rating. DAM
scores are useful for pointing out specific defects in a
speech coding algorithm. If the only desired test outcome
is a relative quality ranking of multiple coders, a carefully
controlled MOS test in which all coders of interest are
16
32
64
EOT156
14
s2 (m)
m=n
SNR(n) =
n+N1
(49)
2
e (m)
m=n
7.
(50)
k=0
NMR(n, k) =
e2k (m)
m=n
k+1
(51)
|M(ej )|2 d
NMR(n) =
k=0
K1
n+N1
1/3
k+1
j 2
|M(e )| d
m=n
e2k [m]
(52)
NETWORK ISSUES
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS
to reconstruct d(n)
in the prediction loop. Any bits not
used in the prediction loop are marked as optional by
the signaling channel mode flag. If network congestion
disrupts traffic at a router between sender and receiver,
the router is allowed to drop optional bits from the coded
speech packets.
Embedded ADPCM algorithms produce codewords that
contain enhancement and core bits. The feedforward (FF)
path of the codec utilizes both enhancement bits and core
bits, while the feedback (FB) path uses core bits only.
With this structure, enhancement bits can be discarded or
dropped during network congestion.
An important example of a multimode coder is QCELP,
the speech coder standard that was adopted by the TIA
North American digital cellular standard based on codedivision multiple access (CDMA) technology [9]. The coder
selects one of four data rates every 20 ms depending on the
speech activity; for example, background noise is coded at a
lower rate than speech. The four rates are approximately
1 kbps (eighth rate), 2 kbps (quarter rate), 4 kbps (half
rate), and 8 kbps (full rate). QCELP is based on the CELP
structure but integrates implementation of the different
rates, thus reducing the average bit rate. For example,
at the higher rates, the LSP parameters are more finely
quantized and the pitch and codebook parameters are
updated more frequently [23]. The coder provides good
quality speech at average rates of 4 kbps.
Another example of a multimode coder is ITU standard
G.723.1, which is an LPC-AS coder that can operate at
2 rates: 5.3 or 6.3 kbps [50]. At 6.3 kbps, the coder is a
multipulse LPC (MPLPC) coder while the 5.3-kbps coder
is an algebraic CELP (ACELP) coder. The frame size is
30 ms with an additional lookahead of 7.5 ms, resulting
in a total algorithmic delay of 67.5 ms. The ACELP and
MPLPC coders share the same LPC analysis algorithm
and frame/subframe structure, so that most of the program
code is used by both coders. As mentioned earlier, in
ACELP, an algebraic transformation of the transmitted
index produces the excitation signal for the synthesizer.
15
In MPLPC, on the other hand, minimizing the perceptualerror weighting is achieved by choosing the amplitude and
position of a number of pulses in the excitation signal.
Voice activity detection (VAD) is used to reduce the bit
rate during silent periods, and switching from one bit rate
to another is done on a frame-by-frame basis.
Multimode coders have been proposed over a wide variety of bandwidths. Taniguchi et al. proposed a multimode
ADPCM coder at bit rates between 10 and 35 kbps [94].
Johnson and Taniguchi proposed a multimode CELP algorithm at data rates of 4.05.3 kbps in which additional
stochastic codevectors are added to the LPC excitation vector when channel conditions are sufficiently good to allow
high-quality transmission [55]. The European Telecommunications Standards Institute (ETSI) has recently proposed a standard for adaptive multirate coding at rates
between 4.75 and 12.2 kbps.
7.3. Joint Source-Channel Coding
In speech communication systems, a major challenge is
to design a system that provides the best possible speech
quality throughout a wide range of channel conditions. One
solution consists of allowing the transceivers to monitor
the state of the communication channel and to dynamically
allocate the bitstream between source and channel coding
accordingly. For low-SNR channels, the source coder
operates at low bit rates, thus allowing powerful forward
error control. For high-SNR channels, the source coder
uses its highest rate, resulting in high speech quality,
but with little error control. An adaptive algorithm selects
a source coder and channel coder based on estimates
of channel quality in order to maintain a constant
total data rate [95]. This technique is called adaptive
multirate (AMR) coding, and requires the simultaneous
implementation of an AMR source coder [24], an AMR
channel coder [26,28], and a channel quality estimation
algorithm capable of acquiring information about channel
conditions with a relatively small tracking delay.
The notion of determining the relative importance
of bits for further unequal error protection (UEP)
was pioneered by Rydbeck and Sundberg [83]. Ratecompatible channel codes, such as Hagenauers rate
compatible punctured convolutional codes (RCPC) [34],
are a collection of codes providing a family of channel
coding rates. By puncturing bits in the bitstream, the
channel coding rate of RCPC codes can be varied
instantaneously, providing UEP by imparting on different
segments different degrees of protection. Cox et al. [13]
address the issue of channel coding and illustrate how
RCPC codes can be used to build a speech transmission
scheme for mobile radio channels. Their approach is
based on a subband coder with dynamic bit allocation
proportional to the average energy of the bands. RCPC
codes are then used to provide UEP.
Relatively few AMR systems describing source and
channel coding have been presented. The AMR systems [99,98,75,44] combine different types of variable rate
CELP coders for source coding with RCPC and cyclic
redundancy check (CRC) codes for channel coding and
were presented as candidates for the European Telecommunications Standards Institute (ETSI) GSM AMR codec
EOT156
16
STANDARDS
9.
FINAL REMARKS
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS
BIOGRAPHIES
Mark A. Hasegawa-Johnson received his S.B., S.M.,
and Ph.D. degrees in electrical engineering and computer
science from MIT in 1989, 1989, and 1996, respectively.
From 1989 to 1990 he worked as a research engineer
at Fujitsu Laboratories Ltd., Kawasaki, Japan, where
he developed and patented a multimodal CELP speech
coder with an efficient algebraic fixed codebook. From
19961999 he was a postdoctoral fellow in the Electrical
Engineering Department at UCLA. Since 1999, he has
been on the faculty of the University of Illinois at
Urbana-Champaign. Dr. Hasegawa-Johnson holds four
U.S. patents and is the author of four journal articles
and twenty conference papers. His areas of interest include
speech coding, automatic speech understanding, acoustics,
and the physiology of speech production.
Abeer Alwan received her Ph.D. in electrical engineering from MIT in 1992. Since then, she has been with
the Electrical Engineering Department at UCLA, California, as an assistant professor (19921996), associate
professor (19962000), and professor (2000present). Professor Alwan established and directs the Speech Processing and Auditory Perception Laboratory at UCLA
(http://www.icsl.ucla.edu/spapl). Her research interests
include modeling human speech production and perception mechanisms and applying these models to speechprocessing applications such as automatic recognition,
compression, and synthesis. She is the recipient of the NSF
Research Initiation Award (1993), the NIH FIRST Career
Development Award (1994), the UCLA-TRW Excellence
in Teaching Award (1994), the NSF Career Development Award (1995), and the Okawa Foundation Award
in Telecommunications (1997). Dr. Alwan is an elected
member of Eta Kappa Nu, Sigma Xi, Tau Beta Pi, and the
New York Academy of Sciences. She served as an elected
member on the Acoustical Society of America Technical
Committee on Speech Communication (19931999), on the
IEEE Signal Processing Technical Committees on Audio
and Electroacoustics (19962000) and Speech Processing
(19962001). She is an editor in chief of the journal Speech
Communication.
BIBLIOGRAPHY
17
EOT156
18
22. S. Furui, Digital Speech Processing, Synthesis, and Recognition, Marcel Dekker, New York, 1989.
43. ISO/IEC, Information Technology Very Low Bitrate AudioVisual Coding, Part 3: Audio, Subpart 2: Parametric Coding,
Technical Report ISO/JTC 1/SC 29/N2203PAR, ISO/IEC,
1998.
25. I. Gerson and M. Jasiuk, Vector sum excited linear prediction (VSELP), in B. S. Atal, V. S. Cuperman, and A. Gersho,
eds., Advances in Speech Coding, Kluwer, Dordrecht, The
Netherlands, 1991, pp. 6980.
46. ITU-T, 5-, 4-, 3- and 2-bits per Sample Embedded Adaptive
Differential Pulse Code Modulation (ADPCM), Technical
Report G.727, International Telecommunications Union,
Geneva, 1990.
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS
58. P. Kabal and R. Ramachandran, The computation of line
spectral frequencies using chebyshev polynomials, IEEE
Trans. Acoust. Speech Signal Process. ASSP-34: 14191426
(1986).
59. W. Kleijn, Speech coding below 4 kb/s using waveform interpolation, Proc. GLOBECOM 1991, Vol. 3, pp. 18791883.
60. W. Kleijn and W. Granzow, Methods for waveform interpolation in speech coding, Digital Signal Process. 1(4): 215230
(1991).
61. W. Kleijn and J. Haagen, A speech coder based on
decomposition of characteristic waveforms, Proc. ICASSP,
1995, pp. 508511.
62. W. Kleijn, Y. Shoham, D. Sen, and R. Hagen, A lowcomplexity waveform interpolation coder, Proc. ICASSP,
1996, pp. 212215.
63. W. B. Kleijn, D. J. Krasinski, and R. H. Ketchum, Improved
speech quality and efficient vector quantization in SELP,
Proc. ICASSP, 1988, pp. 155158.
64. M. Kohler, A comparison of the new 2400bps MELP federal
standard with other standard coders, Proc. ICASSP, 1997,
pp. 15871590.
65. P. Kroon, E. F. Deprettere, and R. J. Sluyter, Regular-pulse
excitation: A novel approach to effective and efficient multipulse coding of speech, IEEE Trans. ASSP 34: 10541063
(1986).
66. W. LeBlanc, B. Bhattacharya, S. Mahmoud, and V. Cuperman, Efficient search and design procedures for robust
multi-stage VQ of LPC parameters for 4kb/s speech coding, IEEE Trans. Speech Audio Process. 1: 373385
(1993).
67. D. Lin, New approaches to stochastic coding of speech
sources at very low bit rates, in I. T. Young et al., ed.,
Signal Processing III: Theories and Applications, Elsevier,
Amsterdam, 1986, pp. 445447.
68. A. McCree and J. C. De Martin, A 1.7 kb/s MELP coder with
improved analysis and quantization, Proc. ICASSP, 1998,
Vol. 2, pp. 593596.
69. A. McCree et al., A 2.4 kbps MELP coder candidate for the
new U.S. Federal standard, Proc. ICASSP, 1996, Vol. 1,
pp. 200203.
70. A. V. McCree and T. P. Barnwell, III, A mixed excitation
LPC vocoder model for low bit rate speech coding, IEEE
Trans. Speech Audio Process. 3(4): 242250 (1995).
71. B. C. J. Moore, An Introduction to the Psychology of Hearing,
Academic Press, San Diego, (1997).
72. P. Noll, MPEG digital audio coding, IEEE Signal Process.
Mag. 14(5): 5981 (1997).
73. B. Novorita, Incorporation of temporal masking effects into
bark spectral distortion measure, Proc. ICASSP, Phoenix,
AZ, 1999, pp. 665668.
74. E. Paksoy, W-Y. Chan, and A. Gersho, Vector quantization
of speech LSF parameters with generalized product codes,
Proc. ICASSP, 1992, pp. 3336.
75. E. Paksoy et al., An adaptive multi-rate speech coder for
digital cellular telephony, Proc. of ICASSP, 1999, Vol. 1,
pp. 193196.
76. K. K. Paliwal and B. S. Atal, Efficient vector quantization
of LPC parameters at 24 bits/frame, IEEE Trans. Speech
Audio Process. 1: 314 (1993).
19
EOT156
20