Singer Identification in Popular Music Recordings Using Voice Coding Features
Singer Identification in Popular Music Recordings Using Voice Coding Features
Singer Identification in Popular Music Recordings Using Voice Coding Features
Brian Whitman
ABSTRACT
In most popular music, the vocals sung by the lead singer are
the focal point of the song. The unique qualities of a singers
voice make it relatively easy for us to identify a song as
belonging to that particular artist. With little training, if one i s
familiar with a particular singers voice one can usually
recognize that voice in other pieces, even when hearing a song
for the first time. The research presented in this paper attempts
to automatically establish the identity of a singer using
acoustic features extracted from songs in a database of popular
music. As a first step, an untrained algorithm for automatically
extracting vocal segments from within songs is presented.
Once these vocal segments are identified, they are presented t o
a singer identification system that has been trained on data
taken from other songs by the same artists in the database.
1. INTRODUCTION
The singing voice is the oldest musical instrument and one
with which almost everyone has a great deal of familiarity.
Given the importance and usefulness of vocal communication,
it is not surprising that our auditory physiology and
perceptual apparatus has evolved to a high level of sensitivity
to the human voice. Once we are exposed to the sound of a
particular persons speaking voice, it is relatively easy t o
identify that voice, even with very little training. For the most
part the same holds true with regards to the singing voice.
Once we become familiar with the sound of a particular singers
voice, we can usually identify the voice, even when hearing a
piece for the first time.
Not only is the voice the oldest musical instrument, it is also
one of the most complex from an acoustic standpoint. This i s
primarily due to the rapid acoustic variation involved in the
singing process. In order to pronounce different words, a
singer must move their jaw, tongue, teeth, etc., changing the
shape and thus the acoustic properties of their vocal tract. No
other instrument exhibits the amount of physical variation of
the human voice. This complexity has affected research in both
analysis and synthesis of singing [1].
In spite of this complexity, voice identification is almost
effortless to us. But perhaps what is more remarkable is that
even in the presence of interfering sounds, such as instruments
or background noise, we can still identify the voice of a
familiar singer. Thus, our process of identification most likely
depends on features invariant to these environmental
variations. As will be discussed later, the search for such
invariant features that can be used for robust automatic
identification is no easy task.
Permission to make digital or hard copies of all or part of this
work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or
commercial advantage and that copies bear this notice and the
full citation on the first page.
2002 IRCAM Centre Pompidou
2. BACKGROUND
A significant amount of research has been performed o n
speaker (talker) identification from digitized speech for
applications such as verification of identity. These systems for
the most part use features similar to those used in speech
recognition. Many of these systems are trained on pristine data
(without background noise) and performance tends to degrade
in noisy environments. And since they are trained on spoken
data, they perform poorly to singing voice input. For more o n
talker identification systems, see [2].
In the realm of music information retrieval, there is a
burgeoning amount of interest and work on automatic song
and artist identification from acoustic data. Such systems
would obviously be useful for anyone attempting to ascertain
the title or performing artist of a new piece of music and could
also aid preference-based searches for music. Another area
where this research has generated a great deal of interest i s
copyright protection and enforcement. Most of these systems
utilize frequency domain features extracted from recordings,
which are then used to train a classifier built using one of
many machine learning techniques. Robust song identification
from acoustic parameters has proven to be very successful
(with accuracy greater than 99% in some cases) in identifying
songs included in the database [3]. Artist identification is a
much more difficult task, and not as well-defined as individual
song identification. A recent example of an artist identification
system is [4], which reports accuracies of approximately 50%
in artist identification on a database of about 250 songs.
Also relevant to the task of singer identification is work i n
musical instrument identification. Our ability to distinguish
different voices (even when singing or speaking the same
phrase) is akin to our ability to distinguish different
instruments (even when playing the same notes). Thus, it i s
likely that many of the features used in automatic instrument
identification systems will be useful for singer identification
as well. Work by Martin [5] on solo instrument identification
demonstrates the importance of both spectral and temporal
features and highlights the difficulty in building machine
listening systems that generalize beyond a limited set of
training conditions.
Obviously, singer identification and artist identification can
amount to the same thing in many situations. In [6],
Berenzweig and Ellis use vocal music as an input to a speech
recognition system, achieving a success rate of 80% i n
isolating vocal regions. In [7], Berenzweig, Ellis, and Lawrence
use a neural network trained on radio recordings to similarly
segment songs into vocal and non-vocal regions. By focusing
on voice regions alone, they were able to improve artist
identification by 15%.
The system presented here also attempts to perform
segmentation of vocal regions prior to singer identification.
After segmentation, the classifier uses features drawn from
H=
E original
(1)
min( E filtered ,i )
i
y[n]
z -N
4. SINGER IDENTIFICATION
s[n ] =
a s[n - k ]
(2)
k =1
(s [n] - s [n])
m
(3)
The transfer function relating the source signal and the signal
estimate is shown [12] to be an all-pole filter:
G
H[ z ] =
(4)
A[z ]
where the denominator is defined as follows:
p
A[z ] = 1 -
a z
k
-k
(5)
k =1
asinw
w = w + 2tan-1
1- acos w
(6)
Vocal
Segments
Non-vocal
Segments
All
Segments
2.0
55.4%
53.1%
55.4%
2.3
40.5%
69.2%
55.1%
2.6
30.7%
79.3%
54.9%
GMM
SVM
32.1 (16.6)
39.6 (30.7)
31.3 (17.1)
35.0 (30.4)
33.4 (16.5)
45.3 (29.6)
GMM
SVM
36.7 (18.1)
35.8 (17.6)
33.0 (17.4)
34.0 (26.8)
38.5 (16.6)
41.5 (28.8)
7. ACKNOWLEDGEMENTS
8. REFERENCES
Modern